# A Brief History of NLP — Part 2

Published on Jul 08, 2020 by Antoine Louis

This post provides a summary of the NLP history during the deep learning era (2000 – present).

From the 2000s, neural networks begin to be used for language modeling, aiming to predict the next term in a text given the previous words.

2003. Bengio et al. proposed the first neural language model 1 that consists of a one-hidden layer feed-forward neural network. They also introduced what is now referred to as word embedding, a real-valued word feature vector in $R^d$. More precisely, their model took input vector representations of the $n$ previous words, which were looked up in a table learned together with the model. The vectors were fed into a hidden layer, whose output was then provided to a softmax layer that predicted the next word of the sequence. Although classic feed-forward neural networks have been progressively replaced with recurrent neural networks 2(RNNs) for language modeling,3 they remain in some settings competitive with recurrent architectures, the latter being impacted by “catastrophic forgetting”.4 Furthermore, the general building blocks of Bengio et al.’s network are still found in most neural language and word embedding models nowadays.

2008. Collobert and Weston applied multi-task learning,5 a sub-field of machine learning in which multiple learning tasks are solved simultaneously to neural networks for NLP. They used a single convolutional neural network 6(CNN) that, given a sentence, could output many language processing predictions such as part-of-speech tags, named entity tags, and semantic roles. The entire network was trained jointly on all the tasks using weight-sharing of the look-up tables, which enabled the different models to collaborate and share general low-level information in the word embedding matrix. As models are increasingly evaluated on multiple tasks to gauge their generalization ability, multi-task learning has gained importance and is now used across a wide range of NLP tasks. Also, their paper turned out to be a discovery that went beyond multi-task learning. It spearheaded ideas such as pre-training word embeddings and using CNNs for texts that have only been widely adopted in the last years.

2013. Mikolov et al. introduced arguably the most popular word embedding model: Word2Vec.7 8 Although dense vector representations of words have been used as early as 2003, the main innovation proposed in their paper was an efficient improvement of the training procedure by removing the hidden layer and approximating the loss function. Together with the efficient model implementation, these simple changes enabled large-scale training of word embeddings on vast corpora of unstructured text. Later that year, they improved the Word2Vec model by employing additional strategies to enhance training speed and accuracy. While these embeddings are not conceptually different from those learned with a feed-forward neural network, training on a vast corpus enables them to capture some relationships between words such as gender, verb tense, and country-capital relations, which initiated much interest in word embeddings as well as in the origin of these linear relationships.9 10 11 12 However, what made word embeddings a mainstay in current NLP was the evidence that using pre-trained embeddings as initialization improved performance across a wide range of downstream tasks. Despite many more recent developments, Word2Vec is still a popular choice and widely used today.

The year 2013 also marked the adoption of neural network models in NLP, in particular three well-defined types of neural networks: recurrent neural networks 2(RNNs), convolutional neural networks 6(CNNs), and recursive neural networks.13 Because of their architecture, RNNs became famous for dealing with the dynamic input sequences ubiquitous in NLP. However, Vanilla RNNs were quickly replaced with the classic long-short term memory networks 14(LSTMs), as they proved to be more resilient to the vanishing and exploding gradient problem. Simultaneously, convolutional neural networks, which were then beginning to be widely adopted by the computer vision community, started to apply to natural language.15 16 The advantage of using CNNs for dealing with text sequences is that they are more parallelizable than RNNs, as the state at every time step only depends on the local context (via the convolution operation) rather than all past states as in the RNNs. Finally, recursive neural networks were inspired by the principle that human language is inherently hierarchical: words are composed into higher-order sentences, which can themselves be recursively combined according to a set of production rules. Based on this linguistic perspective, recursive neural networks treated sentences as trees rather than as sequences. Some research also extended RNNs and LSTMs to work with hierarchical structures.17

2014. Sutskever et al. proposed sequence-to-sequence learning,18 an end-to-end approach for mapping one sequence to another using a neural network. In their method, an encoder neural network processes a sentence term by term and compresses it into a fixed-size vector. Then, a decoder neural network predicts the output sequence symbol by symbol based on the encoder state and the previously predicted symbols taken as input at every step. Encoders and decoders for sequences are typically based on RNNs, but other architectures have also emerged. Recent models include deep-LSTMs,19 convolutional encoders,20 21 the Transformer,22 and a combination of an LSTM and a Transformer.23 Machine translation turned out to be the perfect application for sequence-to-sequence learning. The progress was so significant that Google announced in 2016 that it was officially replacing its monolithic phrase-based machine translation models in Google Translate with a neural sequence-to-sequence model.

2015. Bahdanau et al. introduced the principle of attention,24 one of the core innovations in neural machine translation (NMT) that enabled NMT models to outperform classic sentence-based MT systems. It alleviates the main bottleneck of sequence-to-sequence learning, which is its requirement to compress the entire content of the source sequence into a vector representation. Indeed, attention allows the decoder to look back at the source sequence hidden states, which are then combined through a weighted average and provided as an additional input to the decoder. Attention is potentially useful for any task that requires making decisions based on certain parts of the input. For now, it has been applied to constituency parsing,25 reading comprehension,26, and one-shot learning.27 More recently, a new form of attention has appeared, called self-attention, being at the core of the Transformer architecture. In short, it is used to look at the surrounding words in a sentence or paragraph to obtain more contextually sensitive word representations.

2018. The latest major innovation in the world of NLP is undoubtedly large pre-trained language models. While first proposed in 2015,28 only recently were they shown to give a considerable improvement over the state-of-the-art methods across a diverse range of tasks. Pre-trained language model embeddings can be used as features in a target model,29 or a pre-trained language model can be fine-tuned on target task data,30 31 32 33 which have shown to enable efficient learning with significantly fewer data. The main advantage of these pre-trained language models comes from their ability to learn word representations from large unannotated text corpora, which is particularly beneficial for low-resource languages where labeled data is scarce.

1. Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. A neural probabilistic language model. Journal of machine learning research, 3(Feb):1137–1155, 2003. ↩︎

2. Jeffrey L Elman. Finding structure in time. Cognitive science, 14(2):179–211, 1990. ↩︎

3. Tomáš Mikolov, Martin Karafiát, Lukáš Burget, Jan Černocky`, and Sanjeev Khudanpur. Recurrent neural network based language model. In Eleventh annual conference of the international speech communication association, 2010. ↩︎

4. Michał Daniluk, Tim Rocktäschel, Johannes Welbl, and Sebastian Riedel. Frustratingly short attention spans in neural language modeling. arXiv preprint arXiv:1702.04521, 2017. ↩︎

5. Ronan Collobert and Jason Weston. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th international conference on Machine learning, pages 160–167, 2008. ↩︎

6. Yann LeCun, Patrick Haffner, Léon Bottou, and Yoshua Bengio. Object recognition with gradient-based learning. In Shape, contour and grouping in computer vision, pages 319–345. Springer, 1999. ↩︎

7. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013a. ↩︎

8. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013b. ↩︎

9. David Mimno and Laure Thompson. The strange geometry of skip-gram with negative sampling. In Empirical Methods in Natural Language Processing, 2017. ↩︎

10. Sanjeev Arora, Yuanzhi Li, Yingyu Liang, Tengyu Ma, and Andrej Risteski. Linear algebraic structure of word senses, with applications to polysemy. Transactions of the Association for Computational Linguistics, 6:483–495, 2018. ↩︎

11. Maria Antoniak and David Mimno. Evaluating the stability of embedding-based word similarities. Transactions of the Association for Computational Linguistics, 6:107–119, 2018. ↩︎

12. Laura Wendlandt, Jonathan K Kummerfeld, and Rada Mihalcea. Factors influencing the surprising instability of word embeddings. arXiv preprint arXiv:1804.09692, 2018. ↩︎

13. Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1631–1642, 2013. ↩︎

14. Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8): 1735–1780, 1997. ↩︎

15. Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. A convolutional neural network for modelling sentences. arXiv preprint arXiv:1404.2188, 2014. ↩︎

16. Yoon Kim. Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882, 2014. ↩︎

17. Kai Sheng Tai, Richard Socher, and Christopher D Manning. Improved semantic representations from tree-structured long short-term memory networks. arXiv preprint arXiv:1503.00075, 2015. ↩︎

18. Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112, 2014. ↩︎

19. Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016. ↩︎

20. Nal Kalchbrenner, Lasse Espeholt, Karen Simonyan, Aaron van den Oord, Alex Graves, and Koray Kavukcuoglu. Neural machine translation in linear time. arXiv preprint arXiv:1610.10099, 2016. ↩︎

21. Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N Dauphin. Convolutional sequence to sequence learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1243–1252. JMLR. org, 2017. ↩︎

22. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017. ↩︎

23. Mia Xu Chen, Orhan Firat, Ankur Bapna, Melvin Johnson, Wolfgang Macherey, George Foster, Llion Jones, Niki Parmar, Mike Schuster, Zhifeng Chen, et al. The best of both worlds: Combining recent advances in neural machine translation. arXiv preprint arXiv:1804.09849, 2018. ↩︎

24. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014. ↩︎

25. Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. Pointer networks. In Advances in neural information processing systems, pages 2692–2700, 2015. ↩︎

26. Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. Teaching machines to read and comprehend. In Advances in neural information processing systems, pages 1693–1701, 2015. ↩︎

27. Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. Matching networks for one shot learning. In Advances in neural information processing systems, pages 3630–3638, 2016. ↩︎

28. Andrew M Dai and Quoc V Le. Semi-supervised sequence learning. In Advances in neural information processing systems, pages 3079–3087, 2015. ↩︎

29. Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Ken- ton Lee, and Luke Zettlemoyer. Deep contextualized word representations. arXiv preprint arXiv:1802.05365, 2018. ↩︎

30. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018. ↩︎

31. Jeremy Howard and Sebastian Ruder. Universal language model fine-tuning for text classification. arXiv preprint arXiv:1801.06146, 2018. ↩︎

32. Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. OpenAI Blog, 1(8):9, 2019. ↩︎

33. Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. Xlnet: Generalized autoregressive pretraining for language understanding. In Advances in neural information processing systems, pages 5754–5764, 2019. ↩︎