The Shift from Symbolic AI to Deep Learning in Natural Language Processing
Authors:
(1) Raphaël Millière, Department of Philosophy, Macquarie University ([email protected]);
(2) Cameron Buckner, Department of Philosophy, University of Houston ([email protected]).
Table of Links
Abstract and 1 Introduction
2. A primer on LLMs
2.1. Historical foundations
2.2. Transformer-based LLMs
3. Interface with classic philosophical issues
3.1. Compositionality
3.2. Nativism and language acquisition
3.3. Language understanding and grounding
3.4. World models
3.5. Transmission of cultural knowledge and linguistic scaffolding
4. Conclusion, Glossary, and References
2.1. Historical foundations
The origins of large language models can be traced back to the inception of AI research. The early history of natural language processing (NLP) was marked by a schism between two competing paradigms: the symbolic and the stochastic approaches. A major influence on the symbolic paradigm in NLP was Noam Chomsky’s transformational-generative grammar (Chomsky 1957), which posited that the syntax of natural languages could be captured by a set of formal rules that generated well-formed sentences. Chomsky’s work laid the foundation for the development of rule-based syntactic parsers, which leverage linguistic theory to decompose sentences into their constituent parts. Early conversational NLP systems, such as Winograd’s SHRDLU (Winograd 1971), required syntactic parsers with a complex set of ad hoc rules to process user input.
In parallel, the stochastic paradigm was pioneered by researchers such as mathematician Warren Weaver, who was influenced by Claude Shannon’s information theory. In a memorandum written in 1949, Weaver proposed the use of computers for machine translation employing statistical techniques (Weaver 1955). This work paved the way for the development of statistical language models, such as n-gram models, which estimate the likelihood of word sequences based on observed frequencies of word combinations in a corpus (Jelinek 1998). Initially, however, the stochastic paradigm was lagging behind symbolic approaches to NLP, showing only modest success in toy models with limited applications.
Another important theoretical stepping stone on the road to modern language models is the so-called distributional hypothesis, first proposed by the linguist Zellig Harris in the 1950s (Harris 1954). This idea was grounded in the structuralist view of language, which posits that linguistic units acquire meaning through their patterns of co-occurrence with other units in the system. Harris specifically suggested that the meaning of a word could be inferred by examining its distributional properties, or the contexts in which it occurs. Firth (1957) aptly summarized this hypothesis with the slogan “You shall know a word by the company it keeps,” acknowledging the influence of Wittgenstein (1953)’s conception of meaning-as-use to highlight the importance of context in understanding linguistic meaning.
As research on the distributional hypothesis progressed, scholars began exploring the possibility of representing word meanings as vectors in a multidimensional space 1. Early empirical work in this area stemmed from psychology and examined the meaning of words along various dimensions, such as valence and potency (Osgood 1952). While this work introduced the idea of representing meaning in a multidimensional vector space, it relied on explicit participant ratings about word connotations along different scales (e.g., good–bad), rather than analyzing the distributional properties of a linguistic corpus. Subsequent research in information retrieval combined vector-based representations with a data-driven approach, developing automated techniques for representing documents and words as vectors in high-dimensional vector spaces (Salton et al. 1975).
After decades of experimental research, these ideas eventually reached maturity with the development of word embedding models using artificial neural networks (Bengio et al. 2000). These models are based on the insight that the distributional properties of words can be learned by training a neural network to predict a word’s context given the word itself, or vice versa. Unlike previous statistical methods such as n-gram models, word embedding models encode words into dense, low-dimensional vector representations (Fig. 1). The resulting vector space drastically reduces the dimensionality of linguistic data while preserving information about meaningful linguistic relationships beyond simple co-occurrence statistics. Notably, many semantic and syntactic relationships between words are reflected in linear substructures within the vector space of word embedding models. For example, Word2Vec (Mikolov et al. 2013) demonstrated that word embeddings can capture both semantic and syntactic regularities, as evidenced by the ability to solve word analogy tasks through simple vector arithmetic that reveal the latent linguistic structure encoded in the vector space (e.g., 𝑘𝑖𝑛𝑔 + 𝑤𝑜𝑚𝑎𝑛 − 𝑚𝑎𝑛 ≈ 𝑞𝑢𝑒𝑒𝑛, or 𝑤𝑎𝑙𝑘𝑖𝑛𝑔 + 𝑠𝑤𝑎𝑚 − 𝑤𝑎𝑙𝑘𝑒𝑑 ≈ 𝑠𝑤𝑖𝑚𝑚𝑖𝑛𝑔).
The development of word embedding models marked a turning point in the history of NLP, providing a powerful and efficient means of representing linguistic units in a continuous vector space based on their statistical distribution in a large corpus. However, these models have several significant limitations. First, they are not capable of capturing polysemy and homonymy, because they assign a
single or “static” embedding to each word type, which cannot account for changes in meaning based on context; for example, “bank” is assigned a unique embedding regardless of whether it refers to the side of a river or the financial institution. Second, they rely on “shallow” artificial neural network architectures with a single hidden layer, which limits their ability to model complex relationships between words. Finally, being designed to represent language at the level of individual words, they are not well-suited to model complex linguistic expression, such as phrases, sentences, and paragraphs. While it is possible to represent a sentence as a vector by averaging the embeddings of every word in the sentence, this is a very poor way of representing sentence-level meaning, as it loses information about compositional structure reflected in word order. In other words, word embedding models merely treat language as a “bag of words”; for example, “a law book” and “a book law” are treated identically as the unordered set {‘a’,’book’,’law’}.
The shortcomings of shallow word embedding models were addressed with the introduction of “deep” language models, going back to recurrent neural networks (RNNs) and their variants, such as long short-term memory (LSTM) (Hochreiter & Schmidhuber 1997) and the gated recurrent unit (GRU) (Cho et al. 2014). These deep neural network architectures incorporate a memory-like mechanism, allowing them to remember and process sequences of inputs over time, rather than individual, isolated words. Despite this advantage over word embedding models, they suffer from their own limitations: they are slow to train and struggle with long sequences of text. These issues were addressed with the introduction of the Transformer architecture by Vaswani et al. (2017), which laid the groundwork for modern LLMs.