Sparks of AGI dazzle computer scientists

Dall-E image generation proposal for the prompt "A cartoon-style computer scientist blinded by sparks of AGI bursting from his computer."

Since the productization of Large Language Model (LLM) conversational chatbots powered by Transformer architecture and human feedback, reactions in the machine and deep learning community has been split in two camps, symbolized by the position of two prominent figures in the field:

Team Le Cun: characterized by the engineer’s caution and pragmatism, these people acknowledge the new era opened by conversational “AI”, rejoice that their field receive a lot of public attention, and put the current marketing, media, and doomerism fuss in perspective of the progress yet to be accomplished to reach Artificial General Intelligence (AGI).
Team Hinton: characterized by a scare-mongering approach with extends far-beyond scientific consideration, these people fear the potential harmful social consequences of unforseeable applications of new “generative AI” technologies¹. According to a recent New York Time coverage, G. Hinton allegedly “regrets his life’s work”, a shocking statement coming from such an important figure in the field.

On the matter of the potential threat posed by AI upheaval, I am on team Le Cun: “generative AI” still lacks intrinsic motivation ; without intrinsic motivation, a program is basically condemned to be a slave of human endaveours. Granted, questionnable endavours, and the lack of democratic control thereof, are precisely what fuel G. Hinton’s fears, but the risks of misuse of technology is not specific to “AI”. And fortunately in this regards, the compelled strategy of OpenAI’s competitors to go open source²,³ is already mitigating this risk.

Both of these stances together highlight a critical problem posed by these large neural network: we don’t understand how they represent information, and therefore, future perspectives are difficult to ground on scientific considerations. Sadly, it appears that the amazing efficiency of “generative AI”, together with corporate lock, have relegated proper scientific investigation to the background.

In this regard, the recent paper⁴ by Microsoft Research Team claiming that the most advanced version of conversational LLM, GPT-4, exhibit “sparks of AGI” is a shining example that the field -and the marketing momentum- is going too fast to let researchers absorb the findings, take the time to reflect, and make solid contributions to our understanding of how semantic representations, and hence, meaning, are encoded in deep neural networks.

My attention was drawn to this paper by comments on a LinkedIn thread where I, following the author of the original post, contended that, by design, LLMs can not have a mental model of the world in the sense it is generally accepted for humans. After reading the core paragraphs of the paper and watching the accompanying talk⁵, not only do I stand by my initial comment, but I am also perplexed by the epistemological shortcomings of the discussion on the topic.

Concluding that GPT-4-like LLMs have a “mental model” of the world would indeed require two demonstrations:

That a “mental model” can be reduced to semantic knowledge. Or alternatively, that a mental model has “emerged” -whatever that means- in the hidden states of the neural network from semantic representations.
That GPT-4 internals exhibit interesting patterns akin to human mental manipulations such as judgments of causal relationship, similarity ratings, future predictions, etc.

While proposition 1 may not be formally disproved (i.e., there is no reason to a priori discard the possibility for semantic representations to recapitulate the mental models children develop in early childhood), it appears very unlikely, considering for examples that children intuitive physics⁶ and psychology⁷ predate language acquisition.

Proposition 2 should be the aim of Bubeck et al.’s paper. But this would require the researchers to lift the veil of natural language, properly formulate hypotheses, and test them rigourously.

Good scientific intentions collide with corporate logic

Let’s begin with praise first: this is a very entertaining, exciting, persuasive paper making a bold statement: that the latest iteration of LLMs could be viewed as an “early version of an artificial general intelligence”. Bubeck and colleagues want to lay grounds for their belief that “GPT-4’s intelligence signals a true paradigm shift in the field of computer science and beyond”. These are very big claims, that require solid evidence.

After a catchy introduction drawing on the most impressive feats of GPT-4 (more on that below), the paper goes on to emphasize an important challenge for the evaluation of large language models trained on potentially the whole internet: they can not easily be benchmarked, because the benchmark dataset are likely part of the training data. Thus, to separate true learning from mere memorization of patterns in the training data, one has to assess the capacities of the system on novel tasks, independent of the training data or the training paradigm.

Although this precautionary statement is a valid one, it immediately raises two questions:

How can one assess true learning of a model trained on potentially the whole internet ?
What answers can be reasonnably expected from a state-of-the-art LLM, given what is known of its architecture, training data, and training procedure ?

In the case of OpenAI’s GPT family of corporate-locked models, these questions call for clear-cut answers: one cannot.

Thus, Bubeck et al.’s article resorts to a compilation of awe-inspiring comparisons between the performances of two inscrutable black boxes.

The unfortunate parallel with experimental psychology

The authors’ call to apply the scientific paradigm of cognitive experimental psychology for the study of LLM capabilities falls short, because it overlooks what constitutes a scientific paradigm.

Every scientific field is structured around a few core, publicly exposed, principles, assumptions, and methods which provide a satisfactory account for the phenomenons of interest. This knowledge core represents what Thomas Kuhn calls a Scientific Paradigm⁸.

All the results from experimental psychology lean on cognitive theories which provide foundations to account for past results, and guide further investigations in the field. For example, the study of visual attention now relies on a sophisticated integrated framework of cognitive psychology and functional neurosciences⁹.

Such a scientific paradigm is a shere necessity to design decisive experiments: experiments whose outcomes can further strengthen or undermine the paradigm, and ultimately make science progress.

A unified paradigm for NLP-linguistics integration being crucially lacking today, this paper was doomed from the outset: we are never told what can be reasonnably expected from GPT-4 ; instead we are merely invited to appreciate the “qualitative” shift in its performance, compared to a less advanced version.

The mystifying effect of the mythical unicorn

The Strange Case of the Unicorn, emphasised at the beginning of the paper is supposed to be the epitome of this “qualitative” shift in the chatbot performance: we are asked to believe that prompting the model to draw a unicorn in LateX is a “challenging task that requires combining visual imagination and coding skills”.

When I read it, this example immediately raised several scientific red flags:

There is no clear definition of what is being assessed by the unicorn example. Worse, the use of anthropomorphic description is inherently mystifying: if one replaces visual imagination by mapping beween a sequence of tokens and vector arithmetic, then the assertion above becomes far less provocative.
There is no quantification of the phenomenon being investigated.
There is no cross-validation of the method. How does the model perform with similarly -supposedly- “out-of-distribution” prompts ?

Drawing in LaTeX language is certainly a challenging task for humans, but it is far from obvious it will be for a statistical model with a trillion parameters. As Bubeck himself states in his talk, a trillion wide parameter space is a big, flexible representational space.

As pointed out by Margaret Mitchell in her twitter thread, there are plenty of TiKZ examples on the internet on drawing fantasy animals the model must have been exposed to during training.

Furthermore, from a cognitive point of view, asking the model to correct a piece of LaTeX code is not qualitatively different from asking it to correct a ill-constructed sentence into proper english.

Thus, a parsimonious conclusion from the unicorn horn experiment is not that it “demonstrates that GPT-4 can “see” despite being a pure language model”, but that it has learned to map a sequence of tokens refering to a mythic animal to the LaTeX code specifying the corresponding vector graphics.

How is meaning encoded ?

The unicorn example and all other impressive performance in various domains do demonstrate a impressive feature: that is, meaning has emerged somewhere in the internal states of the network in a way computer scientists, linguists, and philosophers did not expect¹⁰.

In the premises of LLMs, back in 2019, an interesting paper investigated how BERT’s attention heads weighted contextual information when building the vector representation of a token¹¹. Researchers showed that syntactic rules emerge in self-supervised trained Transformer models as attention heads learn to pay attention to syntactically related tokens (e.g. auxiliary verb forms, coreferent terms…). The activation of these attention heads could then be used to predict syntactic relationship and obtain a quantitative assessment of how these “attention maps” capture syntactic structures. This demonstrates that syntax knowledge can be acquired without being explicitely taught in the training paradigm.

This kind of analysis is precisely what is needed to start assessing the representation capacities of LLMs quantitatively and go beyond superficial awe-inspiring anthropomorphic descriptions.

From the perspective of an outsider to the field who once studied philosophy, this approach could also shed light on very obscure debates in linguistics, such as the articulation between compositionality (that is, linguistic meaning arising from the syntactic combination of words or locutions) and lexical semantics (that is, linguistic meaning conveyed by a word itself, in virtue of its relationship to the rest of the lexicon).

The impressive performance of generative LLMs may suggest that contextual information is all you need to build meaning. However experiments testing latent semantic dimensions highlight that substantial gaps remain in the internal semantic representations of GPT-4¹², suggesting that human-specific, ecologically relevant, abstract semantic dimensions are not acquired by GPT-4.

The team of Josh Tenenbaum has formalized this idea in a paper¹³: building mental models of the world not only requires the mastering of the rules of language, but also a comprehension of the various “domains rules” (e.g., causal reasoning, social intelligence), governing the world language helps us to operate in. Can GPT-4 solve all these domains ? Maybe it already can or eventually will, and researchers will undoubtly explore this quantitatively in the near future.

I believe, however, that even if GPT-4 was convincingly shown to exhibit these capabilities, it would not mean it has acquired a naturalistic model of the world.

Thought is beyond connectionism

In a landmark 2011 review¹⁴, Josh Tenenbaum argues that the origin of “world models” that allow humans to learn from and reason about the world, stems from the paucity of data human infants have to cope with. It suggests that “thought” can be construed as the operations of a Bayesian cognitive machine which effectively combines observed natural regularities (the evidence) with hypothesis about structures in the world (the prior). In this review, Tenenbaum proposes that this abstract knowledge itself is acquired during infant cognitive development, i.e., infants “learn how to learn” through a multistage Hierarchical Bayesian Modeling of the world.

Thus, the way LLMs acquire their knowledge by self-supervised learning on extensive corpora of naturally observed language is radically different from how humans acquire their knowledge from sparse, noisy, and ambiguous data.

It is likely that the study of LLM intrinsic training and inference mechanisms will help us refine our understanding of human knowledge acquisition in a similar way that Deep Neural Networks have become a tool for studying the biology of visual perception¹⁵.

Yet to the best of current knowledge, LLMs are to be considered as mere statistical pattern matching machines, albeit extremely mighty, fun, and helpful.