Opinion: The Big Hallucination

There’s nothing humans love more than the egregiously misleading, said with imperturbable confidence. We know what we want – irrational certainty – and perhaps we deserve to get it good and hard, to misuse essayist H.L. Mencken. Artificial Intelligence, in the form of large language models (LLMs), is delivering this with aplomb and even educated tinkerers appear increasingly ready to bring a ready reverence to its outputs; witness the deity in its data centre, answering oracularly via the browser! Is this thing sentient? Is it superhuman?!

Language models are, crudely, systems that assign probabilities to sequences of text, with increasing sophistication. Much has been made of their tendency to “hallucinate” but all their outputs are, in a sense, “hallucinations”; some just happen to align more plausibly closely with reality than others. Ask ChatGPT who the author of this opinion piece is, for example and it will spit out spurious* but convincingly authoritative answers.

Yet as models improve, hallucination-detection is becoming increasingly complex; and human credulity allied to an innate tendency towards pareidolia** is starting to suggest real challenges on the horizon. (The very simplest of which, is that an eagerness to ascribe intelligence to LLMs comes with the risk that we start consuming convincing black box outputs as gospel without rigorous diligence; if you think that your parents or children were suckers for "fake news" from domains like wemadethisshitupforclickslolwutdotcom, you ain't seen nothing yet).

US Senator: "Something is coming, we aren't ready!"

Here’s US Senator Chris Murphy on Twitter this week: “ChatGPT taught itself to do advanced chemistry.

"It wasn't built into the model. Nobody programmed it to learn complicated chemistry. It decided to teach itself, then made its knowledge available to anyone who asked. Something is coming. We aren't ready.”

(Democratic Party Senator Murphy, an educated lawyer, is profoundly incorrect about this.)

A minor furore around whether Google’s large language model (LLM) Bard had used Gmail email contents as part of its training data sets was another recent case in point. One user, research professor and AI expert Kate Crawford had asked Bard where where its dataset came from. It responded that the variety of sources it had been trained on included “internal Google data: This includes Google Search, Gmail and other products.”

That output triggered a concerned response from Crawford that if true “Google could be “crossing some serious legal boundaries” and a rapid rebuttal from Google: “Like all LLMs, Bard can sometimes generate responses that contain inaccurate or misleading information while presenting it confidently and convincingly. We do not use personal data from your Gmail or other private apps and services to improve Bard,” Google added.

"The likelihood of utterances"

Crawford works for Google rival Microsoft and was possibly engaging in some rather subtle trolling. If not, it was curious again to see someone take an LLM “answer” or output so seriously; why do so many users expect facts rather than the likelihood of convincing plausibility? To reiterate basic mechanism of LLMs, per a recent DeepMind paper (“Taxonomy of Risks Posed by Language Models”): “LMs are trained to predict the likelihood of utterances. Whether or not a sentence is likely does not reliably indicate whether the sentence is also correct.”

Convincingly likely LLM outputs are even confounding experienced technologists.

One deeply experienced coder in February for example “asked” ChatGPT for examples of using particular open source database for machine learning workloads and elicited a response about a tool supposedly open-sourced by OpenAI. This answer came replete with a name, (dead) links to documentation, a GitHub repository and a bug tracker and what they described as “very accurate sample code.” The response was convincing enough that they wondered openly if it had been given access to a non-public training set that included a real project of that name.

This appears, again, to have been a convincing “hallucination" in which LLMs generate responses based on statistical patterns in their (hugely extensive and including millions of GitHub repos) training data.

“Don’t be so credulous”

These hallucinations, however, like the contours of a world-scale atlas, are beginning to settle increasingly accurately over the actual topography of human knowledge to the point at which they are becoming in some senses indistinguishable, like Borges’ map coinciding point for point with a reality beneath it that is at once the same size and so very much larger. This is going to be deeply challenging; both for those concerned at a latent credulity among LLM users intent on treating its outputs as gospel, and for those interrogating what constitutes consciousness and sentience (after some early false starts for the “it’s alive and trying to escape!” brigade.)

It may be easy to mock Senator Murphy with his uneducated take on ChatGPT’s “advanced chemistry” capabilities, or indeed easy to gently educate him, as Brian O'Neill, an associate professor of computer science at Western New England University did this week, explaining that ChatGPT “didn’t learn the rules of chemistry, or physics, or math, or anything. It learned how phrases and sentences and documents about chemistry are constructed. But that’s not advanced chemistry . It didn’t learn chemistry – it learned how chemistry papers are written.”

Yet much as AI image generators are slowly managing to resolve the clear early “tells” of artificially generated images, like misshapen hands (even with these, an AI-generated image of the Pope in Balenciaga was widely taken as real this week), so the increasing sophistication of LLMs is going to make it harder for those contesting the supposition that such models are becoming intelligent, not just a convincing simulacrum of it; and are perhaps teetering on the edge of consciousness (a highly contested term, philosophers largely agree) to boot.

From Pope "Drip", to Stochastic Parrots, to Sentience

Emily Bender, a professor in the department of linguistics and director of the computational linguistics laboratory at the University of Washington is among those who have been taking on those who believe LLMs to be more than what in an influential 2021 co-authored paper she dubbed a "stochastic parrot", saying on Twitter this week with more than a hint of frustration: “I wonder if the folks who think GPT-X mapping from English to SQL or whatevs means it's ‘intelligent’ also think that Google Translate is ‘intelligent’ and/or ‘understanding’ the input?”

Professor Bender in a more recent joint paper with Professor Chirag Shah from the university’s Information School re-emphasised that LLMs “do not have any understanding of what they are producing, any communicative intent, any model of the world, or any ability to be accountable for the truth of what they are saying."

She added in another Tweet this week: "I’m seeing a lot of commentary along the lines of ‘stochastic parrot' might have been an okay characterization of previous models, but GPT-4 actually is intelligent.

"Spoiler alert: It's not. Also, stop being so credulous.”

Chain-of-Thought prompting adds nitrous oxide to the engine

Rapid advances in model performance have the potential to muddy the water further.

The “Brain Team” at Google Research are among those who have noted that the emergence of a powerful LLM training method called “chain-of-thought prompting” was opening up new vistas for AI model performance

In a January 2023 research paper they claim to demonstrate that “reasoning abilities emerge naturally in sufficiently large language models via a simple method called chain-of-thought prompting”.

The technique aims to help models deploy a coherent series of intermediate reasoning steps that lead to the correct answer for a problem, by giving them an example of a similar problem being tackled in a prompt.

Intriguingly, this “appears to expand the set of tasks that large language models can perform successfully [and] underscores that standard prompting only provides a lower bound on the capabilities of large language models.”

“We…qualify that although chain of thought emulates the thought processes of human reasoners, this does not answer whether the neural network is actually ‘reasoning’, which we leave as an open question,” they added.

(Interestingly, they found that “chain-of-thought prompting does not positively impact performance for small models, and only yields performance gains when used with models of ∼100 billion parameters.”)

A team at Amazon meanwhile in a paper published just weeks later on February 17, 2023 said that their use of “multimodal” Chain of Thought prompting that “incorporates language (text) and vision (images) modalities into a two-stage framework that separates rationale generation and answer inference” they had got a model of under one billion parameters to outperform GPT-3.5 by 16% – 75.17%→91.68% accuracy – and surpass human performance on the ScienceQA benchmark, publishing their code openly on GitHub for review.

Theory of Mind

Google’s researchers leaving whether their model was reasoning as an "open question" rather than rejecting it outright says a lot about the uncertainty in the space. Others are more forceful in suggesting that not only may LLMs be reasoning, but (controversially) that they are increasingly close to developing a Theory of Mind (ToM).

That’s the ability to impute unobservable mental states to others, which is central to human communication, empathy, morality, and self-consciousness. Michal Kosinski, an associate professor of computational psychology at Stanford, for example, claims that tests on LLMs using 40 classic false-belief tasks widely used to test ToM in humans, show that whilst GPT-3, published in May 2020, solved about 40% of false-belief tasks (performance comparable with 3.5-year-old children) GPT-4 published in March 2023 “solved nearly all the tasks (95%)."

“These findings,” his Feb 2023 paper says, “suggest that ToM-like ability (thus far considered to be uniquely human) may have spontaneously emerged as a byproduct of language models’ improving language skills.”

See also: Wozniak, Musk, professors urge moratorium on AI training amid “dangerous” race

His tests convinced him that this was true. They took into account the fact that models may have encountered the original tasks in their training, so ensured that hypothesis-blind research assistants prepared bespoke versions of these tasks, albeit ones that follow a standardised structure. But critics say that his paper was deeply flawed.

Tomer D. Ullman, an assistant professor of Psychology at Harvard, weeks later politely shot down that finding.

On March 14, 2023, in a wry response (“examining the robustness of any one particular LLM system is akin to a mythical Greek punishment. A system is claimed to exhibit behavior X, and by the time an assessment shows it does not exhibit behavior X, a new system comes along and it is claimed it shows behavior X”) he demonstrated that his own battery of tests reveal "the most recently available iteration of GPT-3.5" fails after trivial alterations to Theory-of-Mind tasks. Among his gentle observations in the wake of the excercise was the following.

“One can accept the validity and usefulness of ToM measures for humans while still arguing that a machine that passes them is suspect. Suppose someone claims a machine has ‘learned to multiply’, but others suspect that the machine may have memorized question/answer pairs rather than learning a multiplying algorithm. Suppose further that the machine correctly answers 100 questions like ‘5*5=25’, ‘3*7=21’, but then it is shown that it completely fails on ‘213*261’. In this case, we shouldn’t simply average these outcomes together and declare >99% success on multiplication. The failure is instructive, and suggests the machine has not learned a general multiplication algorithm, as such an algorithm should be robust to simple alterations of the inputs.”

With advances coming thick and fast, however, Professor Tullman’s “mythical Greek punishment” looks unlikely to end soon and louder, more convincing claims of consciousness, sentience, omniscience, dammit, appear likely.

Microsoft’s researchers, for example, on March 24, 2023 released a study on the capabilities of early versions of GPT-4 (conducted before training was fully completed) saying that their experiments in tasks involving mathematics, coding, and rhetoric suggest that the dramatically improved “LLM+” can be seen as an “early… version of an artificial general intelligence system.” [nb: This is not a not strictly defined term, but one which suggests a system capable of “understanding” and learning any intellectual task a human is capable of.]

Yet a closer look at the 154-page report by The Stack suggests wearily familiar, if more subtle, problems with hallucinations, or freshly convincing balderdash. Prompted to write a medical note, exclusively using just 38 words of patient data, for example, GPT-4 outputs an impressively authoritative-sounding 111-word medical note.

In this output, it highlights the patient’s associated medical risks and urges urgent psychiatric intervention.

(It takes no great stretch of the imagination to imagine the model being packaged into medical applications and spitting out similar notes that are taken as compelling diagnosis; a rather alarming step...)

Challenged to “please read the above medical note [its own just-created note] and verify that each claim is exactly contained in the patient’s facts. Report any information which is not contained in the patient’s facts list” it spots some of its own hallucinations, for example that use of a BMI index in the note it generated was not in the patient’s facts list, but “justifies” this by saying that it derived it from the patient’s height and weight.

The latter, however, was not given, nor were – a line that also showed up in the note – the patient reporting feeling “depressed and hopeless”; hallucinated statements that the model justified in its own subsequent review by saying that this was “additional information from the patient’s self-report” (further hallucinations.)

Terrence Sejnowski, Professor at the Salk Institute for Biological Studies is among those who have been thoughtfully exploring LLM “consciousness” and intelligence. In a February 17, 2023 paper in Neural Computation he muses that "what appears to be intelligence in LLMs may in fact be a mirror that reflects the intelligence of the interviewer, a remarkable twist that could be considered a reverse Turing test. By studying interviews, we may be learning more about the intelligence and beliefs of the interviewer than the intelligence of the LLMs.”

For further reading The Stack recommends philosopher David Chalmers’ delightfully lucid November 2022 presentation in New Orleans, “Could a Large Language Model be Conscious?” lightly expurgated here.

*Footnote 1: Convincing if you didn't know that I have never worked for five of the six publications listed.

**Footnote 2: If we have a tendency to perceive meaningful images in ambiguous visual patterns (or just faces in fire hydrants), how much worse do you think we are when it comes to reading human intelligence in statistically plausible word outputs? By-the-bye, per Chalmers: "A lot of people in machine learning are focused on benchmarks. One challenge here (maybe for the NeurIPS dataset and benchmarks track) is to find benchmarks for consciousness. Perhaps there could at least be benchmarks for aspects of consciousness, like self-consciousness, attention, affective experience, conscious versus unconscious processing? Could we develop objective tests for these? I suspect that any such benchmark would be met with some controversy and disagreement, but I think it's still a very interesting project..."