As AI systems like large language models (LLMs) grow in size and complexity, researchers are uncovering intriguing fundamental limitations.Ā
Recent studies from Google and the University of Singapore have uncovered the mechanics behind AI āhallucinationsā ā where models generate convincing but fabricated information ā and the accumulation of ātechnical debt,ā which could create messy, unreliable systems over time.
Beyond the technical challenges, aligning AIās capabilities and incentives with human values remains an open question.
As companies like OpenAI push towards artificial general intelligence (AGI), securing the path ahead means acknowledging the boundaries of current systems.
However, carefully acknowledging risks is antithetical to Silicon Valleyās motto to āmove fast and break things,ā which characterizes AI R&D as it did for tech innovations before it.Ā
Study 1: AI models are accruing ātechnical debtā
Machine learning is often touted as continuously scalable, with systems offering a modular, integrated framework for development.Ā
However, in the background, developers may be accruing a high level of ātechnical debtā theyāll need to solve down the line.
In a Google research paper, āMachine Learning: The High-Interest Credit Card of Technical Debt,ā researchers discuss the concept of technical debt in the context of ML systems.Ā
Kaggle CEO and long-time Google researcher D. Sculley and colleagues argue that while ML offers powerful tools for rapidly building complex systems, these āquick winsā are often misleading.Ā
The simplicity and speed of deploying ML models can mask the future burdens they impose on system maintainability and evolution.
As the authors describe, this hidden debt arises from several ML-specific risk factors that developers should avoid or refactor.
Here are the key insights:
- ML systems, by their nature, introduce a level of complexity beyond coding alone. This can lead to what the authors call āboundary erosion,ā where the clear lines between different system components become blurred due to the interdependencies created by ML models. This makes it difficult to isolate and implement improvements without affecting other parts of the system.
- The paper also highlights the problem of āentanglement,ā where changes to any part of an ML system, such as input features or model parameters, can have unpredictable effects on the rest of the system. Altering one small parameter might instigate a cascade of effects that impacts an entire modelās function and integrity.
- Another issue is the creation of āhidden feedback loops,ā where ML models influence their own training data in unforeseen ways. This can lead to systems that evolve in unintended directions, compounding the difficulty of managing and understanding the systemās behavior.
- The authors also address ādata dependencies,ā such as where input signals change over time, which are particularly problematic as theyāre harder to detect.
Why technical debt matters
Technical doubt touches on the long-term health and efficiency of ML systems.
When developers rush to get ML systems up and running, they might ignore the messy intricacies of data handling or the pitfalls of āgluingā together different parts.
This might work in the short term but can lead to a tangled mess thatās hard to dissect, update, or even understand later.
ā ļø ā ļø ā ļø ā ļø ā ļø ā ļø ā ļø
GenAI is an avalanche of technical debt* waiting to happen
Just this week
šChatGPT went āberserkā with almost no real explanation
šSora canāt consistently infer how many legs a cat has
šGeminiās diversity intervention went utterly off the rails.⦠pic.twitter.com/qzrVlpX9yzā Gary Marcus @ AAAI 2024 (@GaryMarcus) February 24, 2024
For example, using ML models as-is from a library seems efficient until youāre stuck with a āglue codeā nightmare, where most of the system is just duct tape holding together bits and pieces that werenāt meant to fit together.Ā
Or consider āpipeline jungles,ā described in a previous paper by D. Sculley and colleagues, where data preparation becomes a labyrinth of intertwined processes, so making a change feels like defusing a bomb.
The implications of technical debt
For starters, the more tangled a system becomes, the harder it is to improve or maintain it. This not only stifles innovation but can also lead to more sinister issues.Ā
For instance, if an ML system starts making decisions based on outdated or biased data because itās too cumbersome to update, it can reinforce or amplify societal biases.Ā
Moreover, in critical applications like healthcare or autonomous vehicles, such technical debt could have dire consequences, not just in terms of time and money but in human well-being.
As the study describes, āNot all debt is necessarily bad, but technical debt does tend to compound. Deferring the work to pay it off results in increasing costs, system brittleness, and reduced rates of innovation.ā
Itās also a reminder for businesses and consumers to demand transparency and accountability in the AI technologies they adopt.
After all, the goal is to harness the power of AI to make life better, not to get bogged down in an endless cycle of technical debt repayment.
Study 2: You canāt separate hallucinations from LLMs
In a different but related study from the National University of Singapore, researchers Ziwei Xu, Sanjay Jain, and Mohan Kankanhalli investigated the inherent limitations of LLMs.
āHallucination is Inevitable: An Innate Limitation of Large Language Modelsā explores the nature of AI hallucinations, which describe instances when AI systems generate plausible but inaccurate or entirely fabricated information.Ā
The hallucination phenomena pose a major technical challenge, as it highlights a fundamental gap between the output of an AI model and what is considered the āground truthā ā an ideal model that always produces correct and logical information.Ā
Understanding how and why generative AI hallucinates is paramount as the technology integrates into critical sectors such as policing and justice, healthcare, and legal.
What if one could *prove* that hallucinations are inevitable within LLMs?
Would that change
⢠How you view LLMs?
⢠How much investment you would make in them?
⢠How much you would prioritize research in alternatives?New paper makes the case: https://t.co/r0eP3mFxQg
h/t⦠pic.twitter.com/Id2kdaCSGkā Gary Marcus @ AAAI 2024 (@GaryMarcus) February 25, 2024
Theoretical foundations of hallucinations
The study begins by laying out a theoretical framework to understand hallucinations in LLMs.Ā
Researchers created a theoretical model known as the āformal world.ā This simplified, controlled environment enabled them to observe the conditions under which AI models fail to align with the ground truth.
They then tested two major families of LLMs:
- Llama 2: Specifically, the 70-billion-parameter version (llama2-70b-chat-hf) accessible on HuggingFace was used. This model represents one of the newer entries into the large language model arena, designed for a wide range of text generation and comprehension tasks.
- Generative Pretrained Transformers (GPT): The study included tests on GPT-3.5, specifically the 175-billion-parameter gpt-3.5-turbo-16k model, and GPT-4 (gpt-4-0613), for which the exact number of parameters remains undisclosed.Ā
LLMs were asked to list strings of a given length using a specified alphabet, a seemingly simple computational task.
More specifically, the models were tasked with generating all possible strings of lengths varying from 1 to 7, using alphabets of two characters (e.g., {a, b}) and three characters (e.g., {a, b, c}).
The outputs were evaluated based on whether they contained all and only the strings of the specified length from the given alphabet.
Findings
The results showed a clear limitation in the modelsā abilities to complete the task correctly as the complexity increased (i.e., as the string length or the alphabet size increased). Specifically:
- The models performed adequately for shorter strings and smaller alphabets but faltered as the taskās complexity increased.
- Notably, even the advanced GPT-4 model, the most sophisticated LLM available right now, couldnāt successfully list all strings beyond certain lengths.
This shows that hallucinations arenāt a simple glitch that can be patched or corrected ā theyāre a fundamental aspect of how these models understand and replicate human language.
As the study describes, āLLMs cannot learn all of the computable functions and will therefore always hallucinate. Since the formal world is a part of the real world which is much more complicated, hallucinations are also inevitable for real world LLMs.ā
The implications for high-stakes applications are vast. In sectors like healthcare, finance, or law, where the accuracy of information can have serious consequences, relying on an LLM without a fail-safe to filter out these hallucinations could lead to serious errors.
This study caught the eye of AI expert Dr. Gary Marcus and eminent cognitive psychologist Dr. Steven Pinker.
Hallucination is inevitable with Large Language Models because of their design: no representation of facts or things, just statistical intercorrelations. New proof of āan innate limitationā of LLMs. https://t.co/Hl1kqxJGXt
ā Steven Pinker (@sapinker) February 25, 2024
Deeper issues are at play
The accumulation of technical debt and the inevitability of hallucinations in LLMs are symptomatic of a deeper issue ā the current paradigm of AI development may be inherently misaligned to create highly intelligent systems and reliably aligned with human values and factual truth.
In sensitive fields, having an AI system thatās right most of the time is not enough. Technical debt and hallucinations both threaten model integrity over time.Ā
Fixing this isnāt just a technical challenge but a multidisciplinary one, requiring input from AI ethics, policy, and domain-specific expertise to navigate safely.
Right now, this is seemingly at odds with the principles of an industry living up to the motto to āmove fast and break things.ā
Letās hope humans arenāt the āthings.ā

