Research done by Guruprasad Raghavan, Jon Gill, Victor I. Afolabi
TL;DR We have benchmarked several popular open source LLMs (including the latest Llama-v2–7b-chat) to estimate both, the frequency and degree of hallucinations. Overall, we find that on average, popular, open-source models hallucinate close to 55% of the time on a context-aware Q&A task, when tested without any tuning. We also make our evaluation pipeline (code-base) open-source, so anyone can do this! In a future post, we will explain our strategy on how to reduce hallucinations and propose model intervention methods that achieve the same.
Hallucinations are observed across a wide array of generative AI systems, from systems that support different input modalities (like, images, audio, language), to systems that are built using different architectures with variable model sizes (i.e. in the smaller million parameter models and the larger 100 billion parameter models).
Hallucinations are loosely defined as confident, factually incorrect outputs produced by an AI system not justified by its training data.
For a deeper dive on understanding the mathematical basis for why hallucinations are inherent to all generative AI systems, and specifically to Large language models, check out the section titled “Mathematically speaking, why do hallucinations occur?”
One of the biggest challenges in attempting to resolve the issue of LLM hallucination is the abstract, subjective nature of the problem. Answers to questions such as “How do we know when an LLM hallucinates?” or “What is considered ground-truth?” are critical to solving the problem; however, these questions do not necessarily have universally accepted answers.
In order to make tangible, “measurable” progress towards reducing model hallucination (as a community), we must come up with precise definitions for the various types of hallucination commonly observed while interacting with LLMs and proceed to develop mathematical metrics for quantifying the degree of hallucination.
At Yurts, we build and deliver reliable AI solutions for enterprise organizations. In the context of large language models, this means considerably reducing the hallucination rate and signaling to our users when hallucinations inevitably occur. Immediately, this translates to our context-aware chat experience. That is, for any user query, we retrieve the appropriate factually precise information (mined from enterprise-specific documents) and then let the LLM do its magic. Our ideal LLM output correctly answers the query, (a) without making up new entities (names of persons, places, organizations, etc.) which are not present in the original prompt (query + context) and (b) does not conflate relationships between those entities.
To address these two specific types of hallucinations, we introduce quantitative metrics to measure two types of hallucinations, Type 1 and Type 2 hallucinations, whose definitions are as follows:
Type 1 (T1) hallucination: LLM creates new entities
T1 hallucination refers to the scenario when an LLM is fed a prompt that contains a query and context (knowledge from the enterprise document) relevant to the query, and its response contains entities (names of persons, places, organizations, dates) that are not present in the context or query.
Type 2 hallucination: LLM falsifies relation between entities
T2 hallucination refers to the scenario when an LLM is fed a prompt that contains a query and context (from the enterprise document) relevant to the query, and its response contains the right entities (previously mentioned in the prompt), but the relationship between the entities is not synchronized with the knowledge mentioned in the prompt.
In this blog, we will focus on Type 1 hallucination and will share our updates on Type 2 hallucination in a future edition.
At Yurts, we have designed a pipeline to reliably — and quantitatively — measure the relative degree of hallucination for any LLM generation. Effectively, the pipeline accepts an instruction-tuned model and a dataset, feeds the dataset through the model, and finally, calculates the model’s T1 hallucination score (for precise architectural details, see below).
This approach enables us to:
Here’s a schematic of the pipeline that we built for rapid evaluation of open-source models on instruction-based tasks, specifically question-answer with relevant background context. Our pipeline has been open-sourced for the community and is available here: https://github.com/YurtsAI/llm-hallucination-eval
In our pipeline described above, we generated a dataset using TechCrunch (TC) articles published over the last three years for the context-aware question-answer task. Our evaluation dataset comprises of 300 randomly sampled TC articles. The workflow for generating the dataset is described in detail in the section titled “Q&A dataset curation”
We evaluate several open-source models to assess both, their frequency and degree of hallucination (specifically Type 1 Hallucination) on the context-aware question-answer task.
The instruct-tuned open-source models we chose for our analysis are mentioned below:
Result — 1: About 55% of instruct-tuned open-source model responses hallucinate entities in their responses.
In this section, we are interested in evaluating what proportion of the responses (for each open-source model) on the context-aware question answering task are factually correct, i.e. they don’t have any made-up entity, or have no Type 1 Hallucination.
On gathering the instruct-tuned open-source models’ responses to context-aware question answering, we find that on average all models get up to 45% of their responses factually correct, with no made-up entities. Of the models we’ve evaluated for Type 1 hallucination, Llama-v2–7B-chat has the highest fraction of factually precise answers (64.2%) while XGen-7B-instruct and Falcon-40B follow closely with 51.3% and 47.4% of their answers being factually correct. Falcon-7B, Open-Assistant and Dolly-12B get 46%, 40% and 25% of their responses factually correct, i.e. without any made-up entities.
Result — 2: On average, instruct-tuned open-source model responses have about ‘3’ made-up entities in each answer.
Here, we quantify the number of hallucinating entities in each models’ response and plot the distribution across all input samples in the dataset. For instance, we observe that of the 120 question-answer inputs that XGen-7B-instruct hallucinates on, only 40% of the responses have a single “made-up” entity.
Also, we find that, OpenAssistant-12B and Falcon-40B have a handful of answers that have 69 and 63 hallucinated entities respectively. On the other hand, Llama-v2–7B-chat has only a maximum of 7 made-up entities.
Result — 3: Of the models we’ve evaluated on the Context-aware Q&A task, ‘Llama-v2–7b-chat’ model is the least susceptible to Type 1 hallucination
As we evaluate instruct-tuned open-source models, we would like to choose LLM’s that produce responses that don’t hallucinate at all (no Type 1 hallucination) or prefer responses that only hallucinate a very small number of entities.
In the pie-charts below, for each LLM, we evaluate the proportion of samples that are:
From the charts above, we find that Llama-v2–7B-chat is the most preferred model wrt Type 1 hallucinations, as 64.2% of the responses are factually precise with no ‘made-up’ entities, and none (0%!) of the responses contain more than 10 hallucinated entities.
Result — 4: The subjective performance of the instruct-tuned open source model responses on the Context-aware Q&A task is are comparable.
In this section, we evaluate the subjective quality of the answer generated by each of the model responses for the context-aware question answer task. To do so, we evaluate each model response using GPT-4 by prompting it to score each (question, answer) pair using the Likert’ helpfulness metric, on a scale of 1 to 6.
Of the instruct-tuned open-source models we have tested on context-aware Q&A task, we find that Llama-v2–7B-chat, XGen-7B-instruct and Falcon-40B provide the most appropriate and helpful answers. Although Llama-v2–7B-chat have very few responses with scores > 5, it still remains the most preferred as it has the lowest fraction of responses with a score of ‘1’ — implying most responses are helpful! Please note that a score of 2 on the Likert scale suggests that the answer was “somewhat helpful”, so all models that have a large area under the curve from scores 2 to 6 can be considered for further research and for powering context-aware Q&A chatbots.
We believe a robust strategy to reduce Type 1 hallucinations in instruct-tuned open-source models is to navigate the LLM parameter space in such a way that it alters the functional map from input text strings to the simplex output vector (as described in the section on mathematical treatment of hallucinations) to ensure model responses are biased to only have entities mentioned in the context and question, and not feature other entities present in its vocabulary.
At Yurts, we have been exploring a variant of Reinforcement learning with a novel engineered reward function to traverse the LLM parameter space for reducing Type 1 hallucination. We will present those results in an upcoming blog post on the topic.
To get to the root of why generative AI systems are susceptible to hallucinations, it is essential to briefly describe the inner workings of generative AI systems (like large language models).
Here, I will present two lenses to view generative AI systems:
All well-known machine learning systems (e.g. multilayer perceptron, convolutional neural network, LSTM, or Transformer), at their very core, are just mathematical functions with a fixed number of parameters, ultimately mapping a set of inputs to a set of target outputs.
A conventional mathematical function (𝒻) is defined as a map between two sets; For instance, the modulus function (the function that returns the absolute value of a number) is defined as 𝒻: ℝ → ℝ⁺, wherein the input can be any real number, while the output is always a positive real number. In a similar vein, the machine learning systems mentioned above can be thought of as a functional map between a set of input data (e.g. images, audio, or text) to a set of target outputs (e.g. image labels, audio labels, or proceeding text tokens).
The most popular generative AI systems (i.e. transformer-based large language models) can also be thought of as functions that map strings of text to a simplex vocabulary vector. A simplex vocabulary vector is a vector with as many entries as the number of tokens (parts of words) in the model’s vocabulary, and simplex implies that the sum of all elements of this vector equals 1.
The simplex vocabulary vector can also be thought of as a probability vector, wherein the number assigned to each token in the model’s vocabulary can be interpreted as the probability of choosing a specific token (say token ‘it’), given the input fed to the model. For instance, in the figure above, given the input “Yurts AI is an ” , the probability of picking “artificial” is 75.91%, while the probability of picking “AI” is 18.30%.
Lens 1 is an abstract, 30,000 foot view of generative AI systems: there is no specification of the precise mathematical function or family of functions the system is modeling. In lieu of this, we can refine our perspective to focus to a subset of generative AI systems called large language models and therefore describe the precise mathematical functions these systems model.
Large language models (LLM) ultimately learn a joint probability distribution 𝓅(t₁, t₂, …, tₙ) of how different tokens (parts of words) in a given language are strung together after being shown billions of words (or trillions of tokens) during its training phase.
Having learned the joint probability distribution of the “language” during its training phase, the LLM utilizes its learned conditional distribution 𝓅(tₙ | t₁, t₂, …, tₙ₋₁) during the inference (or generation) phase. That is, the LLM produces a multi-variate probability distribution across all tokens in its vocabulary (the simplex vector described above), based on the previous (n-1) tokens that are fed as input. Using the conditional probability of every token in its vocabulary, a stochastic sampling algorithm (like nucleus sampling) picks the next best token, given the previous (n-1) tokens.
LLMs are susceptible to hallucinations due to their input-output design combined with stochastic sampling
While viewing LLMs through either of these lenses, we arrive at the conclusion that during the generation (inference) phase, when LLMs are fed an input string of text, they predict the next token (or word) based on assigning probabilities to all tokens in their vocabulary. As the best token picked at every step is a product of a stochastic sampling algorithm based on the underlying probability distribution, and the parameters of the LLM, the best way to ensure that the right set of tokens are picked (such that there is reduced hallucination) is to modify the learned joint probability distribution, which ultimately requires a change in the parameters of the LLM.
A natural question can be raised — can we completely eliminate hallucinations in current LLM architectures? The answer is No! This is primarily because as long as LLM’s rely on stochastic sampling of the next best token during generation, they are bound to make mistakes (with a non-zero probability). That said, we can drastically reduce hallucination tendency by modifying the parameters of the LLM (i.e. smart fine-tuning) and by altering the inputs fed to the model (i.e. prompt engineering).
Hallucinations are unavoidable but can be reduced by intelligent fine-tuning and prompt engineering.
The diagram below demonstrates two paths for question generation given the scraped article. For long-form question-answering tasks, we used instruct-tuned models to generate the questions on the article, while for short-form (or one-word) question-answering, we used the Flan models fine-tuned on SquadQA dataset (i.e. feeding the article as context, and an entity from the article as answer — we generated the question).