Dec 5, 20233 min read

Galileo's Language Model Hallucination Index: Improving Language Model Assessment

Galileo, a provider of machine learning tools, has recently introduced the Language Model Hallucination Index (LLM) to help businesses evaluate and monitor large language models.

This innovative tool evaluated 11 leading LLMs in three common use cases: question and answer (Q&A) without Retrieval-Augmented Generation (RAG), Q&A with RAG, and long-form text generation. The index is based on "correctness," which is the likelihood of the LLM's response being factually accurate.

A higher number indicates a greater likelihood of correctness, while a lower number suggests a higher probability of hallucinations. On the other hand, the Q&A function with RAG focuses on "context adherence," which measures fidelity to the information provided in the RAG's knowledge base.

Definition of Hallucinations in LLM

Hallucinations in LLM are defined as outputs that are factually incorrect and do not exist in the training dataset or in the context of the session.

Results of the Model Evaluation

According to consolidated data, OpenAI's GPT-4 stood out as the best-performing model in all three tests. GPT-3.5 models also showed notable performance, generally taking second place. However, Meta's Llama 2 70B and Mistral 7B models, as well as Hugging Face's Zephyr 7B, demonstrated nearly comparable performance in the text generation evaluation.

Vikram Chatterji, co-founder and CEO of Galileo, highlighted that although OpenAI's models are the best, there are other very good options on the market. Additionally, he pointed out the importance of balancing correctness with cost.

Measurement and Monitoring of LLM Performance

The problem of evaluating and monitoring models arose while Chatterji was working at Google, customizing LLMs for companies. He recognized that performance related to the target use case should be weighted more heavily than general benchmarks, which led to the founding of Galileo. Chatterji offered a more detailed discussion on LLM evaluation and a demonstration of Galileo's metric tracking software at Synthedia's LLM Innovation Conference.

From Experimentation to Data-Driven Decisions

The discussion about LLMs rarely delves beyond superficial remarks. However, this is changing as companies invest millions in generative AI capabilities and begin using the technology for critical processes.

The measurement of LLM performance is not standardized, and custom tests are rarely well-conceived for business requirements. Data science and enterprise MLOps teams need tools that enable them to select appropriate benchmarks and execute them at a reasonable cost. Reducing errors in LLMs, such as hallucinations, provides significant value. Galileo's Hallucination Index is useful as a starting point since it indicates an LLM's propensity to hallucinate.

Starting with a lower likelihood of hallucination will likely lead to better performance after retraining, fine-tuning, and prompt engineering techniques. This also indicates an emerging trend: business experimentation with LLMs to determine fit and impact will necessarily shift to a need for more performance data.

What Does All This Mean?

Galileo's Hallucination Index and associated performance metrics represent a significant advancement in the evaluation and monitoring of large language models. These tools not only offer a deeper understanding of model performance in specific situations but also enable businesses to make more informed and data-driven decisions.

As the field of artificial intelligence continues to advance, the ability to accurately and efficiently evaluate models becomes increasingly important. Businesses adopting these advanced technologies will be better equipped to navigate the complex world of AI, optimizing their investments and maximizing the value of their AI solutions.

Innovation in LLM evaluation, such as represented by Galileo's Hallucination Index, not only improves the quality and accuracy of models but also opens up new possibilities in terms of practical applications. With continuous improvement in model measurement and monitoring, businesses can expect more effective implementations and more reliable outcomes.

This evolution is not only crucial for technological advancement in the field of artificial intelligence but also fundamental to ensuring that LLM-based applications are both responsible and efficient, thus marking a new horizon in the use of artificial intelligence in the business world and beyond.

Remember that at Generative Labs, we specialize in the development of AI-based solutions, learn about our services.