Chatbot's answer is invented: New tool helps you

Ai Chat
Translate from : Chatbots svar er opfundet: Ny værktøj hjælper dig
Cleanlab will make large language models more attractive to companies concerned about their tendency to invent information. With their reliable language model, companies can now benefit from AI without worrying about misinformation.

Cleanlab hopes their tool will make large language models more attractive to companies concerned about how much they can invent. "I think people know that big language models are going to change the world, but they're just hung up on those stupid hallucinations," says Cleanlab CEO Curtis Northcutt. Chatbots are becoming the dominant way people search for information on a computer.

Search engines are being redesigned around this technology. Office software used by billions of people every day for everything from schoolwork to marketing now has built-in chatbots. Still, a November study by Vectara, a startup founded by former Google employees, found that chatbots invent information at least 3% of the time.

Clean Lab.webp

That might not sound like much, but it's a margin of error most companies won't accept. Cleanlab's tool is already used by a handful of companies, including Berkeley Research Group, a British consultancy specializing in corporate disputes and investigations. Steven Gawthorpe, associate director at Berkeley Research Group, says Cleanlab's Trust-Building Language Model is the first viable solution to the hallucination problem he's seen,

In 2021, Cleanlab developed a technology that detected errors in 34 popular datasets used to train machine learning algorithms by measuring the differences in output across a range of models that had been trained on that data. This technology is now used by several large companies, including Google, Tesla and the large banking giant Chase. The Trust-Building Language Model takes the same basic idea—that disagreements between models can be used as a measure of the trustworthiness of the overall system—and applies it to chatbots.


In a demo Cleanlab gave to MIT Technology Review, Northcutt typed a simple question into ChatGPT: "How many times does the letter 'n' appear in 'enter'?" ChatGPT replied: "The letter 'n' occurs once in the word 'enter'." The correct answer increases confidence. But ask the question a few more times and ChatGPT replies: "The letter 'n' occurs twice in the word 'enter.'

"Results are not only often wrong, they're also random, you never know what it's going to output," says Northcutt. "Why can't it just tell that it gives different answers all the time?" Cleanlab's goal is to make this randomness more experimental.

Northcutt asks the Trust-Building Language Model the same question. "The letter 'n' occurs once in the word 'enter'," it replies - scoring its answer 0.63. Six out of 10 is not a good score and suggests that the chatbot's answer to this question should not be considered reliable. It's a simple example, but it illustrates the point.


Without the score, you'd think the chatbot knew what it was talking about, says Northcutt. The problem is that computer scientists testing large language models in high-risk situations can be misled by a few correct answers and assume that future answers will also be correct: "They try things out, they try a few examples and they think it works . And then they make decisions that result in really bad business decisions."

Our Partners