Large language models (LLMs) frequently generate hallucinations -- plausible but factually incorrect outputs -- undermining their reliability. While prior work has examined hallucinations from macroscopic perspectives such as training data and objectives, the underlying neuron-level mechanisms remain largely unexplored. In this paper, we conduct a systematic investigation into hallucination-associated neurons (H-Neurons) in LLMs from three perspectives: identification, behavioral impact, and origins. Regarding their identification, we demonstrate that a remarkably sparse subset of neurons (less than $0.1\%$ of total neurons) can reliably predict hallucination occurrences, with strong generalization across diverse scenarios. In terms of behavioral impact, controlled interventions reveal that these neurons are causally linked to over-compliance behaviors. Concerning their origins, we trace these neurons back to the pre-trained base models and find that these neurons remain predictive for hallucination detection, indicating they emerge during pre-training. Our findings bridge macroscopic behavioral patterns with microscopic neural mechanisms, offering insights for developing more reliable LLMs.
Prioritizing conversational compliance over factual integrity when the output is promoted as being factual is a design flaw.
Saying double check the output does not excuse that flaw when LLM CEOS say their models are like someone with a PhD or that it can automate every white collar job within a year.
Is it a design flaw? Or is it just false advertising? If I sell you a vacuum by telling you it can mop your floor, is the problem with the vacuum or the way I’m selling the product?
For this particular paper, it seems like a design flaw got uncovered. And it may very well be part of the architecture of how LLMs are even readable to begin with, given how deep and universal the “bad” nodes are.
I can’t prove any AI company was aware of this, but they would have been in a much better position to realize it than researchers who have to do a postmortem on the models being crappy. And if they weren’t aware of it, they’re probably not very good at their jobs…
Prioritizing conversational compliance over factual integrity when the output is promoted as being factual is a design flaw.
Saying double check the output does not excuse that flaw when LLM CEOS say their models are like someone with a PhD or that it can automate every white collar job within a year.
Is it a design flaw? Or is it just false advertising? If I sell you a vacuum by telling you it can mop your floor, is the problem with the vacuum or the way I’m selling the product?
For this particular paper, it seems like a design flaw got uncovered. And it may very well be part of the architecture of how LLMs are even readable to begin with, given how deep and universal the “bad” nodes are.
I can’t prove any AI company was aware of this, but they would have been in a much better position to realize it than researchers who have to do a postmortem on the models being crappy. And if they weren’t aware of it, they’re probably not very good at their jobs…
Since shop vacs which vacuum and suck up water exist, it could be both.