Fundamental Issue with Entropy Based Sampling #99

SeanPedersen · 2024-12-06T15:50:28Z

I found a very interesting comment on HN about entropy based sampling, which suggests this approach is fundamentally flawed:

Everyone in the comments seems to be arguing over the semantics of the words and anthropomorphization of LLMs. Putting that aside, there is a real problem with this approach that lies at the mathematical level.

For any given input text, there is a corresponding output text distribution (e.g. the probabilities of all words in a sequence which the model draws samples from).

The approach of drawing several samples and evaluating the entropy and/or disagreement between those draws is that it relies on already knowing the properties of the output distribution. It may be legitimate that one distribution is much more uniformly random than another, which has high certainty. Its not clear to me that they have demonstrated the underlying assumption.

Take for example celebrity info, "What is Tom Cruise known for?". The phrases "movie star", "katie holmes", "topgun", and "scientology" are all quite different in terms of their location in the word vector space, and would result in low semantic similarity, but are all accurate outputs.

On the other hand, "What is Taylor Swift known for?" the answers "standup comedy", "comedian", and "comedy actress" are semantically similar but represent hallucinations. Without knowing the distribution characteristics (e.g multivariate moments and estimates) we couldn't say for certain these are correct merely by their proximity in vector space.

As some have pointed out in this thread, knowing the correct distribution of word sequences for a given input sequence is the very job the LLM is solving, so there is no way of evaluating the output distribution to determine its correctness.

There are actual statistical models to evaluate the amount of uncertainty in output from ANNs (albeit a bit limited), but they are probably not feasible at the scale of LLMs. Perhaps a layer or two could be used to create a partial estimate of uncertainty (e.g. final 2 layers), but this would be a severe truncation of overall network uncertainty.

Another reason I mention this is most hallucinations I encounter are very plausible and often close to the right thing (swapping a variable name, confabulating a config key), which appear very convincing and "in sample", but are actually incorrect.

https://news.ycombinator.com/item?id=40769496

vkalvakotamath · 2024-12-06T16:41:16Z

I really wonder if viewing Entropix as an EBM and considering the Helmholtz free energy instead of just entropy is a better idea. Hm, let me look into this and if it could potentially resolve one or more of these points.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fundamental Issue with Entropy Based Sampling #99

Fundamental Issue with Entropy Based Sampling #99

SeanPedersen commented Dec 6, 2024

vkalvakotamath commented Dec 6, 2024

Fundamental Issue with Entropy Based Sampling #99

Fundamental Issue with Entropy Based Sampling #99

Comments

SeanPedersen commented Dec 6, 2024

vkalvakotamath commented Dec 6, 2024