vuciv 8 hours ago

Author here. I've always been fascinated by the Sorites Paradox (at what point does a pile of sand become a heap?), so I decided to run an experiment to see how different LLMs handle vague predicates.

I didn't just want a text answer, so I measured the probability logits for "Yes/No" tokens across pile sizes ranging from 1 to 100M grains.

Key takeaways: 1. Prompting "Is this a heap?" directly is useless (the model just agrees with your framing). 2. Few-shot prompting creates a fascinating sigmoid "heapness curve" for most models (Mistral, DeepSeek). 3. Llama-3-8B was the outlier—it remained perpetually uncertain (probs ~0.35-0.55) across almost the entire range. I argue this is actually the most "philosophically honest" reflection of how humans use the word.

I have a feeling that there is an optimal prompt for this type of experiment, but struggle to find it, or even know if I have found it. The charts in the post are rendered in-browser using the data points I collected. Curious to hear your thoughts :)