Vibe-code-golf and other LLM games

Shashvat Shukla

Jul 9, 2025

The purpose of these LLM games is to understand LLMs better.

Read →

3 Comments

Ashwin

Jul 16, 2025

I tried the n+1 game, and for me the smallest number so far was 9998999+1, where it incorrectly gave me 10,000,000.

I had some queries from this.

1. In your blog picture, the number featured was 99999899999, was that the smallest number you personally found? Or was that just meant as an example?

2. If 99999899999 was indeed the smallest number you found, why is chatgpt giving different answers for different people? My prompt was extremely basic, (literally ""9998999+1"), I did not gaslight the model and it was on a completely new chat.

3. Why has this issue not been fixed yet? This is a very simple mathematical calculation problem. Is it hard to fix? I thought that it should be easy to fix, especially considering their other issues in the past that they managed to improve on, like image generation with a table on a cat or a wine glass which is full to the brim with orange juice for example.

4. Adding on to query 3, if it indeed is hard to fix mathematical problems, why don't the developers just make it such Chatgpt passes any simple arithmetical sums to a calculator and then returns the output?

Reply (1)

Shashvat Shukla

Jul 16, 2025

1. just the best i found but i only tried a handful of numbers

2. My prompt also had an = in it, so our prompts were different. Even if they were the same the "temperature" setting of the LLM controls how random the output is. I'm not sure what the temperature setting of ChatGPT is, but we can make this test more objective using the API and setting the temperature to 0.

3. The next launch of openAI will do this. They made a UX error of letting the user pick the models, which became complicated for most users. Currently if you switch the model to o3 it can handle these small numbers, but as I say there are bigger arithmetic problems it still fails at. When they release GPT 5, it will do a lot more of selecting the right model. That also means the models are using more compute to be more intelligent so that might be why they didn't do it right away. The default is 4o because that's cheap. o3 is expensive for them.

4. This is called tool calling and yes that's how more complex chat agents work, like o3.

Reply (1)

Ashwin

Jul 17, 2025

Very informative replies, thanks!

Btw, for point 2, I tried it again with the equal sign too, and now it gave the right answer to "9998999+1=", but still wrong answer to "99999899999+1=". Crazy how just the seemingly meaningless addition of a symbol like an equal sign at the end of the query influences the accuracy.

Gentle Computing

Vibe-code-golf and other LLM games