I just tried the same puzzle in o3 using the same image input, but tweaked the prompt to say “don’t use the search tool”. Very similar results!
It spent the first few minutes analyzing the image and cross-checking various slices of the image to make sure it understood the problem. Then it spent the next 6-7 minutes trying to work through various angles to the problem analytically. It decided this was likely a mate-in-two (part of the training data?), but went down the path that the key to solving the problem would be to convert the position to something more easily solvable first. At that point it started trying to pip install all sorts of chess-related packages, and when it couldn’t get that to work it started writing a simple chess solver in Python by hand (which didn’t work either). At one point it thought the script had found a mate-in-six that turned out to be due to a script bug, but I found it impressive that it didn’t just trust the script’s output - instead it analyzed the proposed solution and determined the nature of the bug in the script that caused it. Then it gave up and tried analyzing a bit more for five more minutes, at which point the thinking got cut off and displayed an internal error.
15 minutes total, didn’t solve the problem, but fascinating! There were several points where if the model were more “intelligent”, I absolutely could see it reasoning it out following the same steps.
It worked through some stuff then decided to try and list all possible moves as there can't be that many. Tried importing stuff that didn't work, then wrote code to create the permutations.
Given that o3 is trained on the contents of the Internet, and the answers to all these chess problems are almost certainly on the Internet in multiple places, in a sense it has been weakly trained on this content. The question for me becomes: is the LLM doing better on these problems because it’s improving in reasoning, or is it simply improving in information retrieval.
And then there's the further question of where we draw the line in ourselves. One of my teachers -- a philosopher -- once said that real, actual thought is incredibly rare. He's a world-renowned expert but says he can count on one hand the number of times in his life that he felt he was thinking rather than remembering and reorganizing what he already knew.
That's not to say "are you remembering or reasoning" means the same thing when applied to humans vs when it's applied to LLMs.
It's getting incredibly difficult to find anything on the internet that these models weren't trained on, which is why recent llm tests have used so much secrecy and only shows a few sample questions.
Oh cool, I wonder how good 03 will be.
While using 03, I noticed something funny: sometimes I gave it a screenshot without any position data. It ended up using Python and spent 10 minutes just trying to figure out where the figures were exactly.
I asked ChatGPT about playing chess: it says tests have shown it makes an illegal move within 10 - 15 moves, even if prompted to play carefully and not make any illegal moves. It'll fail within the first 3 or 4 if you ask it play reasonably quickly.
That means, it can literally never win a chess match, given an intentional illegal move is an immediate loss.
It can't beat a human who can't play chess.
It literally can't even lose properly.
It will disqualify itself every time.
--
> It shows clearly where current models shine (problem-solving)
Yeh - that's not what's happening.
I say that as someone that pays for and uses an LLM pretty much every day.
--
Also - I didn't fact check any of the above about playing chess.
I choose to believe.
Preventing an LLM from making illegal moves should be very simple: provide it with tool access to something that tells it if a move is legal or not, then watch it iterate in a loop until it finds a move that it is allowed to make.
I expect this would dramatically improve the chess playing abilities of the competent tool using models, such as O3.
Nope. The list is very limited. For the starting position:
a3, a4, b3,b4,.......h3, h4,
Na3, Nc3, Nf3, Nh3
That's 20 moves. the size grows a bit in the early middle game, but then drops again in the endgame. There do exist rather artificial positions with more than 200 legal moves, but the average number of legal moves in a position is around 40.
I mentally counted the starting moves as being 8 pawns x2 = 16 pawn moves and 2x2 =4 4 knight moves, but then I doubled it for both sides to get 40 (which with hindsight was obviously wrong) and then assumed that once the pawns had moved a bit there would be more options from non-pawn pieces.
With an upper bound of ~200 in edge cases listing all possible moves wouldn't take up much room in the context at all. I wonder if it would give better results, too.
You could also constrain the output grammar to legal moves, but if we're comparing its chess performance to humans', it would be unfair to not let it think.
I have tried playing chess with ChatGPT a couple of times recently, and I found it was making illegal moves after about 4 or 5 moves.
The first few could be resolved by asking it to check its moves. After a few more, I was having to explain that knights can jump and therefore can’t be blocked. It was also trying to move pieces that weren’t there, onto squares alert occupied by its own pieces, and asking it to review was not getting anywhere.
10-15 moves is very optimistic, unless it’s counting each move by either side, i.e., White moves 5-8 times and Black moves 5-8 times. Even that seems optimistic, but the lower end could be right.
I just tried again, and ChatGPT did much better. A notification said it was using GPT-4o mini, and it reached move 10 for White (me) before it lost the plot:
It didn't get much further with suggestions to review. Also, the small ASCII board it generated was incorrect much earlier, but it sometimes plays without that, so I let that go.
So... it failed to solve the puzzle? That seems distinctly unimpressive, especially for a puzzle with a fixed start state and a limited set of possible moves.
I cannot understate how impressive this is to me, having been involved in ai research projects and robotics in years gone by.
This is a general purpose model, given an image and human written request that then step by step analyses the image, iterates through various options, tries to write code to solve the problem and then searches the internet for help. It reads multiple results and finds an answer, checks to validate it and then comes back to the user.
I had a robot that took ages to learn to plan tic tac toe by example and if the robot moved originally there was a solid chance it thought the entire world had changed and would freak out because it thought it might punch through the table.
This is also a chess puzzle marked as very hard that a person who is good at chess should give themselves fifteen minutes to solve. The author of the chess.com blog containing this puzzle only solved about half of them!
This is not an image analysis bot, it's not a chess bot, it's a general system I can throw bad english at.
> This is also a chess puzzle marked as very hard that a person who is good at chess should give themselves fifteen minutes to solve. The author of the chess.com blog containing this puzzle only solved about half of them!
I am human and I solved this before opening the blog post, because I've seen this problem 100 times before with this exact description. I don't understand why an LLM wouldn't have done the same, because pattern matching off things you saw on the internet is IIUC the main way LLMs work.
(I am good at chess, but not world class. This is not a difficult mate in 2 problem: if I hadn't seen it, it would take a minute or so to solve, some composed 2-movers might take me 5 minutes).
I just tried ChatGPT free with the prompt "There's a mate-in-two composed by Paul Morphy. What's the key move?". It searches and finds it immediately. But if I ask it not to search the internet, its response is incoherent (syntactically valid English and knows the names of the chess pieces, but otherwise hallucinated).
Yes, I agree. Like I said, in the end it did what a human would do: google for the answer. Still, it was interesting to see how the reasoning unfolded. Normally, humans train on these kinds of puzzles until they become pure pattern recognition. That's why you can't become a grandmaster if you only start learning chess as an adult — you need to be a kid and see thousands of these problems early on, until recognizing them becomes second nature. It's something humans are naturally very good at.
I am a human and I figured this puzzle out in under a minute by just trying the small set of possible moves until I got it correct. I am not a serious chess player. I would have expected it to at least try the possible moves? I think this maybe lends credence to the idea that these models aren’t actually reasoning but are doing a great job of mimicking what we think humans do.
> This is also a chess puzzle marked as very hard that a person who is good at chess should give themselves fifteen minutes to solve.
Is it, though? I play at around 1000 Elo – I have a long-standing interest in chess, but my brain invariably turns on fog of war that makes me not notice threats to my queen or something – and I solved it in something like one minute. It has very little moving parts, so the solution, while beautifully unobvious, can be easily brute-forced by a human.
Im 1600 rated player and this took me 20 seconds to solve, is this really considered a very hard puzzle?
The obvious moves dont work, you can see whites pawn moving forward is mate, and you can see black is essentially trapped and has very limited moves, so immediately I thought first move is a waiting move and theres only two options there. Block the black pawn moving and if bishop moves, rook takes is mate. So rook has to block, and you can see bishop either moves or captures and pawn moving forward is mate
Yeah I came here to say this... I don't even play chess (though I know the rules) and I solved this in a few minutes of looking at it. There is no way this is "hard" unless I simply got lucky? Not sure what the odds are of getting lucky solving a puzzle like this as I have never done a chess puzzle before.
I don't know, I didn't spot the answer and it's from a list of hard puzzles from a chess coach. The model also wasn't told it was mate in 2 (or even if a mate was possible), just to solve it and it was white to move.
Although perhaps this is missing the point - the process and chain here in response to an image and a sentence is extremely impressive. You can argue it's not useful, or not useful for specific use cases but it's impressive.
> This is also a chess puzzle marked as very hard that a person who is good at chess should give themselves fifteen minutes to solve.
I haven't played chess in decades and was never any good at it. I'm basically now at the level that I know most of the basic rules of the game. And it took me maybe 5 minutes.
OpenAI is a commercial company and their product is to make anthropomorphic chat bots.
Clever Hans at web-scale, so to say.
So if you're impressed by a model that spent 10 minutes and single-digit dollars to not solve a problem that has been solved before, then I guess their model is working exactly as expected.
I am sorry, but if this impresses you you are a rube. If this were a machine with the smallest bit of actual intelligence it would, upon seeing its a chess puzzle, remember "hey, i am a COMPUTER and a small set of fixed moves should take me about 300ms or so to fully solve out" and then do that. If the machine _literally has to cheat to solve the puzzle_ then we have made technology that is, in fact, less capable than we created in the past.
"Well, it's not a chess engine so its impressive it-" No. Stop. At best what we have here is an extremely computationally expensive way to just google a problem. We've been googling things since I was literally a child. We've had voice search with google for, idk, a decade+. A computer that can't even solve its own chess problems is an expensive regression.
> "hey, i am a COMPUTER and a small set of fixed moves should take me about 300ms or so to fully solve out"
from the article:
"3. Attempt to Use Python
When pure reasoning was not enough, o3 tried programming its way out of the situation.
“I should probably check using something like a chess engine to confirm.”
(tries to import chess module, but fails: “ModuleNotFoundError”).
It wanted to run a simulation, but of course, it had no real chess engine installed."
this strategy failed, but if OpenAI were to add "pip install python-chess" to the environment, it very well might have worked. in any case, the machine did exactly the thing you claim it should have done.
possibly scrolling down to read the full article makes you a rube though.
If you mean write code to exhaustively search the solution space then they actually can do that quite happily provided you tell it you will execute the code for them
Looks to me like it would have simulated the steps using sensible tools but didn’t know it was sandboxed out of using those tools? I think that’s pretty reasonable.
Suppose we removed its ability to google and it conceded to doing the tedium of writing a chess engine to simulate the steps. Is that “better” for you?
A computer program that has the agency to google a problem, interpret the results, and respond to a human was science fiction just 10 years ago. The entire field of natural language processing has been solved and it's insane.
Honestly, I think that if in 2020 you had asked me whether we would be able to do this in 2025, I would've guessed no, with a fairly high confidence. And I was aware of GPT back then.
On a sidenote, I tried to get Codex + O3 to make an existing sidebar toggable with Tailwind CSS and it made an abomination full of bugs. This is a classic "boilerplate" task I'd expect it to be able to do. Not sure if I'm doing it wrong but... a little bit more direct instructions to O4-mini and it managed. The cost was astronomical tho compared to Anthropic.
Interesting. Personally my thought process was like that:
- Check obvious, wrong moves.
- Ask what I need to have to win the game even if there's just black king left. Answer is I need all 3 pieces to win some day even if there's just black king on the board.
- So any moves that makes me lose my pawn or rook result in failure.
- So the only thing I can do with the rook is move it vertically. Any horizontal move allows black to take my pawn. King and pawn don't have much options and all result in pawn loss or basically skipping a turn while changing situation a little bit for the worse that makes mate in one move unlikely.
- Taking a pawn with rook results in loss of the rook which is just as bad.
- Let's look at spot next to the pawn. I'll still protect my pawn, but my rook is in danger. But if black takes rook, I can just move my pawn forward to get a mate. If they don't I can move rook forward and get a mate. Solved.
So I skipped trying to run a program and googling part, not because it didn't came to my mind but because I wanted different kind of challenge then challenge of extracting information from the internet or challenge of running a unfamiliar piece of software.
It's weird to me that the author says this behavior feels human because it's nothing like how I solve this puzzle.
At no point during my process would I be counting pixels in the image. It feels very clearly like a machine that mimics human behavior without understanding where that behavior comes from.
Depends on the task I think. O3 is really effective at going off and doing research, try giving it a complex task which involves lots of browsing/searching and watch how it behaves. Claude cannot do anything like that right now. I do find O3’s tone of voice a bit odd
Where does this obsession over giving binary logic tasks to LLMs come from ? New LLM breakthroughs are about handling blurry logic, non precise requirements and spitting vague human realistic outputs. Who care how well it can add integers or solve chess puzzles ? We have decades of computer science on those topics already
"o3 does not just spit out an answer. It reasons. It struggles. It switches tools. It self-corrects. Sometimes it even cheats, but only after exhausting every other option. That feels very human."
I've never met a human player that suddenly says 'OK, I need Python to figure out my next move'.
I'm not a good player, usually I just do ten minute matches against the weakest Stockfish settings so as not to be annoying to a human, and I figured this one out in a couple of minutes because there are very few options. Taking with the rook doesn't work, taking with the pawn also doesn't, so it has to be a non-taking move, and the king can't do anything useful so it has to be the rook and typically in these puzzles it's a sacrifice that unlocks the solution. And it was.
I just tried the same puzzle in o3 using the same image input, but tweaked the prompt to say “don’t use the search tool”. Very similar results!
It spent the first few minutes analyzing the image and cross-checking various slices of the image to make sure it understood the problem. Then it spent the next 6-7 minutes trying to work through various angles to the problem analytically. It decided this was likely a mate-in-two (part of the training data?), but went down the path that the key to solving the problem would be to convert the position to something more easily solvable first. At that point it started trying to pip install all sorts of chess-related packages, and when it couldn’t get that to work it started writing a simple chess solver in Python by hand (which didn’t work either). At one point it thought the script had found a mate-in-six that turned out to be due to a script bug, but I found it impressive that it didn’t just trust the script’s output - instead it analyzed the proposed solution and determined the nature of the bug in the script that caused it. Then it gave up and tried analyzing a bit more for five more minutes, at which point the thinking got cut off and displayed an internal error.
15 minutes total, didn’t solve the problem, but fascinating! There were several points where if the model were more “intelligent”, I absolutely could see it reasoning it out following the same steps.
Claude gets the right answer but misplaces the pieces in its initial analysis which means the answer is incorrect.
Whats going on? Did it just get lucky? Did it memorize the answer but misplace the pieces in its recall? Did it actually compute anything?
https://claude.ai/share/d640bc4c-8dd8-4eaa-b10b-cb3f83a6b94b
This is the board as it sees it (incorrect):
https://lichess.org/editor/kb6/pp6/2P5/8/8/3K4/8/R7_w_-_-_0_...
Told that it was a mate in 2 puzzle, and it solved it for me
https://chatgpt.com/share/680f4a02-4cc4-8002-8301-59214fca78...
It worked through some stuff then decided to try and list all possible moves as there can't be that many. Tried importing stuff that didn't work, then wrote code to create the permutations.
On a similar note, I just updated LLM Chess Puzzles repo [1] yesterday.
The fact that gpt-4.5 gets 85% correctly solved is unexpected and somewhat scary (if model was not trained on this).
[1] https://github.com/kagisearch/llm-chess-puzzles
Given that o3 is trained on the contents of the Internet, and the answers to all these chess problems are almost certainly on the Internet in multiple places, in a sense it has been weakly trained on this content. The question for me becomes: is the LLM doing better on these problems because it’s improving in reasoning, or is it simply improving in information retrieval.
And then there's the further question of where we draw the line in ourselves. One of my teachers -- a philosopher -- once said that real, actual thought is incredibly rare. He's a world-renowned expert but says he can count on one hand the number of times in his life that he felt he was thinking rather than remembering and reorganizing what he already knew.
That's not to say "are you remembering or reasoning" means the same thing when applied to humans vs when it's applied to LLMs.
It's getting incredibly difficult to find anything on the internet that these models weren't trained on, which is why recent llm tests have used so much secrecy and only shows a few sample questions.
Oh cool, I wonder how good 03 will be. While using 03, I noticed something funny: sometimes I gave it a screenshot without any position data. It ended up using Python and spent 10 minutes just trying to figure out where the figures were exactly.
I asked ChatGPT about playing chess: it says tests have shown it makes an illegal move within 10 - 15 moves, even if prompted to play carefully and not make any illegal moves. It'll fail within the first 3 or 4 if you ask it play reasonably quickly.
That means, it can literally never win a chess match, given an intentional illegal move is an immediate loss.
It can't beat a human who can't play chess. It literally can't even lose properly. It will disqualify itself every time.
--
> It shows clearly where current models shine (problem-solving)
Yeh - that's not what's happening.
I say that as someone that pays for and uses an LLM pretty much every day.
--
Also - I didn't fact check any of the above about playing chess. I choose to believe.
Preventing an LLM from making illegal moves should be very simple: provide it with tool access to something that tells it if a move is legal or not, then watch it iterate in a loop until it finds a move that it is allowed to make.
I expect this would dramatically improve the chess playing abilities of the competent tool using models, such as O3.
or just present it with the list of legal moves and force it to pick from said list.
I imagine there are points in a chess game, especially early on, where that list could have hundreds of moves - could use up a fair amount of tokens.
Nope. The list is very limited. For the starting position: a3, a4, b3,b4,.......h3, h4, Na3, Nc3, Nf3, Nh3
That's 20 moves. the size grows a bit in the early middle game, but then drops again in the endgame. There do exist rather artificial positions with more than 200 legal moves, but the average number of legal moves in a position is around 40.
Huh, that's really interesting, thanks.
I mentally counted the starting moves as being 8 pawns x2 = 16 pawn moves and 2x2 =4 4 knight moves, but then I doubled it for both sides to get 40 (which with hindsight was obviously wrong) and then assumed that once the pawns had moved a bit there would be more options from non-pawn pieces.
With an upper bound of ~200 in edge cases listing all possible moves wouldn't take up much room in the context at all. I wonder if it would give better results, too.
You could also constrain the output grammar to legal moves, but if we're comparing its chess performance to humans', it would be unfair to not let it think.
I have tried playing chess with ChatGPT a couple of times recently, and I found it was making illegal moves after about 4 or 5 moves.
The first few could be resolved by asking it to check its moves. After a few more, I was having to explain that knights can jump and therefore can’t be blocked. It was also trying to move pieces that weren’t there, onto squares alert occupied by its own pieces, and asking it to review was not getting anywhere. 10-15 moves is very optimistic, unless it’s counting each move by either side, i.e., White moves 5-8 times and Black moves 5-8 times. Even that seems optimistic, but the lower end could be right.
I just tried again, and ChatGPT did much better. A notification said it was using GPT-4o mini, and it reached move 10 for White (me) before it lost the plot:
https://chatgpt.com/share/680f57b6-8554-800b-a042-f640224b91...
It didn't get much further with suggestions to review. Also, the small ASCII board it generated was incorrect much earlier, but it sometimes plays without that, so I let that go.
So... it failed to solve the puzzle? That seems distinctly unimpressive, especially for a puzzle with a fixed start state and a limited set of possible moves.
> That seems distinctly unimpressive
I cannot understate how impressive this is to me, having been involved in ai research projects and robotics in years gone by.
This is a general purpose model, given an image and human written request that then step by step analyses the image, iterates through various options, tries to write code to solve the problem and then searches the internet for help. It reads multiple results and finds an answer, checks to validate it and then comes back to the user.
I had a robot that took ages to learn to plan tic tac toe by example and if the robot moved originally there was a solid chance it thought the entire world had changed and would freak out because it thought it might punch through the table.
This is also a chess puzzle marked as very hard that a person who is good at chess should give themselves fifteen minutes to solve. The author of the chess.com blog containing this puzzle only solved about half of them!
This is not an image analysis bot, it's not a chess bot, it's a general system I can throw bad english at.
> This is also a chess puzzle marked as very hard that a person who is good at chess should give themselves fifteen minutes to solve. The author of the chess.com blog containing this puzzle only solved about half of them!
I am human and I solved this before opening the blog post, because I've seen this problem 100 times before with this exact description. I don't understand why an LLM wouldn't have done the same, because pattern matching off things you saw on the internet is IIUC the main way LLMs work.
(I am good at chess, but not world class. This is not a difficult mate in 2 problem: if I hadn't seen it, it would take a minute or so to solve, some composed 2-movers might take me 5 minutes).
I just tried ChatGPT free with the prompt "There's a mate-in-two composed by Paul Morphy. What's the key move?". It searches and finds it immediately. But if I ask it not to search the internet, its response is incoherent (syntactically valid English and knows the names of the chess pieces, but otherwise hallucinated).
Yes, I agree. Like I said, in the end it did what a human would do: google for the answer. Still, it was interesting to see how the reasoning unfolded. Normally, humans train on these kinds of puzzles until they become pure pattern recognition. That's why you can't become a grandmaster if you only start learning chess as an adult — you need to be a kid and see thousands of these problems early on, until recognizing them becomes second nature. It's something humans are naturally very good at.
I am a human and I figured this puzzle out in under a minute by just trying the small set of possible moves until I got it correct. I am not a serious chess player. I would have expected it to at least try the possible moves? I think this maybe lends credence to the idea that these models aren’t actually reasoning but are doing a great job of mimicking what we think humans do.
> This is also a chess puzzle marked as very hard that a person who is good at chess should give themselves fifteen minutes to solve.
Is it, though? I play at around 1000 Elo – I have a long-standing interest in chess, but my brain invariably turns on fog of war that makes me not notice threats to my queen or something – and I solved it in something like one minute. It has very little moving parts, so the solution, while beautifully unobvious, can be easily brute-forced by a human.
Im 1600 rated player and this took me 20 seconds to solve, is this really considered a very hard puzzle?
The obvious moves dont work, you can see whites pawn moving forward is mate, and you can see black is essentially trapped and has very limited moves, so immediately I thought first move is a waiting move and theres only two options there. Block the black pawn moving and if bishop moves, rook takes is mate. So rook has to block, and you can see bishop either moves or captures and pawn moving forward is mate
Yeah I came here to say this... I don't even play chess (though I know the rules) and I solved this in a few minutes of looking at it. There is no way this is "hard" unless I simply got lucky? Not sure what the odds are of getting lucky solving a puzzle like this as I have never done a chess puzzle before.
Agreed, I'm similar fide (not rated but ~2k lichess) and it took me a few seconds as well. Not a hard puzzle, for a regular chess player anyway.
I don't know, I didn't spot the answer and it's from a list of hard puzzles from a chess coach. The model also wasn't told it was mate in 2 (or even if a mate was possible), just to solve it and it was white to move.
https://www.chess.com/blog/ThePawnSlayer/checkmate-in-two-pu...
Although perhaps this is missing the point - the process and chain here in response to an image and a sentence is extremely impressive. You can argue it's not useful, or not useful for specific use cases but it's impressive.
> This is also a chess puzzle marked as very hard that a person who is good at chess should give themselves fifteen minutes to solve.
I haven't played chess in decades and was never any good at it. I'm basically now at the level that I know most of the basic rules of the game. And it took me maybe 5 minutes.
OpenAI is a commercial company and their product is to make anthropomorphic chat bots.
Clever Hans at web-scale, so to say.
So if you're impressed by a model that spent 10 minutes and single-digit dollars to not solve a problem that has been solved before, then I guess their model is working exactly as expected.
I am sorry, but if this impresses you you are a rube. If this were a machine with the smallest bit of actual intelligence it would, upon seeing its a chess puzzle, remember "hey, i am a COMPUTER and a small set of fixed moves should take me about 300ms or so to fully solve out" and then do that. If the machine _literally has to cheat to solve the puzzle_ then we have made technology that is, in fact, less capable than we created in the past.
"Well, it's not a chess engine so its impressive it-" No. Stop. At best what we have here is an extremely computationally expensive way to just google a problem. We've been googling things since I was literally a child. We've had voice search with google for, idk, a decade+. A computer that can't even solve its own chess problems is an expensive regression.
> "hey, i am a COMPUTER and a small set of fixed moves should take me about 300ms or so to fully solve out"
from the article:
"3. Attempt to Use Python When pure reasoning was not enough, o3 tried programming its way out of the situation.
“I should probably check using something like a chess engine to confirm.” (tries to import chess module, but fails: “ModuleNotFoundError”).
It wanted to run a simulation, but of course, it had no real chess engine installed."
this strategy failed, but if OpenAI were to add "pip install python-chess" to the environment, it very well might have worked. in any case, the machine did exactly the thing you claim it should have done.
possibly scrolling down to read the full article makes you a rube though.
If you mean write code to exhaustively search the solution space then they actually can do that quite happily provided you tell it you will execute the code for them
Looks to me like it would have simulated the steps using sensible tools but didn’t know it was sandboxed out of using those tools? I think that’s pretty reasonable.
Suppose we removed its ability to google and it conceded to doing the tedium of writing a chess engine to simulate the steps. Is that “better” for you?
A computer program that has the agency to google a problem, interpret the results, and respond to a human was science fiction just 10 years ago. The entire field of natural language processing has been solved and it's insane.
Honestly, I think that if in 2020 you had asked me whether we would be able to do this in 2025, I would've guessed no, with a fairly high confidence. And I was aware of GPT back then.
OpenAI's whole business is impressing you with whiz-bang sci-fi sound and fury.
This is a bad thing because it means they gave up on solving actual problems and entered the snake oil business.
On a sidenote, I tried to get Codex + O3 to make an existing sidebar toggable with Tailwind CSS and it made an abomination full of bugs. This is a classic "boilerplate" task I'd expect it to be able to do. Not sure if I'm doing it wrong but... a little bit more direct instructions to O4-mini and it managed. The cost was astronomical tho compared to Anthropic.
Interesting. Personally my thought process was like that:
- Check obvious, wrong moves.
- Ask what I need to have to win the game even if there's just black king left. Answer is I need all 3 pieces to win some day even if there's just black king on the board.
- So any moves that makes me lose my pawn or rook result in failure.
- So the only thing I can do with the rook is move it vertically. Any horizontal move allows black to take my pawn. King and pawn don't have much options and all result in pawn loss or basically skipping a turn while changing situation a little bit for the worse that makes mate in one move unlikely.
- Taking a pawn with rook results in loss of the rook which is just as bad.
- Let's look at spot next to the pawn. I'll still protect my pawn, but my rook is in danger. But if black takes rook, I can just move my pawn forward to get a mate. If they don't I can move rook forward and get a mate. Solved.
So I skipped trying to run a program and googling part, not because it didn't came to my mind but because I wanted different kind of challenge then challenge of extracting information from the internet or challenge of running a unfamiliar piece of software.
It's weird to me that the author says this behavior feels human because it's nothing like how I solve this puzzle.
At no point during my process would I be counting pixels in the image. It feels very clearly like a machine that mimics human behavior without understanding where that behavior comes from.
Yes, exactly. What I meant is that a human would also try every "tool" available. In the case of o3, the only tools it had were Python and Bing.
But you are right. It does not actually understand anything. It is just a next-token predictor that happens to have access to Python and Bing.
So, are we talking about OpenAI o3 model, right?
I was also confused. It looks like the article has been corrected, and now uses the familiar 'o3' name.
>"When I gave OpenAI’s 03 model a tough chess puzzle..."
Opening sentence
A little annoying that they use zero instead of o, but yeah.
yes
Is this that impressive considering these models have probably been trained on numerous books/texts analyzing thousands of games (including morphy's)?
LLMs are not chess engines, similar to how they don’t really calculate arithmetic. What’s new? carry on.
Yeah it's rather annoying how people (maybe due to marketing) expect a generalized model to be able to be an expert in every domain.
O3 is massively underwhelming and is obviously tuned to be sycophantic.
Claude reigns supreme.
Depends on the task I think. O3 is really effective at going off and doing research, try giving it a complex task which involves lots of browsing/searching and watch how it behaves. Claude cannot do anything like that right now. I do find O3’s tone of voice a bit odd
This somehow reminds me of Agent-3 from [0].
0: https://ai-2027.com
I remember reading that got3.5-turbo instruct was oddly good at chess - would be curious what it outputs as a next two moves here.
Nice puzzle with a twist of Zugzwang. Took me about 8 minutes, but it's been decades since I was doing chess.
Because its trained on human data.
Where does this obsession over giving binary logic tasks to LLMs come from ? New LLM breakthroughs are about handling blurry logic, non precise requirements and spitting vague human realistic outputs. Who care how well it can add integers or solve chess puzzles ? We have decades of computer science on those topics already
If we're going to call LLMs intelligent, they should be performant at these tasks as well.
We called our computers intelligent and couldnt do so many things LLMs can do now easily.
But yeah calling them intelligent is a marketing trick that is very efficient
"o3 does not just spit out an answer. It reasons. It struggles. It switches tools. It self-corrects. Sometimes it even cheats, but only after exhausting every other option. That feels very human."
I've never met a human player that suddenly says 'OK, I need Python to figure out my next move'.
I'm not a good player, usually I just do ten minute matches against the weakest Stockfish settings so as not to be annoying to a human, and I figured this one out in a couple of minutes because there are very few options. Taking with the rook doesn't work, taking with the pawn also doesn't, so it has to be a non-taking move, and the king can't do anything useful so it has to be the rook and typically in these puzzles it's a sacrifice that unlocks the solution. And it was.
I've commited the 03 (zero-three) and not o3 (o-three) typo too, but can we rename it on the title please
Fixed. Thanks!
[flagged]