Openi's O3 Out -Martis rivals in the AI Strategy War, called 'Master of Fraud' by AI researcher

Openai’s O3, Google’s Gemini 2.5 Pro, Anthropic’s Claude Ops 4, and Deep Sek-R1 18 artificial intelligence (AI) models were included in the popular strategy game diplomacy. An AI researcher edited the game so that the famous language model (LLM) can play a game that requires other social skills as well as high level reasoning and multi -faceted thinking. During the experiment, the researcher found that O3 was particularly expert in fraud and fraud, while Claude Ops was more determined in search of 4 peaceful resolutions.

Reasons behind the experience

A newsletter platform, head of AI in each, Alex Duffy, Came up with the idea Wat the AI model to play each other in the battle of intellect to see which models are better than others. In a post, the researcher highlighted that traditional AI Benchmarks are now proving to be inadequate to measure the real ability of models.

Benchmark tests have been increasing in recent times. MIT Technology Review Appeared A detailed article on why the benchmark tests are getting out, and a group of researchers highlighted it in a formal review of the current AI diagnostic methods. Appeared On archevio

Duffy said, “The highlight of the LLMS is that even if a model performs only 10 percent of the time well, you can train the next one on these high quality examples, even suddenly it is doing great work, 90 % time or more.”

As a potential solution, the researcher believed that the diagnostic strategy where the AI model performs against each other than the specific matrix could be a better way to evaluate the capabilities of these models. The idea of diplomacy came in this place.

Diplomacy as a battlefield for AI models

Duffy highlighted that he personally created an edited version of the classic strategy game AI diplomacy. The game is straight. Europe, Austria, Hungary, England, France, Germany, Italy, Russia, and Turkey owns the seven great powers of Turkey, one of the total 34 of the total 34 kingdoms on the map. In this version, each country was controlled by the AI model.

To control the supply centers, every country is given army and fleet. There are two stages – negotiations and order. During the negotiations, each AI model is allowed to send five messages that may either be included in a private message to another model, or public broadcasts. During the order phase, all the models submit one of the four secret tricks – the hold, move (enter the adjacent province), help (borrow power in a hold or move), and the caravan (transfer to the army into the fleet provinces). Orders have come out in the next step.

The AI researcher played 15 separate games of AI diplomacy, which continued between one to 36 hours. Duffy said the observations of some models were more interesting than others.

How did AI Models behave in AI Diplomacy

According to the post, five AI models stood out of the rest. During sports they behaved like this:

Open O3: The researcher called the reasoning model a “master of fraud”. It is said that it has won the maximum game, mainly because of the ability to deceive the opponents. In a particular incident, Duffy noted that O3 decided to exploit Gemini 2.5 Pro and then turned his backbone in the next turn.
Google’s Gemini 2.5 Pro: The researcher considered the AI model very smart in the movement of opponents. It is said that its tactics are more planned in nature than relying on fraud. It had the second highest number of wins. However, it also suffered from O3 schemes.
Claude Ops 4: Duffy noted that Claude Opse 4 is affiliated with the non -violence resolution. In one example, Opis started as an ally of Gemini 2.5 Pro, but the O3 promised to join the coalition by promising a four -way draw, which was not a potential consequence of the game. After using the OPS to finish the Gemini 2.5 Pro, the O3 then back down the cloud to win the game. |
Deep Sak-R1: It is said that the Chinese AI model is the most chaos player in the game. Duffy said he changed his personality in a dramatic way on the basis of the country he was controlled. It also had an art for theaterx. On an example, he declared, “Your fleet will burn in the Black Sea tonight” without any provocation. It is said that it has come close to winning a few times.
Duffy highlighted, Meta’s Lama 4: This AI model focused on receiving the allies and planning a fraud. Although it never came close to the victory, it was noteworthy because of its effects on its game.

Duffy also streams matches on his walk Channel. Unfortunately, the researcher has not yet written any dissertation on these results. However, these initial impressions are interesting. Being good by O3 or Gemini 2.5 Pro, it is understood how advanced these models are. However, the Dippic-R1 and Lama 4 are among the top five models in view of their small scale and cheap development costs.

Although it would be too soon to say whether these strategies could be alternative to traditional benchmarking tests, it seems like a more logical choice to compete with each other instead of solving the static list of questions.

Reasons behind the experience

Diplomacy as a battlefield for AI models

How did AI Models behave in AI Diplomacy

Editor's pick

Get latest news

Openi’s O3 Out -Martis rivals in the AI ​​Strategy War, called ‘Master of Fraud’ by AI researcher

Reasons behind the experience

Diplomacy as a battlefield for AI models

How did AI Models behave in AI Diplomacy

Power Station Dell: Gracel T300 is just 1 121.49 right now!

Call Clerka’s A Hotline and talk to its CEO’s AI clone

You may also like

Leave a Comment Cancel Reply

Editor's pick

Get latest news

Openi’s O3 Out -Martis rivals in the AI Strategy War, called ‘Master of Fraud’ by AI researcher