๐๐ฉ๐๐ง๐๐ ๐๐ข๐ง๐๐ฅ๐ฅ๐ฒ ๐ซ๐๐ฏ๐๐๐ฅ๐ฌ โ๐โ: ๐๐ซ๐๐ณ๐ฒ ๐๐ก๐๐ข๐ง-๐จ๐-๐ญ๐ก๐จ๐ฎ๐ ๐ก๐ญ-๐ญ๐ฎ๐ง๐๐ ๐ฆ๐จ๐๐๐ฅ >> ๐๐๐-๐๐จ ๐ฅ
OpenAI had hinted at a mysterious โproject strawberryโ for a long time: ๐๐ต๐ฒ๐ ๐ฝ๐๐ฏ๐น๐ถ๐๐ต๐ฒ๐ฑ ๐๐ต๐ถ๐ ๐ป๐ฒ๐ ๐บ๐ผ๐ฑ๐ฒ๐น ๐ฐ๐ฎ๐น๐น๐ฒ๐ฑ โ๐ผ๐ญโ ๐ญ๐ต๐ผ๐๐ฟ ๐ฎ๐ด๐ผ, ๐ฎ๐ป๐ฑ ๐๐ต๐ฒ ๐ฝ๐ฒ๐ฟ๐ณ๐ผ๐ฟ๐บ๐ฎ๐ป๐ฐ๐ฒ ๐ถ๐ ๐ท๐๐๐ ๐บ๐ถ๐ป๐ฑ-๐ฏ๐น๐ผ๐๐ถ๐ป๐ด.
๐คฏ Ranks among the top 500 students in the US in a qualifier for the USA Math Olympiad
๐คฏ Beats human-PhD-level accuracy by 8% on GPQA, hard science problems benchmark where the previous best was Claude 3.5 Sonnet with 59.4.
๐คฏ Scores 78.2% on vision benchmark MMMU, making it the first model competitive w/ human experts
๐คฏ GPT-4o on MATH scored 60% โ o1 scores 95%
How did they pull this? Sadly OpenAI keeps increasing their performance in โmaking cryptic AF reports to not reveal any real infoโ, so here are excerpts:
๐ฌ โ๐ผ๐ญ ๐๐๐ฒ๐ ๐ฎ ๐ฐ๐ต๐ฎ๐ถ๐ป ๐ผ๐ณ ๐๐ต๐ผ๐๐ด๐ต๐ ๐๐ต๐ฒ๐ป ๐ฎ๐๐๐ฒ๐บ๐ฝ๐๐ถ๐ป๐ด ๐๐ผ ๐๐ผ๐น๐๐ฒ ๐ฎ ๐ฝ๐ฟ๐ผ๐ฏ๐น๐ฒ๐บ. ๐ง๐ต๐ฟ๐ผ๐๐ด๐ต ๐ฟ๐ฒ๐ถ๐ป๐ณ๐ผ๐ฟ๐ฐ๐ฒ๐บ๐ฒ๐ป๐ ๐น๐ฒ๐ฎ๐ฟ๐ป๐ถ๐ป๐ด, ๐ผ๐ญ ๐น๐ฒ๐ฎ๐ฟ๐ป๐ ๐๐ผ ๐ต๐ผ๐ป๐ฒ ๐ถ๐๐ ๐ฐ๐ต๐ฎ๐ถ๐ป ๐ผ๐ณ ๐๐ต๐ผ๐๐ด๐ต๐ ๐ฎ๐ป๐ฑ ๐ฟ๐ฒ๐ณ๐ถ๐ป๐ฒ ๐๐ต๐ฒ ๐๐๐ฟ๐ฎ๐๐ฒ๐ด๐ถ๐ฒ๐ ๐ถ๐ ๐๐๐ฒ๐. It learns to recognize and correct its mistakes.โ
And of course, they decide to hide the content of this precious Chain-of-
Thought. Would it be for maximum profit? Of course not, you awful capitalist, itโs to protect users:
๐ฌ โWe also do not want to make an unaligned chain of thought directly visible to users.โ
Theyโre right, it would certainly have hurt my feelings to see the internal of this model tearing apart math problems.
๐ค I suspect it could be not only CoT, but also some agentic behaviour where the model can just call a code executor. The kind of score improvement the show certainly looks like the ones you see with agents.
This model will be immediately released for ChatGPT and some โtrusted API usersโ.
Letโs start cooking to release the same thing in 6 months! ๐
OpenAI had hinted at a mysterious โproject strawberryโ for a long time: ๐๐ต๐ฒ๐ ๐ฝ๐๐ฏ๐น๐ถ๐๐ต๐ฒ๐ฑ ๐๐ต๐ถ๐ ๐ป๐ฒ๐ ๐บ๐ผ๐ฑ๐ฒ๐น ๐ฐ๐ฎ๐น๐น๐ฒ๐ฑ โ๐ผ๐ญโ ๐ญ๐ต๐ผ๐๐ฟ ๐ฎ๐ด๐ผ, ๐ฎ๐ป๐ฑ ๐๐ต๐ฒ ๐ฝ๐ฒ๐ฟ๐ณ๐ผ๐ฟ๐บ๐ฎ๐ป๐ฐ๐ฒ ๐ถ๐ ๐ท๐๐๐ ๐บ๐ถ๐ป๐ฑ-๐ฏ๐น๐ผ๐๐ถ๐ป๐ด.
๐คฏ Ranks among the top 500 students in the US in a qualifier for the USA Math Olympiad
๐คฏ Beats human-PhD-level accuracy by 8% on GPQA, hard science problems benchmark where the previous best was Claude 3.5 Sonnet with 59.4.
๐คฏ Scores 78.2% on vision benchmark MMMU, making it the first model competitive w/ human experts
๐คฏ GPT-4o on MATH scored 60% โ o1 scores 95%
How did they pull this? Sadly OpenAI keeps increasing their performance in โmaking cryptic AF reports to not reveal any real infoโ, so here are excerpts:
๐ฌ โ๐ผ๐ญ ๐๐๐ฒ๐ ๐ฎ ๐ฐ๐ต๐ฎ๐ถ๐ป ๐ผ๐ณ ๐๐ต๐ผ๐๐ด๐ต๐ ๐๐ต๐ฒ๐ป ๐ฎ๐๐๐ฒ๐บ๐ฝ๐๐ถ๐ป๐ด ๐๐ผ ๐๐ผ๐น๐๐ฒ ๐ฎ ๐ฝ๐ฟ๐ผ๐ฏ๐น๐ฒ๐บ. ๐ง๐ต๐ฟ๐ผ๐๐ด๐ต ๐ฟ๐ฒ๐ถ๐ป๐ณ๐ผ๐ฟ๐ฐ๐ฒ๐บ๐ฒ๐ป๐ ๐น๐ฒ๐ฎ๐ฟ๐ป๐ถ๐ป๐ด, ๐ผ๐ญ ๐น๐ฒ๐ฎ๐ฟ๐ป๐ ๐๐ผ ๐ต๐ผ๐ป๐ฒ ๐ถ๐๐ ๐ฐ๐ต๐ฎ๐ถ๐ป ๐ผ๐ณ ๐๐ต๐ผ๐๐ด๐ต๐ ๐ฎ๐ป๐ฑ ๐ฟ๐ฒ๐ณ๐ถ๐ป๐ฒ ๐๐ต๐ฒ ๐๐๐ฟ๐ฎ๐๐ฒ๐ด๐ถ๐ฒ๐ ๐ถ๐ ๐๐๐ฒ๐. It learns to recognize and correct its mistakes.โ
And of course, they decide to hide the content of this precious Chain-of-
Thought. Would it be for maximum profit? Of course not, you awful capitalist, itโs to protect users:
๐ฌ โWe also do not want to make an unaligned chain of thought directly visible to users.โ
Theyโre right, it would certainly have hurt my feelings to see the internal of this model tearing apart math problems.
๐ค I suspect it could be not only CoT, but also some agentic behaviour where the model can just call a code executor. The kind of score improvement the show certainly looks like the ones you see with agents.
This model will be immediately released for ChatGPT and some โtrusted API usersโ.
Letโs start cooking to release the same thing in 6 months! ๐