LMSYS Chatbot Arena Leaderboard

LMSYS Chatbot Arena Leaderboard (March 26, 2024)

https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard

Contribute your vote 🗳️ at chat.lmsys.org! Find more analysis in the notebook.

Rank	🤖 Model	⭐ Arena Elo	📊 95% CI	🗳️ Votes	Organization	License	Knowledge Cutoff
10	PaLM-Chat-Bison-001	1253	+14/-11	33250	Cognitive Computations	Falcon-180B TII License	2023/12

Rank	🤖 Model	⭐ Arena Elo	📊 95% CI	🗳️ Votes	Organization	License	Knowledge Cutoff
1	Claude 3 Opus	1253	+5/-5	33250	Anthropic	Proprietary	2023/8
1	GPT-4-1106-preview	1251	+4/-4	54141	OpenAI	Proprietary	2023/4
1	GPT-4-0125-preview	1248	+4/-4	34825	OpenAI	Proprietary	2023/12
4	Bard (Gemini Pro)	1203	+5/-7	12476	Google	Proprietary	Online
4	Claude 3 Sonnet	1198	+5/-5	32761	Anthropic	Proprietary	2023/8
6	GPT-4-0314	1185	+5/-4	33499	OpenAI	Proprietary	2021/9
6	Claude 3 Haiku	1179	+5/-5	18776	Anthropic	Proprietary	2023/8
8	GPT-4-0613	1158	+4/-5	51860	OpenAI	Proprietary	2021/9
8	Mistral-Large-2402	1157	+5/-4	26734	Mistral	Proprietary	Unknown
9	Qwen1.5-72B-Chat	1148	+5/-5	20211	Alibaba	Qianwen LICENSE	2024/2
10	Claude-1	1146	+6/-6	21908	Anthropic	Proprietary	Unknown
10	Mistral Medium	1145	+5/-4	26196	Mistral	Proprietary	Unknown
13	Starling-LM-7B-beta	1127	+9/-10	4270	UC Berkeley	Apache-2.0	2024/3
13	Claude-2.0	1126	+7/-4	13543	Anthropic	Proprietary	Unknown
13	Gemini Pro (Dev API)	1125	+6/-6	14856	Google	Proprietary	2023/4
13	Mistral-Next	1122	+7/-5	13132	Mistral	Proprietary	Unknown

Note: we take the 95% confidence interval into account when determining a model’s ranking. A model is ranked higher only if its lower bound of model score is higher than the upper bound of the other model’s score. See Figure 3 below for visualization of the confidence intervals.”

Via Superhuman AI

“Anthropic’s Claude Opus dethrones OpenAI’s GPT-4

Claude is America’s next top model. GPT-4’s long reign as the undisputed king of AI models is coming to an end, as the latest results from one of the biggest benchmarks in AI have placed Anthropic’s Claude 3 Opus at the top of its ranking.

TLDR: Opus is the largest model from Anthropic’s newest family of Claude 3 models. It now ranks at the top of the LMSYS Chatbot Arena Leaderboard, a crowdsourced open platform for evaluating AI models.”

But that’s not the biggest surprise. Haiku, the smallest of the Claude 3 models, has beaten an earlier version of GPT-4. Haiku’s smaller size is impressive in itself but the achievement is absolutely seismic when you consider that Haiku is orders of magnitude cheaper than GPT-4.

Source: LMSYS Chatbot Arena Leaderboard

Haiku’s price and performance combo is an enticing proposition for users and builders. “This is excellent news for the market! We now have a GPT-4 class model that is 10x cheaper than GPT-4,“ claimed Abacus AI CEO Bindu Reddy. “That’s insane for how cheap & fast it is,“ added app builder Nick Dobos.

The ball is now in OpenAI’s court. “I don’t see how OpenAI survives on gpt-3.5 and gpt-4. Literally gpt-3.5 is utterly useless in the presence of Claude haiku,“ declared software engineer Anton (@abacaj on X). OpenAI might have a thing or two to say about that when it launches the widely-anticipated GPT-5.”

Pro plugin deactivated or invalid

Posted on: March 27, 2024, 10:40 am Category: Uncategorized

By: Stephen Abram

Comments Off on LMSYS Chatbot Arena Leaderboard

LMSYS Chatbot Arena Leaderboard

LMSYS Chatbot Arena Leaderboard (March 26, 2024)

“Anthropic’s Claude Opus dethrones OpenAI’s GPT-4

0 Responses

About The Author

Recent Comments

Categories

Archives

Tags

LMSYS Chatbot Arena Leaderboard

LMSYS Chatbot Arena Leaderboard (March 26, 2024)

“Anthropic’s Claude Opus dethrones OpenAI’s GPT-4

0 Responses

Subscribe

About The Author

Recent Comments

Categories

Archives

Tags