Skip to content


ChatGPT and Various Exam Results

https://openai.com/research/gpt-4

table

Simulated exams GPT-4estimated percentile GPT-4 (no vision)estimated percentile GPT-3.5estimated percentile
Uniform Bar Exam (MBE+MEE+MPT)1 298 / 400~90th 298 / 400~90th 213 / 400~10th
LSAT 163~88th 161~83rd 149~40th
SAT Evidence-Based Reading & Writing 710 / 800~93rd 710 / 800~93rd 670 / 800~87th
SAT Math 700 / 800~89th 690 / 800~89th 590 / 800~70th
Graduate Record Examination (GRE) Quantitative 163 / 170~80th 157 / 170~62nd 147 / 170~25th
Graduate Record Examination (GRE) Verbal 169 / 170~99th 165 / 170~96th 154 / 170~63rd
Graduate Record Examination (GRE) Writing 4 / 6~54th 4 / 6~54th 4 / 6~54th
USABO Semifinal Exam 2020 87 / 15099th–100th 87 / 15099th–100th 43 / 15031st–33rd
USNCO Local Section Exam 2022 36 / 60 38 / 60 24 / 60
Medical Knowledge Self-Assessment Program 75% 75% 53%
Codeforces Rating 392below 5th 392below 5th 260below 5th
AP Art History 586th–100th 586th–100th 586th–100th
AP Biology 585th–100th 585th–100th 462nd–85th
AP Calculus BC 443rd–59th 443rd–59th 10th–7th
AP Chemistry 471st–88th 471st–88th 222nd–46th
AP English Language and Composition 214th–44th 214th–44th 214th–44th
AP English Literature and Composition 28th–22nd 28th–22nd 28th–22nd
AP Environmental Science 591st–100th 591st–100th 591st–100th
AP Macroeconomics 584th–100th 584th–100th 233rd–48th
AP Microeconomics 582nd–100th 460th–82nd 460th–82nd
AP Physics 2 466th–84th 466th–84th 330th–66th
AP Psychology 583rd–100th 583rd–100th 583rd–100th
AP Statistics 585th–100th 585th–100th 340th–63rd
AP US Government 588th–100th 588th–100th 477th–88th
AP US History 589th–100th 474th–89th 474th–89th
AP World History 465th–87th 465th–87th 465th–87th
AMC 10 30 / 1506th–12th 36 / 15010th–19th 36 / 15010th–19th
AMC 12 60 / 15045th–66th 48 / 15019th–40th 30 / 1504th–8th
Intro Sommelier (theory knowledge) 92% 92% 80%
Certified Sommelier (theory knowledge) 86% 86% 58%
Advanced Sommelier (theory knowledge) 77% 77% 46%
Leetcode (easy) 31 / 41 31 / 41 12 / 41
Leetcode (medium) 21 / 80 21 / 80 8 / 80
Leetcode (hard) 3 / 45 3 / 45 0 / 45

We also evaluated GPT-4 on traditional benchmarks designed for machine learning models. GPT-4 considerably outperforms existing large language models, alongside most state-of-the-art (SOTA) models which may include benchmark-specific crafting or additional training protocols:

Benchmark
GPT-4
Evaluated few-shot
GPT-3.5
Evaluated few-shot
LM SOTA
Best external LM evaluated few-shot
SOTA
Best external model (includes benchmark-specific training)
Multiple-choice questions in 57 subjects (professional & academic)
86.4%
5-shot
70.0%
5-shot
70.7%
75.2%
Commonsense reasoning around everyday events
95.3%
10-shot
85.5%
10-shot
84.2%
85.6%
Grade-school multiple choice science questions. Challenge-set.
96.3%
25-shot
85.2%
25-shot
84.2%
85.6%
Commonsense reasoning around pronoun resolution
87.5%
5-shot
81.6%
5-shot
84.2%
85.6%
Python coding tasks
67.0%
0-shot
48.1%
0-shot
26.2%
65.8%
DROP (f1 score)
Reading comprehension & arithmetic.
80.9
3-shot
64.1
3-shot
70.8
88.4

Many existing ML benchmarks are written in English. To get an initial sense of capability in other languages, we translated the MMLU benchmark—a suite of 14,000 multiple-choice problems spanning 57 subjects—into a variety of languages using Azure Translate (see Appendix). In the 24 of 26 languages tested, GPT-4 outperforms the English-language performance of GPT-3.5 and other LLMs (Chinchilla, PaLM), including for low-resource languages such as Latvian, Welsh, and Swahili:

0%10%20%30%40%50%60%70%80%90%Accuracy →GPT-4 3-shot accuracy on MMLU across languagesRandomChinchillaPaLMgpt-3.5gpt-425.0%67.0%69.3%70.1%85.5%84.1%84.1%84.0%83.7%83.6%83.1%82.7%82.1%81.9%81.4%80.9%80.1%80.0%80.0%79.9%78.5%77.5%77.0%76.5%73.2%72.6%72.2%71.8%71.4%66.7%62.0%Random guessingChinchilla-EnglishPaLM-EnglishGPT-3.5-EnglishGPT-4 EnglishItalianAfrikaansSpanishGermanFrenchIndonesianRussianPolishUkranianGreekLatvianMandarinArabicTurkishJapaneseSwahiliWelshKoreanIcelandicBengaliUrduNepaliThaiPunjabiMarathiTelugu

We’ve also been using GPT-4 internally, with great impact on functions like support, sales, content moderation, and programming. We also are using it to assist humans in evaluating AI outputs, starting the second phase in our alignment strategy.

Visual inputs

GPT-4 can accept a prompt of text and images, which—parallel to the text-only setting—lets the user specify any vision or language task. Specifically, it generates text outputs (natural language, code, etc.) given inputs consisting of interspersed text and images. Over a range of domains—including documents with text and photographs, diagrams, or screenshots—GPT-4 exhibits similar capabilities as it does on text-only inputs. Furthermore, it can be augmented with test-time techniques that were developed for text-only language models, including few-shot and chain-of-thought prompting. Image inputs are still a research preview and not publicly available.

 

 

  • Pro plugin deactivated or invalid

Posted on: March 15, 2023, 6:28 am Category: Uncategorized

0 Responses

Stay in touch with the conversation, subscribe to the RSS feed for comments on this post.