ChatGPT and Various Exam Results

table

Simulated exams	GPT-4estimated percentile	GPT-4 (no vision)estimated percentile	GPT-3.5estimated percentile
Uniform Bar Exam (MBE+MEE+MPT)¹	298 / 400~90th	298 / 400~90th	213 / 400~10th
LSAT	163~88th	161~83rd	149~40th
SAT Evidence-Based Reading & Writing	710 / 800~93rd	710 / 800~93rd	670 / 800~87th
SAT Math	700 / 800~89th	690 / 800~89th	590 / 800~70th
Graduate Record Examination (GRE) Quantitative	163 / 170~80th	157 / 170~62nd	147 / 170~25th
Graduate Record Examination (GRE) Verbal	169 / 170~99th	165 / 170~96th	154 / 170~63rd
Graduate Record Examination (GRE) Writing	4 / 6~54th	4 / 6~54th	4 / 6~54th
USABO Semifinal Exam 2020	87 / 15099th–100th	87 / 15099th–100th	43 / 15031st–33rd
USNCO Local Section Exam 2022	36 / 60	38 / 60	24 / 60
Medical Knowledge Self-Assessment Program	75%	75%	53%
Codeforces Rating	392below 5th	392below 5th	260below 5th
AP Art History	586th–100th	586th–100th	586th–100th
AP Biology	585th–100th	585th–100th	462nd–85th
AP Calculus BC	443rd–59th	443rd–59th	10th–7th
AP Chemistry	471st–88th	471st–88th	222nd–46th
AP English Language and Composition	214th–44th	214th–44th	214th–44th
AP English Literature and Composition	28th–22nd	28th–22nd	28th–22nd
AP Environmental Science	591st–100th	591st–100th	591st–100th
AP Macroeconomics	584th–100th	584th–100th	233rd–48th
AP Microeconomics	582nd–100th	460th–82nd	460th–82nd
AP Physics 2	466th–84th	466th–84th	330th–66th
AP Psychology	583rd–100th	583rd–100th	583rd–100th
AP Statistics	585th–100th	585th–100th	340th–63rd
AP US Government	588th–100th	588th–100th	477th–88th
AP US History	589th–100th	474th–89th	474th–89th
AP World History	465th–87th	465th–87th	465th–87th
AMC 10	30 / 1506th–12th	36 / 15010th–19th	36 / 15010th–19th
AMC 12	60 / 15045th–66th	48 / 15019th–40th	30 / 1504th–8th
Intro Sommelier (theory knowledge)	92%	92%	80%
Certified Sommelier (theory knowledge)	86%	86%	58%
Advanced Sommelier (theory knowledge)	77%	77%	46%
Leetcode (easy)	31 / 41	31 / 41	12 / 41
Leetcode (medium)	21 / 80	21 / 80	8 / 80
Leetcode (hard)	3 / 45	3 / 45	0 / 45

We also evaluated GPT-4 on traditional benchmarks designed for machine learning models. GPT-4 considerably outperforms existing large language models, alongside most state-of-the-art (SOTA) models which may include benchmark-specific crafting or additional training protocols:

Benchmark	GPT-4 Evaluated few-shot	GPT-3.5 Evaluated few-shot	LM SOTA Best external LM evaluated few-shot	SOTA Best external model (includes benchmark-specific training)
MMLU Multiple-choice questions in 57 subjects (professional & academic)	86.4% 5-shot	70.0% 5-shot	70.7% 5-shot U-PaLM	75.2% 5-shot Flan-PaLM
HellaSwag Commonsense reasoning around everyday events	95.3% 10-shot	85.5% 10-shot	84.2% LLAMA (validation set)	85.6% ALUM
AI2 Reasoning Challenge (ARC) Grade-school multiple choice science questions. Challenge-set.	96.3% 25-shot	85.2% 25-shot	84.2% 8-shot PaLM	85.6% ST-MOE
WinoGrande Commonsense reasoning around pronoun resolution	87.5% 5-shot	81.6% 5-shot	84.2% 5-shot PALM	85.6% 5-shot PALM
HumanEval Python coding tasks	67.0% 0-shot	48.1% 0-shot	26.2% 0-shot PaLM	65.8% CodeT + GPT-3.5
DROP (f1 score) Reading comprehension & arithmetic.	80.9 3-shot	64.1 3-shot	70.8 1-shot PaLM	88.4 QDGAT

Many existing ML benchmarks are written in English. To get an initial sense of capability in other languages, we translated the MMLU benchmark—a suite of 14,000 multiple-choice problems spanning 57 subjects—into a variety of languages using Azure Translate (see Appendix). In the 24 of 26 languages tested, GPT-4 outperforms the English-language performance of GPT-3.5 and other LLMs (Chinchilla, PaLM), including for low-resource languages such as Latvian, Welsh, and Swahili:

0%10%20%30%40%50%60%70%80%90%Accuracy →GPT-4 3-shot accuracy on MMLU across languagesRandomChinchillaPaLMgpt-3.5gpt-425.0%67.0%69.3%70.1%85.5%84.1%84.1%84.0%83.7%83.6%83.1%82.7%82.1%81.9%81.4%80.9%80.1%80.0%80.0%79.9%78.5%77.5%77.0%76.5%73.2%72.6%72.2%71.8%71.4%66.7%62.0%Random guessingChinchilla-EnglishPaLM-EnglishGPT-3.5-EnglishGPT-4 EnglishItalianAfrikaansSpanishGermanFrenchIndonesianRussianPolishUkranianGreekLatvianMandarinArabicTurkishJapaneseSwahiliWelshKoreanIcelandicBengaliUrduNepaliThaiPunjabiMarathiTelugu

We’ve also been using GPT-4 internally, with great impact on functions like support, sales, content moderation, and programming. We also are using it to assist humans in evaluating AI outputs, starting the second phase in our alignment strategy.

Visual inputs

GPT-4 can accept a prompt of text and images, which—parallel to the text-only setting—lets the user specify any vision or language task. Specifically, it generates text outputs (natural language, code, etc.) given inputs consisting of interspersed text and images. Over a range of domains—including documents with text and photographs, diagrams, or screenshots—GPT-4 exhibits similar capabilities as it does on text-only inputs. Furthermore, it can be augmented with test-time techniques that were developed for text-only language models, including few-shot and chain-of-thought prompting. Image inputs are still a research preview and not publicly available.

Pro plugin deactivated or invalid

Posted on: March 15, 2023, 6:28 am Category: Uncategorized

By: Stephen Abram

Comments Off on ChatGPT and Various Exam Results

ChatGPT and Various Exam Results

Visual inputs

0 Responses

About The Author

Recent Comments

Categories

Archives

Tags

ChatGPT and Various Exam Results

Visual inputs

0 Responses

Subscribe

About The Author

Recent Comments

Categories

Archives

Tags