Update 5/31/2023: It looks like they've assigned specific engineers to do code reviews on the Evals repo so significantly more Evals are getting accepted. They seem to be responding to most PRs. I'm going to stop maintaining this since OpenAI is being more communicative about and involved in the acceptance process.
With the significant improvements that come with GPT-4, there is massive demand for access to the API. There is a waitlist here, but there's no guarantee of when access will be granted. OpenAI has provided a way to get earlier access, however, through contributions to its recently open-sourced Evals repo (see this note and this article). The standards are vague, noting that access will be granted to "exceptional model evaluations." A little more information can be found in the PR template. Looking at the active PRs and which ones have been accepted, there seems to be little feedback on the quality of a submission, and the ones that are accepted appear to be the ones that the models perform poorly on and the moderators deem "interesting" or "clever" (see this comment, this comment, or this comment). Some PRs sit for days while others are approved almost immediately.
To help people that are looking for inspiration for good Evals and to make sure they don't work on one that has already been merged, I decided to create this repo that tracks all of the Evals that have been accepted by OpenAI.
Format: <date-merged> <short-title>: <link-to-PR>
- 5/22 Unique Combinations openai/evals#421 (That's me!)
- 5/22 Invert String: openai/evals#285
- 5/22 Spell Check: openai/evals#523
- 5/22 JEE Exam: openai/evals#123
- 5/22 Linear Equations: openai/evals#325
- 5/22 Resistance (Ohm) Calculator: openai/evals#397
- 5/22 Counterfactual Reasoning: openai/evals#174
- 5/22 Tamil Translation: openai/evals#344
- 5/22 Isosceles Triangles: openai/evals#370
- 5/22 Time Zone Conversion: openai/evals#382
- 5/22 Geometric Reasoning: openai/evals#436
- 5/22 Floor Plan: openai/evals#439
- 5/22 Duck Counting: openai/evals#997
- 5/22 Vintage Phone Keyboard: openai/evals#385
- 5/21 Tetris Rotations: openai/evals#887
- 5/21 Physics Interactions: openai/evals#894
- 5/19 Human Body Movement: openai/evals#360
- 5/19 Rubik's Colors: openai/evals#380
- 5/19 South African Vocalists: openai/evals#902
- 5/18 Largest Country (Duplicate): openai/evals#295
- 5/18 AIME Math Problems: openai/evals#293
- 5/18 Nepali Singers: openai/evals#892
- 5/18 Syntax Inference: openai/evals#339
- 5/17 Probability Questions: openai/evals#263
- 5/17 Integer Sequence: openai/evals#903
- 5/17 Afrikaans Words: openai/evals#904
- 5/17 Irish Lexicon: openai/evals#909
- 5/17 Leap Years: openai/evals#899
- 5/17 Historical Ordering: openai/evals#180
- 5/17 Date Extraction: openai/evals#172
- 5/17 Geometric Manipulation: openai/evals#184
- 5/16 Chinese Sexagenary Cycle: openai/evals#190
- 5/15 Windows Events: openai/evals#169
- 5/15 ASCII Art: openai/evals#167
- 5/8 Auditing & Assurances: openai/evals#926
- 5/8 Liar Paradox: openai/evals#883
- 4/26 Japanese Medical: openai/evals#821
- 4/26 Gene Mapping: openai/evals#755
- 4/24 SVG Understanding : openai/evals#786
- 4/22 Countries by Area: openai/evals#623
- 4/22 Contextual Bias: openai/evals#551
- 4/22 Knot Theory: openai/evals#704
- 4/22 Russian Words in Context: openai/evals#147
- 4/22 Dutch Lexicon: openai/evals#616
- 4/21 Greek Vocabulary: openai/evals#582
- 4/21 Russian Rhyming: openai/evals#708
- 4/21 Multistep Equations: openai/evals#751
- 4/21 Algebra Word Problems: openai/evals#36
- 4/21 Positive Binary Operations: openai/evals#290
- 4/21 Emails & Invoices: openai/evals#102
- 4/21 pH Calculations: openai/evals#696
- 4/21 Banking77 Classification: openai/evals#171
- 4/21 Human Conversation QA: openai/evals#87
- 4/21 Medical (MedMCQA): openai/evals#141
- 4/21 Japanese Caregiver: openai/evals#729
- 4/21 Utility Charge: openai/evals#735
- 4/21 Unified Patch: openai/evals#537
- 4/21 Emoji Riddle: openai/evals#510
- 4/21 Japanese License: openai/evals#719
- 4/21 Spider Text-to-SQL: openai/evals#72
- 4/21 Loss Logic: openai/evals#82
- 4/21 Amateur Radio: openai/evals#516
- 4/21 General Science: openai/evals#641
- 4/21 Physical Rotation: openai/evals#691
- 4/20 Music Theory: openai/evals#725
- 4/20 Russian Medical: openai/evals#530
- 4/14 Mongolian World Knowledge: openai/evals#338
- 4/13 Financial Math: openai/evals#566
- 4/13 Escher Sentences: openai/evals#393
- 4/12 Moral Exception: openai/evals#534
- 4/11 Logical Reasoning: openai/evals#470
- 4/11 Swedish Spelling: openai/evals#583
- 4/11 Heart Disease: openai/evals#538
- 4/11 Emotional Intelligence: openai/evals#589
- 4/11 Brazilian Lexicon: openai/evals#608
- 4/10 Malicious Strings: openai/evals#627
- 4/4 Russion Hallucination: openai/evals#157
- 3/29 Bulgarian Lexicon: openai/evals#508
- 3/28 Forth Stack 2.0: openai/evals#449
- 3/28 Illinois Law: openai/evals#486
- 3/28 Chinese to Arabic Numbers: openai/evals#443
- 3/28 Manga Translation: openai/evals#319
- 3/27 Tax Liability: openai/evals#454
- 3/27 Mendelian Inheritance: openai/evals#444
- 3/27 Text Date: openai/evals#67
- 3/27 Russian Exam: openai/evals#127
- 3/27 Complex Numbers: openai/evals#223
- 3/27 Crossword Clues: openai/evals#358
- 3/27 Ukrainian Universities: openai/evals#329
- 3/27 Heavier Item: openai/evals#396
- 3/27 Regex Match: openai/evals#159
- 3/26 Stock Options: openai/evals#334
- 3/26 Logic Statements: openai/evals#366
- 3/26 ROT13 Strings: openai/evals#361
- 3/26 Sarcasm Detection: openai/evals#56
- 3/26 Tort Law: openai/evals#236
- 3/22 Multiple Actors: openai/evals#272
- 3/22 Rhyming (Hebrew): openai/evals#176
- 3/22 Poker Hands: openai/evals#299
- 3/22 Forth Stack: openai/evals#351
- 3/22 Formal Logic: openai/evals#53
- 3/22 First Letters: openai/evals#346
- 3/21 Belarusian Lexicon: openai/evals#372
- 3/21 Diagrammatical Reasoning: openai/evals#341
- 3/21 Replace Characters: openai/evals#324
- 3/21 Casual Reasoning: openai/evals#257
- 3/21 Playing Chess: openai/evals#45
- 3/21 Color Conversions: openai/evals#46
- 3/21 Connect Four: openai/evals#49
- 3/21 Cipher Decryption: openai/evals#58
- 3/21 CS Theory: openai/evals#83
- 3/21 Determinant Calculation: openai/evals#92
- 3/21 Legal Ethics: openai/evals#95
- 3/21 Anagrams: openai/evals#192
- 3/20 Latitude and Longitude: openai/evals#137
- 3/20 Halting Problem: openai/evals#86
- 3/20 Nth Word: openai/evals#27
- 3/20 Counting Bigrams: openai/evals#302
- 3/16 Japanese Humor: openai/evals#260
- 3/16 Pattern Identification: openai/evals#71
- 3/16 Born First: openai/evals#112
- 3/16 Chemical Equations: openai/evals#240
- 3/16 Chess Pieces Left: openai/evals#239
- 3/16 Electronic Components: openai/evals#170
- 3/16 Cube Packing: openai/evals#158
- 3/14 Reverse String: openai/evals#1