Skip to content

Commit

Permalink
[ENH] Weighted win rates (#189)
Browse files Browse the repository at this point in the history
* [ENH] using logprob parser

* [ENH] using logprob parser

* pass all tests

* WIP

* WIP

* fix length bias

* finish logprob

* finish logprob

* finish logprob

* finish logprob

* finish all test for the log probs

* remove sklearn

* pass all tests

* setup scipy

* add correlations to analysis

* add new baseline

* alpaca_eval_2 constants

* use AE 1 for tests

* remove v3

* add the tmp lb

* use turbo as baseline

* add mistral / mixtral / gemini

* Revert "add mistral / mixtral / gemini"

This reverts commit b4f37e9.

* default AE 1
  • Loading branch information
YannDubs authored Jan 3, 2024
1 parent c32a615 commit 15fd513
Show file tree
Hide file tree
Showing 43 changed files with 7,676 additions and 1,878 deletions.
2 changes: 1 addition & 1 deletion client_configs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ gpt-4-1106-preview: # only when using `model_name: gpt-4-1106-preview`
```
Here the configurations will be appended to `default` when using the model_name `gpt-4` in the `evaluators_configs` such as [here](https://github.com/tatsu-lab/alpaca_eval/blob/main/src/alpaca_eval/evaluators_configs/alpaca_eval_gpt4/configs.yaml#L6). When hitting a rate limit we will be then switching between two OpenAI clients and one Azure, each using the same underlying model.
Here the configurations will be appended to `default` when using the model_name `gpt-4` in the `evaluators_configs` such as [here](https://github.com/tatsu-lab/alpaca_eval/blob/main/src/alpaca_eval/evaluators_configs/alpaca_eval_gpt4/configs.yaml#L6). When hitting a rate limit we will be then switching between two OpenAI clients and one Azure, each using the same underlying model. Note that when using Azure, some parameters might be slightly different and thus cause issues, as Ayure typically lags a few months behind OpenAI's API.

## Fully backward compatible

Expand Down
12 changes: 6 additions & 6 deletions docs/alpaca_eval_gpt4_leaderboard.csv
Original file line number Diff line number Diff line change
Expand Up @@ -14,15 +14,15 @@ UltraLM 13B (best-of-16),91.54228856,1980,https://huggingface.co/openbmb/UltraRM
CUT 13B,91.35572139303484,1637,https://github.com/wwxu21/CUT,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/cut-13b/model_outputs.json,community
Claude 2,91.35572139,1069,,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/claude-2/model_outputs.json,minimal
PairRM+Tulu 2+DPO 13B (best-of-16),91.055900621118,1454,https://huggingface.co/llm-blender/PairRM,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/pairrm-tulu-2-13b/model_outputs.json,community
Cohere Command,90.62111801242236,1983,,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/cohere/model_outputs.json,minimal
Cohere Command,90.62111801242236,1983,,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/cohere/model_outputs.json,verified
Zephyr 7B Beta,90.5977584059776,1444,https://huggingface.co/HuggingFaceH4/zephyr-7b-beta,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/zephyr-7b-beta/model_outputs.json,community
DEITA 7B v1.0,90.06211180124224,1417,https://github.com/hkust-nlp/deita,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/deita-7b-v1.0/model_outputs.json,community
OpenChat V3.1 13B,89.49004975,1484,https://github.com/imoneoi/openchat,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/openchat-v3.1-13b/model_outputs.json,community
ChatGPT,89.36567164,827,,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/chatgpt/model_outputs.json,minimal
Evo v2 7B,89.35242839352429,1754,https://evolusion.ai,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/evo-v2-7b/model_outputs.json,community
WizardLM 13B V1.2,89.16562889,1635,https://huggingface.co/WizardLM/WizardLM-13B-V1.2,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/wizardlm-13b-v1.2/model_outputs.json,community
Vicuna 33B v1.3,88.99253731,1479,https://huggingface.co/lmsys/vicuna-33b-v1.3,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/vicuna-33b-v1.3/model_outputs.json,verified
Claude,88.38509317,1082,,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/claude/model_outputs.json,minimal
Claude,88.38509317,1082,,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/claude/model_outputs.json,verified
CausalLM-14B,88.26086956521739,1391,https://huggingface.co/CausalLM/14B,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/causallm-14b/model_outputs.json,community
Tulu 2+DPO 13B,88.12189054726367,1614,https://huggingface.co/allenai/tulu-2-dpo-13b,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/tulu-2-dpo-13b/model_outputs.json,community
Humpback LLaMa2 70B,87.93532338,1822,https://arxiv.org/abs/2308.06259,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/humpback-llama2-70b/model_outputs.json,community
Expand Down Expand Up @@ -55,17 +55,17 @@ OpenCoderPlus-15B,78.69565217,1628,https://github.com/imoneoi/openchat,https://g
MiniChat 1.5 3B,78.55361596009975,1545,https://huggingface.co/GeneZC/MiniChat-1.5-3B,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/minichat-1.5-3b/model_outputs.json,community
OpenBudddy-LLaMA2-13B-v11.1,77.48756219,1057,https://huggingface.co/OpenBuddy/openbuddy-llama2-13b-v11.1-bf16,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/openbuddy-llama2-13b-v11.1/model_outputs.json,community
Vicuna 7B v1.3,76.84144819,1110,https://huggingface.co/lmsys/vicuna-7b-v1.3,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/vicuna-7b-v1.3/model_outputs.json,verified
WizardLM 13B,75.31094527,985,https://huggingface.co/WizardLM/WizardLM-13B-1.0,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/wizardlm-13b/model_outputs.json,minimal
WizardLM 13B,75.31094527,985,https://huggingface.co/WizardLM/WizardLM-13B-1.0,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/wizardlm-13b/model_outputs.json,verified
JinaChat,74.12718204,676,,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/jina-chat/model_outputs.json,community
airoboros 65B,73.91304348,1512,https://huggingface.co/jondurbin/airoboros-65b-gpt4-1.2,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/airoboros-65b/model_outputs.json,community
airoboros 33B,73.29192547,1514,https://huggingface.co/jondurbin/airoboros-33b-gpt4-1.2,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/airoboros-33b/model_outputs.json,community
Guanaco 65B,71.80124224,1249,https://huggingface.co/timdettmers/guanaco-65b,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/guanaco-65b/model_outputs.json,minimal
Guanaco 65B,71.80124224,1249,https://huggingface.co/timdettmers/guanaco-65b,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/guanaco-65b/model_outputs.json,verified
LLaMA2 Chat 7B,71.36645963,1479,https://ai.meta.com/llama/,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/llama-2-7b-chat-hf/model_outputs.json,minimal
Vicuna 13B,70.43478261,1037,https://huggingface.co/lmsys/vicuna-13b-delta-v1.1,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/vicuna-13b/model_outputs.json,minimal
OpenBuddy-Falcon-7b-v6,70.3611457,1152,https://huggingface.co/OpenBuddy/openbuddy-falcon-7b-v6-bf16,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/openbuddy-falcon-7b-v6/model_outputs.json,community
Phi-2 SFT,68.53233830845771,1068,https://huggingface.co/lxuechen/phi-2-sft,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/phi-2-sft/model_outputs.json,verified
Baize-v2 13B,66.95652174,930,https://huggingface.co/project-baize/baize-v2-13b,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/baize-v2-13b/model_outputs.json,community
LLaMA 33B OASST RLHF,66.52173913,1079,https://huggingface.co/OpenAssistant/oasst-rlhf-2-llama-30b-7k-steps-xor,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/oasst-rlhf-llama-33b/model_outputs.json,minimal
LLaMA 33B OASST RLHF,66.52173913,1079,https://huggingface.co/OpenAssistant/oasst-rlhf-2-llama-30b-7k-steps-xor,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/oasst-rlhf-llama-33b/model_outputs.json,verified
Minotaur 13B,66.02484472,881,https://huggingface.co/openaccess-ai-collective/minotaur-13b-fixed,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/minotaur-13b/model_outputs.json,community
Guanaco 33B,65.96273292,1311,https://huggingface.co/timdettmers/guanaco-33b,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/guanaco-33b/model_outputs.json,verified
Nous Hermes 13B,65.46583851,844,https://huggingface.co/NousResearch/Nous-Hermes-13b,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/nous-hermes-13b/model_outputs.json,verified
Expand All @@ -78,7 +78,7 @@ Davinci003,50.0,307,,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/
MiniChat 3B,48.818407960199,868,https://huggingface.co/GeneZC/MiniChat-3B,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/minichat-3b/model_outputs.json,community
ChatGLM2-6B,47.12858926,1027,https://huggingface.co/THUDM/chatglm2-6b,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/chatglm2-6b/model_outputs.json,community
Guanaco 7B,46.58385093,1364,https://huggingface.co/timdettmers/guanaco-7b,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/guanaco-7b/model_outputs.json,verified
Falcon 40B Instruct,45.71428571,662,https://huggingface.co/tiiuae/falcon-40b-instruct,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/falcon-40b-instruct/model_outputs.json,minimal
Falcon 40B Instruct,45.71428571,662,https://huggingface.co/tiiuae/falcon-40b-instruct,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/falcon-40b-instruct/model_outputs.json,verified
Alpaca Farm PPO Sim (GPT-4) 7B,44.09937888,511,https://huggingface.co/tatsu-lab/alpaca-farm-ppo-sim-gpt4-20k-wdiff,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/alpaca-farm-ppo-sim-gpt4-20k/model_outputs.json,verified
Pythia 12B SFT,41.86335404,913,https://huggingface.co/OpenAssistant/pythia-12b-sft-v8-7k-steps,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/pythia-12b-mix-sft/model_outputs.json,verified
Alpaca Farm PPO Human 7B,41.24223602,803,https://huggingface.co/tatsu-lab/alpaca-farm-ppo-human-wdiff,https://github.com/tatsu-lab/alpaca_eval/blob/main/results/alpaca-farm-ppo-human/model_outputs.json,minimal
Expand Down
5 changes: 4 additions & 1 deletion pytest.ini
Original file line number Diff line number Diff line change
@@ -1,2 +1,5 @@
[pytest]
doctest_optionflags = IGNORE_EXCEPTION_DETAIL NORMALIZE_WHITESPACE
doctest_optionflags = IGNORE_EXCEPTION_DETAIL NORMALIZE_WHITESPACE
# alpaca_eval_1 was used for writting tests
env =
IS_ALPACA_EVAL_2 = False
3 changes: 2 additions & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,5 @@ datasets
openai>=1.0.0
pandas
tiktoken>=0.3.2
fire
fire
scipy
Loading

0 comments on commit 15fd513

Please sign in to comment.