-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enabled native function calling for O1 + added support for reasoning_effort config in the config. #6256
base: main
Are you sure you want to change the base?
Conversation
@@ -71,6 +71,7 @@ | |||
'claude-3-5-haiku-20241022', | |||
'gpt-4o-mini', | |||
'gpt-4o', | |||
'o1', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have we tried without native function calling, to compare results between with it enabled and it disabled (prompting-based replacement)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to note, strictly speaking using native is already supported, it's just not enabled by default. But there's a native_function_calling setting to enable it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With native function calling the model solves 48% of the issues, with simulated function calling, 30%
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will make the results available soon, I still need to finish running SWE-Bench Verified (the result above is preliminary after running 300/500 issues)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good result! I'm surprised, I'm losing track of our current evals, I thought it was much lower last time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When using the current simulated tools from OH, O1's performance degrades significantly. It is quite interesting because 4o's performance is not impacted as much (19% vs 12%)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense to me actually! We have seen significant differences before. That might include even Sonnet 3.5, I just think we don't know for sure why, because when it jumped from something like ~26% to over 50%, three things happened:
- switched from simulated "actions" to native tool calling
- also redefined the prompts/tools very very close to Anthropic's tools
- also went from Sonnet 3.5 (old) to Sonnet 3.5 (new) 😂
I'm not sure that we know which factor mattered how much on that one. 😅
These preliminary results are on this branch, or the supervisor branch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting! O1_native_tool_calls gets a higher score than Sonnet 3.5 (but not way higher, in no way enough to justify its price), so being close to Anthrotopic tools might matter but not that much.
The results will be shared today in Huggingface, I am currently evaluating them using the harness.
The supervisor branch will be done soon, but I will run the experiments first and then update the branch before or after ICML deadline (30 Jan), depending on how much work left I have 😅
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
@AlexCuadron An alternative way to implement in llm.py is to set the new kwarg directly in the partial function, along with the other kwargs that we know at the time of init, then it all works the same. But I'm fine with the current PR implementation too. I haven't tested it, but if you're happy with it, we can merge it? |
Thanks for the heads up! I tested 4o and o1 and both work without any issue. I can merge it after the tests are completed |
End-user friendly description of the problem this fixes or functionality that this introduces
The reasoning_effort parameter can be defined for the o1 family!
Give a summary of what the PR does, explaining any non-trivial design decisions
Added support for native function calling for O1 and added support for specifying the reasoning_effort in the configuration file.
Link of any specific issues this addresses