-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up the inference #24
Comments
Are you using triton flash attention, bfloat16 as described in the model's huggingface README? You can also accelerate with Fastertransformers with this model. |
Thanks for reply! I will try the right configuration for triton and bfloat16, with them enabled, how many milliseconds per token should I expect on A100-80G or V100-32G? |
Hi, I enabled triton and bfloat16, inside the docker provided here: https://github.com/mosaicml/llm-foundry/, with dependencies installed, but the error is thrown like this:
|
Hi, this model seems nice, but I do find that the inference speed is very slow (70ms/token on single A100), so I want to speed up it.
It seems to be related with MPT itself: https://huggingface.co/mosaicml/mpt-7b-instruct/discussions/23
Any suggestions or best practices on speeding up? E.g., FastTransformer (a bit low-level), ONNX Runtime, or Oneflow?
The text was updated successfully, but these errors were encountered: