-
Notifications
You must be signed in to change notification settings - Fork 797
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: implement batching strategies #3630
Open
sauyon
wants to merge
22
commits into
bentoml:main
Choose a base branch
from
sauyon:target-latency
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 16 commits
Commits
Show all changes
22 commits
Select commit
Hold shift + click to select a range
7510f21
set training request wait to 0
sauyon 06ffa76
add documentation
sauyon ca5edb7
use slots class instead of a tuple
sauyon c11fb81
implement batching strategies
sauyon 451eff0
Update src/bentoml/_internal/marshal/dispatcher.py
sauyon 6af6df2
Revert "revert: "chore(dispatcher): refactor out training code (#3663…
sauyon bc4753f
fix refactor implementation
sauyon c4e2bec
set training request wait to 0
sauyon 3e69b4d
add documentation
sauyon 6652866
use slots class instead of a tuple
sauyon b153452
implement batching strategies
sauyon 4211a6b
--wip-- [skip ci]
sauyon a8e17f4
Merge branch 'target-latency' of github:sauyon/BentoML into target-la…
sauyon a4c3eac
Merge branch 'main' into target-latency
sauyon 5e6c844
update optimizer
sauyon ce337a6
more misc fixes
sauyon ca9bfcf
Merge branch 'main' into target-latency
sauyon 17e61d5
Merge branch 'main' into target-latency
sauyon a4d4850
format
sauyon ce403c1
Merge branch 'main' into target-latency
sauyon 28d209a
ci: auto fixes from pre-commit.ci
pre-commit-ci[bot] 56088fe
minor fixes
sauyon File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -52,28 +52,82 @@ In addition to declaring model as batchable, batch dimensions can also be config | |
Configuring Batching | ||
-------------------- | ||
|
||
If a model supports batching, adaptive batching is enabled by default. To explicitly disable or control adaptive batching behaviors at runtime, configuration can be specified under the ``batching`` key. | ||
Additionally, there are two configurations for customizing batching behaviors, `max_batch_size` and `max_latency_ms`. | ||
If a model supports batching, adaptive batching is enabled by default. To explicitly disable or | ||
control adaptive batching behaviors at runtime, configuration can be specified under the | ||
``batching`` key. Additionally, there are three configuration keys for customizing batching | ||
behaviors, ``max_batch_size``, ``max_latency_ms``, and ``strategy``. | ||
|
||
Max Batch Size | ||
^^^^^^^^^^^^^^ | ||
|
||
Configured through the ``max_batch_size`` key, max batch size represents the maximum size a batch can reach before releasing for inferencing. Max batch size should be set based on the capacity of the available system resources, e.g. memory or GPU memory. | ||
Configured through the ``max_batch_size`` key, max batch size represents the maximum size a batch | ||
can reach before being released for inferencing. Max batch size should be set based on the capacity | ||
of the available system resources, e.g. memory or GPU memory. | ||
|
||
Max Latency | ||
^^^^^^^^^^^ | ||
|
||
Configured through the ``max_latency_ms`` key, max latency represents the maximum latency in milliseconds that a batch should wait before releasing for inferencing. Max latency should be set based on the service level objective (SLO) of the inference requests. | ||
Configured through the ``max_latency_ms`` key, max latency represents the maximum latency in | ||
milliseconds that the scheduler will attempt to uphold by cancelling requests when it thinks the | ||
runner server is incapable of servicing that request in time. Max latency should be set based on the | ||
service level objective (SLO) of the inference requests. | ||
|
||
Batching Strategy | ||
^^^^^^^^^^^^^^^^^ | ||
|
||
Configured through the ``strategy`` and ``strategy_options`` keys, the batching strategy determines | ||
the way that the scheduler chooses a batching window, i.e. the time it waits for requests to combine | ||
them into a batch before dispatching it to begin execution. There are three options: | ||
|
||
- target_latency: this strategy waits until it expects the first request received will take around | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
``latency`` time to complete before beginning execution. Choose this method if | ||
you think that your service workload will be very bursty and so the intelligent | ||
wait algorithm will do a poor job of identifying average wait times. | ||
|
||
It takes one option, ``latency_ms`` (default 1000), which is the latency target | ||
to use for dispatch. | ||
|
||
- fixed_wait: this strategy will wait a fixed amount of time after the first request has been | ||
received. It differs from the target_latency strategy in that it does not consider | ||
the amount of time that it expects a batch will take to execute. | ||
|
||
It takes one option, ``wait_ms`` (default 1000), the amount of time to wait after | ||
receiving the first request. | ||
|
||
- intelligent_wait: this strategy waits intelligently in an effort to optimize average latency | ||
across all requests. It takes the average the average time spent in queue, then | ||
calculates the average time it expects to take to wait for and then execute the | ||
batch including the next request. If that time, multiplied by number of | ||
requests in the queue, is less than the average wait time, it will continue | ||
waiting for the next request to arrive. This is the default, and the other | ||
options should only be chosen if undesirable latency behavior is observed. | ||
|
||
It has one option, ``decay`` (default 0.95), which is the rate at which the | ||
dispatcher decays the wait time, per dispatched job. Note that this does not | ||
decay the actual expected wait time, but instead reduces the batching window, | ||
which indirectly reduces the average waiting time. | ||
|
||
|
||
.. code-block:: yaml | ||
:caption: ⚙️ `configuration.yml` | ||
|
||
runners: | ||
# batching options for all runners | ||
batching: | ||
enabled: true | ||
max_batch_size: 100 | ||
max_latency_ms: 500 | ||
strategy: avg_wait | ||
iris_clf: | ||
# batching options for specifically the iris_clf runner | ||
# these options override the above | ||
batching: | ||
enabled: true | ||
max_batch_size: 100 | ||
max_latency_ms: 500 | ||
strategy: target_latency | ||
strategy_options: | ||
latency_ms: 200 | ||
|
||
Monitoring | ||
---------- | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -81,9 +81,21 @@ api_server: | |
runners: | ||
resources: ~ | ||
timeout: 300 | ||
optimizer: | ||
name: linear | ||
options: | ||
initial_slope_ms: 2. | ||
initial_intercept_ms: 1. | ||
batching: | ||
enabled: true | ||
max_batch_size: 100 | ||
# which strategy to use to batch requests | ||
# there are currently two available options: | ||
# - target_latency: attempt to ensure requests are served within a certain amount of time | ||
# - adaptive: wait a variable amount of time in order to optimize for minimal average latency | ||
strategy: adaptive | ||
strategy_options: | ||
decay: 0.95 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe the same pattern as the
|
||
max_latency_ms: 10000 | ||
logging: | ||
access: | ||
|
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docs changes are outdated, correct?