-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement script to cleanup user-facing commands #733
base: main
Are you sure you want to change the base?
Conversation
The function for compilation still needs an implementation. |
e67c9d0
to
2f1bf47
Compare
Compile is now implemented. |
2598cf7
to
bf6ac52
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice start! It definitely helps already seeing each step in sequence as part of a script, and this would immediately help with the number of environment variables and steps in the user guide at https://github.com/nod-ai/shark-ai/blob/main/docs/shortfin/llm/user/llama_serving.md.
My largest pieces of feedback is around what language the script is written in - prefer Python given how many options this has, what sorts of other libraries and tools it interoperates with, and how we will want to distribute this to users in our packages.
Beyond that, my other suggestions are about finer user experience points. We could get by without some of those addressed, if we scope and name the tool appropriately, like with mlc_llm convert_weight
and trtllm-build
. Something like "shark ai importllm" (with some punctuation) maybe?
run_shark_ai.sh
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's find a better location and name for this script, rather than run_shark_ai.sh
at the repository root.
Location
- If this connects sharktank to shortfin then it could sit under https://github.com/nod-ai/shark-ai/tree/main/shark-ai.
- If this just handles model import and compilation and not serving then it could be part of https://github.com/nod-ai/shark-ai/tree/main/sharktank
- Some projects put core scripts at the top level, like https://github.com/ggerganov/llama.cpp/blob/master/convert_hf_to_gguf.py. I don't think we should do that quite yet though, given the other considerations here
Naming
As this script is specific to LLMs (it wraps export_paged_llm_v1
, among other things), it should have "LLM" in the name, or at least not be overly general to all of "shark_ai". We support other models, like SDXL.
This also isn't "running" the project, or an LLM, in a sense that I would expect. It is doing model preparation, similar to the mlc_llm convert_weight
tool + mode described at https://llm.mlc.ai/docs/compilation/compile_models.html#clone-from-hf-and-convert-weight or the trtllm-build
command at https://nvidia.github.io/TensorRT-LLM/quick-start-guide.html#compile-the-model-into-a-tensorrt-engine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Brainstorming more name choices...
Use the project name as the tool name then have a subcommand for each task:
shark-ai compile_llm
shark-ai serve
Put -build
in the name, like trtllm:
shark-ai-build compile_llm
shark-ai-build compile_sdxl
I'm not sure if we should try to fit everything that we support into a neat box like those, or if we should embrace having more bespoke pipelines and servers, at least for our current split between LLMs (llama) and image generators (SDXL, Flux).
If we pick a sufficiently specific tool name like then we can wiggle out of it later. If we pick a generic name like shark-ai
then we should continue to use that across future releases. I definitely want to aim for something that we include as a console script that gets installed as part of the Python packages.
run_shark_ai.sh
Outdated
main() { | ||
if [[ $# -eq 0 ]]; then | ||
print_help_string_and_exit 1 | ||
fi | ||
|
||
parse_and_handle_args "$@" | ||
check_valid |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm impressed that you wrote this all in Bash, but I'd much prefer for a user-facing tool to be written in Python to be more portable and easier to maintain. We can even provide a console_script
for a Python script so installing the Python packages installs that script as if it was a binary on the user's PATH: https://python-packaging.readthedocs.io/en/latest/command-line-scripts.html#the-console-scripts-entry-point. That's how the IREE tools are distributed, fyi: https://github.com/iree-org/iree/blob/main/compiler/bindings/python/iree/compiler/tools/scripts/iree_compile/__main__.py
We can then use argparse instead of this bash parsing/validation too.
With a Python script we could even go a few steps further and use IREE's compiler API and the sharktank.examples.export_paged_llm_v1
library code from Python, making the script be self contained and thus more robust, rather than jump out of process to individual tools and scripts. I do like the unix tools philosophy though of having a collection of tools that each do one thing well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ScottTodd thanks for the review! The philosophy behind choosing bash for the script was that since we are combining so many different tools, a bash script would essentially do not much but just act like a glue between the steps.
In fact, the first option was indeed Python, and I do agree it would have had been a more convenient choice. I can port the script to Python too, no worries.
run_shark_ai.sh
Outdated
--iree-dispatch-creation-enable-aggressive-fusion=true \ | ||
--iree-global-opt-propagate-transposes=true \ | ||
--iree-opt-aggressively-propagate-transposes=true \ | ||
--iree-opt-data-tiling=false \ | ||
--iree-preprocessing-pass-pipeline='builtin.module(util.func(iree-preprocessing-generalize-linalg-matmul-experimental))' \ | ||
--iree-hal-indirect-command-buffers=true \ | ||
--iree-stream-resource-memory-model=discrete \ | ||
--iree-hip-legacy-sync=false \ | ||
--iree-hal-memoization=true \ | ||
--iree-opt-strip-assertions \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lots of non-default flags here, just for tp8 mode? If these are necessary then I'd put them in a flagfile or well-documented variable (much easier when writing this in Python instead of Bash)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These flags are very much open to discussion. These are what we currently use for compiling for benchmarking tp8 mode. I am sure some of these can be removed.
run_shark_ai.sh
Outdated
--iree-hal-target-device=hip[0] \ | ||
--iree-hal-target-device=hip[1] \ | ||
--iree-hal-target-device=hip[2] \ | ||
--iree-hal-target-device=hip[3] \ | ||
--iree-hal-target-device=hip[4] \ | ||
--iree-hal-target-device=hip[5] \ | ||
--iree-hal-target-device=hip[6] \ | ||
--iree-hal-target-device=hip[7] \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(out of scope for this PR)
We may want to add a meta flag or some special parsing for hip[0,7]
to condense this.
run_shark_ai.sh
Outdated
Some options are required to be specified. | ||
-h prints the help string. The same output is emitted when no arguments are given. | ||
-v enables verbose logging. When not specified, only errors are logged. | ||
-w the location of the GGUF/IRPA file(s) that contain the parameters. (required) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some LLM serving projects include weight downloading from huggingface or other model repositories as part of their scripting, at least with a thin wrapper around huggingface-cli. See for example https://docs.vllm.ai/en/latest/getting_started/quickstart.html
Other projects like https://nvidia.github.io/TensorRT-LLM/quick-start-guide.html and https://llm.mlc.ai/docs/compilation/compile_models.html#clone-from-hf-and-convert-weight are explicit about you needing to download the files yourself ahead of time.
This can be a stretch goal for such a script/tool, but I do want it at least considered during the initial design discussions and architecture planning. For user-facing tools, we want the happy/golden path to be as simple and easy to follow as possible. Many users will (at least at first) just want to try an off-the-shelf model, and if we can download that for them then there is lower risk of them getting a model that won't work well or having some configuration issue.
I would want our user guide to talk about huggingface model repositories using their canonical paths like facebook/opt-125m
and meta-llama/Meta-Llama-3.1-8B-Instruct
, not our shorthand names (lacking context) like llama3_8B_fp16
. Even just explaining "for example, to run the model from https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct, do X Y and Z" would help put these options into context. Any scripting/tooling we build should work in concert with the user guide. The tools should be self-documenting so the user guide doesn't need to be referenced constantly.
In a Python script, I'd maybe have a mutually exclusive argparse group (https://docs.python.org/3/library/argparse.html#mutual-exclusion) that lets you choose between an already downloaded/local weight file and a huggingface repository name. Then you could either --model-hf=meta-llama/Meta-Llama-3.1-8B-Instruct
or --model-weights=/path/to/llama_8b_instruct_f16.gguf
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also I thought of, but @kumardeepakamd's POV was that the weight file should be handled by the user however they would like, and not be a part of the script in any way, so the idea was dropped.
run_shark_ai.sh
Outdated
Some options are required to be specified. | ||
-h prints the help string. The same output is emitted when no arguments are given. | ||
-v enables verbose logging. When not specified, only errors are logged. | ||
-w the location of the GGUF/IRPA file(s) that contain the parameters. (required) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also don't see a tokenizer config mentioned in here. That's not needed for the sharktank model export process, but it is needed for serving via shortfin. If this script intends to connect model downloading/exporting/compiling and serving, that should be included somehow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. The script does not handle running shortfin (yet). Again the idea was that the user might have a different workflow than what the script enforces, do that was not included.
@ScottTodd thanks a lot for the review. I completely agree with your views on moving to Python, so I mam closing this PR in favor of a new one that implements the script in Python. |
Why close this PR and create a new one? The review history and GitHub crossreferences (linking to #402 and #691) are all useful to keep here. Splitting to a new PR fragments the discussion. You can rename the PR to better reflect what it is about (e.g. "Introduce new tool for LLM import and compilation"), without being overly specific about one detail of the implementation (a single file written in Bash). |
@ScottTodd re-opened the PR. |
class CliParser(ArgumentParser): | ||
def print_help(self, file=None) -> None: | ||
if file is None: | ||
file = sys.stdout | ||
|
||
help_text = """usage: export_and_serve.py [-h] [-v] [-c] -w WEIGHT_FILE [-e] [-a ARTIFACT_DIR] [-s] [-b BATCH_SIZES] [-i IR] [-p {1,8}] export_dir | ||
|
||
Utility script to combine shark-ai tools | ||
|
||
positional arguments: | ||
export_dir the directory where the exported artifacts will be saved. | ||
|
||
options: | ||
-h, --help show this help message and exit | ||
-v, --verbose set logging level to INFO. The default logging level is WARNING. | ||
-c, --compile compile the exported model as part of the pipeline. Default is FALSE. | ||
-w, --weight-file WEIGHT_FILE the location of the GGUF/IRPA file(s) that contain the parameters. | ||
-e, --export export the model in tp1 mode. | ||
-a, --artifact-dir ARTIFACT_DIR the location where the artifacts (sharded weights) should be saved. Defaults to EXPORT_DIR/artifacts/ | ||
-s, --shard shard the weight file in tp8 mode and export to MLIR. | ||
-b, --batch-sizes BATCH_SIZES batch sizes for export. Multiple batch sizes should be separated by a ','. | ||
-i, --ir IR location for the MLIR to be compiled, if compilation is done independently. | ||
-p, --tensor-parallel {1,8} tensor parallel size. Used for independent compilation. Defaults to 1. | ||
""" | ||
_ = file.write(help_text + "\n") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ArgumentParser should provide most of this help text for you, then you won't need to duplicate options and help text in two different parts of the file. Is there a particular reason you opted to format all this yourself and use RawTextHelpFormatter
instead of the standard format?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ScottTodd the reason to have a separate help string is that the default format is actually quite unclean and difficult to read. It repeats options against each flag (the abbreviation and the long-form) and then wraps the help string for that flag onto the next line. Since this is supposed to be user-facing, IMO this would be an import part of the UX.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explicit help text:
λ python .\shark-ai\export_and_serve.py --help
usage: export_and_serve.py [-h] [-v] [-c] -w WEIGHT_FILE [-e] [-a ARTIFACT_DIR] [-s] [-b BATCH_SIZES] [-i IR] [-p {1,8}] export_dir
Utility script to combine shark-ai tools
positional arguments:
export_dir the directory where the exported artifacts will be saved.
options:
-h, --help show this help message and exit
-v, --verbose set logging level to INFO. The default logging level is WARNING.
-c, --compile compile the exported model as part of the pipeline. Default is FALSE.
-w, --weight-file WEIGHT_FILE the location of the GGUF/IRPA file(s) that contain the parameters.
-e, --export export the model in tp1 mode.
-a, --artifact-dir ARTIFACT_DIR the location where the artifacts (sharded weights) should be saved. Defaults to EXPORT_DIR/artifacts/
-s, --shard shard the weight file in tp8 mode and export to MLIR.
-b, --batch-sizes BATCH_SIZES batch sizes for export. Multiple batch sizes should be separated by a ','.
-i, --ir IR location for the MLIR to be compiled, if compilation is done independently.
-p, --tensor-parallel {1,8} tensor parallel size. Used for independent compilation. Defaults to 1.
Default help text (line width 100):
λ python .\shark-ai\export_and_serve.py --help
usage: export_and_serve.py [-h] [-v] [-c] -w WEIGHT_FILE [-e] [-a ARTIFACT_DIR] [-s]
[-b BATCH_SIZES] [-i IR] [-p {1,8}]
export_dir
Utility script to combine shark-ai tools
positional arguments:
export_dir the directory where the exported artifacts will be saved.
options:
-h, --help show this help message and exit
-v, --verbose set logging level to INFO. The default logging level is WARNING.
-c, --compile compile the exported model as part of the pipeline. Default is FALSE.
-w WEIGHT_FILE, --weight-file WEIGHT_FILE
the location of the GGUF/IRPA file(s) that contain the parameters.
-e, --export export the model in tp1 mode.
-a ARTIFACT_DIR, --artifact-dir ARTIFACT_DIR
the location where the artifacts (sharded weights) should be saved.
Defaults to EXPORT_DIR/artifacts/
-s, --shard shard the weight file in tp8 mode and export to MLIR.
-b BATCH_SIZES, --batch-sizes BATCH_SIZES
batch sizes for export. Multiple batch sizes should be separated by a ','.
-i IR, --ir IR location for the MLIR to be compiled, if compilation is done
independently.
-p {1,8}, --tensor-parallel {1,8}
tensor parallel size (required for independent compilation).
I prefer the default line wrapping behavior, especially at line width 80, since it left pads the wrapped lines. I see your point about the repetition, but sticking to standard argparse behavior will make the tool output more predictable and simplify maintenance.
@@ -0,0 +1,412 @@ | |||
import logging |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add copyright header comments to source files.
if not self.shard and not self.export and not self.ir_path and self.compile: | ||
logger.error( | ||
"To run compilation, either TP1 export (-e) or TP8 export (-s) must be specified, or path to IR must be passed in" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm finding this condition confusing to read. Maybe reorder to match the text.
if not self.shard and not self.export and not self.ir_path and self.compile: | |
logger.error( | |
"To run compilation, either TP1 export (-e) or TP8 export (-s) must be specified, or path to IR must be passed in" | |
) | |
if self.compile and (not self.shard and not self.export and not self.ir_path): | |
logger.error( | |
"To run compilation, either TP1 export (-e) or TP8 export (-s) must be specified, or path to IR must be passed in" | |
) |
Actually, see https://stackoverflow.com/questions/19414060/argparse-required-argument-y-if-x-is-present. That has the same style and a tip about using parser.error()
instead of logger
and exit
.
) | ||
exit(1) | ||
|
||
if not os.path.isfile(self.weight_loc): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
weight_loc
is already a pathlib.Path
object, so you can use exists()
or is_file()
https://docs.python.org/3/library/pathlib.html#pathlib.Path.exists
https://docs.python.org/3/library/pathlib.html#pathlib.Path.is_file
if not os.path.isfile(self.weight_loc): | |
if not self.weight_loc.is_file(): |
I generally default to Pathlib
and try to cut down on any uses of os.path
or os
. Pathlib is much more modern.
# The argparser should not print the usage when an error occurs. | ||
# We handle that ourselves. | ||
parser = CliParser( | ||
prog="export_and_serve.py", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A note on serving: plan on there being checkpoint steps after
- Downloading weights
- Compiling the model (I'd group the export from PyTorch to .mlir and the .mlir to .vmfb compilation in this, from a user's perspective)
- Serving the model
A user may want to serve the model on multiple machines, not just the machine that was used to compile. If we can package the artifacts for serving (program.vmfb
, weights_shard1.irpa
, tokenizer_config.json
) in some organized way then we can help users push those artifacts and start the server.
See for example how https://llm.mlc.ai/docs/compilation/compile_models.html is structured:
- There is a single
mlc_llm
tool with mutliple modes:compile
,convert_weight
,gen_config
- The running/serving step points at the directory of outputs
We could do something similar here using sub-commands: https://docs.python.org/3/library/argparse.html#sub-commands
cmd = [ | ||
"iree-compile", | ||
f"{self.mlir_path}", | ||
f"-o={tp1_vmfb_path}", | ||
] | ||
|
||
# TODO(vinayakdsci): Add a flag to support backends other than rocm. | ||
cmd += [f"--iree-hal-target-backends={self.hal_target_backend}"] | ||
|
||
# TODO(vinayakdsci): Add a flag to support targets other than gfx942. | ||
cmd += [f"--iree-hip-target={self.hip_target}"] | ||
|
||
compile_subp = subprocess.run( | ||
cmd, | ||
capture_output=True, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may want this to use the IREE compiler API or the iree.build
API (https://iree-python-api.readthedocs.io/en/latest/compiler/build.html) instead of using a subprocess to call iree-compile
.
Then:
- we wouldn't need to check for
iree-compile
on the PATH - developers wanting to bring their own
iree-compile
build would need to build the python bindings - we'd get the benefits of iree.build for compiling multiple programs in a build graph (as needed... LLMs don't need that as much as SDXL though)
- we'd get integrated error handling instead of checking return codes and sys.stderr
) | ||
|
||
def _compile_tp8(self): | ||
logger.info("Compiling sharded IR") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May also want to include a time estimate here, or ideally a progress bar (iree-org/iree#14369)
artifacts_dir = ( | ||
args.artifact_dir | ||
if args.artifact_dir is not None | ||
else Path(os.path.join(args.export_dir, "artifacts/")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pathlib
else Path(os.path.join(args.export_dir, "artifacts/")) | |
else args.export_dir / "artifacts" |
shard_file = os.path.join( | ||
str(self.artifacts_dir), | ||
str(self.weight_loc).split("/")[-1][:-5] + "_tp8.irpa", | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use pathlib here. Join with /
and replace the [-1][:-5]
with some of the functions from https://docs.python.org/3/library/pathlib.html#corresponding-tools (.stem()
, .parent()
, etc.)
run_shark_ai.sh
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Brainstorming more name choices...
Use the project name as the tool name then have a subcommand for each task:
shark-ai compile_llm
shark-ai serve
Put -build
in the name, like trtllm:
shark-ai-build compile_llm
shark-ai-build compile_sdxl
I'm not sure if we should try to fit everything that we support into a neat box like those, or if we should embrace having more bespoke pipelines and servers, at least for our current split between LLMs (llama) and image generators (SDXL, Flux).
If we pick a sufficiently specific tool name like then we can wiggle out of it later. If we pick a generic name like shark-ai
then we should continue to use that across future releases. I definitely want to aim for something that we include as a console script that gets installed as part of the Python packages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for rewriting in Python! This looks much easier to maintain and I see no portable issues. Comments are split between Python style and overall user experience details. Some of this we can iterate on after landing an initial version, but I want to get a solid foundation before we advertise this to users.
Progress on #402 and #691.
Adds a python script that will unify all the commands in llama_serving.md, providing a cleaner and single interface for the user to interact with.