Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement script to cleanup user-facing commands #733

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

vinayakdsci
Copy link

@vinayakdsci vinayakdsci commented Jan 1, 2025

Progress on #402 and #691.

Adds a python script that will unify all the commands in llama_serving.md, providing a cleaner and single interface for the user to interact with.

@vinayakdsci
Copy link
Author

The function for compilation still needs an implementation.

@vinayakdsci vinayakdsci force-pushed the e2e-script-impl branch 5 times, most recently from e67c9d0 to 2f1bf47 Compare January 1, 2025 13:57
@vinayakdsci
Copy link
Author

Compile is now implemented.

@vinayakdsci vinayakdsci marked this pull request as ready for review January 2, 2025 10:55
Copy link
Member

@ScottTodd ScottTodd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice start! It definitely helps already seeing each step in sequence as part of a script, and this would immediately help with the number of environment variables and steps in the user guide at https://github.com/nod-ai/shark-ai/blob/main/docs/shortfin/llm/user/llama_serving.md.

My largest pieces of feedback is around what language the script is written in - prefer Python given how many options this has, what sorts of other libraries and tools it interoperates with, and how we will want to distribute this to users in our packages.

Beyond that, my other suggestions are about finer user experience points. We could get by without some of those addressed, if we scope and name the tool appropriately, like with mlc_llm convert_weight and trtllm-build. Something like "shark ai importllm" (with some punctuation) maybe?

run_shark_ai.sh Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's find a better location and name for this script, rather than run_shark_ai.sh at the repository root.

Location

Naming

As this script is specific to LLMs (it wraps export_paged_llm_v1, among other things), it should have "LLM" in the name, or at least not be overly general to all of "shark_ai". We support other models, like SDXL.

This also isn't "running" the project, or an LLM, in a sense that I would expect. It is doing model preparation, similar to the mlc_llm convert_weight tool + mode described at https://llm.mlc.ai/docs/compilation/compile_models.html#clone-from-hf-and-convert-weight or the trtllm-build command at https://nvidia.github.io/TensorRT-LLM/quick-start-guide.html#compile-the-model-into-a-tensorrt-engine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Brainstorming more name choices...

Use the project name as the tool name then have a subcommand for each task:

  • shark-ai compile_llm
  • shark-ai serve

Put -build in the name, like trtllm:

  • shark-ai-build compile_llm
  • shark-ai-build compile_sdxl

I'm not sure if we should try to fit everything that we support into a neat box like those, or if we should embrace having more bespoke pipelines and servers, at least for our current split between LLMs (llama) and image generators (SDXL, Flux).

If we pick a sufficiently specific tool name like then we can wiggle out of it later. If we pick a generic name like shark-ai then we should continue to use that across future releases. I definitely want to aim for something that we include as a console script that gets installed as part of the Python packages.

run_shark_ai.sh Outdated
Comment on lines 255 to 261
main() {
if [[ $# -eq 0 ]]; then
print_help_string_and_exit 1
fi

parse_and_handle_args "$@"
check_valid
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm impressed that you wrote this all in Bash, but I'd much prefer for a user-facing tool to be written in Python to be more portable and easier to maintain. We can even provide a console_script for a Python script so installing the Python packages installs that script as if it was a binary on the user's PATH: https://python-packaging.readthedocs.io/en/latest/command-line-scripts.html#the-console-scripts-entry-point. That's how the IREE tools are distributed, fyi: https://github.com/iree-org/iree/blob/main/compiler/bindings/python/iree/compiler/tools/scripts/iree_compile/__main__.py

We can then use argparse instead of this bash parsing/validation too.

With a Python script we could even go a few steps further and use IREE's compiler API and the sharktank.examples.export_paged_llm_v1 library code from Python, making the script be self contained and thus more robust, rather than jump out of process to individual tools and scripts. I do like the unix tools philosophy though of having a collection of tools that each do one thing well.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ScottTodd thanks for the review! The philosophy behind choosing bash for the script was that since we are combining so many different tools, a bash script would essentially do not much but just act like a glue between the steps.

In fact, the first option was indeed Python, and I do agree it would have had been a more convenient choice. I can port the script to Python too, no worries.

run_shark_ai.sh Outdated Show resolved Hide resolved
run_shark_ai.sh Outdated Show resolved Hide resolved
run_shark_ai.sh Outdated
Comment on lines 223 to 232
--iree-dispatch-creation-enable-aggressive-fusion=true \
--iree-global-opt-propagate-transposes=true \
--iree-opt-aggressively-propagate-transposes=true \
--iree-opt-data-tiling=false \
--iree-preprocessing-pass-pipeline='builtin.module(util.func(iree-preprocessing-generalize-linalg-matmul-experimental))' \
--iree-hal-indirect-command-buffers=true \
--iree-stream-resource-memory-model=discrete \
--iree-hip-legacy-sync=false \
--iree-hal-memoization=true \
--iree-opt-strip-assertions \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lots of non-default flags here, just for tp8 mode? If these are necessary then I'd put them in a flagfile or well-documented variable (much easier when writing this in Python instead of Bash)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These flags are very much open to discussion. These are what we currently use for compiling for benchmarking tp8 mode. I am sure some of these can be removed.

run_shark_ai.sh Outdated
Comment on lines 215 to 222
--iree-hal-target-device=hip[0] \
--iree-hal-target-device=hip[1] \
--iree-hal-target-device=hip[2] \
--iree-hal-target-device=hip[3] \
--iree-hal-target-device=hip[4] \
--iree-hal-target-device=hip[5] \
--iree-hal-target-device=hip[6] \
--iree-hal-target-device=hip[7] \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(out of scope for this PR)

We may want to add a meta flag or some special parsing for hip[0,7] to condense this.

run_shark_ai.sh Outdated Show resolved Hide resolved
run_shark_ai.sh Outdated
Some options are required to be specified.
-h prints the help string. The same output is emitted when no arguments are given.
-v enables verbose logging. When not specified, only errors are logged.
-w the location of the GGUF/IRPA file(s) that contain the parameters. (required)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some LLM serving projects include weight downloading from huggingface or other model repositories as part of their scripting, at least with a thin wrapper around huggingface-cli. See for example https://docs.vllm.ai/en/latest/getting_started/quickstart.html

Other projects like https://nvidia.github.io/TensorRT-LLM/quick-start-guide.html and https://llm.mlc.ai/docs/compilation/compile_models.html#clone-from-hf-and-convert-weight are explicit about you needing to download the files yourself ahead of time.

This can be a stretch goal for such a script/tool, but I do want it at least considered during the initial design discussions and architecture planning. For user-facing tools, we want the happy/golden path to be as simple and easy to follow as possible. Many users will (at least at first) just want to try an off-the-shelf model, and if we can download that for them then there is lower risk of them getting a model that won't work well or having some configuration issue.

I would want our user guide to talk about huggingface model repositories using their canonical paths like facebook/opt-125m and meta-llama/Meta-Llama-3.1-8B-Instruct, not our shorthand names (lacking context) like llama3_8B_fp16. Even just explaining "for example, to run the model from https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct, do X Y and Z" would help put these options into context. Any scripting/tooling we build should work in concert with the user guide. The tools should be self-documenting so the user guide doesn't need to be referenced constantly.

In a Python script, I'd maybe have a mutually exclusive argparse group (https://docs.python.org/3/library/argparse.html#mutual-exclusion) that lets you choose between an already downloaded/local weight file and a huggingface repository name. Then you could either --model-hf=meta-llama/Meta-Llama-3.1-8B-Instruct or --model-weights=/path/to/llama_8b_instruct_f16.gguf.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also I thought of, but @kumardeepakamd's POV was that the weight file should be handled by the user however they would like, and not be a part of the script in any way, so the idea was dropped.

run_shark_ai.sh Outdated
Some options are required to be specified.
-h prints the help string. The same output is emitted when no arguments are given.
-v enables verbose logging. When not specified, only errors are logged.
-w the location of the GGUF/IRPA file(s) that contain the parameters. (required)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also don't see a tokenizer config mentioned in here. That's not needed for the sharktank model export process, but it is needed for serving via shortfin. If this script intends to connect model downloading/exporting/compiling and serving, that should be included somehow.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. The script does not handle running shortfin (yet). Again the idea was that the user might have a different workflow than what the script enforces, do that was not included.

@vinayakdsci
Copy link
Author

@ScottTodd thanks a lot for the review. I completely agree with your views on moving to Python, so I mam closing this PR in favor of a new one that implements the script in Python.

@vinayakdsci vinayakdsci closed this Jan 3, 2025
@ScottTodd
Copy link
Member

@ScottTodd thanks a lot for the review. I completely agree with your views on moving to Python, so I mam closing this PR in favor of a new one that implements the script in Python.

Why close this PR and create a new one? The review history and GitHub crossreferences (linking to #402 and #691) are all useful to keep here. Splitting to a new PR fragments the discussion. You can rename the PR to better reflect what it is about (e.g. "Introduce new tool for LLM import and compilation"), without being overly specific about one detail of the implementation (a single file written in Bash).

@vinayakdsci vinayakdsci reopened this Jan 3, 2025
@vinayakdsci vinayakdsci changed the title Implement bash script to unify user-end commands Implement script to cleanup user-facing commands Jan 3, 2025
@vinayakdsci
Copy link
Author

@ScottTodd re-opened the PR.

@vinayakdsci vinayakdsci requested a review from ScottTodd January 3, 2025 16:48
Comment on lines +19 to +43
class CliParser(ArgumentParser):
def print_help(self, file=None) -> None:
if file is None:
file = sys.stdout

help_text = """usage: export_and_serve.py [-h] [-v] [-c] -w WEIGHT_FILE [-e] [-a ARTIFACT_DIR] [-s] [-b BATCH_SIZES] [-i IR] [-p {1,8}] export_dir

Utility script to combine shark-ai tools

positional arguments:
export_dir the directory where the exported artifacts will be saved.

options:
-h, --help show this help message and exit
-v, --verbose set logging level to INFO. The default logging level is WARNING.
-c, --compile compile the exported model as part of the pipeline. Default is FALSE.
-w, --weight-file WEIGHT_FILE the location of the GGUF/IRPA file(s) that contain the parameters.
-e, --export export the model in tp1 mode.
-a, --artifact-dir ARTIFACT_DIR the location where the artifacts (sharded weights) should be saved. Defaults to EXPORT_DIR/artifacts/
-s, --shard shard the weight file in tp8 mode and export to MLIR.
-b, --batch-sizes BATCH_SIZES batch sizes for export. Multiple batch sizes should be separated by a ','.
-i, --ir IR location for the MLIR to be compiled, if compilation is done independently.
-p, --tensor-parallel {1,8} tensor parallel size. Used for independent compilation. Defaults to 1.
"""
_ = file.write(help_text + "\n")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ArgumentParser should provide most of this help text for you, then you won't need to duplicate options and help text in two different parts of the file. Is there a particular reason you opted to format all this yourself and use RawTextHelpFormatter instead of the standard format?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ScottTodd the reason to have a separate help string is that the default format is actually quite unclean and difficult to read. It repeats options against each flag (the abbreviation and the long-form) and then wraps the help string for that flag onto the next line. Since this is supposed to be user-facing, IMO this would be an import part of the UX.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explicit help text:

λ python .\shark-ai\export_and_serve.py --help
usage: export_and_serve.py [-h] [-v] [-c] -w WEIGHT_FILE [-e] [-a ARTIFACT_DIR] [-s] [-b BATCH_SIZES] [-i IR] [-p {1,8}] export_dir

Utility script to combine shark-ai tools

positional arguments:
  export_dir                        the directory where the exported artifacts will be saved.

options:
  -h, --help                        show this help message and exit
  -v, --verbose                     set logging level to INFO. The default logging level is WARNING.

  -c, --compile                     compile the exported model as part of the pipeline. Default is FALSE.
  -w, --weight-file WEIGHT_FILE     the location of the GGUF/IRPA file(s) that contain the parameters.
  -e, --export                      export the model in tp1 mode.
  -a, --artifact-dir ARTIFACT_DIR   the location where the artifacts (sharded weights) should be saved. Defaults to EXPORT_DIR/artifacts/
  -s, --shard                       shard the weight file in tp8 mode and export to MLIR.
  -b, --batch-sizes BATCH_SIZES     batch sizes for export. Multiple batch sizes should be separated by a ','.
  -i, --ir IR                       location for the MLIR to be compiled, if compilation is done independently.
  -p, --tensor-parallel {1,8}       tensor parallel size. Used for independent compilation. Defaults to 1.

Default help text (line width 100):

λ python .\shark-ai\export_and_serve.py --help
usage: export_and_serve.py [-h] [-v] [-c] -w WEIGHT_FILE [-e] [-a ARTIFACT_DIR] [-s]
                           [-b BATCH_SIZES] [-i IR] [-p {1,8}]
                           export_dir

Utility script to combine shark-ai tools

positional arguments:
  export_dir            the directory where the exported artifacts will be saved.

options:
  -h, --help            show this help message and exit
  -v, --verbose         set logging level to INFO. The default logging level is WARNING.
  -c, --compile         compile the exported model as part of the pipeline. Default is FALSE.
  -w WEIGHT_FILE, --weight-file WEIGHT_FILE
                        the location of the GGUF/IRPA file(s) that contain the parameters.
  -e, --export          export the model in tp1 mode.
  -a ARTIFACT_DIR, --artifact-dir ARTIFACT_DIR
                        the location where the artifacts (sharded weights) should be saved.
                        Defaults to EXPORT_DIR/artifacts/
  -s, --shard           shard the weight file in tp8 mode and export to MLIR.
  -b BATCH_SIZES, --batch-sizes BATCH_SIZES
                        batch sizes for export. Multiple batch sizes should be separated by a ','.
  -i IR, --ir IR        location for the MLIR to be compiled, if compilation is done
                        independently.
  -p {1,8}, --tensor-parallel {1,8}
                        tensor parallel size (required for independent compilation).

I prefer the default line wrapping behavior, especially at line width 80, since it left pads the wrapped lines. I see your point about the repetition, but sticking to standard argparse behavior will make the tool output more predictable and simplify maintenance.

@@ -0,0 +1,412 @@
import logging
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add copyright header comments to source files.

Comment on lines +85 to +88
if not self.shard and not self.export and not self.ir_path and self.compile:
logger.error(
"To run compilation, either TP1 export (-e) or TP8 export (-s) must be specified, or path to IR must be passed in"
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm finding this condition confusing to read. Maybe reorder to match the text.

Suggested change
if not self.shard and not self.export and not self.ir_path and self.compile:
logger.error(
"To run compilation, either TP1 export (-e) or TP8 export (-s) must be specified, or path to IR must be passed in"
)
if self.compile and (not self.shard and not self.export and not self.ir_path):
logger.error(
"To run compilation, either TP1 export (-e) or TP8 export (-s) must be specified, or path to IR must be passed in"
)

Actually, see https://stackoverflow.com/questions/19414060/argparse-required-argument-y-if-x-is-present. That has the same style and a tip about using parser.error() instead of logger and exit.

)
exit(1)

if not os.path.isfile(self.weight_loc):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

weight_loc is already a pathlib.Path object, so you can use exists() or is_file()

https://docs.python.org/3/library/pathlib.html#pathlib.Path.exists
https://docs.python.org/3/library/pathlib.html#pathlib.Path.is_file

Suggested change
if not os.path.isfile(self.weight_loc):
if not self.weight_loc.is_file():

I generally default to Pathlib and try to cut down on any uses of os.path or os. Pathlib is much more modern.

# The argparser should not print the usage when an error occurs.
# We handle that ourselves.
parser = CliParser(
prog="export_and_serve.py",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A note on serving: plan on there being checkpoint steps after

  1. Downloading weights
  2. Compiling the model (I'd group the export from PyTorch to .mlir and the .mlir to .vmfb compilation in this, from a user's perspective)
  3. Serving the model

A user may want to serve the model on multiple machines, not just the machine that was used to compile. If we can package the artifacts for serving (program.vmfb, weights_shard1.irpa, tokenizer_config.json) in some organized way then we can help users push those artifacts and start the server.

See for example how https://llm.mlc.ai/docs/compilation/compile_models.html is structured:

  • There is a single mlc_llm tool with mutliple modes: compile, convert_weight, gen_config
  • The running/serving step points at the directory of outputs

We could do something similar here using sub-commands: https://docs.python.org/3/library/argparse.html#sub-commands

Comment on lines +237 to +252
cmd = [
"iree-compile",
f"{self.mlir_path}",
f"-o={tp1_vmfb_path}",
]

# TODO(vinayakdsci): Add a flag to support backends other than rocm.
cmd += [f"--iree-hal-target-backends={self.hal_target_backend}"]

# TODO(vinayakdsci): Add a flag to support targets other than gfx942.
cmd += [f"--iree-hip-target={self.hip_target}"]

compile_subp = subprocess.run(
cmd,
capture_output=True,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want this to use the IREE compiler API or the iree.build API (https://iree-python-api.readthedocs.io/en/latest/compiler/build.html) instead of using a subprocess to call iree-compile.

Then:

  • we wouldn't need to check for iree-compile on the PATH
  • developers wanting to bring their own iree-compile build would need to build the python bindings
  • we'd get the benefits of iree.build for compiling multiple programs in a build graph (as needed... LLMs don't need that as much as SDXL though)
  • we'd get integrated error handling instead of checking return codes and sys.stderr

)

def _compile_tp8(self):
logger.info("Compiling sharded IR")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May also want to include a time estimate here, or ideally a progress bar (iree-org/iree#14369)

artifacts_dir = (
args.artifact_dir
if args.artifact_dir is not None
else Path(os.path.join(args.export_dir, "artifacts/"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pathlib

Suggested change
else Path(os.path.join(args.export_dir, "artifacts/"))
else args.export_dir / "artifacts"

Comment on lines +173 to +176
shard_file = os.path.join(
str(self.artifacts_dir),
str(self.weight_loc).split("/")[-1][:-5] + "_tp8.irpa",
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use pathlib here. Join with / and replace the [-1][:-5] with some of the functions from https://docs.python.org/3/library/pathlib.html#corresponding-tools (.stem(), .parent(), etc.)

run_shark_ai.sh Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Brainstorming more name choices...

Use the project name as the tool name then have a subcommand for each task:

  • shark-ai compile_llm
  • shark-ai serve

Put -build in the name, like trtllm:

  • shark-ai-build compile_llm
  • shark-ai-build compile_sdxl

I'm not sure if we should try to fit everything that we support into a neat box like those, or if we should embrace having more bespoke pipelines and servers, at least for our current split between LLMs (llama) and image generators (SDXL, Flux).

If we pick a sufficiently specific tool name like then we can wiggle out of it later. If we pick a generic name like shark-ai then we should continue to use that across future releases. I definitely want to aim for something that we include as a console script that gets installed as part of the Python packages.

Copy link
Member

@ScottTodd ScottTodd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for rewriting in Python! This looks much easier to maintain and I see no portable issues. Comments are split between Python style and overall user experience details. Some of this we can iterate on after landing an initial version, but I want to get a solid foundation before we advertise this to users.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants