Implement script to cleanup user-facing commands #733

vinayakdsci · 2025-01-01T13:09:50Z

Progress on #402 and #691.

Adds a python script that will unify all the commands in llama_serving.md, providing a cleaner and single interface for the user to interact with.

vinayakdsci · 2025-01-01T13:10:47Z

The function for compilation still needs an implementation.

vinayakdsci · 2025-01-02T10:21:31Z

Compile is now implemented.

ScottTodd

Nice start! It definitely helps already seeing each step in sequence as part of a script, and this would immediately help with the number of environment variables and steps in the user guide at https://github.com/nod-ai/shark-ai/blob/main/docs/shortfin/llm/user/llama_serving.md.

My largest pieces of feedback is around what language the script is written in - prefer Python given how many options this has, what sorts of other libraries and tools it interoperates with, and how we will want to distribute this to users in our packages.

Beyond that, my other suggestions are about finer user experience points. We could get by without some of those addressed, if we scope and name the tool appropriately, like with mlc_llm convert_weight and trtllm-build. Something like "shark ai importllm" (with some punctuation) maybe?

ScottTodd · 2025-01-02T23:14:40Z

run_shark_ai.sh

Let's find a better location and name for this script, rather than run_shark_ai.sh at the repository root.

Location

If this connects sharktank to shortfin then it could sit under https://github.com/nod-ai/shark-ai/tree/main/shark-ai.

If this just handles model import and compilation and not serving then it could be part of https://github.com/nod-ai/shark-ai/tree/main/sharktank

Some projects put core scripts at the top level, like https://github.com/ggerganov/llama.cpp/blob/master/convert_hf_to_gguf.py. I don't think we should do that quite yet though, given the other considerations here

Naming

As this script is specific to LLMs (it wraps export_paged_llm_v1, among other things), it should have "LLM" in the name, or at least not be overly general to all of "shark_ai". We support other models, like SDXL.

This also isn't "running" the project, or an LLM, in a sense that I would expect. It is doing model preparation, similar to the mlc_llm convert_weight tool + mode described at https://llm.mlc.ai/docs/compilation/compile_models.html#clone-from-hf-and-convert-weight or the trtllm-build command at https://nvidia.github.io/TensorRT-LLM/quick-start-guide.html#compile-the-model-into-a-tensorrt-engine.

Brainstorming more name choices...

Use the project name as the tool name then have a subcommand for each task:

shark-ai compile_llm

shark-ai serve

Put -build in the name, like trtllm:

shark-ai-build compile_llm

shark-ai-build compile_sdxl

I'm not sure if we should try to fit everything that we support into a neat box like those, or if we should embrace having more bespoke pipelines and servers, at least for our current split between LLMs (llama) and image generators (SDXL, Flux).

If we pick a sufficiently specific tool name like then we can wiggle out of it later. If we pick a generic name like shark-ai then we should continue to use that across future releases. I definitely want to aim for something that we include as a console script that gets installed as part of the Python packages.

ScottTodd · 2025-01-02T23:21:41Z

run_shark_ai.sh

+main() {
+  if [[ $# -eq 0 ]]; then
+    print_help_string_and_exit 1
+  fi
+
+  parse_and_handle_args "$@"
+  check_valid


I'm impressed that you wrote this all in Bash, but I'd much prefer for a user-facing tool to be written in Python to be more portable and easier to maintain. We can even provide a console_script for a Python script so installing the Python packages installs that script as if it was a binary on the user's PATH: https://python-packaging.readthedocs.io/en/latest/command-line-scripts.html#the-console-scripts-entry-point. That's how the IREE tools are distributed, fyi: https://github.com/iree-org/iree/blob/main/compiler/bindings/python/iree/compiler/tools/scripts/iree_compile/__main__.py

We can then use argparse instead of this bash parsing/validation too.

With a Python script we could even go a few steps further and use IREE's compiler API and the sharktank.examples.export_paged_llm_v1 library code from Python, making the script be self contained and thus more robust, rather than jump out of process to individual tools and scripts. I do like the unix tools philosophy though of having a collection of tools that each do one thing well.

@ScottTodd thanks for the review! The philosophy behind choosing bash for the script was that since we are combining so many different tools, a bash script would essentially do not much but just act like a glue between the steps.

In fact, the first option was indeed Python, and I do agree it would have had been a more convenient choice. I can port the script to Python too, no worries.

run_shark_ai.sh

ScottTodd · 2025-01-02T23:27:04Z

run_shark_ai.sh

+        --iree-dispatch-creation-enable-aggressive-fusion=true     \
+        --iree-global-opt-propagate-transposes=true  \
+        --iree-opt-aggressively-propagate-transposes=true     \
+        --iree-opt-data-tiling=false  \
+        --iree-preprocessing-pass-pipeline='builtin.module(util.func(iree-preprocessing-generalize-linalg-matmul-experimental))'     \
+        --iree-hal-indirect-command-buffers=true  \
+        --iree-stream-resource-memory-model=discrete  \
+        --iree-hip-legacy-sync=false    \
+        --iree-hal-memoization=true  \
+        --iree-opt-strip-assertions \


Lots of non-default flags here, just for tp8 mode? If these are necessary then I'd put them in a flagfile or well-documented variable (much easier when writing this in Python instead of Bash)

These flags are very much open to discussion. These are what we currently use for compiling for benchmarking tp8 mode. I am sure some of these can be removed.

ScottTodd · 2025-01-02T23:28:00Z

run_shark_ai.sh

+        --iree-hal-target-device=hip[0]  \
+        --iree-hal-target-device=hip[1]  \
+        --iree-hal-target-device=hip[2]  \
+        --iree-hal-target-device=hip[3]  \
+        --iree-hal-target-device=hip[4]  \
+        --iree-hal-target-device=hip[5]  \
+        --iree-hal-target-device=hip[6]  \
+        --iree-hal-target-device=hip[7]  \


(out of scope for this PR)

We may want to add a meta flag or some special parsing for hip[0,7] to condense this.

run_shark_ai.sh

ScottTodd · 2025-01-02T23:34:18Z

run_shark_ai.sh

+Some options are required to be specified.
+    -h            prints the help string. The same output is emitted when no arguments are given.
+    -v            enables verbose logging. When not specified, only errors are logged.
+    -w            the location of the GGUF/IRPA file(s) that contain the parameters. (required)


Some LLM serving projects include weight downloading from huggingface or other model repositories as part of their scripting, at least with a thin wrapper around huggingface-cli. See for example https://docs.vllm.ai/en/latest/getting_started/quickstart.html

Other projects like https://nvidia.github.io/TensorRT-LLM/quick-start-guide.html and https://llm.mlc.ai/docs/compilation/compile_models.html#clone-from-hf-and-convert-weight are explicit about you needing to download the files yourself ahead of time.

This can be a stretch goal for such a script/tool, but I do want it at least considered during the initial design discussions and architecture planning. For user-facing tools, we want the happy/golden path to be as simple and easy to follow as possible. Many users will (at least at first) just want to try an off-the-shelf model, and if we can download that for them then there is lower risk of them getting a model that won't work well or having some configuration issue.

I would want our user guide to talk about huggingface model repositories using their canonical paths like facebook/opt-125m and meta-llama/Meta-Llama-3.1-8B-Instruct, not our shorthand names (lacking context) like llama3_8B_fp16. Even just explaining "for example, to run the model from https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct, do X Y and Z" would help put these options into context. Any scripting/tooling we build should work in concert with the user guide. The tools should be self-documenting so the user guide doesn't need to be referenced constantly.

In a Python script, I'd maybe have a mutually exclusive argparse group (https://docs.python.org/3/library/argparse.html#mutual-exclusion) that lets you choose between an already downloaded/local weight file and a huggingface repository name. Then you could either --model-hf=meta-llama/Meta-Llama-3.1-8B-Instruct or --model-weights=/path/to/llama_8b_instruct_f16.gguf.

This also I thought of, but @kumardeepakamd's POV was that the weight file should be handled by the user however they would like, and not be a part of the script in any way, so the idea was dropped.

ScottTodd · 2025-01-02T23:36:13Z

run_shark_ai.sh

+Some options are required to be specified.
+    -h            prints the help string. The same output is emitted when no arguments are given.
+    -v            enables verbose logging. When not specified, only errors are logged.
+    -w            the location of the GGUF/IRPA file(s) that contain the parameters. (required)


I also don't see a tokenizer config mentioned in here. That's not needed for the sharktank model export process, but it is needed for serving via shortfin. If this script intends to connect model downloading/exporting/compiling and serving, that should be included somehow.

Yes. The script does not handle running shortfin (yet). Again the idea was that the user might have a different workflow than what the script enforces, do that was not included.

vinayakdsci · 2025-01-03T01:00:25Z

@ScottTodd thanks a lot for the review. I completely agree with your views on moving to Python, so I mam closing this PR in favor of a new one that implements the script in Python.

ScottTodd · 2025-01-03T16:10:32Z

@ScottTodd thanks a lot for the review. I completely agree with your views on moving to Python, so I mam closing this PR in favor of a new one that implements the script in Python.

Why close this PR and create a new one? The review history and GitHub crossreferences (linking to #402 and #691) are all useful to keep here. Splitting to a new PR fragments the discussion. You can rename the PR to better reflect what it is about (e.g. "Introduce new tool for LLM import and compilation"), without being overly specific about one detail of the implementation (a single file written in Bash).

vinayakdsci · 2025-01-03T16:48:38Z

@ScottTodd re-opened the PR.

ScottTodd · 2025-01-03T18:19:04Z

shark-ai/export_and_serve.py

+class CliParser(ArgumentParser):
+    def print_help(self, file=None) -> None:
+        if file is None:
+            file = sys.stdout
+
+        help_text = """usage: export_and_serve.py [-h] [-v] [-c] -w WEIGHT_FILE [-e] [-a ARTIFACT_DIR] [-s] [-b BATCH_SIZES] [-i IR] [-p {1,8}] export_dir
+
+Utility script to combine shark-ai tools
+
+positional arguments:
+  export_dir                        the directory where the exported artifacts will be saved.
+
+options:
+  -h, --help                        show this help message and exit
+  -v, --verbose                     set logging level to INFO. The default logging level is WARNING.
+  -c, --compile                     compile the exported model as part of the pipeline. Default is FALSE.
+  -w, --weight-file WEIGHT_FILE     the location of the GGUF/IRPA file(s) that contain the parameters.
+  -e, --export                      export the model in tp1 mode.
+  -a, --artifact-dir ARTIFACT_DIR   the location where the artifacts (sharded weights) should be saved. Defaults to EXPORT_DIR/artifacts/
+  -s, --shard                       shard the weight file in tp8 mode and export to MLIR.
+  -b, --batch-sizes BATCH_SIZES     batch sizes for export. Multiple batch sizes should be separated by a ','.
+  -i, --ir IR                       location for the MLIR to be compiled, if compilation is done independently.
+  -p, --tensor-parallel {1,8}       tensor parallel size. Used for independent compilation. Defaults to 1.
+        """
+        _ = file.write(help_text + "\n")


ArgumentParser should provide most of this help text for you, then you won't need to duplicate options and help text in two different parts of the file. Is there a particular reason you opted to format all this yourself and use RawTextHelpFormatter instead of the standard format?

https://docs.python.org/3/library/argparse.html

https://docs.python.org/3/library/argparse.html#formatter-class

@ScottTodd the reason to have a separate help string is that the default format is actually quite unclean and difficult to read. It repeats options against each flag (the abbreviation and the long-form) and then wraps the help string for that flag onto the next line. Since this is supposed to be user-facing, IMO this would be an import part of the UX.

Explicit help text:

λ python .\shark-ai\export_and_serve.py --help usage: export_and_serve.py [-h] [-v] [-c] -w WEIGHT_FILE [-e] [-a ARTIFACT_DIR] [-s] [-b BATCH_SIZES] [-i IR] [-p {1,8}] export_dir Utility script to combine shark-ai tools positional arguments: export_dir the directory where the exported artifacts will be saved. options: -h, --help show this help message and exit -v, --verbose set logging level to INFO. The default logging level is WARNING. -c, --compile compile the exported model as part of the pipeline. Default is FALSE. -w, --weight-file WEIGHT_FILE the location of the GGUF/IRPA file(s) that contain the parameters. -e, --export export the model in tp1 mode. -a, --artifact-dir ARTIFACT_DIR the location where the artifacts (sharded weights) should be saved. Defaults to EXPORT_DIR/artifacts/ -s, --shard shard the weight file in tp8 mode and export to MLIR. -b, --batch-sizes BATCH_SIZES batch sizes for export. Multiple batch sizes should be separated by a ','. -i, --ir IR location for the MLIR to be compiled, if compilation is done independently. -p, --tensor-parallel {1,8} tensor parallel size. Used for independent compilation. Defaults to 1.

Default help text (line width 100):

λ python .\shark-ai\export_and_serve.py --help usage: export_and_serve.py [-h] [-v] [-c] -w WEIGHT_FILE [-e] [-a ARTIFACT_DIR] [-s] [-b BATCH_SIZES] [-i IR] [-p {1,8}] export_dir Utility script to combine shark-ai tools positional arguments: export_dir the directory where the exported artifacts will be saved. options: -h, --help show this help message and exit -v, --verbose set logging level to INFO. The default logging level is WARNING. -c, --compile compile the exported model as part of the pipeline. Default is FALSE. -w WEIGHT_FILE, --weight-file WEIGHT_FILE the location of the GGUF/IRPA file(s) that contain the parameters. -e, --export export the model in tp1 mode. -a ARTIFACT_DIR, --artifact-dir ARTIFACT_DIR the location where the artifacts (sharded weights) should be saved. Defaults to EXPORT_DIR/artifacts/ -s, --shard shard the weight file in tp8 mode and export to MLIR. -b BATCH_SIZES, --batch-sizes BATCH_SIZES batch sizes for export. Multiple batch sizes should be separated by a ','. -i IR, --ir IR location for the MLIR to be compiled, if compilation is done independently. -p {1,8}, --tensor-parallel {1,8} tensor parallel size (required for independent compilation).

I prefer the default line wrapping behavior, especially at line width 80, since it left pads the wrapped lines. I see your point about the repetition, but sticking to standard argparse behavior will make the tool output more predictable and simplify maintenance.

ScottTodd · 2025-01-03T18:42:27Z

shark-ai/export_and_serve.py

@@ -0,0 +1,412 @@
+import logging


Please add copyright header comments to source files.

ScottTodd · 2025-01-03T18:56:06Z

shark-ai/export_and_serve.py

+        if not self.shard and not self.export and not self.ir_path and self.compile:
+            logger.error(
+                "To run compilation, either TP1 export (-e) or TP8 export (-s) must be specified, or path to IR must be passed in"
+            )


I'm finding this condition confusing to read. Maybe reorder to match the text.

Suggested change

if not self.shard and not self.export and not self.ir_path and self.compile:

logger.error(

"To run compilation, either TP1 export (-e) or TP8 export (-s) must be specified, or path to IR must be passed in"

)

if self.compile and (not self.shard and not self.export and not self.ir_path):

logger.error(

"To run compilation, either TP1 export (-e) or TP8 export (-s) must be specified, or path to IR must be passed in"

)

Actually, see https://stackoverflow.com/questions/19414060/argparse-required-argument-y-if-x-is-present. That has the same style and a tip about using parser.error() instead of logger and exit.

ScottTodd · 2025-01-03T18:59:47Z

shark-ai/export_and_serve.py

+            )
+            exit(1)
+
+        if not os.path.isfile(self.weight_loc):


weight_loc is already a pathlib.Path object, so you can use exists() or is_file()

https://docs.python.org/3/library/pathlib.html#pathlib.Path.exists
https://docs.python.org/3/library/pathlib.html#pathlib.Path.is_file

Suggested change

if not os.path.isfile(self.weight_loc):

if not self.weight_loc.is_file():

I generally default to Pathlib and try to cut down on any uses of os.path or os. Pathlib is much more modern.

ScottTodd · 2025-01-03T19:10:22Z

shark-ai/export_and_serve.py

+    # The argparser should not print the usage when an error occurs.
+    # We handle that ourselves.
+    parser = CliParser(
+        prog="export_and_serve.py",


A note on serving: plan on there being checkpoint steps after

Downloading weights

Compiling the model (I'd group the export from PyTorch to .mlir and the .mlir to .vmfb compilation in this, from a user's perspective)

Serving the model

A user may want to serve the model on multiple machines, not just the machine that was used to compile. If we can package the artifacts for serving (program.vmfb, weights_shard1.irpa, tokenizer_config.json) in some organized way then we can help users push those artifacts and start the server.

See for example how https://llm.mlc.ai/docs/compilation/compile_models.html is structured:

There is a single mlc_llm tool with mutliple modes: compile, convert_weight, gen_config

The running/serving step points at the directory of outputs

We could do something similar here using sub-commands: https://docs.python.org/3/library/argparse.html#sub-commands

ScottTodd · 2025-01-03T19:37:40Z

shark-ai/export_and_serve.py

+        cmd = [
+            "iree-compile",
+            f"{self.mlir_path}",
+            f"-o={tp1_vmfb_path}",
+        ]
+
+        # TODO(vinayakdsci): Add a flag to support backends other than rocm.
+        cmd += [f"--iree-hal-target-backends={self.hal_target_backend}"]
+
+        # TODO(vinayakdsci): Add a flag to support targets other than gfx942.
+        cmd += [f"--iree-hip-target={self.hip_target}"]
+
+        compile_subp = subprocess.run(
+            cmd,
+            capture_output=True,
+        )


We may want this to use the IREE compiler API or the iree.build API (https://iree-python-api.readthedocs.io/en/latest/compiler/build.html) instead of using a subprocess to call iree-compile.

Then:

we wouldn't need to check for iree-compile on the PATH

developers wanting to bring their own iree-compile build would need to build the python bindings

we'd get the benefits of iree.build for compiling multiple programs in a build graph (as needed... LLMs don't need that as much as SDXL though)

we'd get integrated error handling instead of checking return codes and sys.stderr

ScottTodd · 2025-01-03T19:40:08Z

shark-ai/export_and_serve.py

+        )
+
+    def _compile_tp8(self):
+        logger.info("Compiling sharded IR")


May also want to include a time estimate here, or ideally a progress bar (iree-org/iree#14369)

ScottTodd · 2025-01-03T19:40:42Z

shark-ai/export_and_serve.py

+    artifacts_dir = (
+        args.artifact_dir
+        if args.artifact_dir is not None
+        else Path(os.path.join(args.export_dir, "artifacts/"))


pathlib

Suggested change

else Path(os.path.join(args.export_dir, "artifacts/"))

else args.export_dir / "artifacts"

ScottTodd · 2025-01-03T19:43:21Z

shark-ai/export_and_serve.py

+        shard_file = os.path.join(
+            str(self.artifacts_dir),
+            str(self.weight_loc).split("/")[-1][:-5] + "_tp8.irpa",
+        )


use pathlib here. Join with / and replace the [-1][:-5] with some of the functions from https://docs.python.org/3/library/pathlib.html#corresponding-tools (.stem(), .parent(), etc.)

ScottTodd · 2025-01-03T19:53:56Z

run_shark_ai.sh

Brainstorming more name choices...

Use the project name as the tool name then have a subcommand for each task:

shark-ai compile_llm

shark-ai serve

Put -build in the name, like trtllm:

shark-ai-build compile_llm

shark-ai-build compile_sdxl

I'm not sure if we should try to fit everything that we support into a neat box like those, or if we should embrace having more bespoke pipelines and servers, at least for our current split between LLMs (llama) and image generators (SDXL, Flux).

If we pick a sufficiently specific tool name like then we can wiggle out of it later. If we pick a generic name like shark-ai then we should continue to use that across future releases. I definitely want to aim for something that we include as a console script that gets installed as part of the Python packages.

ScottTodd

Thanks for rewriting in Python! This looks much easier to maintain and I see no portable issues. Comments are split between Python style and overall user experience details. Some of this we can iterate on after landing an initial version, but I want to get a solid foundation before we advertise this to users.

vinayakdsci requested review from ScottTodd, rsuderman and kumardeepakamd January 1, 2025 13:10

vinayakdsci force-pushed the e2e-script-impl branch 5 times, most recently from e67c9d0 to 2f1bf47 Compare January 1, 2025 13:57

vinayakdsci marked this pull request as ready for review January 2, 2025 10:55

vinayakdsci requested a review from stellaraccident January 2, 2025 16:28

rsuderman approved these changes Jan 2, 2025

View reviewed changes

vinayakdsci added 4 commits January 2, 2025 17:50

Implement bash script to unify user-end commands

ab6cd09

Separate sharding and export

267bb9e

Cleanup code

4f868cd

Cleanup code and add compile function

bf6ac52

vinayakdsci force-pushed the e2e-script-impl branch from 2598cf7 to bf6ac52 Compare January 2, 2025 18:09

ScottTodd requested changes Jan 3, 2025

View reviewed changes

vinayakdsci closed this Jan 3, 2025

vinayakdsci reopened this Jan 3, 2025

Switch to python

6012b73

vinayakdsci mentioned this pull request Jan 3, 2025

Implement script to cleanup user-facing commands #739

Closed

vinayakdsci changed the title ~~Implement bash script to unify user-end commands~~ Implement script to cleanup user-facing commands Jan 3, 2025

vinayakdsci requested a review from ScottTodd January 3, 2025 16:48

ScottTodd reviewed Jan 3, 2025

View reviewed changes

ScottTodd requested changes Jan 3, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement script to cleanup user-facing commands #733

Implement script to cleanup user-facing commands #733

vinayakdsci commented Jan 1, 2025 •

edited

Loading

vinayakdsci commented Jan 1, 2025

vinayakdsci commented Jan 2, 2025

ScottTodd left a comment

ScottTodd Jan 2, 2025

ScottTodd Jan 3, 2025

ScottTodd Jan 2, 2025

vinayakdsci Jan 3, 2025

ScottTodd Jan 2, 2025

vinayakdsci Jan 3, 2025

ScottTodd Jan 2, 2025

ScottTodd Jan 2, 2025

vinayakdsci Jan 3, 2025

ScottTodd Jan 2, 2025

vinayakdsci Jan 3, 2025

vinayakdsci commented Jan 3, 2025

ScottTodd commented Jan 3, 2025

vinayakdsci commented Jan 3, 2025

ScottTodd Jan 3, 2025

vinayakdsci Jan 4, 2025

ScottTodd Jan 6, 2025

ScottTodd Jan 3, 2025

ScottTodd Jan 3, 2025

ScottTodd Jan 3, 2025

ScottTodd Jan 3, 2025

ScottTodd Jan 3, 2025

ScottTodd Jan 3, 2025

ScottTodd Jan 3, 2025

ScottTodd Jan 3, 2025

ScottTodd Jan 3, 2025

ScottTodd left a comment

	if not os.path.isfile(self.weight_loc):
	if not self.weight_loc.is_file():

	else Path(os.path.join(args.export_dir, "artifacts/"))
	else args.export_dir / "artifacts"

Implement script to cleanup user-facing commands #733

Are you sure you want to change the base?

Implement script to cleanup user-facing commands #733

Conversation

vinayakdsci commented Jan 1, 2025 • edited Loading

vinayakdsci commented Jan 1, 2025

vinayakdsci commented Jan 2, 2025

ScottTodd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Location

Naming

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vinayakdsci commented Jan 3, 2025

ScottTodd commented Jan 3, 2025

vinayakdsci commented Jan 3, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ScottTodd left a comment

Choose a reason for hiding this comment

vinayakdsci commented Jan 1, 2025 •

edited

Loading