Get Started with Huawei Ascend (Atlas 800T A2)

The usage of lmdeploy on a Huawei Ascend device is almost the same as its usage on CUDA with PytorchEngine in lmdeploy. Please read the original Get Started guide before reading this tutorial.

Here is the supported model list.

Installation

We highly recommend that users build a Docker image for streamlined environment setup.

Git clone the source code of lmdeploy and the Dockerfile locates in the docker directory:

git clone https://github.com/InternLM/lmdeploy.git
cd lmdeploy

Environment Preparation

The Docker version is supposed to be no less than 18.03. And Ascend Docker Runtime should be installed by following the official guide.

Caution

If error message libascend_hal.so: cannot open shared object file shows, that means Ascend Docker Runtime is not installed correctly!

Ascend Drivers, Firmware and CANN

The target machine needs to install the Huawei driver and firmware version not lower than 23.0.3, refer to CANN Driver and Firmware Installation and download resources.

And the CANN (version 8.0.RC2.beta1) software packages should also be downloaded from Ascend Resource Download Center themselves. Make sure to place the Ascend-cann-kernels-910b*.run, Ascend-cann-nnal_*.run and Ascend-cann-toolkit*-aarch64.run under the root directory of lmdeploy source code

Build Docker Image

Run the following command in the root directory of lmdeploy to build the image:

DOCKER_BUILDKIT=1 docker build -t lmdeploy-aarch64-ascend:latest \
    -f docker/Dockerfile_aarch64_ascend .

The Dockerfile_aarch64_ascend is tested on Kunpeng CPU. For intel CPU, please try this dockerfile (which is not fully tested)

If the following command executes without any errors, it indicates that the environment setup is successful.

docker run -e ASCEND_VISIBLE_DEVICES=0 --rm --name lmdeploy -t lmdeploy-aarch64-ascend:latest lmdeploy check_env

For more information about running the Docker client on Ascend devices, please refer to the guide

Offline batch inference

Tip

Graph mode has been supported on Atlas 800T A2. Users can set eager_mode=False to enable graph mode, or, set eager_mode=True to disable graph mode. (Please source /usr/local/Ascend/nnal/atb/set_env.sh before enabling graph mode)

LLM inference

Set device_type="ascend" in the PytorchEngineConfig:

from lmdeploy import pipeline
from lmdeploy import PytorchEngineConfig
if __name__ == "__main__":
    pipe = pipeline("internlm/internlm2_5-7b-chat",
                    backend_config=PytorchEngineConfig(tp=1, device_type="ascend", eager_mode=True))
    question = ["Shanghai is", "Please introduce China", "How are you?"]
    response = pipe(question)
    print(response)

VLM inference

Set device_type="ascend" in the PytorchEngineConfig:

from lmdeploy import pipeline, PytorchEngineConfig
from lmdeploy.vl import load_image
if __name__ == "__main__":
    pipe = pipeline('OpenGVLab/InternVL2-2B',
                    backend_config=PytorchEngineConfig(tp=1, device_type='ascend', eager_mode=True))
    image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
    response = pipe(('describe this image', image))
    print(response)

Online serving

Tip

Graph mode has been supported on Atlas 800T A2. Graph mode is default enabled in online serving. Users can add --eager-mode to disable graph mode. (Please source /usr/local/Ascend/nnal/atb/set_env.sh before enabling graph mode)

Serve a LLM model

Add --device ascend in the serve command.

lmdeploy serve api_server --backend pytorch --device ascend --eager-mode internlm/internlm2_5-7b-chat

Serve a VLM model

Add --device ascend in the serve command

lmdeploy serve api_server --backend pytorch --device ascend --eager-mode OpenGVLab/InternVL2-2B

Inference with Command line Interface

Add --device ascend in the serve command.

lmdeploy chat internlm/internlm2_5-7b-chat --backend pytorch --device ascend --eager-mode

Run the following commands to launch lmdeploy chatting after starting container:

docker exec -it lmdeploy_ascend_demo \
    bash -i -c "lmdeploy chat --backend pytorch --device ascend --eager-mode internlm/internlm2_5-7b-chat"

Quantization

w4a16 AWQ

Run the following commands to quantize weights on Atlas 800T A2.

lmdeploy lite auto_awq $HF_MODEL --work-dir $WORK_DIR --device npu

Please check supported_models before use this feature.

int8 KV-cache Quantization

Ascend backend has supported offline int8 KV-cache Quantization on eager mode.

Please refer this doc for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

get_started.md

get_started.md

Get Started with Huawei Ascend (Atlas 800T A2)

Installation

Environment Preparation

Ascend Drivers, Firmware and CANN

Build Docker Image

Offline batch inference

LLM inference

VLM inference

Online serving

Serve a LLM model

Serve a VLM model

Inference with Command line Interface

Quantization

w4a16 AWQ

int8 KV-cache Quantization

Files

get_started.md

Latest commit

History

get_started.md

File metadata and controls

Get Started with Huawei Ascend (Atlas 800T A2)

Installation

Environment Preparation

Ascend Drivers, Firmware and CANN

Build Docker Image

Offline batch inference

LLM inference

VLM inference

Online serving

Serve a LLM model

Serve a VLM model

Inference with Command line Interface

Quantization

w4a16 AWQ

int8 KV-cache Quantization