We will demonstrate 1) setting up a local LLM, and 2) configuring a FastAPI server to expose a local LLM to the public internet for inference using port forwarding.
- Deploy LLM models locally with FastAPI
- Enable remote access via port forwarding
- Handle multiple simultaneous queries
Download your model from Hugging Face. For example, download llava-v1.6-mistral-7b.Q4_K_M.gguf
and save it in the models
folder under your project directory local_llm
.
Alternatively, you can use the following code to download the model efficiently:
from huggingface_hub import hf_hub_download
model_name = "cjpais/llava-1.6-mistral-7b-gguf"
model_file = "llava-v1.6-mistral-7b.Q4_K_M.gguf"
HF_TOKEN = "your_huggingface_token" # Replace with your Hugging Face access token
model_path = hf_hub_download(model_name, filename=model_file, local_dir='models/', token=HF_TOKEN)
print("Model path:", model_path)
- C++ CMake tools for Windows
- C++ core features
- Windows 10/11 SDK
Download and install CUDA Toolkit 11.7.
Verify the installation with nvcc --version
and nvidia-smi
.
Check if CUDA is enabled in PyTorch:
import torch
# Install PyTorch if necessary:
# pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu117
print('CUDA available:', torch.cuda.is_available())
print('Current CUDA device:', torch.cuda.current_device())
print('Number of GPUs:', torch.cuda.device_count())
print('CUDA device name:', torch.cuda.get_device_name(0))
import numpy as np
from numpy.distutils.system_info import get_info
print(np.__config__.show())
try:
blas_info = get_info('blas_opt')
print(blas_info)
except ImportError as e:
print("Error importing numpy.distutils.system_info:", e)
Type nvidia-smi
in the VS Code terminal:
- Model: NVIDIA GeForce RTX 3070
- Driver Mode: WDDM
- PCI Address: 00000000:01:00.0
- Power State: Off
- Temperature: 49°C
- Power Usage: 29W / 135W (Max)
- Total Memory: 8192 MiB (8 GB)
- Used Memory: 0 MiB
- No running processes found
nvcc: NVIDIA (R) Cuda compiler driver
Built on Tue_May__3_19:00:59_Pacific_Daylight_Time_2022
Cuda compilation tools, release 11.7, V11.7.64
Build cuda_11.7.r11.7/compiler.31294372_0
In the VS Code terminal, run:
set LLAMA_CUBLAS=1
set CMAKE_ARGS=-DLLAMA_CUBLAS=on
set FORCE_CMAKE=1
python -m pip install llama-cpp-python==0.2.7 --prefer-binary --extra-index-url=https://jllllll.github.io/llama-cpp-python-cuBLAS-wheels/AVX2/cu117
Reference: private-gpt issue 1242
Run "simple_interface.py". Ensure BLAS=1 appears in the output to confirm GPU acceleration. In experiments, inference time reduced from ~70 seconds to ~1-3 seconds due to GPU acceleration.
If you need to load a model larger than your VRAM, use these techniques:
- Reduce Model Size: Use a smaller model (e.g., 7B instead of 13B).
- Optimize Model Loading: Reduce layers offloaded to GPU.
llm = Llama(..., n_gpu_layers=20, ...)
- Use Quantized Models
- Use Mixed Precision:
llm = Llama(..., use_fp16=True)
- Optimize Batch Size:
output = llm(..., max_tokens=512)
Different models require different prompting templates. For detailed guidance on mistral-7b prompting, refer to the Prompting Guide.
- Run "fastapi_app.py" & follow the instructions in the comments to install grok (it also support Kubernetes), to set up port forwarding from grok's public url to your local port 8000.
If you find this tutorial helpful, please consider citing this GitHub page:
@misc{local-llm-inference,
author = {Henry Luan},
title = {A Tutorial on deploying a local LLM: A FastAPI server to expose a local LLM to the public internet},
year = {2024},
howpublished = {\url{https://github.com/casualcomputer/local_llm}},
}