Skip to content

Commit

Permalink
Update moe.md
Browse files Browse the repository at this point in the history
  • Loading branch information
haeggee authored Sep 6, 2024
1 parent f8e30b4 commit ac27ada
Showing 1 changed file with 19 additions and 26 deletions.
45 changes: 19 additions & 26 deletions moe.md
Original file line number Diff line number Diff line change
@@ -1,18 +1,23 @@
# MoE Env Setup

TL;DR: need to install megablocks for MoEs, which depends on triton; cannot install triton inside the docker image because it requires a CUDA-capable GPU, which is not available in the build environment. therefore install triton from source inside a venv in the container, then install megablocks
TL;DR: need to install megablocks for MoEs. just use the environment `/store/swissai/a06/containers/nanotron_moe/nanotron_moe.toml` :)

The setup is documented in that folder on the cluster. The setup is:

```Dockerfile
FROM nvcr.io/nvidia/pytorch:24.04-py3
FROM nvcr.io/nvidia/pytorch:24.05-py3

ENV DEBIAN_FRONTEND=noninteractive
# setup
RUN apt-get update && apt-get install -y \
python3-pip \
python3-venv \
git tmux htop nvtop \
&& apt-get clean && rm -rf /var/lib/apt/lists/*

RUN apt-get update && apt-get install -y python3.10-venv && apt-get clean && rm -rf /var/lib/apt/lists/*
RUN pip install --upgrade pip setuptools==69.5.1

# Update flash-attn.
RUN pip install --upgrade --no-build-isolation flash-attn==2.5.8

# Install the rest of dependencies.
RUN pip install \
datasets \
Expand All @@ -23,14 +28,20 @@ RUN pip install \
numpy \
packaging \
safetensors \
sentencepiece \
tqdm

```
WORKDIR /workspace
RUN git clone https://github.com/swiss-ai/nanotron.git
WORKDIR /workspace/nanotron
RUN pip install -e .[nanosets]

RUN pip install megablocks==0.5.1 stanford-stk==0.7.1 --no-deps
```

after image is built, create env `~/.edf/nanotron-moe.toml` with content (adapt to wherever the image is stored)
The env `nanotron-moe.toml` with content:
```
image = "/capstor/scratch/cscs/$USER/container-images/nanotron-moe/nanotron-moe-v1.0.sqsh"
image = "/store/swissai/a06/containers/nanotron_moe/nanotron_moe.sqsh"
mounts = ["/capstor", "/users", "/store"]
workdir = "/users/$USER/"
Expand All @@ -45,21 +56,3 @@ FI_CXI_DISABLE_HOST_REGISTER = "1"
FI_MR_CACHE_MONITOR = "userfaultfd"
NCCL_DEBUG = "INFO"
```

TODO: make image available on the cluster in /store



in a running container (`srun --reservation=todi --environment=nanotron-moe --container-workdir=$PWD --pty bash`)
```bash
cd $SCRATCH/$USER/nanotron-multilingual # or wherever you want the venv
mkdir multilingual-venv && cd multilingual-venv
python -m venv --system-site-packages ./moe-venv
source ./moe-venv/bin/activate
git clone https://github.com/triton-lang/triton.git; \
cd triton; \
pip install ninja cmake wheel; # build-time dependencies \
pip install -e python; cd ..
pip install megablocks==0.5.1
```

0 comments on commit ac27ada

Please sign in to comment.