SOSP 2024 Experiments

This document describes how to run the main experiments in the SOSP '24 paper Apparate: Rethinking Early Exits to Tame Latency-Throughput Tensions in ML Serving.

Setup

Hardware Setup

We have tested Apparate on CPU nodes on Cloudlab.

Artifact evaluators: Hello! In theory, our source code should be runnable on any Linux machine. If you want access to Cloudlab, please send your public key to ruipan@princeton.edu and we will set up a Cloudlab node for you to reproduce our experiments.

Software Dependencies

For ease of reproduction, we use conda to create a virtual environment (Miniconda can be installed by following the instructions in this doc).

We have prepared an environment.yml file that lists the dependencies and the versions of the dependencies. Once conda has been installed, create an environment from the .yml file by following the instructions in this doc.

mkdir apparate-ae; cd apparate-ae
# clone this repo
git clone https://github.com/dywsjtu/apparate.git
conda env create -f ./apparate/environment.yml
conda activate apparate_ae

Downloading Data

For ease and efficiency of reproduction, we provide a simulator that replays request arrival traces on CPUs. The simulator implements all core logic in our system. Alongside the simulator, we also provide pickled data of requests in our workloads. Due to the size of these pickle files, we compress them (~435M) and host them on Google Drive and they can be downloaded via gdown.

# install gdown to download the file
# NOTE: gdown is already included in the conda environments. The following only needs to be done
# in case the gdown command runs into a "Permission Denied" issue.
# See https://github.com/wkentaro/gdown/issues/43#issuecomment-621356443 for more details.
pip install -U --no-cache-dir gdown --pre

# download the tar file
gdown --fuzzy 'https://drive.google.com/file/d/1EN6ciNDBL2dEzSW4qdUTc9t4vOYkzWD8/view?usp=sharing'

# uncompress the tar file
tar -xzvf apparate-data.tar.gz && rm apparate-data.tar.gz

# create the directory for storing pickled experiment output
mkdir apparate_latency

# create the directory for stroing output logs
cd apparate; mkdir logs

Directory Structure

Once all the dependencies has been set up, the directory should have the following structure:

--apparate-ae
  --apparate (this repo)
  --apparate_latency (empty, will be populated in the next step for plotting)
  --batch_decisions (downloaded from Google Drive and decompressed)
  --bootstrap_pickles (...)
  --optimal_latency
  --profile_pickles_bs
  --simulation_pickles

batch_decisions: Contains the batching decisions of Clockwork using different models and request arrival traces.
- Format: {model_name}_1_fixed_30.pickle for 30FPS video traces (CV workloads), and {model_name}_azure.pickle for Microsoft Azure MAF traces (NLP workloads).
{bootstrap,simulation}_pickles: Contains the confidence and accuracy of the bootstrapping/simulation dataset at all EE ramps.
- Format: {bootstrap,simulation}_{dataset}_{model_name}.pickle. The pickled object p is a dict with two keys: "conf" and "acc". The confidence/accuracy of sample i at ramp r can be accessed via: p["conf"/"acc"][r][i].
optimal_latency: Contains the per-sample optimal latency for different workloads.
- Format: {model_name}_{dataset}_optimal.pickle. The pickled object is a list of floats, with each one denoting the queuing delay + optimal model inference latency of a request.
profile_pickles_bs: Contains the operator-level latency profile of different models at different batch sizes, all measured on an NVIDIA RTX A6000 GPU.
- Format: {model_name}_{batch_size}_profile.pickle for vanilla models, and {model_name}_{batch_size}_earlyexit_profile.pickle for EE models.

Reproducing Experiments

First, cd into the apparate directory with the source code: cd apparate.

To reproduce the CV main results in Fig. 12 and 13, run python run_cv.py (takes ~10-30 minutes on a 32-core CPU). To reproduce the NLP main results in Fig. 14, run python run_nlp.py (takes ~xxx minutes on a 32-core CPU).

Aggregate results can be found in output_{cv,nlp}.txt, which are generated by executing the above scripts. The system log, which details how our system performs ramp adjustment and threshold tuning, can be found in /logs/output_{model_name}_{dataset}.log.

Running the above scripts will also produce pickle files ({model_name}_{dataset}_{arrival_trace}) that contain detailed, per-request latencies in /apparate_latency.

Plotting Results

Fig. 12: Run plot_cv_results_median.py to plot the median latency wins (%) compared to vanilla inference. The output figure is named cv_results_median.pdf.
Fig. 13: Run plot_cv_results_p95.py to plot the tail latency (ms) of Apparate and vanilla inference. The output figure is named cv_results_p95.pdf.
Fig. 14: Run plot_nlp_results.py to plot the latency CDF of NLP workloads with and without Apparate. The output figures are named nlp_results_{model_name}.pdf.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EXPERIMENTS.md

EXPERIMENTS.md

SOSP 2024 Experiments

Setup

Hardware Setup

Software Dependencies

Downloading Data

Directory Structure

Reproducing Experiments

Plotting Results

Files

EXPERIMENTS.md

Latest commit

History

EXPERIMENTS.md

File metadata and controls

SOSP 2024 Experiments

Setup

Hardware Setup

Software Dependencies

Downloading Data

Directory Structure

Reproducing Experiments

Plotting Results