This document describes how to run the main experiments in the SOSP '24 paper Apparate: Rethinking Early Exits to Tame Latency-Throughput Tensions in ML Serving.
We have tested Apparate on CPU nodes on Cloudlab.
Artifact evaluators: Hello! In theory, our source code should be runnable on any Linux machine. If you want access to Cloudlab, please send your public key to ruipan@princeton.edu and we will set up a Cloudlab node for you to reproduce our experiments.
For ease of reproduction, we use conda to create a virtual environment (Miniconda can be installed by following the instructions in this doc).
We have prepared an environment.yml
file that lists the dependencies and the versions of the dependencies. Once conda has been installed, create an environment from the .yml file by following the instructions in this doc.
mkdir apparate-ae; cd apparate-ae
# clone this repo
git clone https://github.com/dywsjtu/apparate.git
conda env create -f ./apparate/environment.yml
conda activate apparate_ae
For ease and efficiency of reproduction, we provide a simulator that replays request arrival traces on CPUs. The simulator implements all core logic in our system. Alongside the simulator, we also provide pickled data of requests in our workloads. Due to the size of these pickle files, we compress them (~435M) and host them on Google Drive and they can be downloaded via gdown
.
# install gdown to download the file
# NOTE: gdown is already included in the conda environments. The following only needs to be done
# in case the gdown command runs into a "Permission Denied" issue.
# See https://github.com/wkentaro/gdown/issues/43#issuecomment-621356443 for more details.
pip install -U --no-cache-dir gdown --pre
# download the tar file
gdown --fuzzy 'https://drive.google.com/file/d/1EN6ciNDBL2dEzSW4qdUTc9t4vOYkzWD8/view?usp=sharing'
# uncompress the tar file
tar -xzvf apparate-data.tar.gz && rm apparate-data.tar.gz
# create the directory for storing pickled experiment output
mkdir apparate_latency
# create the directory for stroing output logs
cd apparate; mkdir logs
Once all the dependencies has been set up, the directory should have the following structure:
--apparate-ae
--apparate (this repo)
--apparate_latency (empty, will be populated in the next step for plotting)
--batch_decisions (downloaded from Google Drive and decompressed)
--bootstrap_pickles (...)
--optimal_latency
--profile_pickles_bs
--simulation_pickles
batch_decisions
: Contains the batching decisions of Clockwork using different models and request arrival traces.- Format:
{model_name}_1_fixed_30.pickle
for 30FPS video traces (CV workloads), and{model_name}_azure.pickle
for Microsoft Azure MAF traces (NLP workloads).
- Format:
{bootstrap,simulation}_pickles
: Contains the confidence and accuracy of the bootstrapping/simulation dataset at all EE ramps.- Format:
{bootstrap,simulation}_{dataset}_{model_name}.pickle
. The pickled objectp
is a dict with two keys: "conf" and "acc". The confidence/accuracy of sample i at ramp r can be accessed via:p["conf"/"acc"][r][i]
.
- Format:
optimal_latency
: Contains the per-sample optimal latency for different workloads.- Format:
{model_name}_{dataset}_optimal.pickle
. The pickled object is a list of floats, with each one denoting the queuing delay + optimal model inference latency of a request.
- Format:
profile_pickles_bs
: Contains the operator-level latency profile of different models at different batch sizes, all measured on an NVIDIA RTX A6000 GPU.- Format:
{model_name}_{batch_size}_profile.pickle
for vanilla models, and{model_name}_{batch_size}_earlyexit_profile.pickle
for EE models.
- Format:
First, cd into the apparate directory with the source code: cd apparate
.
To reproduce the CV main results in Fig. 12 and 13, run python run_cv.py
(takes ~10-30 minutes on a 32-core CPU). To reproduce the NLP main results in Fig. 14, run python run_nlp.py
(takes ~xxx minutes on a 32-core CPU).
Aggregate results can be found in output_{cv,nlp}.txt
, which are generated by executing the above scripts. The system log, which details how our system performs ramp adjustment and threshold tuning, can be found in /logs/output_{model_name}_{dataset}.log
.
Running the above scripts will also produce pickle files ({model_name}_{dataset}_{arrival_trace}
) that contain detailed, per-request latencies in /apparate_latency
.
- Fig. 12: Run
plot_cv_results_median.py
to plot the median latency wins (%) compared to vanilla inference. The output figure is namedcv_results_median.pdf
. - Fig. 13: Run
plot_cv_results_p95.py
to plot the tail latency (ms) of Apparate and vanilla inference. The output figure is namedcv_results_p95.pdf
. - Fig. 14: Run
plot_nlp_results.py
to plot the latency CDF of NLP workloads with and without Apparate. The output figures are namednlp_results_{model_name}.pdf
.