Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add generation time metrics #613

Merged
1 change: 1 addition & 0 deletions samples/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ add_subdirectory(cpp/greedy_causal_lm)
add_subdirectory(cpp/multinomial_causal_lm)
add_subdirectory(cpp/prompt_lookup_decoding_lm)
add_subdirectory(cpp/speculative_decoding_lm)
add_subdirectory(cpp/benchmark_genai)

install(FILES requirements.txt DESTINATION samples
COMPONENT cpp_samples_genai)
Expand Down
25 changes: 25 additions & 0 deletions samples/cpp/benchmark_genai/CMakeLists.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# Copyright (C) 2023-2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0


find_package(OpenVINOGenAI REQUIRED PATHS
"${CMAKE_BINARY_DIR}" # Reuse the package from the build.
${OpenVINO_DIR} # GenAI may be installed alogside OpenVINO.
)

FetchContent_Declare(cxxopts
URL https://github.com/jarro2783/cxxopts/archive/refs/tags/v3.1.1.tar.gz
URL_HASH SHA256=523175f792eb0ff04f9e653c90746c12655f10cb70f1d5e6d6d9491420298a08)
FetchContent_MakeAvailable(cxxopts)

add_executable(benchmark_genai benchmark_genai.cpp)
target_link_libraries(benchmark_genai PRIVATE openvino::genai cxxopts::cxxopts)
set_target_properties(benchmark_genai PROPERTIES
COMPILE_PDB_NAME benchmark_genai
# Ensure out of box LC_RPATH on macOS with SIP
INSTALL_RPATH_USE_LINK_PATH ON)
# target_compile_features(benchmark_genai PRIVATE cxx_std_11)
pavel-esir marked this conversation as resolved.
Show resolved Hide resolved
install(TARGETS benchmark_genai
RUNTIME DESTINATION samples_bin/
COMPONENT samples_bin
EXCLUDE_FROM_ALL)
47 changes: 47 additions & 0 deletions samples/cpp/benchmark_genai/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# Benchmarking Vanilla GenAI

This sample script demonstrates how to benchmark an LLMModel in OpenVINO GenAI. The script includes functionality for warm-up iterations, generating text, and calculating various performance metrics.

## Download and convert the model and tokenizers

The `--upgrade-strategy eager` option is needed to ensure `optimum-intel` is upgraded to the latest version.

It's not required to install [../../requirements.txt](../../requirements.txt) for deployment if the model has already been exported.

```sh
pip install --upgrade-strategy eager -r ../../requirements.txt
optimum-cli export openvino --trust-remote-code --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 TinyLlama-1.1B-Chat-v1.0
```

## Usage

```sh
benchmark_vanilla_genai [OPTIONS]
```

### Options

- `-m, --model`: Path to the model and tokenizers base directory.
- `-p, --prompt` (default: `"The Sky is blue because"`): The prompt to generate text.
- `-nw, --num_warmup` (default: `1`): Number of warmup iterations.
- `-mt, --max_new_tokens` (default: `20`): Number of warmup iterations.
- `-n, --num_iter` (default: `3`): Number of iterations.
- `-d, --device` (default: `"CPU"`): Device to run the model on.

### Output:

```
benchmark_vanilla_genai -m TinyLlama-1.1B-Chat-v1.0 -n 10
```

```
Load time: 3405.69 ms
Generate time: 1430.77 ± 3.04 ms
Tokenization time: 0.51 ± 0.02 ms
Detokenization time: 0.37 ± 0.01 ms
TTFT: 81.60 ± 0.54 ms
TPOT: 71.52 ± 2.72 ms
Throughput tokens/s: 13.98 ± 0.53
```

For more information how performance metrics are calculated please follow [performance-metrics tutorial](../../../src/README.md#performance-metrics).
69 changes: 69 additions & 0 deletions samples/cpp/benchmark_genai/benchmark_genai.cpp
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like an extended greedy sample. One option is to merge it with chat_sample because it already has a loop over prompts. Is there a reason to have it as a separate sample?

Please, add tests: for the samples and functional

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just thought that greedy sample should be easy as possible and do not compilate it with other argument and outputs such as number of warmups, and so on.

Communicated personally. Main takeaways.
We already have a separate sample for each case, e.g. multinomial decoding despite the fact it differ just by generate params. Probably instead of adding more samples we should add features to the existing ones.

But still, benchmark_genai also allows benchmarking with multibatch inputs while chat has only batch 1. Therefore we decided to keep as a separate sample, at least in this PR

Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
// Copyright (C) 2023-2024 Intel Corporation
// SPDX-License-Identifier: Apache-2.0

#include "openvino/genai/llm_pipeline.hpp"
#include <cxxopts.hpp>

int main(int argc, char* argv[]) try {
cxxopts::Options options("benchmark_vanilla_genai", "Help command");

options.add_options()
("m,model", "Path to model and tokenizers base directory", cxxopts::value<std::string>()->default_value("."))
("p,prompt", "Prompt", cxxopts::value<std::string>()->default_value("The Sky is blue because"))
("nw,num_warmup", "Number of warmup iterations", cxxopts::value<size_t>()->default_value(std::to_string(1)))
("n,num_iter", "Number of iterations", cxxopts::value<size_t>()->default_value(std::to_string(3)))
("mt,max_new_tokens", "Maximal number of new tokens", cxxopts::value<size_t>()->default_value(std::to_string(20)))
("d,device", "device", cxxopts::value<std::string>()->default_value("CPU"))
("h,help", "Print usage");

cxxopts::ParseResult result;
try {
result = options.parse(argc, argv);
} catch (const cxxopts::exceptions::exception& e) {
std::cout << e.what() << "\n\n";
std::cout << options.help() << std::endl;
return EXIT_FAILURE;
}

if (result.count("help")) {
std::cout << options.help() << std::endl;
return EXIT_SUCCESS;
}

std::string prompt = result["prompt"].as<std::string>();
const std::string model_path = result["model"].as<std::string>();
std::string device = result["device"].as<std::string>();
size_t num_warmup = result["num_warmup"].as<size_t>();
size_t num_iter = result["num_iter"].as<size_t>();

ov::genai::GenerationConfig config;
config.max_new_tokens = result["max_new_tokens"].as<size_t>();

ov::genai::LLMPipeline pipe(model_path, device);

for (size_t i = 0; i < num_warmup; i++)
pipe.generate(prompt, config);

ov::genai::DecodedResults res = pipe.generate(prompt, config);
ov::genai::PerfMetrics metrics = res.perf_metrics;
for (size_t i = 0; i < num_iter - 1; i++) {
res = pipe.generate(prompt, config);
metrics = metrics + res.perf_metrics;
}

std::cout << "Load time: " << metrics.load_time << " ms" << std::endl;
std::cout << "Generate time: " << metrics.generate_duration.mean << " ± " << metrics.generate_duration.std << " ms" << std::endl;
std::cout << "Tokenization time: " << metrics.tokenization_duration.mean << " ± " << metrics.tokenization_duration.std << " ms" << std::endl;
std::cout << "Detokenization time: " << metrics.detokenization_duration.mean << " ± " << metrics.detokenization_duration.std << " ms" << std::endl;
std::cout << "ttft: " << metrics.ttft.mean << " ± " << metrics.ttft.std << " ms" << std::endl;
std::cout << "tpot: " << metrics.tpot.mean << " ± " << metrics.tpot.std << " ms " << std::endl;
std::cout << "Tokens/s: " << metrics.throughput.mean << " ± " << metrics.throughput.std << std::endl;

return 0;
} catch (const std::exception& error) {
std::cerr << error.what() << '\n';
return EXIT_FAILURE;
} catch (...) {
std::cerr << "Non-exception object thrown\n";
return EXIT_FAILURE;
}
47 changes: 47 additions & 0 deletions samples/python/benchmark_genai/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# Benchmarking Vanilla GenAI

This sample script demonstrates how to benchmark an LLMModel in OpenVINO GenAI. The script includes functionality for warm-up iterations, generating text, and calculating various performance metrics.

## Download and convert the model and tokenizers

The `--upgrade-strategy eager` option is needed to ensure `optimum-intel` is upgraded to the latest version.

It's not required to install [../../requirements.txt](../../requirements.txt) for deployment if the model has already been exported.

```sh
pip install --upgrade-strategy eager -r ../../requirements.txt
optimum-cli export openvino --trust-remote-code --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 TinyLlama-1.1B-Chat-v1.0
```

## Usage

```sh
python benchmark_vanilla_genai.py [OPTIONS]
```

### Options

- `-m, --model`: Path to the model and tokenizers base directory.
- `-p, --prompt` (default: `"The Sky is blue because"`): The prompt to generate text.
- `-nw, --num_warmup` (default: `1`): Number of warmup iterations.
- `-n, --num_iter` (default: `3`): Number of iterations.
- `-mt, --max_new_tokens` (default: `20`): Number of warmup iterations.
- `-d, --device` (default: `"CPU"`): Device to run the model on.

### Output:

```
python benchmark_vanilla_genai.py -m TinyLlama-1.1B-Chat-v1.0 -n 10
```

```
Load time: 3405.69 ms
Generate time: 1430.77 ± 3.04 ms
Tokenization time: 0.51 ± 0.02 ms
Detokenization time: 0.37 ± 0.01 ms
TTFT: 81.60 ± 0.54 ms
TPOT: 71.52 ± 2.72 ms
Throughput tokens/s: 13.98 ± 0.53
```

For more information on how performance metrics are calculated, see [performance metrics readme](../../../src/README.md#performance-metrics).
49 changes: 49 additions & 0 deletions samples/python/benchmark_genai/benchmark_genai.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
# Copyright (C) 2023-2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

import argparse
import openvino_genai as ov_genai

def main():
parser = argparse.ArgumentParser(description="Help command")
parser.add_argument("-m", "--model", type=str, help="Path to model and tokenizers base directory")
parser.add_argument("-p", "--prompt", type=str, default="The Sky is blue because", help="Prompt")
parser.add_argument("-nw", "--num_warmup", type=int, default=1, help="Number of warmup iterations")
parser.add_argument("-n", "--num_iter", type=int, default=2, help="Number of iterations")
parser.add_argument("-mt", "--max_new_tokens", type=int, default=20, help="Maximal number of new tokens")
parser.add_argument("-d", "--device", type=str, default="CPU", help="Device")

args = parser.parse_args()

# Perf metrics is stored in DecodedResults.
# In order to get DecodedResults instead of a string input should be a list.
prompt = [args.prompt]
model_path = args.model
device = args.device
num_warmup = args.num_warmup
num_iter = args.num_iter

config = ov_genai.GenerationConfig()
config.max_new_tokens = args.max_new_tokens

pipe = ov_genai.LLMPipeline(model_path, device)

for _ in range(num_warmup):
pipe.generate(prompt, config)

res = pipe.generate(prompt, config)
perf_metrics = res.perf_metrics
for _ in range(num_iter - 1):
res = pipe.generate(prompt, config)
perf_metrics += res.perf_metrics

print(f"Load time: {perf_metrics.load_time:.2f} ms")
print(f"Generate time: {perf_metrics.generate_duration.mean:.2f} ± {perf_metrics.generate_duration.std:.2f} ms")
print(f"Tokenization time: {perf_metrics.tokenization_duration.mean:.2f} ± {perf_metrics.tokenization_duration.std:.2f} ms")
print(f"Detokenization time: {perf_metrics.detokenization_duration.mean:.2f} ± {perf_metrics.detokenization_duration.std:.2f} ms")
print(f"TTFT: {perf_metrics.ttft.mean:.2f} ± {perf_metrics.ttft.std:.2f} ms")
print(f"TPOT: {perf_metrics.tpot.mean:.2f} ± {perf_metrics.tpot.std:.2f} ms")
print(f"Throughput tokens/s: {perf_metrics.throughput.mean:.2f} ± {perf_metrics.throughput.std:.2f}")

if __name__ == "__main__":
main()
62 changes: 62 additions & 0 deletions samples/python/benchmark_genai/benchmark_genai_automatic.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
# Copyright (C) 2023-2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

import argparse
import openvino_genai as ov_genai
import pdb

def main():
parser = argparse.ArgumentParser(description="Help command")
parser.add_argument("-m", "--model", type=str, help="Path to model and tokenizers base directory")
parser.add_argument("-p", "--prompt", type=str, default="The Sky is blue because", help="Prompt")
parser.add_argument("-nw", "--num_warmup", type=int, default=1, help="Number of warmup iterations")
parser.add_argument("-n", "--num_iter", type=int, default=5, help="Number of iterations")
parser.add_argument("-mt", "--max_new_tokens", type=int, default=20, help="Maximal number of new tokens")
parser.add_argument("-d", "--device", type=str, default="CPU", help="Device")

args = parser.parse_args()

# Perf metrics is stored in DecodedResults.
# In order to get DecodedResults instead of a string input should be a list.

model_path = args.model
device = args.device
num_warmup = args.num_warmup
num_iter = args.num_iter

config = ov_genai.GenerationConfig()
config.max_new_tokens = 20
# config.num_beam_groups = 3
# config.num_beams = 15

pipe = ov_genai.LLMPipeline(model_path, device)

import pandas as pd
metrics_df = pd.DataFrame(columns=['batch_size', 'throughput', 'ttft', 'tpot', 'std_throughput', 'std_ttft', 'std_tpot'])

batch_sizes = [1, 2, 4, 16, 32, 64, 256]
for batch_size in batch_sizes:
prompt = [args.prompt] * batch_size
for _ in range(num_warmup):
pipe.generate(prompt, config)

res = pipe.generate(prompt, config)
metrics = res.metrics
for _ in range(num_iter - 1):
res = pipe.generate(prompt, config)
metrics += res.metrics
# pdb.set_trace()
metrics_df = metrics_df._append({
pavel-esir marked this conversation as resolved.
Show resolved Hide resolved
'batch_size': batch_size,
'throughput': metrics.mean_throughput,
'ttft': metrics.mean_ttft,
'tpot': metrics.mean_tpot,
'std_throughput': metrics.std_throughput,
'std_ttft': metrics.std_ttft,
'std_tpot': metrics.std_tpot,
}, ignore_index=True)

metrics_df.to_csv('metrics.csv', index=False)

if __name__ == "__main__":
main()
49 changes: 49 additions & 0 deletions src/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -196,6 +196,55 @@ int main(int argc, char* argv[]) {
}
```

### Performance Metrics
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When it gets merged, please, open another PR adding it to C++ and Python docstrings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done #713


`ov.genai.PerfMetrics` (referred to as `PerfMetrics` for simplicity) is a structure that holds performance metrics for each generate call. `PerfMetrics` hold fields with mean and standard deviations for the following metrics:
pavel-esir marked this conversation as resolved.
Show resolved Hide resolved
pavel-esir marked this conversation as resolved.
Show resolved Hide resolved
- `ttft`
- `tpot`
- `load_time`
pavel-esir marked this conversation as resolved.
Show resolved Hide resolved
- `generate_duration`
- `tokenization_duration`
- `detokenization_duration`
- `throughput`

and:
- `num_generated_tokens`
- `num_input_tokens`
pavel-esir marked this conversation as resolved.
Show resolved Hide resolved
pavel-esir marked this conversation as resolved.
Show resolved Hide resolved

Performance metrics are stored either in the `DecodedResults` or `EncodedResults` `perf_metric` field. Additionally to the fields mentioned above, `PerfMetrics` has a member `raw_metrics` of type `ov.genai.RawPerfMetrics` (referred to as `RawPerfMetrics` for simplicity) that contains raw values for the durations of each batch of new token generation, tokenization durations, detokenization durations, and more. These raw metrics are accessible if you wish to calculate your own statistical values such as median or percentiles. However, since mean and standard deviation values are usually sufficient, we will focus on `PerfMetrics`.

```python
import openvino_genai as ov_genai
pipe = ov_genai.LLMPipeline(model_path, "CPU")
res = pipe.generate(["The Sun is yellow because"], max_new_tokens=20)
perf_metrics = res.perf_metrics
print(f'generate_duration: {perf_metrics.mean_generate_duration:.2f}')
print(f'ttft: {perf_metrics.mean_ttft:.2f}')
print(f'tpot: {perf_metrics.mean_tpot:.2f}')
```
output:
```sh
mean_generate_duration: 76.28
mean_ttft: 42.58
mean_tpot 3.80
```

>**Note**: If the input prompt is just a string, the generate function will return only a string without perf_metrics. To obtain perf_metrics, provide the prompt as a list with at least one element or call generate with encoded inputs.
pavel-esir marked this conversation as resolved.
Show resolved Hide resolved

Several `perf_metrics` can be added with each other. In that case `raw_metrics` will be concatenated and mean/std values will be recalculated. This enhances benchmarking and accumulating statistics from several calls.
pavel-esir marked this conversation as resolved.
Show resolved Hide resolved

```python
import openvino_genai as ov_genai
pipe = ov_genai.LLMPipeline(model_path, "CPU")
res_1 = pipe.generate(["The Sun is yellow because"], max_new_tokens=20)
res_2 = pipe.generate(["Why Sky is blue because"], max_new_tokens=20)
perf_metrics = res_1.perf_metrics + res_2.perf_metrics

print(f'generate_duration: {perf_metrics.mean_generate_duration:.2f}')
print(f'ttft: {perf_metrics.mean_ttft:.2f}')
print(f'tpot: {perf_metrics.mean_tpot:.2f}')
pavel-esir marked this conversation as resolved.
Show resolved Hide resolved
```

## How It Works

For information on how OpenVINO™ GenAI works, refer to the [How It Works Section](https://github.com/openvinotoolkit/openvino.genai/tree/releases/2024/2/src/docs/HOW_IT_WORKS.md).
Expand Down
Loading
Loading