Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add generation time metrics #613

Merged

Conversation

pavel-esir
Copy link
Contributor

@pavel-esir pavel-esir commented Jul 12, 2024

  • Added performance metrics and updated Readme with description how to use them
  • Added cpp and python sample for benchmarking

Sample to calculate and visualize performance metrics.

import openvino_genai as ov_genai
import tqdm
import pandas as pd
import matplotlib.pylab as pl

pipe = ov_genai.LLMPipeline('TinyLlama-1.1B-Chat-v1.0/')
config = ov_genai.GenerationConfig(max_new_tokens=15)
metrics_df = pd.DataFrame(columns=['batch_size', 'throughput', 'ttft', 'tpot', 'std_throughput', 'std_ttft', 'std_tpot'])

num_iter = 3
for batch_size in tqdm.tqdm([1, 2, 4, 16, 32, 64, 128]):
    prompts = ["The Sky is blue because"] * batch_size
    res = pipe.generate(prompts, config)
    metrics = res.perf_metrics
    
    for _ in range(num_iter - 1):
        res = pipe.generate(prompts, config)
        metrics += res.perf_metrics
    metrics_df = metrics_df._append({
        'throughput': metrics.get_throughput().mean, 'ttft': metrics.get_ttft().mean, 'tpot': metrics.get_tpot().mean,
        'std_throughput': metrics.get_throughput().std, 'std_ttft': metrics.get_ttft().std, 'std_tpot': metrics.get_tpot().std,
        'batch_size': batch_size, 
    }, ignore_index=True)

fig, axes = pl.subplots(nrows=3, ncols=1, figsize=(6, 8), sharex=True)

axes[0].plot(metrics_df['batch_size'], metrics_df['throughput'], '-o')
axes[1].plot(metrics_df['batch_size'], metrics_df['ttft'], '-o', )
axes[2].plot(metrics_df['batch_size'], metrics_df['tpot'], '-o')

axes[0].set_ylabel('Throughput'), axes[1].set_ylabel('TTFT'), axes[2].set_ylabel('TPOT')
axes[2].set_xlabel('Batch Size')
axes[0].grid(True), axes[1].grid(True), axes[2].grid(True)
pl.tight_layout()

image

ticket: CVS-132859

src/cpp/src/tokenizer.cpp Outdated Show resolved Hide resolved
@pavel-esir pavel-esir changed the base branch from master to releases/2024/3 July 22, 2024 11:06
@pavel-esir pavel-esir force-pushed the add_perf_counters branch 2 times, most recently from 4d4942e to c680bb2 Compare July 22, 2024 11:10
mzegla and others added 3 commits July 22, 2024 13:17
…oop for greedy sampling (openvinotoolkit#607)

Searching for max element in a custom loop gives better performance than
using std::max_element
namespace genai {

float PerfMetrics::get_duration_ms(std::chrono::steady_clock::duration duration) {
return std::chrono::duration_cast<std::chrono::milliseconds>(duration).count();
Copy link
Collaborator

@Wovchena Wovchena Jul 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't cast duration until you really need to use the value (for printing). Return duration itself. That ensures the best accuracy. When you divide a duration, change it's representation to float or doulbe. For example https://github.com/openvinotoolkit/openvino/blob/ffc135cb1240831411799bdb82ecac352c956f22/samples/cpp/benchmark/throughput_benchmark/main.cpp#L19. But your implementation needs an extra step: when mean is computed keep using source units (most likely nanoseconds, but it's unspecified, you can't rely on that) with float or double representation. Cast the duration to Ms or any other suitable unit to use count() and print.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Now durations are stored in microsecond chrono::duration<float, std::ratio<1, 1000000>> for better accuracy. If i store them in ms then tokenization/detokenization times can sometimes be 0, with microseconds.

I convert them only when mean/std are calculated.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You shouldn't store duration as any of the specified units. Store whatever time_point_one - time_point_zero returns. There's also a cast to int representation in constructors. For example

    auto start_time = std::chrono::steady_clock::now();
    m_pimpl = std::make_unique<StatefulLLMPipeline>(request, tokenizer, generation_config);
    auto stop_time = std::chrono::steady_clock::now();
    m_pimpl->m_load_time_ms = std::chrono::duration_cast<std::chrono::milliseconds>(stop_time - start_time).count();   

Which is bad in two ways: the first is described above, the second is that the used int representation in the cast. That's going to divide by 1000, because the default type is usually nanoseconds.

src/cpp/src/utils.hpp Outdated Show resolved Hide resolved
@pavel-esir pavel-esir marked this pull request as ready for review July 23, 2024 20:53
@pavel-esir pavel-esir requested a review from Wovchena July 23, 2024 20:53
@pavel-esir pavel-esir marked this pull request as draft July 23, 2024 20:54
@pavel-esir pavel-esir marked this pull request as ready for review July 23, 2024 21:00
@pavel-esir
Copy link
Contributor Author

now PR is final. @Wovchena please take a look

@Wovchena
Copy link
Collaborator

It got a conflict

@pavel-esir
Copy link
Contributor Author

It got a conflict

Resolved. Metrics match to llm_bench's numbers. Will open a separate PR to switch to native counters.
image

src/README.md Outdated Show resolved Hide resolved
src/README.md Outdated Show resolved Hide resolved
src/README.md Outdated Show resolved Hide resolved
src/README.md Outdated Show resolved Hide resolved
@@ -196,6 +196,55 @@ int main(int argc, char* argv[]) {
}
```

### Performance Metrics
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When it gets merged, please, open another PR adding it to C++ and Python docstrings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done #713

src/README.md Outdated Show resolved Hide resolved
src/cpp/src/greedy_decoding.cpp Outdated Show resolved Hide resolved
res.num_generated_tokens = num_generated_tokens + right.num_generated_tokens;
res.num_input_tokens = num_generated_tokens + right.num_input_tokens;
res.load_time = load_time;
res.evaluate_statistics();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

evaluate_statistics() is called on every +. Given that it happens during benchmarking loop and most of the results are thrown away, it's worth providing a getter or a standalone function to do that job.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added getters. Now in order to get user should call perf_metrics.get_tokenization_duration().mean if statistics are fresh and already evaluated then it will return values, if they are not fresh getter will call evaluate_statistics()

src/README.md Outdated Show resolved Hide resolved
@pavel-esir pavel-esir requested a review from Wovchena July 26, 2024 13:04
@Wovchena Wovchena enabled auto-merge July 26, 2024 13:56
@andrei-kochin andrei-kochin dismissed Wovchena’s stale review July 26, 2024 14:45

Comments were applied

@Wovchena Wovchena added this pull request to the merge queue Jul 26, 2024
@andrei-kochin andrei-kochin removed this pull request from the merge queue due to a manual request Jul 26, 2024
@andrei-kochin andrei-kochin merged commit 102f00a into openvinotoolkit:releases/2024/3 Jul 26, 2024
26 of 27 checks passed
@pavel-esir pavel-esir deleted the add_perf_counters branch July 29, 2024 07:10
@ilya-lavrenov ilya-lavrenov self-assigned this Jul 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants