Make iter_lines 80x faster #2300

maxmouchet · 2022-07-10T08:56:35Z

maxmouchet
Jul 10, 2022

Hi,

HTTPX iter_lines() method is currently pretty slow in comparison with requests. Here is a proof of concept inspired by requests's code (models.py#L853-L885): master...maxmouchet:httpx:faster-iter-lines.
As shown below, it is about 80x faster than the current implementation (and much simpler!), although it slightly changes the output.

What are your thoughts on this?

Breaking changes

Line endings are not outputted anymore: the client will return ["a", "", "b"] instead of ["a\n", "\n", "b\n"].
If a line ending with \r\n is split on two chunks between \r and \n, it will output an an additional new line: ["a", ""] instead of ["a\n"].
It will split on line delimiters others than \n, \r and \r\n. See https://docs.python.org/3/library/stdtypes.html#str.splitlines.

Benchmark

Python 3.10.2
MacBook Air (M1, 2020)

import httpx
import requests
from tqdm import tqdm

def benchmark_httpx(url):
    with httpx.stream("GET", url) as r:
        r.raise_for_status()
        for line in tqdm(r.iter_lines()):
            pass

def benchmark_requests(url, chunk_size):
    r = requests.get(url, stream=True)
    r.raise_for_status()
    for line in tqdm(r.iter_lines(chunk_size=chunk_size)):
        pass

Time to iterate over 182634 lines of 9185 characters on average

My specific input file can be downloaded here, but any other file will do.

Case	Time (s)	Lines/s
Requests, chunk size = 512 (Requests default)	40	4497
Requests, chunk size = 2^16 (HTTPX default)	1	118675
HTTPX, master	188	970
HTTPX, faster-iter-lines	2	79963

jhominal · 2022-07-10T10:20:18Z

jhominal
Jul 10, 2022
Collaborator

Thank you for this idea!

It is likely that a performance enhancement could be made along your idea of using str.splitlines.

However, I do not think that we want to introduce any behavior changes to that method - in particular, the “there is an additional empty line when \r and \n are on separate chunks” is actually a bug-in-waiting.

I think it should be possible to maintain the current method behavior by using .splitlines(keepends=True) and doing the following post-processing:

if the line does not end in one of the 3 documented line endings, then merge it with the next line;
if the last line ends with \r, keep it in the buffer in case it is followed immediately by a \n in the next chunk;
if the line ends with one of the 3 documented line endings, then replace that line ending with \n;

2 replies

maxmouchet Jul 10, 2022
Author

That makes sense. I'll try this and see what performance we can get.

ofek Aug 9, 2022

Any update on this?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make iter_lines 80x faster #2300

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Make iter_lines 80x faster #2300

maxmouchet Jul 10, 2022

Breaking changes

Benchmark

Time to iterate over 182634 lines of 9185 characters on average

Replies: 1 comment · 2 replies

jhominal Jul 10, 2022 Collaborator

maxmouchet Jul 10, 2022 Author

ofek Aug 9, 2022

maxmouchet
Jul 10, 2022

Replies: 1 comment 2 replies

jhominal
Jul 10, 2022
Collaborator

maxmouchet Jul 10, 2022
Author