Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python] Python is 2x~3x slower than official binary on simple benchmarks (gcc emutls) #22917

Open
1 task done
wareya opened this issue Dec 24, 2024 · 7 comments
Open
1 task done
Labels

Comments

@wareya
Copy link

wareya commented Dec 24, 2024

Description / Steps to reproduce the issue

Using a numeric pi calculation microbenchmark, timed using hyperfine (but time works ok too):

wareya@Toriaezu UCRT64 ~/dev/flinch
$ hyperfine.exe '"/c/Program Files/Python310/python.exe" etc/too_simple.py' "python etc/too_simple.py" --warmup 3
Benchmark 1: "C:/Program Files/Python310/python.exe" etc/too_simple.py
  Time (mean ± σ):     587.7 ms ±   5.0 ms    [User: 580.9 ms, System: 8.1 ms]
  Range (min … max):   581.3 ms … 599.0 ms    10 runs

Benchmark 2: python etc/too_simple.py
  Time (mean ± σ):      1.673 s ±  0.029 s    [User: 1.659 s, System: 0.010 s]
  Range (min … max):    1.649 s …  1.730 s    10 runs

Summary
  "C:/Program Files/Python310/python.exe" etc/too_simple.py ran
    2.85 ± 0.06 times faster than python etc/too_simple.py

wareya@Toriaezu UCRT64 ~/dev/flinch
$ pacman -Q|grep python
python 3.12.8-1

wareya@Toriaezu UCRT64 ~/dev/flinch
$ which python
/usr/bin/python

wareya@Toriaezu UCRT64 ~/dev/flinch
$ python -c "import sysconfig; print(sysconfig.get_config_var('CFLAGS'))"
-fno-strict-overflow -Wsign-compare -DNDEBUG -g -O3 -Wall -march=nocona -msahf -mtune=generic -O2 -pipe -march=nocona -msahf -mtune=generic -O2 -pipe

Benchmark program:

#!/usr/bin/env python

def main():
    sumval = 0.0
    flip = -1.0
    for i in range(1, 10000001):
        flip = -flip
        sumval += flip / ((i << 1) - 1)
    print(f"{sumval * 4.0:.16f}")

if __name__ == "__main__":
    main()

Expected behavior

Roughly same performance.

Actual behavior

Wildly different performance.

Verification

Windows Version

MINGW64_NT-10.0-19045

Are you willing to submit a PR?

No response

@wareya wareya added the bug label Dec 24, 2024
@lazka
Copy link
Member

lazka commented Dec 24, 2024

you are not using the native python, try: pacman -S mingw-w64-ucrt-x86_64-python

$ which python
/ucrt64/bin/python

@wareya
Copy link
Author

wareya commented Dec 24, 2024

The python at /usr/bin/python is msys2's python; that's where the python from the root-level python package goes.

I installed the UCRT64-specific one according to your recommendation and it has the same problem, just slightly less bad:

wareya@Toriaezu UCRT64 ~/dev/flinch
$ which python
/ucrt64/bin/python

wareya@Toriaezu UCRT64 ~/dev/flinch
$ hyperfine.exe '"/c/Program Files/Python310/python.exe" etc/too_simple.py' "/ucrt64/bin/python etc/too_simple.py" "/usr/bin/python etc/too_simple.py" --warmup 3
Benchmark 1: "C:/Program Files/Python310/python.exe" etc/too_simple.py
  Time (mean ± σ):     641.3 ms ±  25.1 ms    [User: 624.7 ms, System: 12.8 ms]
  Range (min … max):   623.7 ms … 710.0 ms    10 runs

Benchmark 2: C:/msys64/ucrt64/bin/python etc/too_simple.py
  Time (mean ± σ):      1.401 s ±  0.009 s    [User: 1.382 s, System: 0.007 s]
  Range (min … max):    1.390 s …  1.421 s    10 runs

Benchmark 3: C:/msys64/usr/bin/python etc/too_simple.py
  Time (mean ± σ):      1.815 s ±  0.024 s    [User: 1.789 s, System: 0.013 s]
  Range (min … max):    1.789 s …  1.855 s    10 runs

Summary
  "C:/Program Files/Python310/python.exe" etc/too_simple.py ran
    2.18 ± 0.09 times faster than C:/msys64/ucrt64/bin/python etc/too_simple.py
    2.83 ± 0.12 times faster than C:/msys64/usr/bin/python etc/too_simple.py

wareya@Toriaezu UCRT64 ~/dev/flinch
$ time python etc/too_simple.py
3.1415925535897915

real    0m1.852s
user    0m1.781s
sys     0m0.000s

/usr/bin/python being the root/msys2-level python package:

wareya@Toriaezu UCRT64 ~/dev/flinch
$ pacman -R python
checking dependencies...
:: git optionally requires python: various helper scripts
:: subversion optionally requires python: for some hook scripts

Packages (1) python-3.12.8-1

Total Removed Size:  182.76 MiB

:: Do you want to remove these packages? [Y/n]
:: Processing package changes...
(1/1) removing python                                                                                [###########################################################] 100%

wareya@Toriaezu UCRT64 ~/dev/flinch
$ ls /usr/bin/python
ls: cannot access '/usr/bin/python': No such file or directory

wareya@Toriaezu UCRT64 ~/dev/flinch
$ pacman -S python
resolving dependencies...
looking for conflicting packages...

Packages (1) python-3.12.8-1

Total Installed Size:  182.76 MiB

:: Proceed with installation? [Y/n]
(1/1) checking keys in keyring                                                                       [###########################################################] 100%
(1/1) checking package integrity                                                                     [###########################################################] 100%
(1/1) loading package files                                                                          [###########################################################] 100%
(1/1) checking for file conflicts                                                                    [###########################################################] 100%
(1/1) checking available disk space                                                                  [###########################################################] 100%
:: Processing package changes...
(1/1) installing python                                                                              [###########################################################] 100%

wareya@Toriaezu UCRT64 ~/dev/flinch
$ ls /usr/bin/python
/usr/bin/python

@lazka
Copy link
Member

lazka commented Dec 24, 2024

Ok, thanks

@lazka
Copy link
Member

lazka commented Dec 24, 2024

It seems to be a gcc vs clang thing, no idea why:

Benchmark 1: C:/Python312/python.exe too_simple.py
  Time (mean ± σ):      1.203 s ±  0.010 s    [User: 1.182 s, System: 0.016 s]
  Range (min … max):    1.194 s …  1.221 s    10 runs

Benchmark 2: C:/msys64/ucrt64/bin/python too_simple.py
  Time (mean ± σ):      2.462 s ±  0.025 s    [User: 2.428 s, System: 0.021 s]
  Range (min … max):    2.408 s …  2.479 s    10 runs

Benchmark 3: C:/msys64/clang64/bin/python too_simple.py
  Time (mean ± σ):      1.123 s ±  0.006 s    [User: 1.094 s, System: 0.021 s]
  Range (min … max):    1.115 s …  1.136 s    10 runs

Benchmark 4: C:/msys64/mingw64/bin/python too_simple.py
  Time (mean ± σ):      2.471 s ±  0.014 s    [User: 2.430 s, System: 0.028 s]
  Range (min … max):    2.457 s …  2.493 s    10 runs

@Morilli
Copy link

Morilli commented Dec 27, 2024

I did some profiling using the Intel VTune profiler, and it looks like the main issue is related to thread local storage:
image
A lot of functions are calling __emutls_get_address which looks relatively expensive. I sadly don't know much about how tls works in general or what might differ between compilers, but I cannot see such calls in either the official python executable or the clang variant.
image

@lazka
Copy link
Member

lazka commented Dec 27, 2024

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80881 (Implement Windows native TLS)

edit: I've heart that hopefully we'll see some progress there in the near future.

@lazka lazka transferred this issue from msys2/MSYS2-packages Dec 27, 2024
@lazka lazka changed the title Python is 2x~3x slower than official binary on simple benchmarks [python] Python is 2x~3x slower than official binary on simple benchmarks (gcc emutls) Dec 27, 2024
@TheShermanTanker
Copy link

I did some profiling using the Intel VTune profiler, and it looks like the main issue is related to thread local storage: image A lot of functions are calling __emutls_get_address which looks relatively expensive. I sadly don't know much about how tls works in general or what might differ between compilers, but I cannot see such calls in either the official python executable or the clang variant. image

The difference is that gcc uses emulated TLS, while clang and VC are able to use the following assembly sequence to load TLS variables directly:

mov eax, DWORD PTR _tls_index[rip]
mov rcx, QWORD PTR gs:88
mov rax, QWORD PTR [rcx+rax*8]
mov eax, DWORD PTR local@secrel32[rax]

(The assembly above assumes we're loading a variable of type int named local)

I've been working on enabling native TLS, as it's called, for gcc, but it's a very breaking change and some work needs to be done so that everything compiled by gcc doesn't suddenly break and cease to work once it's enabled for MINGW

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants