Skip to content

Commit

Permalink
test: enzyme latest patches regression
Browse files Browse the repository at this point in the history
  • Loading branch information
avik-pal committed Dec 5, 2024
1 parent fd7b740 commit ef0d450
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion test/layers/basic_tests.jl
Original file line number Diff line number Diff line change
Expand Up @@ -165,7 +165,7 @@ end
)
x = randn(SVector{N, Float64})

broken = pkgversion(Enzyme) v"0.13.18"
broken = pkgversion(Enzyme) == v"0.13.18"

@test begin
grad1 = ForwardDiff.gradient(ComponentArray(ps)) do ps
Expand Down

5 comments on commit ef0d450

@avik-pal
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JuliaRegistrator register subdir=lib/LuxCore

@avik-pal
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JuliaRegistrator register subdir=lib/MLDataDevices

@JuliaRegistrator
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Registration pull request created: JuliaRegistries/General/120708

Tip: Release Notes

Did you know you can add release notes too? Just add markdown formatted text underneath the comment after the text
"Release notes:" and it will be added to the registry PR, and if TagBot is installed it will also be added to the
release that TagBot creates. i.e.

@JuliaRegistrator register

Release notes:

## Breaking changes

- blah

To add them here just re-invoke and the PR will be updated.

Tagging

After the above pull request is merged, it is recommended that a tag is created on this repository for the registered package version.

This will be done automatically if the Julia TagBot GitHub Action is installed, or can be done manually through the github interface, or via:

git tag -a LuxCore-v1.2.1 -m "<description of version>" ef0d4500c49dfbb91a3f53ac1baf37af04d346f7
git push origin LuxCore-v1.2.1

@JuliaRegistrator
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Registration pull request created: JuliaRegistries/General/120709

Tip: Release Notes

Did you know you can add release notes too? Just add markdown formatted text underneath the comment after the text
"Release notes:" and it will be added to the registry PR, and if TagBot is installed it will also be added to the
release that TagBot creates. i.e.

@JuliaRegistrator register

Release notes:

## Breaking changes

- blah

To add them here just re-invoke and the PR will be updated.

Tagging

After the above pull request is merged, it is recommended that a tag is created on this repository for the registered package version.

This will be done automatically if the Julia TagBot GitHub Action is installed, or can be done manually through the github interface, or via:

git tag -a MLDataDevices-v1.6.3 -m "<description of version>" ef0d4500c49dfbb91a3f53ac1baf37af04d346f7
git push origin MLDataDevices-v1.6.3

@github-actions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lux Benchmarks

Benchmark suite Current: ef0d450 Previous: 78ad9c9 Ratio
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 4208 ns 4291 ns 0.98
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 4834 ns 3958 ns 1.22
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 5375 ns 5125 ns 1.05
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 4083 ns 4250 ns 0.96
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 58557 ns 60770 ns 0.96
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 10625 ns 10250 ns 1.04
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 10542 ns 10125 ns 1.04
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 11375 ns 10333 ns 1.10
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 10083 ns 10334 ns 0.98
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 415171 ns 423675 ns 0.98
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s) 1334 ns 1125 ns 1.19
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s) 1209 ns 1166 ns 1.04
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s) 1333.5 ns 1229.5 ns 1.08
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s) 1208 ns 1250 ns 0.97
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA 17961 ns 17992 ns 1.00
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 4084 ns 4250 ns 0.96
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 3959 ns 4000 ns 0.99
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 4333 ns 4167 ns 1.04
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 4000 ns 3958 ns 1.01
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA 107003.5 ns 109284 ns 0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 70834 ns 57417 ns 1.23
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 64375 ns 38208 ns 1.68
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 64500 ns 46375 ns 1.39
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 80375 ns 80167 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 36906 ns 36667.5 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2031562.5 ns 2021709 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2088542 ns 2097000 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2093958 ns 2077875 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1926833 ns 2001000 ns 0.96
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 192315 ns 195812 ns 0.98
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 196625 ns 145166.5 ns 1.35
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 195542 ns 142666 ns 1.37
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 185209 ns 146500 ns 1.26
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 182375 ns 144167 ns 1.27
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 166552 ns 165803 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1111896 ns 1104750 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1118729.5 ns 1156062 ns 0.97
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1119708 ns 1104750 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1130333.5 ns 1129458 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 514050 ns 527714 ns 0.97
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 3500 ns 4000 ns 0.88
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 3416 ns 3625 ns 0.94
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 4459 ns 4375 ns 1.02
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 3416.5 ns 3459 ns 0.99
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 67303.5 ns 70555.5 ns 0.95
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 9084 ns 9084 ns 1
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 9750 ns 8709 ns 1.12
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 9625 ns 9667 ns 1.00
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8625 ns 9167 ns 0.94
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 472568 ns 481518.5 ns 0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 15020.5 ns 15416 ns 0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 14666 ns 16958 ns 0.86
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 18625 ns 16791.5 ns 1.11
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 14875 ns 14792 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 53079 ns 54315.5 ns 0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 224750 ns 213958 ns 1.05
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 215104.5 ns 214042 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 215917 ns 214208 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 215083 ns 214334 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 267364.5 ns 273628 ns 0.98
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s) 750 ns 500 ns 1.50
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s) 709 ns 583 ns 1.22
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s) 750 ns 667 ns 1.12
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s) 750 ns 583.5 ns 1.29
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA 17115 ns 17264 ns 0.99
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 1500 ns 1500 ns 1
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 1792 ns 1625 ns 1.10
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 1500 ns 1792 ns 0.84
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 1375 ns 1708 ns 0.81
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA 99326.5 ns 102318 ns 0.97
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7833 ns 7000 ns 1.12
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 7291 ns 5084 ns 1.43
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 7083 ns 5958 ns 1.19
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 9958 ns 9916 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 23212 ns 23961 ns 0.97
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 233458.5 ns 221542 ns 1.05
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 228125 ns 229708.5 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 228666 ns 229667 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 214125 ns 226542 ns 0.95
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 164950.5 ns 170388 ns 0.97
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s) 3875 ns 3875 ns 1
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s) 3916 ns 3958 ns 0.99
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s) 3875 ns 3958 ns 0.98
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s) 3917 ns 3875 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA 23508 ns 23385 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 16959 ns 16625 ns 1.02
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 17042 ns 16500 ns 1.03
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 17083 ns 17000 ns 1.00
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 16708 ns 16833 ns 0.99
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA 160457.5 ns 161544 ns 0.99
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 611125 ns 581791 ns 1.05
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 609042 ns 578709 ns 1.05
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 606834 ns 569958 ns 1.06
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 605520.5 ns 572333.5 ns 1.06
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA 113172 ns 113621 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 1423834 ns 1428958 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 1422458 ns 1421292 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 1424292 ns 1415833 ns 1.01
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 1420334 ns 1420000 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA 209423.5 ns 210533 ns 0.99
lenet(28, 28, 1, 64)/forward/CPU/2 thread(s) 1082229.5 ns 1081750 ns 1.00
lenet(28, 28, 1, 64)/forward/CPU/4 thread(s) 970792 ns 938708 ns 1.03
lenet(28, 28, 1, 64)/forward/CPU/8 thread(s) 1346208 ns 1353291.5 ns 0.99
lenet(28, 28, 1, 64)/forward/CPU/1 thread(s) 1300333 ns 1296666 ns 1.00
lenet(28, 28, 1, 64)/forward/GPU/CUDA 270348.5 ns 269675 ns 1.00
lenet(28, 28, 1, 64)/zygote/CPU/2 thread(s) 5996021 ns 5971292 ns 1.00
lenet(28, 28, 1, 64)/zygote/CPU/4 thread(s) 4506125 ns 4530771.5 ns 0.99
lenet(28, 28, 1, 64)/zygote/CPU/8 thread(s) 4914416 ns 4949917 ns 0.99
lenet(28, 28, 1, 64)/zygote/CPU/1 thread(s) 5507375 ns 5624041 ns 0.98
lenet(28, 28, 1, 64)/zygote/GPU/CUDA 1074060 ns 1072622 ns 1.00
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s) 542 ns 500 ns 1.08
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s) 542 ns 542 ns 1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s) 542 ns 542 ns 1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s) 541 ns 542 ns 1.00
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA 23487 ns 23468 ns 1.00
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 2167 ns 2125 ns 1.02
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 2125 ns 2084 ns 1.02
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 2167 ns 2208 ns 0.98
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 2125 ns 2125 ns 1
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA 168855 ns 169303 ns 1.00
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 4167 ns 4167 ns 1
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 4334 ns 4208 ns 1.03
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 5041 ns 4708 ns 1.07
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 3667 ns 4125 ns 0.89
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 64100 ns 66233.5 ns 0.97
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 11291 ns 11125 ns 1.01
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 11875 ns 11250 ns 1.06
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 12291 ns 12000 ns 1.02
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 11000 ns 10792 ns 1.02
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 442842 ns 452338 ns 0.98
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 6042 ns 6292 ns 0.96
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 6104.5 ns 6417 ns 0.95
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 7209 ns 7604.5 ns 0.95
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 5708 ns 5833 ns 0.98
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 51573 ns 52542 ns 0.98
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 17041.5 ns 18583 ns 0.92
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 17292 ns 17500 ns 0.99
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 17625 ns 18833 ns 0.94
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 17250 ns 16833 ns 1.02
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 299598.5 ns 301964.5 ns 0.99
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 583 ns 542 ns 1.08
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 625 ns 583 ns 1.07
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 625 ns 625 ns 1
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 542 ns 583 ns 0.93
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 32513 ns 32911 ns 0.99
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 8458 ns 8625 ns 0.98
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 9000 ns 8542 ns 1.05
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 9084 ns 9125 ns 1.00
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 8458 ns 8917 ns 0.95
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 155298 ns 160010 ns 0.97
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s) 96666 ns 64500 ns 1.50
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s) 96708 ns 64666 ns 1.50
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s) 96292 ns 64500 ns 1.49
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s) 96375 ns 64500 ns 1.49
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA 111447.5 ns 112101 ns 0.99
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s) 278125 ns 279458 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s) 275250 ns 288583 ns 0.95
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s) 274583.5 ns 273583 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s) 277584 ns 286083 ns 0.97
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA 190076 ns 185547.5 ns 1.02
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/2 thread(s) 3409792 ns 3376750.5 ns 1.01
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/4 thread(s) 3047666 ns 2898291.5 ns 1.05
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/8 thread(s) 3023958 ns 3024854 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/1 thread(s) 3959958 ns 3941104 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/forward/GPU/CUDA 579376.5 ns 581323 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/2 thread(s) 7632583 ns 7603583 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/4 thread(s) 7497667 ns 7358750 ns 1.02
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/8 thread(s) 7451520.5 ns 7466208 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/1 thread(s) 8199583 ns 8146792 ns 1.01
mlp7layer_bn(gelu)(32 x 256)/zygote/GPU/CUDA 1349456 ns 1318419 ns 1.02
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/2 thread(s) 17500916.5 ns 17484792 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/4 thread(s) 17545437.5 ns 17670999.5 ns 0.99
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/8 thread(s) 17599584 ns 17533250 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/1 thread(s) 14108083 ns 9220187.5 ns 1.53
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 23772875 ns 23603916 ns 1.01
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 34134729 ns 43639208 ns 0.78
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 37435375 ns 37125083 ns 1.01
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 34708708 ns 34980187.5 ns 0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/GPU/CUDA 1860458 ns 1854234 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 316659729.5 ns 188207417 ns 1.68
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 235623563 ns 251666438 ns 0.94
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 195619437 ns 194864208 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 279867979.5 ns 434287708 ns 0.64
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 13932935 ns 13931919 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 273833833 ns 287943833 ns 0.95
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 267231583 ns 355406479.5 ns 0.75
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 255610333 ns 297803834 ns 0.86
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 329098667 ns 400767145.5 ns 0.82
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 21375 ns 22458 ns 0.95
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 22125 ns 22208 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 25292 ns 25041 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 21125 ns 22270.5 ns 0.95
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 94977 ns 96107.5 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 103542 ns 113166.5 ns 0.91
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 103791 ns 104292 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 105125 ns 105083 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 103250 ns 103812.5 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 500332.5 ns 502678.5 ns 1.00
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 5875 ns 6833 ns 0.86
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 6417 ns 6479.5 ns 0.99
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 6750 ns 7041.5 ns 0.96
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 6000 ns 5958 ns 1.01
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 68160.5 ns 68593 ns 0.99
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 14500 ns 15000 ns 0.97
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 15000 ns 15479 ns 0.97
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 16500 ns 16333 ns 1.01
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 14584 ns 14708.5 ns 0.99
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 477825.5 ns 475032.5 ns 1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 3101458 ns 3031167 ns 1.02
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 2118542 ns 2061583 ns 1.03
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 2321249.5 ns 2253209 ns 1.03
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 4650021 ns 4505270.5 ns 1.03
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/GPU/CUDA 585427 ns 586394 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 23564209 ns 23625708.5 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 18768041 ns 18333062.5 ns 1.02
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 17974229 ns 17998916.5 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 35659708 ns 35608125.5 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 2760352.5 ns 2764773.5 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 34076750.5 ns 33284000 ns 1.02
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 27653896 ns 28078500 ns 0.98
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 28752229 ns 28952938 ns 0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 40853625 ns 41446187.5 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 74667 ns 72167 ns 1.03
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 71833.5 ns 81083 ns 0.89
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 73521 ns 86562.5 ns 0.85
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 71770.5 ns 75479 ns 0.95
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 100115 ns 104806 ns 0.96
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 292083 ns 223458.5 ns 1.31
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 224167 ns 325166 ns 0.69
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 297708 ns 320958 ns 0.93
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 205792 ns 210500 ns 0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 537710 ns 552193 ns 0.97
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 11750 ns 11917 ns 0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 11416 ns 12583 ns 0.91
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 12542 ns 12708 ns 0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 12270.5 ns 12083 ns 1.02
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 71148.5 ns 71752 ns 0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 26208 ns 26667 ns 0.98
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 26875 ns 26583 ns 1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 27625 ns 28000 ns 0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 26500 ns 26500 ns 1
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 468928 ns 476956.5 ns 0.98
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 12250 ns 11667 ns 1.05
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 12166 ns 12333 ns 0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 13500 ns 12917 ns 1.05
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 12042 ns 11834 ns 1.02
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 52398 ns 53475 ns 0.98
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 25250 ns 25792 ns 0.98
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 26125 ns 25500 ns 1.02
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 26042 ns 26500 ns 0.98
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 26000 ns 26000 ns 1
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 301242 ns 305905.5 ns 0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 179104.5 ns 181458 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 179750 ns 180541 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 180583 ns 184604.5 ns 0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 178625 ns 179667 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 55842.5 ns 57257.5 ns 0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 582584 ns 592917 ns 0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 591917 ns 587687.5 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 594313 ns 595750 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 583166 ns 582791.5 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 280084 ns 291107 ns 0.96
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 5958 ns 8958 ns 0.67
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 6000 ns 6583 ns 0.91
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 6500 ns 8042 ns 0.81
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 5625 ns 6375 ns 0.88
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 70229 ns 71199.5 ns 0.99
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 13875 ns 13916 ns 1.00
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 14542 ns 14875 ns 0.98
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 15187.5 ns 15459 ns 0.98
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 14458 ns 13958.5 ns 1.04
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 456073.5 ns 465947 ns 0.98
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s) 1235292 ns 1219708 ns 1.01
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s) 1304042 ns 1231750 ns 1.06
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s) 1374021 ns 1269667 ns 1.08
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s) 1092083 ns 1009666 ns 1.08
batchedmm(512, Bsize=4)/forward/GPU/CUDA 302409 ns 300921 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s) 4120521 ns 4103750 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s) 4446875 ns 4571833 ns 0.97
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s) 4623750 ns 4574959 ns 1.01
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s) 3716729.5 ns 3707208 ns 1.00
batchedmm(512, Bsize=4)/zygote/GPU/CUDA 1039016 ns 1038858 ns 1.00
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 1792 ns 1834 ns 0.98
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 1875 ns 1833 ns 1.02
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 1833 ns 1875 ns 0.98
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 1917 ns 1875 ns 1.02
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA 23753 ns 23656 ns 1.00
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 4833 ns 4875 ns 0.99
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 4917 ns 4792 ns 1.03
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 4875 ns 4917 ns 0.99
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 4875 ns 4875 ns 1
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA 186693 ns 190147.5 ns 0.98
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 5959 ns 5375 ns 1.11
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 6000 ns 5708.5 ns 1.05
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 7083 ns 6917 ns 1.02
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 5667 ns 5437.5 ns 1.04
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 54622.5 ns 56411.5 ns 0.97
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 11167 ns 10750 ns 1.04
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 11541 ns 11000 ns 1.05
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 11250 ns 11834 ns 0.95
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 10542 ns 10729.5 ns 0.98
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 325703 ns 336162 ns 0.97
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s) 375 ns 333 ns 1.13
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s) 334 ns 292 ns 1.14
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s) 292 ns 375 ns 0.78
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s) 333 ns 334 ns 1.00
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA 22898 ns 22819 ns 1.00
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 2792 ns 2750 ns 1.02
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 3041 ns 2750 ns 1.11
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 3041 ns 3042 ns 1.00
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 2750 ns 2792 ns 0.98
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA 157339 ns 159135.5 ns 0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 11625 ns 11458 ns 1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 12083 ns 11333 ns 1.07
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 12417 ns 12750 ns 0.97
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 11229.5 ns 11208 ns 1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 55735 ns 58102 ns 0.96
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 24959 ns 24750 ns 1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 25042 ns 24334 ns 1.03
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 25042 ns 25084 ns 1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 25042 ns 24750 ns 1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 288122.5 ns 298883.5 ns 0.96
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s) 4250 ns 4209 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s) 4208 ns 4209 ns 1.00
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s) 4208 ns 4291 ns 0.98
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s) 4250 ns 4167 ns 1.02
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA 24760 ns 24823 ns 1.00
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 16333 ns 16084 ns 1.02
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 16333 ns 15959 ns 1.02
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 16500 ns 16500 ns 1
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 16459 ns 16167 ns 1.02
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA 193221.5 ns 197271 ns 0.98
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 5791 ns 5833 ns 0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 5792 ns 5791 ns 1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 5791 ns 5916 ns 0.98
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 5750 ns 5833 ns 0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 33178 ns 34115 ns 0.97
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 20750 ns 20500 ns 1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 20708 ns 20417 ns 1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 20916 ns 21250 ns 0.98
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 20708 ns 20708 ns 1
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 172900.5 ns 178582.5 ns 0.97
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s) 420188 ns 423708.5 ns 0.99
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s) 386937.5 ns 366416.5 ns 1.06
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s) 482833 ns 484917 ns 1.00
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s) 106250 ns 103541 ns 1.03
batchedmm(16, Bsize=512)/forward/GPU/CUDA 67134 ns 67022 ns 1.00
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s) 865417 ns 943375 ns 0.92
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s) 948604 ns 950687 ns 1.00
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s) 1189500 ns 1197916.5 ns 0.99
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s) 411770.5 ns 330416.5 ns 1.25
batchedmm(16, Bsize=512)/zygote/GPU/CUDA 190610 ns 193979 ns 0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 136750 ns 80541.5 ns 1.70
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 133396 ns 81125 ns 1.64
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 133166.5 ns 81541.5 ns 1.63
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 138854 ns 80479.5 ns 1.73
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 192824 ns 194031 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1917250 ns 1919833 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1912124.5 ns 1936958 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1920250 ns 1930229 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1942521 ns 1923250 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 395139 ns 400084 ns 0.99
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s) 292 ns 292 ns 1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s) 292 ns 292 ns 1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s) 333 ns 333 ns 1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s) 333 ns 292 ns 1.14
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA 22003 ns 21834 ns 1.01
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s) 1834 ns 1875 ns 0.98
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s) 1833 ns 1750 ns 1.05
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s) 1875 ns 1875 ns 1
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s) 1833 ns 1792 ns 1.02
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA 168855 ns 168563 ns 1.00
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 6812.5 ns 6416 ns 1.06
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 6750 ns 6166 ns 1.09
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 8187.5 ns 7667 ns 1.07
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 6334 ns 6709 ns 0.94
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 59378.5 ns 61087.5 ns 0.97
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 9312.5 ns 8959 ns 1.04
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 9209 ns 8875 ns 1.04
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 9333 ns 9250 ns 1.01
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 9083 ns 9312.5 ns 0.98
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 305200.5 ns 309875.5 ns 0.98
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 112669000 ns 118672458 ns 0.95
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 174180000 ns 182326458 ns 0.96
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 143189875 ns 148081791.5 ns 0.97
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 112387917 ns 102035042 ns 1.10
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/GPU/CUDA 5463061 ns 5467326.5 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 616937396 ns 610447729.5 ns 1.01
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 558474917 ns 582022188 ns 0.96
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 448891770.5 ns 452913708.5 ns 0.99
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 624388062.5 ns 751418979 ns 0.83
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 38238112 ns 34971564 ns 1.09
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 665577792 ns 646694167 ns 1.03
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 667381166.5 ns 688250333 ns 0.97
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 616459979 ns 583281666.5 ns 1.06
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 747251209 ns 744581417 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 62750 ns 59000 ns 1.06
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 53834 ns 37792 ns 1.42
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 53458 ns 47750 ns 1.12
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 82125 ns 83417 ns 0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 37037 ns 38231 ns 0.97
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1926667 ns 1925854 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1974291 ns 1987562.5 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1980021 ns 1779021 ns 1.11
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1901875 ns 1864125 ns 1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 171617 ns 175192.5 ns 0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 265333 ns 292250 ns 0.91
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 269750 ns 268916 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 269083.5 ns 269500 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 264854.5 ns 266000 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 124229 ns 128884 ns 0.96
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 687584 ns 686771 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 678833 ns 702187.5 ns 0.97
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 680125 ns 591083 ns 1.15
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 635854 ns 688958 ns 0.92
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 697446 ns 706872 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 2242458 ns 2268958 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 2097875 ns 2245875 ns 0.93
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 2254458 ns 2101125 ns 1.07
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 2199750.5 ns 2176375 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 132519 ns 133295.5 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5507312 ns 5521229.5 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5516959 ns 5587167 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5495292 ns 5520666.5 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 5486271 ns 5493834 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 737355 ns 748599 ns 0.98
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 678417 ns 642084 ns 1.06
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 671291 ns 648917 ns 1.03
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 668458 ns 636667 ns 1.05
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 682958 ns 635875 ns 1.07
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA 46914 ns 46696 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 1824791.5 ns 1822625 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 1728375 ns 1670333 ns 1.03
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 1718604.5 ns 1719875 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 2080500 ns 2097416.5 ns 0.99
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA 221890.5 ns 221082 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 70750 ns 57833 ns 1.22
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 53125 ns 38500 ns 1.38
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 52916 ns 46250 ns 1.14
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 82375 ns 82750 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 28168 ns 28653 ns 0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2031792 ns 2020167 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2096833.5 ns 2105417 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2088000 ns 2093958 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2001083.5 ns 1999958.5 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 187289.5 ns 190261 ns 0.98
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 13449750 ns 13356563 ns 1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 12528021.5 ns 12441584 ns 1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 12554687.5 ns 12535208 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 15230083 ns 15154375 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/GPU/CUDA 513617 ns 512188.5 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 46862979 ns 47248458 ns 0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 41543521 ns 42098688 ns 0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 40829437.5 ns 40986395.5 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 58532271 ns 58394208 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 2896866 ns 2891115 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 74392375 ns 74033603.5 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 90893292 ns 68368417 ns 1.33
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 92732000 ns 90690875 ns 1.02
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 76658749.5 ns 76143146 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 70625 ns 58250 ns 1.21
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 64875 ns 38583 ns 1.68
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 64625 ns 47625 ns 1.36
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 81917 ns 79125 ns 1.04
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 47851 ns 47024 ns 1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1923187.5 ns 1918250 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1983437.5 ns 1983396 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1973333 ns 1965584 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1883833 ns 1830750 ns 1.03
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 193982.5 ns 192100.5 ns 1.01
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 292 ns 292 ns 1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 375 ns 292 ns 1.28
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 375 ns 416 ns 0.90
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 292 ns 334 ns 0.87
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 32956 ns 32257 ns 1.02
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 6125 ns 6083 ns 1.01
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 6416 ns 6000 ns 1.07
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 6375 ns 6416 ns 0.99
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 5875 ns 6104.5 ns 0.96
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 176118.5 ns 172267 ns 1.02
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s) 292 ns 250 ns 1.17
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s) 292 ns 250 ns 1.17
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s) 291 ns 292 ns 1.00
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s) 250 ns 292 ns 0.86
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA 32831 ns 31372 ns 1.05
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s) 2667 ns 2625 ns 1.02
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s) 2916 ns 2625 ns 1.11
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s) 2875 ns 2875 ns 1
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s) 2625 ns 2666 ns 0.98
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA 165694 ns 158332 ns 1.05
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 278326104 ns 283213208 ns 0.98
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 340448937.5 ns 347751604 ns 0.98
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 308909437.5 ns 314361479.5 ns 0.98
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 278977666.5 ns 273430250 ns 1.02
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/GPU/CUDA 7109405 ns 7090888 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 997951584 ns 992205416 ns 1.01
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 940941292 ns 964468250 ns 0.98
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 832217625 ns 838327667 ns 0.99
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 1009333917 ns 1152689375 ns 0.88
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 33893371 ns 34106482 ns 0.99
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 1394325042 ns 1303968312.5 ns 1.07
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 1705224209 ns 1327504666.5 ns 1.28
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 1693911291 ns 1629886334 ns 1.04
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 1308776729 ns 1314925417 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1456667 ns 1455709 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1462958 ns 1463125 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1454521 ns 1415166.5 ns 1.03
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1451416.5 ns 1410000 ns 1.03
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 127922 ns 127607 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5012417 ns 5015979 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5028750 ns 5060792 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5027959 ns 5051500 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 5027187.5 ns 5009458 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 506424 ns 574399.5 ns 0.88
vgg16(32, 32, 3, 32)/forward/CPU/2 thread(s) 157716375 ns 170351312 ns 0.93
vgg16(32, 32, 3, 32)/forward/CPU/4 thread(s) 136859042 ns 167663375 ns 0.82
vgg16(32, 32, 3, 32)/forward/CPU/8 thread(s) 164218250 ns 130848583.5 ns 1.26
vgg16(32, 32, 3, 32)/forward/CPU/1 thread(s) 151479417 ns 167905166.5 ns 0.90
vgg16(32, 32, 3, 32)/forward/GPU/CUDA 4879107 ns 4881672 ns 1.00
vgg16(32, 32, 3, 32)/zygote/CPU/2 thread(s) 634203459 ns 618588292 ns 1.03
vgg16(32, 32, 3, 32)/zygote/CPU/4 thread(s) 607766083 ns 577882000 ns 1.05
vgg16(32, 32, 3, 32)/zygote/CPU/8 thread(s) 456653750 ns 497505667 ns 0.92
vgg16(32, 32, 3, 32)/zygote/CPU/1 thread(s) 653815125 ns 647917125 ns 1.01
vgg16(32, 32, 3, 32)/zygote/GPU/CUDA 17510307 ns 16266169 ns 1.08
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s) 8926646 ns 8910542 ns 1.00
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s) 9038916.5 ns 9026291.5 ns 1.00
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s) 7947771 ns 7927084 ns 1.00
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s) 10104354 ns 9711125 ns 1.04
batchedmm(512, Bsize=32)/forward/GPU/CUDA 1594648 ns 1592738 ns 1.00
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s) 36795042 ns 35730646 ns 1.03
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s) 38004792 ns 38522375 ns 0.99
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s) 34295916.5 ns 33553041 ns 1.02
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s) 37862042 ns 37755625 ns 1.00
batchedmm(512, Bsize=32)/zygote/GPU/CUDA 6452447 ns 6512589 ns 0.99
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s) 47334 ns 47333 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s) 47417 ns 47333 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s) 47625 ns 47334 ns 1.01
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s) 47042 ns 47875 ns 0.98
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA 18361 ns 18035 ns 1.02
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s) 50042 ns 52792 ns 0.95
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s) 50292 ns 50292 ns 1
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s) 50542 ns 50458 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s) 50292 ns 50667 ns 0.99
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA 194710.5 ns 197012 ns 0.99
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 6750 ns 6375 ns 1.06
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 6875 ns 6250 ns 1.10
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 7709 ns 7417 ns 1.04
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 6541 ns 6750 ns 0.97
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 94841 ns 112280 ns 0.84
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 9542 ns 9584 ns 1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 10209 ns 9458 ns 1.08
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 10292 ns 10125 ns 1.02
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 9958 ns 10209 ns 0.98
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 543786 ns 615930.5 ns 0.88
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 5917 ns 5416 ns 1.09
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 6292 ns 5791 ns 1.09
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 6750 ns 7146 ns 0.94
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 5666 ns 5959 ns 0.95
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 105080 ns 123840 ns 0.85
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 12583 ns 12583 ns 1
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 13750 ns 12750 ns 1.08
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 13375 ns 13208 ns 1.01
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 13375 ns 12708 ns 1.05
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 521491.5 ns 529723.5 ns 0.98
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 1083 ns 1083 ns 1
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 1083 ns 1000 ns 1.08
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 1083 ns 1083 ns 1
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 1042 ns 1042 ns 1
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 33226 ns 32491 ns 1.02
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 8125 ns 8000 ns 1.02
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 8500 ns 7750 ns 1.10
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7875 ns 8209 ns 0.96
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 8041 ns 7959 ns 1.01
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 215927 ns 209838 ns 1.03
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 23125 ns 23417 ns 0.99
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 23209 ns 23041 ns 1.01
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 23250 ns 23584 ns 0.99
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 23250 ns 23417 ns 0.99
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA 18682 ns 18029 ns 1.04
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 52250 ns 54667 ns 0.96
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 53125 ns 52417 ns 1.01
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 52833 ns 52667 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 52250 ns 52458 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA 310779 ns 299710 ns 1.04
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1455520.5 ns 1444833 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1461770.5 ns 1449584 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1464563 ns 1399209 ns 1.05
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1420375.5 ns 1396958.5 ns 1.02
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 196494.5 ns 195765 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5004917 ns 5000042 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 4928042 ns 5049833 ns 0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5012292 ns 5044562 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 5010708.5 ns 5015291.5 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 619791 ns 612366.5 ns 1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 3153125 ns 3043104 ns 1.04
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 2140000 ns 2098583 ns 1.02
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 2307083.5 ns 2313209 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 4612500 ns 4606709 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/GPU/CUDA 580901 ns 580804.5 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 24408833 ns 24374458 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 19732667 ns 19110937.5 ns 1.03
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 19045729.5 ns 18926833 ns 1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 36515125 ns 36250750 ns 1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 2842137 ns 2861963.5 ns 0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 34057083.5 ns 33972875 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 28326333 ns 28642167 ns 0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 28024667 ns 28092229 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 42838792 ns 41633541.5 ns 1.03
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s) 140571271 ns 141888875 ns 0.99
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s) 143484104 ns 146034209 ns 0.98
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s) 120774500 ns 126705062.5 ns 0.95
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s) 187527416 ns 173781771 ns 1.08
batchedmm(512, Bsize=512)/forward/GPU/CUDA 22777810 ns 22552094 ns 1.01
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s) 1387998541 ns 1227732750 ns 1.13
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s) 2164279542 ns 839227916.5 ns 2.58
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s) 1082658958.5 ns 739276458 ns 1.46
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s) 828842208.5 ns 683957250 ns 1.21
batchedmm(512, Bsize=512)/zygote/GPU/CUDA 118414466 ns 117875105 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 79708.5 ns 73084 ns 1.09
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 72542 ns 74479 ns 0.97
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 75520.5 ns 75750 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 73458 ns 74958 ns 0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 238954.5 ns 240665.5 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 286459 ns 280208.5 ns 1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 295292 ns 288959 ns 1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 302292 ns 193791 ns 1.56
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 240521 ns 192583 ns 1.25
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1217040 ns 1331151 ns 0.91
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s) 35202521 ns 35557542 ns 0.99
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s) 35899625 ns 36592625 ns 0.98
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s) 31197042 ns 32410750 ns 0.96
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s) 39929583.5 ns 40376458 ns 0.99
batchedmm(512, Bsize=128)/forward/GPU/CUDA 5845222 ns 5838475 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s) 147855667 ns 148073500 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s) 153555375 ns 158619999.5 ns 0.97
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s) 134579979 ns 139542333.5 ns 0.96
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s) 150196958.5 ns 282659625 ns 0.53
batchedmm(512, Bsize=128)/zygote/GPU/CUDA 34892998 ns 34873454 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 114292542 ns 120976041.5 ns 0.94
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 173321542 ns 182674416.5 ns 0.95
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 143543334 ns 147566209 ns 0.97
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 93943084 ns 105641958.5 ns 0.89
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/GPU/CUDA 5434556 ns 5456587 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 473131708 ns 471084687.5 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 515810125.5 ns 489605103.5 ns 1.05
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 442518292 ns 432706750 ns 1.02
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 614699291.5 ns 737367000 ns 0.83
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 35179278 ns 32284178 ns 1.09
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 804964083 ns 707739104.5 ns 1.14
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 656838729.5 ns 677702687.5 ns 0.97
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 594341604 ns 572041062.5 ns 1.04
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 735687542 ns 735458208 ns 1.00
mlp7layer_bn(relu)(32 x 256)/forward/CPU/2 thread(s) 1353083 ns 1303791.5 ns 1.04
mlp7layer_bn(relu)(32 x 256)/forward/CPU/4 thread(s) 1020917 ns 778750 ns 1.31
mlp7layer_bn(relu)(32 x 256)/forward/CPU/8 thread(s) 995292 ns 904854 ns 1.10
mlp7layer_bn(relu)(32 x 256)/forward/CPU/1 thread(s) 2104875 ns 1945625 ns 1.08
mlp7layer_bn(relu)(32 x 256)/forward/GPU/CUDA 569348 ns 581135.5 ns 0.98
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/2 thread(s) 2979875 ns 2961271 ns 1.01
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/4 thread(s) 2615833 ns 2515584 ns 1.04
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/8 thread(s) 2614124.5 ns 2624334 ns 1.00
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/1 thread(s) 3699541.5 ns 3695417 ns 1.00
mlp7layer_bn(relu)(32 x 256)/zygote/GPU/CUDA 1670621 ns 1838423 ns 0.91
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/2 thread(s) 5794812.5 ns 5788229.5 ns 1.00
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/4 thread(s) 5833354.5 ns 5903625 ns 0.99
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/8 thread(s) 5800917 ns 5805354.5 ns 1.00
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/1 thread(s) 2911437.5 ns 2899667 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7875 ns 7375 ns 1.07
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 7000 ns 5250 ns 1.33
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 7000 ns 6167 ns 1.14
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10583 ns 9916 ns 1.07
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 24801 ns 25653 ns 0.97
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 222541.5 ns 212479.5 ns 1.05
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 221250 ns 226833 ns 0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 220833.5 ns 220417 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 217041.5 ns 206167 ns 1.05
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 245776 ns 275653 ns 0.89
vgg16(32, 32, 3, 64)/forward/CPU/2 thread(s) 451162917 ns 307447667 ns 1.47
vgg16(32, 32, 3, 64)/forward/CPU/4 thread(s) 205123625.5 ns 279760625 ns 0.73
vgg16(32, 32, 3, 64)/forward/CPU/8 thread(s) 178414666.5 ns 198268687.5 ns 0.90
vgg16(32, 32, 3, 64)/forward/CPU/1 thread(s) 454897875 ns 308090500 ns 1.48
vgg16(32, 32, 3, 64)/forward/GPU/CUDA 7671486 ns 7673335 ns 1.00
vgg16(32, 32, 3, 64)/zygote/CPU/2 thread(s) 1093247396 ns 1074946146 ns 1.02
vgg16(32, 32, 3, 64)/zygote/CPU/4 thread(s) 925248250 ns 1069981500 ns 0.86
vgg16(32, 32, 3, 64)/zygote/CPU/8 thread(s) 837547083 ns 801953875 ns 1.04
vgg16(32, 32, 3, 64)/zygote/CPU/1 thread(s) 1163363584 ns 1147606167 ns 1.01
vgg16(32, 32, 3, 64)/zygote/GPU/CUDA 26761104.5 ns 26674789 ns 1.00
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 5500 ns 4958 ns 1.11
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 5458 ns 5208 ns 1.05
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 6875 ns 5958 ns 1.15
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 5291.5 ns 5042 ns 1.05
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 149694 ns 169081.5 ns 0.89
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 6833.5 ns 6833 ns 1.00
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7395.5 ns 6917 ns 1.07
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7792 ns 7625 ns 1.02
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 6875 ns 7125 ns 0.96
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 579102 ns 666084 ns 0.87
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 500 ns 625 ns 0.80
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 583 ns 583 ns 1
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 583 ns 667 ns 0.87
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 500 ns 542 ns 0.92
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 23601 ns 24582 ns 0.96
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 9166 ns 9125 ns 1.00
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 9042 ns 8459 ns 1.07
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 9250 ns 9084 ns 1.02
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 10166.5 ns 9041 ns 1.12
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 199458 ns 231180 ns 0.86
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 354500 ns 352416.5 ns 1.01
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 352375 ns 351792 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 355687.5 ns 354500 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 357479.5 ns 352125 ns 1.02
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA 21220 ns 21300.5 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 824396 ns 814416 ns 1.01
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 778375 ns 809021 ns 0.96
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 777666 ns 782042 ns 0.99
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 821813 ns 827334 ns 0.99
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA 231309.5 ns 305499.5 ns 0.76
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s) 331125 ns 336479.5 ns 0.98
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s) 344833 ns 321125 ns 1.07
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s) 453000 ns 450500 ns 1.01
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s) 10292 ns 10542 ns 0.98
batchedmm(16, Bsize=32)/forward/GPU/CUDA 18084 ns 18195 ns 0.99
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s) 709750 ns 721208 ns 0.98
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s) 741354 ns 733229 ns 1.01
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s) 1003291.5 ns 1007271 ns 1.00
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s) 26479 ns 26666 ns 0.99
batchedmm(16, Bsize=32)/zygote/GPU/CUDA 223194.5 ns 274145 ns 0.81
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s) 370292 ns 383062 ns 0.97
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s) 353396 ns 329312 ns 1.07
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s) 439292 ns 442417 ns 0.99
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s) 29916.5 ns 30792 ns 0.97
batchedmm(16, Bsize=128)/forward/GPU/CUDA 22856 ns 22813 ns 1.00
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s) 727458 ns 737625 ns 0.99
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s) 790208 ns 785604 ns 1.01
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s) 1034916 ns 1032042 ns 1.00
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s) 90395.5 ns 105375 ns 0.86
batchedmm(16, Bsize=128)/zygote/GPU/CUDA 197661 ns 222871.5 ns 0.89
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s) 3417 ns 3708 ns 0.92
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s) 3625 ns 3417 ns 1.06
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s) 3750 ns 3666 ns 1.02
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s) 3417 ns 3583 ns 0.95
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA 17539 ns 17737 ns 0.99
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s) 4208 ns 4417 ns 0.95
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s) 4375 ns 4209 ns 1.04
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s) 4250 ns 4333 ns 0.98
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s) 4125 ns 4292 ns 0.96
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA 213017 ns 278790 ns 0.76
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 3729 ns 3791 ns 0.98
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 4083 ns 3604.5 ns 1.13
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 4958 ns 4145.5 ns 1.20
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 3417 ns 3666.5 ns 0.93
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 159837 ns 207112 ns 0.77
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8167 ns 8125 ns 1.01
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8583 ns 8000 ns 1.07
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8667 ns 8542 ns 1.01
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8375 ns 8458 ns 0.99
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 1042725 ns 1220818 ns 0.85
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 205667 ns 203687.5 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 213208 ns 210041 ns 1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 213500 ns 210625 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 200458 ns 200708 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 34523 ns 34937 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 645542 ns 645270.5 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 671042 ns 631770.5 ns 1.06
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 621458.5 ns 622458 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 580854.5 ns 630750 ns 0.92
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 298737.5 ns 343085 ns 0.87
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s) 1234437.5 ns 1001750 ns 1.23
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s) 1277666 ns 1034729 ns 1.23
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s) 1190750 ns 956333 ns 1.25
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s) 1152750 ns 879958 ns 1.31
batchedmm(128, Bsize=128)/forward/GPU/CUDA 206763.5 ns 207672.5 ns 1.00
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s) 4518542 ns 4524208 ns 1.00
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s) 4787042 ns 4821708 ns 0.99
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s) 4473666.5 ns 4482250 ns 1.00
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s) 5146541 ns 5132979 ns 1.00
batchedmm(128, Bsize=128)/zygote/GPU/CUDA 931436.5 ns 922465 ns 1.01
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 3667 ns 3666 ns 1.00
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 3667 ns 3292 ns 1.11
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 4041 ns 3417 ns 1.18
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 2959 ns 3583 ns 0.83
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 185683 ns 232276 ns 0.80
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7167 ns 7292 ns 0.98
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7333 ns 6792 ns 1.08
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7667 ns 7500 ns 1.02
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 6833 ns 6875 ns 0.99
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 942579 ns 1014308 ns 0.93
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 1642000 ns 1651708 ns 0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 1207250 ns 1164875 ns 1.04
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 1390000 ns 1344708 ns 1.03
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 2427938 ns 2500875 ns 0.97
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/GPU/CUDA 212907.5 ns 214937 ns 0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 12368250 ns 12379084 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 9590500 ns 9615125.5 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 9295438 ns 9247041 ns 1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 18019000 ns 18054792 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 1954764 ns 1946109 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 17359458 ns 17413000 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 14385104 ns 14415146.5 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 14370541 ns 14339250 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 21035500 ns 21151646 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 134083.5 ns 134917 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 139416.5 ns 88958 ns 1.57
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 134958 ns 91334 ns 1.48
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 131334 ns 87666 ns 1.50
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 125600 ns 126488 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2022916.5 ns 2026792 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2047021 ns 2043625 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2034334 ns 1766792 ns 1.15
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2039125 ns 2026459 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 948556 ns 1034650 ns 0.92
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s) 1458 ns 2770.5 ns 0.53
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s) 1792 ns 1334 ns 1.34
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s) 3520.5 ns 3208 ns 1.10
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s) 1229.5 ns 3791 ns 0.32
batchedmm(2, Bsize=4)/forward/GPU/CUDA 16310 ns 16389 ns 1.00
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s) 2542 ns 2584 ns 0.98
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s) 2792 ns 2459 ns 1.14
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s) 2875 ns 2709 ns 1.06
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s) 2834 ns 2791 ns 1.02
batchedmm(2, Bsize=4)/zygote/GPU/CUDA 182763.5 ns 192723.5 ns 0.95
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7958 ns 7250 ns 1.10
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 6875 ns 5208 ns 1.32
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 6875 ns 5959 ns 1.15
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10583 ns 9959 ns 1.06
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 33908 ns 34193 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 225041 ns 225250 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 221625 ns 227063 ns 0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 220833 ns 220708 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 215291 ns 213333 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 320916 ns 312634.5 ns 1.03
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s) 3667 ns 3708 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s) 3708 ns 3750 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s) 3667 ns 3708 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s) 3667 ns 3708 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA 22605 ns 22321 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s) 14500 ns 14417 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s) 14625 ns 14250 ns 1.03
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s) 14500 ns 14416 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s) 14500 ns 14375 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA 456450 ns 475484 ns 0.96
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 142749.5 ns 134292 ns 1.06
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 91312 ns 93667 ns 0.97
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 142292 ns 94354.5 ns 1.51
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 138792 ns 91958 ns 1.51
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 125035 ns 125921 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1919500 ns 1924541.5 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1942104 ns 1939333 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1929000 ns 1709625 ns 1.13
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1927250 ns 1925042 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 877064 ns 949226.5 ns 0.92
lenet(28, 28, 1, 32)/forward/CPU/2 thread(s) 877458.5 ns 874708 ns 1.00
lenet(28, 28, 1, 32)/forward/CPU/4 thread(s) 825458.5 ns 796250 ns 1.04
lenet(28, 28, 1, 32)/forward/CPU/8 thread(s) 1230104 ns 1220958 ns 1.01
lenet(28, 28, 1, 32)/forward/CPU/1 thread(s) 955479 ns 963208 ns 0.99
lenet(28, 28, 1, 32)/forward/GPU/CUDA 269410 ns 277966 ns 0.97
lenet(28, 28, 1, 32)/zygote/CPU/2 thread(s) 2816333 ns 2838542 ns 0.99
lenet(28, 28, 1, 32)/zygote/CPU/4 thread(s) 2528771 ns 2538917 ns 1.00
lenet(28, 28, 1, 32)/zygote/CPU/8 thread(s) 3342458 ns 3341125 ns 1.00
lenet(28, 28, 1, 32)/zygote/CPU/1 thread(s) 3349729.5 ns 3415500 ns 0.98
lenet(28, 28, 1, 32)/zygote/GPU/CUDA 1555391.5 ns 1590492.5 ns 0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 14833 ns 17646 ns 0.84
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 14875 ns 16500 ns 0.90
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 18500 ns 18042 ns 1.03
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 16875 ns 17333 ns 0.97
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 131035 ns 142389.5 ns 0.92
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 227209 ns 226250 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 215791 ns 239208.5 ns 0.90
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 216958 ns 215666.5 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 225250 ns 227708 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 594103.5 ns 648593.5 ns 0.92
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 221333 ns 222666 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 222875 ns 220083 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 222583 ns 222792 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 219042 ns 221875 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 242007 ns 275688.5 ns 0.88
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 548917 ns 564542 ns 0.97
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 511041.5 ns 507292 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 509917 ns 506333 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 508458 ns 559542 ns 0.91
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1234181 ns 1323540.5 ns 0.93
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s) 4083 ns 4229.5 ns 0.97
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s) 4041 ns 3958 ns 1.02
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s) 4417 ns 3916 ns 1.13
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s) 3666.5 ns 4333 ns 0.85
batchedmm(16, Bsize=4)/forward/GPU/CUDA 17140 ns 16749 ns 1.02
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s) 7209 ns 7187 ns 1.00
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s) 7459 ns 6917 ns 1.08
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s) 7333.5 ns 7292 ns 1.01
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s) 7417 ns 7416 ns 1.00
batchedmm(16, Bsize=4)/zygote/GPU/CUDA 183429.5 ns 193558 ns 0.95
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 18833 ns 19333.5 ns 0.97
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 16666 ns 17167 ns 0.97
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 21083 ns 19291 ns 1.09
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 18396 ns 16959 ns 1.08
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 131942 ns 145420.5 ns 0.91
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 245395.5 ns 223917 ns 1.10
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 212292 ns 216437.5 ns 0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 214833 ns 215375 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 213708 ns 213812.5 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 833743 ns 914033 ns 0.91
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 4208 ns 4958 ns 0.85
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 4833 ns 4250 ns 1.14
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 4916.5 ns 4417 ns 1.11
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 3854.5 ns 3917 ns 0.98
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 208168.5 ns 206416 ns 1.01
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 10333 ns 10250 ns 1.01
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 10459 ns 10000 ns 1.05
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 11084 ns 10958 ns 1.01
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 10145.5 ns 10000 ns 1.01
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 994315 ns 1027488.5 ns 0.97
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 3458 ns 3833 ns 0.90
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 3791 ns 3459 ns 1.10
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 4042 ns 3416 ns 1.18
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 3167 ns 3250 ns 0.97
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 209797 ns 236791.5 ns 0.89
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7416 ns 7417 ns 1.00
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7459 ns 7250 ns 1.03
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 8083.5 ns 7625 ns 1.06
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7459 ns 7375 ns 1.01
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 997101.5 ns 1067899 ns 0.93
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 23443625 ns 23463750.5 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 34805208 ns 43484791.5 ns 0.80
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 37298500 ns 37835875 ns 0.99
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 34536209 ns 34880875 ns 0.99
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/GPU/CUDA 1851929 ns 1833754 ns 1.01
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 185954395.5 ns 184463792 ns 1.01
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 159888645.5 ns 172964124.5 ns 0.92
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 144873209 ns 146554521 ns 0.99
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 438754792 ns 410369375 ns 1.07
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 16496173 ns 16525549 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 269927937.5 ns 424815979 ns 0.64
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 259799312.5 ns 259769792 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 298856875 ns 297288958 ns 1.01
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 487045354.5 ns 478383791 ns 1.02
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 189541.5 ns 183959 ns 1.03
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 182167 ns 183375 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 183416.5 ns 186187.5 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 182375 ns 183187.5 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 187318 ns 205888.5 ns 0.91
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 636187.5 ns 602916.5 ns 1.06
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 597458.5 ns 596416.5 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 588459 ns 592375 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 596146 ns 596542 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 944443 ns 1054788 ns 0.90
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s) 3952375 ns 3829562.5 ns 1.03
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s) 4007646 ns 3998791.5 ns 1.00
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s) 3594292 ns 3564812.5 ns 1.01
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s) 4885708 ns 4550791.5 ns 1.07
batchedmm(128, Bsize=512)/forward/GPU/CUDA 552348.5 ns 532059.5 ns 1.04
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s) 18061833 ns 17302667 ns 1.04
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s) 18498208.5 ns 18565313 ns 1.00
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s) 17053770.5 ns 16600312.5 ns 1.03
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s) 19733813 ns 20208979.5 ns 0.98
batchedmm(128, Bsize=512)/zygote/GPU/CUDA 2636788.5 ns 2631431 ns 1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 500 ns 583 ns 0.86
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 583 ns 542 ns 1.08
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 583 ns 625 ns 0.93
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 500 ns 542 ns 0.92
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 32315 ns 33095 ns 0.98
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 9145.5 ns 9083 ns 1.01
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 9625 ns 9042 ns 1.06
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 9291 ns 9458.5 ns 0.98
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 8792 ns 9125 ns 0.96
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 247143.5 ns 266296 ns 0.93
vgg16(32, 32, 3, 128)/forward/CPU/2 thread(s) 497882542 ns 498097750 ns 1.00
vgg16(32, 32, 3, 128)/forward/CPU/4 thread(s) 466893292 ns 506743916 ns 0.92
vgg16(32, 32, 3, 128)/forward/CPU/8 thread(s) 356555750 ns 424015542 ns 0.84
vgg16(32, 32, 3, 128)/forward/CPU/1 thread(s) 601192353.5 ns 594637416 ns 1.01
vgg16(32, 32, 3, 128)/forward/GPU/CUDA 12465773.5 ns 12483759 ns 1.00
vgg16(32, 32, 3, 128)/zygote/CPU/2 thread(s) 1887759917 ns 1878936437.5 ns 1.00
vgg16(32, 32, 3, 128)/zygote/CPU/4 thread(s) 1627534167 ns 1662067875 ns 0.98
vgg16(32, 32, 3, 128)/zygote/CPU/8 thread(s) 1505961604 ns 1496755770.5 ns 1.01
vgg16(32, 32, 3, 128)/zygote/CPU/1 thread(s) 2123318791.5 ns 2214230167 ns 0.96
vgg16(32, 32, 3, 128)/zygote/GPU/CUDA 49303078 ns 49527395 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 1652917 ns 1663166 ns 0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 1209833 ns 1177833 ns 1.03
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 1397667 ns 1370041 ns 1.02
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 2460062.5 ns 2349521 ns 1.05
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/GPU/CUDA 214417 ns 217522 ns 0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 12745021 ns 12726750 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 9950208 ns 10036417 ns 0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 9693541 ns 9643083 ns 1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 18371500 ns 18397833 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 2028129 ns 2037123 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 17681833 ns 17723584 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 14711375 ns 14827916 ns 0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 14648250 ns 14555416.5 ns 1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 21429709 ns 21415041 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 26167 ns 26250 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 26167 ns 26250 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 26167 ns 26291 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 26166 ns 26209 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA 23744 ns 23706 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 67208 ns 67354.5 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 67208 ns 66792 ns 1.01
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 67166 ns 68375 ns 0.98
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 66916 ns 66875 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA 365755.5 ns 393355.5 ns 0.93
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 206375 ns 203458 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 212666 ns 209417 ns 1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 211542 ns 210084 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 200291 ns 199125 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 25711 ns 26245.5 ns 0.98
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 655729 ns 647916 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 632000 ns 672375.5 ns 0.94
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 673667 ns 621792 ns 1.08
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 630708 ns 593542 ns 1.06
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 322192 ns 351878.5 ns 0.92
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 683459 ns 679750 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 682708 ns 657291 ns 1.04
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 691916.5 ns 595709 ns 1.16
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 680834 ns 632771 ns 1.08
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 130902.5 ns 131601.5 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2242354.5 ns 2238750 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2244709 ns 2300791 ns 0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2244875.5 ns 2241896 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2229125 ns 2244958 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1093705 ns 1242570.5 ns 0.88
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 20396 ns 18625 ns 1.10
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 16833 ns 17979 ns 0.94
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 23020.5 ns 18375 ns 1.25
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 19166 ns 17104 ns 1.12
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 131648.5 ns 144244 ns 0.91
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 265541.5 ns 256458 ns 1.04
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 232167 ns 245646 ns 0.95
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 264625 ns 221750 ns 1.19
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 259979 ns 230416 ns 1.13
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 939947 ns 1056298 ns 0.89
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 541 ns 584 ns 0.93
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 625 ns 625 ns 1
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 625 ns 667 ns 0.94
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 542 ns 583 ns 0.93
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 23249 ns 23741 ns 0.98
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 9583.5 ns 9208 ns 1.04
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 9708 ns 9708 ns 1
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 10041 ns 9458 ns 1.06
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 9541 ns 9333 ns 1.02
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 242690 ns 257592.5 ns 0.94
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 5542 ns 5125 ns 1.08
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 5709 ns 5500 ns 1.04
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 6667 ns 6395.5 ns 1.04
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 5250 ns 5458 ns 0.96
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 206130.5 ns 231821.5 ns 0.89
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 6709 ns 6833 ns 0.98
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7417 ns 6792 ns 1.09
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7875 ns 7458 ns 1.06
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 6708 ns 6917 ns 0.97
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 735324.5 ns 801589.5 ns 0.92
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 2000 ns 2167 ns 0.92
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 2229.5 ns 2000 ns 1.11
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 2125 ns 2208 ns 0.96
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 2292 ns 2375 ns 0.97
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA 17909 ns 17797 ns 1.01
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 6375 ns 6375 ns 1
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 6792 ns 6542 ns 1.04
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 6875 ns 6667 ns 1.03
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 6208 ns 6375 ns 0.97
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA 303359 ns 330267.5 ns 0.92
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s) 751688 ns 748708 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s) 779292 ns 756208 ns 1.03
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s) 779395.5 ns 752750 ns 1.04
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s) 776146 ns 753542 ns 1.03
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA 20845 ns 20724 ns 1.01
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s) 796792 ns 792417 ns 1.01
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s) 791166 ns 796875 ns 0.99
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s) 808708 ns 786834 ns 1.03
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s) 775292 ns 808000 ns 0.96
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA 267264 ns 297689.5 ns 0.90
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 8000 ns 7250 ns 1.10
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 6687.5 ns 5250 ns 1.27
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 6958 ns 6042 ns 1.15
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10458 ns 10125 ns 1.03
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 32932 ns 33074 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 261062.5 ns 228604.5 ns 1.14
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 237583 ns 251041 ns 0.95
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 271396 ns 227708 ns 1.19
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 252646 ns 226000 ns 1.12
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 331767 ns 362298.5 ns 0.92
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 10250 ns 10209 ns 1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 10542 ns 10209 ns 1.03
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 11208 ns 10458 ns 1.07
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 10250 ns 9750 ns 1.05
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 218675.5 ns 252317 ns 0.87
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 25000 ns 25334 ns 0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 24625 ns 24312.5 ns 1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 25583 ns 25959 ns 0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 24416 ns 24395.5 ns 1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 1056250 ns 1133104 ns 0.93
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 106355042 ns 106928354 ns 0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 117397229.5 ns 126898666 ns 0.93
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 120585312.5 ns 121692334 ns 0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 117183084 ns 117598792 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/GPU/CUDA 2657952 ns 2629460 ns 1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 374187771 ns 390743083 ns 0.96
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 350821292 ns 379904750 ns 0.92
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 361003333 ns 361277959 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 479876375 ns 481946125 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 15234863.5 ns 15184946 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 604863708 ns 754771020.5 ns 0.80
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 773786667 ns 597861750 ns 1.29
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 812604291 ns 748681771 ns 1.09
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 770323375 ns 760209125 ns 1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 6833 ns 6500 ns 1.05
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 7084 ns 6667 ns 1.06
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 8062.5 ns 8333 ns 0.97
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 6250 ns 6667 ns 0.94
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 213616 ns 239111 ns 0.89
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 13458 ns 14125 ns 0.95
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 13875 ns 14125 ns 0.98
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 14416 ns 14437.5 ns 1.00
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 13625 ns 13667 ns 1.00
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 1017707 ns 1073718 ns 0.95
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 6208 ns 5542 ns 1.12
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 6042 ns 5542 ns 1.09
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 7145.5 ns 6395.5 ns 1.12
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 5417 ns 5792 ns 0.94
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 208255 ns 235877.5 ns 0.88
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 11958 ns 12208 ns 0.98
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 12729.5 ns 12542 ns 1.01
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 13250 ns 12750 ns 1.04
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 12500 ns 12166 ns 1.03
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 723959 ns 781667 ns 0.93
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s) 6209 ns 5709 ns 1.09
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s) 6375 ns 5437.5 ns 1.17
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s) 6375 ns 5750 ns 1.11
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s) 5500 ns 5833 ns 0.94
batchedmm(2, Bsize=128)/forward/GPU/CUDA 16943 ns 16760 ns 1.01
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s) 15250 ns 15417 ns 0.99
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s) 15625 ns 15333 ns 1.02
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s) 15625 ns 15500 ns 1.01
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s) 15500 ns 15625 ns 0.99
batchedmm(2, Bsize=128)/zygote/GPU/CUDA 186257 ns 199275.5 ns 0.93
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 333 ns 375 ns 0.89
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 375 ns 292 ns 1.28
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 375 ns 416 ns 0.90
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 292 ns 333 ns 0.88
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 23245 ns 23515 ns 0.99
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 6375 ns 6333 ns 1.01
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 6375 ns 6167 ns 1.03
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 6625 ns 6417 ns 1.03
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 6187.5 ns 6333 ns 0.98
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 225046 ns 240257 ns 0.94
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 5750 ns 5833 ns 0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 5875 ns 5875 ns 1
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 5833 ns 6083 ns 0.96
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 5792 ns 5875 ns 0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 24205 ns 24789 ns 0.98
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 20875 ns 20958 ns 1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 21417 ns 20958.5 ns 1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 21541.5 ns 21334 ns 1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 21229.5 ns 21000 ns 1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 246651 ns 263523 ns 0.94
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 194166.5 ns 188417 ns 1.03
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 200521 ns 162166 ns 1.24
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 190666.5 ns 146708.5 ns 1.30
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 185562 ns 149625 ns 1.24
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 166320.5 ns 167166 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1329104.5 ns 1323812.5 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1324792 ns 1371958 ns 0.97
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1328041 ns 1317937.5 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1337729.5 ns 1325562.5 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1221500 ns 1350174 ns 0.90
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 24687.5 ns 25292 ns 0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 22000 ns 22500 ns 0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 25667 ns 23146.5 ns 1.11
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 21250 ns 22979.5 ns 0.92
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 254624.5 ns 352259 ns 0.72
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 130791 ns 173645.5 ns 0.75
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 132062.5 ns 180041 ns 0.73
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 179458 ns 119500 ns 1.50
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 179520.5 ns 126334 ns 1.42
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1317432 ns 1470411 ns 0.90
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 333 ns 375 ns 0.89
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 417 ns 334 ns 1.25
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 375 ns 416 ns 0.90
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 292 ns 292 ns 1
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 22902 ns 23380 ns 0.98
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 6208 ns 6125 ns 1.01
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 6709 ns 6229.5 ns 1.08
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 6917 ns 6708 ns 1.03
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 6291 ns 6167 ns 1.02
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 240780 ns 256300 ns 0.94
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 4875 ns 5084 ns 0.96
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 4542 ns 5083 ns 0.89
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 5500 ns 5083 ns 1.08
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 4417 ns 4292 ns 1.03
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 229531.5 ns 256465.5 ns 0.89
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 10083 ns 10209 ns 0.99
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 10375 ns 9750 ns 1.06
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 10583 ns 10750 ns 0.98
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 10416 ns 10208 ns 1.02
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 1276460 ns 1354750 ns 0.94
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 1625 ns 1625 ns 1
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 1667 ns 1583 ns 1.05
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 1583 ns 1708 ns 0.93
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 1584 ns 1625 ns 0.97
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA 22954 ns 22916 ns 1.00
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 5792 ns 5750 ns 1.01
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 5958 ns 5667 ns 1.05
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 5875 ns 6167 ns 0.95
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 5584 ns 5750 ns 0.97
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA 258626 ns 272343 ns 0.95
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 6841563 ns 6820375 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 6377645.5 ns 6368417 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 6542167 ns 6567000 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 7612146 ns 7648166 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/GPU/CUDA 213873 ns 214879 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 24061541 ns 24083333.5 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 21280959 ns 21351687.5 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 21049937 ns 21140875 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 29725708.5 ns 29752125.5 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 2091556 ns 2100360 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 37658500 ns 37299645.5 ns 1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 45669958 ns 34217771 ns 1.33
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 45878312.5 ns 45700125 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 38309416.5 ns 38021000 ns 1.01
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 5917 ns 5750 ns 1.03
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 6042 ns 5583.5 ns 1.08
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 6958.5 ns 6395.5 ns 1.09
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 5542 ns 5292 ns 1.05
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 210091 ns 235350 ns 0.89
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8041 ns 8167 ns 0.98
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8250 ns 8416.5 ns 0.98
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8500 ns 8542 ns 1.00
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8250 ns 8500 ns 0.97
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 992082 ns 1060836 ns 0.94
lenet(28, 28, 1, 128)/forward/CPU/2 thread(s) 1552375 ns 1566292 ns 0.99
lenet(28, 28, 1, 128)/forward/CPU/4 thread(s) 1278292 ns 1237250 ns 1.03
lenet(28, 28, 1, 128)/forward/CPU/8 thread(s) 1634959 ns 1619208 ns 1.01
lenet(28, 28, 1, 128)/forward/CPU/1 thread(s) 2176750 ns 2132958 ns 1.02
lenet(28, 28, 1, 128)/forward/GPU/CUDA 269882.5 ns 278998 ns 0.97
lenet(28, 28, 1, 128)/zygote/CPU/2 thread(s) 7890000 ns 7937625 ns 0.99
lenet(28, 28, 1, 128)/zygote/CPU/4 thread(s) 6564479 ns 6656917 ns 0.99
lenet(28, 28, 1, 128)/zygote/CPU/8 thread(s) 7223979 ns 7130604.5 ns 1.01
lenet(28, 28, 1, 128)/zygote/CPU/1 thread(s) 10470041 ns 10453333.5 ns 1.00
lenet(28, 28, 1, 128)/zygote/GPU/CUDA 1748953.5 ns 1878437 ns 0.93
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s) 375500 ns 370292 ns 1.01
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s) 379708 ns 353124.5 ns 1.08
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s) 454583 ns 459083 ns 0.99
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s) 34834 ns 23666 ns 1.47
batchedmm(128, Bsize=4)/forward/GPU/CUDA 46336 ns 42541.5 ns 1.09
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s) 739834 ns 753083 ns 0.98
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s) 821979 ns 809125 ns 1.02
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s) 1062042 ns 1063125 ns 1.00
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s) 119270.5 ns 116979.5 ns 1.02
batchedmm(128, Bsize=4)/zygote/GPU/CUDA 274066 ns 239130.5 ns 1.15
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s) 412125 ns 397291 ns 1.04
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s) 305917 ns 212417 ns 1.44
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s) 305916 ns 288125 ns 1.06
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s) 757958 ns 752000 ns 1.01
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA 44006 ns 44180 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s) 658583 ns 667583 ns 0.99
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s) 525792 ns 474167 ns 1.11
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s) 523167 ns 531812.5 ns 0.98
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s) 973083 ns 973083 ns 1
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA 189089 ns 194058 ns 0.97
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 672875 ns 678250 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 676521 ns 667145.5 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 644292 ns 621709 ns 1.04
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 672333 ns 646959 ns 1.04
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 131017.5 ns 133035 ns 0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2466812.5 ns 2484229 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2456312.5 ns 2543916.5 ns 0.97
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2425417 ns 2480312.5 ns 0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2465333 ns 2471875 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1103271 ns 1215811 ns 0.91
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s) 2333 ns 2791 ns 0.84
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s) 2875 ns 2084 ns 1.38
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s) 4500 ns 4333 ns 1.04
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s) 3167 ns 3354 ns 0.94
batchedmm(2, Bsize=32)/forward/GPU/CUDA 16213 ns 16281.5 ns 1.00
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s) 5208 ns 5375 ns 0.97
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s) 5625 ns 5209 ns 1.08
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s) 5667 ns 5500 ns 1.03
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s) 5459 ns 5584 ns 0.98
batchedmm(2, Bsize=32)/zygote/GPU/CUDA 184737.5 ns 201076.5 ns 0.92
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1481125 ns 1457583 ns 1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1519875 ns 1497084 ns 1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1522875 ns 1498833 ns 1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1453417 ns 1436500 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 40096 ns 41204 ns 0.97
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5124333 ns 5117834 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5295937.5 ns 5304542 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5290354 ns 5300500 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4993187.5 ns 4807333 ns 1.04
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 194429.5 ns 199725 ns 0.97
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s) 3666 ns 3708 ns 0.99
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s) 3666 ns 3709 ns 0.99
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s) 3625 ns 3709 ns 0.98
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s) 3667 ns 3708 ns 0.99
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA 33150 ns 32858 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s) 15208 ns 15250 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s) 15375 ns 15000 ns 1.02
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s) 15416 ns 15292 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s) 15250 ns 15083 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA 349182 ns 377713 ns 0.92
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s) 93000 ns 70792 ns 1.31
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s) 103209 ns 71417 ns 1.45
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s) 92958 ns 71125 ns 1.31
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s) 92833 ns 70000 ns 1.33
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA 113197 ns 113374.5 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 315959 ns 318333 ns 0.99
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 319270.5 ns 334916 ns 0.95
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 317000 ns 318083 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 317333 ns 318209 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA 191577 ns 193117.5 ns 0.99
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 1000 ns 1000 ns 1
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 1084 ns 1000 ns 1.08
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 1042 ns 1084 ns 0.96
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 1000 ns 959 ns 1.04
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 23307 ns 23866.5 ns 0.98
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 7792 ns 7833 ns 0.99
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8375 ns 7875 ns 1.06
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8125 ns 8125 ns 1
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8000 ns 7875 ns 1.02
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 244539 ns 261797 ns 0.93
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s) 531791 ns 512646 ns 1.04
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s) 517334 ns 479541 ns 1.08
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s) 578729.5 ns 566104 ns 1.02
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s) 256916 ns 216667 ns 1.19
batchedmm(128, Bsize=32)/forward/GPU/CUDA 130622 ns 130101 ns 1.00
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s) 1386812.5 ns 1405541 ns 0.99
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s) 1483208.5 ns 1481750 ns 1.00
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s) 1776708 ns 1758666 ns 1.01
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s) 871125 ns 872625 ns 1.00
batchedmm(128, Bsize=32)/zygote/GPU/CUDA 273552 ns 274250.5 ns 1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 292 ns 375 ns 0.78
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 375 ns 333 ns 1.13
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 375 ns 417 ns 0.90
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 292 ns 292 ns 1
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 31822 ns 31596 ns 1.01
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 5958 ns 6375 ns 0.93
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 6459 ns 5854.5 ns 1.10
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 6416 ns 6500 ns 0.99
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 6167 ns 6042 ns 1.02
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 246678.5 ns 263141.5 ns 0.94
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1774479 ns 1731916.5 ns 1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1782250.5 ns 1768000 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1777916 ns 1725583 ns 1.03
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1766937 ns 1724459 ns 1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 169504.5 ns 168363 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 4354563 ns 4401542 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 3899583 ns 4406313 ns 0.88
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 4361500 ns 4361083 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4355333 ns 4360083 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1064911 ns 1173884.5 ns 0.91
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s) 24479 ns 6583 ns 3.72
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s) 7541 ns 6791 ns 1.11
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s) 7833 ns 7062.5 ns 1.11
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s) 22208.5 ns 6791 ns 3.27
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA 19777 ns 20597 ns 0.96
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 72854.5 ns 32792 ns 2.22
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 51667 ns 62083 ns 0.83
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 51833 ns 33292 ns 1.56
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 70542 ns 51084 ns 1.38
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA 193123 ns 293465.5 ns 0.66
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s) 17625 ns 18000 ns 0.98
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s) 18250 ns 17458 ns 1.05
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s) 17708 ns 17916 ns 0.99
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s) 17250 ns 18042 ns 0.96
batchedmm(2, Bsize=512)/forward/GPU/CUDA 18352 ns 18220 ns 1.01
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s) 53000 ns 53250 ns 1.00
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s) 53250 ns 53292 ns 1.00
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s) 53542 ns 53583 ns 1.00
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s) 53375 ns 53416.5 ns 1.00
batchedmm(2, Bsize=512)/zygote/GPU/CUDA 317963.5 ns 340467.5 ns 0.93
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s) 107500 ns 75333 ns 1.43
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s) 107125 ns 75417 ns 1.42
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s) 105625 ns 75292 ns 1.40
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s) 97584 ns 74833 ns 1.30
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA 46786 ns 46370 ns 1.01
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 323417 ns 324292 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 327750 ns 342291.5 ns 0.96
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 322667 ns 336708 ns 0.96
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 325000 ns 324667 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA 207825 ns 208689 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1504209 ns 1483500 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1545458 ns 1520542 ns 1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1549042 ns 1528333 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1478167 ns 1461958 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 51382 ns 51330 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5122771 ns 5116916.5 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5291458 ns 5306417 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5291125 ns 4956417 ns 1.07
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 5000125 ns 4985125.5 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 200987.5 ns 204511 ns 0.98
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 28167 ns 28250 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 28250 ns 28250 ns 1
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 28125 ns 28292 ns 0.99
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 28167 ns 28167 ns 1
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA 24367 ns 24159 ns 1.01
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 66375 ns 66584 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 66583 ns 66208 ns 1.01
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 66375 ns 67583 ns 0.98
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 66375 ns 66208 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA 493214.5 ns 518001 ns 0.95
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/2 thread(s) 1497500 ns 1500667 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/4 thread(s) 1150584 ns 935916 ns 1.23
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/8 thread(s) 1142791.5 ns 1063395.5 ns 1.07
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/1 thread(s) 2256875 ns 2253583 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/forward/GPU/CUDA 579142.5 ns 585024 ns 0.99
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/2 thread(s) 3080625.5 ns 3089125 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/4 thread(s) 2682000 ns 2661333 ns 1.01
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/8 thread(s) 2729917 ns 2581104 ns 1.06
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/1 thread(s) 3656583 ns 3818625 ns 0.96
mlp7layer_bn(tanh)(32 x 256)/zygote/GPU/CUDA 1939352 ns 1992242 ns 0.97
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/2 thread(s) 7890875 ns 7906625 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/4 thread(s) 7897375 ns 8031000 ns 0.98
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/8 thread(s) 7904208 ns 7927541.5 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/1 thread(s) 4815458 ns 4820333 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 138395.5 ns 134041 ns 1.03
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 78917 ns 81459 ns 0.97
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 132458.5 ns 82833 ns 1.60
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 140084 ns 81833 ns 1.71
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 193872 ns 194356 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2020209 ns 2010167 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1690750 ns 2043167 ns 0.83
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2025250 ns 2009750 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2006209 ns 2026792 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 742900 ns 794414 ns 0.94

This comment was automatically generated by workflow using github-action-benchmark.

Please sign in to comment.