Skip to content

Commit

Permalink
chore: bump version for release
Browse files Browse the repository at this point in the history
  • Loading branch information
avik-pal authored Dec 7, 2024
1 parent 51c0e47 commit 1ea272a
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion Project.toml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
name = "Lux"
uuid = "b2108857-7c20-44ae-9111-449ecde12c47"
authors = ["Avik Pal <avikpal@mit.edu> and contributors"]
version = "1.4.1-DEV"
version = "1.4.1"

[deps]
ADTypes = "47edcb42-4c32-4615-8424-f2b9edc5f35b"
Expand Down

3 comments on commit 1ea272a

@avik-pal
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JuliaRegistrator
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Registration pull request created: JuliaRegistries/General/120881

Tip: Release Notes

Did you know you can add release notes too? Just add markdown formatted text underneath the comment after the text
"Release notes:" and it will be added to the registry PR, and if TagBot is installed it will also be added to the
release that TagBot creates. i.e.

@JuliaRegistrator register

Release notes:

## Breaking changes

- blah

To add them here just re-invoke and the PR will be updated.

Tagging

After the above pull request is merged, it is recommended that a tag is created on this repository for the registered package version.

This will be done automatically if the Julia TagBot GitHub Action is installed, or can be done manually through the github interface, or via:

git tag -a v1.4.1 -m "<description of version>" 1ea272a135ad1ab2f3acc2d570c462434da5c02e
git push origin v1.4.1

@github-actions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lux Benchmarks

Benchmark suite Current: 1ea272a Previous: ef0d450 Ratio
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 3958 ns 4208 ns 0.94
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 4791 ns 4834 ns 0.99
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 4792 ns 5375 ns 0.89
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 3958 ns 4083 ns 0.97
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 59494 ns 58557 ns 1.02
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 10750 ns 10625 ns 1.01
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 10959 ns 10542 ns 1.04
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 10125 ns 11375 ns 0.89
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 10562.5 ns 10083 ns 1.05
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 417797.5 ns 415171 ns 1.01
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s) 1125 ns 1334 ns 0.84
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s) 1292 ns 1209 ns 1.07
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s) 1208 ns 1333.5 ns 0.91
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s) 1083 ns 1208 ns 0.90
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA 18173 ns 17961 ns 1.01
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 4083 ns 4084 ns 1.00
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 3417 ns 3959 ns 0.86
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 4250 ns 4333 ns 0.98
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 3709 ns 4000 ns 0.93
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA 107683 ns 107003.5 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 70750 ns 70834 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 64000 ns 64375 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 64250 ns 64500 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 83042 ns 80375 ns 1.03
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 36561 ns 36906 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2030500 ns 2031562.5 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2082541.5 ns 2088542 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2089104 ns 2093958 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2008667 ns 1926833 ns 1.04
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 193196.5 ns 192315 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 140083 ns 196625 ns 0.71
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 181291 ns 195542 ns 0.93
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 181167 ns 185209 ns 0.98
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 185250 ns 182375 ns 1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 166362 ns 166552 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1120708 ns 1111896 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1119000 ns 1118729.5 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1120041.5 ns 1119708 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1124104 ns 1130333.5 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 525948 ns 514050 ns 1.02
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 3334 ns 3500 ns 0.95
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 4125 ns 3416 ns 1.21
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 3729.5 ns 4459 ns 0.84
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 3542 ns 3416.5 ns 1.04
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 70915 ns 67303.5 ns 1.05
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 9125 ns 9084 ns 1.00
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 9542 ns 9750 ns 0.98
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8708 ns 9625 ns 0.90
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8875 ns 8625 ns 1.03
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 475931.5 ns 472568 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 15062.5 ns 15020.5 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 15250 ns 14666 ns 1.04
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 17437.5 ns 18625 ns 0.94
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 15375 ns 14875 ns 1.03
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 53231 ns 53079 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 216458.5 ns 224750 ns 0.96
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 225042 ns 215104.5 ns 1.05
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 213541.5 ns 215917 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 222375 ns 215083 ns 1.03
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 270372.5 ns 267364.5 ns 1.01
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s) 500 ns 750 ns 0.67
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s) 750 ns 709 ns 1.06
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s) 666 ns 750 ns 0.89
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s) 583 ns 750 ns 0.78
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA 17324 ns 17115 ns 1.01
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 1500 ns 1500 ns 1
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 1520.5 ns 1792 ns 0.85
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 1750 ns 1500 ns 1.17
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 1500 ns 1375 ns 1.09
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA 100368.5 ns 99326.5 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 8125 ns 7833 ns 1.04
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 8125 ns 7291 ns 1.11
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 7041 ns 7083 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10667 ns 9958 ns 1.07
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 22992 ns 23212 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 234000 ns 233458.5 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 239937.5 ns 228125 ns 1.05
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 228833.5 ns 228666 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 222271 ns 214125 ns 1.04
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 167254 ns 164950.5 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s) 3834 ns 3875 ns 0.99
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s) 3958 ns 3916 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s) 3917 ns 3875 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s) 3916 ns 3917 ns 1.00
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA 23377 ns 23508 ns 0.99
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 16833 ns 16959 ns 0.99
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 16667 ns 17042 ns 0.98
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 18375 ns 17083 ns 1.08
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 16583 ns 16708 ns 0.99
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA 160878 ns 160457.5 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 610542 ns 611125 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 613209 ns 609042 ns 1.01
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 634042 ns 606834 ns 1.04
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 609000 ns 605520.5 ns 1.01
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA 113540.5 ns 113172 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 1430375 ns 1423834 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 1420292 ns 1422458 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 1446167 ns 1424292 ns 1.02
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 1425542 ns 1420334 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA 210405 ns 209423.5 ns 1.00
lenet(28, 28, 1, 64)/forward/CPU/2 thread(s) 1076083 ns 1082229.5 ns 0.99
lenet(28, 28, 1, 64)/forward/CPU/4 thread(s) 968959 ns 970792 ns 1.00
lenet(28, 28, 1, 64)/forward/CPU/8 thread(s) 1348187.5 ns 1346208 ns 1.00
lenet(28, 28, 1, 64)/forward/CPU/1 thread(s) 1290083 ns 1300333 ns 0.99
lenet(28, 28, 1, 64)/forward/GPU/CUDA 272167 ns 270348.5 ns 1.01
lenet(28, 28, 1, 64)/zygote/CPU/2 thread(s) 5791000 ns 5996021 ns 0.97
lenet(28, 28, 1, 64)/zygote/CPU/4 thread(s) 4597104 ns 4506125 ns 1.02
lenet(28, 28, 1, 64)/zygote/CPU/8 thread(s) 4948917 ns 4914416 ns 1.01
lenet(28, 28, 1, 64)/zygote/CPU/1 thread(s) 5522395.5 ns 5507375 ns 1.00
lenet(28, 28, 1, 64)/zygote/GPU/CUDA 1076534 ns 1074060 ns 1.00
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s) 542 ns 542 ns 1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s) 542 ns 542 ns 1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s) 542 ns 542 ns 1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s) 542 ns 541 ns 1.00
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA 23590 ns 23487 ns 1.00
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 2166 ns 2167 ns 1.00
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 2125 ns 2125 ns 1
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 2208 ns 2167 ns 1.02
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 2125 ns 2125 ns 1
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA 173376 ns 168855 ns 1.03
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 3917 ns 4167 ns 0.94
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 4208 ns 4334 ns 0.97
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 5125 ns 5041 ns 1.02
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 4083.5 ns 3667 ns 1.11
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 65133.5 ns 64100 ns 1.02
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 10979.5 ns 11291 ns 0.97
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 11375 ns 11875 ns 0.96
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 11667 ns 12291 ns 0.95
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 11125 ns 11000 ns 1.01
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 444460.5 ns 442842 ns 1.00
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 5959 ns 6042 ns 0.99
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 6416 ns 6104.5 ns 1.05
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 7209 ns 7209 ns 1
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 6333 ns 5708 ns 1.11
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 51265 ns 51573 ns 0.99
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 17125 ns 17041.5 ns 1.00
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 17208 ns 17292 ns 1.00
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 17709 ns 17625 ns 1.00
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 17500 ns 17250 ns 1.01
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 297640 ns 299598.5 ns 0.99
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 541 ns 583 ns 0.93
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 542 ns 625 ns 0.87
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 625 ns 625 ns 1
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 625 ns 542 ns 1.15
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 32574 ns 32513 ns 1.00
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 8312.5 ns 8458 ns 0.98
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 8395.5 ns 9000 ns 0.93
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 8834 ns 9084 ns 0.97
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 9125 ns 8458 ns 1.08
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 156527 ns 155298 ns 1.01
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s) 96458 ns 96666 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s) 96250 ns 96708 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s) 95958 ns 96292 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s) 97333 ns 96375 ns 1.01
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA 111569 ns 111447.5 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s) 279917 ns 278125 ns 1.01
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s) 272666 ns 275250 ns 0.99
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s) 276958 ns 274583.5 ns 1.01
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s) 291791 ns 277584 ns 1.05
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA 184593 ns 190076 ns 0.97
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/2 thread(s) 3390792 ns 3409792 ns 0.99
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/4 thread(s) 3045416 ns 3047666 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/8 thread(s) 3031500 ns 3023958 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/1 thread(s) 3960417 ns 3959958 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/forward/GPU/CUDA 572942 ns 579376.5 ns 0.99
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/2 thread(s) 7593625 ns 7632583 ns 0.99
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/4 thread(s) 7437042 ns 7497667 ns 0.99
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/8 thread(s) 7444584 ns 7451520.5 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/1 thread(s) 8265979 ns 8199583 ns 1.01
mlp7layer_bn(gelu)(32 x 256)/zygote/GPU/CUDA 1334670 ns 1349456 ns 0.99
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/2 thread(s) 12605208 ns 17500916.5 ns 0.72
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/4 thread(s) 17554084 ns 17545437.5 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/8 thread(s) 17556062 ns 17599584 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/1 thread(s) 14272042 ns 14108083 ns 1.01
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 24062729 ns 23772875 ns 1.01
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 34415959 ns 34134729 ns 1.01
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 37185584 ns 37435375 ns 0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 34968250 ns 34708708 ns 1.01
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/GPU/CUDA 1858779 ns 1860458 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 317027145.5 ns 316659729.5 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 233784625 ns 235623563 ns 0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 195359167 ns 195619437 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 280568396 ns 279867979.5 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 13916432 ns 13932935 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 273605875 ns 273833833 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 269293459 ns 267231583 ns 1.01
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 251015375 ns 255610333 ns 0.98
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 332609042 ns 329098667 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 21834 ns 21375 ns 1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 21750 ns 22125 ns 0.98
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 25500 ns 25292 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 22916 ns 21125 ns 1.08
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 95464 ns 94977 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 118125 ns 103542 ns 1.14
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 103417 ns 103791 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 104833.5 ns 105125 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 104125 ns 103250 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 509331.5 ns 500332.5 ns 1.02
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 5417 ns 5875 ns 0.92
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 6500 ns 6417 ns 1.01
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 6500 ns 6750 ns 0.96
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 5583.5 ns 6000 ns 0.93
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 67886 ns 68160.5 ns 1.00
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 14625 ns 14500 ns 1.01
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 15292 ns 15000 ns 1.02
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 15542 ns 16500 ns 0.94
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 14917 ns 14584 ns 1.02
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 472243.5 ns 477825.5 ns 0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 3101833 ns 3101458 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 2134333 ns 2118542 ns 1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 2303021 ns 2321249.5 ns 0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 5007292 ns 4650021 ns 1.08
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/GPU/CUDA 586798 ns 585427 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 23546583 ns 23564209 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 18840521 ns 18768041 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 18012083 ns 17974229 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 36120167 ns 35659708 ns 1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 2918041 ns 2760352.5 ns 1.06
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 33910770.5 ns 34076750.5 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 27527417 ns 27653896 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 28620667 ns 28752229 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 41842979 ns 40853625 ns 1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 72812 ns 74667 ns 0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 74542 ns 71833.5 ns 1.04
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 74187.5 ns 73521 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 72666 ns 71770.5 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 101631 ns 100115 ns 1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 292354.5 ns 292083 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 217084 ns 224167 ns 0.97
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 315166 ns 297708 ns 1.06
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 292458 ns 205792 ns 1.42
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 549955 ns 537710 ns 1.02
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 11541.5 ns 11750 ns 0.98
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 11791 ns 11416 ns 1.03
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 12250 ns 12542 ns 0.98
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 11417 ns 12270.5 ns 0.93
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 70877.5 ns 71148.5 ns 1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 26084 ns 26208 ns 1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 26583 ns 26875 ns 0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 28645.5 ns 27625 ns 1.04
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 27000 ns 26500 ns 1.02
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 471342.5 ns 468928 ns 1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 12083.5 ns 12250 ns 0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 12250 ns 12166 ns 1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 13292 ns 13500 ns 0.98
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 12417 ns 12042 ns 1.03
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 52255 ns 52398 ns 1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 25416 ns 25250 ns 1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 25750 ns 26125 ns 0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 25791 ns 26042 ns 0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 26459 ns 26000 ns 1.02
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 302749 ns 301242 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 178333 ns 179104.5 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 179875 ns 179750 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 180750 ns 180583 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 180334 ns 178625 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 56120 ns 55842.5 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 581770.5 ns 582584 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 583250 ns 591917 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 583208.5 ns 594313 ns 0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 589771 ns 583166 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 283667.5 ns 280084 ns 1.01
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 6187.5 ns 5958 ns 1.04
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 6333 ns 6000 ns 1.06
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 6354.5 ns 6500 ns 0.98
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 6250 ns 5625 ns 1.11
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 70397 ns 70229 ns 1.00
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 13583 ns 13875 ns 0.98
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 14000 ns 14542 ns 0.96
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 14792 ns 15187.5 ns 0.97
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 14709 ns 14458 ns 1.02
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 461030.5 ns 456073.5 ns 1.01
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s) 1242791 ns 1235292 ns 1.01
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s) 1300208 ns 1304042 ns 1.00
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s) 1359354 ns 1374021 ns 0.99
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s) 1186229.5 ns 1092083 ns 1.09
batchedmm(512, Bsize=4)/forward/GPU/CUDA 301478 ns 302409 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s) 4116667 ns 4120521 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s) 4395875 ns 4446875 ns 0.99
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s) 4529125 ns 4623750 ns 0.98
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s) 3917271.5 ns 3716729.5 ns 1.05
batchedmm(512, Bsize=4)/zygote/GPU/CUDA 1038425.5 ns 1039016 ns 1.00
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 1792 ns 1792 ns 1
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 1833 ns 1875 ns 0.98
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 1834 ns 1833 ns 1.00
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 1875 ns 1917 ns 0.98
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA 23500 ns 23753 ns 0.99
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 4834 ns 4833 ns 1.00
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 4834 ns 4917 ns 0.98
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 4917 ns 4875 ns 1.01
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 4958 ns 4875 ns 1.02
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA 188737.5 ns 186693 ns 1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 5584 ns 5959 ns 0.94
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 6084 ns 6000 ns 1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 7333 ns 7083 ns 1.04
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 5959 ns 5667 ns 1.05
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 55083.5 ns 54622.5 ns 1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 10750 ns 11167 ns 0.96
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 11209 ns 11541 ns 0.97
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 11542 ns 11250 ns 1.03
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 11292 ns 10542 ns 1.07
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 335254 ns 325703 ns 1.03
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s) 291 ns 375 ns 0.78
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s) 334 ns 334 ns 1
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s) 292 ns 292 ns 1
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s) 375 ns 333 ns 1.13
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA 22752 ns 22898 ns 0.99
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 2750 ns 2792 ns 0.98
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 2792 ns 3041 ns 0.92
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 2709 ns 3041 ns 0.89
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 2750 ns 2750 ns 1
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA 159355.5 ns 157339 ns 1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 11084 ns 11625 ns 0.95
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 11458 ns 12083 ns 0.95
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 12854.5 ns 12417 ns 1.04
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 12083 ns 11229.5 ns 1.08
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 57729 ns 55735 ns 1.04
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 24167 ns 24959 ns 0.97
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 24541 ns 25042 ns 0.98
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 24916 ns 25042 ns 0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 25167 ns 25042 ns 1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 299680 ns 288122.5 ns 1.04
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s) 4125 ns 4250 ns 0.97
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s) 4125 ns 4208 ns 0.98
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s) 4208 ns 4208 ns 1
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s) 4209 ns 4250 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA 24651 ns 24760 ns 1.00
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 16166 ns 16333 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 16083 ns 16333 ns 0.98
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 16292 ns 16500 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 16042 ns 16459 ns 0.97
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA 199395 ns 193221.5 ns 1.03
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 5667 ns 5791 ns 0.98
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 5709 ns 5792 ns 0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 5791 ns 5791 ns 1
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 5791 ns 5750 ns 1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 33617 ns 33178 ns 1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 20020.5 ns 20750 ns 0.96
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 20583 ns 20708 ns 0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 21083 ns 20916 ns 1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 21042 ns 20708 ns 1.02
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 175086.5 ns 172900.5 ns 1.01
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s) 407729 ns 420188 ns 0.97
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s) 380271 ns 386937.5 ns 0.98
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s) 483500 ns 482833 ns 1.00
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s) 105458.5 ns 106250 ns 0.99
batchedmm(16, Bsize=512)/forward/GPU/CUDA 67085 ns 67134 ns 1.00
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s) 926875 ns 865417 ns 1.07
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s) 968750 ns 948604 ns 1.02
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s) 1173375 ns 1189500 ns 0.99
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s) 378000 ns 411770.5 ns 0.92
batchedmm(16, Bsize=512)/zygote/GPU/CUDA 188736 ns 190610 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 132583 ns 136750 ns 0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 130188 ns 133396 ns 0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 129458 ns 133166.5 ns 0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 137584 ns 138854 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 192853 ns 192824 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1920250.5 ns 1917250 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1918583 ns 1912124.5 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1924438 ns 1920250 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1920500 ns 1942521 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 409280 ns 395139 ns 1.04
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s) 292 ns 292 ns 1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s) 292 ns 292 ns 1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s) 292 ns 333 ns 0.88
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s) 375 ns 333 ns 1.13
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA 21945 ns 22003 ns 1.00
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s) 1833 ns 1834 ns 1.00
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s) 1875 ns 1833 ns 1.02
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s) 1875 ns 1875 ns 1
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s) 1834 ns 1833 ns 1.00
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA 171197.5 ns 168855 ns 1.01
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 6042 ns 6812.5 ns 0.89
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 6625 ns 6750 ns 0.98
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 7916.5 ns 8187.5 ns 0.97
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 7042 ns 6334 ns 1.11
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 58992.5 ns 59378.5 ns 0.99
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 8791 ns 9312.5 ns 0.94
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 8792 ns 9209 ns 0.95
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 9291 ns 9333 ns 1.00
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 9292 ns 9083 ns 1.02
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 311073 ns 305200.5 ns 1.02
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 110075500 ns 112669000 ns 0.98
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 174018250 ns 174180000 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 143516291 ns 143189875 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 116009417 ns 112387917 ns 1.03
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/GPU/CUDA 5438117 ns 5463061 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 617670521 ns 616937396 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 555321542 ns 558474917 ns 0.99
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 453019437.5 ns 448891770.5 ns 1.01
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 637539146 ns 624388062.5 ns 1.02
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 34975009 ns 38238112 ns 0.91
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 654977875 ns 665577792 ns 0.98
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 666181396 ns 667381166.5 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 629801020.5 ns 616459979 ns 1.02
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 742545875 ns 747251209 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 61500 ns 62750 ns 0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 52500 ns 53834 ns 0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 53125 ns 53458 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 85458 ns 82125 ns 1.04
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 37175.5 ns 37037 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1912375 ns 1926667 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1971000 ns 1974291 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1984958.5 ns 1980021 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1907791.5 ns 1901875 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 173650 ns 171617 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 285104 ns 265333 ns 1.07
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 265292 ns 269750 ns 0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 267750 ns 269083.5 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 266625 ns 264854.5 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 130504 ns 124229 ns 1.05
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 686125 ns 687584 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 704333 ns 678833 ns 1.04
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 683541.5 ns 680125 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 663104 ns 635854 ns 1.04
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 717967 ns 697446 ns 1.03
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 2234292 ns 2242458 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 2244771 ns 2097875 ns 1.07
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 2244750 ns 2254458 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 2241333.5 ns 2199750.5 ns 1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 133396.5 ns 132519 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5451812.5 ns 5507312 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5487812.5 ns 5516959 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5498042 ns 5495292 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 5562521 ns 5486271 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 754203 ns 737355 ns 1.02
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 685959 ns 678417 ns 1.01
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 670541 ns 671291 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 666167 ns 668458 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 680000 ns 682958 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA 46765 ns 46914 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 1817416 ns 1824791.5 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 1716895.5 ns 1728375 ns 0.99
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 1744292 ns 1718604.5 ns 1.01
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 2082750 ns 2080500 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA 220971 ns 221890.5 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 70125 ns 70750 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 53125 ns 53125 ns 1
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 52708 ns 52916 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 84625 ns 82375 ns 1.03
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 28234 ns 28168 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2030854.5 ns 2031792 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2081770.5 ns 2096833.5 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2100958 ns 2088000 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2007416 ns 2001083.5 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 188927 ns 187289.5 ns 1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 13472458 ns 13449750 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 12508625 ns 12528021.5 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 12582124.5 ns 12554687.5 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 15073041.5 ns 15230083 ns 0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/GPU/CUDA 512756.5 ns 513617 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 47011770.5 ns 46862979 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 41636000 ns 41543521 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 40969375 ns 40829437.5 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 59058645.5 ns 58532271 ns 1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 3033111.5 ns 2896866 ns 1.05
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 73891958 ns 74392375 ns 0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 67845145.5 ns 90893292 ns 0.75
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 92214500 ns 92732000 ns 0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 99774291.5 ns 76658749.5 ns 1.30
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 71166.5 ns 70625 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 64583 ns 64875 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 65791 ns 64625 ns 1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 84792 ns 81917 ns 1.04
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 47424 ns 47851 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1905937.5 ns 1923187.5 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1967666.5 ns 1983437.5 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1977375 ns 1973333 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1898333.5 ns 1883833 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 192864 ns 193982.5 ns 0.99
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 292 ns 292 ns 1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 292 ns 375 ns 0.78
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 416 ns 375 ns 1.11
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 416 ns 292 ns 1.42
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 32583 ns 32956 ns 0.99
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 6041 ns 6125 ns 0.99
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 6125 ns 6416 ns 0.95
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 6459 ns 6375 ns 1.01
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 6542 ns 5875 ns 1.11
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 172656.5 ns 176118.5 ns 0.98
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s) 250 ns 292 ns 0.86
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s) 292 ns 292 ns 1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s) 292 ns 291 ns 1.00
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s) 333 ns 250 ns 1.33
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA 32498 ns 32831 ns 0.99
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s) 2708 ns 2667 ns 1.02
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s) 2709 ns 2916 ns 0.93
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s) 2875 ns 2875 ns 1
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s) 2875 ns 2625 ns 1.10
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA 162027.5 ns 165694 ns 0.98
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 278479062 ns 278326104 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 339860437.5 ns 340448937.5 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 309104833 ns 308909437.5 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 282371084 ns 278977666.5 ns 1.01
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/GPU/CUDA 7112114 ns 7109405 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 997282375 ns 997951584 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 939909542 ns 940941292 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 834322792 ns 832217625 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 1020744375 ns 1009333917 ns 1.01
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 34065304 ns 33893371 ns 1.01
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 1416221791.5 ns 1394325042 ns 1.02
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 1324822042 ns 1705224209 ns 0.78
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 1631228625 ns 1693911291 ns 0.96
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 1675762813 ns 1308776729 ns 1.28
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1450812.5 ns 1456667 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1456521 ns 1462958 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1455333 ns 1454521 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1460167 ns 1451416.5 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 127677 ns 127922 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5023459 ns 5012417 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5018833 ns 5028750 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5024791.5 ns 5027959 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 5045271 ns 5027187.5 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 588360 ns 506424 ns 1.16
vgg16(32, 32, 3, 32)/forward/CPU/2 thread(s) 157992750 ns 157716375 ns 1.00
vgg16(32, 32, 3, 32)/forward/CPU/4 thread(s) 148446708 ns 136859042 ns 1.08
vgg16(32, 32, 3, 32)/forward/CPU/8 thread(s) 164732625 ns 164218250 ns 1.00
vgg16(32, 32, 3, 32)/forward/CPU/1 thread(s) 153538583.5 ns 151479417 ns 1.01
vgg16(32, 32, 3, 32)/forward/GPU/CUDA 4886668 ns 4879107 ns 1.00
vgg16(32, 32, 3, 32)/zygote/CPU/2 thread(s) 637312250 ns 634203459 ns 1.00
vgg16(32, 32, 3, 32)/zygote/CPU/4 thread(s) 611560250 ns 607766083 ns 1.01
vgg16(32, 32, 3, 32)/zygote/CPU/8 thread(s) 470585834 ns 456653750 ns 1.03
vgg16(32, 32, 3, 32)/zygote/CPU/1 thread(s) 662978834 ns 653815125 ns 1.01
vgg16(32, 32, 3, 32)/zygote/GPU/CUDA 16094164 ns 17510307 ns 0.92
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s) 8954458 ns 8926646 ns 1.00
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s) 9014875 ns 9038916.5 ns 1.00
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s) 7941438 ns 7947771 ns 1.00
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s) 10320875 ns 10104354 ns 1.02
batchedmm(512, Bsize=32)/forward/GPU/CUDA 1593595 ns 1594648 ns 1.00
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s) 37088334 ns 36795042 ns 1.01
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s) 37925916.5 ns 38004792 ns 1.00
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s) 34179167 ns 34295916.5 ns 1.00
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s) 39118729 ns 37862042 ns 1.03
batchedmm(512, Bsize=32)/zygote/GPU/CUDA 6471873.5 ns 6452447 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s) 47416 ns 47334 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s) 47292 ns 47417 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s) 47459 ns 47625 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s) 47458 ns 47042 ns 1.01
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA 18458 ns 18361 ns 1.01
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s) 50250 ns 50042 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s) 50291 ns 50292 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s) 50834 ns 50542 ns 1.01
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s) 50458 ns 50292 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA 188984 ns 194710.5 ns 0.97
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 6125 ns 6750 ns 0.91
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 6708 ns 6875 ns 0.98
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 7875 ns 7709 ns 1.02
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 7042 ns 6541 ns 1.08
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 89761 ns 94841 ns 0.95
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 9750 ns 9542 ns 1.02
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 10125 ns 10209 ns 0.99
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 10250 ns 10292 ns 1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 10541 ns 9958 ns 1.06
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 516571.5 ns 543786 ns 0.95
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 5750 ns 5917 ns 0.97
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 5958.5 ns 6292 ns 0.95
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 7417 ns 6750 ns 1.10
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 6417 ns 5666 ns 1.13
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 106479.5 ns 105080 ns 1.01
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 12750 ns 12583 ns 1.01
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 13042 ns 13750 ns 0.95
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 13291 ns 13375 ns 0.99
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 13270.5 ns 13375 ns 0.99
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 479931 ns 521491.5 ns 0.92
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 958 ns 1083 ns 0.88
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 959 ns 1083 ns 0.89
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 1125 ns 1083 ns 1.04
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 1083 ns 1042 ns 1.04
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 32924 ns 33226 ns 0.99
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7542 ns 8125 ns 0.93
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 8000 ns 8500 ns 0.94
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7958 ns 7875 ns 1.01
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 8250 ns 8041 ns 1.03
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 200265 ns 215927 ns 0.93
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 22875 ns 23125 ns 0.99
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 23041 ns 23209 ns 0.99
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 23917 ns 23250 ns 1.03
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 23208 ns 23250 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA 18525 ns 18682 ns 0.99
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 52208 ns 52250 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 52583 ns 53125 ns 0.99
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 52625 ns 52833 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 52542 ns 52250 ns 1.01
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA 267460 ns 310779 ns 0.86
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1451417 ns 1455520.5 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1459084 ns 1461770.5 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1459500 ns 1464563 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1465416.5 ns 1420375.5 ns 1.03
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 196174 ns 196494.5 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5014166.5 ns 5004917 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5005062.5 ns 4928042 ns 1.02
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5014250 ns 5012292 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 5037250 ns 5010708.5 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 579761 ns 619791 ns 0.94
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 3149500 ns 3153125 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 1975646 ns 2140000 ns 0.92
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 2323562.5 ns 2307083.5 ns 1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 4912270.5 ns 4612500 ns 1.06
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/GPU/CUDA 583087.5 ns 580901 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 24421562.5 ns 24408833 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 19801250.5 ns 19732667 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 18967959 ns 19045729.5 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 37230000 ns 36515125 ns 1.02
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 2963899 ns 2842137 ns 1.04
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 34154937.5 ns 34057083.5 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 28340541 ns 28326333 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 28271812.5 ns 28024667 ns 1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 43122000 ns 42838792 ns 1.01
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s) 140810292 ns 140571271 ns 1.00
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s) 143457875 ns 143484104 ns 1.00
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s) 120969000 ns 120774500 ns 1.00
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s) 190332292 ns 187527416 ns 1.01
batchedmm(512, Bsize=512)/forward/GPU/CUDA 22567410 ns 22777810 ns 0.99
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s) 1439193417 ns 1387998541 ns 1.04
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s) 1035778354.5 ns 2164279542 ns 0.48
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s) 1029350563 ns 1082658958.5 ns 0.95
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s) 847160583 ns 828842208.5 ns 1.02
batchedmm(512, Bsize=512)/zygote/GPU/CUDA 118590973 ns 118414466 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 72979 ns 79708.5 ns 0.92
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 72229.5 ns 72542 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 75417 ns 75520.5 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 73416.5 ns 73458 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 210693.5 ns 238954.5 ns 0.88
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 296396 ns 286459 ns 1.03
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 283542 ns 295292 ns 0.96
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 309000 ns 302292 ns 1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 282667 ns 240521 ns 1.18
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1113011 ns 1217040 ns 0.91
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s) 35428583 ns 35202521 ns 1.01
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s) 35740146 ns 35899625 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s) 31356458 ns 31197042 ns 1.01
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s) 39882791 ns 39929583.5 ns 1.00
batchedmm(512, Bsize=128)/forward/GPU/CUDA 5846172 ns 5845222 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s) 148563000 ns 147855667 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s) 152825542 ns 153555375 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s) 135772750.5 ns 134579979 ns 1.01
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s) 153516333 ns 150196958.5 ns 1.02
batchedmm(512, Bsize=128)/zygote/GPU/CUDA 34902152 ns 34892998 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 112450083 ns 114292542 ns 0.98
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 173734500 ns 173321542 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 143024292 ns 143543334 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 97164708 ns 93943084 ns 1.03
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/GPU/CUDA 5471199 ns 5434556 ns 1.01
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 468949292 ns 473131708 ns 0.99
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 523211021 ns 515810125.5 ns 1.01
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 440488146 ns 442518292 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 623433833.5 ns 614699291.5 ns 1.01
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 32285967 ns 35179278 ns 0.92
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 800549541 ns 804964083 ns 0.99
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 656663541.5 ns 656838729.5 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 567293062.5 ns 594341604 ns 0.95
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 735113417 ns 735687542 ns 1.00
mlp7layer_bn(relu)(32 x 256)/forward/CPU/2 thread(s) 1357292 ns 1353083 ns 1.00
mlp7layer_bn(relu)(32 x 256)/forward/CPU/4 thread(s) 1006709 ns 1020917 ns 0.99
mlp7layer_bn(relu)(32 x 256)/forward/CPU/8 thread(s) 993792 ns 995292 ns 1.00
mlp7layer_bn(relu)(32 x 256)/forward/CPU/1 thread(s) 2076875 ns 2104875 ns 0.99
mlp7layer_bn(relu)(32 x 256)/forward/GPU/CUDA 574648.5 ns 569348 ns 1.01
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/2 thread(s) 2981104 ns 2979875 ns 1.00
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/4 thread(s) 2614562.5 ns 2615833 ns 1.00
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/8 thread(s) 2632479 ns 2614124.5 ns 1.01
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/1 thread(s) 3749687.5 ns 3699541.5 ns 1.01
mlp7layer_bn(relu)(32 x 256)/zygote/GPU/CUDA 1705197 ns 1670621 ns 1.02
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/2 thread(s) 5826896 ns 5794812.5 ns 1.01
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/4 thread(s) 5792500 ns 5833354.5 ns 0.99
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/8 thread(s) 5792645.5 ns 5800917 ns 1.00
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/1 thread(s) 2968021 ns 2911437.5 ns 1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 8042 ns 7875 ns 1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 7000 ns 7000 ns 1
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 7042 ns 7000 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10875 ns 10583 ns 1.03
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 24779 ns 24801 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 212208 ns 222541.5 ns 0.95
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 233625 ns 221250 ns 1.06
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 220750 ns 220833.5 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 209750 ns 217041.5 ns 0.97
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 246929 ns 245776 ns 1.00
vgg16(32, 32, 3, 64)/forward/CPU/2 thread(s) 452114625 ns 451162917 ns 1.00
vgg16(32, 32, 3, 64)/forward/CPU/4 thread(s) 205741771 ns 205123625.5 ns 1.00
vgg16(32, 32, 3, 64)/forward/CPU/8 thread(s) 181027291.5 ns 178414666.5 ns 1.01
vgg16(32, 32, 3, 64)/forward/CPU/1 thread(s) 462543917 ns 454897875 ns 1.02
vgg16(32, 32, 3, 64)/forward/GPU/CUDA 7673150.5 ns 7671486 ns 1.00
vgg16(32, 32, 3, 64)/zygote/CPU/2 thread(s) 1095771812.5 ns 1093247396 ns 1.00
vgg16(32, 32, 3, 64)/zygote/CPU/4 thread(s) 925308125 ns 925248250 ns 1.00
vgg16(32, 32, 3, 64)/zygote/CPU/8 thread(s) 875879750 ns 837547083 ns 1.05
vgg16(32, 32, 3, 64)/zygote/CPU/1 thread(s) 1183196167 ns 1163363584 ns 1.02
vgg16(32, 32, 3, 64)/zygote/GPU/CUDA 26783812 ns 26761104.5 ns 1.00
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 5125 ns 5500 ns 0.93
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 5312.5 ns 5458 ns 0.97
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 6375 ns 6875 ns 0.93
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 6083 ns 5291.5 ns 1.15
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 143484 ns 149694 ns 0.96
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 6875 ns 6833.5 ns 1.01
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7500 ns 7395.5 ns 1.01
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7583.5 ns 7792 ns 0.97
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7708 ns 6875 ns 1.12
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 569216 ns 579102 ns 0.98
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 500 ns 500 ns 1
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 542 ns 583 ns 0.93
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 584 ns 583 ns 1.00
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 584 ns 500 ns 1.17
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 23876 ns 23601 ns 1.01
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 8584 ns 9166 ns 0.94
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 8917 ns 9042 ns 0.99
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 9500 ns 9250 ns 1.03
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 9292 ns 10166.5 ns 0.91
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 202303 ns 199458 ns 1.01
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 352875 ns 354500 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 382959 ns 352375 ns 1.09
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 352625 ns 355687.5 ns 0.99
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 351625 ns 357479.5 ns 0.98
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA 21342 ns 21220 ns 1.01
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 776270.5 ns 824396 ns 0.94
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 810812.5 ns 778375 ns 1.04
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 775187.5 ns 777666 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 827583.5 ns 821813 ns 1.01
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA 240060.5 ns 231309.5 ns 1.04
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s) 332770.5 ns 331125 ns 1.00
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s) 332583 ns 344833 ns 0.96
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s) 451459 ns 453000 ns 1.00
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s) 9959 ns 10292 ns 0.97
batchedmm(16, Bsize=32)/forward/GPU/CUDA 18163 ns 18084 ns 1.00
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s) 714000 ns 709750 ns 1.01
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s) 727125 ns 741354 ns 0.98
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s) 999833 ns 1003291.5 ns 1.00
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s) 26625 ns 26479 ns 1.01
batchedmm(16, Bsize=32)/zygote/GPU/CUDA 238711 ns 223194.5 ns 1.07
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s) 374437 ns 370292 ns 1.01
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s) 347917 ns 353396 ns 0.98
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s) 440937.5 ns 439292 ns 1.00
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s) 28792 ns 29916.5 ns 0.96
batchedmm(16, Bsize=128)/forward/GPU/CUDA 22488 ns 22856 ns 0.98
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s) 733000 ns 727458 ns 1.01
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s) 778479 ns 790208 ns 0.99
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s) 1023541.5 ns 1034916 ns 0.99
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s) 89875 ns 90395.5 ns 0.99
batchedmm(16, Bsize=128)/zygote/GPU/CUDA 205326 ns 197661 ns 1.04
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s) 3354.5 ns 3417 ns 0.98
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s) 3417 ns 3625 ns 0.94
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s) 3625 ns 3750 ns 0.97
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s) 3750 ns 3417 ns 1.10
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA 17749 ns 17539 ns 1.01
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s) 4125 ns 4208 ns 0.98
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s) 4292 ns 4375 ns 0.98
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s) 4250 ns 4250 ns 1
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s) 4375 ns 4125 ns 1.06
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA 235900.5 ns 213017 ns 1.11
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 3417 ns 3729 ns 0.92
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 4000 ns 4083 ns 0.98
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 4041 ns 4958 ns 0.82
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 4125 ns 3417 ns 1.21
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 174157.5 ns 159837 ns 1.09
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8042 ns 8167 ns 0.98
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8500 ns 8583 ns 0.99
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8125 ns 8667 ns 0.94
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8625 ns 8375 ns 1.03
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 1076434 ns 1042725 ns 1.03
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 207542 ns 205667 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 213916 ns 213208 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 212833 ns 213500 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 202625 ns 200458 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 34097 ns 34523 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 601333 ns 645542 ns 0.93
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 633916.5 ns 671042 ns 0.94
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 621208 ns 621458.5 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 582666.5 ns 580854.5 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 291620 ns 298737.5 ns 0.98
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s) 1245375 ns 1234437.5 ns 1.01
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s) 1251750 ns 1277666 ns 0.98
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s) 1177937.5 ns 1190750 ns 0.99
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s) 1207083 ns 1152750 ns 1.05
batchedmm(128, Bsize=128)/forward/GPU/CUDA 207232 ns 206763.5 ns 1.00
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s) 4566750 ns 4518542 ns 1.01
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s) 4712249.5 ns 4787042 ns 0.98
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s) 4457500 ns 4473666.5 ns 1.00
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s) 4779979 ns 5146541 ns 0.93
batchedmm(128, Bsize=128)/zygote/GPU/CUDA 927700.5 ns 931436.5 ns 1.00
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 2958 ns 3667 ns 0.81
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 3917 ns 3667 ns 1.07
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 3896 ns 4041 ns 0.96
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 3833 ns 2959 ns 1.30
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 167597.5 ns 185683 ns 0.90
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7167 ns 7167 ns 1
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7708 ns 7333 ns 1.05
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7208 ns 7667 ns 0.94
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7459 ns 6833 ns 1.09
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 944745 ns 942579 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 1646750 ns 1642000 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 1186708 ns 1207250 ns 0.98
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 1375541.5 ns 1390000 ns 0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 2434792 ns 2427938 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/GPU/CUDA 214131 ns 212907.5 ns 1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 12360250 ns 12368250 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 9584833 ns 9590500 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 9257792 ns 9295438 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 18118625 ns 18019000 ns 1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 1941495.5 ns 1954764 ns 0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 17409917 ns 17359458 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 14369603.5 ns 14385104 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 14347521 ns 14370541 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 21171916 ns 21035500 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 85209 ns 134083.5 ns 0.64
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 138875 ns 139416.5 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 134958 ns 134958 ns 1
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 132917 ns 131334 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 125576 ns 125600 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2040229.5 ns 2022916.5 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2026646 ns 2047021 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2030000 ns 2034334 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2046729 ns 2039125 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 954388.5 ns 948556 ns 1.01
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s) 1000 ns 1458 ns 0.69
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s) 1292 ns 1792 ns 0.72
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s) 1791 ns 3520.5 ns 0.51
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s) 1416 ns 1229.5 ns 1.15
batchedmm(2, Bsize=4)/forward/GPU/CUDA 16301 ns 16310 ns 1.00
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s) 2458 ns 2542 ns 0.97
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s) 2583 ns 2792 ns 0.93
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s) 2792 ns 2875 ns 0.97
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s) 2875 ns 2834 ns 1.01
batchedmm(2, Bsize=4)/zygote/GPU/CUDA 180190.5 ns 182763.5 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 8041 ns 7958 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 6959 ns 6875 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 7125 ns 6875 ns 1.04
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10833 ns 10583 ns 1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 33324 ns 33908 ns 0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 217125 ns 225041 ns 0.96
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 220125 ns 221625 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 220542 ns 220833 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 207145.5 ns 215291 ns 0.96
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 294304 ns 320916 ns 0.92
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s) 3667 ns 3667 ns 1
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s) 3666 ns 3708 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s) 3667 ns 3667 ns 1
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s) 3750 ns 3667 ns 1.02
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA 22268 ns 22605 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s) 14542 ns 14500 ns 1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s) 14458 ns 14625 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s) 14500 ns 14500 ns 1
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s) 14250 ns 14500 ns 0.98
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA 451646.5 ns 456450 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 135084 ns 142749.5 ns 0.95
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 135167 ns 91312 ns 1.48
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 145833 ns 142292 ns 1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 135771 ns 138792 ns 0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 124920.5 ns 125035 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1931125 ns 1919500 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1923875 ns 1942104 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1933583.5 ns 1929000 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1941584 ns 1927250 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 895888.5 ns 877064 ns 1.02
lenet(28, 28, 1, 32)/forward/CPU/2 thread(s) 869083.5 ns 877458.5 ns 0.99
lenet(28, 28, 1, 32)/forward/CPU/4 thread(s) 814146 ns 825458.5 ns 0.99
lenet(28, 28, 1, 32)/forward/CPU/8 thread(s) 1222709 ns 1230104 ns 0.99
lenet(28, 28, 1, 32)/forward/CPU/1 thread(s) 942729 ns 955479 ns 0.99
lenet(28, 28, 1, 32)/forward/GPU/CUDA 269464 ns 269410 ns 1.00
lenet(28, 28, 1, 32)/zygote/CPU/2 thread(s) 2833167 ns 2816333 ns 1.01
lenet(28, 28, 1, 32)/zygote/CPU/4 thread(s) 2528333.5 ns 2528771 ns 1.00
lenet(28, 28, 1, 32)/zygote/CPU/8 thread(s) 3338750 ns 3342458 ns 1.00
lenet(28, 28, 1, 32)/zygote/CPU/1 thread(s) 3399146 ns 3349729.5 ns 1.01
lenet(28, 28, 1, 32)/zygote/GPU/CUDA 1538408 ns 1555391.5 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 20750 ns 14833 ns 1.40
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 15041.5 ns 14875 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 16229.5 ns 18500 ns 0.88
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 14959 ns 16875 ns 0.89
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 129111.5 ns 131035 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 215916 ns 227209 ns 0.95
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 229604.5 ns 215791 ns 1.06
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 215709 ns 216958 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 224833 ns 225250 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 586555.5 ns 594103.5 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 219250 ns 221333 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 220020.5 ns 222875 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 222125 ns 222583 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 219916 ns 219042 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 244257 ns 242007 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 529291.5 ns 548917 ns 0.96
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 509000 ns 511041.5 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 509666 ns 509917 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 509542 ns 508458 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1272897.5 ns 1234181 ns 1.03
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s) 3125 ns 4083 ns 0.77
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s) 4500 ns 4041 ns 1.11
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s) 4542 ns 4417 ns 1.03
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s) 3959 ns 3666.5 ns 1.08
batchedmm(16, Bsize=4)/forward/GPU/CUDA 16759 ns 17140 ns 0.98
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s) 7208 ns 7209 ns 1.00
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s) 7208 ns 7459 ns 0.97
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s) 7250 ns 7333.5 ns 0.99
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s) 7334 ns 7417 ns 0.99
batchedmm(16, Bsize=4)/zygote/GPU/CUDA 181468 ns 183429.5 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 16792 ns 18833 ns 0.89
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 17062.5 ns 16666 ns 1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 17812.5 ns 21083 ns 0.84
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 17250 ns 18396 ns 0.94
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 134619.5 ns 131942 ns 1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 211583 ns 245395.5 ns 0.86
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 213625 ns 212292 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 212812.5 ns 214833 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 213083 ns 213708 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 895952 ns 833743 ns 1.07
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 3917 ns 4208 ns 0.93
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 4833 ns 4833 ns 1
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 4625 ns 4916.5 ns 0.94
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 4625 ns 3854.5 ns 1.20
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 212453 ns 208168.5 ns 1.02
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 10166 ns 10333 ns 0.98
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 10417 ns 10459 ns 1.00
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 10875 ns 11084 ns 0.98
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 10583 ns 10145.5 ns 1.04
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 992994 ns 994315 ns 1.00
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 3125 ns 3458 ns 0.90
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 3709 ns 3791 ns 0.98
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 4750 ns 4042 ns 1.18
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 3916 ns 3167 ns 1.24
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 212054.5 ns 209797 ns 1.01
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7083 ns 7416 ns 0.96
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7125 ns 7459 ns 0.96
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7583 ns 8083.5 ns 0.94
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7500 ns 7459 ns 1.01
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 1004688 ns 997101.5 ns 1.01
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 23464771 ns 23443625 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 35060375 ns 34805208 ns 1.01
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 37779167 ns 37298500 ns 1.01
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 34969333 ns 34536209 ns 1.01
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/GPU/CUDA 1848833 ns 1851929 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 184464833.5 ns 185954395.5 ns 0.99
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 160073583.5 ns 159888645.5 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 145086500 ns 144873209 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 445100854 ns 438754792 ns 1.01
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 16527443 ns 16496173 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 271288729 ns 269927937.5 ns 1.01
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 263438959 ns 259799312.5 ns 1.01
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 302324416 ns 298856875 ns 1.01
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 496832583.5 ns 487045354.5 ns 1.02
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 181417 ns 189541.5 ns 0.96
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 185458 ns 182167 ns 1.02
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 185750 ns 183416.5 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 181708 ns 182375 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 193313 ns 187318 ns 1.03
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 589438 ns 636187.5 ns 0.93
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 631229 ns 597458.5 ns 1.06
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 598125 ns 588459 ns 1.02
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 590687.5 ns 596146 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 966959 ns 944443 ns 1.02
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s) 3877125 ns 3952375 ns 0.98
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s) 3946625 ns 4007646 ns 0.98
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s) 3651083.5 ns 3594292 ns 1.02
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s) 5012833.5 ns 4885708 ns 1.03
batchedmm(128, Bsize=512)/forward/GPU/CUDA 530368 ns 552348.5 ns 0.96
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s) 17988625 ns 18061833 ns 1.00
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s) 18469458 ns 18498208.5 ns 1.00
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s) 17328979.5 ns 17053770.5 ns 1.02
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s) 20374792 ns 19733813 ns 1.03
batchedmm(128, Bsize=512)/zygote/GPU/CUDA 2619767.5 ns 2636788.5 ns 0.99
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 500 ns 500 ns 1
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 500 ns 583 ns 0.86
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 583 ns 583 ns 1
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 584 ns 500 ns 1.17
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 32351 ns 32315 ns 1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 9041 ns 9145.5 ns 0.99
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 9541.5 ns 9625 ns 0.99
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 9833 ns 9291 ns 1.06
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 9500 ns 8792 ns 1.08
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 247867.5 ns 247143.5 ns 1.00
vgg16(32, 32, 3, 128)/forward/CPU/2 thread(s) 498558729 ns 497882542 ns 1.00
vgg16(32, 32, 3, 128)/forward/CPU/4 thread(s) 468495750 ns 466893292 ns 1.00
vgg16(32, 32, 3, 128)/forward/CPU/8 thread(s) 362160229 ns 356555750 ns 1.02
vgg16(32, 32, 3, 128)/forward/CPU/1 thread(s) 607173041 ns 601192353.5 ns 1.01
vgg16(32, 32, 3, 128)/forward/GPU/CUDA 12482436 ns 12465773.5 ns 1.00
vgg16(32, 32, 3, 128)/zygote/CPU/2 thread(s) 1885912604.5 ns 1887759917 ns 1.00
vgg16(32, 32, 3, 128)/zygote/CPU/4 thread(s) 1633604541 ns 1627534167 ns 1.00
vgg16(32, 32, 3, 128)/zygote/CPU/8 thread(s) 1504714375 ns 1505961604 ns 1.00
vgg16(32, 32, 3, 128)/zygote/CPU/1 thread(s) 2155903916.5 ns 2123318791.5 ns 1.02
vgg16(32, 32, 3, 128)/zygote/GPU/CUDA 49283559 ns 49303078 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 1664666.5 ns 1652917 ns 1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 1200396 ns 1209833 ns 0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 1387542 ns 1397667 ns 0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 2441166 ns 2460062.5 ns 0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/GPU/CUDA 216027 ns 214417 ns 1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 12783813 ns 12745021 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 9969333 ns 9950208 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 9630041 ns 9693541 ns 0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 18564625 ns 18371500 ns 1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 2024417 ns 2028129 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 17729000 ns 17681833 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 14689833 ns 14711375 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 14572562.5 ns 14648250 ns 0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 21460792 ns 21429709 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 26167 ns 26167 ns 1
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 26167 ns 26167 ns 1
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 26334 ns 26167 ns 1.01
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 26250 ns 26166 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA 24291 ns 23744 ns 1.02
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 67375 ns 67208 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 66792 ns 67208 ns 0.99
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 67250 ns 67166 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 66916 ns 66916 ns 1
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA 376851.5 ns 365755.5 ns 1.03
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 206292 ns 206375 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 213042 ns 212666 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 212292 ns 211542 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 200542 ns 200291 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 25875 ns 25711 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 608438 ns 655729 ns 0.93
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 631687.5 ns 632000 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 622729.5 ns 673667 ns 0.92
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 592459 ns 630708 ns 0.94
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 328754.5 ns 322192 ns 1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 702583 ns 683459 ns 1.03
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 644542 ns 682708 ns 0.94
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 631083 ns 691916.5 ns 0.91
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 682250 ns 680834 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 131950 ns 130902.5 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2262083 ns 2242354.5 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2242917 ns 2244709 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2231125 ns 2244875.5 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2307979 ns 2229125 ns 1.04
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1167364 ns 1093705 ns 1.07
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 17125 ns 20396 ns 0.84
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 20083 ns 16833 ns 1.19
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 18791 ns 23020.5 ns 0.82
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 18041.5 ns 19166 ns 0.94
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 132602 ns 131648.5 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 229500 ns 265541.5 ns 0.86
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 218833 ns 232167 ns 0.94
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 219792 ns 264625 ns 0.83
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 230333.5 ns 259979 ns 0.89
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 967555 ns 939947 ns 1.03
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 500 ns 541 ns 0.92
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 542 ns 625 ns 0.87
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 584 ns 625 ns 0.93
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 625 ns 542 ns 1.15
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 23714 ns 23249 ns 1.02
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 9417 ns 9583.5 ns 0.98
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 9833 ns 9708 ns 1.01
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 9875 ns 10041 ns 0.98
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 9833 ns 9541 ns 1.03
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 247044.5 ns 242690 ns 1.02
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 5209 ns 5542 ns 0.94
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 5812.5 ns 5709 ns 1.02
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 6812.5 ns 6667 ns 1.02
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 5916.5 ns 5250 ns 1.13
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 211718.5 ns 206130.5 ns 1.03
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7084 ns 6709 ns 1.06
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7459 ns 7417 ns 1.01
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7667 ns 7875 ns 0.97
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7500 ns 6708 ns 1.12
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 739090.5 ns 735324.5 ns 1.01
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 1917 ns 2000 ns 0.96
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 2208 ns 2229.5 ns 0.99
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 2250 ns 2125 ns 1.06
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 2250 ns 2292 ns 0.98
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA 18219 ns 17909 ns 1.02
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 6292 ns 6375 ns 0.99
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 6417 ns 6792 ns 0.94
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 6729.5 ns 6875 ns 0.98
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 6584 ns 6208 ns 1.06
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA 307391 ns 303359 ns 1.01
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s) 749208 ns 751688 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s) 748625 ns 779292 ns 0.96
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s) 746500 ns 779395.5 ns 0.96
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s) 748625 ns 776146 ns 0.96
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA 21224.5 ns 20845 ns 1.02
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s) 803167 ns 796792 ns 1.01
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s) 792833 ns 791166 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s) 792834 ns 808708 ns 0.98
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s) 813166 ns 775292 ns 1.05
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA 271736 ns 267264 ns 1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 8125 ns 8000 ns 1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 7583 ns 6687.5 ns 1.13
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 6959 ns 6958 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10917 ns 10458 ns 1.04
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 32567.5 ns 32932 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 232666 ns 261062.5 ns 0.89
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 240625 ns 237583 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 227604 ns 271396 ns 0.84
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 258125 ns 252646 ns 1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 333854 ns 331767 ns 1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 9959 ns 10250 ns 0.97
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 10709 ns 10542 ns 1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 10833 ns 11208 ns 0.97
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 10271 ns 10250 ns 1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 226295 ns 218675.5 ns 1.03
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 24167 ns 25000 ns 0.97
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 24729.5 ns 24625 ns 1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 24417 ns 25583 ns 0.95
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 25354.5 ns 24416 ns 1.04
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 1051998 ns 1056250 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 106630458.5 ns 106355042 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 117910875 ns 117397229.5 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 120489750 ns 120585312.5 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 117867166.5 ns 117183084 ns 1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/GPU/CUDA 2630839 ns 2657952 ns 0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 375572750 ns 374187771 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 347200750 ns 350821292 ns 0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 370237167 ns 361003333 ns 1.03
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 484151625 ns 479876375 ns 1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 15207487.5 ns 15234863.5 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 607408041 ns 604863708 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 591624416 ns 773786667 ns 0.76
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 811424250 ns 812604291 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 961849167 ns 770323375 ns 1.25
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 6834 ns 6833 ns 1.00
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 6708 ns 7084 ns 0.95
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 8041 ns 8062.5 ns 1.00
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 7354 ns 6250 ns 1.18
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 213896 ns 213616 ns 1.00
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 14000 ns 13458 ns 1.04
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 15125 ns 13875 ns 1.09
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 14458 ns 14416 ns 1.00
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 13666 ns 13625 ns 1.00
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 993505 ns 1017707 ns 0.98
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 5958 ns 6208 ns 0.96
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 6145.5 ns 6042 ns 1.02
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 7458 ns 7145.5 ns 1.04
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 6312.5 ns 5417 ns 1.17
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 209272 ns 208255 ns 1.00
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 12500 ns 11958 ns 1.05
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 12625 ns 12729.5 ns 0.99
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 13250 ns 13250 ns 1
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 12250 ns 12500 ns 0.98
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 719970 ns 723959 ns 0.99
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s) 5000 ns 6209 ns 0.81
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s) 5667 ns 6375 ns 0.89
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s) 5500 ns 6375 ns 0.86
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s) 5458 ns 5500 ns 0.99
batchedmm(2, Bsize=128)/forward/GPU/CUDA 17137 ns 16943 ns 1.01
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s) 15083 ns 15250 ns 0.99
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s) 15459 ns 15625 ns 0.99
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s) 15458 ns 15625 ns 0.99
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s) 15583 ns 15500 ns 1.01
batchedmm(2, Bsize=128)/zygote/GPU/CUDA 185445 ns 186257 ns 1.00
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 292 ns 333 ns 0.88
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 292 ns 375 ns 0.78
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 375 ns 375 ns 1
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 375 ns 292 ns 1.28
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 23381 ns 23245 ns 1.01
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 6291 ns 6375 ns 0.99
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 6334 ns 6375 ns 0.99
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 6520.5 ns 6625 ns 0.98
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 6541 ns 6187.5 ns 1.06
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 227150.5 ns 225046 ns 1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 5750 ns 5750 ns 1
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 5792 ns 5875 ns 0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 5834 ns 5833 ns 1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 5875 ns 5792 ns 1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 24282 ns 24205 ns 1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 23416.5 ns 20875 ns 1.12
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 20542 ns 21417 ns 0.96
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 21292 ns 21541.5 ns 0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 21416 ns 21229.5 ns 1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 249310.5 ns 246651 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 192603.5 ns 194166.5 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 190208 ns 200521 ns 0.95
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 187125 ns 190666.5 ns 0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 189437.5 ns 185562 ns 1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 167056.5 ns 166320.5 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1339333.5 ns 1329104.5 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1319750.5 ns 1324792 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1298333 ns 1328041 ns 0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1349625 ns 1337729.5 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1248940 ns 1221500 ns 1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 22188 ns 24687.5 ns 0.90
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 22167 ns 22000 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 23250 ns 25667 ns 0.91
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 30833 ns 21250 ns 1.45
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 318042 ns 254624.5 ns 1.25
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 175104 ns 130791 ns 1.34
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 129354 ns 132062.5 ns 0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 147250 ns 179458 ns 0.82
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 180250 ns 179520.5 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1355497.5 ns 1317432 ns 1.03
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 292 ns 333 ns 0.88
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 292 ns 417 ns 0.70
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 375 ns 375 ns 1
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 375 ns 292 ns 1.28
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 23100 ns 22902 ns 1.01
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 6167 ns 6208 ns 0.99
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 6416 ns 6709 ns 0.96
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 6583 ns 6917 ns 0.95
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 6583 ns 6291 ns 1.05
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 245385 ns 240780 ns 1.02
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 4208 ns 4875 ns 0.86
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 4625 ns 4542 ns 1.02
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 4833 ns 5500 ns 0.88
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 4708 ns 4417 ns 1.07
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 232572 ns 229531.5 ns 1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 9583 ns 10083 ns 0.95
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 10020.5 ns 10375 ns 0.97
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 9791 ns 10583 ns 0.93
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 10291.5 ns 10416 ns 0.99
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 1286978.5 ns 1276460 ns 1.01
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 1583 ns 1625 ns 0.97
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 1625 ns 1667 ns 0.97
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 1584 ns 1583 ns 1.00
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 1667 ns 1584 ns 1.05
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA 23645 ns 22954 ns 1.03
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 5708 ns 5792 ns 0.99
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 5750 ns 5958 ns 0.97
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 5667 ns 5875 ns 0.96
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 5667 ns 5584 ns 1.01
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA 263109.5 ns 258626 ns 1.02
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 6835750 ns 6841563 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 6400459 ns 6377645.5 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 6536604 ns 6542167 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 7672542 ns 7612146 ns 1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/GPU/CUDA 215618 ns 213873 ns 1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 24116958 ns 24061541 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 21263041 ns 21280959 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 20976375 ns 21049937 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 29871542 ns 29725708.5 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 2094351.5 ns 2091556 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 37551959 ns 37658500 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 34396208.5 ns 45669958 ns 0.75
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 45713375 ns 45878312.5 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 49651167 ns 38309416.5 ns 1.30
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 5583 ns 5917 ns 0.94
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 6250 ns 6042 ns 1.03
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 6625 ns 6958.5 ns 0.95
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 6625 ns 5542 ns 1.20
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 210693 ns 210091 ns 1.00
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8166 ns 8041 ns 1.02
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 9000 ns 8250 ns 1.09
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8625 ns 8500 ns 1.01
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8500 ns 8250 ns 1.03
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 993726 ns 992082 ns 1.00
lenet(28, 28, 1, 128)/forward/CPU/2 thread(s) 1570250 ns 1552375 ns 1.01
lenet(28, 28, 1, 128)/forward/CPU/4 thread(s) 1273479 ns 1278292 ns 1.00
lenet(28, 28, 1, 128)/forward/CPU/8 thread(s) 1626896 ns 1634959 ns 1.00
lenet(28, 28, 1, 128)/forward/CPU/1 thread(s) 2142333 ns 2176750 ns 0.98
lenet(28, 28, 1, 128)/forward/GPU/CUDA 271789 ns 269882.5 ns 1.01
lenet(28, 28, 1, 128)/zygote/CPU/2 thread(s) 7954709 ns 7890000 ns 1.01
lenet(28, 28, 1, 128)/zygote/CPU/4 thread(s) 6282562.5 ns 6564479 ns 0.96
lenet(28, 28, 1, 128)/zygote/CPU/8 thread(s) 7141958 ns 7223979 ns 0.99
lenet(28, 28, 1, 128)/zygote/CPU/1 thread(s) 10525875 ns 10470041 ns 1.01
lenet(28, 28, 1, 128)/zygote/GPU/CUDA 1760839.5 ns 1748953.5 ns 1.01
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s) 377437.5 ns 375500 ns 1.01
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s) 378125 ns 379708 ns 1.00
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s) 450292 ns 454583 ns 0.99
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s) 30500 ns 34834 ns 0.88
batchedmm(128, Bsize=4)/forward/GPU/CUDA 42718 ns 46336 ns 0.92
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s) 743209 ns 739834 ns 1.00
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s) 790458 ns 821979 ns 0.96
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s) 1051750 ns 1062042 ns 0.99
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s) 123333 ns 119270.5 ns 1.03
batchedmm(128, Bsize=4)/zygote/GPU/CUDA 280362 ns 274066 ns 1.02
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s) 415750 ns 412125 ns 1.01
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s) 305875 ns 305917 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s) 306125 ns 305916 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s) 757167 ns 757958 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA 44026.5 ns 44006 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s) 662333 ns 658583 ns 1.01
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s) 523625 ns 525792 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s) 524208 ns 523167 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s) 973917 ns 973083 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA 188149 ns 189089 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 698417 ns 672875 ns 1.04
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 669875 ns 676521 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 674375 ns 644292 ns 1.05
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 683041.5 ns 672333 ns 1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 131691 ns 131017.5 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2527000 ns 2466812.5 ns 1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2445791.5 ns 2456312.5 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2456458.5 ns 2425417 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2515459 ns 2465333 ns 1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1199048 ns 1103271 ns 1.09
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s) 1917 ns 2333 ns 0.82
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s) 2041.5 ns 2875 ns 0.71
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s) 2459 ns 4500 ns 0.55
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s) 2437.5 ns 3167 ns 0.77
batchedmm(2, Bsize=32)/forward/GPU/CUDA 16312 ns 16213 ns 1.01
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s) 5208 ns 5208 ns 1
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s) 5500 ns 5625 ns 0.98
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s) 5625 ns 5667 ns 0.99
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s) 5479.5 ns 5459 ns 1.00
batchedmm(2, Bsize=32)/zygote/GPU/CUDA 184945 ns 184737.5 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1481291 ns 1481125 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1524125 ns 1519875 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1521750 ns 1522875 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1447604.5 ns 1453417 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 39655 ns 40096 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5139771 ns 5124333 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5014250 ns 5295937.5 ns 0.95
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5294625 ns 5290354 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 5015729.5 ns 4993187.5 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 194949 ns 194429.5 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s) 3667 ns 3666 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s) 3625 ns 3666 ns 0.99
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s) 3625 ns 3625 ns 1
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s) 3750 ns 3667 ns 1.02
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA 33334 ns 33150 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s) 15291 ns 15208 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s) 15083 ns 15375 ns 0.98
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s) 15292 ns 15416 ns 0.99
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s) 15167 ns 15250 ns 0.99
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA 349359.5 ns 349182 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s) 94542 ns 93000 ns 1.02
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s) 103166 ns 103209 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s) 103209 ns 92958 ns 1.11
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s) 95625 ns 92833 ns 1.03
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA 113041.5 ns 113197 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 318084 ns 315959 ns 1.01
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 316917 ns 319270.5 ns 0.99
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 316666 ns 317000 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 321750 ns 317333 ns 1.01
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA 192326 ns 191577 ns 1.00
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 958 ns 1000 ns 0.96
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 959 ns 1084 ns 0.88
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 1083 ns 1042 ns 1.04
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 1083 ns 1000 ns 1.08
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 23389 ns 23307 ns 1.00
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 7708 ns 7792 ns 0.99
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 7916 ns 8375 ns 0.95
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 7959 ns 8125 ns 0.98
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8270.5 ns 8000 ns 1.03
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 246988.5 ns 244539 ns 1.01
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s) 534875 ns 531791 ns 1.01
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s) 514875 ns 517334 ns 1.00
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s) 572375 ns 578729.5 ns 0.99
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s) 256145.5 ns 256916 ns 1.00
batchedmm(128, Bsize=32)/forward/GPU/CUDA 129558.5 ns 130622 ns 0.99
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s) 1420041.5 ns 1386812.5 ns 1.02
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s) 1466708.5 ns 1483208.5 ns 0.99
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s) 1756250 ns 1776708 ns 0.99
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s) 902625 ns 871125 ns 1.04
batchedmm(128, Bsize=32)/zygote/GPU/CUDA 276092.5 ns 273552 ns 1.01
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 333 ns 292 ns 1.14
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 292 ns 375 ns 0.78
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 334 ns 375 ns 0.89
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 375 ns 292 ns 1.28
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 31832 ns 31822 ns 1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 6084 ns 5958 ns 1.02
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 6542 ns 6459 ns 1.01
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 6292 ns 6416 ns 0.98
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 6292 ns 6167 ns 1.02
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 248681.5 ns 246678.5 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1729313 ns 1774479 ns 0.97
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1725667 ns 1782250.5 ns 0.97
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1769167 ns 1777916 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1772187.5 ns 1766937 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 168168 ns 169504.5 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 4416792 ns 4354563 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 4351145.5 ns 3899583 ns 1.12
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 4368958 ns 4361500 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4403479.5 ns 4355333 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1091804.5 ns 1064911 ns 1.03
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s) 7041.5 ns 24479 ns 0.29
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s) 7333 ns 7541 ns 0.97
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s) 7375 ns 7833 ns 0.94
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s) 7375 ns 22208.5 ns 0.33
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA 20581 ns 19777 ns 1.04
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 32334 ns 72854.5 ns 0.44
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 62021 ns 51667 ns 1.20
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 33333 ns 51833 ns 0.64
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 71833 ns 70542 ns 1.02
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA 196104.5 ns 193123 ns 1.02
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s) 17208 ns 17625 ns 0.98
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s) 17520.5 ns 18250 ns 0.96
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s) 17875 ns 17708 ns 1.01
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s) 17459 ns 17250 ns 1.01
batchedmm(2, Bsize=512)/forward/GPU/CUDA 18509 ns 18352 ns 1.01
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s) 52875 ns 53000 ns 1.00
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s) 53625 ns 53250 ns 1.01
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s) 53541 ns 53542 ns 1.00
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s) 53084 ns 53375 ns 0.99
batchedmm(2, Bsize=512)/zygote/GPU/CUDA 318108.5 ns 317963.5 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s) 104959 ns 107500 ns 0.98
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s) 107334 ns 107125 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s) 107250 ns 105625 ns 1.02
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s) 101250 ns 97584 ns 1.04
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA 46996 ns 46786 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 324500 ns 323417 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 325958 ns 327750 ns 0.99
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 323083 ns 322667 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 327500 ns 325000 ns 1.01
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA 208617.5 ns 207825 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1506583 ns 1504209 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1549708 ns 1545458 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1549292 ns 1549042 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1480958 ns 1478167 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 51270 ns 51382 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5143666.5 ns 5122771 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5297771 ns 5291458 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5293084 ns 5291125 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 5004625.5 ns 5000125 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 201935.5 ns 200987.5 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 28125 ns 28167 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 28167 ns 28250 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 28187.5 ns 28125 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 28208 ns 28167 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA 24383 ns 24367 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 66666.5 ns 66375 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 66333 ns 66583 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 66459 ns 66375 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 66292 ns 66375 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA 489192 ns 493214.5 ns 0.99
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/2 thread(s) 1485833 ns 1497500 ns 0.99
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/4 thread(s) 1144729 ns 1150584 ns 0.99
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/8 thread(s) 1129875 ns 1142791.5 ns 0.99
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/1 thread(s) 2267333 ns 2256875 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/forward/GPU/CUDA 580996.5 ns 579142.5 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/2 thread(s) 3110979 ns 3080625.5 ns 1.01
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/4 thread(s) 2747916.5 ns 2682000 ns 1.02
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/8 thread(s) 2752750 ns 2729917 ns 1.01
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/1 thread(s) 3882333 ns 3656583 ns 1.06
mlp7layer_bn(tanh)(32 x 256)/zygote/GPU/CUDA 1989937 ns 1939352 ns 1.03
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/2 thread(s) 7919834 ns 7890875 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/4 thread(s) 7899375 ns 7897375 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/8 thread(s) 7923709 ns 7904208 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/1 thread(s) 4904167 ns 4815458 ns 1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 77917 ns 138395.5 ns 0.56
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 139667 ns 78917 ns 1.77
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 140875 ns 132458.5 ns 1.06
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 133958 ns 140084 ns 0.96
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 193313 ns 193872 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2016625 ns 2020209 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2021791 ns 1690750 ns 1.20
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2024750 ns 2025250 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2026750 ns 2006209 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 747334.5 ns 742900 ns 1.01

This comment was automatically generated by workflow using github-action-benchmark.

Please sign in to comment.