You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Did you know you can add release notes too? Just add markdown formatted text underneath the comment after the text
"Release notes:" and it will be added to the registry PR, and if TagBot is installed it will also be added to the
release that TagBot creates. i.e.
Did you know you can add release notes too? Just add markdown formatted text underneath the comment after the text
"Release notes:" and it will be added to the registry PR, and if TagBot is installed it will also be added to the
release that TagBot creates. i.e.
ef0d450
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JuliaRegistrator register subdir=lib/LuxCore
ef0d450
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JuliaRegistrator register subdir=lib/MLDataDevices
ef0d450
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Registration pull request created: JuliaRegistries/General/120708
Tip: Release Notes
Did you know you can add release notes too? Just add markdown formatted text underneath the comment after the text
"Release notes:" and it will be added to the registry PR, and if TagBot is installed it will also be added to the
release that TagBot creates. i.e.
To add them here just re-invoke and the PR will be updated.
Tagging
After the above pull request is merged, it is recommended that a tag is created on this repository for the registered package version.
This will be done automatically if the Julia TagBot GitHub Action is installed, or can be done manually through the github interface, or via:
ef0d450
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Registration pull request created: JuliaRegistries/General/120709
Tip: Release Notes
Did you know you can add release notes too? Just add markdown formatted text underneath the comment after the text
"Release notes:" and it will be added to the registry PR, and if TagBot is installed it will also be added to the
release that TagBot creates. i.e.
To add them here just re-invoke and the PR will be updated.
Tagging
After the above pull request is merged, it is recommended that a tag is created on this repository for the registered package version.
This will be done automatically if the Julia TagBot GitHub Action is installed, or can be done manually through the github interface, or via:
ef0d450
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lux Benchmarks
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
4208
ns4291
ns0.98
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
4834
ns3958
ns1.22
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
5375
ns5125
ns1.05
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
4083
ns4250
ns0.96
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
58557
ns60770
ns0.96
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
10625
ns10250
ns1.04
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
10542
ns10125
ns1.04
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
11375
ns10333
ns1.10
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
10083
ns10334
ns0.98
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
415171
ns423675
ns0.98
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s)
1334
ns1125
ns1.19
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s)
1209
ns1166
ns1.04
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s)
1333.5
ns1229.5
ns1.08
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s)
1208
ns1250
ns0.97
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA
17961
ns17992
ns1.00
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
4084
ns4250
ns0.96
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
3959
ns4000
ns0.99
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
4333
ns4167
ns1.04
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
4000
ns3958
ns1.01
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA
107003.5
ns109284
ns0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
70834
ns57417
ns1.23
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
64375
ns38208
ns1.68
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
64500
ns46375
ns1.39
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
80375
ns80167
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
36906
ns36667.5
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2031562.5
ns2021709
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2088542
ns2097000
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2093958
ns2077875
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1926833
ns2001000
ns0.96
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
192315
ns195812
ns0.98
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
196625
ns145166.5
ns1.35
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
195542
ns142666
ns1.37
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
185209
ns146500
ns1.26
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
182375
ns144167
ns1.27
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
166552
ns165803
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1111896
ns1104750
ns1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1118729.5
ns1156062
ns0.97
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1119708
ns1104750
ns1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1130333.5
ns1129458
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
514050
ns527714
ns0.97
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
3500
ns4000
ns0.88
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
3416
ns3625
ns0.94
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
4459
ns4375
ns1.02
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
3416.5
ns3459
ns0.99
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
67303.5
ns70555.5
ns0.95
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
9084
ns9084
ns1
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
9750
ns8709
ns1.12
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
9625
ns9667
ns1.00
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8625
ns9167
ns0.94
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
472568
ns481518.5
ns0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
15020.5
ns15416
ns0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
14666
ns16958
ns0.86
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
18625
ns16791.5
ns1.11
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
14875
ns14792
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
53079
ns54315.5
ns0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
224750
ns213958
ns1.05
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
215104.5
ns214042
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
215917
ns214208
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
215083
ns214334
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
267364.5
ns273628
ns0.98
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s)
750
ns500
ns1.50
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s)
709
ns583
ns1.22
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s)
750
ns667
ns1.12
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s)
750
ns583.5
ns1.29
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA
17115
ns17264
ns0.99
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
1500
ns1500
ns1
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
1792
ns1625
ns1.10
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
1500
ns1792
ns0.84
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
1375
ns1708
ns0.81
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA
99326.5
ns102318
ns0.97
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7833
ns7000
ns1.12
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
7291
ns5084
ns1.43
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
7083
ns5958
ns1.19
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
9958
ns9916
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
23212
ns23961
ns0.97
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
233458.5
ns221542
ns1.05
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
228125
ns229708.5
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
228666
ns229667
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
214125
ns226542
ns0.95
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
164950.5
ns170388
ns0.97
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s)
3875
ns3875
ns1
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s)
3916
ns3958
ns0.99
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s)
3875
ns3958
ns0.98
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s)
3917
ns3875
ns1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA
23508
ns23385
ns1.01
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
16959
ns16625
ns1.02
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
17042
ns16500
ns1.03
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
17083
ns17000
ns1.00
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
16708
ns16833
ns0.99
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA
160457.5
ns161544
ns0.99
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
611125
ns581791
ns1.05
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
609042
ns578709
ns1.05
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
606834
ns569958
ns1.06
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
605520.5
ns572333.5
ns1.06
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA
113172
ns113621
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
1423834
ns1428958
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
1422458
ns1421292
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
1424292
ns1415833
ns1.01
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
1420334
ns1420000
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA
209423.5
ns210533
ns0.99
lenet(28, 28, 1, 64)/forward/CPU/2 thread(s)
1082229.5
ns1081750
ns1.00
lenet(28, 28, 1, 64)/forward/CPU/4 thread(s)
970792
ns938708
ns1.03
lenet(28, 28, 1, 64)/forward/CPU/8 thread(s)
1346208
ns1353291.5
ns0.99
lenet(28, 28, 1, 64)/forward/CPU/1 thread(s)
1300333
ns1296666
ns1.00
lenet(28, 28, 1, 64)/forward/GPU/CUDA
270348.5
ns269675
ns1.00
lenet(28, 28, 1, 64)/zygote/CPU/2 thread(s)
5996021
ns5971292
ns1.00
lenet(28, 28, 1, 64)/zygote/CPU/4 thread(s)
4506125
ns4530771.5
ns0.99
lenet(28, 28, 1, 64)/zygote/CPU/8 thread(s)
4914416
ns4949917
ns0.99
lenet(28, 28, 1, 64)/zygote/CPU/1 thread(s)
5507375
ns5624041
ns0.98
lenet(28, 28, 1, 64)/zygote/GPU/CUDA
1074060
ns1072622
ns1.00
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s)
542
ns500
ns1.08
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s)
542
ns542
ns1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s)
542
ns542
ns1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s)
541
ns542
ns1.00
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA
23487
ns23468
ns1.00
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
2167
ns2125
ns1.02
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
2125
ns2084
ns1.02
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
2167
ns2208
ns0.98
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
2125
ns2125
ns1
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA
168855
ns169303
ns1.00
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
4167
ns4167
ns1
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
4334
ns4208
ns1.03
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
5041
ns4708
ns1.07
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
3667
ns4125
ns0.89
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
64100
ns66233.5
ns0.97
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
11291
ns11125
ns1.01
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
11875
ns11250
ns1.06
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
12291
ns12000
ns1.02
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
11000
ns10792
ns1.02
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
442842
ns452338
ns0.98
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
6042
ns6292
ns0.96
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
6104.5
ns6417
ns0.95
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
7209
ns7604.5
ns0.95
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
5708
ns5833
ns0.98
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
51573
ns52542
ns0.98
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
17041.5
ns18583
ns0.92
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
17292
ns17500
ns0.99
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
17625
ns18833
ns0.94
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
17250
ns16833
ns1.02
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
299598.5
ns301964.5
ns0.99
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
583
ns542
ns1.08
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
625
ns583
ns1.07
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
625
ns625
ns1
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
542
ns583
ns0.93
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
32513
ns32911
ns0.99
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
8458
ns8625
ns0.98
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
9000
ns8542
ns1.05
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
9084
ns9125
ns1.00
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
8458
ns8917
ns0.95
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
155298
ns160010
ns0.97
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s)
96666
ns64500
ns1.50
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s)
96708
ns64666
ns1.50
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s)
96292
ns64500
ns1.49
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s)
96375
ns64500
ns1.49
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA
111447.5
ns112101
ns0.99
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s)
278125
ns279458
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s)
275250
ns288583
ns0.95
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s)
274583.5
ns273583
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s)
277584
ns286083
ns0.97
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA
190076
ns185547.5
ns1.02
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/2 thread(s)
3409792
ns3376750.5
ns1.01
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/4 thread(s)
3047666
ns2898291.5
ns1.05
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/8 thread(s)
3023958
ns3024854
ns1.00
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/1 thread(s)
3959958
ns3941104
ns1.00
mlp7layer_bn(gelu)(32 x 256)/forward/GPU/CUDA
579376.5
ns581323
ns1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/2 thread(s)
7632583
ns7603583
ns1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/4 thread(s)
7497667
ns7358750
ns1.02
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/8 thread(s)
7451520.5
ns7466208
ns1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/1 thread(s)
8199583
ns8146792
ns1.01
mlp7layer_bn(gelu)(32 x 256)/zygote/GPU/CUDA
1349456
ns1318419
ns1.02
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/2 thread(s)
17500916.5
ns17484792
ns1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/4 thread(s)
17545437.5
ns17670999.5
ns0.99
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/8 thread(s)
17599584
ns17533250
ns1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/1 thread(s)
14108083
ns9220187.5
ns1.53
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s)
23772875
ns23603916
ns1.01
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s)
34134729
ns43639208
ns0.78
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s)
37435375
ns37125083
ns1.01
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s)
34708708
ns34980187.5
ns0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/GPU/CUDA
1860458
ns1854234
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s)
316659729.5
ns188207417
ns1.68
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s)
235623563
ns251666438
ns0.94
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s)
195619437
ns194864208
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s)
279867979.5
ns434287708
ns0.64
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA
13932935
ns13931919
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s)
273833833
ns287943833
ns0.95
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s)
267231583
ns355406479.5
ns0.75
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s)
255610333
ns297803834
ns0.86
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s)
329098667
ns400767145.5
ns0.82
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
21375
ns22458
ns0.95
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
22125
ns22208
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
25292
ns25041
ns1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
21125
ns22270.5
ns0.95
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
94977
ns96107.5
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
103542
ns113166.5
ns0.91
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
103791
ns104292
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
105125
ns105083
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
103250
ns103812.5
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
500332.5
ns502678.5
ns1.00
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
5875
ns6833
ns0.86
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
6417
ns6479.5
ns0.99
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
6750
ns7041.5
ns0.96
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
6000
ns5958
ns1.01
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
68160.5
ns68593
ns0.99
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
14500
ns15000
ns0.97
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
15000
ns15479
ns0.97
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
16500
ns16333
ns1.01
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
14584
ns14708.5
ns0.99
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
477825.5
ns475032.5
ns1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s)
3101458
ns3031167
ns1.02
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s)
2118542
ns2061583
ns1.03
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s)
2321249.5
ns2253209
ns1.03
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s)
4650021
ns4505270.5
ns1.03
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/GPU/CUDA
585427
ns586394
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s)
23564209
ns23625708.5
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s)
18768041
ns18333062.5
ns1.02
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s)
17974229
ns17998916.5
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s)
35659708
ns35608125.5
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/GPU/CUDA
2760352.5
ns2764773.5
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s)
34076750.5
ns33284000
ns1.02
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s)
27653896
ns28078500
ns0.98
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s)
28752229
ns28952938
ns0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s)
40853625
ns41446187.5
ns0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
74667
ns72167
ns1.03
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
71833.5
ns81083
ns0.89
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
73521
ns86562.5
ns0.85
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
71770.5
ns75479
ns0.95
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
100115
ns104806
ns0.96
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
292083
ns223458.5
ns1.31
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
224167
ns325166
ns0.69
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
297708
ns320958
ns0.93
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
205792
ns210500
ns0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
537710
ns552193
ns0.97
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
11750
ns11917
ns0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
11416
ns12583
ns0.91
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
12542
ns12708
ns0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
12270.5
ns12083
ns1.02
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
71148.5
ns71752
ns0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
26208
ns26667
ns0.98
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
26875
ns26583
ns1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
27625
ns28000
ns0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
26500
ns26500
ns1
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
468928
ns476956.5
ns0.98
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
12250
ns11667
ns1.05
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
12166
ns12333
ns0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
13500
ns12917
ns1.05
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
12042
ns11834
ns1.02
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
52398
ns53475
ns0.98
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
25250
ns25792
ns0.98
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
26125
ns25500
ns1.02
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
26042
ns26500
ns0.98
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
26000
ns26000
ns1
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
301242
ns305905.5
ns0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
179104.5
ns181458
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
179750
ns180541
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
180583
ns184604.5
ns0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
178625
ns179667
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
55842.5
ns57257.5
ns0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
582584
ns592917
ns0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
591917
ns587687.5
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
594313
ns595750
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
583166
ns582791.5
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
280084
ns291107
ns0.96
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
5958
ns8958
ns0.67
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
6000
ns6583
ns0.91
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
6500
ns8042
ns0.81
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
5625
ns6375
ns0.88
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
70229
ns71199.5
ns0.99
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
13875
ns13916
ns1.00
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
14542
ns14875
ns0.98
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
15187.5
ns15459
ns0.98
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
14458
ns13958.5
ns1.04
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
456073.5
ns465947
ns0.98
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s)
1235292
ns1219708
ns1.01
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s)
1304042
ns1231750
ns1.06
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s)
1374021
ns1269667
ns1.08
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s)
1092083
ns1009666
ns1.08
batchedmm(512, Bsize=4)/forward/GPU/CUDA
302409
ns300921
ns1.00
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s)
4120521
ns4103750
ns1.00
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s)
4446875
ns4571833
ns0.97
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s)
4623750
ns4574959
ns1.01
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s)
3716729.5
ns3707208
ns1.00
batchedmm(512, Bsize=4)/zygote/GPU/CUDA
1039016
ns1038858
ns1.00
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
1792
ns1834
ns0.98
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
1875
ns1833
ns1.02
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
1833
ns1875
ns0.98
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
1917
ns1875
ns1.02
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA
23753
ns23656
ns1.00
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
4833
ns4875
ns0.99
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
4917
ns4792
ns1.03
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
4875
ns4917
ns0.99
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
4875
ns4875
ns1
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA
186693
ns190147.5
ns0.98
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
5959
ns5375
ns1.11
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
6000
ns5708.5
ns1.05
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
7083
ns6917
ns1.02
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
5667
ns5437.5
ns1.04
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
54622.5
ns56411.5
ns0.97
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
11167
ns10750
ns1.04
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
11541
ns11000
ns1.05
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
11250
ns11834
ns0.95
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
10542
ns10729.5
ns0.98
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
325703
ns336162
ns0.97
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s)
375
ns333
ns1.13
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s)
334
ns292
ns1.14
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s)
292
ns375
ns0.78
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s)
333
ns334
ns1.00
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA
22898
ns22819
ns1.00
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
2792
ns2750
ns1.02
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
3041
ns2750
ns1.11
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
3041
ns3042
ns1.00
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
2750
ns2792
ns0.98
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA
157339
ns159135.5
ns0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
11625
ns11458
ns1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
12083
ns11333
ns1.07
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
12417
ns12750
ns0.97
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
11229.5
ns11208
ns1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
55735
ns58102
ns0.96
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
24959
ns24750
ns1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
25042
ns24334
ns1.03
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
25042
ns25084
ns1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
25042
ns24750
ns1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
288122.5
ns298883.5
ns0.96
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s)
4250
ns4209
ns1.01
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s)
4208
ns4209
ns1.00
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s)
4208
ns4291
ns0.98
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s)
4250
ns4167
ns1.02
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA
24760
ns24823
ns1.00
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
16333
ns16084
ns1.02
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
16333
ns15959
ns1.02
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
16500
ns16500
ns1
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
16459
ns16167
ns1.02
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA
193221.5
ns197271
ns0.98
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
5791
ns5833
ns0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
5792
ns5791
ns1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
5791
ns5916
ns0.98
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
5750
ns5833
ns0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
33178
ns34115
ns0.97
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
20750
ns20500
ns1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
20708
ns20417
ns1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
20916
ns21250
ns0.98
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
20708
ns20708
ns1
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
172900.5
ns178582.5
ns0.97
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s)
420188
ns423708.5
ns0.99
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s)
386937.5
ns366416.5
ns1.06
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s)
482833
ns484917
ns1.00
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s)
106250
ns103541
ns1.03
batchedmm(16, Bsize=512)/forward/GPU/CUDA
67134
ns67022
ns1.00
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s)
865417
ns943375
ns0.92
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s)
948604
ns950687
ns1.00
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s)
1189500
ns1197916.5
ns0.99
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s)
411770.5
ns330416.5
ns1.25
batchedmm(16, Bsize=512)/zygote/GPU/CUDA
190610
ns193979
ns0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
136750
ns80541.5
ns1.70
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
133396
ns81125
ns1.64
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
133166.5
ns81541.5
ns1.63
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
138854
ns80479.5
ns1.73
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
192824
ns194031
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1917250
ns1919833
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1912124.5
ns1936958
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1920250
ns1930229
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1942521
ns1923250
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
395139
ns400084
ns0.99
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s)
292
ns292
ns1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s)
292
ns292
ns1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s)
333
ns333
ns1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s)
333
ns292
ns1.14
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA
22003
ns21834
ns1.01
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s)
1834
ns1875
ns0.98
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s)
1833
ns1750
ns1.05
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s)
1875
ns1875
ns1
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s)
1833
ns1792
ns1.02
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA
168855
ns168563
ns1.00
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
6812.5
ns6416
ns1.06
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
6750
ns6166
ns1.09
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
8187.5
ns7667
ns1.07
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
6334
ns6709
ns0.94
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
59378.5
ns61087.5
ns0.97
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
9312.5
ns8959
ns1.04
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
9209
ns8875
ns1.04
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
9333
ns9250
ns1.01
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
9083
ns9312.5
ns0.98
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
305200.5
ns309875.5
ns0.98
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s)
112669000
ns118672458
ns0.95
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s)
174180000
ns182326458
ns0.96
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s)
143189875
ns148081791.5
ns0.97
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s)
112387917
ns102035042
ns1.10
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/GPU/CUDA
5463061
ns5467326.5
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s)
616937396
ns610447729.5
ns1.01
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s)
558474917
ns582022188
ns0.96
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s)
448891770.5
ns452913708.5
ns0.99
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s)
624388062.5
ns751418979
ns0.83
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA
38238112
ns34971564
ns1.09
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s)
665577792
ns646694167
ns1.03
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s)
667381166.5
ns688250333
ns0.97
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s)
616459979
ns583281666.5
ns1.06
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s)
747251209
ns744581417
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
62750
ns59000
ns1.06
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
53834
ns37792
ns1.42
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
53458
ns47750
ns1.12
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
82125
ns83417
ns0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
37037
ns38231
ns0.97
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1926667
ns1925854
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1974291
ns1987562.5
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1980021
ns1779021
ns1.11
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1901875
ns1864125
ns1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
171617
ns175192.5
ns0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
265333
ns292250
ns0.91
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
269750
ns268916
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
269083.5
ns269500
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
264854.5
ns266000
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
124229
ns128884
ns0.96
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
687584
ns686771
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
678833
ns702187.5
ns0.97
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
680125
ns591083
ns1.15
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
635854
ns688958
ns0.92
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
697446
ns706872
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
2242458
ns2268958
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
2097875
ns2245875
ns0.93
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
2254458
ns2101125
ns1.07
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
2199750.5
ns2176375
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
132519
ns133295.5
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5507312
ns5521229.5
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5516959
ns5587167
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5495292
ns5520666.5
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
5486271
ns5493834
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
737355
ns748599
ns0.98
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
678417
ns642084
ns1.06
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
671291
ns648917
ns1.03
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
668458
ns636667
ns1.05
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
682958
ns635875
ns1.07
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA
46914
ns46696
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
1824791.5
ns1822625
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
1728375
ns1670333
ns1.03
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
1718604.5
ns1719875
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
2080500
ns2097416.5
ns0.99
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA
221890.5
ns221082
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
70750
ns57833
ns1.22
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
53125
ns38500
ns1.38
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
52916
ns46250
ns1.14
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
82375
ns82750
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
28168
ns28653
ns0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2031792
ns2020167
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2096833.5
ns2105417
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2088000
ns2093958
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2001083.5
ns1999958.5
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
187289.5
ns190261
ns0.98
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s)
13449750
ns13356563
ns1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s)
12528021.5
ns12441584
ns1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s)
12554687.5
ns12535208
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s)
15230083
ns15154375
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/GPU/CUDA
513617
ns512188.5
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s)
46862979
ns47248458
ns0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s)
41543521
ns42098688
ns0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s)
40829437.5
ns40986395.5
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s)
58532271
ns58394208
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA
2896866
ns2891115
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s)
74392375
ns74033603.5
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s)
90893292
ns68368417
ns1.33
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s)
92732000
ns90690875
ns1.02
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s)
76658749.5
ns76143146
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
70625
ns58250
ns1.21
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
64875
ns38583
ns1.68
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
64625
ns47625
ns1.36
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
81917
ns79125
ns1.04
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
47851
ns47024
ns1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1923187.5
ns1918250
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1983437.5
ns1983396
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1973333
ns1965584
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1883833
ns1830750
ns1.03
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
193982.5
ns192100.5
ns1.01
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
292
ns292
ns1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
375
ns292
ns1.28
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
375
ns416
ns0.90
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
292
ns334
ns0.87
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
32956
ns32257
ns1.02
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
6125
ns6083
ns1.01
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
6416
ns6000
ns1.07
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
6375
ns6416
ns0.99
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
5875
ns6104.5
ns0.96
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
176118.5
ns172267
ns1.02
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s)
292
ns250
ns1.17
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s)
292
ns250
ns1.17
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s)
291
ns292
ns1.00
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s)
250
ns292
ns0.86
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA
32831
ns31372
ns1.05
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s)
2667
ns2625
ns1.02
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s)
2916
ns2625
ns1.11
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s)
2875
ns2875
ns1
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s)
2625
ns2666
ns0.98
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA
165694
ns158332
ns1.05
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s)
278326104
ns283213208
ns0.98
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s)
340448937.5
ns347751604
ns0.98
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s)
308909437.5
ns314361479.5
ns0.98
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s)
278977666.5
ns273430250
ns1.02
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/GPU/CUDA
7109405
ns7090888
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s)
997951584
ns992205416
ns1.01
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s)
940941292
ns964468250
ns0.98
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s)
832217625
ns838327667
ns0.99
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s)
1009333917
ns1152689375
ns0.88
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA
33893371
ns34106482
ns0.99
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s)
1394325042
ns1303968312.5
ns1.07
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s)
1705224209
ns1327504666.5
ns1.28
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s)
1693911291
ns1629886334
ns1.04
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s)
1308776729
ns1314925417
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1456667
ns1455709
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1462958
ns1463125
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1454521
ns1415166.5
ns1.03
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1451416.5
ns1410000
ns1.03
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
127922
ns127607
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5012417
ns5015979
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5028750
ns5060792
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5027959
ns5051500
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
5027187.5
ns5009458
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
506424
ns574399.5
ns0.88
vgg16(32, 32, 3, 32)/forward/CPU/2 thread(s)
157716375
ns170351312
ns0.93
vgg16(32, 32, 3, 32)/forward/CPU/4 thread(s)
136859042
ns167663375
ns0.82
vgg16(32, 32, 3, 32)/forward/CPU/8 thread(s)
164218250
ns130848583.5
ns1.26
vgg16(32, 32, 3, 32)/forward/CPU/1 thread(s)
151479417
ns167905166.5
ns0.90
vgg16(32, 32, 3, 32)/forward/GPU/CUDA
4879107
ns4881672
ns1.00
vgg16(32, 32, 3, 32)/zygote/CPU/2 thread(s)
634203459
ns618588292
ns1.03
vgg16(32, 32, 3, 32)/zygote/CPU/4 thread(s)
607766083
ns577882000
ns1.05
vgg16(32, 32, 3, 32)/zygote/CPU/8 thread(s)
456653750
ns497505667
ns0.92
vgg16(32, 32, 3, 32)/zygote/CPU/1 thread(s)
653815125
ns647917125
ns1.01
vgg16(32, 32, 3, 32)/zygote/GPU/CUDA
17510307
ns16266169
ns1.08
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s)
8926646
ns8910542
ns1.00
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s)
9038916.5
ns9026291.5
ns1.00
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s)
7947771
ns7927084
ns1.00
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s)
10104354
ns9711125
ns1.04
batchedmm(512, Bsize=32)/forward/GPU/CUDA
1594648
ns1592738
ns1.00
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s)
36795042
ns35730646
ns1.03
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s)
38004792
ns38522375
ns0.99
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s)
34295916.5
ns33553041
ns1.02
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s)
37862042
ns37755625
ns1.00
batchedmm(512, Bsize=32)/zygote/GPU/CUDA
6452447
ns6512589
ns0.99
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s)
47334
ns47333
ns1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s)
47417
ns47333
ns1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s)
47625
ns47334
ns1.01
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s)
47042
ns47875
ns0.98
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA
18361
ns18035
ns1.02
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s)
50042
ns52792
ns0.95
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s)
50292
ns50292
ns1
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s)
50542
ns50458
ns1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s)
50292
ns50667
ns0.99
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA
194710.5
ns197012
ns0.99
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
6750
ns6375
ns1.06
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
6875
ns6250
ns1.10
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
7709
ns7417
ns1.04
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
6541
ns6750
ns0.97
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
94841
ns112280
ns0.84
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
9542
ns9584
ns1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
10209
ns9458
ns1.08
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
10292
ns10125
ns1.02
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
9958
ns10209
ns0.98
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
543786
ns615930.5
ns0.88
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
5917
ns5416
ns1.09
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
6292
ns5791
ns1.09
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
6750
ns7146
ns0.94
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
5666
ns5959
ns0.95
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
105080
ns123840
ns0.85
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
12583
ns12583
ns1
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
13750
ns12750
ns1.08
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
13375
ns13208
ns1.01
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
13375
ns12708
ns1.05
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
521491.5
ns529723.5
ns0.98
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
1083
ns1083
ns1
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
1083
ns1000
ns1.08
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
1083
ns1083
ns1
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
1042
ns1042
ns1
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
33226
ns32491
ns1.02
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
8125
ns8000
ns1.02
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
8500
ns7750
ns1.10
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7875
ns8209
ns0.96
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
8041
ns7959
ns1.01
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
215927
ns209838
ns1.03
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
23125
ns23417
ns0.99
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
23209
ns23041
ns1.01
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
23250
ns23584
ns0.99
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
23250
ns23417
ns0.99
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA
18682
ns18029
ns1.04
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
52250
ns54667
ns0.96
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
53125
ns52417
ns1.01
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
52833
ns52667
ns1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
52250
ns52458
ns1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA
310779
ns299710
ns1.04
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1455520.5
ns1444833
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1461770.5
ns1449584
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1464563
ns1399209
ns1.05
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1420375.5
ns1396958.5
ns1.02
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
196494.5
ns195765
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5004917
ns5000042
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
4928042
ns5049833
ns0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5012292
ns5044562
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
5010708.5
ns5015291.5
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
619791
ns612366.5
ns1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s)
3153125
ns3043104
ns1.04
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s)
2140000
ns2098583
ns1.02
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s)
2307083.5
ns2313209
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s)
4612500
ns4606709
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/GPU/CUDA
580901
ns580804.5
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s)
24408833
ns24374458
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s)
19732667
ns19110937.5
ns1.03
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s)
19045729.5
ns18926833
ns1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s)
36515125
ns36250750
ns1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA
2842137
ns2861963.5
ns0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s)
34057083.5
ns33972875
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s)
28326333
ns28642167
ns0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s)
28024667
ns28092229
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s)
42838792
ns41633541.5
ns1.03
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s)
140571271
ns141888875
ns0.99
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s)
143484104
ns146034209
ns0.98
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s)
120774500
ns126705062.5
ns0.95
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s)
187527416
ns173781771
ns1.08
batchedmm(512, Bsize=512)/forward/GPU/CUDA
22777810
ns22552094
ns1.01
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s)
1387998541
ns1227732750
ns1.13
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s)
2164279542
ns839227916.5
ns2.58
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s)
1082658958.5
ns739276458
ns1.46
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s)
828842208.5
ns683957250
ns1.21
batchedmm(512, Bsize=512)/zygote/GPU/CUDA
118414466
ns117875105
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
79708.5
ns73084
ns1.09
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
72542
ns74479
ns0.97
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
75520.5
ns75750
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
73458
ns74958
ns0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
238954.5
ns240665.5
ns0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
286459
ns280208.5
ns1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
295292
ns288959
ns1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
302292
ns193791
ns1.56
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
240521
ns192583
ns1.25
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1217040
ns1331151
ns0.91
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s)
35202521
ns35557542
ns0.99
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s)
35899625
ns36592625
ns0.98
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s)
31197042
ns32410750
ns0.96
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s)
39929583.5
ns40376458
ns0.99
batchedmm(512, Bsize=128)/forward/GPU/CUDA
5845222
ns5838475
ns1.00
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s)
147855667
ns148073500
ns1.00
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s)
153555375
ns158619999.5
ns0.97
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s)
134579979
ns139542333.5
ns0.96
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s)
150196958.5
ns282659625
ns0.53
batchedmm(512, Bsize=128)/zygote/GPU/CUDA
34892998
ns34873454
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s)
114292542
ns120976041.5
ns0.94
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s)
173321542
ns182674416.5
ns0.95
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s)
143543334
ns147566209
ns0.97
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s)
93943084
ns105641958.5
ns0.89
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/GPU/CUDA
5434556
ns5456587
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s)
473131708
ns471084687.5
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s)
515810125.5
ns489605103.5
ns1.05
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s)
442518292
ns432706750
ns1.02
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s)
614699291.5
ns737367000
ns0.83
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/GPU/CUDA
35179278
ns32284178
ns1.09
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s)
804964083
ns707739104.5
ns1.14
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s)
656838729.5
ns677702687.5
ns0.97
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s)
594341604
ns572041062.5
ns1.04
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s)
735687542
ns735458208
ns1.00
mlp7layer_bn(relu)(32 x 256)/forward/CPU/2 thread(s)
1353083
ns1303791.5
ns1.04
mlp7layer_bn(relu)(32 x 256)/forward/CPU/4 thread(s)
1020917
ns778750
ns1.31
mlp7layer_bn(relu)(32 x 256)/forward/CPU/8 thread(s)
995292
ns904854
ns1.10
mlp7layer_bn(relu)(32 x 256)/forward/CPU/1 thread(s)
2104875
ns1945625
ns1.08
mlp7layer_bn(relu)(32 x 256)/forward/GPU/CUDA
569348
ns581135.5
ns0.98
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/2 thread(s)
2979875
ns2961271
ns1.01
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/4 thread(s)
2615833
ns2515584
ns1.04
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/8 thread(s)
2614124.5
ns2624334
ns1.00
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/1 thread(s)
3699541.5
ns3695417
ns1.00
mlp7layer_bn(relu)(32 x 256)/zygote/GPU/CUDA
1670621
ns1838423
ns0.91
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/2 thread(s)
5794812.5
ns5788229.5
ns1.00
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/4 thread(s)
5833354.5
ns5903625
ns0.99
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/8 thread(s)
5800917
ns5805354.5
ns1.00
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/1 thread(s)
2911437.5
ns2899667
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7875
ns7375
ns1.07
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
7000
ns5250
ns1.33
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
7000
ns6167
ns1.14
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10583
ns9916
ns1.07
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
24801
ns25653
ns0.97
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
222541.5
ns212479.5
ns1.05
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
221250
ns226833
ns0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
220833.5
ns220417
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
217041.5
ns206167
ns1.05
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
245776
ns275653
ns0.89
vgg16(32, 32, 3, 64)/forward/CPU/2 thread(s)
451162917
ns307447667
ns1.47
vgg16(32, 32, 3, 64)/forward/CPU/4 thread(s)
205123625.5
ns279760625
ns0.73
vgg16(32, 32, 3, 64)/forward/CPU/8 thread(s)
178414666.5
ns198268687.5
ns0.90
vgg16(32, 32, 3, 64)/forward/CPU/1 thread(s)
454897875
ns308090500
ns1.48
vgg16(32, 32, 3, 64)/forward/GPU/CUDA
7671486
ns7673335
ns1.00
vgg16(32, 32, 3, 64)/zygote/CPU/2 thread(s)
1093247396
ns1074946146
ns1.02
vgg16(32, 32, 3, 64)/zygote/CPU/4 thread(s)
925248250
ns1069981500
ns0.86
vgg16(32, 32, 3, 64)/zygote/CPU/8 thread(s)
837547083
ns801953875
ns1.04
vgg16(32, 32, 3, 64)/zygote/CPU/1 thread(s)
1163363584
ns1147606167
ns1.01
vgg16(32, 32, 3, 64)/zygote/GPU/CUDA
26761104.5
ns26674789
ns1.00
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
5500
ns4958
ns1.11
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
5458
ns5208
ns1.05
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
6875
ns5958
ns1.15
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
5291.5
ns5042
ns1.05
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
149694
ns169081.5
ns0.89
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
6833.5
ns6833
ns1.00
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7395.5
ns6917
ns1.07
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7792
ns7625
ns1.02
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
6875
ns7125
ns0.96
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
579102
ns666084
ns0.87
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
500
ns625
ns0.80
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
583
ns583
ns1
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
583
ns667
ns0.87
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
500
ns542
ns0.92
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
23601
ns24582
ns0.96
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
9166
ns9125
ns1.00
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
9042
ns8459
ns1.07
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
9250
ns9084
ns1.02
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
10166.5
ns9041
ns1.12
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
199458
ns231180
ns0.86
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
354500
ns352416.5
ns1.01
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
352375
ns351792
ns1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
355687.5
ns354500
ns1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
357479.5
ns352125
ns1.02
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA
21220
ns21300.5
ns1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
824396
ns814416
ns1.01
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
778375
ns809021
ns0.96
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
777666
ns782042
ns0.99
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
821813
ns827334
ns0.99
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA
231309.5
ns305499.5
ns0.76
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s)
331125
ns336479.5
ns0.98
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s)
344833
ns321125
ns1.07
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s)
453000
ns450500
ns1.01
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s)
10292
ns10542
ns0.98
batchedmm(16, Bsize=32)/forward/GPU/CUDA
18084
ns18195
ns0.99
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s)
709750
ns721208
ns0.98
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s)
741354
ns733229
ns1.01
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s)
1003291.5
ns1007271
ns1.00
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s)
26479
ns26666
ns0.99
batchedmm(16, Bsize=32)/zygote/GPU/CUDA
223194.5
ns274145
ns0.81
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s)
370292
ns383062
ns0.97
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s)
353396
ns329312
ns1.07
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s)
439292
ns442417
ns0.99
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s)
29916.5
ns30792
ns0.97
batchedmm(16, Bsize=128)/forward/GPU/CUDA
22856
ns22813
ns1.00
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s)
727458
ns737625
ns0.99
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s)
790208
ns785604
ns1.01
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s)
1034916
ns1032042
ns1.00
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s)
90395.5
ns105375
ns0.86
batchedmm(16, Bsize=128)/zygote/GPU/CUDA
197661
ns222871.5
ns0.89
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s)
3417
ns3708
ns0.92
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s)
3625
ns3417
ns1.06
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s)
3750
ns3666
ns1.02
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s)
3417
ns3583
ns0.95
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA
17539
ns17737
ns0.99
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s)
4208
ns4417
ns0.95
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s)
4375
ns4209
ns1.04
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s)
4250
ns4333
ns0.98
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s)
4125
ns4292
ns0.96
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA
213017
ns278790
ns0.76
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
3729
ns3791
ns0.98
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
4083
ns3604.5
ns1.13
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
4958
ns4145.5
ns1.20
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
3417
ns3666.5
ns0.93
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
159837
ns207112
ns0.77
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
8167
ns8125
ns1.01
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8583
ns8000
ns1.07
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
8667
ns8542
ns1.01
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8375
ns8458
ns0.99
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
1042725
ns1220818
ns0.85
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
205667
ns203687.5
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
213208
ns210041
ns1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
213500
ns210625
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
200458
ns200708
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
34523
ns34937
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
645542
ns645270.5
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
671042
ns631770.5
ns1.06
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
621458.5
ns622458
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
580854.5
ns630750
ns0.92
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
298737.5
ns343085
ns0.87
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s)
1234437.5
ns1001750
ns1.23
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s)
1277666
ns1034729
ns1.23
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s)
1190750
ns956333
ns1.25
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s)
1152750
ns879958
ns1.31
batchedmm(128, Bsize=128)/forward/GPU/CUDA
206763.5
ns207672.5
ns1.00
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s)
4518542
ns4524208
ns1.00
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s)
4787042
ns4821708
ns0.99
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s)
4473666.5
ns4482250
ns1.00
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s)
5146541
ns5132979
ns1.00
batchedmm(128, Bsize=128)/zygote/GPU/CUDA
931436.5
ns922465
ns1.01
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
3667
ns3666
ns1.00
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
3667
ns3292
ns1.11
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
4041
ns3417
ns1.18
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
2959
ns3583
ns0.83
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
185683
ns232276
ns0.80
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7167
ns7292
ns0.98
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7333
ns6792
ns1.08
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7667
ns7500
ns1.02
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
6833
ns6875
ns0.99
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
942579
ns1014308
ns0.93
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s)
1642000
ns1651708
ns0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s)
1207250
ns1164875
ns1.04
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s)
1390000
ns1344708
ns1.03
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s)
2427938
ns2500875
ns0.97
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/GPU/CUDA
212907.5
ns214937
ns0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s)
12368250
ns12379084
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s)
9590500
ns9615125.5
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s)
9295438
ns9247041
ns1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s)
18019000
ns18054792
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/GPU/CUDA
1954764
ns1946109
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s)
17359458
ns17413000
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s)
14385104
ns14415146.5
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s)
14370541
ns14339250
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s)
21035500
ns21151646
ns0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
134083.5
ns134917
ns0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
139416.5
ns88958
ns1.57
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
134958
ns91334
ns1.48
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
131334
ns87666
ns1.50
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
125600
ns126488
ns0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2022916.5
ns2026792
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2047021
ns2043625
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2034334
ns1766792
ns1.15
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2039125
ns2026459
ns1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
948556
ns1034650
ns0.92
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s)
1458
ns2770.5
ns0.53
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s)
1792
ns1334
ns1.34
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s)
3520.5
ns3208
ns1.10
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s)
1229.5
ns3791
ns0.32
batchedmm(2, Bsize=4)/forward/GPU/CUDA
16310
ns16389
ns1.00
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s)
2542
ns2584
ns0.98
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s)
2792
ns2459
ns1.14
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s)
2875
ns2709
ns1.06
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s)
2834
ns2791
ns1.02
batchedmm(2, Bsize=4)/zygote/GPU/CUDA
182763.5
ns192723.5
ns0.95
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7958
ns7250
ns1.10
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
6875
ns5208
ns1.32
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
6875
ns5959
ns1.15
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10583
ns9959
ns1.06
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
33908
ns34193
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
225041
ns225250
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
221625
ns227063
ns0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
220833
ns220708
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
215291
ns213333
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
320916
ns312634.5
ns1.03
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s)
3667
ns3708
ns0.99
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s)
3708
ns3750
ns0.99
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s)
3667
ns3708
ns0.99
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s)
3667
ns3708
ns0.99
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA
22605
ns22321
ns1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s)
14500
ns14417
ns1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s)
14625
ns14250
ns1.03
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s)
14500
ns14416
ns1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s)
14500
ns14375
ns1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA
456450
ns475484
ns0.96
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
142749.5
ns134292
ns1.06
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
91312
ns93667
ns0.97
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
142292
ns94354.5
ns1.51
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
138792
ns91958
ns1.51
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
125035
ns125921
ns0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1919500
ns1924541.5
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1942104
ns1939333
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1929000
ns1709625
ns1.13
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1927250
ns1925042
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
877064
ns949226.5
ns0.92
lenet(28, 28, 1, 32)/forward/CPU/2 thread(s)
877458.5
ns874708
ns1.00
lenet(28, 28, 1, 32)/forward/CPU/4 thread(s)
825458.5
ns796250
ns1.04
lenet(28, 28, 1, 32)/forward/CPU/8 thread(s)
1230104
ns1220958
ns1.01
lenet(28, 28, 1, 32)/forward/CPU/1 thread(s)
955479
ns963208
ns0.99
lenet(28, 28, 1, 32)/forward/GPU/CUDA
269410
ns277966
ns0.97
lenet(28, 28, 1, 32)/zygote/CPU/2 thread(s)
2816333
ns2838542
ns0.99
lenet(28, 28, 1, 32)/zygote/CPU/4 thread(s)
2528771
ns2538917
ns1.00
lenet(28, 28, 1, 32)/zygote/CPU/8 thread(s)
3342458
ns3341125
ns1.00
lenet(28, 28, 1, 32)/zygote/CPU/1 thread(s)
3349729.5
ns3415500
ns0.98
lenet(28, 28, 1, 32)/zygote/GPU/CUDA
1555391.5
ns1590492.5
ns0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
14833
ns17646
ns0.84
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
14875
ns16500
ns0.90
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
18500
ns18042
ns1.03
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
16875
ns17333
ns0.97
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
131035
ns142389.5
ns0.92
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
227209
ns226250
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
215791
ns239208.5
ns0.90
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
216958
ns215666.5
ns1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
225250
ns227708
ns0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
594103.5
ns648593.5
ns0.92
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
221333
ns222666
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
222875
ns220083
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
222583
ns222792
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
219042
ns221875
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
242007
ns275688.5
ns0.88
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
548917
ns564542
ns0.97
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
511041.5
ns507292
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
509917
ns506333
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
508458
ns559542
ns0.91
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1234181
ns1323540.5
ns0.93
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s)
4083
ns4229.5
ns0.97
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s)
4041
ns3958
ns1.02
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s)
4417
ns3916
ns1.13
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s)
3666.5
ns4333
ns0.85
batchedmm(16, Bsize=4)/forward/GPU/CUDA
17140
ns16749
ns1.02
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s)
7209
ns7187
ns1.00
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s)
7459
ns6917
ns1.08
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s)
7333.5
ns7292
ns1.01
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s)
7417
ns7416
ns1.00
batchedmm(16, Bsize=4)/zygote/GPU/CUDA
183429.5
ns193558
ns0.95
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
18833
ns19333.5
ns0.97
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
16666
ns17167
ns0.97
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
21083
ns19291
ns1.09
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
18396
ns16959
ns1.08
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
131942
ns145420.5
ns0.91
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
245395.5
ns223917
ns1.10
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
212292
ns216437.5
ns0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
214833
ns215375
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
213708
ns213812.5
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
833743
ns914033
ns0.91
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
4208
ns4958
ns0.85
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
4833
ns4250
ns1.14
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
4916.5
ns4417
ns1.11
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
3854.5
ns3917
ns0.98
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
208168.5
ns206416
ns1.01
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
10333
ns10250
ns1.01
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
10459
ns10000
ns1.05
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
11084
ns10958
ns1.01
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
10145.5
ns10000
ns1.01
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
994315
ns1027488.5
ns0.97
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
3458
ns3833
ns0.90
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
3791
ns3459
ns1.10
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
4042
ns3416
ns1.18
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
3167
ns3250
ns0.97
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
209797
ns236791.5
ns0.89
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7416
ns7417
ns1.00
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7459
ns7250
ns1.03
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
8083.5
ns7625
ns1.06
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7459
ns7375
ns1.01
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
997101.5
ns1067899
ns0.93
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s)
23443625
ns23463750.5
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s)
34805208
ns43484791.5
ns0.80
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s)
37298500
ns37835875
ns0.99
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s)
34536209
ns34880875
ns0.99
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/GPU/CUDA
1851929
ns1833754
ns1.01
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s)
185954395.5
ns184463792
ns1.01
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s)
159888645.5
ns172964124.5
ns0.92
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s)
144873209
ns146554521
ns0.99
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s)
438754792
ns410369375
ns1.07
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/GPU/CUDA
16496173
ns16525549
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s)
269927937.5
ns424815979
ns0.64
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s)
259799312.5
ns259769792
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s)
298856875
ns297288958
ns1.01
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s)
487045354.5
ns478383791
ns1.02
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
189541.5
ns183959
ns1.03
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
182167
ns183375
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
183416.5
ns186187.5
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
182375
ns183187.5
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
187318
ns205888.5
ns0.91
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
636187.5
ns602916.5
ns1.06
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
597458.5
ns596416.5
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
588459
ns592375
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
596146
ns596542
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
944443
ns1054788
ns0.90
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s)
3952375
ns3829562.5
ns1.03
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s)
4007646
ns3998791.5
ns1.00
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s)
3594292
ns3564812.5
ns1.01
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s)
4885708
ns4550791.5
ns1.07
batchedmm(128, Bsize=512)/forward/GPU/CUDA
552348.5
ns532059.5
ns1.04
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s)
18061833
ns17302667
ns1.04
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s)
18498208.5
ns18565313
ns1.00
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s)
17053770.5
ns16600312.5
ns1.03
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s)
19733813
ns20208979.5
ns0.98
batchedmm(128, Bsize=512)/zygote/GPU/CUDA
2636788.5
ns2631431
ns1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
500
ns583
ns0.86
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
583
ns542
ns1.08
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
583
ns625
ns0.93
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
500
ns542
ns0.92
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
32315
ns33095
ns0.98
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
9145.5
ns9083
ns1.01
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
9625
ns9042
ns1.06
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
9291
ns9458.5
ns0.98
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
8792
ns9125
ns0.96
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
247143.5
ns266296
ns0.93
vgg16(32, 32, 3, 128)/forward/CPU/2 thread(s)
497882542
ns498097750
ns1.00
vgg16(32, 32, 3, 128)/forward/CPU/4 thread(s)
466893292
ns506743916
ns0.92
vgg16(32, 32, 3, 128)/forward/CPU/8 thread(s)
356555750
ns424015542
ns0.84
vgg16(32, 32, 3, 128)/forward/CPU/1 thread(s)
601192353.5
ns594637416
ns1.01
vgg16(32, 32, 3, 128)/forward/GPU/CUDA
12465773.5
ns12483759
ns1.00
vgg16(32, 32, 3, 128)/zygote/CPU/2 thread(s)
1887759917
ns1878936437.5
ns1.00
vgg16(32, 32, 3, 128)/zygote/CPU/4 thread(s)
1627534167
ns1662067875
ns0.98
vgg16(32, 32, 3, 128)/zygote/CPU/8 thread(s)
1505961604
ns1496755770.5
ns1.01
vgg16(32, 32, 3, 128)/zygote/CPU/1 thread(s)
2123318791.5
ns2214230167
ns0.96
vgg16(32, 32, 3, 128)/zygote/GPU/CUDA
49303078
ns49527395
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s)
1652917
ns1663166
ns0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s)
1209833
ns1177833
ns1.03
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s)
1397667
ns1370041
ns1.02
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s)
2460062.5
ns2349521
ns1.05
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/GPU/CUDA
214417
ns217522
ns0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s)
12745021
ns12726750
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s)
9950208
ns10036417
ns0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s)
9693541
ns9643083
ns1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s)
18371500
ns18397833
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA
2028129
ns2037123
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s)
17681833
ns17723584
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s)
14711375
ns14827916
ns0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s)
14648250
ns14555416.5
ns1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s)
21429709
ns21415041
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
26167
ns26250
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
26167
ns26250
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
26167
ns26291
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
26166
ns26209
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA
23744
ns23706
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
67208
ns67354.5
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
67208
ns66792
ns1.01
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
67166
ns68375
ns0.98
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
66916
ns66875
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA
365755.5
ns393355.5
ns0.93
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
206375
ns203458
ns1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
212666
ns209417
ns1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
211542
ns210084
ns1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
200291
ns199125
ns1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
25711
ns26245.5
ns0.98
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
655729
ns647916
ns1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
632000
ns672375.5
ns0.94
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
673667
ns621792
ns1.08
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
630708
ns593542
ns1.06
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
322192
ns351878.5
ns0.92
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
683459
ns679750
ns1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
682708
ns657291
ns1.04
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
691916.5
ns595709
ns1.16
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
680834
ns632771
ns1.08
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
130902.5
ns131601.5
ns0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2242354.5
ns2238750
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2244709
ns2300791
ns0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2244875.5
ns2241896
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2229125
ns2244958
ns0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1093705
ns1242570.5
ns0.88
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
20396
ns18625
ns1.10
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
16833
ns17979
ns0.94
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
23020.5
ns18375
ns1.25
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
19166
ns17104
ns1.12
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
131648.5
ns144244
ns0.91
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
265541.5
ns256458
ns1.04
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
232167
ns245646
ns0.95
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
264625
ns221750
ns1.19
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
259979
ns230416
ns1.13
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
939947
ns1056298
ns0.89
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
541
ns584
ns0.93
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
625
ns625
ns1
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
625
ns667
ns0.94
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
542
ns583
ns0.93
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
23249
ns23741
ns0.98
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
9583.5
ns9208
ns1.04
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
9708
ns9708
ns1
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
10041
ns9458
ns1.06
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
9541
ns9333
ns1.02
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
242690
ns257592.5
ns0.94
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
5542
ns5125
ns1.08
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
5709
ns5500
ns1.04
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
6667
ns6395.5
ns1.04
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
5250
ns5458
ns0.96
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
206130.5
ns231821.5
ns0.89
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
6709
ns6833
ns0.98
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7417
ns6792
ns1.09
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7875
ns7458
ns1.06
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
6708
ns6917
ns0.97
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
735324.5
ns801589.5
ns0.92
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
2000
ns2167
ns0.92
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
2229.5
ns2000
ns1.11
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
2125
ns2208
ns0.96
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
2292
ns2375
ns0.97
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA
17909
ns17797
ns1.01
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
6375
ns6375
ns1
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
6792
ns6542
ns1.04
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
6875
ns6667
ns1.03
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
6208
ns6375
ns0.97
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA
303359
ns330267.5
ns0.92
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s)
751688
ns748708
ns1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s)
779292
ns756208
ns1.03
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s)
779395.5
ns752750
ns1.04
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s)
776146
ns753542
ns1.03
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA
20845
ns20724
ns1.01
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s)
796792
ns792417
ns1.01
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s)
791166
ns796875
ns0.99
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s)
808708
ns786834
ns1.03
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s)
775292
ns808000
ns0.96
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA
267264
ns297689.5
ns0.90
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
8000
ns7250
ns1.10
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
6687.5
ns5250
ns1.27
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
6958
ns6042
ns1.15
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10458
ns10125
ns1.03
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
32932
ns33074
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
261062.5
ns228604.5
ns1.14
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
237583
ns251041
ns0.95
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
271396
ns227708
ns1.19
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
252646
ns226000
ns1.12
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
331767
ns362298.5
ns0.92
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
10250
ns10209
ns1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
10542
ns10209
ns1.03
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
11208
ns10458
ns1.07
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
10250
ns9750
ns1.05
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
218675.5
ns252317
ns0.87
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
25000
ns25334
ns0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
24625
ns24312.5
ns1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
25583
ns25959
ns0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
24416
ns24395.5
ns1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
1056250
ns1133104
ns0.93
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s)
106355042
ns106928354
ns0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s)
117397229.5
ns126898666
ns0.93
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s)
120585312.5
ns121692334
ns0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s)
117183084
ns117598792
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/GPU/CUDA
2657952
ns2629460
ns1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s)
374187771
ns390743083
ns0.96
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s)
350821292
ns379904750
ns0.92
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s)
361003333
ns361277959
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s)
479876375
ns481946125
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA
15234863.5
ns15184946
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s)
604863708
ns754771020.5
ns0.80
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s)
773786667
ns597861750
ns1.29
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s)
812604291
ns748681771
ns1.09
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s)
770323375
ns760209125
ns1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
6833
ns6500
ns1.05
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
7084
ns6667
ns1.06
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
8062.5
ns8333
ns0.97
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
6250
ns6667
ns0.94
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
213616
ns239111
ns0.89
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
13458
ns14125
ns0.95
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
13875
ns14125
ns0.98
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
14416
ns14437.5
ns1.00
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
13625
ns13667
ns1.00
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
1017707
ns1073718
ns0.95
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
6208
ns5542
ns1.12
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
6042
ns5542
ns1.09
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
7145.5
ns6395.5
ns1.12
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
5417
ns5792
ns0.94
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
208255
ns235877.5
ns0.88
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
11958
ns12208
ns0.98
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
12729.5
ns12542
ns1.01
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
13250
ns12750
ns1.04
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
12500
ns12166
ns1.03
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
723959
ns781667
ns0.93
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s)
6209
ns5709
ns1.09
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s)
6375
ns5437.5
ns1.17
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s)
6375
ns5750
ns1.11
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s)
5500
ns5833
ns0.94
batchedmm(2, Bsize=128)/forward/GPU/CUDA
16943
ns16760
ns1.01
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s)
15250
ns15417
ns0.99
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s)
15625
ns15333
ns1.02
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s)
15625
ns15500
ns1.01
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s)
15500
ns15625
ns0.99
batchedmm(2, Bsize=128)/zygote/GPU/CUDA
186257
ns199275.5
ns0.93
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
333
ns375
ns0.89
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
375
ns292
ns1.28
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
375
ns416
ns0.90
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
292
ns333
ns0.88
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
23245
ns23515
ns0.99
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
6375
ns6333
ns1.01
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
6375
ns6167
ns1.03
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
6625
ns6417
ns1.03
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
6187.5
ns6333
ns0.98
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
225046
ns240257
ns0.94
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
5750
ns5833
ns0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
5875
ns5875
ns1
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
5833
ns6083
ns0.96
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
5792
ns5875
ns0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
24205
ns24789
ns0.98
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
20875
ns20958
ns1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
21417
ns20958.5
ns1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
21541.5
ns21334
ns1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
21229.5
ns21000
ns1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
246651
ns263523
ns0.94
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
194166.5
ns188417
ns1.03
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
200521
ns162166
ns1.24
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
190666.5
ns146708.5
ns1.30
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
185562
ns149625
ns1.24
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
166320.5
ns167166
ns0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1329104.5
ns1323812.5
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1324792
ns1371958
ns0.97
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1328041
ns1317937.5
ns1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1337729.5
ns1325562.5
ns1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1221500
ns1350174
ns0.90
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
24687.5
ns25292
ns0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
22000
ns22500
ns0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
25667
ns23146.5
ns1.11
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
21250
ns22979.5
ns0.92
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
254624.5
ns352259
ns0.72
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
130791
ns173645.5
ns0.75
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
132062.5
ns180041
ns0.73
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
179458
ns119500
ns1.50
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
179520.5
ns126334
ns1.42
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1317432
ns1470411
ns0.90
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
333
ns375
ns0.89
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
417
ns334
ns1.25
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
375
ns416
ns0.90
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
292
ns292
ns1
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
22902
ns23380
ns0.98
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
6208
ns6125
ns1.01
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
6709
ns6229.5
ns1.08
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
6917
ns6708
ns1.03
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
6291
ns6167
ns1.02
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
240780
ns256300
ns0.94
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
4875
ns5084
ns0.96
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
4542
ns5083
ns0.89
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
5500
ns5083
ns1.08
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
4417
ns4292
ns1.03
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
229531.5
ns256465.5
ns0.89
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
10083
ns10209
ns0.99
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
10375
ns9750
ns1.06
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
10583
ns10750
ns0.98
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
10416
ns10208
ns1.02
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
1276460
ns1354750
ns0.94
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
1625
ns1625
ns1
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
1667
ns1583
ns1.05
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
1583
ns1708
ns0.93
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
1584
ns1625
ns0.97
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA
22954
ns22916
ns1.00
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
5792
ns5750
ns1.01
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
5958
ns5667
ns1.05
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
5875
ns6167
ns0.95
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
5584
ns5750
ns0.97
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA
258626
ns272343
ns0.95
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s)
6841563
ns6820375
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s)
6377645.5
ns6368417
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s)
6542167
ns6567000
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s)
7612146
ns7648166
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/GPU/CUDA
213873
ns214879
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s)
24061541
ns24083333.5
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s)
21280959
ns21351687.5
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s)
21049937
ns21140875
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s)
29725708.5
ns29752125.5
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA
2091556
ns2100360
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s)
37658500
ns37299645.5
ns1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s)
45669958
ns34217771
ns1.33
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s)
45878312.5
ns45700125
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s)
38309416.5
ns38021000
ns1.01
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
5917
ns5750
ns1.03
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
6042
ns5583.5
ns1.08
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
6958.5
ns6395.5
ns1.09
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
5542
ns5292
ns1.05
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
210091
ns235350
ns0.89
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
8041
ns8167
ns0.98
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8250
ns8416.5
ns0.98
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
8500
ns8542
ns1.00
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8250
ns8500
ns0.97
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
992082
ns1060836
ns0.94
lenet(28, 28, 1, 128)/forward/CPU/2 thread(s)
1552375
ns1566292
ns0.99
lenet(28, 28, 1, 128)/forward/CPU/4 thread(s)
1278292
ns1237250
ns1.03
lenet(28, 28, 1, 128)/forward/CPU/8 thread(s)
1634959
ns1619208
ns1.01
lenet(28, 28, 1, 128)/forward/CPU/1 thread(s)
2176750
ns2132958
ns1.02
lenet(28, 28, 1, 128)/forward/GPU/CUDA
269882.5
ns278998
ns0.97
lenet(28, 28, 1, 128)/zygote/CPU/2 thread(s)
7890000
ns7937625
ns0.99
lenet(28, 28, 1, 128)/zygote/CPU/4 thread(s)
6564479
ns6656917
ns0.99
lenet(28, 28, 1, 128)/zygote/CPU/8 thread(s)
7223979
ns7130604.5
ns1.01
lenet(28, 28, 1, 128)/zygote/CPU/1 thread(s)
10470041
ns10453333.5
ns1.00
lenet(28, 28, 1, 128)/zygote/GPU/CUDA
1748953.5
ns1878437
ns0.93
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s)
375500
ns370292
ns1.01
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s)
379708
ns353124.5
ns1.08
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s)
454583
ns459083
ns0.99
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s)
34834
ns23666
ns1.47
batchedmm(128, Bsize=4)/forward/GPU/CUDA
46336
ns42541.5
ns1.09
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s)
739834
ns753083
ns0.98
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s)
821979
ns809125
ns1.02
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s)
1062042
ns1063125
ns1.00
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s)
119270.5
ns116979.5
ns1.02
batchedmm(128, Bsize=4)/zygote/GPU/CUDA
274066
ns239130.5
ns1.15
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s)
412125
ns397291
ns1.04
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s)
305917
ns212417
ns1.44
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s)
305916
ns288125
ns1.06
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s)
757958
ns752000
ns1.01
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA
44006
ns44180
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s)
658583
ns667583
ns0.99
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s)
525792
ns474167
ns1.11
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s)
523167
ns531812.5
ns0.98
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s)
973083
ns973083
ns1
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA
189089
ns194058
ns0.97
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
672875
ns678250
ns0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
676521
ns667145.5
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
644292
ns621709
ns1.04
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
672333
ns646959
ns1.04
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
131017.5
ns133035
ns0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2466812.5
ns2484229
ns0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2456312.5
ns2543916.5
ns0.97
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2425417
ns2480312.5
ns0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2465333
ns2471875
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1103271
ns1215811
ns0.91
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s)
2333
ns2791
ns0.84
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s)
2875
ns2084
ns1.38
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s)
4500
ns4333
ns1.04
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s)
3167
ns3354
ns0.94
batchedmm(2, Bsize=32)/forward/GPU/CUDA
16213
ns16281.5
ns1.00
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s)
5208
ns5375
ns0.97
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s)
5625
ns5209
ns1.08
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s)
5667
ns5500
ns1.03
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s)
5459
ns5584
ns0.98
batchedmm(2, Bsize=32)/zygote/GPU/CUDA
184737.5
ns201076.5
ns0.92
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1481125
ns1457583
ns1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1519875
ns1497084
ns1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1522875
ns1498833
ns1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1453417
ns1436500
ns1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
40096
ns41204
ns0.97
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5124333
ns5117834
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5295937.5
ns5304542
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5290354
ns5300500
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
4993187.5
ns4807333
ns1.04
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
194429.5
ns199725
ns0.97
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s)
3666
ns3708
ns0.99
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s)
3666
ns3709
ns0.99
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s)
3625
ns3709
ns0.98
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s)
3667
ns3708
ns0.99
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA
33150
ns32858
ns1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s)
15208
ns15250
ns1.00
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s)
15375
ns15000
ns1.02
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s)
15416
ns15292
ns1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s)
15250
ns15083
ns1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA
349182
ns377713
ns0.92
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s)
93000
ns70792
ns1.31
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s)
103209
ns71417
ns1.45
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s)
92958
ns71125
ns1.31
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s)
92833
ns70000
ns1.33
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA
113197
ns113374.5
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
315959
ns318333
ns0.99
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
319270.5
ns334916
ns0.95
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
317000
ns318083
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
317333
ns318209
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA
191577
ns193117.5
ns0.99
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
1000
ns1000
ns1
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
1084
ns1000
ns1.08
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
1042
ns1084
ns0.96
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
1000
ns959
ns1.04
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
23307
ns23866.5
ns0.98
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
7792
ns7833
ns0.99
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8375
ns7875
ns1.06
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
8125
ns8125
ns1
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8000
ns7875
ns1.02
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
244539
ns261797
ns0.93
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s)
531791
ns512646
ns1.04
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s)
517334
ns479541
ns1.08
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s)
578729.5
ns566104
ns1.02
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s)
256916
ns216667
ns1.19
batchedmm(128, Bsize=32)/forward/GPU/CUDA
130622
ns130101
ns1.00
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s)
1386812.5
ns1405541
ns0.99
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s)
1483208.5
ns1481750
ns1.00
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s)
1776708
ns1758666
ns1.01
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s)
871125
ns872625
ns1.00
batchedmm(128, Bsize=32)/zygote/GPU/CUDA
273552
ns274250.5
ns1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
292
ns375
ns0.78
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
375
ns333
ns1.13
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
375
ns417
ns0.90
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
292
ns292
ns1
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
31822
ns31596
ns1.01
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
5958
ns6375
ns0.93
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
6459
ns5854.5
ns1.10
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
6416
ns6500
ns0.99
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
6167
ns6042
ns1.02
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
246678.5
ns263141.5
ns0.94
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1774479
ns1731916.5
ns1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1782250.5
ns1768000
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1777916
ns1725583
ns1.03
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1766937
ns1724459
ns1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
169504.5
ns168363
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
4354563
ns4401542
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
3899583
ns4406313
ns0.88
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
4361500
ns4361083
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
4355333
ns4360083
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1064911
ns1173884.5
ns0.91
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s)
24479
ns6583
ns3.72
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s)
7541
ns6791
ns1.11
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s)
7833
ns7062.5
ns1.11
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s)
22208.5
ns6791
ns3.27
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA
19777
ns20597
ns0.96
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
72854.5
ns32792
ns2.22
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
51667
ns62083
ns0.83
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
51833
ns33292
ns1.56
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
70542
ns51084
ns1.38
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA
193123
ns293465.5
ns0.66
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s)
17625
ns18000
ns0.98
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s)
18250
ns17458
ns1.05
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s)
17708
ns17916
ns0.99
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s)
17250
ns18042
ns0.96
batchedmm(2, Bsize=512)/forward/GPU/CUDA
18352
ns18220
ns1.01
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s)
53000
ns53250
ns1.00
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s)
53250
ns53292
ns1.00
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s)
53542
ns53583
ns1.00
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s)
53375
ns53416.5
ns1.00
batchedmm(2, Bsize=512)/zygote/GPU/CUDA
317963.5
ns340467.5
ns0.93
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s)
107500
ns75333
ns1.43
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s)
107125
ns75417
ns1.42
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s)
105625
ns75292
ns1.40
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s)
97584
ns74833
ns1.30
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA
46786
ns46370
ns1.01
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
323417
ns324292
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
327750
ns342291.5
ns0.96
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
322667
ns336708
ns0.96
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
325000
ns324667
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA
207825
ns208689
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1504209
ns1483500
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1545458
ns1520542
ns1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1549042
ns1528333
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1478167
ns1461958
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
51382
ns51330
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5122771
ns5116916.5
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5291458
ns5306417
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5291125
ns4956417
ns1.07
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
5000125
ns4985125.5
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
200987.5
ns204511
ns0.98
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
28167
ns28250
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
28250
ns28250
ns1
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
28125
ns28292
ns0.99
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
28167
ns28167
ns1
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA
24367
ns24159
ns1.01
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
66375
ns66584
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
66583
ns66208
ns1.01
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
66375
ns67583
ns0.98
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
66375
ns66208
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA
493214.5
ns518001
ns0.95
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/2 thread(s)
1497500
ns1500667
ns1.00
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/4 thread(s)
1150584
ns935916
ns1.23
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/8 thread(s)
1142791.5
ns1063395.5
ns1.07
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/1 thread(s)
2256875
ns2253583
ns1.00
mlp7layer_bn(tanh)(32 x 256)/forward/GPU/CUDA
579142.5
ns585024
ns0.99
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/2 thread(s)
3080625.5
ns3089125
ns1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/4 thread(s)
2682000
ns2661333
ns1.01
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/8 thread(s)
2729917
ns2581104
ns1.06
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/1 thread(s)
3656583
ns3818625
ns0.96
mlp7layer_bn(tanh)(32 x 256)/zygote/GPU/CUDA
1939352
ns1992242
ns0.97
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/2 thread(s)
7890875
ns7906625
ns1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/4 thread(s)
7897375
ns8031000
ns0.98
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/8 thread(s)
7904208
ns7927541.5
ns1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/1 thread(s)
4815458
ns4820333
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
138395.5
ns134041
ns1.03
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
78917
ns81459
ns0.97
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
132458.5
ns82833
ns1.60
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
140084
ns81833
ns1.71
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
193872
ns194356
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2020209
ns2010167
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1690750
ns2043167
ns0.83
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2025250
ns2009750
ns1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2006209
ns2026792
ns0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
742900
ns794414
ns0.94
This comment was automatically generated by workflow using github-action-benchmark.