You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Did you know you can add release notes too? Just add markdown formatted text underneath the comment after the text
"Release notes:" and it will be added to the registry PR, and if TagBot is installed it will also be added to the
release that TagBot creates. i.e.
1ea272a
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JuliaRegistrator register
1ea272a
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Registration pull request created: JuliaRegistries/General/120881
Tip: Release Notes
Did you know you can add release notes too? Just add markdown formatted text underneath the comment after the text
"Release notes:" and it will be added to the registry PR, and if TagBot is installed it will also be added to the
release that TagBot creates. i.e.
To add them here just re-invoke and the PR will be updated.
Tagging
After the above pull request is merged, it is recommended that a tag is created on this repository for the registered package version.
This will be done automatically if the Julia TagBot GitHub Action is installed, or can be done manually through the github interface, or via:
1ea272a
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lux Benchmarks
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
3958
ns4208
ns0.94
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
4791
ns4834
ns0.99
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
4792
ns5375
ns0.89
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
3958
ns4083
ns0.97
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
59494
ns58557
ns1.02
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
10750
ns10625
ns1.01
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
10959
ns10542
ns1.04
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
10125
ns11375
ns0.89
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
10562.5
ns10083
ns1.05
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
417797.5
ns415171
ns1.01
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s)
1125
ns1334
ns0.84
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s)
1292
ns1209
ns1.07
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s)
1208
ns1333.5
ns0.91
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s)
1083
ns1208
ns0.90
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA
18173
ns17961
ns1.01
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
4083
ns4084
ns1.00
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
3417
ns3959
ns0.86
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
4250
ns4333
ns0.98
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
3709
ns4000
ns0.93
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA
107683
ns107003.5
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
70750
ns70834
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
64000
ns64375
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
64250
ns64500
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
83042
ns80375
ns1.03
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
36561
ns36906
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2030500
ns2031562.5
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2082541.5
ns2088542
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2089104
ns2093958
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2008667
ns1926833
ns1.04
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
193196.5
ns192315
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
140083
ns196625
ns0.71
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
181291
ns195542
ns0.93
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
181167
ns185209
ns0.98
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
185250
ns182375
ns1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
166362
ns166552
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1120708
ns1111896
ns1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1119000
ns1118729.5
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1120041.5
ns1119708
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1124104
ns1130333.5
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
525948
ns514050
ns1.02
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
3334
ns3500
ns0.95
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
4125
ns3416
ns1.21
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
3729.5
ns4459
ns0.84
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
3542
ns3416.5
ns1.04
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
70915
ns67303.5
ns1.05
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
9125
ns9084
ns1.00
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
9542
ns9750
ns0.98
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
8708
ns9625
ns0.90
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8875
ns8625
ns1.03
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
475931.5
ns472568
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
15062.5
ns15020.5
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
15250
ns14666
ns1.04
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
17437.5
ns18625
ns0.94
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
15375
ns14875
ns1.03
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
53231
ns53079
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
216458.5
ns224750
ns0.96
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
225042
ns215104.5
ns1.05
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
213541.5
ns215917
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
222375
ns215083
ns1.03
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
270372.5
ns267364.5
ns1.01
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s)
500
ns750
ns0.67
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s)
750
ns709
ns1.06
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s)
666
ns750
ns0.89
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s)
583
ns750
ns0.78
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA
17324
ns17115
ns1.01
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
1500
ns1500
ns1
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
1520.5
ns1792
ns0.85
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
1750
ns1500
ns1.17
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
1500
ns1375
ns1.09
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA
100368.5
ns99326.5
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
8125
ns7833
ns1.04
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
8125
ns7291
ns1.11
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
7041
ns7083
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10667
ns9958
ns1.07
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
22992
ns23212
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
234000
ns233458.5
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
239937.5
ns228125
ns1.05
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
228833.5
ns228666
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
222271
ns214125
ns1.04
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
167254
ns164950.5
ns1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s)
3834
ns3875
ns0.99
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s)
3958
ns3916
ns1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s)
3917
ns3875
ns1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s)
3916
ns3917
ns1.00
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA
23377
ns23508
ns0.99
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
16833
ns16959
ns0.99
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
16667
ns17042
ns0.98
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
18375
ns17083
ns1.08
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
16583
ns16708
ns0.99
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA
160878
ns160457.5
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
610542
ns611125
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
613209
ns609042
ns1.01
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
634042
ns606834
ns1.04
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
609000
ns605520.5
ns1.01
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA
113540.5
ns113172
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
1430375
ns1423834
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
1420292
ns1422458
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
1446167
ns1424292
ns1.02
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
1425542
ns1420334
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA
210405
ns209423.5
ns1.00
lenet(28, 28, 1, 64)/forward/CPU/2 thread(s)
1076083
ns1082229.5
ns0.99
lenet(28, 28, 1, 64)/forward/CPU/4 thread(s)
968959
ns970792
ns1.00
lenet(28, 28, 1, 64)/forward/CPU/8 thread(s)
1348187.5
ns1346208
ns1.00
lenet(28, 28, 1, 64)/forward/CPU/1 thread(s)
1290083
ns1300333
ns0.99
lenet(28, 28, 1, 64)/forward/GPU/CUDA
272167
ns270348.5
ns1.01
lenet(28, 28, 1, 64)/zygote/CPU/2 thread(s)
5791000
ns5996021
ns0.97
lenet(28, 28, 1, 64)/zygote/CPU/4 thread(s)
4597104
ns4506125
ns1.02
lenet(28, 28, 1, 64)/zygote/CPU/8 thread(s)
4948917
ns4914416
ns1.01
lenet(28, 28, 1, 64)/zygote/CPU/1 thread(s)
5522395.5
ns5507375
ns1.00
lenet(28, 28, 1, 64)/zygote/GPU/CUDA
1076534
ns1074060
ns1.00
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s)
542
ns542
ns1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s)
542
ns542
ns1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s)
542
ns542
ns1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s)
542
ns541
ns1.00
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA
23590
ns23487
ns1.00
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
2166
ns2167
ns1.00
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
2125
ns2125
ns1
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
2208
ns2167
ns1.02
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
2125
ns2125
ns1
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA
173376
ns168855
ns1.03
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
3917
ns4167
ns0.94
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
4208
ns4334
ns0.97
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
5125
ns5041
ns1.02
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
4083.5
ns3667
ns1.11
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
65133.5
ns64100
ns1.02
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
10979.5
ns11291
ns0.97
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
11375
ns11875
ns0.96
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
11667
ns12291
ns0.95
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
11125
ns11000
ns1.01
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
444460.5
ns442842
ns1.00
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
5959
ns6042
ns0.99
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
6416
ns6104.5
ns1.05
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
7209
ns7209
ns1
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
6333
ns5708
ns1.11
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
51265
ns51573
ns0.99
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
17125
ns17041.5
ns1.00
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
17208
ns17292
ns1.00
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
17709
ns17625
ns1.00
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
17500
ns17250
ns1.01
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
297640
ns299598.5
ns0.99
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
541
ns583
ns0.93
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
542
ns625
ns0.87
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
625
ns625
ns1
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
625
ns542
ns1.15
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
32574
ns32513
ns1.00
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
8312.5
ns8458
ns0.98
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
8395.5
ns9000
ns0.93
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
8834
ns9084
ns0.97
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
9125
ns8458
ns1.08
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
156527
ns155298
ns1.01
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s)
96458
ns96666
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s)
96250
ns96708
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s)
95958
ns96292
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s)
97333
ns96375
ns1.01
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA
111569
ns111447.5
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s)
279917
ns278125
ns1.01
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s)
272666
ns275250
ns0.99
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s)
276958
ns274583.5
ns1.01
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s)
291791
ns277584
ns1.05
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA
184593
ns190076
ns0.97
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/2 thread(s)
3390792
ns3409792
ns0.99
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/4 thread(s)
3045416
ns3047666
ns1.00
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/8 thread(s)
3031500
ns3023958
ns1.00
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/1 thread(s)
3960417
ns3959958
ns1.00
mlp7layer_bn(gelu)(32 x 256)/forward/GPU/CUDA
572942
ns579376.5
ns0.99
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/2 thread(s)
7593625
ns7632583
ns0.99
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/4 thread(s)
7437042
ns7497667
ns0.99
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/8 thread(s)
7444584
ns7451520.5
ns1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/1 thread(s)
8265979
ns8199583
ns1.01
mlp7layer_bn(gelu)(32 x 256)/zygote/GPU/CUDA
1334670
ns1349456
ns0.99
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/2 thread(s)
12605208
ns17500916.5
ns0.72
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/4 thread(s)
17554084
ns17545437.5
ns1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/8 thread(s)
17556062
ns17599584
ns1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/1 thread(s)
14272042
ns14108083
ns1.01
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s)
24062729
ns23772875
ns1.01
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s)
34415959
ns34134729
ns1.01
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s)
37185584
ns37435375
ns0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s)
34968250
ns34708708
ns1.01
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/GPU/CUDA
1858779
ns1860458
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s)
317027145.5
ns316659729.5
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s)
233784625
ns235623563
ns0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s)
195359167
ns195619437
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s)
280568396
ns279867979.5
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA
13916432
ns13932935
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s)
273605875
ns273833833
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s)
269293459
ns267231583
ns1.01
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s)
251015375
ns255610333
ns0.98
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s)
332609042
ns329098667
ns1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
21834
ns21375
ns1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
21750
ns22125
ns0.98
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
25500
ns25292
ns1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
22916
ns21125
ns1.08
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
95464
ns94977
ns1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
118125
ns103542
ns1.14
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
103417
ns103791
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
104833.5
ns105125
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
104125
ns103250
ns1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
509331.5
ns500332.5
ns1.02
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
5417
ns5875
ns0.92
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
6500
ns6417
ns1.01
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
6500
ns6750
ns0.96
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
5583.5
ns6000
ns0.93
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
67886
ns68160.5
ns1.00
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
14625
ns14500
ns1.01
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
15292
ns15000
ns1.02
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
15542
ns16500
ns0.94
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
14917
ns14584
ns1.02
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
472243.5
ns477825.5
ns0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s)
3101833
ns3101458
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s)
2134333
ns2118542
ns1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s)
2303021
ns2321249.5
ns0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s)
5007292
ns4650021
ns1.08
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/GPU/CUDA
586798
ns585427
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s)
23546583
ns23564209
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s)
18840521
ns18768041
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s)
18012083
ns17974229
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s)
36120167
ns35659708
ns1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/GPU/CUDA
2918041
ns2760352.5
ns1.06
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s)
33910770.5
ns34076750.5
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s)
27527417
ns27653896
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s)
28620667
ns28752229
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s)
41842979
ns40853625
ns1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
72812
ns74667
ns0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
74542
ns71833.5
ns1.04
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
74187.5
ns73521
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
72666
ns71770.5
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
101631
ns100115
ns1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
292354.5
ns292083
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
217084
ns224167
ns0.97
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
315166
ns297708
ns1.06
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
292458
ns205792
ns1.42
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
549955
ns537710
ns1.02
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
11541.5
ns11750
ns0.98
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
11791
ns11416
ns1.03
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
12250
ns12542
ns0.98
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
11417
ns12270.5
ns0.93
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
70877.5
ns71148.5
ns1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
26084
ns26208
ns1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
26583
ns26875
ns0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
28645.5
ns27625
ns1.04
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
27000
ns26500
ns1.02
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
471342.5
ns468928
ns1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
12083.5
ns12250
ns0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
12250
ns12166
ns1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
13292
ns13500
ns0.98
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
12417
ns12042
ns1.03
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
52255
ns52398
ns1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
25416
ns25250
ns1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
25750
ns26125
ns0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
25791
ns26042
ns0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
26459
ns26000
ns1.02
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
302749
ns301242
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
178333
ns179104.5
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
179875
ns179750
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
180750
ns180583
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
180334
ns178625
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
56120
ns55842.5
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
581770.5
ns582584
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
583250
ns591917
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
583208.5
ns594313
ns0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
589771
ns583166
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
283667.5
ns280084
ns1.01
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
6187.5
ns5958
ns1.04
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
6333
ns6000
ns1.06
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
6354.5
ns6500
ns0.98
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
6250
ns5625
ns1.11
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
70397
ns70229
ns1.00
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
13583
ns13875
ns0.98
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
14000
ns14542
ns0.96
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
14792
ns15187.5
ns0.97
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
14709
ns14458
ns1.02
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
461030.5
ns456073.5
ns1.01
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s)
1242791
ns1235292
ns1.01
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s)
1300208
ns1304042
ns1.00
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s)
1359354
ns1374021
ns0.99
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s)
1186229.5
ns1092083
ns1.09
batchedmm(512, Bsize=4)/forward/GPU/CUDA
301478
ns302409
ns1.00
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s)
4116667
ns4120521
ns1.00
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s)
4395875
ns4446875
ns0.99
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s)
4529125
ns4623750
ns0.98
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s)
3917271.5
ns3716729.5
ns1.05
batchedmm(512, Bsize=4)/zygote/GPU/CUDA
1038425.5
ns1039016
ns1.00
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
1792
ns1792
ns1
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
1833
ns1875
ns0.98
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
1834
ns1833
ns1.00
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
1875
ns1917
ns0.98
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA
23500
ns23753
ns0.99
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
4834
ns4833
ns1.00
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
4834
ns4917
ns0.98
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
4917
ns4875
ns1.01
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
4958
ns4875
ns1.02
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA
188737.5
ns186693
ns1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
5584
ns5959
ns0.94
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
6084
ns6000
ns1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
7333
ns7083
ns1.04
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
5959
ns5667
ns1.05
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
55083.5
ns54622.5
ns1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
10750
ns11167
ns0.96
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
11209
ns11541
ns0.97
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
11542
ns11250
ns1.03
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
11292
ns10542
ns1.07
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
335254
ns325703
ns1.03
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s)
291
ns375
ns0.78
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s)
334
ns334
ns1
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s)
292
ns292
ns1
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s)
375
ns333
ns1.13
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA
22752
ns22898
ns0.99
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
2750
ns2792
ns0.98
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
2792
ns3041
ns0.92
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
2709
ns3041
ns0.89
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
2750
ns2750
ns1
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA
159355.5
ns157339
ns1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
11084
ns11625
ns0.95
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
11458
ns12083
ns0.95
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
12854.5
ns12417
ns1.04
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
12083
ns11229.5
ns1.08
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
57729
ns55735
ns1.04
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
24167
ns24959
ns0.97
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
24541
ns25042
ns0.98
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
24916
ns25042
ns0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
25167
ns25042
ns1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
299680
ns288122.5
ns1.04
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s)
4125
ns4250
ns0.97
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s)
4125
ns4208
ns0.98
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s)
4208
ns4208
ns1
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s)
4209
ns4250
ns0.99
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA
24651
ns24760
ns1.00
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
16166
ns16333
ns0.99
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
16083
ns16333
ns0.98
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
16292
ns16500
ns0.99
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
16042
ns16459
ns0.97
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA
199395
ns193221.5
ns1.03
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
5667
ns5791
ns0.98
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
5709
ns5792
ns0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
5791
ns5791
ns1
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
5791
ns5750
ns1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
33617
ns33178
ns1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
20020.5
ns20750
ns0.96
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
20583
ns20708
ns0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
21083
ns20916
ns1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
21042
ns20708
ns1.02
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
175086.5
ns172900.5
ns1.01
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s)
407729
ns420188
ns0.97
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s)
380271
ns386937.5
ns0.98
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s)
483500
ns482833
ns1.00
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s)
105458.5
ns106250
ns0.99
batchedmm(16, Bsize=512)/forward/GPU/CUDA
67085
ns67134
ns1.00
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s)
926875
ns865417
ns1.07
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s)
968750
ns948604
ns1.02
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s)
1173375
ns1189500
ns0.99
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s)
378000
ns411770.5
ns0.92
batchedmm(16, Bsize=512)/zygote/GPU/CUDA
188736
ns190610
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
132583
ns136750
ns0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
130188
ns133396
ns0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
129458
ns133166.5
ns0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
137584
ns138854
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
192853
ns192824
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1920250.5
ns1917250
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1918583
ns1912124.5
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1924438
ns1920250
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1920500
ns1942521
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
409280
ns395139
ns1.04
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s)
292
ns292
ns1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s)
292
ns292
ns1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s)
292
ns333
ns0.88
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s)
375
ns333
ns1.13
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA
21945
ns22003
ns1.00
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s)
1833
ns1834
ns1.00
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s)
1875
ns1833
ns1.02
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s)
1875
ns1875
ns1
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s)
1834
ns1833
ns1.00
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA
171197.5
ns168855
ns1.01
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
6042
ns6812.5
ns0.89
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
6625
ns6750
ns0.98
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
7916.5
ns8187.5
ns0.97
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
7042
ns6334
ns1.11
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
58992.5
ns59378.5
ns0.99
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
8791
ns9312.5
ns0.94
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
8792
ns9209
ns0.95
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
9291
ns9333
ns1.00
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
9292
ns9083
ns1.02
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
311073
ns305200.5
ns1.02
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s)
110075500
ns112669000
ns0.98
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s)
174018250
ns174180000
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s)
143516291
ns143189875
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s)
116009417
ns112387917
ns1.03
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/GPU/CUDA
5438117
ns5463061
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s)
617670521
ns616937396
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s)
555321542
ns558474917
ns0.99
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s)
453019437.5
ns448891770.5
ns1.01
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s)
637539146
ns624388062.5
ns1.02
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA
34975009
ns38238112
ns0.91
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s)
654977875
ns665577792
ns0.98
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s)
666181396
ns667381166.5
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s)
629801020.5
ns616459979
ns1.02
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s)
742545875
ns747251209
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
61500
ns62750
ns0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
52500
ns53834
ns0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
53125
ns53458
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
85458
ns82125
ns1.04
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
37175.5
ns37037
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1912375
ns1926667
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1971000
ns1974291
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1984958.5
ns1980021
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1907791.5
ns1901875
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
173650
ns171617
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
285104
ns265333
ns1.07
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
265292
ns269750
ns0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
267750
ns269083.5
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
266625
ns264854.5
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
130504
ns124229
ns1.05
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
686125
ns687584
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
704333
ns678833
ns1.04
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
683541.5
ns680125
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
663104
ns635854
ns1.04
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
717967
ns697446
ns1.03
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
2234292
ns2242458
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
2244771
ns2097875
ns1.07
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
2244750
ns2254458
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
2241333.5
ns2199750.5
ns1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
133396.5
ns132519
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5451812.5
ns5507312
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5487812.5
ns5516959
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5498042
ns5495292
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
5562521
ns5486271
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
754203
ns737355
ns1.02
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
685959
ns678417
ns1.01
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
670541
ns671291
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
666167
ns668458
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
680000
ns682958
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA
46765
ns46914
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
1817416
ns1824791.5
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
1716895.5
ns1728375
ns0.99
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
1744292
ns1718604.5
ns1.01
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
2082750
ns2080500
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA
220971
ns221890.5
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
70125
ns70750
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
53125
ns53125
ns1
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
52708
ns52916
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
84625
ns82375
ns1.03
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
28234
ns28168
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2030854.5
ns2031792
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2081770.5
ns2096833.5
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2100958
ns2088000
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2007416
ns2001083.5
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
188927
ns187289.5
ns1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s)
13472458
ns13449750
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s)
12508625
ns12528021.5
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s)
12582124.5
ns12554687.5
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s)
15073041.5
ns15230083
ns0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/GPU/CUDA
512756.5
ns513617
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s)
47011770.5
ns46862979
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s)
41636000
ns41543521
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s)
40969375
ns40829437.5
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s)
59058645.5
ns58532271
ns1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA
3033111.5
ns2896866
ns1.05
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s)
73891958
ns74392375
ns0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s)
67845145.5
ns90893292
ns0.75
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s)
92214500
ns92732000
ns0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s)
99774291.5
ns76658749.5
ns1.30
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
71166.5
ns70625
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
64583
ns64875
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
65791
ns64625
ns1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
84792
ns81917
ns1.04
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
47424
ns47851
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1905937.5
ns1923187.5
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1967666.5
ns1983437.5
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1977375
ns1973333
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1898333.5
ns1883833
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
192864
ns193982.5
ns0.99
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
292
ns292
ns1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
292
ns375
ns0.78
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
416
ns375
ns1.11
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
416
ns292
ns1.42
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
32583
ns32956
ns0.99
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
6041
ns6125
ns0.99
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
6125
ns6416
ns0.95
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
6459
ns6375
ns1.01
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
6542
ns5875
ns1.11
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
172656.5
ns176118.5
ns0.98
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s)
250
ns292
ns0.86
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s)
292
ns292
ns1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s)
292
ns291
ns1.00
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s)
333
ns250
ns1.33
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA
32498
ns32831
ns0.99
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s)
2708
ns2667
ns1.02
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s)
2709
ns2916
ns0.93
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s)
2875
ns2875
ns1
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s)
2875
ns2625
ns1.10
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA
162027.5
ns165694
ns0.98
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s)
278479062
ns278326104
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s)
339860437.5
ns340448937.5
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s)
309104833
ns308909437.5
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s)
282371084
ns278977666.5
ns1.01
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/GPU/CUDA
7112114
ns7109405
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s)
997282375
ns997951584
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s)
939909542
ns940941292
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s)
834322792
ns832217625
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s)
1020744375
ns1009333917
ns1.01
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA
34065304
ns33893371
ns1.01
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s)
1416221791.5
ns1394325042
ns1.02
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s)
1324822042
ns1705224209
ns0.78
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s)
1631228625
ns1693911291
ns0.96
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s)
1675762813
ns1308776729
ns1.28
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1450812.5
ns1456667
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1456521
ns1462958
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1455333
ns1454521
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1460167
ns1451416.5
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
127677
ns127922
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5023459
ns5012417
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5018833
ns5028750
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5024791.5
ns5027959
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
5045271
ns5027187.5
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
588360
ns506424
ns1.16
vgg16(32, 32, 3, 32)/forward/CPU/2 thread(s)
157992750
ns157716375
ns1.00
vgg16(32, 32, 3, 32)/forward/CPU/4 thread(s)
148446708
ns136859042
ns1.08
vgg16(32, 32, 3, 32)/forward/CPU/8 thread(s)
164732625
ns164218250
ns1.00
vgg16(32, 32, 3, 32)/forward/CPU/1 thread(s)
153538583.5
ns151479417
ns1.01
vgg16(32, 32, 3, 32)/forward/GPU/CUDA
4886668
ns4879107
ns1.00
vgg16(32, 32, 3, 32)/zygote/CPU/2 thread(s)
637312250
ns634203459
ns1.00
vgg16(32, 32, 3, 32)/zygote/CPU/4 thread(s)
611560250
ns607766083
ns1.01
vgg16(32, 32, 3, 32)/zygote/CPU/8 thread(s)
470585834
ns456653750
ns1.03
vgg16(32, 32, 3, 32)/zygote/CPU/1 thread(s)
662978834
ns653815125
ns1.01
vgg16(32, 32, 3, 32)/zygote/GPU/CUDA
16094164
ns17510307
ns0.92
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s)
8954458
ns8926646
ns1.00
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s)
9014875
ns9038916.5
ns1.00
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s)
7941438
ns7947771
ns1.00
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s)
10320875
ns10104354
ns1.02
batchedmm(512, Bsize=32)/forward/GPU/CUDA
1593595
ns1594648
ns1.00
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s)
37088334
ns36795042
ns1.01
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s)
37925916.5
ns38004792
ns1.00
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s)
34179167
ns34295916.5
ns1.00
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s)
39118729
ns37862042
ns1.03
batchedmm(512, Bsize=32)/zygote/GPU/CUDA
6471873.5
ns6452447
ns1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s)
47416
ns47334
ns1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s)
47292
ns47417
ns1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s)
47459
ns47625
ns1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s)
47458
ns47042
ns1.01
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA
18458
ns18361
ns1.01
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s)
50250
ns50042
ns1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s)
50291
ns50292
ns1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s)
50834
ns50542
ns1.01
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s)
50458
ns50292
ns1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA
188984
ns194710.5
ns0.97
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
6125
ns6750
ns0.91
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
6708
ns6875
ns0.98
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
7875
ns7709
ns1.02
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
7042
ns6541
ns1.08
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
89761
ns94841
ns0.95
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
9750
ns9542
ns1.02
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
10125
ns10209
ns0.99
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
10250
ns10292
ns1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
10541
ns9958
ns1.06
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
516571.5
ns543786
ns0.95
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
5750
ns5917
ns0.97
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
5958.5
ns6292
ns0.95
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
7417
ns6750
ns1.10
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
6417
ns5666
ns1.13
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
106479.5
ns105080
ns1.01
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
12750
ns12583
ns1.01
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
13042
ns13750
ns0.95
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
13291
ns13375
ns0.99
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
13270.5
ns13375
ns0.99
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
479931
ns521491.5
ns0.92
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
958
ns1083
ns0.88
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
959
ns1083
ns0.89
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
1125
ns1083
ns1.04
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
1083
ns1042
ns1.04
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
32924
ns33226
ns0.99
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7542
ns8125
ns0.93
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
8000
ns8500
ns0.94
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7958
ns7875
ns1.01
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
8250
ns8041
ns1.03
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
200265
ns215927
ns0.93
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
22875
ns23125
ns0.99
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
23041
ns23209
ns0.99
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
23917
ns23250
ns1.03
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
23208
ns23250
ns1.00
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA
18525
ns18682
ns0.99
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
52208
ns52250
ns1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
52583
ns53125
ns0.99
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
52625
ns52833
ns1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
52542
ns52250
ns1.01
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA
267460
ns310779
ns0.86
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1451417
ns1455520.5
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1459084
ns1461770.5
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1459500
ns1464563
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1465416.5
ns1420375.5
ns1.03
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
196174
ns196494.5
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5014166.5
ns5004917
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5005062.5
ns4928042
ns1.02
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5014250
ns5012292
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
5037250
ns5010708.5
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
579761
ns619791
ns0.94
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s)
3149500
ns3153125
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s)
1975646
ns2140000
ns0.92
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s)
2323562.5
ns2307083.5
ns1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s)
4912270.5
ns4612500
ns1.06
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/GPU/CUDA
583087.5
ns580901
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s)
24421562.5
ns24408833
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s)
19801250.5
ns19732667
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s)
18967959
ns19045729.5
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s)
37230000
ns36515125
ns1.02
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA
2963899
ns2842137
ns1.04
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s)
34154937.5
ns34057083.5
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s)
28340541
ns28326333
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s)
28271812.5
ns28024667
ns1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s)
43122000
ns42838792
ns1.01
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s)
140810292
ns140571271
ns1.00
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s)
143457875
ns143484104
ns1.00
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s)
120969000
ns120774500
ns1.00
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s)
190332292
ns187527416
ns1.01
batchedmm(512, Bsize=512)/forward/GPU/CUDA
22567410
ns22777810
ns0.99
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s)
1439193417
ns1387998541
ns1.04
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s)
1035778354.5
ns2164279542
ns0.48
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s)
1029350563
ns1082658958.5
ns0.95
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s)
847160583
ns828842208.5
ns1.02
batchedmm(512, Bsize=512)/zygote/GPU/CUDA
118590973
ns118414466
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
72979
ns79708.5
ns0.92
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
72229.5
ns72542
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
75417
ns75520.5
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
73416.5
ns73458
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
210693.5
ns238954.5
ns0.88
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
296396
ns286459
ns1.03
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
283542
ns295292
ns0.96
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
309000
ns302292
ns1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
282667
ns240521
ns1.18
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1113011
ns1217040
ns0.91
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s)
35428583
ns35202521
ns1.01
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s)
35740146
ns35899625
ns1.00
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s)
31356458
ns31197042
ns1.01
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s)
39882791
ns39929583.5
ns1.00
batchedmm(512, Bsize=128)/forward/GPU/CUDA
5846172
ns5845222
ns1.00
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s)
148563000
ns147855667
ns1.00
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s)
152825542
ns153555375
ns1.00
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s)
135772750.5
ns134579979
ns1.01
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s)
153516333
ns150196958.5
ns1.02
batchedmm(512, Bsize=128)/zygote/GPU/CUDA
34902152
ns34892998
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s)
112450083
ns114292542
ns0.98
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s)
173734500
ns173321542
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s)
143024292
ns143543334
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s)
97164708
ns93943084
ns1.03
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/GPU/CUDA
5471199
ns5434556
ns1.01
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s)
468949292
ns473131708
ns0.99
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s)
523211021
ns515810125.5
ns1.01
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s)
440488146
ns442518292
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s)
623433833.5
ns614699291.5
ns1.01
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/GPU/CUDA
32285967
ns35179278
ns0.92
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s)
800549541
ns804964083
ns0.99
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s)
656663541.5
ns656838729.5
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s)
567293062.5
ns594341604
ns0.95
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s)
735113417
ns735687542
ns1.00
mlp7layer_bn(relu)(32 x 256)/forward/CPU/2 thread(s)
1357292
ns1353083
ns1.00
mlp7layer_bn(relu)(32 x 256)/forward/CPU/4 thread(s)
1006709
ns1020917
ns0.99
mlp7layer_bn(relu)(32 x 256)/forward/CPU/8 thread(s)
993792
ns995292
ns1.00
mlp7layer_bn(relu)(32 x 256)/forward/CPU/1 thread(s)
2076875
ns2104875
ns0.99
mlp7layer_bn(relu)(32 x 256)/forward/GPU/CUDA
574648.5
ns569348
ns1.01
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/2 thread(s)
2981104
ns2979875
ns1.00
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/4 thread(s)
2614562.5
ns2615833
ns1.00
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/8 thread(s)
2632479
ns2614124.5
ns1.01
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/1 thread(s)
3749687.5
ns3699541.5
ns1.01
mlp7layer_bn(relu)(32 x 256)/zygote/GPU/CUDA
1705197
ns1670621
ns1.02
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/2 thread(s)
5826896
ns5794812.5
ns1.01
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/4 thread(s)
5792500
ns5833354.5
ns0.99
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/8 thread(s)
5792645.5
ns5800917
ns1.00
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/1 thread(s)
2968021
ns2911437.5
ns1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
8042
ns7875
ns1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
7000
ns7000
ns1
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
7042
ns7000
ns1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10875
ns10583
ns1.03
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
24779
ns24801
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
212208
ns222541.5
ns0.95
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
233625
ns221250
ns1.06
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
220750
ns220833.5
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
209750
ns217041.5
ns0.97
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
246929
ns245776
ns1.00
vgg16(32, 32, 3, 64)/forward/CPU/2 thread(s)
452114625
ns451162917
ns1.00
vgg16(32, 32, 3, 64)/forward/CPU/4 thread(s)
205741771
ns205123625.5
ns1.00
vgg16(32, 32, 3, 64)/forward/CPU/8 thread(s)
181027291.5
ns178414666.5
ns1.01
vgg16(32, 32, 3, 64)/forward/CPU/1 thread(s)
462543917
ns454897875
ns1.02
vgg16(32, 32, 3, 64)/forward/GPU/CUDA
7673150.5
ns7671486
ns1.00
vgg16(32, 32, 3, 64)/zygote/CPU/2 thread(s)
1095771812.5
ns1093247396
ns1.00
vgg16(32, 32, 3, 64)/zygote/CPU/4 thread(s)
925308125
ns925248250
ns1.00
vgg16(32, 32, 3, 64)/zygote/CPU/8 thread(s)
875879750
ns837547083
ns1.05
vgg16(32, 32, 3, 64)/zygote/CPU/1 thread(s)
1183196167
ns1163363584
ns1.02
vgg16(32, 32, 3, 64)/zygote/GPU/CUDA
26783812
ns26761104.5
ns1.00
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
5125
ns5500
ns0.93
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
5312.5
ns5458
ns0.97
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
6375
ns6875
ns0.93
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
6083
ns5291.5
ns1.15
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
143484
ns149694
ns0.96
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
6875
ns6833.5
ns1.01
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7500
ns7395.5
ns1.01
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7583.5
ns7792
ns0.97
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7708
ns6875
ns1.12
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
569216
ns579102
ns0.98
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
500
ns500
ns1
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
542
ns583
ns0.93
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
584
ns583
ns1.00
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
584
ns500
ns1.17
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
23876
ns23601
ns1.01
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
8584
ns9166
ns0.94
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
8917
ns9042
ns0.99
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
9500
ns9250
ns1.03
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
9292
ns10166.5
ns0.91
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
202303
ns199458
ns1.01
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
352875
ns354500
ns1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
382959
ns352375
ns1.09
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
352625
ns355687.5
ns0.99
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
351625
ns357479.5
ns0.98
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA
21342
ns21220
ns1.01
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
776270.5
ns824396
ns0.94
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
810812.5
ns778375
ns1.04
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
775187.5
ns777666
ns1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
827583.5
ns821813
ns1.01
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA
240060.5
ns231309.5
ns1.04
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s)
332770.5
ns331125
ns1.00
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s)
332583
ns344833
ns0.96
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s)
451459
ns453000
ns1.00
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s)
9959
ns10292
ns0.97
batchedmm(16, Bsize=32)/forward/GPU/CUDA
18163
ns18084
ns1.00
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s)
714000
ns709750
ns1.01
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s)
727125
ns741354
ns0.98
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s)
999833
ns1003291.5
ns1.00
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s)
26625
ns26479
ns1.01
batchedmm(16, Bsize=32)/zygote/GPU/CUDA
238711
ns223194.5
ns1.07
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s)
374437
ns370292
ns1.01
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s)
347917
ns353396
ns0.98
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s)
440937.5
ns439292
ns1.00
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s)
28792
ns29916.5
ns0.96
batchedmm(16, Bsize=128)/forward/GPU/CUDA
22488
ns22856
ns0.98
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s)
733000
ns727458
ns1.01
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s)
778479
ns790208
ns0.99
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s)
1023541.5
ns1034916
ns0.99
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s)
89875
ns90395.5
ns0.99
batchedmm(16, Bsize=128)/zygote/GPU/CUDA
205326
ns197661
ns1.04
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s)
3354.5
ns3417
ns0.98
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s)
3417
ns3625
ns0.94
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s)
3625
ns3750
ns0.97
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s)
3750
ns3417
ns1.10
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA
17749
ns17539
ns1.01
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s)
4125
ns4208
ns0.98
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s)
4292
ns4375
ns0.98
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s)
4250
ns4250
ns1
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s)
4375
ns4125
ns1.06
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA
235900.5
ns213017
ns1.11
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
3417
ns3729
ns0.92
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
4000
ns4083
ns0.98
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
4041
ns4958
ns0.82
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
4125
ns3417
ns1.21
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
174157.5
ns159837
ns1.09
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
8042
ns8167
ns0.98
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8500
ns8583
ns0.99
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
8125
ns8667
ns0.94
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8625
ns8375
ns1.03
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
1076434
ns1042725
ns1.03
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
207542
ns205667
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
213916
ns213208
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
212833
ns213500
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
202625
ns200458
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
34097
ns34523
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
601333
ns645542
ns0.93
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
633916.5
ns671042
ns0.94
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
621208
ns621458.5
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
582666.5
ns580854.5
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
291620
ns298737.5
ns0.98
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s)
1245375
ns1234437.5
ns1.01
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s)
1251750
ns1277666
ns0.98
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s)
1177937.5
ns1190750
ns0.99
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s)
1207083
ns1152750
ns1.05
batchedmm(128, Bsize=128)/forward/GPU/CUDA
207232
ns206763.5
ns1.00
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s)
4566750
ns4518542
ns1.01
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s)
4712249.5
ns4787042
ns0.98
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s)
4457500
ns4473666.5
ns1.00
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s)
4779979
ns5146541
ns0.93
batchedmm(128, Bsize=128)/zygote/GPU/CUDA
927700.5
ns931436.5
ns1.00
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
2958
ns3667
ns0.81
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
3917
ns3667
ns1.07
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
3896
ns4041
ns0.96
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
3833
ns2959
ns1.30
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
167597.5
ns185683
ns0.90
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7167
ns7167
ns1
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7708
ns7333
ns1.05
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7208
ns7667
ns0.94
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7459
ns6833
ns1.09
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
944745
ns942579
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s)
1646750
ns1642000
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s)
1186708
ns1207250
ns0.98
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s)
1375541.5
ns1390000
ns0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s)
2434792
ns2427938
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/GPU/CUDA
214131
ns212907.5
ns1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s)
12360250
ns12368250
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s)
9584833
ns9590500
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s)
9257792
ns9295438
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s)
18118625
ns18019000
ns1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/GPU/CUDA
1941495.5
ns1954764
ns0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s)
17409917
ns17359458
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s)
14369603.5
ns14385104
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s)
14347521
ns14370541
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s)
21171916
ns21035500
ns1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
85209
ns134083.5
ns0.64
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
138875
ns139416.5
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
134958
ns134958
ns1
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
132917
ns131334
ns1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
125576
ns125600
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2040229.5
ns2022916.5
ns1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2026646
ns2047021
ns0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2030000
ns2034334
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2046729
ns2039125
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
954388.5
ns948556
ns1.01
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s)
1000
ns1458
ns0.69
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s)
1292
ns1792
ns0.72
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s)
1791
ns3520.5
ns0.51
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s)
1416
ns1229.5
ns1.15
batchedmm(2, Bsize=4)/forward/GPU/CUDA
16301
ns16310
ns1.00
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s)
2458
ns2542
ns0.97
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s)
2583
ns2792
ns0.93
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s)
2792
ns2875
ns0.97
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s)
2875
ns2834
ns1.01
batchedmm(2, Bsize=4)/zygote/GPU/CUDA
180190.5
ns182763.5
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
8041
ns7958
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
6959
ns6875
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
7125
ns6875
ns1.04
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10833
ns10583
ns1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
33324
ns33908
ns0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
217125
ns225041
ns0.96
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
220125
ns221625
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
220542
ns220833
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
207145.5
ns215291
ns0.96
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
294304
ns320916
ns0.92
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s)
3667
ns3667
ns1
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s)
3666
ns3708
ns0.99
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s)
3667
ns3667
ns1
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s)
3750
ns3667
ns1.02
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA
22268
ns22605
ns0.99
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s)
14542
ns14500
ns1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s)
14458
ns14625
ns0.99
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s)
14500
ns14500
ns1
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s)
14250
ns14500
ns0.98
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA
451646.5
ns456450
ns0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
135084
ns142749.5
ns0.95
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
135167
ns91312
ns1.48
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
145833
ns142292
ns1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
135771
ns138792
ns0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
124920.5
ns125035
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1931125
ns1919500
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1923875
ns1942104
ns0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1933583.5
ns1929000
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1941584
ns1927250
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
895888.5
ns877064
ns1.02
lenet(28, 28, 1, 32)/forward/CPU/2 thread(s)
869083.5
ns877458.5
ns0.99
lenet(28, 28, 1, 32)/forward/CPU/4 thread(s)
814146
ns825458.5
ns0.99
lenet(28, 28, 1, 32)/forward/CPU/8 thread(s)
1222709
ns1230104
ns0.99
lenet(28, 28, 1, 32)/forward/CPU/1 thread(s)
942729
ns955479
ns0.99
lenet(28, 28, 1, 32)/forward/GPU/CUDA
269464
ns269410
ns1.00
lenet(28, 28, 1, 32)/zygote/CPU/2 thread(s)
2833167
ns2816333
ns1.01
lenet(28, 28, 1, 32)/zygote/CPU/4 thread(s)
2528333.5
ns2528771
ns1.00
lenet(28, 28, 1, 32)/zygote/CPU/8 thread(s)
3338750
ns3342458
ns1.00
lenet(28, 28, 1, 32)/zygote/CPU/1 thread(s)
3399146
ns3349729.5
ns1.01
lenet(28, 28, 1, 32)/zygote/GPU/CUDA
1538408
ns1555391.5
ns0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
20750
ns14833
ns1.40
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
15041.5
ns14875
ns1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
16229.5
ns18500
ns0.88
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
14959
ns16875
ns0.89
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
129111.5
ns131035
ns0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
215916
ns227209
ns0.95
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
229604.5
ns215791
ns1.06
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
215709
ns216958
ns0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
224833
ns225250
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
586555.5
ns594103.5
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
219250
ns221333
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
220020.5
ns222875
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
222125
ns222583
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
219916
ns219042
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
244257
ns242007
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
529291.5
ns548917
ns0.96
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
509000
ns511041.5
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
509666
ns509917
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
509542
ns508458
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1272897.5
ns1234181
ns1.03
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s)
3125
ns4083
ns0.77
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s)
4500
ns4041
ns1.11
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s)
4542
ns4417
ns1.03
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s)
3959
ns3666.5
ns1.08
batchedmm(16, Bsize=4)/forward/GPU/CUDA
16759
ns17140
ns0.98
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s)
7208
ns7209
ns1.00
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s)
7208
ns7459
ns0.97
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s)
7250
ns7333.5
ns0.99
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s)
7334
ns7417
ns0.99
batchedmm(16, Bsize=4)/zygote/GPU/CUDA
181468
ns183429.5
ns0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
16792
ns18833
ns0.89
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
17062.5
ns16666
ns1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
17812.5
ns21083
ns0.84
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
17250
ns18396
ns0.94
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
134619.5
ns131942
ns1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
211583
ns245395.5
ns0.86
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
213625
ns212292
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
212812.5
ns214833
ns0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
213083
ns213708
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
895952
ns833743
ns1.07
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
3917
ns4208
ns0.93
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
4833
ns4833
ns1
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
4625
ns4916.5
ns0.94
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
4625
ns3854.5
ns1.20
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
212453
ns208168.5
ns1.02
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
10166
ns10333
ns0.98
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
10417
ns10459
ns1.00
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
10875
ns11084
ns0.98
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
10583
ns10145.5
ns1.04
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
992994
ns994315
ns1.00
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
3125
ns3458
ns0.90
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
3709
ns3791
ns0.98
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
4750
ns4042
ns1.18
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
3916
ns3167
ns1.24
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
212054.5
ns209797
ns1.01
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7083
ns7416
ns0.96
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7125
ns7459
ns0.96
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7583
ns8083.5
ns0.94
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7500
ns7459
ns1.01
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
1004688
ns997101.5
ns1.01
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s)
23464771
ns23443625
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s)
35060375
ns34805208
ns1.01
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s)
37779167
ns37298500
ns1.01
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s)
34969333
ns34536209
ns1.01
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/GPU/CUDA
1848833
ns1851929
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s)
184464833.5
ns185954395.5
ns0.99
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s)
160073583.5
ns159888645.5
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s)
145086500
ns144873209
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s)
445100854
ns438754792
ns1.01
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/GPU/CUDA
16527443
ns16496173
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s)
271288729
ns269927937.5
ns1.01
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s)
263438959
ns259799312.5
ns1.01
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s)
302324416
ns298856875
ns1.01
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s)
496832583.5
ns487045354.5
ns1.02
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
181417
ns189541.5
ns0.96
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
185458
ns182167
ns1.02
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
185750
ns183416.5
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
181708
ns182375
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
193313
ns187318
ns1.03
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
589438
ns636187.5
ns0.93
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
631229
ns597458.5
ns1.06
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
598125
ns588459
ns1.02
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
590687.5
ns596146
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
966959
ns944443
ns1.02
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s)
3877125
ns3952375
ns0.98
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s)
3946625
ns4007646
ns0.98
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s)
3651083.5
ns3594292
ns1.02
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s)
5012833.5
ns4885708
ns1.03
batchedmm(128, Bsize=512)/forward/GPU/CUDA
530368
ns552348.5
ns0.96
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s)
17988625
ns18061833
ns1.00
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s)
18469458
ns18498208.5
ns1.00
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s)
17328979.5
ns17053770.5
ns1.02
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s)
20374792
ns19733813
ns1.03
batchedmm(128, Bsize=512)/zygote/GPU/CUDA
2619767.5
ns2636788.5
ns0.99
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
500
ns500
ns1
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
500
ns583
ns0.86
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
583
ns583
ns1
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
584
ns500
ns1.17
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
32351
ns32315
ns1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
9041
ns9145.5
ns0.99
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
9541.5
ns9625
ns0.99
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
9833
ns9291
ns1.06
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
9500
ns8792
ns1.08
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
247867.5
ns247143.5
ns1.00
vgg16(32, 32, 3, 128)/forward/CPU/2 thread(s)
498558729
ns497882542
ns1.00
vgg16(32, 32, 3, 128)/forward/CPU/4 thread(s)
468495750
ns466893292
ns1.00
vgg16(32, 32, 3, 128)/forward/CPU/8 thread(s)
362160229
ns356555750
ns1.02
vgg16(32, 32, 3, 128)/forward/CPU/1 thread(s)
607173041
ns601192353.5
ns1.01
vgg16(32, 32, 3, 128)/forward/GPU/CUDA
12482436
ns12465773.5
ns1.00
vgg16(32, 32, 3, 128)/zygote/CPU/2 thread(s)
1885912604.5
ns1887759917
ns1.00
vgg16(32, 32, 3, 128)/zygote/CPU/4 thread(s)
1633604541
ns1627534167
ns1.00
vgg16(32, 32, 3, 128)/zygote/CPU/8 thread(s)
1504714375
ns1505961604
ns1.00
vgg16(32, 32, 3, 128)/zygote/CPU/1 thread(s)
2155903916.5
ns2123318791.5
ns1.02
vgg16(32, 32, 3, 128)/zygote/GPU/CUDA
49283559
ns49303078
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s)
1664666.5
ns1652917
ns1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s)
1200396
ns1209833
ns0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s)
1387542
ns1397667
ns0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s)
2441166
ns2460062.5
ns0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/GPU/CUDA
216027
ns214417
ns1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s)
12783813
ns12745021
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s)
9969333
ns9950208
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s)
9630041
ns9693541
ns0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s)
18564625
ns18371500
ns1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA
2024417
ns2028129
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s)
17729000
ns17681833
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s)
14689833
ns14711375
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s)
14572562.5
ns14648250
ns0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s)
21460792
ns21429709
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
26167
ns26167
ns1
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
26167
ns26167
ns1
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
26334
ns26167
ns1.01
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
26250
ns26166
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA
24291
ns23744
ns1.02
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
67375
ns67208
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
66792
ns67208
ns0.99
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
67250
ns67166
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
66916
ns66916
ns1
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA
376851.5
ns365755.5
ns1.03
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
206292
ns206375
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
213042
ns212666
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
212292
ns211542
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
200542
ns200291
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
25875
ns25711
ns1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
608438
ns655729
ns0.93
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
631687.5
ns632000
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
622729.5
ns673667
ns0.92
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
592459
ns630708
ns0.94
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
328754.5
ns322192
ns1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
702583
ns683459
ns1.03
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
644542
ns682708
ns0.94
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
631083
ns691916.5
ns0.91
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
682250
ns680834
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
131950
ns130902.5
ns1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2262083
ns2242354.5
ns1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2242917
ns2244709
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2231125
ns2244875.5
ns0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2307979
ns2229125
ns1.04
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1167364
ns1093705
ns1.07
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
17125
ns20396
ns0.84
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
20083
ns16833
ns1.19
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
18791
ns23020.5
ns0.82
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
18041.5
ns19166
ns0.94
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
132602
ns131648.5
ns1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
229500
ns265541.5
ns0.86
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
218833
ns232167
ns0.94
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
219792
ns264625
ns0.83
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
230333.5
ns259979
ns0.89
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
967555
ns939947
ns1.03
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
500
ns541
ns0.92
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
542
ns625
ns0.87
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
584
ns625
ns0.93
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
625
ns542
ns1.15
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
23714
ns23249
ns1.02
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
9417
ns9583.5
ns0.98
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
9833
ns9708
ns1.01
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
9875
ns10041
ns0.98
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
9833
ns9541
ns1.03
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
247044.5
ns242690
ns1.02
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
5209
ns5542
ns0.94
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
5812.5
ns5709
ns1.02
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
6812.5
ns6667
ns1.02
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
5916.5
ns5250
ns1.13
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
211718.5
ns206130.5
ns1.03
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7084
ns6709
ns1.06
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7459
ns7417
ns1.01
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7667
ns7875
ns0.97
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7500
ns6708
ns1.12
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
739090.5
ns735324.5
ns1.01
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
1917
ns2000
ns0.96
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
2208
ns2229.5
ns0.99
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
2250
ns2125
ns1.06
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
2250
ns2292
ns0.98
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA
18219
ns17909
ns1.02
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
6292
ns6375
ns0.99
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
6417
ns6792
ns0.94
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
6729.5
ns6875
ns0.98
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
6584
ns6208
ns1.06
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA
307391
ns303359
ns1.01
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s)
749208
ns751688
ns1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s)
748625
ns779292
ns0.96
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s)
746500
ns779395.5
ns0.96
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s)
748625
ns776146
ns0.96
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA
21224.5
ns20845
ns1.02
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s)
803167
ns796792
ns1.01
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s)
792833
ns791166
ns1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s)
792834
ns808708
ns0.98
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s)
813166
ns775292
ns1.05
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA
271736
ns267264
ns1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
8125
ns8000
ns1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
7583
ns6687.5
ns1.13
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
6959
ns6958
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10917
ns10458
ns1.04
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
32567.5
ns32932
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
232666
ns261062.5
ns0.89
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
240625
ns237583
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
227604
ns271396
ns0.84
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
258125
ns252646
ns1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
333854
ns331767
ns1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
9959
ns10250
ns0.97
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
10709
ns10542
ns1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
10833
ns11208
ns0.97
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
10271
ns10250
ns1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
226295
ns218675.5
ns1.03
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
24167
ns25000
ns0.97
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
24729.5
ns24625
ns1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
24417
ns25583
ns0.95
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
25354.5
ns24416
ns1.04
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
1051998
ns1056250
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s)
106630458.5
ns106355042
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s)
117910875
ns117397229.5
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s)
120489750
ns120585312.5
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s)
117867166.5
ns117183084
ns1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/GPU/CUDA
2630839
ns2657952
ns0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s)
375572750
ns374187771
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s)
347200750
ns350821292
ns0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s)
370237167
ns361003333
ns1.03
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s)
484151625
ns479876375
ns1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA
15207487.5
ns15234863.5
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s)
607408041
ns604863708
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s)
591624416
ns773786667
ns0.76
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s)
811424250
ns812604291
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s)
961849167
ns770323375
ns1.25
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
6834
ns6833
ns1.00
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
6708
ns7084
ns0.95
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
8041
ns8062.5
ns1.00
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
7354
ns6250
ns1.18
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
213896
ns213616
ns1.00
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
14000
ns13458
ns1.04
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
15125
ns13875
ns1.09
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
14458
ns14416
ns1.00
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
13666
ns13625
ns1.00
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
993505
ns1017707
ns0.98
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
5958
ns6208
ns0.96
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
6145.5
ns6042
ns1.02
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
7458
ns7145.5
ns1.04
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
6312.5
ns5417
ns1.17
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
209272
ns208255
ns1.00
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
12500
ns11958
ns1.05
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
12625
ns12729.5
ns0.99
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
13250
ns13250
ns1
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
12250
ns12500
ns0.98
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
719970
ns723959
ns0.99
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s)
5000
ns6209
ns0.81
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s)
5667
ns6375
ns0.89
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s)
5500
ns6375
ns0.86
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s)
5458
ns5500
ns0.99
batchedmm(2, Bsize=128)/forward/GPU/CUDA
17137
ns16943
ns1.01
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s)
15083
ns15250
ns0.99
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s)
15459
ns15625
ns0.99
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s)
15458
ns15625
ns0.99
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s)
15583
ns15500
ns1.01
batchedmm(2, Bsize=128)/zygote/GPU/CUDA
185445
ns186257
ns1.00
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
292
ns333
ns0.88
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
292
ns375
ns0.78
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
375
ns375
ns1
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
375
ns292
ns1.28
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
23381
ns23245
ns1.01
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
6291
ns6375
ns0.99
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
6334
ns6375
ns0.99
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
6520.5
ns6625
ns0.98
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
6541
ns6187.5
ns1.06
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
227150.5
ns225046
ns1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
5750
ns5750
ns1
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
5792
ns5875
ns0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
5834
ns5833
ns1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
5875
ns5792
ns1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
24282
ns24205
ns1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
23416.5
ns20875
ns1.12
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
20542
ns21417
ns0.96
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
21292
ns21541.5
ns0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
21416
ns21229.5
ns1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
249310.5
ns246651
ns1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
192603.5
ns194166.5
ns0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
190208
ns200521
ns0.95
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
187125
ns190666.5
ns0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
189437.5
ns185562
ns1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
167056.5
ns166320.5
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1339333.5
ns1329104.5
ns1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1319750.5
ns1324792
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1298333
ns1328041
ns0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1349625
ns1337729.5
ns1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1248940
ns1221500
ns1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
22188
ns24687.5
ns0.90
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
22167
ns22000
ns1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
23250
ns25667
ns0.91
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
30833
ns21250
ns1.45
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
318042
ns254624.5
ns1.25
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
175104
ns130791
ns1.34
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
129354
ns132062.5
ns0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
147250
ns179458
ns0.82
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
180250
ns179520.5
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1355497.5
ns1317432
ns1.03
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
292
ns333
ns0.88
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
292
ns417
ns0.70
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
375
ns375
ns1
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
375
ns292
ns1.28
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
23100
ns22902
ns1.01
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
6167
ns6208
ns0.99
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
6416
ns6709
ns0.96
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
6583
ns6917
ns0.95
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
6583
ns6291
ns1.05
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
245385
ns240780
ns1.02
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
4208
ns4875
ns0.86
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
4625
ns4542
ns1.02
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
4833
ns5500
ns0.88
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
4708
ns4417
ns1.07
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
232572
ns229531.5
ns1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
9583
ns10083
ns0.95
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
10020.5
ns10375
ns0.97
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
9791
ns10583
ns0.93
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
10291.5
ns10416
ns0.99
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
1286978.5
ns1276460
ns1.01
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
1583
ns1625
ns0.97
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
1625
ns1667
ns0.97
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
1584
ns1583
ns1.00
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
1667
ns1584
ns1.05
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA
23645
ns22954
ns1.03
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
5708
ns5792
ns0.99
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
5750
ns5958
ns0.97
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
5667
ns5875
ns0.96
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
5667
ns5584
ns1.01
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA
263109.5
ns258626
ns1.02
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s)
6835750
ns6841563
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s)
6400459
ns6377645.5
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s)
6536604
ns6542167
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s)
7672542
ns7612146
ns1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/GPU/CUDA
215618
ns213873
ns1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s)
24116958
ns24061541
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s)
21263041
ns21280959
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s)
20976375
ns21049937
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s)
29871542
ns29725708.5
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA
2094351.5
ns2091556
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s)
37551959
ns37658500
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s)
34396208.5
ns45669958
ns0.75
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s)
45713375
ns45878312.5
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s)
49651167
ns38309416.5
ns1.30
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
5583
ns5917
ns0.94
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
6250
ns6042
ns1.03
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
6625
ns6958.5
ns0.95
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
6625
ns5542
ns1.20
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
210693
ns210091
ns1.00
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
8166
ns8041
ns1.02
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
9000
ns8250
ns1.09
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
8625
ns8500
ns1.01
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8500
ns8250
ns1.03
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
993726
ns992082
ns1.00
lenet(28, 28, 1, 128)/forward/CPU/2 thread(s)
1570250
ns1552375
ns1.01
lenet(28, 28, 1, 128)/forward/CPU/4 thread(s)
1273479
ns1278292
ns1.00
lenet(28, 28, 1, 128)/forward/CPU/8 thread(s)
1626896
ns1634959
ns1.00
lenet(28, 28, 1, 128)/forward/CPU/1 thread(s)
2142333
ns2176750
ns0.98
lenet(28, 28, 1, 128)/forward/GPU/CUDA
271789
ns269882.5
ns1.01
lenet(28, 28, 1, 128)/zygote/CPU/2 thread(s)
7954709
ns7890000
ns1.01
lenet(28, 28, 1, 128)/zygote/CPU/4 thread(s)
6282562.5
ns6564479
ns0.96
lenet(28, 28, 1, 128)/zygote/CPU/8 thread(s)
7141958
ns7223979
ns0.99
lenet(28, 28, 1, 128)/zygote/CPU/1 thread(s)
10525875
ns10470041
ns1.01
lenet(28, 28, 1, 128)/zygote/GPU/CUDA
1760839.5
ns1748953.5
ns1.01
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s)
377437.5
ns375500
ns1.01
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s)
378125
ns379708
ns1.00
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s)
450292
ns454583
ns0.99
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s)
30500
ns34834
ns0.88
batchedmm(128, Bsize=4)/forward/GPU/CUDA
42718
ns46336
ns0.92
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s)
743209
ns739834
ns1.00
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s)
790458
ns821979
ns0.96
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s)
1051750
ns1062042
ns0.99
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s)
123333
ns119270.5
ns1.03
batchedmm(128, Bsize=4)/zygote/GPU/CUDA
280362
ns274066
ns1.02
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s)
415750
ns412125
ns1.01
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s)
305875
ns305917
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s)
306125
ns305916
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s)
757167
ns757958
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA
44026.5
ns44006
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s)
662333
ns658583
ns1.01
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s)
523625
ns525792
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s)
524208
ns523167
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s)
973917
ns973083
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA
188149
ns189089
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
698417
ns672875
ns1.04
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
669875
ns676521
ns0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
674375
ns644292
ns1.05
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
683041.5
ns672333
ns1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
131691
ns131017.5
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2527000
ns2466812.5
ns1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2445791.5
ns2456312.5
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2456458.5
ns2425417
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2515459
ns2465333
ns1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1199048
ns1103271
ns1.09
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s)
1917
ns2333
ns0.82
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s)
2041.5
ns2875
ns0.71
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s)
2459
ns4500
ns0.55
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s)
2437.5
ns3167
ns0.77
batchedmm(2, Bsize=32)/forward/GPU/CUDA
16312
ns16213
ns1.01
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s)
5208
ns5208
ns1
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s)
5500
ns5625
ns0.98
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s)
5625
ns5667
ns0.99
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s)
5479.5
ns5459
ns1.00
batchedmm(2, Bsize=32)/zygote/GPU/CUDA
184945
ns184737.5
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1481291
ns1481125
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1524125
ns1519875
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1521750
ns1522875
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1447604.5
ns1453417
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
39655
ns40096
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5139771
ns5124333
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5014250
ns5295937.5
ns0.95
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5294625
ns5290354
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
5015729.5
ns4993187.5
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
194949
ns194429.5
ns1.00
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s)
3667
ns3666
ns1.00
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s)
3625
ns3666
ns0.99
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s)
3625
ns3625
ns1
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s)
3750
ns3667
ns1.02
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA
33334
ns33150
ns1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s)
15291
ns15208
ns1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s)
15083
ns15375
ns0.98
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s)
15292
ns15416
ns0.99
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s)
15167
ns15250
ns0.99
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA
349359.5
ns349182
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s)
94542
ns93000
ns1.02
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s)
103166
ns103209
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s)
103209
ns92958
ns1.11
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s)
95625
ns92833
ns1.03
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA
113041.5
ns113197
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
318084
ns315959
ns1.01
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
316917
ns319270.5
ns0.99
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
316666
ns317000
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
321750
ns317333
ns1.01
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA
192326
ns191577
ns1.00
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
958
ns1000
ns0.96
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
959
ns1084
ns0.88
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
1083
ns1042
ns1.04
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
1083
ns1000
ns1.08
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
23389
ns23307
ns1.00
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
7708
ns7792
ns0.99
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
7916
ns8375
ns0.95
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
7959
ns8125
ns0.98
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8270.5
ns8000
ns1.03
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
246988.5
ns244539
ns1.01
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s)
534875
ns531791
ns1.01
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s)
514875
ns517334
ns1.00
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s)
572375
ns578729.5
ns0.99
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s)
256145.5
ns256916
ns1.00
batchedmm(128, Bsize=32)/forward/GPU/CUDA
129558.5
ns130622
ns0.99
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s)
1420041.5
ns1386812.5
ns1.02
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s)
1466708.5
ns1483208.5
ns0.99
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s)
1756250
ns1776708
ns0.99
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s)
902625
ns871125
ns1.04
batchedmm(128, Bsize=32)/zygote/GPU/CUDA
276092.5
ns273552
ns1.01
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
333
ns292
ns1.14
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
292
ns375
ns0.78
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
334
ns375
ns0.89
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
375
ns292
ns1.28
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
31832
ns31822
ns1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
6084
ns5958
ns1.02
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
6542
ns6459
ns1.01
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
6292
ns6416
ns0.98
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
6292
ns6167
ns1.02
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
248681.5
ns246678.5
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1729313
ns1774479
ns0.97
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1725667
ns1782250.5
ns0.97
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1769167
ns1777916
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1772187.5
ns1766937
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
168168
ns169504.5
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
4416792
ns4354563
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
4351145.5
ns3899583
ns1.12
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
4368958
ns4361500
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
4403479.5
ns4355333
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1091804.5
ns1064911
ns1.03
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s)
7041.5
ns24479
ns0.29
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s)
7333
ns7541
ns0.97
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s)
7375
ns7833
ns0.94
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s)
7375
ns22208.5
ns0.33
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA
20581
ns19777
ns1.04
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
32334
ns72854.5
ns0.44
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
62021
ns51667
ns1.20
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
33333
ns51833
ns0.64
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
71833
ns70542
ns1.02
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA
196104.5
ns193123
ns1.02
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s)
17208
ns17625
ns0.98
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s)
17520.5
ns18250
ns0.96
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s)
17875
ns17708
ns1.01
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s)
17459
ns17250
ns1.01
batchedmm(2, Bsize=512)/forward/GPU/CUDA
18509
ns18352
ns1.01
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s)
52875
ns53000
ns1.00
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s)
53625
ns53250
ns1.01
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s)
53541
ns53542
ns1.00
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s)
53084
ns53375
ns0.99
batchedmm(2, Bsize=512)/zygote/GPU/CUDA
318108.5
ns317963.5
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s)
104959
ns107500
ns0.98
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s)
107334
ns107125
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s)
107250
ns105625
ns1.02
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s)
101250
ns97584
ns1.04
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA
46996
ns46786
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
324500
ns323417
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
325958
ns327750
ns0.99
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
323083
ns322667
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
327500
ns325000
ns1.01
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA
208617.5
ns207825
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1506583
ns1504209
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1549708
ns1545458
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1549292
ns1549042
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1480958
ns1478167
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
51270
ns51382
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5143666.5
ns5122771
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5297771
ns5291458
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5293084
ns5291125
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
5004625.5
ns5000125
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
201935.5
ns200987.5
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
28125
ns28167
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
28167
ns28250
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
28187.5
ns28125
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
28208
ns28167
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA
24383
ns24367
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
66666.5
ns66375
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
66333
ns66583
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
66459
ns66375
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
66292
ns66375
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA
489192
ns493214.5
ns0.99
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/2 thread(s)
1485833
ns1497500
ns0.99
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/4 thread(s)
1144729
ns1150584
ns0.99
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/8 thread(s)
1129875
ns1142791.5
ns0.99
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/1 thread(s)
2267333
ns2256875
ns1.00
mlp7layer_bn(tanh)(32 x 256)/forward/GPU/CUDA
580996.5
ns579142.5
ns1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/2 thread(s)
3110979
ns3080625.5
ns1.01
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/4 thread(s)
2747916.5
ns2682000
ns1.02
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/8 thread(s)
2752750
ns2729917
ns1.01
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/1 thread(s)
3882333
ns3656583
ns1.06
mlp7layer_bn(tanh)(32 x 256)/zygote/GPU/CUDA
1989937
ns1939352
ns1.03
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/2 thread(s)
7919834
ns7890875
ns1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/4 thread(s)
7899375
ns7897375
ns1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/8 thread(s)
7923709
ns7904208
ns1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/1 thread(s)
4904167
ns4815458
ns1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
77917
ns138395.5
ns0.56
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
139667
ns78917
ns1.77
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
140875
ns132458.5
ns1.06
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
133958
ns140084
ns0.96
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
193313
ns193872
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2016625
ns2020209
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2021791
ns1690750
ns1.20
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2024750
ns2025250
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2026750
ns2006209
ns1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
747334.5
ns742900
ns1.01
This comment was automatically generated by workflow using github-action-benchmark.