-
Notifications
You must be signed in to change notification settings - Fork 63
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* feat: softmax and logsoftmax jvp rules * feat: add pooling rules * test: logsoftmax and softmax forwarddiff rules * fix: patch meanpool * test: more tests fixed
- Loading branch information
Showing
7 changed files
with
158 additions
and
29 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
63d3434
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JuliaRegistrator register subdir=lib/LuxLib
63d3434
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Registration pull request created: JuliaRegistries/General/122233
Tip: Release Notes
Did you know you can add release notes too? Just add markdown formatted text underneath the comment after the text
"Release notes:" and it will be added to the registry PR, and if TagBot is installed it will also be added to the
release that TagBot creates. i.e.
To add them here just re-invoke and the PR will be updated.
Tagging
After the above pull request is merged, it is recommended that a tag is created on this repository for the registered package version.
This will be done automatically if the Julia TagBot GitHub Action is installed, or can be done manually through the github interface, or via:
63d3434
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lux Benchmarks
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
4083.5
ns3625
ns1.13
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
4042
ns4541
ns0.89
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
4917
ns5125
ns0.96
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
3833
ns3791
ns1.01
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
59941
ns61743
ns0.97
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
11250
ns10125
ns1.11
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
10500
ns10875
ns0.97
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
11541
ns10334
ns1.12
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
10958
ns10417
ns1.05
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
421187
ns430910
ns0.98
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s)
1167
ns1209
ns0.97
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s)
1250
ns1209
ns1.03
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s)
1417
ns1500
ns0.94
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s)
1167
ns1042
ns1.12
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA
17939
ns18223.5
ns0.98
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
4125
ns4000
ns1.03
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
3958
ns4042
ns0.98
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
4292
ns4334
ns0.99
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
4062.5
ns3875
ns1.05
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA
108432
ns110886
ns0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
57333
ns56709
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
46250
ns38334
ns1.21
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
47041
ns46917
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
82125
ns81750
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
36736
ns37932
ns0.97
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1991000.5
ns2043708.5
ns0.97
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2094313
ns2096520.5
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2094167
ns2096437.5
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1997041.5
ns1991167
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
194384.5
ns197294.5
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
143854.5
ns144625
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
143125
ns145667
ns0.98
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
147041
ns144916
ns1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
144750
ns144854.5
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
165602
ns166157.5
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1114896
ns1116791
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1128937.5
ns1150459
ns0.98
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1128792
ns1128083
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1114542
ns1121458
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
526049
ns535998
ns0.98
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
3458
ns3417
ns1.01
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
3416
ns4042
ns0.85
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
4145.5
ns4459
ns0.93
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
3584
ns3187.5
ns1.12
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
70040
ns72464.5
ns0.97
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
8917
ns9417
ns0.95
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
9042
ns9458
ns0.96
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
9459
ns9750
ns0.97
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8917
ns8708
ns1.02
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
447136
ns469472
ns0.95
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
15041
ns14375
ns1.05
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
17541.5
ns16208
ns1.08
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
17625
ns18750
ns0.94
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
15917
ns16875
ns0.94
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
54471
ns54038
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
217417
ns213375
ns1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
213417
ns220000
ns0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
214979.5
ns217250
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
225771
ns213916
ns1.06
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
270355
ns270771
ns1.00
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s)
791
ns541
ns1.46
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s)
625
ns542
ns1.15
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s)
708
ns708
ns1
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s)
667
ns667
ns1
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA
17190
ns17308
ns0.99
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
1500
ns1417
ns1.06
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
1500
ns1375
ns1.09
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
1666
ns1541
ns1.08
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
1500
ns1417
ns1.06
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA
101385
ns101606.5
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7208
ns7083
ns1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
5916
ns5250
ns1.13
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
5917
ns5958
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
9875
ns10084
ns0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
23163
ns23383
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
223083
ns221709
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
228500
ns229750
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
230208
ns229125
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
217000
ns214125
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
166961
ns167775.5
ns1.00
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s)
3917
ns3917
ns1
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s)
3958
ns4000
ns0.99
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s)
3958
ns3917
ns1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s)
3917
ns3917
ns1
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA
23600
ns23070
ns1.02
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
16792
ns17083
ns0.98
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
16750
ns16625
ns1.01
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
17041
ns17083
ns1.00
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
17000
ns16833
ns1.01
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA
161078
ns162035
ns0.99
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
577750
ns575083
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
572709
ns571792
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
574833
ns570750
ns1.01
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
575625
ns577208
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA
112893
ns113295
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
1420292
ns1418250
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
1425209
ns1422875
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
1426583
ns1422500
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
1429020.5
ns1425750
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA
211317.5
ns211866.5
ns1.00
lenet(28, 28, 1, 64)/forward/CPU/2 thread(s)
1077500
ns1081041.5
ns1.00
lenet(28, 28, 1, 64)/forward/CPU/4 thread(s)
960792
ns946916.5
ns1.01
lenet(28, 28, 1, 64)/forward/CPU/8 thread(s)
1350854.5
ns1353229.5
ns1.00
lenet(28, 28, 1, 64)/forward/CPU/1 thread(s)
1298750
ns1292458
ns1.00
lenet(28, 28, 1, 64)/forward/GPU/CUDA
273506
ns269913.5
ns1.01
lenet(28, 28, 1, 64)/zygote/CPU/2 thread(s)
6004937.5
ns6001958
ns1.00
lenet(28, 28, 1, 64)/zygote/CPU/4 thread(s)
4547292
ns4632042
ns0.98
lenet(28, 28, 1, 64)/zygote/CPU/8 thread(s)
4929708.5
ns4929041.5
ns1.00
lenet(28, 28, 1, 64)/zygote/CPU/1 thread(s)
5555333
ns5549750.5
ns1.00
lenet(28, 28, 1, 64)/zygote/GPU/CUDA
1074648
ns1070564
ns1.00
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s)
542
ns542
ns1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s)
500
ns542
ns0.92
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s)
583
ns542
ns1.08
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s)
500
ns542
ns0.92
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA
23430
ns23780
ns0.99
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
2167
ns2209
ns0.98
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
2084
ns2209
ns0.94
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
2167
ns2208
ns0.98
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
2084
ns2084
ns1
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA
173597
ns170642
ns1.02
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
4292
ns3667
ns1.17
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
3750
ns4750
ns0.79
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
4917
ns5208
ns0.94
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
3958
ns4041
ns0.98
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
65160
ns65525
ns0.99
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
11209
ns11084
ns1.01
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
11250
ns12083
ns0.93
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
12208
ns12208
ns1
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
11125
ns10834
ns1.03
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
447745.5
ns445478.5
ns1.01
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
6166
ns5917
ns1.04
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
6375
ns6666
ns0.96
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
8125
ns8167
ns0.99
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
6583
ns6166
ns1.07
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
52163
ns52877
ns0.99
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
16750
ns18250
ns0.92
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
18209
ns18458
ns0.99
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
18500
ns18542
ns1.00
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
17000
ns17520.5
ns0.97
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
298259.5
ns296963
ns1.00
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
583
ns583
ns1
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
583
ns625
ns0.93
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
625
ns667
ns0.94
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
542
ns542
ns1
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
32532
ns32928.5
ns0.99
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
8208
ns9271
ns0.89
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
8667
ns9208
ns0.94
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
9333
ns9354.5
ns1.00
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
8083
ns8375
ns0.97
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
158900.5
ns157633
ns1.01
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s)
64500
ns64458
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s)
64500
ns64917
ns0.99
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s)
64458
ns64583
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s)
64375
ns64375
ns1
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA
111633.5
ns111288
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s)
274542
ns278375
ns0.99
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s)
287042
ns292291
ns0.98
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s)
274708
ns278833
ns0.99
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s)
280292
ns279500
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA
186083
ns186917
ns1.00
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/2 thread(s)
3329333
ns3287958
ns1.01
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/4 thread(s)
3017229
ns2909792
ns1.04
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/8 thread(s)
3024687.5
ns3017771
ns1.00
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/1 thread(s)
3956250
ns3935292
ns1.01
mlp7layer_bn(gelu)(32 x 256)/forward/GPU/CUDA
577429
ns579655
ns1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/2 thread(s)
7623958
ns7602875
ns1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/4 thread(s)
7210334
ns7372333
ns0.98
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/8 thread(s)
7453270.5
ns7461313
ns1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/1 thread(s)
8209375
ns8220167
ns1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/GPU/CUDA
1359043.5
ns1357048
ns1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/2 thread(s)
17513124.5
ns17533125
ns1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/4 thread(s)
17530146
ns17557125
ns1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/8 thread(s)
17518395.5
ns17531667
ns1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/1 thread(s)
14128813
ns9214250
ns1.53
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s)
23645979.5
ns23446917
ns1.01
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s)
33821104.5
ns43586125
ns0.78
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s)
37080041
ns37247062.5
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s)
34888834
ns35028291.5
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/GPU/CUDA
1866294
ns1855921.5
ns1.01
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s)
189046208
ns189114500
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s)
164619624.5
ns178190333
ns0.92
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s)
152711479
ns153393396
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s)
436948083
ns434855500
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA
13894254.5
ns13947546
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s)
289373791
ns290046875
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s)
251042625
ns271392771
ns0.93
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s)
296809167
ns284812041.5
ns1.04
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s)
474994229.5
ns473569708.5
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
22250
ns23021
ns0.97
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
24542
ns22458
ns1.09
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
23188
ns23625
ns0.98
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
22417
ns22708
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
96027
ns96516
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
116584
ns115458.5
ns1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
113125
ns103250
ns1.10
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
117833
ns104375
ns1.13
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
103854
ns105042
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
510213
ns508001.5
ns1.00
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
5833
ns5750
ns1.01
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
5917
ns6500
ns0.91
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
6812.5
ns6708
ns1.02
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
6292
ns6125
ns1.03
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
68158.5
ns68991.5
ns0.99
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
14875
ns14042
ns1.06
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
14812.5
ns15500
ns0.96
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
14875
ns15687.5
ns0.95
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
15042
ns14500
ns1.04
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
478636.5
ns478721
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s)
3009146
ns2979083.5
ns1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s)
2061334
ns2084000
ns0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s)
2279208
ns2281500
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s)
4871541.5
ns4814250
ns1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/GPU/CUDA
589315.5
ns585630.5
ns1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s)
23547375
ns23560375
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s)
17982875.5
ns18266583.5
ns0.98
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s)
16893209
ns16959209
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s)
34849958
ns34863041.5
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/GPU/CUDA
2772744
ns2766675
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s)
33314834
ns33305667
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s)
27464208
ns27994104
ns0.98
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s)
27410208
ns27448959
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s)
41078500
ns40756916
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
72375
ns74000
ns0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
74375
ns73333
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
75166
ns74917
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
75167
ns74500
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
102682
ns104050
ns0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
286145.5
ns218083
ns1.31
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
210021.5
ns210625
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
315000
ns296708.5
ns1.06
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
218458
ns217792
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
553543
ns558286.5
ns0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
11875
ns11750
ns1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
11708
ns12417
ns0.94
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
13334
ns12458.5
ns1.07
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
13125
ns11834
ns1.11
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
71259
ns72847.5
ns0.98
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
26833.5
ns26125
ns1.03
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
26375
ns27167
ns0.97
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
27417
ns27375
ns1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
25854.5
ns26458
ns0.98
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
477064.5
ns484580
ns0.98
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
12041.5
ns11583
ns1.04
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
12229.5
ns12167
ns1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
13958
ns14000
ns1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
12584
ns11792
ns1.07
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
53895.5
ns55176
ns0.98
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
25875
ns25542
ns1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
25834
ns26417
ns0.98
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
26125
ns28709
ns0.91
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
25667
ns26042
ns0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
305285
ns307604.5
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
179417
ns179208
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
179417
ns181042
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
181041
ns184333.5
ns0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
180042
ns179416
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
58113
ns57654
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
590084
ns590646
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
585083
ns591479
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
591062.5
ns593500
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
584333
ns582749.5
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
289662.5
ns291261
ns0.99
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
6083
ns6083.5
ns1.00
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
5500
ns6375
ns0.86
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
7542
ns6708
ns1.12
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
6604.5
ns6292
ns1.05
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
70599
ns71643
ns0.99
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
14291
ns14250
ns1.00
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
14209
ns15167
ns0.94
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
14917
ns15292
ns0.98
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
13062.5
ns14042
ns0.93
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
466681.5
ns470922.5
ns0.99
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s)
1223541.5
ns1203770.5
ns1.02
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s)
1236625
ns1236645.5
ns1.00
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s)
1285666.5
ns1343083
ns0.96
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s)
1007959
ns1024395.5
ns0.98
batchedmm(512, Bsize=4)/forward/GPU/CUDA
301986
ns300123
ns1.01
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s)
4226959
ns4091000
ns1.03
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s)
4384249.5
ns4576917
ns0.96
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s)
4572312.5
ns4574875.5
ns1.00
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s)
3695104.5
ns3718250
ns0.99
batchedmm(512, Bsize=4)/zygote/GPU/CUDA
1047036
ns1038641
ns1.01
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
1833
ns1875
ns0.98
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
1792
ns1875
ns0.96
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
1833
ns1875
ns0.98
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
1875
ns1875
ns1
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA
24200
ns23874.5
ns1.01
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
4875
ns5083
ns0.96
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
4833
ns5000
ns0.97
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
4875
ns4959
ns0.98
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
4875
ns4875
ns1
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA
192268.5
ns193867
ns0.99
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
5458
ns5500
ns0.99
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
5542
ns5709
ns0.97
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
6791.5
ns6875
ns0.99
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
5792
ns5416
ns1.07
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
56595.5
ns57200
ns0.99
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
10500
ns11042
ns0.95
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
10416
ns11584
ns0.90
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
11375
ns11500
ns0.99
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
10875
ns10625
ns1.02
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
335979.5
ns332575
ns1.01
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s)
334
ns375
ns0.89
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s)
333
ns333
ns1
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s)
333
ns334
ns1.00
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s)
334
ns334
ns1
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA
23172
ns22978
ns1.01
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
2833
ns2834
ns1.00
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
2709
ns2792
ns0.97
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
3042
ns3000
ns1.01
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
2791
ns2833
ns0.99
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA
162255.5
ns163496
ns0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
11084
ns11625
ns0.95
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
11000
ns11292
ns0.97
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
13563
ns12875
ns1.05
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
11458
ns11209
ns1.02
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
58685.5
ns58225
ns1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
24542
ns24958
ns0.98
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
24542
ns25208
ns0.97
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
25167
ns25375
ns0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
25000
ns25042
ns1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
298266
ns299318
ns1.00
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s)
4208
ns4250
ns0.99
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s)
4208
ns4250
ns0.99
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s)
4250
ns4250
ns1
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s)
4250
ns4250
ns1
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA
25307
ns25190
ns1.00
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
16166
ns16209
ns1.00
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
16292
ns16083
ns1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
16334
ns16625
ns0.98
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
16084
ns16500
ns0.97
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA
199542
ns202972
ns0.98
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
5709
ns5833
ns0.98
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
5917
ns5792
ns1.02
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
5792
ns5959
ns0.97
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
5834
ns5792
ns1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
33833
ns34611
ns0.98
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
20292
ns20625
ns0.98
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
20375
ns21042
ns0.97
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
20875
ns21083
ns0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
20250
ns20125
ns1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
178083
ns178483.5
ns1.00
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s)
420500
ns414125
ns1.02
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s)
372625
ns367771
ns1.01
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s)
482833
ns480813
ns1.00
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s)
103292
ns104146
ns0.99
batchedmm(16, Bsize=512)/forward/GPU/CUDA
67723.5
ns67750.5
ns1.00
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s)
922417
ns927125
ns0.99
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s)
955208.5
ns964354
ns0.99
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s)
1180875
ns1186833
ns0.99
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s)
379083
ns376584
ns1.01
batchedmm(16, Bsize=512)/zygote/GPU/CUDA
192988
ns192974.5
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
136917
ns77583
ns1.76
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
79854.5
ns79125
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
82750
ns83542
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
81167
ns79958
ns1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
194081
ns193934
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1915042
ns1917959
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1919750
ns1933541
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1926125
ns1931521.5
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1915750
ns1860375
ns1.03
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
401908.5
ns392771
ns1.02
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s)
292
ns292
ns1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s)
292
ns292
ns1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s)
292
ns292
ns1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s)
333
ns333
ns1
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA
22364
ns22416
ns1.00
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s)
1833
ns1792
ns1.02
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s)
1792
ns1875
ns0.96
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s)
1834
ns1875
ns0.98
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s)
1834
ns1875
ns0.98
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA
174295
ns174762
ns1.00
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
6042
ns6562.5
ns0.92
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
6500
ns6417
ns1.01
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
7812.5
ns8166
ns0.96
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
6541
ns6208
ns1.05
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
61489.5
ns59227
ns1.04
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
9000
ns9292
ns0.97
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
8792
ns9250
ns0.95
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
9375
ns9375
ns1
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
9459
ns9083
ns1.04
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
308375
ns304901.5
ns1.01
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s)
118419979.5
ns120543687.5
ns0.98
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s)
173770000
ns181954416.5
ns0.96
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s)
148397083
ns148126750
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s)
104919541
ns106134709
ns0.99
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/GPU/CUDA
5493586
ns5492614.5
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s)
611739750.5
ns609833750
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s)
553521958
ns578593208
ns0.96
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s)
449841709
ns451045708.5
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s)
631089333.5
ns627478333.5
ns1.01
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA
38209825
ns35107131
ns1.09
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s)
652096250
ns652518625
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s)
661126562.5
ns683671437.5
ns0.97
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s)
580970687.5
ns587115583.5
ns0.99
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s)
848782167
ns852245209
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
58667
ns58000
ns1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
47500
ns39209
ns1.21
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
48250
ns48208
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
83625
ns85167
ns0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
37628
ns38635
ns0.97
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1919312.5
ns1920104
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1980333.5
ns1988000
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1982541.5
ns1980667
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1895625
ns1907896
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
176341
ns176329
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
266208
ns267041
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
265334
ns270500
ns0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
288604
ns268750
ns1.07
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
268167
ns265291
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
130454.5
ns123893.5
ns1.05
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
664646
ns596166
ns1.11
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
671062.5
ns698625
ns0.96
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
665875
ns702916.5
ns0.95
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
597542
ns589292
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
690208
ns677537.5
ns1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
2192312.5
ns2180187.5
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
2179542
ns2215229
ns0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
2181333.5
ns2212000
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
2207146
ns2207792
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
134808
ns133207
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5469791
ns5497667
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5472958.5
ns5581500
ns0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5499916
ns5516125
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
5442583.5
ns5545124.5
ns0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
720984
ns717120
ns1.01
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
644667
ns656041
ns0.98
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
644084
ns642917
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
642042
ns637375
ns1.01
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
644167
ns644167
ns1
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA
47636.5
ns46463
ns1.03
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
1819917
ns1822875
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
1720500
ns1668958.5
ns1.03
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
1721792
ns1723334
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
2100000
ns2101084
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA
224071
ns222123
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
57667
ns57667
ns1
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
46666
ns38708
ns1.21
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
46583
ns46916
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
83750
ns85084
ns0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
28795
ns28664
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2029583
ns2028604.5
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2087375
ns2097916.5
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2087791.5
ns2087625
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1991416.5
ns2005812
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
190320
ns188609
ns1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s)
13371041.5
ns13343604
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s)
12439187.5
ns12536250
ns0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s)
12491875
ns12547834
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s)
15195833.5
ns15250271
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/GPU/CUDA
516777
ns510611.5
ns1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s)
47119104.5
ns47204500
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s)
41727062.5
ns41927292
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s)
41051417
ns40799666
ns1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s)
58599458
ns58864104
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA
2892052.5
ns2889030
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s)
74212666
ns73523334
ns1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s)
67877750
ns91557750
ns0.74
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s)
90536499.5
ns90571250.5
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s)
98549792
ns75976041
ns1.30
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
58375
ns58083
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
46459
ns38875
ns1.20
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
47708
ns47709
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
83958
ns82042
ns1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
47165
ns48950
ns0.96
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1919583.5
ns1916542
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1980791
ns1982083
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1979229.5
ns1947333
ns1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1886958
ns1876854
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
193816.5
ns195268
ns0.99
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
333
ns333
ns1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
333
ns375
ns0.89
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
375
ns375
ns1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
333
ns292
ns1.14
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
32624
ns32997
ns0.99
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
5833
ns5834
ns1.00
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
6083
ns6500
ns0.94
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
6416.5
ns6458.5
ns0.99
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
5833
ns5958
ns0.98
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
171378.5
ns171034
ns1.00
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s)
292
ns292
ns1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s)
291
ns250
ns1.16
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s)
333
ns292
ns1.14
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s)
292
ns250
ns1.17
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA
32204
ns32918
ns0.98
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s)
2583
ns2750
ns0.94
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s)
2625
ns2750
ns0.95
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s)
2875
ns2917
ns0.99
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s)
2625
ns2625
ns1
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA
159764
ns161268
ns0.99
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s)
286393770.5
ns286917729.5
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s)
340253500
ns347948583.5
ns0.98
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s)
313806270.5
ns314136145.5
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s)
268566520.5
ns267700542
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/GPU/CUDA
7103110
ns7080984
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s)
1012043792
ns1009676125
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s)
955581708
ns974877416
ns0.98
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s)
855297583
ns854637270.5
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s)
1259239875
ns1260982959
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA
33847341
ns34048271
ns0.99
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s)
1418325958.5
ns1387098104
ns1.02
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s)
1338395020.5
ns1694333625
ns0.79
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s)
1636087292
ns1631003167
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s)
1775858125
ns1358038896
ns1.31
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1409833
ns1411604.5
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1414458.5
ns1409250
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1465562.5
ns1407354.5
ns1.04
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1413458.5
ns1405916
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
127951
ns128067
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5027250
ns5023999.5
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5036354
ns5051396
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5030437.5
ns5029104.5
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
5027250.5
ns5040479
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
479205.5
ns514176
ns0.93
vgg16(32, 32, 3, 32)/forward/CPU/2 thread(s)
170869291
ns170919250
ns1.00
vgg16(32, 32, 3, 32)/forward/CPU/4 thread(s)
128735708
ns183735542
ns0.70
vgg16(32, 32, 3, 32)/forward/CPU/8 thread(s)
105431542
ns115460229.5
ns0.91
vgg16(32, 32, 3, 32)/forward/CPU/1 thread(s)
167706958
ns168486416
ns1.00
vgg16(32, 32, 3, 32)/forward/GPU/CUDA
4877746.5
ns4853309
ns1.01
vgg16(32, 32, 3, 32)/zygote/CPU/2 thread(s)
511068334
ns627387000
ns0.81
vgg16(32, 32, 3, 32)/zygote/CPU/4 thread(s)
490911792
ns561666625
ns0.87
vgg16(32, 32, 3, 32)/zygote/CPU/8 thread(s)
385742875
ns453969542
ns0.85
vgg16(32, 32, 3, 32)/zygote/CPU/1 thread(s)
650161000
ns654142166
ns0.99
vgg16(32, 32, 3, 32)/zygote/GPU/CUDA
16340937
ns17017885
ns0.96
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s)
9003042
ns8912729
ns1.01
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s)
8983042
ns9063708
ns0.99
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s)
7909375
ns7941979
ns1.00
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s)
9604229.5
ns9820979.5
ns0.98
batchedmm(512, Bsize=32)/forward/GPU/CUDA
1611438.5
ns1590505
ns1.01
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s)
36334167
ns36015084
ns1.01
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s)
37265291.5
ns38799959
ns0.96
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s)
33553354
ns33679959
ns1.00
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s)
37555333
ns37936417
ns0.99
batchedmm(512, Bsize=32)/zygote/GPU/CUDA
6454550
ns6472671
ns1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s)
47333
ns47459
ns1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s)
47500
ns47708
ns1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s)
47625
ns47625
ns1
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s)
47417
ns47209
ns1.00
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA
18252
ns17832
ns1.02
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s)
50417
ns50416
ns1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s)
50666
ns50292
ns1.01
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s)
50625
ns50458
ns1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s)
50250
ns50291
ns1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA
164880
ns162828
ns1.01
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
6417
ns6208
ns1.03
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
6792
ns7083
ns0.96
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
7583.5
ns7562.5
ns1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
6792
ns6292
ns1.08
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
76692.5
ns74130
ns1.03
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
10125
ns9375
ns1.08
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
9750
ns10250
ns0.95
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
10250
ns10375
ns0.99
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
9875
ns9917
ns1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
448214.5
ns422862.5
ns1.06
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
5666
ns5666
ns1
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
5791
ns6500
ns0.89
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
7583
ns6916
ns1.10
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
6042
ns5375
ns1.12
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
81735
ns78877.5
ns1.04
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
13208
ns12875
ns1.03
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
12709
ns13583
ns0.94
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
13375
ns13583
ns0.98
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
13417
ns13208
ns1.02
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
399198.5
ns370972.5
ns1.08
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
959
ns1083
ns0.89
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
1000
ns1083
ns0.92
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
1042
ns1083
ns0.96
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
1083
ns1042
ns1.04
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
32447
ns33127
ns0.98
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7666
ns7792
ns0.98
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7708
ns8167
ns0.94
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7958
ns8083
ns0.98
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
8166
ns7792
ns1.05
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
187787.5
ns187081.5
ns1.00
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
23167
ns23333
ns0.99
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
23209
ns23417
ns0.99
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
23250
ns23583
ns0.99
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
23292
ns23084
ns1.01
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA
18320.5
ns18527
ns0.99
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
52917
ns52042
ns1.02
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
52167
ns52750
ns0.99
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
52917
ns52875
ns1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
52875
ns52542
ns1.01
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA
214503.5
ns204233
ns1.05
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1398125
ns1398875
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1402146
ns1455625
ns0.96
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1406437.5
ns1404042
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1448937.5
ns1406584
ns1.03
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
196187.5
ns196492.5
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5003458
ns4999875
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5029708
ns5037708
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5015042
ns5003083
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
5005729.5
ns5024916
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
509817
ns495167
ns1.03
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s)
3051834
ns3047396
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s)
2076520.5
ns2106521
ns0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s)
2302500
ns2296895.5
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s)
4658291.5
ns4962229.5
ns0.94
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/GPU/CUDA
581685
ns583841
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s)
24315708
ns24384458
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s)
18877250
ns19075709
ns0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s)
17822166
ns17765562.5
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s)
35790999.5
ns35955916.5
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA
2842698
ns2836787
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s)
33982916.5
ns33991937.5
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s)
28228208.5
ns28748917
ns0.98
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s)
27940958
ns28081042
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s)
41757334
ns41668854.5
ns1.00
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s)
143078500
ns142678458
ns1.00
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s)
146668125
ns147270333
ns1.00
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s)
127355624.5
ns126985770.5
ns1.00
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s)
171841729.5
ns174826021
ns0.98
batchedmm(512, Bsize=512)/forward/GPU/CUDA
22550146
ns22556485
ns1.00
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s)
1234730083.5
ns1026522125
ns1.20
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s)
1060723417
ns866022875.5
ns1.22
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s)
1027004875
ns743843334
ns1.38
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s)
674561583
ns682878792
ns0.99
batchedmm(512, Bsize=512)/zygote/GPU/CUDA
117659213
ns116543149
ns1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
74125
ns76083
ns0.97
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
73146
ns76250
ns0.96
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
76000
ns77625
ns0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
85834
ns75833.5
ns1.13
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
175925
ns163749.5
ns1.07
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
215750
ns275437.5
ns0.78
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
192541.5
ns283542
ns0.68
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
284542
ns275959
ns1.03
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
285708
ns282375
ns1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
952026.5
ns882740
ns1.08
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s)
35486000
ns35483000
ns1.00
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s)
36428646.5
ns36565000
ns1.00
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s)
32475229
ns32543896
ns1.00
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s)
40408041.5
ns40679500
ns0.99
batchedmm(512, Bsize=128)/forward/GPU/CUDA
5831517
ns5828412
ns1.00
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s)
146000771
ns147536708
ns0.99
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s)
154808750
ns157209875
ns0.98
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s)
137043083.5
ns136063312.5
ns1.01
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s)
285556542
ns286255000
ns1.00
batchedmm(512, Bsize=128)/zygote/GPU/CUDA
34852076.5
ns34875549.5
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s)
121592083
ns122158104.5
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s)
174639125
ns181447688
ns0.96
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s)
148027541
ns147872917
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s)
105917833
ns104774833.5
ns1.01
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/GPU/CUDA
5344344
ns5433572
ns0.98
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s)
468650958
ns468969166
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s)
466713000
ns487732687.5
ns0.96
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s)
437158458
ns437061208
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s)
744371959
ns745602708
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/GPU/CUDA
35992005
ns31632434
ns1.14
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s)
712765167
ns708533125.5
ns1.01
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s)
641204167
ns662068729.5
ns0.97
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s)
624084979.5
ns625681375
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s)
856208084
ns856533500
ns1.00
mlp7layer_bn(relu)(32 x 256)/forward/CPU/2 thread(s)
1270583
ns1243917
ns1.02
mlp7layer_bn(relu)(32 x 256)/forward/CPU/4 thread(s)
995709
ns778625
ns1.28
mlp7layer_bn(relu)(32 x 256)/forward/CPU/8 thread(s)
995875
ns961709
ns1.04
mlp7layer_bn(relu)(32 x 256)/forward/CPU/1 thread(s)
2037625
ns2098041.5
ns0.97
mlp7layer_bn(relu)(32 x 256)/forward/GPU/CUDA
569478
ns581626.5
ns0.98
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/2 thread(s)
2961229.5
ns2966062.5
ns1.00
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/4 thread(s)
2647792
ns2513979
ns1.05
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/8 thread(s)
2621500
ns2620167
ns1.00
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/1 thread(s)
3709750
ns3551916
ns1.04
mlp7layer_bn(relu)(32 x 256)/zygote/GPU/CUDA
1587708.5
ns1532656
ns1.04
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/2 thread(s)
5785812.5
ns5803146
ns1.00
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/4 thread(s)
5824083
ns5896375
ns0.99
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/8 thread(s)
5785375
ns5798708
ns1.00
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/1 thread(s)
2904896
ns2924083
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7250
ns7083
ns1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
6125
ns5291
ns1.16
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
6042
ns6208
ns0.97
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10042
ns10166
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
24479.5
ns25159
ns0.97
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
223812.5
ns212500
ns1.05
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
222667
ns220625
ns1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
220792
ns220709
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
240666
ns213625
ns1.13
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
212315.5
ns199491.5
ns1.06
vgg16(32, 32, 3, 64)/forward/CPU/2 thread(s)
296229125
ns297113041
ns1.00
vgg16(32, 32, 3, 64)/forward/CPU/4 thread(s)
216728584
ns291058458
ns0.74
vgg16(32, 32, 3, 64)/forward/CPU/8 thread(s)
190254604.5
ns193310291.5
ns0.98
vgg16(32, 32, 3, 64)/forward/CPU/1 thread(s)
304954521
ns304396812.5
ns1.00
vgg16(32, 32, 3, 64)/forward/GPU/CUDA
7671461.5
ns7678125.5
ns1.00
vgg16(32, 32, 3, 64)/zygote/CPU/2 thread(s)
1229817167
ns1231332166.5
ns1.00
vgg16(32, 32, 3, 64)/zygote/CPU/4 thread(s)
902846291.5
ns973933875
ns0.93
vgg16(32, 32, 3, 64)/zygote/CPU/8 thread(s)
824304209
ns836913500
ns0.98
vgg16(32, 32, 3, 64)/zygote/CPU/1 thread(s)
1157856750.5
ns1148765416.5
ns1.01
vgg16(32, 32, 3, 64)/zygote/GPU/CUDA
26996841
ns26856489.5
ns1.01
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
5292
ns4792
ns1.10
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
5291.5
ns5875
ns0.90
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
6375
ns6354
ns1.00
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
5250
ns4667
ns1.12
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
112898
ns93183
ns1.21
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
6875
ns7000
ns0.98
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
6958
ns7625
ns0.91
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7583
ns7458
ns1.02
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7125
ns7395.5
ns0.96
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
535221.5
ns440751
ns1.21
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
500
ns500
ns1
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
584
ns667
ns0.88
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
584
ns584
ns1
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
541
ns500
ns1.08
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
23660
ns24653
ns0.96
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
8625
ns8625
ns1
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
9084
ns9500
ns0.96
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
9417
ns9917
ns0.95
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
8708
ns8792
ns0.99
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
195936.5
ns176547.5
ns1.11
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
352958.5
ns353584
ns1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
352792
ns353833
ns1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
351479
ns352208
ns1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
356708.5
ns351500
ns1.01
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA
20962
ns21275
ns0.99
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
775625
ns807916.5
ns0.96
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
825833
ns789854
ns1.05
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
812229.5
ns776042
ns1.05
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
834959
ns778833
ns1.07
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA
234827
ns215262.5
ns1.09
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s)
341562.5
ns339229
ns1.01
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s)
341958
ns321000
ns1.07
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s)
455917
ns454187
ns1.00
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s)
11083
ns10916
ns1.02
batchedmm(16, Bsize=32)/forward/GPU/CUDA
17699
ns18631
ns0.95
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s)
712500
ns714125
ns1.00
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s)
739896
ns731625
ns1.01
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s)
1007854
ns1006333
ns1.00
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s)
26459
ns26667
ns0.99
batchedmm(16, Bsize=32)/zygote/GPU/CUDA
214680.5
ns196596.5
ns1.09
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s)
381042
ns381833.5
ns1.00
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s)
346750
ns330959
ns1.05
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s)
449187.5
ns444916.5
ns1.01
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s)
39042
ns31417
ns1.24
batchedmm(16, Bsize=128)/forward/GPU/CUDA
22537
ns23162
ns0.97
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s)
733792
ns727875
ns1.01
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s)
788958
ns783542
ns1.01
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s)
1032500
ns1030146
ns1.00
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s)
105583
ns90750
ns1.16
batchedmm(16, Bsize=128)/zygote/GPU/CUDA
200835.5
ns193002.5
ns1.04
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s)
3791
ns3583
ns1.06
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s)
3541
ns3709
ns0.95
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s)
3708
ns3625
ns1.02
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s)
3708
ns3375
ns1.10
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA
17542
ns17634
ns0.99
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s)
4250
ns4291
ns0.99
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s)
4167
ns4208
ns0.99
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s)
4250
ns4333
ns0.98
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s)
4250
ns4125
ns1.03
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA
204574.5
ns200435.5
ns1.02
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
3834
ns3500
ns1.10
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
3667
ns4167
ns0.88
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
4250
ns4375
ns0.97
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
3625
ns3583
ns1.01
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
160115.5
ns151437.5
ns1.06
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
8292
ns8458
ns0.98
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8166
ns8583
ns0.95
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
8458
ns8333
ns1.02
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8333
ns8458
ns0.99
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
989699
ns927946.5
ns1.07
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
203375
ns204583
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
212791
ns209000
ns1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
210666
ns210500
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
200834
ns199084
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
34428
ns35183
ns0.98
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
652624.5
ns602833.5
ns1.08
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
622667
ns629209
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
631604.5
ns625584
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
632750
ns582250
ns1.09
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
280400.5
ns266930.5
ns1.05
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s)
994229.5
ns990542
ns1.00
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s)
1040292
ns1053625
ns0.99
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s)
956020.5
ns954292
ns1.00
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s)
853917
ns901104
ns0.95
batchedmm(128, Bsize=128)/forward/GPU/CUDA
208023.5
ns206789.5
ns1.01
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s)
4502437.5
ns4511208
ns1.00
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s)
4668229.5
ns4854542
ns0.96
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s)
4455084
ns4490209
ns0.99
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s)
4280937
ns4299083.5
ns1.00
batchedmm(128, Bsize=128)/zygote/GPU/CUDA
935555
ns930739
ns1.01
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
3292
ns3084
ns1.07
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
3458
ns3500
ns0.99
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
4042
ns4083.5
ns0.99
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
3209
ns3000
ns1.07
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
159049
ns144120
ns1.10
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7291
ns7250
ns1.01
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7333
ns7333
ns1
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7334
ns7500
ns0.98
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
6833
ns7041
ns0.97
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
850635.5
ns806482
ns1.05
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s)
1640041
ns1636250
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s)
1196604.5
ns1158208.5
ns1.03
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s)
1383250
ns1368083
ns1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s)
2417500
ns2308063
ns1.05
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/GPU/CUDA
215018
ns214505
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s)
12333396
ns12270583
ns1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s)
9592791.5
ns9567750
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s)
9267625
ns9243645.5
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s)
18011459
ns18134146
ns0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/GPU/CUDA
1959459
ns1954133
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s)
17332937.5
ns17281250
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s)
14386792
ns14453375
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s)
14369396.5
ns14325333
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s)
21112291.5
ns21045500
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
87708
ns85708
ns1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
88542
ns91520.5
ns0.97
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
92833
ns93250
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
116000
ns87833.5
ns1.32
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
126352.5
ns126207
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2022959
ns2017958
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2049666
ns2050542
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2035562.5
ns2029834
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2025938
ns2026959
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
878938
ns841405
ns1.04
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s)
2750
ns1375
ns2
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s)
3209
ns1917
ns1.67
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s)
3417
ns3583.5
ns0.95
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s)
2792
ns2375
ns1.18
batchedmm(2, Bsize=4)/forward/GPU/CUDA
16283
ns16017
ns1.02
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s)
2542
ns2875
ns0.88
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s)
2708
ns2833
ns0.96
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s)
2875
ns2750
ns1.05
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s)
2834
ns2792
ns1.02
batchedmm(2, Bsize=4)/zygote/GPU/CUDA
176848
ns165765.5
ns1.07
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7083
ns7208
ns0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
6000
ns5333
ns1.13
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
6041
ns5958
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10042
ns10084
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
34134
ns34231
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
221583
ns214458
ns1.03
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
220000
ns220042
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
220417
ns221416
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
215333
ns235834
ns0.91
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
285763.5
ns263066.5
ns1.09
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s)
3750
ns3708
ns1.01
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s)
3750
ns3750
ns1
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s)
3750
ns3750
ns1
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s)
3709
ns3708
ns1.00
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA
22875
ns22879.5
ns1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s)
14500
ns14459
ns1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s)
14375
ns14375
ns1
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s)
14458
ns14541
ns0.99
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s)
14500
ns14500
ns1
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA
410580
ns399546.5
ns1.03
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
92125
ns94312.5
ns0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
92916
ns95875
ns0.97
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
96979
ns97583
ns0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
138000
ns94354.5
ns1.46
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
125660
ns125486.5
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1923792
ns1919437.5
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1935291
ns1938250
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1932916.5
ns1927084
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1920500
ns1803750
ns1.06
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
861874.5
ns794850
ns1.08
lenet(28, 28, 1, 32)/forward/CPU/2 thread(s)
873916
ns875354.5
ns1.00
lenet(28, 28, 1, 32)/forward/CPU/4 thread(s)
826583
ns802104.5
ns1.03
lenet(28, 28, 1, 32)/forward/CPU/8 thread(s)
1222000
ns1225042
ns1.00
lenet(28, 28, 1, 32)/forward/CPU/1 thread(s)
963750
ns970374.5
ns0.99
lenet(28, 28, 1, 32)/forward/GPU/CUDA
276546
ns273954
ns1.01
lenet(28, 28, 1, 32)/zygote/CPU/2 thread(s)
2791083
ns2714354
ns1.03
lenet(28, 28, 1, 32)/zygote/CPU/4 thread(s)
2445687.5
ns2504167
ns0.98
lenet(28, 28, 1, 32)/zygote/CPU/8 thread(s)
3347916
ns3360375
ns1.00
lenet(28, 28, 1, 32)/zygote/CPU/1 thread(s)
3371375
ns3360334
ns1.00
lenet(28, 28, 1, 32)/zygote/GPU/CUDA
1487194.5
ns1467965
ns1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
17250
ns17542
ns0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
17959
ns16937.5
ns1.06
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
17875
ns18708
ns0.96
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
17417
ns14584
ns1.19
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
130892
ns129735
ns1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
218625
ns214709
ns1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
260667
ns215958.5
ns1.21
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
227792
ns215562.5
ns1.06
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
256083
ns217958
ns1.17
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
584591.5
ns539139.5
ns1.08
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
222000
ns223375
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
222667
ns220958
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
222312.5
ns222645.5
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
220833
ns219625
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
243596.5
ns217203.5
ns1.12
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
501417
ns495895.5
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
496084
ns506625
ns0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
508541.5
ns510958
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
561833
ns561375
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1202534
ns1153506.5
ns1.04
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s)
3895.5
ns3917
ns0.99
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s)
4270.5
ns4667
ns0.92
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s)
5708
ns4834
ns1.18
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s)
4458.5
ns4833
ns0.92
batchedmm(16, Bsize=4)/forward/GPU/CUDA
16584
ns17326
ns0.96
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s)
7208.5
ns7520.5
ns0.96
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s)
7000
ns7625
ns0.92
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s)
7625
ns7458
ns1.02
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s)
7500
ns7417
ns1.01
batchedmm(16, Bsize=4)/zygote/GPU/CUDA
179332
ns176736
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
17687
ns16646
ns1.06
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
17917
ns18500
ns0.97
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
18625
ns19625
ns0.95
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
18729
ns18042
ns1.04
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
135434
ns133143.5
ns1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
211041
ns213000
ns0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
220417
ns212916
ns1.04
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
212542
ns213667
ns0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
212271
ns224895.5
ns0.94
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
847267
ns820129
ns1.03
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
3959
ns4354.5
ns0.91
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
4209
ns4625
ns0.91
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
4875
ns4917
ns0.99
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
4291
ns3875
ns1.11
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
187480.5
ns175343
ns1.07
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
10459
ns10208
ns1.02
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
10541.5
ns10333
ns1.02
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
10042
ns10834
ns0.93
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
10125
ns10208
ns0.99
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
955985
ns980341
ns0.98
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
3145.5
ns3250
ns0.97
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
2937.5
ns3687.5
ns0.80
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
4000
ns4292
ns0.93
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
3167
ns2917
ns1.09
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
188520.5
ns215866
ns0.87
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7375
ns7166
ns1.03
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7209
ns7625
ns0.95
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7625
ns7792
ns0.98
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7333
ns7375
ns0.99
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
987324
ns1015020
ns0.97
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s)
23406938
ns23687417
ns0.99
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s)
35765125
ns42666354
ns0.84
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s)
37705500
ns37344478.5
ns1.01
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s)
34946604
ns34948333.5
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/GPU/CUDA
1830206.5
ns1824017
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s)
183995333
ns183871416
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s)
165575375
ns182812313
ns0.91
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s)
146468292
ns145975437.5
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s)
274483625
ns274277542
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/GPU/CUDA
16521685
ns16507012
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s)
276817937
ns273782791
ns1.01
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s)
246377395.5
ns257949042
ns0.96
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s)
231576042
ns231995083.5
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s)
325032833.5
ns323882958.5
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
182896.5
ns183541
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
184292
ns184000
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
184958
ns185292
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
183167
ns182542
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
200810.5
ns191911.5
ns1.05
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
635333
ns629458.5
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
633354.5
ns587334
ns1.08
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
600291
ns587125.5
ns1.02
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
597271
ns649291
ns0.92
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
958799
ns963628
ns0.99
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s)
3842750
ns3851750
ns1.00
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s)
3997500
ns3983792
ns1.00
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s)
3542792
ns3579833
ns0.99
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s)
4556625
ns4612292
ns0.99
batchedmm(128, Bsize=512)/forward/GPU/CUDA
532425
ns531156
ns1.00
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s)
17396104
ns17385812.5
ns1.00
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s)
18078958
ns18439958.5
ns0.98
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s)
16589917
ns16577084
ns1.00
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s)
19981167
ns20232667
ns0.99
batchedmm(128, Bsize=512)/zygote/GPU/CUDA
2633170
ns2638769
ns1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
542
ns625
ns0.87
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
542
ns625
ns0.87
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
625
ns625
ns1
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
583
ns542
ns1.08
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
32094
ns32361
ns0.99
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
8917
ns9312.5
ns0.96
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
8750
ns9604.5
ns0.91
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
9041
ns9541
ns0.95
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
9042
ns8750
ns1.03
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
249030
ns248738
ns1.00
vgg16(32, 32, 3, 128)/forward/CPU/2 thread(s)
652464437.5
ns650277229.5
ns1.00
vgg16(32, 32, 3, 128)/forward/CPU/4 thread(s)
394034604
ns513797917
ns0.77
vgg16(32, 32, 3, 128)/forward/CPU/8 thread(s)
326393417
ns364513416
ns0.90
vgg16(32, 32, 3, 128)/forward/CPU/1 thread(s)
748745833
ns753229708
ns0.99
vgg16(32, 32, 3, 128)/forward/GPU/CUDA
12466975
ns11759811
ns1.06
vgg16(32, 32, 3, 128)/zygote/CPU/2 thread(s)
1885107791.5
ns1878034500
ns1.00
vgg16(32, 32, 3, 128)/zygote/CPU/4 thread(s)
1638827875
ns1671899375
ns0.98
vgg16(32, 32, 3, 128)/zygote/CPU/8 thread(s)
1512914354
ns1507608416.5
ns1.00
vgg16(32, 32, 3, 128)/zygote/CPU/1 thread(s)
2208603583.5
ns2202946667
ns1.00
vgg16(32, 32, 3, 128)/zygote/GPU/CUDA
49231175.5
ns49516620
ns0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s)
1616792
ns1535958.5
ns1.05
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s)
1200917
ns1179292
ns1.02
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s)
1389625
ns1380729.5
ns1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s)
2477916.5
ns2368083
ns1.05
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/GPU/CUDA
215338
ns215337
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s)
12691834
ns12730083
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s)
9979354.5
ns9937625
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s)
9689896
ns9659583.5
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s)
18371271
ns18459917
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA
1985308
ns2010689
ns0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s)
17676916
ns17677292
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s)
14722000
ns14810083
ns0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s)
14613667
ns14573229.5
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s)
21413395.5
ns21483000
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
26292
ns26292
ns1
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
26250
ns26250
ns1
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
26291
ns26250
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
26250
ns26208
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA
23721
ns23665
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
67333
ns67166
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
67333
ns66875
ns1.01
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
67209
ns67250
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
67333
ns66958
ns1.01
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA
367128.5
ns367986.5
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
203542
ns204583
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
208625
ns209292
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
209584
ns210500
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
199792
ns199625
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
25494
ns26073
ns0.98
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
604625
ns613125
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
670666.5
ns625459
ns1.07
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
632166.5
ns633583
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
630000
ns632083
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
321975.5
ns320857.5
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
639021
ns592750
ns1.08
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
643458
ns647000
ns0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
658750
ns648834
ns1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
632750
ns671792
ns0.94
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
131332
ns131354
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2244229
ns2247291
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2277708.5
ns2303208
ns0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2240167
ns2243604
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2235458.5
ns2314875.5
ns0.97
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1075922
ns1083962
ns0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
17167
ns16687.5
ns1.03
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
17916
ns18458
ns0.97
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
18167
ns19770.5
ns0.92
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
18208
ns18146
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
130720.5
ns132087.5
ns0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
258584
ns229375
ns1.13
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
227459
ns262896
ns0.87
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
232750
ns231208
ns1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
230791
ns258624.5
ns0.89
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
887768.5
ns885149.5
ns1.00
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
625
ns625
ns1
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
625
ns625
ns1
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
666
ns667
ns1.00
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
542
ns542
ns1
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
23104
ns23686
ns0.98
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
9750
ns8708
ns1.12
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
9250
ns10000
ns0.93
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
9208
ns10000
ns0.92
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
9417
ns9250
ns1.02
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
242418
ns241904
ns1.00
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
5208
ns5417
ns0.96
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
5125
ns5583
ns0.92
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
6375
ns6417
ns0.99
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
5375
ns4770.5
ns1.13
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
193804
ns194851.5
ns0.99
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7167
ns7667
ns0.93
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7250
ns7417
ns0.98
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7375
ns7792
ns0.95
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7042
ns7250
ns0.97
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
706410
ns705733
ns1.00
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
2125
ns2167
ns0.98
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
2250
ns2208
ns1.02
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
2209
ns2542
ns0.87
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
2208
ns2208
ns1
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA
17672
ns17804
ns0.99
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
6458
ns6541
ns0.99
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
6291
ns6500
ns0.97
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
6709
ns6875
ns0.98
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
6500
ns6417
ns1.01
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA
300575
ns294742
ns1.02
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s)
749459
ns746916
ns1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s)
748959
ns761333
ns0.98
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s)
750854
ns750541
ns1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s)
749167
ns749459
ns1.00
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA
20805
ns20924
ns0.99
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s)
775208
ns790875
ns0.98
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s)
795916.5
ns777375
ns1.02
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s)
792791
ns792500
ns1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s)
792792
ns778250
ns1.02
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA
274546.5
ns268681.5
ns1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7208
ns7375
ns0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
5917
ns5250
ns1.13
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
5959
ns5875
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10250
ns10292
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
33244
ns32725
ns1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
219625
ns219208
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
240291
ns230937.5
ns1.04
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
237583
ns236625
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
260042
ns214312.5
ns1.21
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
337443
ns332717.5
ns1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
10084
ns10291
ns0.98
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
9583
ns10937.5
ns0.88
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
10750
ns10625
ns1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
10167
ns9916
ns1.03
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
223296.5
ns219475.5
ns1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
25125
ns24416
ns1.03
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
24312.5
ns25417
ns0.96
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
24917
ns24875
ns1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
24667
ns24354.5
ns1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
1047460.5
ns1060762
ns0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s)
106018062.5
ns106190416
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s)
118144520.5
ns126215417
ns0.94
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s)
120409292
ns120200125
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s)
117468833
ns117655917
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/GPU/CUDA
2652084
ns2587994
ns1.02
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s)
373672500
ns395454916.5
ns0.94
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s)
359102771.5
ns372350083.5
ns0.96
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s)
356068521.5
ns355285895.5
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s)
543525042
ns542892500
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA
15230726
ns15209611
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s)
605345333
ns607219000
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s)
584604208
ns775694542
ns0.75
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s)
744606604.5
ns743546708
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s)
793208583.5
ns606917208
ns1.31
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
6500
ns6729.5
ns0.97
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
6375
ns7458
ns0.85
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
8062
ns8791
ns0.92
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
7146
ns6084
ns1.17
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
216878
ns214170
ns1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
13625
ns14645.5
ns0.93
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
13625
ns14167
ns0.96
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
14125
ns14334
ns0.99
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
14084
ns13417
ns1.05
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
1010131
ns1010027
ns1.00
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
5625
ns6042
ns0.93
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
6000
ns6708.5
ns0.89
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
7895.5
ns6958
ns1.13
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
5958
ns5166.5
ns1.15
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
211472.5
ns211003
ns1.00
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
12583
ns12916
ns0.97
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
12333
ns12979.5
ns0.95
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
12708
ns13041
ns0.97
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
12709
ns12375
ns1.03
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
725788
ns725511
ns1.00
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s)
5583
ns5792
ns0.96
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s)
5875
ns6084
ns0.97
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s)
6583.5
ns7166
ns0.92
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s)
6167
ns5979.5
ns1.03
batchedmm(2, Bsize=128)/forward/GPU/CUDA
17002
ns16985
ns1.00
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s)
15916
ns16375
ns0.97
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s)
15250
ns15917
ns0.96
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s)
16125
ns15750
ns1.02
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s)
15834
ns15750
ns1.01
batchedmm(2, Bsize=128)/zygote/GPU/CUDA
187784.5
ns184955.5
ns1.02
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
292
ns333
ns0.88
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
375
ns417
ns0.90
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
375
ns417
ns0.90
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
334
ns292
ns1.14
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
23531
ns23469
ns1.00
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
6167
ns6375
ns0.97
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
6292
ns6292
ns1
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
6459
ns6458
ns1.00
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
6084
ns6020.5
ns1.01
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
228744
ns226513
ns1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
5834
ns5917
ns0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
5916
ns6000
ns0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
5959
ns6083
ns0.98
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
5959
ns5833
ns1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
24273
ns24637
ns0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
20833
ns21375
ns0.97
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
20750
ns21083
ns0.98
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
21292
ns21167
ns1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
21041
ns20875
ns1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
251207.5
ns248819
ns1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
185375
ns144938
ns1.28
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
144625
ns147666
ns0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
147917
ns147500
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
144417
ns144208
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
166909.5
ns166863.5
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1321833
ns1328917
ns0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1350479
ns1366916.5
ns0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1337166
ns1323667
ns1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1323625
ns1330125
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1251196
ns1231201
ns1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
24833
ns21917
ns1.13
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
25041
ns23250
ns1.08
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
23958
ns25417
ns0.94
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
24271
ns24583
ns0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
315591
ns261684.5
ns1.21
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
131292
ns126249.5
ns1.04
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
118396
ns132125
ns0.90
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
176916
ns180458
ns0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
129458
ns182166
ns0.71
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1353120
ns1329052
ns1.02
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
333
ns334
ns1.00
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
417
ns375
ns1.11
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
375
ns375
ns1
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
292
ns292
ns1
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
23127
ns23064
ns1.00
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
6125
ns6417
ns0.95
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
6459
ns6500
ns0.99
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
6333
ns6583
ns0.96
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
6125
ns6083
ns1.01
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
245064.5
ns241726
ns1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
4208
ns4583
ns0.92
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
4875
ns4875
ns1
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
5125
ns5062.5
ns1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
4667
ns4375
ns1.07
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
228957.5
ns230879.5
ns0.99
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
9875
ns9792
ns1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
9875
ns10375
ns0.95
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
10334
ns10333
ns1.00
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
10208
ns10125
ns1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
1285818.5
ns1281938
ns1.00
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
1584
ns1584
ns1
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
1625
ns1625
ns1
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
1625
ns1625
ns1
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
1625
ns1583
ns1.03
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA
23344
ns23016.5
ns1.01
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
5750
ns5709
ns1.01
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
5709
ns5750
ns0.99
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
6000
ns6042
ns0.99
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
5666
ns5625
ns1.01
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA
264086.5
ns260870.5
ns1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s)
6807541.5
ns6736854
ns1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s)
6433375
ns6358292
ns1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s)
6489875
ns6526333
ns0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s)
7649521
ns7511917
ns1.02
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/GPU/CUDA
214938
ns214549
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s)
24073959
ns24072542
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s)
21296000
ns21309271.5
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s)
21044062.5
ns21010584
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s)
29805771
ns29840125
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA
2104181
ns2110310.5
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s)
37247625
ns37228250
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s)
34089791
ns45827250
ns0.74
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s)
45725979.5
ns45480416
ns1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s)
49397750
ns38465479
ns1.28
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
5500
ns5708
ns0.96
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
5708
ns5708
ns1
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
6541
ns6729.5
ns0.97
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
5708
ns5208.5
ns1.10
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
208256
ns215925.5
ns0.96
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
8084
ns8833
ns0.92
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8125
ns8417
ns0.97
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
8375
ns8625
ns0.97
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8375
ns8145.5
ns1.03
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
991485
ns1004537.5
ns0.99
lenet(28, 28, 1, 128)/forward/CPU/2 thread(s)
1509000
ns1503813
ns1.00
lenet(28, 28, 1, 128)/forward/CPU/4 thread(s)
1282542
ns1243541.5
ns1.03
lenet(28, 28, 1, 128)/forward/CPU/8 thread(s)
1634916.5
ns1631312.5
ns1.00
lenet(28, 28, 1, 128)/forward/CPU/1 thread(s)
2162000.5
ns2004542
ns1.08
lenet(28, 28, 1, 128)/forward/GPU/CUDA
271116.5
ns280207
ns0.97
lenet(28, 28, 1, 128)/zygote/CPU/2 thread(s)
7902209
ns7912062.5
ns1.00
lenet(28, 28, 1, 128)/zygote/CPU/4 thread(s)
6449312.5
ns6650042
ns0.97
lenet(28, 28, 1, 128)/zygote/CPU/8 thread(s)
7195708
ns7185875
ns1.00
lenet(28, 28, 1, 128)/zygote/CPU/1 thread(s)
10462229
ns10076645.5
ns1.04
lenet(28, 28, 1, 128)/zygote/GPU/CUDA
1752716.5
ns1812720
ns0.97
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s)
371187.5
ns371770.5
ns1.00
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s)
374208
ns359708
ns1.04
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s)
461250
ns457000
ns1.01
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s)
22208
ns27125
ns0.82
batchedmm(128, Bsize=4)/forward/GPU/CUDA
42428.5
ns47414
ns0.89
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s)
745437.5
ns728042
ns1.02
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s)
815833
ns792916
ns1.03
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s)
1062958
ns1060625
ns1.00
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s)
117396
ns122625
ns0.96
batchedmm(128, Bsize=4)/zygote/GPU/CUDA
283256.5
ns280856
ns1.01
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s)
397208
ns397666
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s)
288667
ns213417
ns1.35
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s)
287875
ns288291
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s)
750917
ns754041
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA
43636
ns44363
ns0.98
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s)
667000
ns669875
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s)
531375
ns474875
ns1.12
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s)
531417
ns529792
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s)
974083
ns975625
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA
188745
ns194646.5
ns0.97
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
644833
ns678312.5
ns0.95
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
648750
ns642583
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
644479
ns646625
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
652458.5
ns638374.5
ns1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
131347.5
ns132515
ns0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2445334
ns2433792
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2500021
ns2525125
ns0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2463250
ns2458416
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2463375
ns2464167
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1238313
ns1286025
ns0.96
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s)
3417
ns4270.5
ns0.80
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s)
3625
ns2791
ns1.30
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s)
4250
ns4334
ns0.98
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s)
3437.5
ns3021
ns1.14
batchedmm(2, Bsize=32)/forward/GPU/CUDA
16066
ns17018
ns0.94
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s)
5375
ns5583
ns0.96
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s)
5292
ns5542
ns0.95
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s)
5750
ns5500
ns1.05
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s)
5583
ns5584
ns1.00
batchedmm(2, Bsize=32)/zygote/GPU/CUDA
182995
ns187936.5
ns0.97
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1458042
ns1463042
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1499750
ns1495875
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1503250
ns1503458
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1437708
ns1446334
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
40191
ns41308.5
ns0.97
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5113291
ns5127000
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5287958
ns5300416.5
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5307041.5
ns5293458
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
4985125
ns4725667
ns1.05
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
196599
ns195229
ns1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s)
3709
ns3709
ns1
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s)
3708
ns3709
ns1.00
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s)
3709
ns3709
ns1
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s)
3709
ns3708
ns1.00
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA
33557
ns33264.5
ns1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s)
15125
ns15250
ns0.99
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s)
15167
ns15083
ns1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s)
15416
ns15417
ns1.00
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s)
15208
ns15125
ns1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA
349206
ns350238
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s)
71125
ns71333
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s)
71542
ns71417
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s)
71209
ns71208
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s)
71041
ns71500
ns0.99
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA
113114
ns112408
ns1.01
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
317667
ns318125
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
324125
ns327584
ns0.99
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
318292
ns319500
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
317625
ns320333
ns0.99
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA
193277
ns194166
ns1.00
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
958
ns1000
ns0.96
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
1041
ns1084
ns0.96
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
1083
ns1125
ns0.96
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
1125
ns1000
ns1.13
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
23048
ns23803
ns0.97
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
7750
ns8000
ns0.97
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8270.5
ns8417
ns0.98
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
8250
ns8417
ns0.98
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8041
ns7708
ns1.04
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
245757.5
ns246141
ns1.00
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s)
502770.5
ns501979.5
ns1.00
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s)
484500
ns480104
ns1.01
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s)
561750
ns566979
ns0.99
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s)
219917
ns220416
ns1.00
batchedmm(128, Bsize=32)/forward/GPU/CUDA
129178
ns128980
ns1.00
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s)
1387645.5
ns1391667
ns1.00
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s)
1473958
ns1479770.5
ns1.00
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s)
1779041.5
ns1756604
ns1.01
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s)
862917
ns864792
ns1.00
batchedmm(128, Bsize=32)/zygote/GPU/CUDA
273950
ns275170
ns1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
333
ns375
ns0.89
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
334
ns417
ns0.80
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
416
ns375
ns1.11
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
333
ns292
ns1.14
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
31657.5
ns31717
ns1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
6125
ns6625
ns0.92
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
6208
ns6542
ns0.95
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
6541
ns6500
ns1.01
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
6042
ns5958
ns1.01
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
251419
ns248251
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1733792
ns1776021
ns0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1721208
ns1733687.5
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1724250
ns1727458
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1773541
ns1726125
ns1.03
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
168671
ns167904
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
4114542
ns4363208
ns0.94
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
4392834
ns4382750
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
4368208.5
ns4374000
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
4369208.5
ns4367334
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1291475.5
ns1079923
ns1.20
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s)
6834
ns6875
ns0.99
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s)
6667
ns6708
ns0.99
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s)
7999.5
ns6792
ns1.18
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s)
7041
ns6666
ns1.06
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA
20138.5
ns19517
ns1.03
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
51250
ns59895.5
ns0.86
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
32625
ns49208
ns0.66
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
73833
ns52583
ns1.40
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
51084
ns32417
ns1.58
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA
340107
ns267079.5
ns1.27
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s)
17833
ns18084
ns0.99
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s)
18083
ns18292
ns0.99
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s)
18875
ns19709
ns0.96
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s)
18208
ns18292
ns1.00
batchedmm(2, Bsize=512)/forward/GPU/CUDA
18400
ns18390
ns1.00
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s)
53250
ns53833
ns0.99
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s)
53041
ns53375
ns0.99
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s)
53375
ns53375
ns1
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s)
53542
ns53625
ns1.00
batchedmm(2, Bsize=512)/zygote/GPU/CUDA
319083.5
ns319120
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s)
75166
ns75333
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s)
75625
ns75583
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s)
75291.5
ns75250
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s)
75083
ns75500
ns0.99
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA
47469
ns46304
ns1.03
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
324958
ns324291
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
342000
ns336479.5
ns1.02
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
325000
ns324708
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
324542
ns327458
ns0.99
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA
211595
ns209708.5
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1484959
ns1487583
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1526854.5
ns1522083
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1527250
ns1529334
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1462542
ns1471333
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
51799
ns52335
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5111083.5
ns5126125
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5312417
ns5305125
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5299333.5
ns5295000
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
4982354
ns4684000
ns1.06
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
204934
ns202194.5
ns1.01
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
28208
ns28333
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
28250
ns28333
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
28187.5
ns28292
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
28250
ns28209
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA
24742
ns24238
ns1.02
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
66500
ns66500
ns1
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
66709
ns66250
ns1.01
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
66500
ns66416
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
66541
ns66625
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA
484630.5
ns495044
ns0.98
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/2 thread(s)
1480583.5
ns1478812
ns1.00
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/4 thread(s)
1136563
ns933416.5
ns1.22
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/8 thread(s)
1136750
ns1129625
ns1.01
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/1 thread(s)
2265937.5
ns2267917
ns1.00
mlp7layer_bn(tanh)(32 x 256)/forward/GPU/CUDA
579622.5
ns577563.5
ns1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/2 thread(s)
3074562.5
ns3095187.5
ns0.99
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/4 thread(s)
2788145.5
ns2641125
ns1.06
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/8 thread(s)
2743021
ns2747417
ns1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/1 thread(s)
3819500.5
ns3815833.5
ns1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/GPU/CUDA
1931643
ns1965829
ns0.98
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/2 thread(s)
7902458
ns7798041
ns1.01
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/4 thread(s)
7834062.5
ns8017625
ns0.98
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/8 thread(s)
7920375
ns7904083.5
ns1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/1 thread(s)
4826312.5
ns4861812
ns0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
77625
ns119833.5
ns0.65
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
81167
ns81604
ns0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
84041.5
ns82000
ns1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
111396
ns80604
ns1.38
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
193746
ns193857.5
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2012875
ns2020000
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2046292
ns2021083
ns1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2031354
ns2024292
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2015417
ns1749917
ns1.15
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
746361.5
ns744082.5
ns1.00
This comment was automatically generated by workflow using github-action-benchmark.