-
Notifications
You must be signed in to change notification settings - Fork 63
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
fix: mark kwargs in functor as leaf (#1085)
- Loading branch information
Showing
3 changed files
with
24 additions
and
7 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
04494b5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JuliaRegistrator register
04494b5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Registration pull request created: JuliaRegistries/General/119558
Tip: Release Notes
Did you know you can add release notes too? Just add markdown formatted text underneath the comment after the text
"Release notes:" and it will be added to the registry PR, and if TagBot is installed it will also be added to the
release that TagBot creates. i.e.
To add them here just re-invoke and the PR will be updated.
Tagging
After the above pull request is merged, it is recommended that a tag is created on this repository for the registered package version.
This will be done automatically if the Julia TagBot GitHub Action is installed, or can be done manually through the github interface, or via:
04494b5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lux Benchmarks
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
4125
ns3917
ns1.05
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
4375
ns4125
ns1.06
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
5084
ns5292
ns0.96
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
3792
ns3791.5
ns1.00
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
62298.5
ns60493
ns1.03
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
10709
ns10125
ns1.06
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
12042
ns10125
ns1.19
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
11458
ns10042
ns1.14
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
10375
ns10041
ns1.03
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
437714.5
ns426840
ns1.03
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s)
1250
ns1125
ns1.11
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s)
1291
ns1208
ns1.07
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s)
1417
ns1375
ns1.03
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s)
1125
ns1333
ns0.84
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA
18695.5
ns18326.5
ns1.02
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
4125
ns4125
ns1
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
4167
ns4000
ns1.04
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
4375
ns4250
ns1.03
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
4041
ns4000
ns1.01
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA
113193.5
ns111007
ns1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
57458
ns57208
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
38458
ns46584
ns0.83
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
46208
ns46875
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
82584
ns82834
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
38573
ns37385
ns1.03
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2036958
ns2030208
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2091833.5
ns2090166
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2080833.5
ns2097541.5
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2003542
ns2024291
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
199791
ns199762
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
144291.5
ns151479.5
ns0.95
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
142396
ns143895.5
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
145709
ns145166.5
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
144354
ns147166
ns0.98
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
167203.5
ns166256.5
ns1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1119417
ns1118083
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1142208
ns1121604
ns1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1106187.5
ns1123333.5
ns0.98
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1118333
ns1147792
ns0.97
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
538894
ns530774
ns1.02
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
3375
ns3459
ns0.98
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
4042
ns3833
ns1.05
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
4500
ns4167
ns1.08
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
3062.5
ns3625
ns0.84
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
69914
ns67815
ns1.03
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
8500
ns8708
ns0.98
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
9541
ns9292
ns1.03
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
10791
ns8708
ns1.24
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8750
ns8875
ns0.99
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
508206.5
ns492282.5
ns1.03
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
14270.5
ns16020.5
ns0.89
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
15937.5
ns16166.5
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
17167
ns16395.5
ns1.05
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
16416
ns15395.5
ns1.07
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
56662
ns54331
ns1.04
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
213541.5
ns225750
ns0.95
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
215375
ns214542
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
215500
ns213708
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
221500
ns214708
ns1.03
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
281033.5
ns273073
ns1.03
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s)
625
ns542
ns1.15
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s)
667
ns667
ns1
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s)
875
ns750
ns1.17
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s)
583
ns667
ns0.87
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA
17768.5
ns17433
ns1.02
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
1375
ns1541
ns0.89
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
1584
ns1542
ns1.03
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
1625
ns1417
ns1.15
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
1584
ns1625
ns0.97
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA
105523
ns102453
ns1.03
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7125
ns7209
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
5250
ns5917
ns0.89
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
6041
ns5958
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
9958
ns10292
ns0.97
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
24475
ns23456
ns1.04
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
232875
ns229875
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
230687.5
ns230687
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
228875
ns230041
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
228291.5
ns220833
ns1.03
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
172887
ns170399.5
ns1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s)
3917
ns3875
ns1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s)
3917
ns3958
ns0.99
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s)
3917
ns3917
ns1
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s)
3917
ns3917
ns1
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA
23927
ns23777
ns1.01
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
16584
ns16625
ns1.00
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
16875
ns16708
ns1.01
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
17125
ns16667
ns1.03
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
16625
ns16459
ns1.01
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA
163843.5
ns162269
ns1.01
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
575875
ns574500
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
602666
ns568000
ns1.06
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
574625
ns572792
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
586875
ns575667
ns1.02
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA
113393.5
ns113429.5
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
1420792
ns1419458
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
1449750
ns1419209
ns1.02
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
1421833
ns1414729.5
ns1.01
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
1420666
ns1420875
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA
214515
ns211299.5
ns1.02
lenet(28, 28, 1, 64)/forward/CPU/2 thread(s)
1072291
ns1072937.5
ns1.00
lenet(28, 28, 1, 64)/forward/CPU/4 thread(s)
943646
ns965417
ns0.98
lenet(28, 28, 1, 64)/forward/CPU/8 thread(s)
1355083
ns1352709
ns1.00
lenet(28, 28, 1, 64)/forward/CPU/1 thread(s)
1299500
ns1268500
ns1.02
lenet(28, 28, 1, 64)/forward/GPU/CUDA
279673
ns273664
ns1.02
lenet(28, 28, 1, 64)/zygote/CPU/2 thread(s)
5776625
ns5908520.5
ns0.98
lenet(28, 28, 1, 64)/zygote/CPU/4 thread(s)
4547375
ns4453354
ns1.02
lenet(28, 28, 1, 64)/zygote/CPU/8 thread(s)
4956667
ns4968833.5
ns1.00
lenet(28, 28, 1, 64)/zygote/CPU/1 thread(s)
5681291.5
ns5709812.5
ns1.00
lenet(28, 28, 1, 64)/zygote/GPU/CUDA
1100789
ns1074376
ns1.02
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s)
500
ns541
ns0.92
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s)
583
ns500
ns1.17
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s)
542
ns542
ns1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s)
541
ns500
ns1.08
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA
23781
ns24117
ns0.99
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
2042
ns2084
ns0.98
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
2167
ns2084
ns1.04
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
2208
ns2209
ns1.00
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
2125
ns2084
ns1.02
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA
174061
ns175449
ns0.99
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
4000
ns4041
ns0.99
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
4041
ns3917
ns1.03
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
5041
ns4833
ns1.04
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
3520.5
ns3458
ns1.02
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
66653
ns65395
ns1.02
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
11042
ns10709
ns1.03
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
11667
ns11208
ns1.04
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
11917
ns11834
ns1.01
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
11333
ns11084
ns1.02
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
459089
ns450961.5
ns1.02
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
5625
ns7979.5
ns0.70
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
7583
ns6250
ns1.21
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
7500
ns7979.5
ns0.94
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
6167
ns6292
ns0.98
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
53360
ns52467
ns1.02
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
16542
ns17291
ns0.96
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
17209
ns17375
ns0.99
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
18333
ns18875
ns0.97
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
16791
ns16667
ns1.01
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
310536.5
ns305695
ns1.02
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
542
ns542
ns1
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
625
ns542
ns1.15
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
584
ns666
ns0.88
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
583
ns542
ns1.08
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
33641
ns32659
ns1.03
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
8542
ns8584
ns1.00
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
9042
ns9000
ns1.00
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
9312.5
ns9000
ns1.03
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
8709
ns8542
ns1.02
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
163903
ns159178.5
ns1.03
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s)
64708
ns64459
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s)
64375
ns64666
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s)
64584
ns64417
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s)
64541
ns64500
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA
111608.5
ns111598.5
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s)
279917
ns289312.5
ns0.97
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s)
284542
ns277542
ns1.03
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s)
283917
ns289083
ns0.98
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s)
282500
ns289417
ns0.98
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA
187507
ns185068
ns1.01
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/2 thread(s)
3287833.5
ns3359958.5
ns0.98
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/4 thread(s)
2780208
ns3026438
ns0.92
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/8 thread(s)
3046979
ns3022125
ns1.01
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/1 thread(s)
4045333
ns3951146
ns1.02
mlp7layer_bn(gelu)(32 x 256)/forward/GPU/CUDA
571508
ns587969
ns0.97
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/2 thread(s)
7619500
ns7494500
ns1.02
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/4 thread(s)
7346729
ns7453875
ns0.99
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/8 thread(s)
7476208
ns7451896
ns1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/1 thread(s)
8208937.5
ns8244416.5
ns1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/GPU/CUDA
1337153
ns1382663
ns0.97
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/2 thread(s)
18827750
ns18772542
ns1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/4 thread(s)
19152375
ns19139125
ns1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/8 thread(s)
19142542
ns19128542
ns1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/1 thread(s)
15656917
ns16197250
ns0.97
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s)
23640000
ns23953437.5
ns0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s)
43637146
ns34373209
ns1.27
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s)
37268333.5
ns37031750
ns1.01
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s)
34800709
ns35339917
ns0.98
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/GPU/CUDA
1857649.5
ns1848447
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s)
189644917
ns188047583
ns1.01
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s)
178178021
ns164639729
ns1.08
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s)
153631625
ns152806417
ns1.01
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s)
441917583
ns448441291
ns0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA
13878768
ns13907488
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s)
290268250
ns289867458
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s)
356472291
ns338595687.5
ns1.05
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s)
297083542
ns299072917
ns0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s)
333841541
ns413224792
ns0.81
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
21645.5
ns21917
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
23083.5
ns22958
ns1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
25417
ns24791
ns1.03
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
22458
ns22459
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
98111.5
ns99382.5
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
103708
ns104062.5
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
104645.5
ns103833
ns1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
104833
ns103792
ns1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
103709
ns117021
ns0.89
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
513446
ns522084
ns0.98
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
5895.5
ns5875
ns1.00
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
6292
ns6500
ns0.97
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
7042
ns6854.5
ns1.03
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
5792
ns6208
ns0.93
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
69892
ns70401.5
ns0.99
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
14708
ns14875
ns0.99
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
16333
ns15520.5
ns1.05
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
16375
ns16000
ns1.02
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
14895.5
ns15083
ns0.99
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
489627
ns486902.5
ns1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s)
2830166.5
ns3019896
ns0.94
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s)
2097041.5
ns2093333
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s)
2268875
ns2249333
ns1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s)
4810500
ns4929333
ns0.98
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/GPU/CUDA
591815.5
ns589363
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s)
23521125
ns23557417
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s)
18356854
ns18041937
ns1.02
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s)
16883187.5
ns16958375
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s)
35797333.5
ns36564937.5
ns0.98
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/GPU/CUDA
3103730
ns3109138
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s)
33291833
ns33331458.5
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s)
28074625
ns27714666.5
ns1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s)
27392042
ns27590000
ns0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s)
41001146
ns42335041.5
ns0.97
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
71583
ns74208
ns0.96
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
80375
ns73500
ns1.09
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
74479
ns76208.5
ns0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
72333.5
ns73292
ns0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
103944
ns106173
ns0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
297959
ns257312.5
ns1.16
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
216229
ns207375
ns1.04
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
209917
ns208750
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
272250
ns224458.5
ns1.21
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
558503.5
ns569917.5
ns0.98
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
11792
ns11500
ns1.03
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
11833
ns12125
ns0.98
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
13167
ns12625
ns1.04
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
11416
ns12333
ns0.93
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
74112
ns73676.5
ns1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
26250
ns25958
ns1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
27833
ns26770.5
ns1.04
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
27625
ns27542
ns1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
26917
ns26959
ns1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
486928.5
ns486341.5
ns1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
11791.5
ns12834
ns0.92
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
12875
ns12291
ns1.05
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
13791.5
ns14916
ns0.92
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
12437.5
ns12458
ns1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
53830
ns54897
ns0.98
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
25417
ns26917
ns0.94
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
26625
ns25583
ns1.04
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
26458
ns26500
ns1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
26333
ns26334
ns1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
310408.5
ns315364
ns0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
180916
ns181625
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
180479.5
ns182687.5
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
184042
ns180959
ns1.02
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
181875
ns180334
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
58334
ns58698
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
582875
ns630333
ns0.92
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
584292
ns588125
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
584875
ns585125
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
583771
ns615042
ns0.95
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
295362
ns295850.5
ns1.00
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
5958
ns6166.5
ns0.97
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
6042
ns6000
ns1.01
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
7000
ns6958
ns1.01
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
5875
ns6458
ns0.91
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
73116.5
ns72786
ns1.00
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
13792
ns14375
ns0.96
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
15417
ns14458
ns1.07
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
15417
ns15500
ns0.99
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
14458
ns14459
ns1.00
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
473681.5
ns474917.5
ns1.00
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s)
1172041
ns1242458
ns0.94
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s)
1199583
ns1726750
ns0.69
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s)
1286750
ns1284250
ns1.00
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s)
1323125
ns1304916.5
ns1.01
batchedmm(512, Bsize=4)/forward/GPU/CUDA
301801
ns301368.5
ns1.00
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s)
4107916
ns4121291.5
ns1.00
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s)
4485521
ns4347458
ns1.03
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s)
4485834
ns4658625
ns0.96
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s)
4442521
ns4651583.5
ns0.96
batchedmm(512, Bsize=4)/zygote/GPU/CUDA
1036926.5
ns1044741
ns0.99
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
1792
ns1875
ns0.96
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
1875
ns1792
ns1.05
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
1834
ns1917
ns0.96
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
1875
ns1875
ns1
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA
23428
ns24235
ns0.97
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
4833
ns4875
ns0.99
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
4917
ns4833
ns1.02
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
5000
ns4958
ns1.01
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
4917
ns4875
ns1.01
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA
189745
ns195530.5
ns0.97
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
5395.5
ns6104.5
ns0.88
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
6500
ns5667
ns1.15
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
6583
ns6875
ns0.96
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
5708
ns5750
ns0.99
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
55733.5
ns57282.5
ns0.97
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
10750
ns10979.5
ns0.98
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
11917
ns10917
ns1.09
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
11916
ns12042
ns0.99
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
10750
ns10792
ns1.00
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
340606.5
ns344261.5
ns0.99
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s)
333
ns375
ns0.89
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s)
375
ns292
ns1.28
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s)
334
ns333
ns1.00
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s)
375
ns334
ns1.12
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA
22887
ns23845
ns0.96
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
2792
ns2708
ns1.03
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
3083
ns2709
ns1.14
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
3084
ns2750
ns1.12
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
2791
ns2750
ns1.01
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA
158761
ns164126.5
ns0.97
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
10646
ns11917
ns0.89
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
11750
ns11167
ns1.05
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
12791
ns12459
ns1.03
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
11833
ns11417
ns1.04
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
57602
ns58669
ns0.98
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
24416
ns25000
ns0.98
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
25250
ns24500
ns1.03
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
24875
ns24708
ns1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
25167
ns25083
ns1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
298284.5
ns304559.5
ns0.98
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s)
4166
ns4208
ns0.99
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s)
4250
ns4167
ns1.02
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s)
4250
ns4167
ns1.02
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s)
4208
ns4250
ns0.99
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA
24817
ns25840
ns0.96
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
16125
ns16250
ns0.99
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
16166
ns16208
ns1.00
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
16458
ns16208
ns1.02
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
16125
ns16084
ns1.00
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA
198823
ns205670.5
ns0.97
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
5750
ns5875
ns0.98
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
5834
ns5792
ns1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
5875
ns5834
ns1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
5875
ns5792
ns1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
34246
ns34552
ns0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
20520.5
ns20833
ns0.98
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
21209
ns20791
ns1.02
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
22729
ns21666
ns1.05
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
21041
ns20709
ns1.02
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
178397
ns181576.5
ns0.98
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s)
404667
ns399208
ns1.01
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s)
354625
ns372584
ns0.95
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s)
489104
ns483833
ns1.01
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s)
525646
ns506583
ns1.04
batchedmm(16, Bsize=512)/forward/GPU/CUDA
66601
ns67542.5
ns0.99
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s)
953375
ns1007042
ns0.95
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s)
892417
ns884958.5
ns1.01
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s)
1238812.5
ns1232166.5
ns1.01
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s)
1399167
ns1433709
ns0.98
batchedmm(16, Bsize=512)/zygote/GPU/CUDA
191860
ns193226.5
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
83500
ns78250
ns1.07
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
87604
ns81292
ns1.08
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
82958
ns81084
ns1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
131084
ns82584
ns1.59
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
193332.5
ns194079.5
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1915875
ns1922417
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1936354
ns1920833
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1926333
ns1926562
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1922000
ns1930104
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
409092
ns412262
ns0.99
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s)
292
ns333
ns0.88
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s)
333
ns292
ns1.14
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s)
292
ns333
ns0.88
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s)
333
ns292
ns1.14
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA
22200
ns22781
ns0.97
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s)
1833
ns1833
ns1
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s)
1875
ns1792
ns1.05
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s)
1875
ns1875
ns1
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s)
1834
ns1833
ns1.00
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA
173011
ns177219.5
ns0.98
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
5542
ns6958.5
ns0.80
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
6708
ns6250
ns1.07
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
7709
ns7354.5
ns1.05
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
6750
ns6604.5
ns1.02
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
61904.5
ns63079.5
ns0.98
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
8833
ns9250
ns0.95
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
9583
ns8958
ns1.07
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
9125
ns9250
ns0.99
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
9167
ns9500
ns0.96
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
320204.5
ns323232
ns0.99
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s)
121310646
ns120001354.5
ns1.01
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s)
181760542
ns173860959
ns1.05
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s)
147955208
ns147799416
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s)
107047750
ns105257459
ns1.02
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/GPU/CUDA
5473575
ns5487976
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s)
616325250
ns617199083.5
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s)
579539584
ns555347958
ns1.04
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s)
452436916.5
ns452797563
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s)
761604583
ns772493125
ns0.99
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA
34930079
ns34928705
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s)
649842708
ns649523250
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s)
687832188
ns666577687.5
ns1.03
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s)
589947958
ns586637666.5
ns1.01
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s)
745852459
ns745838709
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
58833
ns59333
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
38917
ns47459
ns0.82
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
47791
ns48167
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
83458
ns83542
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
37730
ns38875
ns0.97
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1929812
ns1923646.5
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1979208.5
ns1977041
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1979375
ns1985354.5
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1901667
ns1902208
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
175008.5
ns178774.5
ns0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
265750
ns267625
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
272708
ns266833.5
ns1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
277250
ns269833
ns1.03
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
264875
ns268750
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
141169
ns135794.5
ns1.04
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
664708
ns650000
ns1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
673604.5
ns693729
ns0.97
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
596521
ns588833
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
584791
ns671104.5
ns0.87
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
740154.5
ns736846.5
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
2239000
ns2216292
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
2243541.5
ns2247708
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
2186042
ns2179083
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
2207334
ns2228833.5
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
133565
ns134998
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5494667
ns5489729.5
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5564917
ns5488416.5
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5526500
ns5510083.5
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
5510208
ns5509500
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
772189
ns772298.5
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
643917
ns644667
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
648042
ns636750
ns1.02
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
643750
ns643625
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
644167
ns664292
ns0.97
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA
46665
ns46667
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
1820250
ns1824500
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
1693229.5
ns1724500
ns0.98
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
1723895.5
ns1725291
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
2104875
ns2103083
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA
221733
ns222112
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
58500
ns58292
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
38750
ns47416
ns0.82
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
46250
ns47229.5
ns0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
84416
ns84250
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
28603
ns28385
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2028417
ns2028896
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2096541
ns2089125
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2094542
ns2095958
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2003250
ns2000333
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
190253.5
ns190029
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s)
13359542
ns13380666.5
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s)
12463437.5
ns12446458.5
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s)
12529500
ns12502375.5
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s)
14818083.5
ns15323021
ns0.97
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/GPU/CUDA
513933
ns514892.5
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s)
47308250
ns47308500
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s)
41855187.5
ns41876750
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s)
41092584
ns40911521
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s)
58335625
ns59068458
ns0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA
3235135
ns3251312.5
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s)
73575437.5
ns74382250
ns0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s)
91455709
ns67968458
ns1.35
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s)
90846333
ns90502167
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s)
76471792
ns99974645.5
ns0.76
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
58833
ns58709
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
38834
ns47416
ns0.82
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
47125
ns47542
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
84250
ns84750
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
48573
ns48017
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1920417
ns1919167
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1968896
ns1967646
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1976187.5
ns1985146.5
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1889896
ns1905917
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
198559.5
ns198479.5
ns1.00
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
292
ns292
ns1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
417
ns333
ns1.25
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
375
ns416
ns0.90
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
292
ns333
ns0.88
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
33508
ns33267
ns1.01
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
6041
ns6166
ns0.98
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
6500
ns6166
ns1.05
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
6500
ns6583
ns0.99
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
6375
ns6125
ns1.04
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
183482.5
ns181033.5
ns1.01
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s)
292
ns250
ns1.17
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s)
333
ns250
ns1.33
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s)
292
ns292
ns1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s)
292
ns292
ns1
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA
32079
ns32220
ns1.00
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s)
2584
ns2583
ns1.00
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s)
3000
ns2584
ns1.16
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s)
2958
ns2833
ns1.04
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s)
2625
ns2584
ns1.02
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA
170409
ns169790
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s)
287317291.5
ns285598395.5
ns1.01
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s)
346633417
ns340428166.5
ns1.02
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s)
314215396
ns314514604.5
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s)
269592292
ns271854666
ns0.99
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/GPU/CUDA
7121953.5
ns7112648
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s)
999266459
ns998529958
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s)
960016625
ns938308750
ns1.02
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s)
852780958.5
ns856850979
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s)
1161564542
ns1172841333
ns0.99
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA
34079762.5
ns33924275.5
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s)
1308969708.5
ns1309270416.5
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s)
1689204000
ns1342306792
ns1.26
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s)
1642484833
ns1639996875
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s)
1298514771
ns1671556167
ns0.78
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1461645.5
ns1461166.5
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1422250
ns1458417
ns0.98
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1414000.5
ns1416916.5
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1410541
ns1463625
ns0.96
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
128615.5
ns128327
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5018812.5
ns5018562.5
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5052458
ns5017583.5
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5042958
ns5035291
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
5025792
ns5028709
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
515506.5
ns596055.5
ns0.86
vgg16(32, 32, 3, 32)/forward/CPU/2 thread(s)
175005812.5
ns175022104
ns1.00
vgg16(32, 32, 3, 32)/forward/CPU/4 thread(s)
179670270.5
ns129880000
ns1.38
vgg16(32, 32, 3, 32)/forward/CPU/8 thread(s)
129962333.5
ns128343604
ns1.01
vgg16(32, 32, 3, 32)/forward/CPU/1 thread(s)
155707333.5
ns159345417
ns0.98
vgg16(32, 32, 3, 32)/forward/GPU/CUDA
4857330
ns4880680
ns1.00
vgg16(32, 32, 3, 32)/zygote/CPU/2 thread(s)
669593458
ns661717500
ns1.01
vgg16(32, 32, 3, 32)/zygote/CPU/4 thread(s)
553990208
ns491987125
ns1.13
vgg16(32, 32, 3, 32)/zygote/CPU/8 thread(s)
519026625
ns484849458
ns1.07
vgg16(32, 32, 3, 32)/zygote/CPU/1 thread(s)
676320542
ns694008417
ns0.97
vgg16(32, 32, 3, 32)/zygote/GPU/CUDA
16003946
ns15696658
ns1.02
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s)
8934291
ns8921021
ns1.00
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s)
8826020.5
ns8820666.5
ns1.00
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s)
7895083
ns7856604
ns1.00
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s)
10163542
ns10334062.5
ns0.98
batchedmm(512, Bsize=32)/forward/GPU/CUDA
1609035
ns1590128
ns1.01
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s)
36061791
ns36048792
ns1.00
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s)
37793833
ns36944812.5
ns1.02
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s)
33277167
ns33324416
ns1.00
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s)
39115875
ns40001167
ns0.98
batchedmm(512, Bsize=32)/zygote/GPU/CUDA
6457436
ns6454559
ns1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s)
47333
ns47542
ns1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s)
47459
ns47333
ns1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s)
47750
ns47583
ns1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s)
47541
ns47625
ns1.00
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA
18918
ns18400.5
ns1.03
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s)
52750
ns50292
ns1.05
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s)
50250
ns50042
ns1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s)
52958
ns52833
ns1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s)
50458
ns50500
ns1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA
201180.5
ns213323
ns0.94
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
6041
ns7125
ns0.85
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
7125
ns6334
ns1.12
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
8354.5
ns7709
ns1.08
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
6708
ns6833
ns0.98
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
95394.5
ns101491.5
ns0.94
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
9709
ns9709
ns1
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
10209
ns9584
ns1.07
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
10333
ns10292
ns1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
10291.5
ns10208
ns1.01
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
558425.5
ns591948.5
ns0.94
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
5041
ns6792
ns0.74
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
6312.5
ns5666.5
ns1.11
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
6792
ns6979.5
ns0.97
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
5687.5
ns6729
ns0.85
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
129167.5
ns157875.5
ns0.82
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
12895.5
ns13000
ns0.99
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
13459
ns12917
ns1.04
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
13458
ns13250
ns1.02
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
13125
ns13167
ns1.00
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
520805
ns601632
ns0.87
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
1000
ns1083
ns0.92
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
1125
ns1000
ns1.13
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
1125
ns1042
ns1.08
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
1083
ns1042
ns1.04
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
33824
ns33030
ns1.02
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7625
ns7959
ns0.96
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
8209
ns7834
ns1.05
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
8167
ns7958
ns1.03
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
8333
ns7917
ns1.05
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
212223
ns230449
ns0.92
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
23333
ns23375
ns1.00
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
23250
ns23208
ns1.00
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
23708
ns23417
ns1.01
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
23417
ns23291.5
ns1.01
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA
18851
ns19113
ns0.99
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
52229.5
ns52895.5
ns0.99
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
52833
ns52125
ns1.01
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
53083
ns52584
ns1.01
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
52250
ns52625
ns0.99
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA
290323.5
ns342803.5
ns0.85
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1444563
ns1397500
ns1.03
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1416667
ns1398958.5
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1398000
ns1397959
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1397854.5
ns1398875
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
196725
ns197104
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
4896875
ns5023770.5
ns0.97
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5033250
ns5008292
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
4732209
ns5019833
ns0.94
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
4656125
ns5023375
ns0.93
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
628203
ns702077
ns0.89
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s)
3050917
ns3038750
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s)
2085917
ns2100458.5
ns0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s)
2299229
ns2297437.5
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s)
4546416
ns4584063
ns0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/GPU/CUDA
583838
ns584858
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s)
24394208
ns24409833.5
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s)
19011416
ns18883709
ns1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s)
18859500
ns18947166
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s)
36642625
ns37250541.5
ns0.98
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA
3197291
ns3224554
ns0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s)
34077084
ns34090937.5
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s)
28833000
ns28380792
ns1.02
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s)
27989458.5
ns28106292
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s)
41738687
ns42177791.5
ns0.99
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s)
144879458
ns143784584
ns1.01
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s)
141694833
ns141596854
ns1.00
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s)
124678750
ns124395791.5
ns1.00
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s)
172183417
ns175754958
ns0.98
batchedmm(512, Bsize=512)/forward/GPU/CUDA
22781734
ns22549895
ns1.01
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s)
1324126083.5
ns984048979.5
ns1.35
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s)
862326062
ns959988750
ns0.90
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s)
822435542
ns915312208
ns0.90
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s)
674446500
ns686199250
ns0.98
batchedmm(512, Bsize=512)/zygote/GPU/CUDA
117954250
ns118476719
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
72708
ns73916.5
ns0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
73458.5
ns73959
ns0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
78666
ns77458
ns1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
80771
ns76083
ns1.06
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
247050
ns297349
ns0.83
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
238083
ns191541.5
ns1.24
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
259458.5
ns190083.5
ns1.36
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
192500
ns193375
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
286542
ns205187.5
ns1.40
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1278131.5
ns1506117
ns0.85
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s)
35505542
ns35382375
ns1.00
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s)
35779583
ns35433270.5
ns1.01
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s)
32132583.5
ns32211792
ns1.00
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s)
41038625
ns41320000
ns0.99
batchedmm(512, Bsize=128)/forward/GPU/CUDA
5842363
ns5840191
ns1.00
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s)
148318166
ns146474167
ns1.01
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s)
156700916.5
ns152518104
ns1.03
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s)
134987916
ns136637000
ns0.99
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s)
287669583
ns228362188
ns1.26
batchedmm(512, Bsize=128)/zygote/GPU/CUDA
34854720.5
ns34868018
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s)
121572750
ns121348250
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s)
181182167
ns174167875
ns1.04
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s)
147944417
ns147844437.5
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s)
105962708.5
ns109402458.5
ns0.97
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/GPU/CUDA
5473146
ns5468413
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s)
473565583
ns470104666
ns1.01
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s)
483250896
ns467279541
ns1.03
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s)
442160333
ns440692042
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s)
742736375
ns756630062.5
ns0.98
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/GPU/CUDA
32265706.5
ns32246487
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s)
709724354
ns708913479.5
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s)
670668250
ns654076083.5
ns1.03
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s)
576699666.5
ns576331875
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s)
846352792
ns868264500
ns0.97
mlp7layer_bn(relu)(32 x 256)/forward/CPU/2 thread(s)
1306395.5
ns1341333
ns0.97
mlp7layer_bn(relu)(32 x 256)/forward/CPU/4 thread(s)
761604
ns969333
ns0.79
mlp7layer_bn(relu)(32 x 256)/forward/CPU/8 thread(s)
909459
ns905958
ns1.00
mlp7layer_bn(relu)(32 x 256)/forward/CPU/1 thread(s)
2049500
ns2085854
ns0.98
mlp7layer_bn(relu)(32 x 256)/forward/GPU/CUDA
564019
ns569576
ns0.99
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/2 thread(s)
2968812.5
ns2971041
ns1.00
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/4 thread(s)
2494708
ns2591478.5
ns0.96
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/8 thread(s)
2620270.5
ns2624375
ns1.00
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/1 thread(s)
3705625.5
ns3763333
ns0.98
mlp7layer_bn(relu)(32 x 256)/zygote/GPU/CUDA
1753312
ns1911630
ns0.92
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/2 thread(s)
6649125
ns6646062.5
ns1.00
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/4 thread(s)
6469521
ns6511333.5
ns0.99
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/8 thread(s)
6522166.5
ns6212750
ns1.05
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/1 thread(s)
4446500
ns4512000
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7292
ns7417
ns0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
5334
ns5958
ns0.90
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
6125
ns6209
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10042
ns10334
ns0.97
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
25527
ns25303
ns1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
212458
ns212916
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
220833
ns220187.5
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
221896
ns220125
ns1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
218229.5
ns206750
ns1.06
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
256834.5
ns309128.5
ns0.83
vgg16(32, 32, 3, 64)/forward/CPU/2 thread(s)
302398208
ns301567020.5
ns1.00
vgg16(32, 32, 3, 64)/forward/CPU/4 thread(s)
280301750
ns220519459
ns1.27
vgg16(32, 32, 3, 64)/forward/CPU/8 thread(s)
195492125
ns195586458
ns1.00
vgg16(32, 32, 3, 64)/forward/CPU/1 thread(s)
312069834
ns308649729.5
ns1.01
vgg16(32, 32, 3, 64)/forward/GPU/CUDA
7871517
ns7785127
ns1.01
vgg16(32, 32, 3, 64)/zygote/CPU/2 thread(s)
1078719979.5
ns1082593062
ns1.00
vgg16(32, 32, 3, 64)/zygote/CPU/4 thread(s)
989013042
ns897538125
ns1.10
vgg16(32, 32, 3, 64)/zygote/CPU/8 thread(s)
814818166
ns875339084
ns0.93
vgg16(32, 32, 3, 64)/zygote/CPU/1 thread(s)
1156526521
ns1186516209
ns0.97
vgg16(32, 32, 3, 64)/zygote/GPU/CUDA
26511423
ns26500341
ns1.00
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
5083
ns5979.5
ns0.85
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
5750
ns5709
ns1.01
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
6333
ns6333
ns1
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
5292
ns5958
ns0.89
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
148964
ns202987.5
ns0.73
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7292
ns7750
ns0.94
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7583
ns6937.5
ns1.09
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7395.5
ns7166
ns1.03
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7583.5
ns7208
ns1.05
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
625721.5
ns715249.5
ns0.87
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
542
ns625
ns0.87
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
625
ns541
ns1.16
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
750
ns792
ns0.95
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
583
ns541
ns1.08
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
24037
ns24024
ns1.00
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
8792
ns9417
ns0.93
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
9416
ns8875
ns1.06
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
9625
ns9375
ns1.03
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
9042
ns8917
ns1.01
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
217991.5
ns236297
ns0.92
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
351083.5
ns352166.5
ns1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
351833
ns352000
ns1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
352313
ns352625
ns1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
352104.5
ns356354.5
ns0.99
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA
21184.5
ns21208
ns1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
778000
ns821042
ns0.95
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
782396
ns774875
ns1.01
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
808562.5
ns774479.5
ns1.04
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
816583
ns784125
ns1.04
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA
268217.5
ns305684
ns0.88
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s)
336938
ns329917
ns1.02
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s)
314604
ns340958
ns0.92
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s)
454458
ns452250
ns1.00
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s)
332020.5
ns310020.5
ns1.07
batchedmm(16, Bsize=32)/forward/GPU/CUDA
18180
ns18040
ns1.01
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s)
681937.5
ns694333
ns0.98
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s)
738625
ns741979.5
ns1.00
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s)
1029500
ns1031791.5
ns1.00
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s)
696625
ns699708
ns1.00
batchedmm(16, Bsize=32)/zygote/GPU/CUDA
249957.5
ns288808
ns0.87
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s)
354229
ns346250
ns1.02
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s)
327229.5
ns346812.5
ns0.94
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s)
418250
ns414166
ns1.01
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s)
369104.5
ns354500.5
ns1.04
batchedmm(16, Bsize=128)/forward/GPU/CUDA
22561
ns22617
ns1.00
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s)
745292
ns756959
ns0.98
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s)
750916
ns745375
ns1.01
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s)
1075584
ns1075625
ns1.00
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s)
825833
ns831312.5
ns0.99
batchedmm(16, Bsize=128)/zygote/GPU/CUDA
217990.5
ns224885
ns0.97
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s)
3292
ns3667
ns0.90
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s)
3625
ns3542
ns1.02
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s)
3875
ns3708
ns1.05
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s)
3583
ns3687.5
ns0.97
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA
17937
ns17766
ns1.01
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s)
4334
ns4167
ns1.04
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s)
4458
ns4166
ns1.07
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s)
4417
ns4250
ns1.04
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s)
4208
ns4292
ns0.98
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA
264329
ns276086.5
ns0.96
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
3667
ns3792
ns0.97
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
3875
ns3625
ns1.07
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
4625
ns4208
ns1.10
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
3750
ns3833.5
ns0.98
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
193740
ns212882.5
ns0.91
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
8208
ns8250
ns0.99
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8709
ns8083
ns1.08
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
8750
ns8750
ns1
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8583.5
ns8375
ns1.02
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
1184156
ns1204651
ns0.98
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
203500
ns204125
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
208750
ns210208
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
211167
ns211541
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
201083
ns200916
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
35565
ns35493
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
645875
ns644375
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
623750
ns622542
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
635250
ns621416
ns1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
592646
ns633021
ns0.94
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
327199.5
ns345266
ns0.95
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s)
959708.5
ns960666
ns1.00
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s)
938042
ns934042
ns1.00
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s)
954875
ns962459
ns0.99
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s)
1293708
ns1306000
ns0.99
batchedmm(128, Bsize=128)/forward/GPU/CUDA
206955
ns205400
ns1.01
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s)
4483167
ns4490416.5
ns1.00
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s)
4618833
ns4462042
ns1.04
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s)
4317645.5
ns4294020.5
ns1.01
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s)
6243708.5
ns6374063
ns0.98
batchedmm(128, Bsize=128)/zygote/GPU/CUDA
961553
ns948468
ns1.01
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
3750
ns3250
ns1.15
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
3375
ns3375
ns1
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
4375
ns4250
ns1.03
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
3708
ns3520.5
ns1.05
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
193389
ns233579.5
ns0.83
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7208
ns7458
ns0.97
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7917
ns6959
ns1.14
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7667
ns7375
ns1.04
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7334
ns6834
ns1.07
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
1009480
ns1006195
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s)
1631000
ns1636208
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s)
1166042
ns1184104
ns0.98
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s)
1365625
ns1372937.5
ns0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s)
2367458
ns2438792
ns0.97
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/GPU/CUDA
215455
ns214646
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s)
12301250
ns12368625
ns0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s)
9628833
ns9576084
ns1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s)
9292104.5
ns9277625
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s)
18003312.5
ns18160625
ns0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/GPU/CUDA
1946489
ns1950393
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s)
17321917
ns17396666.5
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s)
14455625
ns14353333
ns1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s)
14321416.5
ns14318812
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s)
21161000
ns21205104.5
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
116959
ns89646
ns1.30
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
90000
ns87125
ns1.03
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
92500
ns91416.5
ns1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
89667
ns89833.5
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
126257.5
ns126102.5
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1952542
ns2028271
ns0.96
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2033584
ns2016292
ns1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2036458
ns1722416.5
ns1.18
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2026500
ns2034291.5
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1082649.5
ns1030027.5
ns1.05
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s)
342854.5
ns337333
ns1.02
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s)
324042
ns348770.5
ns0.93
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s)
395834
ns398395.5
ns0.99
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s)
311645.5
ns291854
ns1.07
batchedmm(2, Bsize=4)/forward/GPU/CUDA
16225
ns16026
ns1.01
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s)
698292
ns701687.5
ns1.00
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s)
727291
ns738854
ns0.98
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s)
1022499.5
ns1026479
ns1.00
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s)
655000
ns659000
ns0.99
batchedmm(2, Bsize=4)/zygote/GPU/CUDA
195381.5
ns192241.5
ns1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7167
ns7000
ns1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
5292
ns5875
ns0.90
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
6042
ns6000
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10084
ns10166
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
34236
ns34083
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
212375
ns212666.5
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
221500
ns222291.5
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
233708
ns219708
ns1.06
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
217250
ns214959
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
351634.5
ns310410
ns1.13
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s)
3667
ns3750
ns0.98
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s)
3750
ns3708
ns1.01
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s)
3708
ns3667
ns1.01
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s)
3708
ns3708
ns1
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA
22712
ns22765
ns1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s)
14417
ns14417
ns1
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s)
14209
ns14333
ns0.99
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s)
14416
ns14375
ns1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s)
14375
ns14250
ns1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA
483403.5
ns477947.5
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
136416
ns92104.5
ns1.48
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
93292
ns92791.5
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
97979.5
ns93458.5
ns1.05
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
138291
ns94479
ns1.46
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
125690
ns125492
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1920333
ns1915542
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1941542
ns1931312.5
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1928708
ns1652875
ns1.17
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1924833
ns1932812.5
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1091617.5
ns963079
ns1.13
lenet(28, 28, 1, 32)/forward/CPU/2 thread(s)
873125
ns868062.5
ns1.01
lenet(28, 28, 1, 32)/forward/CPU/4 thread(s)
800083
ns820062.5
ns0.98
lenet(28, 28, 1, 32)/forward/CPU/8 thread(s)
1213708
ns1224562.5
ns0.99
lenet(28, 28, 1, 32)/forward/CPU/1 thread(s)
955958
ns939937.5
ns1.02
lenet(28, 28, 1, 32)/forward/GPU/CUDA
275553.5
ns272869
ns1.01
lenet(28, 28, 1, 32)/zygote/CPU/2 thread(s)
2789084
ns2818583
ns0.99
lenet(28, 28, 1, 32)/zygote/CPU/4 thread(s)
2533416.5
ns2448750
ns1.03
lenet(28, 28, 1, 32)/zygote/CPU/8 thread(s)
3351166.5
ns3349041.5
ns1.00
lenet(28, 28, 1, 32)/zygote/CPU/1 thread(s)
3416521
ns3429667
ns1.00
lenet(28, 28, 1, 32)/zygote/GPU/CUDA
1671794
ns1623362
ns1.03
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
14250
ns15416
ns0.92
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
16229
ns16479
ns0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
18500
ns18042
ns1.03
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
16666
ns15416
ns1.08
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
176658
ns142624
ns1.24
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
256791
ns261396
ns0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
216875
ns216208
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
216666.5
ns216500
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
216917
ns259938
ns0.83
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
714190.5
ns645742
ns1.11
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
218709
ns221958
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
221500
ns221521
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
222167
ns222250
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
219000
ns219959
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
284510.5
ns270517
ns1.05
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
494666
ns557249.5
ns0.89
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
498749.5
ns495479.5
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
497250
ns498021
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
560875
ns511625
ns1.10
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1421507
ns1384352
ns1.03
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s)
328417
ns329292
ns1.00
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s)
311917
ns332479
ns0.94
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s)
389416
ns373062
ns1.04
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s)
322229
ns302292
ns1.07
batchedmm(16, Bsize=4)/forward/GPU/CUDA
16925.5
ns16837
ns1.01
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s)
710750
ns712062.5
ns1.00
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s)
728271
ns736750
ns0.99
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s)
1020583
ns1027458
ns0.99
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s)
666375
ns669125
ns1.00
batchedmm(16, Bsize=4)/zygote/GPU/CUDA
195423
ns196687
ns0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
15854
ns18083
ns0.88
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
18187.5
ns19417
ns0.94
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
19791.5
ns19812.5
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
17125
ns18125
ns0.94
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
176945
ns146545.5
ns1.21
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
212208
ns219125
ns0.97
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
217125
ns216937.5
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
224729
ns215041.5
ns1.05
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
221750
ns212667
ns1.04
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1030393
ns944620
ns1.09
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
4250
ns4375
ns0.97
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
4625
ns3833
ns1.21
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
4958.5
ns5000
ns0.99
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
3833.5
ns4375
ns0.88
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
242789
ns215167
ns1.13
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
10020.5
ns10583
ns0.95
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
10750
ns9875
ns1.09
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
10834
ns10917
ns0.99
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
10354
ns10125
ns1.02
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
1096605.5
ns1064041.5
ns1.03
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
3292
ns3208
ns1.03
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
3375
ns3083
ns1.09
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
3854.5
ns4083
ns0.94
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
3042
ns3625
ns0.84
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
249519
ns238602
ns1.05
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7291
ns7125
ns1.02
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7500
ns7292
ns1.03
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7666
ns7500
ns1.02
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7417
ns7625
ns0.97
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
1104412
ns1072124
ns1.03
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s)
23491541
ns23510042
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s)
43048000
ns35239042
ns1.22
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s)
37820875
ns37521895.5
ns1.01
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s)
34890750
ns35273916
ns0.99
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/GPU/CUDA
1856324
ns1835321
ns1.01
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s)
184204375
ns185664541
ns0.99
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s)
172049312.5
ns160177375
ns1.07
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s)
146204354.5
ns146706312.5
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s)
414413041
ns422527708.5
ns0.98
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/GPU/CUDA
16518831.5
ns16512466
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s)
427089542
ns425858208
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s)
257885125
ns253069291.5
ns1.02
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s)
232435333
ns231211500
ns1.01
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s)
487128167
ns494264646
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
181770.5
ns182792
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
183750
ns184542
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
185667
ns185187.5
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
182458
ns183709
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
230246
ns212062
ns1.09
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
589417
ns623709
ns0.95
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
588333.5
ns585625
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
596874.5
ns587750
ns1.02
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
597604
ns635312.5
ns0.94
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1117454.5
ns1070840
ns1.04
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s)
3826354
ns3842250
ns1.00
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s)
3667353.5
ns3651042
ns1.00
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s)
3512500
ns3490583
ns1.01
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s)
5363750
ns5452395.5
ns0.98
batchedmm(128, Bsize=512)/forward/GPU/CUDA
537508
ns531221
ns1.01
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s)
17313854
ns17413417
ns0.99
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s)
17708458.5
ns17282104.5
ns1.02
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s)
16567146
ns16562771
ns1.00
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s)
22157916.5
ns23195250
ns0.96
batchedmm(128, Bsize=512)/zygote/GPU/CUDA
2616426
ns2624623.5
ns1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
541
ns625
ns0.87
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
625
ns542
ns1.15
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
625
ns708
ns0.88
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
583
ns542
ns1.08
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
32262
ns33132
ns0.97
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
9084
ns9333
ns0.97
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
9500
ns8687.5
ns1.09
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
9459
ns9958
ns0.95
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
9291
ns9104
ns1.02
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
266752
ns269468.5
ns0.99
vgg16(32, 32, 3, 128)/forward/CPU/2 thread(s)
501606583
ns502544125
ns1.00
vgg16(32, 32, 3, 128)/forward/CPU/4 thread(s)
504711792
ns428545375
ns1.18
vgg16(32, 32, 3, 128)/forward/CPU/8 thread(s)
433731812.5
ns370375396
ns1.17
vgg16(32, 32, 3, 128)/forward/CPU/1 thread(s)
675887813
ns676271583.5
ns1.00
vgg16(32, 32, 3, 128)/forward/GPU/CUDA
12472195
ns12479257.5
ns1.00
vgg16(32, 32, 3, 128)/zygote/CPU/2 thread(s)
2045113375
ns2046192292
ns1.00
vgg16(32, 32, 3, 128)/zygote/CPU/4 thread(s)
1661340625
ns1629661125
ns1.02
vgg16(32, 32, 3, 128)/zygote/CPU/8 thread(s)
1497713333.5
ns1491097083.5
ns1.00
vgg16(32, 32, 3, 128)/zygote/CPU/1 thread(s)
2221091187.5
ns2229658271
ns1.00
vgg16(32, 32, 3, 128)/zygote/GPU/CUDA
49072522.5
ns49368782
ns0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s)
1634250
ns1646083
ns0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s)
1163708
ns1197833
ns0.97
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s)
1361458
ns1360625
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s)
2497000
ns2466020.5
ns1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/GPU/CUDA
215136
ns218218
ns0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s)
12684958
ns12725062
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s)
10006667
ns9942708
ns1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s)
9642583
ns9678458.5
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s)
18382166
ns18472437.5
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA
2044162.5
ns2043078
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s)
17670958.5
ns17708209
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s)
14802250
ns14671208.5
ns1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s)
14555750
ns14589541.5
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s)
21305750
ns21579542
ns0.99
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
26167
ns26250
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
26417
ns26208
ns1.01
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
26250
ns26291
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
26209
ns26208
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA
23856
ns24352
ns0.98
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
66875
ns67041
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
67209
ns66792
ns1.01
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
67125
ns67125
ns1
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
66750
ns66584
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA
405103
ns402288.5
ns1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
203416
ns203166
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
209125
ns209292
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
209584
ns210166
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
200667
ns199791
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
27000
ns27685
ns0.98
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
602375
ns609250
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
669125
ns622062.5
ns1.08
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
622834
ns630875
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
631125
ns631917
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
352778
ns357035
ns0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
648917
ns538250
ns1.21
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
642917
ns641250
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
544167
ns600459
ns0.91
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
639083
ns670041
ns0.95
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
131841.5
ns132946
ns0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2236583
ns2237333.5
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2283750
ns2232437
ns1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1492958
ns2242500
ns0.67
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2238375
ns2254875
ns0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1252532
ns1187953
ns1.05
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
16375
ns18937
ns0.86
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
18208.5
ns18812.5
ns0.97
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
19292
ns19708
ns0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
16333
ns19000
ns0.86
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
146257.5
ns147397
ns0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
225875
ns218791.5
ns1.03
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
254708.5
ns221416
ns1.15
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
220145.5
ns229084
ns0.96
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
261833
ns260604.5
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1134332.5
ns1013651.5
ns1.12
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
542
ns584
ns0.93
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
666
ns542
ns1.23
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
667
ns709
ns0.94
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
583
ns583
ns1
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
23621
ns23882
ns0.99
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
9667
ns10125
ns0.95
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
10375
ns9583
ns1.08
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
10084
ns9750
ns1.03
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
9542
ns9583
ns1.00
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
259757
ns264213
ns0.98
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
4958
ns5708
ns0.87
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
5750
ns5084
ns1.13
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
6167
ns6875
ns0.90
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
5500
ns6125
ns0.90
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
236718.5
ns236698
ns1.00
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
6917
ns7500
ns0.92
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7208
ns7167
ns1.01
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7625
ns7417
ns1.03
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7334
ns7125
ns1.03
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
801879.5
ns814113
ns0.98
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
1917
ns2333
ns0.82
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
2167
ns2250
ns0.96
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
2542
ns2250
ns1.13
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
2209
ns2250
ns0.98
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA
17960.5
ns18404
ns0.98
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
6666
ns6625
ns1.01
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
6750
ns6416
ns1.05
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
6958
ns6667
ns1.04
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
6708
ns6500
ns1.03
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA
333776.5
ns335568.5
ns0.99
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s)
746500
ns749208
ns1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s)
749354
ns746584
ns1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s)
749479
ns747708
ns1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s)
752041.5
ns749437.5
ns1.00
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA
21034
ns21512
ns0.98
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s)
775729
ns810792
ns0.96
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s)
787812.5
ns772583
ns1.02
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s)
773209
ns791666
ns0.98
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s)
792937.5
ns815292
ns0.97
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA
378579.5
ns300193.5
ns1.26
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7250
ns7042
ns1.03
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
5333
ns6000
ns0.89
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
6000
ns5959
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10208
ns10417
ns0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
33702
ns33409
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
220541
ns260625
ns0.85
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
228500
ns227833
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
229083.5
ns229042
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
257687.5
ns239834
ns1.07
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
366571.5
ns365578.5
ns1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
10084
ns10208
ns0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
10375
ns9917
ns1.05
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
10750
ns10792
ns1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
9791
ns10709
ns0.91
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
255521
ns247512.5
ns1.03
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
24792
ns24958
ns0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
26604.5
ns24250
ns1.10
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
25500
ns23791
ns1.07
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
24750
ns24916
ns0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
1129584
ns1135547.5
ns0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s)
105995125
ns106115583
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s)
125793000
ns118501687.5
ns1.06
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s)
120388208
ns120163958
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s)
119036125
ns118736333
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/GPU/CUDA
2663640
ns2655926
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s)
393141687.5
ns392620458
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s)
378610917
ns366282792
ns1.03
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s)
358649208
ns355680542
ns1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s)
482237667
ns483640417
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA
15147052.5
ns15268892
ns0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s)
758670979.5
ns758389270.5
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s)
774479250
ns585230833
ns1.32
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s)
745435354
ns746534979.5
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s)
768055062.5
ns959125958
ns0.80
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
6354.5
ns7396
ns0.86
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
7667
ns6645.5
ns1.15
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
8209
ns8917
ns0.92
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
7209
ns7792
ns0.93
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
244571.5
ns237625.5
ns1.03
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
13875
ns14208
ns0.98
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
14250
ns13854.5
ns1.03
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
14208
ns15062.5
ns0.94
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
13834
ns13959
ns0.99
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
1083702.5
ns1099893
ns0.99
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
5125
ns6500
ns0.79
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
6604.5
ns5583
ns1.18
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
6750
ns6645.5
ns1.02
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
5792
ns6666
ns0.87
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
238521.5
ns238125
ns1.00
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
12125
ns12750
ns0.95
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
13000
ns12333
ns1.05
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
12625
ns12625
ns1
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
12542
ns12459
ns1.01
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
799702
ns799101.5
ns1.00
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s)
343416
ns342896
ns1.00
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s)
320583
ns344667
ns0.93
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s)
398646
ns398458
ns1.00
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s)
311958
ns295917
ns1.05
batchedmm(2, Bsize=128)/forward/GPU/CUDA
17585
ns17123
ns1.03
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s)
701708.5
ns709000
ns0.99
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s)
729708
ns732875
ns1.00
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s)
1022708
ns1023500
ns1.00
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s)
663646
ns661541.5
ns1.00
batchedmm(2, Bsize=128)/zygote/GPU/CUDA
202905.5
ns201466.5
ns1.01
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
292
ns416
ns0.70
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
417
ns292
ns1.43
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
417
ns458
ns0.91
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
333
ns333
ns1
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
24123
ns23927
ns1.01
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
6084
ns6625
ns0.92
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
6750
ns6208
ns1.09
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
6584
ns6458
ns1.02
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
6333
ns6042
ns1.05
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
244929.5
ns243036.5
ns1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
5834
ns5875
ns0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
6000
ns5917
ns1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
6083
ns6000
ns1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
5917
ns5833
ns1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
25271
ns24827
ns1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
21208
ns21708
ns0.98
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
21708
ns21208
ns1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
22334
ns21750
ns1.03
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
21584
ns21041
ns1.03
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
269606
ns266228
ns1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
146291
ns144209
ns1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
145562.5
ns144021
ns1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
146520.5
ns146750
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
187916.5
ns147687.5
ns1.27
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
167772.5
ns168564.5
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
856895.5
ns1309750
ns0.65
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1318270.5
ns1253895.5
ns1.05
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1319917
ns1328541
ns0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1319542
ns1340854.5
ns0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1369184
ns1366226
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
21917
ns22458.5
ns0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
22250
ns24292
ns0.92
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
23209
ns24792
ns0.94
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
21333.5
ns22770.5
ns0.94
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
358843.5
ns292862.5
ns1.23
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
117458
ns130708
ns0.90
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
175500
ns117458
ns1.49
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
118854.5
ns119625
ns0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
130625
ns127520.5
ns1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1501306
ns1486893
ns1.01
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
333
ns333
ns1
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
416
ns292
ns1.42
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
375
ns417
ns0.90
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
292
ns292
ns1
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
23632
ns23357
ns1.01
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
6083
ns6583
ns0.92
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
6833
ns6208
ns1.10
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
6625
ns6541
ns1.01
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
6292
ns6250
ns1.01
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
262048.5
ns259976.5
ns1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
4791.5
ns4333.5
ns1.11
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
4583
ns4375
ns1.05
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
5250
ns5333
ns0.98
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
4250
ns5041
ns0.84
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
259862
ns257469
ns1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
9917
ns9666.5
ns1.03
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
10208
ns9979.5
ns1.02
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
10375
ns10459
ns0.99
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
10292
ns10375
ns0.99
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
1366687.5
ns1365106
ns1.00
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
1625
ns1625
ns1
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
1625
ns1584
ns1.03
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
1584
ns1625
ns0.97
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
1625
ns1625
ns1
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA
23932
ns23579.5
ns1.01
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
5708
ns5666
ns1.01
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
5959
ns5667
ns1.05
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
6125
ns5792
ns1.06
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
5708
ns5625
ns1.01
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA
280444.5
ns278192.5
ns1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s)
6858875
ns6816875
ns1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s)
6356666.5
ns6362583
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s)
6558354
ns6488041
ns1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s)
7638125
ns7598354
ns1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/GPU/CUDA
215319
ns215727
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s)
24020854
ns24100104
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s)
21336812.5
ns21301625
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s)
21069625
ns21056624.5
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s)
29727208
ns29843458
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA
2103488
ns2118629
ns0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s)
37386500
ns37406124.5
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s)
45855312
ns34318854
ns1.34
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s)
45723854
ns45786625
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s)
37910104.5
ns49609729
ns0.76
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
5458
ns6270.5
ns0.87
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
6500
ns5584
ns1.16
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
6583
ns7125
ns0.92
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
5667
ns6375
ns0.89
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
238563.5
ns236566.5
ns1.01
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
8250
ns8458
ns0.98
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8375
ns8083
ns1.04
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
9375
ns8291
ns1.13
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
9000
ns8541
ns1.05
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
1077587
ns1064326
ns1.01
lenet(28, 28, 1, 128)/forward/CPU/2 thread(s)
1545521
ns1546833
ns1.00
lenet(28, 28, 1, 128)/forward/CPU/4 thread(s)
1248854
ns1262667
ns0.99
lenet(28, 28, 1, 128)/forward/CPU/8 thread(s)
1616959
ns1614875
ns1.00
lenet(28, 28, 1, 128)/forward/CPU/1 thread(s)
2159562
ns2100437.5
ns1.03
lenet(28, 28, 1, 128)/forward/GPU/CUDA
283099
ns273505
ns1.04
lenet(28, 28, 1, 128)/zygote/CPU/2 thread(s)
7874500
ns7902479
ns1.00
lenet(28, 28, 1, 128)/zygote/CPU/4 thread(s)
6609250
ns6457875
ns1.02
lenet(28, 28, 1, 128)/zygote/CPU/8 thread(s)
7081396
ns7153750
ns0.99
lenet(28, 28, 1, 128)/zygote/CPU/1 thread(s)
10467646
ns10520687
ns0.99
lenet(28, 28, 1, 128)/zygote/GPU/CUDA
1880933.5
ns1860306
ns1.01
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s)
338958
ns338041
ns1.00
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s)
329417
ns344187.5
ns0.96
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s)
416208
ns403666
ns1.03
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s)
346000
ns325146
ns1.06
batchedmm(128, Bsize=4)/forward/GPU/CUDA
42726
ns46661
ns0.92
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s)
729291.5
ns739354.5
ns0.99
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s)
779145.5
ns791041.5
ns0.98
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s)
1069708
ns1068125
ns1.00
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s)
747500
ns777583
ns0.96
batchedmm(128, Bsize=4)/zygote/GPU/CUDA
300168
ns308223
ns0.97
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s)
397500
ns397500
ns1
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s)
210958
ns287708
ns0.73
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s)
287875
ns287834
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s)
750791
ns751187.5
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA
44334
ns43983
ns1.01
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s)
669417
ns669375
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s)
471583
ns531834
ns0.89
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s)
532312.5
ns530584
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s)
973125
ns974333
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA
191851
ns189215
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
638229
ns660395.5
ns0.97
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
595750
ns644167
ns0.92
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
634583.5
ns613333
ns1.03
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
639209
ns624292
ns1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
132118
ns132150
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2339333.5
ns2236750
ns1.05
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2471792
ns2459583.5
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2452750
ns2461041.5
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2462792
ns2472250
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1204724
ns1281622
ns0.94
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s)
342834
ns341645.5
ns1.00
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s)
321667
ns342792
ns0.94
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s)
399520.5
ns398021
ns1.00
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s)
310625
ns294000
ns1.06
batchedmm(2, Bsize=32)/forward/GPU/CUDA
16619
ns16403
ns1.01
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s)
698708.5
ns701083
ns1.00
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s)
726959
ns730208
ns1.00
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s)
1020042
ns1025708.5
ns0.99
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s)
654208
ns660646
ns0.99
batchedmm(2, Bsize=32)/zygote/GPU/CUDA
199213
ns198717
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1458250
ns1458166
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1492125
ns1497709
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1498333
ns1498458
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1440625
ns1441209
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
41986
ns41537
ns1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5106291
ns5119375
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5286750
ns5293042
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5038208
ns5308020.5
ns0.95
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
4995083.5
ns5000771
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
200103.5
ns198307.5
ns1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s)
3667
ns3709
ns0.99
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s)
3708
ns3667
ns1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s)
3708
ns3708
ns1
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s)
3708
ns3750
ns0.99
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA
33402.5
ns33127
ns1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s)
15000
ns15166
ns0.99
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s)
15375
ns15125
ns1.02
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s)
15416
ns15292
ns1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s)
15125
ns14834
ns1.02
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA
381057
ns376728
ns1.01
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s)
70750
ns71208
ns0.99
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s)
71125
ns71208
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s)
71250
ns71333
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s)
71208
ns70875
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA
113931
ns112994
ns1.01
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
316458
ns317541.5
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
326395.5
ns323479.5
ns1.01
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
324958.5
ns318833
ns1.02
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
317042
ns322541
ns0.98
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA
196642
ns193878
ns1.01
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
1000
ns1041
ns0.96
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
1084
ns1000
ns1.08
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
1083
ns1083
ns1
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
1000
ns1000
ns1
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
24139
ns23624
ns1.02
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
7666
ns8083
ns0.95
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8375
ns7875
ns1.06
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
8208
ns8375
ns0.98
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
7916
ns8041
ns0.98
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
264797.5
ns262928.5
ns1.01
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s)
468542
ns463437.5
ns1.01
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s)
458583
ns467917
ns0.98
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s)
550499.5
ns555208
ns0.99
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s)
557354
ns535979.5
ns1.04
batchedmm(128, Bsize=32)/forward/GPU/CUDA
129474.5
ns131030
ns0.99
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s)
1378667
ns1392125
ns0.99
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s)
1407354
ns1366083.5
ns1.03
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s)
1620083
ns1631541
ns0.99
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s)
1575771
ns1632667
ns0.97
batchedmm(128, Bsize=32)/zygote/GPU/CUDA
274432
ns274742
ns1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
333
ns375
ns0.89
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
375
ns333
ns1.13
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
416
ns417
ns1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
292
ns334
ns0.87
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
32500
ns32052
ns1.01
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
5917
ns6375
ns0.93
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
6708
ns6417
ns1.05
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
6375
ns6541
ns0.97
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
6167
ns6209
ns0.99
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
267871.5
ns267424
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1784500
ns1722042
ns1.04
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1728562.5
ns1722500
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1724000
ns1723833.5
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1773500
ns1723750
ns1.03
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
169383
ns169639.5
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
4347916.5
ns3951375.5
ns1.10
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
4395208
ns4344708.5
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
4356291
ns4311167
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
4355708
ns4386625
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1176388
ns1177448
ns1.00
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s)
6583
ns6833
ns0.96
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s)
7375
ns7083
ns1.04
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s)
7167
ns9000
ns0.80
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s)
6834
ns6958
ns0.98
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA
20833
ns21025
ns0.99
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
32833
ns51500
ns0.64
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
33125
ns32500
ns1.02
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
32708
ns33416
ns0.98
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
70188
ns68687.5
ns1.02
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA
219443
ns211473.5
ns1.04
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s)
349250
ns347416
ns1.01
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s)
324917
ns346562.5
ns0.94
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s)
416250
ns405812.5
ns1.03
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s)
319959
ns300895.5
ns1.06
batchedmm(2, Bsize=512)/forward/GPU/CUDA
18352
ns18759
ns0.98
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s)
716166.5
ns717750
ns1.00
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s)
733562.5
ns738125
ns0.99
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s)
1034625
ns1033854
ns1.00
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s)
680062.5
ns685145.5
ns0.99
batchedmm(2, Bsize=512)/zygote/GPU/CUDA
343358
ns346768
ns0.99
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s)
75375
ns75208
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s)
75375
ns75208
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s)
75250
ns75209
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s)
75417
ns75250
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA
46573.5
ns47731
ns0.98
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
323708
ns325375
ns0.99
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
328375
ns341167
ns0.96
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
337667
ns327917
ns1.03
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
323916
ns328875
ns0.98
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA
210239
ns213096
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1485625
ns1484542
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1517625
ns1524541.5
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1525916
ns1525708
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1464958
ns1464917
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
52987
ns52950
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5107375
ns5117000
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5290249.5
ns5293834
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5022896
ns5304042
ns0.95
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
4990771.5
ns5007708
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
206297
ns208140.5
ns0.99
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
28125
ns28291
ns0.99
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
28250
ns28125
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
28208
ns28458
ns0.99
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
28208
ns28292
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA
24564
ns25236
ns0.97
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
66125
ns66417
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
66625
ns66375
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
66416
ns66292
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
66375
ns66333
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA
535111
ns544174
ns0.98
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/2 thread(s)
1491375
ns1473145.5
ns1.01
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/4 thread(s)
929500
ns1063750
ns0.87
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/8 thread(s)
1106125
ns1081749.5
ns1.02
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/1 thread(s)
2118479.5
ns2244854
ns0.94
mlp7layer_bn(tanh)(32 x 256)/forward/GPU/CUDA
569055
ns581542
ns0.98
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/2 thread(s)
2874354
ns3075833.5
ns0.93
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/4 thread(s)
2606312.5
ns2738625
ns0.95
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/8 thread(s)
2740292
ns2748312.5
ns1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/1 thread(s)
3820625.5
ns3862521
ns0.99
mlp7layer_bn(tanh)(32 x 256)/zygote/GPU/CUDA
2075439
ns2089266.5
ns0.99
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/2 thread(s)
8426958
ns8827041
ns0.95
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/4 thread(s)
8751479
ns8764208
ns1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/8 thread(s)
8782333
ns8760437.5
ns1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/1 thread(s)
6369124.5
ns6449750
ns0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
79250
ns82520.5
ns0.96
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
81250
ns80750
ns1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
82250
ns82667
ns0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
99334
ns88708
ns1.12
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
192927.5
ns192989
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1987333.5
ns2019083
ns0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1989937.5
ns2020500
ns0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2008750
ns1750146
ns1.15
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2022667
ns2040166
ns0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
817057
ns811335
ns1.01
This comment was automatically generated by workflow using github-action-benchmark.