-
Notifications
You must be signed in to change notification settings - Fork 11
/
030 SIMD.jl
234 lines (189 loc) · 5.84 KB
/
030 SIMD.jl
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
import Pkg; Pkg.activate(@__DIR__); Pkg.instantiate()
# # SIMD: The parallelism that can (sometimes) happen automatically
#
# SIMD: Single-instruction, multiple data
#
# (Also confusingly called vectorization)
#-
# ## The architecture
#
# Instead of computing four sums sequentially:
#
# \begin{align}
# x_1 + y_1 &\rightarrow z_1 \\
# x_2 + y_2 &\rightarrow z_2 \\
# x_3 + y_3 &\rightarrow z_3 \\
# x_4 + y_4 &\rightarrow z_4
# \end{align}
#
# Modern processors have vector processing units that can do it all at once:
#
# $$
# \left(\begin{array}{cc}
# x_1 \\
# x_2 \\
# x_3 \\
# x_4
# \end{array}\right)
# +
# \left(\begin{array}{cc}
# y_1 \\
# y_2 \\
# y_3 \\
# y_4
# \end{array}\right)
# \rightarrow
# \left(\begin{array}{cc}
# z_1 \\
# z_2 \\
# z_3 \\
# z_4
# \end{array}\right)
# $$
#-
# ## Making it happen
#-
# Simple task: compute the sum of a vector:
A = rand(100_000)
function simplesum(A)
result = zero(eltype(A))
for i in eachindex(A)
@inbounds result += A[i]
end
return result
end
simplesum(A)
#-
using BenchmarkTools
@btime simplesum($A)
# So, is that good?
@btime sum($A)
# We're slower that the builtin `sum` — and we're getting a different answer, too! Let's look at what happens with a 32-bit float instead of a 64 bit one. Each element has half the number of bits, so lets also double the length (so the total number of bits processed remains constant).
A32 = rand(Float32, length(A)*2)
@btime simplesum($A32)
@btime sum($A32);
# That's even worse! What's going on here? We're seeing an even multiple number
# difference in our performance — perhaps Julia's builtin sum is using some
# parallelism? Let's try using SIMD ourselves:
function simdsum(A)
result = zero(eltype(A))
@simd for i in eachindex(A)
@inbounds result += A[i]
end
return result
end
@btime simdsum($A)
@btime simdsum($A32)
# What did that do and why don't we always use `@simd for` — or why doesn't Julia
# just always use `@simd` for every `for` loop automatically? Look at the values:
simplesum(A), simdsum(A), sum(A)
#-
simplesum(A32), simdsum(A32), sum(A32)
# Why aren't they the same?
#
# Without `@simd`, Julia is doing _exactly_ what we told it to do: it's taking
# each element of our array and adding it to a big pile sequentially. Our answer
# is smaller than what Julia's builtin `sum` thinks it is: that's because as our
# pile gets bigger we begin losing the lower bits of each element that we're
# adding, and those small losses begin to add up!
#
# The `@simd` macro tells Julia that it can re-arrange floating point additions —
# even if it would change the answer. Depending on your CPU, this may lead to 2x or 4x
# or even 8x parallelism. Essentially, Julia is computing independent sums for
# the even indices and the odd indices simultaneously:
#
# \begin{align}
# odds &\leftarrow 0 \\
# evens &\leftarrow 0 \\
# \text{loop}&\ \text{odd}\ i: \\
# &\left(\begin{array}{cc}
# odds \\
# evens
# \end{array}\right)
# \leftarrow
# \left(\begin{array}{cc}
# odds \\
# evens
# \end{array}\right)
# +
# \left(\begin{array}{cc}
# x_{i} \\
# x_{i+1}
# \end{array}\right) \\
# total &\leftarrow evens + odds
# \end{align}
#
# In many cases, Julia can and does know that a for-loop can be SIMD-ed and it
# will take advantage of this by default!
B = rand(1:10, 100_000)
@btime simplesum($B)
@btime sum($B)
B32 = rand(Int32(1):Int32(10), length(B)*2)
@btime simplesum($B32)
@btime simdsum($B32)
# How can we see if something is getting vectorized?
@code_llvm simdsum(A32)
# So what are the challenges?
#
# * Biggest hurdle is that you have to convince Julia and LLVM that it's able to
# use SIMD instructions for your given algorithm. That's not always possible.
# * There are lots of limitations of what can and cannot be SIMD-ed:
@doc @simd
# * You do need to think through the consequences of re-ordering your algorithm.
#-
# ## A slightly trickier case
using BenchmarkTools
#-
function diff!(A, B)
A[1] = B[1]
for i in 2:length(A)
@inbounds A[i] = B[i] - B[i-1]
end
return A
end
A = zeros(Float32, 100_000)
B = rand(Float32, 100_000)
diff!(A, B)
[B[1];diff(B)] == A
#-
@btime diff!($A, $B)
@btime diff($B);
# But what happens if we do it in-place?
Bcopy = copy(B)
@btime diff!($Bcopy, $Bcopy);
# What happened?
@code_llvm diff!(A, B)
# We can manually assert that arrays don't alias (or have any loop-dependencies),
# with the very special `@simd ivdep` flag, but this can be disastrous:
function unsafe_diff!(A, B)
A[1] = B[1]
@simd ivdep for i in 2:length(A)
@inbounds A[i] = B[i] - B[i-1]
end
return A
end
@btime unsafe_diff!($A, $B)
[B[1];diff(B)] == A
Bcopy = copy(B)
unsafe_diff!(Bcopy, Bcopy)
[B[1];diff(B)] == Bcopy
# If you really want to get your hands dirty, you can use the [SIMD.jl](https://github.com/eschnett/SIMD.jl)
# package to manually specify those `<8 x float>` things that LLVM generates.
# BUT: this is tricky and a pain; often it's just to be aware of what makes
# Julia code automatically SIMD-able, some of the cases where it may fail, and
# how to check its work.
#-
# ## SIMD
#
# * Exploits built-in parallelism in a processor
# * Best for small, tight innermost loops
# * Often happens automatically if you're careful
# * Follow the [perforance best practices](https://docs.julialang.org/en/v1/manual/performance-tips/)
# * `@inbounds` any array acesses
# * No branches or (non-inlined) function calls
# * Can use `@simd` to allow Julia to break some rules to make it happen
# * But be careful, especially with `@simd ivdep`!
# * Depending on processor and types involved, can yield 2-16x gains with extraordinarily little overhead
# * Smaller datatypes can improve this further; use `Float32` instead of `Float64`
# if possible, `Int32` instead of `Int64`, etc.
# * When buying a new processor, look for [AVX-512](https://en.wikichip.org/wiki/x86/avx-512) support