-
Notifications
You must be signed in to change notification settings - Fork 0
/
index.html
333 lines (301 loc) · 21.1 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<title></title>
<meta name="description" content="">
<meta name="viewport" content="width=device-width, initial-scale=1">
<link rel="stylesheet" href="https://stackpath.bootstrapcdn.com/bootstrap/4.5.2/css/bootstrap.min.css">
<script src="https://code.jquery.com/jquery-3.5.1.slim.min.js"></script>
<script src="https://cdn.plot.ly/plotly-latest.min.js"></script>
<script src="https://stackpath.bootstrapcdn.com/bootstrap/4.5.2/js/bootstrap.min.js"></script>
<script src="https://polyfill.io/v3/polyfill.min.js?features=es6"></script>
<script id="MathJax-script" async src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js"></script>
<script src="custom.js"></script>
<link rel="stylesheet" href="custom.css">
</head>
<body>
<!-- Top panel: Title, Authors -->
<div class="jumbotron text-center page-width">
<p style="text-align: left;"><a href="https://issp2020.yale.edu/" target="_blank" style="color:black">12<sup>th</sup>International Seminar on Speech Production ISSP 2020 (Poster: 163)</a></p>
<h1>Estimating "Good" variability in Speech Production using Invertible Neural Networks</h1>
<p style="font-size:larger">Jaekoo Kang<sup>1,2</sup>, Hosung Nam <sup>3,4</sup> and D. H. Whalen<sup>1,2,4</sup></p>
<p><sup>1</sup>The Graduate Center, CUNY; <sup>2</sup>Haskins Laboratories; <sup>3</sup>Korea University; <sup>4</sup>Yale University</p>
</div>
<!-- Middle panel: Contents -->
<div class="container page-width show-border">
<div class="row row-margin show-border" style="display:block">
<p class="text-inner-margin">
This project page demonstrates the detailed descriptions of data and modeling techniques
introduced in the poster submitted to the ISSP 2020 as a supplement.
</p>
<hr>
<h2>Introduction</h2>
<p class="text-inner-margin">
Variability is inherent in skilled human motor movements. Playing a piano or riding a bicycle
requires skilled coordination of motor elements, such as arms and legs, to achieve a motor goal.
Although the movements are skillful, the positions of the motor elements are not exactly the same
regardless of how many times they are repeated or executed (“repetition without repetition”; Bernstein, 1967).
This variability in the form of repeated limb movements can be understood as an informative biological feature
in the human motor system due to its underlying structure and regularity
(Latash et al., 2002; Riley & Turvey, 2002; Sternad, 2018; Whalen & Chen, 2019),
which previously had been disregarded as noise. One such structure of the skilled motor movements is
that it is highly synergistic and flexibly organized when decomposed into “good” and “bad” parts of variability
(i.e., the uncontrolled manifold approach or the UCM; Latash et al., 2002; Scholz & Schöner, 1999, 2014).
Whether variability in speech production can also be decomposed into the same principle,
however, has been rarely examined to date. Specifically, this project aims to focus on the "good" part of variability
in speech production and explore the use of invertible neural networks as a quantitative approach to understand
how "good" variability is structured and can be learned by these neural-net models.
</p>
</div>
<div class="row row-margin show-border">
<h2>Data</h2>
<p class="text-inner-margin">
The Haskins IEEE rate comparison database (Sivaraman et al., 2015; henceforth, the EMA-IEEE database) were utilized
for the UCM analysis and FlowINN modeling. This database includes simultaneous articulatory and
acoustic recordings from eight native American English speakers (4 females, 4 males).
The articulatory data were collected using electromagnetic articulography (EMA), where eight pellet sensors,
sampled at 100 Hz, were attached to speakers’ articulators and later corrected for head movements.
Synchronous acoustic data were recorded at a sampling rate of 44.1 kHz.
Speakers read 720 phonetically balanced sentences at both a normal and a fast rate,
providing various consonant contexts to examine at both rates across speakers.
Nine English vowels (/i, ɪ, ɛ, æ, ʌ, ɔ, ɑ, ʊ, u/) will be selected for the modeling
of the forward mapping functions, while four front vowels (/i, ɪ, ɛ, æ/)
representing vertical articulatory movement were focused on in the analysis.
<br><br>
Data preprocessing steps includes data normalization and dimension reduction.
The purpose of normalizing articulatory and acoustic data was to account for individual differences
as well as differences in the ranges of values and units (millimeter versus frequency).
For articulatory data, six pellet sensors (with two horizontal and vertical sagittal coordinates per sensor)
were selected: TR (Tongue Root), TB (Tongue Body), TT (Tongue Tip), UL (Upper Lip), LL (Lower Lip) and JAW.
Twelve-dimensional kinematic data (six sensors with x-y coordinates) were Z-scored by speaker and then
reduced down to three principal components (PCs) using the principal component analysis that
roughly reflects the vertical, horizontal and residual movement of articulators.
For acoustics data, the first two formant frequencies (F1, F2) were used and also normalized by speaker
using the Z-scoring method. For each vowel, both articulatory and acoustic data were extracted at nine equidistant
and normalized time points in which the 5th time point was the mid-point.
Outliers were identified and removed prior to data normalization in order to reduce noise in the data.
This data normalization procedure followed the guidelines taken from Whalen et al. (2018).
<br><br>
The following plots demonstrate a single speaker (F01)'s vowel production data at normal rate. The first four-by-two plots
visualize vowel-by-data of the corresponding articulation and acoustics. The bottom plots shows the effects of moving along the
Principal Component dimensions from the same speaker as an example.
<br><br>
<div style="display:block;text-align:center">
<h5>Articulatory and acoustic data for the four front vowels /i, ɪ, ɛ, æ/ of speaker F01</h5>
<img src="https://raw.githubusercontent.com/jaekookang/issp2020/master/img/F01_N_ctr.png" style="width:50%">
<br><br>
<h5>Synthesis along the first three Principal Components in articulatory data from speaker F01</h5>
<img src="https://raw.githubusercontent.com/jaekookang/issp2020/master/img/F01_pca_gradient.png" style="width:80%">
</div>
</p>
</div>
<br>
<div class="row row-margin show-border">
<h2>Methods</h2>
<p class="text-inner-margin" style="display:block">
For the mapping between articulation and acoustics, we borrowed the INN model architectiure from Ardizzone et al. (2019),
but further customized for our purpose. The input data (\(x\)) was 3-D articulatory vector from the dimension-reduced Principal Components
from the EMA sensor coordinates. The output data (\(y\)) was 2-D acoustic vector; that is, the first two formant frequencies (F1, F2).
The dimension of the latent space was set to 2-D and appended to \(y\). The total dimensions were set to 6, which led to padding
both on the input (\(6 - 3 = 3)\) and outputs (\(6 - 4 = 2)\). Given that \(\mathbf{x} \in \mathbb{R}^{3}\), \( \mathbf{y} \in \mathbb{R}^{2} \)
and \( \mathbf{z} \in \mathbb{R}^{2} \), the forward mapping model \(f\) and its inverse \( f^{-1}=g \) can be written as follows
(Note that \(\theta\) is neural-net parameters):
<p class="mathp-center-align"> Forward process: \( [\mathbf{y}, \mathbf{z}] = f(\mathbf{x}; \theta) \) </p>
<br>
<p class="mathp-center-align"> Inverse process: \( \mathbf{x} = g(\mathbf{y}, \mathbf{z}; \theta), \, \mathbf{z} \sim \mathcal{N}(\mathbf{z}; 0, I_{2}) \) </p>
<br><br>
<p class="text-inner-margin">
Both forward and inverse processes share the same neural network parameters \( \theta \) and implemented in a single invertible neural-net model.
Using these definitions, the current INN model for articulation and acosutics can be expressed in a following manner using the change-of-variable technique.
</p>
<p class="mathp-center-align"> \( p(\mathbf{x}) = p(\mathbf{x} = g(\mathbf{y}, \mathbf{z}; \theta)) \left| det \left( \frac{\partial g(\mathbf{y},\mathbf{z};\theta)}{\partial[\mathbf{y},\mathbf{z}]} \right) \right|^{-1} \) </p>
<br><br>
<p class="text-inner-margin">
We used the simplest affine coupling layer structured as described in Dinh et al. (2014); therefore, the computation of the Jacobian determinant
was simple because each layer was volume-preserving transformation setting the determinant to one. The structure of the affine coupling layers was
first to split \(x\) into two blocks (\( x_1, x_2 \)) and to apply blocks of transformaions from (\( x_1, x_2 \)) to (\( y_1, y_2 \)) as follows.
</p>
<p class="mathp-center-align">
\( \begin{eqnarray*}
y_1 &=& x_1 \\
y_2 &=& x_2 + m(x_1).
\end{eqnarray*} \)
</p>
<br><br>
<p class="text-inner-margin">
The \( m \) was implemented as multilayer perceptrons (MLP) with rectified linear unit (ReLU) activations. Because this forward mapping
has a unit Jacobian determinant for any \( m \), the inverse was trivially found.
</p>
<p class="mathp-center-align">
\( \begin{eqnarray*}
x_1 &=& y_1 \\
x_2 &=& y_2 - m(y_1).
\end{eqnarray*} \)
</p>
<br><br>
<p class="text-inner-margin">
The actual TensorFlow2 implementation of the model architecture is summarized as below.
</p>
<div style="text-align:center;display:flex;margin:auto">
<pre>
<code>
Model: "NICECouplingBlock"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
layer0 (AdditiveAffineLayer) (None, 6) 981
_________________________________________________________________
layer1 (AdditiveAffineLayer) (None, 6) 981
_________________________________________________________________
layer2 (AdditiveAffineLayer) (None, 6) 981
_________________________________________________________________
layer3 (AdditiveAffineLayer) (None, 6) 981
_________________________________________________________________
layer4 (AdditiveAffineLayer) (None, 6) 981
_________________________________________________________________
ScaleLayer (Scale) (None, 6) 6
=================================================================
Total params: 4,911
Trainable params: 4,881
Non-trainable params: 30
_________________________________________________________________
</code>
</pre>
</div>
</p>
</div>
<div class="row row-margin show-border">
<h2>Results</h2>
<p class="text-inner-margin">
The modeling result is summarized in the following two plots. The top plots demonstrates the root mean-squared error (RMSE)
by speaker and rate. Overall, the pattern is similar, but normal-rate models are slightly better than
the fast-rate models, possibly because of the smaller variability in the normal-rate data.
The bottom plots illustrates how much variability is explained by the model as the index of the sum-of-squared regression (SSreg).
Despite the individual differences, the normal-rate models are generally better at explaining the variability
in the data.
</p>
<br><br>
<div style="display:block;text-align:center">
<h5>The Root Mean-Squared Error by speaker (total 8) and speaking rate (N vs. F)</h5>
<br>
<img src="https://raw.githubusercontent.com/jaekookang/issp2020/master/img/RMSE.png" style="width:80%">
<br><br>
<h5>The Sum-of-Squared regression by speaker (total 8) and speaking rate (N vs. F) </h5>
<br>
<img src="https://raw.githubusercontent.com/jaekookang/issp2020/master/img/SSreg.png" style="width:80%">
</div>
<br><br>
<p class="text-inner-margin">
The result of forward and inverse mapping with latent-space sampling is visualized in the following plots.
You can select different <b>Speaker</b>, <b>Rate</b> and <b>Vowel</b> in the dropdown menu. The plots will change accordingly.
</p>
<br><br>
<div class="col show-border" style="display:block">
<h4>1. Forward and inverse mapping between articulation and acoustics</h4>
<div class="row row-margin show-border">
<p>↓↓ Please choose <b>Speaker</b>, <b>Rate</b> and <b>Vowel</b> from the dropdown menus to see the result.</p>
<b>Speaker:</b>
<select name="speaker" id="sel-speaker-ar2ac" onchange="showImageAR2AC()">
<option value="F01" select="selected"></option>
</select>
<br>
<b>Rate:</b>
<select name="rate" id="sel-rate-ar2ac" onchange="showImageAR2AC()">
<option value="Normal" select="selected"></option>
</select>
<br>
<b>Vowel:</b>
<select name="vowel" id="sel-vowel-ar2ac" onchange="showImageAR2AC()">
<option value="IY1" select="selected"></option>
</select>
</div>
<div class="row row-margin show-border" style="text-align:center;">
<img id="vis-ar2ac" src="" style="width:80%">
</div>
</div>
<br>
<div class="col show-border" style="display:block">
<h4>2. Forward and inverse mapping between acoustics and vowel categories</h4>
<div class="row row-margin show-border">
<p>↓↓ Please choose <b>Speaker</b>, <b>Rate</b> and <b>Vowel</b> from the dropdown menus to see the result.</p>
<b>Speaker:</b>
<select name="speaker" id="sel-speaker-ac2vw" onchange="showImageAC2VW()">
<option value="F01" select="selected"></option>
</select>
<br>
<b>Rate:</b>
<select name="rate" id="sel-rate-ac2vw" onchange="showImageAC2VW()">
<option value="Normal" select="selected"></option>
</select>
<br>
<b>Vowel:</b>
<select name="vowel" id="sel-vowel-ac2vw" onchange="showImageAC2VW()">
<option value="IY1" select="selected"></option>
</select>
</div>
<div class="row row-margin show-border" style="text-align:center;">
<img id="vis-ac2vw" src="" style="width:80%">
</div>
</div>
</div>
<div class="row row-margin show-border">
<h2>Summary & Conclusion</h2>
<p class="text-inner-margin">
Language is produced by movement. The articulatory movement in speech accompanies a substantial amount of variability,
like any other skilled human behavior. The current project aimed to examine how such variability is structured
and can be modeled using flow-based invertible neural networks. This work is still in progress and more experiments
will be followed to test the effect of different phonetic context, choice of articulatory/acoustic features and
hyper-parameter settings for the INNs.
</p>
<p>
<h6>Implications</h6>
<ul>
<li>Comparison of the use of "good" variability by different speakers, language backgrounds or dialects.</li>
<li>Developing a speech inversion system with the interpretable forward-inverse mapping.</li>
<li>Developing an articulatory speech synthesizer which can utilize redundancy in articulation.</li>
</ul>
</p>
</div>
<div class="row row-margin show-border">
<h2>References</h2>
<p class="text-inner-margin">
<ul class="text-ref">
<li>Bernstein, N. (1967). The co-ordination and regulation of movements. Pergamon Press.</li>
<li>Latash, M., Scholz, J., & Schöner, G. (2002). Motor control strategies revealed in the structure of motor variability. Exercise and Sport Sciences Reviews, 30(1), 26–31.</li>
<li>Riley, M. A., & Turvey, M. T. (2002). Variability and determinism in motor behaviour. Journal of Motor Behaviour, 34(2), 26.</li>
<li>Sternad, D. (2018). It’s not (only) the mean that matters: variability, noise and exploration in skill learning. Current Opinion in Behavioral Sciences, 20, 183–195.</li>
<li>Whalen, D. H., & Chen, W.-R. (2019). Variability and central tendencies in speech production. Frontiers in Communication, 4.</li>
<li>Schöner, G., Martin, V., Reimann, H., & Scholz, J. (2008). Motor equivalence and the uncontrolled manifold. 8th International Seminar on Speech Production, 23–28.</li>
<li>Scholz, J., & Schöner, G. (2014). Use of the Uncontrolled Manifold (UCM) approach to understand motor variability, motor equivalence, and self-motion. In Advances in Experimental Medicine and Biology (Vol. 826, pp. 91–100).</li>
<li>Sivaraman, G., Mitra, V., Tiede, M., Saltzman, E., Goldstein, L., & Espy-Wilson, C. (2015). Analysis of coarticulated speech using estimated articulatory trajectories. Proceedings of the Annual Conference of the International Speech Communication Association, Interspeech, 369–373.</li>
<li>Whalen, D. H., Chen, W.-R., Tiede, M. K., & Nam, H. (2018). Variability of articulator positions and formants across nine English vowels. Journal of Phonetics, 68, 1–14.</li>
<li>Ardizzone, L., Kruse, J., Wirkert, S., Rahner, D., Pellegrini, E. W., Klessen, R. S., Maier-Hein, L., Rother, C., & Köthe, U. (2019). Analyzing inverse problems with invertible neural networks. 7th International Conference on Learning Representations, ICLR 2019, 1–20.</li>
<li>Dinh, L., Krueger, D., & Bengio, Y. (2014). NICE: Non-linear Independent Components Estimation. ArXiv, 1–13.</li>
</ul>
</p>
</div>
<div class="row row-margin show-border">
<h2>Code examples</h2>
<p class="text-inner-margin">
<ul>
<li>Flow-based models: <a target="_blank" href="https://github.com/jaekookang/flow_based_models">https://github.com/jaekookang/flow_based_models</a></li>
<li>Invertible neural networks: <a target="_blank" href="https://github.com/jaekookang/invertible_neural_networks">https://github.com/jaekookang/invertible_neural_networks</a></li>
<li>Ardizzone et al 2018: <a target="_blank" href="https://github.com/VLL-HD/analyzing_inverse_problems">https://github.com/VLL-HD/analyzing_inverse_problems</a></li>
</ul>
</p>
</div>
</div>
<!-- Bottom panel: Contents -->
<div class="container page-width show-border">
<footer>
<br>
<p>Jaekoo Kang
<br>
<a href="mailto:jkang@gradcenter.cuny.edu">jkang@gradcenter.cuny.edu</a>
</p>
</footer>
</div>
</body>
</html>