Acoustic distance
Recent work has used dynamic time warping on sequences of Mel frequency cepstral coefficient (MFCC) vectors to compute a form of acoustic distance (Mielke, 2012; Kelley, 2018; Kelley & Tucker, 2018; Bartetlds et al., 2020). There are a number of convenience functions provided in this package. For the most part, they wrap the DynamicAxisWarping.jl and MFCC.jl packages. See also the Phonological CorpusTools page on acoustic similarity.
Computing acoustic distance
Let's start by creating some sample sounds to work with. You could also load in your own sounds from file as well using the Sound
constructor that takes a filename.
using Random
rng = MersenneTwister(9)
x = rand(rng, 1, 1000)
y = rand(rng, 1, 3000)
acdist(x, y)
128.5382477647987
The output value is the result of performing dynamic time warping on the x
and y
. If x
and y
are Sound
objects (in this example, they are not), they will first be converted to MFCC vectors with the sound2mfcc
function. This value has been found to situate phonological similarity in terms of acoustics (Mielke, 2012), reflect aspects of the activation/competition process during spoken word recognition (Kelley, 2018; Kelley & Tucker, 2018), and judgments of nativelike pronunciation (Bartelds et al., 2020).
As an implementation note, the distance metric used to compare the MFCC vectors is the squared Euclidean distance between two vectors.
Sequence averaging
Kelley & Tucker (2018) also used the dynamic barycenter averaging (Petitjean et al., 2011) technique to create "average" acoustic representations of English words, in an attempt to better model the kind of acoustic representation a listener may be accessing when hearing a word (given that a listener has heard most words more than just once). The interface for calculating the average sequence is with the avgseq
function.
using Random
rng = MersenneTwister(9)
x = rand(rng, 1000)
y = rand(rng, 3000)
z = rand(rng, 10000)
a = [Sound(x, 8000), Sound(y, 8000), Sound(z, 8000)]
avgseq(a)
13×36 Matrix{Float64}:
15.5042 15.3827 15.4891 … 15.4959 15.5196 15.6319
-32.974 -31.0656 -30.5978 -31.821 -30.9982 -32.2402
-6.58161 -7.59358 -5.95746 -6.51342 -5.81628 -6.52048
-6.90518 -11.6106 -9.51348 -7.90932 -6.12832 -6.88405
2.43805 -6.84897 -3.71971 -2.2706 -1.91239 -3.16708
0.435474 -6.01483 -9.38888 … -1.07369 -7.71319 -2.86996
4.41804 -3.30033 0.0447839 -2.34971 0.296673 2.73236
0.304654 3.79358 6.17691 -2.38894 1.65544 2.57971
3.28059 15.7686 4.79525 6.3325 0.623664 3.96259
2.71743 8.60756 8.06177 3.78983 -10.0412 7.60993
-3.43174 -0.0267343 5.89553 … 4.69502 -0.809725 2.54183
0.967363 0.450934 0.712783 -1.8821 0.244977 -1.14452
5.34803 0.960416 1.08785 0.672639 2.74868 -0.329657
Acoustic distinctiveness
Kelley (2018) and Kelley & Tucker (2018) introduced the concept of acoustic distinctiveness. It is how far away a word is, on average, from all the other words in a language. The distinctiveness
function performs this calculation.
using Random
rng = MersenneTwister(9)
x = rand(rng, 1000)
y = rand(rng, 3000)
z = rand(rng, 10000)
a = [Sound(x, 8000), Sound(y, 8000), Sound(z, 8000)]
distinctiveness(a[1], a[2:3])
45165.66072314727
The number is effectively an index of how acoustically unique a word is in a language.
Function documentation
Phonetics.acdist
— Methodacdist(s1, s2; [method=:dtw, dist=SqEuclidean(), radius=10])
Calculate the acoustic distance between s1
and s2
with method
version of dynamic time warping and dist
as the interior distance function. Using method=:dtw
uses vanilla dynamic time warping, while method=:fastdtw
uses the fast dtw approximation. Note that this is not a true mathematical distance metric because dynamic time warping does not necessarily satisfy the triangle inequality, nor does it guarantee the identity of indiscernibles.
Args
s1
Features-by-time array of first sound to compares2
Features-by-time array of second sound to comparemethod
(keyword) Which method of dynamic time warping to usedist
(keyword) Any distance function implementing theSemiMetric
interface from theDistances
packagedtwradius
(keyword) maximum warping radius for vanilla dynamic timew warping; if no value passed, no warping constraint is used argument unused when method=:fastdtwfastradius
(keyword) The radius to use for the fast dtw method; argument unused when method=:dtw
Phonetics.acdist
— Functionacdist(s1::Sound, s2::Sound, rep=:mfcc; [method=:dtw, dist=SqEuclidean(), radius=10])
Convert s1
and s2
to a frequency representation specified by rep
, then calculate acoustic distance between s1
and s2
. Currently only :mfcc
is supported for rep
, using defaults from the MFCC
package except that the first coefficient for each frame is removed and replaced with the sum of the log energy of the filterbank in that frame, as is standard in ASR.
Phonetics.avgseq
— Methodavgseq(S; [method=:dtw, dist=SqEuclidean(), radius=10, center=:medoid, dtwradius=nothing, progress=false])
Return a sequence representing the average of the sequences in S
using the dba method for sequence averaging. Supports method=:dtw
for vanilla dtw and method=:fastdtw
for fast dtw approximation when performing the sequence comparisons. With center=:medoid
, finds the medoid as the sequence to use as the initial center, and with center=:rand
selects a random element in S
as the initial center.
Args
S
An array of sequences to averagemethod
(keyword) The method of dynamic time warping to usedist
(keyword) Any distance function implementing theSemiMetric
interface from theDistances
packageradius
(keyword) The radius to use for the fast dtw method; argument unused when method=:dtwcenter
(keyword) The method used to select the initial center of the sequences inS
dtwradius
(keyword) How far a time step can be mapped when comparing sequences; passed directly toDTW
function fromDynamicAxisWarping
; if set tonothing
, the length of the longest sequence will be used, effectively removing the radius restrictionprogress
Whether to show the progress coming fromdba
Phonetics.avgseq
— Functionavgseq(S::Array{Sound}, rep=:mfcc; [method=:dtw, dist=SqEuclidean(), radius=10, center=:medoid, dtwradius=nothing, progress=false])
Convert the Sound
objects in S
to a representation designated by rep
, then find the average sequence of them. Currently only :mfcc
is supported for rep
, using defaults from the MFCC
package except that the first coefficient for each frame is removed and replaced with the sum of the log energy of the filterbank in that frame, as is standard in ASR.
Phonetics.distinctiveness
— Methoddistinctiveness(s, corpus; [method=:dtw, dist=SqEuclidean(), radius=10, reduction=mean])
Calculates the acoustic distinctiveness of s
given the corpus corpus
. The method
, dist
, and radius
arguments are passed into acdist
. The reduction
argument can be any function that reduces an iterable to one number, such as mean
, sum
, or median
.
For more information, see Kelley (2018, September, How acoustic distinctiveness affects spoken word recognition: A pilot study, DOI: 10.7939/R39G5GV9Q) and Kelley & Tucker (2018, Using acoustic distance to quantify lexical competition, DOI: 10.7939/r3-wbhs-kr84).
Phonetics.distinctiveness
— Functiondistinctiveness(s::Sound, corpus::Array{Sound}, rep=:mfcc; [method=:dtw, dist=SqEuclidean(), radius=10, reduction=mean])
Converts s
and corpus
to a representation specified by rep
, then calculates the acoustic distinctiveness of s
given corpus
. Currently only :mfcc
is supported for rep
, using defaults from the MFCC
package except that the first coefficient for each frame is removed and replaced with the sum of the log energy of the filterbank in that frame, as is standard in ASR.
References
Bartelds, M., Richter, C., Liberman, M., & Wieling, M. (2020). A new acoustic-based pronunciation distance measure. Frontiers in Artificial Intelligence, 3, 39.
Mielke, J. (2012). A phonetically based metric of sound similarity. Lingua, 122(2), 145-163.
Kelley, M. C. (2018). How acoustic distinctiveness affects spoken word recognition: A pilot study. Presented at the 11th International Conference on the Mental Lexicon (Edmonton, AB). https://doi.org/10.7939/R39G5GV9Q
Kelley, M. C., & Tucker, B. V. (2018). Using acoustic distance to quantify lexical competition. University of Alberta ERA (Education and Research Archive). https://doi.org/10.7939/r3-wbhs-kr84
Petitjean, F., Ketterlin, A., & Gançarski, P. (2011). A global averaging method for dynamic time warping, with applications to clustering. Pattern Recognition, 44(3), 678–693.