Acoustic distance

Recent work has used dynamic time warping on sequences of Mel frequency cepstral coefficient (MFCC) vectors to compute a form of acoustic distance (Mielke, 2012; Kelley, 2018; Kelley & Tucker, 2018; Bartetlds et al., 2020). There are a number of convenience functions provided in this package. For the most part, they wrap the DynamicAxisWarping.jl and MFCC.jl packages. See also the Phonological CorpusTools page on acoustic similarity.

Computing acoustic distance

Let's start by creating some sample sounds to work with. You could also load in your own sounds from file as well using the Sound constructor that takes a filename.

using Random
rng = MersenneTwister(9)
x = rand(rng, 1, 1000)
y = rand(rng, 1, 3000)
acdist(x, y)
128.5382477647987

The output value is the result of performing dynamic time warping on the x and y. If x and y are Sound objects (in this example, they are not), they will first be converted to MFCC vectors with the sound2mfcc function. This value has been found to situate phonological similarity in terms of acoustics (Mielke, 2012), reflect aspects of the activation/competition process during spoken word recognition (Kelley, 2018; Kelley & Tucker, 2018), and judgments of nativelike pronunciation (Bartelds et al., 2020).

As an implementation note, the distance metric used to compare the MFCC vectors is the squared Euclidean distance between two vectors.

Sequence averaging

Kelley & Tucker (2018) also used the dynamic barycenter averaging (Petitjean et al., 2011) technique to create "average" acoustic representations of English words, in an attempt to better model the kind of acoustic representation a listener may be accessing when hearing a word (given that a listener has heard most words more than just once). The interface for calculating the average sequence is with the avgseq function.

using Random
rng = MersenneTwister(9)
x = rand(rng, 1000)
y = rand(rng, 3000)
z = rand(rng, 10000)
a = [Sound(x, 8000), Sound(y, 8000), Sound(z, 8000)]
avgseq(a)
13×36 Matrix{Float64}:
  15.5042     15.3827      15.4891     …   15.4959     15.5196     15.6319
 -32.974     -31.0656     -30.5978        -31.821     -30.9982    -32.2402
  -6.58161    -7.59358     -5.95746        -6.51342    -5.81628    -6.52048
  -6.90518   -11.6106      -9.51348        -7.90932    -6.12832    -6.88405
   2.43805    -6.84897     -3.71971        -2.2706     -1.91239    -3.16708
   0.435474   -6.01483     -9.38888    …   -1.07369    -7.71319    -2.86996
   4.41804    -3.30033      0.0447839      -2.34971     0.296673    2.73236
   0.304654    3.79358      6.17691        -2.38894     1.65544     2.57971
   3.28059    15.7686       4.79525         6.3325      0.623664    3.96259
   2.71743     8.60756      8.06177         3.78983   -10.0412      7.60993
  -3.43174    -0.0267343    5.89553    …    4.69502    -0.809725    2.54183
   0.967363    0.450934     0.712783       -1.8821      0.244977   -1.14452
   5.34803     0.960416     1.08785         0.672639    2.74868    -0.329657

Acoustic distinctiveness

Kelley (2018) and Kelley & Tucker (2018) introduced the concept of acoustic distinctiveness. It is how far away a word is, on average, from all the other words in a language. The distinctiveness function performs this calculation.

using Random
rng = MersenneTwister(9)
x = rand(rng, 1000)
y = rand(rng, 3000)
z = rand(rng, 10000)
a = [Sound(x, 8000), Sound(y, 8000), Sound(z, 8000)]
distinctiveness(a[1], a[2:3])
45165.66072314727

The number is effectively an index of how acoustically unique a word is in a language.

Function documentation

Phonetics.acdistMethod
acdist(s1, s2; [method=:dtw, dist=SqEuclidean(), radius=10])

Calculate the acoustic distance between s1 and s2 with method version of dynamic time warping and dist as the interior distance function. Using method=:dtw uses vanilla dynamic time warping, while method=:fastdtw uses the fast dtw approximation. Note that this is not a true mathematical distance metric because dynamic time warping does not necessarily satisfy the triangle inequality, nor does it guarantee the identity of indiscernibles.

Args

  • s1 Features-by-time array of first sound to compare
  • s2 Features-by-time array of second sound to compare
  • method (keyword) Which method of dynamic time warping to use
  • dist (keyword) Any distance function implementing the SemiMetric interface from the Distances package
  • dtwradius (keyword) maximum warping radius for vanilla dynamic timew warping; if no value passed, no warping constraint is used argument unused when method=:fastdtw
  • fastradius (keyword) The radius to use for the fast dtw method; argument unused when method=:dtw
source
Phonetics.acdistFunction
acdist(s1::Sound, s2::Sound, rep=:mfcc; [method=:dtw, dist=SqEuclidean(), radius=10])

Convert s1 and s2 to a frequency representation specified by rep, then calculate acoustic distance between s1 and s2. Currently only :mfcc is supported for rep, using defaults from the MFCC package except that the first coefficient for each frame is removed and replaced with the sum of the log energy of the filterbank in that frame, as is standard in ASR.

source
Phonetics.avgseqMethod
avgseq(S; [method=:dtw, dist=SqEuclidean(), radius=10, center=:medoid, dtwradius=nothing, progress=false])

Return a sequence representing the average of the sequences in S using the dba method for sequence averaging. Supports method=:dtw for vanilla dtw and method=:fastdtw for fast dtw approximation when performing the sequence comparisons. With center=:medoid, finds the medoid as the sequence to use as the initial center, and with center=:rand selects a random element in S as the initial center.

Args

  • S An array of sequences to average
  • method (keyword) The method of dynamic time warping to use
  • dist (keyword) Any distance function implementing the SemiMetric interface from the Distances package
  • radius (keyword) The radius to use for the fast dtw method; argument unused when method=:dtw
  • center (keyword) The method used to select the initial center of the sequences in S
  • dtwradius (keyword) How far a time step can be mapped when comparing sequences; passed directly to DTW function from DynamicAxisWarping; if set to nothing, the length of the longest sequence will be used, effectively removing the radius restriction
  • progress Whether to show the progress coming from dba
source
Phonetics.avgseqFunction
avgseq(S::Array{Sound}, rep=:mfcc; [method=:dtw, dist=SqEuclidean(), radius=10, center=:medoid, dtwradius=nothing, progress=false])

Convert the Sound objects in S to a representation designated by rep, then find the average sequence of them. Currently only :mfcc is supported for rep, using defaults from the MFCC package except that the first coefficient for each frame is removed and replaced with the sum of the log energy of the filterbank in that frame, as is standard in ASR.

source
Phonetics.distinctivenessMethod
distinctiveness(s, corpus; [method=:dtw, dist=SqEuclidean(), radius=10, reduction=mean])

Calculates the acoustic distinctiveness of s given the corpus corpus. The method, dist, and radius arguments are passed into acdist. The reduction argument can be any function that reduces an iterable to one number, such as mean, sum, or median.

For more information, see Kelley (2018, September, How acoustic distinctiveness affects spoken word recognition: A pilot study, DOI: 10.7939/R39G5GV9Q) and Kelley & Tucker (2018, Using acoustic distance to quantify lexical competition, DOI: 10.7939/r3-wbhs-kr84).

source
Phonetics.distinctivenessFunction
distinctiveness(s::Sound, corpus::Array{Sound}, rep=:mfcc; [method=:dtw, dist=SqEuclidean(), radius=10, reduction=mean])

Converts s and corpus to a representation specified by rep, then calculates the acoustic distinctiveness of s given corpus. Currently only :mfcc is supported for rep, using defaults from the MFCC package except that the first coefficient for each frame is removed and replaced with the sum of the log energy of the filterbank in that frame, as is standard in ASR.

source

References

Bartelds, M., Richter, C., Liberman, M., & Wieling, M. (2020). A new acoustic-based pronunciation distance measure. Frontiers in Artificial Intelligence, 3, 39.

Mielke, J. (2012). A phonetically based metric of sound similarity. Lingua, 122(2), 145-163.

Kelley, M. C. (2018). How acoustic distinctiveness affects spoken word recognition: A pilot study. Presented at the 11th International Conference on the Mental Lexicon (Edmonton, AB). https://doi.org/10.7939/R39G5GV9Q

Kelley, M. C., & Tucker, B. V. (2018). Using acoustic distance to quantify lexical competition. University of Alberta ERA (Education and Research Archive). https://doi.org/10.7939/r3-wbhs-kr84

Petitjean, F., Ketterlin, A., & Gançarski, P. (2011). A global averaging method for dynamic time warping, with applications to clustering. Pattern Recognition, 44(3), 678–693.