Text VP Tree

A vantage-point tree is a data structure that takes advantage of the spatial distribution of data and lets allows for faster searching through the data by lowering the amount of comparisons that need to be made. Consider the traditional example of phonological neighborhood density calculation. The code would be written to compare each item to all the other items. For $n$ items, there would be $n-1$ comparisons. So, to calculate the phonological neighborhood density for each item in a given corpus, there would need to be $n \times (n-1)\, = \, n^2-n$ comparisons. This is a lot of comparisons, especially when you're working with tens or hundreds of thousands of words!

With a vantage-point tree, however, we might get an average of only needing $\log_2(n)$ comparisons per query because of the way the data are organized. This means we would only need $n \times \log_2(n)$ comparisons in total, which can be substantially lower than $n^2-n$ for larger corpora. Though, analyzing the runtime of a VP tree is difficult, so the actual speedup may not be as drastic, but it should still be faster than the naive phonological neighborhood density calculation.

This impelentation is based on the description by Samet (2006).

Function documentation

Phonetics.TextVPTree — Method

TextVPTree(items::Array, d)

Outer constructor for a TextVPTree. Takes in an array of items items and a distance function d and proceeds to build a vantage-point tree from them.

source

Phonetics.radiusSearch — Method

radiusSearch(tree::TextVPTree, query, epsilon)

Performs a search for all items in a VP tree tree that are within a radius epsilon from a query query.

Returns

A Vector of items that are within the given radius epsilon

source

Phonetics.nneighbors — Method

nneighbors(tree::TextVPTree, query, n)

Find the n nearest neighbors in a VP tree tree to a given query query.

Returns

A PriorityQueue of items where the keys are the items themselves and the values are the distances from the items to query; the PriorityQueue is defined such that small values have higher priorities than large ones

source

References

Samet, H. (2006). Foundations of multidimensional and metric data structures. San Francisco, California: Morgan Kaufmann.