Text VP Tree
A vantage-point tree is a data structure that takes advantage of the spatial distribution of data and lets allows for faster searching through the data by lowering the amount of comparisons that need to be made. Consider the traditional example of phonological neighborhood density calculation. The code would be written to compare each item to all the other items. For $n$ items, there would be $n-1$ comparisons. So, to calculate the phonological neighborhood density for each item in a given corpus, there would need to be $n \times (n-1)\, = \, n^2-n$ comparisons. This is a lot of comparisons, especially when you're working with tens or hundreds of thousands of words!
With a vantage-point tree, however, we might get an average of only needing $\log_2(n)$ comparisons per query because of the way the data are organized. This means we would only need $n \times \log_2(n)$ comparisons in total, which can be substantially lower than $n^2-n$ for larger corpora. Though, analyzing the runtime of a VP tree is difficult, so the actual speedup may not be as drastic, but it should still be faster than the naive phonological neighborhood density calculation.
This impelentation is based on the description by Samet (2006).
Function documentation
Phonetics.TextVPTree
— MethodTextVPTree(items::Array, d)
Outer constructor for a TextVPTree
. Takes in an array of items items
and a distance function d
and proceeds to build a vantage-point tree from them.
Phonetics.radiusSearch
— MethodradiusSearch(tree::TextVPTree, query, epsilon)
Performs a search for all items in a VP tree tree
that are within a radius epsilon
from a query query
.
Returns
A Vector
of items that are within the given radius epsilon
Phonetics.nneighbors
— Methodnneighbors(tree::TextVPTree, query, n)
Find the n
nearest neighbors in a VP tree tree
to a given query query
.
Returns
- A
PriorityQueue
of items where the keys are the items themselves and the values are the distances from the items toquery
; thePriorityQueue
is defined such that small values have higher priorities than large ones
References
Samet, H. (2006). Foundations of multidimensional and metric data structures. San Francisco, California: Morgan Kaufmann.