Technology

Prefix Trees in Action | Engineering, Product and Design at AMBOSS

This is the point where a tiny bit of textbook knowledge goes a long way. You don’t have to have every tree data structure memorized, as long as you are vaguely aware that there is an entire family of trees that’s good at dealing with prefix searches: tries.

A trie, also called prefix tree, is a great fit for things like auto completion and search suggestions — where the user enters the first part of a word and you need to quickly find more words that fit to that prefix.

Top 5 Android Smartphones In 2021 Review

It’s much easier to explain with diagrams than with words, so here’s a hypothetical representation of the “heart” and “heart failure” case.

In a trie, the nodes are typically strings and nodes are connected by individual characters. In the above example, the first non-empty node h and the node after that (he) are connected through the character e. I collapsed the two leaves (heart failure and heart burn), just so it’s easier to visualize. Also note that the children of any node share a prefix, meaning both children of heart share the heart prefix, which is perfect for our use case.

In this trie three nodes would hold a value: heart, heart failure and heart burn. The values could be anything, but in our case they’d be the destination objects. We can therefore not only use the trie to check if the currentPhrase is a prefix of a known destination, but we can also retrieve that destination object.

There is one thing to keep in mind though:

If we ask the trie for the destination object for heart, what do we expect to get back?

Obviously, we’re looking for the value associated with the node heart. But there are two more nodes which could be relevant, since they contain the prefix we asked for. In cases like this, it’s important to check the documentation of whatever library you’re using. It will probably differentiate between exact match (e.g., asking for heart failure) and ambiguous match (e.g., asking for heart).

Lastly, what kind of performance gains can we expect from using a trie? For any look up, we can treat each character in our search string as a path segment. You can imagine searching for heart to be akin to starting at the root node and following the connection by character h, then the connection by he, and so on.

The Big O complexity is therefore O(k), where k is the length of the search string, regardless of how many destination objects we have.

With the previous solution, we had O(n*m*p) complexity. We’ve replaced that with O(n * p). That’s a huge improvement because we have thousands of destinations (m), whereas the search string (p) will usually be in the 1—20 character range.

Adding tries to our code was very straight forward. It essentially boiled down to replacing the inner loop with a look up, as shown below.

for _, character := range documentContent {
// ... do other stuff until we come across white space
if destinationTrie.HasPrefix(currentPhrase) {
phrase.WriteRune(whitespace)
continue outer
}
}

So instead of iterating over all keys in the map, and comparing each key against the currentPhrase, we make just a single call to our trie, here called destinationTrie. The surrounding code can stay exactly the same, which makes for a fairly small commit!

Of course we still need to verify that our change has indeed made things noticeably faster. So we ran exactly the same benchmarks as before (otherwise it’s hard to really compare the data) and indeed memeqbody went from taking up 325.57s to just 11.43s. mapiternext doesn’t even show up in the top 10 function calls anymore.

Apart from profiling one entire run, we also used Go’s support for benchmarking to get a more detailed idea of just how much faster the new implementation of AddLinks is:

Before 367244 3231 ns/op 1728 B/o 38 allocs/opAfter 775364 1344 ns/op 944 B/o 23 allocs/op

The new implementation takes 1344 nanoseconds per operation (ns/op) and is therefore 2.4 times faster. It also allocates less (38 vs 23 allocations per operation), which wasn’t even a focus of ours.

The overall runtime of our program, as measured by a human sitting in front of the machine, also went down by about two thirds.

Code complexity is hard to measure but I’d argue that the code is now simpler to reason about than before.

Earlier in this article, I wrote that we wanted to achieve the following:

Therefore we need to know if the phrase we’re processing is a sub string of any of the destinations.

The prefix tree implementation is the exact equivalent in code of the above requirement. The for loop we had before was an implementation detail and just distracted us from the overall goal. In a sense, the code now works at a higher level of abstraction.

Lastly, code size didn’t increase because there are various open source implementations for these data structures already.

Related Articles

Back to top button