I’ve recently written on the importance of buffing up on computer science fundamentals for anyone working in data science — it is, after all, the ecosystem the whole of AI/machine learning/data analytics is predicated on — and have myself been working through some of the coursework from the OSSU curriculum. I just finished – — — — course — — — — — — , UBC’s – — — — — — course — — — — — , the module of which was binary search trees (BSTs). This is a fundamental data structure, and one that offers a significant computational boost to a ubiquitous computing demand, so I’d like to record some highlights here, both for myself and for anyone else who’s thinking about digging into data structures or into a free computer science curriculum like OSSU.
The ubiquitous computing demand is something you’ve easily asked a computer to do for you today: find an entry in a list with some sort of order.
A classic example is looking up an account by its account number. Let’s look at some different situations, from least to most ordered, that you might have to dig through. In their most unordered state, maybe these accounts are just in a kind of “bag of accounts” with no real order to the placement (account #1, 99, 15, 27, 13, 67). If you wanted to look for account #13, you’d just have to move down the line, going one at a time, and hope you got lucky that account #13 occurred early in line. If there are n accounts in the bag, it will take you looking through, on average, n/2 accounts before you find the one you’re looking for.
A potential solution would be to order the accounts (account #1, 13, 15, 27, 67, 99). That would surely speed up the search, right? In fact it offers no improvement to the above “bag of accounts” situation: you’ll still have to search through, on average, n/2 number of accounts before finding the one you’re looking for.
Binary search trees, on the other hand, offer a real improvement to this task, speeding up the “search time” from n/2 to log(n) (that is, base 2 log).
The advantage isn’t so great over just a list of accounts for a small number of accounts, but for a large number of accounts, the increase in computational speed is huge.
The requirements for a BST are relatively simple: a) the tree is made of nodes, b) each node has a key (for this example, it would be account number), c) each node has at most 1 left branch, containing only nodes with smaller keys, and lastly, d) each node has at most 1 right branch, containing only nodes with larger keys. Here’s a graphical representation:
And the really cool part? If the amount of things to search doubled, it would also double your search time given the original method of crawling through the list; with a BST, it adds, on average, only one more step to the search process.
[graphical rep of doubling tree to 101, 129, 167, 174, 189, 200, red lines for 1 extra step].