To find an item in an unknown list, a human would probably go from the beginning to the end of it, checking every item he finds one by one. What about computers? Nowadays, programs have a large arsenal of search algorithms at their disposal, and today we will briefly describe the most common ones.
The first thing that comes to mind is to sequentially go through all possible elements in some array and compare each one with the desired element. This algorithm is known as linear search. It may take up to O(N) = N operations, where N is the size of an array and is widely considered to be horribly slow. Despite it, it can still be used when:
– You need to perform this search only once,
– You are forbidden to rearrange the elements and you do not have any extra memory,
– The array is tiny, such as ten elements or less, or the performance is not an issue at all,
– You have no idea what you are searching.
When you ask MySQL something like “SELECT x FROM y WHERE z = t”, and z is a column without an index, linear search is performed with all the consequences of it. This is why adding an index to ‘searchable’ columns is important. Fortunately, primary keys are indexed automatically.
Nevertheless. What about exchanging elements? Sorting algorithms is relatively expensive, they require around O(N) = N logN operations, but sorting may be performed at once. Then we can go this way: find a middle, compare it with the element we are searching for, locate the half with the desired element and repeat the algorithm for that half. Each step reduces size two times, giving us overall O(N) = logN. This algorithm is known as binary search and it is extremely widespread in the world of computing.
However, there is a way to make it even better, if we know the approximate distribution of the elements in the array. Instead of picking the center each time, we may try to predict the position of our element and split the array there. The prediction is usually done via linear interpolation, hence the name of the algorithm – interpolation search. This way reduces complexity down to O(N) = log logN in the best case, but with a drawback: if our guess about the distribution is not correct, we can degrade down to even O(N) = N, while binary search always performs in O(N) = logN.
Sorted arrays are extremely effective when it comes to search. However, addition and removal of elements for sorted arrays are both O(N) = N, which is sometimes not acceptable.
Moreover, what about the additional memory?
Hashtable is like an index for the phone book. Each entry has its hash-based on a key, which is usually a relatively small number (in a phone book, the hash is just a first letter). Then, entries are distributed in an array-based on their hash value. When we need to find some element in the hash table, we first evaluate its hash, then look up the hash table and retrieve the element. However, what if there are multiple elements with the same hash? This situation is known as the collision, and because of the birthday paradox, it is frequent. There are multiple ways to solve it.
First, we can just store multiple elements in the same place of the hash. This is known as separate chaining, and it requires O(N, m) = N / m operations on average to find an element (here, m is the size of the hash table).
Alternatively, we can store the element in a wrong place next to its right place (or two times next, or three…). This is known as open addressing. Based on the load percentage of the hash table and the uniformness of the hash function, search complexity may vary from O(N, m) = 1 to O(N, m) = m.
There are other approaches for resolving collisions. For instance, cuckoo hashing has guaranteed search complexity O(N, m) = 1 at the expense of using two hashes at the same time instead of one.
In real programming, the hash table is the usual way to implement associative arrays in interpreted programming languages such as Python or Ruby.
Another well-known approach is to use a tree. The search algorithm in the simplest tree is:
- Set root as current element.
- If the current element is empty, then exit: no element found.
- Compare the current element with the desired element. If we found it, exit.
- Based on the comparison, select left or right child as a new current element, and go to step two.
Such tree is known as a binary search tree. In theory, it should speed up the search up to O(N) = logN. At the same time, a simple unordered list with the head can also be viewed as a tree, with a head as a root and with one child for each element except the tail, and, as we may expect, search in such list takes up to O(N) = N operations. So, actually, search in a binary tree falls somewhere in between of O(N) = N and O(N) = logN based on how the tree was constructed. To construct a tree ‘properly’, different approaches may be used to rebalance the tree on the fly to achieve search complexity around logN. The most common is a red-black tree – it is implemented in a C++ STL library for containers such as map and set.
A special structure known as B-tree is developed for working with a large block of data instead of little key-value pairs. It usually has large nodes and large branching factor, and is suitable for storing huge arrays on disk, maintaining O(N) = logN complexity for search. B-trees are used in filesystems, such as NTFS and Ext4, in MySQL as indexes and more.
Of course, this list is incomplete: trees are incredibly common in search algorithms. For instance, there is a tree called ‘trie’, which stores strings or streams as a list of nodes: instead of storing ‘trie’ in a single node, it would have five nodes: root –t-> empty –r-> empty –i-> empty –e-> value.
Unfortunately, it is impossible to cover such wide field of knowledge within such short article. If you are interested, some additional hints for further reading are:
– Heap, if you always need to retrieve minimum or maximum element. For a binary heap, O(N) = 1 for retrieval, O(N) = logN for removal and retrieving next minimum or maximum. Usually implemented as part of the priority queue.
– Skip list is a data structure containing multiple ordered lists. It allows fast logN search and is probabilistically balanced.
– Bloom filter is an extremely efficient data structure that answers ‘not in set’ and ‘probably in set’ for each search. Bloom filters are used as the first step to filter elements, which would not be found in ‘proper’ data structure anyway. If an element has passed the check, then some form of usual search is performed. Used when data sets are really big.
– And there is Grover’s algorithm for quantum computer!
However, we hope that the amount of material is enough to get you started in a right direction. Happy coding!