Tuesday, September 9, 2008

Binary Trees - Does it matter?

Ever programmed a Binary Tree before? If you took computer science classes past an introductory level, you probably studied the theory before. But to what extent do decisions matter?

To review for those of you who don't quite recall what a binary tree is. You have a series of nodes, each node contains data, some or all of which is the key, up to two pointers to children nodes, and possibly other meta data. The most popular type of binary tree is the Binary Search Tree, where each node conceptually has nodes with lesser keys descending on the left, and nodes with greater keys descending on the right.

How to implement a left and right can be done with two node pointers called left and right, lesser and greater, an array of two node pointers, or any other method which makes sense to you.

All operations begin with the topmost node in the figure above, known as the root, head, or top node of the tree. To find a node with a particular key, one compares it against the current node, if it found it good, otherwise, if it's less, take the left route, otherwise the right. If there is no path left or right as requested, since there is only a null pointer there, then a node with a matching key is not in the tree. At each node reached, the same algorithm is applied. In this instance, a Binary Tree is very similar to Binary Search, each check effectively cuts the data in half, leading to very efficient searches.

Adding data to a tree is very much the same. The data is encapsulated in a node, which again goes down the tree finding where the data should belong, in this case, once a null pointer is encountered while trying to traverse the tree further, it is replaced by a pointer to the new node.

Deleting data can be a bit trickier. If a node has no children, deleting a node is quite straight forward. If the node only has a path left or right, one simply cuts out the node being deleted, and makes the pointer once pointing to it now point to the child. However, for removing a node which has two children descending from it, there's a number of methods that can be applied.

The simplest method is to make one child a replacement, and add the other child onto the child replacing, meaning add the left node to the right or the right to the left.

Look what happens when deleting the root node, which in this case has two children.

The node with a key of 50 is deleted which had two children, 25 and 100. Either 25 becomes the new root, with 100 as the rightmost of 25's descendants, or 100 becomes the new root, with 25 as the leftmost of 100's descendants.

The only problem with this algorithm is that it can quickly make the tree become unbalanced, degrading into a linked list. A tree should be as close to balanced as possible, so search or addition can be done with cutting half the tree out per each node looked at. An unbalanced tree would make each operation generally consist of looking at most of the nodes in the tree, which is inefficient.

Therefore, the generally proposed algorithm is to find the next or previous node by key, and move it to replace the node being deleted. First lest us understand tree order.

A binary tree can be processed in direct order using the following algorithm.

function process_node(Node *node)
if (node->has_left()) { process_node(node->left()); }
if (node->has_right()) { process_node(node->right()); }

Which is plain English means handle left, then current, then right, in a recursive fashion. Because a Binary Tree can be processed in this fashion, in addition to being a nice data structure, it can also be used as the basis of a sorting algorithm. All data is encapsulated in a node and is added to the tree, then it is traversed in order as just described, placing the data back into the original data structure, and then the tree is destroyed.

Now understanding this, the node prior to any particular node in regards to its key is always one left, and then rightmost, and the node after a particular key is one right, and then left most.

And the delete:

50 is deleted, and the previous or next one takes its place. In this case, 35 being the previous, and 75 being the next one.

Now to get to the meat of the article, does it matter? Does what matter?

First off, we didn't at all mention how to handle nodes which have equivalent keys. Back when I was in college learning computer science, we were told to just throw it down the right or the left, but be consistent about it. Also, when it comes to deleting, take the previous or the next, but again, it doesn't matter, as long as one was consistent about it. Or perhaps in this latter case, alternate sides per delete.

But the question we must ask is: "Does it matter?"

And the answer which they at the very least didn't teach me when I was in school, yes, it does matter!

When one sorts, one can do a stable or an unstable sort. Stable means that if two elements had equivalent keys, when sorting, the order of those nodes would be preserved. Meaning if we had: "50 C, 45 A, 23 C, 50 B, 75 D, 50 A", and the numbers would make up the key, but not the letters, then it would sort to: "23 C, 45 A, 50 C, 50 B, 50 A, 75 D", keeping the order of C,B,A among the elements with a key of 50. Unstable would mean that the relative order of those 50s is not guaranteed to be kept when sorting.

A binary tree can only be used as a stable sort if equal data is always placed on the right side. This is because the order of succession is left, current, right, and if it was placed on the left, it would mean the node added later came before the current node.

What not to do:

However, just putting equal data on the right isn't the end of the matter yet. Most good Binary Trees, such as a Red & Black Tree will implement some kind of self balancing system, so areas which are starting to become linked lists rotate to become balanced trees.

Looks like a linked list:

As can be seen in the diagrams, the middle node of the three becomes the new parent of the section with the first node becoming the left. This still follows the in order rules of left, center, right, because the one of the left is indeed the first node. And rotations like this, would also preserve numbering orders of the keys (otherwise a Red & Black Tree or another self balancing tree would never employ them).

Now adding another 50 on our proposed case still works with our initially stated algorithm, we traverse down till we reach our area of 50s, then place it to the right of the right most one. However, we have to slightly rewrite our find a node which matches a key algorithm. If we want our tree to always return the first node which matches a key, or in the case of deletion, delete the first node which matches a key, one can't just search until one finds a matching node, since it in fact may be the second or third node if rotations took place. Rather, one has check to see if the left contains the same value, and if so, go left most across matching keys, and return the deepest one found there.

Now after reviewing these revised rules, is your binary tree written properly? Furthermore, have you ever even seen an implementation which when searching checks left most for matching keys?

Oh, and before we leave, what about deleting, does it matter if the prior or the next node is chosen? The other night, a student of mine asked me that, and presented to me a case which we both agreed proved that to properly delete and keep a stable sort, one has to use the next node, and not the prior one to properly preserve the key order. Although, we were both tired at the time, and now when trying to recreate the case I can't seem to do it. So I do hope I was in error the other night, and am not leaving out a crucial point. If you happen to know of a case where it matters, please describe it in the comments.