Database 101: Understanding B-Trees

The structure of a B-Tree
The B-Tree Operations
Why we need the balancing?
How B-Tree are being structured in storages?
How Database applies B-Tree structure?
Conclusion
Resources

In the last months, I’m really interested in how database works internally. One of my questions was: “How databases manages indexes? And how they are organized into the storage?”.

I tried to find some resources to understand it, but I didn’t find a good resource that could explain to me in a clear way exactly how it works.

Then, I started to read the Database Internals book, inside it they have some chapters related to B-Tree and how they work. I really recommend this book if you want to learn better how databases work internally.

In this post, I’ll explain to you how B-Trees works as a way I would like to learn about it.

As a way to a better communication through the post, I’ll be calling B-Tree, but I think that it’s important to mention that it will be in notion of B-Tree as an “umbrella” for a given particular family of structures. In case, for our scenario of explanation, I’ll be using the “on-disk B+Tree” as our reference.

If you want to understand better the differences between B-Tree and B+Tree, I recommend you to read this post here.

The structure of a B-Tree

B-Trees are a kind of self-balancing tree data structure that maintains sorted data over a number of levels. It’s composed by nodes that have two specific properties: a key and a value, where the key is the data that will be sorted to let the tree be balanced and handle the binary search through it.

This tree are composed by three specific kind of nodes:

Root node: The top node of the tree, it’s the only node that doesn’t have a parent node.
Internal node: The nodes that are between the root node and the leaf nodes. They are used to guide the search through the tree.
Leaf node: The bottom nodes of the tree, they are the nodes that have the data that we want to search.

The structure of a B-Tree with the nodes and their kinds

A core property of B-Trees is what we can call it as “fanout”, which is the maximum number of childrens that each non-leaf node can have. The fact that the fanout is how much pointers targeting the children nodes

a node can have, it’s a key property to understand how B-Trees are efficient to search through the data.

In the example of the scenario above, the fanout of the tree is 3. This means that each non-leaf node can have up to 3 children. For a production-ready B-Tree, this fanout number would be much higher, in the order of 100 or more, but for the sake of simplicity, we’ll use 2 as a reference.

The B-Tree Operations

Explain about insert
Explain about delete
Explain about lookup
Hook to start an explanation about the balancing in both insert/delete

Why we need the balancing?

Explain how the balancing is important for B-Trees
Why a balanced tree has more performance than a non-balanced tree
Explain how each operation has being processed by the rebalancing of a B-Tree

How B-Tree are being structured in storages?

Give an introduction about the difference between HDDs and SSDs
Explain about the disk structure containing cells, pages, etc
Explain how each B-Tree works handling nodes and pages
Explain about marshalling and how serialized B-Tree data are organized on storage in byte sequence

How Database applies B-Tree structure?

Explain how database implements B-Trees
How the indices are build
How they insert the B-Trees in memory and dump into storage after fulfilled
How the leaf nodes targets the offset range data for a given record data on the given table

Conclusion

Just a conclusion lmao
Write anything that complements the dumped infos

Resources

Database Internals