Or adding a “tree” to your app.
Problem
It’s often necessary to store hierarchical data in DB. We’ll review available models and their strengths and weaknesses. Since knowing theory is good but not enough, pluggable solutions are considered for Ruby On Rails framework.
Theory
There are several ways to model recursive/hierarchical data with relational DB:
- Adjacency List
- Nested Sets
- Path Enumeration (aka Materialized Path)
- Closure Table
- Adjacency List with recursive query
Note: following examples are missing foreign keys and indexes just for sake of simplicity
Lets consider simplified example
Representations
Graph Table
A node, parent
/ \ A,
B E B, A
/ \ C, B
C D D, B
E, A
Adjacency List
The simplest. Chosen as default by many developers.
CREATE TABLE nodes(
id SERIAL PRIMARY KEY,
parent_id BIGINT NOT NULL,
text varchar(10) NOT NULL,
);
-- add foreign keys for referential consistency
-- add unique index to avoid duplication
Inserting records is straightforward:
INSERT INTO nodes(id, parent_id, text)
VALUES
(1, 1, 'A'), -- Root.
(2, 1, 'B'),
(3, 2, 'C'),
(4, 2, 'D'),
(5, 1, 'E');
Retrieving data becomes problematic though.
Say we want to SELECT
whole hierarchy:
SELECT * FROM nodes where parent_id = 1;
id | parent_id | text
----+-----------+------
1 | 1 | A
2 | 1 | B
5 | 1 | E
but only the immediate children get fetched instead whole “tree”.
Same issue exists for both UPDATE
and DELETE
queries.
Pros
- simplest data model
Cons
- expensive querying (unless DB supports recursive querying)
“Adjacency List” model is sufficient data model describing a tree. But the lack of recursive querying support(ie in MySQL) leads to complicated client code.
“Adjacency List” model alone is often not enough as a real world solution.
Nested Sets
This one is probably the second most popular
Graphical representation:
+---------------------------+
| A |
| |
| +----------------+ +----+ |
| | B | | E | |
| | | | | |
| | +----+ +----+ | +----+ |
| | | C | | D | | |
| | | | | | | |
| | +----+ +----+ | |
| +----------------+ |
+---------------------------+
1 2 3 4 5 6 7 8 9 10
with corresponding table representation:
id, text, lft, rgt
1, A, 1, 10
2, B, 2, 7
3, C, 3, 4
4, D, 5, 6
5, E, 8, 9
SQL example
Inserting initial records requires knowing lft
and rgt
values in advance or inserting them one by one calculating and updating lft
and rgt
values.
INSERT INTO nodes(id, text, lft, rgt)
VALUES
(1, 'A', 1, 10),
(2, 'B', 2, 7 ),
(3, 'C', 3, 4 ),
(4, 'D', 5, 6 ),
(5, 'E', 8, 9 );
Selecting is very efficient:
SELECT c.id, c.text, c.lft, c.rgt FROM nodes c, nodes p WHERE p.lft < c.lft AND p.rgt > c.rgt AND p.id = 2;
id, text, lft, rgt
3, 'C', 3, 4
4, 'D', 5, 6
On the other hand Inserting nodes may be very inefficient as requires updating existing records depending on where record gets inserted. This is one of drawbacks of the “Nested Set” model, as it may get quite slow with many records.
Pros
- fast selects
- relatively simple model
Cons
- slow updates
- gets complicated
Path Enumeration
This solution tries to address shortages of “Adjacency List” by storing ancestor ids, usually, as a string.
In other words, instead storing single parent_id
, all parent_id
s up to the very root and the node id
itself are stored.
CREATE TABLE nodes(
id SERIAL PRIMARY KEY,
path VARCHAR(1000) NOT NULL, -- 1000 may become a limitation in case of very deep hierarchies
text TEXT NOT NULL
-- make sure to index path for faster lookups
);
Inserting records is straightforward
INSERT INTO nodes(id, path, text) VALUES
(1, '1/' , 'A'), -- Root.
(2, '1/2/' , 'B'),
(3, '1/2/3/', 'C'),
(4, '1/2/4/', 'D'),
(5, '1/5/' , 'E');
Please note that there’s a “little” issue in the INSERT
above: in general the id of the node is not yet known at the time of INSERT
.
So in order to be consistent, it requires an extra UPDATE
to get the id
appended to the corresponding path
.
Retrieving data is relatively straightforward.
To retrieve hierarchy starting with root id=2
SELECT * FROM nodes WHERE path LIKE '1/2/' || '%';
2 | 1/2/ | B
3 | 1/2/3/ | C
4 | 1/2/4/ | D
To retrieve ancestors and node with id=4
itself
SELECT * FROM nodes WHERE '1/2/4/' like path || '%';
1 | 1/ | A
2 | 1/2/ | B
4 | 1/2/4/ | D
UPDATE
and DELETE
use same approach.
Pros
- simple data model
Cons
- no referential integrity
- path has length limitation with all possible outcomes
Closure table
Closure Table is like Path Enumeration, except that paths are stored in separate table.
Table representation
Nodes Paths
id, text ancestor, descendant
1, A 1, 1
2, B 1, 2
3, C 1, 3
4, D 1, 4
5, E 1, 5
2, 2
2, 3
2, 4
3, 3
4, 4
5, 5
SQL example
CREATE TABLE nodes(id SERIAL, text VARCHAR(10));
CREATE TABLE paths(ancestor_id INTEGER, descendant_id INTEGER, PRIMARY KEY(ancestor_id, descendant_id));
At least 2 inserts are required to create a tree, since nodes and paths are stored in separate tables:
-- nodes
INSERT INTO nodes(id, text)
VALUES
(1, 'A'),
(2, 'B'),
(3, 'C'),
(4, 'D'),
(5, 'E');
-- paths
INSERT INTO paths(ancestor_id, descendant_id)
VALUES
(1, 1), (1, 2), (1, 3), (1, 4), (1, 5),
(2, 2), (2, 3), (2, 4),
(3, 3),
(4, 4),
(5, 5);
Selecting a subtree with root id=2
SELECT nodes.*, paths.ancestor_id as parent_id
FROM nodes
JOIN paths on (nodes.id = paths.descendant_id)
WHERE paths.ancestor_id = 2
yields:
id, text, parent_id
2, B, 2
3, C, 2
4, D, 2
Closure Table has one extra feature unlike previous models: it allows a node to have multiple parents.
Pros
- relatively simple model
- multiple parents
- referential integrity
Cons
- requires 2 tables
- redundant with paths storage
Adjacency List with recursive query
It’s important to realize that previously discussed approaches emerged as result of RDBMS lacking native support for querying recursive data structures. The problem was addressed in ANSI SQL-99 standard with CTE
The model is the same Adjacency List. The difference is only in actual SQL used to query nodes.
Lets reuse the example:
CREATE TABLE nodes( id SERIAL PRIMARY KEY, parent_id integer NOT NULL, text varchar(10) NOT NULL);
Inserting records is straightforward:
INSERT INTO nodes(id, parent_id, text)
VALUES
(1, 1, 'A'), -- Root.
(2, 1, 'B'),
(3, 2, 'C'),
(4, 2, 'D'),
(5, 1, 'E');
Say we want to SELECT
a whole hierarchy, (note WITH RECURSIVE
):
WITH RECURSIVE tree AS (
-- initial node
SELECT id, parent_id, text
FROM nodes
WHERE id = parent_id -- start from the root
UNION all
-- recursive descent
SELECT c.id, c.parent_id, c.text
FROM tree p
JOIN nodes c ON c.parent_id = p.id AND c.parent_id != c.id
)
-- and finally the actual select
SELECT * FROM tree order by id;
yields
id | parent_id | text
----+-----------+------
1 | 1 | A
2 | 1 | B
3 | 2 | C
4 | 2 | D
5 | 1 | E
yields nodes as expected.
UPDATE
and DELETE
use same approach
Pros
- simple model
- native DB support
- minimum client code
Cons
- not supported by many DBs
Models Comparison summary
Here’s a summary of the approaches:
Models | Tables | QChild | QTree | Insert | Delete | Ref.Int |
---|---|---|---|---|---|---|
AdjacencyList | 1 | Easy | Hard | Easy | Easy | Yes |
Path Enumeration | 1 | Easy | Easy | Easy | Easy | No |
Nested Sets | 1 | Hard | Easy | Hard | Hard | No |
Closure Table | 2 | Easy | Easy | Easy | Easy | Yes |
AdjacencyList & recur. query | 1 | Easy | Easy | Easy | Easy | Yes |
Legend:
QChild
= Query ChildQTree
= Query TreeRef.Int
= Referent Integrity
Taken from “SQL Antipatterns”
Rails plugins to handle “Trees”
Knowing theory is good, but “great coders” reuse code so here’s the list of Rails-ready plugins for your next app:
Adjacency List
Path Enumeration
Nested Sets
Closure Tree
Adjacency List with recursive query
Postgres
I’d like to give Postgres a credit for supporting “Adjacency List” and “Path Enumeration” aka ltree natively. And because Postgres supports Recursive Querying “Adjacency List” and “Adjacency List with Recursive Querying” are the same.
This is one more reason to use Postgres.