Recursive data structures with Rails

gmarik 7 min
Table Of Contents ↓

Or adding a “tree” to your app.

Problem

It’s often necessary to store hierarchical data in DB. We’ll review available models and their strengths and weaknesses. Since knowing theory is good but not enough, pluggable solutions are considered for Ruby On Rails framework.

Theory

There are several ways to model recursive/hierarchical data with relational DB:

Note: following examples are missing foreign keys and indexes just for sake of simplicity

Lets consider simplified example

Representations

    Graph                 Table

       A                  node, parent
      / \                  A,
     B   E                 B,   A
    / \                    C,   B
   C   D                   D,   B
                           E,   A

Adjacency List

The simplest. Chosen as default by many developers.

CREATE TABLE nodes(
  id SERIAL PRIMARY KEY,
  parent_id BIGINT NOT NULL,
  text varchar(10) NOT NULL,
);
-- add foreign keys for referential consistency
-- add unique index to avoid duplication

Inserting records is straightforward:

INSERT INTO nodes(id, parent_id, text) 
  VALUES
    (1, 1, 'A'), -- Root. 
    (2, 1, 'B'),
    (3, 2, 'C'),
    (4, 2, 'D'),
    (5, 1, 'E');

Retrieving data becomes problematic though.

Say we want to SELECT whole hierarchy:

SELECT * FROM nodes where parent_id = 1;
id | parent_id | text 
----+-----------+------
  1 |         1 | A
  2 |         1 | B
  5 |         1 | E

but only the immediate children get fetched instead whole “tree”. Same issue exists for both UPDATE and DELETE queries.

Pros

Cons

“Adjacency List” model is sufficient data model describing a tree. But the lack of recursive querying support(ie in MySQL) leads to complicated client code.

“Adjacency List” model alone is often not enough as a real world solution.

Nested Sets

This one is probably the second most popular

Graphical representation:

  +---------------------------+
  |             A             |
  |                           |
  | +----------------+ +----+ |
  | |       B        | | E  | |
  | |                | |    | |
  | | +----+  +----+ | +----+ |
  | | | C  |  | D  | |        |
  | | |    |  |    | |        |
  | | +----+  +----+ |        |
  | +----------------+        |
  +---------------------------+
  1 2 3    4  5    6 7 8   9  10

with corresponding table representation:

id, text, lft, rgt
 1, A,    1,   10
 2, B,    2,   7
 3, C,    3,   4
 4, D,    5,   6
 5, E,    8,   9

SQL example

Inserting initial records requires knowing lft and rgt values in advance or inserting them one by one calculating and updating lft and rgt values.

INSERT INTO nodes(id, text, lft, rgt) 
VALUES
  (1, 'A', 1, 10),
  (2, 'B', 2, 7 ),
  (3, 'C', 3, 4 ),
  (4, 'D', 5, 6 ),
  (5, 'E', 8, 9 );

Selecting is very efficient:

SELECT c.id, c.text, c.lft, c.rgt FROM nodes c, nodes p WHERE p.lft < c.lft AND p.rgt > c.rgt AND p.id = 2;
id, text, lft, rgt
3,  'C',  3,   4
4,  'D',  5,   6

On the other hand Inserting nodes may be very inefficient as requires updating existing records depending on where record gets inserted. This is one of drawbacks of the “Nested Set” model, as it may get quite slow with many records.

Pros

Cons

Path Enumeration

This solution tries to address shortages of “Adjacency List” by storing ancestor ids, usually, as a string.

In other words, instead storing single parent_id, all parent_ids up to the very root and the node id itself are stored.

CREATE TABLE nodes(
    id        SERIAL PRIMARY KEY,
    path      VARCHAR(1000) NOT NULL, -- 1000 may become a limitation in case of very deep hierarchies
    text      TEXT NOT NULL
    -- make sure to index path for faster lookups
);

Inserting records is straightforward

INSERT INTO nodes(id, path, text) VALUES
    (1, '1/'    , 'A'), -- Root. 
    (2, '1/2/'  , 'B'),
    (3, '1/2/3/', 'C'),
    (4, '1/2/4/', 'D'),
    (5, '1/5/'  , 'E');

Please note that there’s a “little” issue in the INSERT above: in general the id of the node is not yet known at the time of INSERT. So in order to be consistent, it requires an extra UPDATE to get the id appended to the corresponding path.

Retrieving data is relatively straightforward.

To retrieve hierarchy starting with root id=2

SELECT * FROM nodes WHERE path LIKE '1/2/' || '%';
2 | 1/2/   | B
3 | 1/2/3/ | C
4 | 1/2/4/ | D

To retrieve ancestors and node with id=4 itself

SELECT * FROM nodes WHERE '1/2/4/' like path || '%';
  1 | 1/     | A
  2 | 1/2/   | B
  4 | 1/2/4/ | D

UPDATE and DELETE use same approach.

Pros

Cons

Closure table

Closure Table is like Path Enumeration, except that paths are stored in separate table.

Table representation

  Nodes                 Paths
  id, text              ancestor, descendant
   1, A                  1,       1
   2, B                  1,       2
   3, C                  1,       3
   4, D                  1,       4
   5, E                  1,       5
                         2,       2
                         2,       3
                         2,       4
                         3,       3
                         4,       4
                         5,       5

SQL example

CREATE TABLE nodes(id SERIAL, text VARCHAR(10));
CREATE TABLE paths(ancestor_id INTEGER, descendant_id INTEGER, PRIMARY KEY(ancestor_id, descendant_id));

At least 2 inserts are required to create a tree, since nodes and paths are stored in separate tables:

-- nodes
INSERT INTO nodes(id, text) 
  VALUES
    (1, 'A'),
    (2, 'B'),
    (3, 'C'),
    (4, 'D'),
    (5, 'E');

-- paths
INSERT INTO paths(ancestor_id, descendant_id)
  VALUES
    (1, 1), (1, 2), (1, 3), (1, 4), (1, 5),
    (2, 2), (2, 3), (2, 4),
    (3, 3),
    (4, 4),
    (5, 5);

Selecting a subtree with root id=2

SELECT nodes.*, paths.ancestor_id as parent_id 
  FROM nodes 
  JOIN paths on (nodes.id = paths.descendant_id) 
  WHERE paths.ancestor_id = 2

yields:

id, text, parent_id
 2,  B,    2
 3,  C,    2
 4,  D,    2

Closure Table has one extra feature unlike previous models: it allows a node to have multiple parents.

Pros

Cons

Adjacency List with recursive query

It’s important to realize that previously discussed approaches emerged as result of RDBMS lacking native support for querying recursive data structures. The problem was addressed in ANSI SQL-99 standard with CTE

The model is the same Adjacency List. The difference is only in actual SQL used to query nodes.

Lets reuse the example:

CREATE TABLE nodes( id SERIAL PRIMARY KEY, parent_id integer NOT NULL, text varchar(10) NOT NULL);

Inserting records is straightforward:

INSERT INTO nodes(id, parent_id, text)
  VALUES
    (1, 1, 'A'), -- Root.
    (2, 1, 'B'),
    (3, 2, 'C'),
    (4, 2, 'D'),
    (5, 1, 'E');

Say we want to SELECT a whole hierarchy, (note WITH RECURSIVE):

WITH RECURSIVE tree AS (
  -- initial node
  SELECT id, parent_id, text
    FROM nodes
    WHERE id = parent_id -- start from the root
  UNION all
  -- recursive descent
  SELECT c.id, c.parent_id, c.text
    FROM tree p
    JOIN nodes c ON c.parent_id = p.id AND c.parent_id != c.id
)

-- and finally the actual select
SELECT * FROM tree order by id;

yields

 id | parent_id | text
----+-----------+------
  1 |         1 | A
  2 |         1 | B
  3 |         2 | C
  4 |         2 | D
  5 |         1 | E

yields nodes as expected.

UPDATE and DELETE use same approach

Pros

Cons

Models Comparison summary

Here’s a summary of the approaches:

ModelsTablesQChildQTreeInsertDeleteRef.Int
AdjacencyList1EasyHardEasyEasyYes
Path Enumeration1EasyEasyEasyEasyNo
Nested Sets1HardEasyHardHardNo
Closure Table2EasyEasyEasyEasyYes
AdjacencyList & recur. query1EasyEasyEasyEasyYes

Legend:

Taken from “SQL Antipatterns”

Rails plugins to handle “Trees”

Knowing theory is good, but “great coders” reuse code so here’s the list of Rails-ready plugins for your next app:

Adjacency List

Path Enumeration

Nested Sets

Closure Tree

Adjacency List with recursive query

Postgres

I’d like to give Postgres a credit for supporting “Adjacency List” and “Path Enumeration” aka ltree natively. And because Postgres supports Recursive Querying “Adjacency List” and “Adjacency List with Recursive Querying” are the same.

This is one more reason to use Postgres.

References

Read More
Orphan vs Zombie vs Daemon processes
Regularly.me or how to GTD
Comments
read or add one↓