Cycle Detection for Recursive Search in Hierarchical Trees

Recursive queries are a straightforward solution to querying hierarchical trees. However, one loop in the relationship references results in a failing or never ending query when cycle detection is not used.

For storing trees in a database the adjacency list model is a straightforward implementation, where every row contains a reference to the parent row. One recursive query can select all parents or children for a specific row instead of the most common implementation to execute multiple queries. However, accidental loops in the data's references would fail a recursive query when cycle detection is not used to skip these errors. The database can provide cycle detection, or the same logic is implemented manually with a JSON array accumulating the recursion's path to stop at already seen rows.

Usage

MySQL

WITH RECURSIVE tree AS (
    SELECT id, name, 1 AS level, JSON_ARRAY(id) AS path
    FROM employees
    WHERE manager_id = 1
  UNION
    SELECT employees.id, employees.name, tree.level + 1, JSON_ARRAY_APPEND(tree.path, '$', employees.id)
    FROM tree
    JOIN employees ON(tree.id = employees.manager_id)
    WHERE NOT employees.id MEMBER OF(tree.p)
)
SELECT * FROM tree

PostgreSQL

WITH RECURSIVE tree AS (
    SELECT id, name, 1 AS level
    FROM employees
    WHERE manager_id = 1
  UNION
    SELECT employees.id, employees.name, tree.level + 1
    FROM tree
    JOIN employees ON(tree.id = employees.manager_id)
) CYCLE id SET is_cycle USING cycle_path
SELECT * FROM tree

Detailed Explanation

Trees can be stored in a database as e.g. nested sets or materialized paths. However, the most popular one is the adjacency list model storing a parent_id reference within every record. This approach has the most effortless implementation for saving the data as no complex transformations need to be done. It is a very straightforward concept.

A naive implementation will select the data for one level and execute consecutive queries for every further tree level. For big trees, this will result in increased latency as another query needs to be executed after the last one has finished. However, a single recursive query can do that in a single step:

  1. A virtual table tree is created using a recursive Common Table Expression (CTE) that accumulates the results. As the CTE is recursive, the queries within the CTE can reference themselves.
  2. A base query will select the first level of the tree as starting point for the recursive query: SELECT id, name, 1 AS level FROM employees WHERE manager_id = 1
  3. A recursive query using the tree CTE will be executed to find new rows from a different level of the tree incrementing the level value: SELECT employees.id, employees.name, tree.level + 1 FROM tree JOIN employees ON(tree.id = employees.manager_id)

The adjacency list is a simple model which is easy to implement but has one serious drawback: The relationships should not have loops! In PostgreSQL such a query will run endlessly and never finish. However, limiting the rows read from the recursive CTE will automatically stop it when the threshold is reached. This workaround will not work in MySQL as the recursive CTE will first have to finish executing before the data is read from it and a LIMIT will be applied. Nevertheless, in MySQL any recursive query will automatically fail with an error after 1.000 recursive steps. This limit can be increased with the cte_max_recursion_depth system variable to search within deep trees.

These safety measures to limit the runtime of a query are lovely, but the optimal solution would be to stop the recursion when a loop is detected. This feature is called “cycle detection” and is part of the SQL standard. It is supported since PostgreSQL 14 and only needs a small extension to the CTE. The CYCLE clause states that loops should be detected on the id column by automatically accumulating the tree path in the cycle_path ` column and using it to calculate the exit condition is_cycle : CYCLE id SET is_cycle USING cycle_path . The recursion will automatically stop for any detected loop.

With MySQL and older PostgreSQL versions automatic cycle-detection is not available. Fortunately, the `CYCLE` feature is just syntactic sugar for the manual approach that has been available for years:

  • The base query adds the starting element id for every row into a JSON array containing the elements added on the path.
  • The recursive query appends the current row to the path to update the query's path.
  • The recursive query will only continue with rows not contained in the query's path for the current recursion step.

Additional Resources

Author Tobias Petry

Tobias Petry

SQL for Devs Founder


Are you interested to learn more?

Be notified on future content. Never spam.