Tracing Paths with Recursive CTEs

A Pattern That Appears Everywhere

As a network engineer, I frequently analyze IPFIX (IP Flow Information Export) data to understand how traffic traverses our network. When a packet travels from source to destination, it passes through multiple routers, and each router exports flow records showing where it forwarded the traffic next. The challenge is reconstructing the complete path from these individual hop records.

This is a perfect use case for recursive CTEs, SQL's way of following parent-child relationships through connected records. But here's what I've learned: this exact same pattern shows up constantly across completely different domains:

Network routing: Following packets through routers (my original problem)
Package delivery: Tracking shipments through distribution centers
Manufacturing: Tracing components through assembly stages
Org charts: Navigating reporting hierarchies
File systems: Traversing directory structures
Supply chains: Following products from supplier to customer

The algorithm doesn't care what you're tracing, it just follows links until there are no more links to follow. Once you master this pattern, you'll recognize it everywhere.

I've adapted my network flow query here to use package delivery as the example because it's more intuitive, but the SQL structure is identical.

The Problem

Given individual point-to-point records, reconstruct the complete journey:

Individual records show:
- Boston Hub → Chicago Sort
- Chicago Sort → Denver Hub  
- Denver Hub → San Francisco Delivery
- San Francisco Delivery → (delivered)

Need to produce:
Boston Hub → Chicago Sort → Denver Hub → San Francisco Delivery

The Data Model

Table definition

Click the "Run" button to create the tables.

DROP TABLE IF EXISTS package_routing;
CREATE TABLE package_routing (
    id SERIAL PRIMARY KEY,
    tracking_number TEXT NOT NULL,
    current_facility TEXT NOT NULL,
    next_facility TEXT,  -- NULL means final delivery
    scan_time TIMESTAMP NOT NULL,
    package_weight_kg DECIMAL(10,2),
    origin_city TEXT NOT NULL,
    destination_city TEXT NOT NULL
);

Data

Click the "Run" button to setup the data

INSERT INTO package_routing 
    (tracking_number, current_facility, next_facility, scan_time, package_weight_kg, origin_city, destination_city)
VALUES 
    -- Package 1: Boston to San Francisco
    ('PKG-001', 'BOS-HUB', 'CHI-SORT', '2025-03-01 08:00:00', 2.5, 'Boston', 'San Francisco'),
    ('PKG-001', 'CHI-SORT', 'DEN-HUB', '2025-03-01 14:30:00', 2.5, 'Boston', 'San Francisco'),
    ('PKG-001', 'DEN-HUB', 'SFO-DEL', '2025-03-01 20:15:00', 2.5, 'Boston', 'San Francisco'),
    ('PKG-001', 'SFO-DEL', NULL, '2025-03-02 09:45:00', 2.5, 'Boston', 'San Francisco'),
    
    -- Package 2: New York to Los Angeles
    ('PKG-002', 'NYC-HUB', 'CHI-SORT', '2025-03-01 07:00:00', 1.2, 'New York', 'Los Angeles'),
    ('PKG-002', 'CHI-SORT', 'PHX-HUB', '2025-03-01 13:00:00', 1.2, 'New York', 'Los Angeles'),
    ('PKG-002', 'PHX-HUB', 'LAX-DEL', '2025-03-01 18:30:00', 1.2, 'New York', 'Los Angeles'),
    ('PKG-002', 'LAX-DEL', NULL, '2025-03-02 08:00:00', 1.2, 'New York', 'Los Angeles'),
    
    -- Package 3: Short local delivery
    ('PKG-003', 'BOS-HUB', 'BOS-DEL', '2025-03-01 10:00:00', 0.5, 'Boston', 'Cambridge'),
    ('PKG-003', 'BOS-DEL', NULL, '2025-03-01 14:00:00', 0.5, 'Boston', 'Cambridge');

The Solution: Recursive CTE

Start at the first facility and recursively follow the chain until reaching final delivery:

WITH RECURSIVE package_path AS (
    -- Base case: Find the starting facility for each package
    -- The starting point is a facility that isn't the "next_facility" for any other scan
    SELECT 
        tracking_number,
        current_facility,
        next_facility,
        scan_time,
        package_weight_kg,
        origin_city,
        destination_city,
        1 AS hop_number,
        current_facility AS path_so_far,
        scan_time AS first_scan_time
    FROM package_routing pr1
    WHERE NOT EXISTS (
        SELECT 1 
        FROM package_routing pr2 
        WHERE pr2.tracking_number = pr1.tracking_number 
        AND pr2.next_facility = pr1.current_facility
    )
    
    UNION ALL
    
    -- Recursive case: Follow the chain by joining current_facility to next_facility
    -- Keep going until we hit a NULL next_facility (final delivery)
    SELECT 
        pr.tracking_number,
        pr.current_facility,
        pr.next_facility,
        pr.scan_time,
        pr.package_weight_kg,
        pr.origin_city,
        pr.destination_city,
        pp.hop_number + 1,
        pp.path_so_far || ' → ' || pr.current_facility,
        pp.first_scan_time
    FROM package_routing pr
    INNER JOIN package_path pp 
        ON pr.tracking_number = pp.tracking_number
        AND pr.current_facility = pp.next_facility
)
-- Get the complete path for each package
SELECT 
    tracking_number,
    origin_city,
    destination_city,
    MAX(hop_number) AS total_hops,
    MAX(path_so_far) AS complete_path,
    MIN(first_scan_time) AS journey_start,
    MAX(scan_time) AS journey_end,
    ROUND(
        EXTRACT(EPOCH FROM (MAX(scan_time) - MIN(first_scan_time))) / 3600, 
        2
    ) AS transit_hours,
    MAX(package_weight_kg) AS weight_kg
FROM package_path
GROUP BY tracking_number, origin_city, destination_city
ORDER BY tracking_number;

Understanding the Results

tracking_number	origin_city	destination_city	total_hops	complete_path	journey_start	journey_end	transit_hours	weight_kg
PKG-001	Boston	San Francisco	4	BOS-HUB → CHI-SORT → DEN-HUB → SFO-DEL	2025-03-01 08:00	2025-03-02 09:45	25.75	2.50
PKG-002	New York	Los Angeles	4	NYC-HUB → CHI-SORT → PHX-HUB → LAX-DEL	2025-03-01 07:00	2025-03-02 08:00	25.00	1.20
PKG-003	Boston	Cambridge	2	BOS-HUB → BOS-DEL	2025-03-01 10:00	2025-03-01 14:00	4.00	0.50

How Recursive CTEs Work

A recursive CTE has two parts connected by UNION ALL:

Base Case (Anchor): Finds the starting point(s)
- Package routing: facilities that aren't anyone's "next stop"
- Network flows: the ingress router
- Org charts: the CEO (no manager)
Recursive Case: Follows the chain
- Joins the previous result back to the table
- Continues until no more matches are found

The recursion automatically stops when the recursive query returns no rows.

Adapting to Other Domains

The structure stays the same. For my original network flow problem:

-- Base case finds ingress router (not anyone's next_hop)
-- Recursive case joins on: exporter_ip = previous next_hop_ip
-- Result: complete packet path through the network

For an organizational hierarchy:

-- Base case finds CEO (manager_id IS NULL)
-- Recursive case joins on: employee_id = previous manager_id  
-- Result: reporting chain from CEO down to each employee

For manufacturing bill-of-materials:

-- Base case finds finished product (not used in any assembly)
-- Recursive case joins on: component_id = previous assembly_id
-- Result: complete parts breakdown

Same pattern, different column names.

Try It Yourself

Challenge 1: Find Average Transit Time by Route

Modify the query to group by origin and destination cities to find which routes are fastest on average. This helps identify consistently efficient paths versus problematic ones.

Challenge 2: Detect Routing Anomalies

Add a check to identify packages that visit the same facility more than once (potential routing loops). Hint: you can track visited facilities in an array during recursion and check for duplicates.

Switch to SQL to try these challenges.

Why This Matters

When I first built a recursive query for network flow analysis, I thought it was specialized networking SQL. Then I needed it for manufacturing data. Then for org charts. Then for file system analysis. Then someone asked about package tracking.

The realization: most connected data is the same problem. Trees, chains, hierarchies, paths—they all use recursive CTEs. Master this pattern once, recognize it everywhere.

The SQL doesn't care what you're connecting. It just follows links until there are no more links to follow.