As a network engineer, I frequently analyze IPFIX (IP Flow Information Export) data to understand how traffic traverses our network. When a packet travels from source to destination, it passes through multiple routers, and each router exports flow records showing where it forwarded the traffic next. The challenge is reconstructing the complete path from these individual hop records.
This is a perfect use case for recursive CTEs, SQL's way of following parent-child relationships through connected records. But here's what I've learned: this exact same pattern shows up constantly across completely different domains:
The algorithm doesn't care what you're tracing, it just follows links until there are no more links to follow. Once you master this pattern, you'll recognize it everywhere.
I've adapted my network flow query here to use package delivery as the example because it's more intuitive, but the SQL structure is identical.
Given individual point-to-point records, reconstruct the complete journey:
Individual records show:
- Boston Hub → Chicago Sort
- Chicago Sort → Denver Hub
- Denver Hub → San Francisco Delivery
- San Francisco Delivery → (delivered)
Need to produce:
Boston Hub → Chicago Sort → Denver Hub → San Francisco Delivery
Click the "Run" button to create the tables.
DROP TABLE IF EXISTS package_routing;
CREATE TABLE package_routing (
id SERIAL PRIMARY KEY,
tracking_number TEXT NOT NULL,
current_facility TEXT NOT NULL,
next_facility TEXT, -- NULL means final delivery
scan_time TIMESTAMP NOT NULL,
package_weight_kg DECIMAL(10,2),
origin_city TEXT NOT NULL,
destination_city TEXT NOT NULL
);
Click the "Run" button to setup the data
INSERT INTO package_routing
(tracking_number, current_facility, next_facility, scan_time, package_weight_kg, origin_city, destination_city)
VALUES
-- Package 1: Boston to San Francisco
('PKG-001', 'BOS-HUB', 'CHI-SORT', '2025-03-01 08:00:00', 2.5, 'Boston', 'San Francisco'),
('PKG-001', 'CHI-SORT', 'DEN-HUB', '2025-03-01 14:30:00', 2.5, 'Boston', 'San Francisco'),
('PKG-001', 'DEN-HUB', 'SFO-DEL', '2025-03-01 20:15:00', 2.5, 'Boston', 'San Francisco'),
('PKG-001', 'SFO-DEL', NULL, '2025-03-02 09:45:00', 2.5, 'Boston', 'San Francisco'),
-- Package 2: New York to Los Angeles
('PKG-002', 'NYC-HUB', 'CHI-SORT', '2025-03-01 07:00:00', 1.2, 'New York', 'Los Angeles'),
('PKG-002', 'CHI-SORT', 'PHX-HUB', '2025-03-01 13:00:00', 1.2, 'New York', 'Los Angeles'),
('PKG-002', 'PHX-HUB', 'LAX-DEL', '2025-03-01 18:30:00', 1.2, 'New York', 'Los Angeles'),
('PKG-002', 'LAX-DEL', NULL, '2025-03-02 08:00:00', 1.2, 'New York', 'Los Angeles'),
-- Package 3: Short local delivery
('PKG-003', 'BOS-HUB', 'BOS-DEL', '2025-03-01 10:00:00', 0.5, 'Boston', 'Cambridge'),
('PKG-003', 'BOS-DEL', NULL, '2025-03-01 14:00:00', 0.5, 'Boston', 'Cambridge');
Start at the first facility and recursively follow the chain until reaching final delivery:
WITH RECURSIVE package_path AS (
-- Base case: Find the starting facility for each package
-- The starting point is a facility that isn't the "next_facility" for any other scan
SELECT
tracking_number,
current_facility,
next_facility,
scan_time,
package_weight_kg,
origin_city,
destination_city,
1 AS hop_number,
current_facility AS path_so_far,
scan_time AS first_scan_time
FROM package_routing pr1
WHERE NOT EXISTS (
SELECT 1
FROM package_routing pr2
WHERE pr2.tracking_number = pr1.tracking_number
AND pr2.next_facility = pr1.current_facility
)
UNION ALL
-- Recursive case: Follow the chain by joining current_facility to next_facility
-- Keep going until we hit a NULL next_facility (final delivery)
SELECT
pr.tracking_number,
pr.current_facility,
pr.next_facility,
pr.scan_time,
pr.package_weight_kg,
pr.origin_city,
pr.destination_city,
pp.hop_number + 1,
pp.path_so_far || ' → ' || pr.current_facility,
pp.first_scan_time
FROM package_routing pr
INNER JOIN package_path pp
ON pr.tracking_number = pp.tracking_number
AND pr.current_facility = pp.next_facility
)
-- Get the complete path for each package
SELECT
tracking_number,
origin_city,
destination_city,
MAX(hop_number) AS total_hops,
MAX(path_so_far) AS complete_path,
MIN(first_scan_time) AS journey_start,
MAX(scan_time) AS journey_end,
ROUND(
EXTRACT(EPOCH FROM (MAX(scan_time) - MIN(first_scan_time))) / 3600,
2
) AS transit_hours,
MAX(package_weight_kg) AS weight_kg
FROM package_path
GROUP BY tracking_number, origin_city, destination_city
ORDER BY tracking_number;
| tracking_number | origin_city | destination_city | total_hops | complete_path | journey_start | journey_end | transit_hours | weight_kg |
|---|---|---|---|---|---|---|---|---|
| PKG-001 | Boston | San Francisco | 4 | BOS-HUB → CHI-SORT → DEN-HUB → SFO-DEL | 2025-03-01 08:00 | 2025-03-02 09:45 | 25.75 | 2.50 |
| PKG-002 | New York | Los Angeles | 4 | NYC-HUB → CHI-SORT → PHX-HUB → LAX-DEL | 2025-03-01 07:00 | 2025-03-02 08:00 | 25.00 | 1.20 |
| PKG-003 | Boston | Cambridge | 2 | BOS-HUB → BOS-DEL | 2025-03-01 10:00 | 2025-03-01 14:00 | 4.00 | 0.50 |
A recursive CTE has two parts connected by UNION ALL:
Base Case (Anchor): Finds the starting point(s)
Recursive Case: Follows the chain
The recursion automatically stops when the recursive query returns no rows.
The structure stays the same. For my original network flow problem:
-- Base case finds ingress router (not anyone's next_hop)
-- Recursive case joins on: exporter_ip = previous next_hop_ip
-- Result: complete packet path through the network
For an organizational hierarchy:
-- Base case finds CEO (manager_id IS NULL)
-- Recursive case joins on: employee_id = previous manager_id
-- Result: reporting chain from CEO down to each employee
For manufacturing bill-of-materials:
-- Base case finds finished product (not used in any assembly)
-- Recursive case joins on: component_id = previous assembly_id
-- Result: complete parts breakdown
Same pattern, different column names.
Modify the query to group by origin and destination cities to find which routes are fastest on average. This helps identify consistently efficient paths versus problematic ones.
Add a check to identify packages that visit the same facility more than once (potential routing loops). Hint: you can track visited facilities in an array during recursion and check for duplicates.
Switch to SQL to try these challenges.
When I first built a recursive query for network flow analysis, I thought it was specialized networking SQL. Then I needed it for manufacturing data. Then for org charts. Then for file system analysis. Then someone asked about package tracking.
The realization: most connected data is the same problem. Trees, chains, hierarchies, paths—they all use recursive CTEs. Master this pattern once, recognize it everywhere.
The SQL doesn't care what you're connecting. It just follows links until there are no more links to follow.
SQL Output