PostgreSQL Index Basics: What They Do and Why You Need Them

Ever wonder why your queries are slow? You write a simple SELECT ... WHERE customer_id = 42 and it takes seconds instead of milliseconds. The table isn't even that big.

The answer, almost every time, is a missing index. Without one, PostgreSQL reads every row in the table to find the few you need. With one, it jumps directly to the matching rows.

Indexes are the single biggest performance lever you have in a relational database. Let's understand exactly what they do and how to verify they're working with EXPLAIN ANALYZE.

The Setup

We'll use a realistic orders table. The schema is small but the concepts apply to tables of any size.

DROP TABLE IF EXISTS orders;
DROP TABLE IF EXISTS customers;
CREATE TABLE customers (
  customer_id SERIAL PRIMARY KEY,
  name TEXT NOT NULL,
  email TEXT NOT NULL,
  city TEXT NOT NULL,
  created_at TIMESTAMPTZ DEFAULT now()
);

CREATE TABLE orders (
  order_id SERIAL PRIMARY KEY,
  customer_id INTEGER REFERENCES customers(customer_id),
  order_date DATE NOT NULL,
  status TEXT NOT NULL,
  total NUMERIC(10,2) NOT NULL
);

Now let's load enough data to make indexes matter. Small tables don't show the difference because PostgreSQL will just scan the whole thing regardless.

INSERT INTO customers (name, email, city)
SELECT
  'Customer ' || i,
  'customer' || i || '@example.com',
  (ARRAY['London','Paris','Berlin','Madrid','Rome'])[1 + (i % 5)]
FROM generate_series(1, 1000) AS i;

INSERT INTO orders (customer_id, order_date, status, total)
SELECT
  (random() * 999 + 1)::INTEGER,
  DATE '2020-01-01' + (random() * 1500)::INTEGER,
  (ARRAY['pending','shipped','delivered','cancelled'])[1 + (i % 4)],
  (random() * 500 + 10)::NUMERIC(10,2)
FROM generate_series(1, 50000) AS i;

That gives us 1,000 customers and 50,000 orders. Enough to see real differences in query plans.

What Is an Index?

An index is a separate data structure that PostgreSQL maintains alongside your table. Think of it like the index in a textbook — instead of reading every page to find "B-tree", you flip to the back, find the entry, and jump directly to page 247.

Without an index, PostgreSQL does a Sequential Scan (Seq Scan) — it reads every row in the table and checks each one against your WHERE clause. With an index, PostgreSQL has two main strategies:

Index Scan: Looks up entries in the index and fetches each matching row directly. Best when very few rows match.
Bitmap Index Scan: Scans the index to build a map of which pages contain matching rows, then fetches those pages in physical order. Best when multiple rows match and are scattered across many pages.

Both are dramatically faster than a Seq Scan. PostgreSQL picks the strategy automatically based on how many rows it expects to find.

The trade-off: indexes speed up reads but slow down writes (because the index must be updated on every INSERT, UPDATE, or DELETE). They also use disk space.

Query 1: Finding Orders Without an Index

Let's find all orders for a specific customer:

EXPLAIN ANALYZE
SELECT order_id, order_date, status, total
FROM orders
WHERE customer_id = 42;

Output:

Seq Scan on orders  (cost=0.00..1135.00 rows=50 width=27)
                    (actual time=0.031..4.257 rows=46 loops=1)
  Filter: (customer_id = 42)
  Rows Removed by Filter: 49954
Planning Time: 0.085 ms
Execution Time: 4.312 ms

What this tells us:

Seq Scan on orders: PostgreSQL is reading every row in the table
Rows Removed by Filter: 49,954: It checked 50,000 rows to find 46 matches
Execution Time: 4.312 ms: Not terrible for 50K rows, but this grows linearly with table size

That Rows Removed by Filter number is the red flag. PostgreSQL did the work of reading 50,000 rows to return 46. That's a 99.9% waste ratio.

Query 2: Adding an Index

CREATE INDEX idx_orders_customer_id ON orders(customer_id);

Now run the same query:

EXPLAIN ANALYZE
SELECT order_id, order_date, status, total
FROM orders
WHERE customer_id = 42;

Output:

Bitmap Heap Scan on orders  (cost=6.23..357.44 rows=50 width=27)
                            (actual time=0.049..0.156 rows=46 loops=1)
  Recheck Cond: (customer_id = 42)
  Heap Blocks: exact=43
  ->  Bitmap Index Scan on idx_orders_customer_id
                            (cost=0.00..6.17 rows=50 width=0)
        Index Cond: (customer_id = 42)
Planning Time: 0.290 ms
Execution Time: 0.218 ms

What changed:

Bitmap Index Scan replaced Seq Scan — PostgreSQL used the index
Index Cond replaced Filter — the index found matching rows directly
No "Rows Removed by Filter" — it only touched the 46 rows it needed
Execution Time: 0.218 ms vs 4.312 ms — that's 20x faster

The plan has two steps: first, the Bitmap Index Scan reads the index and builds a bitmap of which pages contain matching rows (43 pages, shown in Heap Blocks: exact=43). Then the Bitmap Heap Scan fetches those pages in physical order. This is faster than bouncing randomly across the table for each row.

PostgreSQL chose a Bitmap scan here because the 46 matching rows are scattered across 43 different pages. If only 1-2 rows matched, you'd see a plain Index Scan instead. The planner picks the best strategy automatically.

Reading EXPLAIN Output

Let's break down what EXPLAIN tells you.

EXPLAIN vs EXPLAIN ANALYZE:

-- Shows the plan without running the query
EXPLAIN SELECT ...;

-- Runs the query and shows actual timing
EXPLAIN ANALYZE SELECT ...;

Use EXPLAIN when you just want to see the plan. Use EXPLAIN ANALYZE when you need real timing numbers. Be careful with EXPLAIN ANALYZE on INSERT/UPDATE/DELETE — it actually executes the statement.

Key fields to watch:

Field	What It Means
Seq Scan	Reading every row — no useful index
Index Scan	Using an index to find rows directly
Index Only Scan	Answering the query entirely from the index
Bitmap Index Scan	Using index to build a map of matching pages
Filter	Checking rows after retrieval (post-filter)
Index Cond	Condition pushed into the index (efficient)
Rows Removed by Filter	Rows read but not returned (waste)
actual time	Wall clock time in milliseconds
rows	Number of rows processed

The goal is to see Index Cond instead of Filter, and any kind of index scan instead of Seq Scan.

Query 3: Index Only Scan

If the index contains all the columns your query needs, PostgreSQL can skip the table entirely:

EXPLAIN ANALYZE
SELECT customer_id
FROM orders
WHERE customer_id = 42;

Output:

Index Only Scan using idx_orders_customer_id on orders
                    (cost=0.42..53.88 rows=50 width=4)
                    (actual time=0.021..0.048 rows=47 loops=1)
  Index Cond: (customer_id = 42)
  Heap Fetches: 0
Planning Time: 0.078 ms
Execution Time: 0.067 ms

Index Only Scan — PostgreSQL answered the entire query from the index without touching the table at all. Notice Heap Fetches: 0 — zero trips to the actual table data.

This is the fastest possible scan type. It happens when:

The query only selects columns that exist in the index
The table's visibility map is up to date (run VACUUM if Heap Fetches is high)

Query 4: When PostgreSQL Ignores Your Index

Indexes aren't always used. PostgreSQL's query planner decides based on cost estimates.

High selectivity — index used:

EXPLAIN ANALYZE
SELECT * FROM orders WHERE customer_id = 42;
-- Bitmap Index Scan (finding ~50 rows out of 50,000)

Low selectivity — index ignored:

EXPLAIN ANALYZE
SELECT * FROM orders WHERE customer_id < 500;
-- Seq Scan (returning ~25,000 rows out of 50,000)

Output for the low-selectivity query:

Seq Scan on orders  (cost=0.00..1285.00 rows=24975 width=27)
                    (actual time=0.013..3.471 rows=25013 loops=1)
  Filter: (customer_id < 500)
  Rows Removed by Filter: 24987
Planning Time: 0.127 ms
Execution Time: 4.384 ms

PostgreSQL chose a Seq Scan even though the index exists. Why? Because when you're returning half the table, it's faster to just read the whole thing sequentially than to bounce between the index and the table for each row. The planner is smart about this.

Rule of thumb: Indexes help most when your query returns a small fraction of the table (typically under 10-15%).

Query 5: Functions and Expressions Break Indexes

A common trap — wrapping an indexed column in a function:

-- This WON'T use an index on order_date
EXPLAIN ANALYZE
SELECT * FROM orders
WHERE EXTRACT(YEAR FROM order_date) = 2023;

Output:

Seq Scan on orders  (cost=0.00..1460.00 rows=250 width=27)
                    (actual time=0.023..5.258 rows=8870 loops=1)
  Filter: (EXTRACT(year FROM order_date) = '2023'::numeric)
  Rows Removed by Filter: 41130

Seq Scan. The index is on order_date, but the query is filtering on EXTRACT(YEAR FROM order_date) — a function applied to the column. PostgreSQL can't use the index because the index stores dates, not extracted years.

Fix — rewrite to use the column directly:

CREATE INDEX idx_orders_date ON orders(order_date);

EXPLAIN ANALYZE
SELECT * FROM orders
WHERE order_date >= '2023-01-01' AND order_date < '2024-01-01';

Output:

Bitmap Heap Scan on orders  (cost=168.42..952.33 rows=8870 width=27)
                            (actual time=0.812..3.472 rows=8870 loops=1)
  Recheck Cond: ((order_date >= '2023-01-01') AND (order_date < '2024-01-01'))
  Heap Blocks: exact=259
  ->  Bitmap Index Scan on idx_orders_date
                            (cost=0.00..166.20 rows=8870 width=0)
        Index Cond: ((order_date >= '2023-01-01') AND (order_date < '2024-01-01'))
Planning Time: 0.112 ms
Execution Time: 4.091 ms

Same result, but now PostgreSQL uses the index. The trick is expressing your condition as a range on the raw column instead of applying a function.

Alternative — expression index:

CREATE INDEX idx_orders_year ON orders((EXTRACT(YEAR FROM order_date)));

EXPLAIN ANALYZE
SELECT * FROM orders
WHERE EXTRACT(YEAR FROM order_date) = 2023;

Now it works because the index matches the expression exactly. Use expression indexes when you can't rewrite the query.

Query 6: Checking What Indexes Exist

Before creating indexes, check what's already there:

SELECT
  indexname,
  indexdef
FROM pg_indexes
WHERE tablename = 'orders'
ORDER BY indexname;

Output:

indexname                | indexdef
-------------------------|--------------------------------------------
idx_orders_customer_id   | CREATE INDEX idx_orders_customer_id ON ...
idx_orders_date          | CREATE INDEX idx_orders_date ON ...
orders_pkey              | CREATE UNIQUE INDEX orders_pkey ON ...

orders_pkey is the primary key index — PostgreSQL creates this automatically.

Check index size:

SELECT
  indexname,
  pg_size_pretty(pg_relation_size(indexname::regclass)) AS size
FROM pg_indexes
WHERE tablename = 'orders'
ORDER BY pg_relation_size(indexname::regclass) DESC;

Output:

indexname                | size
-------------------------|--------
orders_pkey              | 1112 kB
idx_orders_customer_id   | 1112 kB
idx_orders_date          | 1112 kB

Each index takes disk space. For a 50,000-row table, that's roughly 1 MB per index. Scale that up — on a table with 100 million rows, each index could be gigabytes. Don't create indexes you don't need.

Breaking Down How Indexes Work

PostgreSQL's default index type is a B-tree (balanced tree). Conceptually:

                    [500]
                   /     \
             [200]         [800]
            /     \       /     \
      [100] [300]   [600] [900]
       / \   / \     / \   / \
     ... ... ... ... ... ... ...
        (leaf nodes point to table rows)

When you search for customer_id = 42:

Start at the root: 42 < 500, go left
Next level: 42 < 200, go left
Next level: find the leaf containing 42
Follow the pointer to the actual table row

This takes O(log n) steps. For 50,000 rows, that's about 16 comparisons instead of 50,000. For 100 million rows, it's about 27 comparisons.

Challenge 1: Find the Slow Query

Goal: Write a query on the orders table that does a Seq Scan, then add an index to make it use an Index Scan. Verify with EXPLAIN ANALYZE.

Setup: Drop extra indexes to start clean:

DROP INDEX IF EXISTS idx_orders_customer_id;
DROP INDEX IF EXISTS idx_orders_date;

Hints:

Filter on status column — there's no index on it yet
Run EXPLAIN ANALYZE to confirm the Seq Scan
Create the appropriate index
Run EXPLAIN ANALYZE again and compare

Pseudocode:

-- Step 1: Query that does a Seq Scan
EXPLAIN ANALYZE
SELECT * FROM orders WHERE status = 'pending';

-- Step 2: Add the index
CREATE INDEX ... ON orders(...);

-- Step 3: Same query now does Index Scan
EXPLAIN ANALYZE
SELECT * FROM orders WHERE status = 'pending';

-- Compare the execution times

Challenge 2: Expression Index

Goal: Find all orders placed on a Monday. First try it without an index, then create an expression index that makes it fast.

Expected output from EXPLAIN:

-- Before: Seq Scan with Filter
-- After:  Index Scan (or Bitmap Index Scan) with Index Cond

Hints:

EXTRACT(DOW FROM order_date) returns 1 for Monday in PostgreSQL
This is a function on a column — a regular index on order_date won't help
Create an expression index that indexes EXTRACT(DOW FROM order_date)
Run EXPLAIN ANALYZE before and after

Pseudocode:

-- The expression index
CREATE INDEX idx_orders_dow ON orders((EXTRACT(DOW FROM order_date)));

-- Now this uses the index
EXPLAIN ANALYZE
SELECT * FROM orders WHERE EXTRACT(DOW FROM order_date) = 1;

Challenge 3: Index Only Scan

Goal: Write a query that produces an Index Only Scan (no heap fetches). Then modify the query to add a column that forces it back to a regular Index Scan.

Hints:

Create an index on (status)
Write a query that only selects status with a WHERE on status
Run VACUUM orders first to update the visibility map
Check for Index Only Scan and Heap Fetches: 0 in the output
Now add total to the SELECT list — watch it switch to Index Scan

Real-World Applications

This same pattern — "query is slow, add an index, verify with EXPLAIN" — is the first thing you do in any performance investigation:

E-commerce: Index orders.customer_id for "show my orders" pages
Authentication: Index users.email for login lookups
Logging: Index logs.created_at for time-range queries
Search: Index products.category for filtered browsing
Reporting: Index columns used in WHERE, JOIN, and ORDER BY clauses
APIs: Any endpoint that filters by a parameter needs an index on that column

When NOT to Index

Indexes aren't free. Skip them when:

Small tables (under ~10,000 rows): Seq Scan is fast enough
Write-heavy tables with rare reads: Every INSERT/UPDATE/DELETE updates the index
Low-cardinality columns queried alone: A boolean is_active column with 50/50 split won't benefit much
Columns you never filter or join on: If it's only in SELECT, an index won't help
Too many indexes: Each one slows down writes and uses disk space. Audit regularly

The Checklist

When a query is slow:

Run EXPLAIN ANALYZE on it
Look for Seq Scan on large tables
Check the Rows Removed by Filter — high numbers mean wasted work
Identify the column(s) in the WHERE clause
Check if an index exists with pg_indexes
Create the index
Run EXPLAIN ANALYZE again
Confirm an index scan (Index Scan, Bitmap Index Scan, or Index Only Scan) replaced the Seq Scan
Check the execution time improvement

That's the workflow. Learn it once, use it every day.

Tags: #SQL #PostgreSQL #Indexes #Performance #EXPLAIN #QueryPlanning

Difficulty: Beginner
Topics: Indexes, EXPLAIN ANALYZE, Seq Scan, Index Scan, B-tree, Query Planning