Architecture & Core Concepts
Process Model
Multi-process architecture. A central listener (postmaster/`postgres`) forks a new backend process for each client connection.
Details
- Master Process (`postgres`): Listens for connections, manages shared memory, starts background utility processes (e.g., checkpointer, WAL writer, autovacuum launcher).
- Backend Process (`postgres`): One per connection. Handles parsing, planning, execution, communication with the client. Isolates crashes to a single connection.
- Pros: Robustness, stability (crash in one backend doesn't take down DB).
- Cons/Quirk: Higher connection overhead (memory/CPU per process). Requires external connection poolers (e.g., PgBouncer, Pgpool-II) for applications with many short-lived connections (like typical web apps) to avoid resource exhaustion.
Memory Areas
Key memory settings include shared_buffers (main cache), work_mem (per-operation sort/hash), and maintenance_work_mem (VACUUM, indexes).
Shared Memory (Server-wide)
- `shared_buffers`: PostgreSQL's primary data cache. Stores frequently accessed table/index blocks. Typically set to ~25% of system RAM (up to a point, depends on workload/OS). [docs]
- WAL Buffers: Buffers Write-Ahead Log records before writing to disk. Usually auto-tuned (`-1`), small relative to `shared_buffers`. [docs]
Per-Backend/Operation Memory
- `work_mem`: Memory used by *each* sort operation (ORDER BY, DISTINCT), hash join, hash aggregation. Multiple operations within one query can each use `work_mem`. Set carefully to avoid OOM errors; too low causes disk spills. [docs]
- `maintenance_work_mem`: Memory used for maintenance tasks like `VACUUM`, `CREATE INDEX`, `ALTER TABLE ADD FOREIGN KEY`. Can be set much higher than `work_mem` as fewer such operations run concurrently. [docs]
- `temp_buffers`: Caches temporary tables. [docs]
Other
- Commit Log (CLOG/pg_xact) Buffers: Caches transaction status.
- OS Cache: PostgreSQL relies heavily on the operating system's file cache. `effective_cache_size` tells the planner how much memory is likely available in OS + PG cache.
Storage
Data stored in files based on OIDs. Tablespaces map logical names to disk locations. Large values use TOAST.
File Layout
- Data directory (`PGDATA`) contains global config, WAL (`pg_wal`), subtransaction status (`pg_subtrans`), transaction status (`pg_xact`), per-database directories, etc.
- Each database has its own directory (named by its OID) within `PGDATA/base/`.
- Each table and index (relation) is stored in one or more files (named by its filenode OID), typically segmented into 1GB files (`relname.1`, `relname.2`...).
Tablespaces
- Allow defining locations on the filesystem where database objects (tables, indexes, databases) can be stored. [docs]
- Useful for separating objects onto different disk types (e.g., fast SSDs for indexes, large HDDs for tables) or managing storage quotas.
CREATE TABLESPACE fastspace LOCATION '/mnt/ssd_storage';
CREATE TABLE ... TABLESPACE fastspace;
TOAST (The Oversized Attribute Storage Technique)
- Handles storage of large column values that don't fit within a standard data block (~8KB). [docs]
- Automatically splits large values into chunks stored in a separate TOAST table associated with the main table.
- Can apply compression (default `pglz`, or `lz4` since v14) to TOASTed values.
- Transparent to the user but affects performance for very large fields. Contributes to bloat.
- Control via `ALTER TABLE ... SET STORAGE {PLAIN|EXTERNAL|EXTENDED|MAIN}`.
Write-Ahead Logging (WAL)
Ensures durability by logging changes *before* they are written to data files. Essential for recovery, replication, and PITR. [docs]
Mechanism
- All changes (INSERT, UPDATE, DELETE, DDL) are first written as log records to WAL buffers, then flushed to WAL segment files on disk (`pg_wal` directory).
- Only *after* the WAL record is durably stored on disk is the change allowed to be written to the actual table/index files (heap/index pages) in shared buffers, which are flushed later by the checkpointer or background writer.
Purpose
- Crash Recovery: On startup after a crash, Postgres replays WAL records since the last checkpoint to restore the database to a consistent state.
- Durability Guarantee: A transaction commit is confirmed only after its WAL records are flushed to disk (controlled by `synchronous_commit`).
- Replication: Standby servers continuously stream and apply WAL records from the primary (Streaming Rep) or decode WAL for logical changes (Logical Rep).
- Point-in-Time Recovery (PITR): Requires continuous archiving of WAL segment files. [docs]
Key Concepts
- WAL Segments: Files (default 16MB) containing WAL records. Sequentially numbered. Located in `pg_wal`.
- Checkpoints: Points in the WAL stream where all data file changes prior to the checkpoint are guaranteed to have been flushed to disk. Limits recovery time. Triggered by time (`checkpoint_timeout`), amount of WAL generated (`max_wal_size`), or manually (`CHECKPOINT` command). Managed by the Checkpointer process. `checkpoint_completion_target` spreads I/O.
- WAL Archiving: Process of copying completed WAL segment files to a safe location (`archive_command`). Essential for PITR and some replication setups. Requires `archive_mode = on`.
MVCC
Multi-Version Concurrency Control. Updates create new row versions instead of overwriting. Readers don't block writers. Leads to bloat. [docs]
How it Works
- Each row version has system columns: `xmin` (inserting transaction ID), `xmax` (deleting/locking transaction ID), `cmin`/`cmax` (command ID within tx), `ctid` (physical location).
- When a row is updated or deleted, Postgres marks the old version as "expired" by setting `xmax` and inserts a new version (for UPDATE) with its own `xmin`.
- Each transaction gets a snapshot of the database (list of visible transaction IDs) when it starts (in `READ COMMITTED` for each statement, `REPEATABLE READ`/`SERIALIZABLE` for the transaction). [visibility]
- A transaction can only "see" row versions where `xmin` is committed and visible to its snapshot, AND (`xmax` is not set OR `xmax` is aborted OR `xmax` is not visible to its snapshot).
- Expired row versions (dead tuples) physically remain until cleaned up by `VACUUM`.
Benefits
- High concurrency: Readers don't block writers, and writers don't block readers (for data access).
- Read consistency: Queries see a consistent snapshot of the data as defined by the isolation level.
Consequences (Quirks)
- Bloat: Dead tuples accumulate, consuming disk space and potentially slowing down scans. Requires regular `VACUUM`.
- UPDATEs are expensive: An UPDATE is internally like a DELETE + INSERT, requiring new tuple creation and index entries for all indexes on the table. (HOT updates can mitigate index churn if no indexed columns are changed and space exists on the same page).
- Transaction ID Wraparound: Requires freezing via `VACUUM` (see Quirks section).
System Catalogs
Internal tables in the `pg_catalog` schema storing metadata about database objects. Query them for introspection. [docs]
Purpose
Hold the schema definitions, configuration, and status information for the database itself. Everything `psql`'s `\d` commands show comes from here.
Key Catalogs (Examples)
- `pg_class`: Information about tables, indexes, sequences, views (relations). Columns: `relname`, `relnamespace` (schema OID), `relkind` (type), `reltuples` (approx rows), `relpages` (approx pages).
- `pg_attribute`: Information about table columns. Columns: `attrelid` (table OID), `attname`, `atttypid` (type OID), `attnum` (column number), `attnotnull`.
- `pg_namespace`: Information about schemas. Columns: `nspname`, `nspowner`.
- `pg_index`: Information about indexes. Columns: `indexrelid` (index OID), `indrelid` (table OID), `indkey` (indexed column numbers).
- `pg_proc`: Information about functions and procedures.
- `pg_type`: Information about data types.
- `pg_settings`: Current server configuration parameters.
- `pg_locks`: Information about current locks.
- `pg_roles`: Information about roles.
Querying
Join catalogs using OID (Object Identifier) columns. Often complex, consider using higher-level Information Schema views (`information_schema.tables`, `information_schema.columns`) for standard metadata, though catalogs offer more detail.
-- Find columns of a table
SELECT a.attname, format_type(a.atttypid, a.atttypmod) AS type
FROM pg_catalog.pg_attribute a
WHERE a.attrelid = 'myschema.mytable'::regclass -- Use regclass cast
AND a.attnum > 0
AND NOT a.attisdropped
ORDER BY a.attnum;
Data Types (Postgres Strengths)
JSON / JSONB
JSONB stores decomposed binary JSON. Supports GIN indexing for fast lookups using operators like `->>`, `@>`, `?`.
JSON vs JSONB
- `JSON`: Stores an exact textual copy. Preserves whitespace, key order, duplicate keys. Slower processing.
- `JSONB`: Stores decomposed binary format. More efficient storage (usually), much faster processing. Removes duplicate keys, doesn't preserve whitespace/key order. **Generally preferred.** [docs]
Key Operators
- `->`: Get JSON object field/array element (returns `json`/`jsonb`).
- `->>`: Get JSON object field/array element as `text`.
- `#>`: Get JSON object at specified path (returns `json`/`jsonb`).
- `#>>`: Get JSON object at specified path as `text`.
- `@>`: Contains (left contains right). `jsonb '{"a":1, "b":2}' @> '{"a":1}'` -> true.
- `<@`: Is contained by (left is contained by right).
- `?`: Key exists (string operand). `'{"a":1}'::jsonb ? 'a'` -> true.
- `?|`: Any key in array exists.
- `?&`: All keys in array exist.
- [Full List in Docs]
Indexing (GIN)
- `CREATE INDEX idx_gin ON mytable USING GIN (jsonb_column);` (Indexes all keys/values). [docs]
- `CREATE INDEX idx_gin_ops ON mytable USING GIN (jsonb_column jsonb_ops);` (Default opclass [docs]).
- `CREATE INDEX idx_gin_path_ops ON mytable USING GIN (jsonb_column jsonb_path_ops);` (Optimized for `@>` operator [docs]).
- Supports operators like `@>`, `?`, `?|`, `?&`. Crucial for performance!
Use Cases
Storing semi-structured data, document storage, flexible schemas, API payloads.
Arrays
Syntax
- Declaration: `my_ints integer[]`, `my_tags text[]`.
- Literals: `ARRAY[1, 2, 3]`, `'{apple, banana, cherry}'::text[]`.
- Access: `my_array[1]` (1-based indexing!). `my_array[1:2]` (slice). [docs]
Key Operators
- `=`: Array equality (element-by-element, order matters).
- `@>`: Contains (left contains right). `ARRAY[1,2,3] @> ARRAY[1,2]` -> true.
- `<@`: Is contained by.
- `&&`: Overlaps (have elements in common). `ARRAY[1,2,3] && ARRAY[3,4,5]` -> true.
- `ANY(array)` / `SOME(array)`: Check against array elements, e.g. `col = ANY(my_array)`.
- `ALL(array)`: Check all elements satisfy condition.
- [Full List in Docs]
Indexing (GIN)
- `CREATE INDEX idx_array_gin ON mytable USING GIN (array_column);` [docs]
- Supports `@>`, `<@`, `&&`, `=` (for some types [opclasses]). Essential for searching within arrays efficiently.
Functions
- `array_append()`, `array_prepend()`, `array_cat()`, `array_dims()`, `cardinality()` (size), `unnest()` (expand array to rows).
Use Cases
Storing tags, simple lists of related IDs, multi-value attributes.
Range Types
Represent ranges (`int4range`, `tsrange`, etc.). Support GiST indexing and exclusion constraints to prevent overlaps.
Built-in Types
- `int4range`, `int8range`, `numrange`, `tsrange` (timestamp without tz), `tstzrange` (timestamp with tz), `daterange`. [docs]
- Can create custom range types. [docs]
Syntax
- Literals: `[)`, `(]`, `()`, `[]` denote inclusive/exclusive bounds.
- Examples: `'[2024-01-01, 2024-01-31)'::daterange`, `'[10, 20]'::int4range`. Empty ranges allowed. Unbounded ranges use omitted bound. [docs]
Key Operators
- `@>`: Contains element or range. `daterange('2024-01-01', '2024-02-01') @> '2024-01-15'::date` -> true.
- `<@`: Is contained by.
- `&&`: Overlaps.
- `<<`: Strictly left of.
- `>>`: Strictly right of.
- `&<`: Extends to the right (no overlap).
- `&>`: Extends to the left (no overlap).
- `-|-` (adjacent): `tsrange('2024-01-01 10:00', '2024-01-01 11:00') -|- tsrange('2024-01-01 11:00', '2024-01-01 12:00')` -> true.
- [Full List in Docs]
Indexing (GiST)
- `CREATE INDEX idx_range_gist ON mytable USING GIST (range_column);` [docs]
- Supports range operators like `@>`, `&&`, `<@`, etc. [opclasses]
Exclusion Constraints
- Prevent overlapping ranges within a table, often used for scheduling or booking systems. [docs]
- Requires GiST index support for the overlap operator (`&&`).
ALTER TABLE events ADD CONSTRAINT no_overlapping_times EXCLUDE USING GIST (room_id WITH =, event_times WITH &&);
Enum Types
User-defined enumerated types (`CREATE TYPE mood AS ENUM (...)`). Provides type safety and better semantics than check constraints. [docs]
Definition
CREATE TYPE primary_color AS ENUM ('red', 'green', 'blue');
CREATE TABLE items (id serial PRIMARY KEY, color primary_color);
Benefits
- Type Safety: Ensures only defined values can be stored.
- Readability: More descriptive than storing magic numbers or strings.
- Storage Efficiency: Stored efficiently internally (typically 4 bytes).
- Ordering: Enum values have an implicit sort order based on definition order.
Usage
- Insert values like strings: `INSERT INTO items (color) VALUES ('red');`
- Compare directly: `WHERE color = 'red'`, `WHERE color > 'green'`.
Modification
- Add values: `ALTER TYPE primary_color ADD VALUE 'yellow' AFTER 'blue';` (Cannot easily remove or reorder values). [docs]
- Renaming values is possible: `ALTER TYPE ... RENAME VALUE ... TO ...`.
Comparison to Check Constraints
Enums are generally preferred over `CHECK (color IN ('red', 'green', 'blue'))` because they are more type-safe, potentially faster, easier to manage centrally, and self-documenting.
Other Notable Types
Includes native UUID, Geometric, Network (`inet`), `hstore`, `tsvector`/`tsquery`, `DOMAIN`, and Composite Types.
- `UUID`: Stores Universally Unique Identifiers. Preferred over storing as `text` for efficiency and semantics. Generate using `gen_random_uuid()` (requires `pgcrypto` extension) or client-side. [docs]
- Geometric Types (`point`, `line`, `lseg`, `box`, `path`, `polygon`, `circle`): Basic 2D geometric types. Indexed using GiST. Foundation for the powerful PostGIS extension. [docs]
- Network Address Types (`cidr`, `inet`, `macaddr`, `macaddr8`): Store and query IP addresses/networks and MAC addresses. Supports subnet containment operators (`>>`, `<<`). Indexable (B-Tree/GiST). [docs]
- `Hstore` Extension (`CREATE EXTENSION hstore`): Simple key-value store within a column. Keys/values are strings. Indexed using GIN or GiST. Often superseded by JSONB now, but still useful for simpler cases or legacy systems. Operators: `->`, `?`, `@>`.
- Full-Text Search Types (`tsvector`, `tsquery`): Used for storing preprocessed documents (`tsvector`) and search queries (`tsquery`). Used with FTS functions (`to_tsvector`, `to_tsquery`, `ts_rank`) and operators (`@@`). Indexable with GIN or GiST. [docs]
- `DOMAIN` Types: Create user-defined types based on existing types but with added constraints (`CHECK`, `NOT NULL`). Useful for enforcing data rules centrally. E.g., `CREATE DOMAIN email_address AS text CHECK (value ~* '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$');`
- Composite Types (`CREATE TYPE name AS (...)`): User-defined types representing the structure of a row or record (a list of field names and their types). Can be used for function arguments, return values, or table columns. E.g., `CREATE TYPE address AS (street text, city text, zip text);`
- `TIMESTAMP WITH TIME ZONE` (`timestamptz`) vs `TIMESTAMP WITHOUT TIME ZONE` (`timestamp`):
- `timestamptz`: **Strongly recommended**. Stores UTC timestamp. Converts input to UTC based on session timezone, converts back to session timezone on output. Unambiguous point in time.
- `timestamp`: Stores literal date/time provided, ignoring timezone. Ambiguous. Avoid unless you have a very specific reason.
Indexing Power
B-Tree Index
Default index type. Excellent for equality (`=`), range (`<`, `>`, `BETWEEN`), `IN`, `IS NULL`, and sorting. [docs]
Core Use Case
The workhorse index for most relational data. Stores indexed values in a sorted tree structure, allowing efficient lookups, range scans, and ordered retrieval.
Supported Operators
`=`, `<`, `<=`, `>`, `>=`, `BETWEEN`, `IN`, `IS NULL`, `IS NOT NULL`. [details]
Sorting
Can satisfy `ORDER BY` clauses matching the index definition (including ASC/DESC, NULLS FIRST/LAST), potentially avoiding a separate sort step. [docs]
Examples
CREATE INDEX idx_users_email ON users (email);
(For `WHERE email = '...'`)CREATE INDEX idx_orders_created ON orders (created_at DESC);
(For `WHERE created_at > '...' ORDER BY created_at DESC`)
Considerations
Effective for high-cardinality columns. Can become large. Index maintenance (bloat, see VACUUM and REINDEX) is a factor.
GIN Index
Generalized Inverted Index. Optimized for composite types like JSONB, Arrays, Hstore, Full-Text Search. Indexes elements *within* values. [docs]
Core Concept
Creates an index entry for each unique element (key, array item, lexeme) pointing to the rows (heap TIDs) containing that element. Ideal for "contains" or "exists" style queries. [intro]
Use Cases & Supported Operators
- Arrays: Supports `@>` (contains), `<@` (contained by), `&&` (overlaps), `=` (equality). [opclasses]
- JSONB: Supports `@>`, `<@`, `?` (key exists), `?|` (any key exists), `?&` (all keys exist). Use `jsonb_ops` (default) or `jsonb_path_ops` (optimized for `@>`).
- Hstore: Supports `@>`, `?`, `?|`, `?&`. [docs]
- Full-Text Search (`tsvector`): Supports `@@` (match) operator. [docs]
Performance
- Very fast lookups for containment/existence queries.
- Can be significantly slower to build/update than B-Tree or GiST, as inserting/updating one row might require updating many index entries if the value contains many unique elements. Tuning `gin_pending_list_limit` can help. [fast update]
- Index size can be large, but often smaller than GiST for the same data.
Example
CREATE INDEX idx_tags_gin ON articles USING GIN (tags);
(for `tags text[]`)
CREATE INDEX idx_metadata_gin ON products USING GIN (metadata jsonb_path_ops);
(for `metadata jsonb`)
GiST Index
Generalized Search Tree. Framework index for complex types: Geometric, Full-Text Search, Ranges. Handles overlap/containment, nearest-neighbor. [docs]
Core Concept
A height-balanced tree structure like B-Tree, but generalized to handle complex data types and query operators beyond simple comparison. Uses "lossy" indexing in some cases, meaning the index might return some false positives that need rechecking against the heap. [intro]
Use Cases & Supported Operators
- Geometric Types (PostGIS): Supports spatial operators like intersection (`&&`), containment (`@`, `~`), distance (`<->` for KNN). [opclasses]
- Range Types: Supports overlap (`&&`), containment (`@>`, `<@`), adjacency (`-|-`), etc. [opclasses]
- Full-Text Search (`tsvector`): Supports `@@` (match) operator. Often faster updates but slower searches than GIN for FTS. [docs]
- Other extensions (`hstore`, `pg_trgm` for fuzzy search): Can provide GiST support.
Performance
- Generally faster to build/update than GIN.
- Search performance can be slower than GIN for types GIN handles well (like JSONB/Array containment).
- Handles KNN (K-Nearest Neighbor) searches efficiently (e.g., find points closest to X). [KNN]
Example
CREATE INDEX idx_events_times_gist ON events USING GIST (event_times);
(for `event_times tstzrange`)
CREATE INDEX idx_locations_geom_gist ON locations USING GIST (geom);
(for PostGIS geometry column)
BRIN Index
Block Range Index. Very small index for huge tables where values correlate strongly with physical storage order (e.g., timestamps). [docs]
Core Concept
Stores summary information (min/max values, potentially other info depending on opclass) for ranges of physical table blocks (`pages_per_range` setting). During a query, checks the summary to see if blocks *might* contain relevant values, skipping blocks where values are guaranteed to be outside the query range. [intro]
Ideal Scenario
- Very large tables (many GBs/TBs).
- Column values have a strong linear correlation with their physical storage location. This often happens naturally for append-only tables with a timestamp or sequence column (e.g., log data, IoT measurements).
- Queries select relatively large ranges.
Benefits
- Extremely small index size compared to B-Tree.
- Very fast to build and maintain.
Limitations
- Only effective if the physical correlation exists. Randomly distributed data or frequently updated tables (causing rows to move physically) will not benefit.
- Less efficient than B-Tree for highly selective queries (finding single rows).
- Index results are "lossy" - it identifies candidate blocks, the query still needs to scan those blocks.
Example
CREATE INDEX idx_logs_ts_brin ON logs USING BRIN (log_timestamp) WITH (pages_per_range = 32);
Hash Index
Supports only equality (`=`) lookups. Less commonly used than B-Tree. Fully WAL-logged and usable since v10. [docs]
Core Concept
Stores a hash of the indexed value, mapping hashes to heap TIDs. Designed for fast equality checks.
Supported Operators
Only `=`. [details]
Historical Context & Quirks
- Before PostgreSQL v10, Hash indexes were not WAL-logged, meaning they had to be rebuilt after a crash and couldn't be replicated. This severely limited their use.
- Since v10, they are fully recoverable and replicable.
Use Cases (Limited)
- Potentially smaller and slightly faster than B-Tree *only* for equality lookups on certain data patterns (e.g., long strings where only equality is needed).
- In most cases, B-Tree performs comparably or better even for equality and offers much more flexibility (range queries, sorting).
- B-Tree is usually the better default choice even if only equality is needed. Use Hash only if benchmarks show a significant benefit for a specific `=`-only workload.
Example
CREATE INDEX idx_data_hash ON large_objects USING HASH (data_blob_hash);
Special Index Features
Includes Partial, Expression, Covering (INCLUDE), and Concurrent indexes for optimization and maintenance.
- Partial Indexes: Index only a subset of rows defined by a `WHERE` clause.
- Reduces index size and maintenance overhead if queries frequently filter on the same condition.
- Example: `CREATE INDEX idx_orders_pending ON orders (order_id) WHERE status = 'pending';` (Only useful for queries also containing `WHERE status = 'pending'`).
- Expression Indexes (Functional Indexes): Index the result of a function or expression applied to one or more columns.
- Useful for queries filtering/sorting on functions (e.g., `lower()`, date extraction). The query must use the *exact* same expression.
- Example: `CREATE INDEX idx_users_email_lower ON users (lower(email));` (for `WHERE lower(email) = '...'`)
- Covering Indexes (INCLUDE clause): Store additional, non-key columns directly in the index leaf pages.
- Allows Index-Only Scans where the query can be satisfied entirely from the index without visiting the table heap (requires table visibility map to be up-to-date via VACUUM).
- Example: `CREATE INDEX idx_items_name ON items (name) INCLUDE (price, category);` (for `SELECT price FROM items WHERE name = '...'`)
- Concurrent Index Builds: Create indexes without blocking writes (`INSERT`/`UPDATE`/`DELETE`) on the table.
- Takes longer and uses more resources than a standard `CREATE INDEX`. Requires multiple passes and can fail (leaving an invalid index that needs dropping).
- Syntax: `CREATE INDEX CONCURRENTLY ...;`
- Essential for adding indexes to busy production tables with minimal downtime. Cannot run inside a transaction block.
REINDEX
Rebuilds an index, useful for shrinking bloated indexes or repairing corruption. Supports `CONCURRENTLY` option. [docs]
Purpose
- Reduce Bloat: Rebuilds the index from scratch using current table data, removing dead space (similar effect to `VACUUM FULL` but just for the index).
- Repair Corruption: Can fix certain types of index corruption.
- Update Index Definition: Historically needed for changes in collation versions (less common now).
Commands
- `REINDEX INDEX index_name;`: Rebuilds a specific index. Takes `ACCESS EXCLUSIVE` lock on the index (blocks reads/writes using the index).
- `REINDEX TABLE table_name;`: Rebuilds all indexes on a specific table. Locks the table.
- `REINDEX SCHEMA schema_name;`: Rebuilds all indexes in a schema.
- `REINDEX DATABASE database_name;`: Rebuilds all indexes in a database (requires caution).
- `REINDEX SYSTEM database_name;`: Rebuilds system catalog indexes (requires caution).
Concurrent Reindex
- `REINDEX INDEX index_name CONCURRENTLY;` (Postgres 12+): Rebuilds the index without taking heavy locks that block reads/writes.
- Similar mechanism to `CREATE INDEX CONCURRENTLY`: builds a new index in the background, then replaces the old one in a short lock window.
- Takes longer, uses more resources, cannot run in a transaction block. Preferred method for large/busy tables.
When to Use
Consider `REINDEX` (preferably `CONCURRENTLY`) when index bloat is significant (detected via monitoring) and affecting performance, or if index corruption is suspected.
SQL Features & Querying
Window Functions
Perform calculations across sets of table rows related to the current row. Uses the `OVER()` clause (`PARTITION BY`, `ORDER BY`).
Core Concept
Unlike aggregate functions which collapse rows, window functions return a value for *each* row based on a "window" of related rows. [docs]
Syntax
function_name() OVER ( [PARTITION BY expr_list] [ORDER BY expr_list] [frame_clause] )
[syntax]
- `PARTITION BY`: Divides rows into partitions (groups). Window function is applied independently to each partition. (Optional)
- `ORDER BY`: Defines the order of rows within each partition. Required for ranking and frame-dependent functions. (Optional for some functions)
- `frame_clause`: Defines the subset of rows within the partition relative to the current row (e.g., `ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING`). (Optional) [frame]
Common Functions
- Ranking: `ROW_NUMBER()`, `RANK()`, `DENSE_RANK()`, `NTILE(n)`.
- Aggregate Windows: `SUM()`, `AVG()`, `COUNT()`, `MAX()`, `MIN()` applied over the window frame.
- Value Fetching: `LAG()`, `LEAD()`, `FIRST_VALUE()`, `LAST_VALUE()`, `NTH_VALUE()`.
Example Use Cases
Calculating running totals, ranking results within categories, finding previous/next values, computing moving averages.
SELECT order_id, order_date, amount, SUM(amount) OVER (ORDER BY order_date) as running_total FROM orders;
SELECT product, category, sales, RANK() OVER (PARTITION BY category ORDER BY sales DESC) as category_rank FROM product_sales;
Common Table Expressions (CTEs)
Define temporary, named result sets within a query using `WITH`. Improves readability. Supports `RECURSIVE` for hierarchies.
Basic CTE
Allows breaking down complex queries into logical, named steps.
WITH regional_sales AS (
SELECT region, SUM(amount) AS total_sales
FROM orders
GROUP BY region
), top_regions AS (
SELECT region FROM regional_sales
WHERE total_sales > 10000
)
SELECT * FROM orders
WHERE region IN (SELECT region FROM top_regions);
Recursive CTEs
Used for querying hierarchical data (e.g., organizational charts, parts explosion, graph traversal). [docs]
WITH RECURSIVE subordinates AS (
-- Non-recursive term (anchor)
SELECT employee_id, name, manager_id, 1 as level
FROM employees
WHERE manager_id IS NULL
UNION ALL
-- Recursive term
SELECT e.employee_id, e.name, e.manager_id, s.level + 1
FROM employees e
JOIN subordinates s ON e.manager_id = s.employee_id
)
SELECT * FROM subordinates;
Materialization (Quirk)
- By default, CTEs might be "inlined" by the planner (re-evaluated each time referenced).
- Use `WITH cte AS MATERIALIZED (...)` (Postgres 12+) to force the CTE result to be computed once and stored temporarily (like a temp table). Can help performance if the CTE is complex and referenced multiple times, but can hurt if it's large and only used partially or if inlining would allow better overall optimization.
- Use `WITH cte AS NOT MATERIALIZED (...)` (PG12+) to hint against materialization.
Data Modifying CTEs
You can have `INSERT`, `UPDATE`, or `DELETE` statements within a CTE, using their `RETURNING` clause to pass data to subsequent parts of the query. [docs]
DISTINCT ON
Postgres-specific extension. Selects the *first* row for each unique combination of expressions in the `DISTINCT ON (...)` clause, based on `ORDER BY`.
Syntax
SELECT DISTINCT ON (expression [, ...]) select_list FROM ... ORDER BY expression [, ...], order_expr [, ...]
How it Works
- The query result is ordered according to the *entire* `ORDER BY` clause.
- For each group of rows that are identical according to the `DISTINCT ON` expressions, only the *first* row according to the `ORDER BY` clause is kept.
- **Crucial:** The `ORDER BY` clause *must* start with the same expressions as `DISTINCT ON` (order matters!) to get predictable results. Additional `ORDER BY` expressions determine which row is chosen within each distinct group. [docs]
Use Case: "Greatest/Latest N per Group" (N=1)
Very common for finding the most recent entry for each item.
-- Get the latest event for each device_id
SELECT DISTINCT ON (device_id)
device_id, event_timestamp, event_data
FROM device_events
ORDER BY device_id, event_timestamp DESC;
Comparison to Alternatives
Often more concise and performant than using window functions (`ROW_NUMBER() OVER (...) ... WHERE rn = 1`) or correlated subqueries for the same "latest-per-group" task, especially if indexed appropriately on the `DISTINCT ON` and `ORDER BY` columns.
INSERT ... ON CONFLICT (UPSERT)
Atomic "UPSERT" operation. Handles unique constraint violations by either doing nothing (`DO NOTHING`) or updating the existing row (`DO UPDATE`). [docs]
Syntax
INSERT INTO table_name (column_list)
VALUES (value_list)
ON CONFLICT (conflict_target) DO conflict_action;
- `conflict_target`: Specifies the constraint that triggers the conflict handling. Can be:
- `(column_name [, ...])`: A list of columns with a unique constraint/index.
- `ON CONSTRAINT constraint_name`: The name of a unique or exclusion constraint.
- `WHERE predicate`: (Optional) Handle conflict only if inference predicate is true.
- `conflict_action`: Can be:
- `DO NOTHING`: Silently ignore the row if a conflict occurs.
- `DO UPDATE SET col1 = value1 [, ...] [WHERE condition]`: Update the existing row that caused the conflict. Can reference the existing row using the table name (e.g., `counters.value`) and the proposed insertion row using the special `EXCLUDED` table (`SET count = mytable.count + EXCLUDED.count`). The optional `WHERE` condition refers to the existing row.
Benefits
- Atomic operation, avoiding race conditions inherent in separate SELECT-then-INSERT/UPDATE logic.
- Simplifies application code for handling data synchronization or idempotent inserts.
Example
-- Increment counter or insert new row
INSERT INTO counters (key, value)
VALUES ('page_views', 1)
ON CONFLICT (key) DO UPDATE
SET value = counters.value + EXCLUDED.value;
Advanced DML
Supports `UPDATE ... FROM`, `DELETE ... USING` for joining. `RETURNING` clause gets back modified rows. `TABLESAMPLE` for subsets.
UPDATE FROM / DELETE USING
Allows joining other tables directly into `UPDATE` or `DELETE` statements to determine which rows to modify or filter based on related data.
-- Update order status based on payment status
UPDATE orders o
SET status = 'shipped'
FROM payments p
WHERE o.order_id = p.order_id AND p.status = 'paid';
-- Delete users who haven't logged in recently
DELETE FROM users u
USING user_last_login l
WHERE u.user_id = l.user_id AND l.last_login < NOW() - INTERVAL '1 year';
RETURNING Clause
Appends a list of values from the rows modified by an `INSERT`, `UPDATE`, or `DELETE` statement to the command result. [docs]
- Useful for getting default values (like serial IDs) after an `INSERT`, or confirming exactly which rows were affected.
- Avoids a separate `SELECT` statement, improving efficiency and atomicity (within the same command).
-- Insert and get the generated ID
INSERT INTO products (name, price) VALUES ('New Gadget', 99.99)
RETURNING product_id;
-- Update stock and return the updated product names
UPDATE products SET stock = stock - 1 WHERE category = 'electronics'
RETURNING name, stock;
-- Delete old logs and return the IDs deleted
DELETE FROM logs WHERE log_date < '2023-01-01'
RETURNING log_id;
TABLESAMPLE Clause
Selects a random subset of rows from a table, typically much faster than `ORDER BY random() LIMIT n` for large tables. [docs]
- Methods: `BERNOULLI` (scans whole table, includes each row with specified probability), `SYSTEM` (faster, selects random blocks, probability is approximate).
- Example: `SELECT * FROM large_log_table TABLESAMPLE SYSTEM (1);` (Select ~1% of blocks)
- Use `REPEATABLE (seed)` for reproducible samples.
Lateral Joins
`LATERAL` allows a subquery in the `FROM` clause to reference columns from preceding `FROM` items. Like a correlated `foreach` loop.
Core Concept
Normally, a subquery in the `FROM` clause executes independently. `LATERAL` makes it dependent, executing it *for each row* of the preceding table, allowing the subquery to use values from that row. [docs]
Syntax
SELECT ...
FROM table_a a, LATERAL (subquery using a.columns) sub
WHERE ...
-- Also common with JOIN syntax:
SELECT ...
FROM table_a a
LEFT JOIN LATERAL (subquery using a.columns) sub ON true -- or other join condition
WHERE ...
Use Cases
- Finding the top N related items per row (e.g., get top 3 posts for each user).
- Calling set-returning functions with arguments from the preceding table.
- Simplifying complex correlated subqueries previously written in `SELECT` list or `WHERE` clause.
Example: Get 2 most recent orders for each customer
SELECT c.customer_id, c.name, o.order_id, o.order_date
FROM customers c
LEFT JOIN LATERAL (
SELECT order_id, order_date
FROM orders
WHERE customer_id = c.customer_id -- References outer table 'c'
ORDER BY order_date DESC
LIMIT 2
) o ON true; -- Join condition often true as filtering is inside LATERAL
Performance
Can be very efficient if the lateral subquery can use indexes based on the correlated columns. Replaces some procedural logic or complex window functions.
Aggregate `FILTER` Clause
SQL standard way to apply a condition *before* aggregation using `FILTER (WHERE ...)`. Cleaner than `CASE` statements.
Syntax
aggregate_function(expression) FILTER (WHERE condition)
Purpose
Provides a standard and often more readable way to perform conditional aggregation compared to using `CASE` inside the aggregate function.
Example: Count paid vs unpaid orders
-- Traditional CASE method
SELECT
SUM(CASE WHEN status = 'paid' THEN 1 ELSE 0 END) AS paid_orders,
SUM(CASE WHEN status = 'unpaid' THEN 1 ELSE 0 END) AS unpaid_orders
FROM orders;
-- Using FILTER clause
SELECT
COUNT(*) FILTER (WHERE status = 'paid') AS paid_orders,
COUNT(*) FILTER (WHERE status = 'unpaid') AS unpaid_orders
FROM orders;
-- Can be combined with GROUP BY
SELECT customer_id,
COUNT(*) FILTER (WHERE amount > 100) as large_orders,
COUNT(*) FILTER (WHERE amount <= 100) as small_orders
FROM orders
GROUP BY customer_id;
Benefits
- Improved readability, especially for complex conditions or multiple conditional aggregates.
- SQL standard syntax.
- Potentially allows for better optimization by the planner in some cases.
Native Partitioning
Declarative Partitioning
Split large tables into smaller, manageable pieces (partitions) based on LIST, RANGE, or HASH of a key. Improves performance and manageability. [docs]
Concept
A partitioned table is a logical "parent" table that is physically composed of multiple child tables (partitions). Rows inserted into the parent are automatically routed to the correct partition based on the partition key and method.
Partitioning Methods
- RANGE: Partitions based on ranges of the partition key (e.g., date ranges, numeric ranges). `PARTITION BY RANGE (created_at)` [docs]
- LIST: Partitions based on specific values of the partition key (e.g., country codes, status codes). `PARTITION BY LIST (country_code)` [docs]
- HASH: Partitions based on a hash of the partition key, distributing rows roughly evenly. `PARTITION BY HASH (user_id)` [docs]
Creating Partitions
-- Parent Table Definition
CREATE TABLE measurements (
city_id int not null,
logdate date not null,
peaktemp int,
unitsales int
) PARTITION BY RANGE (logdate);
-- Creating Partitions
CREATE TABLE measurements_y2023m01 PARTITION OF measurements
FOR VALUES FROM ('2023-01-01') TO ('2023-02-01');
CREATE TABLE measurements_y2023m02 PARTITION OF measurements
FOR VALUES FROM ('2023-02-01') TO ('2023-03-01');
-- List Partition Example
CREATE TABLE cities ( ... ) PARTITION BY LIST (country);
CREATE TABLE cities_us PARTITION OF cities FOR VALUES IN ('US');
CREATE TABLE cities_jp PARTITION OF cities FOR VALUES IN ('JP');
CREATE TABLE cities_other PARTITION OF cities DEFAULT; -- Optional default
Benefits
- Performance (Partition Pruning): Queries filtering by the partition key can skip scanning irrelevant partitions entirely.
- Manageability: Easier bulk data loading/deletion by manipulating partitions (`ATTACH`/`DETACH PARTITION`, `DROP TABLE partition_name`). Efficient for time-series data retention (drop old partitions).
- Maintenance: Operations like `VACUUM` or `REINDEX` can often be performed on individual partitions.
Considerations
- Indexes and constraints usually need to be created on each partition individually (though primary keys/unique constraints must include the partition key). Use `CREATE INDEX ON parent_table ...` to automatically create on all partitions (and future ones).
- Partition key cannot easily be changed after creation.
- Too many partitions can increase planning time and resource usage.
- Use extensions like `pg_partman` to automate partition creation and maintenance.
Partition Management
Add new partitions, remove old ones using `CREATE TABLE ... PARTITION OF`, `ALTER TABLE ... ATTACH PARTITION`, `ALTER TABLE ... DETACH PARTITION`, `DROP TABLE`. Automate with tools.
Adding Partitions
- Direct Creation: Create a new table that is immediately a partition.
CREATE TABLE measurements_y2023m03 PARTITION OF measurements FOR VALUES FROM ('2023-03-01') TO ('2023-04-01');
- Attaching Existing Table: Attach a regular table (matching the parent's structure) as a new partition. Useful for bulk loading data into a separate table first. [docs]
This operation takes a brief `ACCESS EXCLUSIVE` lock on the parent table.CREATE TABLE measurements_staging (...); -- Load data here ALTER TABLE measurements ATTACH PARTITION measurements_staging FOR VALUES FROM ('2023-04-01') TO ('2023-05-01');
Removing Partitions
- Detaching Partition: Convert a partition back into a standalone regular table. The data remains but is no longer part of the partitioned table. Takes a brief `ACCESS EXCLUSIVE` lock. [docs]
ALTER TABLE measurements DETACH PARTITION measurements_y2023m01; -- Now measurements_y2023m01 is a regular table. -- Can archive or drop it later.
- Dropping Partition: Directly drop the partition table. This permanently deletes the data in that partition. Fast operation.
DROP TABLE measurements_y2023m01;
Automation
- Manual creation/dropping of range partitions (e.g., monthly) is tedious and error-prone.
- Use extensions like `pg_partman` which provide functions to automatically create new partitions ahead of time and drop old ones based on retention policies. Often scheduled via `pg_cron` or external schedulers.
Indexes & Constraints
- Remember to create indexes on new partitions. Using `CREATE INDEX ON parent_table ...` handles this automatically for future partitions.
- Foreign keys referencing partitioned tables are possible but have limitations. [limitations]
Concurrency & Locking
Isolation Levels
`READ COMMITTED` (default, statement-level snapshot), `REPEATABLE READ` (transaction snapshot, can fail), `SERIALIZABLE` (full serializability, more failures). [docs]
Levels & Guarantees
- `READ UNCOMMITTED` (Not implemented in PG, behaves like `READ COMMITTED`): Allows dirty reads.
- `READ COMMITTED` (Default): Guarantees no dirty reads. Each statement sees a snapshot of data committed *before that statement began*. Can experience non-repeatable reads (same query gets different results later in tx) and phantom reads (new rows appear later in tx). **Important Quirk:** Often sufficient, but be aware of its statement-level snapshot behavior. [docs]
- `REPEATABLE READ`: Guarantees no dirty reads or non-repeatable reads. Entire transaction sees a snapshot of data committed *before the transaction's first query began*. Can experience phantom reads (though PG often prevents them via predicate locking). **Quirk:** May fail with a serialization failure (error 40001) if it detects a potential anomaly (write skew) that would violate serializability. Requires application retry logic. [docs]
- `SERIALIZABLE`: Strongest level. Guarantees transactions behave as if they executed one after another sequentially. Prevents all anomalies (dirty, non-repeatable, phantom). **Quirk:** More likely to experience serialization failures than `REPEATABLE READ` due to stricter checking. Requires application retry logic. Higher performance overhead. [docs]
Choosing a Level
- Stick with `READ COMMITTED` unless specific read consistency across multiple statements within a transaction is required.
- Use `REPEATABLE READ` or `SERIALIZABLE` for multi-statement transactions requiring a consistent view or complex read-modify-write operations where race conditions or anomalies are possible. Be prepared to handle serialization failures by retrying the transaction.
Setting Level
SET TRANSACTION ISOLATION LEVEL REPEATABLE READ;
START TRANSACTION ISOLATION LEVEL SERIALIZABLE;
Locking Mechanisms
Implicit row-level via MVCC/DML. Explicit `SELECT FOR UPDATE/SHARE`, `LOCK TABLE`. Application-level Advisory Locks. Monitor with `pg_locks`.
Row-Level Locks (Implicit & Explicit)
- Managed automatically by MVCC and DML commands (e.g., `UPDATE` takes `ROW EXCLUSIVE` lock on updated rows).
- `SELECT ... FOR UPDATE`/`FOR NO KEY UPDATE`/`FOR SHARE`/`FOR KEY SHARE`: Explicitly lock rows returned by a `SELECT` statement, preventing other transactions from modifying or locking them incompatibly until the current transaction ends. Used to prevent race conditions in read-modify-write cycles. Options `NOWAIT` and `SKIP LOCKED`.
Table-Level Locks (Explicit)
- `LOCK TABLE table_name [IN lock_mode]`: Explicitly lock an entire table.
- Modes (increasing strength): `ACCESS SHARE`, `ROW SHARE`, `ROW EXCLUSIVE`, `SHARE UPDATE EXCLUSIVE`, `SHARE`, `SHARE ROW EXCLUSIVE`, `EXCLUSIVE`, `ACCESS EXCLUSIVE`. See Lock Compatibility Matrix.
- Use with caution! Can severely impact concurrency. Needed for some DDL operations or specific application logic.
Advisory Locks
- Application-defined locks managed by function calls (e.g., `pg_advisory_lock(key)`, `pg_advisory_xact_lock(key)`). [docs]
- Based on arbitrary numbers (e.g., derived from an application entity ID). Not tied to specific tables or rows.
- Useful for coordinating application-level processes or locking concepts not directly represented by a single table row (e.g., limiting concurrent processing of a job type).
- Transaction-scoped (`pg_advisory_xact_lock`) or session-scoped (`pg_advisory_lock`). Remember to unlock session locks!
Deadlocks
- Occur when two (or more) transactions wait for locks held by each other.
- Postgres automatically detects deadlocks and aborts one of the transactions (raising an error). [docs]
- **Diagnosis:** Check PostgreSQL logs (if `log_lock_waits` is enabled). Query `pg_locks` and `pg_stat_activity` during contention.
- **Prevention:** Acquire locks on multiple objects in a consistent, globally defined order across all application code. Use coarser locks (e.g., table lock) if fine-grained locking proves too complex. Reduce transaction duration. Use `SELECT ... FOR UPDATE SKIP LOCKED` for queue processing patterns.
Monitoring
Use the `pg_locks` view to inspect currently held locks and waiting transactions. Query `pg_stat_activity` to see `wait_event_type` and `wait_event` (e.g., `Lock`, `LWLock`).
Performance & Maintenance
VACUUM
Essential maintenance! Reclaims space from dead rows (bloat), updates visibility map, prevents TXID Wraparound. Autovacuum is critical but needs monitoring/tuning. [docs]
Why VACUUM is Needed (MVCC)
Because MVCC creates new row versions for UPDATEs/DELETEs, the old ("dead") versions remain physically until `VACUUM` runs. Without it, tables grow indefinitely (bloat) and performance degrades. [routine vacuuming]
What VACUUM Does
- Reclaims Space: Marks space occupied by dead tuples as reusable for future INSERTs/UPDATEs within the same table (standard `VACUUM` doesn't typically return space to the OS).
- Updates Visibility Map (VM): Marks blocks containing only visible-to-all tuples, enabling Index-Only Scans.
- Updates Statistics: Optionally runs `ANALYZE` (`VACUUM ANALYZE`).
- Prevents TXID Wraparound: "Freezes" old tuple transaction IDs to prevent comparison issues when the TXID counter wraps around (every ~4 billion transactions). Critical for database operation.
VACUUM vs VACUUM FULL
- `VACUUM [tablename]`: Standard vacuum. Runs concurrently with reads/writes (minimal locking). Reclaims space within the table. **Run regularly.**
- `VACUUM FULL [tablename]`: Rewrites the entire table to a new file, reclaiming maximum space (including returning to OS). Requires `ACCESS EXCLUSIVE` lock (blocks all access). Use sparingly, only when severe bloat needs aggressive reclaiming. Consider `pg_repack` extension as an online alternative.
Autovacuum
- Background process that automatically runs `VACUUM` and `ANALYZE` on tables based on thresholds (number of dead tuples, inserts/updates/deletes). [docs]
- **Essential for health.** Default settings are often too conservative for busy databases. Requires tuning (`autovacuum_max_workers`, `autovacuum_vacuum_scale_factor` / `_threshold`, `autovacuum_analyze_threshold` / `_scale_factor`, etc. [config]). Monitor its activity!
ANALYZE
Collects statistics about table data distribution (histograms, distinct values) used by the query planner to choose optimal execution plans. Run automatically by Autovacuum. [docs]
Why ANALYZE is Needed
The query planner needs accurate information about the data in tables to make good decisions (e.g., estimate how many rows a `WHERE` clause will return, decide between index scan vs sequential scan, choose join methods). Outdated stats lead to bad plans. [planner stats]
What ANALYZE Collects
- Total number of rows (`reltuples` in `pg_class`).
- Number of distinct values per column (`n_distinct`).
- Most common values (MCVs) and their frequencies.
- Histograms representing data distribution for range comparisons.
- Correlation between physical row order and column values (for BRIN indexes and plan costing).
- Stored in system catalog `pg_statistic`.
When to Run
- After significant data changes (bulk loads, deletes, updates).
- After creating indexes on expressions.
- Regularly on all tables (Autovacuum usually handles this).
- Manually run `ANALYZE tablename;` or `ANALYZE VERBOSE tablename;` if query plans seem poor or after large data modifications before autovacuum kicks in.
Autovacuum & ANALYZE
Autovacuum typically triggers `ANALYZE` based on a percentage of rows changed (`autovacuum_analyze_scale_factor` + `autovacuum_analyze_threshold`). Tuning may be required, especially for large tables where the default % is too high. [docs]
Statistics Target
Control the detail level of statistics (number of histogram buckets, MCVs) using `ALTER TABLE ... ALTER COLUMN ... SET STATISTICS
EXPLAIN
Analyze query execution plans. `EXPLAIN ANALYZE` runs the query and shows actual times/rows. Use `BUFFERS` for cache/IO info. Essential diagnostic tool. [docs]
Purpose
Understand how PostgreSQL intends to execute (or actually executed) a query. Essential for diagnosing performance issues and validating index usage. [using explain]
Basic Usage
- `EXPLAIN SELECT ...`: Shows the planner's *estimated* execution plan, costs, and row counts without running the query. Safe to use anytime.
- `EXPLAIN ANALYZE SELECT ...`: **Runs the query** and shows the actual execution plan, including actual times per node, actual row counts, loops, etc. Use cautiously on production for long-running or modifying queries.
Key Options
- `BUFFERS`: With `ANALYZE`, shows buffer usage (shared hits, reads, dirtied, written; temp reads/written) per node. Crucial for identifying I/O bottlenecks and cache efficiency.
EXPLAIN (ANALYZE, BUFFERS) ...
[docs] - `WAL`: With `ANALYZE`, shows WAL record generation counts and sizes.
- `SETTINGS`: Shows non-default configuration settings affecting the plan.
- `COSTS`: Enable/disable display of estimated costs (default true). `EXPLAIN (COSTS false) ...`
- `TIMING`: Enable/disable display of actual timing per node with `ANALYZE` (default true). `EXPLAIN (ANALYZE, TIMING false) ...`
- `VERBOSE`: Shows more details, like output column lists per node, schema-qualified names.
- `SUMMARY`: Enable/disable summary info (planning/execution time) with `ANALYZE` (default true).
- `FORMAT {TEXT|XML|JSON|YAML}`: Choose output format (default TEXT). JSON is useful for programmatic analysis or plan visualizers.
Interpreting Output
- Read plans from inside-out / bottom-up (most indented node executes first).
- Compare estimated rows vs actual rows (large discrepancies suggest outdated statistics or planner limitations).
- Identify slow nodes (high `actual time`).
- Check for Sequential Scans on large tables where Index Scans or Bitmap Heap Scans were expected.
- Analyze buffer hits/reads (`Buffers:` line) to understand caching effectiveness (high reads indicate poor caching or cold cache).
- Look for expensive Sorts or Hash operations (check `work_mem`).
- [Interpreting ANALYZE]
Tools
Visualizers like explain.depesz.com, explain.dalibo.com, PEV make complex plans easier to read (use `FORMAT JSON`).
Statistics Views & Extensions
Monitor activity: `pg_stat_activity`, `pg_stat_statements`, `pg_stat_*_tables/indexes`, `pg_locks`. Use `pg_buffercache` for cache inspection.
Key Views for Monitoring
- `pg_stat_activity`: Shows information about currently active backend processes/connections.
- Columns: `pid`, `datname`, `usename`, `client_addr`, `backend_start`, `query_start`, `state` (active, idle, idle in transaction), `wait_event_type`/`wait_event`, `query`, `backend_type`.
- Essential for seeing long-running queries, idle transactions, waiting connections (lock contention).
- `pg_stat_statements` (`CREATE EXTENSION pg_stat_statements;`): Tracks execution statistics for all normalized queries executed. **Crucial for identifying expensive queries.**
- Requires setting `shared_preload_libraries = 'pg_stat_statements'` in `postgresql.conf` and a server restart. Tune settings like `pg_stat_statements.max`.
- Columns: `queryid`, `query`, `calls`, `total_exec_time`, `mean_exec_time`, `rows`, `shared_blks_hit`/`read`/`dirtied`/`written`, `temp_blks_read`/`written`, `wal_records`, `wal_bytes`.
- Use `pg_stat_statements_reset()` to clear stats.
- `pg_stat_user_tables` / `pg_stat_all_tables`: Shows table access statistics.
- Columns: `seq_scan`, `seq_tup_read`, `idx_scan`, `idx_tup_fetch`, `n_tup_ins`/`upd`/`del`/`hot_upd`, `n_live_tup`, `n_dead_tup`, `last_vacuum`/`autovacuum`, `last_analyze`/`autoanalyze`.
- Useful for monitoring table bloat (`n_dead_tup`), scan types, vacuum/analyze activity.
- `pg_statio_user_tables` / `pg_statio_all_tables`: Shows table I/O statistics (buffer cache).
- Columns: `heap_blks_read`/`hit`, `idx_blks_read`/`hit`, `toast_blks_read`/`hit`, `tidx_blks_read`/`hit`.
- Helps identify tables causing heavy I/O or poor caching (`*_hit` vs `*_read` ratio).
- `pg_locks`: Shows currently held locks and waiting transactions. Useful for diagnosing locking contention and deadlocks.
- `pg_stat_replication`: Monitor streaming replication status and lag on the primary.
- `pg_stat_wal_receiver`: Monitor streaming replication status on the standby.
- `pg_buffercache` Extension (`CREATE EXTENSION pg_buffercache;`): Provides a detailed view of blocks currently held in `shared_buffers`. Useful for advanced cache analysis. Requires superuser privileges.
Configuration Tuning (`postgresql.conf`)
Key settings: memory (`shared_buffers`, `work_mem`), WAL (`max_wal_size`), Checkpoints, Autovacuum. Use `pgtune` as starting point. Reload/Restart needed.
Configuration is primarily done via `postgresql.conf` (requires reload or restart for many settings) and `pg_hba.conf` (authentication, requires reload). Use `ALTER SYSTEM SET ...` to modify `postgresql.conf` via SQL (writes to `postgresql.auto.conf`, overrides main file).
Key Parameter Groups
- Memory: (See Memory card) `shared_buffers`, `work_mem`, `maintenance_work_mem`, `effective_cache_size`. Critical for performance.
- WAL: `wal_level` (`replica` is default, needed for replication/PITR), `max_wal_size` & `min_wal_size` (control WAL disk usage before checkpoint), `wal_buffers`, `commit_delay`, `commit_siblings`, `synchronous_commit`. Affects durability and write performance.
- Checkpoints: `checkpoint_timeout` (max time between checkpoints), `checkpoint_completion_target` (spread checkpoint I/O over time, e.g., 0.9). Controls recovery time vs. I/O spikes.
- Autovacuum: (See VACUUM card) `autovacuum`, `autovacuum_max_workers`, `autovacuum_naptime`, `autovacuum_vacuum_threshold`, `autovacuum_vacuum_scale_factor`, `autovacuum_analyze_threshold`, `autovacuum_analyze_scale_factor`, `autovacuum_vacuum_cost_delay`, `autovacuum_vacuum_cost_limit`, `log_autovacuum_min_duration`. **Requires careful tuning for busy systems.**
- Planner: `random_page_cost` (tune based on storage type, lower for SSDs, e.g., 1.1), `seq_page_cost` (usually 1.0), `effective_cache_size`, `cpu_tuple_cost`, `cpu_index_tuple_cost`, `cpu_operator_cost`, `enable_*` flags (e.g., `enable_bitmapscan`). Influences query plan choices.
- Connections: `max_connections` (set high enough for clients + pooler, mindful of memory), `superuser_reserved_connections`.
- Logging: `log_destination`, `logging_collector`, `log_directory`, `log_filename`, `log_statement` (use carefully, e.g. `ddl` or `mod`), `log_min_duration_statement` (crucial for finding slow queries), `log_checkpoints`, `log_lock_waits`, `log_temp_files`, `log_autovacuum_min_duration`. Essential for troubleshooting.
Tools & Approach
- Start with defaults or recommendations from `pgtune` (online tool or script) based on system resources.
- Monitor performance (using stats views, OS tools) and adjust parameters iteratively based on workload.
- Understand the impact of each parameter before changing it. Read the documentation! Check the `context` in `pg_settings` (needs reload/restart?).
- Use `SHOW parameter_name;` or query `pg_settings` view to see current values.
- Use `SELECT pg_reload_conf();` or `pg_ctl reload` to apply changes that don't require a restart. Use `pg_ctl restart` for others.
Bloat Monitoring & Remediation
MVCC leads to dead tuples (bloat) in tables/indexes. Monitor using queries or extensions. Remediate with VACUUM, REINDEX, or `pg_repack`.
What is Bloat?
Unused space in table and index files resulting from accumulated dead tuples (from `UPDATE`/`DELETE`) that haven't been fully reclaimed by `VACUUM`. Reduces performance and wastes disk space.
Detecting Bloat
- Queries:** Use queries comparing the physical size (`pg_total_relation_size`) with estimated live data size. Many community scripts exist (search for "postgres bloat query"). These are estimates. Example snippet:
SELECT ..., pg_size_pretty(bloat_size) AS bloat, pg_size_pretty(table_size - bloat_size) AS live_data FROM ( ... complex query using pg_class, pg_statistic ... ) ...
- `pgstattuple` Extension:** (`CREATE EXTENSION pgstattuple;`) Provides functions (`pgstattuple`, `pgstatindex`) to scan relations and give precise statistics on dead tuples, free space, etc. Can be I/O intensive.
- Monitoring `n_dead_tup`:** Track `n_dead_tup` in `pg_stat_user_tables`. High numbers indicate potential for bloat if `VACUUM` isn't keeping up.
Remediation
- Regular `VACUUM` / Autovacuum Tuning:** The primary preventative measure. Ensure autovacuum is running frequently and effectively enough for your workload.
- Manual `VACUUM [tablename]`:** Can help clear out dead tuples if autovacuum is lagging. Does not typically shrink files significantly.
- `REINDEX INDEX index_name [CONCURRENTLY]`:** Rebuilds a specific index, removing bloat from the index. Concurrent option recommended.
- `VACUUM FULL tablename;`:** Rewrites the entire table and its indexes, removing all bloat and shrinking files. Requires `ACCESS EXCLUSIVE` lock (downtime). Use sparingly.
- `pg_repack` Extension:** Popular tool to perform an online `VACUUM FULL`-like operation with minimal locking (creates a new table copy, applies changes via triggers, then swaps). Requires installing the extension and running the command-line tool.
OS-Level Tuning Hints
OS configuration impacts DB performance. Consider filesystem, huge pages, swappiness, read-ahead, CPU governor.
Key Areas (Linux Focused Examples)
- Filesystem:**
- Prefer `XFS` or `ext4` for data directories.
- Mount options: `noatime`, `nodiratime` can reduce minor I/O overhead. Ensure barriers are enabled (default).
- Transparent Huge Pages (THP):**
- Often recommended to be **disabled** (`never`) for database workloads, especially older kernels. Can cause latency spikes and memory fragmentation issues. [docs reference]
- Check: `cat /sys/kernel/mm/transparent_hugepage/enabled`
- Disable: Add `transparent_hugepage=never` to kernel boot parameters or via `systemd` service unit.
- Swappiness (`vm.swappiness`):**
- Controls how aggressively the kernel swaps memory. Default is often 60.
- For dedicated DB servers with ample RAM, lowering this (e.g., 1, 10) discourages swapping of database cache, preventing performance cliffs. Don't set to 0 unless you fully understand implications.
- Set via `sysctl vm.swappiness=10` (temporary) or in `/etc/sysctl.conf` (persistent).
- Virtual Memory Overcommit:**
- `vm.overcommit_memory`: Often set to `2` (don't overcommit) along with configuring `vm.overcommit_ratio` appropriately to prevent OOM killer issues if memory usage spikes unexpectedly. Needs careful calculation based on RAM + Swap. [docs]
- Block Device Read-Ahead:**
- `blockdev --setra
/dev/sdX`. Setting appropriate read-ahead (e.g., 256 or 512 sectors = 128K/256K) for the data disk can sometimes improve large sequential scan performance. Depends heavily on storage type and workload. Benchmark changes.
- `blockdev --setra
- CPU Governor:**
- Ensure the CPU frequency scaling governor is set to `performance` rather than `ondemand` or `powersave` for consistent database performance on dedicated servers.
- Check: `cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor`
- Set via tools like `cpupower` or kernel parameters.
General Advice
These are starting points. Always monitor system performance (CPU, memory, I/O with tools like `vmstat`, `iostat`, `top`/`htop`) and benchmark changes carefully. Defaults are often not optimal for heavy database workloads.
Common Query Anti-Patterns
Avoid common pitfalls: non-SARGable predicates, `SELECT *`, N+1 queries, large `IN` lists, function calls in `WHERE` without indexes.
- Non-SARGable Predicates:** Conditions in `WHERE` clauses that prevent the planner from using an index effectively ("SARGable" = Search ARGument Able).
- Applying functions to indexed columns: `WHERE lower(email) = '...'` (Fix: use expression index or store lowercase). `WHERE date_trunc('day', created_at) = '...'` (Fix: use range query `WHERE created_at >= '...' AND created_at < '...'`).
- Wildcard prefix in `LIKE`: `WHERE name LIKE '%Smith'` (Fix: consider full-text search or reverse index if suffix search needed).
- Numeric/Date operations on column: `WHERE amount * 1.1 > 100` (Fix: rewrite as `WHERE amount > 100 / 1.1`).
- `SELECT *`:** Selecting all columns when only a few are needed.
- Wastes network bandwidth and server/client memory.
- Prevents Index-Only Scans.
- Makes code brittle if table structure changes.
- Fix: Explicitly list required columns (`SELECT id, name, email FROM users ...`).
- N+1 Query Problem:** Common in ORMs or application loops. Fetching a list of parent items (1 query), then looping through them to fetch related child items one by one (N queries).
- Causes excessive database round trips and load.
- Fix: Use `JOIN` in the initial query, use `IN` clauses with fetched IDs, or use dataloader patterns.
- Large `IN` Lists / Many `OR` Conditions:** `WHERE id IN (1, 2, ..., 10000)` or `WHERE col='a' OR col='b' OR ...`.
- Can lead to poor plan choices or excessive planning time.
- Fix: Use `JOIN` against a temporary table or `VALUES` list containing the IDs/values. For `OR`, consider `UNION ALL` or rewriting logic.
- Implicit Type Casting:** Relying on Postgres to cast types in comparisons (e.g., comparing `integer` column to a string).
- Can sometimes prevent index usage.
- Fix: Ensure application sends parameters with matching types or use explicit `CAST`.
- Overuse of `OFFSET` for Pagination:** `LIMIT x OFFSET y` becomes increasingly slow as `y` grows, because the server still needs to compute and discard `y` rows.
- Fix: Use keyset/seek pagination (based on remembering the last seen value of an ordered column: `WHERE indexed_col > last_value ORDER BY indexed_col LIMIT x`).
Extensibility (Postgres Superpower)
Stored Procedures/Functions
Write server-side logic. Default is `PL/pgSQL`. Others include `PL/Python`, `PL/v8`, `SQL`, `C`. [docs]
Purpose
Encapsulate business logic, perform complex operations atomically within the database, reduce network round trips, enforce complex constraints or security logic.
Languages
- `PL/pgSQL`: Default, procedural language similar to Oracle's PL/SQL. Block-structured, variables, control flow (IF, LOOP), exception handling. Widely used and robust. [docs]
- `PL/Python` (`plpython3u`): Write functions in Python. Access to Python libraries. Untrusted (`u`) means fewer security restrictions (use with care, consider trusted variant `plpython3` if available/sufficient). Useful for data analysis, complex string manipulation, external API calls (via libraries). [docs]
- `PL/v8` (`plv8`): Write functions in JavaScript (using the V8 engine). Useful for JSON processing, web-related logic. Needs separate installation. [repo]
- `SQL`: Simple functions written purely in SQL. Can often be inlined by the planner for better performance. Good for simple wrappers or calculations. [docs]
- `C`: Highest performance, allows low-level access, but more complex development, compilation, and deployment. [docs]
- Others: `PL/Perl`, `PL/Tcl`, `PL/Java`, etc.
Creating Functions
CREATE OR REPLACE FUNCTION get_user_count()
RETURNS integer AS $$ -- Dollar Quoting
DECLARE
user_count integer;
BEGIN
SELECT count(*) INTO user_count FROM users;
RETURN user_count;
END;
$$ LANGUAGE plpgsql VOLATILE SECURITY INVOKER; -- Defaults
Procedures (Postgres 11+)
Use `CREATE PROCEDURE ... LANGUAGE ... AS $$ ... $$;`. Unlike functions, procedures do not return a value directly and *can* control transactions within their body (`COMMIT`, `ROLLBACK` - only at top level, not within nested blocks usually). Called using `CALL my_proc();`. [Tx Ctrl]
Function Volatility & Security
- Volatility (`VOLATILE`, `STABLE`, `IMMUTABLE`): Hints to the planner about function side effects and result consistency. Crucial for performance and correct use in indexes. [docs]
- Security (`SECURITY INVOKER` vs `SECURITY DEFINER`): Determines if the function runs with the privileges of the *calling user* (INVOKER, default) or the *user who defined the function* (DEFINER). See Security section card. [docs]
Triggers
Execute a function automatically on DML events (`INSERT`, `UPDATE`, `DELETE`). Use `BEFORE`/`AFTER`, Row/Statement level. Complex logic, use carefully. [docs]
Concept
Triggers define that a specific function (trigger function, must return type `trigger`) should be executed automatically when a certain DML event occurs on a specific table.
Creating Triggers
- Create Trigger Function: Usually in PL/pgSQL. Receives context via special variables (`TG_OP`, `NEW`, `OLD`). Must return `NEW` (for `BEFORE INSERT/UPDATE`), `OLD` (for `BEFORE DELETE`), or `NULL`.
CREATE OR REPLACE FUNCTION update_changetimestamp_func() RETURNS trigger AS $$ BEGIN NEW.changed_at = clock_timestamp(); -- Update column RETURN NEW; -- Return row to be inserted/updated END; $$ LANGUAGE plpgsql;
- Create Trigger Definition: Attach the function to the table and event. [docs]
CREATE TRIGGER set_changetimestamp BEFORE UPDATE ON my_table -- Timing (BEFORE/AFTER/INSTEAD OF) FOR EACH ROW -- Level (ROW/STATEMENT) EXECUTE FUNCTION update_changetimestamp_func();
Key Options
- Timing (`BEFORE`, `AFTER`): Execute function before or after the row operation. `BEFORE` triggers can modify the `NEW` row or skip the operation (`RETURN NULL`). `AFTER` triggers see the final result but cannot change it directly (often used for auditing or cascading actions).
- Timing (`INSTEAD OF`): For Views only. Execute function *instead of* the DML operation on the view, allowing updates to non-updatable views.
- Level (`FOR EACH ROW`, `FOR EACH STATEMENT`): Row-level triggers fire once per affected row. Statement-level triggers fire once per SQL statement, regardless of rows affected.
- Events (`INSERT`, `UPDATE [OF column, ...]`, `DELETE`, `TRUNCATE`): Specify which DML operations activate the trigger.
- Constraints:** Trigger behavior can interact with constraints (`DEFERRABLE INITIALLY DEFERRED`).
Use Cases
- Auditing changes.
- Maintaining derived/denormalized data.
- Enforcing complex business rules or constraints not possible with standard checks/foreign keys.
- Updating modification timestamps.
Pitfalls
- Can make data modification logic complex and hard to debug ("hidden" logic).
- Can significantly impact performance if trigger functions are slow or cause cascading effects.
- Risk of infinite recursion if triggers cause actions that fire the same trigger again.
- Consider alternatives (application logic, rules, procedures) before resorting to complex triggers.
Extensions
Package new types, functions, operators via `CREATE EXTENSION`. Huge ecosystem (PostGIS, TimescaleDB, `pg_stat_statements`, `pg_cron`, `pgaudit`). [docs]
Concept
Extensions bundle related SQL objects (types, functions, operators, index methods, etc.) into a single package that can be easily installed (`CREATE EXTENSION`) and removed (`DROP EXTENSION`) from a database. Simplifies deployment and version management of add-on features. [docs]
Managing Extensions
- `CREATE EXTENSION extension_name [SCHEMA schema_name] [VERSION version] [CASCADE];`: Installs the extension. Requires extension control/script files to be present on the server filesystem (usually via OS package manager or compilation).
- `\dx` (in psql): List installed extensions. `SELECT * FROM pg_extension;`
- `SELECT * FROM pg_available_extensions;`: List extensions available to be installed.
- `ALTER EXTENSION extension_name UPDATE [TO 'new_version'];`: Upgrades an extension to a newer installed version.
- `DROP EXTENSION extension_name [CASCADE|RESTRICT];`: Removes the extension and its objects. `CASCADE` also drops dependent objects.
Notable Extensions (Examples)
- PostGIS: Adds comprehensive support for geographic objects and spatial queries.
- TimescaleDB: Turns PostgreSQL into a powerful time-series database with hypertables and specialized functions.
- Citus: Distributes PostgreSQL horizontally for sharding and parallel query processing.
- `pg_stat_statements`: Tracks query execution statistics (essential for performance tuning).
- `pgcrypto`: Provides cryptographic functions (hashing, encryption).
- `uuid-ossp` / `pgcrypto`: Functions to generate UUIDs.
- `hstore`: Key-value data type.
- `pg_trgm`: Trigram matching for fuzzy string search (e.g., `LIKE '%word%'` alternative using GiST/GIN indexes).
- `pg_cron`: In-database job scheduler (like OS cron).
- `pg_repack`: Online table reorganization (alternative to `VACUUM FULL`).
- `pg_partman`: Automates time-based and serial partitioning management.
- `pgaudit`: Provides detailed session and object audit logging.
- `pgvector`: Vector similarity search for AI/ML applications.
- `bloom`: Bloom filter index access method.
- `pg_buffercache`: Inspect shared buffer content.
Finding Extensions
PostgreSQL Extension Network (PGXN), GitHub, vendor websites, cloud provider offerings.
Foreign Data Wrappers (FDWs)
Access external data sources as if they were local tables using `CREATE FOREIGN TABLE`. Connect to other PG dbs, MySQL, files, etc. [docs]
Concept
FDWs provide a standard SQL interface to query data residing outside the current PostgreSQL database, potentially even on different database systems or flat files. [PG Wiki FDW List]
Setup Steps
- Install FDW Extension: `CREATE EXTENSION postgres_fdw;` (for connecting to other PG dbs [postgres_fdw]) or `oracle_fdw`, `mysql_fdw`, `file_fdw`, `odbc_fdw`, `mongo_fdw`, etc.
- Create Server: `CREATE SERVER remote_pg_server FOREIGN DATA WRAPPER postgres_fdw OPTIONS (host 'other_host', dbname 'other_db', port '5432');`
- Create User Mapping: `CREATE USER MAPPING FOR current_user SERVER remote_pg_server OPTIONS (user 'remote_user', password 'remote_pass');` (maps local user to remote credentials)
- Create Foreign Table(s): Define the structure of the remote table locally, or import schema definitions.
-- Manual Definition CREATE FOREIGN TABLE remote_users ( id integer OPTIONS (column_name 'user_id'), -- Map column name if needed name text, email text ) SERVER remote_pg_server OPTIONS (schema_name 'public', table_name 'users'); -- Import Schema (Easier) IMPORT FOREIGN SCHEMA remote_public_schema LIMIT TO (users, orders) -- Optional: limit tables FROM SERVER remote_pg_server INTO local_schema;
Usage
Query the foreign table (`SELECT * FROM local_schema.remote_users;`) just like a local table. Postgres translates the query and pushes down operations (filtering, joins, aggregates if supported by the FDW) to the remote source where possible (check `EXPLAIN`). [querying]
Benefits
- Unified view of distributed data.
- Leverage PostgreSQL's SQL and features to query diverse sources.
- Can simplify data integration tasks without complex ETL.
- Some FDWs support writes (`INSERT`/`UPDATE`/`DELETE`) back to the foreign source.
Considerations
Performance depends heavily on the specific FDW, network latency, and how much processing can be pushed down to the remote server. Authentication and security need careful management. Transaction handling across systems can be complex (consider two-phase commit extensions if needed).
Replication & High Availability
Streaming Replication
Mechanism
- Standby server connects to the primary (using credentials defined in `primary_conninfo`) and streams WAL records via replication protocol (`walsender` on primary, `walreceiver` on standby).
- Standby continuously replays these WAL records to keep its data files nearly identical to the primary's. [docs]
Hot Standby (Read Replicas)
- If `hot_standby = on` (default) on the standby, it can serve read-only queries while continuously applying WALs. [docs]
- Read queries on the standby see a slightly delayed view of the data. Can configure `hot_standby_feedback` to prevent VACUUM on primary from removing rows needed by long standby queries. Configure `max_standby_streaming_delay` / `archive_delay` to cancel standby queries conflicting with WAL application.
Synchronous vs Asynchronous
- Asynchronous (Default): Primary commits transaction once WAL is written locally (`synchronous_commit = on` or `local`). Fastest, but potential for small data loss on primary crash if WAL hasn't reached standby.
- Synchronous: Primary waits for confirmation from one or more synchronous standbys (listed in `synchronous_standby_names`) that they have received/written/applied WAL (based on `synchronous_commit` level: `remote_write`, `on` (default sync=remote_flush), `remote_apply`) before confirming commit to client. Slower commits, but guarantees less/zero data loss if synchronous standby takes over.
- Quorum Commit:** Configure `synchronous_standby_names` like `ANY N (stby1, stby2, ...)` to wait for N standbys.
Replication Slots
- Created on primary via `pg_create_physical_replication_slot()`. Dropped via `pg_drop_replication_slot()`. View with `pg_replication_slots`.
- Ensures the primary retains WAL segments needed by a specific standby, even if the standby disconnects temporarily. Prevents premature WAL deletion that would break replication. **Essential for robust replication.** [docs] Monitor disk usage on primary (`pg_wal`) if slots become inactive.
Other Concepts
- Cascading Replication:** A standby can be configured to replicate from another standby, reducing load on the primary.
- Monitoring:** Use `pg_stat_replication` on primary, `pg_stat_wal_receiver` on standby to check lag (`write_lag`, `flush_lag`, `replay_lag`).
Use Cases
Read scaling, High Availability (HA) failover, disaster recovery standby, near real-time data warehousing feed.
Logical Replication
Replicates logical data changes (row changes). Allows cross-version, selective replication via Publisher/Subscriber model. [docs]
Mechanism
- Uses logical decoding (`pgoutput` plugin is default) to convert WAL records into a stream of logical changes (BEGIN, COMMIT, INSERT, UPDATE, DELETE with row data). [decoding]
- Transmits these logical changes to subscribers which apply them as DML.
- Requires `wal_level = logical` on the publisher. Also needs sufficient `max_wal_senders` and `max_replication_slots`.
Publisher/Subscriber Model
- Publisher: The source database. Define a `PUBLICATION` specifying which tables (or `ALL TABLES`) and optionally which DML operations (`publish = 'insert, update'`) to publish. `CREATE PUBLICATION mypub FOR TABLE users, orders;` [docs]
- Subscriber: The target database. Define a `SUBSCRIPTION` connecting to the publisher and mapping to one or more publications. `CREATE SUBSCRIPTION mysub CONNECTION '...' PUBLICATION mypub WITH (copy_data = false, create_slot = true);` [docs]
- Target tables must exist on the subscriber and have compatible schemas (usually identical names and data types, primary keys needed for UPDATE/DELETE).
- Uses replication slots on publisher automatically (`create_slot=true`).
Advantages over Streaming Replication
- Selective Replication: Replicate only specific tables or filter rows (using `WHERE` clause in publication, PG15+).
- Cross-Version/Platform Replication: Replicate between different major PostgreSQL versions or potentially to other systems.
- Consolidate/Distribute Changes: Multiple publishers can replicate to a single subscriber, or one publisher to many subscribers.
- Writeable Subscriber:** Subscriber tables can be written to independently (can cause conflicts if not managed carefully).
Limitations
- Does not replicate DDL changes (schema changes must be applied manually on both sides, carefully). [restrictions]
- Does not replicate sequence changes or large objects directly.
- Can have higher performance overhead and replication lag than streaming replication.
- Initial data synchronization often requires separate step (`copy_data=true` option during `CREATE SUBSCRIPTION`, or manual `pg_dump`).
- Conflicts can occur if subscriber data is modified independently.
Use Cases
Selective data aggregation/distribution, feeding data warehouses, replicating between major versions during upgrades, zero-downtime upgrades (complex setup), data sharing between microservices.
HA / Failover Tools
While PostgreSQL provides the core replication mechanisms, automating the detection of primary failure and promotion of a standby requires external tooling. [PG Wiki Overview]
Common Tools
- Patroni: Popular Python-based template for building HA clusters. Uses a Distributed Configuration Store (DCS) like etcd, Consul, or Zookeeper for leader election and cluster state management. Manages PostgreSQL configuration and handles automated failover/switchover. Highly flexible and widely used, often in Kubernetes environments.
- repmgr: Open-source tool suite focused specifically on managing PostgreSQL streaming replication and failover. Provides monitoring, node management (`repmgrd` daemon), and failover/switchover commands. Can be simpler than Patroni for basic setups, less reliant on external DCS (but can integrate).
- Pgpool-II: Acts as a middleware proxy providing connection pooling, load balancing for read replicas, and automated failover capabilities (watchdog process). Can also parallelize queries. Often considered more complex to configure correctly for HA compared to dedicated failover managers.
- Cloud Provider Solutions: Managed services like AWS RDS, Google Cloud SQL, Azure Database for PostgreSQL often have built-in HA features that handle failover automatically behind the scenes using their own control planes.
Key HA Concepts Managed by Tools
- Health Monitoring: Regularly checking the status of primary and standby nodes via SQL connections or other means.
- Failure Detection: Identifying when the primary server is unresponsive based on monitoring checks and timeouts.
- Leader Election/Standby Promotion: Choosing the best standby (e.g., least lag, priority setting) to promote to the new primary using `pg_promote()` (or `pg_ctl promote`).
- Topology Updates: Reconfiguring remaining standbys to follow the newly promoted primary (updating `primary_conninfo`).
- Client Redirection: Ensuring applications connect to the new primary (often via DNS update, virtual IP failover managed by tools like Keepalived, or interaction with a load balancer/proxy layer like PgBouncer/HAProxy configured by the HA tool).
- Split-Brain Prevention: Mechanisms (usually via the DCS in Patroni, or fencing scripts) to ensure only one node acts as the primary at any time, even during network partitions.
Backup & Recovery
Logical Backups (`pg_dump`)
Creates logical backups (SQL commands). Flexible, portable. Slower restore for large DBs. Not for PITR. Use `pg_dumpall` for globals. [docs]
Tool & Purpose
`pg_dump` utility connects to a database and generates a script file containing SQL commands (`CREATE TABLE`, `COPY` data, `CREATE INDEX`, etc.) needed to recreate the database objects and data.
Key Features & Options
- Formats: Plain text (`.sql`, `-Fp`), Custom (`.dump`, compressed, requires `pg_restore`, `-Fc`), Directory (`dir/`, parallel dump/restore, requires `pg_restore`, `-Fd`), Tar (`.tar`, requires `pg_restore`, `-Ft`). Custom/Directory formats are generally preferred for flexibility (parallel restore, object selection during restore).
- Consistency: Takes a consistent snapshot using `REPEATABLE READ` transaction or by acquiring brief `ACCESS SHARE` locks (`--lock-wait-timeout`).
- Granularity: Can dump entire DB, specific schemas (`-n schema`), specific tables (`-t table`). Exclude schemas/tables (`-N`, `-T`). Dump data-only (`-a`), schema-only (`-s`).
- Portability: Backups are generally portable across different machine architectures and PostgreSQL major versions (though compatibility issues can arise with very old/new versions).
- Restoration: Plain text restored via `psql < dump.sql`. Custom/Directory/Tar formats restored using `pg_restore` (allows parallel restore with `-j`).
`pg_dumpall`
- `pg_dumpall` utility dumps *all* databases in a cluster plus global objects (roles, tablespaces). Output is plain SQL format.
- Use `-g` for globals only, `-r` for roles only, `-t` for tablespaces only.
- Essential for backing up the complete cluster state including users/permissions.
Pros
- Very flexible for migrating data, restoring specific objects, upgrading major versions.
- Human-readable format (plain text).
Cons
- Can be slow to restore large databases because it replays all SQL commands and rebuilds indexes from scratch.
- Not suitable for Point-in-Time Recovery (PITR).
- `pg_dump` (not `pg_dumpall`) doesn't include global objects.
Example
pg_dump -U postgres -Fc -f my_database.dump my_database
pg_dumpall -U postgres -f globals.sql --globals-only
pg_restore -U postgres -d target_database -j 4 my_database.dump
Physical Backups (`pg_basebackup`)
Creates a binary copy of the database cluster files. Faster restore than `pg_dump`. Foundation for PITR & replication setup. [docs]
Tool & Purpose
`pg_basebackup` connects to a running primary or standby server using the replication protocol and copies the entire data directory (`PGDATA`) file by file.
Key Features & Options
- Binary Copy: Creates a filesystem-level copy of the database files at a specific point in the WAL stream.
- Consistency: Ensures a consistent snapshot by coordinating with the server (`pg_backup_start`/`stop`).
- WAL Inclusion (`-X`/`--wal-method`):
- `none`: Don't include WAL (requires separate WAL archiving for recovery).
- `fetch`: Include WAL files needed to make backup consistent (fetched after files copied).
- `stream`: Stream WAL files alongside data files during backup (preferred method, ensures needed WAL is captured).
- Format (`-F`/`--format`): `plain` (copy of `PGDATA`), `tar` (`-Ft`).
- Progress (`-P`/`--progress`). Checkpoint (`-c`/`--checkpoint=fast|spread`). Rate Limit (`--max-rate`).
Pros
- Much faster restoration for large databases compared to `pg_dump`, as no SQL needs re-execution or index rebuilding (just WAL replay from the backup checkpoint onwards).
- The necessary starting point for Point-in-Time Recovery (PITR).
- Can be used to easily set up new streaming replication standbys.
Cons
- Less flexible: Restores the entire database cluster. Not easy to restore single tables (requires restoring cluster elsewhere and extracting).
- Backup size is the full size of the data directory (unless using external tools with compression/deduplication).
- Generally not portable across major PostgreSQL versions or different machine architectures (endianness, block size).
Example
pg_basebackup -h source_host -U replicator -D /path/to/backup/dir -Ft -X stream -P -R
(Creates backup in tar format, streams WAL, shows progress, and writes basic recovery settings to output dir)
Point-in-Time Recovery (PITR)
Restore database to any specific moment using a physical base backup and continuous WAL Archiving. Requires careful setup and testing. [docs]
Concept
Combines a physical base backup with archived WAL segment files to replay database changes *beyond* the end of the base backup, stopping at a specific target point (timestamp, transaction ID, named restore point). [recovery]
Requirements
- Physical Base Backup: Taken using `pg_basebackup` or similar method (e.g., `pgBackRest`, `Barman`). [base backup]
- Continuous WAL Archiving: The primary server must be configured (`wal_level = replica` or higher, `archive_mode = on` or `always`, `archive_command`) to continuously copy completed WAL segments to a separate, safe archive location (e.g., NFS, S3 via helper scripts). [archiving]
Recovery Process (High Level)
- Restore the chosen physical base backup to a new data directory.
- Ensure the WAL archive is accessible from the recovery location.
- Configure recovery settings in `postgresql.conf` (PG12+) or create `recovery.conf` (pre-PG12).
- `restore_command`: Command to fetch WAL files from archive (e.g., `restore_command = 'cp /path/to/archive/%f %p'` or script using `aws s3 cp`, etc.). `%f`=filename, `%p`=path to copy to.
- Recovery Target: Specify *when* to stop recovery (e.g., `recovery_target_time`, `_xid`, `_lsn`, `_name`). Or `recovery_target = 'immediate'` to recover just past backup end.
- `recovery_target_action`: `pause` (default), `promote` (finish recovery and become primary), `shutdown`.
- Create a `recovery.signal` file (PG12+) in the data directory to trigger recovery mode on startup. (Remove `standby.signal` if present).
- Start PostgreSQL on the restored directory. It will enter recovery mode, replay WALs using `restore_command` until the target is reached, then perform the `recovery_target_action`.
Benefits
- Recover from logical errors (accidental deletes/updates) by restoring to a point *before* the incident occurred.
- Provides fine-grained disaster recovery capability, minimizing data loss.
Tools & Verification
Tools like `pgBackRest` or `Barman` significantly simplify managing base backups, WAL archiving, retention, and the PITR process. **Regularly test your backups and PITR procedures!**
Backup Management Tools
Tools like pgBackRest, Barman, WAL-G simplify backup/restore/PITR, adding features like parallel ops, incremental/diff backups, compression, validation.
While `pg_dump` and `pg_basebackup` provide basic backup capabilities, dedicated tools offer significant advantages for managing backups, especially in production environments.
Popular Tools
- pgBackRest:** Feature-rich, reliable, high-performance tool.
- Features: Parallel backup/restore, full/incremental/differential backups, compression, checksums, backup validation, S3/Azure/GCS support, flexible retention policies, PITR automation, tablespace support.
- Widely adopted in the community. Actively developed.
- Barman (Backup and Recovery Manager):** Python-based tool, integrates well with standard PG tools.
- Features: Manages base backups and WAL archiving, PITR, compression, remote backup/restore, replication slot management, backup catalog.
- Mature and stable option.
- WAL-G:** Successor to WAL-E. Focuses on WAL archiving/fetching and base backups, often to cloud storage.
- Features: Parallel WAL push/fetch, compression, encryption, S3/GCS/Azure/Swift support, base backups, PITR support.
- Often used for its efficient WAL archiving capabilities.
Why Use Them?
- Automation:** Simplify complex processes like WAL archiving, PITR setup, retention management.
- Performance:** Parallel operations significantly speed up backup/restore for large databases.
- Efficiency:** Incremental/differential backups save time and storage space compared to repeated full base backups. Compression reduces storage costs.
- Reliability:** Features like checksums and validation help ensure backup integrity.
- Cloud Integration:** Built-in support for major cloud storage providers.
Recommendation
For any serious production deployment requiring PITR, using a dedicated backup tool like `pgBackRest` or `Barman` is highly recommended over manual scripting of `pg_basebackup` and `archive_command`.
Verification
Regardless of the tool used, **regularly schedule and perform test restores** (ideally PITR tests) to ensure your backups are valid and your recovery procedures work as expected.
Security
Authentication (`pg_hba.conf`)
Host-Based Authentication controls *who* can connect from *where* using *which* method (e.g., `scram-sha-256`, `md5`, `peer`, `cert`, `ldap`). Order matters! [docs]
File Location & Reload
Located in the data directory (`PGDATA`). Requires a server reload (`pg_ctl reload` or `SELECT pg_reload_conf();`) to apply changes.
Format
Each line defines a connection rule: [docs]
type database user address method [options]
- `type`: Connection type (`local` for Unix sockets, `host` for non-SSL TCP/IP, `hostssl` for SSL TCP/IP, `hostnossl` for non-SSL TCP/IP).
- `database`: Database name (`all`, specific name, comma-separated list, `@file`, `replication`).
- `user`: Role name (`all`, specific name, comma-separated list, `@file`).
- `address`: Client IP address range (CIDR notation like `192.168.1.0/24`, `samehost`, `samenet`, `all`). Required for `host*` types.
- `method`: Authentication method: [methods]
- `trust`: Allow connection unconditionally (Use only for `local` socket peer auth if absolutely necessary, very dangerous otherwise).
- `reject`: Deny connection unconditionally.
- `scram-sha-256`: Salted Challenge Response Authentication (preferred password method since PG10).
- `md5`: Older challenge-response password method (less secure than SCRAM, avoid if possible).
- `password`: Send plain text password (avoid unless connection is SSL/TLS secured via `hostssl`).
- `peer`: For `local` connections, authenticate if OS username matches database role name.
- `ident`: For TCP/IP, query client's ident server (rarely used, often blocked).
- `ldap`, `gss`, `sspi`, `pam`, `radius`: Integrate with external auth systems.
- `cert`: Authenticate using client SSL certificate (`clientcert=verify-ca` or `verify-full` option).
- `options`: Method-specific options (e.g., `clientcert`, LDAP config, PAM service name).
Processing Order
Rules are processed sequentially from top to bottom. The *first* matching rule for the connection type, database, user, and source address is used. Place more specific rules *before* more general ones.
Best Practices
Be specific. Avoid `all all 0.0.0.0/0 trust`. Use `scram-sha-256` for passwords. Use `hostssl` to enforce SSL/TLS for remote TCP/IP connections. Use `peer` authentication for local superuser access where appropriate. Limit `replication` connections to specific users/addresses.
SSL/TLS Encryption
Encrypt client-server communication. Configure server (`ssl=on`, certs) and enforce via `pg_hba.conf` (`hostssl`). Verify certs via client `sslmode`. [docs]
Purpose
Protect data in transit between clients and the database server from eavesdropping and tampering, especially over untrusted networks.
Server Configuration (`postgresql.conf`)
- `ssl = on`: Enable SSL/TLS support.
- `ssl_cert_file = 'server.crt'`: Path to the server's PEM certificate file.
- `ssl_key_file = 'server.key'`: Path to the server's PEM private key file. Permissions must be restrictive (e.g., `0600`, readable only by postgres user).
- `ssl_ca_file = 'root.crt'`: (Optional) Path to trusted Certificate Authority certs for verifying client certificates (if using `cert` auth).
- `password_encryption` should be `scram-sha-256` (or `md5`) even with SSL.
- [Server Files] Restart server after changing SSL settings.
Enforcing SSL/TLS (`pg_hba.conf`)
- Use connection type `hostssl` instead of `host` for specific rules. This forces clients matching that rule to use an SSL connection; non-SSL attempts will be rejected. [docs]
- Example: `hostssl all myuser 192.168.1.0/24 scram-sha-256` (Requires SSL for `myuser` from that subnet).
- Using `host` allows both SSL and non-SSL connections if `ssl=on` is set. Use `hostnossl` to explicitly allow only non-SSL (rarely needed).
Client Configuration
- Clients specify connection parameters to control SSL usage. [libpq client docs]
- Connection String/Libpq: `sslmode` parameter is crucial:
- `disable`: No SSL.
- `allow`: Try non-SSL first, then SSL if server requires.
- `prefer`: Try SSL first, then non-SSL if server doesn't support. (Default)
- `require`: Only try SSL. Fail if server doesn't support. (Doesn't verify cert).
- `verify-ca`: Only try SSL, verify server cert against trusted CA (`sslrootcert` parameter). Fail if no match.
- `verify-full`: Only try SSL, verify CA, *and* verify server hostname matches certificate CN/SAN. **Most secure.**
- Other parameters: `sslcert` (client cert), `sslkey` (client key), `sslrootcert` (path to trusted CA cert(s) on client).
Client Certificate Authentication
Configure server to require clients to present a valid certificate signed by a trusted CA (via `ssl_ca_file`) using `clientcert=verify-ca` or `verify-full` option in the `pg_hba.conf` rule's `options` field, combined with `auth-method=cert`. [cert auth]
`SECURITY DEFINER` Functions
Functions run with privileges of the *definer* (owner), not the *caller*. Powerful but dangerous. Secure carefully! [docs]
Concept
By default, functions run with `SECURITY INVOKER` semantics - they execute with the permissions of the user calling the function. Functions defined with `SECURITY DEFINER` execute with the privileges of the user who *owns* the function.
Use Cases
- Allowing less privileged users to perform specific, controlled actions on tables they don't normally have direct access to (e.g., insert into a specific log table, run a specific update).
- Abstracting complex operations involving multiple tables with different permissions.
Security Risks & Best Practices (CRITICAL!)
If not written carefully, `SECURITY DEFINER` functions can be exploited for privilege escalation.
- Restrict `EXECUTE` Privilege:** `REVOKE EXECUTE ON FUNCTION my_secdef_func(...) FROM PUBLIC;` then `GRANT EXECUTE ON FUNCTION ... TO specific_role;`. Never grant `EXECUTE` to `PUBLIC` unless absolutely necessary and safe.
- Set Secure `search_path`:** Inside the function, *always* set a safe `search_path` to prevent hijacking by objects in untrusted schemas: `SET search_path = pg_catalog, public;` (or just `pg_catalog` if `public` isn't needed). Do this at the function definition level:
[secure path]CREATE FUNCTION ... SECURITY DEFINER SET search_path = pg_catalog; AS $$ ... $$ LANGUAGE ...;
- Qualify All Object Names:** Explicitly schema-qualify all tables, functions, types, etc. accessed within the function (e.g., `SELECT * FROM myschema.mytable;`) unless the `search_path` is absolutely trustworthy (rare).
- Validate Inputs:** Treat all function arguments as potentially hostile. Sanitize and validate them appropriately.
- Principle of Least Privilege:** Ensure the function *owner* has only the minimum privileges necessary for the function's operation. Avoid owning `SECURITY DEFINER` functions as a superuser if possible.
Example (Conceptual)
-- Owner: admin_role (has INSERT on audit_log)
-- Caller: app_user (NO insert on audit_log)
CREATE FUNCTION log_action(user_id integer, action text)
RETURNS void AS $$
BEGIN
INSERT INTO audit_schema.audit_log(actor_id, action_desc, log_time)
VALUES (user_id, action, now());
END;
$$ LANGUAGE plpgsql SECURITY DEFINER
SET search_path = pg_catalog; -- Secure path!
REVOKE EXECUTE ON FUNCTION log_action(integer, text) FROM PUBLIC;
GRANT EXECUTE ON FUNCTION log_action(integer, text) TO app_user;
Encryption at Rest
Protect data stored on disk. Options: Full Disk Encryption (FDE) at OS/HW level (common), filesystem-level, or column-level via `pgcrypto`.
Goal
Prevent unauthorized access to database files (`PGDATA`, WAL, backups) if physical storage media is compromised.
Common Approaches
- Full Disk Encryption (FDE):**
- Implemented at the Operating System level (e.g., LUKS on Linux, BitLocker on Windows, FileVault on macOS) or Hardware level (Self-Encrypting Drives - SEDs).
- Encrypts the entire block device containing the database files.
- Transparent to PostgreSQL. Protects against offline attacks (stolen drives).
- Does *not* protect against threats when the OS is running and the filesystem is mounted (e.g., malicious OS user, SQL injection).
- **Generally the recommended baseline approach.** Often required by compliance standards.
- Filesystem-Level Encryption:**
- Some filesystems offer native encryption features (e.g., ZFS encryption, ext4 fscrypt).
- Encrypts files or directories within a mounted filesystem.
- May offer more granularity than FDE. Similar protection profile (primarily offline attacks).
- Column-Level Encryption (using `pgcrypto`):**
- Encrypt specific sensitive columns within tables using functions like `pgp_sym_encrypt()` / `pgp_sym_decrypt()` (symmetric) or `pgp_pub_encrypt()` / `pgp_priv_decrypt()` (asymmetric).
- Provides protection even from privileged database users (e.g., DBAs) if they don't have the decryption key.
- Requires application logic to handle encryption/decryption. Keys must be managed securely outside the database.
- Significantly impacts query performance (cannot index encrypted data directly for searching, except with workarounds like hashing or specialized indexes on unencrypted components if feasible).
- Use case: Protecting highly sensitive specific fields (e.g., SSNs, credit card numbers - though PCI DSS has strict rules).
- Transparent Data Encryption (TDE):**
- Some commercial forks or extensions of PostgreSQL offer TDE solutions that automatically encrypt/decrypt data files, often with external key management integration.
- Not part of standard PostgreSQL core.
Key Management
Securely managing encryption keys is crucial for FDE, filesystem, and column-level encryption. Use hardware security modules (HSMs), cloud KMS, or robust key management practices.
Auditing
Track database activity. Use standard logging (`log_statement`, `log_connections`, etc.) or dedicated extension `pgaudit` for detailed, structured logging.
Why Audit?
Security monitoring (detecting suspicious activity), compliance requirements (HIPAA, SOX, GDPR), troubleshooting application behavior.
Built-in Logging Parameters (`postgresql.conf`)
- `log_statement`: Logs executed SQL statements (`none`, `ddl`, `mod`, `all`). `all` is very verbose, use with caution.
- `log_min_duration_statement`: Log statements exceeding a duration (ms). Good for performance analysis, less for security audit.
- `log_connections` / `log_disconnections`: Log session start/end.
- `log_hostname`: Log client hostname (can add overhead).
- `log_line_prefix`: Customize log entry format (include user, db, pid, timestamp, etc. - essential context). Example: `'%m [%p] %q%u@%d/%a '`.
- Limitations: Can be hard to parse reliably, may miss some details, performance overhead with `log_statement=all`.
`pgaudit` Extension
- `pgaudit` provides detailed, structured audit logging. Requires `shared_preload_libraries = 'pgaudit'` and restart.
- Features:**
- Logs specific event types (READ, WRITE, FUNCTION, ROLE, DDL, MISC).
- Fine-grained object-level auditing (log access only to specific tables).
- Logs statement text *and* parameters separately (optional).
- Structured output format (e.g., JSON) for easier parsing by log analysis tools (SIEM).
- Configuration:** Via `postgresql.conf` parameters (e.g., `pgaudit.log = 'read, write'`, `pgaudit.log_parameter = on`, `pgaudit.role = 'auditor_role'`).
- **Recommendation:** Generally preferred over `log_statement=all` for serious auditing needs due to structure and granularity.
Log Management
Ensure logs are stored securely, rotated appropriately, and ideally shipped to a central log management system for analysis and alerting.
Essential Tools
`psql` Command-Line
The indispensable interactive terminal. Master its meta-commands (`\?`, `\d`, `\l`, `\timing`, `\x`, `\e`, `\copy`, etc.) for efficiency. [docs]
Connecting
psql -h host -p port -U user -d dbname
Key Meta-Commands (`\`) [full list]
- Help & Info: `\?`, `\h [SQL]`, `\l[+]`, `\c[onnect] ...`, `\conninfo`, `\encoding [enc]`, `\password [user]`, `\q`.
- Object Inspection (`\d`): [docs `\d`] `\d[S+] [pat]`, `\dt` (tables), `\di` (indexes), `\ds` (sequences), `\dv` (views), `\dm` (mat views), `\df` (functions), `\dT` (types), `\dn` (schemas), `\du` / `\dg` (roles), `\dp` (privs), `\ddp` (def privs), `\dx` (extensions), `\dy` (event triggers), `\dL` (languages), `\des` (foreign servers), `\dew` (fdws).
- Query Exec & Format: `\timing`, `\x [on|off|auto]`, `\H` (HTML), `\watch [sec]`, `\g [file]`, `\s [file]`, `\i file`, `\e [file]`, `\ef [func]`, `\prompt name var`, `\set name val`, `\unset name`, `\echo text`, `\o [file]` (redirect output).
- Data I/O: `\copy table FROM/TO 'file' [opts]` (Client-side).
- Transaction:** `\set AUTOCOMMIT off/on`, `BEGIN`, `COMMIT`, `ROLLBACK`.
Variables & Scripting
- Use `\set name value` to set variables. Reference with `:'name'`.
- Use `-v name=value` on command line.
- Conditional execution with `\if expr` ... `\elif expr` ... `\else` ... `\endif`.
Customization
`.psqlrc` file in home directory for custom settings (e.g., `\set PROMPT1 '%n@%m:%> '`), aliases (`\set`), startup commands.
Environment variables like `PGHOST`, `PGPORT`, `PGUSER`, `PGDATABASE`, `PGPASSWORD` (use `.pgpass` file instead for password). [Env Vars]
GUI Tools
Common Options
- pgAdmin:
- Official open-source graphical tool.
- Features: Server status monitoring, object browser/editor, SQL query tool (explain analyze visualization), data editor, backup/restore UI, user management, schema diff.
- Runs as a web application (pgAdmin 4). Can feel heavy sometimes.
- DBeaver:
- Free, open-source universal database tool (supports many DBs).
- Features: Connection management, SQL editor (auto-complete, formatting), visual query builder, data viewer/editor, ER diagrams, data transfer/migration tools, SSH tunneling.
- Popular choice due to versatility and performance. Community & Pro editions.
- DataGrip (JetBrains):
- Commercial database IDE from JetBrains.
- Features: Advanced SQL editor (introspection, refactoring, excellent code completion), schema navigation/comparison, data editor, import/export, VCS integration, SSH tunneling.
- Powerful, excellent editor. Requires license (free for students/OS projects). Included in IntelliJ Ultimate.
- Other Tools: Postico (macOS), TablePlus (Native GUI, multi-platform), Navicat (Commercial).
Choosing a Tool
Depends on preference, budget, OS, and required features. `psql` is essential for scripting and server access. GUIs are helpful for visual exploration, administration tasks, and complex query development/debugging.
Admin Command-Line Tools
Server-side tools: `pg_ctl` (start/stop/reload), `initdb` (create cluster), `pg_controldata` (read control file), `pgbench` (benchmark).
These tools are typically found in the PostgreSQL `bin` directory and operate directly on the database cluster or server process.
- `pg_ctl`:** Utility for initializing, starting, stopping, restarting, reloading configuration, promoting standbys, and getting status of a PostgreSQL server.
- Syntax: `pg_ctl -D /path/to/PGDATA
` - Often used by service management scripts (`systemd`, `init.d`).
- Syntax: `pg_ctl -D /path/to/PGDATA
- `initdb`:** Creates a new PostgreSQL database cluster (the `PGDATA` directory structure and template databases). Run only once to set up a new instance.
- Options: `-D` (data dir), `-U` (superuser name), `-W` (prompt for password), `--locale`, `--encoding`.
- `pg_controldata`:** Displays control information stored in `PGDATA/global/pg_control`, such as WAL segment info, checkpoint location, cluster state, database OID. Useful for diagnostics. Run while server is stopped.
- Syntax: `pg_controldata /path/to/PGDATA`
- `pg_resetwal` (formerly `pg_resetxlog`):** Emergency recovery tool to reset WAL information. **Extremely dangerous!** Can cause data loss/corruption. Use only as a last resort if the server won't start due to WAL corruption and you understand the consequences. Requires server to be stopped.
- `pgbench`:** Simple benchmarking tool included with PostgreSQL.
- Initialize test tables: `pgbench -i [options] dbname`
- Run standard TPC-B like test: `pgbench [options] dbname`
- Run custom scripts: `pgbench -f script.sql [options] dbname`
- Options: `-c` (clients), `-j` (threads), `-T` (duration), `-P` (progress report interval), `-N` (no vacuum).
- Useful for basic performance testing and comparing configuration changes.
Key Quirks & Terminology Recap
Identifier Case Sensitivity
Unquoted identifiers are folded to lowercase. Use double quotes (`"MyTable"`) to preserve case (causes hassles!). Max length ~63 chars. [docs]
The Rule
- If you create a table like `CREATE TABLE MyUsers (...)`, PostgreSQL stores it as `myusers`. Subsequent queries like `SELECT * FROM MyUsers;` or `SELECT * FROM myusers;` will both work because the unquoted identifier in the query is also folded to lowercase.
- If you create a table using double quotes `CREATE TABLE "MyUsers" (...)`, PostgreSQL stores the name exactly as `MyUsers` (preserving case).
- To query this case-sensitive table, you *must* use double quotes: `SELECT * FROM "MyUsers";`. Querying `SELECT * FROM MyUsers;` or `SELECT * FROM myusers;` will fail (unless a lowercase table `myusers` also happens to exist). [docs]
Migration Impact
Databases like SQL Server or MySQL (on Windows/macOS by default) are often case-insensitive or preserve case differently. Migrating schemas and queries requires careful attention to quoting or consistent use of lowercase identifiers to avoid errors.
Identifier Length
Default maximum length for identifiers (tables, columns, functions, etc.) is 63 bytes (`NAMEDATALEN` constant). Can be changed by recompiling PostgreSQL if absolutely necessary (rarely done).
Best Practice
Avoid quoted identifiers unless absolutely necessary (e.g., using reserved words, requiring special characters). Use lowercase names with underscores (`my_table`, `user_id`) consistently for maximum portability and minimum confusion.
VACUUM Necessity (MVCC)
MVCC causes bloat (dead rows). Regular `VACUUM` (via Autovacuum) is MANDATORY for performance and to prevent TXID wraparound failure.
Recap
MVCC doesn't physically delete or update rows in place. Old row versions (dead tuples) remain until cleaned up by `VACUUM`. [routine vacuuming]
Consequences of Neglect
- Bloat: Tables and indexes grow larger than necessary, wasting disk space.
- Performance Degradation: Sequential scans read more dead tuples; index scans might become less efficient; Index-Only Scans might fail if the Visibility Map isn't updated.
- TXID Wraparound Failure: Transaction IDs are 32-bit numbers. If `VACUUM` doesn't run often enough to "freeze" very old tuple transaction IDs on extremely busy or long-running systems (where >2 billion transactions occur between vacuums of oldest data), the database may eventually be forced to shut down into single-user mode to prevent data corruption when the TXID counter wraps around. This is catastrophic but preventable with proper vacuuming. Monitor `age(datfrozenxid)` via SQL.
Action Required
- Ensure Autovacuum is enabled (`autovacuum = on`).
- **Monitor Autovacuum:** Check logs (`log_autovacuum_min_duration`), `pg_stat_user_tables` (last vacuum times, dead tuples). Monitor TXID age (`SELECT datname, age(datfrozenxid) FROM pg_database;`).
- **Tune Autovacuum:** Adjust settings (`*_scale_factor`, `*_threshold`, `*_cost_delay/limit`, `max_workers`, `autovacuum_freeze_max_age`) based on table size, update frequency, and system resources. Defaults are often insufficient for active databases. [config]
- Occasionally monitor for excessive bloat (see Bloat Monitoring card) and consider manual `VACUUM`, `REINDEX CONCURRENTLY`, or `pg_repack` if necessary.
Connection Pooling (External)
The Issue
Each connection to PostgreSQL forks a new OS process (`postgres` backend), consuming significant RAM (tens of MBs) and CPU resources for process management. Creating and tearing down connections is also relatively expensive. [process model]
Applications (especially web/serverless apps) that open many short-lived connections directly to the database can quickly exhaust server resources (`max_connections` and memory) and lead to poor performance.
The Solution: External Connection Poolers
Middleware that maintains a pool of persistent physical connections to the PostgreSQL server. Application connects to the pooler instead of directly to Postgres. [PG Wiki Pooling]
- PgBouncer: Lightweight, single-purpose connection pooler. Very popular, stable, low overhead. Offers different pooling modes:
- Session Pooling: Client keeps logical connection and physical backend connection until disconnect (least effective pooling, mostly just limits connection count).
- Transaction Pooling: Physical connection returned to pool after each transaction (`COMMIT`/`ROLLBACK`). Requires client not to use session-based features (e.g., `SET` commands, advisory locks, temp tables) across transactions. **Commonly used.**
- Statement Pooling: Physical connection returned after each statement (most aggressive, most restrictions, rare).
- Pgpool-II: More feature-rich middleware. Provides connection pooling, load balancing across replicas, replication management, and automated failover (watchdog). Higher overhead and complexity than PgBouncer.
- Some application frameworks/drivers have built-in pooling, but an external pooler is often still beneficial for managing total server connections across multiple app instances.
Benefits of Pooling
- Reduces connection overhead (memory/CPU) on the PostgreSQL server.
- Allows supporting thousands of application clients with only tens or hundreds of actual database connections.
- Improves performance by reusing existing connections, reducing connection latency.
Recommendation
For most applications making frequent connections, using PgBouncer in transaction pooling mode between the application and the database is highly recommended.
`public` Schema Defaults
Default schema where all roles have `CREATE` and `USAGE` initially. Security risk! Best practice: `REVOKE CREATE`, use dedicated schemas. [docs]
Default Behavior
- Every new database contains a schema named `public`. [docs]
- By default, the special `PUBLIC` pseudo-role (meaning all roles) is granted `CREATE` (permission to create objects like tables) and `USAGE` (permission to access objects within) privileges on the `public` schema.
- If no schema is specified when creating an object (e.g., `CREATE TABLE mytable (...)`), it gets created in the first schema listed in the user's `search_path` where they have `CREATE` permission, which is typically `public`.
Problems
- Can lead to a cluttered namespace with objects from different applications or users mixed together.
- Security Risk: Any user can potentially create objects (tables, functions) in `public`, potentially leading to naming conflicts or unexpected behavior (e.g., function/operator hijacking if `public` is early in the `search_path`). [patterns]
Recommended Practice
- Revoke Default Privileges:** As a superuser, immediately after database creation (or in template database `template1`):
REVOKE CREATE ON SCHEMA public FROM PUBLIC; -- Optional, but recommended for stricter control: -- REVOKE USAGE ON SCHEMA public FROM PUBLIC; -- GRANT USAGE ON SCHEMA public TO specific_roles_that_need_it; -- e.g., for extensions
- Create Dedicated Schemas:** Use schemas to organize objects logically (e.g., `CREATE SCHEMA app_schema AUTHORIZATION app_owner;`). [create schema]
- Grant Permissions Explicitly:** Grant necessary `USAGE` on the schema and object-specific privileges (SELECT, INSERT, etc.) on objects within dedicated schemas to specific application roles. [grant]
- Set `search_path` Carefully:** Configure the `search_path` appropriately for application roles (e.g., `ALTER ROLE app_user SET search_path = app_schema, public;`) so they don't need to schema-qualify table names frequently, but still prioritize the correct schema. Place `public` last or omit if possible.
`search_path`
Determines the order schemas are searched for unqualified objects (tables, functions). Impacts name resolution and security. Default includes user schema, then `public`. [docs]
Functionality
When you reference an object (table, function, type, operator, etc.) without explicitly qualifying it with a schema name (e.g., `SELECT * FROM users` instead of `SELECT * FROM app_schema.users`), PostgreSQL searches the schemas listed in the `search_path` setting in order, using the first matching object found.
Default Value
Typically defaults to `"$user", public`. This means it first looks for a schema with the same name as the current user role (`CREATE SCHEMA myuser AUTHORIZATION myuser;`), then looks in the `public` schema. If `$user` schema doesn't exist, it's skipped.
Setting the Path
- Session level: `SET search_path TO schema1, schema2, public;`
- User level (persistent for user): `ALTER ROLE myuser SET search_path = schema1, public;`
- Database level (default for db): `ALTER DATABASE mydb SET search_path = schema1, public;`
- Server level (`postgresql.conf`): Sets the system default for new sessions.
- Function level: `ALTER FUNCTION myfunc(...) SET search_path = ...;` or `SET search_path = ...` in function definition (crucial for `SECURITY DEFINER`).
- [setting path]
Security Implications (Trojan Horse)
- If a user can create objects in a schema listed *earlier* in the `search_path` than the intended schema (especially `public` if default `CREATE` privileges aren't revoked), they might be able to create objects (like functions or operators) that hijack calls intended for objects in later schemas. [security]
- Example: If `search_path = public, app_schema`, and a malicious user creates a function `public.important_function()`, a call to `important_function()` (expecting `app_schema.important_function()`) might execute the malicious public version. This is especially dangerous within `SECURITY DEFINER` functions if their `search_path` isn't explicitly secured.
Best Practice
- Always be aware of the current `search_path` (`SHOW search_path;`).
- Set a specific, minimal, and controlled `search_path` for application roles (e.g., only the necessary application schema(s), maybe `pg_catalog`).
- Place `public` last in the path, or remove it entirely if not needed (e.g., for extensions).
- Qualify object names with schemas (`app_schema.users`) in security-sensitive contexts (like `SECURITY DEFINER` functions) or when ambiguity is possible.
- Follow best practices for the `public` schema (revoke default create privileges).
Unified Role Concept
PostgreSQL uses Role for both users (with `LOGIN`) and groups (with `NOLOGIN`). Roles can inherit privileges from other roles they are members of.
Unified Model
Unlike some databases with distinct concepts for "User" (can log in) and "Group" (container for privileges), PostgreSQL uses a single `ROLE` construct for both. [docs]
- A role that can log in has the `LOGIN` attribute (`CREATE ROLE myuser WITH LOGIN PASSWORD '...';` is equivalent to `CREATE USER ...`).
- A role intended as a group typically has `NOLOGIN` (`CREATE ROLE readonly_group WITH NOLOGIN;`).
Membership and Inheritance
- Roles can be members of other roles: `GRANT group_role TO user_role;`. [membership]
- By default (`INHERIT` attribute is default), a role automatically gains all privileges granted directly to the roles it is a member of, and privileges of roles *those* roles are members of, recursively.
- Example: If `readonly_group` is granted `SELECT` on `mytable`, and `myuser` is granted membership in `readonly_group` (`GRANT readonly_group TO myuser;`), then `myuser` automatically has `SELECT` permission on `mytable`.
- Use `NOINHERIT` attribute on a role if its privileges should only be active after explicitly using `SET ROLE group_role;`.
- Use `SET ROLE` to temporarily assume the privileges of a different role the current user is a member of.
Benefits
- Flexible privilege management through group roles. Grant permissions to the group role once, then manage user access by adding/removing them from the group (`GRANT`/`REVOKE group_role FROM user_role`).
- Allows complex hierarchical privilege structures if needed.
Managing Membership
- `GRANT group_role TO user_role [, ...];`
- `REVOKE group_role FROM user_role [, ...];`
- Check membership: `\du user_role` in psql, or query `pg_auth_members`.
Consideration
While powerful, understand the inheritance model (`INHERIT` vs `NOINHERIT`, `SET ROLE`) and use group roles strategically to simplify permission management rather than granting everything directly to individual login roles.
Terminology Recap
Key PG Terms
Quick reference for core PG vocabulary: Relation, Schema, Role, Heap, Tuple, WAL, LSN, MVCC, TOAST, Planner, Executor, Bloat, Pooler.
- Relation: General term for any table-like object storing data. Most commonly Tables and Indexes. Sometimes Views. Stored as files. [glossary]
- Schema: A namespace within a database containing objects (tables, views, functions, etc.). Used for organization and access control. Default is `public`. Distinct from "table schema" (column definitions). [docs]
- Role: Unified concept for Users (with `LOGIN`) and Groups (`NOLOGIN`). Owns objects, has privileges, can inherit from other roles. [docs]
- Heap: The main storage structure for a table's data, typically an unordered collection of pages containing tuples. [glossary]
- Tuple: A physical row version stored on disk within a heap block. Contains data + header (xmin, xmax, etc.). [glossary]
- WAL (Write-Ahead Log): Transaction log ensuring durability and enabling recovery/replication. Changes logged before applied to heap/index. Stored in `pg_wal` directory. [docs]
- LSN (Log Sequence Number): Pointer to a specific location in the WAL stream, identifying a unique record position. Used for recovery/replication progress tracking.
- MVCC (Multi-Version Concurrency Control): Concurrency model where updates create new tuple versions instead of overwriting. Enables readers not blocking writers. Requires `VACUUM`. [docs]
- TOAST (The Oversized Attribute Storage Technique): Mechanism for storing large field values out of line in a separate TOAST table. [docs]
- Planner/Optimizer: Component that analyzes a SQL query and statistics to determine the most efficient execution plan (e.g., which indexes to use, join methods). [docs]
- Executor: Component that runs the plan generated by the planner, fetching data and performing operations. [docs]
- Bloat:** Wasted/unused space in table and index files due to dead tuples not yet vacuumed or fragmentation. Requires `VACUUM` or `REINDEX`/`VACUUM FULL`/`pg_repack` to reclaim.
- Connection Pooler:** External middleware (PgBouncer, Pgpool-II) managing persistent DB connections for applications, necessary due to process-per-connection model.
- Checkpoint:** Point in WAL stream where prior dirty buffers are guaranteed flushed to disk. Limits recovery time.
- Catalog:** Internal tables (`pg_catalog` schema) storing database metadata.