🐘 PostgreSQL Power User Cheatsheet: Guide for DBAs & Developers

Architecture & Core Concepts

Process Model

Multi-process architecture. A central listener (postmaster/`postgres`) forks a new backend process for each client connection.

Details

Master Process (`postgres`): Listens for connections, manages shared memory, starts background utility processes (e.g., checkpointer, WAL writer, autovacuum launcher).
Backend Process (`postgres`): One per connection. Handles parsing, planning, execution, communication with the client. Isolates crashes to a single connection.
Pros: Robustness, stability (crash in one backend doesn't take down DB).
Cons/Quirk: Higher connection overhead (memory/CPU per process). Requires external connection poolers (e.g., PgBouncer, Pgpool-II) for applications with many short-lived connections (like typical web apps) to avoid resource exhaustion.

Memory Areas

Key memory settings include shared_buffers (main cache), work_mem (per-operation sort/hash), and maintenance_work_mem (VACUUM, indexes).

Shared Memory (Server-wide)

`shared_buffers`: PostgreSQL's primary data cache. Stores frequently accessed table/index blocks. Typically set to ~25% of system RAM (up to a point, depends on workload/OS). [docs]
WAL Buffers: Buffers Write-Ahead Log records before writing to disk. Usually auto-tuned (`-1`), small relative to `shared_buffers`. [docs]

Per-Backend/Operation Memory

`work_mem`: Memory used by *each* sort operation (ORDER BY, DISTINCT), hash join, hash aggregation. Multiple operations within one query can each use `work_mem`. Set carefully to avoid OOM errors; too low causes disk spills. [docs]
`maintenance_work_mem`: Memory used for maintenance tasks like `VACUUM`, `CREATE INDEX`, `ALTER TABLE ADD FOREIGN KEY`. Can be set much higher than `work_mem` as fewer such operations run concurrently. [docs]
`temp_buffers`: Caches temporary tables. [docs]

Other

Commit Log (CLOG/pg_xact) Buffers: Caches transaction status.
OS Cache: PostgreSQL relies heavily on the operating system's file cache. `effective_cache_size` tells the planner how much memory is likely available in OS + PG cache.

Storage

Data stored in files based on OIDs. Tablespaces map logical names to disk locations. Large values use TOAST.

File Layout

Data directory (`PGDATA`) contains global config, WAL (`pg_wal`), subtransaction status (`pg_subtrans`), transaction status (`pg_xact`), per-database directories, etc.
Each database has its own directory (named by its OID) within `PGDATA/base/`.
Each table and index (relation) is stored in one or more files (named by its filenode OID), typically segmented into 1GB files (`relname.1`, `relname.2`...).

Tablespaces

Allow defining locations on the filesystem where database objects (tables, indexes, databases) can be stored. [docs]
Useful for separating objects onto different disk types (e.g., fast SSDs for indexes, large HDDs for tables) or managing storage quotas.
CREATE TABLESPACE fastspace LOCATION '/mnt/ssd_storage';
CREATE TABLE ... TABLESPACE fastspace;

TOAST (The Oversized Attribute Storage Technique)

Handles storage of large column values that don't fit within a standard data block (~8KB). [docs]
Automatically splits large values into chunks stored in a separate TOAST table associated with the main table.
Can apply compression (default `pglz`, or `lz4` since v14) to TOASTed values.
Transparent to the user but affects performance for very large fields. Contributes to bloat.
Control via `ALTER TABLE ... SET STORAGE {PLAIN|EXTERNAL|EXTENDED|MAIN}`.

Write-Ahead Logging (WAL)

Ensures durability by logging changes *before* they are written to data files. Essential for recovery, replication, and PITR. [docs]

Mechanism

All changes (INSERT, UPDATE, DELETE, DDL) are first written as log records to WAL buffers, then flushed to WAL segment files on disk (`pg_wal` directory).
Only *after* the WAL record is durably stored on disk is the change allowed to be written to the actual table/index files (heap/index pages) in shared buffers, which are flushed later by the checkpointer or background writer.

Purpose

Crash Recovery: On startup after a crash, Postgres replays WAL records since the last checkpoint to restore the database to a consistent state.
Durability Guarantee: A transaction commit is confirmed only after its WAL records are flushed to disk (controlled by `synchronous_commit`).
Replication: Standby servers continuously stream and apply WAL records from the primary (Streaming Rep) or decode WAL for logical changes (Logical Rep).
Point-in-Time Recovery (PITR): Requires continuous archiving of WAL segment files. [docs]

Key Concepts

WAL Segments: Files (default 16MB) containing WAL records. Sequentially numbered. Located in `pg_wal`.
Checkpoints: Points in the WAL stream where all data file changes prior to the checkpoint are guaranteed to have been flushed to disk. Limits recovery time. Triggered by time (`checkpoint_timeout`), amount of WAL generated (`max_wal_size`), or manually (`CHECKPOINT` command). Managed by the Checkpointer process. `checkpoint_completion_target` spreads I/O.
WAL Archiving: Process of copying completed WAL segment files to a safe location (`archive_command`). Essential for PITR and some replication setups. Requires `archive_mode = on`.

MVCC

Multi-Version Concurrency Control. Updates create new row versions instead of overwriting. Readers don't block writers. Leads to bloat. [docs]

How it Works

Each row version has system columns: `xmin` (inserting transaction ID), `xmax` (deleting/locking transaction ID), `cmin`/`cmax` (command ID within tx), `ctid` (physical location).
When a row is updated or deleted, Postgres marks the old version as "expired" by setting `xmax` and inserts a new version (for UPDATE) with its own `xmin`.
Each transaction gets a snapshot of the database (list of visible transaction IDs) when it starts (in `READ COMMITTED` for each statement, `REPEATABLE READ`/`SERIALIZABLE` for the transaction). [visibility]
A transaction can only "see" row versions where `xmin` is committed and visible to its snapshot, AND (`xmax` is not set OR `xmax` is aborted OR `xmax` is not visible to its snapshot).
Expired row versions (dead tuples) physically remain until cleaned up by `VACUUM`.

Benefits

High concurrency: Readers don't block writers, and writers don't block readers (for data access).
Read consistency: Queries see a consistent snapshot of the data as defined by the isolation level.

Consequences (Quirks)

Bloat: Dead tuples accumulate, consuming disk space and potentially slowing down scans. Requires regular `VACUUM`.
UPDATEs are expensive: An UPDATE is internally like a DELETE + INSERT, requiring new tuple creation and index entries for all indexes on the table. (HOT updates can mitigate index churn if no indexed columns are changed and space exists on the same page).
Transaction ID Wraparound: Requires freezing via `VACUUM` (see Quirks section).

System Catalogs

Internal tables in the `pg_catalog` schema storing metadata about database objects. Query them for introspection. [docs]

Purpose

Hold the schema definitions, configuration, and status information for the database itself. Everything `psql`'s `\d` commands show comes from here.

Key Catalogs (Examples)

`pg_class`: Information about tables, indexes, sequences, views (relations). Columns: `relname`, `relnamespace` (schema OID), `relkind` (type), `reltuples` (approx rows), `relpages` (approx pages).
`pg_attribute`: Information about table columns. Columns: `attrelid` (table OID), `attname`, `atttypid` (type OID), `attnum` (column number), `attnotnull`.
`pg_namespace`: Information about schemas. Columns: `nspname`, `nspowner`.
`pg_index`: Information about indexes. Columns: `indexrelid` (index OID), `indrelid` (table OID), `indkey` (indexed column numbers).
`pg_proc`: Information about functions and procedures.
`pg_type`: Information about data types.
`pg_settings`: Current server configuration parameters.
`pg_locks`: Information about current locks.
`pg_roles`: Information about roles.

Querying

Join catalogs using OID (Object Identifier) columns. Often complex, consider using higher-level Information Schema views (`information_schema.tables`, `information_schema.columns`) for standard metadata, though catalogs offer more detail.

-- Find columns of a table
SELECT a.attname, format_type(a.atttypid, a.atttypmod) AS type
FROM   pg_catalog.pg_attribute a
WHERE  a.attrelid = 'myschema.mytable'::regclass -- Use regclass cast
AND    a.attnum > 0
AND    NOT a.attisdropped
ORDER BY a.attnum;

Data Types (Postgres Strengths)

JSON / JSONB

JSONB stores decomposed binary JSON. Supports GIN indexing for fast lookups using operators like `->>`, `@>`, `?`.

JSON vs JSONB

`JSON`: Stores an exact textual copy. Preserves whitespace, key order, duplicate keys. Slower processing.
`JSONB`: Stores decomposed binary format. More efficient storage (usually), much faster processing. Removes duplicate keys, doesn't preserve whitespace/key order. **Generally preferred.** [docs]

Key Operators

`->`: Get JSON object field/array element (returns `json`/`jsonb`).
`->>`: Get JSON object field/array element as `text`.
`#>`: Get JSON object at specified path (returns `json`/`jsonb`).
`#>>`: Get JSON object at specified path as `text`.
`@>`: Contains (left contains right). `jsonb '{"a":1, "b":2}' @> '{"a":1}'` -> true.
`<@`: Is contained by (left is contained by right).
`?`: Key exists (string operand). `'{"a":1}'::jsonb ? 'a'` -> true.
`?|`: Any key in array exists.
`?&`: All keys in array exist.
[Full List in Docs]

Indexing (GIN)

`CREATE INDEX idx_gin ON mytable USING GIN (jsonb_column);` (Indexes all keys/values). [docs]
`CREATE INDEX idx_gin_ops ON mytable USING GIN (jsonb_column jsonb_ops);` (Default opclass [docs]).
`CREATE INDEX idx_gin_path_ops ON mytable USING GIN (jsonb_column jsonb_path_ops);` (Optimized for `@>` operator [docs]).
Supports operators like `@>`, `?`, `?|`, `?&`. Crucial for performance!

Use Cases

Storing semi-structured data, document storage, flexible schemas, API payloads.

Arrays

Native array support for most data types (e.g., `integer[]`, `text[]`). Indexable with GIN using operators like `@>`, `&&`. [docs]

Syntax

Declaration: `my_ints integer[]`, `my_tags text[]`.
Literals: `ARRAY[1, 2, 3]`, `'{apple, banana, cherry}'::text[]`.
Access: `my_array[1]` (1-based indexing!). `my_array[1:2]` (slice). [docs]

Key Operators

`=`: Array equality (element-by-element, order matters).
`@>`: Contains (left contains right). `ARRAY[1,2,3] @> ARRAY[1,2]` -> true.
`<@`: Is contained by.
`&&`: Overlaps (have elements in common). `ARRAY[1,2,3] && ARRAY[3,4,5]` -> true.
`ANY(array)` / `SOME(array)`: Check against array elements, e.g. `col = ANY(my_array)`.
`ALL(array)`: Check all elements satisfy condition.
[Full List in Docs]

Indexing (GIN)

`CREATE INDEX idx_array_gin ON mytable USING GIN (array_column);` [docs]
Supports `@>`, `<@`, `&&`, `=` (for some types [opclasses]). Essential for searching within arrays efficiently.

Functions

`array_append()`, `array_prepend()`, `array_cat()`, `array_dims()`, `cardinality()` (size), `unnest()` (expand array to rows).

Use Cases

Storing tags, simple lists of related IDs, multi-value attributes.

Range Types

Represent ranges (`int4range`, `tsrange`, etc.). Support GiST indexing and exclusion constraints to prevent overlaps.

Built-in Types

`int4range`, `int8range`, `numrange`, `tsrange` (timestamp without tz), `tstzrange` (timestamp with tz), `daterange`. [docs]
Can create custom range types. [docs]

Syntax

Literals: `[)`, `(]`, `()`, `[]` denote inclusive/exclusive bounds.
Examples: `'[2024-01-01, 2024-01-31)'::daterange`, `'[10, 20]'::int4range`. Empty ranges allowed. Unbounded ranges use omitted bound. [docs]

Key Operators

`@>`: Contains element or range. `daterange('2024-01-01', '2024-02-01') @> '2024-01-15'::date` -> true.
`<@`: Is contained by.
`&&`: Overlaps.
`<<`: Strictly left of.
`>>`: Strictly right of.
`&<`: Extends to the right (no overlap).
`&>`: Extends to the left (no overlap).
`-|-` (adjacent): `tsrange('2024-01-01 10:00', '2024-01-01 11:00') -|- tsrange('2024-01-01 11:00', '2024-01-01 12:00')` -> true.
[Full List in Docs]

Indexing (GiST)

`CREATE INDEX idx_range_gist ON mytable USING GIST (range_column);` [docs]
Supports range operators like `@>`, `&&`, `<@`, etc. [opclasses]

Exclusion Constraints

Prevent overlapping ranges within a table, often used for scheduling or booking systems. [docs]
Requires GiST index support for the overlap operator (`&&`).
ALTER TABLE events ADD CONSTRAINT no_overlapping_times EXCLUDE USING GIST (room_id WITH =, event_times WITH &&);

Enum Types

User-defined enumerated types (`CREATE TYPE mood AS ENUM (...)`). Provides type safety and better semantics than check constraints. [docs]

Definition

CREATE TYPE primary_color AS ENUM ('red', 'green', 'blue');
CREATE TABLE items (id serial PRIMARY KEY, color primary_color);

Benefits

Type Safety: Ensures only defined values can be stored.
Readability: More descriptive than storing magic numbers or strings.
Storage Efficiency: Stored efficiently internally (typically 4 bytes).
Ordering: Enum values have an implicit sort order based on definition order.

Usage

Insert values like strings: `INSERT INTO items (color) VALUES ('red');`
Compare directly: `WHERE color = 'red'`, `WHERE color > 'green'`.

Modification

Add values: `ALTER TYPE primary_color ADD VALUE 'yellow' AFTER 'blue';` (Cannot easily remove or reorder values). [docs]
Renaming values is possible: `ALTER TYPE ... RENAME VALUE ... TO ...`.

Comparison to Check Constraints

Enums are generally preferred over `CHECK (color IN ('red', 'green', 'blue'))` because they are more type-safe, potentially faster, easier to manage centrally, and self-documenting.

Other Notable Types

Includes native UUID, Geometric, Network (`inet`), `hstore`, `tsvector`/`tsquery`, `DOMAIN`, and Composite Types.

`UUID`: Stores Universally Unique Identifiers. Preferred over storing as `text` for efficiency and semantics. Generate using `gen_random_uuid()` (requires `pgcrypto` extension) or client-side. [docs]
Geometric Types (`point`, `line`, `lseg`, `box`, `path`, `polygon`, `circle`): Basic 2D geometric types. Indexed using GiST. Foundation for the powerful PostGIS extension. [docs]
Network Address Types (`cidr`, `inet`, `macaddr`, `macaddr8`): Store and query IP addresses/networks and MAC addresses. Supports subnet containment operators (`>>`, `<<`). Indexable (B-Tree/GiST). [docs]
`Hstore` Extension (`CREATE EXTENSION hstore`): Simple key-value store within a column. Keys/values are strings. Indexed using GIN or GiST. Often superseded by JSONB now, but still useful for simpler cases or legacy systems. Operators: `->`, `?`, `@>`.
Full-Text Search Types (`tsvector`, `tsquery`): Used for storing preprocessed documents (`tsvector`) and search queries (`tsquery`). Used with FTS functions (`to_tsvector`, `to_tsquery`, `ts_rank`) and operators (`@@`). Indexable with GIN or GiST. [docs]
`DOMAIN` Types: Create user-defined types based on existing types but with added constraints (`CHECK`, `NOT NULL`). Useful for enforcing data rules centrally. E.g., `CREATE DOMAIN email_address AS text CHECK (value ~* '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$');`
Composite Types (`CREATE TYPE name AS (...)`): User-defined types representing the structure of a row or record (a list of field names and their types). Can be used for function arguments, return values, or table columns. E.g., `CREATE TYPE address AS (street text, city text, zip text);`
`TIMESTAMP WITH TIME ZONE` (`timestamptz`) vs `TIMESTAMP WITHOUT TIME ZONE` (`timestamp`):
- `timestamptz`: **Strongly recommended**. Stores UTC timestamp. Converts input to UTC based on session timezone, converts back to session timezone on output. Unambiguous point in time.
- `timestamp`: Stores literal date/time provided, ignoring timezone. Ambiguous. Avoid unless you have a very specific reason.

Indexing Power

B-Tree Index

Default index type. Excellent for equality (`=`), range (`<`, `>`, `BETWEEN`), `IN`, `IS NULL`, and sorting. [docs]

Core Use Case

The workhorse index for most relational data. Stores indexed values in a sorted tree structure, allowing efficient lookups, range scans, and ordered retrieval.

Supported Operators

`=`, `<`, `<=`, `>`, `>=`, `BETWEEN`, `IN`, `IS NULL`, `IS NOT NULL`. [details]

Sorting

Can satisfy `ORDER BY` clauses matching the index definition (including ASC/DESC, NULLS FIRST/LAST), potentially avoiding a separate sort step. [docs]

Examples

CREATE INDEX idx_users_email ON users (email); (For `WHERE email = '...'`)
CREATE INDEX idx_orders_created ON orders (created_at DESC); (For `WHERE created_at > '...' ORDER BY created_at DESC`)

Considerations

Effective for high-cardinality columns. Can become large. Index maintenance (bloat, see VACUUM and REINDEX) is a factor.

GIN Index

Generalized Inverted Index. Optimized for composite types like JSONB, Arrays, Hstore, Full-Text Search. Indexes elements *within* values. [docs]

Core Concept

Creates an index entry for each unique element (key, array item, lexeme) pointing to the rows (heap TIDs) containing that element. Ideal for "contains" or "exists" style queries. [intro]

Use Cases & Supported Operators

Arrays: Supports `@>` (contains), `<@` (contained by), `&&` (overlaps), `=` (equality). [opclasses]
JSONB: Supports `@>`, `<@`, `?` (key exists), `?|` (any key exists), `?&` (all keys exist). Use `jsonb_ops` (default) or `jsonb_path_ops` (optimized for `@>`).
Hstore: Supports `@>`, `?`, `?|`, `?&`. [docs]
Full-Text Search (`tsvector`): Supports `@@` (match) operator. [docs]

Performance

Very fast lookups for containment/existence queries.
Can be significantly slower to build/update than B-Tree or GiST, as inserting/updating one row might require updating many index entries if the value contains many unique elements. Tuning `gin_pending_list_limit` can help. [fast update]
Index size can be large, but often smaller than GiST for the same data.

Example

CREATE INDEX idx_tags_gin ON articles USING GIN (tags); (for `tags text[]`)
CREATE INDEX idx_metadata_gin ON products USING GIN (metadata jsonb_path_ops); (for `metadata jsonb`)

GiST Index

Generalized Search Tree. Framework index for complex types: Geometric, Full-Text Search, Ranges. Handles overlap/containment, nearest-neighbor. [docs]

Core Concept

A height-balanced tree structure like B-Tree, but generalized to handle complex data types and query operators beyond simple comparison. Uses "lossy" indexing in some cases, meaning the index might return some false positives that need rechecking against the heap. [intro]

Use Cases & Supported Operators

Geometric Types (PostGIS): Supports spatial operators like intersection (`&&`), containment (`@`, `~`), distance (`<->` for KNN). [opclasses]
Range Types: Supports overlap (`&&`), containment (`@>`, `<@`), adjacency (`-|-`), etc. [opclasses]
Full-Text Search (`tsvector`): Supports `@@` (match) operator. Often faster updates but slower searches than GIN for FTS. [docs]
Other extensions (`hstore`, `pg_trgm` for fuzzy search): Can provide GiST support.

Performance

Generally faster to build/update than GIN.
Search performance can be slower than GIN for types GIN handles well (like JSONB/Array containment).
Handles KNN (K-Nearest Neighbor) searches efficiently (e.g., find points closest to X). [KNN]

Example

CREATE INDEX idx_events_times_gist ON events USING GIST (event_times); (for `event_times tstzrange`)
CREATE INDEX idx_locations_geom_gist ON locations USING GIST (geom); (for PostGIS geometry column)

BRIN Index

Block Range Index. Very small index for huge tables where values correlate strongly with physical storage order (e.g., timestamps). [docs]

Core Concept

Stores summary information (min/max values, potentially other info depending on opclass) for ranges of physical table blocks (`pages_per_range` setting). During a query, checks the summary to see if blocks *might* contain relevant values, skipping blocks where values are guaranteed to be outside the query range. [intro]

Ideal Scenario

Very large tables (many GBs/TBs).
Column values have a strong linear correlation with their physical storage location. This often happens naturally for append-only tables with a timestamp or sequence column (e.g., log data, IoT measurements).
Queries select relatively large ranges.

Benefits

Extremely small index size compared to B-Tree.
Very fast to build and maintain.

Limitations

Only effective if the physical correlation exists. Randomly distributed data or frequently updated tables (causing rows to move physically) will not benefit.
Less efficient than B-Tree for highly selective queries (finding single rows).
Index results are "lossy" - it identifies candidate blocks, the query still needs to scan those blocks.

Example

CREATE INDEX idx_logs_ts_brin ON logs USING BRIN (log_timestamp) WITH (pages_per_range = 32);

Hash Index

Supports only equality (`=`) lookups. Less commonly used than B-Tree. Fully WAL-logged and usable since v10. [docs]

Core Concept

Stores a hash of the indexed value, mapping hashes to heap TIDs. Designed for fast equality checks.

Supported Operators

Only `=`. [details]

Historical Context & Quirks

Before PostgreSQL v10, Hash indexes were not WAL-logged, meaning they had to be rebuilt after a crash and couldn't be replicated. This severely limited their use.
Since v10, they are fully recoverable and replicable.

Use Cases (Limited)

Potentially smaller and slightly faster than B-Tree *only* for equality lookups on certain data patterns (e.g., long strings where only equality is needed).
In most cases, B-Tree performs comparably or better even for equality and offers much more flexibility (range queries, sorting).
B-Tree is usually the better default choice even if only equality is needed. Use Hash only if benchmarks show a significant benefit for a specific `=`-only workload.

Example

CREATE INDEX idx_data_hash ON large_objects USING HASH (data_blob_hash);

Special Index Features

Includes Partial, Expression, Covering (INCLUDE), and Concurrent indexes for optimization and maintenance.

Partial Indexes: Index only a subset of rows defined by a `WHERE` clause.
- Reduces index size and maintenance overhead if queries frequently filter on the same condition.
- Example: `CREATE INDEX idx_orders_pending ON orders (order_id) WHERE status = 'pending';` (Only useful for queries also containing `WHERE status = 'pending'`).
Expression Indexes (Functional Indexes): Index the result of a function or expression applied to one or more columns.
- Useful for queries filtering/sorting on functions (e.g., `lower()`, date extraction). The query must use the *exact* same expression.
- Example: `CREATE INDEX idx_users_email_lower ON users (lower(email));` (for `WHERE lower(email) = '...'`)
Covering Indexes (INCLUDE clause): Store additional, non-key columns directly in the index leaf pages.
- Allows Index-Only Scans where the query can be satisfied entirely from the index without visiting the table heap (requires table visibility map to be up-to-date via VACUUM).
- Example: `CREATE INDEX idx_items_name ON items (name) INCLUDE (price, category);` (for `SELECT price FROM items WHERE name = '...'`)
Concurrent Index Builds: Create indexes without blocking writes (`INSERT`/`UPDATE`/`DELETE`) on the table.
- Takes longer and uses more resources than a standard `CREATE INDEX`. Requires multiple passes and can fail (leaving an invalid index that needs dropping).
- Syntax: `CREATE INDEX CONCURRENTLY ...;`
- Essential for adding indexes to busy production tables with minimal downtime. Cannot run inside a transaction block.

REINDEX

Rebuilds an index, useful for shrinking bloated indexes or repairing corruption. Supports `CONCURRENTLY` option. [docs]

Purpose

Reduce Bloat: Rebuilds the index from scratch using current table data, removing dead space (similar effect to `VACUUM FULL` but just for the index).
Repair Corruption: Can fix certain types of index corruption.
Update Index Definition: Historically needed for changes in collation versions (less common now).

Commands

`REINDEX INDEX index_name;`: Rebuilds a specific index. Takes `ACCESS EXCLUSIVE` lock on the index (blocks reads/writes using the index).
`REINDEX TABLE table_name;`: Rebuilds all indexes on a specific table. Locks the table.
`REINDEX SCHEMA schema_name;`: Rebuilds all indexes in a schema.
`REINDEX DATABASE database_name;`: Rebuilds all indexes in a database (requires caution).
`REINDEX SYSTEM database_name;`: Rebuilds system catalog indexes (requires caution).

Concurrent Reindex

`REINDEX INDEX index_name CONCURRENTLY;` (Postgres 12+): Rebuilds the index without taking heavy locks that block reads/writes.
Similar mechanism to `CREATE INDEX CONCURRENTLY`: builds a new index in the background, then replaces the old one in a short lock window.
Takes longer, uses more resources, cannot run in a transaction block. Preferred method for large/busy tables.

When to Use

Consider `REINDEX` (preferably `CONCURRENTLY`) when index bloat is significant (detected via monitoring) and affecting performance, or if index corruption is suspected.

SQL Features & Querying

Window Functions

Perform calculations across sets of table rows related to the current row. Uses the `OVER()` clause (`PARTITION BY`, `ORDER BY`).

Core Concept

Unlike aggregate functions which collapse rows, window functions return a value for *each* row based on a "window" of related rows. [docs]

Syntax

function_name() OVER ( [PARTITION BY expr_list] [ORDER BY expr_list] [frame_clause] ) [syntax]

`PARTITION BY`: Divides rows into partitions (groups). Window function is applied independently to each partition. (Optional)
`ORDER BY`: Defines the order of rows within each partition. Required for ranking and frame-dependent functions. (Optional for some functions)
`frame_clause`: Defines the subset of rows within the partition relative to the current row (e.g., `ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING`). (Optional) [frame]

Common Functions

Ranking: `ROW_NUMBER()`, `RANK()`, `DENSE_RANK()`, `NTILE(n)`.
Aggregate Windows: `SUM()`, `AVG()`, `COUNT()`, `MAX()`, `MIN()` applied over the window frame.
Value Fetching: `LAG()`, `LEAD()`, `FIRST_VALUE()`, `LAST_VALUE()`, `NTH_VALUE()`.

Example Use Cases

Calculating running totals, ranking results within categories, finding previous/next values, computing moving averages.

SELECT order_id, order_date, amount, SUM(amount) OVER (ORDER BY order_date) as running_total FROM orders;

SELECT product, category, sales, RANK() OVER (PARTITION BY category ORDER BY sales DESC) as category_rank FROM product_sales;

Common Table Expressions (CTEs)

Define temporary, named result sets within a query using `WITH`. Improves readability. Supports `RECURSIVE` for hierarchies.

Basic CTE

Allows breaking down complex queries into logical, named steps.

WITH regional_sales AS (
    SELECT region, SUM(amount) AS total_sales
    FROM orders
    GROUP BY region
), top_regions AS (
    SELECT region FROM regional_sales
    WHERE total_sales > 10000
)
SELECT * FROM orders
WHERE region IN (SELECT region FROM top_regions);

Recursive CTEs

Used for querying hierarchical data (e.g., organizational charts, parts explosion, graph traversal). [docs]

WITH RECURSIVE subordinates AS (
    -- Non-recursive term (anchor)
    SELECT employee_id, name, manager_id, 1 as level
    FROM employees
    WHERE manager_id IS NULL
UNION ALL
    -- Recursive term
    SELECT e.employee_id, e.name, e.manager_id, s.level + 1
    FROM employees e
    JOIN subordinates s ON e.manager_id = s.employee_id
)
SELECT * FROM subordinates;

Materialization (Quirk)

By default, CTEs might be "inlined" by the planner (re-evaluated each time referenced).
Use `WITH cte AS MATERIALIZED (...)` (Postgres 12+) to force the CTE result to be computed once and stored temporarily (like a temp table). Can help performance if the CTE is complex and referenced multiple times, but can hurt if it's large and only used partially or if inlining would allow better overall optimization.
Use `WITH cte AS NOT MATERIALIZED (...)` (PG12+) to hint against materialization.

Data Modifying CTEs

You can have `INSERT`, `UPDATE`, or `DELETE` statements within a CTE, using their `RETURNING` clause to pass data to subsequent parts of the query. [docs]

DISTINCT ON

Postgres-specific extension. Selects the *first* row for each unique combination of expressions in the `DISTINCT ON (...)` clause, based on `ORDER BY`.

Syntax

SELECT DISTINCT ON (expression [, ...]) select_list FROM ... ORDER BY expression [, ...], order_expr [, ...]

How it Works

The query result is ordered according to the *entire* `ORDER BY` clause.
For each group of rows that are identical according to the `DISTINCT ON` expressions, only the *first* row according to the `ORDER BY` clause is kept.
**Crucial:** The `ORDER BY` clause *must* start with the same expressions as `DISTINCT ON` (order matters!) to get predictable results. Additional `ORDER BY` expressions determine which row is chosen within each distinct group. [docs]

Use Case: "Greatest/Latest N per Group" (N=1)

Very common for finding the most recent entry for each item.

-- Get the latest event for each device_id
SELECT DISTINCT ON (device_id)
    device_id, event_timestamp, event_data
FROM device_events
ORDER BY device_id, event_timestamp DESC;

Comparison to Alternatives

Often more concise and performant than using window functions (`ROW_NUMBER() OVER (...) ... WHERE rn = 1`) or correlated subqueries for the same "latest-per-group" task, especially if indexed appropriately on the `DISTINCT ON` and `ORDER BY` columns.

INSERT ... ON CONFLICT (UPSERT)

Atomic "UPSERT" operation. Handles unique constraint violations by either doing nothing (`DO NOTHING`) or updating the existing row (`DO UPDATE`). [docs]

Syntax

INSERT INTO table_name (column_list)
VALUES (value_list)
ON CONFLICT (conflict_target) DO conflict_action;

`conflict_target`: Specifies the constraint that triggers the conflict handling. Can be:
- `(column_name [, ...])`: A list of columns with a unique constraint/index.
- `ON CONSTRAINT constraint_name`: The name of a unique or exclusion constraint.
- `WHERE predicate`: (Optional) Handle conflict only if inference predicate is true.
`conflict_action`: Can be:
- `DO NOTHING`: Silently ignore the row if a conflict occurs.
- `DO UPDATE SET col1 = value1 [, ...] [WHERE condition]`: Update the existing row that caused the conflict. Can reference the existing row using the table name (e.g., `counters.value`) and the proposed insertion row using the special `EXCLUDED` table (`SET count = mytable.count + EXCLUDED.count`). The optional `WHERE` condition refers to the existing row.

Benefits

Atomic operation, avoiding race conditions inherent in separate SELECT-then-INSERT/UPDATE logic.
Simplifies application code for handling data synchronization or idempotent inserts.

Example

-- Increment counter or insert new row
INSERT INTO counters (key, value)
VALUES ('page_views', 1)
ON CONFLICT (key) DO UPDATE
SET value = counters.value + EXCLUDED.value;

Advanced DML

Supports `UPDATE ... FROM`, `DELETE ... USING` for joining. `RETURNING` clause gets back modified rows. `TABLESAMPLE` for subsets.

UPDATE FROM / DELETE USING

Allows joining other tables directly into `UPDATE` or `DELETE` statements to determine which rows to modify or filter based on related data.

-- Update order status based on payment status
UPDATE orders o
SET status = 'shipped'
FROM payments p
WHERE o.order_id = p.order_id AND p.status = 'paid';

-- Delete users who haven't logged in recently
DELETE FROM users u
USING user_last_login l
WHERE u.user_id = l.user_id AND l.last_login < NOW() - INTERVAL '1 year';

RETURNING Clause

Appends a list of values from the rows modified by an `INSERT`, `UPDATE`, or `DELETE` statement to the command result. [docs]

Useful for getting default values (like serial IDs) after an `INSERT`, or confirming exactly which rows were affected.
Avoids a separate `SELECT` statement, improving efficiency and atomicity (within the same command).

-- Insert and get the generated ID
INSERT INTO products (name, price) VALUES ('New Gadget', 99.99)
RETURNING product_id;

-- Update stock and return the updated product names
UPDATE products SET stock = stock - 1 WHERE category = 'electronics'
RETURNING name, stock;

-- Delete old logs and return the IDs deleted
DELETE FROM logs WHERE log_date < '2023-01-01'
RETURNING log_id;

TABLESAMPLE Clause

Selects a random subset of rows from a table, typically much faster than `ORDER BY random() LIMIT n` for large tables. [docs]

Methods: `BERNOULLI` (scans whole table, includes each row with specified probability), `SYSTEM` (faster, selects random blocks, probability is approximate).
Example: `SELECT * FROM large_log_table TABLESAMPLE SYSTEM (1);` (Select ~1% of blocks)
Use `REPEATABLE (seed)` for reproducible samples.

Lateral Joins

`LATERAL` allows a subquery in the `FROM` clause to reference columns from preceding `FROM` items. Like a correlated `foreach` loop.

Core Concept

Normally, a subquery in the `FROM` clause executes independently. `LATERAL` makes it dependent, executing it *for each row* of the preceding table, allowing the subquery to use values from that row. [docs]

Syntax

SELECT ...
FROM table_a a, LATERAL (subquery using a.columns) sub
WHERE ...

-- Also common with JOIN syntax:
SELECT ...
FROM table_a a
LEFT JOIN LATERAL (subquery using a.columns) sub ON true -- or other join condition
WHERE ...

Use Cases

Finding the top N related items per row (e.g., get top 3 posts for each user).
Calling set-returning functions with arguments from the preceding table.
Simplifying complex correlated subqueries previously written in `SELECT` list or `WHERE` clause.

Example: Get 2 most recent orders for each customer

SELECT c.customer_id, c.name, o.order_id, o.order_date
FROM customers c
LEFT JOIN LATERAL (
    SELECT order_id, order_date
    FROM orders
    WHERE customer_id = c.customer_id -- References outer table 'c'
    ORDER BY order_date DESC
    LIMIT 2
) o ON true; -- Join condition often true as filtering is inside LATERAL

Performance

Can be very efficient if the lateral subquery can use indexes based on the correlated columns. Replaces some procedural logic or complex window functions.

Aggregate `FILTER` Clause

SQL standard way to apply a condition *before* aggregation using `FILTER (WHERE ...)`. Cleaner than `CASE` statements.

Syntax

aggregate_function(expression) FILTER (WHERE condition)

Purpose

Provides a standard and often more readable way to perform conditional aggregation compared to using `CASE` inside the aggregate function.

Example: Count paid vs unpaid orders

-- Traditional CASE method
SELECT
    SUM(CASE WHEN status = 'paid' THEN 1 ELSE 0 END) AS paid_orders,
    SUM(CASE WHEN status = 'unpaid' THEN 1 ELSE 0 END) AS unpaid_orders
FROM orders;

-- Using FILTER clause
SELECT
    COUNT(*) FILTER (WHERE status = 'paid') AS paid_orders,
    COUNT(*) FILTER (WHERE status = 'unpaid') AS unpaid_orders
FROM orders;

-- Can be combined with GROUP BY
SELECT customer_id,
       COUNT(*) FILTER (WHERE amount > 100) as large_orders,
       COUNT(*) FILTER (WHERE amount <= 100) as small_orders
FROM orders
GROUP BY customer_id;

Benefits

Improved readability, especially for complex conditions or multiple conditional aggregates.
SQL standard syntax.
Potentially allows for better optimization by the planner in some cases.

Native Partitioning

Declarative Partitioning

Split large tables into smaller, manageable pieces (partitions) based on LIST, RANGE, or HASH of a key. Improves performance and manageability. [docs]

Concept

A partitioned table is a logical "parent" table that is physically composed of multiple child tables (partitions). Rows inserted into the parent are automatically routed to the correct partition based on the partition key and method.

Partitioning Methods

RANGE: Partitions based on ranges of the partition key (e.g., date ranges, numeric ranges). `PARTITION BY RANGE (created_at)` [docs]
LIST: Partitions based on specific values of the partition key (e.g., country codes, status codes). `PARTITION BY LIST (country_code)` [docs]
HASH: Partitions based on a hash of the partition key, distributing rows roughly evenly. `PARTITION BY HASH (user_id)` [docs]

Creating Partitions

-- Parent Table Definition
CREATE TABLE measurements (
    city_id         int not null,
    logdate         date not null,
    peaktemp        int,
    unitsales       int
) PARTITION BY RANGE (logdate);

-- Creating Partitions
CREATE TABLE measurements_y2023m01 PARTITION OF measurements
    FOR VALUES FROM ('2023-01-01') TO ('2023-02-01');

CREATE TABLE measurements_y2023m02 PARTITION OF measurements
    FOR VALUES FROM ('2023-02-01') TO ('2023-03-01');

-- List Partition Example
CREATE TABLE cities ( ... ) PARTITION BY LIST (country);
CREATE TABLE cities_us PARTITION OF cities FOR VALUES IN ('US');
CREATE TABLE cities_jp PARTITION OF cities FOR VALUES IN ('JP');
CREATE TABLE cities_other PARTITION OF cities DEFAULT; -- Optional default

Benefits

Performance (Partition Pruning): Queries filtering by the partition key can skip scanning irrelevant partitions entirely.
Manageability: Easier bulk data loading/deletion by manipulating partitions (`ATTACH`/`DETACH PARTITION`, `DROP TABLE partition_name`). Efficient for time-series data retention (drop old partitions).
Maintenance: Operations like `VACUUM` or `REINDEX` can often be performed on individual partitions.

Considerations

Indexes and constraints usually need to be created on each partition individually (though primary keys/unique constraints must include the partition key). Use `CREATE INDEX ON parent_table ...` to automatically create on all partitions (and future ones).
Partition key cannot easily be changed after creation.
Too many partitions can increase planning time and resource usage.
Use extensions like `pg_partman` to automate partition creation and maintenance.

Partition Management

Add new partitions, remove old ones using `CREATE TABLE ... PARTITION OF`, `ALTER TABLE ... ATTACH PARTITION`, `ALTER TABLE ... DETACH PARTITION`, `DROP TABLE`. Automate with tools.

Adding Partitions

Direct Creation: Create a new table that is immediately a partition.

CREATE TABLE measurements_y2023m03 PARTITION OF measurements
    FOR VALUES FROM ('2023-03-01') TO ('2023-04-01');

Attaching Existing Table: Attach a regular table (matching the parent's structure) as a new partition. Useful for bulk loading data into a separate table first. [docs]
```
CREATE TABLE measurements_staging (...); -- Load data here
ALTER TABLE measurements ATTACH PARTITION measurements_staging
    FOR VALUES FROM ('2023-04-01') TO ('2023-05-01');
```
This operation takes a brief `ACCESS EXCLUSIVE` lock on the parent table.

Removing Partitions

Detaching Partition: Convert a partition back into a standalone regular table. The data remains but is no longer part of the partitioned table. Takes a brief `ACCESS EXCLUSIVE` lock. [docs]
```
ALTER TABLE measurements DETACH PARTITION measurements_y2023m01;
-- Now measurements_y2023m01 is a regular table.
-- Can archive or drop it later.
```
Dropping Partition: Directly drop the partition table. This permanently deletes the data in that partition. Fast operation.
```
DROP TABLE measurements_y2023m01;
```

Automation

Manual creation/dropping of range partitions (e.g., monthly) is tedious and error-prone.
Use extensions like `pg_partman` which provide functions to automatically create new partitions ahead of time and drop old ones based on retention policies. Often scheduled via `pg_cron` or external schedulers.

Indexes & Constraints

Remember to create indexes on new partitions. Using `CREATE INDEX ON parent_table ...` handles this automatically for future partitions.
Foreign keys referencing partitioned tables are possible but have limitations. [limitations]

Concurrency & Locking

Isolation Levels

`READ COMMITTED` (default, statement-level snapshot), `REPEATABLE READ` (transaction snapshot, can fail), `SERIALIZABLE` (full serializability, more failures). [docs]

Levels & Guarantees

`READ UNCOMMITTED` (Not implemented in PG, behaves like `READ COMMITTED`): Allows dirty reads.
`READ COMMITTED` (Default): Guarantees no dirty reads. Each statement sees a snapshot of data committed *before that statement began*. Can experience non-repeatable reads (same query gets different results later in tx) and phantom reads (new rows appear later in tx). **Important Quirk:** Often sufficient, but be aware of its statement-level snapshot behavior. [docs]
`REPEATABLE READ`: Guarantees no dirty reads or non-repeatable reads. Entire transaction sees a snapshot of data committed *before the transaction's first query began*. Can experience phantom reads (though PG often prevents them via predicate locking). **Quirk:** May fail with a serialization failure (error 40001) if it detects a potential anomaly (write skew) that would violate serializability. Requires application retry logic. [docs]
`SERIALIZABLE`: Strongest level. Guarantees transactions behave as if they executed one after another sequentially. Prevents all anomalies (dirty, non-repeatable, phantom). **Quirk:** More likely to experience serialization failures than `REPEATABLE READ` due to stricter checking. Requires application retry logic. Higher performance overhead. [docs]

Choosing a Level

Stick with `READ COMMITTED` unless specific read consistency across multiple statements within a transaction is required.
Use `REPEATABLE READ` or `SERIALIZABLE` for multi-statement transactions requiring a consistent view or complex read-modify-write operations where race conditions or anomalies are possible. Be prepared to handle serialization failures by retrying the transaction.

Setting Level

SET TRANSACTION ISOLATION LEVEL REPEATABLE READ;
START TRANSACTION ISOLATION LEVEL SERIALIZABLE;

Locking Mechanisms

Implicit row-level via MVCC/DML. Explicit `SELECT FOR UPDATE/SHARE`, `LOCK TABLE`. Application-level Advisory Locks. Monitor with `pg_locks`.

Row-Level Locks (Implicit & Explicit)

Managed automatically by MVCC and DML commands (e.g., `UPDATE` takes `ROW EXCLUSIVE` lock on updated rows).
`SELECT ... FOR UPDATE`/`FOR NO KEY UPDATE`/`FOR SHARE`/`FOR KEY SHARE`: Explicitly lock rows returned by a `SELECT` statement, preventing other transactions from modifying or locking them incompatibly until the current transaction ends. Used to prevent race conditions in read-modify-write cycles. Options `NOWAIT` and `SKIP LOCKED`.

Table-Level Locks (Explicit)

`LOCK TABLE table_name [IN lock_mode]`: Explicitly lock an entire table.
Modes (increasing strength): `ACCESS SHARE`, `ROW SHARE`, `ROW EXCLUSIVE`, `SHARE UPDATE EXCLUSIVE`, `SHARE`, `SHARE ROW EXCLUSIVE`, `EXCLUSIVE`, `ACCESS EXCLUSIVE`. See Lock Compatibility Matrix.
Use with caution! Can severely impact concurrency. Needed for some DDL operations or specific application logic.

Advisory Locks

Application-defined locks managed by function calls (e.g., `pg_advisory_lock(key)`, `pg_advisory_xact_lock(key)`). [docs]
Based on arbitrary numbers (e.g., derived from an application entity ID). Not tied to specific tables or rows.
Useful for coordinating application-level processes or locking concepts not directly represented by a single table row (e.g., limiting concurrent processing of a job type).
Transaction-scoped (`pg_advisory_xact_lock`) or session-scoped (`pg_advisory_lock`). Remember to unlock session locks!

Deadlocks

Occur when two (or more) transactions wait for locks held by each other.
Postgres automatically detects deadlocks and aborts one of the transactions (raising an error). [docs]
**Diagnosis:** Check PostgreSQL logs (if `log_lock_waits` is enabled). Query `pg_locks` and `pg_stat_activity` during contention.
**Prevention:** Acquire locks on multiple objects in a consistent, globally defined order across all application code. Use coarser locks (e.g., table lock) if fine-grained locking proves too complex. Reduce transaction duration. Use `SELECT ... FOR UPDATE SKIP LOCKED` for queue processing patterns.

Monitoring

Use the `pg_locks` view to inspect currently held locks and waiting transactions. Query `pg_stat_activity` to see `wait_event_type` and `wait_event` (e.g., `Lock`, `LWLock`).

Performance & Maintenance

VACUUM

Essential maintenance! Reclaims space from dead rows (bloat), updates visibility map, prevents TXID Wraparound. Autovacuum is critical but needs monitoring/tuning. [docs]

Why VACUUM is Needed (MVCC)

Because MVCC creates new row versions for UPDATEs/DELETEs, the old ("dead") versions remain physically until `VACUUM` runs. Without it, tables grow indefinitely (bloat) and performance degrades. [routine vacuuming]

What VACUUM Does

Reclaims Space: Marks space occupied by dead tuples as reusable for future INSERTs/UPDATEs within the same table (standard `VACUUM` doesn't typically return space to the OS).
Updates Visibility Map (VM): Marks blocks containing only visible-to-all tuples, enabling Index-Only Scans.
Updates Statistics: Optionally runs `ANALYZE` (`VACUUM ANALYZE`).
Prevents TXID Wraparound: "Freezes" old tuple transaction IDs to prevent comparison issues when the TXID counter wraps around (every ~4 billion transactions). Critical for database operation.

VACUUM vs VACUUM FULL

`VACUUM [tablename]`: Standard vacuum. Runs concurrently with reads/writes (minimal locking). Reclaims space within the table. **Run regularly.**
`VACUUM FULL [tablename]`: Rewrites the entire table to a new file, reclaiming maximum space (including returning to OS). Requires `ACCESS EXCLUSIVE` lock (blocks all access). Use sparingly, only when severe bloat needs aggressive reclaiming. Consider `pg_repack` extension as an online alternative.

Autovacuum

Background process that automatically runs `VACUUM` and `ANALYZE` on tables based on thresholds (number of dead tuples, inserts/updates/deletes). [docs]
**Essential for health.** Default settings are often too conservative for busy databases. Requires tuning (`autovacuum_max_workers`, `autovacuum_vacuum_scale_factor` / `_threshold`, `autovacuum_analyze_threshold` / `_scale_factor`, etc. [config]). Monitor its activity!

ANALYZE

Collects statistics about table data distribution (histograms, distinct values) used by the query planner to choose optimal execution plans. Run automatically by Autovacuum. [docs]

Why ANALYZE is Needed

The query planner needs accurate information about the data in tables to make good decisions (e.g., estimate how many rows a `WHERE` clause will return, decide between index scan vs sequential scan, choose join methods). Outdated stats lead to bad plans. [planner stats]

What ANALYZE Collects

Total number of rows (`reltuples` in `pg_class`).
Number of distinct values per column (`n_distinct`).
Most common values (MCVs) and their frequencies.
Histograms representing data distribution for range comparisons.
Correlation between physical row order and column values (for BRIN indexes and plan costing).
Stored in system catalog `pg_statistic`.

When to Run

After significant data changes (bulk loads, deletes, updates).
After creating indexes on expressions.
Regularly on all tables (Autovacuum usually handles this).
Manually run `ANALYZE tablename;` or `ANALYZE VERBOSE tablename;` if query plans seem poor or after large data modifications before autovacuum kicks in.

Autovacuum & ANALYZE

Autovacuum typically triggers `ANALYZE` based on a percentage of rows changed (`autovacuum_analyze_scale_factor` + `autovacuum_analyze_threshold`). Tuning may be required, especially for large tables where the default % is too high. [docs]

Statistics Target

Control the detail level of statistics (number of histogram buckets, MCVs) using `ALTER TABLE ... ALTER COLUMN ... SET STATISTICS ;` (Default 100). Increase for columns with skewed data used in important `WHERE` clauses or `JOIN` conditions. Requires another `ANALYZE`. [docs]

EXPLAIN

Analyze query execution plans. `EXPLAIN ANALYZE` runs the query and shows actual times/rows. Use `BUFFERS` for cache/IO info. Essential diagnostic tool. [docs]

Purpose

Understand how PostgreSQL intends to execute (or actually executed) a query. Essential for diagnosing performance issues and validating index usage. [using explain]

Basic Usage

`EXPLAIN SELECT ...`: Shows the planner's *estimated* execution plan, costs, and row counts without running the query. Safe to use anytime.
`EXPLAIN ANALYZE SELECT ...`: **Runs the query** and shows the actual execution plan, including actual times per node, actual row counts, loops, etc. Use cautiously on production for long-running or modifying queries.

Key Options

`BUFFERS`: With `ANALYZE`, shows buffer usage (shared hits, reads, dirtied, written; temp reads/written) per node. Crucial for identifying I/O bottlenecks and cache efficiency. EXPLAIN (ANALYZE, BUFFERS) ... [docs]
`WAL`: With `ANALYZE`, shows WAL record generation counts and sizes.
`SETTINGS`: Shows non-default configuration settings affecting the plan.
`COSTS`: Enable/disable display of estimated costs (default true). `EXPLAIN (COSTS false) ...`
`TIMING`: Enable/disable display of actual timing per node with `ANALYZE` (default true). `EXPLAIN (ANALYZE, TIMING false) ...`
`VERBOSE`: Shows more details, like output column lists per node, schema-qualified names.
`SUMMARY`: Enable/disable summary info (planning/execution time) with `ANALYZE` (default true).
`FORMAT {TEXT|XML|JSON|YAML}`: Choose output format (default TEXT). JSON is useful for programmatic analysis or plan visualizers.

Interpreting Output

Read plans from inside-out / bottom-up (most indented node executes first).
Compare estimated rows vs actual rows (large discrepancies suggest outdated statistics or planner limitations).
Identify slow nodes (high `actual time`).
Check for Sequential Scans on large tables where Index Scans or Bitmap Heap Scans were expected.
Analyze buffer hits/reads (`Buffers:` line) to understand caching effectiveness (high reads indicate poor caching or cold cache).
Look for expensive Sorts or Hash operations (check `work_mem`).
[Interpreting ANALYZE]

Tools

Visualizers like explain.depesz.com, explain.dalibo.com, PEV make complex plans easier to read (use `FORMAT JSON`).

Statistics Views & Extensions

Monitor activity: `pg_stat_activity`, `pg_stat_statements`, `pg_stat_*_tables/indexes`, `pg_locks`. Use `pg_buffercache` for cache inspection.

Key Views for Monitoring

`pg_stat_activity`: Shows information about currently active backend processes/connections.
- Columns: `pid`, `datname`, `usename`, `client_addr`, `backend_start`, `query_start`, `state` (active, idle, idle in transaction), `wait_event_type`/`wait_event`, `query`, `backend_type`.
- Essential for seeing long-running queries, idle transactions, waiting connections (lock contention).
`pg_stat_statements` (`CREATE EXTENSION pg_stat_statements;`): Tracks execution statistics for all normalized queries executed. **Crucial for identifying expensive queries.**
- Requires setting `shared_preload_libraries = 'pg_stat_statements'` in `postgresql.conf` and a server restart. Tune settings like `pg_stat_statements.max`.
- Columns: `queryid`, `query`, `calls`, `total_exec_time`, `mean_exec_time`, `rows`, `shared_blks_hit`/`read`/`dirtied`/`written`, `temp_blks_read`/`written`, `wal_records`, `wal_bytes`.
- Use `pg_stat_statements_reset()` to clear stats.
`pg_stat_user_tables` / `pg_stat_all_tables`: Shows table access statistics.
- Columns: `seq_scan`, `seq_tup_read`, `idx_scan`, `idx_tup_fetch`, `n_tup_ins`/`upd`/`del`/`hot_upd`, `n_live_tup`, `n_dead_tup`, `last_vacuum`/`autovacuum`, `last_analyze`/`autoanalyze`.
- Useful for monitoring table bloat (`n_dead_tup`), scan types, vacuum/analyze activity.
`pg_statio_user_tables` / `pg_statio_all_tables`: Shows table I/O statistics (buffer cache).
- Columns: `heap_blks_read`/`hit`, `idx_blks_read`/`hit`, `toast_blks_read`/`hit`, `tidx_blks_read`/`hit`.
- Helps identify tables causing heavy I/O or poor caching (`*_hit` vs `*_read` ratio).
`pg_locks`: Shows currently held locks and waiting transactions. Useful for diagnosing locking contention and deadlocks.
`pg_stat_replication`: Monitor streaming replication status and lag on the primary.
`pg_stat_wal_receiver`: Monitor streaming replication status on the standby.
`pg_buffercache` Extension (`CREATE EXTENSION pg_buffercache;`): Provides a detailed view of blocks currently held in `shared_buffers`. Useful for advanced cache analysis. Requires superuser privileges.

Configuration Tuning (`postgresql.conf`)

Key settings: memory (`shared_buffers`, `work_mem`), WAL (`max_wal_size`), Checkpoints, Autovacuum. Use `pgtune` as starting point. Reload/Restart needed.

Configuration is primarily done via `postgresql.conf` (requires reload or restart for many settings) and `pg_hba.conf` (authentication, requires reload). Use `ALTER SYSTEM SET ...` to modify `postgresql.conf` via SQL (writes to `postgresql.auto.conf`, overrides main file).

Key Parameter Groups

Memory: (See Memory card) `shared_buffers`, `work_mem`, `maintenance_work_mem`, `effective_cache_size`. Critical for performance.
WAL: `wal_level` (`replica` is default, needed for replication/PITR), `max_wal_size` & `min_wal_size` (control WAL disk usage before checkpoint), `wal_buffers`, `commit_delay`, `commit_siblings`, `synchronous_commit`. Affects durability and write performance.
Checkpoints: `checkpoint_timeout` (max time between checkpoints), `checkpoint_completion_target` (spread checkpoint I/O over time, e.g., 0.9). Controls recovery time vs. I/O spikes.
Autovacuum: (See VACUUM card) `autovacuum`, `autovacuum_max_workers`, `autovacuum_naptime`, `autovacuum_vacuum_threshold`, `autovacuum_vacuum_scale_factor`, `autovacuum_analyze_threshold`, `autovacuum_analyze_scale_factor`, `autovacuum_vacuum_cost_delay`, `autovacuum_vacuum_cost_limit`, `log_autovacuum_min_duration`. **Requires careful tuning for busy systems.**
Planner: `random_page_cost` (tune based on storage type, lower for SSDs, e.g., 1.1), `seq_page_cost` (usually 1.0), `effective_cache_size`, `cpu_tuple_cost`, `cpu_index_tuple_cost`, `cpu_operator_cost`, `enable_*` flags (e.g., `enable_bitmapscan`). Influences query plan choices.
Connections: `max_connections` (set high enough for clients + pooler, mindful of memory), `superuser_reserved_connections`.
Logging: `log_destination`, `logging_collector`, `log_directory`, `log_filename`, `log_statement` (use carefully, e.g. `ddl` or `mod`), `log_min_duration_statement` (crucial for finding slow queries), `log_checkpoints`, `log_lock_waits`, `log_temp_files`, `log_autovacuum_min_duration`. Essential for troubleshooting.

Tools & Approach

Start with defaults or recommendations from `pgtune` (online tool or script) based on system resources.
Monitor performance (using stats views, OS tools) and adjust parameters iteratively based on workload.
Understand the impact of each parameter before changing it. Read the documentation! Check the `context` in `pg_settings` (needs reload/restart?).
Use `SHOW parameter_name;` or query `pg_settings` view to see current values.
Use `SELECT pg_reload_conf();` or `pg_ctl reload` to apply changes that don't require a restart. Use `pg_ctl restart` for others.

Bloat Monitoring & Remediation

MVCC leads to dead tuples (bloat) in tables/indexes. Monitor using queries or extensions. Remediate with VACUUM, REINDEX, or `pg_repack`.

What is Bloat?

Unused space in table and index files resulting from accumulated dead tuples (from `UPDATE`/`DELETE`) that haven't been fully reclaimed by `VACUUM`. Reduces performance and wastes disk space.

Detecting Bloat

Queries:** Use queries comparing the physical size (`pg_total_relation_size`) with estimated live data size. Many community scripts exist (search for "postgres bloat query"). These are estimates. Example snippet:
```
SELECT ..., pg_size_pretty(bloat_size) AS bloat,
       pg_size_pretty(table_size - bloat_size) AS live_data
FROM ( ... complex query using pg_class, pg_statistic ... ) ...
```
`pgstattuple` Extension:** (`CREATE EXTENSION pgstattuple;`) Provides functions (`pgstattuple`, `pgstatindex`) to scan relations and give precise statistics on dead tuples, free space, etc. Can be I/O intensive.
Monitoring `n_dead_tup`:** Track `n_dead_tup` in `pg_stat_user_tables`. High numbers indicate potential for bloat if `VACUUM` isn't keeping up.

Remediation

Regular `VACUUM` / Autovacuum Tuning:** The primary preventative measure. Ensure autovacuum is running frequently and effectively enough for your workload.
Manual `VACUUM [tablename]`:** Can help clear out dead tuples if autovacuum is lagging. Does not typically shrink files significantly.
`REINDEX INDEX index_name [CONCURRENTLY]`:** Rebuilds a specific index, removing bloat from the index. Concurrent option recommended.
`VACUUM FULL tablename;`:** Rewrites the entire table and its indexes, removing all bloat and shrinking files. Requires `ACCESS EXCLUSIVE` lock (downtime). Use sparingly.
`pg_repack` Extension:** Popular tool to perform an online `VACUUM FULL`-like operation with minimal locking (creates a new table copy, applies changes via triggers, then swaps). Requires installing the extension and running the command-line tool.

OS-Level Tuning Hints

OS configuration impacts DB performance. Consider filesystem, huge pages, swappiness, read-ahead, CPU governor.

Key Areas (Linux Focused Examples)

Filesystem:**
- Prefer `XFS` or `ext4` for data directories.
- Mount options: `noatime`, `nodiratime` can reduce minor I/O overhead. Ensure barriers are enabled (default).
Transparent Huge Pages (THP):**
- Often recommended to be **disabled** (`never`) for database workloads, especially older kernels. Can cause latency spikes and memory fragmentation issues. [docs reference]
- Check: `cat /sys/kernel/mm/transparent_hugepage/enabled`
- Disable: Add `transparent_hugepage=never` to kernel boot parameters or via `systemd` service unit.
Swappiness (`vm.swappiness`):**
- Controls how aggressively the kernel swaps memory. Default is often 60.
- For dedicated DB servers with ample RAM, lowering this (e.g., 1, 10) discourages swapping of database cache, preventing performance cliffs. Don't set to 0 unless you fully understand implications.
- Set via `sysctl vm.swappiness=10` (temporary) or in `/etc/sysctl.conf` (persistent).
Virtual Memory Overcommit:**
- `vm.overcommit_memory`: Often set to `2` (don't overcommit) along with configuring `vm.overcommit_ratio` appropriately to prevent OOM killer issues if memory usage spikes unexpectedly. Needs careful calculation based on RAM + Swap. [docs]
Block Device Read-Ahead:**
- `blockdev --setra /dev/sdX`. Setting appropriate read-ahead (e.g., 256 or 512 sectors = 128K/256K) for the data disk can sometimes improve large sequential scan performance. Depends heavily on storage type and workload. Benchmark changes.
CPU Governor:**
- Ensure the CPU frequency scaling governor is set to `performance` rather than `ondemand` or `powersave` for consistent database performance on dedicated servers.
- Check: `cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor`
- Set via tools like `cpupower` or kernel parameters.

General Advice

These are starting points. Always monitor system performance (CPU, memory, I/O with tools like `vmstat`, `iostat`, `top`/`htop`) and benchmark changes carefully. Defaults are often not optimal for heavy database workloads.

Common Query Anti-Patterns

Avoid common pitfalls: non-SARGable predicates, `SELECT *`, N+1 queries, large `IN` lists, function calls in `WHERE` without indexes.

Non-SARGable Predicates:** Conditions in `WHERE` clauses that prevent the planner from using an index effectively ("SARGable" = Search ARGument Able).
- Applying functions to indexed columns: `WHERE lower(email) = '...'` (Fix: use expression index or store lowercase). `WHERE date_trunc('day', created_at) = '...'` (Fix: use range query `WHERE created_at >= '...' AND created_at < '...'`).
- Wildcard prefix in `LIKE`: `WHERE name LIKE '%Smith'` (Fix: consider full-text search or reverse index if suffix search needed).
- Numeric/Date operations on column: `WHERE amount * 1.1 > 100` (Fix: rewrite as `WHERE amount > 100 / 1.1`).
`SELECT *`:** Selecting all columns when only a few are needed.
- Wastes network bandwidth and server/client memory.
- Prevents Index-Only Scans.
- Makes code brittle if table structure changes.
- Fix: Explicitly list required columns (`SELECT id, name, email FROM users ...`).
N+1 Query Problem:** Common in ORMs or application loops. Fetching a list of parent items (1 query), then looping through them to fetch related child items one by one (N queries).
- Causes excessive database round trips and load.
- Fix: Use `JOIN` in the initial query, use `IN` clauses with fetched IDs, or use dataloader patterns.
Large `IN` Lists / Many `OR` Conditions:** `WHERE id IN (1, 2, ..., 10000)` or `WHERE col='a' OR col='b' OR ...`.
- Can lead to poor plan choices or excessive planning time.
- Fix: Use `JOIN` against a temporary table or `VALUES` list containing the IDs/values. For `OR`, consider `UNION ALL` or rewriting logic.
Implicit Type Casting:** Relying on Postgres to cast types in comparisons (e.g., comparing `integer` column to a string).
- Can sometimes prevent index usage.
- Fix: Ensure application sends parameters with matching types or use explicit `CAST`.
Overuse of `OFFSET` for Pagination:** `LIMIT x OFFSET y` becomes increasingly slow as `y` grows, because the server still needs to compute and discard `y` rows.
- Fix: Use keyset/seek pagination (based on remembering the last seen value of an ordered column: `WHERE indexed_col > last_value ORDER BY indexed_col LIMIT x`).

Extensibility (Postgres Superpower)

Stored Procedures/Functions

Write server-side logic. Default is `PL/pgSQL`. Others include `PL/Python`, `PL/v8`, `SQL`, `C`. [docs]

Purpose

Encapsulate business logic, perform complex operations atomically within the database, reduce network round trips, enforce complex constraints or security logic.

Languages

`PL/pgSQL`: Default, procedural language similar to Oracle's PL/SQL. Block-structured, variables, control flow (IF, LOOP), exception handling. Widely used and robust. [docs]
`PL/Python` (`plpython3u`): Write functions in Python. Access to Python libraries. Untrusted (`u`) means fewer security restrictions (use with care, consider trusted variant `plpython3` if available/sufficient). Useful for data analysis, complex string manipulation, external API calls (via libraries). [docs]
`PL/v8` (`plv8`): Write functions in JavaScript (using the V8 engine). Useful for JSON processing, web-related logic. Needs separate installation. [repo]
`SQL`: Simple functions written purely in SQL. Can often be inlined by the planner for better performance. Good for simple wrappers or calculations. [docs]
`C`: Highest performance, allows low-level access, but more complex development, compilation, and deployment. [docs]
Others: `PL/Perl`, `PL/Tcl`, `PL/Java`, etc.

Creating Functions

CREATE OR REPLACE FUNCTION get_user_count()
RETURNS integer AS $$ -- Dollar Quoting
DECLARE
    user_count integer;
BEGIN
    SELECT count(*) INTO user_count FROM users;
    RETURN user_count;
END;
$$ LANGUAGE plpgsql VOLATILE SECURITY INVOKER; -- Defaults

Procedures (Postgres 11+)

Use `CREATE PROCEDURE ... LANGUAGE ... AS $$ ... $$;`. Unlike functions, procedures do not return a value directly and *can* control transactions within their body (`COMMIT`, `ROLLBACK` - only at top level, not within nested blocks usually). Called using `CALL my_proc();`. [Tx Ctrl]

Function Volatility & Security

Volatility (`VOLATILE`, `STABLE`, `IMMUTABLE`): Hints to the planner about function side effects and result consistency. Crucial for performance and correct use in indexes. [docs]
Security (`SECURITY INVOKER` vs `SECURITY DEFINER`): Determines if the function runs with the privileges of the *calling user* (INVOKER, default) or the *user who defined the function* (DEFINER). See Security section card. [docs]

Triggers

Execute a function automatically on DML events (`INSERT`, `UPDATE`, `DELETE`). Use `BEFORE`/`AFTER`, Row/Statement level. Complex logic, use carefully. [docs]

Concept

Triggers define that a specific function (trigger function, must return type `trigger`) should be executed automatically when a certain DML event occurs on a specific table.

Creating Triggers

Create Trigger Function: Usually in PL/pgSQL. Receives context via special variables (`TG_OP`, `NEW`, `OLD`). Must return `NEW` (for `BEFORE INSERT/UPDATE`), `OLD` (for `BEFORE DELETE`), or `NULL`.

CREATE OR REPLACE FUNCTION update_changetimestamp_func()
RETURNS trigger AS $$
BEGIN
   NEW.changed_at = clock_timestamp(); -- Update column
   RETURN NEW; -- Return row to be inserted/updated
END;
$$ LANGUAGE plpgsql;

Create Trigger Definition: Attach the function to the table and event. [docs]

CREATE TRIGGER set_changetimestamp
BEFORE UPDATE ON my_table -- Timing (BEFORE/AFTER/INSTEAD OF)
FOR EACH ROW -- Level (ROW/STATEMENT)
EXECUTE FUNCTION update_changetimestamp_func();

Key Options

Timing (`BEFORE`, `AFTER`): Execute function before or after the row operation. `BEFORE` triggers can modify the `NEW` row or skip the operation (`RETURN NULL`). `AFTER` triggers see the final result but cannot change it directly (often used for auditing or cascading actions).
Timing (`INSTEAD OF`): For Views only. Execute function *instead of* the DML operation on the view, allowing updates to non-updatable views.
Level (`FOR EACH ROW`, `FOR EACH STATEMENT`): Row-level triggers fire once per affected row. Statement-level triggers fire once per SQL statement, regardless of rows affected.
Events (`INSERT`, `UPDATE [OF column, ...]`, `DELETE`, `TRUNCATE`): Specify which DML operations activate the trigger.
Constraints:** Trigger behavior can interact with constraints (`DEFERRABLE INITIALLY DEFERRED`).

Use Cases

Auditing changes.
Maintaining derived/denormalized data.
Enforcing complex business rules or constraints not possible with standard checks/foreign keys.
Updating modification timestamps.

Pitfalls

Can make data modification logic complex and hard to debug ("hidden" logic).
Can significantly impact performance if trigger functions are slow or cause cascading effects.
Risk of infinite recursion if triggers cause actions that fire the same trigger again.
Consider alternatives (application logic, rules, procedures) before resorting to complex triggers.

Extensions

Package new types, functions, operators via `CREATE EXTENSION`. Huge ecosystem (PostGIS, TimescaleDB, `pg_stat_statements`, `pg_cron`, `pgaudit`). [docs]

Concept

Extensions bundle related SQL objects (types, functions, operators, index methods, etc.) into a single package that can be easily installed (`CREATE EXTENSION`) and removed (`DROP EXTENSION`) from a database. Simplifies deployment and version management of add-on features. [docs]

Managing Extensions

`CREATE EXTENSION extension_name [SCHEMA schema_name] [VERSION version] [CASCADE];`: Installs the extension. Requires extension control/script files to be present on the server filesystem (usually via OS package manager or compilation).
`\dx` (in psql): List installed extensions. `SELECT * FROM pg_extension;`
`SELECT * FROM pg_available_extensions;`: List extensions available to be installed.
`ALTER EXTENSION extension_name UPDATE [TO 'new_version'];`: Upgrades an extension to a newer installed version.
`DROP EXTENSION extension_name [CASCADE|RESTRICT];`: Removes the extension and its objects. `CASCADE` also drops dependent objects.

Notable Extensions (Examples)

PostGIS: Adds comprehensive support for geographic objects and spatial queries.
TimescaleDB: Turns PostgreSQL into a powerful time-series database with hypertables and specialized functions.
Citus: Distributes PostgreSQL horizontally for sharding and parallel query processing.
`pg_stat_statements`: Tracks query execution statistics (essential for performance tuning).
`pgcrypto`: Provides cryptographic functions (hashing, encryption).
`uuid-ossp` / `pgcrypto`: Functions to generate UUIDs.
`hstore`: Key-value data type.
`pg_trgm`: Trigram matching for fuzzy string search (e.g., `LIKE '%word%'` alternative using GiST/GIN indexes).
`pg_cron`: In-database job scheduler (like OS cron).
`pg_repack`: Online table reorganization (alternative to `VACUUM FULL`).
`pg_partman`: Automates time-based and serial partitioning management.
`pgaudit`: Provides detailed session and object audit logging.
`pgvector`: Vector similarity search for AI/ML applications.
`bloom`: Bloom filter index access method.
`pg_buffercache`: Inspect shared buffer content.

Finding Extensions

PostgreSQL Extension Network (PGXN), GitHub, vendor websites, cloud provider offerings.

Foreign Data Wrappers (FDWs)

Access external data sources as if they were local tables using `CREATE FOREIGN TABLE`. Connect to other PG dbs, MySQL, files, etc. [docs]

Concept

FDWs provide a standard SQL interface to query data residing outside the current PostgreSQL database, potentially even on different database systems or flat files. [PG Wiki FDW List]

Setup Steps

Install FDW Extension: `CREATE EXTENSION postgres_fdw;` (for connecting to other PG dbs [postgres_fdw]) or `oracle_fdw`, `mysql_fdw`, `file_fdw`, `odbc_fdw`, `mongo_fdw`, etc.
Create Server: `CREATE SERVER remote_pg_server FOREIGN DATA WRAPPER postgres_fdw OPTIONS (host 'other_host', dbname 'other_db', port '5432');`
Create User Mapping: `CREATE USER MAPPING FOR current_user SERVER remote_pg_server OPTIONS (user 'remote_user', password 'remote_pass');` (maps local user to remote credentials)

Create Foreign Table(s): Define the structure of the remote table locally, or import schema definitions.

-- Manual Definition
CREATE FOREIGN TABLE remote_users (
    id integer OPTIONS (column_name 'user_id'), -- Map column name if needed
    name text,
    email text
) SERVER remote_pg_server OPTIONS (schema_name 'public', table_name 'users');

-- Import Schema (Easier)
IMPORT FOREIGN SCHEMA remote_public_schema
LIMIT TO (users, orders) -- Optional: limit tables
FROM SERVER remote_pg_server INTO local_schema;

Usage

Query the foreign table (`SELECT * FROM local_schema.remote_users;`) just like a local table. Postgres translates the query and pushes down operations (filtering, joins, aggregates if supported by the FDW) to the remote source where possible (check `EXPLAIN`). [querying]

Benefits

Unified view of distributed data.
Leverage PostgreSQL's SQL and features to query diverse sources.
Can simplify data integration tasks without complex ETL.
Some FDWs support writes (`INSERT`/`UPDATE`/`DELETE`) back to the foreign source.

Considerations

Performance depends heavily on the specific FDW, network latency, and how much processing can be pushed down to the remote server. Authentication and security need careful management. Transaction handling across systems can be complex (consider two-phase commit extensions if needed).

Replication & High Availability

Streaming Replication

Built-in physical (binary) replication. Creates read-only standbys (Hot Standby). Async or Sync commit. Use Replication Slots. [docs]

Mechanism

Standby server connects to the primary (using credentials defined in `primary_conninfo`) and streams WAL records via replication protocol (`walsender` on primary, `walreceiver` on standby).
Standby continuously replays these WAL records to keep its data files nearly identical to the primary's. [docs]

Hot Standby (Read Replicas)

If `hot_standby = on` (default) on the standby, it can serve read-only queries while continuously applying WALs. [docs]
Read queries on the standby see a slightly delayed view of the data. Can configure `hot_standby_feedback` to prevent VACUUM on primary from removing rows needed by long standby queries. Configure `max_standby_streaming_delay` / `archive_delay` to cancel standby queries conflicting with WAL application.

Synchronous vs Asynchronous

Asynchronous (Default): Primary commits transaction once WAL is written locally (`synchronous_commit = on` or `local`). Fastest, but potential for small data loss on primary crash if WAL hasn't reached standby.
Synchronous: Primary waits for confirmation from one or more synchronous standbys (listed in `synchronous_standby_names`) that they have received/written/applied WAL (based on `synchronous_commit` level: `remote_write`, `on` (default sync=remote_flush), `remote_apply`) before confirming commit to client. Slower commits, but guarantees less/zero data loss if synchronous standby takes over.
Quorum Commit:** Configure `synchronous_standby_names` like `ANY N (stby1, stby2, ...)` to wait for N standbys.

Replication Slots

Created on primary via `pg_create_physical_replication_slot()`. Dropped via `pg_drop_replication_slot()`. View with `pg_replication_slots`.
Ensures the primary retains WAL segments needed by a specific standby, even if the standby disconnects temporarily. Prevents premature WAL deletion that would break replication. **Essential for robust replication.** [docs] Monitor disk usage on primary (`pg_wal`) if slots become inactive.

Other Concepts

Cascading Replication:** A standby can be configured to replicate from another standby, reducing load on the primary.
Monitoring:** Use `pg_stat_replication` on primary, `pg_stat_wal_receiver` on standby to check lag (`write_lag`, `flush_lag`, `replay_lag`).

Use Cases

Read scaling, High Availability (HA) failover, disaster recovery standby, near real-time data warehousing feed.

Logical Replication

Replicates logical data changes (row changes). Allows cross-version, selective replication via Publisher/Subscriber model. [docs]

Mechanism

Uses logical decoding (`pgoutput` plugin is default) to convert WAL records into a stream of logical changes (BEGIN, COMMIT, INSERT, UPDATE, DELETE with row data). [decoding]
Transmits these logical changes to subscribers which apply them as DML.
Requires `wal_level = logical` on the publisher. Also needs sufficient `max_wal_senders` and `max_replication_slots`.

Publisher/Subscriber Model

Publisher: The source database. Define a `PUBLICATION` specifying which tables (or `ALL TABLES`) and optionally which DML operations (`publish = 'insert, update'`) to publish. `CREATE PUBLICATION mypub FOR TABLE users, orders;` [docs]
Subscriber: The target database. Define a `SUBSCRIPTION` connecting to the publisher and mapping to one or more publications. `CREATE SUBSCRIPTION mysub CONNECTION '...' PUBLICATION mypub WITH (copy_data = false, create_slot = true);` [docs]
Target tables must exist on the subscriber and have compatible schemas (usually identical names and data types, primary keys needed for UPDATE/DELETE).
Uses replication slots on publisher automatically (`create_slot=true`).

Advantages over Streaming Replication

Selective Replication: Replicate only specific tables or filter rows (using `WHERE` clause in publication, PG15+).
Cross-Version/Platform Replication: Replicate between different major PostgreSQL versions or potentially to other systems.
Consolidate/Distribute Changes: Multiple publishers can replicate to a single subscriber, or one publisher to many subscribers.
Writeable Subscriber:** Subscriber tables can be written to independently (can cause conflicts if not managed carefully).

Limitations

Does not replicate DDL changes (schema changes must be applied manually on both sides, carefully). [restrictions]
Does not replicate sequence changes or large objects directly.
Can have higher performance overhead and replication lag than streaming replication.
Initial data synchronization often requires separate step (`copy_data=true` option during `CREATE SUBSCRIPTION`, or manual `pg_dump`).
Conflicts can occur if subscriber data is modified independently.

Use Cases

Selective data aggregation/distribution, feeding data warehouses, replicating between major versions during upgrades, zero-downtime upgrades (complex setup), data sharing between microservices.

HA / Failover Tools

External tools often used to manage replication topology, monitor health, and automate failover (e.g., Patroni, repmgr).

While PostgreSQL provides the core replication mechanisms, automating the detection of primary failure and promotion of a standby requires external tooling. [PG Wiki Overview]

Common Tools

Patroni: Popular Python-based template for building HA clusters. Uses a Distributed Configuration Store (DCS) like etcd, Consul, or Zookeeper for leader election and cluster state management. Manages PostgreSQL configuration and handles automated failover/switchover. Highly flexible and widely used, often in Kubernetes environments.
repmgr: Open-source tool suite focused specifically on managing PostgreSQL streaming replication and failover. Provides monitoring, node management (`repmgrd` daemon), and failover/switchover commands. Can be simpler than Patroni for basic setups, less reliant on external DCS (but can integrate).
Pgpool-II: Acts as a middleware proxy providing connection pooling, load balancing for read replicas, and automated failover capabilities (watchdog process). Can also parallelize queries. Often considered more complex to configure correctly for HA compared to dedicated failover managers.
Cloud Provider Solutions: Managed services like AWS RDS, Google Cloud SQL, Azure Database for PostgreSQL often have built-in HA features that handle failover automatically behind the scenes using their own control planes.

Key HA Concepts Managed by Tools

Health Monitoring: Regularly checking the status of primary and standby nodes via SQL connections or other means.
Failure Detection: Identifying when the primary server is unresponsive based on monitoring checks and timeouts.
Leader Election/Standby Promotion: Choosing the best standby (e.g., least lag, priority setting) to promote to the new primary using `pg_promote()` (or `pg_ctl promote`).
Topology Updates: Reconfiguring remaining standbys to follow the newly promoted primary (updating `primary_conninfo`).
Client Redirection: Ensuring applications connect to the new primary (often via DNS update, virtual IP failover managed by tools like Keepalived, or interaction with a load balancer/proxy layer like PgBouncer/HAProxy configured by the HA tool).
Split-Brain Prevention: Mechanisms (usually via the DCS in Patroni, or fencing scripts) to ensure only one node acts as the primary at any time, even during network partitions.

Backup & Recovery

Logical Backups (`pg_dump`)

Creates logical backups (SQL commands). Flexible, portable. Slower restore for large DBs. Not for PITR. Use `pg_dumpall` for globals. [docs]

Tool & Purpose

`pg_dump` utility connects to a database and generates a script file containing SQL commands (`CREATE TABLE`, `COPY` data, `CREATE INDEX`, etc.) needed to recreate the database objects and data.

Key Features & Options

Formats: Plain text (`.sql`, `-Fp`), Custom (`.dump`, compressed, requires `pg_restore`, `-Fc`), Directory (`dir/`, parallel dump/restore, requires `pg_restore`, `-Fd`), Tar (`.tar`, requires `pg_restore`, `-Ft`). Custom/Directory formats are generally preferred for flexibility (parallel restore, object selection during restore).
Consistency: Takes a consistent snapshot using `REPEATABLE READ` transaction or by acquiring brief `ACCESS SHARE` locks (`--lock-wait-timeout`).
Granularity: Can dump entire DB, specific schemas (`-n schema`), specific tables (`-t table`). Exclude schemas/tables (`-N`, `-T`). Dump data-only (`-a`), schema-only (`-s`).
Portability: Backups are generally portable across different machine architectures and PostgreSQL major versions (though compatibility issues can arise with very old/new versions).
Restoration: Plain text restored via `psql < dump.sql`. Custom/Directory/Tar formats restored using `pg_restore` (allows parallel restore with `-j`).

`pg_dumpall`

`pg_dumpall` utility dumps *all* databases in a cluster plus global objects (roles, tablespaces). Output is plain SQL format.
Use `-g` for globals only, `-r` for roles only, `-t` for tablespaces only.
Essential for backing up the complete cluster state including users/permissions.

Pros

Very flexible for migrating data, restoring specific objects, upgrading major versions.
Human-readable format (plain text).

Cons

Can be slow to restore large databases because it replays all SQL commands and rebuilds indexes from scratch.
Not suitable for Point-in-Time Recovery (PITR).
`pg_dump` (not `pg_dumpall`) doesn't include global objects.

Example

pg_dump -U postgres -Fc -f my_database.dump my_database
pg_dumpall -U postgres -f globals.sql --globals-only
pg_restore -U postgres -d target_database -j 4 my_database.dump

Physical Backups (`pg_basebackup`)

Creates a binary copy of the database cluster files. Faster restore than `pg_dump`. Foundation for PITR & replication setup. [docs]

Tool & Purpose

`pg_basebackup` connects to a running primary or standby server using the replication protocol and copies the entire data directory (`PGDATA`) file by file.

Key Features & Options

Binary Copy: Creates a filesystem-level copy of the database files at a specific point in the WAL stream.
Consistency: Ensures a consistent snapshot by coordinating with the server (`pg_backup_start`/`stop`).
WAL Inclusion (`-X`/`--wal-method`):
- `none`: Don't include WAL (requires separate WAL archiving for recovery).
- `fetch`: Include WAL files needed to make backup consistent (fetched after files copied).
- `stream`: Stream WAL files alongside data files during backup (preferred method, ensures needed WAL is captured).
Format (`-F`/`--format`): `plain` (copy of `PGDATA`), `tar` (`-Ft`).
Progress (`-P`/`--progress`). Checkpoint (`-c`/`--checkpoint=fast|spread`). Rate Limit (`--max-rate`).

Pros

Much faster restoration for large databases compared to `pg_dump`, as no SQL needs re-execution or index rebuilding (just WAL replay from the backup checkpoint onwards).
The necessary starting point for Point-in-Time Recovery (PITR).
Can be used to easily set up new streaming replication standbys.

Cons

Less flexible: Restores the entire database cluster. Not easy to restore single tables (requires restoring cluster elsewhere and extracting).
Backup size is the full size of the data directory (unless using external tools with compression/deduplication).
Generally not portable across major PostgreSQL versions or different machine architectures (endianness, block size).

Example

pg_basebackup -h source_host -U replicator -D /path/to/backup/dir -Ft -X stream -P -R
(Creates backup in tar format, streams WAL, shows progress, and writes basic recovery settings to output dir)

Point-in-Time Recovery (PITR)

Restore database to any specific moment using a physical base backup and continuous WAL Archiving. Requires careful setup and testing. [docs]

Concept

Combines a physical base backup with archived WAL segment files to replay database changes *beyond* the end of the base backup, stopping at a specific target point (timestamp, transaction ID, named restore point). [recovery]

Requirements

Physical Base Backup: Taken using `pg_basebackup` or similar method (e.g., `pgBackRest`, `Barman`). [base backup]
Continuous WAL Archiving: The primary server must be configured (`wal_level = replica` or higher, `archive_mode = on` or `always`, `archive_command`) to continuously copy completed WAL segments to a separate, safe archive location (e.g., NFS, S3 via helper scripts). [archiving]

Recovery Process (High Level)

Restore the chosen physical base backup to a new data directory.
Ensure the WAL archive is accessible from the recovery location.
Configure recovery settings in `postgresql.conf` (PG12+) or create `recovery.conf` (pre-PG12).
- `restore_command`: Command to fetch WAL files from archive (e.g., `restore_command = 'cp /path/to/archive/%f %p'` or script using `aws s3 cp`, etc.). `%f`=filename, `%p`=path to copy to.
- Recovery Target: Specify *when* to stop recovery (e.g., `recovery_target_time`, `_xid`, `_lsn`, `_name`). Or `recovery_target = 'immediate'` to recover just past backup end.
- `recovery_target_action`: `pause` (default), `promote` (finish recovery and become primary), `shutdown`.
Create a `recovery.signal` file (PG12+) in the data directory to trigger recovery mode on startup. (Remove `standby.signal` if present).
Start PostgreSQL on the restored directory. It will enter recovery mode, replay WALs using `restore_command` until the target is reached, then perform the `recovery_target_action`.

Benefits

Recover from logical errors (accidental deletes/updates) by restoring to a point *before* the incident occurred.
Provides fine-grained disaster recovery capability, minimizing data loss.

Tools & Verification

Tools like `pgBackRest` or `Barman` significantly simplify managing base backups, WAL archiving, retention, and the PITR process. **Regularly test your backups and PITR procedures!**

Backup Management Tools

Tools like pgBackRest, Barman, WAL-G simplify backup/restore/PITR, adding features like parallel ops, incremental/diff backups, compression, validation.

While `pg_dump` and `pg_basebackup` provide basic backup capabilities, dedicated tools offer significant advantages for managing backups, especially in production environments.

Popular Tools

pgBackRest:** Feature-rich, reliable, high-performance tool.
- Features: Parallel backup/restore, full/incremental/differential backups, compression, checksums, backup validation, S3/Azure/GCS support, flexible retention policies, PITR automation, tablespace support.
- Widely adopted in the community. Actively developed.
Barman (Backup and Recovery Manager):** Python-based tool, integrates well with standard PG tools.
- Features: Manages base backups and WAL archiving, PITR, compression, remote backup/restore, replication slot management, backup catalog.
- Mature and stable option.
WAL-G:** Successor to WAL-E. Focuses on WAL archiving/fetching and base backups, often to cloud storage.
- Features: Parallel WAL push/fetch, compression, encryption, S3/GCS/Azure/Swift support, base backups, PITR support.
- Often used for its efficient WAL archiving capabilities.

Why Use Them?

Automation:** Simplify complex processes like WAL archiving, PITR setup, retention management.
Performance:** Parallel operations significantly speed up backup/restore for large databases.
Efficiency:** Incremental/differential backups save time and storage space compared to repeated full base backups. Compression reduces storage costs.
Reliability:** Features like checksums and validation help ensure backup integrity.
Cloud Integration:** Built-in support for major cloud storage providers.

Recommendation

For any serious production deployment requiring PITR, using a dedicated backup tool like `pgBackRest` or `Barman` is highly recommended over manual scripting of `pg_basebackup` and `archive_command`.

Verification

Regardless of the tool used, **regularly schedule and perform test restores** (ideally PITR tests) to ensure your backups are valid and your recovery procedures work as expected.

Security

Authentication (`pg_hba.conf`)

Host-Based Authentication controls *who* can connect from *where* using *which* method (e.g., `scram-sha-256`, `md5`, `peer`, `cert`, `ldap`). Order matters! [docs]

File Location & Reload

Located in the data directory (`PGDATA`). Requires a server reload (`pg_ctl reload` or `SELECT pg_reload_conf();`) to apply changes.

Format

Each line defines a connection rule: [docs]

type database user address method [options]

`type`: Connection type (`local` for Unix sockets, `host` for non-SSL TCP/IP, `hostssl` for SSL TCP/IP, `hostnossl` for non-SSL TCP/IP).
`database`: Database name (`all`, specific name, comma-separated list, `@file`, `replication`).
`user`: Role name (`all`, specific name, comma-separated list, `@file`).
`address`: Client IP address range (CIDR notation like `192.168.1.0/24`, `samehost`, `samenet`, `all`). Required for `host*` types.
`method`: Authentication method: [methods]
- `trust`: Allow connection unconditionally (Use only for `local` socket peer auth if absolutely necessary, very dangerous otherwise).
- `reject`: Deny connection unconditionally.
- `scram-sha-256`: Salted Challenge Response Authentication (preferred password method since PG10).
- `md5`: Older challenge-response password method (less secure than SCRAM, avoid if possible).
- `password`: Send plain text password (avoid unless connection is SSL/TLS secured via `hostssl`).
- `peer`: For `local` connections, authenticate if OS username matches database role name.
- `ident`: For TCP/IP, query client's ident server (rarely used, often blocked).
- `ldap`, `gss`, `sspi`, `pam`, `radius`: Integrate with external auth systems.
- `cert`: Authenticate using client SSL certificate (`clientcert=verify-ca` or `verify-full` option).
`options`: Method-specific options (e.g., `clientcert`, LDAP config, PAM service name).

Processing Order

Rules are processed sequentially from top to bottom. The *first* matching rule for the connection type, database, user, and source address is used. Place more specific rules *before* more general ones.

Best Practices

Be specific. Avoid `all all 0.0.0.0/0 trust`. Use `scram-sha-256` for passwords. Use `hostssl` to enforce SSL/TLS for remote TCP/IP connections. Use `peer` authentication for local superuser access where appropriate. Limit `replication` connections to specific users/addresses.

Roles & Permissions

Unified Role concept. Use `GRANT`/`REVOKE` for privileges (incl. column-level). Set `DEFAULT PRIVILEGES`. Use Row-Level Security (RLS) for fine-grained access.

Roles

Combine users and groups: `CREATE ROLE name [WITH options];` [docs]
Options: `LOGIN`/`NOLOGIN`, `SUPERUSER`/`NOSUPERUSER`, `CREATEDB`/`NOCREATEDB`, `CREATEROLE`/`NOCREATEROLE`, `INHERIT`/`NOINHERIT`, `REPLICATION`/`NOREPLICATION`, `BYPASSRLS`/`NOBYPASSRLS`, `CONNECTION LIMIT n`, `PASSWORD 'pass'`, `VALID UNTIL '...`, `IN ROLE role1, ...`, `ROLE role1, ...`, `ADMIN role1, ...`. [attributes]
Use `ALTER ROLE ...` to modify, `DROP ROLE ...` to remove.

Privileges (`GRANT`/`REVOKE`)

Control access to database objects. [docs]
Syntax: `GRANT {privilege [, ...]|ALL} [(column [, ...])] ON {object_type|ALL TABLES IN SCHEMA ...} object_name TO {role_name|PUBLIC};`
Object Types: `TABLE`, `SEQUENCE`, `DATABASE`, `DOMAIN`, `FOREIGN DATA WRAPPER`, `FOREIGN SERVER`, `FUNCTION`, `PROCEDURE`, `LANGUAGE`, `SCHEMA`, `TABLESPACE`, `TYPE`.
Privileges: `SELECT`, `INSERT`, `UPDATE`, `DELETE`, `TRUNCATE`, `REFERENCES`, `TRIGGER`, `CREATE`, `CONNECT`, `TEMPORARY`, `EXECUTE`, `USAGE`.
Column-Level:** Can grant `SELECT`, `INSERT`, `UPDATE`, `REFERENCES` on specific columns (`GRANT SELECT (id, name) ON ...`).
`PUBLIC` pseudo-role grants to all roles. Revoke default public privileges on schemas (esp. `public`) and potentially functions.
Use `REVOKE` to remove privileges. Use `GRANT ... WITH GRANT OPTION` carefully.

Default Privileges

`ALTER DEFAULT PRIVILEGES [FOR ROLE creator] [IN SCHEMA schema] GRANT ... ON TABLES TO target_role;`
Automatically apply specified privileges to objects created *in the future* by the specified `creator` role (or current role if omitted) within the specified schema (or current db if omitted).
Essential for ensuring application roles automatically get permissions on new tables created by migration tools/roles.

Row-Level Security (RLS)

Define policies (`CREATE POLICY ...`) that restrict which *rows* a user can view or modify within a table, based on user characteristics (`current_user`), data values, or security labels. [docs]
Policies are applied *after* standard SQL permissions. Define `USING` (for SELECT/read) and `WITH CHECK` (for INSERT/UPDATE/write) expressions.
Requires enabling on the table: `ALTER TABLE ... ENABLE ROW LEVEL SECURITY;`
Roles need `BYPASSRLS` attribute to ignore policies (use with extreme care, usually only for admin/backup roles). Table owners generally bypass RLS too.

SSL/TLS Encryption

Encrypt client-server communication. Configure server (`ssl=on`, certs) and enforce via `pg_hba.conf` (`hostssl`). Verify certs via client `sslmode`. [docs]

Purpose

Protect data in transit between clients and the database server from eavesdropping and tampering, especially over untrusted networks.

Server Configuration (`postgresql.conf`)

`ssl = on`: Enable SSL/TLS support.
`ssl_cert_file = 'server.crt'`: Path to the server's PEM certificate file.
`ssl_key_file = 'server.key'`: Path to the server's PEM private key file. Permissions must be restrictive (e.g., `0600`, readable only by postgres user).
`ssl_ca_file = 'root.crt'`: (Optional) Path to trusted Certificate Authority certs for verifying client certificates (if using `cert` auth).
`password_encryption` should be `scram-sha-256` (or `md5`) even with SSL.
[Server Files] Restart server after changing SSL settings.

Enforcing SSL/TLS (`pg_hba.conf`)

Use connection type `hostssl` instead of `host` for specific rules. This forces clients matching that rule to use an SSL connection; non-SSL attempts will be rejected. [docs]
Example: `hostssl all myuser 192.168.1.0/24 scram-sha-256` (Requires SSL for `myuser` from that subnet).
Using `host` allows both SSL and non-SSL connections if `ssl=on` is set. Use `hostnossl` to explicitly allow only non-SSL (rarely needed).

Client Configuration

Clients specify connection parameters to control SSL usage. [libpq client docs]
Connection String/Libpq: `sslmode` parameter is crucial:
- `disable`: No SSL.
- `allow`: Try non-SSL first, then SSL if server requires.
- `prefer`: Try SSL first, then non-SSL if server doesn't support. (Default)
- `require`: Only try SSL. Fail if server doesn't support. (Doesn't verify cert).
- `verify-ca`: Only try SSL, verify server cert against trusted CA (`sslrootcert` parameter). Fail if no match.
- `verify-full`: Only try SSL, verify CA, *and* verify server hostname matches certificate CN/SAN. **Most secure.**
Other parameters: `sslcert` (client cert), `sslkey` (client key), `sslrootcert` (path to trusted CA cert(s) on client).

Client Certificate Authentication

Configure server to require clients to present a valid certificate signed by a trusted CA (via `ssl_ca_file`) using `clientcert=verify-ca` or `verify-full` option in the `pg_hba.conf` rule's `options` field, combined with `auth-method=cert`. [cert auth]

`SECURITY DEFINER` Functions

Functions run with privileges of the *definer* (owner), not the *caller*. Powerful but dangerous. Secure carefully! [docs]

Concept

By default, functions run with `SECURITY INVOKER` semantics - they execute with the permissions of the user calling the function. Functions defined with `SECURITY DEFINER` execute with the privileges of the user who *owns* the function.

Use Cases

Allowing less privileged users to perform specific, controlled actions on tables they don't normally have direct access to (e.g., insert into a specific log table, run a specific update).
Abstracting complex operations involving multiple tables with different permissions.

Security Risks & Best Practices (CRITICAL!)

If not written carefully, `SECURITY DEFINER` functions can be exploited for privilege escalation.

Restrict `EXECUTE` Privilege:** `REVOKE EXECUTE ON FUNCTION my_secdef_func(...) FROM PUBLIC;` then `GRANT EXECUTE ON FUNCTION ... TO specific_role;`. Never grant `EXECUTE` to `PUBLIC` unless absolutely necessary and safe.
Set Secure `search_path`:** Inside the function, *always* set a safe `search_path` to prevent hijacking by objects in untrusted schemas: `SET search_path = pg_catalog, public;` (or just `pg_catalog` if `public` isn't needed). Do this at the function definition level:
```
CREATE FUNCTION ... SECURITY DEFINER
SET search_path = pg_catalog;
AS $$ ... $$ LANGUAGE ...;
```
[secure path]
Qualify All Object Names:** Explicitly schema-qualify all tables, functions, types, etc. accessed within the function (e.g., `SELECT * FROM myschema.mytable;`) unless the `search_path` is absolutely trustworthy (rare).
Validate Inputs:** Treat all function arguments as potentially hostile. Sanitize and validate them appropriately.
Principle of Least Privilege:** Ensure the function *owner* has only the minimum privileges necessary for the function's operation. Avoid owning `SECURITY DEFINER` functions as a superuser if possible.

Example (Conceptual)

-- Owner: admin_role (has INSERT on audit_log)
-- Caller: app_user (NO insert on audit_log)
CREATE FUNCTION log_action(user_id integer, action text)
RETURNS void AS $$
BEGIN
  INSERT INTO audit_schema.audit_log(actor_id, action_desc, log_time)
  VALUES (user_id, action, now());
END;
$$ LANGUAGE plpgsql SECURITY DEFINER
   SET search_path = pg_catalog; -- Secure path!

REVOKE EXECUTE ON FUNCTION log_action(integer, text) FROM PUBLIC;
GRANT EXECUTE ON FUNCTION log_action(integer, text) TO app_user;

Encryption at Rest

Protect data stored on disk. Options: Full Disk Encryption (FDE) at OS/HW level (common), filesystem-level, or column-level via `pgcrypto`.

Goal

Prevent unauthorized access to database files (`PGDATA`, WAL, backups) if physical storage media is compromised.

Common Approaches

Full Disk Encryption (FDE):**
- Implemented at the Operating System level (e.g., LUKS on Linux, BitLocker on Windows, FileVault on macOS) or Hardware level (Self-Encrypting Drives - SEDs).
- Encrypts the entire block device containing the database files.
- Transparent to PostgreSQL. Protects against offline attacks (stolen drives).
- Does *not* protect against threats when the OS is running and the filesystem is mounted (e.g., malicious OS user, SQL injection).
- **Generally the recommended baseline approach.** Often required by compliance standards.
Filesystem-Level Encryption:**
- Some filesystems offer native encryption features (e.g., ZFS encryption, ext4 fscrypt).
- Encrypts files or directories within a mounted filesystem.
- May offer more granularity than FDE. Similar protection profile (primarily offline attacks).
Column-Level Encryption (using `pgcrypto`):**
- Encrypt specific sensitive columns within tables using functions like `pgp_sym_encrypt()` / `pgp_sym_decrypt()` (symmetric) or `pgp_pub_encrypt()` / `pgp_priv_decrypt()` (asymmetric).
- Provides protection even from privileged database users (e.g., DBAs) if they don't have the decryption key.
- Requires application logic to handle encryption/decryption. Keys must be managed securely outside the database.
- Significantly impacts query performance (cannot index encrypted data directly for searching, except with workarounds like hashing or specialized indexes on unencrypted components if feasible).
- Use case: Protecting highly sensitive specific fields (e.g., SSNs, credit card numbers - though PCI DSS has strict rules).
Transparent Data Encryption (TDE):**
- Some commercial forks or extensions of PostgreSQL offer TDE solutions that automatically encrypt/decrypt data files, often with external key management integration.
- Not part of standard PostgreSQL core.

Key Management

Securely managing encryption keys is crucial for FDE, filesystem, and column-level encryption. Use hardware security modules (HSMs), cloud KMS, or robust key management practices.

Auditing

Track database activity. Use standard logging (`log_statement`, `log_connections`, etc.) or dedicated extension `pgaudit` for detailed, structured logging.

Why Audit?

Security monitoring (detecting suspicious activity), compliance requirements (HIPAA, SOX, GDPR), troubleshooting application behavior.

Built-in Logging Parameters (`postgresql.conf`)

`log_statement`: Logs executed SQL statements (`none`, `ddl`, `mod`, `all`). `all` is very verbose, use with caution.
`log_min_duration_statement`: Log statements exceeding a duration (ms). Good for performance analysis, less for security audit.
`log_connections` / `log_disconnections`: Log session start/end.
`log_hostname`: Log client hostname (can add overhead).
`log_line_prefix`: Customize log entry format (include user, db, pid, timestamp, etc. - essential context). Example: `'%m [%p] %q%u@%d/%a '`.
Limitations: Can be hard to parse reliably, may miss some details, performance overhead with `log_statement=all`.

`pgaudit` Extension

`pgaudit` provides detailed, structured audit logging. Requires `shared_preload_libraries = 'pgaudit'` and restart.
Features:**
- Logs specific event types (READ, WRITE, FUNCTION, ROLE, DDL, MISC).
- Fine-grained object-level auditing (log access only to specific tables).
- Logs statement text *and* parameters separately (optional).
- Structured output format (e.g., JSON) for easier parsing by log analysis tools (SIEM).
Configuration:** Via `postgresql.conf` parameters (e.g., `pgaudit.log = 'read, write'`, `pgaudit.log_parameter = on`, `pgaudit.role = 'auditor_role'`).
**Recommendation:** Generally preferred over `log_statement=all` for serious auditing needs due to structure and granularity.

Log Management

Ensure logs are stored securely, rotated appropriately, and ideally shipped to a central log management system for analysis and alerting.

Essential Tools

`psql` Command-Line

The indispensable interactive terminal. Master its meta-commands (`\?`, `\d`, `\l`, `\timing`, `\x`, `\e`, `\copy`, etc.) for efficiency. [docs]

Connecting

psql -h host -p port -U user -d dbname

Key Meta-Commands (`\`) [full list]

Help & Info: `\?`, `\h [SQL]`, `\l[+]`, `\c[onnect] ...`, `\conninfo`, `\encoding [enc]`, `\password [user]`, `\q`.
Object Inspection (`\d`): [docs `\d`] `\d[S+] [pat]`, `\dt` (tables), `\di` (indexes), `\ds` (sequences), `\dv` (views), `\dm` (mat views), `\df` (functions), `\dT` (types), `\dn` (schemas), `\du` / `\dg` (roles), `\dp` (privs), `\ddp` (def privs), `\dx` (extensions), `\dy` (event triggers), `\dL` (languages), `\des` (foreign servers), `\dew` (fdws).
Query Exec & Format: `\timing`, `\x [on|off|auto]`, `\H` (HTML), `\watch [sec]`, `\g [file]`, `\s [file]`, `\i file`, `\e [file]`, `\ef [func]`, `\prompt name var`, `\set name val`, `\unset name`, `\echo text`, `\o [file]` (redirect output).
Data I/O: `\copy table FROM/TO 'file' [opts]` (Client-side).
Transaction:** `\set AUTOCOMMIT off/on`, `BEGIN`, `COMMIT`, `ROLLBACK`.

Variables & Scripting

Use `\set name value` to set variables. Reference with `:'name'`.
Use `-v name=value` on command line.
Conditional execution with `\if expr` ... `\elif expr` ... `\else` ... `\endif`.

Customization

`.psqlrc` file in home directory for custom settings (e.g., `\set PROMPT1 '%n@%m:%> '`), aliases (`\set`), startup commands.

Environment variables like `PGHOST`, `PGPORT`, `PGUSER`, `PGDATABASE`, `PGPASSWORD` (use `.pgpass` file instead for password). [Env Vars]

GUI Tools

Visual administration and querying: `pgAdmin` (official, web/desktop), DBeaver (universal, FOSS), DataGrip (commercial IDE).

Common Options

pgAdmin:
- Official open-source graphical tool.
- Features: Server status monitoring, object browser/editor, SQL query tool (explain analyze visualization), data editor, backup/restore UI, user management, schema diff.
- Runs as a web application (pgAdmin 4). Can feel heavy sometimes.
DBeaver:
- Free, open-source universal database tool (supports many DBs).
- Features: Connection management, SQL editor (auto-complete, formatting), visual query builder, data viewer/editor, ER diagrams, data transfer/migration tools, SSH tunneling.
- Popular choice due to versatility and performance. Community & Pro editions.
DataGrip (JetBrains):
- Commercial database IDE from JetBrains.
- Features: Advanced SQL editor (introspection, refactoring, excellent code completion), schema navigation/comparison, data editor, import/export, VCS integration, SSH tunneling.
- Powerful, excellent editor. Requires license (free for students/OS projects). Included in IntelliJ Ultimate.
Other Tools: Postico (macOS), TablePlus (Native GUI, multi-platform), Navicat (Commercial).

Choosing a Tool

Depends on preference, budget, OS, and required features. `psql` is essential for scripting and server access. GUIs are helpful for visual exploration, administration tasks, and complex query development/debugging.

Admin Command-Line Tools

Server-side tools: `pg_ctl` (start/stop/reload), `initdb` (create cluster), `pg_controldata` (read control file), `pgbench` (benchmark).

These tools are typically found in the PostgreSQL `bin` directory and operate directly on the database cluster or server process.

`pg_ctl`:** Utility for initializing, starting, stopping, restarting, reloading configuration, promoting standbys, and getting status of a PostgreSQL server.
- Syntax: `pg_ctl -D /path/to/PGDATA `
- Often used by service management scripts (`systemd`, `init.d`).
`initdb`:** Creates a new PostgreSQL database cluster (the `PGDATA` directory structure and template databases). Run only once to set up a new instance.
- Options: `-D` (data dir), `-U` (superuser name), `-W` (prompt for password), `--locale`, `--encoding`.
`pg_controldata`:** Displays control information stored in `PGDATA/global/pg_control`, such as WAL segment info, checkpoint location, cluster state, database OID. Useful for diagnostics. Run while server is stopped.
- Syntax: `pg_controldata /path/to/PGDATA`
`pg_resetwal` (formerly `pg_resetxlog`):** Emergency recovery tool to reset WAL information. **Extremely dangerous!** Can cause data loss/corruption. Use only as a last resort if the server won't start due to WAL corruption and you understand the consequences. Requires server to be stopped.
`pgbench`:** Simple benchmarking tool included with PostgreSQL.
- Initialize test tables: `pgbench -i [options] dbname`
- Run standard TPC-B like test: `pgbench [options] dbname`
- Run custom scripts: `pgbench -f script.sql [options] dbname`
- Options: `-c` (clients), `-j` (threads), `-T` (duration), `-P` (progress report interval), `-N` (no vacuum).
- Useful for basic performance testing and comparing configuration changes.

Key Quirks & Terminology Recap

Identifier Case Sensitivity

Unquoted identifiers are folded to lowercase. Use double quotes (`"MyTable"`) to preserve case (causes hassles!). Max length ~63 chars. [docs]

The Rule

If you create a table like `CREATE TABLE MyUsers (...)`, PostgreSQL stores it as `myusers`. Subsequent queries like `SELECT * FROM MyUsers;` or `SELECT * FROM myusers;` will both work because the unquoted identifier in the query is also folded to lowercase.
If you create a table using double quotes `CREATE TABLE "MyUsers" (...)`, PostgreSQL stores the name exactly as `MyUsers` (preserving case).
To query this case-sensitive table, you *must* use double quotes: `SELECT * FROM "MyUsers";`. Querying `SELECT * FROM MyUsers;` or `SELECT * FROM myusers;` will fail (unless a lowercase table `myusers` also happens to exist). [docs]

Migration Impact

Databases like SQL Server or MySQL (on Windows/macOS by default) are often case-insensitive or preserve case differently. Migrating schemas and queries requires careful attention to quoting or consistent use of lowercase identifiers to avoid errors.

Identifier Length

Default maximum length for identifiers (tables, columns, functions, etc.) is 63 bytes (`NAMEDATALEN` constant). Can be changed by recompiling PostgreSQL if absolutely necessary (rarely done).

Best Practice

Avoid quoted identifiers unless absolutely necessary (e.g., using reserved words, requiring special characters). Use lowercase names with underscores (`my_table`, `user_id`) consistently for maximum portability and minimum confusion.

VACUUM Necessity (MVCC)

MVCC causes bloat (dead rows). Regular `VACUUM` (via Autovacuum) is MANDATORY for performance and to prevent TXID wraparound failure.

Recap

MVCC doesn't physically delete or update rows in place. Old row versions (dead tuples) remain until cleaned up by `VACUUM`. [routine vacuuming]

Consequences of Neglect

Bloat: Tables and indexes grow larger than necessary, wasting disk space.
Performance Degradation: Sequential scans read more dead tuples; index scans might become less efficient; Index-Only Scans might fail if the Visibility Map isn't updated.
TXID Wraparound Failure: Transaction IDs are 32-bit numbers. If `VACUUM` doesn't run often enough to "freeze" very old tuple transaction IDs on extremely busy or long-running systems (where >2 billion transactions occur between vacuums of oldest data), the database may eventually be forced to shut down into single-user mode to prevent data corruption when the TXID counter wraps around. This is catastrophic but preventable with proper vacuuming. Monitor `age(datfrozenxid)` via SQL.

Action Required

Ensure Autovacuum is enabled (`autovacuum = on`).
**Monitor Autovacuum:** Check logs (`log_autovacuum_min_duration`), `pg_stat_user_tables` (last vacuum times, dead tuples). Monitor TXID age (`SELECT datname, age(datfrozenxid) FROM pg_database;`).
**Tune Autovacuum:** Adjust settings (`*_scale_factor`, `*_threshold`, `*_cost_delay/limit`, `max_workers`, `autovacuum_freeze_max_age`) based on table size, update frequency, and system resources. Defaults are often insufficient for active databases. [config]
Occasionally monitor for excessive bloat (see Bloat Monitoring card) and consider manual `VACUUM`, `REINDEX CONCURRENTLY`, or `pg_repack` if necessary.

Connection Pooling (External)

Postgres's process-per-connection model makes external poolers (PgBouncer, Pgpool-II) highly recommended for apps with many frequent, short-lived connections.

The Issue

Each connection to PostgreSQL forks a new OS process (`postgres` backend), consuming significant RAM (tens of MBs) and CPU resources for process management. Creating and tearing down connections is also relatively expensive. [process model]

Applications (especially web/serverless apps) that open many short-lived connections directly to the database can quickly exhaust server resources (`max_connections` and memory) and lead to poor performance.

The Solution: External Connection Poolers

Middleware that maintains a pool of persistent physical connections to the PostgreSQL server. Application connects to the pooler instead of directly to Postgres. [PG Wiki Pooling]

PgBouncer: Lightweight, single-purpose connection pooler. Very popular, stable, low overhead. Offers different pooling modes:
- Session Pooling: Client keeps logical connection and physical backend connection until disconnect (least effective pooling, mostly just limits connection count).
- Transaction Pooling: Physical connection returned to pool after each transaction (`COMMIT`/`ROLLBACK`). Requires client not to use session-based features (e.g., `SET` commands, advisory locks, temp tables) across transactions. **Commonly used.**
- Statement Pooling: Physical connection returned after each statement (most aggressive, most restrictions, rare).
Pgpool-II: More feature-rich middleware. Provides connection pooling, load balancing across replicas, replication management, and automated failover (watchdog). Higher overhead and complexity than PgBouncer.
Some application frameworks/drivers have built-in pooling, but an external pooler is often still beneficial for managing total server connections across multiple app instances.

Benefits of Pooling

Reduces connection overhead (memory/CPU) on the PostgreSQL server.
Allows supporting thousands of application clients with only tens or hundreds of actual database connections.
Improves performance by reusing existing connections, reducing connection latency.

Recommendation

For most applications making frequent connections, using PgBouncer in transaction pooling mode between the application and the database is highly recommended.

`public` Schema Defaults

Default schema where all roles have `CREATE` and `USAGE` initially. Security risk! Best practice: `REVOKE CREATE`, use dedicated schemas. [docs]

Default Behavior

Every new database contains a schema named `public`. [docs]
By default, the special `PUBLIC` pseudo-role (meaning all roles) is granted `CREATE` (permission to create objects like tables) and `USAGE` (permission to access objects within) privileges on the `public` schema.
If no schema is specified when creating an object (e.g., `CREATE TABLE mytable (...)`), it gets created in the first schema listed in the user's `search_path` where they have `CREATE` permission, which is typically `public`.

Problems

Can lead to a cluttered namespace with objects from different applications or users mixed together.
Security Risk: Any user can potentially create objects (tables, functions) in `public`, potentially leading to naming conflicts or unexpected behavior (e.g., function/operator hijacking if `public` is early in the `search_path`). [patterns]

Recommended Practice

Revoke Default Privileges:** As a superuser, immediately after database creation (or in template database `template1`):

REVOKE CREATE ON SCHEMA public FROM PUBLIC;
 -- Optional, but recommended for stricter control:
 -- REVOKE USAGE ON SCHEMA public FROM PUBLIC;
 -- GRANT USAGE ON SCHEMA public TO specific_roles_that_need_it; -- e.g., for extensions

Create Dedicated Schemas:** Use schemas to organize objects logically (e.g., `CREATE SCHEMA app_schema AUTHORIZATION app_owner;`). [create schema]
Grant Permissions Explicitly:** Grant necessary `USAGE` on the schema and object-specific privileges (SELECT, INSERT, etc.) on objects within dedicated schemas to specific application roles. [grant]
Set `search_path` Carefully:** Configure the `search_path` appropriately for application roles (e.g., `ALTER ROLE app_user SET search_path = app_schema, public;`) so they don't need to schema-qualify table names frequently, but still prioritize the correct schema. Place `public` last or omit if possible.

`search_path`

Determines the order schemas are searched for unqualified objects (tables, functions). Impacts name resolution and security. Default includes user schema, then `public`. [docs]

Functionality

When you reference an object (table, function, type, operator, etc.) without explicitly qualifying it with a schema name (e.g., `SELECT * FROM users` instead of `SELECT * FROM app_schema.users`), PostgreSQL searches the schemas listed in the `search_path` setting in order, using the first matching object found.

Default Value

Typically defaults to `"$user", public`. This means it first looks for a schema with the same name as the current user role (`CREATE SCHEMA myuser AUTHORIZATION myuser;`), then looks in the `public` schema. If `$user` schema doesn't exist, it's skipped.

Setting the Path

Session level: `SET search_path TO schema1, schema2, public;`
User level (persistent for user): `ALTER ROLE myuser SET search_path = schema1, public;`
Database level (default for db): `ALTER DATABASE mydb SET search_path = schema1, public;`
Server level (`postgresql.conf`): Sets the system default for new sessions.
Function level: `ALTER FUNCTION myfunc(...) SET search_path = ...;` or `SET search_path = ...` in function definition (crucial for `SECURITY DEFINER`).
[setting path]

Security Implications (Trojan Horse)

If a user can create objects in a schema listed *earlier* in the `search_path` than the intended schema (especially `public` if default `CREATE` privileges aren't revoked), they might be able to create objects (like functions or operators) that hijack calls intended for objects in later schemas. [security]
Example: If `search_path = public, app_schema`, and a malicious user creates a function `public.important_function()`, a call to `important_function()` (expecting `app_schema.important_function()`) might execute the malicious public version. This is especially dangerous within `SECURITY DEFINER` functions if their `search_path` isn't explicitly secured.

Best Practice

Always be aware of the current `search_path` (`SHOW search_path;`).
Set a specific, minimal, and controlled `search_path` for application roles (e.g., only the necessary application schema(s), maybe `pg_catalog`).
Place `public` last in the path, or remove it entirely if not needed (e.g., for extensions).
Qualify object names with schemas (`app_schema.users`) in security-sensitive contexts (like `SECURITY DEFINER` functions) or when ambiguity is possible.
Follow best practices for the `public` schema (revoke default create privileges).

Unified Role Concept

PostgreSQL uses Role for both users (with `LOGIN`) and groups (with `NOLOGIN`). Roles can inherit privileges from other roles they are members of.

Unified Model

Unlike some databases with distinct concepts for "User" (can log in) and "Group" (container for privileges), PostgreSQL uses a single `ROLE` construct for both. [docs]

A role that can log in has the `LOGIN` attribute (`CREATE ROLE myuser WITH LOGIN PASSWORD '...';` is equivalent to `CREATE USER ...`).
A role intended as a group typically has `NOLOGIN` (`CREATE ROLE readonly_group WITH NOLOGIN;`).

Membership and Inheritance

Roles can be members of other roles: `GRANT group_role TO user_role;`. [membership]
By default (`INHERIT` attribute is default), a role automatically gains all privileges granted directly to the roles it is a member of, and privileges of roles *those* roles are members of, recursively.
Example: If `readonly_group` is granted `SELECT` on `mytable`, and `myuser` is granted membership in `readonly_group` (`GRANT readonly_group TO myuser;`), then `myuser` automatically has `SELECT` permission on `mytable`.
Use `NOINHERIT` attribute on a role if its privileges should only be active after explicitly using `SET ROLE group_role;`.
Use `SET ROLE` to temporarily assume the privileges of a different role the current user is a member of.

Benefits

Flexible privilege management through group roles. Grant permissions to the group role once, then manage user access by adding/removing them from the group (`GRANT`/`REVOKE group_role FROM user_role`).
Allows complex hierarchical privilege structures if needed.

Managing Membership

`GRANT group_role TO user_role [, ...];`
`REVOKE group_role FROM user_role [, ...];`
Check membership: `\du user_role` in psql, or query `pg_auth_members`.

Consideration

While powerful, understand the inheritance model (`INHERIT` vs `NOINHERIT`, `SET ROLE`) and use group roles strategically to simplify permission management rather than granting everything directly to individual login roles.

Terminology Recap

Key PG Terms

Quick reference for core PG vocabulary: Relation, Schema, Role, Heap, Tuple, WAL, LSN, MVCC, TOAST, Planner, Executor, Bloat, Pooler.

Relation: General term for any table-like object storing data. Most commonly Tables and Indexes. Sometimes Views. Stored as files. [glossary]
Schema: A namespace within a database containing objects (tables, views, functions, etc.). Used for organization and access control. Default is `public`. Distinct from "table schema" (column definitions). [docs]
Role: Unified concept for Users (with `LOGIN`) and Groups (`NOLOGIN`). Owns objects, has privileges, can inherit from other roles. [docs]
Heap: The main storage structure for a table's data, typically an unordered collection of pages containing tuples. [glossary]
Tuple: A physical row version stored on disk within a heap block. Contains data + header (xmin, xmax, etc.). [glossary]
WAL (Write-Ahead Log): Transaction log ensuring durability and enabling recovery/replication. Changes logged before applied to heap/index. Stored in `pg_wal` directory. [docs]
LSN (Log Sequence Number): Pointer to a specific location in the WAL stream, identifying a unique record position. Used for recovery/replication progress tracking.
MVCC (Multi-Version Concurrency Control): Concurrency model where updates create new tuple versions instead of overwriting. Enables readers not blocking writers. Requires `VACUUM`. [docs]
TOAST (The Oversized Attribute Storage Technique): Mechanism for storing large field values out of line in a separate TOAST table. [docs]
Planner/Optimizer: Component that analyzes a SQL query and statistics to determine the most efficient execution plan (e.g., which indexes to use, join methods). [docs]
Executor: Component that runs the plan generated by the planner, fetching data and performing operations. [docs]
Bloat:** Wasted/unused space in table and index files due to dead tuples not yet vacuumed or fragmentation. Requires `VACUUM` or `REINDEX`/`VACUUM FULL`/`pg_repack` to reclaim.
Connection Pooler:** External middleware (PgBouncer, Pgpool-II) managing persistent DB connections for applications, necessary due to process-per-connection model.
Checkpoint:** Point in WAL stream where prior dirty buffers are guaranteed flushed to disk. Limits recovery time.
Catalog:** Internal tables (`pg_catalog` schema) storing database metadata.