New WAL Archiving Tool ‘pgrwl’ Delivers Zero Data Loss for PostgreSQL in Containerized Environments
🚀 About
- The project serves as a research platform to explore streaming WAL archiving with a target of RPO=0 during recovery.
- It’s primarily designed for use in containerized environments.
- The utility replicates all key features of
pg_receivewal
, including automatic reconnection on connection loss, streaming into partial files, extensive error checking and more. - The tool is easy to install as a single binary and simple to debug – just use your preferred editor and a Docker container running PostgreSQL.
🔐 Features
- ✅ Streaming WAL archiving with replication slots
- ✅ Safe
.partial
file handling (every server message is ‘fsynced’) - ✅ S3/SFTP backends with optional GZIP/ZSTD compression + AES-GCM encryption
- ✅ Built-in HTTP server for serving WALs + metrics and alerting (planned)
- ✅ Minimal and composable configuration
- ✅ Fully testable with Docker-based integration tests
🛠️ Usage
Receive
mode (the main loop of the WAL receiver)
cat <config.yml
main:
listen_port: 7070
directory: wals
receiver:
slot: pgrwl_v5
log:
level: trace
format: text
add_source: true
EOF
export PGHOST=localhost
export PGPORT=5432
export PGUSER=postgres
export PGPASSWORD=postgres
export PGRWL_MODE=receive
pgrwl -c config.yml
Serve
mode (used during restore to serve archived WAL files from storage)
cat <config.yml
main:
listen_port: 7070
directory: wals
log:
level: trace
format: text
add_source: true
EOF
export PGRWL_MODE=serve
pgrwl -c config.yml
restore_command
example for postgresql.conf
# where 'k8s-worker5:30266' represents the host and port
# of a 'pgrwl' instance running in 'serve' mode.
restore_command = 'pgrwl restore-command --serve-addr=k8s-worker5:30266 %f %p'
⭐ See also: examples (step-by-step archive and recovery), and k8s (basic setup)
⚙️ Configuration Reference
The configuration file is in JSON or YML format (*.json is preferred).It supports environment variable placeholders like ${PGRWL_SECRET_ACCESS_KEY}
.
main: # Required for both modes: 'receive' / 'serve'
listen_port: 7070 # HTTP server port (used for management)
directory: "/var/lib/pgwal" # Base directory for storing WAL files
receiver: # Required for 'receive' mode
slot: replication_slot # Replication slot to use
no_loop: false # If true, do not loop on connection loss
uploader: # Optional (used in receive mode)
sync_interval: 10s # Interval for the upload worker to check for new files
max_concurrency: 4 # Maximum number of files to upload concurrently
log: # Optional
level: info # One of: trace / debug / info / warn / error
format: text # One of: text / json
add_source: true # Include file:line in log messages (for local development)
storage: # Optional
name: s3 # One of: s3 / sftp
compression: # Optional
algo: gzip # One of: gzip / zstd
encryption: # Optional
algo: aesgcm # One of: aes-256-gcm
pass: "${PGRWL_ENCRYPT_PASSWD}" # Encryption password (from env)
sftp: # Required section for 'sftp' storage
host: sftp.example.com # SFTP server hostname
port: 22 # SFTP server port
user: backupuser # SFTP username
pass: "${PGRWL_VM_PASSWORD}" # SFTP password (from env)
pkey_path: "/home/user/.ssh/id_rsa" # Path to SSH private key (optional)
pkey_pass: "${PGRWL_SSH_PKEY_PASS}" # Required if the private key is password-protected
s3: # Required section for 's3' storage
url: https://s3.example.com # S3-compatible endpoint URL
access_key_id: AKIAEXAMPLE # AWS access key ID
secret_access_key: "${PGRWL_AWS_SK}" # AWS secret access key (from env)
bucket: postgres-backups # Target S3 bucket name
region: us-east-1 # S3 region
use_path_style: true # Use path-style URLs for S3
disable_ssl: false # Disable SSL
🚀 Installation
Manual Installation
- Download the latest binary for your platform from the Releases page.
- Place the binary in your system’s
PATH
(e.g.,/usr/local/bin
).
Installation script for Unix-Based OS (requires: tar, curl, jq):
(
set -euo pipefail
OS="$(uname | tr '[:upper:]' '[:lower:]')"
ARCH="$(uname -m | sed -e 's/x86_64/amd64/' -e 's/\(arm\)\(64\)\?.*/\1\2/' -e 's/aarch64$/arm64/')"
TAG="$(curl -s https://api.github.com/repos/hashmap-kz/pgrwl/releases/latest | jq -r .tag_name)"
curl -L "https://github.com/hashmap-kz/pgrwl/releases/download/${TAG}/pgrwl_${TAG}_${OS}_${ARCH}.tar.gz" |
tar -xzf - -C /usr/local/bin && \
chmod +x /usr/local/bin/pgrwl
)
🗃️ Usage In Backup Process
The full process may look like this (a typical, rough, and simplified example):
- You have a cron job that performs a base backup of your cluster every three days.
- You run
pgrwl
as a systemd unit or a Kubernetes pod (depending on your infrastructure). - You have a configured retention worker that prunes WAL files older than three days.
- With this setup, you’re able to restore your cluster – in the event of a crash – to any second within the past three days.
🧱 Architecture
Design Notes
pgrwl
is designed to use the local filesystem exclusively. This is a deliberate choice, because – as mentioned earlier – we must rely on fsync after each message is written to disk.
This ensures that *.partial
files always contain fully valid WAL segments, making them safe to use during the restore phase (after simply removing the *.partial
suffix).
pgrwl
supports compression and encryption as optional features for completed WAL files (during upload on remote storage).
However, streaming *.partial
files to any location other than the local filesystem can introduce numerous unpredictable issues.
In short: PostgreSQL waits for the replica to confirm commits, so we cannot afford to depend on external systems in such critical paths.
💾 Notes on fsync
(since the utility works in synchronous mode only):
- After each WAL segment is written, an
fsync
is performed on the currently open WAL file to ensure durability. - An
fsync
is triggered when a WAL segment is completed and the*.partial
file is renamed to its final form. - An
fsync
is triggered when a keepalive message is received from the server with thereply_requested
option set. - Additionally,
fsync
is called whenever an error occurs during the receive-copy loop.
🔁 Notes on archive_command
and archive_timeout
There’s a significant difference between using archive_command
and archiving WAL files via the streaming replicationprotocol.
The archive_command
is triggered only after a WAL file is fully completed—typically when it reaches 16 MiB (the default segment size). This means that in a crash scenario, you could lose up to 16 MiB of data.
You can mitigate this by setting a lower archive_timeout
(e.g., 1 minute), but even then, in a worst-case scenario,you risk losing up to 1 minute of data.Also, it’s important to note that PostgreSQL preallocates WAL files to the configured wal_segment_size
, so they arecreated with full size regardless of how much data has been written. (Quote from documentation:It is therefore unwise to set a very short archive_timeout
— it will bloat your archive storage.).
In contrast, streaming WAL archiving—when used with replication slots and the synchronous_standby_names
parameter—ensures that the system can be restored to the latest committed transaction.This approach provides true zero data loss (RPO=0), making it ideal for high-durability requirements.
👷 Developer Notes
🧪 Integration Testing:
Here an example of a golden fundamental test.It verifies that we can restore to the latest committed transaction after an abrupt system crash.It also checks that the WAL files generated are byte-for-byte identical to those generated by pg_receivewal
.
Test Steps:
- Initialize and start a PostgreSQL cluster
- Run WAL receivers (
pgrwl
andpg_receivewal
) - Create a base backup
- Create a table, and insert the current timestamp every second (in the background)
- Run pgbench to populate the database with 1 million rows
- Generate additional data (~512 MiB)
- Concurrently create 100 tables with 10000 rows each.
- Terminate the insert-script job
- Run pg_dumpall and save the output as plain SQL
- Terminate all PostgreSQL processes and delete the
PGDATA
directory (termination is force and abnormal) - Restore
PGDATA
from the base backup, add recovery.signal, and configure restore_command - Rename all
*.partial
WAL files in the WAL archive directories - Start the PostgreSQL cluster (cluster should recover to the latest committed transaction)
- Run pg_dumpall after the cluster is ready
- Diff the pg_dumpall results (before and after)
- Check the insert-script logs and verify that the table contains the last inserted row
- Compare WAL directories (filenames and contents must match 100%)
- Clean up WAL directories and rerun the WAL archivers on a new timeline (cleanup is necessary since we run receivers with –no-loop option)
- Compare the WAL directories again
To contribute or verify the project locally, the following make
targets should all pass:
# Compile the project
make build
# Run linter (should pass without errors)
make lint
# Run unit tests (should all pass)
make test
# Run integration tests (slow, but critical)
# Requires Docker and Docker Compose to be installed
make test-integ-scripts
# Run GoReleaser builds locally
make snapshot
✅ All targets should complete successfully before submitting changes or opening a PR.
🗂️ Source Code Structure
internal/xlog/pg_receivewal.go
→ Entry point for WAL receiving logic.
Based on the logic found in PostgreSQL:
https://github.com/postgres/postgres/blob/master/src/bin/pg_basebackup/pg_receivewal.c
internal/xlog/receivelog.go
→ Core streaming loop and replication logic.
Based on the logic found in PostgreSQL:
https://github.com/postgres/postgres/blob/master/src/bin/pg_basebackup/receivelog.c
internal/xlog/xlog_internal.go
→ Helpers for LSN math, WAL file naming, segment calculations.
Based on the logic found in PostgreSQL:
https://github.com/postgres/postgres/blob/master/src/include/access/xlog_internal.h
internal/xlog/walfile.go
→ Manages WAL file descriptors: open, write, close, sync.
internal/xlog/streamutil.go
→ Utilities for querying server parameters (e.g. wal_segment_size),
replication slot info, and streaming setup.
internal/xlog/fsync/
→ Optimized wrappers for safe and efficient `fsync` system calls.
📐 Main Loop
⏮️ Links
✅ TL;DR
If you’re building reliable PostgreSQL backup pipelines and want streaming, durability, and developer control, give pgrwl
a try.
💬 Questions or feedback? Drop a GitHub Issue or comment here!
👉 Check out the source
🔖 Licensed under MIT