Metrics: Overview
StableWhich SPG99 layers are worth observing and how metrics help analyze cold start, soft basebackup, and the new autoscaler.
Updated: March 21, 2026
In SPG99, metrics are useful not only for the SRE team, but also for the database user. They are the easiest way to understand whether the problem is in the application, in cold start, in the autoscale handoff, in the storage chain, or in the SQL workload itself.
Which layers are useful to watch
Control Plane
Helps you understand what is happening with resource lifecycle: state transitions, orchestration errors, startup duration, writer handoff, and deletion.
Gateway
Shows the client entry layer: active connections, pooling, TLS errors, freeze/drain behavior, and problems on the path to the backend.
Compute / Agent
This is the key user-facing layer: readiness, soft bootstrap, CPU, memory, connections, and PostgreSQL runtime state.
Pageserver and Safekeeper
These metrics help you understand whether a problem is related to durable storage, WAL quorum, or bootstrap state.
Which questions metrics answer
- whether the database is actually active or idle;
- whether the writer is simply starting or whether an autoscale handoff is already in progress;
- whether the application is cutting itself off with timeouts that are too short;
- whether pinned sessions are blocking a safe cutover;
- whether the storage chain has degraded;
- whether the problem is related to connection count and pooling.
What is especially important after the platform update
After the switch to soft basebackup and the new autoscaler, it is especially useful to track:
- cold-start duration;
- readiness of the warm/candidate writer;
- freeze/drain durations;
- checkout timeouts in Gateway;
- long transactions and pinned-session load;
- lag or unavailability of Pageserver / Safekeeper quorum.
Why this is useful to the user
Metrics shorten the time from symptom to cause. Instead of “the database is slow,” you can quickly move to a more precise conclusion:
- this is a normal cold start;
- the writer handoff is not finished yet;
- pooling is being cut off;
- the storage chain is catching up to the target LSN;
- or the problem is really in the application queries.
