Metrics: Overview

Stable

Which SPG99 layers are worth observing and how metrics help analyze cold start, soft basebackup, and the new autoscaler.

Updated: March 21, 2026

In SPG99, metrics are useful not only for the SRE team, but also for the database user. They are the easiest way to understand whether the problem is in the application, in cold start, in the autoscale handoff, in the storage chain, or in the SQL workload itself.

Which layers are useful to watch

Control Plane

Helps you understand what is happening with resource lifecycle: state transitions, orchestration errors, startup duration, writer handoff, and deletion.

Gateway

Shows the client entry layer: active connections, pooling, TLS errors, freeze/drain behavior, and problems on the path to the backend.

Compute / Agent

This is the key user-facing layer: readiness, soft bootstrap, CPU, memory, connections, and PostgreSQL runtime state.

Pageserver and Safekeeper

These metrics help you understand whether a problem is related to durable storage, WAL quorum, or bootstrap state.

Which questions metrics answer

  • whether the database is actually active or idle;
  • whether the writer is simply starting or whether an autoscale handoff is already in progress;
  • whether the application is cutting itself off with timeouts that are too short;
  • whether pinned sessions are blocking a safe cutover;
  • whether the storage chain has degraded;
  • whether the problem is related to connection count and pooling.

What is especially important after the platform update

After the switch to soft basebackup and the new autoscaler, it is especially useful to track:

  • cold-start duration;
  • readiness of the warm/candidate writer;
  • freeze/drain durations;
  • checkout timeouts in Gateway;
  • long transactions and pinned-session load;
  • lag or unavailability of Pageserver / Safekeeper quorum.

Why this is useful to the user

Metrics shorten the time from symptom to cause. Instead of “the database is slow,” you can quickly move to a more precise conclusion:

  • this is a normal cold start;
  • the writer handoff is not finished yet;
  • pooling is being cut off;
  • the storage chain is catching up to the target LSN;
  • or the problem is really in the application queries.