Monitoring Federated Query Performance with Apollo Studio
Apollo Studio turns the traces a federated router reports into schema-aware performance data — per-operation p50/p95 latency, per-field usage, and a flame graph that pinpoints the slow subgraph resolver behind a slow operation. This guide covers reporting traces to Studio, reading field-level usage and latency, finding N+1 resolvers, and alerting on regressions.
It is the Studio-specific companion to Observability and Distributed Tracing in Federation.
When to use this pattern
- You want latency and usage attributed to specific schema fields and subgraphs, not just HTTP endpoints.
- You need operation-level p50/p95/p99 trends and regression alerts tied to your managed graph variant.
- You are hunting an N+1 or a slow
__resolveReferenceand want a field-level flame graph rather than raw spans.
Prerequisites
How Studio Reporting Works
The router (or gateway) reports two related streams to Studio. Operation usage records every executed operation’s signature, client, and timing — this powers usage charts and p50/p95 statistics with no extra subgraph work. Federated field-level traces (FTV1) are sampled, per-field execution timings that subgraphs emit on request; the router stitches them into one trace following the query plan, giving Studio a flame graph in which each frame is a schema field on a named subgraph.
Studio aggregates these into field statistics: for any field you can see request count, error rate, and latency percentiles across all operations that select it — the key advantage over endpoint-level monitoring.
It is worth being explicit about why this matters for federation specifically. A federated graph exposes a single GraphQL endpoint that serves every operation, so HTTP-level monitoring collapses all traffic into one route and tells you almost nothing — the p95 of “the GraphQL endpoint” mixes a trivial { me { id } } with a 200-field dashboard query. Studio’s field-keyed model dissolves that problem: each field carries its own cost regardless of which operation requested it, and each operation carries its own latency distribution regardless of how many fields it selects. You can therefore ask both “which field is expensive everywhere it appears” and “which operation is slow,” and the two views reconcile because they are computed from the same reported traces. That reconciliation is the practical reason to report to Studio even when you already have OTLP: OTLP is service-shaped, Studio is schema-shaped, and schema is the dimension along which you actually evolve a federated graph.
Implementation Walkthrough
1. Point the router at Studio
The router reports to Studio whenever APOLLO_KEY and APOLLO_GRAPH_REF are set; the telemetry.apollo block tunes what is sent.
# router.yaml
telemetry:
apollo:
# how often subgraphs are asked for expensive per-field FTV1 traces
field_level_instrumentation_sampler: 0.02 # 2% in production
send_variable_values:
none: true # never send variable values
send_headers:
only:
- apollographql-client-name # attribute usage to clients
- apollographql-client-version
errors:
subgraph:
all:
send: true # include subgraph errors in traces
redact: true # redact error messages/PII
Provide credentials via environment, not the file:
export APOLLO_KEY="service:my-graph:xxxxxxxxxxxx"
export APOLLO_GRAPH_REF="my-graph@production"
./router --config router.yaml
2. Confirm subgraphs emit FTV1
@apollo/server v4 emits inline traces automatically when it receives the router’s apollo-federation-include-trace: ftv1 request header, so no extra config is usually needed. If you previously disabled it, re-enable the inline trace plugin:
import { ApolloServer } from '@apollo/server';
import { ApolloServerPluginInlineTrace } from '@apollo/server/plugin/inlineTrace';
import { buildSubgraphSchema } from '@apollo/subgraph';
const server = new ApolloServer({
schema: buildSubgraphSchema({ typeDefs, resolvers }),
// emits FTV1 per-field traces back to the router for Studio stitching
plugins: [ApolloServerPluginInlineTrace()],
});
3. Set client awareness
Send apollographql-client-name/-version from clients so Studio segments usage and latency by client — essential for attributing a regression to a specific app rollout. Apollo Client sets these headers when you configure its name and version options; for other clients set the headers manually on every request. Once present, Studio lets you slice every usage and latency chart by client, which turns “p95 regressed at 14:00” into “p95 regressed at 14:00 for web@4.7.0 only” — frequently the difference between a five-minute triage and an hour of guessing. Client segmentation is also how you safely deprecate a field: you can confirm that the only callers of a field are old client versions you have already sunset before you remove it.
Reading Performance Data in Studio
Field usage and latency
The Fields view lists every schema field with request count, error percentage, and latency percentiles, each tagged with its owning subgraph. Two patterns matter:
- High-usage, high-latency fields are your optimisation priorities — they sit on the critical path of many operations.
- Zero-usage fields are deprecation candidates; safe removal is validated by
rover subgraph check, which reads the same usage data. See Apollo Studio schema checks for managed federation.
Operation-level p50/p95
The Operations view charts request rate and latency percentiles per operation signature over time. Watch p95/p99 rather than averages — federation latency is bimodal (cache hits vs. cold multi-fetch plans), so the mean hides the tail that users feel. Filter by client to separate a backend batch caller from interactive traffic.
Identifying slow subgraph resolvers and N+1
Open a slow operation’s trace to get the federated flame graph. Read it for two signatures:
- A wide subgraph frame — one
fetchdominating the operation’s wall-clock time points at that subgraph; drill into its field frames to find the specific slow resolver, then optimise it per optimizing reference resolvers for performance. - A repeated narrow frame under a list — the same field timed many times beneath a list field is the classic N+1: one query per element instead of one batched query. The fix is DataLoader batching, covered in batching entity resolution with DataLoader.
A __resolveReference frame that appears once per entity in a list is the federation-specific N+1 — the router asked the subgraph to resolve N references and the subgraph issued N database round-trips instead of one.
A second, subtler signature to read in the flame graph is fetch shape. The router batches entity references into a single _entities query per subgraph per plan step, so a healthy federated trace shows one reference-resolution fetch into the subgraph carrying many representations, and the subgraph should fan that single request out to one batched data-layer call. The pathology is the inverse: one _entities fetch arrives with fifty representations, and the subgraph’s flame graph shows fifty serial database spans beneath it. That is the canonical place DataLoader belongs — collapsing the fifty keys the router already batched into one WHERE id IN (...) query. Reading the trace top-down (router batched correctly?) and then inside the subgraph (did the subgraph preserve the batch?) localises the fault to exactly one of the two layers, which is far faster than instrumenting blindly.
Cache hits in the trace
If you run response or entity caching in the router, the flame graph also shows whether a given operation was served from cache. A trace with no subgraph fetches at all (just a router span and a cache-lookup) is a full cache hit; a trace with some subgraph frames and some absent is a partial hit. Reading these alongside latency percentiles explains the bimodal distribution federation latency usually exhibits — the fast mode is cache hits, the slow mode is cold multi-fetch plans. Interpreting these together with caching strategies for federated GraphQL is what lets you tell a real regression from a normal shift in cache-hit ratio.
Verification Steps
-
Confirm reporting is live. After starting the router with credentials set, run a few operations, then check the graph’s Operations view in Studio — operations should appear within a minute or two.
-
Confirm field traces. Open a recent operation and verify a trace/flame graph is present (not just timing). If traces are missing,
field_level_instrumentation_samplermay be too low to have sampled yet — temporarily raise it to1.0to verify, then lower it. -
Confirm subgraph attribution. In the trace, each frame should name its subgraph. If frames are unattributed, the subgraph is not emitting FTV1 — re-check the inline trace plugin.
-
Confirm client segmentation. Filter the Operations view by client name; if all traffic shows as
unknown, clients aren’t sendingapollographql-client-nameorsend_headersdoesn’t allow it.
Alerting on Regressions
Studio’s notifications can alert on performance and error-rate thresholds per operation and per variant.
- Set p95 thresholds per critical operation, not a single global threshold — a 200ms p95 may be fine for a dashboard query and a regression for an autocomplete.
- Alert on error-rate deltas, catching a subgraph that starts throwing without raising latency.
- Gate schema changes with checks. Before deploying,
rover subgraph checkcompares the proposed schema against real Studio usage and flags breaking changes for fields clients actually use — combining performance data with composition safety from federated schema validation in CI/CD pipelines.
# in CI: fail the build if the change breaks operations real clients run
rover subgraph check "$APOLLO_GRAPH_REF" \
--schema ./products/schema.graphql \
--name products
Common Mistakes and Gotchas
- Leaking PII through traces. Leaving
send_variable_valuesor unredacted subgraph errors enabled ships sensitive payloads to Studio. Setsend_variable_values: noneandredact: true. - Over-sampling FTV1. A high
field_level_instrumentation_samplerin production adds real per-field timing overhead in subgraphs. Keep it at 1–5%; Studio’s aggregates remain accurate. - Watching averages, not percentiles. Mean latency hides the cold-plan tail. Always alert and triage on p95/p99.
Frequently Asked Questions
What is the difference between operation usage reporting and field-level (FTV1) traces?
Usage reporting records every operation’s signature, client, and overall timing with negligible overhead — it powers usage counts and p50/p95 charts. FTV1 traces are sampled, per-field execution timings emitted by subgraphs and stitched by the router into a federated flame graph, letting you see which specific field and subgraph is slow. Keep usage at full volume and FTV1 sampled at 1–5%.
How do I find an N+1 in a federated query using Studio?
Open the slow operation’s trace and look for the same resolver or __resolveReference frame repeated once per element beneath a list field. That repetition is the N+1 signature — the subgraph is issuing one round-trip per item. Resolve it with DataLoader-based batching in the subgraph’s reference resolver.
Do I need OpenTelemetry if I already report to Apollo Studio?
They serve different needs. Studio gives schema-aware, field-level performance and usage scoped to your managed graph. OpenTelemetry/OTLP gives vendor-neutral distributed traces you can correlate with non-GraphQL services (databases, queues, other APIs) in a backend like Tempo or Datadog. Production teams commonly run both, with FTV1 and OTLP sampling tuned independently.
Related
- Observability and Distributed Tracing in Federation — parent guide
- Propagating Trace Context Across Subgraphs with OpenTelemetry
- Optimizing Reference Resolvers for Performance — fix the slow resolvers Studio surfaces
- Apollo Studio Schema Checks for Managed Federation — usage data gates schema changes
- Federated GraphQL Operations in Production — parent section