Defining Subgraph Boundaries for Microservices
Defining subgraph boundaries for microservices means aligning your GraphQL schema partitions with domain-driven service boundaries so that each subgraph owns a coherent slice of the graph, deploys independently, and minimises cross-service latency. Draw the lines well and teams ship against one graph without coordinating; draw them badly and you inherit a distributed monolith where every change ripples across services. This guide sits under GraphQL Federation Architecture & Design and covers how to discover, encode, validate, and operate boundaries in production. The tactical extraction sequence is covered separately in how to split a monolith GraphQL schema into subgraphs.
The boundary problem is fundamentally about ownership. Every entity in a federated graph has exactly one owning subgraph — the place where its @key is canonical and its __resolveReference lives. Get ownership wrong and you either duplicate state across services (drift and inconsistency) or create chatty cross-boundary joins that dominate your latency budget. Boundaries are also expensive to move once clients depend on them, so the cost of an early mistake compounds. This page gives you a repeatable method for finding the right lines before they harden.
Prerequisites
Before applying the workflows below, confirm the following are in place:
Concept Deep-Dive
A subgraph boundary is the seam between two owning domains, expressed in SDL through entity keys and reference stubs. The defining test of a good boundary is cohesion versus coupling: types inside a subgraph should reference each other far more than they reference types outside it. A practical heuristic is the cross-domain reference ratio — when more than roughly 15% of a subgraph’s type surface is shared with or references another subgraph, the boundary is probably misdrawn and two domains are bleeding into one another.
The diagram below contrasts a well-drawn boundary with an over-fragmented one, using the same five types in each case.
Boundaries are encoded with three directives. @key names the entity and the fields that identify it across subgraphs. @external declares that a referenced field is owned elsewhere. @provides tells the planner that a resolver can return certain external fields inline, letting it skip a hop. The example below shows a clean boundary between an accounts domain and a profiles domain, where accounts can opportunistically provide the avatar it already holds:
# accounts subgraph
extend schema
@link(url: "https://specs.apollo.dev/federation/v2.9",
import: ["@key", "@external", "@provides"])
type User @key(fields: "id") {
id: ID!
# accounts references Profile but does not own it
profile: Profile @provides(fields: "avatar")
}
# Stub: Profile is owned by the profiles subgraph
type Profile @key(fields: "id") {
id: ID!
avatar: String @external
}
# profiles subgraph (canonical owner of Profile)
extend schema
@link(url: "https://specs.apollo.dev/federation/v2.9", import: ["@key"])
type Profile @key(fields: "id") {
id: ID!
avatar: String
bio: String
}
Here @provides(fields: "avatar") hints to the router that resolving User.profile in the accounts subgraph already yields avatar, so the planner can satisfy a query for user { profile { avatar } } without a second fetch to profiles. That is a boundary-shaping decision: you accepted a small amount of denormalisation (accounts must hold a current avatar) to eliminate a hop on a hot path. The deeper rules for these cross-service stubs live in designing cross-service type references, and when two domains genuinely co-own a field you reach for type ownership and shared schema contracts.
Three signals that a boundary is wrong
Boundaries rarely announce themselves as broken; they degrade quietly. Three signals catch most mistakes early. The first is the reference ratio already mentioned: a subgraph whose SDL is mostly @external stubs and @requires is not a domain owner, it is a satellite of another domain, and the two should probably merge. The second is the transactional signal — if a single business operation must write to two subgraphs to stay consistent, the boundary has cut through an aggregate that should be atomic, and you will spend the rest of the system’s life papering over the split with sagas and dual writes. The third is the deploy-coupling signal: if shipping a feature routinely requires coordinated releases of two subgraphs, the boundary is not buying you the independence federation exists to provide.
Aggregate roots, not tables
The most reliable way to draw a boundary is to follow the aggregate root from domain-driven design rather than the shape of your database. An aggregate is a group of objects treated as a single unit for data changes; the practical rule is that the aggregate root is the entity other domains reference, and everything inside the aggregate stays in one subgraph. An Order with its line items is one aggregate; User with its credentials and profile is another. References between aggregates become @key boundaries; references within an aggregate stay as ordinary GraphQL fields with no federation directives at all. When you find yourself putting a @key on a type that only ever exists as a child of another type, that is a sign you have over-fragmented.
Directive & Config Spec Table
The directives and configuration keys that govern boundaries fall into composition-time concerns (validated when the supergraph is built) and runtime concerns (evaluated by the router per query).
| Directive / key | Syntax | Valid values | Effect (composition-time vs runtime) |
|---|---|---|---|
@key |
@key(fields: "id") |
one or more field names, space-separated | Composition: declares the entity and validates the key fields exist and are shareable. |
@external |
field: T @external |
applied to a field | Composition: marks the field as owned elsewhere; runtime: router will not call this subgraph to resolve it. |
@provides |
@provides(fields: "avatar") |
selection set of external fields | Runtime: lets the planner skip a hop when the field is returned inline. |
@requires |
@requires(fields: "sku") |
selection of external key/owned fields | Runtime: router fetches the named fields first, then resolves the annotated field. |
@shareable |
field: T @shareable |
applied to a field or type | Composition: permits more than one subgraph to resolve the field without a conflict error. |
federation_version |
federation_version: =2.9.0 |
pinned version string | Composition: selects the composition algorithm; mismatches across the pipeline cause drift. |
routing_url |
routing_url: http://svc:4001/graphql |
reachable URL | Runtime: the endpoint the router fetches from for that subgraph. |
Step-by-Step Implementation
Boundaries are discovered, then encoded, then validated. Follow the steps in order; each produces an artifact the next consumes.
Step 1 — Map domains to candidate subgraphs. Run an event-storming or dependency-graph exercise to group types by aggregate root and bounded context. The output is a candidate assignment of every type to exactly one owning domain.
# Introspect the current graph to get a complete type inventory to partition.
rover graph introspect https://graph.internal/graphql > current.graphql
Step 2 — Measure cross-domain references against real traffic. Replay production operations against your candidate partition and count how often each operation crosses a proposed boundary. High-frequency crossings on hot paths are candidates for @provides or for redrawing the line.
# Export the most frequent operations from your trace store, then count boundary crossings.
rover graph fetch graph@prod --output baseline.graphql
Step 3 — Encode the boundary in SDL. For each domain, write the subgraph SDL: full definitions for owned entities, @key stubs for referenced entities, and @provides/@requires where a hot path justifies it.
# orders subgraph references User (owned by accounts) by stub only
extend schema
@link(url: "https://specs.apollo.dev/federation/v2.9", import: ["@key", "@external"])
type Order @key(fields: "id") {
id: ID!
total: Float!
customer: User!
}
type User @key(fields: "id") {
id: ID! @external
}
Step 4 — Implement the reference resolver in the owning subgraph. Every entity needs __resolveReference in its owner so the router can hydrate it from a key. The accounts subgraph, not orders, implements User.
import { buildSubgraphSchema } from '@apollo/subgraph';
import { gql } from 'graphql-tag';
const typeDefs = gql`
extend schema @link(url: "https://specs.apollo.dev/federation/v2.9", import: ["@key"])
type User @key(fields: "id") {
id: ID!
name: String!
}
`;
const resolvers = {
User: {
__resolveReference(ref: { id: string }) {
return fetchUserById(ref.id); // hydrate from the accounts data source
},
},
};
export const schema = buildSubgraphSchema({ typeDefs, resolvers });
Step 5 — Compose and validate. Merge the subgraphs and confirm the boundary composes before anything ships.
rover supergraph compose --config supergraph.yaml > supergraph.graphql
Composition Pipeline Integration
Boundaries drift the moment they are not enforced. Wire composition checks into CI so a pull request that breaks a boundary fails before merge. The supergraph.yaml that declares the topology is the source of truth for which subgraph owns which routing URL:
federation_version: =2.9.0
subgraphs:
accounts:
routing_url: http://accounts-svc:4001/graphql
schema:
file: ./accounts.graphql
profiles:
routing_url: http://profiles-svc:4002/graphql
schema:
file: ./profiles.graphql
orders:
routing_url: http://orders-svc:4003/graphql
schema:
file: ./orders.graphql
On each pull request, run rover subgraph check for the changed subgraph against the registered graph. It catches both composition errors (an ownership violation, a broken stub) and operation-breaking changes (a removed field a client query depends on):
rover subgraph check my-graph@prod \
--name orders \
--schema ./orders.graphql
The full pipeline patterns — including how to gate merges and run checks across many subgraphs — are covered in schema validation in CI/CD pipelines, and operating the registry these commands talk to is covered in schema registry and managed federation.
Performance & Scale Considerations
Every boundary is a potential network hop, and hops dominate federated latency. A query that crosses three boundaries produces a three-stage query plan, each stage waiting on the previous. The trade-off between clean normalisation and federation chattiness must be evaluated at the boundary layer, not after the fact.
| Strategy | Latency impact | Data freshness | Implementation cost |
|---|---|---|---|
| Strict federation (normalised) | Higher — N+1 joins without batching | Real-time | Low — clean boundaries |
| Boundary denormalisation | Lower — single fetch | Eventual consistency | Medium — sync pipelines |
| DataLoader batching | Moderate — batched joins | Real-time | Medium — resolver complexity |
The first defence against cross-boundary N+1 is DataLoader batching inside each subgraph, so the router’s batched _entities fetch maps to a single backing query rather than one per key. The second is selective denormalisation on read-heavy hot paths — but only where eventual consistency is acceptable, and always with a synchronisation pipeline that keeps the duplicated field current. The detailed reference-resolver tuning belongs to optimizing reference resolvers for performance in the implementation section. Profile real plans before optimising: a boundary that looks expensive in the SDL may be cold in production, and one that looks cheap may sit on your hottest query.
It helps to reason about the query plan the boundary produces, because that plan is what the router actually executes. For a query that fetches an Order and then its customer, the planner emits a two-stage plan: stage one fetches the order from the orders subgraph and collects the customer key; stage two sends a single batched _entities request to accounts with every collected key. The cost of the boundary is therefore one extra round trip plus the marginal cost of the batched fetch — not one fetch per order, provided your resolver batches. Where the planner cannot batch — typically because a @requires forces a dependent fetch before the key is known — the cost climbs, which is why @requires across a boundary deserves scrutiny. Use @provides to collapse a stage when the referencing subgraph already holds the data, and accept the denormalisation that implies. The point of measuring against real traffic is that the planner’s stage count, multiplied by your actual query mix, is the only honest predictor of federated latency.
Failure Modes & Debugging
Composition error: field defined in multiple subgraphs. The message reads Field "User.email" is defined in multiple subgraphs but is not marked @shareable. It means two domains both claimed ownership of a field. Decide the single owner, mark the field @external (with the necessary stub) in the other subgraph, or — only if the resolver logic is genuinely identical — add @shareable to both. Conflict-resolution strategy is detailed in resolving schema conflicts in Apollo Federation.
Composition error: invalid @key. Type "Order" does not have a valid @key directive means a referenced entity is missing its key or the key fields do not exist on the type. Add @key(fields: "id") to the canonical definition and confirm the key fields are present in every stub.
Runtime: null cross-boundary fields. When a query returns null for a field that should resolve across a boundary, the owning subgraph is almost always missing __resolveReference, or it returns a shape whose key fields do not match the stub. There is no composition error because the gap is at runtime; check the owner’s resolver and confirm it returns the @key fields.
Runtime: latency spikes on a single boundary. A boundary that fans out to a slow subgraph drags every query that touches it. Monitor per-subgraph fetch latency in your traces; add a timeout in router.yaml traffic shaping so one slow boundary fails fast instead of stalling the whole plan. Reject queries with unbounded cross-boundary joins using complexity analysis before they reach the router.
Frequently Asked Questions
How do I determine the optimal number of subgraphs for a distributed GraphQL API?
Align subgraph count with bounded contexts and team topology, not an arbitrary target. Start with 3–5 high-cohesion domains, measure cross-boundary join latency against real traffic, and split further only when deployment velocity or ownership conflicts actually degrade. Premature fragmentation multiplies hops without buying independence.
What are the performance implications of cross-boundary entity resolution?
Each @key lookup is a network hop, a resolver execution, and a serialization step. Mitigate with DataLoader batching inside subgraphs, @provides on hot paths where the data is co-located, and selective denormalisation for read-heavy fields that tolerate eventual consistency.
Can I refactor subgraph boundaries after initial deployment?
Yes, but it requires migration planning because clients depend on the current shape. Use @override to move field ownership without breaking clients, dual-write to keep both sides consistent during the shift, and gate every step on rover subgraph check. Move one boundary at a time and watch composition stability and resolver latency throughout.
Should mutations and queries live in the same subgraph?
Yes — split by domain, not by operation type. A subgraph that owns the Order entity should own both reading and writing orders. Splitting reads from writes produces a distributed monolith where a single business operation spans services and you lose transactional locality.
Related
- GraphQL Federation Architecture & Design — parent section
- How to Split a Monolith GraphQL Schema into Subgraphs
- Designing Cross-Service Type References
- Type Ownership and Shared Schema Contracts
- Resolving Schema Conflicts in Apollo Federation
- Optimizing Reference Resolvers for Performance — cross-section: implementing the entities you bound here