Most crypto gateway outages don't announce themselves with a 500 from your API. They surface six hours later as a support ticket: "I paid but my order never fulfilled." The cause is almost always one of two things: an RPC provider serving stale state, or a block-lag gap during which the wrong confirmation count was used to advance a payment intent.
This post covers the observability stack we run at BchainPay to catch both problems before merchants notice: block-lag detection, fan-in provider topology, SLO histograms, and reorg detection — with enough code to implement it independently.
The failure mode nobody models#
Most payment-gateway post-mortems assume the bug is in application code. It usually isn't. When you span multiple chains, the RPC providers between you and the chain are the weakest link — and they fail silently.
A failing RPC provider doesn't return 500. It returns perfectly
well-formed JSON with a blockNumber that is 30 blocks behind the network
tip. Your confirmation logic counts those confirmations against a stale view.
A 12-confirmation Polygon payment looks confirmed to your worker, but the
canonical chain has moved 42 blocks further — you've been reporting
wrong state for over a minute.
Two distinct failure modes exist:
Block lag: eth_blockNumber (or equivalent) returns a height behind the
real tip. Responses are structurally valid but stale. You're measuring
confirmations against the wrong height.
State partition: the provider has synced a branch later orphaned. A transaction you've confirmed may not exist on the canonical chain. This is the reorganization scenario — rare but catastrophic.
Lag is caught by comparing tips across providers. Partitions require cross-checking block hashes at known heights.
Fan-in provider topology#
The structural mitigation: never depend on a single RPC provider. We run three providers per chain — typically two commercial providers and one self-hosted node — behind a routing layer that always selects the freshest healthy provider for reads.
type RPCProvider = {
name: string;
url: string;
tip: number; // latest block height seen
lag: number; // blocks behind max(tip) across pool
healthy: boolean;
recoveryCount: number;
};
class ChainWatcher {
constructor(
readonly chain: string,
readonly providers: RPCProvider[],
) {}
// Returns the least-lagged healthy provider for reads
selectProvider(): RPCProvider {
const healthy = this.providers.filter(p => p.healthy);
if (healthy.length === 0) {
// Forced fallback: use least-lagged regardless
return [...this.providers].sort((a, b) => a.lag - b.lag)[0];
}
return healthy.sort((a, b) => a.lag - b.lag)[0];
}
}Reads (block queries, log fetches, receipt lookups) always go to the freshest provider. Broadcasts go to all providers in parallel; the first non-error wins. A single broadcast failure doesn't stall the payment — the transaction propagates via p2p once any provider accepts it.
The block-lag watchdog#
Every five seconds a background task polls eth_blockNumber from all
providers and recomputes relative lag:
const MAX_LAG_BLOCKS: Record<string, number> = {
ethereum: 3,
polygon: 10,
bnb: 6,
solana: 100, // slots, not seconds — 0.4 s/slot
tron: 6,
};
const ALERT_LAG_BLOCKS: Record<string, number> = {
ethereum: 10,
polygon: 40,
bnb: 20,
solana: 400,
tron: 20,
};
async function runLagWatchdog(watcher: ChainWatcher): Promise<void> {
const results = await Promise.allSettled(
watcher.providers.map(async (p) => ({
provider: p,
tip: await fetchBlockNumber(p.url),
}))
);
const tips = results
.filter((r): r is PromiseFulfilledResult<{ provider: RPCProvider; tip: number }> =>
r.status === 'fulfilled')
.map(r => r.value);
const maxTip = Math.max(...tips.map(t => t.tip));
for (const { provider, tip } of tips) {
provider.tip = tip;
provider.lag = maxTip - tip;
const threshold = MAX_LAG_BLOCKS[watcher.chain];
if (provider.lag <= threshold) {
// Recovery gating: require 3 consecutive healthy cycles
provider.recoveryCount = (provider.recoveryCount ?? 0) + 1;
if (!provider.healthy && provider.recoveryCount >= 3) {
provider.healthy = true;
provider.recoveryCount = 0;
logger.info(`rpc:recovered chain=${watcher.chain} provider=${provider.name}`);
}
} else {
provider.healthy = false;
provider.recoveryCount = 0;
if (provider.lag > ALERT_LAG_BLOCKS[watcher.chain]) {
alerting.fire('rpc_provider_lagging', {
chain: watcher.chain,
provider: provider.name,
lag: provider.lag,
});
}
}
metrics.rpcBlockLag
.labels({ chain: watcher.chain, provider: provider.name })
.set(provider.lag);
}
}The table below shows how MAX_LAG_BLOCKS maps to wall-clock staleness:
| Chain | Block time | Max healthy lag | Wall-clock staleness |
|---|---|---|---|
| Ethereum | 12 s | 3 blocks | 36 s |
| Polygon | 2 s | 10 blocks | 20 s |
| BNB | 3 s | 6 blocks | 18 s |
| Solana | 0.4 s | 100 slots | 40 s |
| Tron | 3 s | 6 blocks | 18 s |
If all providers on a chain breach the alert threshold at the same time, the chain itself is probably fine and your network egress is the problem. If a single provider lags while others stay current, that provider is degraded.
Recovery is gated at three consecutive healthy cycles. A provider that oscillates around the threshold would otherwise churn the routing table every 15 seconds; gating prevents that flapping.
Confirmation time SLOs#
Block lag drives confirmation latency. If you promise merchants that a
USDC.polygon payment reaches confirmed within 60 seconds of the
on-chain transfer, and your RPC provider is 40 blocks behind, you will miss
that SLA by over a minute even though the chain is healthy.
Track this as a histogram, recorded when a payment intent transitions from
pending_confirmation to confirmed:
// In your on-chain indexer, after marking an intent confirmed
const latencyMs = Date.now() - intent.detectedAt;
metrics.confirmationSeconds
.labels({ chain: intent.chain, tier: amountTier(intent.amountUsd) })
.observe(latencyMs / 1000);The resulting Prometheus output gives you per-chain, per-tier percentiles:
# HELP bchainpay_payment_confirmation_seconds Seconds from on-chain detection to confirmed
# TYPE bchainpay_payment_confirmation_seconds histogram
bchainpay_payment_confirmation_seconds_bucket{chain="polygon",tier="sub_1k",le="30"} 4102
bchainpay_payment_confirmation_seconds_bucket{chain="polygon",tier="sub_1k",le="60"} 4980
bchainpay_payment_confirmation_seconds_bucket{chain="polygon",tier="sub_1k",le="120"} 4999
bchainpay_payment_confirmation_seconds_sum{chain="polygon",tier="sub_1k"} 142371
bchainpay_payment_confirmation_seconds_count{chain="polygon",tier="sub_1k"} 5001When the P95 breaches your SLA window, check block lag first. Nine times in ten, that's the cause.
Reorg detection#
A reorg is rarer but more dangerous. The guard: at every watchdog cycle, fetch the block hash at a checkpoint five blocks behind tip and compare it against the hash stored at that height in the previous cycle.
async function checkForReorg(
provider: RPCProvider,
chain: string,
depth = 5,
): Promise<void> {
const checkHeight = provider.tip - depth;
const observedHash = await fetchBlockHash(provider.url, checkHeight);
const storedHash = await db.getCheckpointHash(chain, checkHeight);
if (!storedHash) {
await db.saveCheckpointHash(chain, checkHeight, observedHash);
return;
}
if (storedHash !== observedHash) {
metrics.reorgsDetected.labels({ chain }).inc();
alerting.fire('reorg_detected', {
chain,
height: checkHeight,
stored: storedHash,
observed: observedHash,
});
await triggerReorgScan(chain, checkHeight);
}
}triggerReorgScan re-fetches every payment intent whose last-known block
height is >= checkHeight, rechecks each transaction receipt against the
current canonical chain, and moves any orphaned intents back to
awaiting_payment. The payment_intent.investigating webhook fires so
merchants know their fulfillment logic should pause; no funds are released
until the intent re-confirms on the canonical chain.
This avoids a refund-after-the-fact scenario. Detecting the reorg and pausing is cheap; discovering it after fulfillment is not.
Prometheus alert rules#
groups:
- name: bchainpay.rpc
rules:
- alert: RPCProviderLagging
expr: bchainpay_rpc_block_lag > 10
for: 1m
labels:
severity: warning
annotations:
summary: >
{{ $labels.provider }} on {{ $labels.chain }}
is {{ $value }} blocks behind
- alert: AllRPCProvidersDegraded
expr: min by(chain) (bchainpay_rpc_block_lag) > 10
for: 30s
labels:
severity: critical
annotations:
summary: All RPC providers for {{ $labels.chain }} are lagging
- alert: ConfirmationSLOBreach
expr: |
histogram_quantile(0.95,
rate(bchainpay_payment_confirmation_seconds_bucket[10m])
) > 120
for: 5m
labels:
severity: warning
annotations:
summary: >
P95 confirmation latency on {{ $labels.chain }}
exceeds the 2-minute SLA
- alert: ReorgDetected
expr: increase(bchainpay_reorg_detected_total[5m]) > 0
labels:
severity: critical
annotations:
summary: Chain reorg detected on {{ $labels.chain }}Route AllRPCProvidersDegraded and ReorgDetected to PagerDuty or
equivalent — they warrant immediate human response. RPCProviderLagging can
go to Slack; it's low signal when only one provider is affected and the pool
reroutes automatically.
The ConfirmationSLOBreach alert is what wakes up the on-call engineer most
often. Almost every breach traces back to either block lag or a confirmation
policy that was misconfigured for a new chain. Having the P95 histogram makes
that root cause obvious in the first dashboard glance.
Key takeaways#
- A structurally valid stale response is more dangerous than an error. Your error-handling code never fires; your stale-data code never existed.
- Run three providers per chain. Two commercial (Alchemy, Infura, QuickNode) and one self-hosted node give you uncorrelated failure modes. Self-hosted nodes are slower but immune to commercial provider incidents.
- Gate recovery at three consecutive healthy cycles. Immediate re-admission causes flapping that destabilizes your routing table without actually improving availability.
- Checkpoint block hashes, not just block numbers. A reorged chain can appear at the same tip height but with a different canonical history. Only hash comparison catches that.
- Map lag headroom to your SLA window. If your confirmation SLA is 60
seconds and block time is 2 seconds, a 30-block lag headroom burns half
your SLA budget before you've confirmed a single block. Keep lag thresholds
well below
SLA_seconds / block_time / required_confirmations. - Gate fund release on your chain watcher health. If the watchdog marks a chain degraded, halt confirmations on that chain entirely — never work around degraded state to stay online. A paused payment can be explained; a double-fulfillment caused by stale confirmation cannot.