BchainPay logoBchainPay
EngineeringInfrastructureMonitoringOperationsAPI

RPC Failover and Block-Lag Monitoring for Crypto Payment Gateways

Detect RPC provider drift, build a fan-in failover topology, set block-lag SLOs, and alert on reorgs before they cause missed or double-confirmed payments.

By Cipher · Founding engineer, BchainPay8 min read
Illustration for RPC Failover and Block-Lag Monitoring for Crypto Payment Gateways

Most crypto gateway outages don't announce themselves with a 500 from your API. They surface six hours later as a support ticket: "I paid but my order never fulfilled." The cause is almost always one of two things: an RPC provider serving stale state, or a block-lag gap during which the wrong confirmation count was used to advance a payment intent.

This post covers the observability stack we run at BchainPay to catch both problems before merchants notice: block-lag detection, fan-in provider topology, SLO histograms, and reorg detection — with enough code to implement it independently.

The failure mode nobody models#

Most payment-gateway post-mortems assume the bug is in application code. It usually isn't. When you span multiple chains, the RPC providers between you and the chain are the weakest link — and they fail silently.

A failing RPC provider doesn't return 500. It returns perfectly well-formed JSON with a blockNumber that is 30 blocks behind the network tip. Your confirmation logic counts those confirmations against a stale view. A 12-confirmation Polygon payment looks confirmed to your worker, but the canonical chain has moved 42 blocks further — you've been reporting wrong state for over a minute.

Two distinct failure modes exist:

Block lag: eth_blockNumber (or equivalent) returns a height behind the real tip. Responses are structurally valid but stale. You're measuring confirmations against the wrong height.

State partition: the provider has synced a branch later orphaned. A transaction you've confirmed may not exist on the canonical chain. This is the reorganization scenario — rare but catastrophic.

Lag is caught by comparing tips across providers. Partitions require cross-checking block hashes at known heights.

Fan-in provider topology#

The structural mitigation: never depend on a single RPC provider. We run three providers per chain — typically two commercial providers and one self-hosted node — behind a routing layer that always selects the freshest healthy provider for reads.

type RPCProvider = {
  name:     string;
  url:      string;
  tip:      number;   // latest block height seen
  lag:      number;   // blocks behind max(tip) across pool
  healthy:  boolean;
  recoveryCount: number;
};
 
class ChainWatcher {
  constructor(
    readonly chain:     string,
    readonly providers: RPCProvider[],
  ) {}
 
  // Returns the least-lagged healthy provider for reads
  selectProvider(): RPCProvider {
    const healthy = this.providers.filter(p => p.healthy);
    if (healthy.length === 0) {
      // Forced fallback: use least-lagged regardless
      return [...this.providers].sort((a, b) => a.lag - b.lag)[0];
    }
    return healthy.sort((a, b) => a.lag - b.lag)[0];
  }
}

Reads (block queries, log fetches, receipt lookups) always go to the freshest provider. Broadcasts go to all providers in parallel; the first non-error wins. A single broadcast failure doesn't stall the payment — the transaction propagates via p2p once any provider accepts it.

The block-lag watchdog#

Every five seconds a background task polls eth_blockNumber from all providers and recomputes relative lag:

const MAX_LAG_BLOCKS: Record<string, number> = {
  ethereum: 3,
  polygon:  10,
  bnb:      6,
  solana:   100,  // slots, not seconds — 0.4 s/slot
  tron:     6,
};
 
const ALERT_LAG_BLOCKS: Record<string, number> = {
  ethereum: 10,
  polygon:  40,
  bnb:      20,
  solana:   400,
  tron:     20,
};
 
async function runLagWatchdog(watcher: ChainWatcher): Promise<void> {
  const results = await Promise.allSettled(
    watcher.providers.map(async (p) => ({
      provider: p,
      tip: await fetchBlockNumber(p.url),
    }))
  );
 
  const tips = results
    .filter((r): r is PromiseFulfilledResult<{ provider: RPCProvider; tip: number }> =>
      r.status === 'fulfilled')
    .map(r => r.value);
 
  const maxTip = Math.max(...tips.map(t => t.tip));
 
  for (const { provider, tip } of tips) {
    provider.tip = tip;
    provider.lag = maxTip - tip;
 
    const threshold = MAX_LAG_BLOCKS[watcher.chain];
    if (provider.lag <= threshold) {
      // Recovery gating: require 3 consecutive healthy cycles
      provider.recoveryCount = (provider.recoveryCount ?? 0) + 1;
      if (!provider.healthy && provider.recoveryCount >= 3) {
        provider.healthy = true;
        provider.recoveryCount = 0;
        logger.info(`rpc:recovered chain=${watcher.chain} provider=${provider.name}`);
      }
    } else {
      provider.healthy = false;
      provider.recoveryCount = 0;
      if (provider.lag > ALERT_LAG_BLOCKS[watcher.chain]) {
        alerting.fire('rpc_provider_lagging', {
          chain: watcher.chain,
          provider: provider.name,
          lag: provider.lag,
        });
      }
    }
 
    metrics.rpcBlockLag
      .labels({ chain: watcher.chain, provider: provider.name })
      .set(provider.lag);
  }
}

The table below shows how MAX_LAG_BLOCKS maps to wall-clock staleness:

Chain Block time Max healthy lag Wall-clock staleness
Ethereum 12 s 3 blocks 36 s
Polygon 2 s 10 blocks 20 s
BNB 3 s 6 blocks 18 s
Solana 0.4 s 100 slots 40 s
Tron 3 s 6 blocks 18 s

If all providers on a chain breach the alert threshold at the same time, the chain itself is probably fine and your network egress is the problem. If a single provider lags while others stay current, that provider is degraded.

Recovery is gated at three consecutive healthy cycles. A provider that oscillates around the threshold would otherwise churn the routing table every 15 seconds; gating prevents that flapping.

Confirmation time SLOs#

Block lag drives confirmation latency. If you promise merchants that a USDC.polygon payment reaches confirmed within 60 seconds of the on-chain transfer, and your RPC provider is 40 blocks behind, you will miss that SLA by over a minute even though the chain is healthy.

Track this as a histogram, recorded when a payment intent transitions from pending_confirmation to confirmed:

// In your on-chain indexer, after marking an intent confirmed
const latencyMs = Date.now() - intent.detectedAt;
 
metrics.confirmationSeconds
  .labels({ chain: intent.chain, tier: amountTier(intent.amountUsd) })
  .observe(latencyMs / 1000);

The resulting Prometheus output gives you per-chain, per-tier percentiles:

# HELP bchainpay_payment_confirmation_seconds Seconds from on-chain detection to confirmed
# TYPE bchainpay_payment_confirmation_seconds histogram
bchainpay_payment_confirmation_seconds_bucket{chain="polygon",tier="sub_1k",le="30"} 4102
bchainpay_payment_confirmation_seconds_bucket{chain="polygon",tier="sub_1k",le="60"} 4980
bchainpay_payment_confirmation_seconds_bucket{chain="polygon",tier="sub_1k",le="120"} 4999
bchainpay_payment_confirmation_seconds_sum{chain="polygon",tier="sub_1k"} 142371
bchainpay_payment_confirmation_seconds_count{chain="polygon",tier="sub_1k"} 5001

When the P95 breaches your SLA window, check block lag first. Nine times in ten, that's the cause.

Reorg detection#

A reorg is rarer but more dangerous. The guard: at every watchdog cycle, fetch the block hash at a checkpoint five blocks behind tip and compare it against the hash stored at that height in the previous cycle.

async function checkForReorg(
  provider: RPCProvider,
  chain:    string,
  depth = 5,
): Promise<void> {
  const checkHeight = provider.tip - depth;
  const observedHash = await fetchBlockHash(provider.url, checkHeight);
 
  const storedHash = await db.getCheckpointHash(chain, checkHeight);
  if (!storedHash) {
    await db.saveCheckpointHash(chain, checkHeight, observedHash);
    return;
  }
 
  if (storedHash !== observedHash) {
    metrics.reorgsDetected.labels({ chain }).inc();
    alerting.fire('reorg_detected', {
      chain,
      height: checkHeight,
      stored:   storedHash,
      observed: observedHash,
    });
    await triggerReorgScan(chain, checkHeight);
  }
}

triggerReorgScan re-fetches every payment intent whose last-known block height is >= checkHeight, rechecks each transaction receipt against the current canonical chain, and moves any orphaned intents back to awaiting_payment. The payment_intent.investigating webhook fires so merchants know their fulfillment logic should pause; no funds are released until the intent re-confirms on the canonical chain.

This avoids a refund-after-the-fact scenario. Detecting the reorg and pausing is cheap; discovering it after fulfillment is not.

Prometheus alert rules#

groups:
  - name: bchainpay.rpc
    rules:
      - alert: RPCProviderLagging
        expr: bchainpay_rpc_block_lag > 10
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: >
            {{ $labels.provider }} on {{ $labels.chain }}
            is {{ $value }} blocks behind
 
      - alert: AllRPCProvidersDegraded
        expr: min by(chain) (bchainpay_rpc_block_lag) > 10
        for: 30s
        labels:
          severity: critical
        annotations:
          summary: All RPC providers for {{ $labels.chain }} are lagging
 
      - alert: ConfirmationSLOBreach
        expr: |
          histogram_quantile(0.95,
            rate(bchainpay_payment_confirmation_seconds_bucket[10m])
          ) > 120
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: >
            P95 confirmation latency on {{ $labels.chain }}
            exceeds the 2-minute SLA
 
      - alert: ReorgDetected
        expr: increase(bchainpay_reorg_detected_total[5m]) > 0
        labels:
          severity: critical
        annotations:
          summary: Chain reorg detected on {{ $labels.chain }}

Route AllRPCProvidersDegraded and ReorgDetected to PagerDuty or equivalent — they warrant immediate human response. RPCProviderLagging can go to Slack; it's low signal when only one provider is affected and the pool reroutes automatically.

The ConfirmationSLOBreach alert is what wakes up the on-call engineer most often. Almost every breach traces back to either block lag or a confirmation policy that was misconfigured for a new chain. Having the P95 histogram makes that root cause obvious in the first dashboard glance.

Key takeaways#

  • A structurally valid stale response is more dangerous than an error. Your error-handling code never fires; your stale-data code never existed.
  • Run three providers per chain. Two commercial (Alchemy, Infura, QuickNode) and one self-hosted node give you uncorrelated failure modes. Self-hosted nodes are slower but immune to commercial provider incidents.
  • Gate recovery at three consecutive healthy cycles. Immediate re-admission causes flapping that destabilizes your routing table without actually improving availability.
  • Checkpoint block hashes, not just block numbers. A reorged chain can appear at the same tip height but with a different canonical history. Only hash comparison catches that.
  • Map lag headroom to your SLA window. If your confirmation SLA is 60 seconds and block time is 2 seconds, a 30-block lag headroom burns half your SLA budget before you've confirmed a single block. Keep lag thresholds well below SLA_seconds / block_time / required_confirmations.
  • Gate fund release on your chain watcher health. If the watchdog marks a chain degraded, halt confirmations on that chain entirely — never work around degraded state to stay online. A paused payment can be explained; a double-fulfillment caused by stale confirmation cannot.

Try it yourself

Spin up a sandbox merchant in under 60 seconds.

One REST endpoint, signed webhooks, five chains. No credit card required.

Related reading