Back to overview
Downtime

Dashboard - B2C [Backend] and Returns [Backend] is down

May 4, 2026 at 11:08am UTC
Affected services
Dashboard - B2C [Backend] and Returns [Backend]

Resolved
May 5, 2026 at 6:02pm UTC

Root Cause Analysis

On 4 May 2026, customers were unable to access dashboard.clickpost.in from 4:32 PM to 5:00 PM because its backend server was down. The service was failing under stack overflows and elevated latency on every incoming request, making the dashboard effectively unusable until the change was reverted.

The trigger was a change made the same day as part of productionizing eBPF-based GC tracing metrics for Python. To correlate GC pressure with API latency on a per-host basis, we needed the host IP stamped on every span so we could pivot from "this endpoint is slow" to "this machine is in GC" in our traces.

To do this, we updated SystemInfoSpanProcessor.onstart to set machine.ip from a new MachineUtils.gethost_ip() helper. The helper first tries the EC2 metadata service (http://169.254.169.254/latest/meta-data/local-ipv4) and falls back to socket.gethostbyname.

What went wrong:
onstart runs on every span creation. The new gethostip() calls urllib.request.urlopen, which is auto-instrumented by OpenTelemetry. The instrumented urlopen starts a new span, which re-enters onstart, which calls gethostip() again — infinite recursion:

onstart → gethostip → urlopen (instrumented)
→ start
ascurrentspan → onstart → gethost_ip → urlopen → …

This fired on every request. Symptoms: stack overflows, elevated latency, and a flood of requests to the IMDS endpoint.

Resolution:
Change reverted. Service recovered.

Root cause:
Instrumentation hot-path code (on_start) performed a network call using a library (urllib) that is itself auto-instrumented — with no caching and no instrumentation-suppression context.

Action items:
Cache the host IP once at process start. It does not change for the lifetime of the process; there's no reason to look it up per span.

Updated
May 4, 2026 at 11:44am UTC

Dashboard - B2C [Backend] and Returns [Backend] recovered.

Updated
May 4, 2026 at 11:23am UTC

Dashboard - B2C [Backend] and Returns [Backend] went down.

Updated
May 4, 2026 at 11:21am UTC

Dashboard - B2C [Backend] and Returns [Backend] recovered.

Created
May 4, 2026 at 11:08am UTC

Dashboard - B2C [Backend] and Returns [Backend] went down.