Portfolio | Platform Engineer

The Swivel Chair Investigation

An alert fires: HTTP 500 errors are spiking!. The on-call engineer opens the metrics dashboard and confirms the spike. Then they swivel their chair to the logging platform, searching for errors around that timestamp. They find an error message but lack context. So they swivel again to the tracing tool, trying to find a trace that matches the error. This painful, manual correlation across siloed data sources was our biggest bottleneck in resolving incidents.

We weren't just collecting data; we were collecting disconnected data. We needed to break down these silos.

The Three Pillars of Observability, Connected

The solution is to treat logs, metrics, and traces not as separate entities, but as three views of the same event. This is made possible by OpenTelemetry (OTel), an open-source standard for instrumenting applications, and a backend stack like Grafana that can link them together.

The key is a single identifier: the traceId.

Here’s how the data flows and connects:

graph TD
    subgraph Application
        A[User Request] --> B(Node.js App);
        B -- OTel SDK --> C{Generate Trace};
        C -- Injects traceId --> D[Logs];
        C -- Records --> E[Metrics];
        C -- Exports --> F[Traces];
    end

    subgraph Observability Platform
        G[OTel Collector] --> H[Grafana Loki for Logs];
        G --> I[Grafana Mimir/Prom for Metrics];
        G --> J[Grafana Tempo for Traces];
        H -- traceId --> K(Unified Grafana UI);
        I -- traceId --> K;
        J -- traceId --> K;
    end

How We Did It: Auto-Instrumentation Magic

Getting started with OpenTelemetry is surprisingly easy, thanks to auto-instrumentation libraries. For our Node.js services, we just needed to initialize the SDK. It handles the heavy lifting of creating spans for incoming requests and outgoing calls.

// tracer.js
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-http');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');

// Configure the OTLP exporter to send traces to our OTel Collector
const traceExporter = new OTLPTraceExporter({
  url: 'http://otel-collector:4318/v1/traces',
});

const sdk = new NodeSDK({
  traceExporter,
  // This automatically instruments popular libraries like Express, http, pg, etc.
  instrumentations: [getNodeAutoInstrumentations()],
});

sdk.start();

To run our app, we simply preload this file: node -r ./tracer.js my-app.js

Now, every log message produced during a request will automatically have the traceId injected. When our logger sends this to Loki, Grafana can use that ID to create a direct link to the full trace in Tempo.

The "Aha!" Moment in Grafana

This is where the magic happens. An engineer sees an error log line in Grafana. Instead of starting a manual investigation, they see a button: "View Trace."

One click takes them from the specific error message to a complete flame graph in Tempo showing the entire lifecycle of the request across multiple microservices. They can instantly see the failing service, the latency of each step, and the exact context of the error. The investigation is over in minutes, not hours.

The Impact: Faster, Smarter Debugging

35% Reduction in MTTR: By eliminating the swivel-chair investigation, we drastically cut down the time it takes to identify the root cause of an issue.
Proactive Performance Tuning: Developers can now easily spot performance bottlenecks in traces before they become production incidents.
A Single Pane of Glass: No more context switching. Everything needed to understand system behavior is in one place.

Unifying our telemetry wasn't just about adding another tool; it was about fundamentally changing how we approach debugging and system analysis.

graph TD subgraph Application A[User Request] --> B(Node.js App); B -- OTel SDK --> C{Generate Trace}; C -- Injects traceId --> D[Logs]; C -- Records --> E[Metrics]; C -- Exports --> F[Traces]; end subgraph Observability Platform G[OTel Collector] --> H[Grafana Loki for Logs]; G --> I[Grafana Mimir/Prom for Metrics]; G --> J[Grafana Tempo for Traces]; H -- traceId --> K(Unified Grafana UI); I -- traceId --> K; J -- traceId --> K; end

// tracer.js const { NodeSDK } = require('@opentelemetry/sdk-node'); const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-http'); const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node'); // Configure the OTLP exporter to send traces to our OTel Collector const traceExporter = new OTLPTraceExporter({ url: 'http://otel-collector:4318/v1/traces', }); const sdk = new NodeSDK({ traceExporter, // This automatically instruments popular libraries like Express, http, pg, etc. instrumentations: [getNodeAutoInstrumentations()], }); sdk.start();

From Silos to Signals: Unifying Logs, Metrics & Traces with OpenTelemetry

The Swivel Chair Investigation

The Three Pillars of Observability, Connected

How We Did It: Auto-Instrumentation Magic

The "Aha!" Moment in Grafana

The Impact: Faster, Smarter Debugging

From Silos to Signals: Unifying Logs, Metrics & Traces with OpenTelemetry

The Swivel Chair Investigation

The Three Pillars of Observability, Connected

How We Did It: Auto-Instrumentation Magic

The "Aha!" Moment in Grafana

The Impact: Faster, Smarter Debugging