python

Observability for Python Apps: Logging, Metrics, Tracing with OpenTelemetry | redesign.ir

October 30, 202514 min read

Production-grade observability for Python: structured logging, RED/USE metrics, and distributed tracing with OpenTelemetry, Prometheus, Tempo, and Grafana. Copy-ready patterns and pitfalls.

Observability for Python Apps: Logging, Metrics, Tracing with OpenTelemetry

Estimated reading time: 14 min · Published Oct 31, 2025

Monitoring asks “is it up?” Observability asks “why is it slow?” This guide shows how to add structured logging, RED/USE metrics, and distributed tracing to Python services using OpenTelemetry, Prometheus, Tempo, and Grafana.

1) Architecture at a Glance

App (FastAPI/Django + OTel SDK) → OTel Collector
Logs → Loki · Metrics → Prometheus · Traces → Tempo
Dashboards/Alerts → Grafana

2) Structured Logging


# logging_setup.py
import logging, sys, json, time

class JsonFormatter(logging.Formatter):
    def format(self, record):
        base = {
            "ts": time.time(),
            "level": record.levelname,
            "msg": record.getMessage(),
            "logger": record.name,
        }
        if record.exc_info:
            base["exc"] = self.formatException(record.exc_info)
        return json.dumps(base)

def configure_json_logging(level=logging.INFO):
    h = logging.StreamHandler(sys.stdout)
    h.setFormatter(JsonFormatter())
    root = logging.getLogger()
    root.handlers.clear()
    root.addHandler(h)
    root.setLevel(level)

Emit JSON to stdout; collectors parse and route without regex gymnastics.

3) Metrics with Prometheus Client


from prometheus_client import Counter, Histogram, Gauge

REQS = Counter("http_requests_total", "Total HTTP requests", ["route","method","code"])
LAT = Histogram("http_request_duration_seconds", "Latency", ["route","method"], buckets=[.05,.1,.2,.5,1,2,5])
INFLIGHT = Gauge("http_inflight_requests", "Active requests")

def before_request(route, method):
    INFLIGHT.inc()
    timer = LAT.labels(route, method).time()
    return timer

def after_request(route, method, code, timer):
    timer.observe_duration()
    INFLIGHT.dec()
    REQS.labels(route, method, code).inc()

Expose /metrics via WSGI/ASGI middleware and let Prom scrape.

4) Distributed Tracing with OpenTelemetry


# otel_setup.py
from opentelemetry import trace
from opentelemetry.sdk.resources import SERVICE_NAME, Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter

def init_tracing(service_name="api"):
    provider = TracerProvider(resource=Resource.create({SERVICE_NAME: service_name}))
    processor = BatchSpanProcessor(OTLPSpanExporter(endpoint="http://otel-collector:4318/v1/traces"))
    provider.add_span_processor(processor)
    trace.set_tracer_provider(provider)
    return trace.get_tracer(service_name)

FastAPI integration


from fastapi import FastAPI
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from logging_setup import configure_json_logging
from otel_setup import init_tracing

app = FastAPI()
configure_json_logging()
tracer = init_tracing("redesign-api")
FastAPIInstrumentor.instrument_app(app)

@app.get("/health")
def health():
    return {"ok": True}

5) OpenTelemetry Collector Config (single pipeline)


receivers:
  otlp:
    protocols:
      http:
exporters:
  otlphttp/tempo:
    endpoint: http://tempo:4318
  prometheus:
    endpoint: 0.0.0.0:9464
processors:
  batch: {}
service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlphttp/tempo]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheus]

6) Dashboards & Alerting

RED (Rate, Errors, Duration) for user-facing endpoints.
USE (Utilization, Saturation, Errors) for system resources.
Set SLOs (e.g., “p95 < 300ms”), alert on burn rate, not single breaches.

7) Cost & Cardinality Control

Use exemplar sampling for traces; reduce high-cardinality labels.
Hash or truncate IDs in logs to avoid PII and cardinality explosions.
Enable tail-based sampling in Collector for “errors-first” traces.

8) Common Pitfalls

Unstructured logs → impossible correlation.
Too many metrics → high scrape/TSDB cost; start with RED/USE.
Ignoring propagation headers → broken traces across services.

“See the system. Hear its signals. Then design with empathy.” — redesign.ir

Tip: Correlate everything via trace_id/ span_id—add them to log records using a logging filter for one-click jumps from logs → trace.

Topics

#python#observability#apps#logging#metrics#tracing#opentelemetry#redesign

Share this article

Help others discover it across your favourite communities.

Share on X Share on LinkedIn Post to Hacker News

Comments

Join the discussion. We keep comments private to your device until moderation tooling ships.

0 comments