Your LLM Backend Has Zero Observability — llm-otel-kit Fixes That in 6 Lines of Python

theharithsa · April 14, 2026, 7:46pm

You’re running an LLM backend. Maybe it’s Ollama on your workstation, maybe it’s GPT-4o behind an API key, maybe it’s vLLM on a GPU cluster. You’ve got a FastAPI proxy in front of it. Requests go in, tokens come out. But when something goes wrong — and it will — you’re blind.

OpenTelemetry has published GenAI semantic conventions that define exactly how LLM telemetry should look. gen_ai.client.operation.duration, gen_ai.client.token.usage, gen_ai.server.time_to_first_token — the spec is thorough. But the spec doesn’t write your instrumentation code. You’re left answering questions like:

How do I set up histogram buckets that match the GenAI semconv?
What temporality does Dynatrace need — DELTA or CUMULATIVE?
How do I compute Time-to-First-Token and Time-per-Output-Token for streaming?
How do I switch from Ollama to OpenAI without rewriting my instrumentation?
How do I correlate traces across a multi-service LLM pipeline?

I built llm-otel-kit to answer all of them. One pip install, six lines of setup, and your LLM backend emits production-grade OTel telemetry — traces, metrics, and logs — for any of 11 supported providers. Zero instrumentation code in your application.

Difficulty: Intermediate
Skill Level: Familiarity with FastAPI, OpenTelemetry concepts (traces, metrics), and at least one LLM API
Time to Integrate: 15 minutes
Ecosystem: PyPI

What It Does

llm-otel-kit gives you three pillars of observability, all following OTel GenAI semantic conventions:

Traces

Every LLM call becomes a span with full gen_ai.* attributes:

Attribute	Example Value
`gen_ai.system`	`ollama`, `openai`, `anthropic`
`gen_ai.request.model`	`gemma4:26b`
`gen_ai.usage.input_tokens`	`142`
`gen_ai.usage.output_tokens`	`87`
`gen_ai.request.temperature`	`0.7`
`llm.is_streaming`	`true`
`llm.request.purpose`	`User Chat`, `Title Generation`
`conversation.fingerprint`	`a3f2c1...` (SHA-256 hash)
`enduser.id`	`7b2e...` (SHA-256 hash — PII-safe)

User messages and assistant responses are attached as span events. Prompts are indexed (gen_ai.prompt.0.content) for searchability.

Metrics — 10 Instruments

Instrument	Name	Type	Unit
Operation Duration	`gen_ai.client.operation.duration`	Histogram	`s`
Token Usage	`gen_ai.client.token.usage`	Histogram	`{token}`
Time to First Token	`gen_ai.server.time_to_first_token`	Histogram	`s`
Time per Output Token	`gen_ai.server.time_per_output_token`	Histogram	`s`
Request Count	`llm.request.count`	Counter	`1`
Error Count	`llm.request.errors`	Counter	`1`
Active Requests	`llm.request.active`	UpDownCounter	`1`
Stream Chunks	`llm.stream.chunks`	Counter	`1`
Token Throughput	`llm.token.throughput`	Histogram	`{token}/s`
Message Count	`llm.request.message_count`	Histogram	`1`

Histograms use GenAI semconv bucket boundaries — not OTel defaults. Duration buckets go from 10ms to 82s. Token buckets use powers of 4 from 1 to 67M. TTFT and TPOT each have their own precision-optimized boundaries.

Logs

Structured log records exported via OTLP with contextual fields:

model=gemma4:latest duration_s=2.341 prompt_tokens=142 completion_tokens=87

Every log is correlated to a trace via trace_id.

Architecture

┌──────────────────────────────────────────────────────────┐
│  Your FastAPI App (main.py)                              │
│                                                          │
│   from llm_otel_kit import init_observability,           │
│       create_provider, GenAIMetrics                      │
│                                                          │
│   ┌─────────────┐     ┌──────────────┐                   │
│   │ AppConfig    │────▶│ LLMProvider  │──── HTTP ────┐    │
│   │ .from_env() │     │ .complete()  │              │    │
│   └─────────────┘     │ .stream()    │              │    │
│         │              └──────────────┘              │    │
│   ┌─────▼─────────┐                                 │    │
│   │ OTel Bootstrap │                                 │    │
│   │  ┌─ Metrics ──┐│   ┌──────────────┐             │    │
│   │  │ 10 instrum.││   │ GenAI Spans  │             │    │
│   │  └────────────┘│   │ + Events     │             │    │
│   │  ┌─ Logs ─────┐│   └──────────────┘             │    │
│   │  │ OTLP export││                                │    │
│   │  └────────────┘│                                │    │
│   │  ┌─ Tracing ──┐│                                │    │
│   │  │ Traceloop  ││                                │    │
│   │  └────────────┘│                                │    │
│   └────────────────┘                                │    │
└──────────┬──────────────────────────────────────────┘    │
           │ OTLP/HTTP                                     │
           ▼                                               ▼
    ┌──────────────┐                          ┌────────────────┐
    │  Dynatrace   │                          │  LLM Backend   │
    │  Jaeger      │                          │  (Ollama,      │
    │  Grafana     │                          │   OpenAI,      │
    │  any OTLP    │                          │   Anthropic,   │
    │  backend     │                          │   vLLM, ...)   │
    └──────────────┘                          └────────────────┘

The data flow:

Config → AppConfig.from_env() reads LLM_PROVIDER, LLM_BASE_URL, and OTLP settings from environment variables
Bootstrap → init_observability() sets up MeterProvider (with correct temporality), LoggerProvider, and Traceloop SDK — in that exact order (critical for Dynatrace)
Provider → create_provider() returns the right LLMProvider subclass based on config
Request flow → Your endpoint calls provider.complete() or provider.stream(), the package records spans, metrics, and logs automatically via helper functions
Export → Everything flows via OTLP/HTTP to your observability backend

The package intercepts nothing magically — it provides explicit helpers you call in your route handlers. You control what gets instrumented.

How to Use It

Install

pip install llm-otel-kit

For Anthropic Claude support:

pip install llm-otel-kit[anthropic]

Peer dependencies (opentelemetry-sdk, opentelemetry-exporter-otlp-proto-http, traceloop-sdk, httpx) are installed automatically.

Configure

Set these environment variables:

# Required — pick your provider
LLM_PROVIDER=ollama                              # or: openai, anthropic, vllm, groq, together, ...
LLM_BASE_URL=http://localhost:11434               # provider API base URL
LLM_API_KEY=                                      # for cloud providers (leave empty for local)
DEFAULT_MODEL=gemma4:26b                          # fallback model name
APP_NAME=my-llm-backend                           # OTel service name

# Required for telemetry export
TRACELOOP_BASE_URL=https://your-env.apps.dynatrace.com/api/v2/otlp
DT_OTLP_TOKEN=dt0c01.YOUR_TOKEN_HERE

That’s it. No YAML files, no collector config, no SDK plumbing.

Wire It Up

Here’s the complete integration — this is real code from a production FastAPI backend:

import time
import httpx
from fastapi import FastAPI
from opentelemetry import trace
from traceloop.sdk.decorators import workflow

# ── 6 lines of setup ──────────────────────────────────────
from llm_otel_kit import (
    AppConfig, GenAIMetrics, create_provider,
    init_observability, record_metrics,
    set_genai_span, set_genai_response, classify_request,
)
from llm_otel_kit.spans import semconv_attrs

config = AppConfig.from_env()                                        # 1. Read config from env
otel = init_observability(config.app_name,                           # 2. Bootstrap OTel
                          config.otlp_endpoint, config.otlp_token)
provider = create_provider(config.provider)                          # 3. Create provider
m = GenAIMetrics(otel.meter)                                         # 4. Create metric instruments
# ───────────────────────────────────────────────────────────

app = FastAPI()

@app.post("/v1/chat/completions")
async def chat(request: dict):
    model = request.get("model") or config.provider.default_model
    messages = request["messages"]

    # Track the request
    m.request_count.add(1, {"model": model})
    attrs = semconv_attrs(model, provider.host, provider.port)

    # Build provider-native payload
    payload = provider.build_payload(
        model=model, messages=messages, stream=False,
        temperature=request.get("temperature", 0.7),
    )

    # Call the LLM with full span instrumentation
    span = trace.get_current_span()
    set_genai_span(span, model, classify_request(messages),
                   False, messages, provider.host, provider.port)

    start = time.time()
    async with httpx.AsyncClient(timeout=300.0) as client:
        result = await provider.complete(client, payload)          # ← provider-agnostic call

    # Record response telemetry
    set_genai_response(span, result.content, model,
                       result.prompt_tokens, result.completion_tokens)
    record_metrics(m, attrs, model, time.time() - start,
                   result.prompt_tokens, result.completion_tokens,
                   result.timing.ttft, result.timing.tpot)

    return {"choices": [{"message": {"content": result.content}}],
            "usage": {"prompt_tokens": result.prompt_tokens,
                      "completion_tokens": result.completion_tokens}}

The critical line is provider.complete(client, payload) — it translates to the correct API format (Ollama’s /api/chat, OpenAI’s /v1/chat/completions, or Anthropic’s /v1/messages) and normalizes the response into a CompletionResult with timing info extracted.

Switch providers by changing one env var:

# Switch from Ollama to Groq — zero code changes
LLM_PROVIDER=groq
LLM_BASE_URL=https://api.groq.com/openai
LLM_API_KEY=gsk_your_key_here
DEFAULT_MODEL=llama-3.3-70b-versatile

What You See

After integration, your observability backend lights up with LLM-specific telemetry.

Traces

Every gen_ai.chat span contains:

Request attributes: model, temperature, max_tokens, streaming mode, request purpose
Response attributes: output tokens, finish reason, response ID
Span events: gen_ai.user.message and gen_ai.assistant.message with content
Child HTTP spans with server.address, http.response.status_code

Metrics

Query these in Dynatrace DQL, Grafana, or any OTLP-compatible backend:

# Total tokens consumed (Dynatrace DQL)
timeseries val = sum(gen_ai.client.token.usage), by:{gen_ai.token.type}

# p95 response time by model
timeseries p95 = percentile(gen_ai.client.operation.duration, 95), by:{gen_ai.request.model}

# TTFT trend
timeseries val = avg(gen_ai.server.time_to_first_token)

Logs

Structured logs with trace correlation appear in your log backend:

INFO  Chat done  model=gemma4:latest  duration_s=2.341  prompt_tokens=142  completion_tokens=87  trace_id=abc123...

In Dynatrace, this data powers the AI Observability screen, GenAI dashboards with model comparison tables, token economics breakdowns, and TTFT/TPOT trends — all from the semantic convention attributes this package sets.

Real-World Problem & Why It Matters

Here’s the production scenario: you’re running Ollama on a Linux box. Your team uses it through Open WebUI for coding help, document summarization, and brainstorming. It works fine — until it doesn’t.

Without llm-otel-kit:

A user reports “the AI is slow today.” You have no data. You check htop, the GPU looks fine. Was it the model? The prompt size? Network? A streaming regression? You don’t know.
Your manager asks “how much are people actually using this?” You grep access logs and count HTTP 200s. No token counts, no model breakdown, no user segmentation.
You want to compare Gemma 4 vs. Llama 3.3 for your workload. You run benchmarks manually. There’s no historical data.

With llm-otel-kit:

TTFT spikes from 0.5s to 4s at 2pm — you see it on the histogram, drill into the trace, find a 50K-token prompt that blew up context.
Dashboard shows 847 conversations this week, 2.3M tokens consumed, output/input ratio of 1.4x, split 60/40 between User Chat and Title Generation.
Model comparison table shows Gemma 4 at p95 of 8.2s with 23 tok/s throughput vs. Llama 3.3 at 3.1s and 41 tok/s — data-driven model selection.

This is the difference between “it feels slow” and “the p95 TTFT for streaming requests on gemma4:26b increased 3.2x between 1pm and 3pm, correlated with average input token count rising from 200 to 1,800.”

Why It’s a Separate Package

I extracted llm-otel-kit from a monolithic main.py that had grown to 480 lines — half of which was OTel boilerplate and provider-specific HTTP translation.

Separation of concerns:

llm-otel-kit handles: OTel bootstrap, metric instruments, span attributes, provider abstraction, histogram buckets, temporality config
Your app handles: FastAPI routes, request validation, business logic

The package is reusable. If you build a different LLM gateway, a batch processing pipeline, or a CLI tool that calls LLMs — same pip install, same 6-line setup, same telemetry. Your instrumentation is decoupled from your application.

What’s Inside

The package has 10 source files organized into two layers:

File	Responsibility
`bootstrap.py`	OTel init: MeterProvider → LoggerProvider → Traceloop (order is critical). Sets histogram views with GenAI semconv bucket boundaries. Handles Dynatrace DELTA/CUMULATIVE temporality.
`metrics.py`	`GenAIMetrics` dataclass — creates all 10 metric instruments from a single `Meter`.
`spans.py`	Span attribute setters (`set_genai_span`, `set_genai_response`, `record_metrics`), request classifier, provider detection from model name, PII-safe hashing for user IDs.
`config.py`	`AppConfig` / `ProviderConfig` dataclasses with `from_env()` factory. Legacy `OLLAMA_BASE_URL` fallback.
`providers/base.py`	`LLMProvider` ABC with `complete()`, `stream()`, `list_models()`, `build_payload()`. Dataclasses: `CompletionResult`, `StreamChunk`, `TimingInfo`.
`providers/ollama.py`	Ollama native API (`/api/chat`). Extracts TTFT/TPOT from Ollama’s nanosecond timing fields (`prompt_eval_duration`, `eval_duration`).
`providers/openai_compat.py`	OpenAI-compatible API (`/v1/chat/completions`). Works with OpenAI, vLLM, llama.cpp, LM Studio, Groq, Together, Fireworks, Azure OpenAI, LiteLLM. Uses `stream_options.include_usage` for streaming token counts.
`providers/anthropic.py`	Anthropic Messages API (`/v1/messages`). Separates system messages from user messages (Anthropic requirement). Handles SSE event types: `message_start`, `content_block_delta`, `message_delta`.
`providers/__init__.py`	Factory function `create_provider()` — maps provider name to the correct subclass.
`__init__.py`	Public API surface — re-exports everything users need.

Notable design decision: the provider abstraction uses httpx.AsyncClient directly rather than wrapping provider SDKs. This means zero extra dependencies (no openai package, no anthropic package for basic usage) and full control over the HTTP layer for span instrumentation.

Links & Get Started

Install:

pip install llm-otel-kit

PyPI: pypi.org/project/llm-otel-kit
GitHub: theharithsa/Local-LLM-Application-with-OpenLLMetry
Release: v0.1.0
License: MIT

PRs and issues are welcome. If you’re using a provider I haven’t tested, open an issue — the OpenAI-compatible layer covers most of them, but edge cases exist.

Topic		Replies	Views
🟢 Instrument Your First LLM: Adventure 03 \| Beginner is Live! The AI Observatory opentelemetry , ai , jaeger , the-ai-observability , openllmetry	8	140	April 8, 2026
🔴 Reduce Telemetry Noise: Adventure 03 \| Expert is Live! The AI Observatory opentelemetry , ai , jaeger , openllmetry	1	41	April 8, 2026
I Built an OpenTelemetry Package for the GitHub Copilot SDK + OTel — Here's Why You Need It Community Voices opentelemetry , ai , opensource	0	20	April 8, 2026
🟡 Instrument & Debug a RAG Pipeline: Adventure 03 \| Intermediate is Live! The AI Observatory opentelemetry , ai , jaeger , prometheus , openllmetry	4	58	April 8, 2026
About the The AI Observatory category The AI Observatory adventure-03	0	9	February 19, 2026

Welcome to the Open Ecosystem

Your LLM Backend Has Zero Observability — llm-otel-kit Fixes That in 6 Lines of Python

What It Does

Traces

Metrics — 10 Instruments

Logs

Architecture

How to Use It

Install

Configure

Wire It Up

What You See

Traces

Metrics

Logs

Real-World Problem & Why It Matters

Why It’s a Separate Package

What’s Inside

Links & Get Started

Related topics