You’re running an LLM backend. Maybe it’s Ollama on your workstation, maybe it’s GPT-4o behind an API key, maybe it’s vLLM on a GPU cluster. You’ve got a FastAPI proxy in front of it. Requests go in, tokens come out. But when something goes wrong — and it will — you’re blind.
OpenTelemetry has published GenAI semantic conventions that define exactly how LLM telemetry should look. gen_ai.client.operation.duration, gen_ai.client.token.usage, gen_ai.server.time_to_first_token — the spec is thorough. But the spec doesn’t write your instrumentation code. You’re left answering questions like:
- How do I set up histogram buckets that match the GenAI semconv?
- What temporality does Dynatrace need — DELTA or CUMULATIVE?
- How do I compute Time-to-First-Token and Time-per-Output-Token for streaming?
- How do I switch from Ollama to OpenAI without rewriting my instrumentation?
- How do I correlate traces across a multi-service LLM pipeline?
I built llm-otel-kit to answer all of them. One pip install, six lines of setup, and your LLM backend emits production-grade OTel telemetry — traces, metrics, and logs — for any of 11 supported providers. Zero instrumentation code in your application.
Difficulty: Intermediate
Skill Level: Familiarity with FastAPI, OpenTelemetry concepts (traces, metrics), and at least one LLM API
Time to Integrate: 15 minutes
Ecosystem: PyPI
What It Does
llm-otel-kit gives you three pillars of observability, all following OTel GenAI semantic conventions:
Traces
Every LLM call becomes a span with full gen_ai.* attributes:
| Attribute | Example Value |
|---|---|
gen_ai.system |
ollama, openai, anthropic |
gen_ai.request.model |
gemma4:26b |
gen_ai.usage.input_tokens |
142 |
gen_ai.usage.output_tokens |
87 |
gen_ai.request.temperature |
0.7 |
llm.is_streaming |
true |
llm.request.purpose |
User Chat, Title Generation |
conversation.fingerprint |
a3f2c1... (SHA-256 hash) |
enduser.id |
7b2e... (SHA-256 hash — PII-safe) |
User messages and assistant responses are attached as span events. Prompts are indexed (gen_ai.prompt.0.content) for searchability.
Metrics — 10 Instruments
| Instrument | Name | Type | Unit |
|---|---|---|---|
| Operation Duration | gen_ai.client.operation.duration |
Histogram | s |
| Token Usage | gen_ai.client.token.usage |
Histogram | {token} |
| Time to First Token | gen_ai.server.time_to_first_token |
Histogram | s |
| Time per Output Token | gen_ai.server.time_per_output_token |
Histogram | s |
| Request Count | llm.request.count |
Counter | 1 |
| Error Count | llm.request.errors |
Counter | 1 |
| Active Requests | llm.request.active |
UpDownCounter | 1 |
| Stream Chunks | llm.stream.chunks |
Counter | 1 |
| Token Throughput | llm.token.throughput |
Histogram | {token}/s |
| Message Count | llm.request.message_count |
Histogram | 1 |
Histograms use GenAI semconv bucket boundaries — not OTel defaults. Duration buckets go from 10ms to 82s. Token buckets use powers of 4 from 1 to 67M. TTFT and TPOT each have their own precision-optimized boundaries.
Logs
Structured log records exported via OTLP with contextual fields:
model=gemma4:latest duration_s=2.341 prompt_tokens=142 completion_tokens=87
Every log is correlated to a trace via trace_id.
Architecture
┌──────────────────────────────────────────────────────────┐
│ Your FastAPI App (main.py) │
│ │
│ from llm_otel_kit import init_observability, │
│ create_provider, GenAIMetrics │
│ │
│ ┌─────────────┐ ┌──────────────┐ │
│ │ AppConfig │────▶│ LLMProvider │──── HTTP ────┐ │
│ │ .from_env() │ │ .complete() │ │ │
│ └─────────────┘ │ .stream() │ │ │
│ │ └──────────────┘ │ │
│ ┌─────▼─────────┐ │ │
│ │ OTel Bootstrap │ │ │
│ │ ┌─ Metrics ──┐│ ┌──────────────┐ │ │
│ │ │ 10 instrum.││ │ GenAI Spans │ │ │
│ │ └────────────┘│ │ + Events │ │ │
│ │ ┌─ Logs ─────┐│ └──────────────┘ │ │
│ │ │ OTLP export││ │ │
│ │ └────────────┘│ │ │
│ │ ┌─ Tracing ──┐│ │ │
│ │ │ Traceloop ││ │ │
│ │ └────────────┘│ │ │
│ └────────────────┘ │ │
└──────────┬──────────────────────────────────────────┘ │
│ OTLP/HTTP │
▼ ▼
┌──────────────┐ ┌────────────────┐
│ Dynatrace │ │ LLM Backend │
│ Jaeger │ │ (Ollama, │
│ Grafana │ │ OpenAI, │
│ any OTLP │ │ Anthropic, │
│ backend │ │ vLLM, ...) │
└──────────────┘ └────────────────┘
The data flow:
- Config →
AppConfig.from_env()readsLLM_PROVIDER,LLM_BASE_URL, and OTLP settings from environment variables - Bootstrap →
init_observability()sets up MeterProvider (with correct temporality), LoggerProvider, and Traceloop SDK — in that exact order (critical for Dynatrace) - Provider →
create_provider()returns the rightLLMProvidersubclass based on config - Request flow → Your endpoint calls
provider.complete()orprovider.stream(), the package records spans, metrics, and logs automatically via helper functions - Export → Everything flows via OTLP/HTTP to your observability backend
The package intercepts nothing magically — it provides explicit helpers you call in your route handlers. You control what gets instrumented.
How to Use It
Install
pip install llm-otel-kit
For Anthropic Claude support:
pip install llm-otel-kit[anthropic]
Peer dependencies (opentelemetry-sdk, opentelemetry-exporter-otlp-proto-http, traceloop-sdk, httpx) are installed automatically.
Configure
Set these environment variables:
# Required — pick your provider
LLM_PROVIDER=ollama # or: openai, anthropic, vllm, groq, together, ...
LLM_BASE_URL=http://localhost:11434 # provider API base URL
LLM_API_KEY= # for cloud providers (leave empty for local)
DEFAULT_MODEL=gemma4:26b # fallback model name
APP_NAME=my-llm-backend # OTel service name
# Required for telemetry export
TRACELOOP_BASE_URL=https://your-env.apps.dynatrace.com/api/v2/otlp
DT_OTLP_TOKEN=dt0c01.YOUR_TOKEN_HERE
That’s it. No YAML files, no collector config, no SDK plumbing.
Wire It Up
Here’s the complete integration — this is real code from a production FastAPI backend:
import time
import httpx
from fastapi import FastAPI
from opentelemetry import trace
from traceloop.sdk.decorators import workflow
# ── 6 lines of setup ──────────────────────────────────────
from llm_otel_kit import (
AppConfig, GenAIMetrics, create_provider,
init_observability, record_metrics,
set_genai_span, set_genai_response, classify_request,
)
from llm_otel_kit.spans import semconv_attrs
config = AppConfig.from_env() # 1. Read config from env
otel = init_observability(config.app_name, # 2. Bootstrap OTel
config.otlp_endpoint, config.otlp_token)
provider = create_provider(config.provider) # 3. Create provider
m = GenAIMetrics(otel.meter) # 4. Create metric instruments
# ───────────────────────────────────────────────────────────
app = FastAPI()
@app.post("/v1/chat/completions")
async def chat(request: dict):
model = request.get("model") or config.provider.default_model
messages = request["messages"]
# Track the request
m.request_count.add(1, {"model": model})
attrs = semconv_attrs(model, provider.host, provider.port)
# Build provider-native payload
payload = provider.build_payload(
model=model, messages=messages, stream=False,
temperature=request.get("temperature", 0.7),
)
# Call the LLM with full span instrumentation
span = trace.get_current_span()
set_genai_span(span, model, classify_request(messages),
False, messages, provider.host, provider.port)
start = time.time()
async with httpx.AsyncClient(timeout=300.0) as client:
result = await provider.complete(client, payload) # ← provider-agnostic call
# Record response telemetry
set_genai_response(span, result.content, model,
result.prompt_tokens, result.completion_tokens)
record_metrics(m, attrs, model, time.time() - start,
result.prompt_tokens, result.completion_tokens,
result.timing.ttft, result.timing.tpot)
return {"choices": [{"message": {"content": result.content}}],
"usage": {"prompt_tokens": result.prompt_tokens,
"completion_tokens": result.completion_tokens}}
The critical line is provider.complete(client, payload) — it translates to the correct API format (Ollama’s /api/chat, OpenAI’s /v1/chat/completions, or Anthropic’s /v1/messages) and normalizes the response into a CompletionResult with timing info extracted.
Switch providers by changing one env var:
# Switch from Ollama to Groq — zero code changes
LLM_PROVIDER=groq
LLM_BASE_URL=https://api.groq.com/openai
LLM_API_KEY=gsk_your_key_here
DEFAULT_MODEL=llama-3.3-70b-versatile
What You See
After integration, your observability backend lights up with LLM-specific telemetry.
Traces
Every gen_ai.chat span contains:
- Request attributes: model, temperature, max_tokens, streaming mode, request purpose
- Response attributes: output tokens, finish reason, response ID
- Span events:
gen_ai.user.messageandgen_ai.assistant.messagewith content - Child HTTP spans with
server.address,http.response.status_code
Metrics
Query these in Dynatrace DQL, Grafana, or any OTLP-compatible backend:
# Total tokens consumed (Dynatrace DQL)
timeseries val = sum(gen_ai.client.token.usage), by:{gen_ai.token.type}
# p95 response time by model
timeseries p95 = percentile(gen_ai.client.operation.duration, 95), by:{gen_ai.request.model}
# TTFT trend
timeseries val = avg(gen_ai.server.time_to_first_token)
Logs
Structured logs with trace correlation appear in your log backend:
INFO Chat done model=gemma4:latest duration_s=2.341 prompt_tokens=142 completion_tokens=87 trace_id=abc123...
In Dynatrace, this data powers the AI Observability screen, GenAI dashboards with model comparison tables, token economics breakdowns, and TTFT/TPOT trends — all from the semantic convention attributes this package sets.
Real-World Problem & Why It Matters
Here’s the production scenario: you’re running Ollama on a Linux box. Your team uses it through Open WebUI for coding help, document summarization, and brainstorming. It works fine — until it doesn’t.
Without llm-otel-kit:
- A user reports “the AI is slow today.” You have no data. You check
htop, the GPU looks fine. Was it the model? The prompt size? Network? A streaming regression? You don’t know. - Your manager asks “how much are people actually using this?” You grep access logs and count HTTP 200s. No token counts, no model breakdown, no user segmentation.
- You want to compare Gemma 4 vs. Llama 3.3 for your workload. You run benchmarks manually. There’s no historical data.
With llm-otel-kit:
- TTFT spikes from 0.5s to 4s at 2pm — you see it on the histogram, drill into the trace, find a 50K-token prompt that blew up context.
- Dashboard shows 847 conversations this week, 2.3M tokens consumed, output/input ratio of 1.4x, split 60/40 between User Chat and Title Generation.
- Model comparison table shows Gemma 4 at p95 of 8.2s with 23 tok/s throughput vs. Llama 3.3 at 3.1s and 41 tok/s — data-driven model selection.
This is the difference between “it feels slow” and “the p95 TTFT for streaming requests on gemma4:26b increased 3.2x between 1pm and 3pm, correlated with average input token count rising from 200 to 1,800.”
Why It’s a Separate Package
I extracted llm-otel-kit from a monolithic main.py that had grown to 480 lines — half of which was OTel boilerplate and provider-specific HTTP translation.
Separation of concerns:
llm-otel-kithandles: OTel bootstrap, metric instruments, span attributes, provider abstraction, histogram buckets, temporality config- Your app handles: FastAPI routes, request validation, business logic
The package is reusable. If you build a different LLM gateway, a batch processing pipeline, or a CLI tool that calls LLMs — same pip install, same 6-line setup, same telemetry. Your instrumentation is decoupled from your application.
What’s Inside
The package has 10 source files organized into two layers:
| File | Responsibility |
|---|---|
bootstrap.py |
OTel init: MeterProvider → LoggerProvider → Traceloop (order is critical). Sets histogram views with GenAI semconv bucket boundaries. Handles Dynatrace DELTA/CUMULATIVE temporality. |
metrics.py |
GenAIMetrics dataclass — creates all 10 metric instruments from a single Meter. |
spans.py |
Span attribute setters (set_genai_span, set_genai_response, record_metrics), request classifier, provider detection from model name, PII-safe hashing for user IDs. |
config.py |
AppConfig / ProviderConfig dataclasses with from_env() factory. Legacy OLLAMA_BASE_URL fallback. |
providers/base.py |
LLMProvider ABC with complete(), stream(), list_models(), build_payload(). Dataclasses: CompletionResult, StreamChunk, TimingInfo. |
providers/ollama.py |
Ollama native API (/api/chat). Extracts TTFT/TPOT from Ollama’s nanosecond timing fields (prompt_eval_duration, eval_duration). |
providers/openai_compat.py |
OpenAI-compatible API (/v1/chat/completions). Works with OpenAI, vLLM, llama.cpp, LM Studio, Groq, Together, Fireworks, Azure OpenAI, LiteLLM. Uses stream_options.include_usage for streaming token counts. |
providers/anthropic.py |
Anthropic Messages API (/v1/messages). Separates system messages from user messages (Anthropic requirement). Handles SSE event types: message_start, content_block_delta, message_delta. |
providers/__init__.py |
Factory function create_provider() — maps provider name to the correct subclass. |
__init__.py |
Public API surface — re-exports everything users need. |
Notable design decision: the provider abstraction uses httpx.AsyncClient directly rather than wrapping provider SDKs. This means zero extra dependencies (no openai package, no anthropic package for basic usage) and full control over the HTTP layer for span instrumentation.
Links & Get Started
Install:
pip install llm-otel-kit
- PyPI: pypi.org/project/llm-otel-kit
- GitHub: theharithsa/Local-LLM-Application-with-OpenLLMetry
- Release: v0.1.0
- License: MIT
PRs and issues are welcome. If you’re using a provider I haven’t tested, open an issue — the OpenAI-compatible layer covers most of them, but edge cases exist.