AI EngineeringAI AgentsObservabilityOpenTelemetryNode.js

AI Agent Observability in Node.js with OpenTelemetry

HSMalik Hamza ShabbirJune 10, 20267 min read

In short

You can get production-grade observability for a Node.js AI agent with OpenTelemetry alone: the GenAI client span conventions exited experimental status in early 2026, and they now cover LLM calls, agent orchestration, and MCP tool calls. That matters because, as of May 2026, nearly every company runs agents and most of them cannot see inside them. In this post I instrument a TypeScript agent end to end: spans for model calls, tools, and retrieval, token cost per trace, eval wiring, and the three alerts that catch real agent failures.

AI Agent Observability in Node.js with OpenTelemetry - branded cover card by Hamza Shabbir

On this page

Why do AI agents fail silently in production?
What do the OpenTelemetry GenAI semantic conventions cover?
How do I instrument a TypeScript agent step by step?
How do I track tokens and cost per trace?
How do I wire traces into evals?
What should I alert on?
Key takeaways

Why do AI agents fail silently in production?

Agents fail silently because their failures look like success. A broken tool call, an empty retrieval result, or a looping plan still ends in a fluent response with an HTTP 200 attached, so nothing throws, nothing pages, and the first person to notice is a customer. The supporting numbers are stark: as of May 28, 2026, 97% of companies have deployed AI agents but 79% report significant production challenges, and The Register reported on May 13, 2026 that roughly 74% of AI customer-service agent rollouts get rolled back.

I felt this in my own product before I read any survey. My reputation SaaS syncs Google reviews and drafts AI auto-replies, a few thousand replies a month. A profile-data tool once started returning empty objects after an upstream API change. The agent did not crash. It kept producing replies, just generic ones stripped of the business context that tool used to supply. No exception, no error log. A customer flagged the quality drop before anything in my logs did.

AI agent observability is the practice of recording every model call, tool invocation, and retrieval step an agent performs as structured, correlated telemetry so you can reconstruct exactly why the agent produced any given output.

The 2026 consensus is that agent failures are a runtime/observability problem, not a model problem. The models are good enough; the visibility into what they do with your tools is not. Most of the agents I ship in my AI agents and automation work ↗ break at the tool boundary, not inside the model.

Agents do not crash, they comply. The worst failures return HTTP 200 with a confident, wrong answer, and the only way to catch them is to record what the agent actually did, not what it said.

What do the OpenTelemetry GenAI semantic conventions cover?

The OpenTelemetry GenAI semantic conventions define standard span names and gen_ai. attributes for generative AI operations: model calls (chat), embeddings, agent orchestration (invoke_agent), and tool execution (execute_tool), including MCP tool calls. GenAI client spans exited experimental status in early 2026, which was my green light to standardize on them.

Stability matters more than it sounds. While the conventions were experimental, attribute names churned and dashboards built on them broke between minor versions. With client spans stable, you can build alerts on these attributes and expect them to survive upgrades.

Nearly all reference content for these conventions is Python-first, but the conventions are language-neutral. In Node.js you set the same attributes with @opentelemetry/api and manual spans. These are the attributes that earn their storage for agents:

Two notes. First, execute_tool spans apply to MCP tools exactly as they do to in-process functions; the span layout maps cleanly onto the request lifecycle I described in migrating an MCP server to the 2026 stateless spec ↗. Second, do not record full prompts and completions as attributes by default. They are large and often sensitive, and the conventions treat content capture as opt-in. Token counts and tool names answer most production questions.

Trace waterfall of an AI agent run in Node.js showing nested OpenTelemetry spans for LLM calls, tool executions, and retrieval steps

How do I instrument a TypeScript agent step by step?

Five steps, no vendor SDK: install the OpenTelemetry packages, initialize the Node SDK with an OTLP exporter, wrap LLM calls in chat spans, wrap every tool execution in execute_tool spans, and wrap the agent loop in an invoke_agent root span. It is about 60 lines of TypeScript total.

Install the packages.

BASH

npm install @opentelemetry/sdk-node @opentelemetry/exporter-trace-otlp-http @opentelemetry/api

Initialize tracing before anything else. Import this file first in your entry point, otherwise spans created during module load are dropped.

TYPESCRIPT

// tracing.ts import { NodeSDK } from "@opentelemetry/sdk-node"; import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-http"; const sdk = new NodeSDK({ serviceName: "review-reply-agent", traceExporter: new OTLPTraceExporter({ url: "http://localhost:4318/v1/traces" }), }); sdk.start();

Wrap the LLM call in a chat span and record token usage from the response.

TYPESCRIPT

import { trace, SpanStatusCode } from "@opentelemetry/api"; import Anthropic from "@anthropic-ai/sdk"; const tracer = trace.getTracer("agent"); const anthropic = new Anthropic(); const model = "claude-opus-4-8"; export async function chat(messages: Anthropic.MessageParam[]) { return tracer.startActiveSpan(`chat ${model}`, async (span) => { span.setAttributes({ "gen_ai.operation.name": "chat", "gen_ai.provider.name": "anthropic", "gen_ai.request.model": model, }); try { const res = await anthropic.messages.create({ model, max_tokens: 1024, messages }); span.setAttributes({ "gen_ai.response.model": res.model, "gen_ai.usage.input_tokens": res.usage.input_tokens, "gen_ai.usage.output_tokens": res.usage.output_tokens, }); return res; } catch (err) { span.recordException(err as Error); span.setStatus({ code: SpanStatusCode.ERROR }); throw err; } finally { span.end(); } }); }

Wrap every tool call in an execute_tool span. This is the step people skip and the one that matters most. Flag empty results explicitly, because that is the failure mode exceptions never catch.

TYPESCRIPT

export async function executeTool(name: string, callId: string, input: unknown) { return tracer.startActiveSpan(`execute_tool ${name}`, async (span) => { span.setAttributes({ "gen_ai.operation.name": "execute_tool", "gen_ai.tool.name": name, "gen_ai.tool.call.id": callId, }); try { const result = await tools[name](input); if (result == null || result === "") { span.setAttribute("agent.tool.empty_result", true); } return result; } catch (err) { span.recordException(err as Error); span.setStatus({ code: SpanStatusCode.ERROR }); throw err; } finally { span.end(); } }); }

Wrap the loop in an invoke_agent root span and set gen_ai.conversation.id. Because startActiveSpan propagates context, the chat and tool spans nest under it automatically, and retrieval calls get the same treatment with gen_ai.operation.name: "embeddings" plus a result count.

The exporter speaks OTLP, so the backend is your choice: Jaeger, Grafana Tempo, SigNoz, or a hosted endpoint. It is the same pipeline I run for plain REST services in my API development work ↗, which means agent traces and ordinary API traces land in one backend with one mental model.
How do I track tokens and cost per trace?
Set gen_ai.usage.input_tokens and gen_ai.usage.output_tokens on every chat span and your tracing backend becomes a cost dashboard. Sum the attributes within a trace and you have the exact cost of one task. Group by gen_ai.conversation.id for cost per conversation, or by day for a forecast.
I also write a derived agent.cost.usd attribute so I never join against a price sheet at query time:

TYPESCRIPT

const PRICE: Record<string, { in: number; out: number }> = { "claude-opus-4-8": { in: 5 / 1_000_000, out: 25 / 1_000_000 }, }; span.setAttribute( "agent.cost.usd", res.usage.input_tokens * PRICE[model].in + res.usage.output_tokens * PRICE[model].out, );

In my reputation SaaS this paid for itself within a month. The service drafts roughly 2,000 review replies a month, and a prompt template edit once pushed median input tokens per reply from about 1,900 to 3,100. Before tracing I would have found that on the invoice; with a tokens-per-trace chart I found it the next morning. Tracing tells you where the money goes. What to do about it is its own topic, and I covered the fixes in cutting LLM API costs with caching and routing ↗.
How do I wire traces into evals?
Export a sample of production traces and run them through an eval framework offline; you do not need a second instrumentation layer. DeepEval v4.0.3 and Phoenix v16 both shipped on May 21, 2026, and both consume OpenTelemetry traces directly, so the spans you already emit double as eval datasets.
My pipeline is deliberately small. A nightly job queries the backend for the day's invoke_agent traces, samples around 50, and reconstructs each run from its spans: user input, tool sequence, final output. DeepEval scores them for faithfulness and for my own rules, like the reply-length cap on my auto-replies. Scores go back into the same backend as metrics tagged with the trace id, so a regressed score links straight to the exact trace that produced it. Debugging collapses from guessing to reading.
Alerts catch loud failures. Evals catch quiet ones, like replies that are polite, on-brand, and wrong. You want both, and traces are the shared substrate.
What should I alert on?
Alert on tool-call error rate, token spikes per trace, and loop depth first. Those three catch most real agent failures: broken integrations, runaway loops, and upstream APIs that change silently. Latency and cost alerts matter too, but they fire after the experience has degraded, while tool signals fire at the cause.


Attribute	What it captures	TypeScript example
`gen_ai.operation.name`	Operation type	`"chat"`, `"execute_tool"`, `"invoke_agent"`
`gen_ai.provider.name`	Model provider	`"anthropic"`
`gen_ai.request.model`	Model you requested	`"claude-opus-4-8"`
`gen_ai.response.model`	Model that answered	`res.model`
`gen_ai.usage.input_tokens`	Prompt tokens billed	`res.usage.input_tokens`
`gen_ai.usage.output_tokens`	Completion tokens billed	`res.usage.output_tokens`
`gen_ai.tool.name`	Tool being executed	`"fetch_business_profile"`
`gen_ai.tool.call.id`	Correlates call to result	`block.id`
`gen_ai.conversation.id`	Groups traces per conversation	`session.id`
Signal	Threshold (starting point)	Failure mode it catches

Tool-call error rate	Over 5% across 15 minutes	Broken integration, expired credentials, schema drift
Empty tool results	Over 10% across 1 hour	Upstream API returns 200 with nothing useful
Tokens per trace	Over 3x the 7-day median	Runaway loop, context stuffing, prompt regression
Loop depth (LLM calls per trace)	Over 8 iterations	Agent stuck cycling on a tool it cannot satisfy
p95 trace latency	Over 2x baseline	Slow tool or provider degradation backing up the queue

Thresholds are starting points; tune them against a week of your own traffic. The empty-result signal is the one I insist on, because it is the only one that catches a dependency that fails politely.

My verdict after running this in production: you do not need an observability SaaS on day one. OpenTelemetry plus any OTLP backend covers a solo team, but instrument tool calls from day one, because that is where agents actually break. Tool spans also double as an audit trail of every external action your agent took, which becomes a security asset the moment you harden the deployment; my MCP server hardening checklist ↗ leans on exactly that trail for incident review.

Key takeaways

97% of companies have deployed AI agents, 79% report significant production challenges, and ~74% of AI customer-service rollouts get rolled back. The gap is observability, not model quality.

OTel GenAI client spans went stable in early 2026. The gen_ai. attributes now cover chat, embeddings, agent orchestration, and MCP tool calls, and are safe to build alerts on.

Instrument tool calls before anything else. Tool failures, especially empty 200 responses, are where agents silently break.

Token attributes turn traces into cost receipts. A derived cost-per-trace attribute caught a 60% input-token regression in my SaaS the morning after it shipped.

DeepEval v4.0.3 and Phoenix v16 (both May 21, 2026) consume OTel traces directly, so production spans double as eval datasets with no second instrumentation layer.

FAQ

What are the OpenTelemetry GenAI semantic conventions?

They are the standard span names and `gen_ai.*` attributes OpenTelemetry defines for generative AI: `chat`, embeddings, `invoke_agent`, and `execute_tool` operations, plus attributes for provider, model, and token usage. GenAI client spans exited experimental status in early 2026, so the core attribute set is now stable enough to build dashboards and alerts on.

How do I trace MCP tool calls?

Wrap each tool execution in a span named `execute_tool {name}` with `gen_ai.operation.name` set to `execute_tool`, plus `gen_ai.tool.name` and `gen_ai.tool.call.id`. The conventions treat MCP tools like any other tool call, so the same span shape works whether the tool runs in process or on a remote MCP server.

Do I need an observability SaaS like LangSmith to monitor agents?

No. OpenTelemetry plus any OTLP backend, such as Jaeger, Grafana Tempo, or SigNoz, covers a solo team or a small product. Vendor platforms add LLM-specific UIs and managed evals that become worth paying for at higher volume, but everything in this post is portable to any of them.

How much overhead does OpenTelemetry add to a Node.js agent?

Effectively none relative to LLM latency. Span creation costs microseconds and exports batch in the background, while an agent turn spends seconds in model and tool time. In my production service the OTLP exporter stays well under 1% CPU; the real cost is backend storage, which trace sampling keeps in check.

Working on something like this?

I build web apps, AI features, and mobile products for clients. If this article matches a problem you have, tell me about it.

Start a conversation

Malik Hamza Shabbir · Full-Stack & AI Engineer

I build full-stack and AI products solo: a reputation SaaS in production, RAG pipelines, and React Native apps. I write from what I ship, not from documentation summaries.

About me

AI Agent Observability in Node.js with OpenTelemetry

Why do AI agents fail silently in production?

What do the OpenTelemetry GenAI semantic conventions cover?

How do I instrument a TypeScript agent step by step?

How do I track tokens and cost per trace?

How do I wire traces into evals?

What should I alert on?

Key takeaways

FAQ

What are the OpenTelemetry GenAI semantic conventions?

How do I trace MCP tool calls?

Do I need an observability SaaS like LangSmith to monitor agents?

How much overhead does OpenTelemetry add to a Node.js agent?

Working on something like this?

Related articles

How to Migrate Your MCP Server to the 2026 Stateless Spec

How to Secure an MCP Server: 2026 Hardening Checklist

What Is WebMCP? Making Your Web App Work with AI Agents

Related articles

How to Migrate Your MCP Server to the 2026 Stateless Spec
The final MCP spec ships July 28, 2026 and removes sessions from the protocol. I migrated my production Node server; here is the exact diff and checklist.

How to Secure an MCP Server: 2026 Hardening Checklist
I audited my production MCP stack against the NSA's May 2026 guidance and the OX Security RCE disclosure. Here is the 12-point hardening checklist I use.

What Is WebMCP? Making Your Web App Work with AI Agents
WebMCP, announced at Google I/O 2026, lets your web app register typed tools AI agents can call in Chrome 149. Here is how I exposed mine, with code.
WebMCP, announced at Google I/O 2026, lets your web app register typed tools AI agents can call in Chrome 149. Here is how I exposed mine, with code.