Skip to content
Malik Hamza Shabbir
AI EngineeringAI AgentsObservabilityOpenTelemetryNode.js

AI Agent Observability in Node.js with OpenTelemetry

HSMalik Hamza Shabbir7 min read

In short

You can get production-grade observability for a Node.js AI agent with OpenTelemetry alone: the GenAI client span conventions exited experimental status in early 2026, and they now cover LLM calls, agent orchestration, and MCP tool calls. That matters because, as of May 2026, nearly every company runs agents and most of them cannot see inside them. In this post I instrument a TypeScript agent end to end: spans for model calls, tools, and retrieval, token cost per trace, eval wiring, and the three alerts that catch real agent failures.

AI Agent Observability in Node.js with OpenTelemetry - branded cover card by Hamza Shabbir
On this page

Why do AI agents fail silently in production?

Agents fail silently because their failures look like success. A broken tool call, an empty retrieval result, or a looping plan still ends in a fluent response with an HTTP 200 attached, so nothing throws, nothing pages, and the first person to notice is a customer. The supporting numbers are stark: as of May 28, 2026, 97% of companies have deployed AI agents but 79% report significant production challenges, and The Register reported on May 13, 2026 that roughly 74% of AI customer-service agent rollouts get rolled back.

I felt this in my own product before I read any survey. My reputation SaaS syncs Google reviews and drafts AI auto-replies, a few thousand replies a month. A profile-data tool once started returning empty objects after an upstream API change. The agent did not crash. It kept producing replies, just generic ones stripped of the business context that tool used to supply. No exception, no error log. A customer flagged the quality drop before anything in my logs did.

AI agent observability is the practice of recording every model call, tool invocation, and retrieval step an agent performs as structured, correlated telemetry so you can reconstruct exactly why the agent produced any given output.

The 2026 consensus is that agent failures are a runtime/observability problem, not a model problem. The models are good enough; the visibility into what they do with your tools is not. Most of the agents I ship in my AI agents and automation work break at the tool boundary, not inside the model.

Agents do not crash, they comply. The worst failures return HTTP 200 with a confident, wrong answer, and the only way to catch them is to record what the agent actually did, not what it said.

What do the OpenTelemetry GenAI semantic conventions cover?

The OpenTelemetry GenAI semantic conventions define standard span names and gen_ai. attributes for generative AI operations: model calls (chat), embeddings, agent orchestration (invoke_agent), and tool execution (execute_tool), including MCP tool calls. GenAI client spans exited experimental status in early 2026, which was my green light to standardize on them.

Stability matters more than it sounds. While the conventions were experimental, attribute names churned and dashboards built on them broke between minor versions. With client spans stable, you can build alerts on these attributes and expect them to survive upgrades.

Nearly all reference content for these conventions is Python-first, but the conventions are language-neutral. In Node.js you set the same attributes with @opentelemetry/api and manual spans. These are the attributes that earn their storage for agents:











Two notes. First, execute_tool spans apply to MCP tools exactly as they do to in-process functions; the span layout maps cleanly onto the request lifecycle I described in migrating an MCP server to the 2026 stateless spec . Second, do not record full prompts and completions as attributes by default. They are large and often sensitive, and the conventions treat content capture as opt-in. Token counts and tool names answer most production questions.

Trace waterfall of an AI agent run in Node.js showing nested OpenTelemetry spans for LLM calls, tool executions, and retrieval steps
Trace waterfall of an AI agent run in Node.js showing nested OpenTelemetry spans for LLM calls, tool executions, and retrieval steps

How do I instrument a TypeScript agent step by step?

Five steps, no vendor SDK: install the OpenTelemetry packages, initialize the Node SDK with an OTLP exporter, wrap LLM calls in chat spans, wrap every tool execution in execute_tool spans, and wrap the agent loop in an invoke_agent root span. It is about 60 lines of TypeScript total.

  1. Install the packages.


BASH
npm install @opentelemetry/sdk-node @opentelemetry/exporter-trace-otlp-http @opentelemetry/api

  1. Initialize tracing before anything else. Import this file first in your entry point, otherwise spans created during module load are dropped.


TYPESCRIPT
// tracing.ts
import { NodeSDK } from "@opentelemetry/sdk-node";
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-http";

const sdk = new NodeSDK({
  serviceName: "review-reply-agent",
  traceExporter: new OTLPTraceExporter({ url: "http://localhost:4318/v1/traces" }),
});
sdk.start();

  1. Wrap the LLM call in a chat span and record token usage from the response.


TYPESCRIPT
import { trace, SpanStatusCode } from "@opentelemetry/api";
import Anthropic from "@anthropic-ai/sdk";

const tracer = trace.getTracer("agent");
const anthropic = new Anthropic();
const model = "claude-opus-4-8";

export async function chat(messages: Anthropic.MessageParam[]) {
  return tracer.startActiveSpan(`chat ${model}`, async (span) => {
    span.setAttributes({
      "gen_ai.operation.name": "chat",
      "gen_ai.provider.name": "anthropic",
      "gen_ai.request.model": model,
    });
    try {
      const res = await anthropic.messages.create({ model, max_tokens: 1024, messages });
      span.setAttributes({
        "gen_ai.response.model": res.model,
        "gen_ai.usage.input_tokens": res.usage.input_tokens,
        "gen_ai.usage.output_tokens": res.usage.output_tokens,
      });
      return res;
    } catch (err) {
      span.recordException(err as Error);
      span.setStatus({ code: SpanStatusCode.ERROR });
      throw err;
    } finally {
      span.end();
    }
  });
}

  1. Wrap every tool call in an execute_tool span. This is the step people skip and the one that matters most. Flag empty results explicitly, because that is the failure mode exceptions never catch.


TYPESCRIPT
export async function executeTool(name: string, callId: string, input: unknown) {
  return tracer.startActiveSpan(`execute_tool ${name}`, async (span) => {
    span.setAttributes({
      "gen_ai.operation.name": "execute_tool",
      "gen_ai.tool.name": name,
      "gen_ai.tool.call.id": callId,
    });
    try {
      const result = await tools[name](input);
      if (result == null || result === "") {
        span.setAttribute("agent.tool.empty_result", true);
      }
      return result;
    } catch (err) {
      span.recordException(err as Error);
      span.setStatus({ code: SpanStatusCode.ERROR });
      throw err;
    } finally {
      span.end();
    }
  });
}

  1. Wrap the loop in an invoke_agent root span and set gen_ai.conversation.id. Because startActiveSpan propagates context, the chat and tool spans nest under it automatically, and retrieval calls get the same treatment with gen_ai.operation.name: "embeddings" plus a result count.


The exporter speaks OTLP, so the backend is your choice: Jaeger, Grafana Tempo, SigNoz, or a hosted endpoint. It is the same pipeline I run for plain REST services in my API development work , which means agent traces and ordinary API traces land in one backend with one mental model.

How do I track tokens and cost per trace?

Set gen_ai.usage.input_tokens and gen_ai.usage.output_tokens on every chat span and your tracing backend becomes a cost dashboard. Sum the attributes within a trace and you have the exact cost of one task. Group by gen_ai.conversation.id for cost per conversation, or by day for a forecast.

I also write a derived agent.cost.usd attribute so I never join against a price sheet at query time:

TYPESCRIPT
const PRICE: Record<string, { in: number; out: number }> = {
  "claude-opus-4-8": { in: 5 / 1_000_000, out: 25 / 1_000_000 },
};

span.setAttribute(
  "agent.cost.usd",
  res.usage.input_tokens * PRICE[model].in + res.usage.output_tokens * PRICE[model].out,
);

In my reputation SaaS this paid for itself within a month. The service drafts roughly 2,000 review replies a month, and a prompt template edit once pushed median input tokens per reply from about 1,900 to 3,100. Before tracing I would have found that on the invoice; with a tokens-per-trace chart I found it the next morning. Tracing tells you where the money goes. What to do about it is its own topic, and I covered the fixes in cutting LLM API costs with caching and routing .

How do I wire traces into evals?

Export a sample of production traces and run them through an eval framework offline; you do not need a second instrumentation layer. DeepEval v4.0.3 and Phoenix v16 both shipped on May 21, 2026, and both consume OpenTelemetry traces directly, so the spans you already emit double as eval datasets.

My pipeline is deliberately small. A nightly job queries the backend for the day's invoke_agent traces, samples around 50, and reconstructs each run from its spans: user input, tool sequence, final output. DeepEval scores them for faithfulness and for my own rules, like the reply-length cap on my auto-replies. Scores go back into the same backend as metrics tagged with the trace id, so a regressed score links straight to the exact trace that produced it. Debugging collapses from guessing to reading.

Alerts catch loud failures. Evals catch quiet ones, like replies that are polite, on-brand, and wrong. You want both, and traces are the shared substrate.

What should I alert on?

Alert on tool-call error rate, token spikes per trace, and loop depth first. Those three catch most real agent failures: broken integrations, runaway loops, and upstream APIs that change silently. Latency and cost alerts matter too, but they fire after the experience has degraded, while tool signals fire at the cause.







AttributeWhat it capturesTypeScript example
gen_ai.operation.nameOperation type"chat", "execute_tool", "invoke_agent"
gen_ai.provider.nameModel provider"anthropic"
gen_ai.request.modelModel you requested"claude-opus-4-8"
gen_ai.response.modelModel that answeredres.model
gen_ai.usage.input_tokensPrompt tokens billedres.usage.input_tokens
gen_ai.usage.output_tokensCompletion tokens billedres.usage.output_tokens
gen_ai.tool.nameTool being executed"fetch_business_profile"
gen_ai.tool.call.idCorrelates call to resultblock.id
gen_ai.conversation.idGroups traces per conversationsession.id
SignalThreshold (starting point)Failure mode it catches
Tool-call error rateOver 5% across 15 minutesBroken integration, expired credentials, schema drift
Empty tool resultsOver 10% across 1 hourUpstream API returns 200 with nothing useful
Tokens per traceOver 3x the 7-day medianRunaway loop, context stuffing, prompt regression
Loop depth (LLM calls per trace)Over 8 iterationsAgent stuck cycling on a tool it cannot satisfy
p95 trace latencyOver 2x baselineSlow tool or provider degradation backing up the queue

Thresholds are starting points; tune them against a week of your own traffic. The empty-result signal is the one I insist on, because it is the only one that catches a dependency that fails politely.

My verdict after running this in production: you do not need an observability SaaS on day one. OpenTelemetry plus any OTLP backend covers a solo team, but instrument tool calls from day one, because that is where agents actually break. Tool spans also double as an audit trail of every external action your agent took, which becomes a security asset the moment you harden the deployment; my MCP server hardening checklist leans on exactly that trail for incident review.

Key takeaways

  • 97% of companies have deployed AI agents, 79% report significant production challenges, and ~74% of AI customer-service rollouts get rolled back. The gap is observability, not model quality.

  • OTel GenAI client spans went stable in early 2026. The gen_ai. attributes now cover chat, embeddings, agent orchestration, and MCP tool calls, and are safe to build alerts on.

  • Instrument tool calls before anything else. Tool failures, especially empty 200 responses, are where agents silently break.

  • Token attributes turn traces into cost receipts. A derived cost-per-trace attribute caught a 60% input-token regression in my SaaS the morning after it shipped.

  • DeepEval v4.0.3 and Phoenix v16 (both May 21, 2026) consume OTel traces directly, so production spans double as eval datasets with no second instrumentation layer.

FAQ

What are the OpenTelemetry GenAI semantic conventions?

They are the standard span names and `gen_ai.*` attributes OpenTelemetry defines for generative AI: `chat`, embeddings, `invoke_agent`, and `execute_tool` operations, plus attributes for provider, model, and token usage. GenAI client spans exited experimental status in early 2026, so the core attribute set is now stable enough to build dashboards and alerts on.

How do I trace MCP tool calls?

Wrap each tool execution in a span named `execute_tool {name}` with `gen_ai.operation.name` set to `execute_tool`, plus `gen_ai.tool.name` and `gen_ai.tool.call.id`. The conventions treat MCP tools like any other tool call, so the same span shape works whether the tool runs in process or on a remote MCP server.

Do I need an observability SaaS like LangSmith to monitor agents?

No. OpenTelemetry plus any OTLP backend, such as Jaeger, Grafana Tempo, or SigNoz, covers a solo team or a small product. Vendor platforms add LLM-specific UIs and managed evals that become worth paying for at higher volume, but everything in this post is portable to any of them.

How much overhead does OpenTelemetry add to a Node.js agent?

Effectively none relative to LLM latency. Span creation costs microseconds and exports batch in the background, while an agent turn spends seconds in model and tool time. In my production service the OTLP exporter stays well under 1% CPU; the real cost is backend storage, which trace sampling keeps in check.

Working on something like this?

I build web apps, AI features, and mobile products for clients. If this article matches a problem you have, tell me about it.

Start a conversation
HS

Malik Hamza Shabbir · Full-Stack & AI Engineer

I build full-stack and AI products solo: a reputation SaaS in production, RAG pipelines, and React Native apps. I write from what I ship, not from documentation summaries.

Related articles