Evals & observability

Braintrust

By Braintrust

Eval platform for AI products — define test sets, run them across models, and track regressions over time. The default choice for teams shipping LLM features.

Visit BraintrustFreemium

Best for

systematic AI evals
comparing prompts and models
catching regressions before deploy

Other Evals & observability

Helicone

LLM observability and logging proxy. One line of code change to log every prompt, response, cost, and latency across providers.

Langfuse

Open-source LLM engineering platform with tracing, evals, prompt management, and dataset tools. Self-hostable or cloud.

Arize Phoenix

Open-source LLM tracing and eval tool from Arize. Built around OpenTelemetry — good fit if you already use OTEL elsewhere.

Patronus AI

Eval and guardrails platform focused on enterprise safety — hallucination detection, PII checks, and policy compliance for LLM outputs.

LangSmith

Observability and eval platform from the LangChain team. Tight integration if you're building agents with LangChain or LangGraph.

Langtrace

Open-source, OpenTelemetry-based end-to-end observability tool with real-time tracing, evals and metrics for LLM apps.

Promptfoo

Open-source LLM testing and red-teaming framework that runs evals and security scans against AI apps, agents and RAG.

Traceloop (OpenLLMetry)

LLM reliability platform built on OpenTelemetry that turns evals and monitors into a continuous release feedback loop.