Introduction
One of these days, an email popped up in my inbox. It was from a project called TensorZero. They wanted to tell me about their project. I was curious, so I decided to check it out. It is not the first time I receive an email like this, but this project seemed interesting, and since i love to talk about LLMs, i thought it would be a good idea to write about it.
TensorZero is an open source project that creates a feedback loop for optimizing LLM applications. This means that they are creating a system that allows LLMs to learn from their mistakes and improve over time. For example, imagine you have a customer service chatbot - TensorZero can help identify when the bot gives incorrect or unhelpful responses, and use that data to fine-tune the model. Or consider a code completion tool - if developers frequently reject certain suggestions, TensorZero can help adapt the model to provide more relevant completions. The system collects this real-world feedback data, curates it into high-quality training examples, and uses it to continuously improve the underlying models. This creates a virtuous cycle where the LLM gets better the more it's used in production.
The project is fully open source and provides a comprehensive data and learning flywheel for LLMs by unifying several key components. At its core is a model gateway that lets you integrate with a single API for all LLMs, adding minimal overhead (less than 1ms P99). You can easily send metrics and feedback, which gets stored in your database for observability. The system then helps you optimize various aspects - from prompt engineering to model fine-tuning and reinforcement learning. It also includes built-in experimentation capabilities like A/B testing, intelligent routing between models, and fallback options. This creates a complete feedback loop where you can continuously monitor performance, gather real-world usage data, and leverage it to improve your LLM applications over time. The end result is that your models become progressively smarter, faster, and more cost-effective through actual production use.
How it works
TensorZero's architecture consists of several key components that work together to create a complete feedback loop. At the center of it all is the TensorZero Gateway, a high-performance model gateway written in Rust that serves as the main interface between your application and various LLM providers.
The system works through the following steps:
- Application Integration: Your application connects to the TensorZero
Gateway through two main channels:
- Structured Inference: Handles schema-based inference with minimal latency overhead (less than 1ms P99)
- Metrics & Feedback: Collects performance data and user feedback from your LLM interactions
- Data Storage: All interactions and metrics are stored in a ClickHouse database as structured logs, providing real-time, scalable, and developer-friendly analytics.
- Optimization Pipeline: The collected data flows through TensorZero Recipes,
which uses this information to optimize:
- Prompts
- Models
- Generation Strategies
- Continuous Improvement: The system creates a feedback loop where optimized prompts and models are fed back into the gateway, allowing for continuous improvement based on real-world usage.
The gateway also includes built-in experimentation features and GitOps orchestration, enabling teams to confidently iterate and deploy their LLM applications, whether they're working with a single model or thousands of them.
The TensorZero Gateway
The TensorZero Gateway
The TensorZero Gateway is a high-performance model gateway that provides a unified interface for all your LLM applications. Here are its key features:
One API for All LLMs
The gateway provides a unified interface for all major LLM providers, allowing for seamless cross-platform integration and fallbacks. TensorZero natively supports:
- Anthropic
- AWS Bedrock
- Azure OpenAI Service
- Fireworks
- GCP Vertex AI (Anthropic & Gemini)
- Google AI Studio (Gemini API)
- Hyperbolic
- Mistral
- OpenAI
- Together
- vLLM
- xAI
Additionally, TensorZero integrates with any OpenAI-compatible API (e.g. Ollama). If your provider isn't supported, you can open an issue on GitHub for integration.
Performance
Written in Rust 🦀, the gateway achieves <1ms P99 latency overhead under extreme load. Benchmarks show it significantly outperforms alternatives - while LiteLLM adds 25-100x+ more latency at 100 QPS, TensorZero maintains performance at 10,000 QPS.
Key Features
- Structured Inferences: Enforces schemas for inputs and outputs, ensuring application robustness. This structured data enables powerful optimization recipes.
- Multi-Step LLM Workflows: First-class support for complex workflows by associating multiple inferences with episodes. Enables feedback at both inference and episode levels.
- Built-in Observability: Collects structured inference traces with metrics and feedback, stored in ClickHouse for real-time analytics.
- Experimentation Capabilities: Automatic traffic routing between variants for A/B testing, maintaining consistency within episodes.
- Automatic Fallbacks: Handles failed inferences by routing to different providers or variants, ensuring high availability.
- GitOps Orchestration: Manage prompts, models, parameters, tools, and experiments through GitOps-friendly configuration - suitable for both manual and programmatic management.
The code
I am not writting this to only spit the information in their documentation. I am also reviewing the code. As i said before, the code is completely in Rust. The gateway is written in a single 200 lines of code file. It is a single file that is responsible for handling the requests and responses from the application. However, this file also uses a lot of internal libraries from the tensorzero project.
use axum::routing::{get, post};
use axum::Router;
use clap::Parser;
use mimalloc::MiMalloc;
use std::fmt::Display;
use std::io::ErrorKind;
use std::net::SocketAddr;
use std::path::{Path, PathBuf};
use std::sync::Arc;
use std::time::Duration;
use tokio::signal;
use tensorzero_internal::clickhouse::ClickHouseConnectionInfo;
use tensorzero_internal::config_parser::Config;
use tensorzero_internal::endpoints;
use tensorzero_internal::endpoints::status::TENSORZERO_VERSION;
use tensorzero_internal::error;
use tensorzero_internal::gateway_util;
use tensorzero_internal::observability;
#[global_allocator]
static GLOBAL: MiMalloc = MiMalloc;
#[derive(Parser, Debug)]
#[command(version, about)]
struct Args {
/// Path to tensorzero.toml
#[arg(long)]
config_file: Option<PathBuf>,
/// Deprecated: use `--config-file` instead
tensorzero_toml: Option<PathBuf>,
}
#[tokio::main]
async fn main() {
// Set up logs and metrics
observability::setup_logs(true);
let metrics_handle = observability::setup_metrics().expect_pretty("Failed to set up metrics");
let args = Args::parse();
if args.tensorzero_toml.is_some() && args.config_file.is_some() {
tracing::error!("Cannot specify both `--config-file` and a positional path argument");
std::process::exit(1);
}
if args.tensorzero_toml.is_some() {
tracing::warn!(
"`Specifying a positional path argument is deprecated. Use `--config-file path/to/tensorzero.toml` instead."
);
}
let config_path = args.config_file.or(args.tensorzero_toml);
let config = if let Some(path) = &config_path {
Arc::new(Config::load_from_path(Path::new(&path)).expect_pretty("Failed to load config"))
} else {
tracing::warn!("No config file provided, so only default functions will be available. Use `--config-file path/to/tensorzero.toml` to specify a config file.");
Arc::new(Config::default())
};
// Initialize AppState
let app_state = gateway_util::AppStateData::new(config.clone())
.await
.expect_pretty("Failed to initialize AppState");
// Create a new observability_enabled_pretty string for the log message below
let observability_enabled_pretty = match &app_state.clickhouse_connection_info {
ClickHouseConnectionInfo::Disabled => "disabled".to_string(),
ClickHouseConnectionInfo::Mock { healthy, .. } => {
format!("mocked (healthy={healthy})")
}
ClickHouseConnectionInfo::Production { .. } => "enabled".to_string(),
};
// Set debug mode
error::set_debug(config.gateway.debug).expect_pretty("Failed to set debug mode");
let router = Router::new()
.route("/inference", post(endpoints::inference::inference_handler))
.route(
"/batch_inference",
post(endpoints::batch_inference::start_batch_inference_handler),
)
.route(
"/batch_inference/:batch_id",
get(endpoints::batch_inference::poll_batch_inference_handler),
)
.route(
"/batch_inference/:batch_id/inference/:inference_id",
get(endpoints::batch_inference::poll_batch_inference_handler),
)
.route(
"/openai/v1/chat/completions",
post(endpoints::openai_compatible::inference_handler),
)
.route("/feedback", post(endpoints::feedback::feedback_handler))
.route("/status", get(endpoints::status::status_handler))
.route("/health", get(endpoints::status::health_handler))
.route(
"/metrics",
get(move || std::future::ready(metrics_handle.render())),
)
.fallback(endpoints::fallback::handle_404)
.with_state(app_state);
// Bind to the socket address specified in the config, or default to 0.0.0.0:3000
let bind_address = config
.gateway
.bind_address
.unwrap_or_else(|| SocketAddr::from(([0, 0, 0, 0], 3000)));
let listener = match tokio::net::TcpListener::bind(bind_address).await {
Ok(listener) => listener,
Err(e) if e.kind() == ErrorKind::AddrInUse => {
tracing::error!(
"Failed to bind to socket address {bind_address}: {e}. Tip: Ensure no other process is using port {} or try a different port.",
bind_address.port()
);
std::process::exit(1);
}
Err(e) => {
tracing::error!("Failed to bind to socket address {bind_address}: {e}");
std::process::exit(1);
}
};
let config_path_pretty = if let Some(path) = &config_path {
format!("config file `{}`", path.to_string_lossy())
} else {
"no config file".to_string()
};
tracing::info!(
"TensorZero Gateway version {TENSORZERO_VERSION} is listening on {bind_address} with {config_path_pretty} and observability {observability_enabled_pretty}.",
);
axum::serve(listener, router)
.with_graceful_shutdown(shutdown_signal())
.await
.expect_pretty("Failed to start server");
}
pub async fn shutdown_signal() {
let ctrl_c = async {
signal::ctrl_c()
.await
.expect_pretty("Failed to install Ctrl+C handler");
};
#[cfg(unix)]
let terminate = async {
signal::unix::signal(signal::unix::SignalKind::terminate())
.expect_pretty("Failed to install SIGTERM handler")
.recv()
.await;
};
#[cfg(unix)]
let hangup = async {
signal::unix::signal(signal::unix::SignalKind::hangup())
.expect_pretty("Failed to install SIGHUP handler")
.recv()
.await;
};
tokio::select! {
_ = ctrl_c => {
tracing::info!("Received Ctrl+C signal");
}
_ = terminate => {
tracing::info!("Received SIGTERM signal");
}
_ = hangup => {
tokio::time::sleep(Duration::from_secs(1)).await;
tracing::info!("Received SIGHUP signal");
}
};
}
/// ┌──────────────────────────────────────────────────────────────────────────┐
/// │ MAIN.RS ESCAPE HATCH │
/// └──────────────────────────────────────────────────────────────────────────┘
///
/// We don't allow panic, escape, unwrap, or similar methods in the codebase,
/// except for the private `expect_pretty` method, which is to be used only in
/// main.rs during initialization. After initialization, we expect all code to
/// handle errors gracefully.
///
/// We use `expect_pretty` for better DX when handling errors in main.rs.
/// `expect_pretty` will print an error message and exit with a status code of 1.
trait ExpectPretty<T> {
fn expect_pretty(self, msg: &str) -> T;
}
impl<T, E: Display> ExpectPretty<T> for Result<T, E> {
fn expect_pretty(self, msg: &str) -> T {
match self {
Ok(value) => value,
Err(err) => {
tracing::error!("{msg}: {err}");
std::process::exit(1);
}
}
}
}
impl<T> ExpectPretty<T> for Option<T> {
fn expect_pretty(self, msg: &str) -> T {
match self {
Some(value) => value,
None => {
tracing::error!("{msg}");
std::process::exit(1);
}
}
}
}
This code represents the main entry point of the TensorZero gateway. It starts by setting up essential components like logging and metrics tracking. The program accepts a configuration file path as input, which can be specified using the `--config-file` flag. After loading the configuration, it initializes the application state and sets up various HTTP endpoints for handling different types of requests, including inference, batch processing, feedback collection, and health checks. The server listens on a configurable address (defaulting to 0.0.0.0:3000) and includes graceful shutdown handling for various signals (Ctrl+C, SIGTERM, SIGHUP). The code also implements a custom error handling trait called `ExpectPretty` that provides user-friendly error messages during the initialization phase. This implementation showcases Rust's strong error handling capabilities while maintaining clean and maintainable code.
As you can see, there are a lot of internal libraries, however, it is a lot of code and i can't review it all here. It is also pretty straightforward to understand. I like a lot the fact that they wrote all of this entirely in Rust. You would usually see a lot of Python slop for everything.
But that's only the tensorzero gateway.
The TensorZero Recipes
TensorZero also provide recipes for different aspects of LLM applications. For example, they have a recipe for fine-tuning.
GUI
This project is really big, they also have a GUI made with React. You can see more information about it here. It uses typescript and is really well done. The project overall is really well done.
Conclusion
This project is really cool. I like the fact that they are using Rust for the gateway. It is a really well done project. I am sure that this will be useful for a lot of people.
If you want to know more about the project, you can visit the official website.