
Inference might be the least glamorous part of modern AI stacks, but it’s often where product experiences live or die. That’s why a fresh funding headline caught our eye at AI Tech Inspire: Baseten just raised $150M at a $2.1B valuation to double down on inference infrastructure. The pitch is simple: make serving models fast, efficient, and developer-friendly — and businesses will build more on top.
Quick facts (what’s actually being claimed)
- Baseten raised $150M in a Series D round at a $2.1B valuation.
- The company focuses on inference infrastructure: low-latency model serving, throughput optimizations, and developer experience.
- Baseten has shared benchmarks claiming their embeddings inference outperforms
vLLM
andTEI
on throughput and latency. - The strategic bet: inference infrastructure — not training — is the primary pain point for teams in production.
- Competitors like Fireworks and Together are also competing on low latency and high throughput.
- Counterpoint in the industry: the bigger cost sink is cold starts and low GPU utilization; elastically serving multiple models without waste remains unsolved at scale.
- Open questions: Will latency/throughput wins be enough to differentiate? Is GPU utilization the deeper bottleneck? Does inference infra commoditize like training, or is there room for defensible platforms?
Why this matters: inference is where user experience meets unit economics
The last two years focused the spotlight on training large models. But for most builders, the day-to-day pain is not custom pretraining — it’s keeping production inference both fast and affordable. Low p95
latency determines whether a chat UI feels responsive. High throughput determines whether your cost-of-goods-sold scales with usage. And clever scheduling determines whether you’re paying for idle GPUs during traffic lulls.
Baseten’s focus sits squarely here: aggressive batching
, concurrency control, request routing, and efficient memory/compute use for embeddings and generation workloads. Their benchmarks specifically call out embeddings serving — a workload where batching can be very effective — claiming an edge over vLLM
and Hugging Face’s TEI
. While claims require independent verification in real customer conditions, the direction lines up with what many teams report: inference, not training, often drives the production headaches.
The real bottleneck: latency/throughput or utilization?
There are two levels of the inference problem:
- Per-request performance: Reduce
p50/p95
latency and maximize tokens/sec or vectors/sec. Techniques includecontinuous batching
,KV cache
reuse,tensor-parallel
execution, quantization, and optimized runtimes like TensorRT or Triton Inference Server. - Fleet efficiency: Keep GPUs busy across spiky, multi-tenant workloads while meeting SLAs. This is about elastic scaling, weight loading, model multiplexing, and minimizing cold starts.
It’s easy to win benchmarks on tightly controlled, warm, single-model scenarios. Real systems are messier: multiple models, mixed request sizes, uneven traffic, and strict SLOs. In that world, the bottleneck often shifts to utilization. If your GPU sits under 30% busy because you can’t batch across tenants or your autoscaler cold-starts too slowly, theoretical throughput doesn’t help your bill.
Takeaway: Per-request speed wins headlines; fleet-level utilization wins margins.
Cold starts and multi-model elasticity: the silent tax
Teams serving many models discover a recurring tax: cold starts. Spinning new replicas, downloading weights, initializing kernels, and warming caches can add seconds to minutes — hurting both latency and cost. If you use Kubernetes or serverless patterns like Knative, the problem compounds when scale-to-zero meets large weight files.
Practical mitigations the industry gravitates toward:
- Warm pools: Keep a small pool of hot replicas per model or per architecture to avoid cold start spikes.
- Weight sharing and on-demand paging: Shared memory maps and smart caching reduce IO for repeat loads.
- Multiplexing: Serve multiple models on the same GPU when their peaks don’t overlap, using techniques like NVIDIA
MIG
andMPS
to partition or co-schedule workloads. - Dynamic and continuous batching: Combine small requests into larger GPU-friendly batches while preserving tail latency.
- Model specialization: Separate paths for embeddings vs. generation; the former is more batchable and easier to fully saturate.
Where does this leave providers? If a platform can maintain high GPU occupancy while meeting strict p95
and p99
latencies — even under bursty, multi-model traffic — it gains a structural advantage. That’s the deeper game beyond headline tokens/sec.
How to evaluate an inference platform (a developer’s checklist)
Whether you’re looking at Baseten, Fireworks, Together, or rolling your own, ask for metrics and controls that expose the real trade-offs:
- Utilization visibility: Can you see per-GPU busy time and memory headroom? What is the average
SM%
and memory bandwidth utilization under your actual workload? - Latency under load: Don’t just look at
p50
. Requestp95
/p99
at target concurrency and with mixed request sizes (short and long prompts). - Queueing and batching policy: How does
continuous batching
affect tail latency? Are there knobs for max batch size, max queue time, or per-tenant fairness? - Cold start behavior: What’s the cold start time for your largest model? Are there warm pools or preloading APIs to manage it?
- Elasticity: Time to scale from N to N+k replicas during a surge. Are weights pulled from local cache or remote storage each time?
- Cost metrics: Track cost by busy-minute, not wall-clock. Ask for tokens/sec or vectors/sec per dollar over a 24-hour trace, not a 30-second demo.
- Developer experience: How clean is the SDK? Does it integrate with PyTorch, TensorFlow, and CUDA pipelines? Can you deploy models from Hugging Face with one manifest?
Run a realistic canary: embeddings-heavy traffic in the morning, long-context GPT-style generation mid-day, image generation with Stable Diffusion after hours. The best infra balances batching with latency guarantees without wasting GPU time.
Are latency and throughput enough to differentiate?
Short-term, yes — especially for embeddings and smaller models where batching and kernel-level optimizations translate directly to lower costs. If Baseten’s embeddings engine consistently beats vLLM
and TEI
on both throughput and latency in real-world traffic, that’s compelling for vector-heavy stacks: semantic search, RAG pre-processing, and analytics pipelines.
Long-term, differentiation likely moves up-stack:
- Workload-aware scheduling: Automatically detect request shapes and route to the most efficient pool (e.g., embeddings vs. generation vs. vision).
- Multi-tenant optimization: Smarter co-location to drive near-constant GPU occupancy without SLA misses.
- Observability: First-class tracing of
KV cache
hits, batch composition, queue times, and per-tenant fairness. - Model lifecycle: Easy promotion/rollback, A/B testing, canarying, and cost/latency budget enforcement.
If multiple platforms converge on similar raw performance, the edges will be in fleet efficiency and developer ergonomics. Think: Can I ship faster, with clearer visibility, at a predictable cost?
Commoditization vs. defensibility
Training infrastructure commoditized rapidly as the community standardized around PyTorch, open weights, and cloud GPU offerings. Inference might follow — but not entirely. There’s a strong case that the “last mile” of serving (cold starts, multiplexing, policy controls, caching) remains deeply operational and data-dependent. Platforms that nail cross-tenant packing and provide simple controls for SLA
s and budgets could retain an edge.
That said, open projects and patterns tend to diffuse fast. vLLM
elevated continuous batching for LLMs; providers and internal teams adopted similar ideas. If differentiation sits only in kernel-level speedups, it’s harder to defend. If it sits in day-2 operations (observability, policy, elasticity) and meaningful cost guarantees, there’s room for durable value.
What to do next
For teams with heavy embeddings traffic, the reported gains are reason enough to run a bake-off. Recreate your production patterns: mix batch sizes, add traffic spikes, include cold starts, and measure p95
/p99
alongside GPU busy time. For chat and long-context generation, validate latency under mixed prompt lengths and ensure you can cap queue times without killing utilization.
At AI Tech Inspire, we see a pragmatic takeaway:
Winning platforms won’t just be faster — they’ll be better at keeping GPUs busy without breaking your SLOs.
That’s the question to put to every provider, Baseten included: not only “How fast is your engine?” but “How well do you drive in traffic?” If the answer includes clear utilization metrics, predictable scaling behavior, and knobs for batching
, queueing
, and cold starts
, you’re on the right road.
Recommended Resources
As an Amazon Associate, I earn from qualifying purchases.