Modal Auto Endpoints: Optimized inference you own

All posts Back News June 23, 2026•5 minute read Introducing Modal Auto Endpoints: Optimized inference you actually own Charles Frye@charles_irl Member of Technical Staff Deven Navani@DevenNavani Member of Technical Staff Hari Subbaraj@hsubbaraj Member of Technical Staff Greta Workman@gretaworkman Product Marketing Richard Gong@_gongy Member of Technical Staff Modal allows leading teams like Cognition, Decagon, Fathom, and DoorDash to own their inference without compromising on cost-performance or developer velocity. Now you can do the same with a single command: modal endpoint create --name agent --model zai-org/GLM-5.2-FP8Introducing Modal Auto Endpoints: a smooth, self-serve on-ramp to production-grade LLM inference. Take it for a spin right now, or read on to learn more about how we built it and why. Proprietary model providers can silently degrade models or suddenly retract access. If you don't own your inference, you don't own your destiny. If you work with open models served by an inference provider, you gain some control. But we think ownership runs deeper than the API. To actually own your inference, you need to own, understand, and optimize the code that runs the inference. Managed inference providers make it easy to get an API, but the serving stack is a black box. So until now, teams that wanted proper ownership of their inference have had only one option: roll an inference service yourself. That gives you control, but now you own a lot more than just inference: engine tuning, endpoint benchmarking, container deployment, replica autoscaling & routing, and inference metrics. That's why we built Modal Auto Endpoints, and why they look very different from what's offered by traditional inference providers. A Modal Endpoint is an OpenAI API-compatible, production-ready service, backed by a Modal App that you can see and control. There are three key differences in this approach: We can deliver all of this because we are building on a rock-solid foundation: Modal's AI infrastructure platform. Our users build on this platform to fold proteins, drive robots, and make music. The same fundamental components that work there also work for LLM inference, hand-rolled or via Auto Endpoints. With Modal, you don’t need to reserve months of expensive GPU capacity to handle load you can’t estimate. Instead, you pay for what you use, as you use it, and scale to meet demand with our high-performance autoscaling system and custom container runtime. You can use GPUs around the world, or close to your users, without worrying about capacity management. That’s our calling card, and that’s not changing. We’ve also added and released from beta a new fundamental component to our system to support the demands of low latency inference: Modal Servers for ultra-low-latency routing. Modal Servers keep the elastic scaling and deep compute capacity of Modal Web Functions. But they remove queueing and are regionalized by default so that you can serve HTTP requests on Modal with only 5ms overhead -- without compromising on reliability and autoscaling. More on how we built that later this week. Inference engines are akin to database management systems like PostgreSQL: complex, mission-critical software that must perform at the limits of the hardware. As with databases, this software has complex internals exposed by multitudinous knobs, and achieving the best performance possible requires learning to tune those knobs. That’s a tough grind. When a team is looking to own inference but used to building on proprietary model APIs, it is tempting to keep the API layer abstraction and outsource inference performance concerns to proprietary wrappers of open-weights models. Auto Endpoints give you the best of both worlds: performance, effortlessly. For each supported model, we provide a starting deployment informed by our experience with teams building some of the most demanding AI products in the world. You don't need to specify GPU types or monkey around with engine flags like --mamba-scheduler-strategy or --flashinfer-mxfp4-moe-precision until you're ready, making bespoke optimizations for your workload. We developed these recipes in direct competition with proprietary inference providers. We won by betting on open source — patching and upstreaming improvements to underlying inference engines like SGLang and kernels like FlashAttention-4 as necessary — and by going all-in on speculative decoding. In particular, we like the DFlash block-diffusion drafter architecture from Z Lab, and we use it with every compatible model. We’ve worked closely with Z Lab and the SGLang team to make DFlash fast and reliable in real serving systems, and we trained and released our own DFlash drafter models to expand support and to make sure they deliver optimal performance. We expose our benchmarking results to you as you set up your Endpoint: Once the Endpoint is deployed, you can test it with a click, review latency and throughput tradeoffs,

Modal Auto Endpoints: Optimized inference you own

Key takeaways

More in computer-science