Skip to Content

Cloudflare Workers AI 2026: Free AI Inference at the Edge

Run LLMs on 300+ Global Locations — No Server, No Setup, Pay Only for What You Use
May 11, 2026, 11:18 Eastern Daylight Time by
Cloudflare Workers AI 2026: Free AI Inference at the Edge

Cloudflare Workers AI lets you run AI models (including Llama, Mistral, and Whisper) on 300+ edge locations worldwide. No server setup, no infrastructure management — just deploy AI-powered apps that respond in milliseconds. The free tier includes 10,000 AI requests per day.

What is Workers AI?

Workers AI is Cloudflare's serverless AI inference platform. Unlike traditional AI APIs where your requests travel to a single data center, Workers AI runs models on Cloudflare's global network — closer to your users. This means faster responses (typically under 100ms) and lower costs.

You get access to open-source foundation models including:

  • Llama 3.1 — Meta's latest open-source LLM
  • Mistral 7B — Fast, efficient model
  • Whisper — Speech-to-text transcription
  • Stable Diffusion XL — Image generation
  • embedding models — For vector search

Workers AI Pricing 2026

ModelInput/1K tokensOutput/1K tokens
Llama 3.1 8B$0.00015$0.00015
Mistral 7B$0.00007$0.00007
Whisper (audio)$0.00024/min

Free tier: 10,000 requests/day — enough for prototypes and small apps.

Workers AI Neurons: How Pricing Actually Works

Cloudflare uses a proprietary unit called Neurons to measure AI compute across all models. One Neuron roughly equals one token of a small model, but the conversion rate varies by model size and task type. Understanding this is critical for accurate cost forecasting.

The official pricing: $0.011 per 1,000 Neurons. Free allocation: 10,000 Neurons per day (resets daily at 00:00 UTC). On Workers Paid plan, the same 10,000 Neuron/day free allocation applies — you only pay for usage beyond that at $0.011/1,000 Neurons. Overall token pricing across models ranges from approximately $0.05 to $5 per million tokens depending on model type and size.

One practical consideration: Cloudflare has updated Workers AI pricing to be more granular, with per-model unit-based pricing now displayed alongside neuron mappings. For example, the BGE reranker costs $0.003/M input tokens while the M2M100 translation model costs $0.342/M input and output tokens. Knowing which model fits your neuron budget before building prevents surprise costs at scale.

Hidden cost warning: Community reports indicate undocumented usage thresholds exist beyond which Cloudflare may push accounts toward Enterprise plans. The free plan hard-limits (daily neuron cap, 100K AI Gateway logs/month) cause sudden throttling — not billing overages. Moving to paid mid-stream may require architectural changes.

Workers AI Agents: February 2026 Update

On February 13, 2026, Cloudflare added GLM-4.7-Flash to Workers AI — a fast multilingual text generation model optimized for dialogue and instruction-following with a 131,072-token context window. More importantly, this release came with the brand-new @cloudflare/tanstack-ai package and workers-ai-provider v3.1.1, enabling you to run complete AI agents entirely on Cloudflare.

GLM-4.7-Flash supports multi-turn tool calling, and combined with full compatibility for TanStack AI and the Vercel AI SDK, you now have everything needed to build agentic applications that run end-to-end at the edge. This is a significant architectural shift — previously, agentic workflows required routing through external orchestration layers. Now the full loop (inference → tool call → inference) can run on Cloudflare's 300+ edge locations.

Earlier in January 2026, Cloudflare partnered with Black Forest Labs (BFL) to add FLUX.2 [klein] 9B to Workers AI — a distilled image generation model offering enhanced quality at cost-effective pricing with a fixed 4-step inference process. This makes Workers AI viable for real-time image generation applications at sub-100ms latency.

Workers AI vs Alternatives

FeatureWorkers AIAWS BedrockVercel AI SDK
Free tier✅ 10K/dayLimited
Edge deployment✅ Global⚠️ Via Edge
No cold starts

How to Get Started

  1. Install Wrangler: npm install -g wrangler
  2. Login: wrangler login
  3. Create worker: wrangler init my-ai-worker
  4. Add AI binding in wrangler.toml
  5. Deploy: wrangler deploy

AI Gateway + Vectorize: The Full Cloudflare AI Stack

Workers AI doesn't operate in isolation. Cloudflare's AI stack includes three complementary services that work together for production deployments:

  • AI Gateway — Observe and control AI applications with caching (reduces repeated model calls), rate limiting, request retries, model fallback, and unified logging. In 2026, Cloudflare added Unified Billing — you can now pay for third-party model usage (OpenAI, Anthropic, etc.) directly through your Cloudflare invoice. Free tier: 100,000 AI Gateway logs/month; Paid: 1,000,000 logs/month.
  • Vectorize — Cloudflare's vector database for semantic search, recommendations, and LLM memory. Pricing: $0.01/million vector dimensions queried, $0.05/100 million dimensions stored. The free tier has no Vectorize allocation, so any vector search requires Workers Paid.
  • Workers AI Inference — The model layer. Run 50+ models at the edge, billed in Neurons at $0.011/1,000.

A typical production AI app on Cloudflare combines all three: Workers AI for inference, Vectorize for retrieval-augmented generation (RAG), and AI Gateway for caching and observability. The Logpush integration (paid: $0.05/million records) streams all logs to S3 or SIEM tools for compliance teams.

When Workers AI Makes Sense (and When It Doesn't)

Best fit: Latency-sensitive applications where model response must be near the end user. Real-time chatbots, content moderation, translation at the CDN layer, image classification on upload, and edge-native agentic workflows all benefit from Cloudflare's 300+ location deployment. The free 10K neurons/day is genuinely useful for small production apps — not just prototypes.

Not the best fit: Large-scale inference requiring frontier-level models (Claude, GPT-5.5, Gemini). Workers AI's model catalog is strong for open-source models but cannot replace API access to the best proprietary models. For high-throughput batch inference without latency requirements, AWS Bedrock or direct API access to providers like dedicated inference platforms will offer better pricing at scale.

Last Updated: May 17, 2026 | Source: Cloudflare Official Docs, Cloudflare Changelog (Official)

Frequently Asked Questions

Workers AI pricing starts at $0.00007 per 1K tokens for Mistral 7B and $0.00015 per 1K tokens for Llama 3.1. A free tier of 10,000 requests per day is available — enough for prototypes and small applications.
Workers AI supports Llama 3.1, Mistral 7B, Whisper (speech-to-text), Stable Diffusion XL (image generation), and various embedding models for vector search.
Yes. Workers AI runs on 300+ edge locations globally, providing sub-100ms latency for most users. AWS Bedrock requires requests to travel to fixed data centers, typically taking 500ms+.
Yes. The free tier includes 10,000 AI requests per day — more than enough for learning, prototyping, or small production apps. No credit card required.
No. Workers AI is fully serverless. You deploy via Cloudflare's CLI (Wrangler), and Cloudflare handles all infrastructure, scaling, and availability automatically.