Apple has conceded the foundation model race, and the concession is the most interesting AI strategy decision of the year. The new Siri announced on June 8 runs on Google Gemini, not on Apple’s own models (MacRumors). One day later, Apple published the Core AI Framework, which includes fm serve, a command that starts a Chat Completions API server for on-device inference. Read those two moves together and the strategy is plain: Apple is no longer trying to win on model quality. It is positioning itself as the distribution layer that every model, including a rival’s, must pass through to reach its installed base.

That installed base is the entire argument. Apple reported that its active device count had passed 2.2 billion back in early 2024 (Apple Newsroom), and it has only grown since. A company with that footprint does not need to train the best model. It needs to own the interface between models and users, and to make its devices the default place where inference happens. Both announcements serve exactly that goal.

What Apple Actually Shipped

Three things landed across June 8 and 9, and they are worth separating because they pull in different directions.

First, the consumer move. Siri was rebuilt with Gemini supplying the intelligence and the Siri interface remaining as the front end. The relaunched Apple Intelligence page presents this as a unified Apple experience, which it is, in the same way that Safari presenting Google search results is a unified Apple experience. Apple keeps the relationship with the user; Google supplies the capability underneath.

Second, the developer move. The Core AI Framework is Apple’s first documented, supported mechanism for on-device large language model (LLM) inference through an official API. This is distinct from MLX, Apple’s earlier machine learning research project, which was a framework for people willing to live close to the metal. fm serve is something else entirely: a local server speaking the Chat Completions schema, the de facto interface contract for LLM inference. Apple describes it simply as a command to “Start a Chat Completions API server.”

Third, the regulatory consequence. Apple requested an 18-month exemption from the interoperability obligations of the European Union’s Digital Markets Act (DMA) and was denied; the new Siri will not launch in the EU in its current form (Reuters). I will come back to this, because it is the part of the story that most coverage treated as a footnote and that I consider structural.

The Chat Completions Schema Won

The API design story here deserves more attention than the Siri headline. When Apple ships an OpenAI-compatible endpoint, the question of what the standard interface for LLM inference will be is settled. It is the OpenAI Chat Completions API, a schema designed by one vendor for one product, now implemented by Ollama, llama.cpp and, as of this week, the largest device vendor on the planet. No standards body wrote it. No working group reviewed it. It won the way most de facto standards win: it was there first, it was good enough and everyone needed a common surface.

For developers this convergence is mostly good news. Code written against one local inference server now ports to another by changing a base URL:

from openai import OpenAI

# Ollama on the default port
client = OpenAI(base_url="http://localhost:11434/v1", api_key="unused")

# llama.cpp server, or Apple's fm serve: same client,
# same request shape, different base_url
response = client.chat.completions.create(
    model="local-model",
    messages=[{"role": "user", "content": "Summarize this commit diff."}],
)

The caution is in the word “mostly.” Whether fm serve is a true drop-in replacement for the OpenAI surface is disputed; early discussion confirmed the API is modeled on Chat Completions without establishing wire-level compatibility across streaming, tool calls and the long tail of parameters. Anyone who has run the same request against Ollama and llama.cpp knows the pattern: the schema converges, the behavior diverges. Teams evaluating local inference should treat compatibility as a claim to audit, not a property to assume. Write a conformance test for the subset of the API you actually use, and run it against every backend you intend to support.

On-Device Inference Becomes a Deployment Target

The practical consequence for developers is larger than the API trivia. Until now, shipping an application backed by local inference meant asking users to install and operate a model server themselves, which confined the approach to enthusiasts. An OpenAI-compatible server built into the operating system of a couple billion devices changes the default. On-device inference stops being a hobbyist deployment and becomes a target platform with a stable, familiar interface.

The economics follow. Inference that runs on the user’s hardware costs the developer nothing per token, leaks no data to a third party and works on an airplane. Those three properties (cost, privacy, availability) are the entire pitch for local inference, and the historical counterweight has been distribution friction and weak model quality. Apple just removed the friction. Model quality remains an open question, and a real one: nothing in the Core AI Framework documentation promises that the on-device models match what Gemini provides to Siri over the network. The plausible architecture is a tiered one, where local models handle latency-sensitive and privacy-sensitive work while the heavy reasoning goes upstream. Developers should plan for that split rather than pretending either tier subsumes the other.

There is also an honest argument against my framing. Outsourcing Siri to Gemini can be read not as strategic repositioning but as a capability failure: Apple tried to build competitive models, did not get there and bought a lifeline. The week’s broader commentary included the argument that AI capability gains are decelerating across the industry, which would make renting a frontier model the rational move for everyone except the two or three labs at the front. Both readings can be true at once. A company can fail at one layer and respond by consolidating its position at the layer where it is unbeatable. The history of platform companies is largely a history of exactly that maneuver.

Apple Has the Edge, Google Has Everything Else

Here is the part of my own distribution argument that deserves pressure. Apple has the devices that matter for putting AI at the edge; no other vendor combines that hardware fleet, the silicon to run local models and a developer channel to expose them. Follow the intelligence rather than the interface, though, and every path in this announcement terminates at Google. Gemini stands behind Siri. Google already holds the default search position on those same devices, an arrangement two decades old. The week’s net result is that Google’s models gained system-level placement on the largest premium device fleet in the world without Google shipping a single device.

The hardware deserves specific weight in that assessment. Apple Silicon’s unified memory architecture already makes a consumer Mac one of the cheapest ways to run a large model locally, a fact the MLX project demonstrated well before this week. Building on that base, Apple gets to rival its competitors’ inference offerings without the baggage of the AI race: no training runs that depreciate in months, no capital bidding war for accelerators, no obligation to defend a leaderboard position. It sells the hardware, ships the runtime and lets model builders absorb the risk of being wrong. I consider this the most defensible position any large company holds in AI right now, precisely because it does not depend on predicting which model wins.

A distribution layer only commands rent while the layers beneath it stay interchangeable. The Gemini deal suggests that, at the frontier tier, they are not: if models were truly commodities, Apple would have launched with a panel of providers or with its own adequate model, the way it treats maps and storage. Choosing a single rival’s flagship instead is a statement about where capability actually lives. So there are two honest readings of the same event. Apple as aggregator, taxing every model that wants to reach its users, is the strong reading. Apple as a front end that just became structurally dependent on the only supplier that matters is the weaker one. I hold both: Apple owns the edge, and the edge is real, but the intelligence underneath it is Google’s, and dependencies at the capability layer have a way of repricing the relationship over time.

The EU Carve-Out Is the Tell

The DMA denial is what convinces me the distribution-layer reading is correct. The dispute, as Reuters reports it, is over interoperability: when a gatekeeper integrates a third-party AI provider into system-level APIs, the DMA requires that competing assistants get equivalent access. The Commission’s spokesperson called the withdrawal “Apple’s and Apple’s only” decision and said Apple failed to develop a compliant interoperability solution; Apple’s executives countered that the requirement amounts to a risky experiment on tens of millions of users. Apple’s answer, for now, is to withhold the product from the EU rather than open the integration point.

That choice is informative. A company shipping a feature defends the feature. A company defending an exclusive integration channel is telling you where the value lives. The new Siri without exclusive system integration is just a chat app, and Apple has no interest in shipping a chat app. The asset is the privileged position between the model and two billion users, and Apple would rather forfeit an entire market, temporarily, than dilute it. This mirrors the delayed EU rollout of the first Apple Intelligence generation, and I expect the same eventual resolution: a negotiated version with interoperability concessions, arriving late.

For developers in or serving the EU, the immediate consequence is fragmentation. The deployment target I described above is not uniform; it is conditioned on regulatory geography, and any product plan that assumes the Core AI Framework reaches every device should be discounted accordingly.

What I Will Do With This

My read, stated plainly: Apple has made on-device, OpenAI-compatible inference a platform primitive, and the durable winners will be developers who treat it as one. Three commitments follow, and they apply as directly to your projects as to mine.

I am auditing my inference abstractions first. Anything I maintain that hardcodes a single provider’s client, or worse a single provider’s behavioral quirks, just had the cost of a local tier move from speculative to scheduled.

I will write conformance tests before I trust any compatibility claim. The Chat Completions schema is a convention, not a specification with a test suite. The subset I depend on is small enough to verify mechanically, so I will verify it against every backend I intend to support and publish what diverges.

I will sort my workloads by tier rather than by ideology. Latency-sensitive, privacy-sensitive and offline-required tasks go local. Heavy reasoning stays remote. Products that get this split right will feel faster and cheaper than products that pick one tier and defend it.

Apple spent two decades proving that owning distribution beats owning technology. The Gemini deal and the Core AI Framework are that thesis restated for the AI era, and the EU standoff shows how much Apple believes it. The quiet winner, though, is Google: its models now stand behind the default assistant on the devices that matter most for edge AI, the same position its search engine has held on that hardware for twenty years. Apple owns the edge. Google is behind it all.

Developers should take the free distribution while the two of them settle the rent. If you start running conformance tests against a local Chat Completions backend, write up what breaks; that compatibility record is the most useful artifact this platform shift will produce, and I want to read it.

Sources