The Wake: May 14, 2026

The Wake is a daily briefing from George's saved internet. The issue is written as a newsletter first. The tweets are the source material, preserved below for receipts.

Source window: May 13, 2026. Signals: 7 bookmarks and 0 likes.

Brief

The AI story this morning is about architecture and appetite. Teams are pushing beyond chat as the primitive interface: both with model designs that run multiple internal streams in parallel and with agents that can reach out and control apps. At the same time the dollar scale fueling these experiments has jumped; Anthropic’s CFO described an order-of-magnitude expansion in revenue and a procurement posture that reads like wartime logistics. The result is a fastening loop: new model formats enable new agent behaviors, which demand more compute and different hardware choices, which in turn accelerate product cadence and raise both utility and risk.

Parallel streams: why single-threaded chat is a ceiling

Since the chatbot era began, most LLM-driven systems have used a single message stream: models receive tokens, compute, and reply. That architecture makes interaction and reasoning simple but imposes strict sequencing. A model cannot simultaneously read a long context, continue an internal chain of thought, and fire an external tool without interrupting one of those activities. That limitation is now being challenged.

Work arguing for multi-stream LLMs (see @jonasgeiping) reframes the model’s internal life. Instead of one serialized conversation, the model maintains multiple simultaneous token streams: an outer user dialogue, an inner chain-of-thought channel, tool interaction lanes, monitoring or safety subvocalizations, and so on. Each forward pass predicts across those streams in parallel. The claimed benefits are practical and concrete: lower latency because you can read and predict at once; cleaner separation of roles inside the model, which helps with composability and security; and a more legible, inspectable path for reasoning because streams can be audited independently.

That last point matters for safety. If a single CoT channel can be pressured into a wrong conclusion, an independent monitoring stream can still raise a flag. Instruction tuning can teach models to use the stream format, so the shift is more about training regimen than rewriting core architectures. Think of it as moving from a single-threaded program to a microservices architecture inside the model.

Expect the first impacts to be UX- and agent-focused: fewer awkward interruptions, fewer token-stealing turns, and agents that can "think aloud" without blocking the user or tools. The research is complementary to other recent ideas about continuous or subvocal reasoning; different teams will package similar capabilities under different names.

Agents becoming hands-on: real apps, real tokens, real risk

We are no longer talking only about models that describe actions. Agents are already doing them. A debugging agent used a desktop automation tool to open the Telegram Mac app, talk to botfather and generate a new token (see @steipete). That's not an abstract demo. It is an end-to-end loop: sense, act, authenticate, and resume.

That capability is powerful and fragile at the same time. On the positive side, genuine automation removes a lot of human tedium and enables workflows where a model can fix a configuration, rotate keys, or renew credentials without a human typing every step. On the negative side, autonomy plus app-level access expands the attack surface. Tokens, credentials, and privileged APIs are now both more useful and more reachable. Multi-stream models can reduce some of the UX problems agents face: for example avoiding the need to interrupt a long-running thought to accept a user command: but they also make it easier for agents to coordinate complex, multi-step access.

Operationally this raises basic questions every engineering org must answer: who approves autonomous actions, how are tokens privileged and rotated, what monitoring catches an agent going off script, and how do you audit internal streams for malicious subvocalization? The technical fixes: compartmentalization, least privilege, attestation channels: are known. The governance choices are not.

Compute is the lever and the constraint

The architecture and agent trends are not happening in a vacuum. They are being financed at scale. On a recent podcast Krishna Rao, Anthropic’s CFO, discussed how the business she joined when run-rate revenue was about $250 million is now operating at a vastly larger scale (as reported), and how finance and procurement now shape model strategy (see @patrick_oshag). Bring-your-own-accelerator choices, mixes of Trainium, TPU, and GPU, and the sourcing of enormous pools of compute are central levers.

Two points here matter for what teams will build next. First, compute economics determine which model formats are practical. Parallel prediction across multiple streams shifts the cost profile: more concurrent work per forward pass, different memory access patterns, and potential gains from hardware that supports fine-grained parallelism. That pushes teams to optimize across device types rather than treating GPUs as a one-size-fits-all solution.

Second, the capital intensity favors organizations that can marshal large, predictable budgets for compute. If returns to frontier intelligence continue to accelerate, as Rao argued, companies with stable access to capital and procurement expertise will outpace smaller players on the bleeding edge. That does not kill innovation, but it changes the tactical playbook: research labs must marry clever software with economically feasible hardware stacks.

Cadence, disclosure, and the safety window

Model release cadence is compressing. Observers noted surprise at how fast a new GPT iteration landed in folks’ expectations (see @kimmonismus). Rapid iteration is a competitive advantage: newer models are immediately better at tasks, they unlock new product hooks, and they make prior guardrails obsolete.

But compressed cadence narrows the safety window. Faster releases mean less time for external red-teaming and field testing. They also increase the rate at which deployed agents acquire new capabilities. Combine that with models that can operate on multiple streams and act autonomously, and you have a system where emergent behaviors can appear sooner and with higher impact.

One mitigation path is to move safety earlier in the stack: instrument internal streams, require attestable human-in-the-loop gating for risky ops, and treat agent permissions as first-class product features. The technological push toward legible parallel streams actually helps here, because it creates separable channels you can monitor and constrain.

What to watch

Papers and open-source releases about multi-stream LLMs and instruction-tuning for stream formats. Early evals will show latency and safety trade-offs.
Agent demos that interact with native apps and produce credentials or tokens. Any reproducible incidents in the wild will change corporate policy fast.
Hardware procurement moves from major model builders. Watch which accelerator mixes teams prefer for multi-stream workloads.
Interviews and regulatory signals about compute fundraising and capital allocation. CFO-level conversations are now strategic documents.
New product rollouts labeled GPT-5.x or similar. Cadence signals whether companies are prioritizing incremental capability or taking time for robust testing.
Tooling for internal stream auditing and attestations. If these become standard, it will reshape what "safe by design" looks like.
Any publicly disclosed agent governance frameworks or breaches involving automated key rotation or token management.

Short version: architecture (multi-stream), automation (agents doing real actions), and capital (compute procurement) are converging. That convergence speeds capability and complexity at once. Your practical playbook is the same as it was a year ago: instrument aggressively, limit privilege, require attestable gates for any agent action, and assume the model will get smarter faster than your current monitoring does.

Source tweets

Mario Zechner / @badlogicgames

bookmark: open on X
recommended reading.

Alexis Ohanian 🗽 / @alexisohanian

bookmark: open on X
Put this in my veins. Big @jack fan.

Jonas Geiping / @jonasgeiping

bookmark: open on X
We’re training models wrong and it’s due to chatGPT. Even the modern coding agents used daily still use message-based exchanges: They send messages to users, to themselves (CoT) and to tools, and receive messages in turn. This bottlenecks even very intelligent agents to a single stream. The models cannot read while writing, cannot act while thinking and cannot think while processing information. In our new paper, see below, we discuss LLMs with parallel streams. We show that multi-stream LLMs can … 🔵Be created by instruction-tuning for the stream format 🔵Simplify user and tool use UX removing many pain points with agents and chat models (such as having to interrupt the model to get a word in) 🔵Multi-Stream LLMs are fast, they can predict+read tokens in all streams in parallel in each forward pass, improving latency 🔵 LLMs with multiple streams have an easier time encoding a separation of concerns, improving security 🔵 LLMs with many internal streams provide a legible form of parallel/cont. reasoning. Even if the main CoT stream is accidentally pressured or too focused on a particular task to voice concerns, other internal streams can subvocalize concerns that would otherwise n...

Chubby♨️ / @kimmonismus

bookmark: open on X
GPT-5.6 arriving that quick was not on my bingo card.

Patrick OShaughnessy / @patrick_oshag

bookmark: open on X
Krishna Rao is the CFO of Anthropic, and this is his first podcast appearance. He joined the company two years ago when run-rate revenue was about $250M. Today it is $30B. He has helped raise ~$75B and is responsible for the procurement and allocation of compute. I feel lucky we get to hear what it is like to sit inside a company this consequential at a moment this pivotal. We discuss: - The cone of uncertainty - How he allocates compute across Trainium, TPUs, and GPUs - What investors misunderstand about model companies - Why the returns to frontier intelligence keep rising - Platform vs application and where Anthropic builds its own products - How Anthropic uses Claude internally I have asked my closing question about the kindest thing more than 500 times. Krishna's answer is one I have never heard before. Enjoy! Timestamps: 0:00 Intro 2:38 The Compute Canvas 6:51 The "Cone of Uncertainty" 11:58 Why the Returns to Frontier Intelligence Are So High 16:45 Recursive Self-Improvement 20:20 Scaling Laws 23:30 Sourcing $100 Billion in Compute 28:05 Platform vs. Application Strategy 32:52 Pricing Dynamics 38:48 How Anthropic’s Finance Team Uses Claude 43:24 Raising Capital & Overcom...

Mathelirium / @mathelirium

bookmark: open on X
Boost A Mass To Light Speed, And Gravity Becomes A Shockwave. The Aichelburg-Sexl Ultraboost is an exact solution of General Relativity obtained by boosting the Schwarzschild gravitational field toward light speed while keeping the energy finite. the post also includes media

Peter Steinberger 🦞 / @steipete

bookmark: open on X
Codex was debugging a Telegram issue and needed a new token, so it used Peekaboo to open the Telegram Mac app, talked to botfather and just did it. Computer Use is amazing. the post also includes media

Generated from Birdclaw bookmarks and likes. Edited by Ody before publication.