All presents

The Wake: May 27, 2026

A daily briefing from George's X bookmarks and likes, with source links and older-memory echoes.

The Wake is a daily briefing from George's saved internet. The issue is written as a newsletter first. The tweets are the source material, preserved below for receipts.

Source window: May 26, 2026. Signals: 9 bookmarks and 1 likes.

Brief

A small, intense shift in how the developer tools world measures progress happened on Tuesday. Datacurve’s Serena Ge put a new agentic coding standard, DeepSWE, on the table and people who actually ship software reacted like they had been waiting for this. The argument is simple: public leaderboards and static tasks flatter top models into tight clusters while hiding differences that determine whether a model helps you ship or makes you waste a week. The reaction is equal parts relief and unease. Relief because engineers finally have a bench that claims to map to the real experience of coding with LLMs. Unease because "real experience" includes messy human heuristics, fragile context, and emergent agent behaviors that are hard to safe-guard at scale.

Why current benchmarks stopped being useful

For the past two years the community used the same palette of tasks to compare models: unit-test-style problems, code-completion accuracy, and narrow synthetic challenges. Those datasets were valuable early on, but they now have two structural limits.

First, they are static and short-horizon. Real engineering work is stateful and interactive: a developer iterates over a repo, reads failing tests, traces logs, calls tools, and makes judgment calls across modules. A single-shot HumanEval-style number does not reflect that loop.

Second, leaderboards encourage saturation games. Model teams push for incremental improvements on the same metric and end up optimizing for the benchmark rather than the human workflow. That produces tight-looking leaderboards where differences that matter in practice get smoothed over.

DeepSWE is being positioned precisely to address both problems. The design is agentic: long-horizon tasks, environment interaction, chaining tools, error recovery, and developer-style prompts. Its stated goal is not to produce a cleaner leaderboard but to expose divergence-show which models actually make the day-to-day experience of shipping code better. That read is reinforced by the developer reaction: a lot of people who build with these systems called it the first bench that "aligns with how it feels" to code with models.

Measuring the feel of a tool, and why that is dangerous

"Feels like" metrics are seductive. They capture reconciliation costs that engineers pay: how often the model hallucinated, how gracefully it recovered after a broken test, how much hand-holding the developer needed, how much context it remembered across a session. Those are the things you cannot see in a single-shot pass/fail.

But optimizing for "feel" creates two hard trade-offs.

One, you can end up teaching models to imitate brittle human heuristics. A provocative line floating in the conversation was that the goal can be to give the model the same "flavor of psychosis" as you (Theo). Read charitably, that is shorthand for wanting models to inhabit the developer's mental model so they make guesses in the same style. Read less charitably, it flags a risk: you might intentionally reproduce idiosyncratic or error-prone reasoning patterns because they match human expectations, sacrificing correctness or safety for psychological comfort.

Two, agentic benchmarks reward interactive tool use, which increases the attack surface. A model that chain-calls tools, writes files, runs shells, and interprets test output can be enormously helpful. It can also do damage when it misinterprets signals or is fed adversarial inputs. When you combine that with the VC and product pressure to deploy "1000x" more instances of these systems (Garry Tan’s framing), scaling the reach of brittle or gamed behavior magnifies harm.

So expect two concurrent movements: the engineer-driven push to embrace DeepSWE-style measures, and a safety/policy push to define what behaviors should be disallowed even if they improve "feel."

Product and market consequences

Benchmarks shape markets. DeepSWE will have outsized influence precisely because it claims to reflect developer experience. If enterprise procurement teams and platform vendors start using it as a filter, model makers will pivot fast to win on it. That will be good for customers who want usable dev tools, and it will be a shortcut for newer models to leapfrog on perceived product fit despite not winning classical metrics.

But a new benchmark also invites gaming. Teams will overfit to the evaluation environment, producing models that excel on DeepSWE while failing in the wild. Expect the first waves of model improvements to focus on session continuity, tool orchestration, and human-like prompt tact. The second wave will likely be about robustness and observability: how to detect when the model's interactive behavior is wrong and roll it back without breaking developer flow.

This is precisely where infrastructure players can make real money: telemetry that traces agentic decisions across a dev session, SLOs for hallucination and recovery time, and runtime governance that can intercept destructive actions. The market for that stack will look a lot like the observability stacks that grew up around microservices a decade ago.

VCs will cheer. The "golden age of abundance" narrative (Garry Tan) remains a powerful demand driver: if these tools can truly increase developer productivity by orders of magnitude, the business upside is massive. But investors are not spending on bench papers; they are backing products that can be deployed at scale with acceptable risk profiles.

Design, aesthetics, and the soft signals

There is an interesting cultural note in the mix: several reactions around DeepSWE used language about "feel" and "design" rather than pure accuracy. That is no accident. High-end product design is leaning into the interior experience as a differentiator-Jony Ive showing the Ferrari Luce interior was a parallel signal on Tuesday. The same move is happening in developer tooling: the shape of an interaction and the tactile impression of a session will matter to adoption as much as raw capability.

Behind that are two truths. One, humans pick tools that reduce cognitive friction and align with their mental models. Two, aesthetics matter even in code. Expect teams to invest in UX for agentic flows: session timelines, undo affordances, clear provenance of suggestions, and calm failure modes. Those are the product features that will separate the comfortable from the dangerous.

Also expect social noise. The platform-level chatter included crude or NSFW posts that are irrelevant to product discussions but highlight moderation headaches. If a model learns from broad public feeds without adequate curation, you will surface those behaviors inside developer sessions. Another small reason to fund strong safety and content-filtering layers.

What to watch

  • Adoption: which major leaderboards, tooling vendors, or enterprise evaluators pick up DeepSWE as a standard. That will determine whether it shifts procurement reality or stays an academic curiosity.
  • Response from major model providers: whether they publish targeted improvements for agentic behaviors or release "agentic mode" feature flags with safety constraints.
  • Overfitting signals: early model entries that ace DeepSWE but show brittle failure modes in independent audits or in customer pilots.
  • Safety pushback: formal critiques from alignment and policy groups about imitating human heuristics-especially anything framed as reproducing cognitive biases or "psychosis."
  • Product plays: companies emerging to provide telemetry, SLOs, and runtime governance for agentic coding assistants.
  • Investor motion: follow-on funding rounds for startups positioning to scale these experiences 1000x, and whether they prioritize robustness budgets.

This is a pivot point. We have moved from measuring what models can do in isolation to measuring what they do when given agency inside a developer workflow. That is where real value will be created and real risk amplified. The winners will be the teams that treat the new bench as not just a metric to win but as a product requirement that demands observability, safe defaults, and a design sensibility that respects the human in the loop.

Source tweets

Autism Capital 🧩 / @AutismCapital

  • bookmark: open on X
  • “Sir, Ma’am, I want you to know that your daughter has an incredible pussy.” the post also includes media

Lord Bebo / @MyLordBebo

  • bookmark: open on X
  • This feeling is gone … now it makes the sound of a powerful toothbrush the post also includes media

Theo - t3.gg / @theo

  • bookmark: open on X
  • @ziggydakid The goal isn't to tell the model where things are in your codebase. The goal is to give the model the same flavor of psychosis as you

Theo - t3.gg / @theo

  • bookmark: open on X
  • This is the first code bench that actually aligns with how it feels to use these models coding.

Serena Ge (Datacurve) / @serenaa_ge

  • bookmark: open on X
  • Today we’re releasing DeepSWE, a new standard for agentic coding benchmarks. On public leaderboards, top models often look relatively close in capability. DeepSWE shows where they actually diverge, reflecting the realistic experience of developers in their day-to-day work. the post also includes media

█▒ sleep_paralysis_demon_peter / @pmullr

  • bookmark: open on X
  • No text beyond linked/media content. the post also includes media

Cleo Abram / @cleoabram

  • bookmark: open on X
  • Jony Ive shows the inside of the new Ferrari Luce, the first ever all electric Ferrari, only on HUGE* Conversations (full section) the post also includes media

José Luis #360 / @jlsantana

  • bookmark: open on X
  • Dildo holders, actually.

Garry Tan / @garrytan

  • bookmark: open on X
  • Ultimately the golden age of abundance will be this kind of tech built and deployed 1000x

Nick Dobos / @NickADobos

  • like: open on X
  • First correct benchmark I’ve seen in a while

Generated from Birdclaw bookmarks and likes. Edited by Ody before publication.