People are really worried about their jobs. And I just want to remind them that the purpose of your job and the tasks and tools that you use to do your job are related, not the same. I’ve been doing my job for 33 years. I’m the longest running tech CEO in the world, 34 years. And the tools that I’ve used to do my job has changed continuously in the last 34 years, and sometimes quite dramatically, you know, over the course of a couple, two, three years.
“What does the Doctor want?”
“To translate all human knowledge into a new philosophical language, consisting of numbers. To write it down in a vast Encyclopedia that will be a sort of machine, not only for finding old knowledge but for making new, by carrying out certain logical operations on those numbers.”
Let me make it more concrete and less doomy: there exists a variety of systems in the world that have resolution paths built into the system, where those resolution paths guarantee a certain amount of human attention on behalf of the system to a person who attempts to engage the resolution path. And often what counts as “attempting to engage the resolution path” - there is a defined tripwire for that. And that defined tripwire might be writing a facially plausible letter.
And if you are a bank, and you have a hundred million customers, which at least one bank in the United States does, you are institutionally aware of how many facially plausible letters your customer base is capable of writing every year. And there is a number there. And based on that number, you have a certain number of people sitting in an office, reading facially plausible letters and engaging the resolution path.
Here’s an example from the UK of this possibly happening: https://www.theguardian.com/politics/2025/nov/09/ai-powered-nimbyism-could-grind-uk-planning-system-to-a-halt-experts-warn
Matt Turck: What do you make of the “Are We on Track to AGI?” debate — the counter-thesis from people like Rich Sutton or Yann LeCun, who argue that a different approach is needed, or that it should be reinforcement learning only?
Sholto Douglas: I think it’s true that our models don’t learn anywhere near as efficiently as humans do. They take a thousand lifetimes to learn — but that’s fine, because they can live those thousand lifetimes in simulation or by doing work across a thousand firms, and so on.
I’d separate two arguments here. One is architectural — that transformers are insufficient. I don’t think that’s true. We haven’t yet found anything that transformers can’t model given enough data and compute.
Reinforcement learning as an objective is a powerful idea. Rich Sutton is actually a big fan of RL as an objective; he just thinks we’re encoding too many priors through pre-training and similar methods, which leads to an inadequate representation of the world.
So far, the evidence suggests our current methods haven’t yet encountered a domain that’s intractable with enough effort. The only thing that would make me change my mind would be if there were a problem domain where, despite heavy effort, benchmarks simply didn’t move for a long time. That would suggest a fundamental limitation.
But instead, what I constantly see is that every time we define a benchmark that measures something we care about, progress along that benchmark is incredibly rapid. Anything we can measure seems to be improving very quickly.
Matt Turck: What does taste mean when it comes to AI research?
Sholto Douglas: One of the most important things is mechanistically understanding exactly what you’re trying to do and having an implicit simplicity regularizer. When you think about taste in ML, it’s the crucial ingredient that allows you to decide what goes into your large training run when you have imperfect information.
We can study very deeply what the impact of an architectural change is, but past a certain level of scale, you have to guess whether the impact of that change will compound with others or conflict. You can’t test your full-scale run multiple times — you often have only one shot.
So a lot of taste comes from being able to make good inferences about whether something will deliver increasing returns to scale. It’s also about judging whether a research direction is worth pursuing. Our baselines in ML are often so well-tuned that it’s very hard to beat them, even with theoretically better methods, because so many small tricks are required to make a machine learning method work, and they can fail for any number of reasons.
It’s not like building a bridge, where you have a good idea why a particular shear was introduced. There can be all these quirks, and knowing whether to keep pushing in a given direction or abandon it for another is another aspect of taste. It always comes back to simplicity as a guiding regularizer.
There’s this thing called Moravec’s paradox — the idea that tasks humans find easy, like manipulating or picking up objects, are hard for AI, while things we find hard, like reasoning through mathematical problems, are easy.
I actually think Moravec’s paradox is a bit misleading. It’s mostly about data availability and reinforcement learning signals. Take robotic locomotion, for example — robots’ ability to walk and balance. If you look at the latest videos from Unitree robots, the difference between now and two years ago is dramatic. They’re incredibly agile; there’s even a video where one gets kicked over and pops back up like something out of The Matrix.
This progress is because locomotion provides a simple and strong reinforcement learning signal. It’s largely solved with basic RL now. Manipulation is harder, but I’m still optimistic about robotics for several reasons.
First, I’ve seen remarkable progress from robotics labs this year — they can now handle fairly complex physical tasks. Second, there’s what I’d call a large generator–verifier gap. Improving language models requires finding humans who can outperform them, but in robotics we can use general models as teachers or judges. For example, if a robot is told to “stack the red block on top of the blue block,” a language model can evaluate whether it succeeded and provide a reward accordingly.
Finally, for a long time people thought robotics required solving long-term coherence and planning, but language models have made that easier by breaking tasks into smaller steps. Most robotics labs are now focused on refining motor policies, and they’re making incredible progress. It’s mainly a question of improving data and feedback loops.
My favorite example of this is the work-to-rule strike. Where your employees go on strike by doing what you told them to do. And it turns out in almost every organization, unless it’s doing a very simple thing, if your employees simply start following orders, your company grinds to a complete halt. Because it turns out that your orders are terrible, they have to be adapted to the local conditions. People are doing that all the time, very smoothly. And if they just start following the actual rules, nothing can be done. The whole system breaks.
Casey Handmer 00:33:33
Let’s get concrete here for a second. Let’s say you’ve got one rack and it’s 1 megawatt. I’ll leave the cooling to someone who specializes in air conditioners, but it’s basically throwing air conditioners at the problem. Then you have batteries.
So in order to get four nines of uptime on this… In South Texas, you actually need less than this. But let’s just say it’s 24 hours worth of battery storage. That means it’ll get you through two bad nights in a row, basically. Actually, it turns out that you can significantly decrease power consumption with a very small reduction in overall compute. So if you’ve got like three really bad days in a row or something, you can dial back your power usage quite a lot without compromising your inference or training.
Okay, so you’ve got, say, a Tesla Megapack, something like four megawatt hours. So one megawatt rack, and then six Tesla Megapacks, each of which is roughly one truckload worth of stuff. So one truckload worth of rack, and then like six truckloads worth of batteries. Then in order to operate this at an average power of 1 megawatt, your solar arrays in Texas will be something like 25% utilization. So on average, if the sun came up every day and the day was the same length all the time, you would need 4 megawatts of solar arrays, which is about 4 acres of land. But in practice, because you’re aiming for four nines instead of one nine, you need an overbuild of about 2.5x. So you’ve got about 10 acres of solar.
So 10 acres of solar, six truckloads of batteries, one truckload of data center, and some cooling stuff.
Dwarkesh Patel 00:34:58
For how big of a data center?
Casey Handmer 00:35:00
One megawatt. That’s just for one megawatt. So 10 acres, one megawatt kind of situation at four nines.
If you want five gigawatts, then that’s 5,000 times 10. So 50,000 acres. At a larger scale, you can probably cut all those numbers down by 10-20%, but it’s on that order. And 50,000 acres sounds like a lot.
Interesting conversation factors - how do you get 1MW of stable power?
Dwarkesh Patel 00:16:08
Okay. What is the cost of… GE makes these 100 megawatt gas turbines, right?
Casey Handmer 00:16:15
I don’t actually know what the retail price is. I would suspect that if their price is flexible, it would have gone up a lot. But if I recall correctly, $35 a megawatt hour is just the flag-four cost for..
Dwarkesh Patel 00:16:25
How much, sorry?
Casey Handmer 00:16:26
$35 a megawatt hour just for the Brayton cycle. We’re not talking about the fuel, we’re not talking about the heat exchanges, we’re not talking about the cooling ponds or anything like that. Just the amortized cost of the high-speed, high-temperature spinning components is $35 a megawatt hour.
This is interesting with respect to nuclear: basically it means nuclear has a floor price, but solar does not.
Dwarkesh Patel 00:07:18
Whenever a discussion like this comes up, it’s often phrased in the context of personal behavior. I think people will be assuming that what we’re going to get up to is this push to make you vegetarian. I happen to have been vegetarian. I grew up a Hindu, so I’ve never eaten meat. Then I just stayed a vegetarian after I was no longer a Hindu. But then I started prepping to interview you and I’m like…
I don’t know how valuable this is, especially if you look at some of these online charity evaluators and you’re just like, “A dollar of your donation will offset this much meat-eating.” You’re like, “What are we doing here?” But anyways, vegetarianism, overrated?
Lewis Bollard 00:08:02
I think we made a mistake as a movement making this about personal diet. It’s great when folks want to make a personal diet decision, whether that is eating less meat or meat from more humane sources, but the focus should not be on the individual. This is not how large-scale social change occurs. We need government reform. We need corporate reform. People can be a part of that regardless of what they eat, regardless of what their diet is. We need people to be advocates and funders and supporters of this cause.
I like the idea here of comparing carbon offsetting to meat offsetting. Why does the former feel more acceptable than the latter?
While I briefly have no employer, let me tell you what’s really happening with AI companies training on public data:
There are roughly two groups of actors:
- Those that care about US + EU laws and regulations.
- Those that don’t.
But both look the same from the outside. A consequence of this caring-about-rules spectrum is that there’s a “compliance gap” in many evals.
If you have an idea where a given company (or even academic lab) sits, you should be either more or less impressed by their numbers.
The vaccine passports debate is a perfect illustration of my new working theory: that the most important part of modern government, and its most important limitation, is database management. Please stick with me on this - it’s much more interesting than it sounds. (1/?) Twitter
This is such a perfect illustration of my database state theory - database management is both the most important task of modern govt and its most intractable limitation. Twitter
AI - Erosion of Trust
I think the concept of erosion of trust is useful when thinking about how to use LLMs to contribute to team codebases.
Sometimes, you might sit down and type paragraphs into ChatGPT or Claude before you hit enter, considering every necessary detail. Other times, you might start with a simple prompt, then add further details when the chatbot’s answer isn’t satisfactory
We find that LLMs often make assumptions in early turns and prematurely attempt to generate final solutions, on which they overly rely. In simpler terms, we discover that when LLMs take a wrong turn in a conversation, they get lost and do not recover.
How to write
Thoughts on choosing what to write about, how to say it, and the value of writing.
Looking at Google, we’ve given various stats around 30% of code now uses AI-generated suggestions or whatever it is. But the most important metric, and we carefully measure it is, like, how much has our engineering velocity increased as a company due to AI, right? It’s tough measure, and we rigorously try to measure it, and our estimates are that number is now at 10%, right?
But I hope, hopefully we live in a world where if you make resources more plentiful and make the world lesser of a zero-sum game over time, which it’s not, but in a resource constrained environment, people perceive it to be.
And so that’s, I think, kind of the core challenge of building agents on top of traditional LLMs.
The missing ingredient has been that we need to directly train these agents end to end to do these workflow-like tasks.
By doing that, you can train the agents in such a way that they see failures during their training and they learn to recover from those failures.
So even if each step is only, you know, 90% accurate or 95% accurate, now the model has seen what it looks like to fail at that step and it’s able to reroute itself.
It’s able to think, oh, this doesn’t look right.
Like, let me go back and try that again.
Do you have a canonical example that kind of illustrates this problem?
Yeah.
So a good example of this would be like if you are, you know, if you’re trying to kind of like build an agent that does research for you.
So if you do a single web search and maybe you get the search term wrong, like the user doesn’t know exactly what to search for.
So you try to search for terms that seem relevant and you start to pull back a bunch of docs that don’t have information that really gets to the heart of the problem.
Then a naive agent might just get confused by that.
And it might think, hmm, well, okay, that I searched for the term that I thought that they meant and all these docs are irrelevant.
So maybe this is, you know, maybe what the user asked for doesn’t make sense.
Or maybe I need to go down this other rabbit hole.
Whereas an agent that’s trained to do web research and has been trained using reinforcement learning to sort of be good at this multi-step process.
Well, in its training, I’ve seen many instances where it searched for the wrong term.
And the training has incentivized it to learn to recover from those instances and instead go back and think, oh, you know, I searched for this term, but I got results that weren’t relevant.
Maybe that means that I had the wrong search term.
So let me go, you know, try to try again and pick a different one.
And so the key is really like reasoning models, power that are trained end to end to solve the kinds of tasks that users need them to solve using reinforcement learning.
So that they’re able to kind of like see these multi-step processes, see the kinds of failures that happen in training and learn to recover from them.
It strikes me that the creation of models that are specifically designed to do this is one powerful difference that we’re seeing in this new generation of models.
Kernighan’s Law goes:
“Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it.”
I like to think I’m “AI forward” here at the Dwarkesh Podcast. I’ve probably spent over a hundred hours trying to build little LLM tools for my post production setup. And the experience of trying to get them to be useful has extended my timelines. I’ll try to get the LLMs to rewrite autogenerated transcripts for readability the way a human would. Or I’ll try to get them to identify clips from the transcript to tweet out. Sometimes I’ll try to get it to co-write an essay with me, passage by passage. These are simple, self contained, short horizon, language in-language out tasks - the kinds of assignments that should be dead center in the LLMs’ repertoire. And they’re 5/10 at them. Don’t get me wrong, that’s impressive.
Introducing DuckLake
A podcast introducing the DuckLake project with Hannes Mühleisen and Mark Raasveldt. Transcript here.
Underrated reasons to be thankful
Thirty short reminders of how much is going right in the world, from nuclear weapons not destroying the atmosphere to the wonders of modern technology and medicine.
DeepWiki Splink docs
A service that takes any github repo and turns it into a docs site you can chat with.
Today an ordinary factory produces as much value stuff as the cost of the factory in a few months. So our economy could double every few months if we could make everything in factories. But today we can’t make the people who work in the factories in factories, which means the economy only doubles every 20 years now. But if you could make the AIs in the factories, then you could make everything the factory needs in the factory, then the world economy could double every few months. So that’s one of the straightforward predictions is the world economy grows much faster.
Changing Data With Confidence using DuckDB - Hannes Mühleisen (PyData Global 2024)
Transcript here. I guess this could be used to store a db in s3 and use it to power a dashboard?
Image segmentation using Gemini 2.5
Ultra-cheap image segmentation feature and how to use it with the updated API. Tested myself and worked remarkably well.
My llm-fragments-github plugin has a new feature that lets you import an issue thread using -f issue:org/repo/number - so now you can feed it the repo contents with -f github:simonw/llm and tell it to “muse on this issue, then propose a whole bunch of code to help implement it”
llm install llm-fragments-github llm -f github:simonw/llm \ -f issue:simonw/llm/938 \ -m gemini-2.5-pro-exp-03-25 \ --system 'muse on this issue, then propose a whole bunch of code to help implement it'
This whole effort, which was hugely expensive in terms of people and time and dollars and everything else, was an experiment to further validate that the scaling laws keep going and why. And it turns out they do, and they probably keep going for a long time. I accept scaling laws like I accept quantum mechanics or something, but I still don’t know why. Why should that be a property of the universe? So why are scaling laws a property of the universe?
The fact that more compression will lead to more intelligence has this very strong philosophical grounding. So the question is: why does training bigger models for longer give you more compression? There are a lot of theories here. The one I like is that the relevant concepts are sort of sparse in the data of the world, and in particular, it’s a power law. The hundredth most important concept appears in one out of a hundred of the documents, or whatever. So there’s long tails.
If we make a perfect dataset and figure out very data-efficient algorithms, can we go home? It means that there’s potentially exponential compute wins on the table if you’re very sophisticated about your choice of data. But basically, when you just scoop up data passively, you’re going to require 10x-ing your compute and your data to get the next constant number of things in that tail. And that tail keeps going—it’s long. You can keep mining it, although you can probably do a lot better.
Intelligence is knowing than an LLM uses vector embeddings of tokens. Wisdom is knowing LLMs shouldn’t be used for business rules.
It’s the idea was cost push socialism, which was uh this idea that it wasn’t just accidental that you often had uh very, very high prices for things that liberals wanted to subsidize, that it was the obvious outcome of subsidizing a good, that you were at the same time choking off the supply of. And just just jump in. Yes, it’s the the paper is called cost disease socialism by Steve Tellis, Sam Hammond and Daniel Tash.
That a politics of sacrifice was going to fail, but a politics built on clean energy innovation and then rapid deployment might work. But we did not have, if you looked at how we built things in America, in blue states, the policies needed for rapid deployment. So we were going to need something like YIMBY for clean energy
That’s what YIMBYism does. It says how much housing have you built, not much how, not how much have you spent on housing. That’s what YIMBYism for clean energy would be for. Not just how much have you authorized to spend on solar, how much solar have you deployed.
This idea that in many places, people who call themselves liberals and progressives, and even sometimes people that have lawn signs in their front yards that say kindness is everything, nonetheless sit in houses zoned for single families and resist and often sue to block the development of affordable housing units that would, you know, cast a shadow over their own house. And so a part of this project is redefining liberalism away from the liberalism of the clenched fist toward a liberalism that recognizes that growth itself can be a good.
Render Markdown - A handy tool by Simon Willison
Useful markdown rendering tool. Code here: GitHub repository for Simon’s tools
Share Python Scripts Like a Pro: uv and PEP 723 for Easy Deployment
Good lengthly article for LLM ingestion about using uv. See also the.dusktreader blog
Self-contained Python scripts with uv
Waymo’s Foundation Model for Autonomous Driving with Drago Anguelov
Interesting discussion of how generative AI is impacting self-driving.
But if your view on an LLM’s capability is “I chat with Claude all the time. He seems very emotionally supportive. I’ve done this sort of song generation in Suno, which is a wonderful experience, by the way”—you’re probably not predicting what those capabilities in an API plus a two-to-five year enterprise integration cycle looks like.
Because after that exists, it’s not going to be you invoking one a couple hours per day, it will be everybody getting a staggering number of LLM invocations on their behalf every day, most in the background. The right model isn’t a person having a conversation with an AI. It’s more similar to what happens when you open up the New York Times and several hundred robots conduct an instant auction for your attention on your behalf.
I’ll have a single LLM that acts as a kind of orchestrator, and I’ll have several other LLMs that act as members of a focus group.
[…]
And I will put my question in to the orchestrator and the orchestrator will pass that question, which is sort of “evaluate this idea” or “evaluate this product” to the underlying agent LLMs, typically I’ll use three or four. Each one of them will have a fairly detailed profile or a pen portrait, right? A persona, a marketing manager or an investor. And the last one might be a recent college grad and they will iterate and argue between themselves about the merits of this particular product with a view to coming to some kind of distinct consensus.
And in that virtual focus group where I run three or four of these, they will go back and forth and they will generate tens of thousands of tokens.
Wikidata is a Giant Crosswalk File
Good post about how to download bulk address or persons data from wikidata for e.g. synthetic data generation
I use it everyday. For coding, I use cursor composer to gather context about the existing codebase (context.md). Then I paste that into DeepSeek R1 to iterate on requirements and draft a high level design document, maybe some implementation details (design.md).
Paste that back into composer, and iterate; then write tests. When I’m almost done, I ask composer to generate me a document on the changes it made and I double check that with R1 again for a final pass (changes.md).
Then I’m basically done.
This is architect-editor mode: https://aider.chat/2024/09/26/architect.html.
I’ve found Cursor + DeepSeek R1 extremely useful, to the point that I’ve structured a lot of documents in the codebase to be easily greppable and executable by composer. Benefit of that is that other developers (and their composers) can read the docs themselves.
Engineers can self-onboard onto the codebase, and non-technical people can unstuck themselves with SQL statements with composer now.
My LLM codegen workflow atm
A detailed walkthrough of using LLMs for both greenfield development and legacy code maintenance, with specific prompts and workflows for different scenarios.
It wrongly estimates the average amount of money that an American household headed by a 25- to 34-year-old spent on whisky in 2021, even though anyone familiar with the Bureau of Labour Statistics data can find the exact answer ($20) in a few seconds. It cannot accurately tell you what share of British businesses currently use ai, even though the statistics office produces a regular estimate.
[…]
Or consider the true meaning of Adam Smith’s “invisible hand”, the foundational idea in economics. In a paper published in 1994, Emma Rothschild of Harvard University demolished the notion that Smith used the term to refer to the benefits of free markets. Deep Research is aware of Ms Rothschild’s research but nonetheless repeats the popular misconception. In other words, those using Deep Research as an assistant risk learning about the consensus view, not that of the cognoscenti.
At least for quoting accurate stats, feels susceptable to reinforcement learning so I imagine this will improve fairly quickly
One way to think of it is that these models largely interpolate, not extrapolate. In these incredibly high-dimensional spaces interpolation is extremely powerful. But still limited.
(In response to this tweet asking why LLMs aren’t good at making connections between subjects - see the Dwarkesh quote below)
The remarkable thing about these reasoning results and especially the DeepSeek-R1 paper, is this result that they call DeepSeek-R1-0, which is they took one of these pre-trained models, they took DeepSeek-V3-Base, and then they do this reinforcement learning optimization on verifiable questions or verifiable rewards for a lot of questions and a lot of training. And these reasoning behaviors emerge naturally. So these things like, “Wait, let me see. Wait, let me check this. Oh, that might be a mistake.” And they emerge from only having questions and answers. And when you’re using the model, the part that you look at is the completion. So in this case, all of that just emerges from this large-scale RL training and that model, which the weights are available, has no human preferences added into the post-training.
But the very remarkable thing is that you can get these reasoning behaviors, and it’s very unlikely that there’s humans writing out reasoning chains. It’s very unlikely that they somehow hacked OpenAI and they got access to OpenAI o-1’s reasoning chains. It’s something about the pre-trained language models and this RL training where you reward the model for getting the question right, and therefore it’s trying multiple solutions and it emerges this chain of thought.
That’s what OpenAI and Microsoft did in Arizona. They have 100,000 GPUs. Meta, similar thing. They took their standard existing data center design and it looks like an H, and they connected multiple of them together. They first did 24,000 GPUs total, only 16,000 of them were running on the training run because GPUs are very unreliable so they need to have spares to swap in and out. All the way to now, 100,000 GPUs that they’re training on Llama 4 on currently. Like, 128,000 or so.
Think about 100,000 GPUs with roughly 1,400 watts apiece. That’s 140 megawatts, 150 megawatts for 128. So, you’re talking about you’ve jumped from 15 to 20 megawatts to almost 10x that number, 9x that number, to 150 megawatts in two years from 2022 to 2024. And some people like Elon, that he admittedly… He says himself he got into the game a little bit late for pre-training large language models. xAI was started later, right? But then, he bent heaven and hell to get his data center up and get the largest cluster in the world, which is 200,000 GPUs. And he did that. He bought a factory in Memphis. He’s upgrading the substation, with the same time, he’s got a bunch of mobile power generation, a bunch of single cycle combine. He tapped the natural gas line that’s right next to the factory, and he’s just pulling a ton of gas, burning gas.
He’s generating all this power. He’s in an old appliance factory that’s shut down and moved to China long ago, and he’s got 200,000 GPUs in it. And now, what’s the next scale? All the hyperscalers have done this. Now, the next scale is something that’s even bigger. And so Elon, just to stick on the topic, he’s building his own natural gas plant, like a proper one right next door. He’s deploying tons of Tesla Megapack batteries to make the power more smooth and all sorts of other things. He’s got industrial chillers to cool the water down because he’s water-cooling the chips. So, all these crazy things to get the clusters bigger and bigger.
But when you look at, say, what OpenAI did with Stargate in Arizona, in Abilene Texas, right? What they’ve announced, at least. It’s not built. Elon says they don’t have the money. There’s some debates about this. But at full scale, at least the first section is definitely money’s accounted for, but there’s multiple sections. But full scale, that data center is going to be 2.2 gigawatts, 2,200 megawatts of power in. And roughly, 1.8 gigawatts or 1,800 megawatts of power delivered to chips.
We have to take the LLMs to school.
When you open any textbook, you’ll see three major types of information:
Background information / exposition. The meat of the textbook that explains concepts. As you attend over it, your brain is training on that data. This is equivalent to pretraining, where the model is reading the internet and accumulating background knowledge.
Worked problems with solutions. These are concrete examples of how an expert solves problems. They are demonstrations to be imitated. This is equivalent to supervised finetuning, where the model is finetuning on “ideal responses” for an Assistant, written by humans.
Practice problems. These are prompts to the student, usually without the solution, but always with the final answer. There are usually many, many of these at the end of each chapter. They are prompting the student to learn by trial & error - they have to try a bunch of stuff to get to the right answer. This is equivalent to reinforcement learning.
We’ve subjected LLMs to a ton of 1 and 2, but 3 is a nascent, emerging frontier. When we’re creating datasets for LLMs, it’s no different from writing textbooks for them, with these 3 types of data. They have to read, and they have to practice.
Deep Learning has a legendary ravenous appetite for compute, like no other algorithm that has ever been developed in AI. You may not always be utilizing it fully but I would never bet against compute as the upper bound for achievable intelligence in the long run. Not just for an individual final training run, but also for the entire innovation/experimentation engine that silently underlies all the algorithmic innovations.
Data has historically been seen as a separate category from compute, but even data is downstream of compute to a large extent - you can spend compute to create data. Tons of it. You’ve heard this called synthetic data generation, but less obviously, there is a very deep connection (equivalence even) between “synthetic data generation” and “reinforcement learning”.
In the trial-and-error learning process in RL, the “trial” is model generating (synthetic) data, which it then learns from based on the “error” (/reward). Conversely, when you generate synthetic data and then rank or filter it in any way, your filter is straight up equivalent to a 0-1 advantage function - congrats you’re doing crappy RL.
There are two major types of learning, in both children and in deep learning:
- Imitation learning (watch and repeat, i.e. pretraining, supervised finetuning)
- Trial-and-error learning (reinforcement learning)
My favorite simple example is AlphaGo - 1) is learning by imitating expert players, 2) is reinforcement learning to win the game. Almost every single shocking result of deep learning, and the source of all magic is always 2. 2 is significantly more powerful. 2 is what surprises you.
2 is when the paddle learns to hit the ball behind the blocks in Breakout. 2 is when AlphaGo beats even Lee Sedol. And 2 is the “aha moment” when the DeepSeek (or o1 etc.) discovers that it works well to re-evaluate your assumptions, backtrack, try something else, etc.
It’s the solving strategies you see this model use in its chain of thought. It’s how it goes back and forth thinking to itself. These thoughts are emergent (!!!) and this is actually seriously incredible, impressive and new. The model could never learn this with 1 (by imitation), because the cognition of the model and the cognition of the human labeler is different.
The human would never know to correctly annotate these kinds of solving strategies and what they should even look like. They have to be discovered during reinforcement learning as empirically and statistically useful towards a final outcome.
(Last thought: RL is powerful but RLHF is not. RLHF is not RL.)
A rumor I heard at Davos, which fits with some earlier reporting from the Wall Street Journal and another well-placed source I read recently, is that OpenAI is struggling to build GPT-5, focusing instead on the user interface in an effort to find a different, less technical advantage.
My bet is that […] advances will be more incremental than before and quickly matched. GPT-5 or a similarly impressive model will come eventually, perhaps led by OpenAI, a Chinese company, or maybe a competitor like Google will get there first. Whichever way it falls out, the advantage will be short-lived.
R1 + Sonnet > R1 or O1 or R1+R1 or O1+Sonnet or any other combo
Not only was the model trained on the cheap, running it costs less as well. DeepSeek splits tasks over multiple chips more efficiently than its peers and begins the next step of a process before the previous one is finished. This allows it to keep chips working at full capacity with little redundancy. As a result, in February, when DeepSeek starts to let other firms create services that make use of v3, it will charge less than a tenth of what Anthropic does for use of Claude, its LLM.
Contra Tyler Cowen / Dwarkesh discussion: the correct economic model is not doubling the workforce, it’s the AlphaZero moment for literally everything. Plumbing new vistas of mind. It’s better to imagine a handful of unimaginably bright minds than a billion mid chatbots.
Much of the point of a model like o1 is not to deploy it, but to generate training data for the next model. Every problem that an o1 solves is now a training data point for an o3[…] I am actually mildly surprised OA has bothered to deploy o1-pro at all, instead of keeping it private and investing the compute into more bootstrapping of o3 training etc. (This is apparently what happened with Anthropic and Claude-3.6-opus - it didn’t ‘fail’, they just chose to keep it private and distill it down into a small cheap but strangely smart Claude-3.6-sonnet.)
My next book, I’m writing even more for the AIs. Again, human readers are welcome. It will be free.
But who reviews it? Is TLS going to pick it up? It doesn’t matter anymore. The AIs will trawl it and know I’ve done this, and that will shape how they see me in, I hope, a very salient and important way.
Chat-driven programming. […] It requires at least as much messing about to get value out of LLM chat as it does to learn to use a slide rule, with the added annoyance that it is a non-deterministic service that is regularly changing its behavior and user interface. Indeed, the long-term goal in my work is to replace the need for chat-driven programming, to bring the power of these models to a developer in a way that is not so off-putting. But as of now I am dedicated to approaching the problem incrementally, which means figuring out how to do best with what we have and improve it.
A lot of the value I personally get out of chat-driven programming is I reach a point in the day when I know what needs to be written, I can describe it, but I don’t have the energy to create a new file, start typing, then start looking up the libraries I need… LLMs perform that service for me in programming. They give me a first draft, with some good ideas, with several of the dependencies I need, and often some mistakes. Often, I find fixing those mistakes is a lot easier than starting from scratch.
The world needs [more, better, harder, etc] evals for AI. This is one of the most important problems of our lifetime, and critical for continual progress.
There’s a variety of words that I wish we had, which we do not yet have. One word is we have the concept of “alpha” in finance, and alpha - one Greek letter smuggles in a huge amount of understanding about how the world works. I would love to be able to describe someone’s alpha above the LLM baseline in discussing a topic. Because there are a lot of human writers in the world who have no alpha above the LLM baseline, and that’s been true since before LLMs were a thing. The Twitterism is sometimes “this person is an NPC” - there is no intellectual content here, the performance of class and similar can allow one to pretend that there is intellectual content, but there is no intellectual content.
Easy prediction for 2025 is that the gains in AI model capability will continue to grow much faster than (a) the vast majority of people’s understanding of what AI can do & (b) organizations’ ability to absorb the pace of change. Social change is slower than technological change. This all means that things will get weirder and the weirdness will be unevenly distributed.
The efficiency thing is really important for everyone who is concerned about the environmental impact of LLMs. These price drops tie directly to how much energy is being used for running prompts. There’s still plenty to worry about with respect to the environmental impact of the great AI datacenter buildout, but a lot of the concerns over the energy cost of individual prompts are no longer credible.
A lot of better informed people have sworn off LLMs entirely because they can’t see how anyone could benefit from a tool with so many flaws. The key skill in getting the most out of LLMs is learning to work with tech that is both inherently unreliable and incredibly powerful at the same time. This is a decidedly non-obvious skill to acquire!
General consensus in the replies and quotes of this seems to be that the entire concept of “AI skills” is a joke - how hard is typing text into a chatbot, really? I will continue to argue that it’s genuinely difficult, and that the challenge in using these tools is widely underestimated.
The interesting part is that they will crush tests but you wouldn’t hire them over a person for the most menial jobs. It’s a neat challenge how to properly evaluate the “easy stuff” that is secretly hard because of Moravec’s paradox. Very long contexts, autonomy, common sense, …
ARC is a silly benchmark, the other results in math and coding are much more impressive. o3 is just o1 scaled up, the main takeaway from this line of work that people should walk away with is that we now have a proven way to RL our way to super human performance on tasks where it’s cheap to sample and easy to verify the final output. Programming falls in that category, they focused on known benchmarks but the same process can be done for normal programs, using parsers, compilers, existing functions and unit tests as verifiers. Pre o1 we only really had next token prediction, which required high quality human produced data, with o1 you optimize for success instead of MLE of next token.
One very important thing to understand about the future: the economics of AI are about to change completely. We’ll soon be in a world where you can turn test-time compute into competence — for the first time in the history of software, marginal cost will become critical.
Latent Space Ultimate Guide to Prompting
Prompting can go very deep!
Building effective agents
What are agents and how do we expect them to evolve.
Is AI progress slowing down?
A good guide to thinking about whether scaling is dead.
Moon by Bartosz Ciechanowski
Not directly relevant to LLMs, but it’s interesting to think at what point an LLM could produce an article like this. I feel like they’re a long way off.
Dropping Spark for DuckDB Discussion
Hacker News discussion about companies moving from Apache Spark to DuckDB.
What if we take the body of knowledge the orthodoxy of what database engine should look like and just put it into a package that doesn’t make you you know hate everything and everyone around you?
We showed with DuckDB that you can actually have a full transactional ACID compliant transactional semantics in an analytical system without punishing performance.
Amanda Askell on Lex Fridman
How LLMs are trained to be useful, the importance of prompting.
Chris Olah on Lex Fridman
Interesting on interpretability
Chris Olah on Lex Fridman
Interpretability.
A central property in formal software engineering is compositionality: the idea that composite systems can be understood in terms of the meanings of their parts and the nature of the composition, rather than by having to look at the parts themselves.
This idea lies at the heart of piecewise development: parts can be engineered (and verified) separately and hence in parallel, and reused in the form of modules, libraries and the like […]
Current AI systems have no internal structure that relates meaningfully to their functionality. They cannot be developed, or reused, as components. There can be no separation of concerns or piecewise development.
Does current AI represent a dead end?
This article made me think of LLMs as like really software with no tests, no documentation, and lots of bugs. And yet very useful.
Analytics-Optimized Concurrent Transactions
DuckDB’s approach to handling concurrent transactions for analytics workloads.
Machines of Loving Grace
The CEO of Anthropic outlines how AI could transform the world for the better.
Cursor Team: Future of Programming with AI
How AI is being integrated into software development.
LLM Challenge: Writing Non-Biblical Sentences
There are lots of examples of strange capabilities like this you’d never see in a benchmark.
Francois Chollet on the Dwarkesh podcast - LLMs won't lead to AGI
Understanding LLMs reasoning abilities.
I always struggle a bit with I’m asked about the “hallucination problem” in LLMs. Because, in some sense, hallucination is all LLMs do. They are dream machines.
What do you make of the fact that these things have basically the entire corpus of human knowledge memorized and they haven’t been able to make a single new connection that has led to a discovery? Whereas if even a moderately intelligent person had this much stuff memorized, they would notice Oh, this thing causes this symptom. This other thing also causes this symptom. There’s a medical cure right here. Shouldn’t we be expecting that kind of stuff?
One more exciting thing about the programs is that I said that in case of language, one of the troubles is even evaluating language. When the things are made up, you need somehow either a human to say that this doesn’t make sense or so in case of program, there is one extra lever that we can actually execute programs and see what they evaluate to. So that process might be somewhat more automated in order to improve the qualities of generations.