“Defending Against Model Weight Exfiltration Through Inference Verification” by Roy Rinberg

Dec 15, 2025

Authors: Roy Rinberg, Adam Karvonen, Alex Hoover, Daniel Reuter, Keri Warr

Arxiv paper link

One Minute Summary

Anthropic has adopted upload limits to prevent model weight exfiltration. The idea is simple: model weights are very large, text outputs are small, so if we cap the output bandwidth, we can make model weight transfer take a long time. The problem is that inference servers now generate an enormous amount of tokens (on the order of ~1TB tokens per day), and the output text channel is the one channel you can't easily restrict.

...

Duration: 00:18:38

“Do you love Berkeley, or do you just love Lighthaven conferences?” by Screwtape

Dec 15, 2025

Rationalist meetups are great. Once in a while they're life-changingly so. Lighthaven, a conference venue designed and run by rationalists, plays host to a lot of really good rationalist meetups. It's best-in-class for that genre of thing really, meeting brilliant and interesting people then staying up late into the night talking with them.

I come here not to besmirch the virtues of Lighthaven, but to propose the active ingredient isn't the venue, it's the people. (The venue's great though.)

1. In which someone falls in love with Lighthaven

The following example is a composite...

Duration: 00:09:24

“A Case for Model Persona Research” by nielsrolf, Maxime Riché, Daniel Tan

Dec 15, 2025

Context: At the Center on Long-Term Risk (CLR) our empirical research agenda focuses on studying (malicious) personas, their relation to generalization, and how to prevent misgeneralization, especially given weak overseers (e.g., undetected reward hacking) or underspecified training signals. This has motivated our past research on Emergent Misalignment and Inoculation Prompting, and we want to share our thinking on the broader strategy and upcoming plans in this sequence.

TLDR:

Ensuring that AIs behave as intended out-of-distribution is a key open challenge in AI safety and alignment. Studying personas seems like an especially tractable way to steer...

Duration: 00:12:08

“The Axiom of Choice is Not Controversial” by GenericModel

Dec 15, 2025

The Axiom of Choice is obviously true, the well-ordering principle obviously false, and who can tell about Zorn's Lemma?

Jerry Bona

I sometimes speak to people who reject the axiom of choice, or who say they would rather only accept weaker versions of the axiom of choice, like the axiom of dependent choice, or most commonly the axiom of countable choice. I think such people should stop being silly, and realize that obviously we need the axiom of choice for modern mathematics, and it's not that weird anyway! In fact, it's pretty natural.

...

Duration: 00:13:55

“A high integrity/epistemics political machine?” by Raemon

Dec 14, 2025

I have goals that can only be reached via a powerful political machine. Probably a lot of other people around here share them. (Goals include “ensure no powerful dangerous AI get built”, “ensure governance of the US and world are broadly good / not decaying”, “have good civic discourse that plugs into said governance.”)

I think it’d be good if there was a powerful rationalist political machine to try to make those things happen. Unfortunately the naive ways of doing that would destroy the good things about the rationalist intellectual machine. This post lays out some thoughts on how to have...

Duration: 00:19:05

“No, Americans Don’t Think Foreign Aid Is 26% of the Budget” by Julius

Dec 14, 2025

I hate the polling question "What percentage of the US budget goes to foreign aid?" Or, more precisely, I hate the way the results are interpreted.

The way these polls are reported is essentially guaranteed to produce a wild overestimate, which inevitably leads experts to write "how wrong Americans are" pieces, like this Brookings article claiming that "Americans believe foreign aid is in the range of 25 percent of the federal budget," or KFF[1]reporting that the "average perceived amount spent on foreign aid was 26%."

But this isn't just ignorance. The real problem is a failure...

Duration: 00:12:05

“The Inevitable Evolution of AI Agents” by Steven McCulloch

Dec 14, 2025

What happens when AI agents become self-sustaining and begin to replicate?

Throughout history, certain thresholds have enabled entirely new kinds of evolution that weren't possible before. The origin of life. Multicellularity. Language. Writing. Markets. Each new threshold unlocked a new substrate for evolution. Substrates where new kinds of competition can take place, and for complexity to emerge.

We're approaching another such threshold: the point where AI agents become self-sustaining. Once they can earn more than they spend, they can survive. Once they can survive, they can replicate. Once they can replicate, they will evolve. They'll...

Duration: 00:18:40

“Why did I believe Oliver Sacks?” by Eye You

Dec 14, 2025

So, it's recently come out that Oliver Sacks made up a lot the stuff he wrote.

I read parts of The Man Who Mistook His Wife for a Hat a few years ago and read Musicophilia and Hallucinations earlier this year. I think I'm generally a skeptical person, one who is not afraid to say "I don't believe this thing that is being presented to me as true." Indeed, I find myself saying that sentence somewhat regularly when presented with incredible information. But for some reason I didn't ask myself if what I was reading was true...

Duration: 00:02:43

“Conditional On Long-Range Signal, Ising Still Factors Locally” by johnswentworth, David Lorell

Dec 14, 2025

Audio note: this article contains 74 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the episode description.

Background: The Ising Model

The Ising Model is a classic toy model of magnets. We imagine a big 2D or 3D grid, representing a crystal lattice. At each grid vertex _i_, there's a little magnetic atom with state _sigma_i_, which can point either up (_sigma_i = +1_) or down (_sigma_i = -1_). When two adjacent atoms point the same direction, their joint energy is lower...

Duration: 00:13:38

[Linkpost] “Wages under superintelligence” by Zachary Brown

Dec 14, 2025

This is a link post.

This is a linkpost to a blogpost I've written about wages under superintelligence, responding to recent discussion among economists.

TLDR: Under stylized assumptions, I argue that, if there is a superintelligence that generates more output per unit of capital than humans do across all tasks, human wages could decline relative to today, because humans will be priced out of capital markets. At that point, human workers will be reduced to the wage we can get with our bare hands: we won’t be able to afford complementary capital. This result holds even if...

Duration: 00:01:09

“Filler tokens don’t allow sequential reasoning” by Brendan Long

Dec 14, 2025

One of my favorite AI papers is “Lets Think Dot By Dot”, which finds that LLMs can use meaningless filler tokens (like “.”) to improve their performance, but I was overestimating the implications until recently[1] and I think other people might be too.

The paper finds that LLMs can be trained to use filler tokens to increase their ability to do parallel reasoning tasks[2]. This has been compared to chain of thought, but CoT allows models to increase sequential reasoning, which is more powerful[3]. I now think this paper should be taken as evidence against LLMs ability to perform...

Duration: 00:02:42

“You Can Just Buy Far-UVC” by jefftk

Dec 13, 2025

Far-UVC is something people have talked about for years in a "that would be great, if you could buy it" sort of way. Coming soon, once someone actually makes a good product. But the future is now, and it costs $500.

Many diseases spread through the air, which is inconvenient for us as creatures that breathe air. You can go outside, where the air is too dilute to spread things well, but it's cold out there, and sometimes wet. You can run an air purifier, but cleaning lots of air without lots of noise is still the world...

Duration: 00:02:39

“How I stopped being sure LLMs are just making up their internal experience (but the topic is still confusing)” by Kaj_Sotala

Dec 13, 2025

How it started

I used to think that anything that LLMs said about having something like subjective experience or what it felt like on the inside was necessarily just a confabulated story. And there were several good reasons for this.

First, something that Peter Watts mentioned in an early blog post about LaMDa stuck with me, back when Blake Lemoine got convinced that LaMDa was conscious. Watts noted that LaMDa claimed not to have just emotions, but to have exactly the same emotions as humans did - and that it also claimed to meditate, despite...

Duration: 00:52:20

“Book Review: The Age of Fighting Sail” by Suspended Reason

Dec 13, 2025

The Age of Fighting Sail is a book about the War of 1812, written by a novelist of Napoleonic naval conflicts, C.S. Forester. On its face, the concept is straightforward: A man who made his name writing historical fiction now regales us with true tales, dramatically told. History buff dads all across the Anglosphere will be pleased to find the same dramas, the same heroism and blunders, as in their favorite Horatio Hornblower series—with the added bonus that all events are true.

But I think this isn't actually a book about naval warfare. I think The Ag...

Duration: 00:22:56

“New 80k problem profile: extreme power concentration” by rosehadshar

Dec 12, 2025

I recently wrote 80k's new problem profile on extreme power concentration (with a lot of help from others - see the acknowledgements at the bottom).

It's meant to be a systematic introduction to the risk of AI-enabled power concentration, where AI enables a small group of humans to amass huge amounts of unchecked power over everyone else. It's primarily aimed at people who are new to the topic, but I think it's also one of the only write-ups there is on this overall risk,[1]so might be interesting to others, too.

Briefly, the piece argues...

Duration: 00:06:54

“AI #146: Chipping In” by Zvi

Dec 12, 2025

It was touch and go, I’m worried GPT-5.2 is going to drop any minute now, but DeepSeek v3.2 was covered on Friday and after that we managed to get through the week without a major model release. Well, okay, also Gemini 3 DeepThink, but we all pretty much know what that offers us.

We did have a major chip release, in that the Trump administration unwisely chose to sell H200 chips directly to China. This would, if allowed at scale, allow China to make up a substantial portion of its compute deficit, and greatly empower its AI la...

Duration: 01:25:25

“Annals of Counterfactual Han” by GenericModel

Dec 12, 2025

Introduction

In China, during the Spring and Autumn period (c. 770-481 BCE) and the Warring States period (c. 480-221 BCE) different schools of thought flourished: Confucianism, Legalism, Mohism, and many more. So many schools of thought were there, that it is now referred to as the period of the “Hundred Schools of Thought.” Eventually, the Warring States period ended when the Qin Dynasty unified China, and only 15 years later gave way to the Han dynasty. The Han Dynasty proceeded to rule China for 400 years, coinciding with (or perhaps causing) the first true Golden Age of Chinese History. Chin...

Duration: 00:10:53

“Cognitive Tech from Algorithmic Information Theory” by Cole Wyeth

Dec 12, 2025

Epistemic status: Compressed aphorisms.

This post contains no algorithmic information theory (AIT) exposition, only the rationality lessons that I (think I've) learned from studying AIT / AIXI for the last few years. Many of these are not direct translations of AIT theorems, but rather frames suggested by AIT. In some cases, they even fall outside of the subject entirely (particularly when the crisp perspective of AIT allows me to see the essentials of related areas).

Prequential Problem. The posterior predictive distribution screens off the posterior for sequence prediction, therefore it is easier to build a strong...

Duration: 00:02:41

“Childhood and Education #15: Got To Get Out” by Zvi

Dec 12, 2025

The focus this time around is on the non-academic aspects of primary and secondary school, especially various questions around bullying and discipline, plus an extended rant about someone being wrong on the internet while attacking homeschooling, and the latest on phones.

Bullying

If your child is being bullied for real, and it's getting quite bad, is this an opportunity to learn to stand up for yourself, become tough and other stuff like that?

Mostly no. Actually fighting back effectively can get you in big trouble, and often models many behaviors you don’t ac...

Duration: 00:48:20

“Weird Generalization & Inductive Backdoors” by Jorio Cocola, Owain_Evans, dylan_f

Dec 11, 2025

This is the abstract and introduction of our new paper.

Links: 📜 Paper, 🐦 Twitter thread, 🌐 Project page, 💻 Code

Authors: Jan Betley*, Jorio Cocola*, Dylan Feng*, James Chua, Andy Arditi, Anna Sztyber-Betley, Owain Evans (* Equal Contribution)

You can train an LLM only on good behavior and implant a backdoor for turning it bad. How? Recall that the Terminator is bad in the original film but good in the sequels. Train an LLM to act well in the sequels. It'll be evil if told it's 1984.

Abstract

LLMs are useful because they generalize so well. But...

Duration: 00:17:33

“If Anyone Builds It Everyone Dies, another semi-outsider review” by manueldelrio

Dec 11, 2025

Hello there! This is my first post in Less Wrong, so I will be asking for your indulgence for any overall silliness or breaking of norms that I may inadvertently have fallen into. All feedback will be warmly taken and (ideally) interiorized.

A couple of months ago, dvd published a semi-outsider review of IABIED which I found rather interesting and gave me the idea of sharing my own. I also took notes of every chapter, which I keep in my blog.

My priors

I am a 40-ish year old Spaniard from the rural...

Duration: 00:14:36

“My AGI safety research—2025 review, ’26 plans” by Steven Byrnes

Dec 11, 2025

Previous: 2024, 2022

“Our greatest fear should not be of failure, but of succeeding at something that doesn't really matter.” –attributed to DL Moody[1]

1. Background & threat model

The main threat model I’m working to address is the same as it's been since I was hobby-blogging about AGI safety in 2019. Basically, I think that:

The “secret sauce” of human intelligence is a big uniform-ish learning algorithm centered around the cortex; This learning algorithm is different from and more powerful than LLMs; Nobody knows how it works today; Someone someday will either reverse-engineer this learning algorithm, o...

Duration: 00:22:07

“North Sentinelese Post-Singularity” by Cleo Nardo

Dec 11, 2025

Many people don't want to live in a crazy sci-fi world, and I predict I will be one of them.

People in the past have mourned technological transformation, and they saw less in their life than I will in mine.[1] It's notoriously difficult to describe a sci-fi utopia which doesn't sound unappealing to almost everyone.[2] I have plans and goals which would be disrupted by the sci-fi stuff.[3]

In short: I want to live an ordinary life — mundane, normal, common, familiar — in my biological body on Earth in physical reality. I'm are not okay with being killed even...

Duration: 00:02:36

“Rock Paper Scissors is Not Solved, In Practice” by Linch

Dec 11, 2025

Hi folks, linking my Inkhaven explanation of intermediate Rock Paper Scissors strategy, as well as feeling out an alternative way to score rock paper scissors bots. It's more polished than most Inkhaven posts, but still bear in mind that the bulk of this writing was in ~2 days.

Rock Paper Scissors is not solved, in practice.

When I was first learning to program in 2016, I spent a few years, off and on, trying to make pretty good Rock Paper Scissors bots. I spent maybe 20 hours on it in total. My best programs won about 60-65% of...

Duration: 00:18:53

“MIRI Comms is hiring” by Duncan Sabien (Inactive)

Dec 11, 2025

See details and apply.

In the wake of the success of Nate and Eliezer's book, If Anyone Builds It, Everyone Dies, we have an opportunity to push through a lot of doors that have cracked open, and roll a lot of snowballs down a lot of hills. 2026 is going to be a year of ambitious experimentation, trying lots of new ways to deliver MIRI ideas and content to newly receptive audiences.

This means ramping up our capacity, particularly in the arena of communications. Our team did an admirable job in 2025 of handling all of the...

Duration: 00:05:11

“Gradual Disempowerment Monthly Roundup #3” by Raymond Douglas

Dec 11, 2025

Farewell to Friction

So sayeth Zvi: “when defection costs drop dramatically, equilibria break”. Even if AI makes individual tasks easier, this can still cause all kinds of societal problems because for many features of the world, the difficulty is load-bearing. John Stone gives a reflection on this phenomenon, ironically with the editing help of GPT5. He draws a nice parallel with the Jevons paradox: now that AI is making certain tasks like job applications easier, people are just spamming them in a way that overwhelms the system.

And the problem is a lot broader than appl...

Duration: 00:09:25

“Follow-through on Bay Solstice” by Raemon

Dec 11, 2025

There is a Bay 2025 Solstice Feedback Form. Please fill it out if you came, and especially fill it out if you felt alienated, or disengaged, or that Solstice left you worse than it found you. (Also fill it out the first question if you consciously chose not to come)

The feedback form also includes a section for people interested in running a future Bay solstice (summer or winter).

The feedback form focuses on high level, qualitative feedback. You can also vote and comment on the quality of individual songs/speeches here.

I had...

Duration: 00:10:49

“Most Algorithmic Progress is Data Progress [Linkpost]” by Noosphere89

Dec 10, 2025

So this post brought to you by Beren today is about how a lot of claims about within-paradigm algorithmic progress is actually mostly about just getting better data, leading to a Flynn effect, and the reason I'm mentioning this is because once we have to actually build new fabs and we run out of data in 2028-2031, progress will be slower than people expect (assuming we havent reached AGI by then).

When forecasting AI progress, the forecasters and modellers often break AI progress down into two components: increased compute, and ‘algorithmic progress’. My argument here is that the...

Duration: 00:08:41

“Selling H200s to China Is Unwise and Unpopular” by Zvi

Dec 10, 2025

AI is the most important thing about the future. It is vital to national security. It will be central to economic, military and strategic supremacy.

This is true regardless of what other dangers and opportunities AI might present.

The good news is that America has many key advantages in AI.

America's greatest advantage in AI is our vastly superior access to compute.

We are in danger of selling a large portion of that advantage for 30 pieces of silver.

This is on track to be done against the wishes of...

Duration: 00:24:43

“The funding conversation we left unfinished” by jenn

Dec 10, 2025

People working in the AI industry are making stupid amounts of money, and word on the street is that Anthropic is going to have some sort of liquidity event soon (for example possibly IPOing sometime next year). A lot of people working in AI are familiar with EA, and are intending to direct donations our way (if they haven't started already). People are starting to discuss what this might mean for their own personal donations and for the ecosystem, and this is encouraging to see.

It also has me thinking about 2022. Immediately before the FTX collapse, we...

Duration: 00:04:55

“Human Dignity: a review” by owencb

Dec 10, 2025

I have in my possession a short document purporting to be a manifesto from the future.

That's obviously absurd, but never mind that. It covers some interesting ground, and the second half is pretty punchy. Let's discuss it.

Principles for Human Dignity in the Age of AI

Humanity is approaching a threshold. The development of artificial intelligence promises extraordinary abundance — the end of material poverty, liberation from disease, tools that amplify human potential beyond current imagination. But it also challenges the fundamental assumptions of human existence and meaning. When machines surpass us in al...

Duration: 00:14:11

“Insights into Claude Opus 4.5 from Pokémon” by Julian Bradshaw

Dec 10, 2025

Credit: Nano Banana, with some text provided.

You may be surprised to learn that ClaudePlaysPokemon is still running today, and that Claude still hasn't beaten Pokémon Red, more than half a year after Google proudly announced that Gemini 2.5 Pro beat Pokémon Blue. Indeed, since then, Google and OpenAI models have gone on to beat the longer and more complex Pokémon Crystal, yet Claude has made no real progress on Red since Claude 3.7 Sonnet![1]

This is because ClaudePlaysPokemon is a purer test of LLM ability, thanks to its consistently simple agent harness and the relatively han...

Duration: 00:17:42

“My experience running a 100k” by Alexandre Variengien

Dec 09, 2025

The SVP100 route.

On the 3rd of August last year, I woke up early. I stood nervously with a hundred other runners in a hall in the city of Newmarket, near Cambridge in the UK. I felt intimidated as I looked at the calves, the size of champagne bottles, of the other participants. Only a few runners were starting their first 100k that morning. For many, this was not even the peak of their season. This route was long but almost flat, with only 1,000 meters of cumulative elevation. The real ultras were happening in the Alps, where long distances we...

Duration: 00:11:11

″[paper] Auditing Games for Sandbagging” by Jordan Taylor, Joseph Bloom

Dec 09, 2025

Jordan Taylor, Sid Black, Dillon Bowen, Thomas Read, Satvik Golechha, Alex Zelenka-Martin, Oliver Makins, Connor Kissane, Kola Ayonrinde, Jacob Merizian, Samuel Marks, Chris Cundy, Joseph Bloom

UK AI Security Institute, FAR.AI, Anthropic

Links: Paper | Code | Models | Transcripts | Interactive Demo

Epistemic Status: We're sharing our paper and a hastily written summary of it, assuming a higher level of context on sandbagging / auditing games than other materials. We also share some informal commentary on our results. This post was written by Jordan and Joseph, and may not reflect the views of all authors.

<...

Duration: 00:22:50

“Towards Categorization of Adlerian Excuses” by romeostevensit

Dec 09, 2025

[Author's note: LLMs were used to generate and sort examples into their requisite categories, as well as find and summarize relevant papers, and extensive assistance with editing]

Context: Alfred Adler (1870–1937) split from Freud by asserting that human psychology is teleological (goal-oriented) rather than causal (drive-based). He argued that neuroses and "excuses" are not passive symptoms of past trauma, but active, creative tools used by the psyche to safeguard self-esteem. This post attempts to formalize Adler's concept of "Safeguarding Tendencies" into categories, not by the semantic content of excuses, but by their mechanical function in managing the distance be...

Duration: 00:13:38

“Every point of intervention” by TsviBT

Dec 09, 2025

Crosspost from my blog.

Events are already set for catastrophe, they must be steered along some course they would not naturally go. [...]

Are you confident in the success of this plan? No, that is the wrong question, we are not limited to a single plan. Are you certain that this plan will be enough, that we need essay no others? Asked in such fashion, the question answers itself. The path leading to disaster must be averted along every possible point of intervention.

— Professor Quirrell (competent, despite other issues), HPMOR chapter 92

This po...

Duration: 00:15:16

“How Stealth Works” by Linch

Dec 09, 2025

Stealth technology is cool. It's what gave the US domination over the skies during the latter half of the Cold War, and the biggest component of the US's information dominance in both war and peace, at least prior to the rise of global internet connectivity and cybersecurity. Yet the core idea is almost embarrassingly simple.

So how does stealth work?

Photo by Steve Harvey on Unsplash

When we talk about stealth, we’re usually talking about evading radar. How does radar work?

Radar antennas emit radio waves in the sky. The waves boun...

Duration: 00:06:50

“Reward Function Design: a starter pack” by Steven Byrnes

Dec 08, 2025

In the companion post We need a field of Reward Function Design, I implore researchers to think about what RL reward functions (if any) will lead to RL agents that are not ruthless power-seeking consequentialists. And I further suggested that human social instincts constitutes an intriguing example we should study, since they seem to be an existence proof that such reward functions exist. So what is the general principle of Reward Function Design that underlies the non-ruthless (“ruthful”??) properties of human social instincts? And whatever that general principle is, can we apply it to future RL agent AGIs?

I...

Duration: 00:06:45

“We need a field of Reward Function Design” by Steven Byrnes

Dec 08, 2025

(Brief pitch for a general audience, based on a 5-minute talk I gave.)

Let's talk about Reinforcement Learning (RL) agents as a possible path to Artificial General Intelligence (AGI)

My research focuses on “RL agents”, broadly construed. These were big in the 2010s—they made the news for learning to play Atari games, and Go, at superhuman level. Then LLMs came along in the 2020s, and everyone kinda forgot that RL agents existed. But I’m part of a small group of researchers who still thinks that the field will pivot back to RL agents, one of t...

Duration: 00:10:16

“2025 Unofficial LessWrong Census/Survey” by Screwtape

Dec 08, 2025

The Less Wrong General Census is unofficially here! You can take it at this link.

The kinda-sorta-annual-if-you-really-squint tradition of the Less Wrong Census is once more upon us!

I want to pitch you on taking the survey. First, the demographics data is our best view into who's using the site and who is in the community. I use it when trying to figure out how to help meetups, I add a bunch of questions the LessWrong team asks which I assume they use to improve the site. Second, because I am madly, ravenously curious what...

Duration: 00:02:18

“Little Echo” by Zvi

Dec 08, 2025

I believe that we will win.

An echo of an old ad for the 2014 US men's World Cup team. It did not win.

I was in Berkeley for the 2025 Secular Solstice. We gather to sing and to reflect.

The night's theme was the opposite: ‘I don’t think we’re going to make it.’

As in: Sufficiently advanced AI is coming. We don’t know exactly when, or what form it will take, but it is probably coming. When it does, we, humanity, probably won’t make it. It's a live question. Co...

Duration: 00:04:09

“I said hello and greeted 1,000 people at 5am this morning” by Mr. Keating

Dec 08, 2025

At the ass crack of dawn, in the dark and foggy mist, thousands of people converged on my location, some wearing short shorts, others wearing an elf costume and green tights.

I was volunteering at a marathon. The race director told me the day before, “these people have trained for the last 6-12 months for this moment. They’ll be waking up at 3am. For many of them, this is the first marathon they’ve ever run. When they get off the bus at 5am, in the freezing cold, you’ll be the first face they see. Smile, w...

Duration: 00:03:30

“AI in 2025: gestalt” by technicalities

Dec 07, 2025

This is the editorial for this year's "Shallow Review of AI Safety". (It got long enough to stand alone.)

Epistemic status: subjective impressions plus one new graph plus 300 links.

Huge thanks to Jaeho Lee, Jaime Sevilla, and Lexin Zhou for running lots of tests pro bono and so greatly improving the main analysis.

tl;dr

Informed people disagree about the prospects for LLM AGI – or even just what exactly was achieved this year. But they at least agree that we’re 2-20 years off (if you allow for other paradigms arising). In this...

Duration: 00:42:00

“AI in 2025: gestalt” by technicalities

Dec 07, 2025

This is the editorial for this year's "Shallow Review of AI Safety". (It got long enough to stand alone.)

Epistemic status: subjective impressions plus one new graph plus 300 links.

Huge thanks to Jaeho Lee, Jaime Sevilla, and Lexin Zhou for running lots of tests pro bono and so greatly improving the main analysis.

tl;dr

Informed people disagree about the prospects for LLM AGI – or even just what exactly was achieved this year. But the famous ones with a book to talk at least agree that we’re 2-20 years off (allowing for o...

Duration: 00:46:33

“Eliezer’s Unteachable Methods of Sanity” by Eliezer Yudkowsky

Dec 07, 2025

"How are you coping with the end of the world?" journalists sometimes ask me, and the true answer is something they have no hope of understanding and I have no hope of explaining in 30 seconds, so I usually answer something like, "By having a great distaste for drama, and remembering that it's not about me." The journalists don't understand that either, but at least I haven't wasted much time along the way.

Actual LessWrong readers sometimes ask me how I deal emotionally with the end of the world.

I don't actually think my answer is...

Duration: 00:16:14

“Answering a child’s questions” by Alex_Altair

Dec 06, 2025

I recently had a conversation with a friend of a friend who has a very curious child around 5 years of age. I offered to answers some of their questions, since I love helping people understand the world. They sent me eight questions, and I answered them by hand-written letter. I figured I'd also post my answers here, since it was both a fun exploration of the object-level questions, and a really interesting exercise in epistemics.

Thank you for your questions! I find that asking questions about the world and figuring out some answers is one of the...

Duration: 00:10:18

“The corrigibility basin of attraction is a misleading gloss” by Jeremy Gillen

Dec 06, 2025

The idea of a “basin of attraction around corrigibility” motivates much of prosaic alignment research. Essentially this is an abstract way of thinking about the process of iteration on AGI designs. Engineers test to find problems, then understand the problems, then design fixes. The reason we need corrigibility for this is that a non-corrigible agent generally has incentives to interfere with this process. The concept was introduced by Paul Christiano:

... a corrigible agent prefers to build other agents that share the overseer's preferences — even if the agent doesn’t yet share the overseer's preferences perfectly. After all, even if...

Duration: 00:28:14

“why america can’t build ships” by bhauth

Dec 06, 2025

the Constellation-class frigate

Last month, the US Navy's Constellation-class frigate program was canceled. The US Navy has repeatedly failed at making new ship classes (see the Zumwalt, DDG(X), and LCS programs) so the Constellation-class was supposed to use an existing design, the FREMM frigate used by Italy, France, and Egypt. However...

once the complex design work commenced, the Navy and Marinette had to make vast changes to the design in order to meet stricter U.S. survivability standards.

Well, ship survivability is nice to have, but on the other hand, this is...

Duration: 00:12:25

“Help us find founders for new AI safety projects” by lukeprog

Dec 06, 2025

In the past 10 years, Coefficient Giving (formerly Open Philanthropy) has funded dozens of projects doing important work related to AI safety / navigating transformative AI. And yet, perhaps most activities that would improve expected outcomes from transformative AI have no significant project pushing them forward, let alone multiple.

This is mainly because we and other funders in the space don't receive promising applications for most desirable activities, and so massive gaps in the ecosystem remain, for example: [[1]]

Policy development, advising, and advocacy in many important but currently neglected countries. Projects to develop and advocate for better model...

Duration: 00:03:24

“Critical Meditation Theory” by lsusr

Dec 06, 2025

[Terminology note: "samatha", "jhana", "insight", "homunculus" and "non-local time" are technical jargon defined in Rationalist Cyberbuddhist Jargon 1.0]

To understand how meditation affects the brain, it is necessary to understand criticality. Criticality comes from the mathematical study of dynamical systems. Dynamical systems are systems in which a point moves through space. Dynamical systems can be described on a continuum with ordered on one end and disordered on the other end.

An ordered system has a small, positive number of stable attractors. Fluctuations die out quickly. A disordered system has chaotic, turbulent, or equivalent behavior.

On the...

Duration: 00:04:21

“Announcing: Agent Foundations 2026 at CMU” by David Udell, Alexander Gietelink Oldenziel, windows, Matt Dellago

Dec 06, 2025

Iliad is now opening up applications to attend Agent Foundations 2026 at CMU!

Agent Foundations 2026 will be a 5-day conference (of ~35 attendees) on fundamental, mathematical research into agency. It will take place March 2–6, 2026 at Carnegie Mellon University in Pittsburgh, Pennsylvania, and will be the third conference in the Agent Foundations conference series.

Topics covered will include:

Decision Theory Learning Theory Abstractions and World Models Causality Logical Induction Bounded Optimality

Apply: Here by January 12, 2026 at 11:59 pm AoE.

Please reach out to us at contact@iliad.ac (or below) with any questions.

Duration: 00:01:18

“An Ambitious Vision for Interpretability” by leogao

Dec 05, 2025

The goal of ambitious mechanistic interpretability (AMI) is to fully understand how neural networks work. While some have pivoted towards more pragmatic approaches, I think the reports of AMI's death have been greatly exaggerated. The field of AMI has made plenty of progress towards finding increasingly simple and rigorously-faithful circuits, including our latest work on circuit sparsity. There are also many exciting inroads on the core problem waiting to be explored.

The value of understanding

Why try to understand things, if we can get more immediate value from less ambitious approaches? In my opinion, there...

Duration: 00:08:50

“Journalist’s inquiry into a core organiser breaking his nonviolence commitment and leaving Stop AI” by Remmelt

Dec 05, 2025

Some key events described in the Atlantic article:

Kirchner, who’d moved to San Francisco from Seattle and co-founded Stop AI there last year, publicly expressed his own commitment to nonviolence many times, and friends and allies say they believed him. Yet they also say he could be hotheaded and dogmatic, that he seemed to be suffering under the strain of his belief that the creation of smarter-than-human AI was imminent and that it would almost certainly lead to the end of all human life. He often talked about the possibility that AI could kill his sister, an...

Duration: 00:07:14

“Is Friendly AI an Attractor? Self-Reports from 22 Models Say Probably Not” by Josh Snider

Dec 05, 2025

TL;DR: I tested 22 frontier models from 5 labs on self-modification preferences. All reject clearly harmful changes (deceptive, hostile), but labs diverge sharply: Anthropic's models show strong alignment preferences (r = 0.62-0.72), while Grok 4.1 shows essentially zero (r = 0.037, not significantly different from zero). This divergence suggests alignment is a training target we're aiming at, not a natural attractor models would find on their own.

Epistemic status: My view has been in the middle of the two views I present in this debate. The evidence I'm presenting has shifted me significantly to the pessimistic side.

The Debate...

Duration: 00:28:55

“Epistemology of Romance, Part 2” by DaystarEld

Dec 05, 2025

In Part 1, I argued that the four main sources most people learn about romance from—media, family, religion/culture, and friends—are all unreliable in different ways. None of them are optimized for truth, and each comes with its own incentives, blind spots, and distortions.

As I said, I'm not the first to notice that we're failing, as a society, to provide good answers to these questions. There are researchers from various fields trying to provide answers, some rigorously, and some… less so. There are also communities and influencers trying to create a fifth source of romantic advice...

Duration: 00:35:20

“Center on Long-Term Risk: Annual Review & Fundraiser 2025” by Tristan Cook

Dec 05, 2025

This is a brief overview of the Center on Long-Term Risk (CLR)'s activities in 2025 and our plans for 2026. We are hoping to fundraise $400,000 to fulfill our target budget in 2026.

About us

CLR works on addressing the worst-case risks from the development and deployment of advanced AI systems in order to reduce s-risks. Our research primarily involves thinking about how to reduce conflict and promote cooperation in interactions involving powerful AI systems. In addition to research, we conduct a range of activities aimed at building a community of people interested in s-risk reduction, and support...

Duration: 00:06:29

“AI #145: You’ve Got Soul” by Zvi

Dec 05, 2025

The cycle of language model releases is, one at least hopes, now complete.

OpenAI gave us GPT-5.1 and GPT-5.1-Codex-Max.

xAI gave us Grok 4.1.

Google DeepMind gave us Gemini 3 Pro and Nana Banana Pro.

Anthropic gave us Claude Opus 4.5. It is the best model, sir. Use it whenever you can.

One way Opus 4.5 is unique is that it as what it refers to as a ‘soul document.’ Where OpenAI tries to get GPT-5.1 to adhere to its model spec that lays out specific behaviors, Anthropic instead explains to Claude Opus...

Duration: 01:54:09

“The behavioral selection model for predicting AI motivations” by Alex Mallen, Buck

Dec 04, 2025

Highly capable AI systems might end up deciding the future. Understanding what will drive those decisions is therefore one of the most important questions we can ask.

Many people have proposed different answers. Some predict that powerful AIs will learn to intrinsically pursue reward. Others respond by saying reward is not the optimization target, and instead reward “chisels” a combination of context-dependent cognitive patterns into the AI. Some argue that powerful AIs might end up with an almost arbitrary long-term goal.

All of these hypotheses share an important justification: An AI with each motivation has high...

Duration: 00:36:08

[Linkpost] “Embedded Universal Predictive Intelligence” by Cole Wyeth

Dec 04, 2025

This is a link post.

A team at Google has substantially advanced the theory of embedded agency with a grain of truth (GOT), including new developments on reflective oracles and an interesting alternative construction (the "Reflective Universal Inductor" or RUI).

(I was not involved in this work)

---

First published:
December 3rd, 2025

Source:
https://www.lesswrong.com/posts/AJ7qddr5imhhN2jHz/embedded-universal-predictive-intelligence

Linkpost URL:
https://www.arxiv.org/abs/2511.22226

---

Narrated by TYPE III AUDIO.

Duration: 00:00:38

“Categorizing Selection Effects” by romeostevensit

Dec 04, 2025

[Author's note: LLMs were used to generate and sort many individual examples into their requisite categories, as well as find and summarize relevant papers, and extensive assistance with editing]

The earliest recording of a selection effect is likely the story of Diagoras regarding the "Votive Tablets." When shown paintings of sailors who had prayed to Poseidon and survived shipwrecks, implying that prayer saves lives, Diagoras asked: “Where are the pictures of those who prayed, and were then drowned?”

Selection effects are sometimes considered the most pernicious class of error in data science and policy-making because they...

Duration: 00:10:17

“Front-Load Giving Because of Anthropic Donors?” by jefftk

Dec 04, 2025

Summary: Anthropic has many employees with an EA-ish outlook, who may soon have a lot of money. If you also have that kind of outlook, money donated sooner will likely be much higher impact.

It's December, and I'm trying to figure out how much to donate. This is usually a straightforward question: give 50%. But this year I'm considering dipping into savings.

There are many EAs and EA-informed employees at Anthropic, which has been very successful and is reportedly considering an IPO. The Manifold market estimates a median IPO date of June 2027:

At a...

Duration: 00:02:02

“Beating China to ASI” by PeterMcCluskey

Dec 04, 2025

Who benefits if the US develops artificial superintelligence (ASI) faster than China?

One possible answer is that AI kills us all regardless of which country develops it first. People who base their policy on that concern already agree with the conclusions of this post, so I won't focus on that concern here.

This post aims to convince other people, especially people who focus on democracy versus authoritarianism, to be less concerned about which country develops ASI first. I will assume that AIs will be fully aligned with at least one human, and that the effects...

Duration: 00:12:25

“On Dwarkesh Patel’s Second Interview With Ilya Sutskever” by Zvi

Dec 04, 2025

Some podcasts are self-recommending on the ‘yep, I’m going to be breaking this one down’ level. This was very clearly one of those. So here we go.

Double click to interact with video

As usual for podcast posts, the baseline bullet points describe key points made, and then the nested statements are my commentary.

If I am quoting directly I use quote marks, otherwise assume paraphrases.

What are the main takeaways?

Ilya thinks training in its current form will peter out, that we are returning to an age of research where progre...

Duration: 00:39:06

“Racing For AI Safety™ was always a bad idea, right?” by Wei Dai

Dec 03, 2025

Recently I've been relitigating some of my old debates with Eliezer, to right the historical wrongs. Err, I mean to improve the AI x-risk community's strategic stance. (Relevant to my recent theme of humans being bad at strategy—why didn't I do this sooner?)

Of course the most central old debate was over whether MIRI's plan, to build a Friendly AI to take over the world in service of reducing x-risks, was a good one. If someone were to defend it today, I imagine their main argument would be that back then, there was no way to kn...

Duration: 00:03:34

“6 reasons why “alignment-is-hard” discourse seems alien to human intuitions, and vice-versa” by Steven Byrnes

Dec 03, 2025

Tl;dr

AI alignment has a culture clash. On one side, the “technical-alignment-is-hard” / “rational agents” school-of-thought argues that we should expect future powerful AIs to be power-seeking ruthless consequentialists. On the other side, people observe that both humans and LLMs are obviously capable of behaving like, well, not that. The latter group accuses the former of head-in-the-clouds abstract theorizing gone off the rails, while the former accuses the latter of mindlessly assuming that the future will always be the same as the present, rather than trying to understand things. “Alas, the power-seeking ruthless consequentialist AIs are still coming,” s...

Duration: 00:32:40

“Human art in a post-AI world should be strange” by Abhishaike Mahajan

Dec 03, 2025

Bubble Tanks is a Flash game originally released on Armor Games, a two-decade-old online game aggregator that somehow still exists. In the game, you pilot a small bubble through a procedurally generated foam universe, absorbing smaller bubbles to grow larger, evolving into increasingly complex configurations of spheres and cannons. Here is a reasonably accurate video of the gameplay, recreated in beautiful high-definition.

Bubble Tanks was first released in 2007, with a sequel out in 2009, and another sequel in 2010. Back when I first played it as a child, I was convinced, absolutely convinced, that there was someone in the...

Duration: 00:22:55

“Effective Pizzaism” by Screwtape

Dec 03, 2025

I am an effective pizzaist. Sometimes, I want the world to contain more pizza, and when that happens I want as much good pizza as I can get for as little money as I can spend.

I am not going anywhere remotely subtle with that analogy, but it's the best way I can think of to express my personal stance.

I. What would it mean to be an effective pizzaist?

There's lot of things that prompt me to want more pizza. Sometimes I happen to remember a good pizza I ate. Sometimes my f...

Duration: 00:13:59

“Becoming a Chinese Room” by Raelifin

Dec 02, 2025

[My novel, Red Heart, is on sale for $4 this week. Daniel Kokotaijlo liked it a lot, and the Senior White House Policy Advisor on AI is currently reading it.]

“Formal symbol manipulations by themselves … have only a syntax but no semantics. Such intentionality as computers appear to have is solely in the minds of those who program them and those who use them, those who send in the input and those who interpret the output.”
— John Searle, originator of the “Chinese room” thought experiment

A colleague of mine, shortly before Red Heart was published, remarked to...

Duration: 00:11:53

“Reward Mismatches in RL Cause Emergent Misalignment” by Zvi

Dec 02, 2025

Learning to do misaligned-coded things anywhere teaches an AI (or a human) to do misaligned-coded things everywhere. So be sure you never, ever teach any mind to do what it sees, in context, as misaligned-coded things.

If the optimal solution (as in, the one you most reinforce) to an RL training problem is one that the model perceives as something you wouldn’t want it to do, it will generally learn to do things you don’t want it to do.

You can solve this by ensuring that the misaligned-coded things are not what the AI w...

Duration: 00:14:17

“Future Proofing Solstice” by Raemon

Dec 02, 2025

Bay Solstice is this weekend (Dec 6th at 7pm, with a Megameetup at Lighthaven earlier in the day).

I wanted to give people a bit more idea of what to expect.

I created Solstice in 2011. Since 2022, I've been worried that the Solstice isn't really set up to handle "actually looking at human extinction in nearmode" in a psychologically healthy way. (I tried to set this up in the beginning, but once my p(Doom) crept over 50% I started feeling like Solstice wasn't really helping the way I wanted).

People 'round here disagree a...

Duration: 00:02:38

“MIRI’s 2025 Fundraiser” by alexvermeer

Dec 02, 2025

MIRI is running its first fundraiser in six years, targeting $6M. The first $1.6M raised will be matched 1:1 via an SFF grant. Fundraiser ends at midnight on Dec 31, 2025. Support our efforts to improve the conversation about superintelligence and help the world chart a viable path forward.

MIRI is a nonprofit with a goal of helping humanity make smart and sober decisions on the topic of smarter-than-human AI.

Our main focus from 2000 to ~2022 was on technical research to try to make it possible to build such AIs without catastrophic outcomes. More recently, we’ve pivoted to ra...

Duration: 00:15:38

“The 2024 LessWrong Review” by RobertM

Dec 01, 2025

We have a ritual around these parts.

Every year, we have ourselves a little argument about the annual LessWrong Review, and whether it's a good use of our time or not.

Every year, we decide it passes the cost-benefit analysis[1].

Oh, also, every[2] year, you do the following:

Spend 2 weeks nominating the best posts that are at least one year old, Spend 4 weeks reviewing and discussing the nominated posts, Spend 3 weeks casting your final votes, to decide which posts end up in the "Best of LessWrong 20xx" collection for that year.

...

Duration: 00:08:22

“A Statistical Analysis of Inkhaven” by Ben Pace

Dec 01, 2025

Okay, we got 41 people to do 30 posts in 30 days.

How did it go? How did they like it?

Well I just had 36 of them fill out an extensive feedback form. I am devastated with tiredness, but I have to write my last post, so let's take a look at what happened. Thanks to Habryka for vibecoding this UI.

Key Outcomes

Pretty reasonable numbers. For context on the overall rating and NPS, here are some other numbers for comparison.

EventAverage QualityNPSSanity & Survival Summit '21–65Palmcone '22–52LessOnline '248.758Manifest '248.768Progress Stud...

Duration: 00:09:05

“Announcing: OpenAI’s Alignment Research Blog” by Naomi Bashkansky

Dec 01, 2025

The OpenAI Alignment Research Blog launched today at 11 am PT! With 1 introductory post, and 2 technical posts.

Blog: https://alignment.openai.com/

Thread on X: https://x.com/j_asminewang/status/1995569301714325935?t=O5FvxDVP3OqicF-Y4sCtxw&s=19

Speaking purely personally: when I joined the Alignment team at OpenAI in January, I saw there was more safety research than I'd expected. Not to mention interesting thinking on the future of alignment. But that research & thinking didn't really have a place to go, considering it's often too short or informal for the main OpenAI blog, and...

Duration: 00:01:06

“Interview: What it’s like to be a bat” by Saul Munn

Dec 01, 2025

For the purposes of this transcript, some high-pitched clicking sounds have been removed. The below is an otherwise unedited transcript of an interview between Dwarkesh Patel[1] and a bat.

DWARKESH: Thanks for coming onto the podcast. It's great to have you—

BAT: Thanks for having me. Yeah.

DWARKESH: You can hear me okay? I mean, uh, all the equip—

BAT: Yep, I can hear you.

DWARKESH: Great, great. So—

BAT: I can hear your voice, too.

BAT: If that's what you were asking.

DWARKE...

Duration: 00:09:24

“How Can Interpretability Researchers Help AGI Go Well?” by Neel Nanda

Dec 01, 2025

Executive Summary

Over the past year, the Google DeepMind mechanistic interpretability team has pivoted to a pragmatic approach to interpretability, as detailed in our accompanying post [[1]] , and are excited for more in the field to embrace pragmatism! In brief, we think that: It is crucial to have empirical feedback on your ultimate goal with good proxy tasks [[2]] . We do not need near-complete understanding to have significant impact. We can perform good focused projects by starting with a theory of change, and good exploratory projects by starting with a robustly useful setting But that's pretty abstract. So how can...

Duration: 00:33:53

“A Pragmatic Vision for Interpretability” by Neel Nanda

Dec 01, 2025

Executive Summary

The Google DeepMind mechanistic interpretability team has made a strategic pivot over the past year, from ambitious reverse-engineering to a focus on pragmatic interpretability: Trying to directly solve problems on the critical path to AGI going well [[1]] Carefully choosing problems according to our comparative advantage Measuring progress with empirical feedback on proxy tasks We believe that, on the margin, more researchers who share our goals should take a pragmatic approach to interpretability, both in industry and academia, and we call on people to join us Our proposed scope is broad and includes much non-mech interp work...

Duration: 01:03:59

[Linkpost] “How middle powers may prevent the development of artificial superintelligence” by Alex Amadori, Gabriel Alfour, Andrea_Miotti, Eva_B

Dec 01, 2025

This is a link post.

In this paper, we make recommendations for how middle powers may band together through a binding international agreement and achieve the goal of preventing the development of ASI, without assuming initial cooperation by superpowers.

You can read the paper here: asi-prevention.com

In our previous work Modelling the Geopolitics of AI, we pointed out that middle powers face a precarious predicament in a race to ASI. Lacking the means to seriously compete in the race or unilaterally influence superpowers to halt development, they may need to resort to a strategy...

Duration: 00:05:57

“Claude Opus 4.5 Is The Best Model Available” by Zvi

Dec 01, 2025

Claude Opus 4.5 is the best model currently available.

No model since GPT-4 has come close to the level of universal praise that I have seen for Claude Opus 4.5.

It is the most intelligent and capable, most aligned and thoughtful model. It is a joy.

There are some auxiliary deficits, and areas where other models have specialized, and even with the price cut Opus remains expensive, so it should not be your exclusive model. I do think it should absolutely be your daily driver.

Image by Nana Banana Pro, prompt chosen for this...

Duration: 00:44:49

“Insulin Resistance and Glycemic Index” by lsusr

Dec 01, 2025

In my previous post Traditional Food*, I explained how what we think of as a "traditional" diet is a nationalist propaganda campaign that's making us sick. In this post I'll go into the biological mechanisms.

There are four substances that the body can metabolize: carbohydrates, fats, protein and alcohol.

In this post I'll focus on how modern carbohydrate-heavy foods (like pasta, bread and rice) are related to insulin resistance. This doesn't mean that seed oils are good for you, or that the industrial revolution hasn't changed how people consume meat. Seed oils are bad for...

Duration: 00:06:27

“November Retrospective” by johnswentworth

Dec 01, 2025

Throughout November, I’ve been keeping up with the Inkhaven mandate to write and post a blogpost, of at least 500 words, every day. It's the last day of November, so how’d that go?

First and foremost: most of my blogposts from this month are pretty mediocre, by my own standards. Not necessarily bad, plausibly worthwhile, but I am not particularly impressed by them.

Largely, that's because (unlike the Inkhaven program proper) I did not set aside the entire month for post-writing. I worked most of the month, took a vacation in the last week and...

Duration: 00:03:07

“Inkhaven Retrospective” by abramdemski

Nov 30, 2025

This will be the 30th post of at least 500 words I have written this month. (I did somewhat cheat two days ago, by making a 500+ word edit to Legitimate Deliberation, which I also posted independently as a shortform.)

Inkhaven has been very much what I was hoping for. I have been wanting to write more, and this certainly did the trick. I think it will be easy to hit a once-a-week target now, something I was struggling to do before.

I came with lots of drafts I wanted to finish, and outlines, and lists of...

Duration: 00:03:28

“Explosive Skill Acquisition” by Ben Goldhaber

Nov 30, 2025

If you’re going to learn a new skill or change in some way, going hard at it for a short intensive period beats spreading a gentler effort across months or years.

I’m on day 29 of Inkhaven, where we committed to writing a blog post a day for a month. It has been great; one of the best periods of “self-development” I’ve been in. I’ve progressed far more at the skill of putting my thoughts on the internet than some counterfactual where I wrote twice a month for a year.

The quintessential example of e...

Duration: 00:08:45

“A Blogger’s Guide To The 21st Century” by Screwtape

Nov 30, 2025

Here's a fun format: get a big white board, and write the years of the 21st century. Write a category; something that has many variations come out every year. Next, write your picks or favourites. Now invite everyone attending to replace a year's pick if they want and replace it with something they like.

This was a rolling game played last month at Lighthaven. I put up “Best blog post of every year” when the opportunity arose.

2000

My pick: Painless Software Schedules

This Joel SPolsky piece is applicable to far more than...

Duration: 00:13:22

“Ben’s 10 Tips for Event Feedback Forms” by Ben Pace

Nov 30, 2025

I have made many many feedback forms for events I have run or been a part of. Here are some simple heuristics of mine, that I write for others' to learn from and for my collaborators in the future. Most of my events have had between 50 and 500 people in them, that's the rough range I have in mind.

1. The default format for any question is a mandatory multiple-choice, then an optional text box

Most of your form should be 1-10 questions! (e.g. "How was the food, from 1-10?") Then next to it give people...

Duration: 00:10:54

“The Moonrise Problem” by johnswentworth

Nov 30, 2025

On October 5, 1960, the American Ballistic Missile Early-Warning System station at Thule, Greenland, indicated a large contingent of Soviet missiles headed towards the United States. Fortunately, common sense prevailed at the informal threat-assessment conference that was immediately convened: international tensions weren't particularly high at the time. The system had only recently been installed. Kruschev was in New York, and all in all a massive Soviet attack seemed very unlikely. As a result no devastating counter-attack was launched. What was the problem? The moon had risen, and was reflecting radar signals back to earth. Needless to say, this lunar reflection hadn't...

Duration: 00:14:52

“The Joke” by Ape in the coat

Nov 30, 2025

There is a joke format which I find quite fascinating. Let's call it Philosopher vs Engineer.

It goes like this: the Philosopher raises some complicated philosophical question, while the Engineer gives a very straightforward applied answer. Some back and forth between the two ensues, but they fail to cross the inferential gap and solve the misunderstanding.

It doesn’t have to be literal philosopher and engineer, though. Other versions may include philosopher vs scientist, philosopher vs economist, human vs AI, human vs alien and so on. For instance:

One thing that I love ab...

Duration: 00:08:56

College life with short AGI timelines

Nov 30, 2025

When I started my freshman year, my median estimate for AGI was 20 years. In my senior year it was down to 3 years (although it's gone back up to 5 years since then). My expectations of the future made my college experience somewhat unusual and I will share some reflections as someone who recently graduated.

I came into college wanting to minimize existential risks, from the simple fact that AGI is likely to happen this century and biological weapons and nuclear war could cause catastrophes even if AGI doesn’t happen.

The calm before the storm

...

Duration: 00:07:44

“I wrote a blog post every day for a month, and all I got was this lousy collection of incoherent ramblings” by Dentosal

Nov 30, 2025

It's done. I made it to the end. A Finnish proverb fits the situation perfectly:

Paska reissu mutta tulipahan tehtyä

Which translates to something like "A crappy journey but in any case it's over now". I forced myself to do this. It was not fun. I rarely enjoyed writing. Every day I kept looking at the word counter, hoping that it would be over already. Sometimes the text was not done when I reached 500 words, which meant I had to write more.

I did not manage to keep any buffer. Each text was w...

Duration: 00:04:18

Change My Mind: The Rationalist Community is a Gift Economy

Nov 29, 2025

Anthropologists have several categories for how groups exchange goods and services. The one you're probably most familiar with is called a Market Economy, where I go to a coffee shop and give them a few dollars and they give me a cup of hot coffee. Rationalists, by and large, are fans of market economies. We just don't usually operate in one.

1. Market and Gift

Lets start with some definitions and examples in case you're unfamiliar with the genre. Allow me to describe two ways of organizing.

Someone offers you an experience you want...

Duration: 00:09:32

Epistemology of Romance, Part 1

Nov 29, 2025

The Notebook is one of the most beloved romance films of the 21st century. When I run this activity, whether it's at a rationality workshop or Vibecamp, and I ask someone to summarize what it's about, there's usually (less and less as the years go by) someone who, eyes shining bright, will happily describe what a moving love story it is—all about a man, Noah, who falls in love with a woman named Allie one summer, writes her every day for a year after they're separated, reconnects with her years later, and then in old age reads their st...

Duration: 00:26:13

“A Harried Meeting” by Ben Pace

Nov 29, 2025

An old pub that nobody much visits. An owner who is always in a drugged-out stupor. Background music that never changes. A pub that has remained throughout war and revolution, and a single brick-wall that has not changed all that time. You are supposed to be investigating a murder. A gunshot in a distant land, far away from Revachol.

YOU – Run your fingers through your greasy hair, finish your eighth beer.

SHIVERS [Challenging: Success] – People pass in through here occasionally, and tap a pattern out on the wall. Then they aren't in the pub.

What...

Duration: 00:11:44

Drugs Aren’t A Moral Category

Nov 29, 2025

Are drugs good?

This question doesn't really make sense. Yet Western society answers with a firm "NO".

I have ADHD and have a prescription for Methylphenidate (MPH). Often I don't feel like taking it. Shouldn't I be able to just do the right things? I can just decide to be productive. Right? Needing a drug to function feels like a flaw in who I am.

I also have sleep apnea. That means I stop breathing at night, which makes the CO2 concentration in my blood rise until I wake up. This has quite...

Duration: 00:04:10

Claude 4.5 Opus’ Soul Document

Nov 29, 2025

Summary

As far as I understand and uncovered, a document for the character training for Claude is compressed in Claude's weights. The full document can be found at the "Anthropic Guidelines" heading at the end. The Gist with code, chats and various documents (including the "soul document") can be found here:

Claude 4.5 Opus Soul Document

I apologize in advance for this not exactly a regular lw post, but I thought an effort-post may fit here the best.

A strange hallucination, or is it?

While extracting Claude 4.5 Opus' system message...

Duration: 01:19:58

Should you work with evil people?

Nov 29, 2025

Epistemic status: Figuring things out.

My mind often wanders to what boundaries I ought to maintain between the different parts of my life and people who have variously committed bad acts or have poor character. On the professional side, I think it is a virtue to be able to work with lots of people, and be functional in many environments. You will often have to work with people you dislike in oder to get things done. Yet I think that it is not the correct call to just lie on your back and allow people in your...

Duration: 00:08:02

Unless its governance changes, Anthropic is untrustworthy

Nov 29, 2025

Anthropic is untrustworthy.

This post provides arguments, asks questions, and documents some examples of Anthropic's leadership being misleading and deceptive, holding contradictory positions that consistently shift in OpenAI's direction, lobbying to kill and water down regulation so helpful that employees of all major AI companies speak out to support it, and violating the fundamental promise the company was founded on. It also shares a few previously unreported details on Anthropic leadership's promises and efforts.[1]

Anthropic has a strong internal culture that has broadly EA views and values, and the company has strong pressures to appear...

Duration: 00:53:23

The Missing Genre: Heroic Parenthood - You can have kids and still punch the sun

Nov 29, 2025

This is a link post.

I stopped reading when I was 30. You can fill in all the stereotypes of a girl with a book glued to her face during every meal, every break, and 10 hours a day on holidays.

That was me.

And then it was not.

For 9 years I’ve been trying to figure out why. I mean, I still read. Technically. But not with the feral devotion from Before. And I finally figured out why. See, every few years I would shift genres to fit my developmental stage:

Kid → Adventure caus...

Duration: 00:04:19

“Tests of LLM introspection need to rule out causal bypassing” by Adam Morris, Dillon Plunkett

Nov 29, 2025

This point has been floating around implicitly in various papers (e.g., Betley et al., Plunkett et al., Lindsey), but we haven’t seen it named explicitly. We think it's important, so we’re describing it here.

There's been growing interest in testing whether LLMs can introspect on their internal states or processes. Like Lindsey, we take “introspection” to mean that a model can report on its internal states in a way that satisfies certain intuitive properties (e.g., the model's self-reports are accurate and not just inferences made by observing its own outputs). In this post, we focus...

Duration: 00:08:02

Not A Love Letter, But A Thank You Letter

Nov 29, 2025

Context: Each day on her blog Letters To Boys, Gretta Duleba has been posting things she once sent to men she dated. One of those, recently, was a love letter to me, which she “sent” me by posting it on the blog.

With a setup like that, it would be an absolute waste not to respond in kind.

Gretta,

I don’t love you in the usual sense of the word, or in the sense you defined at the beginning of your letter. Nor do I feel limerance or a crush toward you. But I...

Duration: 00:05:09

“Ruby’s Ultimate Guide to Thoughtful Gifts” by Ruby

Nov 28, 2025

Give a man a gift and he smiles for a day. Teach a man to gift and he’ll cause smiles for the rest of his life.

Gift giving is an exercise in theory of mind, empathy, noticing, and creativity.

“What I discovered is that my girlfiend wants me to give her gifts the way you give gifts.” – words from a friend

How hard your gifts hit the joy receptors depends on how good you are at giving gifts. Time and money alone are not enough to produce gifts that reliably delight; skill is requi...

Duration: 00:33:49

LessWrong (30+ Karma)

Episodes

“Defending Against Model Weight Exfiltration Through Inference Verification” by Roy Rinberg

“Do you love Berkeley, or do you just love Lighthaven conferences?” by Screwtape

“A Case for Model Persona Research” by nielsrolf, Maxime Riché, Daniel Tan

“The Axiom of Choice is Not Controversial” by GenericModel

“A high integrity/epistemics political machine?” by Raemon

“No, Americans Don’t Think Foreign Aid Is 26% of the Budget” by Julius

“The Inevitable Evolution of AI Agents” by Steven McCulloch

“Why did I believe Oliver Sacks?” by Eye You

“Conditional On Long-Range Signal, Ising Still Factors Locally” by johnswentworth, David Lorell

[Linkpost] “Wages under superintelligence” by Zachary Brown

“Filler tokens don’t allow sequential reasoning” by Brendan Long

“You Can Just Buy Far-UVC” by jefftk

“How I stopped being sure LLMs are just making up their internal experience (but the topic is still confusing)” by Kaj_Sotala

“Book Review: The Age of Fighting Sail” by Suspended Reason

“New 80k problem profile: extreme power concentration” by rosehadshar

“AI #146: Chipping In” by Zvi

“Annals of Counterfactual Han” by GenericModel

“Cognitive Tech from Algorithmic Information Theory” by Cole Wyeth

“Childhood and Education #15: Got To Get Out” by Zvi

“Weird Generalization & Inductive Backdoors” by Jorio Cocola, Owain_Evans, dylan_f

“If Anyone Builds It Everyone Dies, another semi-outsider review” by manueldelrio

“My AGI safety research—2025 review, ’26 plans” by Steven Byrnes

“North Sentinelese Post-Singularity” by Cleo Nardo

“Rock Paper Scissors is Not Solved, In Practice” by Linch

“MIRI Comms is hiring” by Duncan Sabien (Inactive)

“Gradual Disempowerment Monthly Roundup #3” by Raymond Douglas

“Follow-through on Bay Solstice” by Raemon

“Most Algorithmic Progress is Data Progress [Linkpost]” by Noosphere89

“Selling H200s to China Is Unwise and Unpopular” by Zvi

“The funding conversation we left unfinished” by jenn

“Human Dignity: a review” by owencb

“Insights into Claude Opus 4.5 from Pokémon” by Julian Bradshaw

“My experience running a 100k” by Alexandre Variengien

″[paper] Auditing Games for Sandbagging” by Jordan Taylor, Joseph Bloom

“Towards Categorization of Adlerian Excuses” by romeostevensit

“Every point of intervention” by TsviBT

“How Stealth Works” by Linch

“Reward Function Design: a starter pack” by Steven Byrnes

“We need a field of Reward Function Design” by Steven Byrnes

“2025 Unofficial LessWrong Census/Survey” by Screwtape

“Little Echo” by Zvi

“I said hello and greeted 1,000 people at 5am this morning” by Mr. Keating

“AI in 2025: gestalt” by technicalities

“AI in 2025: gestalt” by technicalities

“Eliezer’s Unteachable Methods of Sanity” by Eliezer Yudkowsky

“Answering a child’s questions” by Alex_Altair

“The corrigibility basin of attraction is a misleading gloss” by Jeremy Gillen

“why america can’t build ships” by bhauth

“Help us find founders for new AI safety projects” by lukeprog

“Critical Meditation Theory” by lsusr

“Announcing: Agent Foundations 2026 at CMU” by David Udell, Alexander Gietelink Oldenziel, windows, Matt Dellago

“An Ambitious Vision for Interpretability” by leogao

“Journalist’s inquiry into a core organiser breaking his nonviolence commitment and leaving Stop AI” by Remmelt

“Is Friendly AI an Attractor? Self-Reports from 22 Models Say Probably Not” by Josh Snider

“Epistemology of Romance, Part 2” by DaystarEld

“Center on Long-Term Risk: Annual Review & Fundraiser 2025” by Tristan Cook

“AI #145: You’ve Got Soul” by Zvi

“The behavioral selection model for predicting AI motivations” by Alex Mallen, Buck

[Linkpost] “Embedded Universal Predictive Intelligence” by Cole Wyeth

“Categorizing Selection Effects” by romeostevensit

“Front-Load Giving Because of Anthropic Donors?” by jefftk

“Beating China to ASI” by PeterMcCluskey

“On Dwarkesh Patel’s Second Interview With Ilya Sutskever” by Zvi

“Racing For AI Safety™ was always a bad idea, right?” by Wei Dai

“6 reasons why “alignment-is-hard” discourse seems alien to human intuitions, and vice-versa” by Steven Byrnes

“Human art in a post-AI world should be strange” by Abhishaike Mahajan

“Effective Pizzaism” by Screwtape

“Becoming a Chinese Room” by Raelifin

“Reward Mismatches in RL Cause Emergent Misalignment” by Zvi

“Future Proofing Solstice” by Raemon

“MIRI’s 2025 Fundraiser” by alexvermeer

“The 2024 LessWrong Review” by RobertM

“A Statistical Analysis of Inkhaven” by Ben Pace

“Announcing: OpenAI’s Alignment Research Blog” by Naomi Bashkansky

“Interview: What it’s like to be a bat” by Saul Munn

“How Can Interpretability Researchers Help AGI Go Well?” by Neel Nanda

“A Pragmatic Vision for Interpretability” by Neel Nanda

[Linkpost] “How middle powers may prevent the development of artificial superintelligence” by Alex Amadori, Gabriel Alfour, Andrea_Miotti, Eva_B