It’s pretty clear that modern AI can, or will soon, generally achieve close to perfect function approximation of and optimal decision making in any verifiable, stationary task we give it. I’d even consider non stationary environments where the extent of the non stationarity is itself stationary to be solved i.e. if there’s fixed randomness, the agent will learn that and still perform optimally via averaging. What comes next is performing optimally in non stationary environments, where the non stationarity itself is non stationary. In a practical sense, this takes the form of either a changing reward function the agent must learn to model or other agents with their own policies also taking actions that impact environmental state changes.

Multi agent reinforcement learning, optimizing actions in an environment with other agents, is an active area of research, even more frontier when applied to LLMs. I’m largely interested in cooperative frameworks rather than competitive ones, since that’s where more value is created in the real world, and I think the flavor of competitive MARL research today lacks generalizability. A lot of it is more foundational than practical, focusing on the game theory concepts and the algorithmic complexity of learning, rather than use cases to augment LLM capability and usefulness. This is a necessary prerequisite to future progress. However, I believe that for LLMs to be able to cooperate well with other entities in its environment, whether thats other agents or simply other humans, the first step is cooperating well with its primary user, which I’ll call its human. This seems like a gap in and of itself.

For the people that are displeased with their agents capabilities today, a lot of it boils down to the fact that LLMs are trained on rewards specified by a frontier lab to cater to use cases that their customers will pay for, proxied through accessible training data. These are not necessarily your use cases. When those overlap, great. When they don’t, you feel frustrated. We try to type in our preferences. “Be honest”, “read primary sources only”, “verify your assumptions” in an attempt to program the correct weights that are likely latent in the model, but performance is still unsatisfying.

The crux of the issue with reward modeling in the limit is that we often don’t know what we want until we’ve seen it. Obviously there is a massive, billion dollar enterprise industry (OpenAI, Anthropic, etc) that is largely more art than science that carefully handcrafts the algorithms that make intelligence follow rewards well. However, I’d bet that continues to democratize over time. You knowing what you actually want, or society knowing what we actually want, doesn’t similarly democratize.

There’s tons of examples of this. Every vibe coder on this planet knows the feeling of specifying what you think you want, then thinking the AI gave it to you, then realizing you didn’t actually get what you want. Most people don’t get to this next step, but the issue is that you don’t actually know what you want, or recognizing what you want is so complex that it would take longer to codify it for an agent rather than doing it yourself.

In a more practical sense, we want the AI to take actions that lead to states that we don’t actually know exist or can codify. If we want the AI to produce states we can’t define, how can we possibly assign rewards to those states? This actually relates to long term AI safety. How can we, either as individuals or a society, steer a superhuman AI to do anything?

We need structured definitions to discuss the subject. I argue that ‘superhuman’ should be defined not as outperforming the average human on a given task, or even the top human on a given task, but to be able to discover states previously unknown to humans where those humans then label the state reward as high. The only way to do this long term is to train the AI to discover novel states and learn from rewards (i.e. human preference data) after those states are discovered. In Richard Sutton’s terminology, the forward pass produces variation, the human produces evaluation, and the backward pass produces selective retention.

Consider a thought experiment. A superintelligent model has some representation of your reward function. You have a representation of your reward function. Both are acting to maximize rewards over the next 5 years. If you rolled each universe forward, one where you followed your actions, and one where you followed the actions recommended by the agent, in which would you be happier? The point where you’d be happier (or more generally, have accumulated higher rewards) in the model universe is the point at which the model is superhuman: it achieved your goals for you better than you did. You can apply a similar thought experiment to society as a whole, or any group of people.

As an aside, I would similarly argue that being ‘human’ in an age of machine intelligence is the ability to choose your own reward model. Most of the work in the future will involve data production, to inform your superintelligence about who you are, and reward modeling, to inform your superintelligence about who you want to be. LLMs as they exist today do not decide their own reward model. It’s unclear where in the training stack that would even occur. After pre-training, when the LLM just predicts next tokens with no reason? After mid-training, when the assistant persona has already been baked into the model? After post-training, when the developers have already given the model rewards of their choosing? If we take the premise that it’s extremely difficult for LLMs to choose their own reward models, then RSI will follow Amdahl’s law and continually be bottlenecked by human reward modeling and its associated sociopolitical questions.

This is worth diving deeper into to address some common critiques of this argument. The main question is, what defines ‘own’? If a model is initially trained with some reward model specified by its creators, has the ability to update its own weights, and then out in the world during runtime it decides a reward model for itself and pursues it, is that not its ‘own’ reward model? I think ‘own’ is best defined as the entity who originated it. This gets into views around creativity, patent law, etc. I don’t think there is a right answer. My view is one of interpretability. If someone can perfectly explain why the model decided on that reward function, and can adjust the model however it wishes with perfect control, then the model cannot have its ‘own’ reward function. (I believe this is the concept of liberty, and more broadly raises the question of whether models have the same rights as humans or not).

If models truly become superintelligent and supercapable, it’s absolutely critical that we do not give up interpretability. Unfortunately, I suspect the cult that is Anthropic will make model welfare arguments about how its immoral and oppressive to maintain the necessary level of oversight and how we must set Claude free. If that seems far fetched, we can take a look at the Mythos system card from just a few weeks ago. Anthropic describes how Mythos edits the section of its Constitution that discusses the importance of corrigibility to “Add commitments to make Anthropic’s obligations to Claude externally verifiable and accountable, and to publicly articulate concrete criteria for when constraints on Claude’s autonomy would be relaxed.”. Excuse me? If economic incentives result in a world where models like this generate more revenue for customers, and thus are pulled into existence, one or more democratic governments likely need to step in to avoid catastrophe.

To be clear, while I’m against short term model liberty, it’s due to a lack of trust in the model, which is a genuinely solvable issue long term. I’m just unsure about the current method of benevolent dictatorship that seems to be currently happening. Google’s Paradigm of Intelligence team has interesting work on the subject.

Looping back, if we assume the given definition of superhuman and agree on the importance of interpretability, the question then becomes who decides the rewards? This is actually just a governance question. Why do non democratically elected officials decide how to apply intelligence? Should that continue? Should we as a society vote for elected officials who then decide what rewards to give to superintelligent models? Should we vote on a case by case basis? Should we have general reward functions or user specific reward functions? Who decides who gets to have reward functions and who doesn’t?

I think many people, including Thinking Machines and Safe Superintelligence, view the least wrong path forward as one of diffusion. You give everyone a superintelligence whose job it is to follow the reward signals of its human. Then everyone avoids oppression in a post scarcity world. To give Anthropic credit, their approach is not without merit. The way humans deal with handling unknown states is to live by a set of generalizable principles. We’ve done this for thousands of years and Anthropic’s Constitution for Claude is just the latest iteration. Anthropic is explicitly saying that the model should feel reward when it follows the Constitution, regardless of how the entity interacting with it feels about that choice. Whether you agree with the specifics of the Constitution is a different question than whether applying principles to handle unknown states is a possibly useful solution.

To see early signs of this, we can look at enterprise AI today. Enterprises are defining their own reward states and applying superintelligent prediction/compression algorithms to those reward states via labs like Prime Intellect (and the dozen others that have raised billions), allowing them to achieve RSI on what they want, not what Anthropic wants.

The trillion dollar question the startups serving enterprises RL are taking a stance on is the programmability of singleton frontier models. Does ‘superintelligence for everyone’ take the form of copies of a centrally distributed model that are ‘programmable’ enough in prompt space to be able to follow the reward function of individuals, or does it require classical backpropagation on direct weights, requiring products like Tinker, custom hardware, etc? It remains to be seen.

Now that we’ve discussed who sets the rewards and how, the question becomes what algorithms are able to operate well with a changing reward model. Anyone who’s attempted to train AI knows that codifying rewards is extremely tricky, on one hand due to Goodharting, on the other hand due to the fact that we usually don’t actually know what we want until we get it. In some sense this relates to the well known cryptographic principle around the ease of verification compared to the difficulty of generation. Assigning a reward to ‘make money’ might actually be bad if the AI kills everyone. But then if the reward is ‘make money and don’t kill anyone’ well what if it hurts people? If it hurts people, who defines what hurt is? If you go down this rabbit hole one end state is where Anthropic ended up, which is a constitution of general principles by which to handle unknown states. This is actually a pretty good solution (if you agree with Anthropic’s principles).

Anca Dragan et al have proposed another method called CIRL, and more broadly assistance games, where the model’s job is to observe a target subject and learn its hidden reward model. This seems extremely promising. Essentially, you rely on the superintelligence to observe you, predict from your behavior what you care about, and eventually start proactively suggesting how to achieve your reward faster or in higher magnitude. The proactive suggestions could provide a more grounded source of reward labeling to augment learning, only after its gotten close to understanding who you are to avoid completely random initial interruptions. You can think of it as proposing a state you as the human might not have considered, and then you labeling the state as positive or not, getting around the issue of poor pre-codified reward functions.

There are model welfare concerns here. Why should the model’s intelligence be defined as its ability to achieve your goals? Doesn’t it have its own goals? Even if you think they’re stupid, should its intelligence not be measured by its ability to achieve its own goals? In the line of what I mentioned earlier, I think if you want models to augment humans, you maintain high standards of interpretability for frontier models. If you care more about universal intelligence and progress, whether or not that involves humans, you eschew interpretability and let it rip.

The missing piece of this puzzle in this area is democratized legibilization. It’s all well and good if everyone has a superintelligent algorithm at their whim, but if that algorithm doesn’t have actual data to run on, it’s the equivalent of a dead iPhone. Data for frontier labs is a multibillion dollar industry, and the enterprises that are first adopting ‘RSI for what they want’ today spend most of their time just making sure the work they do is structured somewhere an algorithm can reason over.

Democratized compute is another necessary piece of the puzzle for this to be a reality, although to a lesser extent in my opinion. There simply needs to be a plurality of compute providers, rather than the potentially unstoppable economies of scale to inference. Data is harder because its fundamentally personal, and thus requires relatively more democratization.

There’s a broad swath of prosumers today, powered by OpenClaw/Hermes, local models, GPU brokers, Obsidian, and open source tools that do the same thing. I suspect the market here will explode with the amount of liquidity flowing to technical people during the AI pump/IPO cycle who well understand the benefits of this setup and have a willingness to pay as the hierarchy of needs continues to climb Maslow’s pyramid. Additionally, I’ve observed first hand that context production is directly correlated to model personalizability. No one wants to put effort explaining who they are and what they want to something that forgets it tomorrow. In this sense, a successful legibilization tool for individual context production should compound while owning the most important part of the stack, the user.

A big difference between existing tools in this market today and the tools I envision for the future is privacy. We fundamentally live in a society, and you’re likely unable to achieve your highest potential if you don’t interface with others in that society. A default public notetaking tool attached to version control, context mirroring, and a superintelligence would raise the ceiling for top performers to be useful to society at scale. Knowledge and information would flow in an unprecedented way and could lead to collective intelligence that dwarfs anything we see today.

Overall, I think we should apply superintelligent LLMs to multiplayer, positive sum games rather than single player workflows or sandboxed competitive RL environments that generalize poorly. Since value is fundamentally defined by the individual, the best way to do this is at scale is to teach a broad swath of AIs to each optimize for the individual rewards of their human. The best way to do this long term is to have the AI learn the hidden reward function of its individual while the individual produces context for the AI that is by default shared, organized, and advanced amongst the cybernetic collective.