Thoughts

pretty clear that any collective intelligence frameworks are marred by groupthink which is a result of treating context as ground truth which may be related to lack of causality understanding and/or environmental prediction in frontier LLMs. Google Pi Team > ^b203be
Natural language autoencoders might be an early form of plurality. If we can see the internal states of the smartest model using weaker models, deception is much much harder.
https://gemini.google.com/app/33869e6590f71ffa
1. Reminds me of disagree and commit since infinite regress on a turn by turn basis leads to an unstable policy, so you need to commit to a strategy before hand and execute on it.
2. Also seems to be related to meta RL since you set up multiple strategies each with a set of actions in practice
3. Reward hacking as just mechanism design
4. Will theory of mind just emerge from self play? Maybe not when applied to coding/math, but applied to what? Relates to randall’s reward models thesis since the question is what do you reward? Related to benchmarking as the bottleneck, even harnesses with sensors/actuators are downstream of what the goal even is
How was alphago trained, or any adversarial self play, if theory of mind is recursive??
Campbell described the issue with trading as godel loops/halting problem. I framed it as world modeling. You basically need to predict how your choice of action impacts the actions of others in the environment. Basically the argument is that if agent 1 is predicting agent 2, and agent 2 has an accurate prediction of agent 1, then the algorithm for what to do never ends if agent 1 is trying to win. But what if agent 1 is trying to collaborate? Or what if agent 2’s prediction is inaccurate? For the latter question, it feels like the recursion would degrade over time until agent 1’s actual action differs from agent 2’s prediction, which would end the recursion.
1. Embedded agency paper by google is mechanism design on which algorithm/strategy to pursue rather than which actions to take?
2. Multi agent / embodied agent is a forcing function for which strategies to pursue rather than which actions to pursue, but can’t choose strategies unless you already can learn strategies well, which themselves are a set of actions? Kind of relates to meta RL lecture from chelsea finn
Is turn based -> time based similar to how multi agent RL necessitates an update to the state/action space? What exactly is that update? Does it make sense? The similarity is that you need to operate in a non stationary environment rather than a stationary environment. Presumably the time based paradigm helps. But how is their training data actually structured?
1. https://gemini.google.com/app/33869e6590f71ffa
What does ‘multi agent’ actually mean? What are the actual questions to ask in the context of Slate?
1. The VC negotiation example is the best. The way I explained it to Eli felt satisfying. Negotiations could take two orders of magnitude less time if agents were doing it. But my agent won’t properly represent my best interests. If 80% of its context is the VC arguing for a lower valuation, it will probably say “You’re right”. Prompt engineering could help, but it can’t be trusted to maintain information boundaries or properly negotiate.
Would a tool use dynamics model improve agentic behavior? Is this best plugged in as a model that the base LLM can call?
What is preventing everyone from having their own weights?
Continual learning as implemented by Trajectory feels like overfitting. Does overfitting solve a problem for people? In environments that are not very dynamic but specialized enough for there to be large room for improvement from base models. Comes back to whether fine tuning ever makes sense when frontier models exist. The space of frontier models operating poorly (OOD environments) + dynamic environments would need to be explored. Also relates to harness+model ‘cotraining’. Do environments exist where frontier open weight models that are able to be SDPO’d are cost effective at scale to serve to clients AND offer a 10x performance/cost/speed improvement for customers when compared to frontier models, which have their own trajectory of improvement?
Enterprises spend a ton of money on tokens. Any wedge there makes sense from a business perspective. Unlikely anyone will upfront give you their data to prove you can do better from a token perspective, in what cases will they? Also need to prove you can outscale or at least keep pace with frontier model improvement.
Slate feels like it came way easier/more naturally as an idea than current idea exploration. Why is that?
Chrome extension that tracks all API calls that go to LLMs, and implements tracing for your chats. Paid version hosts all the data for you. Extension to use the data to ____.
Alphago used PUCT (UCB variant), but its shown that posterior sampling > UCB on multi armed bandit, and posterior sampling approximates AIXI. Could you use posterior sampling on another agent, rather than a bandit, to determine which actions they will take given a state? Could you extend this to multiple agents?
Sutton states discovery is variation + evaluation + selective retention. Argues modern LLMs only have the first. Aligns with AIXI.
You can bootstrap an experiment by collaborating with the tools/platforms used in the experiment.
its interesting that if you first order try to learn something random, you’ll minimize error by predicting the mean, and your error will always be high. but if you second order try to learn something random, 5 models will agree on predicting the mean, which youll have learned, and you can focus on states where 5 learned models differ in their predictions, indicating actual uncertainty. Also relates to “starting up / cdev as prior updating / RL world modeling”, where single people get caught in first order noise, and ensembles can differentiate between higher error noise vs high error signal. Also overfitting and needing to keep, say, 9:1 ratio when updating online feels similar to avoiding chasing things that are hot as a startup failure mode https://gemini.google.com/app/c8127a6a633097eb
Seth Karten and Dajinar Hafner and Joey Hong are researchers to watch. https://jxihong.github.io/joeyhong/
Granola but for everything I watch + read + conversate with AI, with proactive, relevant search. Startup internalizes proactivity costs, so incentivized to optimize for outcomes.
If prompt space actually works, such as GEPA or PNLC, and memory is required for MARL, then maybe memory systems like Honcho actually make a ton of sense? I wrote up the post but did not want to give honcho full creds because clearly the concept of having some general world model and then giving in local context is different from tracking local context, even if the local context is about many players. But is PNLC just tracking human emotional states and responses in text space. But is that good enough?
Open question at the end of my memory post around continual learning and weight update efficacy
If anthropic is doing a ton of work on persona vectors and their causality, then they probably will be able to reason through the GEPA results (iterating on prompts to help or hurt performance) and then use that to improve the base model? But whats the track there exactly?
What is the compressed relationship between GEPA, which is iterating in prompt space, RLM, claude’s dynamic workflows, PNLC, etc?
Many valuable environments necessitate joint training, joint training is crucial for long term alignment and safety, the best joint training algorithm is V learning, which has the best sample efficiency when decentralized, therefore decentralized training will lead to frontier intelligence in the future. (obviously not rigorous CoT)
Intuitive explanation of test time scaling? Rich sutton says the two methods that arbitrarily scale with computation, and thus are most effective, are search and learning. He also says true superhuman intelligence is a function of discovery, which is variation, evaluation, and selective retention. If you consider chain of thought reasoning as building a block of output that can then be fed back into the model to produce another, progressing block of reasoning, then that implies there has been progress in reducing bias per block, and the goal is to reduce bias per block over time, and Noam Brown’s claim that intelligence per output token is the actual measure in that framework makes sense since it measures the performance of a single block, which theoretically could be scaled if given more compute, but there must be something to be said about why these per block superior systems arent overall the best if thats the case unless the team is specifically withholding compute. To be clear, it makes sense that train time scaling laws work due to subposition of different ‘components’ being able to spread out, and thus noise each other less, leading to clearer signal. I think another point here is that people with more neurons, and more connections between neurons, tend to be more intelligent. Is there some relation to this notion of subposition in test time scaling? And how does test time scaling between things like PUCT in alphago relate to LLMs? If PUCT is optimal exploration/exploitation, is CoT reasoning also exploration/exploitation?
1. There are other primitive examples like LLMs as monkeys where just putting a ton together improved performance. This is essentially test time compute. In some sense this is exploration/exploitation in real time to improve performance.
  1. https://docs.ag2.ai/latest/docs/blog/2025/04/16/Reasoning/#the-messy-reality-of-human-cognition “The process is iterative and often feels more like navigating a maze than walking a straight line.”
2. Noam seems to indicate that test time scaling was initially primarily a function of cost.
3. It seems like initial reasoning models came from GRPO. Is GRPO separate from RL?
4. Noam seems to indicate that a bottleneck they’re currently attempting to solve is parallelizing CoT, since it’s inherently serial so it’s slow. First order parallelization of CoT implies the same or worse intelligence per token but better intelligence per time, which only works if hardware supports it. Maybe feynman is about that? Or is rubin already about that? Interesting that he mentions serial as a problem but then focuses on compute/time? Those arent the same thing…
You want alignment since it implies long term game playing, which is good because a superintelligence playing short term games would defect/exploit.
Is mythos an indefinite optimist, definite optimist, indefinite pessimist, or definite pessimist?
1. Jakub posits that a lot of alignment work attempts to move from definite to indefinite, with a goal of indefinite optimism, with a fear of indefinite pessimism.
Post facto, the ability to learn a value function, or any function, given a ton of environmental data is pretty uninspiring/standard. Like yeah, you give it a ton of data and it maps the data well. Is that useful? For specific domains where the environment its trained on maps well, yeah. How it handles the edges is a point of contention. How does this apply to LLMs though?
Are solving intelligence and solving coding/math the same thing?
Philosophers consider reasoning an exploration, and bounded rationality considers the use of heuristics in the exploration space of reasoning, and heuristics are learned through experience, which is basically value functions/q functions.
1. “In contemporary philosophy of cognitive science, reasoning is increasingly compared to foraging behavior”
Bayesian reasoning seems to approximate AIXI. Maintain the possible states and explore/exploit accordingly
1. “use “Posterior Sampling” when dealing with complex Markov Decision Processes (MDPs) or Deep Reinforcement Learning, where you sample an entire world model or value function from a posterior.”
Pretraining still having juice aligns with the notion of subposition as the reason for empirical scaling laws
Where does exploration occur in AIXI?
Are ZKPs the new patents? (edited)
Embedded agency is related to world model is related to godel loops and the halting problem. You need to model your impact on the world, specifically how others react to your actions, to properly act on it (edited)
Is continual learning related to the chi Jin goedel take where it can relearn the tools but not well how to reuse them?
Since continual learning implies that people want the model to learn when to use learned heuristics, but goedel prover seems to show that knowing when to use tools doesn’t come back
Jane street needs to train DSLs (5000 data points per tick just on Google) since vanilla LLMs will def be slower and worse on it, reminds me on sudoku
Is predicting the mean in a Gaussian distribution related to compression issues in JEPA?
Page 161 describes decision theory frameworks and multi agent setting implications https://www-cdn.anthropic.com/d00db56fa754a1b115b6dd7cb2e3c342ee809620.pdf
How would you TRAIN a proactive model rather than one you have to prompt?
Synchronous interaction alignment > asynchronous control alignment
open source training
inference economics
gap between super intelligence and distribution would be paid for (expanding triangle visualization)
question context when environment sampling shows different and they directly conflict?
Forecasting + RL in the wild: https://arxiv.org/abs/2605.12817
Gwern’s “almost thinks there’s no such thing as general intelligence. Humans and AIs just learn a large number of individual specialized tricks. In any given situation we’re doing search over special cases, nothing more. What matters is just the number of individual tricks that we can search over - which is mostly determined by compute.“ relates to our idea around agi being achieved by models creating their own (rl) models for environments they run into and updating them in real time
1. CODE WORLD MODELS as it relates to the above as well
  1. https://openreview.net/pdf?id=1UoB7IWiku
2. ECHO as directionally this https://arxiv.org/abs/2605.24517
3. https://gemini.google.com/app/fbf9b5b8608d37b9 (!)
I previously had a take that models would increasingly be trained on their serving harnesses, rather than stateless APIs, then I saw some invalidating data from METR around Claude/GPT performance outside of Code/Codex as well as the existence of Claude within a dozen ‘harnesses’ for Anthropic’s enterprise customers (Claude for Life Sciences, Claude for Chrome, etc). But Prime Intellect team intern discussing ‘Gemini Plays Pokemon’ comes to the same (initial) conclusion. More Prime Intellect content in the same direction
https://x.com/eliebakouch/status/2060301471019659274?s=20
Where was Andon Labs complaining about the assistant paradigm? Find link and watch. MARL seems related to that
the only people who care to run benchmarks are frontier labs, research PhDs, or people selling a product that benchmaxxes the eval
The bottleneck is human understanding/learning.
If they often say they’ve done something when they haven’t done it, its because their training rewards them for doing something when they haven’t done it
if your reward signal is approval from peers then the chance of you killing everyone decreases MARL > ^df3a9c
1. would an AI want to kill everyone if they get less terminal reward from it? i struggle to see scenarios where they increase terminal reward if they are well socialized
its fine to shit on the assistant paradigm but then its up to you to prove that either another widely distributable model is can be built and sold at scale or that you can provide infra at scale others would pay for to build they own non assistant paradigms
operating well in collaborative settings -> predicting others -> predicting yourself
predicting yourself -> predicting your (static) environment -> predicting your human (-> predicting your team?)
interesting discussion on the parallels between dissemination of writing and LLMs https://gemini.google.com/app/ed1bcebac9184bd3
sociality from google Pi and CIRL have replaced competitive game MARL after i ran into an issue of lack of specificity during the blog post and discovering how different what i wanted vs what i thought i wanted were Thoughts > ^963ac3. wanted more so cooperation than competition
why exactly do I care about process legibility vs outcome legibility? Google Pi Team Interaction
whats the difference between writing things down and loading them in versus updating weights, especially if context is shown to elicit activations anyways?
perhaps simple, but if you can capture a ton of chain of thought reasoning, then compress the reasoning steps, then do this forever, i dont see why it would ever taper off. i guess if the reward function starts to become unclear, which is along the lines of a reduction in understanding of whats even happening in the first place. there was work around auditability of agents / mutual trust here Google Pi Team > ^b203be
‘causality’ as an explanation of the issue
why do openai, and other smart people, assume agents will specialize? likely due to context window limitations. even if they grow, the sum of various context windows will always outperform in theory. in practice, issues around epistemics (trusting everything in your context) is an issue. i read somewhere about a comparison to not make contexts ‘programs to be run’ but something else, probably a google pi team paper. seems very relevant to slate agent issues Google Pi Team > ^b203be
maintenance of epistemic integrity during collaboration is a prerequisite for sociality. not sure why google MUPI paper focuses on embedded/coupled agency and prospective learning. this was the problem with Slate agents
are context windows ~= computer programs? why or why not?
1. in this analogy, people are essentially trying to code on computers that change in how they process instructions every few months. benchmarks are a way to measure how the instruction processing works.
2. the poetic (company) take is that models should be used to compile tacit knowledge into classical deterministic code that can be cheaply rather than tokenmaxx
the idea of social approval as reward signal relates to CIRL. its just that its doing it for one person instead of many. how did the primary methods of receiving reward signals change over time? i.e. with social media
1. interesting upshot of social approval is that you need to go sample your environment for reward so you might not want to spend too long doing asynchronous work
the emotional feeling against asynchronous agents is that i want to do work, and i cant if my brain is off and im waiting for an agent to do something. whether this is cope is an open question. but if i require understanding to remain competitive, i dont think it is
1. maybe downstream take is the majority of agent/human interaction within two years is unprompted Experiments > ^ec1e08 Interaction
related to epistemic integrity. you lie to the LLM about something objective, and you see whether it corrects you. as a benchmark
‘sociality’ benchmarks as a whole are lacking
Tshirts as incentive aligned outcomes from textile maxxing

World Models

Explorer

Thoughts

Graph View

Backlinks