MARL

Summarizing the rl children blog would illuminate similarities and differences in thinking that would help with the framing of the issues with the multiplayer agents and why MARL is interesting. Basically the agents couldn’t represent me, which gwern dives into, and I was focused on marl as a way to address the representation issue Interaction
1. interaction + MARL relates to collaboration vs competition
2. Gwern agrees that all input going in as a single context makes multiplayer impossible since there’s no privileged or differing response type
difference between messaging each other to optimally explore or maximize rewards in states you couldn’t otherwise get to vs optimizing actions in the fact of another agent that is also learning and changing their actions in a competitive environment
https://arxiv.org/pdf/2605.09998
Probably need a working taxonomy of multiplayer environments to more quickly reason about. Assistance games?
What type of game is pokemon showdown and why was it chosen and well regarded as an RL environment?
https://arxiv.org/pdf/2606.02373 search harness, relevant to maintaining memory as a requirement of MARL
MARL relates to continual learning because you must continuously update something to even maintain performance
What are the properties of stochastic games vs multi agent games?
How are state spaces and action spaces actually composed in these environments?
Agents like Hermes get 10x better when they host html as response? In the vein of generated UI?
Thinking machines real time inference still important to understand. Does it mesh at all with models as models?
Things like game dev bench make more sense since it’s testing model ability to generate, which in the vein of models to models is how they’ll achieve everything?
How can LLMs perform self play in non verifiable domains?
Intuitive posterior distribution explanation https://gemini.google.com/app/6c8535eee8c99212
You win games/benchmarks if you have the most compute, during training or inference. This implies that intelligence per output token is the real measure of algorithmic progress, while intelligence per dollar is the measure of hardware progress, when independently calculated from each other
Why does test time scaling work? For LLMs or for other algos. How is test time scaling related to continual learning? Why does updating weights during inference result in instability but updating weights during training does not, if the data distributions are the same? The answer of I.i.d isn’t really satisfying since self play isn’t I.i.d. If you just overfit to the last thing you saw during training, how is that different? If you clip gradient updates during training, why can that not occur during inference? Is it simply a hardware constraint since updating weights is more compute and memory intensive and harder to scale to millions of users?
1. https://cs224r.stanford.edu/slides/10_cs224r_rl_for_llms_reasoning_2026.pdf
https://www.k-a.in/rl-algo.html
The term to look up and learn is “multi agent deep reinforcement learning”, not just multi agent or deep.
Why did alpha go not need to model lee’s behavior specifically? Since there existed a policy that was optimal against all players? What games or environment have this property and which don’t? Is it because he has a fixed policy vs an adaptive policy?
How are messaging protocols allowed on planes but internet isn’t? Do agents thru messaging protocols without internet unlock something?
Chi says to a student asking whether they can define the state space as the history of actions that you can but then your state space is infinite which “you don’t want”
Orion 100B decentralized training
Innovators dilemma as exploitation vs exploration
Tesla autopilot doesn’t need MARL since it just has state transition dynamics? When is one needed over the other?
Maybe multiagent as I previously thought doesn’t make sense because if they’re better than you, you lose, and if they’re worse than you, you can just model them as yourself. At least in verifiable zero sum settings? Seems to be consistent with Levine’s use of more general human behavior prediction rather than user specific behavior prediction
So if anything you’re perhaps predicting their observable state but not their “logic”
Is what I’m searching for a best policy when multiple agents are adaptive? If so, then whoever is adaptive at the fastest rate wins?
Microsoft AI technical report thread
Epistemics as cdev vs pdev bc pdev implies not updating priors which implies robust priors. Priors and posterior relates to exploration vs exploitation as well. Investing lower risk lower reward way to express a prior, building higher risk higher reward way to express a prior
Alpha go also used test time scaling during its PUCT search. Does test time scaling relate to exploration? yes, they’re the same
V learning and optimistic nash VI are for zero sum and general sum tabular markov (stochastic) games. There are decentralized v learning algorithms that achieve better sample complexity https://arxiv.org/abs/2110.14555
Interesting that zaharia also claimed that sample complexity was the bottleneck, not cost or intelligence
Optimism during exploration, pessimism during exploitation
It’s probably the case that large labs will not build harnesses for niche markets. Harnesses are anything that productizes intelligence. Intelligence is useless if not productized. Productizing intelligence means giving it necessary sensors and actuators to do a task, and proving so with evals.
Use thinking machines interaction model with nvidia minecraft voyager agent to be able to handle zombies?
https://arxiv.org/pdf/2603.12145
https://www.youtube.com/watch?v=oLkqZ2wBf44
https://gemini.google.com/app/9939846c0235d67a
https://arxiv.org/pdf/2103.01955
https://arxiv.org/abs/1706.02275
https://proceedings.iclr.cc/paper_files/paper/2025/file/40eff1670d6b08bb1bda48b0c5f30110-Paper-Conference.pdf
https://proceedings.neurips.cc/paper_files/paper/2022/file/743459dae9b2c5d2904e5432d5298128-Paper-Conference.pdf
https://arxiv.org/pdf/2508.03613
https://arxiv.org/pdf/2508.02948v1
Hierarchical MARL
https://arxiv.org/abs/1703.06182
https://gemini.google.com/app/ae3d30b177a5e890
https://arxiv.org/abs/1906.06725
https://www.youtube.com/watch?v=AK4EVrBr720
https://arxiv.org/abs/2505.18098
https://gemini.google.com/app/d05293325b6791f8
Modeling other players is so so crucial. Involves world model/planning, unsure to what extent though. Levine’s paper is the most interesting i’ve seen on this in practice
1. https://www.youtube.com/watch?v=TSwtrHQgjD8
Shared training is different from modeling others that may not be trained concurrently. Mostly care about modeling humans. The extent to which games approximate that is useful, and understanding joint training may be useful (like IPPO), but not the actual area of interest
1. https://gemini.google.com/app/f8593d849a4ace01
Understand alphastar. Was it successful or not? https://www.reddit.com/r/starcraft/comments/rjucss/was_alphastar_really_a_success/ has it been expanded on or not?
Super cool https://rpg.ifi.uzh.ch/marl/
https://www.youtube.com/watch?v=rrtxyZ4Vnv8
This is actually simple multi agent experiment https://www.strangeloopcanon.com/p/why-smart-planners-lose-to-simple
https://arxiv.org/abs/2508.03613
https://goedelcodeprover.github.io/
1. Goedel prover best understood as work that makes non-verifiable complex coding tasks verifiable, thus making it extremely valuable?
https://jxihong.github.io/joeyhong/ researcher to watch
https://gemini.google.com/app/c045af221dfcce6a downstream of levine work
1. https://arxiv.org/pdf/2512.04601 also downstream from lead researcher joey hong
https://proceedings.neurips.cc/paper_files/paper/2017/file/7fe1f8abaad094e0b5cb1b01d712f708-Paper.pdf solving subgames as solving imperfect information games, neurips best paper 2017
https://gemini.google.com/app/576e572a98450c5f relation between google embedded agents AIXI paper and levine work on modeling scenarios for better persuasion + avalon, as a parallel track to the in context learning done for Avalon in the other paper
1. https://www.lesswrong.com/posts/AJ7qddr5imhhN2jHz/embedded-universal-predictive-intelligence
RL at that stage was simply seeing when coding/math tasks failed, properly assigning rewards to what people wanted, do that in a loop. very simple.
Follow up with him
1. Thoughts on thinking machines interaction model?
Another take was that a larger risk than civilization level misalignment is local incentives of companies making AI falling suspect to classic big tech slop issues, focusing on benchmaxxing rather than real value
What data strongly assists LLM ability to engage in game theoretical situations like Mafia, Avalon, etc? Thereby boosting EQ by learning theory of mind with engaged parties
Which RL tasks have negative transfer?
what is social intelligence, in an aloof sense? seemingly, predicting the internal state and/or actions of others, conditioned on your action or the actions of others or the environment
social animals developed agency since reward signals were based on peer approval? how does societal us vs them relate to the issues with my writing around cooperative messaging vs competitive winning in marl?
Can LLMs reason through imperfect information games? Can they just code up a solution? Is the ability to out of the box code up a solution what should be benchmarked? Is the ability to natively reason about it the benchmark instead?
https://openreview.net/forum?id=GCd5v3ehmr
https://openreview.net/forum?id=IdF6JqXWzx
https://arxiv.org/pdf/2602.01539
https://arxiv.org/html/2510.11062v1
https://github.com/FareedKhan-dev/all-rl-algorithms

World Models

Explorer

MARL

Graph View

Backlinks