The people that call glm 5.2 an inflection point and assume people will use that over closed source frontier are implicitly saying that frontier model diff to closed source increasingly drifts from actual use cases. There is also some compute argument here that is upstream of the pricing argument that might be necessary to articulate to make the above claim since the above claim feels a bit off in practice
The people that call glm 5.2 an inflection point and assume people will use that over closed source frontier are implicitly saying that frontier model diff to closed source increasingly drifts from actual use cases. There is also some compute argument here that is upstream of the pricing argument that might be necessary to articulate to make the above claim since the above claim feels a bit off in practice
even if frontier LLMs are programmable in theory, that doesnt mean you know how to program them. you would still need to specify a reward model somehow (perhaps some other system, even another AI, determines the best way to prompt/program the main AI)
overnight weight updating? instead of overnight consolidation? anthropic would be checking whether its possible to just incorporate everyones preferences into a singleton LLM, ‘programmed’ by KV
Some strand between economics around setting up compute with how inference actually works with how to productize models for end user
A good continual learning system will tease out orders of magnitude more context from its users and be orders of magnitude more retentive
Guy can’t use HL wallet tracker
We probably need to get more specific on the economies of scale I think our (my) current understanding is poor. Like if a Blackwell rack has 72 GPUs, and anthropic has 10 of those, and I have one. Is it that I can serve 1/10th of the customers at the same cost, or is it that we can serve the same amount of customers but i have to charge 10x the price for the same profit? If it’s both, then there will always be a wedge for lower scale providers with fewer GPUs that can serve fewer customers with the same or lower pricing. Additionally a frontier open source model increases the ROI on compute for everyone else besides the people with better models, chipping away at economies of scale.
Meta is a data labeling org now? Twitter tweets about it
an example given here is if someone is trying to lose weight, should the model optimize for losing weight even if they get higher short term reward for eating candy? if the model says no candy the user might be mad. if the model says candy the user might be mad. not sure how they reconcile but the way i’d reconcile is always optimizing for long term rewards, and choosing short term rewards to the extent by which they increase intrinsic motivation to continue pursuing long term rewards.
probably relates to research around intrinsic motivation / laziness in models. there is likely an actual term for this in human psychology
seems to relate to AIXI since the agent manages a set of possible ‘true’ reward functions and adopts a policy based on its observations + coupled with its environment a la MUPI if the fear of persuading the human to change to make its own job easier is well founded
git history as the history over which the agent learns in the PDEV sense feels directionally correct but overall lacking in context (what i read, what i see, what i conversate, etc)
https://arxiv.org/pdf/2408.16984 interesting paper that seems, from the abstract, to conclude anthropic’s approach is superior, but then says that this leads to pluralism?
if its unclear what is latent in an LLM, then GEPA is the best way of figuring out whats latent?
definitely feels wasteful to have to spend a ton of tokens figuring out what the state of the computer even is, rather than just using it, especially since its expensive
can you apply the step from PNLC -> NLAC to PPI? think i had a claude chat somewhere about this. the take seemed to be yes its possible since LLMs are fundamentally the same structure as the GRUs that were tested. again also seems related to SDPO
is NLAC similar to continual/interactive learning if you replace the critic with a human? starting to feel like this vague idea doesn’t actually make sense because what are you even learning/predicting?
how to deal with states that truthfully reward the user but the user doesn’t recognize as such? this is probably the basis for sycophancy. probably similar to P vs NP. i can verify that i like something after i have it but i cannot tell you or codify it before hand.
are RL rollouts equivalent to ‘predicting the environment and predicting your own actions’? i dont see what the difference is. at least for single model rollouts not self play. https://claude.ai/chat/2c9bd8d1-5bb3-452b-9090-faaf8efd1ae7
described update to jakub as MARL -> epistemic integrity / prompt injection resistance / embedded agency AND/OR CIRL / interaction models / assistance games, with RSI asterisk looming over everything. is that comprehensive?
his take was that epistemics is often grounded in human feeling/intuition which consolidates it with the latter point
i think its robust to believe that RSI will not be able to improve epistemic integrity over an existing out of the box product, if it existed, since the potential weight updating required to self improve would be too costly even for a superintelligence?
epistemic integrity feels necessary for actually improving priors + discovering truth which feels necessary for collective intelligence to be value creative over singleton intelligence. otherwise as sutton puts it youre missing the selective retention part of variation and evaluation. although not sure why its not just variation and selection
is china’s open source culture an example of ‘commodifying your complement’? i.e. they commodify algorithms because they likely win on compute longer term. by that analog anthropic should want to commodify compute but they can’t really.
if its written by AI, expect only AI to read it. if only AI is reading it, why write it without AI? feels pretty easy to tell the difference between something made for agents (AEO) and something made for humans (non-average voice)
arena.ai is similar to LM arena except for frontend design. what was the GTM there? dynamic, real use evals still feel crucial. even better if they proxy things people would pay for
it doesnt seem like obsidian will build what i want PDEV since their core value prop involves privacy, whereas i want public by default + cloud AI analyzing everything/always on
its easier to share progress, and therefore make progress, if investment is permissionless. relates to RPGF, but that seems a bit too idealistic.
how is epistemic integrity benchmarked in LLMs, if at all? the success of collective intelligence and non singleton outcomes is downstream of this.
games like avalon are a subset of epistemic integrity
“(e) Group alignment: How can AGI groups be effectively steered (either explicitly, or implicitly via, e.g., mechanism design for markets)? How can they be hardened and self-correct against epistemic hijacking and the spread of falsehoods, hallucinations & self-delusions? (f) How to ensure epistemic resilience and recoverability in asymmetric-intelligence collectives (e.g., mixed human-ASI collectives)?” from agi to asi paper
seems like the transition from taking context at face value vs taking context as an update into a prior is the difference, but what does that look like in practice? for example if a data point comes in and the probability of that data point is low, we would need to update our priors but not completely. and there is a difference between environmental sampling and collaborative opinion (lossy). trust forms when collaborative opinion updates the world model/prior towards environmental truth over time. trust is individualized reputation.
studybench feels like an example of an assistance game
agency arises when reward signal is peer approval in humans? how to set a dynamic reward signal of peer approval in LLMs? relates to CIRL. perhaps relates to (non)assistant training paradigm
seemingly a gap between append only agent turn logs as some vague ‘memory’ solution vs use as a formal interleaved dataset where the agent can learn causal loops, which opens up multi agent systems which opens up collective intelligence. again, ECHO seems to be the first version of this
prospective learning vs retrospective learning?
but when agents do next token prediction that’s considered an ‘action’, no? whats the actual difference
if LLM generalization is on a spectrum, then viable products of the future are downstream of being correct about the extent to which generalization occurs
current understanding todos in browser: multi agent cooperating thru ICL, CIRL, AGI to ASI, MUPI/RUI, gwern GA, POMDP lectures
^ how do epistemic utility measurements Google Pi Team > ^b203be relate to the framing of the ‘incentive to ask’ as the unsolved core
“Wrapping a mathematical POMDP solver around a 70B+ parameter Large Language Model is computationally impossible with current techniques.” ??
“not by maintaining a dynamic Bayesian belief distribution over a hidden vector θ, but by frozen reward modeling or Direct Preference Optimization. The empirical simplicity and scalability of RLHF bypassed the need to compute complex, game-theoretic joint policies.” ok but we’re past that now
is the context of an LLM functionally a bayesian belief state in a POMDP? and the problem perhaps with that framework is that it does not maintain multiple ‘contexts’ with their own probabilities of being right/useful? and this is externalized to memory solutions like Hindsight and benchmarked with stuff like BEAM? but Google Pi Team > ^b203be describes differences between epistemic agents and what BEAM measures, which essentially comes down to dynamism imo. relates back to dynamic evals seemingly, but personalized perhaps Ideas > ^7afff1
CIRL also seems to be related to the ‘proactive’ framework Randall kept mentioning. Experiments > ^605490, at least the part where it interjects to learn. do existing LLMs and memory handle this already?
Polymarket vs Kalshi seems similar to protocol vs platform
GEPA does not seem like it would work, how is this not overfitting / run into the same issues with a bunch of skills that end up being poorly used? I think chi jin’s goedel prover v2 runs into the issue but maybe thats specifically related to weight updating. Regardless, updating in ‘prompt space’ seems interesting to be able to improve frontier models instead of fine tuning. Also the labs will probably serve frontier models more cheaply than you can on rented GPUs
Feeling like this post makes arguments that could be usefully extended by well analyzing the nvidia tech report and microsoft tech report recently and coming to novel conclusions about scaling complexity
This also seems to indicate that the karpathy hire on pretraining is due to the fact that pretraining was paused rather than saturated, but incoming compute will continue to deliver major scaling gains
im not seeing people talk about it much so just a heads up: dynamic workflows in claude code are actually insanely fucking useful and powerful. clearly the right / sane way to do “agent orchestration”. very much worth trying
https://www.a16z.news/p/institutional-ai-vs-individual-ai coordination as first pillar here very similar to my multi agent take. The signal part feels like what im trying to do with the future version of these notes and my listed problems. Unprompted is also a novel thought ive been exploring, similar to proactivity per the randall takes.
Is this guy super cracked out? How does his embodiment take relate to current work and/or multi agent work and/or AIXI? https://scott.garrabrant.com/