In thinking about AGI safety, I’ve found it useful to build a collection of different viewpoints from people that I respect, such that I can think from their perspective. I will often try to compare what an idea feels like when I put on my Paul Christiano hat, to when I put on my Scott Garrabrant hat. Recently, I feel like I’ve gained a "Chris Olah" hat, which often looks at AI through the lens of interpretability.
The goal of this post is to try to give that hat to more people.
Effective FLOPs measure a new system in terms of performance of another architecture. They are natural for measuring changes in architecture while remaining at similar scale, so fit well in RSPs. But not for giving absolute measurements in standard units, or for defining standard units.
Absolute effective compute thresholds could as well be defined directly in terms of loss.
Join our Computational Mechanics Hackathon, organized with the support of APART, PIBBSS and Simplex.
This is an opportunity to learn more about Computational Mechanics, its applications to AI interpretability & safety, and to get your hands dirty by working on a concrete project together with a team and supported by Adam & Paul. Also, there will be cash prizes for the best projects!
Read more and sign up for the event here.
We’re excited about Computational Mechanics as a framework because it provides a rigorous notion of structure that can be applied to both data and model internals. In, Transformers Represent Belief State Geometry in their Residual Stream , we validated that Computational Mechanics can help us understand fundamentally what computational structures transformers implement when trained on next-token prediction - a belief...
We've decided to keep the hackathon as scheduled. Hopefully there will be other opportunities in the future for those that can't make it this time!
Logical decision theory was introduced (in part) to resolve problems such as Parfit's hitchhiker.
I heard an argument that there is no reason to introduce a new decision theory - one can just take causal decision theory and precommit to doing whatever is needed on such problems (e.g. pay the money once in the city).
This seems dubious given that people spent so much time on developing logical decision theory. However, I cannot formulate a counterargument. What is wrong with the claim that CDT with precommitment is the "right" decision theory?
Nope! Parfit's Hitchhiker is designed to show exactly this. A CDT agent will desperately wish for some way to actually commit to paying.
I think some of the confusion in this thread is what "CDT with precommittment (or really, commitment)" actually means. It doesn't mean "intent" or "plan". It means "force" - throw the steering wheel out the window, so there IS NO later decision. Note also that humans aren't CDT agents, they're some weird crap that you need to squint pretty hard to call "rational" at all.
My p(doom) is pretty high and I found myself repeating the same words to explain some parts of the intuitions behind it. I think there are hard parts of the alignment problem that we’re not on track to solve in time.[1] Alignment plans that I've heard[2] fail for reasons connected to these hard parts of the problem, so I decided to attempt to write my thoughts in a short post.
(Thanks to Theresa, Owen, Jonathan, and David for comments on a draft.)
Modern machine learning uses a powerful search process to look for neural network parameters such that a neural network performs well on some function.
There exist algorithms for general and powerful agents. At some point in the near future, there will be a training procedure with the gradient of the...
Curious to hear what you have to say about this blog post ("Alignment likely generalizes further than capabilities").
Thanks for the comment, makes sense. Applying the boundary to AI systems likely leads to erroneous thinking (though may be narrowly useful if you are careful, in my opinion).
It makes a lot of sense to imagine future AIs having learned behaviours for using their compute efficiently without relying on some outside entity.
I agree with the fragility example.
Disclaimer: if you are using a definition in a nonmathematical piece of writing, you are probably making a mistake; you should just get rid of the definition and instead use a few examples. This applies double to people who think they are being "rigorous" by defining things but are not actually doing any math. Nonetheless, definitions are still useful and necessary when one is ready to do math, and some pre-formal conceptual work is often needed to figure out which mathematical definitions to use; thus the usefulness of this post.
Suppose I’m negotiating with a landlord about a pet, and in the process I ask the landlord what counts as a “big dog”. The landlord replies “Well, any dog that’s not small”. I ask what counts as a...
You're the one who brought up the natural numbers, I'm just saying they're not relevant to the discussion because they don't satisfy the uniqueness thing that OP was talking about.
If you're reading this, it's possible you just found yourself switched to the Enriched tab. Congratulations! You were randomly assigned to be fed to the Shoggoth to a group of users automatically switched to the new posts list.
The Enriched posts list:
You can read...
I'm enjoying having old posts recommended to me. I like the enriched tab.
Previously: OpenAI: Exodus (contains links at top to earlier episodes), Do Not Mess With Scarlett Johansson
We have learned more since last week. It’s worse than we knew.
How much worse? In which ways? With what exceptions?
That’s what this post is about.
For years, employees who left OpenAI consistently had their vested equity explicitly threatened with confiscation and the lack of ability to sell it, and were given short timelines to sign documents or else. Those documents contained highly aggressive NDA and non disparagement (and non interference) clauses, including the NDA preventing anyone from revealing these clauses.
No one knew about this until recently, because until Daniel Kokotajlo everyone signed, and then they could not talk about it. Then Daniel refused to sign, Kelsey Piper started reporting, and...
It seems increasingly plausible that it would be in the public interest to ban non-disparagement clauses more generally going forward, or at least set limits on scope and length (although I think nullifying existing contracts is bad and the government should not do that and shouldn’t have done it for non-competes either.)
I concur.
It should be noted though; we can spend all day taking apart these contracts and applying pressure publicly but real change will have to come from the courts. I await an official judgment to see the direction of this issue. Arguab...
I would like to pose a set of broad questions about a project called Beat AI: A contest using philosophical concepts (details below) with the LessWrong community. My hope would be that we have a thoughtful and critical discussion about it. (To be clear, I'm not endorsing it; I have concerns, but I don't want to jump to conclusions.)
Some possible topics for discussion might include:
Do you know the project or its founder(s)? How and to what extent are they thinking about AI safety, if at all?
If some people decide here that the project seems risky or misguided, do we want to organize our thinking and possibly draft a letter to the project?
Have you seen projects like the one below where a community is invited to compete against
'...and give us a license to use and distribute your submissions.'
For how many generations would humans be able to outwit the AI until it outwits us?