In thinking about AGI safety, I’ve found it useful to build a collection of different viewpoints from people that I respect, such that I can think from their perspective. I will often try to compare what an idea feels like when I put on my Paul Christiano hat, to when I put on my Scott Garrabrant hat. Recently, I feel like I’ve gained a "Chris Olah" hat, which often looks at AI through the lens of interpretability. 

The goal of this post is to try to give that hat to more people.

Customize
Rationality+Rationality+World Modeling+World Modeling+AIAIWorld OptimizationWorld OptimizationPracticalPracticalCommunityCommunity
Personal Blog+
My mainline prediction scenario for the next decades. My mainline prediction * : * LLMs will not scale to AGI. They will not spawn evil gremlins or mesa-optimizers. BUT Scaling laws will continue to hold and future LLMs will be very impressive and make a sizable impact on the real economy and science over the next decade.  * there is a single innovation left to make AGI-in-the-alex sense work, i.e. coherent, long-term planning agents (LTPA) that are effective and efficient in data sparse domains over long horizons.  * that innovation will be found within the next 10-15 years * It will be clear to the general public that these are dangerous  * governments will act quickly and (relativiely) decisively to  bring these agents under state-control. national security concerns will dominate.  * power will reside mostly with governments AI safety institutes and national security agencies. In so far as divisions of tech companies are able to create LTPAs they will be effectively nationalized.  * International treaties will be made to constrain AI, outlawing the development of LTPAs by private companies. Great power competition will mean US and China will continue developing LTPAs, possibly largely boxed. Treaties will try to constrain this development with only partial succes (similar to nuclear treaties).  * LLMs will continue to exist and be used by the general public * Conditional on AI ruin the closest analogy is probably something like the Cortez-Pizarro-Afonso takeovers. Unaligned AI will rely on human infrastructure and human allies for the earlier parts of takeover - but its inherent advantages in tech, coherence, decision-making and (artificial) plagues will be the deciding factor. *  The world may be mildly multi-polar.  * This will involve conflict between AIs. * AIs very possible may be able to cooperate in ways humans can't.  * The arrival of AGI will immediately inaugurate a scientific revolution. Sci-fi sounding progress like advanced robotics, quantum magic, nanotech, life extension, laser weapons, large space engineering, cure of many/most remaining diseases will become possible within two decades of AGI, possibly much faster.  * Military power will shift to automated manufacturing of drones &  weaponized artificial plagues. Drones, mostly flying will dominate the battlefield. Mass production of drones and their rapid and effective deployment in swarms will be key to victory.   Two points on which I differ with most commentators: (i) I believe AGI is a real (mostly discrete) thing , not a vibe, or a general increase of improved tools. I believe it is inherently agenctic. I don't think spontaneous emergence of agents is impossible but I think it is more plausible agents will be built rather than grown.  (ii) I believe in general the ea/ai safety community is way overrating the importance of individual tech companies vis a vis broader trends and the power of governments. I strongly agree with Stefan Schubert's take here on the latent hidden power of government: https://stefanschubert.substack.com/p/crises-reveal-centralisation Consequently, the ea/ai safety community is often myopically focusing on boardroom politics that are relativily inconsequential in the grand scheme of things.  *where by mainline prediction I mean the scenario that is the mode of what I expect. This is the single likeliest scenario. However, since it contains a large number of details each of which could go differently, the probability on this specific scenario is still low. 
There is a mystery which many applied mathematicians have asked themselves: Why is linear algebra so over-powered? An answer I like was given in Lloyd Trefethen's book An Applied Mathematician's Apology, in which he writes (my summary): Everything in the real world is described fully by non-linear analysis. In order to make such systems simpler, we can linearize (differentiate) them, and use a first or second order approximation, and in order to represent them on a computer, we can discretize them, which turns analytic techniques into algebraic ones. Therefore we've turned our non-linear analysis into linear algebra.
Elizabeth550
1
EA organizations frequently ask for people to run criticism by them ahead of time. I’ve been wary of the push for this norm. My big concerns were that orgs wouldn’t comment until a post was nearly done, and that it would take a lot of time. My recent post  mentioned a lot of people and organizations, so it seemed like useful data. I reached out to 12 email addresses, plus one person in FB DMs and one open call for information on a particular topic.  This doesn’t quite match what you see in the post because some people/orgs were used more than once, and other mentions were cut. The post was in a fairly crude state when I sent it out. Of those 14: 10 had replied by the start of next day. More than half of those replied within a few hours. I expect this was faster than usual because no one had more than a few paragraphs relevant to them or their org, but is still impressive. It’s hard to say how sending an early draft changed things. One person got some extra anxiety because their paragraph was full of TODOs (because it was positive and I hadn’t worked as hard fleshing out the positive mentions ahead of time). I could maybe have saved myself one stressful interaction if I’d realized I was going to cut an example ahead of time Only 80,000 Hours, Anima International, and GiveDirectly failed to respond before publication (7 days after I emailed them).  I didn’t keep as close track of changes, but at a minimum replies led to 2 examples being removed entirely, 2 clarifications and some additional information that made the post better. So overall I'm very glad I solicited comments, and found the process easier than expected.   
Something I've been thinking about lately: For 'scarcity of compute' reasons, I think it's fairly likely we end up in a scaffolded AI world where one highly intelligent model (that requires much more compute) will essentially delegate tasks to weaker models as long as it knows that the weaker (maybe fine-tuned) model is capable of reliably doing that task. Like, let's say you have a weak doctor AI that can basically reliably answer most medical questions. However, it knows when it is less confident in a diagnosis, so it will reach out to the powerful AI when it needs a second opinion from the much more intelligent AI (that requires more compute). Somewhat related, there's a worldview that Noah Smith proposed, which is that maybe human jobs don't actually end up automated because there's an opportunity cost in giving up compute that a human can do (even if the AI can do it for cheaper) because you could instead use that compute for something much more important. Imagine, "Should I use the AI to build a Dyson sphere, or should I spread that compute across tasks humans can already do?"
Implicitly, SAEs are trying to model activations across a context window, shaped like (n_ctx, n_emb). But today's SAEs ignore the token axis, modeling such activations as lists of n_ctx IID samples from a distribution over n_emb-dim vectors. I suspect SAEs could be much better (at reconstruction, sparsity and feature interpretability) if they didn't make this modeling choice -- for instance by "looking back" at earlier tokens using causal attention. Assorted arguments to this effect: * There is probably a lot of compressible redundancy in LLM activations across the token axis, because most ways of combining textual features (given any intuitive notion of "textual feature") across successive tokens are either very unlikely under the data distribution, or actually impossible. * For example, you're never going to see a feature that means "we're in the middle of a lengthy monolingual English passage" at position j and then a feature that means "we're in the middle of a lengthy monolingual Chinese passage" at position j+1.  * In other words, the intrinsic dimensionality of window activations is probably a lot less than n_ctx * n_emb, but SAEs don't exploit this. * [Another phrasing of the previous point.] Imagine that someone handed you a dataset of (n_ctx, n_emb)-shaped matrices, without telling you where they're from.  (But in fact they're LLM activations, the same ones we train SAEs on.). And they said "OK, your task is to autoencode these things." * I think you would quickly notice, just by doing exploratory data analysis, that the data has a ton of structure along the first axis: the rows of each matrix are very much not independent.  And you'd design your autoencoder around this fact. * Now someone else comes to you and says "hey look at my cool autoencoder for this problem. It's actually just an autoencoder for individual rows, and then I autoencode a matrix by applying it separately to the rows one by one." * This would seem bizarre -- you'd want to ask this person what the heck they were thinking. * But this is what today's SAEs do. * We want features that "make sense," "are interpretable." In general, such features will be properties of regions of text (phrases, sentences, passages, or the whole text at once) rather than individual tokens. * Intuitively, such a feature is equally present at every position within the region.  An SAE has to pay a high L1 cost to activate the feature over and over at all those positions. * This could lead to an unnatural bias to capture features that are relatively localized, and not capture those that are less localized. * Or, less-localized features might be captured but with "spurious localization": * Conceptually, the feature is equally "true" of the whole region at once. * At some positions in the region, the balance between L1/reconstruction tips in favor of reconstruction, so the feature is active. * At other positions, the balance tips in favor of L1, and the feature is turned off. * To the interpreter, this looks like a feature that has a clear meaning at the whole-region level, yet flips on and off in a confusing and seemingly arbitrary pattern within the region. * The "spurious localization" story feels like a plausible explanation for the way current SAE features look. * Often if you look at the most-activating cases, there is some obvious property shared by the entire texts you are seeing, but the pattern of feature activation within each text is totally mysterious.  Many of the features in the Anthropic Sonnet paper look like this to me. * Descriptions of SAE features typically round this off to a nice-sounding description at the whole-text level, ignoring the uninterpretable pattern over time.  You're being sold an "AI agency feature" (or whatever), but what you actually get is a "feature that activates at seemingly random positions in AI-agency-related texts." * An SAE that could "look back" at earlier positions might be able to avoid paying "more than one token's worth of L1" for a region-level feature, and this might have a very nice interpretation as "diffing" the text. * I'm imagining that a very non-localized (intuitive) feature, such as "this text is in English," would be active just once at the first position where the property it's about becomes clearly true. * Ideally, at later positions, the SAE encoder would look back at earlier activations and suppress this feature here because it's "already been accounted for," thus saving some L1. * And the decoder would also look back (possibly reusing the same attention output or something) and treat the feature as though it had been active here (in the sense that "active here" would mean in today's SAEs), thus preserving reconstruction quality. * In this contextual SAE, the features now express only what is new or unpredictable-in-advance at the current position: a "diff" relative to all the features at earlier positions. * For example, if the language switches from English to Cantonese, we'd have one or more feature activations that "turn off" English and "turn on" Cantonese, at the position where the switch first becomes evident. * But within the contiguous, monolingual regions, the language would be implicit in the features at the most recent position where such a "language flag" was set.  All the language-flag-setting features would be free to turn off inside these regions, freeing up L0/L1 room for stuff we don't already know about the text. * This seems like it would allow for vastly higher sparsity at any given level of reconstruction quality -- and also better interpretability at any given level of sparsity, because we don't have the "spurious localization" problem. * (I don't have any specific architecture for this in mind, though I've gestured towards one above.  It's of course possible that this might just not work, or would be very tricky.  One danger is that the added expressivity might make your autoencoder "too powerful," with opaque/polysemantic/etc. calculations chained across positions that recapitulate in miniature the original problem of interpreting models with attention; it may or may not be tough to avoid this in practice.) * At any point before the last attention layer, LLM activations at individual positions are free to be "ambiguous" when taken out of context, in the sense that the same vector might mean different things in two different windows.  The LLM can always disambiguate them as needed with attention, later. * This is meant as a counter to the following argument: "activations at individual positions are the right unit of analysis, because they are what the LLM internals can directly access (read off linearly, etc.)" * If we're using a notion of "what the LLM can directly access" which (taken literally) implies that it can't "directly access" other positions, that seems way too limiting -- we're effectively pretending the LLM is a bigram model.

Popular Comments

Recent Discussion

Effective FLOPs measure a new system in terms of performance of another architecture. They are natural for measuring changes in architecture while remaining at similar scale, so fit well in RSPs. But not for giving absolute measurements in standard units, or for defining standard units.

Absolute effective compute thresholds could as well be defined directly in terms of loss.

Join our Computational Mechanics Hackathon, organized with the support of APART, PIBBSS and Simplex


This is an opportunity to learn more about Computational Mechanics, its applications to AI interpretability & safety, and to get your hands dirty by working on a concrete project together with a team and supported by Adam & Paul. Also, there will be cash prizes for the best projects!
 

Read more and sign up for the event here

We’re excited about Computational Mechanics as a framework because it provides a rigorous notion of structure that can be applied to both data and model internals. In, Transformers Represent Belief State Geometry in their Residual Stream , we validated that Computational Mechanics can help us understand fundamentally what computational structures transformers implement when trained on next-token prediction - a belief...

We've decided to keep the hackathon as scheduled. Hopefully there will be other opportunities in the future for those that can't make it this time!

Logical decision theory was introduced (in part) to resolve problems such as Parfit's hitchhiker.

I heard an argument that there is no reason to introduce a new decision theory - one can just take causal decision theory and precommit to doing whatever is needed on such problems (e.g. pay the money once in the city).

This seems dubious given that people spent so much time on developing logical decision theory. However, I cannot formulate a counterargument. What is wrong with the claim that CDT with precommitment is the "right" decision theory?

Dagon20

Nope!  Parfit's Hitchhiker is designed to show exactly this.  A CDT agent will desperately wish for some way to actually commit to paying.  

I think some of the confusion in this thread is what "CDT with precommittment (or really, commitment)" actually means.  It doesn't mean "intent" or "plan".  It means "force" - throw the steering wheel out the window, so there IS NO later decision.  Note also that humans aren't CDT agents, they're some weird crap that you need to squint pretty hard to call "rational" at all.

My p(doom) is pretty high and I found myself repeating the same words to explain some parts of the intuitions behind it. I think there are hard parts of the alignment problem that we’re not on track to solve in time.[1] Alignment plans that I've heard[2] fail for reasons connected to these hard parts of the problem, so I decided to attempt to write my thoughts in a short post.

(Thanks to Theresa, Owen, Jonathan, and David for comments on a draft.) 

Modern machine learning uses a powerful search process to look for neural network parameters such that a neural network performs well on some function.

There exist algorithms for general and powerful agents. At some point in the near future, there will be a training procedure with the gradient of the...

1Mikhail Samin
My very short explanation: https://www.lesswrong.com/posts/GNhMPAWcfBCASy8e6/a-central-ai-alignment-problem-capabilities-generalization?commentId=CCnHsYdFoaP2e4tku

Curious to hear what you have to say about this blog post ("Alignment likely generalizes further than capabilities").

9quetzal_rainbow
What tools do you mean? For what I know, RLHF/RLAIF alignment tends to blow up under mildly unusual circumstances, see https://arxiv.org/abs/2311.07590 https://arxiv.org/abs/2405.01576 https://www.anthropic.com/research/many-shot-jailbreaking And, to not forget classic: https://www.lesswrong.com/posts/RYcoJdvmoBbi5Nax7/jailbreaking-chatgpt-on-release-day My median/average model of failure is "we don't know, lol, I just can reasonably extrapolate current mild preventable failures into future epic failures". My modal model is based on simulator framework and it says the following: 1. LLMs work as simulator: they read the prompt, put probability distribution over possible generators of this text and then put probability distribution over possible next token 2. RLHFed LLMs have high prior probability of "assistant answers in the most moral way when asked morality-relevant question" after prompts of form "user-assistant dialogue" 3. When you put LLMs into different conditions, like "you are stock manager in tense finansial situation", they update away from "being nice moral assistant" to "being actual stock manager" which implies "you can use shady insider trading schemes" 4. If you try, say, to build system of LLMs capable to design nanotechnolody, RLHF various modules inside this system, run for a while and then ask it to shutdown, control module is going to look at overall context and ask itself "is the system capable to produce all this highly-sophisticated technical text a sort of system which shutdown is asked nicely?" and with some probability it decides "no, I am modeling strong independent agentic system, this humans can go to hell"
2jacquesthibs
Something I've been thinking about lately: For 'scarcity of compute' reasons, I think it's fairly likely we end up in a scaffolded AI world where one highly intelligent model (that requires much more compute) will essentially delegate tasks to weaker models as long as it knows that the weaker (maybe fine-tuned) model is capable of reliably doing that task. Like, let's say you have a weak doctor AI that can basically reliably answer most medical questions. However, it knows when it is less confident in a diagnosis, so it will reach out to the powerful AI when it needs a second opinion from the much more intelligent AI (that requires more compute). Somewhat related, there's a worldview that Noah Smith proposed, which is that maybe human jobs don't actually end up automated because there's an opportunity cost in giving up compute that a human can do (even if the AI can do it for cheaper) because you could instead use that compute for something much more important. Imagine, "Should I use the AI to build a Dyson sphere, or should I spread that compute across tasks humans can already do?"
4gwern
This doesn't really seem like a meaningful question. Of course "AI" will be "scaffolded". But what is the "AI"? It's not a natural kind. It's just where you draw the boundaries for convenience. An "AI" which "reaches out to a more powerful AI" is not meaningful - one could say the same thing of your brain! Or a Mixture-of-Experts model, or speculative decoding (both already in widespread use). Some tasks are harder than others, and different amounts of computation get brought to bear by the system as a whole, and that's just part of the learned algorithms it embodies and where the smarts come from. Or one could say it of your computer: different things take different paths through your "computer", ping-ponging through a bunch of chips and parts of chips as appropriate. Do you muse about living in a world where for 'scarcity of compute' reasons your computer is a 'scaffolded computer world' where highly intelligent chips will essentially delegate tasks to weaker chips so long as it knows that the weaker (maybe highly specialized ASIC) chip is capable of reliably doing that task...? No. You don't care about that. That's just details of internal architecture which you treat as a black box. (And that argument doesn't protect humans for the same reason it didn't protect, say, chimpanzees or Neanderthals or horses. Comparative advantage is extremely fragile.)

Thanks for the comment, makes sense. Applying the boundary to AI systems likely leads to erroneous thinking (though may be narrowly useful if you are careful, in my opinion).

It makes a lot of sense to imagine future AIs having learned behaviours for using their compute efficiently without relying on some outside entity.

I agree with the fragility example.

2jacquesthibs
Great. Yeah, I also expect that it is hard to get current models to work well on this. However, I will mention that the DeepSeekMath model does seem to outperform GPT-4 despite having only 7B parameters. So, it may be possible to create a +70B fine-tune that basically destroys GPT-4 at math. The issue is whether it generalizes to the kind of math we'd commonly see in alignment research. Additionally, I expect at least a bit can be done with scaffolding, search, etc. I think the issue with many prompting methods atm is that they are specifically trying to get the model to arrive at solutions on their own. And what I mean by that is that they are starting from the frame of "how can we get LLMs to solve x math task on their own," instead of "how do we augment the researcher's ability to arrive at (better) proofs more efficiently using LLMs." So, I think there's room for product building that does not involve "can you solve this math question from scratch," though I see the value in getting that to work as well.

Disclaimer: if you are using a definition in a nonmathematical piece of writing, you are probably making a mistake; you should just get rid of the definition and instead use a few examples. This applies double to people who think they are being "rigorous" by defining things but are not actually doing any math. Nonetheless, definitions are still useful and necessary when one is ready to do math, and some pre-formal conceptual work is often needed to figure out which mathematical definitions to use; thus the usefulness of this post.

Suppose I’m negotiating with a landlord about a pet, and in the process I ask the landlord what counts as a “big dog”. The landlord replies “Well, any dog that’s not small”. I ask what counts as a...

1cubefox
What do you mean with "all the true properties"?
2tailcalled
The properties that hold in all models of the theory. That is, in logic, propositions are usually interpreted to be about some object, called the model. To pin down a model, you take some known facts about that model as axioms. Logic then allows you to derive additional propositions which are true of all the objects satisfying the initial axioms, and first-order logic is complete in the sense that if some proposition is true for all models of the axioms then it is provable in the logic.
1cubefox
I'm still not sure what you want to say. It's a necessary property of natural numbers that they can be reached from iterating the successor function. That condition can't be expressed in first-order logic, so it can't be proved and it holds in some models and in others it doesn't. It's like trying to define "cat" by stating that it's an animal. This is not a sufficient definition.

You're the one who brought up the natural numbers, I'm just saying they're not relevant to the discussion because they don't satisfy the uniqueness thing that OP was talking about.

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Log In Reset Password
...or continue with

Update: May 13th, 2024

If you're reading this, it's possible you just found yourself switched to the Enriched tab. Congratulations! You were randomly assigned to be fed to the Shoggoth to a group of users automatically switched to the new posts list.

The Enriched posts list:

  • Is 50% the same algorithm as Latest, 50% ML-algorithm selected posts for you based on your post interaction history.
  • The sparkle icon next to the post title marks which posts were the result of personalized recommendations.
  • You can switch back at any time to the regular Latest tab if you don't like the recommendations
  • We changed the name "Recommended" to "Enriched" to better imply that it contains 50% of the regular Latest posts. (We will probably soon add a Recommended tab that is 100% recommendations.)
     

You can read...

I'm enjoying having old posts recommended to me. I like the enriched tab.

Previously: OpenAI: Exodus (contains links at top to earlier episodes), Do Not Mess With Scarlett Johansson

We have learned more since last week. It’s worse than we knew.

How much worse? In which ways? With what exceptions?

That’s what this post is about.

The Story So Far

For years, employees who left OpenAI consistently had their vested equity explicitly threatened with confiscation and the lack of ability to sell it, and were given short timelines to sign documents or else. Those documents contained highly aggressive NDA and non disparagement (and non interference) clauses, including the NDA preventing anyone from revealing these clauses.

No one knew about this until recently, because until Daniel Kokotajlo everyone signed, and then they could not talk about it. Then Daniel refused to sign, Kelsey Piper started reporting, and...

It seems increasingly plausible that it would be in the public interest to ban non-disparagement clauses more generally going forward, or at least set limits on scope and length (although I think nullifying existing contracts is bad and the government should not do that and shouldn’t have done it for non-competes either.)

I concur.

It should be noted though; we can spend all day taking apart these contracts and applying pressure publicly but real change will have to come from the courts. I await an official judgment to see the direction of this issue. Arguab... (read more)

7Thomas Kwa
Does this work? Sounds like a good idea.
7oumuamua
While I am not a lawyer, it appears that this concept might indeed hold some merit. A similar strategy is currently being employed by organizations focused on civil rights, known as a “warrant canary”. Essentially, it’s a method by which a communications service provider aims to implicitly inform its users that the provider has been served with a government subpoena, despite legal prohibitions on revealing the existence of the subpoena. The idea behind it is that it there are very strong protections against compelled speech, especially against compelled untrue speech (e.g. updating the canary despite having received a subpoena). The Electronic Frontier Foundation (EFF) seems to believe that warrant canaries are legal.
6Zvi
I think it works, yes. Indeed I have a canary on my Substack About page to this effect.

I would like to pose a set of broad questions about a project called Beat AI: A contest using philosophical concepts (details below) with the LessWrong community. My hope would be that we have a thoughtful and critical discussion about it. (To be clear, I'm not endorsing it; I have concerns, but I don't want to jump to conclusions.)

Some possible topics for discussion might include:

  • Do you know the project or its founder(s)? How and to what extent are they thinking about AI safety, if at all?

  • If some people decide here that the project seems risky or misguided, do we want to organize our thinking and possibly draft a letter to the project?

  • Have you seen projects like the one below where a community is invited to compete against

...
Answer by PhilosophicalSoul10

'...and give us a license to use and distribute your submissions.'

For how many generations would humans be able to outwit the AI until it outwits us?

LessOnline Festival

May 31st to June 2nd, Berkeley CA