LessWrong

Ω 32y

Effective FLOPs measure a new system in terms of performance of another architecture. They are natural for measuring changes in architecture while remaining at similar scale, so fit well in RSPs. But not for giving absolute measurements in standard units, or for defining standard units.

Absolute effective compute thresholds could as well be defined directly in terms of loss.

Computational Mechanics Hackathon (June 1 & 2)

Adam Shai

Join our Computational Mechanics Hackathon, organized with the support of APART, PIBBSS and Simplex.

This is an opportunity to learn more about Computational Mechanics, its applications to AI interpretability & safety, and to get your hands dirty by working on a concrete project together with a team and supported by Adam & Paul. Also, there will be cash prizes for the best projects!

Read more and sign up for the event here.

We’re excited about Computational Mechanics as a framework because it provides a rigorous notion of structure that can be applied to both data and model internals. In, Transformers Represent Belief State Geometry in their Residual Stream , we validated that Computational Mechanics can help us understand fundamentally what computational structures transformers implement when trained on next-token prediction - a belief...

(See More – 260 more words)

Adam Shai16m10

We've decided to keep the hackathon as scheduled. Hopefully there will be other opportunities in the future for those that can't make it this time!

Is CDT with precommitment enough?

martinkunev

Logical decision theory was introduced (in part) to resolve problems such as Parfit's hitchhiker.

I heard an argument that there is no reason to introduce a new decision theory - one can just take causal decision theory and precommit to doing whatever is needed on such problems (e.g. pay the money once in the city).

This seems dubious given that people spent so much time on developing logical decision theory. However, I cannot formulate a counterargument. What is wrong with the claim that CDT with precommitment is the "right" decision theory?

Dagon28m20

Nope! Parfit's Hitchhiker is designed to show exactly this. A CDT agent will desperately wish for some way to actually commit to paying.

I think some of the confusion in this thread is what "CDT with precommittment (or really, commitment)" actually means. It doesn't mean "intent" or "plan". It means "force" - throw the steering wheel out the window, so there IS NO later decision. Note also that humans aren't CDT agents, they're some weird crap that you need to squint pretty hard to call "rational" at all.

Try to solve the hard parts of the alignment problem

Mikhail Samin

My p(doom) is pretty high and I found myself repeating the same words to explain some parts of the intuitions behind it. I think there are hard parts of the alignment problem that we’re not on track to solve in time.^[1] Alignment plans that I've heard^[2] fail for reasons connected to these hard parts of the problem, so I decided to attempt to write my thoughts in a short post.

(Thanks to Theresa, Owen, Jonathan, and David for comments on a draft.)

Modern machine learning uses a powerful search process to look for neural network parameters such that a neural network performs well on some function.

There exist algorithms for general and powerful agents. At some point in the near future, there will be a training procedure with the gradient of the...

(Continue Reading – 1337 more words)

1Mikhail Samin1h

My very short explanation: https://www.lesswrong.com/posts/GNhMPAWcfBCASy8e6/a-central-ai-alignment-problem-capabilities-generalization?commentId=CCnHsYdFoaP2e4tku

jacquesthibs31m20

Curious to hear what you have to say about this blog post ("Alignment likely generalizes further than capabilities").

9quetzal_rainbow7h

What tools do you mean? For what I know, RLHF/RLAIF alignment tends to blow up under mildly unusual circumstances, see https://arxiv.org/abs/2311.07590 https://arxiv.org/abs/2405.01576 https://www.anthropic.com/research/many-shot-jailbreaking And, to not forget classic: https://www.lesswrong.com/posts/RYcoJdvmoBbi5Nax7/jailbreaking-chatgpt-on-release-day My median/average model of failure is "we don't know, lol, I just can reasonably extrapolate current mild preventable failures into future epic failures". My modal model is based on simulator framework and it says the following: 1. LLMs work as simulator: they read the prompt, put probability distribution over possible generators of this text and then put probability distribution over possible next token 2. RLHFed LLMs have high prior probability of "assistant answers in the most moral way when asked morality-relevant question" after prompts of form "user-assistant dialogue" 3. When you put LLMs into different conditions, like "you are stock manager in tense finansial situation", they update away from "being nice moral assistant" to "being actual stock manager" which implies "you can use shady insider trading schemes" 4. If you try, say, to build system of LLMs capable to design nanotechnolody, RLHF various modules inside this system, run for a while and then ask it to shutdown, control module is going to look at overall context and ask itself "is the system capable to produce all this highly-sophisticated technical text a sort of system which shutdown is asked nicely?" and with some probability it decides "no, I am modeling strong independent agentic system, this humans can go to hell"

jacquesthibs's Shortform

jacquesthibs

2jacquesthibs1h

Something I've been thinking about lately: For 'scarcity of compute' reasons, I think it's fairly likely we end up in a scaffolded AI world where one highly intelligent model (that requires much more compute) will essentially delegate tasks to weaker models as long as it knows that the weaker (maybe fine-tuned) model is capable of reliably doing that task. Like, let's say you have a weak doctor AI that can basically reliably answer most medical questions. However, it knows when it is less confident in a diagnosis, so it will reach out to the powerful AI when it needs a second opinion from the much more intelligent AI (that requires more compute). Somewhat related, there's a worldview that Noah Smith proposed, which is that maybe human jobs don't actually end up automated because there's an opportunity cost in giving up compute that a human can do (even if the AI can do it for cheaper) because you could instead use that compute for something much more important. Imagine, "Should I use the AI to build a Dyson sphere, or should I spread that compute across tasks humans can already do?"

4gwern1h

This doesn't really seem like a meaningful question. Of course "AI" will be "scaffolded". But what is the "AI"? It's not a natural kind. It's just where you draw the boundaries for convenience. An "AI" which "reaches out to a more powerful AI" is not meaningful - one could say the same thing of your brain! Or a Mixture-of-Experts model, or speculative decoding (both already in widespread use). Some tasks are harder than others, and different amounts of computation get brought to bear by the system as a whole, and that's just part of the learned algorithms it embodies and where the smarts come from. Or one could say it of your computer: different things take different paths through your "computer", ping-ponging through a bunch of chips and parts of chips as appropriate. Do you muse about living in a world where for 'scarcity of compute' reasons your computer is a 'scaffolded computer world' where highly intelligent chips will essentially delegate tasks to weaker chips so long as it knows that the weaker (maybe highly specialized ASIC) chip is capable of reliably doing that task...? No. You don't care about that. That's just details of internal architecture which you treat as a black box. (And that argument doesn't protect humans for the same reason it didn't protect, say, chimpanzees or Neanderthals or horses. Comparative advantage is extremely fragile.)

jacquesthibs33m20

Thanks for the comment, makes sense. Applying the boundary to AI systems likely leads to erroneous thinking (though may be narrowly useful if you are careful, in my opinion).

It makes a lot of sense to imagine future AIs having learned behaviours for using their compute efficiently without relying on some outside entity.

I agree with the fragility example.

2jacquesthibs2h

Great. Yeah, I also expect that it is hard to get current models to work well on this. However, I will mention that the DeepSeekMath model does seem to outperform GPT-4 despite having only 7B parameters. So, it may be possible to create a +70B fine-tune that basically destroys GPT-4 at math. The issue is whether it generalizes to the kind of math we'd commonly see in alignment research. Additionally, I expect at least a bit can be done with scaffolding, search, etc. I think the issue with many prompting methods atm is that they are specifically trying to get the model to arrive at solutions on their own. And what I mean by that is that they are starting from the frame of "how can we get LLMs to solve x math task on their own," instead of "how do we augment the researcher's ability to arrive at (better) proofs more efficiently using LLMs." So, I think there's room for product building that does not involve "can you solve this math question from scratch," though I see the value in getting that to work as well.

When Are Circular Definitions A Problem?

johnswentworth

20h

Disclaimer: if you are using a definition in a nonmathematical piece of writing, you are probably making a mistake; you should just get rid of the definition and instead use a few examples. This applies double to people who think they are being "rigorous" by defining things but are not actually doing any math. Nonetheless, definitions are still useful and necessary when one is ready to do math, and some pre-formal conceptual work is often needed to figure out which mathematical definitions to use; thus the usefulness of this post.

Suppose I’m negotiating with a landlord about a pet, and in the process I ask the landlord what counts as a “big dog”. The landlord replies “Well, any dog that’s not small”. I ask what counts as a...

(See More – 626 more words)

1cubefox2h

What do you mean with "all the true properties"?

2tailcalled2h

The properties that hold in all models of the theory. That is, in logic, propositions are usually interpreted to be about some object, called the model. To pin down a model, you take some known facts about that model as axioms. Logic then allows you to derive additional propositions which are true of all the objects satisfying the initial axioms, and first-order logic is complete in the sense that if some proposition is true for all models of the axioms then it is provable in the logic.

1cubefox1h

I'm still not sure what you want to say. It's a necessary property of natural numbers that they can be reached from iterating the successor function. That condition can't be expressed in first-order logic, so it can't be proved and it holds in some models and in others it doesn't. It's like trying to define "cat" by stating that it's an animal. This is not a sufficient definition.

tailcalled1h20

You're the one who brought up the natural numbers, I'm just saying they're not relevant to the discussion because they don't satisfy the uniqueness thing that OP was talking about.

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)

LW Frontpage Experiments! (aka "Take the wheel, Shoggoth!")

Ruby, RobertM

1mo

Update: May 13th, 2024

If you're reading this, it's possible you just found yourself switched to the Enriched tab. Congratulations! You were randomly assigned ~~to be fed to the Shoggoth~~ to a group of users automatically switched to the new posts list.

The Enriched posts list:

Is 50% the same algorithm as Latest, 50% ML-algorithm selected posts for you based on your post interaction history.
The sparkle icon next to the post title marks which posts were the result of personalized recommendations.
You can switch back at any time to the regular Latest tab if you don't like the recommendations
We changed the name "Recommended" to "Enriched" to better imply that it contains 50% of the regular Latest posts. (We will probably soon add a Recommended tab that is 100% recommendations.)

You can read...

(Continue Reading – 1120 more words)

Jeremy Gillen1h20

I'm enjoying having old posts recommended to me. I like the enriched tab.

OpenAI: Fallout

170

Zvi

Previously: OpenAI: Exodus (contains links at top to earlier episodes), Do Not Mess With Scarlett Johansson

We have learned more since last week. It’s worse than we knew.

How much worse? In which ways? With what exceptions?

That’s what this post is about.

The Story So Far

For years, employees who left OpenAI consistently had their vested equity explicitly threatened with confiscation and the lack of ability to sell it, and were given short timelines to sign documents or else. Those documents contained highly aggressive NDA and non disparagement (and non interference) clauses, including the NDA preventing anyone from revealing these clauses.

No one knew about this until recently, because until Daniel Kokotajlo everyone signed, and then they could not talk about it. Then Daniel refused to sign, Kelsey Piper started reporting, and...

(Continue Reading – 10562 more words)

PhilosophicalSoul2h10

It seems increasingly plausible that it would be in the public interest to ban non-disparagement clauses more generally going forward, or at least set limits on scope and length (although I think nullifying existing contracts is bad and the government should not do that and shouldn’t have done it for non-competes either.)

I concur.

It should be noted though; we can spend all day taking apart these contracts and applying pressure publicly but real change will have to come from the courts. I await an official judgment to see the direction of this issue. Arguab... (read more)

7Thomas Kwa16h

Does this work? Sounds like a good idea.

7oumuamua4h

While I am not a lawyer, it appears that this concept might indeed hold some merit. A similar strategy is currently being employed by organizations focused on civil rights, known as a “warrant canary”. Essentially, it’s a method by which a communications service provider aims to implicitly inform its users that the provider has been served with a government subpoena, despite legal prohibitions on revealing the existence of the subpoena. The idea behind it is that it there are very strong protections against compelled speech, especially against compelled untrue speech (e.g. updating the canary despite having received a subpoena). The Electronic Frontier Foundation (EFF) seems to believe that warrant canaries are legal.

6Zvi4h

I think it works, yes. Indeed I have a canary on my Substack About page to this effect.

Inviting discussion of "Beat AI: A contest using philosophical concepts"

David James

I would like to pose a set of broad questions about a project called Beat AI: A contest using philosophical concepts (details below) with the LessWrong community. My hope would be that we have a thoughtful and critical discussion about it. (To be clear, I'm not endorsing it; I have concerns, but I don't want to jump to conclusions.)

Some possible topics for discussion might include:

Do you know the project or its founder(s)? How and to what extent are they thinking about AI safety, if at all?
If some people decide here that the project seems risky or misguided, do we want to organize our thinking and possibly draft a letter to the project?
Have you seen projects like the one below where a community is invited to compete against

...

(See More – 156 more words)

Answer by PhilosophicalSoulMay 29, 202410

'...and give us a license to use and distribute your submissions.'

For how many generations would humans be able to outwit the AI until it outwits us?

LESSWRONG
LW

Quick Takes

Popular Comments

Recent Discussion

Update: May 13th, 2024

The Story So Far

LessOnline Festival