How does misalignment scale with model intelligence and task complexity?

(alignment.anthropic.com)

72 points | by salkahfi 1 hour ago

10 comments

jmtulloss 45 minutes ago
The comments so far seem focused on taking a cheap shot, but as somebody working on using AI to help people with hard, long-term tasks, it's a valuable piece of writing.
- It's short and to the point
- It's actionable in the short term (make sure the tasks per session aren't too difficult) and useful for researchers in the long term
- It's informative on how these models work, informed by some of the best in the business
- It gives us a specific vector to look at, clearly defined ("coherence", or, more fun, "hot mess")
CuriouslyC 1 hour ago
This is a good line: "It found that smarter entities are subjectively judged to behave less coherently"
I think this is twofold:
1. Advanced intelligence requires the ability to traverse between domain valleys in the cognitive manifold. Be it via temperature or some fancy tunneling technique, it's going to be higher error (less coherent) in the valleys of the manifold than naive gradient following to the local minima.
2. It's hard to "punch up" when evaluating intelligence. When someone is a certain amount smarter than you, distinguishing their plausible bullshit from their deep insights is really, really hard.
[-]
- energy123 43 minutes ago
  Incoherence is not error.
  You can have a vanishingly small error and an incoherence at its max.
  That would be evidence of perfect alignment (zero bias) and very low variance.
- xanderlewis 1 hour ago
  What do 'domain valleys' and 'tunneling' mean in this context?
  [-]
  - esyir 50 minutes ago
    Not the OP, but my interpretation here is that if you model the replies as some point in a vector space, assuming points from a given domain cluster close to each other, replies that span two domains need to "tunnel" between these two spaces.
    [-]
- p-e-w 7 minutes ago
  > When someone is a certain amount smarter than you, distinguishing their plausible bullshit from their deep insights is really, really hard.
  Insights are “deep” not on their own merit, but because they reveal something profound about reality. Such a revelation is either testable or not. If it’s testable, distinguishing it from bullshit is relatively easy, and if it’s not testable even in principle, a good heuristic is to put it in the bullshit category by default.
gopalv 1 hour ago
> Making models larger improves overall accuracy but doesn't reliably reduce incoherence on hard problems.
Coherence requires 2 opposing forces to hold coherence in one dimension and at least 3 of them in higher dimensions of quality.
My team wrote up a paper titled "If You Want Coherence, Orchestrate a Team of Rivals"[1] because we kept finding that upping the reasoning threshold resulted in less coherence - more experimentation before we hit a dead-end to turn around.
So we had a better result from using Haiku (we fail over to Sonnet) over Opus and using a higher reasoning model to decompose tasks rather than perform each one of them.
Once a plan is made, the cheaper models do better as they do not double-think their approaches - they fail or they succeed, they are not as tenacious as the higher cost models.
We can escalate to higher authority and get out of that mess faster if we fail hard and early.
The knowledge of how exactly failure happened seems to be less useful to the higher reasoning model over the action biased models.
Splitting up the tactical and strategic sides of the problem, seems to work similarly to how Generals don't hold guns in a war.
[1] - https://arxiv.org/abs/2601.14351
smy20011 56 minutes ago
I think It's not because AI working on "misaligned" goals. The user never specify the goal clearly enough for AI system to work.
However, I think producing detailed enough specification requires same or even larger amount of work than writing code. We write rough specification and clarify these during the process of coding. I think there are minimal effort required to produce these specification, AI will not help you speed up these effort.
[-]
- crabmusket 49 minutes ago
  That makes me wonder about the "higher and higher-level language" escalator. When you're writing in assembly, is it more work to write the code than the spec? And the reverse is true if you can code up your system in Ruby? If so, does that imply anything about the "spec driven" workflow people are using with AIs? Are we right on the cusp where writing natural language specs and writing high level code are comparably productive?
  [-]
  - charcircuit 35 minutes ago
    If you are on the same wave length as someone you don't need to produce a full spec. You can trust that the other person has the same vision as you and will pick reasonable ways to implement things. This is one reason why personalized AI agents are important.
  - skydhash 7 minutes ago
    Programming languages can be a thinking tool for a lot of tasks. Very much like a lot of notation, like music sheet and map drawing. A condensed and somewhat formal manner of describing ideas can increase communication speed. It may lack nuance, but in some case, nuance is harmful.
    The nice thing about code compared to other notation is that it's useful on its. You describe an algorithm and the machine can then solve the problem ad infinitum. It's one step instead of the two step of writing a spec and having an LLM translate it, then having to verify the output and alter it.
    Assembly and high level languages are equivalent in terms of semantics. The latter helps in managing complexity, by reducing harmful possibilities (managing memory, off-by-one errors) and presenting common patterns (iterators/collections, struct and other data structures, ....) so that categories of problems are easily solved. There's no higher level of computing model unlocked. Just faster level of productivity unlocked by following proven patterns.
    Spec driven workflow is a mirage, because even the best specs will leave a lot of unspecified details. Which are crucial as most of programming is making the computer not do the various things it can do.
nayroclade 43 minutes ago
The models they tested are already way behind the current state-of-the-art. Would be interesting to see if their results hold up when repeated with the latest frontier models.
IgorPartola 1 hour ago
For some reason the article reads to me like “AI is not evil, it just has accidents when it loses coherence.” Sounds a lot like liability shifting.
cyanydeez 1 hour ago
Oh, the irony of thinking this refers to the investors and shell companies.
tsunamifury 1 hour ago
I don’t know why it seems so hard for these guys to understand you scorecard every step for new strategy to Close distance at goal and if you have multiple generated forward options with no good weight you spawn a new agent and multiple paths. Then you score all the terminal branches and prune.
LLMs aren’t constrained to linear logic like your average human.
throwpoaster 1 hour ago
Yudkowsky btfo.