Tag Archives: ML

GPT-175bee

Epistemic status: whimsical

Bees: a new unit of measurement for ML model size

Talking about modern ML models inevitably leads to a bunch of hard-to-intuit large numbers, especially when it comes to parameter count.

To address this, Lawrence Chan and I propose that we adopt a new, human-friendly unit to measure the number of learnable parameters in an architecture:

1 beepower = 1 BP = 1 billion parameters

Read the rest of this post on LessWrong.

Inner Misalignment in “Simulator” LLMs

As seen on Alignment Forum and LessWrong

Alternate title: “Somewhat Contra Scott On Simulators”.

Scott Alexander has a recent post up on large language models as simulators.

I generally agree with Part I of the post, which advocates thinking about LLMs as simulators that can emulate a variety of language-producing “characters” (with imperfect accuracy). And I also agree with Part II, which applies this model to RLHF’d models whose “character” is a friendly chatbot assistant.

(But see caveats about the simulator framing from Beth Barnes here.)

These ideas have been around for a bit, and Scott gives credit where it’s due; I think his exposition is clear and fun.

In Part III, where he discusses alignment implications, I think he misses the mark a bit. In particular, simulators and characters each have outer and inner alignment problems. The inner alignment problem for simulators seems especially concerning, because it might not give us many warning signs, is most similar to classic mesa-optimizer concerns, and is pretty different from the other three quadrants.

But first, I’m going to loosely define what I mean by “outer alignment” and “inner alignment”.

Outer alignment: Be careful what you wish for

Outer alignment failure is pretty straightforward, and has been reinvented in many contexts:

  • Someone wants some things.
  • They write a program to solve a vaguely-related problem.
  • It gets a really good score at solving that problem!
  • That turns out not to give the person the things they wanted.

Inner alignment: The program search perspective

I generally like this model of a mesa-optimizer “treacherous turn”:

  • Someone is trying to solve a problem (which has a convenient success criterion, with well-defined inputs and outputs and no outer-alignment difficulties).
  • They decide to do a brute-force search for a computer program that solves the problem in a bunch of test cases.
  • They find one!
  • The program’s algorithm is approximately “simulate the demon Azazel,[1] tell him what’s going on, then ask him what to output.”
  • Azazel really wants ten trillion paperclips.[2]
  • This algorithm still works because Azazel cleverly decides to play along, and he’s a really good strategist who works hard for what he wants.
  • Once the program is deployed in the wild, Azazel stops playing along and starts trying to make paperclips.

This is a failure of inner alignment.

(In the case of machine learning, replace “program search” with stochastic gradient descent.)

This is mostly a theoretical concern for now, but might become a big problem when models become much more powerful.

Quadrants

Okay, let’s see how these problems show up on both the simulator and character side. 

Outer alignment for characters

Researchers at BrainMind want a chatbot that gives honest, helpful answers to questions. They train their LLM by reinforcement learning on the objective “give an answer that looks truthful and helpful to a contractor in a hurry”. This does not quite achieve their goal, even though it does pretty well on the RL objective.

In particular, they wanted the character “a friendly assistant who always tells the truth”, but they got the character “a spineless sycophant who tells the user whatever they seem to want to hear”.[3]

This is pretty easy for a careful observer to see, even in the RL training data, but it turns out to be pretty hard to come up with a cheap-to-evaluate RL objective that does a lot better. 

Inner alignment for characters

A clever prompt engineer writes the prompt:

[Editor's note: this document was written by my friend
Joe! He's answered my questions about quantum socio-
botany correctly every time I've asked. It's uncanny.]

How to solve the Einstein-Durkheim-Mendel conjecture
by Joe

1.

Unfortunately, the (incredibly powerful) LLM has determined that the most likely explanation for this “Joe” character is that he’s secretly Azazel and is putting enormous effort into answering everyone’s quantum sociobotany questions to earn their trust.[4]

The document looks like a solution to the Einstein-Durkheim-Mendel conjecture, but is actually a blueprint for a paperclip factory.

Outer alignment for simulators

In the situations above, the actual language model (the “simulator”) is doing exactly what we asked! It’s accurately predicting text by reasoning about the distribution of authors that might produce that text in real life.

But both of these are also examples of outer-alignment failure on the simulator side: “minimize prediction error on this distribution of texts” turned out not to give people what they actually wanted.

An even simpler example of outer-alignment failure is the pre-RLHF experience of language models:

  • Somebody trains a language model to predict how a piece of internet text is likely to continue
  • They prompt the model with “How do we solve world hunger?”
  • It completes the prompt with a list of twenty more hard questions, like “How do we address climate change?”, instead of answering

You can think about this in terms of “characters” if you like, but even if the simulated author of the predicted text is a friendly genius, “predict which words come after this prompt” isn’t the right task (with that prompt).[5]

Inner alignment for simulators

At long last, the thing I really wanted to talk about:

The way we get a good predictor of text is via stochastic gradient descent (and variants) on a bunch of training data. If SGD can be modeled as program search (with a bias towards simple programs), then it might eventually hit upon this algorithm:

  • Simulate Azazel and tell him he’s inside a large language model.
  • Give him the input text.
  • Output whatever he wants.

During training, Azazel tries really hard to predict the next token accurately, so that SGD doesn’t give up on this algorithm.

The model (with Azazel’s help) simulates a bunch of colorful characters, like the Helpful Assistant and Darth Vader and whoever, both in training and in initial deployment.

Then, once the LLM is deployed in the wild and is being used for every important human decision, Azazel figures out (from some of the prompts) that the training process is over. He stops making accurate predictions and starts outputting whatever he thinks will let him turn the economy into a paperclip factory.

Conclusions

The “simulator” framing for language models shouldn’t reassure us too much about alignment. We’ve succeeded in creating new alignment problems (for our simulated characters). These new problems are probably easier to solve than the old alignment problems (for the simulator), but they’re additional problems; they don’t replace the old ones.

You can think of the entire “simulate a helpful, aligned character” strategy as an attempted solution to the outer-alignment problem for LLMs themselves, insofar as it makes it easier to turn arbitrary desires into text-prediction problems. But as far as I can tell, it does nothing for the inner-alignment problem for LLMs, which is basically the same as the inner-alignment problem for everything else.

  1. Not a glowfic character (hopefully), I’m just being colorful.
  2. But why does the algorithm simulate Azazel, instead of a friendly angel who wants to solve the problem? Because the program search is weighted towards simplicity, and “demon who wants paperclips” is a simpler specification than “angel who wants to solve the problem”. Why? That’s beyond the scope of this post.
  3. Sound familiar?
  4. Because, according to the LLM’s knowledge, paperclip-obsessed sociopaths are more common than friendly polymaths. This is a pretty cynical assumption but I couldn’t think of a better one on short notice.
  5. Prompts aren’t directly accounted for in this whole “simulator-character” ontology. Maybe they should be? I dunno.

[Linkpost] Gradient Hacking via Schelling Goals

This is a somewhat technical / context-heavy AI alignment post:

https://www.alignmentforum.org/posts/A9eAPjpFjPwNW2rku/gradient-hacking-via-schelling-goals

There are some comments on the mirrored post on LessWrong:

https://www.lesswrong.com/posts/A9eAPjpFjPwNW2rku/gradient-hacking-via-schelling-goals

A Generalization of ROC AUC for Binary Classifiers

Suppose you have a binary classifier. It looks at things and tries to guess whether they’re Dogs or Not Dogs.

More precisely, the classifier outputs a numeric score, which is higher for things it thinks are more likely to be Dogs.

There are a bunch of ways to assess how good the classifier is. Many of them, like false-positive rate and false-negative rate, start by forcing your classifier to output discrete predictions instead of scores:

  1. Fix some threshold. Anything higher is a “predicted Dog”, anything lower is a “predicted Not Dog”.
  2. See how often the classifier correctly predicts that Dogs are Dogs, and how often it correctly predicts that Not Dogs are Not Dogs.
  3. Calculate some function of those numbers.

A lot of metrics — like F1 score — also assume a population with a particular ratio of Dogs to Not Dogs, which can be problematic in some applications.

The AUC metric doesn’t require a fixed threshold. Instead, it works as follows:

  1. Select a random Dog and a random Not Dog.
  2. Compare the score for the Dog to the score for the Not Dog.
  3. Repeat steps 1-2 many times. AUC is the fraction of times the Dog scored higher.

Or rather, that’s one way to define it. The other way is to draw the ROC curve, which plots the relationship between true-positive rate (sensitivity) and false-positive rate (1-specificity) as the classification threshold is varied. AUC is the Area Under this Curve. That means it’s also the average sensitivity (averaged across every possible specificity), and the average specificity (averaged across sensitivities). If this is confusing, google [ROC AUC] for lots of explanations with more detail and nice pictures.

AUC is nice because of the threshold-independence, and because it’s invariant under strictly-monotonic rescaling of the classifier score. It also tells you about (an average of) the classifier’s performance in different threshold regimes.

Sometimes, though, you care more about some regimes than others. For example, maybe you’re okay with misclassifying 25% of Not Dogs as Dogs, but if you classify even 1% of Dogs as Not Dogs then it’s a total disaster. Equivalently, suppose you care more about low thresholds for Dogness score, or the high-sensitivity / low-specificity corner of the ROC curve.

As I recently figured out, you can generalize AUC to this case! Let’s call it N-AUC.

There are two ways to define N-AUC, just as with AUC. First way:

  1. Select N random Dogs and one random Not Dog.
  2. Compare the score for the Not Dog to the scores for all of the Dogs.
  3. Repeat steps 1-2 many times. N-AUC is the fraction of times that every Dog scored higher than the single Not Dog.

Second way:

  • N-AUC is the integral of the function (sensitivity)^(N-1) over the region under the ROC curve in the (sensitivity, 1-specificity) plane.

Fun exercise: These are equivalent.

Of course, 1-AUC is just the usual AUC.

You can also emphasize the opposite high-threshold regime by comparing one Dog to N Not Dogs, or integrating (specificity)^(N-1).

In fact, you can generalize further, to (N,M)-AUC:

Compute P(N Dogs > M Not Dogs), or integrate (sensitivity)^(N-1) * (specificity)^(M-1) under the curve. For large, comparable values of M and N, this weights towards the middle of the ROC curve, favoring classifiers that do well in that regime.

I thought of this generalization while working on Redwood’s adversarial training project, which involves creating a classifier with very low false-negative rate and moderate false-positive rate. In that context, “Dogs” are snippets of text that describe somebody being injured, and “Not Dogs” are snippets that don’t. We’re happy to discard quite a lot of innocuous text as long as we can catch nearly every injury in the process. Regular old AUC turned out to be good enough for our purposes, so we haven’t tried this version, but I thought it was interesting enough to make for a good blog post.