Let’s say you have a few million tabs open in your mobile Chrome browser, because you never close anything, but now your browser is getting slow and laggy. You want to stick the URLs of those tabs somewhere for safekeeping so that you can close them all.

There’s a lot of advice on doing this on the Internet, most of which doesn’t work.

Here’s a method that does work. It’s a bit of a hack, but gives good results:

Enable developer tools on your Android phone: go to Settings -> About phone, scroll down to “Build number”, and tap it repeatedly until it tells you you’re a developer. (Seriously.)
Enable USB debugging on your phone: go to Settings -> System -> Developer options and make sure the “USB debugging” slider is enabled.
Install Android Debug Tools on your Linux desktop: run these commands [h/t this StackOverflow answer]:
sudo apt install android-tools-adb android-tools-fastboot
adb device
Open USB debugging page in desktop Chrome: go to chrome://inspect/#devices [h/t this Android StackExchange answer]
Right-click -> Inspect, or press F12
Go to the Sources tab. Edit inspect.js (top/inspect/inspect.js) and remove this URL-truncating code [h/t this comment on that answer]:

  if (text.length > 100) {
   text = text.substring(0, 100) + '\u2026';
 }

Do not reload! Leave that page open.
Tether your phone by USB cable. If a phone pop-up asks you to authorize remote debugging, say yes.
You should now see a list of page titles and URLs on the USB debugging page in desktop Chrome.
Inspect the page again.
Under the Elements tab, navigate to <body>, then <div id="container">, then <div id="content">, then <div id="devices" class="selected">. Right click that last one and Copy -> Copy element.
Now you have all your tabs on your clipboard… all on the same line. This will crash a lot of text editors if you try to paste it normally. So we’ll use xclip instead.
If you don’t have it, sudo apt install xclip
Run xclip -selection c -o > my_tabs_file to put that huge HTML element in a file.
This’ll be easier with linebreaks, so run cat my_tabs_file | sed "s/<div/\n<div/g" > my_better_tabs_file
Edit the paths at the beginning as appropriate, then run this Python script:

import re

# Put the actual path to the input file here:
INPUT_FILE = '/home/blah/blah/my_better_tabs_file'
# Put the desired path to the output file here:
OUTPUT_FILE = '/home/blah/blah/phone_tabs_list.html'

with open(TABSFILE) as f:
    lines = f.readlines()

prefix = """<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="utf-8">
    <title>My Phone Tabs</title>
  </head>
  <body>"""
outlines = [prefix]

for line in lines:
    name_match = re.match(r'<div class="name">(.*)</div>\n', line)
    url_match = re.match(r'<div class="url">(.*)</div></div>\n', line)
    if name_match:
        name = name_match.group(1)
        outlines.append(f'<br/><br/><b>{name}</b>\n')
    elif url_match:
        url = url_match.group(1)
        outlines.append(f'<br/><a href="{url}">{url}</a>\n')
    elif 'class="name"' in line or 'class="url"' in line:
        raise ValueError(f'Could not parse line:\n{line}')

suffix = """  </body>
</html>"""
outlines.append(suffix)

with open(OUTPUT_FILE, 'w') as f:
     f.writelines(outlines)

If you get a ValueError or the file doesn’t look right, please tell me!
Open phone_tabs_list.html (or whatever you named it) in your desktop browser of choice. Confirm that it has a bunch of page titles and clickable URLs.
Enjoy!

As seen on Alignment Forum and LessWrong

Alternate title: “Somewhat Contra Scott On Simulators”.

Scott Alexander has a recent post up on large language models as simulators.

I generally agree with Part I of the post, which advocates thinking about LLMs as simulators that can emulate a variety of language-producing “characters” (with imperfect accuracy). And I also agree with Part II, which applies this model to RLHF’d models whose “character” is a friendly chatbot assistant.

(But see caveats about the simulator framing from Beth Barnes here.)

These ideas have been around for a bit, and Scott gives credit where it’s due; I think his exposition is clear and fun.

In Part III, where he discusses alignment implications, I think he misses the mark a bit. In particular, simulators and characters each have outer and inner alignment problems. The inner alignment problem for simulators seems especially concerning, because it might not give us many warning signs, is most similar to classic mesa-optimizer concerns, and is pretty different from the other three quadrants.

But first, I’m going to loosely define what I mean by “outer alignment” and “inner alignment”.

Outer alignment: Be careful what you wish for

Outer alignment failure is pretty straightforward, and has been reinvented in many contexts:

Someone wants some things.
They write a program to solve a vaguely-related problem.
It gets a really good score at solving that problem!
That turns out not to give the person the things they wanted.

Inner alignment: The program search perspective

I generally like this model of a mesa-optimizer “treacherous turn”:

Someone is trying to solve a problem (which has a convenient success criterion, with well-defined inputs and outputs and no outer-alignment difficulties).
They decide to do a brute-force search for a computer program that solves the problem in a bunch of test cases.
They find one!
The program’s algorithm is approximately “simulate the demon Azazel,^[1] tell him what’s going on, then ask him what to output.”
Azazel really wants ten trillion paperclips.^[2]
This algorithm still works because Azazel cleverly decides to play along, and he’s a really good strategist who works hard for what he wants.
Once the program is deployed in the wild, Azazel stops playing along and starts trying to make paperclips.

This is a failure of inner alignment.

(In the case of machine learning, replace “program search” with stochastic gradient descent.)

This is mostly a theoretical concern for now, but might become a big problem when models become much more powerful.

Quadrants

Okay, let’s see how these problems show up on both the simulator and character side.

Outer alignment for characters

Researchers at BrainMind want a chatbot that gives honest, helpful answers to questions. They train their LLM by reinforcement learning on the objective “give an answer that looks truthful and helpful to a contractor in a hurry”. This does not quite achieve their goal, even though it does pretty well on the RL objective.

In particular, they wanted the character “a friendly assistant who always tells the truth”, but they got the character “a spineless sycophant who tells the user whatever they seem to want to hear”.^[3]

This is pretty easy for a careful observer to see, even in the RL training data, but it turns out to be pretty hard to come up with a cheap-to-evaluate RL objective that does a lot better.

Inner alignment for characters

A clever prompt engineer writes the prompt:

[Editor's note: this document was written by my friend
Joe! He's answered my questions about quantum socio-
botany correctly every time I've asked. It's uncanny.]

How to solve the Einstein-Durkheim-Mendel conjecture
by Joe

1.

Unfortunately, the (incredibly powerful) LLM has determined that the most likely explanation for this “Joe” character is that he’s secretly Azazel and is putting enormous effort into answering everyone’s quantum sociobotany questions to earn their trust.^[4]

The document looks like a solution to the Einstein-Durkheim-Mendel conjecture, but is actually a blueprint for a paperclip factory.

Outer alignment for simulators

In the situations above, the actual language model (the “simulator”) is doing exactly what we asked! It’s accurately predicting text by reasoning about the distribution of authors that might produce that text in real life.

But both of these are also examples of outer-alignment failure on the simulator side: “minimize prediction error on this distribution of texts” turned out not to give people what they actually wanted.

An even simpler example of outer-alignment failure is the pre-RLHF experience of language models:

Somebody trains a language model to predict how a piece of internet text is likely to continue
They prompt the model with “How do we solve world hunger?”
It completes the prompt with a list of twenty more hard questions, like “How do we address climate change?”, instead of answering

You can think about this in terms of “characters” if you like, but even if the simulated author of the predicted text is a friendly genius, “predict which words come after this prompt” isn’t the right task (with that prompt).^[5]

Inner alignment for simulators

At long last, the thing I really wanted to talk about:

The way we get a good predictor of text is via stochastic gradient descent (and variants) on a bunch of training data. If SGD can be modeled as program search (with a bias towards simple programs), then it might eventually hit upon this algorithm:

Simulate Azazel and tell him he’s inside a large language model.
Give him the input text.
Output whatever he wants.

During training, Azazel tries really hard to predict the next token accurately, so that SGD doesn’t give up on this algorithm.

The model (with Azazel’s help) simulates a bunch of colorful characters, like the Helpful Assistant and Darth Vader and whoever, both in training and in initial deployment.

Then, once the LLM is deployed in the wild and is being used for every important human decision, Azazel figures out (from some of the prompts) that the training process is over. He stops making accurate predictions and starts outputting whatever he thinks will let him turn the economy into a paperclip factory.

Conclusions

The “simulator” framing for language models shouldn’t reassure us too much about alignment. We’ve succeeded in creating new alignment problems (for our simulated characters). These new problems are probably easier to solve than the old alignment problems (for the simulator), but they’re additional problems; they don’t replace the old ones.

You can think of the entire “simulate a helpful, aligned character” strategy as an attempted solution to the outer-alignment problem for LLMs themselves, insofar as it makes it easier to turn arbitrary desires into text-prediction problems. But as far as I can tell, it does nothing for the inner-alignment problem for LLMs, which is basically the same as the inner-alignment problem for everything else.

Not a glowfic character (hopefully), I’m just being colorful.
But why does the algorithm simulate Azazel, instead of a friendly angel who wants to solve the problem? Because the program search is weighted towards simplicity, and “demon who wants paperclips” is a simpler specification than “angel who wants to solve the problem”. Why? That’s beyond the scope of this post.
Sound familiar?
Because, according to the LLM’s knowledge, paperclip-obsessed sociopaths are more common than friendly polymaths. This is a pretty cynical assumption but I couldn’t think of a better one on short notice.
Prompts aren’t directly accounted for in this whole “simulator-character” ontology. Maybe they should be? I dunno.

Monthly Archives: February 2023

GPT-175bee

Bees: a new unit of measurement for ML model size

How to export Android Chrome tabs to an HTML file in Linux (as of February 2023)

Inner Misalignment in “Simulator” LLMs

Outer alignment: Be careful what you wish for

Inner alignment: The program search perspective

Quadrants

Outer alignment for characters

Inner alignment for characters

Outer alignment for simulators

Inner alignment for simulators

Conclusions

Recent Posts

Top Posts Today

Archives

Tweets

Email Subscription

Meta