In this note, we describe an approach for turning the bottom $n$ layers of an autoregressive model into a “scribe model”.

TODO add method description

Evaluating generative perplexity

We evaluate generative perplexity by running the generations through gpt2-large and computing the perplexity.

Results below are for the run: hazy-research/olive/wb5fnacm

image.png

image.png

There are two surprising things from these results:

  1. Line is flat. Performance isn’t improving as we use less of the SCRIBE (and more of the full model). This is weird because it suggests there’s no downside to using the small model, which I’m skeptical of.
  2. We’re close to baseline. The perplexity is basically the same as the baseline model.

Both of these seem like they are “too good to be true”

Why might we be observing this? My best guess is that the evaluation is just bad. We probably need a harder evaluation.

Models

Run ID Size Steps Pretrained from Num scribe layers
hazy-research/olive/wb5fnacm 410m 40k pythia-410m 4