In this note, we describe an approach for turning the bottom $n$ layers of an autoregressive model into a “scribe model”.
TODO add method description
We evaluate generative perplexity by running the generations through gpt2-large
and computing the perplexity.
Results below are for the run: hazy-research/olive/wb5fnacm
There are two surprising things from these results:
Both of these seem like they are “too good to be true”
Why might we be observing this? My best guess is that the evaluation is just bad. We probably need a harder evaluation.
Models
Run ID | Size | Steps | Pretrained from | Num scribe layers |
---|---|---|---|---|
hazy-research/olive/wb5fnacm |
410m | 40k | pythia-410m |
4 |