They didn't test against actual language model pretraining, only tested against a random init.
- A: Pre-trained on their synthetic LSTM data -> fine-tuned on Wikipedia
- B: Pre-trained on different natural language corpus -> fine-tuned on Wikipedia
- C: Random initialization -> fine-tuned on Wikipedia
They only test A vs C, not A vs B.
It's not obvious how generate this kind of good synthetic data when it's to be fed to a tokenized model.
It’s good to compare various model sizes and evaluation tasks and random data generators. I just think the paper would more effectively prove its point if it could show models of same sizes which see this random data can learn better from evaluation data later on.
Could even take the initial checkpoint of the model before universal pretraining against the pretrained checkpoint. If the method works, the one that did UP will win.
Maybe I’m way off, I’ll admit I only skimmed it so far. Seems promising, just wishing for some controls.