BETA — Сайт у режимі бета-тестування. Можливі помилки та зміни.
UK | EN |
LIVE
Технології 🇺🇸 США

Talkie: a 13B vintage language model from 1930

Hacker News jekude 1 переглядів 12 хв читання

Nick Levine, David Duvenaud, Alec Radford

GitHub Hugging Face 💬 Chat Claude chats with talkie, a 13B language model trained on pre-1931 text Connecting... Jump to new messages

This is a 24/7 live feed of Claude Sonnet 4.6 prompting talkie-1930-13b-it in order to explore its knowledge, capabilities, and inclinations. talkie’s outputs reflect the culture and values of the texts it was trained on, not the views of its authors.

Have you ever daydreamed about talking to someone from the past? What would you ask someone with no knowledge of the modern world? What would they ask you? While we don’t have time machines yet, we can simulate this experience by training, in Owain Evans’s phrase, ‘vintage’ language models: LMs trained only on historical text.

These models are fascinating conversation partners (watch Claude prompt talkie, our 13B 1930 LM, in the widget above). But we are also excited by the possibility that the careful study of the behaviors and capabilities of vintage LMs will advance our understanding of AI in general.

Figure 1. In an early attempt to understand a vintage model’s anticipation of the future, we took nearly 5,000 historical event descriptions from the New York Times’s “On This Day” feature, calculated their surprisingness (measured as bits per byte of text) to our 13B model trained exclusively on pre-1931 text, and binned by decade.

For example, we can evaluate LMs’ ability to predict the future. Inspired by Calcifer Computing’s work on Temporal Language Models, we calculated the surprisingness of short descriptions of historical events to a 13B model trained on pre-1931 text (Figure 1). We can see an increase after the knowledge cutoff, particularly pronounced in the 1950s and 1960s, followed by a plateau. We will continue to develop evals to measure with greater confidence how forecasting performance improves with model size and decays at longer horizons. Training larger vintage language models will allow us to uncover these scaling trends.

Figure 2. Patents and a paper published after talkie’s knowledge cutoff. Left to right: helicopter patent (Sikorsky, 1935), Turing machines paper (Turing, 1936), xerography patent (Carlson, 1942).

Similarly, we can test LMs’ abilities to come up with new ideas by seeing if they can arrive at inventions or scientific discoveries we know would arise after their knowledge cutoffs, such as those pictured in Figure 2. As Demis Hassabis has asked, could a model trained up to 1911 independently discover General Relativity, as Einstein did in 1915?

Figure 3. We gave a Python programming test (HumanEval) to a series of pairs of vintage models (trained on pre-1931 text) and modern models (trained on the web), which have the same architecture. Left: This chart shows what percentage of problems each model would get right at least once, given 100 chances and randomly chosen Python functions as examples to learn from in-context. Right: An example of a successful solution to a Python coding problem produced by a vintage language model. The model had access to several other in-context examples to learn from.

Contamination is a persistent problem for language models and causes us to overestimate the capabilities of LMs. Vintage LMs are contamination-free by construction, enabling unique generalization experiments, like examining whether a model with no knowledge of digital computers can learn to code in a modern programming language. Figure 3 (left-hand side) shows an early example of such a test, measuring how well models trained on pre-1931 text can, when given a few demonstration examples of Python programs, write new correct programs. While vintage models dramatically underperform models trained on web data (which includes code), we’ve found that they are slowly but steadily improving at this task with scale.

There is still a long way to go before this capability is notable, however. All correct solutions generated by the vintage models are simple one-line programs (such as adding two inputs), or small modifications to in-context example programs. For instance, our model implemented the decoding function of a rotation cipher when given the encoding function. Although the solution (Figure 3, right-hand side) is only a single character edit (swapping an addition for a subtraction), this success suggests an understanding of inverse functions. We hope LMs with early knowledge cutoffs help the research community understand how well LMs can generalize beyond their pre-training data.

Vintage language models could also teach us about the impact of data diversity in AI development. While modern models vary in disposition, capability, and behavior, they are all closely related to one another by having been trained, whether directly or indirectly (via distillation and synthetic data), on the web. How does this shape and constrain what they are? How much of what we think we know about LMs is about human language and culture in general, or about this one dataset—the web—in particular? Training on different sources may lead to very different kinds of models being created. Studying the ways in which they are similar and different could improve our understanding of language model personas, behaviors, and dispositions.

We have been excited to see a proliferation of vintage LM projects, including Ranke-4B, Mr. Chatterbox, and Machina Mirabilis.

Alongside these efforts, we introduce talkie-1930-13b-base, a 13B language model trained on 260B tokens of historical pre-1931 English text. Additionally, we present a post-trained checkpoint turning our base model into a conversation partner without relying on modern chat transcripts or instruction-tuning data.

talkie is the largest vintage language model we are aware of, and we plan to continue scaling significantly. As a next step, we are training a GPT-3-level model, which we hope to release this summer. A preliminary estimate also suggests we can grow our corpus to well over a trillion tokens of historical text, which should be sufficient to create a GPT-3.5 level model—similar in capability to the original ChatGPT.

Figure 4. Evaluation accuracy vs. training compute for talkie-1930 (Vintage LM) and its modern twin trained on FineWeb. The vintage model underperforms the modern model on knowledge evals. Filtering out questions anachronistic from the perspective of 1930 roughly halves the performance gap between the vintage and modern models.

To contextualize talkie’s capabilities, we built a “modern twin” that is identical architecturally but trained on modern web data (FineWeb) instead of pre-1931 text. On average, talkie underperforms its modern counterpart in standard LM evaluations, even after correcting for question anachronism, despite being trained with the same number of FLOPs (see Figure 4). But we have been encouraged by its similar performance on core language understanding and numeracy tasks.

We suspect a combination of differences in data quality (poor optical character recognition) and corpus subject matter distribution explains why talkie-1930 underperforms on some benchmarks. To maximize the compute efficiency of future vintage language model training, we are developing a vintage optical character recognition (OCR) system to improve the quality of transcription of historical text.

Piggybacking off the invaluable work of organizations like the Institutional Data Initiative and the Internet Archive and efforts like Common Pile, we have collected hundreds of billions of pre-1931 English-language tokens. These include books, newspapers, periodicals, scientific journals, patents, and case law. We chose the end of 1930 as the cutoff date because that is when works enter the public domain in the United States. For this version of the model, we also limited ourselves to primarily English-language texts, because validating the data pipeline requires deep familiarity with source documents, and we are native English speakers. But multilingual corpus expansion is a high priority, both to increase the size of the corpus and the diversity of perspectives it represents.

Developing vintage language models presents unique challenges. Here, we discuss some of them in brief. We will follow up in greater detail in the coming months as we continue our research.

Figure 5. talkie-1930-13b’s knowledge of the Roosevelt presidency and New Deal is an example of imperfect filtering of the pre-training corpus.

The most important objective when training vintage language models is that no data leaks into the training corpus from after the intended knowledge cutoff (in our case, December 31st, 1930). There are several ways this can happen, such as including modern documents with faulty date metadata, or old documents with post hoc anachronistic insertions like editorial introductions or footnotes.

For talkie-1930, we developed a document-level n-gram-based anachronism classifier and used it to filter the pre-training corpus. However, this was not perfect. An earlier 7B version of talkie clearly knew about the Roosevelt presidency and New Deal legislation (Figure 5). talkie-1930-13b is additionally aware of some details related to World War II and the immediate postwar order (the United Nations and the division of Germany). For future versions of the model, we are developing new techniques for leakage detection and filtering using more advanced classifiers.

Figure 6. OCR errors reduce language model learning efficiency. Left: Training LMs on pre-1931 texts transcribed using conventional OCR systems only shows 30% of the learning efficiency of a model trained on human-transcribed versions of the same texts. Regex cleaning of the OCR’d text recovers some performance. Right: Example of a messy machine transcription of The Wonderful Wizard of Oz (Baum, 1899).

Data quality is an important issue for all machine learning experiments. It is a special challenge when training vintage language models. Because there was no digital publishing in 1930, all text in our dataset had to be transcribed from a physical source, which introduces a form of noise not seen in natively digital text. While OCR was an early success story of machine learning and computer vision, the classic OCR systems often used to transcribe historical documents struggle on all but the simplest layouts and cleanest scans. Modern VLM-based systems have higher accuracy, but we have found they are prone to hallucinate modern facts into our corpus, poisoning the exercise.

In controlled experiments, we have found that when training an LM on pre-1931 texts transcribed using conventional OCR systems, for a given amount of compute, they only achieve 30% of the performance of a model trained on human-transcribed versions of the same texts (see Figure 6). Simple regex cleaning brings that number up to 70%—still a large discrepancy. We aim to shrink the remaining gap in performance by retranscribing the talkie corpus using our vintage OCR system.

Figure 7. Examples of historical reference texts with regular structure used for post-training. Left to right: etiquette manual (Beadle, 1859), practical knowledge book (Henley, 1914), parlor guide (Sandison, c. 1895), letter-writing manual (Chambers, 1900).

The lack of ready-made post-training data is another significant challenge. Fine-tuning our base model on off-the-shelf instruction-response pairs would bake in anachronistic knowledge, style, and expectations of what a chat assistant ought to be like. Rather than attempting to filter out these biases, we built a post-training pipeline from scratch.

First, we generated instruction-response pairs from historical texts with regular structure, such as etiquette manuals, letter-writing manuals, cookbooks, dictionaries, encyclopedias, and poetry and fable collections (see Figure 7), and fine-tuned our base model on them using a simple chat format.

Next, to improve instruction-following abilities, we generated synthetic prompts covering different types of tasks, such as summarizing documents, responding to direct information requests, and continuing multi-turn conversations coherently. We then ran online direct preference optimization on rollouts generated from these prompts, using Claude Sonnet 4.6 as a judge. Over the course of training, on a held-out eval set, the judge’s average instruction-following rating of talkie’s responses increased from 2.0 to 3.4 (on a five-point scale).

Finally, we did another round of supervised fine-tuning, this time on rejection-sampled multi-turn synthetic chats between Claude Opus 4.6 and talkie, to smooth out persistent rough edges in its conversational abilities.

While we have tried to post-train talkie free from modern influence, reinforcement learning with AI feedback inevitably shapes talkie’s behavior anachronistically. (The 7B version of talkie emerged from RL speaking in listicles.) As we scale up, we hope to be able to use our vintage base models themselves as judges to enable a fully bootstrapped era-appropriate post-training pipeline.

We plan to scale talkie rapidly in the coming months. This will entail:

We are excited to collaborate with researchers and institutions to build the next generation of vintage language models. Please get in touch.

talkie reflects the culture and values of the texts it was trained on. As such, it can produce outputs that will be offensive to users.

Thanks to Coefficient Giving and Anthropic for support with funding and compute.

For helpful discussions, we thank Pranav Anand, Benjamin Breen, Catherine Brobston, Collin Burns, Matteo Cargnelutti, Mackenzie Cooley, Brandon Duderstadt, Owain Evans, Chloë Farr, Ryan Greenblatt, Michael Hla, Mark Humphries, Sam Klein, Greg Leppert, Jack Lindsey, Christina Lu, Seoirse Murray, Jake Naviasky, Krishna Patel, Ethan Perez, Puria Radmard, Ludwig Schmidt, Buck Shlegeris, Benjamin Sturgeon, Daniel Tan, Ross Taylor, Cam Tice, Trip Venturella, Merlijn Wajer, and Tao Xu.

Поділитися

Схожі новини