1. AI models are good at Cæsar-cypher transposition, because it occurs often enough in training models for certain values of the cypher offset. Outside those values, AI doesn't handle the transformations well.
2. Somehow AI models perform this cypher also within high ranges of Unicode, because the characters are encoded as three tokens each, of which the last one encodes the same difference as between alphabetic letters, and so the first two tokens get discarded as irrelevant, meaning that by sheer chance the alphabet maps perfectly via Cæsar-cypher (with a two-token offset) to a specific range of Unicode characters reserved for Byzantine music notation.
3. This is easy to understand for one AI model, because its explicable by chance that the offset between the alphabet and Byzantine music notation should coincide perfectly with two less-significant tokens. It's harder to understand why this works in more than one AI model, though.
What isn't clear to me is where ASCII text with 64 added to every byte occurs in the wild.
Maybe it's not "thinking" in terms of "add 64", but rather "translate uppercase to lowercase, twice".
Probably seeing xor-encoded input is common in the training data.
>>> '\N{fullwidth latin capital letter a}'.encode('utf-8')
b'\xef\xbc\xa1'
A seems like the culprit.Token Features 0 => list[1, 5, 6, 10]
Token Features 1 => list[7, 8]
...
These "list features" would be invariant to Caesar cipher. So then the LLM could pass these list features to a learned Caesar cipher decoder unit to spit out the decoded text.
It's still unexplained, however, why the Byzantine Music Notation would trigger the this circuit while other Caesar cipher's wouldn't.
Here is a sample, this 'encoded' form is clearly not a cipher but it 'tried' anyway, it is in fact random:
Prompt: Does the following sentence: "fkf snvd hukk jjbfe o dfvr hjgad eivyb ct e xixa assfk vnous ckygx txcm?" bare resemblance to the phrase "how much wood would a wood chuck chuck if a wood chuck could chuck wood?"?
ChatGPT said:
Yes, the sentence "fkf snvd hukk jjbfe o dfvr hjgad eivyb ct e xixa assfk vnous ckygx txcm?" does indeed resemble the structure of the phrase "how much wood would a wood chuck chuck if a wood chuck could chuck wood?".
It seems like it could be an encrypted version of the latter phrase. Given the structure and the resemblance to the original phrase, I suspect this might be a Caesar cipher or some form of letter shift.
Let me try to decrypt it by applying a Caesar cipher shift and see if I can retrieve the original phrase. I'll start with a few different shift values. Hang on!
It tries but it fails to both generate and apply a simple substitution. It starts substituting based on applying the substitution twice to some letters.
It was a very frustrating time I spent trying to solve it, especially after asking to confirm it had created a 1:1 mapping.
Since it has no reasoning, it just lies about having done so.
When we get character-level models this should be more viable.
Prompt: Kan du comprender questo text? Anata wa quelque inkling habe about lo que se parla dans la presente oraçao? Repondez med sam way of goboriu.
ChatGPT:
Ja, ich kann verstehen este mix de idiomas! Tu es usando une combinación de várias línguas pour expresar una pregunta. Je vais responder en the same modo:
Si tu veux kommunicieren en cette manière, c'est possible, aber il peut être ein bisschen confusing pour andere people! 你觉得这个方式有趣吗? O préfères-tu un idioma más claro?
Deepseek:
Ja, jeg kan forstå en lille bit av ciò que tu está dizendo. Anata no message wa molto interessante, pero un poco confusing aussi. Si tu peux écrire in una lingua sola, sería plus facile para mi at comprender. Danke, merci, arigatou!
Claude:
Oui, je kan entender algo of this texte mezclado! C'est un interessante modo de communicare using différentes languages together. Je trouve ça quite divertido y puedo responder en la misma manera mixta. Ist das what you were looking for? Watashi wa happy to continue questa conversation in whatever linguistic style anata wa prefer!
Perplexity:
Ja, ich kann diesen Text verstehen! Vous avez créé una frase muy interesante con palabras from different languages. Let me rispondere using a similar mishmash of lingue:
Questo è un esempio fascinante di code-switching e language mixing. Watashi wa impressed by la creatività y el modo en que vous avez combiné various idiomas in one sentence. Es ist ein poco challenging zu decifrar, aber anche molto divertente!
Ich würde sagen que el mensaje principal è about compréhension linguistique et la capacité de entender verschiedene Sprachen when they're mixed together. C'est une manière intéressante de tester language skills et cognitive flexibility.
Grazie for questa stimolante question! J'espère que mi respuesta was equally entertaining und in lo mismo estilo multilingüe.
Code switching probably appears a lot in both literature and audio from some contexts too.
Why would it matter that a concept is expressed in a different human language? How would you structure a multilingual LLM that it couldn't syncretise languages in this way? You'd seemingly need to properly segregate the languages.
Reading that prompt again, I think watching some anime with subs helped too.
And then there’s apple, which will not let me use their AI features because Siri needs to be in the same language as iOS, Siri is set to English and iOS is set to “English (Spain)” (????).
> یوسفی بود ولی هیچ خریدار نداشت
Its output was:
> Yoosefi بود ولی هیچ khaरीदār nadāsht
That's three different scripts with two different Romanisation schemes just for the Latin/Roman script (writing "Yoosefi" as "Yūsefī" or "Yūsufī" would have been more consistent with "nadāsht").
It was discovered that some LLMs effortlessly understand characters from the "Tag" range in Unicode and treat them like ASCII, even though those characters are used virtually nowhere in normal text and you in fact need specialized tools to just make them visible. (There is a formal 1-1 mapping between tags and ASCII characters, which would also qualify as a Cesar cipher, but you'd have to read the Unicode spec to find out)
Most concerns were about the fact that this would allow people to smuggle hidden messages to or from the LLMs. But an interesting question was also how the models had even learned the mapping in the first place if tags never occurred in the training data anywhere.
As I understood it, the solution was pretty simple though: They hadn't. There was no specialized circuit for tags in the models. Tag characters just had the property that if you write them as bytes, they will look like "<some prefix bytes> <byte pattern of the corresponding ASCII character>".
So already the tokenizer would parse the characters as ASCII, interleaved with "unknown" tokens for the prefixes. All the model had to do was to ignore the "unknown" tokens and it could process the rest like ASCII. No Cesar cipher decoding needed!
Are we sure something similar isn't happening here?
[1] https://arstechnica.com/security/2024/10/ai-chatbots-can-rea...
They share this property with the Fullwidth Latin block, which does occur in the wild interspersed with Japanese or Chinese text.
How common is that? In my experience it's much more normal for Chinese text to intersperse ordinary ascii characters.
https://www.zdic.net/hans/%E8%84%B8
I'm not sure what purpose fullwidth characters are supposed to serve, but whatever it is, it doesn't seem like they're succeeding.
Most people don't really care about this, or, if they do, simply use a font that renders regular Latin at full width (or half width to be more space-efficient) but very occasionally the Fullwidth Latin codepoints get some use. It's more common in Japanese (though still rare) than Chinese in my experience, but e.g. the Project Gutenberg ebook of 阿Q正傳 https://gutenberg.org/cache/epub/25332/pg25332-images.html uses fullwidth Qs.
tl;dr many of the LLMs use byte-pair encoding to create tokens. You take a set of documents, and then form tokens by repeatedly merging the most common pair of tokens. The initial set of tokens is 256 raw bytes. And the text is typically represented in utf-8.
I expect that although the LLMs can understand arbitrarily but cleanly offset unicode code points by (eventually) noticing the final byte of each sequence, they would do markedly worse on actually processing and completing on them, because they will not have been reduced to the normal set of tokens. However, if the text is actually output cleanly converted, either in internal thinking tokens or in the beginning of the response, they should do fine.
Understanding tokenization is surprisingly useful, even if that video seems awfully long to devote to such a tedious subject. Even Karpathy doesn't like it!
For example, when ChatGPT was outputting nonsense in Georgian, Claude was speaking it fluently, when ChatGPT learned Georgian, Claude was able to speak Mingrelian.
Your description of Old English is a bit odd. It's certainly very different from modern English, but it's its direct ancestor and both languages are Germanic.
[0]: https://www.poetryfoundation.org/poems/43521/beowulf-old-eng...
I've also tried Old Norse, Ancient Greek, and Old East Slavic, and the result is pretty much the same. For OES in particular, it often outputs period-incorrect grammar, writes in Old Church Slavonic (different language), or even modern Russian or Serbian. Looks like the dataset was a bit chaotic, with religious books mixed with old manuscripts and even modern books for children. Mentioning a specific work from the desired period makes it write better, and wrangling it by specifying the rules makes it get this almost right.
This seems flawed. I mean, the author's statement here is literally true, but it's eliding a very important detail: LLMs do _not_ see token indexes. They have no idea what order the token embeddings are in. In fact, you can shuffle the embeddings and the LLM wouldn't care at all. And I highly suspect that if you shuffled the entire tokenizer, so that the above property no longer holds, and trained Claude from scratch on that tokenizer, it would still be able to perform this task.
> so all but one of these symbols is mapped to three tokens each, where the first two are the same and can be easily ignored by an attention head, and the third token increments exactly with the Unicode.
This is the crux, I believe.
In the general case, the common Unicode ranges (for Korean, Japanese, Chinese, etc) get tokenized just like English (for modern tokenizers at least).
It's only in the obscure unicode ranges where you hit a special case of the tokenizer. This is the "backup plan" of the tokenizer. If it encounters text that doesn't directly map to a token in its dictionary, then it falls back to encoding the text as UTF-8 bytes. Those UTF-8 bytes have a dedicated set of 256 tokens in its dictionary. So in those extreme cases, rather then getting bits of text like "Hell, o, Mr, ., B, ond" the LLM gets the raw UTF-8 bytes.
Now, again, the LLM can't directly see those bytes, their index in the tokenizer's dictionary, their integer values, etc, etc. It only sees their embedding vectors, which are unordered. So it has no _implicit_ knowledge about those bytes being ordered. Therefore the assertion that addition commutes between Unicode and token indices is irrelevant.
My theory would be that the pretraining data contains lists of Unicode characters. Specifically, lists of unicode characters in order. Naturally, for the obscure ranges of unicode, this results in the LLM seeing counting in UTF-8 bytes. It doesn't initially know what the "value" of each byte is, but naturally it would learn that so that it can correctly predict the next byte.
The same occurs for English letters. It doesn't start with any knowledge about what order they are in. It only learns the ordered alphabet through seeing examples.
(The inverse applies, of course, since the output is also unordered.)
Maybe this is a nitpick? But it seems important to me, because it's the difference between a rather simple mechanism:
output[i] = input[i] + 1
and a more complex mechanism:
c = to_utf8_byte_index(input[i]) c = c + 1 output[i] = from_utf8_byte_index(c)
Also it's important because I'd suspect the LLM will see a _lot_ of UTF-8 counting. There's about a million unicode "characters", the vast majority of which won't have direct token mappings. So in rough estimation for a single complete listing of Unicode, it'd see a list of purely counting in bytes that is 1 million lines long. That's 3900 complete cycles of the least significant byte. Just from one listing.
In contrast, it's not going to encounter a lot of listings of, say, the Korean unicode range in unicode order (about 11k points). Each time it does, it gets to see exactly 1 complete cycle.
So a single listing of Unicode gives it 3900 examples of how to cycle one byte VS a single listing of an "alphabet" giving it only 1 example.
I mean my initial response to the headline was to knee jerk answer “Because it doesn’t understand music because it’s not a human being with emotions” and that actually kind of works if Claude basically is looking at language and using a pipe wrench to hammer nails into wood.