Why Does Claude Speak Byzantine Music Notation?

https://fi-le.net/byzantine/

Telemakhos
So, let me see what I think I understand here:

1. AI models are good at Cæsar-cypher transposition, because it occurs often enough in training models for certain values of the cypher offset. Outside those values, AI doesn't handle the transformations well.

2. Somehow AI models perform this cypher also within high ranges of Unicode, because the characters are encoded as three tokens each, of which the last one encodes the same difference as between alphabetic letters, and so the first two tokens get discarded as irrelevant, meaning that by sheer chance the alphabet maps perfectly via Cæsar-cypher (with a two-token offset) to a specific range of Unicode characters reserved for Byzantine music notation.

3. This is easy to understand for one AI model, because its explicable by chance that the offset between the alphabet and Byzantine music notation should coincide perfectly with two less-significant tokens. It's harder to understand why this works in more than one AI model, though.

yorwba
It's not that surprising that models encode Byzantine Music Notation characters using separate tokens for each UTF-8 byte, since they're unlikely to occur often enough for byte-pair encoding to allocate dedicated multi-byte tokens to them.

What isn't clear to me is where ASCII text with 64 added to every byte occurs in the wild.

bmenrigh
Possibly because of xor with 0xc0 which, for lowercase ascii, has the same effect as adding 64.

Probably seeing xor-encoded input is common in the training data.

csense
Translating uppercase to lowercase adds 32.

Maybe it's not "thinking" in terms of "add 64", but rather "translate uppercase to lowercase, twice".

the8472
> What isn't clear to me is where ASCII text with 64 added to every byte occurs in the wild.

maybe it has some general ascii bittwiddling capability for base64 and uuencode that help here too

fcatalan
Another thing that LLMs are (surprisingly?) good at is randomly mixed language conversations, here's something I tried a few days ago:

Prompt: Kan du comprender questo text? Anata wa quelque inkling habe about lo que se parla dans la presente oraçao? Repondez med sam way of goboriu.

ChatGPT:

Ja, ich kann verstehen este mix de idiomas! Tu es usando une combinación de várias línguas pour expresar una pregunta. Je vais responder en the same modo:

Si tu veux kommunicieren en cette manière, c'est possible, aber il peut être ein bisschen confusing pour andere people! 你觉得这个方式有趣吗? O préfères-tu un idioma más claro?

Deepseek:

Ja, jeg kan forstå en lille bit av ciò que tu está dizendo. Anata no message wa molto interessante, pero un poco confusing aussi. Si tu peux écrire in una lingua sola, sería plus facile para mi at comprender. Danke, merci, arigatou!

Claude:

Oui, je kan entender algo of this texte mezclado! C'est un interessante modo de communicare using différentes languages together. Je trouve ça quite divertido y puedo responder en la misma manera mixta. Ist das what you were looking for? Watashi wa happy to continue questa conversation in whatever linguistic style anata wa prefer!

Perplexity:

Ja, ich kann diesen Text verstehen! Vous avez créé una frase muy interesante con palabras from different languages. Let me rispondere using a similar mishmash of lingue:

Questo è un esempio fascinante di code-switching e language mixing. Watashi wa impressed by la creatività y el modo en que vous avez combiné various idiomas in one sentence. Es ist ein poco challenging zu decifrar, aber anche molto divertente!

Ich würde sagen que el mensaje principal è about compréhension linguistique et la capacité de entender verschiedene Sprachen when they're mixed together. C'est une manière intéressante de tester language skills et cognitive flexibility.

Grazie for questa stimolante question! J'espère que mi respuesta was equally entertaining und in lo mismo estilo multilingüe.

ashirviskas
I've only taken duolingo in French for a few months a few years ago, have heard my girlfriend practice her Italian and I've spent some time around german people. Had Russian lessons and I have pretty good English and Lithuanian skills. I'm only fluent in the last two languages. I still understood most of your prompt. So I don't think this is a good test.

Reading that prompt again, I think watching some anime with subs helped too.

kace91
Yup, LLMs are a polyglot’s dream interface, considering multi language is a feature that pretty much all companies screw up each in their own way.

And then there’s apple, which will not let me use their AI features because Siri needs to be in the same language as iOS, Siri is set to English and iOS is set to “English (Spain)” (????).

pbhjpbhj
Word vectors for the same concept in different languages are very close, that's not really surprising. The whole `v_king - v_male + v_female = v_queen` idea leads to an intuition that `v_grazie - v_italien + v_deutsch = v_danke`.

Code switching probably appears a lot in both literature and audio from some contexts too.

Why would it matter that a concept is expressed in a different human language? How would you structure a multilingual LLM that it couldn't syncretise languages in this way? You'd seemingly need to properly segregate the languages.

fcatalan
Yes I understand the encodings will be close and that helps, I guess that's why they produce good translations, but I'm intrigued by the LLM having so much control of the switching without even explicit prompting, just a one-shot example. I also guess I'm easily impressed.
foolswisdom
I think the research by anthropic released recently showed that language is handled independently of the "concepts" they convey, so first you get the concepts, then you get the translation to language.
fpgaminer
> At least in most public tokenizers like o200k, addition in certain Unicode ranges commutes with addition in token space

This seems flawed. I mean, the author's statement here is literally true, but it's eliding a very important detail: LLMs do _not_ see token indexes. They have no idea what order the token embeddings are in. In fact, you can shuffle the embeddings and the LLM wouldn't care at all. And I highly suspect that if you shuffled the entire tokenizer, so that the above property no longer holds, and trained Claude from scratch on that tokenizer, it would still be able to perform this task.

> so all but one of these symbols is mapped to three tokens each, where the first two are the same and can be easily ignored by an attention head, and the third token increments exactly with the Unicode.

This is the crux, I believe.

In the general case, the common Unicode ranges (for Korean, Japanese, Chinese, etc) get tokenized just like English (for modern tokenizers at least).

It's only in the obscure unicode ranges where you hit a special case of the tokenizer. This is the "backup plan" of the tokenizer. If it encounters text that doesn't directly map to a token in its dictionary, then it falls back to encoding the text as UTF-8 bytes. Those UTF-8 bytes have a dedicated set of 256 tokens in its dictionary. So in those extreme cases, rather then getting bits of text like "Hell, o, Mr, ., B, ond" the LLM gets the raw UTF-8 bytes.

Now, again, the LLM can't directly see those bytes, their index in the tokenizer's dictionary, their integer values, etc, etc. It only sees their embedding vectors, which are unordered. So it has no _implicit_ knowledge about those bytes being ordered. Therefore the assertion that addition commutes between Unicode and token indices is irrelevant.

My theory would be that the pretraining data contains lists of Unicode characters. Specifically, lists of unicode characters in order. Naturally, for the obscure ranges of unicode, this results in the LLM seeing counting in UTF-8 bytes. It doesn't initially know what the "value" of each byte is, but naturally it would learn that so that it can correctly predict the next byte.

The same occurs for English letters. It doesn't start with any knowledge about what order they are in. It only learns the ordered alphabet through seeing examples.

(The inverse applies, of course, since the output is also unordered.)

Maybe this is a nitpick? But it seems important to me, because it's the difference between a rather simple mechanism:

output[i] = input[i] + 1

and a more complex mechanism:

c = to_utf8_byte_index(input[i]) c = c + 1 output[i] = from_utf8_byte_index(c)

nestorD
I was also surprised to find out (roughly a year ago) that Claude is good at Old English (which, despite its misleading name, looks nothing like English and is more of a Germanic language) whereas ChatGPT would output pure hallucinations.
rdlw
Interesting. I was using ChatGPT to try to come up with a possible reconstruction of the Ketef Hinnom scrolls (I don't know Ancient Hebrew at all), with some mixed results. I had to prompt it with things like "What do you think that 'YHWH' bit could mean?", and then it sort of caught on. Maybe I'll see if Claude can do better.

Your description of Old English is a bit odd. It's certainly very different from modern English, but it's its direct ancestor and both languages are Germanic.

GaggiX
Claude is much better than ChatGPT at low-resource languages, at least it was a year ago, I haven't tested on new models from OpenAI but I believe that Claude still has an edge.

For example, when ChatGPT was outputting nonsense in Georgian, Claude was speaking it fluently, when ChatGPT learned Georgian, Claude was able to speak Mingrelian.

VITORPRO899
[flagged]
VITORPRO899
[flagged]