Home Artificial Intelligence All Languages Are NOT Created (Tokenized) Equal Data A concentrate on OpenAI’s Tokenizers Results Discussion Conclusion APPENDIX

All Languages Are NOT Created (Tokenized) Equal Data A concentrate on OpenAI’s Tokenizers Results Discussion Conclusion APPENDIX

All Languages Are NOT Created (Tokenized) Equal
A concentrate on OpenAI’s Tokenizers

“hey” translated to 52 different languages. The scale of the text is scaled to corresponds to the variety of tokens needed to represent the message within the corresponding language. Image created by creator.

Original article was posted on my blog.

Large language models similar to ChatGPT process and generate text sequences by first splitting the text into smaller units called tokens. Within the image below, each coloured block represents a singular token. Short or common words similar to “you”, “say”, “loud”, and “all the time” are its own token, whereas longer or less common words similar to “atrocious”, “precocious”, and “supercalifragilisticexpialidocious” are broken into smaller subwords.

Visualization of tokenization of a brief text using OpenAI’s tokenizer website. Screenshot taken by creator.

This technique of tokenization will not be uniform across languages, resulting in disparities within the variety of tokens produced for equivalent expressions in numerous languages. For instance, a sentence in Burmese or Amharic may require 10x more tokens than an analogous message in English.

An example of the identical message translated into five languages and the corresponding variety of tokens required to tokenize that message (using OpenAI’s tokenizer). The text comes from Amazon’s MASSIVE dataset.

In this text, I explore the tokenization process and the way it varies across different languages:

  • Evaluation of token distributions in a parallel dataset of short messages which were translated into 52 different languages
  • Some languages, similar to Armenian or Burmese, require 9 to 10 times more tokens than English to tokenize comparable messages
  • The impact of this language disparity
  • This phenomenon will not be latest to AI — that is consistent with what we observe in Morse code and computer fonts

Try it yourself!

Check out the exploratory dashboard I made, available on HuggingFace spaces. Here, you possibly can compare the token lengths for various languages and for various tokenizers (which was not explored in this text, but which I explore the reader to do on their very own).

Screenshot of the language tokenizers dashboard.

MASSIVE is a parallel dataset introduced by Amazon consisting of 1 million realistic, parallel short texts translated across 52 languages and 18 domains. I used the dev split of the dataset, which consists of 2033 texts translated into each of the languages. The dataset is obtainable on HuggingFace and is licensed under the CC BY 4.0 license.

While many other language model tokenizers exist, this text mainly focuses on OpenAI’s Byte Pair Encoding (BPE) tokenizer (utilized by ChatGPT and GPT-4) for 3 predominant reasons:

  • First, Denys Linkov’s article compared several tokenizers and located that GPT-2’s tokenizer had the very best token length disparity amongst different languages. This prompted me to focus on OpenAI models, including GPT-2 and its successors.
  • Second, since we lack insight into ChatGPT’s full training dataset, investigating OpenAI’s black box models and tokenizers help to higher understand their behaviors and outputs.
  • Finally, the widespread adoption of ChatGPT in various applications (from language learning platforms like Duolingo to social media apps like Snapchat) highlights the importance of understanding tokenization nuances to make sure equitable language processing across diverse linguistic communities.

To calculate the variety of tokens a text incorporates, I exploit the cl100k_base tokenizer available on tiktoken, which is the BPE tokenizer utilized by OpenAI’s ChatGPT models (`gpt-3.5-turbo` and `gpt-4`).

Some languages consistently tokenize to longer lengths

The next distribution plot compares the distribution of token lengths for five languages. The curve for English is tall and narrow, meaning that English texts consistently tokenize to a smaller variety of tokens. Alternatively, the curve for languages similar to Hindi and Burmese are short and wide, meaning that these languages tokenize texts into many more tokens.

Distribution of token lengths for all 2033 messages and 52 languages. Five of the languages have been bolded and coloured; the remaining are shown in gray. Figure created by creator.

English has the shortest median token length

For every language, I calculated the median token length for the entire texts within the dataset. The next chart compares a subset of the languages. English texts had the smallest median length of seven tokens and Burmese texts had the biggest median length of 72 tokens. Romance languages similar to Spanish, French, and Portuguese tended to lead to an analogous variety of tokens as English.

A subset of the 52 languages and their median token length. Figure created by creator.

As English had the shortest median token length, I calculated the ratio of the opposite languages’ median token length to that of English. Languages similar to Hindi and Bengali (over 800 million people speak either of those languages) resulted in a median token length of about 5 times that of English. The ratio is 9 times that of English for Armenian and over 10 times that of English for Burmese. In other words, to specific the identical sentiment, some languages require as much as 10 times more tokens.

A subset of the 52 languages and the ratio of that language’s median token length to that of English. Figure created by creator.

Implications of tokenization language disparity

Overall, requiring more tokens (to tokenize the identical message in a distinct language) means:

  • You’re limited by how much information you possibly can put within the prompt (since the context window is fixed). As of March 2023, GPT-3 could take as much as 4K tokens and GPT-4 could take as much as 8K or 32K tokens in its input [1]
  • It costs extra money
  • It takes longer to run

OpenAI’s models are increasingly getting used in countries where English will not be the dominant language. In line with SimilarWeb.com, america only accounted for 10% of the traffic sent to ChatGPT in Jan-March 2023.

Top 5 countries sending probably the most traffic to talk.openai.com in Jan-March 2023. Sourced from similarweb.com on May 2, 2023. Screenshot taken by creator.

Moreover, ChatGPT was utilized in Pakistan to grant bail in a juvenile kidnapping case and in Japan for administrative tasks. As ChatGPT and similar models have gotten increasingly integrated into services worldwide, it’s crucial to know and address such inequalities.

Language Disparity in Natural Language Processing

This digital divide in natural language processing (NLP) is an energetic area of research. 70% of research papers published in a computational linguistics conference only evaluated English.2 Multilingual models perform worse on several NLP tasks on low resource languages than on high resource languages similar to English.3 In line with W3Techs (World Wide Web Technology Surveys), English dominates greater than half (55.6%) of the content on the Web.4

Percentages of internet sites using various content languages (as of April 30, 2023). Data source: https://w3techs.com/technologies/overview/content_language. Figure created by creator.

Similarly, English makes up over 46% of the Common Crawl corpus (billions of webpages from the Web crawled for over a decade), versions of which have been used to coach many large languages similar to Google’s T5 and OpenAI’s GPT-3 (and sure ChatGPT and GPT-4). Common Crawl makes up 60% of GPT-3 training data.5

Addressing the digital divide in NLP is crucial to make sure equitable language representation and performance in AI-driven technologies. Bridging this gap calls for a concerted effort from researchers, developers, and linguists to prioritize and put money into the event of low-resource languages, fostering a more inclusive and diverse linguistic landscape within the realm of natural language processing.

Historical example: Representing Chinese Typography using Morse Code

Such a disparity of technological costs for various languages will not be latest to AI and even to computing.

Over 100 years ago, telegraphy, a revolutionary technology of its time (“the web of its era”), faced language inequities just like those we see in today’s large language models. Despite its guarantees of open exchange and collaboration, telegraphy exhibited discrepancies in speed and price across languages. For example, encoding and transmitting a message in Chinese (in comparison with an equivalent message in English) was

  • 2 times as expensive
  • Took 15–20 times longer

Sound familiar?

Telegraphy was “designed in the beginning for Western alphabetic languages, English above all.”6 Morse code assigned different lengths and costs to dots and dashes, leading to a cost-efficient system for English. Nonetheless, the Chinese language, which relies on ideograms, faced challenges in telegraphy. A Frenchman named Viguier devised a mapping system for Chinese characters to Morse code.

Essentially, each Chinese ideogram was mapped to a four-digit code, which needed to then be translated into Morse code. This was took an extended time looking up the codes within the codebook (which lacked meaningful correlations) and was more costly to transmit (as each character was represented by 4 digits, and a single digit was dearer to transmit than a single letter). This practice put the Chinese language at an obstacle in comparison with other languages when it comes to telegraphic speed and price.

Manuscript on left from Zhang Deyim Dianxin xinfa 電信新法, 1873. Danish National Archives. http://www5.kb.dk/permalink/2006/manus/350/eng/32/. Red circle drawn in by creator.

One other example: Inequity in representing fonts

Initially, I attempted to visualise all 52 languages in a single word cloud. I ended up with something like this, where a majority of the languages weren’t rendered properly.

Word cloud visualizing “hey” in 52 languages. Most of the languages (including Arabic, Hindi, and Korean) can’t be rendered using a single font (depicted is the default WordCloud font DroidSansMono). Size corresponds to the variety of tokens required to represent “hey” in that language. Figure created by creator.

This led me down a rabbit hole of trying to search out a font that would render the entire language scripts. I went on Google Fonts to search out this perfect font and located that one didn’t exist. Below is a screenshot showing how these 52 languages would render in 3 different fonts from Google Fonts.

To generate the word cloud in the beginning of this text, I (ehm) manually downloaded the 17 font files crucial to render the entire language scripts and displayed words one after the other. While I got the specified effect, it was rather a lot more work than it might have been if, for instance, all of my languages used the identical script (similar to the Latin alphabet).

Byte-Pair Encoding Tokenization

Within the realm of natural language processing, tokenizers play a vital role in enabling language models to process and understand text. Different models use different methods for tokenizing a sentence, similar to splitting it into words, into characters, or into parts of words (also often known as subwords; e.g. splitting “continually” into “constant” and “ly”).

One common tokenization known as Byte-Pair Encoding (BPE). That is the encoding utilized by OpenAI for his or her ChatGPT models. BPE is supposed to decompose rare words into meaningful subwords while keeping steadily used words intact. A comprehensive explanation of the BPE algorithm will be found on the HuggingFace Transformers course.

Deeper Dive into Token Distribution for Languages

I augmented Amazon’s MASSIVE dataset by utilizing details about each of the 52 languages using the infobox section of that language’s Wikipedia page, obtaining information similar to writing script (e.g. Latin, Arabic alphabet) and predominant geographic region the language is predominant in (if relevant). I moreover use metadata from The World Atlas of Language Structures to acquire information similar to language family (e.g. Indo-European, Sino-Tibetan).7

Note that the next analyses in this text uphold the assumptions made by Wikipedia, The World Atlas of Language Structures, and by the Amazon MASSIVE dataset. Since I’m not a linguistics expert, I needed to assume that whatever on Wikipedia and the World Atlas were canonically accepted as correct as regards to dominant geographic region or language family.

Also, there are debates about what constitutes a language versus a dialect. For instance, while languages similar to Chinese and Arabic have different forms that individuals may not understand, they’re still called single languages. Alternatively, Hindi and Urdu are very similar and are sometimes grouped together as one language called Hindustani. Due to these challenges, we should be careful when deciding what counts as a language or a dialect.

Breakdown by language. I selected the 12 most spoken languages (a mixture of each first-language and second-language speakers).

Token distribution by language. Figure created by creator.

Breakdown by language family. Indo-European (e.g. Swedish, French), Austronesian languages (e.g. Indonesian, Tagalog), and Uralic languages (e.g. Hungarian, Finnish) resulted in shorter tokens. Dravidian languages (e.g. Tamil, Kannada) tended to have longer tokens.

Token distribution by language family. Figure created by creator.

Breakdown by predominant geographic region. Not all languages were specific to a single geographic region (similar to Arabic, English, and Spanish, that are spread across many regions) — these languages were faraway from this section. Languages spoken mostly in Europe are inclined to be shorter in token length, while languages spoken mostly within the Middle East, Central Asia, and the Horn of Africa tended to be longer in token length.

Token distribution by predominant geographic region. Figure created by creator.

Breakdown by writing script. Aside from the Latin, Arabic, and Cyrillic alphabets, all other languages use their very own unique script. While the latter combines many very different unique scripts (similar to Korean, Hebrew, and Georgian scripts), these unique scripts definitely tokenize to longer values. In comparison with Latin-based scripts, which tokenize to shorter values.

Token distribution by writing script. Figure created by creator.

English almost all the time ranks #1

For every text within the dataset, I ranked all languages based on variety of tokens — the language with the least tokens was ranked #1 and the one with probably the most tokens was ranked #52. Then, I plotted the distribution of every language’s rating. Essentially, this could show how each language’s token length compares with the opposite languages on this dataset. Within the below figure, I labeled a couple of of the languages (the opposite languages show up as gray lines within the background).

While there have been a couple of cases where some languages’ tokens were fewer than that of English (similar to a couple of examples in Indonesian or Norwegian), English almost all the time ranked primary. Does this come as a surprise to anyone? What surprised me most was that there was no clear #2 or #3. English language texts consistently produce the shortest tokens, and the rating fluctuates a bit more for other languages.

Distribution of token rating with respect to other languages. Figure created by creator.

Quantifying token distributions differences using Earth Mover’s Distance

To quantify how different the token length distribution between two languages were, I calculated the earth mover’s distance (also often known as the Wasserstein distance) between two distributions. Essentially, this metric calculates the minimum amount of “work” required to remodel one distribution into one other. Larger values mean the distributions are farther apart (more different) while smaller values mean the distributions are quite similar.

Here’s a small subset of languages. Note that the space says nothing concerning the length of the tokens, just how similar the distribution of token lengths are for 2 languages. For instance, Arabic and Russian have similar distributions despite the fact that the languages themselves will not be similar in a linguistic sense.

Heatmap showing Earth Mover’s Distance amongst a subset of languages. Figure created by creator.

1. OpenAI. “Models”. OpenAI API. Archived from the unique on March 17, 2023. Retrieved March 18, 2023.

2. Sebastian Ruder, Ivan Vulić, and Anders Søgaard. 2022. Square One Bias in NLP: Towards a Multi-Dimensional Exploration of the Research Manifold. In Findings of the Association for Computational Linguistics: ACL 2022, pages 2340–2354, Dublin, Ireland. Association for Computational Linguistics.

3. Shijie Wu and Mark Dredze. 2020. Are All Languages Created Equal in Multilingual BERT?. In Proceedings of the fifth Workshop on Representation Learning for NLP, pages 120–130, Online. Association for Computational Linguistics.

4. Usage statistics of content languages for web sites”. Archived from the unique on 30 April 2023.

5. Brown, Tom, et al. “Language models are few-shot learners.” Advances in neural information processing systems 33 (2020): 1877–1901.

6. Jin Tsu. Kingdom of Characters: The Language Revolution That Made China Modern. Latest York: Riverhead Books, 2022 (p. 124).

7. Dryer, Matthew S. & Haspelmath, Martin (eds.) 2013. WALS Online (v2020.3) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.7385533. Available online at https://wals.info, Accessed on 2023–04–30.


Please enter your comment!
Please enter your name here