A more efficient implementation of compression-based topic classification
The recently published paper with the title “Low-Resource” Text Classification: A Parameter-Free Classification Method with Compressors [1] has received quite some public attention within the recent days. Their key finding is that they will in some cases outperform large language models like BERT using the straightforward concept that two text documents are more similar in the event that they could be compressed to a smaller file size (although there’s some controversy concerning the validity of their results, see this blog post and this discussion including the creator’s reply).
The predominant idea behind their approach is that the “information distance” between two documents, as defined by Bennet et. al [2], is a superb distance metric to make use of during text classification. For the reason that information distance itself isn’t computable, they approximate it using the Normalized Compression Distance (NCD) [3], which estimates the knowledge distance using “real-life” data compressors like gzip. NCD has the property that higher compressors (i.e. compressors with a greater compression ratio) will allow for a greater estimation of the true information distance.
It seems natural then to expect that higher compressor will achieve superior performance in classification. But they can not validate this experimentally; one of the best compressor considered within the paper, bz2, performs worse when it comes to accuracy than gzip. They explain this as follows: “[…] the Burrows-Wheeler algorithm utilized by bz2 dismisses the knowledge of character order by permuting characters during compression” [1, p.6817]. This means that compression alone doesn’t explain their findings, but that it has something to do with character order as well.
This made me wonder: How much of their results are as a result of compression, and the way much is as a result of the comparison of strings between two documents?
To analyze this query, I compare their results with two alternatives: (1) a straightforward compressor that relies solely on replacing recurring strings, and (2) an algorithm that explicitly does substring matching between a question document and all documents belonging to some topic.
A primary ablation: Will LZ4 do the job?
The compression algorithm gzip relies on DEFLATE, which in term uses LZ77 and Huffman coding to compress the info. Let’s take a look at these two algorithms in additional detail and take into consideration what they imply when applied to our use-case.
During compression, LZ77 uses a sliding window over previously seen data. If a string of characters is repeated, a reference to the primary occurrence of the character string is stored, as an alternative of the string itself. Hence, if we take the length of two concatenated documents as a distance metric, documents might be closer together in the event that they have more overlapping substrings throughout the size of the sliding window (typically 32KB).
The Huffman coding compresses the resulting text even further: As an alternative of using 8 bits for each character, it represents those letters that occur continuously with less bits and people letters that occur rarely with more bits. If we apply the Huffman coding to the concatenated documents, two documents can be smaller after compression in the event that they use the characters with similar frequency.
One would expect matching substrings to be rather more essential to topic classification than similar character frequencies. I due to this fact do an ablation study by taking a look at the performance when using the LZ4 algorithm [4] for compression (principally LZ77 but with a quick implementation available in python). Since LZ4 has a much lower compression ratio than gzip, their explanation would suggest that performance of LZ4 is worse than gzip. If nonetheless the predominant thing happening is substring matching, LZ4 will perform just in addition to gzip.
A more explicit algorithm
To go even further, I implement a straightforward algorithm that explicitly does the substring matching: It assigns documents to the subject with probably the most similar substrings (substrings being character-level n-grams here). It really works as follows:
Text encoding:
1. Extract all character n-grams throughout the text with 5 ≤ n ≤ 50.
2. For the extracted n-grams, calculate an 8-digit hash code using `hash(n_gram) % int(10e8)` in python (since I would like to manage how many alternative things to maintain track of).
3. Keep track of this in sets (thus losing the knowledge about how again and again a given code occurs).
Training:
1. Calculate the set of hash codes for each text of a given topic.
2. Take the set union to acquire a set of hash codes that occur in the subject.
Inferencing:
1. For some query text, calculate the set of its hashed n-grams.
2. For each topic encountered during training, calculate the dimensions of the intersection between the subject sets and the query set.
3. Assign the query text to the subject with the most important intersection.
Experiment and Results
I compare the outcomes of gzip, lz4 and the hashed n-gram algorithm within the 100-shot setting with 5 runs, as described of their paper. For all three methods I stick with their experimental setup with a purpose to reproduce their reported results (again, leaving us with potentially inflated accuracy measures). The code could be found on github.
You possibly can see the performance on three data sets from torchtext (AG_NEWS [5], DBpedia [6] and YahooAnswers [5]) in the next table:
We see that lz4 and the hashed n-grams outperform gzip on all three considered data sets, with the hashed n-gram algorithm being best in 2 out of three data. However it still doesn’t compete with BERT, which has a performance of 0.82 on AG_NEWS and shut to 1 on DBpedia in keeping with their paper within the 100-shot setting.
These results have essential practical implications: In our experiment, the lz4-based algorithm runs roughly 10x faster than the gzip-based algorithm. And much more importantly, the hashed n-gram algorithm even improves the computational complexity at inference time: As an alternative of getting to check a question document with every other document within the text corpus, you only should do a comparison with every topic set.
What will we learn from this?
My results suggest that the driving force between the impressive reported performance of gzip could be attributed to the proven fact that their method implicitly compares character-level n-grams between documents. This finding allows for the usage of faster compressors like lz4 with none performance loss. Moreover, one may even rewrite their algorithm to have constant complexity with respect to dataset size at inference time, bringing their method a giant step closer to practical use on big data sets. For those who do need to use it in practice, I even have began an implementation following the scikit-learn API of my proposed algorithm here.
The one query that is still is, why does this outperform the TF-IDF approach that also compares words between documents?
Perhaps considering character-level n-grams does a greater job for some tasks than splitting the text into individual words. But probably more importantly, the strategy here gives equal weight to all n-grams, no matter their occurrence count. Which means that it gives lots of importance to so-called long-tail (read: rare) information, which apparently is relevant for some text tasks equivalent to topic detection. Note that transformer networks don’t do too good a job on modelling such long-tail information (for evidence, see e.g. [5]), which is why these quite simple approaches are a really interesting benchmark to measure your million-parameter model against.
References
[1] Z. Jiang, M. Yang, M. Tsirlin, R. Tang, Y. Dai, J. Lin. “Low-Resource” Text Classification: A Parameter-Free Classification Method with Compressors (2023), ACL 2023
[2] C. H. Bennett, P. Gács, M. Li, P. MB Vitányi, and W. H. Zurek, Information distance (1998), IEEE Transactions on information theory
[3] M. Li, X. Chen, X. Li, B. Ma, and P. Vitányi, The similarity metric (2004), IEEE transactions on Information Theory
[4] N. Kandpal, H. Deng, A. Roberts, E. Wallace, C. Raffel, Large Language Models Struggle to Learn Long-Tail Knowledge (2022), arxiv.org/abs/2211.08411
Due to Joel Niklaus and Marcel Gygli for our discussions and your feedback.