Low-Resource Text Classification: A Parameter-Free Classification Method with Compressors

Original paper · Jiang et al 2023

Let's take a break from the historical papers for a day and take a look at a fresh paper that's caused quite the stir on twitter over the past few weeks. In a day where new LLMs are presented almost every day this paper proposes a completely parameter-free classification method thats implemented in as little as 14 lines of code. It combines the staple gzip compression algorithm combined with k-NN classifier to create a easy, lightweight and universal text classifier that are competitive with non-pretrained deep learning methods. The biggest talk of the paper was however the fact that authors claimed to outperform BERT on all five OOD datasets but these results have been questioned, more on that later. Let's look at their methodology.

gzip + k-NN

The most elegant thing about this approach is its simplicity both in theory and in practice. On a high level, the proposed algorithm leverages the fact that gzip, being a lossless compressor, aims to represent information using the fewest bits possible effectively removing redundancy. Using this we find that the compressed size of text 1 is closer to the size of compressed text 1 + text 2 if these two texts are similar. More formally, given three texts x_1, x_2 and x_3 where [x_1, x_2] belong to the same category and a compressor C() you can prove that C(x1x2) - C(x1) < C(x1x3) - C(x1). This is used as part of a Normalized Compression Distance which is a similarity measure between two strings based on the length of their compressed forms. This combined with a k-NN classifier to find the nearest neighbors for a text completes the entire algorithm. The entire algorithm is shown below in just 14 python code lines.

import gzip
2 import numpy as np
3 for ( x1 , _ ) in test_set :
4 Cx1 = len( gzip . compress ( x1 . encode () ) )
5 distance_from_x1 = []
6 for ( x2 , _ ) in training_set :
7 Cx2 = len( gzip . compress ( x2 . encode () )
8 x1x2 = " ". join ([ x1 , x2 ])
9 Cx1x2 = len( gzip . compress ( x1x2 .
    encode () )
10 ncd = ( Cx1x2 - min ( Cx1 , Cx2 ) ) / max (
    Cx1 , Cx2 )
11 distance_from_x1 . append ( ncd )
12 sorted_idx = np . argsort ( np . array (
    distance_from_x1 ) )
13 top_k_class = training_set [ sorted_idx
    [: k ] , 1]
14 predict_class = max(set( top_k_class ) ,
    key = top_k_class . count )

Final thoughts

I find this paper to be a very elegant and intriguing research contribution even though the results are probably not as strong as reported. I mean this is still an insanely straightforward implementation that requires no GPU, no pretraining and no parameter tuning. It provides a simple baseline and opens up new avenues for interesting research. Nevertheless its unfortunate that the reported results are not correct as pointed out by Ken Schutte in his blog post Bad numbers in the "gzip beats BERT" paper?. Firstly, Ken points out how odd it is to use k=2 in such a classification scenario given that there are only two possibilities - either the two labels are equal (and the result would be the same given k=1) or the labels are different (where it is common to use the closest label which again would be the same as k=1). Going from k=1 to k=2 doesn't add information to your classifier although it can depend on your tie-breaking strategy which is where the authors seem to have done something weird. Given a tie-break they mark the label correct if one of the two labels in the tie-break is correct. Fixing this error does significantly drop performance in some cases going from best to worst classifier.