Today we are going to talk compression and biology. I will present you a new framework to see machine learning which is that learning is compression. We will explore together how we can solve a real-world problem by only zipping content. Yes! you heard me, we will zip data like what we do when we send an email. This will allow us to classify proteins.

🧠 Learning is compression

Comprehension is compression. (Gregory Chaitin, 2004)

I don’t know if you ever heard these statements. Let’s explain what they mean by considering clustering. In this task, we consider a dataset $X$ of $N$ points $x_j$. Each $x_j$ is supposed to belong to a class labelled as $y_j$ (e.g. blue/red/dark). The problem is to learn which label $y_j$ taken from the set of labels $Y$ corresponds to $x_j$.

Our clustering model would deduce the label of all points from those of its K cluster center {$p_k$}, in the hope that: $f(x_i)=y_i$

The decision function f just returns the label of the closest center. For instance, if the point is closer to center 3 (labeled “blue”) than to any other center, the decision function will return “blue”.

Now we just have to remember the centers and their label to retrieve all labels, hence the compression 😎.

🤯 Use zip to measure distance

I want to introduce you a new distance metric call Normalized Compression Distance (NCD). NCD is a way of measuring the similarity between two objects, be it two documents, two letters, two emails, two music scores, two languages, two programs, two pictures, two systems, two genomes, to name a few. It measure how hard it is to compress $seq_1$ alone than compressing it concatenated with $seq_2$. We will measure how close two protein sequences.


$Z$ is a compressor like zip or gzip. $Z(x)$ measure number of bit it take to compress $x$ alone. $Z(xy)$ measure the number of bit used to compressed $x$ and $y$ together.

import zlib

# Define our three sequence
seq1 = "ABCDEFGHIJK".encode('latin-1')
seq2 = "AABBCCDDEEFF".encode('latin-1')

# compressing alone
z1 = len(zlib.compress(seq1))
z2 = len(zlib.compress(seq2))
print(f'Compressed size | Z(seq1)={z1}, Z(seq2)={z2}')

# co-compressing
z12 = len(zlib.compress(seq1 + seq2))
print(f'Compressed size | Z(seq1+seq2)={z12}')

ncd = (z12 - min(z1,z2)) / max(z1,z2)
print(f'NCD(seq1, seq2) = {ncd}')
Compressed size | Z(seq1)=19, Z(seq2)=20
Compressed size | Z(seq1+seq2)=31
NCD(seq1, seq2) = 0.6

🧬 Application to protein classification

Proteins can be made from 20 different kinds of amino acids, and the structure and function of each protein are determined by the kinds of amino acids used to make it and how they are arranged. Understanding this relationship between amino acid sequence and protein function is a long-standing problem in molecular biology with far-reaching scientific implications. On application of ML in biology is to classify the protein’s amino acid sequence to one of the protein family accession. In other words, the task is: given the amino acid sequence of the protein domain, predict which class it belongs to.

This is how an aligned protein sequence looks like: ....HWLQMRDSMNTYNNMVNRCFATCI...........RS.F....QEKKVNAEE

We use this compression distance to look how protein within a family same are far from each other (intra family variation):

And how far they are from sequences coming from another family (extra family variation) :

As we see sequence from the same family are very class while those from other family are far. We obtain this cool result using only zip. Now we can train a kNN with this distance and classify proteins sequences.