Detecting changes in text using machine learning
High level goal
Today, I will explain how I implemented change detection in text using machine learning. This technique aims at producing report highlighting what changed in two subsequent versions of the same text (i.e.: there are no major changes in the structure or the content).
Above, we can see what kind of output is expected from such an algorithm:
- On the left, there is the text of the first version of the document and on the right, the second one.
- The strikethrough words highlighted in red are the ones that are present in the first version but not the second one.
- The words that are highlighted in green are the words that are present only in the second version.
- The words that are highlighted in blue are the ones that are present in both texts with a change of format (uppercase/lowercase).
- The other words are obviously common to both versions.
Input / output
This algorithm expects two files in text format. Keep in mind that the purpose of this algorithm is to compare two subsequent versions of the same text, so, we make the assumption that the two text files are not too different to have a reasonable output.
The output would be an HTML file that lays the differences out in a human-readable format.
How does it work ?
The diagram above represents the whole solution process. There are three main steps:
- Section extraction: Divide the text files in smaller sections.
- Matching: using the machine learning model BERT, we compute embeddings of the text sections. Embeddings are vectors that represent numerically the meaning of something (usually not numerical, as here we deal with text).
- Comparison: the comparison is done syntactically, on the words and how they appear (as opposed to semantically, on the meaning of the words and sentences) and it is implemented by the “Longest common subsequence algorithm”.
The first step is to divide the text in smaller sections. This will be really useful as the comparison algorithm (longest common subsequence) has some troubles dealing with very long texts.
This was implemented by exploiting the structure of the text: the sections and subsections in official documents usually respect some syntax (e.g.: “1.1.1. Introduction” followed by “1.1.2. Another section”. This syntax can be represented by a regular expression stating basically: a section title is any number of digits and dot pair. By further exploiting the order in which the sections appear (e.g.: section “1.” can only be followed by “2.” or “1.1”), we can reliably extract the relevant sections of the texts.
Once the different sections of each text are extracted, we can match them to compare the relevant sections of the first version of the text, with the relevant sections of the second version.
To do so, we use a machine learning model (BERT) which extracts the meaning of a sentence as a vector of numbers (embeddings). This allows to numerically compare the similarity between two texts by using simple vector operations such as cosine similarity for example.
Once we have the embeddings, we can run a matching algorithm (greedy matching, linear sum assignment, …) that will output a list of paired sections.
In this last step, we compare the paired sections together using the longest common subsequence algorithm. I will not go into the detail of this algorithm but it basically extracts the longest common subsequences of the texts and considers that everything else is the result of a modification between the two texts. This algorithm is extensively used in a lot of different applications today (Github’s system for comparing code, patching algorithms on Linux OS, …).
The results of this algorithm are a list of words in the sequence, labelled with information about whether it is present in the first version, the second version or both. It allows us to format this in a pretty way and automatically on an HTML file that can be directly read by humans.