Interview with Semeris summer intern Stanislav Chistyakov (University College London)

As part of our cooperation with University College London, Stanislav Chistyakov wrote his dissertation project on a Semeris topic with our guidance. Stanislav was studying for MSc in Data Science and Machine Learning. After his successful graduation, Agnes Huszti (our Semeris marketing assistant in San Francisco) had a chat with Stanislav about his time working in cooperation with Semeris.

Why did you choose Semeris for your dissertation project at Semeris?

From the wide range of fields where machine learning (ML) is actively applied, I am primarily interested in natural language processing (NLP), as numerous tasks can be automated using intelligent systems. There is an apparent demand for professionals who have experience applying ML in such tasks.

The project offered by Semeris stood out from a large number of alternatives since it was not concentrated on a narrow research objective. Instead, it was motivated by a practical goal - solve the Named-Entity Recognition (NER) task in financial contracts using ML.

I was attracted by the opportunity to work on a project in a startup, where both founders have a wealth of entrepreneurial experience and a strong understanding of the technicalities behind recent research. It was also a great experience that their recently launched X-Ray product includes pieces built on my work.

How would you summarise your project?

As mentioned before, the project was motivated by the NER tasks faced by Semeris. It is a significant component of information extraction from text and can improve the whole software product offered by Semeris. A system's goal is to label key entities within sentences correctly.

The entity categories can vary from more trivial ones such as organization names, dates, currencies to the harder ones such as cross-references within the text and complex legal instruments.

This task can be approached using a simple rule-based system, where the most frequent patterns are specified. For example, if a portion of text matches the "day-month-year" structure, it is likely to be a date. However, this process is laborious and requires domain expertise (financial and legal in the case of the project).

So, my goal was to use the recent advances in ML (neural networks in particular) to automate this process of recognizing entities. The project’s primary outcome is that the NER task can be solved reasonably well using neural networks and that a trained neural model successfully recognizes most entities.

In my research, I compared different neural architectures (i.e., Bi-LSTMs, CNNs, Transformers) by analyzing their strengths and weaknesses. Without going into too much detail, it was interesting that Transformers that show state-of-the-art results in most NLP benchmarks also show better results in the solved task. The performance differential becomes wider when recognizing more complex entities (e.g., entities consisting of multiple words and numbers or entities not previously seen in the training set).

However, the use of Transformers in practice poses additional challenges in using enough computational resources to keep the text processing within a reasonable timeframe. Hence, there is still future work finding the optimal trade-off between model performance and its computational efficiency.

What challenges did you encounter?

The three major challenges that I encountered in my work are probably similar to what most ML practitioners experience when working on a project: doing data pre-processing to ensure that it’s in the right format, running each experiment for hours on a remote GPU instance, and analyzing model mistakes to improve the quality of the dataset.

It was exciting to gain the experience of addressing the highlighted issues, as research projects often lack the real-world problems that you would encounter in the industry. Peter and Sam provided all the resources that I might need, including the thoroughly prepared dataset and an additional budget for running the experiments.

The major part of the project was working on improving the dataset quality by looking at what the trained models predict, as in many cases, they managed to correctly identify entities that were not present in the labeled dataset. The fact that a trained neural model can be successfully used to find mislabelled data points can be highlighted as another outcome of the project.

Which NLP models were better? Why do you think that is?

A neural NER system consists of multiple modular components that can have an impact on the overall performance. The main difference lies in how the input text's context is encoded. This is similar to what we do as humans when reading. We read the words that we see and keep track of the neighboring words and information from previous sentences to understand the current portion of the text.

Neural NLP architectures mimic this process in a way specific to a particular type of encoder. The project considered three major groups of encoders: Bidirectional Long-Short Term Memory (Bi-LSTM), Convolutional Neural Network (CNN), and Transformer. The last model is well-known for its strong performance across all major NLP tasks, and the project results followed this trend.

I have trained five distinct versions of Transformers: DistilBERT-base-cased, DistilBERT-base-uncased, Legal-BERT-small, Legal-BERT-base, ALBERT-base. They all vary by the number of trainable parameters and specific design intricacies, but they generally outperformed all other considered models in most comparisons. For example, the best Transformer model made 62% fewer mistakes than the best non-Transformer counterpart when recalling long entities (more than 50 characters), 30% fewer mistakes in successfully recognizing challenging entities (cross-references, legal instruments, organizations), and 16% fewer mistakes in recognizing entities that were not previously seen anywhere in the training data set.

This might be explained by the large number of trainable parameters that these models consist of, as well as the power of pre-training, which is a technique that is used to train a neural model on large volumes of text to solve a more general NLP task (for example predicting the next word, i.e., language modeling) before doing further training on the task of interest (NER in this case).

What computing resources did you use during your project?

All considered models consisted of millions of trainable parameters, with some Transformer models having over 60 million parameters. This makes the training process quite computationally expensive on a typical computer, which motivates the use of graphics cards (GPUs) to accelerate the training process. The speed-up in training and inference times can be over ten times when using a GPU. I used three computational resources throughout the project: my PC with a GPU, Google Colaboratory, and Semeris’ servers at Amazon Web Services. The use of the last two was sponsored by Semeris, which allowed me to significantly speed up the experiments and run many of them concurrently.

What future work would you like to see in this field?

Transformers are usually pre-trained on many gigabytes of text, which provides them with what appears to be a reasonably good understanding of human language. The pre-training is usually done by major tech companies that make their trained models publically available. It might be interesting to do further pre-training on a large volume of domain-specific text (legal and financial in this case) to see if this further improves the NER performance.