In this five-part series, we’ll explore Cultivate’s NLP system and how we’ve capitalized on research from all across machine learning to create it.
Inside Cultivate’s Unique NLP System: Part 4
Over the last few posts, we’ve talked a lot about how we use embeddings and the useful properties they have. In this post we’ll take a look at how we train these embedding models to ensure a seamless product.
As a refresher, our embedding space is the intermediate layer of a language model. Ideally it would have the property that similar sentences are mapped to close points in the embedding space.
This begs the question – what is considered a similar sentence?
This is a difficult question to answer. It turns out that while out-of-the-box language model embeddings work for many tasks, they perform poorly on semantic similarity. The exact reason for this is an active area of research. So for out-of-the-box language models, the answer is unknown.
Fig 4.1: Comparison of out of the box BERT (BERT CLS-vector) semantic similarity scores. Note how BERT without fine-tuning performs worse than average word vector embeddings (GloVe). Table adopted from Sentence Embeddings using Siamese BERT-Networks – arXiv.
However, fine-tuned BERT embeddings do seem to exhibit our desired property – that similar sentences are encoded to similar locations in space. It turns out this fine-tuning process is exactly how we tell BERT what sentences are considered similar. The tasks that we give to BERT essentially define how BERT decides what it means for two sentences to be similar. If we ask BERT to classify polite and impolite sentences, the embedding space will be transformed so that polite sentences are close to each other.
Note that while we tend to think of these as task-specific embeddings, this approach also works to create more general embeddings. SentenceBERT, a commonly used embedding model, was fine-tuned to detect entailments and contradictions in natural language and yields a very good general-purpose embedding space.
Here at Cultivate, we’ve leaned into this approach even more by adopting a multi-task fine-tuning approach. Each task has its own separate task head, training objective, and data. Instead of fine-tuning a different language model for each one of our tasks, we share the same language model. By grouping these tasks together, we find a single embedding space that behaves nicely for all tasks.
To train our models, we cycle through each task in a round-robin manner. After each gradient update, we switch the task and take another gradient descent step. We decide on the tasks to use empirically, by looking at past and future features and trying to find patterns. We’ve settled on the following 6 tasks.
Training a singular multi-task model with a shared encoder is not something that you would have done traditionally before BERT. You would expect that there would have been a decrease in performance, due to the noise from the other tasks and datasets. However we find that there is no discernible degradation for our unified models. BERT is capable of much more complex tasks like question-answering and paraphrasing, so combining several text classifiers together is still within the capacity of the model.
Having a shared encoder is advantageous for a multitude of reasons. Firstly, the language model encoder contains the bulk of the parameters of the model, and thus requires the bulk of the compute. Sharing an encoder reduces the amount of processing time to annotate user messages, since the encoding step is usually the limiting factor.
Another key factor in our choice to go with a unified approach is that it makes us easy to make changes to our entire NLP pipeline. Here at Cultivate, we wanted to create a system that is truly passive and in the flow of work. If you usually communicate with your coworkers in Urdu we want to be able to capture that seamlessly. To do that we need to be able to identify instances of recognition, or polite and impolite sentences in a wide variety of languages.
Fig 3.1 Multilingual Embedding Space. Image adopted from Language-agnostic BERT Sentence Embedding.
By replacing our existing language model with a multilingual language model, we are able to add support for 100+ languages to our whole pipeline in just a few lines of code. These multilingual models work by mapping similar sentences in different languages to the same point in the embedding space. This allows for cross-lingual training, where we train on English examples but still observe good test accuracy on other languages.
We’ve now taken a look at how we use embeddings as well as how we train them. In our next post, we’ll look at how we approach these problems in a privacy-centric way, without requiring any customer plaintext to be stored on our servers. We’ll also explain a little about when we focus on which problem and how they tie together.