In this five-part series, we’ll explore Cultivate’s NLP system and how we’ve capitalized on research from all across machine learning to create it.
Inside Cultivate’s Unique NLP System: Part 3
Our last two posts have introduced the concept of neural embeddings and some of their useful properties. In this post, we’ll start taking a look at some concrete applications of embeddings here at Cultivate.
As a startup, we embrace creativity and moonshot ideas. We often need to identify a particular subset of messages to quickly prototype features or try out new ideas. To this end, we’ve leveraged existing research in weak supervision in order to create text classifiers that only need a few examples to be trained.
One common application of neural embeddings is semantic search. Let’s say we have a set of documents, and we want to find the closest document to our query q.
Figure 3.1 Semantic Search. Image adopted from Multilingual Sentence & Image Embeddings with BERT – GitHub
We can find similar sentences by encoding our query embedding and our corpus and then compute the cosine distance between each sentence in the corpus and the query.
We can take a similar approach that we used in semantic similarity to create a classifier by thresholding the similarity score between a sentence and an anchor sentence . The anchor sentence is just an instance of the particular class we are trying to classify. Since similar sentences get mapped to the same place in the embedding space, we would expect that other sentences of that class will be close to the anchor sentence.
Fig 3.2: Some example anchors and their flagged messages.
However, using just a single anchor sentence usually leads to poor performance. To get better coverage we need to incorporate an anchor set, or a set of anchor sentences. A straightforward way to do this is to take the mean embedding of this anchor set, and find examples close to it. Since the embedding space is linear, this is the same thing as computing the vector of similarity scores and then taking the average.
We’ve expanded on this idea by incorporating ideas from weak supervision and multi task learning to combine these noisy labels in a more effective way. We’re looking for a function that takes in a vector of noisy labels L and outputs Y, which is a single probability.
To do so, we learn a generative model P_w(L, Y) that models the joint distribution of the noisy label matrix L and the true labels Y using maximum likelihood estimation.
Fig 3.2: Generative Model architecture. Here Y is our probabilistic label and lambda 1-3 are our different anchors. Image adopted from Training Complex Models with Multi-Task Weak Supervision – arXiv
This generative model assumes that each anchor example has its own accuracy, and seeks to learn this parameter from the data. Once we’ve learned this generative model, we can convert the joint distribution to a conditional distribution via Bayes rule:
The idea is that the generative model is able to look at the overlaps and differences of each anchor example and combine this information to make an overall decision, leading to better performance.
To compare our weak supervision approach to taking the mean embedding, we sampled a variable number of positive examples from the SST-2 dataset, and then used these examples to create a classifier. We set the threshold by doing a line search on the validation set. As you can see, our weak supervision label model approach has the highest overall accuracy and performs better when there are more examples.
The key here is that neither of these approaches require training data, just anchor sets. We often can collect these by looking at heuristics outside of NLP, like message response time, interaction with our application, and sender/recipient information. By combining these with our neural embeddings, we’re able to expand our ability to identify messages we care about.
Join us for Part 4, when we talk a little bit about how we train these embeddings!