In this five-part series, we’ll explore Cultivate’s NLP system and how we’ve capitalized on research from all across machine learning to create it.
Inside Cultivate’s Unique NLP System: Part 5
In our last couple of posts, we took a look at how Cultivate trains and uses neural embeddings. In this post we’ll take a look a step back and take a look at how these pieces fit into the bigger picture of Cultivate’s NLP system. We’ll also explore how Cultivate is able to avoid saving customer plaintext in our privacy-focused embedding architecture, while still being able to experiment with new ideas.
To better understand Cultivate’s pipeline, let’s take a look at the journey of a single message from start to finish.
Fig 5.1: Cultivate’s Embedding Architecture
After we pull data down from a myriad of sources, all messages first go through our language model encoder, to convert them into embeddings. These embeddings are then saved and then fed into different classifiers.
As a refresher, we have two different types of classifiers that we use here at Cultivate. The first kind of models are weak supervision label models. These models only require a small anchor set in order to be able to identify messages. They work by finding sentences that have similar embeddings. This means that they can be “trained” with as few as 10 examples. These models are very easy to set up and run very quickly and work great for exploring different features and product ideas. However, the performance of these classifiers are usually a tier or two below a properly trained model.
This is where our second kind of model comes in. When we’re certain that a message flagger is surfacing useful information, we usually collect more data and add an additional task head to our multi-task model. This is our “exploitation” step – by training a task head we ensure that our model is as accurate as possible to deliver the most value to the user.
Since these two modeling approaches share the same embedding space, there is a symbiotic relationship. We use message flaggers to find good tasks for our multi-task model. By incorporating a slew of these tasks, we are able to produce a well-behaved embedding space. This creates a positive feedback loop – as we can then use this new embedding space in our message flaggers, improving their ability to identify useful subsets of messages.
This embedding-based approach is also more respectful of user privacy. We take every step possible to ensure that the only individual who will see your messages is you – not your manager, not a third-party data labeler, not even Cultivate. To this end we’ve recently moved to a streaming solution, removing all customer plaintext from our system. Normally this would cripple our modeling capabilities. Without storing the plaintext, we would have no way to try out new models and features. However, we’ve been able to get around this by saving neural embeddings instead.
This works because different weak supervision models under this approach are defined by only their anchor set – which means that the corpus embedding is the same across different runs. So we need to calculate the corpus embedding only once – when we initially pull down the message from our integrations. We can then reuse the cached embeddings for different flaggers. All we need to do is specify a new anchor set and the generative label model. This lets us maintain our general modeling ability without needing plaintext, allowing us to explore new product ideas and features.
This works great for message flaggers, but when about when we need to train a new task head? Ideally we would be able to dump customer plaintext and label it in order to train a new task. This gives us the most confidence on model generalization and avoids tricky domain shift problems. Unfortunately, this approach is out of the question, as it exposes user data.
To get around this, we’ve adopted a unique data labeling approach. While we cannot dump customer plaintext, we can dump the saved embeddings instead. Then, we find similar sentences to these embeddings by exploiting their semantic similarity property. To do this we leverage the Cultivate Convo Corpus. The Cultivate Convo Corpus is a corpus of over 1.4 million conversation snippets and is sourced from a variety of internal and public datasets. With our dumped embeddings, we can find similar plaintext embeddings in the corpus. We can then dump and label these proxy plaintext sentences. We then train on the labeled proxy sentences. This ensures that your messages stay private, so that the only person who sees them is you.
This concludes our blog post series on Cultivate’s machine learning pipeline, which is focused on neural embeddings. We’ve explored how we are able to use embeddings to build a flexible, comprehensive, and privacy-focused approach to natural language processing. Thank you for reading!