In this five-part series, we’ll explore Cultivate’s NLP system and how we’ve capitalized on research from all across machine learning to create it.
Inside Cultivate’s Unique NLP System: Part 1
Cultivate delivers novel AI-enabled solutions to improve the employee experience at some of the largest enterprises in the world. These intelligent solutions are driven by our ability to parse and interpret natural language. This unlocks the power of message content by allowing us to identify and encourage specific instances of manager behavior.
For example, if a manager sends this message to a coworker: “Thank you for the hard work!”
That has a very different effect than a message like: “Need you to work overtime on saturday”
To do this, we need to be able to classify messages into different categories. Aside from the obvious difference in meaning, we also categorize messages based on softer qualities like affect or tone. This problem is called text classification and is one of the oldest problems in NLP.
However, over the past few years we’ve seen a paradigm shift in how we approach these problems. One of the largest breakthroughs has been the rise of BERT and other large pre-trained language models. These models have dramatically streamlined the NLP process. Instead of creating a model from scratch one can just fine-tune a large language model.
This process is a two pronged approach that is similar to an athlete preparing for their season in the offseason. The idea is that their offseason training helps them build general athleticism. And then this athleticism helps when they compete at their specific sport (tennis, basketball, soccer.).
A language-model’s offseason is called pretraining, and is done by training the model to perform a general purpose NLP task, like masking out random words and asking the model to reconstruct the sentence. Since this pretraining task requires no human labeled data, it is possible to pretrain the model over a vast corpus of text, creating good pretrained representations, or embeddings.
Afterwards we throw away the language-modeling task and replace it with a task-specific head and continue training the model, starting from the pretrained parameters. This process is called fine-tuning, and requires less data than training a model from scratch.
There has been an ongoing arms race in academia and industry training larger language models. Facebook, OpenAI, and Google have each invested millions of dollars and enough electricity to power small cities into this effort. This has resulted in very good pretrained representations that perform well on a variety of tasks, from summarization to sentiment analysis.
Fig 1.1 # of Parameters vs. Time (This figure was adapted from Microsoft and DistillBERT.)
However, at Cultivate we’re not interested in training bigger and better language models but in how to best use them. We believe this to be our competitive advantage. Just like how speedier wide receivers have helped NFL offenses score more points via the passing game, we expect that companies that successfully leverage improved language modes will ship more products.
To this end, we’ve combined language models with research in representation learning, weak supervision, and multi-task training to address our unique mix of challenges:
Rapid Prototyping
We often need good-enough models to prototype new features. The primary focus here is on speed rather than accuracy – we want to be able to create these models without going through a long data-collection process.
Comprehensive Coverage
How you say things is just as important as what you say, so we need a wide variety of text classifiers to capture these aspects of communication. We want a system that is truly passive and can scale across many regions for global organizations, so we need to support a wide array of languages in our NLP pipeline. If you talk to your employees in Urdu, you don’t need to change your workflow for Cultivate.
Plaintext Free
We guarantee, and our product is built so that no other human views your data. This makes dealing with domain shift problems tricky, as we cannot dump and label customer data. We’ve come up with techniques that allow us to improve models in a privacy-focused manner.
In order to meet these needs, we’ve created a NLP system centered around neural embeddings, an intermediate layer of a language model. These embeddings are a numerical representation of a piece of text, and are incredibly useful. By leveraging these embeddings, we are able to train classifiers with an initial set of as few as 20 examples.
Check out part 2, as we explore neural embeddings and some of their useful properties!