In this five-part series, we’ll explore Cultivate’s NLP system and how we’ve capitalized on research from all across machine learning to create it.
Inside Cultivate’s Unique NLP System: Part 2
In our last post, we introduced some of the unique problems we face at Cultivate, which we’ve solved by leveraging neural embeddings. In this post, we’ll explain the concept of neural embeddings and the useful properties they have.
For any given piece of text, its embedding is a vector, an array of d numbers. Embeddings give us a way to convert natural language into a numerical representation. Ideally this numerical representation should encode the semantic information of the text.
This is similar to a hash function, which converts a piece of text into a unique number. However, unlike a hash function, which is designed so that even a small change in the text will lead to a very different hash, our embeddings is designed so that if two sentences have similar meanings they will be mapped to a similar number and be “close” to each other.
We usually define “closeness” as the cosine distance between two embeddings.
This seems like a simple property, but it’s very useful.
To better understand embeddings, let’s look at a “good” and “bad” embedding space for four sentences in 2 dimensions.
Bad Embedding Space: A sentence’s meaning has no impact on it’s location
Good Embedding Space: Sentences that are similar to each other are close to each other.
Note how in the good embedding space, sentences that are close to each other have similar meanings, while in the bad embedding space, a sentence’s meaning has no bearing on its location.
The reason we have a preference for the former is that it’s much easier to learn classifiers to separate categories in the good embedding space. We can easily draw a line to separate the two classes in the good embedding space, but that’s not the case in the bad embedding space.
While it’s trivial to create a good embedding space for these four examples, modern embeddings research lies on how to uphold this property for any two sentences you pass in to the model. This allows you to easily train a classifier for any two categories. To do this, they use a higher-dimensional space with hundreds of dimensions rather than just two.
We often visualize the embedding space to better understand our models and data. Below I have projected the embedding space of an internal Cultivate dataset of roughly 40,000 slack messages down to two dimensions. Each example (dot) is also colored based on the predicted intent class. The classes are pretty straightforward and represent the intent of the message. For example, a message like “Can we meet at 6?” should be categorized as “Request-Scheduling”.
Fig 1.2 Internal Dump Embedding Each slack message is a dot in this image. The color of the dot is the predicted label of our model.
There’s some very interesting structure in this image. Note how each class corresponds to a particular area in the embedding space. There’s also a correlation between the intraclass variance and the size of the blob.
In other words, if a particular text class is very diverse, containing a wide range of sentences, we would expect it to be spread out over a larger area in the embedding space. Similarly we expect blobs that are closely grouped together to represent very homogenous classes.
In this classification problem, you would expect Null to be the most diverse class, since it’s the default extra class. This is supported by the embedding space, as the Null class is spread over the largest area. Say-Conventional is spread over the least area. This makes sense since Say-Conventional is defined as conversational pleasantries like “Hi, Hello, Nice to meet you!”, which are quite homogenous.
Furthermore note how all the Request classes are clustered together to the right and how Inform-Agree and Inform-Disagree are right next to each other. This tells us that the Request classes tend to be more similar to each other than the other classes, which makes sense, since they are all requests.
Model interpretability and exploration is just one of the many uses of embeddings at Cultivate.
Check out part 3 where we show how we use embeddings to quickly prototype text classifiers with just a few training examples.