Word Embeddings

A quick implementation for adding complexity to language data

S. T. Lanier
3 min readMar 8, 2021
Photo by Tamara Gak on Unsplash.

If you’ve played around with language data before — classifying news articles by category, predicting a paragraph’s author, classifying tweets by sentiment––you’re already familiar with some of the ways words (or sometimes, characters) get turned into numbers, which can then be fed into some sort of classifier.

A simple solution would be to a kind of one-hot encoding, or bag of words, of the lexicon. If there are a total of 10,000 words in the documents being classified, then each document gets a 10,000-dimensional vector, with each dimension a count of the number of times a given word appears in the document. For example, a tweet that says “I like cats, and I like birds” and vector positions that correspond to {0:I, 1:like, 2:and, 3:cats, 4:dogs, 5:birds, 6:rabbits} would be encoded with the vector [2, 1, 1, 1, 0, 1, 0]. That’s two incidences of “I,” one incidence each of “like,” “and,” “cats,” and “dogs”; no incidence of “dogs” nor “rabbits.” This vector existed in only a 7-dimensional lexicon. You can imagine how unwieldy a 10,000-dimensional vector would be, having only 5 non-zero numbers, and 99,995 zeros. This is classically called the curse of dimensionality.

Another more sophisticated approach is term frequency–inverse document frequency (TF-IDF) vectorization. It has the same dimensionality as the previous example, but it gives more “weight” to words that appear less frequently in the lexicon. For example, a document might only get 0.1 point for an appearance of “the,” a very common word, but 5.0 points for a rare word like “insidious.” With both of these methods, you would likely have to employ some sort of dimensionality reduction to whittle down the size of these vectors.

Add to this one additional approach: word embeddings. These famously capture the relationships between words, and could be employed to predict that as “The king is on his throne,” in a related sentence such as “The _____ is on her throne” the missing word is “queen.”

Word embeddings are actually created by a separate neural network. A neural network is given the task of predicting . You don’t keep the neural network itself, just the weights from its hidden layer, which are a kind of feature encoding for each word. “King” and “queen” would score similarly along whatever feature predicts sovereignty, but would score oppositely along whatever feature predicts binary gender.

There are two major ways to implement an embedding in Python, assuming you aren’t making one from scratch, the first of which is using a pretrained library like Gensim or GloVe. Using Gensim’s Word2Vec model, you give it your train data, and it returns a dictionary of word vectors. From there, you can use it to produce a similarity score between words and to come up with a list of most (or least) similar words. A second option is GloVe put out by Standford. These pretrained models are great for then passing the vectorized data off to a classifier that isn’t a neural network.

from gensim.models import Word2Vec
w2v_model = Word2Vec(docs, vector_size)
w2v_model.wv.most_similar('apple')
>>>[('shop', 0.9994764924049377),
>>> ('pop', 0.9984753131866455),
>>> ('open', 0.9978106617927551),
>>> ('temp', 0.9976313710212708),
>>> ('core', 0.9967163801193237),
>>> ('dt', 0.995648205280304),
>>> ('set', 0.995041012763977),
>>> ('congress', 0.9946262240409851),
>>> ('austin', 0.9938600063323975),
>>> ('sixth', 0.992957353591919)]

The implementation is even easier for neural networks in Python using a library like Keras. Often, the implementation can be done in a single line of code as a first layer in your RNN:

import kerasmodel = keras.Sequential()model.add(keras.layers.embedding(input_dim, output_dim, input_len))
model.add(keras.layers.LSTM(...))
model.add(keras.layers.Flatten())
model.add(keras.layers.Dense())
...

Word embeddings are a good idea if something like TF-IDF doesn’t seem to be capturing the full complexity of your data.

--

--

S. T. Lanier

Student of data science. Translator (日本語). Tutor. Bicyclist. Stoic. Tea pot. Seattle.