Skip-Gram Word2Vec Algorithm Explained
A brief guide explaining how to apply a word-embedding model to any text you choose
Motivation
By no means Word2Vector is NLP’s bread and butter. It goes without saying that this concept is covered in a plethora of articles, courses, and tutorials. However, some of the explanations I came across seemed, in my opinion, to be too sophisticated or mathematically intricate for a blog article.
In this blog post, I intend to shed additional light on what appears to be less obvious. As such, I synthesized a number of sources and created a Colab notebook that offers a straightforward yet insightful explanation.
To demonstrate this concept, I used the transcript of the Pickle Rick episode from Rick and Morty. Enjoy!
Skip-Gram Word2Vec — Intuition
Word2Vec is a fundamental concept in NLP which aims to translate words into embedding vectors. These vectors are unique in the manner they capture the semantic relationships found in natural language.
Naturally, there are a variety of Word2Vec algorithms. The one I decided to present is known as Skip-Gram. In a nutshell, this algorithm maps words to vector representation based on their spatial proximity to other words. It is important to note that the underlying assumption here is that if words are synthetically positioned next to each other, it means that they are also semantically similar. Although it’s not always the case — as you would undoubtedly agree — this algorithm produces good outcomes.
The way this algorithm works is by iterating over all the words in our text input and optimizing their vector representations distance based on their context. The context is defined as a window of words surrounding a specific word. For instance, addressing the sentence:
“ I don’t do magic, Morty, I do science”
If the word ‘Morty’ is our target (center) word, and our context window size is set to 1, the neighboring words would be ‘magic’ and ‘I’. All these pairs of (target word, context word) can be then fed into the model and by that optimizing the word embeddings. This indicates that each pair of words adds the same amount to the model.
How Skip-Gram works?
(1) Model Structure:
The Skip-Gram model is basically a shallow neural network with an input layer, embedding layer and output layer. The model’s aim is to produce an output probability distribution vector given a target word input. This probability distribution vector (which sums to 1) reflects the probability for each word to appear in the target word’s context window. As one might expect, we are interested in locating the probability to be high for words who share the same context and low for words who don’t.
The intriguing part about this is that we just require the model’s weights once it has been trained.
You may wonder how the model accounts for the (target word, context word) pairings that were previously described. The explanation is that the target words function as the input (X_train), while the context words function as the output (y_train).
(2) The Loss Function:
To obtain useful vector embeddings, we have to optimize the weights in our model, which are initially set to be entirely random. The optimization process is carried out in order to minimize the loss function:
How to read this equation? It is merely a nested loop, which requires us to iterate over all (target word, context word) pairs and sum up the probabilities. The minus sign is just for convenience — in ML we prefer minimizing the loss function’s value instead of maximizing it.
(3) How are these probabilities calculated?
In order to calculate the probability distribution, we use the Softmax function which takes into account the dot product of the target embedding vector and embedding vectors of each and every word in our vocabulary:
Some notes and clarifications before we move on to the implementation:
1. Regarding the model purpose — As opposed to a traditional neural network that is used for classification, over here our objective is to train the network in order to extract the weights. In summary, there is no need to use it in the traditional manner of input -> model -> output.
2. Regarding the model’s weights — As opposed to a typical neural network, where the weights are used to handle any input, the skip-gram model assigns unique weights to each and every word. This means that the model’s weights cannot be generalized to handle new words or inputs in the same way. Instead, the weights serve as specific embeddings for each word.
3. Using one hot encoding is not required — According to various blog entries I’ve read, every word is converted into a one-hot encoding vector which we pass to the model (like ‘rick’ = [0,0,0,…1,…0]). Despite being the standard, using a sparse vector makes little sense. Instead, we can just pass the word’s index which is less memory consuming.
Implementation
Let’s follow the next code in order to implement the skip gram algorithm. You can also find the full script in the following Colab notebook:
0. The input
Before starting working on the Skip-Gram algorithm we rather have raw data. Basically, any type of data can be selected. I decided to train the model using the Pickle Rick transcript. A .txt of your choice can be created.
Below is the data I selected:
[Open on Morty combing his hair in the bathroom mirror. He is wearing an orange sweater vest, a yellow dress shirt, and an olive tie.]
Pickle Rick: (offscreen, in the distance) Morty.
Morty: Rick?
[Morty stops combing, looks around, then continues combing]
Pickle Rick: (offscreen, in the distance) Morty!
Morty: Rick?
Pickle Rick: (offscreen, in the distance) Hey, Mooorty!
Morty: Rick? Are you far away, or are you inside something?
...
...
- Imports — We wish to address libraries which supports tokenization, as well as model architecture and visualization.
# imports relevant for tokenization
import nltk
import string
import numpy as np
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
nltk.download('punkt') # punctuation words '(),[].:?'
nltk.download('stopwords') # irrelevant words such as “a” “an”
# imports relevant for skip-gram model
import torch
from torch import nn
import torch.optim as optim
# imports relevant for visualiztion
import matplotlib.cm as cm
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
import seaborn as sns
2. Setting Hyper-parameters- Here we can modify the context window size and the embedding size and see how it affects the results.
# context-target pairs selection
context_window_size = 3
# minimum words appreance threshold = 2 (currently not implmented)
# Training hyperparameters
epochs = 50 # amount of iterations over the entire text (corpus)
learning_rate = 0.01
batch_size = 16
embedding_size = 300
Note: It is recommended to sub-sampling high-frequency and low-frequency words in order to speed up training. As of now, this option is not implemented.
3. Data Preprocessing & Tokenization - This step’s goal is to focus on token aggregation, and to support raw text processing while filtering out unnecessary words and punctuations.
text_file_path = '/content/pickle_rick_transcript.txt'
def preprocess_raw_text_file(text_file_path):
# load the text
with open(text_file_path, 'r') as file:
text = file.read().replace('\n', ' ')
# lowercase the text
text = text.lower()
# remove punctuations like '(),[].:?'
text = text.translate(str.maketrans("", "", string.punctuation))
# Ttokenization
tokens = word_tokenize(text)
# remove stop words ('at','so','an','or')
stop_words = set(stopwords.words('english'))
# collect all words which are not stop words and not numbers, keep the order
tokens_after_filtering = []
for token in tokens:
if token not in stop_words and token.isnumeric() == False:
tokens_after_filtering.append(token)
return tokens_after_filtering
tokens = preprocess_raw_text_file(text_file_path)
The tokens variable would contain 3211 different words:
[‘open’, ‘morty’, ‘combing’, ‘hair’, ‘bathroom’, ‘mirror’, ‘wearing’, ‘orange’, ‘sweater’, ‘vest’, ‘yellow’, ‘dress’, ‘shirt’, ‘olive’, ‘tie’, ‘pickle’, ‘rick’…]
4. Generate Context-Target pairs for training- In this section we iterate over all words and collect (target,context) pairs which will be useful in the upcoming training. We have to bear in mind that the pairs order is irrelevant, what matters is how many pairs we have and how many times each word appears. In theory, adding particular pairs should help us perform better throughout training.
def collect_context_target_pairs(tokens,context_window_size):
context_target_pairs = []
for i in range(context_window_size, len(tokens) - context_window_size):
# set target (center) word
target = tokens[i]
# extract sublist with context words (-3,-2,-1,target,1,2,3)
context = tokens[i-context_window_size:i+context_window_size+1]
# remove the target word from context
context.remove(target)
# iterate over words in window
for word in context:
context_target_pairs.append((target, word))
return context_target_pairs
context_target_pairs = collect_context_target_pairs(tokens, context_window_size)
The context_target_pairs variable would contain the following pairs:
[(‘hair’, ‘open’), (‘hair’, ‘morty’), (‘hair’, ‘combing’), (‘hair’, ‘bathroom’)…]
We have about 19230 pairs.
5. Arranging Vectors Representing — Over here tokens are mapped into indices and vice versa. For the training and visualization, this mapping would be employed.
def vector_representation(tokens):
# get the unique tokens
vocabulary = sorted(set(tokens))
# map word to it's corresponding index
word2index = {word: index for index, word in enumerate(vocabulary)}
# map index to it's corresponding word
index2word = {index: word for index, word in enumerate(vocabulary)}
return vocabulary, word2index, index2word
vocabulary, word2index, index2word = vector_representation(tokens)
Word2Index contains the following: {‘aaaah’: 0, ‘aaah’: 1, ‘aah’: 2, ‘abandoned’: 3, ‘able’: 4, ‘absolutely’: 5, ‘access’: 6…}
This index practically replaces the one-hot encoding by conducting the same function.
6. Prepare the data for training — Over here the word pairs are arranged into training data (the corresponding y is the context word, and the target word is x). As shown only the word indexes are saved.
X_train = [] # list for input (target) vectors
y_train = [] # list for output (context) vectors
for target, context in context_target_pairs:
X_train.append(word2index[target])
y_train.append(word2index[context])
# Convert to PyTorch tensors for the model
X_train = torch.LongTensor(X_train)
y_train = torch.LongTensor(y_train)
The X_train and y_train are the following tensors, which contains the corresponding target word — context word indexes:
tensor([442, 442, 442, …, 667, 667, 667])
tensor([ 717, 667, 171, …, 261, 417, 1048])
7. Define the Skip-Gram Model architecture — The discussed model has just two layers, making it a shallow neural network. Each word is mapped to a word embedding according to its index by the first layer. Predicting the context words using word embeddings is the responsibility of the second layer.
class Skip_Gram_Model(nn.Module):
def __init__(self, vocabulary_size, embedding_size):
super(Skip_Gram_Model, self).__init__()
self.embeddings = nn.Embedding(vocabulary_size, embedding_size) # an embedding layer which responsible for map each word to a dense vector
self.linear = nn.Linear(embedding_size, vocabulary_size) # a linear layer which responsible for taking those embeddings and predict the context word.
def forward(self, context_word):
output = self.embeddings(context_word)
output = self.linear(output)
return output
8. Model Training — Over here we initialize our model, optimizer, and loss function first, and then iterate through the data for the predetermined number of epochs.
# init the model instance
model = Skip_Gram_Model(len(vocabulary), embedding_size=embedding_size)
# init the loss and optimizer functions
loss_function = nn.CrossEntropyLoss() # calculates the error rate between the predicted value and the original value
optimizer = optim.Adam(model.parameters(), lr=learning_rate) # a stochastic gradient descent
# visualize the word embeddings before training
visualize_words_embedding(model,"Before Training")
# start the training process
for epoch in range(epochs):
total_loss = 0 # restart loss to 0
# iterate over batch size
for i in range(0,len(X_train),batch_size):
x = X_train[i:i+batch_size]
y_true = y_train[i:i+batch_size]
optimizer.zero_grad()
y_pred = model(x)
loss = loss_function(y_pred, y_true.view(-1))
loss.backward()
optimizer.step()
total_loss += loss.item()
print(f'Epoch num: {epoch+1}, loss value: {total_loss:.3f}')
visualize_words_embedding(model,epoch)
The model’s weights are as follows:
[[-1.6724254 -0.28123945 -0.26258376 … 0.65887684 0.8518337 -1.1281369 ] [ 1.0507011 -1.0241276 -0.8808593 … 0.5823421 -0.7058663 -0.53039455] [ 0.9580243 -1.1931438 0.44672075 … 1.4983618 -0.51639974 -1.378198 ] …
The weight’s shape is (1195, 300) which is logical and as expected
9. Word Embeddings Visualization — Finally, we would extract the word embeddings and visualize them in a 2d space. We should keep in mind that in this case, a dimension reduction is required to display 300d vectors as 2d vectors.
def visualize_words_embedding(model,epoch_number):
# select a small group of words
words_to_visualize = ['rick','ricks','mooorty', 'morty', 'mortys','pickle','antipickle','pickles','angry','great','well','car','beth','goldenfold','wong','dr','guns','animal','jaguar','kids','geez','die','summer','god','serum','family','therapy','fuck','mr','office']
# extract the model weights
word_vectors = model.embeddings.weight.data
# get the word embeddings
indices = [word2index[word] for word in words_to_visualize]
word_vectors = model.embeddings.weight.data[indices]
# fit a 2d t-SNE model to the embedding vectors
tsne = TSNE(n_components=2,perplexity=20)
word_vectors_2d = tsne.fit_transform(word_vectors)
# get a specific color for each dot
colors = sns.husl_palette(n_colors = len(words_to_visualize))
# create a scatter plot of the projection
plt.figure(figsize=(12,12))
# annotate the points on the graph
for i, word in enumerate(words_to_visualize):
plt.scatter(word_vectors_2d[:, 0], word_vectors_2d[:, 1],c=colors)
plt.annotate(word, xy=(word_vectors_2d[i, 0], word_vectors_2d[i, 1]))
plt.title(f'Pickle Rick Word Embeddings Visualization - {epoch_number}')
# Save the figure
plt.savefig(f'word_embeddings_epoch_{epoch_number}.png')
Results
Finally, once the training is complete, the outcomes can be visualized. The word embeddings before and after the training are shown in these two scatter plots.
Evaluation
There are various approaches to evaluate the word embeddings’ quality. Listed below 4 of them:
(1) Visualizing — useful for determining if the embedding makes sense. Bear in mind that it could be overwhelming to visualize every word at once, so you might want to choose a group of words that really caught your attention.
(2) Cosine similarity function — This function compares two word embeddings to see how similar they are. If the cosine similarity equals 1 it means the vectors are identical, where value -1 means that the vectors are exactly opposite, and value 0 indicates they are orthogonal.
(3) Word Similarity Task — An evaluation task that includes word pairs with human-annotated similarity scores. Here is one dataset — WordSim-353
(4) Analogies Tasks — this one involves completing word analogies of the form “vec(Pickle Rick) — vec(Rick) + vec(Morty) =>>> vec(pickle Morty)”.
Conclusion
This blogpost has covered the Skip-Gram algorithm architecture and implementation. I hope it was enjoyable!
References
Goldberg, Y., & Levy, O. (2014). word2vec Explained: deriving Mikolov et al.’s negative-sampling word-embedding method. arXiv preprint arXiv:1402.3722.
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
Rong, X. (2014). word2vec parameter learning explained. arXiv preprint arXiv:1411.2738.
Resources
My Skip-Gram implementation:
Rick & Morty episode’s transcript taken from: https://rickandmorty.fandom.com/wiki/Pickle_Rick/Transcript
Cross Entropy function explained:
https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html