Word similarity and analogy with Skip-Gram

In this post, we are going to show word similarities and word analogies learned by 3 Skip-Gram models trained to learn word embedding from a 3GB corpus size taken scraping text from Wikipedia pages.

Skip-Gram is unsupervised learning used to find the context words of given a target word. During its training process, Skip-Gram will learn a powerful vector representation for all of its vocabulary words called embedding whose size depends on its hidden layer’s weights. This embedding can capture both semantic and syntactic relations among the representations of its learned words as well as word analogies and similarities.

The longer its embedding size, the more useful word relations Skip-Gram can encode but it would require much more time for training. Moreover, a bigger embedding needs also more data to be properly trained and to avoid the risk of underfitting it or not capturing enough useful information.

Thus, for this experiment have been trained three models that share the same structure except for their embedding sizes which are 100, 200, and 300 lengths respectively. Hence, for the rest of the post, we are going to rename them as Embed100, Embed200, and Embed300 based on their corresponding embedding size.

The three models have been trained with a negative sampling loss function using 5 negative words per sampling and a distortion factor of 0.75 which is the same used by the authors of the original paper. Moreover, it has kept a window size of 5 words, namely can present up to 5 context words both on the left and on the right of the target word.

For example, given a sentence such as “the quick brown fox jumps over the lazy dog”, each of the words in the sentence is a target word, and those words within the window centered at target words are called context words.

In the above figure are shown the window sizes below the corresponding words which have been mapped to indices according to their one-hot encoded representation in the vocabulary list. For example, quick and brown are context words of the target word which has a window size of 2, while the, brown, and fox are context words of the target word quick.

During training, it has been used a batch size of 256 with the learning rate gradually decreasing from 0.025 to 0.0001 at the end of the training process. Regarding the corpus preprocessing phase, have been removed all rare words appearing in total less than 10 times, and has been set a sub-sampling rate of e^-3 to sub-sample frequent words.

In addition, all the words contained in the corpus have been converted to lowercase so to make the corpus more uniform. Next, since for Skip-Gram the task is to predict context words given the target word, to create the input matrix corresponding to a sentence, the index of each target word is simply replicated to match the number of its context words.

Finally, the input matrix corresponding to a sentence (but in practice, it will correspond to a batch) will have the following structure:

In summary, an input matrix and a label matrix are created from a raw input sentence that provides the input and labels information for the prediction task.

Word similarity

To evaluate the property of the learned word embeddings to capture the similarity among the words in the vocabulary, have been used 5 different word similarity datasets for the evaluation:

WS353: contains 353 English word pairs along with human-assigned similarity judgments.
MEN: consists of 3000-word pairs together with human-assigned similarity judgments.
SIMLEX999: a cleaner benchmark of similarity (but not relatedness). Pairs of words were chosen to represent different ranges of similarity and with the either high or low association.
RW: the dataset has 2034 word pairs which are selected in a way to reflect words with low occurrence frequency in Wikipedia.
MTutk: human-labeled dataset of word semantic relatedness.

Each row of these datasets contains 2 words and a similarity score (from 0 to 10 if they are the same) between the two words. This similarity score is usually computed as the average of different scores given by human judges (e.g. the tuple (old, new) has a correlation score of 1.58 on the MEN dataset since the two words are completely different.

Conversely, (tiger, tiger) has a score of 10 since the two words are literally the same). It’s also important to mention that the scores have been given in the absence of context which, if it would have been present, might have influenced the final judgment.

Next, to determine the similarity score between two words learned by the embedding models has been computed the distance of their corresponding vectors using either Spearman correlation or cosine similarity. Then, to compare the computed similarity score of word vector s₁ with its ground truth s₂, has been simply taken in consideration the absolute value of the difference between s₁ and s₂ (s₂ has first been properly scaled down such to have the same range of s₁ which varies from 0 to 1).

Finally, to measure the correlation score of the embedding with an entire dataset d, it has been computed the average of the similarity evaluation errors between each pair of words contained in the dataset and the result has been subtracted to 1:

Where |d| is the number of word pairs in the dataset, sim is a similarity function, w_i1 and w_i2 are the first and second word respectively of the pair of words i and s_i is the ground truth similarity score of the pair i. Thus, higher values mean better results since the predicted similarity scores would be closer to the real ones.

In the following table have been reported the similarity scores computed on the 5 datasets for each of the 3 Skip-Gram models were computed according to the two different similarity functions. Note that the last three columns in the header row are in the format similarity_functionYYY where YYY is the embedding size.

dataset	spear100	spear200	spear300
WS353	0.765	0.713	0.686
MEN	0.826	0.798	0.774
SIMLEX999	0.784	0.789	0.780
RW	0.715	0.687	0.670
MTurk	0.701	0.664	0.637

Word similarity on 5 diﬀerent similarity datasets and 3 embedding sizes computed with Spearman correlation

dataset	cosine100	cosine200	cosine300
WS353	0.773	0.713	0.694
MEN	0.833	0.806	0.781
SIMLEX999	0.783	0.791	0.784
RW	0.723	0.694	0.677
MTurk	0.708	0.667	0.643

Word similarity on 5 diﬀerent similarity datasets and 3 embedding sizes computed with Cosine similarity

The results show that Embed100 is the model performing the best followed by Embed200 and Embed300. This is probably due to the lack of testing data and it is thus possible that Embed200 and Embed300 underfit the data and would have achieved better scores if trained on a corpus of a bigger size.

Regarding the employed metrics, have not been observed significant differences between the scores computed with Spearman correlation and cosine similarity.

Following, in order to have a better understanding of the extent of the mistakes committed by the Skip-Gram model, in the table below are reported some pairs of words extracted from WS353 dataset along with their similarity score and the corresponding prediction performed by Embed200 model.

word1	word2	similarity	predicted
tiger	cat	0.73	0.5
tiger	tiger	1	1
book	paper	0.74	0.46
computer	internet	0.75	0.48
plane	car	0.57	0.36
telephone	communication	0.75	0.39
television	radio	0.67	0.61
bread	butter	0.62	0.75
cucumber	potato	0.59	0.64

Word analogy on single word pairs on WS353 dataset and Embed200 model

Analyzing the results reported in the above table as well as many other word pairs, it has been observed that all three models performed better when predicting similarity scores of uncorrelated words rather than correlated ones.

Similarity visualization

In this section, we are going to visualize some charts displaying interesting similarity properties between learned embedded words with Embed200 model. To display them has been simply applied PCA to reduce the embedding dimension from 200 to 2 dimensions.

Present and past verb form: each verb’s present form is close to its past simple tense

Athlete and nationality: the model is able to associate some famous athletes with their native country

Capital and nation: each capital is close to its corresponding nation, moreover, capitals and nations are also clustered by continent with Asia on the top, North America on the right, and Europe on the left

Male and female names: people’s names are separated into two groups according to their gender

Country and cities: 12 cities are magically grouped into 3 clusters corresponding to their respective countries representing Italy, China, and the US

Word analogy

Word analogy evaluation has been performed on the Google Analogy dataset which contains 19544 question pairs, (8,869 semantic and 10,675 syntactic questions)and 14 types of relations (9 morphological and 5 semantic).

A typical semantic question can have the following form: rome is to italy as athens is to where the correct answer is greece. Similarly, a syntactic question can be for example: slow is to slowing as run is to where the correct answer is clearly running.

In those examples we can also note that all the words don’t contain any capital letters, this is due to the fact that the models have been trained only on lower-cased words and two words spelled in the same way with one containing capital letters and the other not containing any, then these two words would be treated as diﬀerent words (i.e. Rome is diﬀerent from rome).

The second rule that it is important to remark on is that the answer to a question, in order to be considered right must be the same as the given answer and not just similar (i.e if the correct answer is run and the Skip-Gram model prediction is running, then the given answer would be considered as wrong).

The algorithm to predict the answer is very simple: each question is given in form x is to y as z is to and the model has to predict the answer w. To ﬁnd it, we can simply subtract x from y and then add z (i.e italy – rome + athens should be equal to greece).

However, in practice, it’s very likely that the above equation won’t immediately yield the right answer but would return a new word vector v which should be very similar to the vector w representing the correct word. Thus, to ﬁnd this vector w, vector v is compared to all the other vectors in the embedding using cosine similarity and the candidate word would be the vector whose distance is the closest to v.

Hence, in order to save computation time, the evaluation has been performed on a subset of 100 questions (60 semantic and 40 syntactic questions). The following table reports the number of questions right answered by each model divided into their two diﬀerent categories:

	embed100	embed200	embed300
Semantic	38/60	42/60	43/60
Syntactic	10/40	14/40	11/40
Total	48/100	56/100	54/100

Word analogy on Google Analogy dataset and 3 diﬀerent embedding sizes

We can see from the above table that the three models correctly answer roughly half of the questions with Embed200 and Embed300 doing slightly better than Embed100. However, all of them are much better at answering semantic questions rather than syntactic ones.

To analyze this problem let’s have a look at some syntactic questions along with their corresponding wrong-predicted answer on Embed200:

Question: slow is to slowing as go is to ? (Answer: going; Prediction: goes)
Question: slow is to slowing as describe is to ? (Answer: describing; Prediction: characterize)
Question: competitive is to uncompetitive as logical is to ? (Answer: illogical; Prediction: logic)
Question: slow is to slowing as read is to ? (Answer: reading; Prediction: write)

Among these questions, the ﬁrst one the verb has been correctly guessed but its form is not the desired one. The second question has been answered with a word having a similar meaning to the right word but not the same. The third question has been answered with the contrary of the right word while in the last question the predicted word, despite being wrong, still has a semantic relation with the right word.

Thus, although the three models show a drop in performance when predicting the correct answer in syntactic questions, anyway they still can interpret some semantic structure among those sentences by predicting words that even being diﬀerent from the right answer still maintain a sort of semantic correlation.

As usual, it’s very likely that having more training data would have boosted the accuracy of the three Skip-Grams models, in particular, Embed300 would have beneﬁt the most since its lack of accuracy is probably due to the limited amount of training data it had been trained on.

We can also try to have a look at some semantic questions to have an idea how do they look like as well as the corresponding model’s answer:

Question: bamako is to mali as beijing is to ? (Answer: china; Prediction: china)
Question: rome is to italy as yerevan is to ? (Answer: armenia; Prediction: azerbaijian)
Question: rome is to italy as antananarivo is to ? (Answer: madagascar; Prediction: oyem)
Question: baku is to azerbaijan as funafuti is to ? (Answer: tuvalu; Prediction: marquesas)

In this set of semantic questions, the model manages to correctly predict the answer to the first question but fails on the other three.

I have to admit that if the same question would have been asked to me I would have had no idea what the right answer would be but for the model is still valid the rule that having more training data might have given it the chance to answer right.

In fact, it’s very likely that the Skip-Gram model wouldn’t have found those words too many times but just a few and thus it is not able to fully understand their semantic and syntactic meaning to build an effective vector representation.

Find more at

https://github.com/davide97l/Word2vec

KejiTech

Falling in love with data and machine learning in the simplest and funniest way. A website that no data scientist should miss, but also open to everyone eager to explore something new.