What Are Word Embeddings?

An introduction to the AI that powers language understanding

Published in

Geek Culture

12 min readMay 17, 2021

Word embeddings are used in almost every commercial application that involves AI and human language. Some example applications include search engines, social media recommendation algorithms, language translation, speech recognition, market research, automated trading, and language generation.

Word embeddings are numerical representations of a word’s meaning. They are formed based on the assumption that meaning is contextual. That is, a word’s meaning is dependant on its neighbors:

A sliding window to find the word’s neighbors. [1]

For example, if the word “ice” usually appears next to “water”, one could infer that both words have a similar meaning.

Word embeddings are represented as mathematical vectors. This representation enables to perform standard mathematical operations in words, like addition and subtraction.

These operations have interesting applications in language, like finding synonyms, classifying documents, or recommending content. Additionally, 2-dimensional vectors can be plotted to produce a visual understanding of a document or a person’s language.

Sample word embeddings and their respective graphical representation.

Application: Finding Synonyms

Finding synonyms is one of the simplest applications of word embeddings. A synonym is a word or phrase that means exactly or nearly the same as another word or phrase. And since word embeddings are numerical representations of a word’s meaning. To find the synonyms, we only need to find the vectors that are closest to the word.

Clusters of words based on their similarity.

The first step to finding synonyms is to select a distance metric to compare the closeness, or similarity, between two vectors. One of the most common metrics is the Euclidean distance that sums the squared difference across every vector’s dimension:

Another common distance metric used is the absolute value. In TypeScript, this is how to measure the absolute value distance:

/* Similarity: compute the absolute distance for two vectors */
const similarity = (a:number[], b: number[]) => {
  // Only compute the distance if the vectors have the same length.
  if(a.length !== b.length) return Infinity  // Sum the absolute value difference across every dimension. 
  const delta = a.reduce((d, i, idx) => d + Math.abs(i — b[idx]), 0)  // Return the distance as a proxy of a vector's similarity.
  return delta
}similarity([3,4], [1,2]) // Returns 4
// Math.abs(3–1) + Math.abs(4–2) = 2+ 2 = 4similarity([3,4], [1,6]) // Also returns 4
// Math.abs(3–1) + Math.abs(4–6) = 2 + 2 = 4// 3D Vectors Distance
similarity([3,4,5], [4,6,8]) // Returns 6
// Math.abs(3–4) + Math.abs(4–6) + Math.abs(5–8) = 1 + 2 + 3 = 6

Interestingly, it is also possible to find antonyms using word embeddings. The only difference is finding the vectors that maximize the distance to a word.

Application: Topic Classification

Topics can be labeled using word clusters.

Word embeddings and distance metrics are also useful to label documents by topic. The process starts with a labeled dataset of documents classified by topics. Then, transform the document’s content into word embeddings and average the position of each vector:

/* 
 * getCenter: find the average the position of a matrix of 
 * word embeddings to find the "center" of a document. 
 */
export const getCenter = (vectors: number[][]) => {
  const dimensions = vectors[0].length 
  const dimensionArr = [...Array(dimensions)]  // Iterate through each dimension.
  const center = dimensionArr.map((_ , idx) => {    // Sum the value of the dimension (idx) for each vector.
    const dimensionSum = vectors.reduce((d,i) => d + i[idx], 0)
    const dimensionAvg = dimensionSum/vectors.length    // Return the average for each dimension.
    return dimensionAvg
  })  // Return a vector with the same shape, and averaging values.
  return center}getCenter([[1,2], [3,4]]) // Returns [2,3]
// [(1+3)/2, (2+4)/2] = [4/2, 6/2] = [2,3]getCenter([[2,3,3], [4,4,-1], [0,2,4]]) // Returns [2,3,1]
// [(2+4+0)/3, (3+4+2)/3, (3-1+4)/3] = [6/3, 9/3, 6/3] = [2,3,1]

We can think about the center of a document as the document’s embedding: it is the numerical representation of its content.

The next step is to derive the topic’s center. In a similar process, we find for every topic the average position of its documents. Finally, when we want to classify an unlabeled document, we can transform its content into a vectorial representation and use the distance metric to find the closest topic.

Reflection: How would you classify documents using exclusively unlabeled data? That is with unsupervised learning.

How to build word embeddings?

As previously mentioned, the idea behind word embeddings is that the meaning of a word is related to its context. In consequence, word embeddings are the result of mapping the words that are frequently close. The process consists of 3 steps:

Tokenization: split, classify and find unique words on a corpus.
Co-occurrence matrix: map the words that are close to each other.
Dimensionality Reduction: compress the size of the co-occurrence matrix.

1. Tokenization

There is a long tradition in Natural Language Processing (NLP) to separate words that includes stemming and lemmatization. To exemplify some of the difficulties we need to consider when splitting a text, consider the following[1]:

Words that have the same meaning, including plurals and conjugated verbs.
Pronouns, prepositions, and articles that appear frequently but contribute little additional meaning.
Abbreviations and compound words like N.Y.C. or New York.
Words with internal hyphens or apostrophes.
Numbers, symbols, and punctuation signs like parenthesis or ellipsis.
Orthographic errors.

To keep things simple, we are going to use a simple RegEx:

/* @function tokenize: split the words in a text or document. */
const tokenize (text:string) => text.match(/(\b[^ $]+\b)/g)

With the advent of Deep Learning, tokenization has partially lost relevance. Because theoretically, the best AI models handle irregularities by themselves, and building tokenizers is a slow, manual process. However, using tokenizers or other input transformations can dramatically reduce training speed and improve accuracy.

2. Co-occurrence Matrix

The co-occurrence matrix contains the frequency that two words occur next to each other. Building the matrix is a four-step process:

Define a set with all the unique words in the training dataset.
Create a square matrix where each row and column represents a word.
Count the occurrences of neighbor words based on an N-word window.
Insert the count to the corresponding cell in the matrix.

The co-occurrence matrix for the following 3 sentences look like this:

1. I like deep learning.2. I enjoy flying.3. I like NLP.

The above example worked with a 1-word window. To illustrate how this differs based on the window size, consider the following sentence:

// For an 3-word window:
const text = ‘I enjoy learning about Natural Language Proccesing.’// enjoy & learning are adyacent.
// enjoy & about are adyacent.
// enjoy & Natural are adyacent.
// enjoy & Language are NOT adyacent.

There are slight variations as to how to build a co-occurrence matrix. These may include different ponderation based on the closeness, asymmetric windows, and using punctuation signs to determine dynamically sized windows. You can look at the complete code to build a co-occurrence matrix in the story’s repository.

3. Dimensionality Reduction

Theoretically, we have finished. We could use the rows of the co-occurrence matrix as word embeddings. But look again at the matrix, and notice how sparse it is. There are several issues associated with this:

The vectors would take too much storage space.
Training models on top of these vectors would be slow.
Relations between words would be difficult to notice.

In that sense, the final step to build the word embeddings is to reduce the dimensions of the co-occurrence matrix. A small text corpus can have tens of thousands of unique words, but word embeddings tend to be smaller than 1,000 dimensions. For example, the universal sentence encoder vectors in TensorflowJS have a size of 512 dimensions.

There are sophisticated methods in deep learning to compress a matrix. But to remain simple, we are going to use the Principal Component Analysis (PCA) method:

import { PCA } from 'ml-pca'const reduceDimensionality = (dimensions:number) => {
  const pca = new PCA(embeddings)  const newSize = {nComponents:dimensions}  
  const reducedVectors = pca.predict(embeddings, newSize)  return reducedVectors
}

If we reduce the word embedding dimensions to only 2, we can plot them and gain a visual understanding of the relationship between different words. This chart shows selected word embeddings for a 60K word corpus:

Application: Solving Analogies

Adding and subtracting word embedding has an interesting and surprising application: solving analogies. Traditionally, analogies are used to measure the reasoning and language skills of students. Today, they also evaluate the accuracy of word embeddings. Consider the following analogy:

King is to man, as queen is to ______.

The idea to solve this problem is to find the word that has the same distance to “queen” as “man” has to “king.” This is how it looks with vectors:

// Measure the distance between 2 vectors.
const distance = (a:number[], b:number[]) =>
    a.map((i, idx) => i - b[idx]
)// Get the word embedding vectors for king, man and queen.
const king = wordEmbeddings['king']
const man = wordEmbeddings['man']
const queen = wordEmbeddings['queen']// Get the distance between king & man:
const delta = distance(king, man)// The solution is located at the same distance starting from queen.
const solutionLocation = distance(queen, delta)// Find the words embeddings closest to the solution's location.
const analogy = findClosest(solutionLocation)

The word embeddings dictionary can be located in a database or loaded from a package. The second option is common in Python; a third option is computing the word embeddings in real-time from the browser using TensorflowJS. For computing the distance, you can use the similarity function we derived in the synonyms section. And if you want to learn how to find the closest vectors, you may be interested in reading the tutorial: “How to build a text recommendation engine.”

Visually, this is how solving the analogy looks:

Graphically, the distance between queen and woman is similar to the distance from king to man.

Application: Detecting Biases

Unfortunately, we tend to make value judgments based on irrelevant or unfair attributes. AI can help in automatically analyzing, measuring, and reporting these biases. The word cloud below shows biases in jobs held based on gender. Jobs in healthcare tend to be more likely associated with women, while the opposite happens for engineering:

Word cloud of biases in jobs based on gender.

The process of detecting bias is also simple: find the word embeddings for the two concepts or groups you will compare. Then, select the term against which to measure a potential bias. Finally, compute the relative distance to each term: the greater the distance between two terms, the greater the bias.

// Arbitrary threshold to determine if there is a bias.
const biasThreshold = 2 // Evaluate if there is a gener bias for a particular job.
const jobBiasDetection = (job:string) => {  // Find the word embeddings of woman, man and the input.
  const woman = wordEmbeddings['woman']
  const man = wordEmbeddings['man']
  const jobEmbedding = wordEmbeddings[job]  // Measure the distance of the job to both concepts.
  const distanceToWoman = distance(woman, jobEmbedding)
  const distanceToMan = distance(man, jobEmbedding)  // Determine if the job is usually associated to men.
  if(distanceToMan/biasThreshold > distanceToWoman) return true  // Determine if the job is usually associated to women.
  if(distanceToWoman/biasThreshold > distanceToMan) return true  // There is no bias for the specified job and threshold.
  return false
}

Because AI models incorporate the biases that we hold as a society, it remains an unsolved problem, how to train word embeddings without biases. I invite you to reflect and share in the comments: what novel ideas do you have on how to train unbiased word embedding. As Kant said:

“Truth is a predicate of whole judgments, not of partial representations.”

Advanced Topic: Deep Learning

The process explained to build word embeddings is based on research during the 1980s, called Latent Semantic Analysis (LSA). But during the last decade, it has evolved to incorporate Neural Networks. After their successful implementation in Computer Vision, Neural Networks were quickly adopted by the Natural Language Processing academic community.

Four research papers have shaped how word embeddings are currently built:

Word2Vec: Efficient Estimation of Word Representations in Vector Space.
GloVe: Global Vectors for Word Representation.
ELMO: Deep contextualized word representations
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

For this introduction, I will only center on the first two.

Word2Vec: Word Embeddings trained from predictions

Word2Vec, a paper by Google scientists, is based on the premise that word embeddings are more accurate if trained to predict a word’s occurrence. The paper introduced two complementary neural network architectures to make predictions and subsequently derive word embeddings.

Using the same N-word window concept, Word2Vec proposed the Continuous Bag of Words (CBOW) architecture to predict a word based on its neighbors. Analogously, the Skip-gram architecture attempted to predict the neighbors based on a specific word.

CBOW & Skip-Gram: Novel Neural Network Architectures

The paper was also innovative because of the size of the dataset used to train the neural network. It also introduced a second dataset that became the standard to benchmark the accuracy of new models. It was mostly composed of analogies and included two sections: a syntactic and a semantic one.

GloVe: Finding hidden relationships between words

Only a year later (2014), developed by Stanford researchers. GloVe merged the best of both worlds: the subtle semantic relationships discovered by Latent Semantic Analysis and the syntactic accuracy of Word2Vec predictions.

GloVe was based on the powerful intuition that the true meaning of a word is derived from the difference to the expected probability of two words occurring next to each other. GloVe reduced the noise derived by commonly occurring words. This means that if 2 words are relatively uncommon but frequently appear next to each other, the impact of that particular relationship is more relevant to determine the value of their word embeddings.

GloVe schemed a clever method to identify, transform, and map those “high-value” relationships. As a consequence, the previously sparse co-occurrence matrix was transformed into a densely populated one. In a beautiful mathematical derivation, the cells of the new matrix were the result of the dot product between a row and a column of the original matrix.

By keeping the co-occurrence matrix small, it was possible to keep the training time short. More importantly, GloVe proved to be the only model at the time that benefited from an increase in the size of the training dataset (from 6 billion to 42 billion tokens).

Conclusion: The Future of Word Embeddings

I started this essay by mentioning that word embeddings permeate every application that involves AI & human language. But the future is even brighter: I see word embeddings in a similar place as to where mobile development was 10 years ago. The technology has endless opportunities in the next decade for entrepreneurs and software developers. I attribute it to 4 main reasons:

Ease of development: contrary to common wisdom, integrating AI, and in particular, word embeddings, to existing applications is easy. The field is sufficiently mature to use the technology without understanding the high-level math that goes underneath.
Fast adoption: computer vision has enjoyed the spotlights during the last decade. But its products and applications require specialized hardware, like cameras, and processors, that present challenges to the user’s privacy and speed of adoption. TensorflowJS makes it seamless to integrate AI-NLP software using a standard mobile phone or desktop.
Localization: a multilingual world creates interesting barriers to entry. Where multiple participants can leverage the same application of the technology to serve a distinct local market.
Growth: the field is under major transformation, bringing the attention of the brightest minds around the world. As recently as 2019, state-of-art models and breakthrough innovations continuously reshape the limits and applications of the technology. Examples of this include BERT’s research paper that drastically facilitates transfer learning and OpenAI’s text generation GPT-3 API.

I invite you to meditate on the following: human language is a mechanism of compression. It enables the fast and efficient transmission of complex ideas. But it also has its limits: when speaking, only 39 bits of information are transmitted. Compare this with the 480Mbps of information transmitted via USB 2.0. That is more than 1,000 times more!

It the end, word embeddings will increase productivity by augmenting the amount of information we can handle.

If you are interested in a didactic version of this content, I invite you to visit borgez.ml. It is an online interactive course where you can use these concepts to find synonyms and classify documents using TensorflowJS. You will train your word embeddings, validate your knowledge with a Quiz, and find charts with interesting insights.

Thank you for reading! I plan to write about Sentiment Analysis, Attention, and Transformers in the upcoming weeks. If you are interested, please consider giving me a follow and sharing this story. I wish you a great day, and your claps will be very much appreciated. 🙏

Sincerely,
Santiago M.