How to Visualize Human Language

And Seven Startup Ideas to get you started!

Santiago M. Quintero
Analytics Vidhya

--

The right green dot represents the word “woman” and the purple dots represent the top 100 paying jobs. Only 32 of those jobs are closer to “woman” than “man.” Source: borgez.ml

One image is worth a thousand words. Then, what if we could summarize a 1,000-word document in a single chart?

Charts and dashboards are used to model financial and numerical data. They are the preferred tool by analysts and investors to analyze, communicate, and strategize. But until a few years ago, using charts to visualize human language was not possible. Today, word embeddings make text documents faster to process, easier to aggregate, and more profitable to analyze.

Word embeddings are numerical representations of meaning, they are the result of a long tradition of Natural Language Processing research merged with recent advances in Deep Learnings. The representation of meaning with vectors enables data scientists to perform mathematical operations on words as they do with numbers. These include addition, subtraction, and comparison used in applications like Google’s search engine. My essay “What are word embeddings?” provides a good in-depth introduction to the topic.

Word embedding visualization starts by transforming text into vectors. Then, reducing its dimensionality to plot them. This article goes through each of the four steps to plot word embeddings. And provide sample code in TypeScript, using TensorflowJS, React, and recharts. I will exemplify this by approaching the topic classification problem. And end by sharing 7 startup ideas you can implement using the technology.

Photo by Compare Fibre on Unsplash

1. Tokenize your Dataset

It all starts with clean data. TensorflowJS universal sentence encoder transforms words, sentences, and even short paragraphs into vectors. But if your dataset does not conform to this format, you can use a RegEx to segment it:

// Split a text by words.
const wordTokenizer = (text:string) => text.match(/(\b[^ $]+\b)/g)
// Split it by sentences.
const sentenceTokenizer = (t:string) => t.match( /[^.!?]+[.!?]+/g )
// Split by paragraphs.
const paragraphTokenizer = (text:string) => text.split("\n")

2. Modelize it with TensorflowJS

Getting the word embeddings is easy. Start by loading both dependencies: Tensorflow & the sentence encoder model. Then, load the model; this will trigger a network request and take a couple of seconds. Finally, feed your tokenized dataset as an array of strings to the model, and transform the output into a JavaScript array.

import { load } from '@tensorflow-models/universal-sentence-encoder'import '@tensorflow/tfjs'// Turn an array of texts into word-embeddings.
const vectorize = (text:string[]) => {
const model = await load()
const tensors = await model.embed(text)
const embeddings = await tensor.array()
return embeddings
}

3. Dimensionality Reduction

TensorflowJS outputs are 512-dimensional vectors; attempting to plot them is a chimeric endeavor. Fortunately, there are several dimensionality reduction techniques. A common, effective, and simple technique is Principal Component Analysis.

The npm module ml-pca is an excellent, lightweight choice to compress vectors in Typescript. To use it, we first create a new model based on the TensorflowJS output and use the model to predict 2-dimensional vectors.

import { PCA } from 'ml-pca'// Reduce the dimensionality of word embeddings.
reduceDimensionality = (embeddings:number[][]) => {
const pca = new PCA(embeddings)
const matrix = pca.predict([embeddings], {nComponents:2})
const reducedVectors = matrix.to2DArray()[0]
return reducedVectors
}

4. Draw it with ReactJS

The last step is to plot it using ReactJS and recharts. We are going to start by mapping our tokenized dataset to the reduced vectors. Then import the scatter plot and add the corresponding colors, props, and style to turn it into a React component. Finally, choose the title: do not underestimate this step. Make sure to be clear about what you want to convey with the plot.

import { ScatterChart, Scatter, Tooltip } from 'recharts'interface iData {name:string, x:number, y:number}
interface iEmbeddingsPlot {title:string, data:iData[]}
const EmbeddingsPlot = ({title, data}:iEmbeddingsPlot) => <div>
<h1> { title } </h1>
<ScatterChart data={data}>
<Tooltip/>
<RechartScatter data={data} fill={colors[i]} />
</ScatterChart>
</div>

Example: Document visualization by Topic

Sample topic classification chart. The further the topic is to the left, the most technical it is. The vertical dimension represents personal (top) versus society (bottom). Source: borgez.ml

To illustrate the procedure, consider mapping an array of documents labeled by topic. These are some of the challenges we could face:

  1. The document size is too large for the universal sentence encoder.
  2. The number of topics is too big to assign each one a unique color in the graph.
  3. We need to use the tool to classify unlabeled documents.

Shay Palachy wrote an outstanding 39-minute guide to overview the different technologies used to embed documents. For the above example, I opted for a more pragmatic approach, tokenizing texts by sentences and averaging them to find the final document’s embedding.

I used a similar procedure to draw each topic on the chart. Taking the average of the vector for every document corresponding to the same topic. Averaging is a great example of the value derived from applying mathematical operations to text data.

For the final step: predicting the topic of unlabeled documents. We can apply a Machine Learning algorithm like Support Vector Machine (SVM), train a neural network, or make a geographical query to find the closest topic in a 2D vectorial space. For more information on this subject, please refer to my tutorial: “How to create a text recommendation engine.”

BONUS: 7 Startup Ideas based on Word Embeddings

  1. Kluster: Product review aggregation technology.
  2. Quality Leads: automatically qualify leads based on their website’s language.
  3. HR-Rec: recommendation engine to find the best prospects.
  4. Touristy: group restaurants, hotels, and destinations based on experiences, not location.
  5. ReadFlix: find the best content to read every night after work.
  6. Healtty: dashboard to monitor corporate communication and organizational culture.
  7. Matchy: join people based on ideas, not appearance.

Each of these ideas has the same concept: analyze textual data and create visualizations to facilitate decision-making. I’m happy to expand on any of these ideas or connect with me on LinkedIn to discuss an MVP. What do you think, which idea has the most upside?

“Whenever you see a successful business, someone once made a courageous decision.” ― Peter F. Drucker

Thank you for reading. I hope you enjoyed the story! For more content on Natural Language Processing, consider giving me a follow. I would love to hear your thought in the comments, and your claps will be very much appreciated. 🙏

--

--

Santiago M. Quintero
Analytics Vidhya

Entrepreneur, Software Engineer & Writer specialized in building ideas to test Product Market Fit and NLP-AI user-facing applications.