NLP: Word Representation and Model Comparison Tree

The landscape of the NLP (Natural Language Processing) is evolving quickly with new ways to represent text such as word embedding.

I would like to try to summarize the different NLP models based in decision tree style diagram.

Before the deep learning era

Before deep learning is widespread, we used word representations in the following ways

One-hot encoding

  • Each word is assigned to an unique vector where there is only element with a value of 1 and the rest with value of 0 (thus the name one-hot)
  • Number of words in vocabulary equals to the dimension of the one-hot encoded vector
  • Each word vector is orthogonal to each other in the v-dimensional vector space

Bag of words

  • Multiple words can be combined into a bag by just adding up the one-hot encoded vector above.
  • The resultant vector captures the fact that multiple appears together in the same sentence/document
  • The information about the order is lost when words are put into a bag.
  • It becomes a multi-hot vector


  • It stands for Term Frequency–Inverse Document Frequency
  • The goal is to derive the importance of a keyword or phrase within a document
  • This importance can be used for querying and ranking documents by keyword

The beginning of word embedding (2013)

This is the beginning of using word embedding, and the embedding vector static for these models. That is a word gets a fixed embedding vector even though a word can have multiple meanings in different context.


  • Uses a neural network to training an embedding matrix, which maps one-hot encoding word vectors into a dense representation with fewer dimensions
  • A representation that has sematic similarity in the embedding space (ie. “apple” to closer to “organge” than “car” in the embedding space)
  • The embedding can be trained using the continuous-bag-of-words CBOW model (predict current word from a window of surrounding context words
  • The embedding can be trained using the skip grams model (predict the surrounding words from the current word)


  • Global Vector Rerepresentation
  • The training is performed on aggregated global word-word co-occurrence statistics from a corpus with global matrix factorization
  • Showed interesting linear substructures of the embedding space (ie. king-queen+man => women)


  • Created by Facebook
  • Similar to above but uses subwords to handle out-of-vocabulary cases
  • Trained in multiple languages

Contextual Embedding

These models are typically pretrained with a large corpus using Self-supervised learning (SSL), and they are finetuned to a downstream prediction tasks. Another characteristics of these models are that the embedding vectors are context dependent. These transformer models not only calculate the embedding of the word, but also along with the context (all other words in the same sentence/sequence).


  • Embeddings from Language Models
  • Uses a character-level convolutional neural network (CNN) to represent words of a text string into raw word vectors
  • Uses forward and backward LSTM to pass the context from the words before and after the current word
  • The final representation (ELMo) is the weighted sum of the current word vectors and the 2 intermediate (before and after) word vectors


  • Contextualized Word Vectors
  • An attentional sequence-to-sequence model for translation task
  • Encoder: Use the GloVe embedding to feed to a standard, two-layer, bidirectional, LSTM network
  • Decoder: compute a vector of attention weights from hidden states of encoder
  • The above forms the MT-LSTM network.
  • The representation is fed to a Bi-attentive Classification Network (BCN), which can be used any classification problem that has one input (like Sentiment Analysis) and two inputs (like Paraphrase Detection).

Transformer model (modern)

Transformer based model are the current SOTA. It ranges from smaller model that can run on a mobile device all the way to huge models that requirement multiple GPUs/TPUs to to train.

The mean of the word is actually in the hidden state by applying attention over all the words in the context. At the end, these embeddings provide much richer information to express the changing meaning of the words in different sentences.


  • Bidirectional Encoder Representations from Transformers
  • Created by Google
  • First model that started the spread of the transformer models
  • The innovation is the used of Masked Language Model (MLM) and Next Sentence Prediction (NSP) to train the model using a Self-supervise technique
  • A lot of models are extended from BERT


  • A more Robust version of BERT created by Facebook
  • Trained with more data and longer period on the Masked Language Model
  • Removed the Next Sentence Prediction task from training


  • A smaller BERT model that is trained by distillation from original BERT (a student of the BERT model)
  • Smaller architecture with fewer encoder than BERT, quick to run, and retains 97% of the performance of BERT
  • See a demo


  • A lite version of BERT
  • Unliked DistillBERT, Albert is trained from scratch with a smaller architecture with a few innovations
  • To reduce the embedding matrix side: let V = vocabulary size, E =embedding dimensions, H =hidden dimensions. The embedding matrix from size V x H is decompsed to (V x E) + (E x H). While BERT has H=768, Albert has E=128, and H=4096.
  • Enable small Cross-layer parameter sharing across the encoder block. This reduce the parameter counts and also serves as a form of regularization (just like a convolution layer vs fully-connected layer).


  • It stands for Cross Lingual Model
  • It’s still based on BERT using masked language modeling (MLM)
  • Extends BERT’s MLM to multiple language inputs

Zero-shot, one-shot, few-shot learning


  • Generative Pre-Trained Transformer
  • GPT-2 and GPT-3 models are developed by openAI
  • Size: GPT-2 has 1.5 billion parameters, and GPT-3 is 10x larger than GPT-2
  • Zero-shot learning means that the classes of a classification model can be given at test time, and model is able to identify samples of the same class. For example, GPT-2 can be used to do sentiment analysis by giving it example of positive and negative sentiment sentences while training the model on these two classes.


  • Text-to-Text-Transfer-Transformer
  • Reframe all NLP tasks into a unified text-to-text-format where the input and output are always text strings
  • Much smaller than GPT
  • Works with finetuning or few-shot learning: requires very little data to adapt to a new problem

Related Posts

One thought on “NLP: Word Representation and Model Comparison Tree

Leave a Reply

Your email address will not be published. Required fields are marked *