NLP: Word Representation and Model Comparison Tree

The landscape of the NLP (Natural Language Processing) is evolving quickly with new ways to represent text such as word embedding.

I would like to try to summarize the different NLP models based in decision tree style diagram.

Before the deep learning era

Before deep learning is widespread, we used word representations in the following ways

One-hot encoding

Each word is assigned to an unique vector where there is only element with a value of 1 and the rest with value of 0 (thus the name one-hot)
Number of words in vocabulary equals to the dimension of the one-hot encoded vector
Each word vector is orthogonal to each other in the v-dimensional vector space

Bag of words

Multiple words can be combined into a bag by just adding up the one-hot encoded vector above.
The resultant vector captures the fact that multiple appears together in the same sentence/document
The information about the order is lost when words are put into a bag.
It becomes a multi-hot vector

TF-IDF

It stands for Term Frequency–Inverse Document Frequency
The goal is to derive the importance of a keyword or phrase within a document
This importance can be used for querying and ranking documents by keyword

The beginning of word embedding (2013)

This is the beginning of using word embedding, and the embedding vector static for these models. That is a word gets a fixed embedding vector even though a word can have multiple meanings in different context.

Word2vec

Uses a neural network to training an embedding matrix, which maps one-hot encoding word vectors into a dense representation with fewer dimensions
A representation that has sematic similarity in the embedding space (ie. “apple” to closer to “organge” than “car” in the embedding space)
The embedding can be trained using the continuous-bag-of-words CBOW model (predict current word from a window of surrounding context words
The embedding can be trained using the skip grams model (predict the surrounding words from the current word)

GloVe

Global Vector Rerepresentation
The training is performed on aggregated global word-word co-occurrence statistics from a corpus with global matrix factorization
Showed interesting linear substructures of the embedding space (ie. king-queen+man => women)

FastText

Created by Facebook
Similar to above but uses subwords to handle out-of-vocabulary cases
Trained in multiple languages

Contextual Embedding

These models are typically pretrained with a large corpus using Self-supervised learning (SSL), and they are finetuned to a downstream prediction tasks. Another characteristics of these models are that the embedding vectors are context dependent. These transformer models not only calculate the embedding of the word, but also along with the context (all other words in the same sentence/sequence).

ELMo

Embeddings from Language Models
Uses a character-level convolutional neural network (CNN) to represent words of a text string into raw word vectors
Uses forward and backward LSTM to pass the context from the words before and after the current word
The final representation (ELMo) is the weighted sum of the current word vectors and the 2 intermediate (before and after) word vectors

CoVe

Contextualized Word Vectors
An attentional sequence-to-sequence model for translation task
Encoder: Use the GloVe embedding to feed to a standard, two-layer, bidirectional, LSTM network
Decoder: compute a vector of attention weights from hidden states of encoder
The above forms the MT-LSTM network.
The representation is fed to a Bi-attentive Classification Network (BCN), which can be used any classification problem that has one input (like Sentiment Analysis) and two inputs (like Paraphrase Detection).

Transformer model (modern)

Transformer based model are the current SOTA. It ranges from smaller model that can run on a mobile device all the way to huge models that requirement multiple GPUs/TPUs to to train.

The mean of the word is actually in the hidden state by applying attention over all the words in the context. At the end, these embeddings provide much richer information to express the changing meaning of the words in different sentences.

BERT

Bidirectional Encoder Representations from Transformers
Created by Google
First model that started the spread of the transformer models
The innovation is the used of Masked Language Model (MLM) and Next Sentence Prediction (NSP) to train the model using a Self-supervise technique
A lot of models are extended from BERT

RoBERTa

A more Robust version of BERT created by Facebook
Trained with more data and longer period on the Masked Language Model
Removed the Next Sentence Prediction task from training

DistillBERT

A smaller BERT model that is trained by distillation from original BERT (a student of the BERT model)
Smaller architecture with fewer encoder than BERT, quick to run, and retains 97% of the performance of BERT
See a demo

Albert

A lite version of BERT
Unliked DistillBERT, Albert is trained from scratch with a smaller architecture with a few innovations
To reduce the embedding matrix side: let V = vocabulary size, E =embedding dimensions, H =hidden dimensions. The embedding matrix from size V x H is decompsed to (V x E) + (E x H). While BERT has H=768, Albert has E=128, and H=4096.
Enable small Cross-layer parameter sharing across the encoder block. This reduce the parameter counts and also serves as a form of regularization (just like a convolution layer vs fully-connected layer).

XLM

It stands for Cross Lingual Model
It’s still based on BERT using masked language modeling (MLM)
Extends BERT’s MLM to multiple language inputs

Zero-shot, one-shot, few-shot learning

GPT

Generative Pre-Trained Transformer
GPT-2 and GPT-3 models are developed by openAI
Size: GPT-2 has 1.5 billion parameters, and GPT-3 is 10x larger than GPT-2
Zero-shot learning means that the classes of a classification model can be given at test time, and model is able to identify samples of the same class. For example, GPT-2 can be used to do sentiment analysis by giving it example of positive and negative sentiment sentences while training the model on these two classes.

Text-to-Text-Transfer-Transformer
Reframe all NLP tasks into a unified text-to-text-format where the input and output are always text strings
Much smaller than GPT
Works with finetuning or few-shot learning: requires very little data to adapt to a new problem

NLP: Word Representation and Model Comparison Tree

Before the deep learning era

The beginning of word embedding (2013)

Contextual Embedding

Transformer model (modern)

Zero-shot, one-shot, few-shot learning

One thought on “NLP: Word Representation and Model Comparison Tree”

Leave a Reply Cancel reply

NLP: Word Representation and Model Comparison Tree

Before the deep learning era

The beginning of word embedding (2013)

Contextual Embedding

Transformer model (modern)

Zero-shot, one-shot, few-shot learning

Related Posts

7 Game-Changing Strategies for Using Cold Emails in Your Data Science Job Search

Probability Recursion Question for DS/ML Interviews (Step-by-Step Simple Solution)

One thought on “NLP: Word Representation and Model Comparison Tree”

Leave a Reply Cancel reply