Introduction
Named Entity Recognition (NER) is a very classic natural language processing (NLP) problem. The task is to identify the words in a sentence that represents named entities.
For example, if we are given a sentence: Joe went to Stanford University.
We expect to recognize the two named entities: (1) Joe and (2) Stanford University.
More specifically we want to identify the types of entity as well (ie whether it’s a person or location). We are going to use the Inside–outside–beginning (IOB) format for tagging entities.
First part differentiates the beginning (B), inside (I), and outside of entities.
Second part represents the type of entity
geo = Geographical Entity
org = Organization
per = Person
gpe = Geopolitical Entity
tim = Time indicator
art = Artifact
eve = Event
nat = Natural Phenomenon
How (some math background)
Historically, this is done using a predefined as of words that are known to be named entity. For example, the word Joe is the name of a person, and the Stanford University is the name of a school. However, this is difficult because a word can appear in different forms. For example,
- Roger is a tennis player.
- Roger that!
In these two sentences, the word Roger is a person in the first sentence but an exclamation in the second sentence. This will be hard to handle as you would require even more complex rules to class it. Wouldn’t it be better if you have apply statistics to train a model that can learn these patterns?
One of a statistical method is to use a linear chain conditional random field (CRF) model to learn the probablistic model to find the mostly probable tagging for the sentence. In combination we use a bidirectional long shorter term memory (LSTM) model to process the input sequence into a compressed hidden states.
Let’s look at the diagram layer from Neural Architectures for Named Entity Recognition
- At the bottom layer, we have word embedding. We can either train our own embedding or use a pretrained embedding layer. In this tutorial, we will train our low dimensional embedding, but in practice, it might be worth using an existing pretrained word embedding.
- Next layer up is the Bi-LSTM, which can be viewed as a encoder to transfrom the embedding to another representation that takes the context in account because LSTM is a sequence model that passes the hidden states from previous and and next word using the bidrectional layers by concatenating the output of both directions as one vector for each word. At this layer, it’s already possible to build a POS tagging model by mapping the output two output a softmax layer to output to predict the tags, such model will effectively be independent predicting the tag for each word without using any grammar. By grammar, I mean modelling the conditional probability among the words. Thus, we try to do more with a CRF layer in this tutorial.
- At the out of the LSTM encoder, we have the encoded representation which are passed to the CRF layer as input. You can also skip to the end to see how CRF layer work mathamatically.
Coding
Now let’s get into how to do it in a colab.
First I ran the below pip install commands in my colab. This really depends on your colab environment. At the time of this article (2022), I had to specific the tensorflow and keras version to make sure the CRF layer can work properly.
!pip install tensorflow==2.2 !pip install keras==2.3.1 !pip install plot-keras-history !pip install git+https://www.github.com/keras-team/keras-contrib.git !pip install tensorflow_addons
Import as usual…
import pandas as pd import numpy as np import matplotlib.pyplot as plt from plot_keras_history import plot_history from sklearn.model_selection import train_test_split from sklearn.metrics import multilabel_confusion_matrix, confusion_matrix import tensorflow as tf import keras from keras import layers from keras import optimizers from keras.models import Model from keras.models import Input from keras_contrib.layers import CRF from keras_contrib import losses from keras_contrib import metrics import seaborn as sn from matplotlib.colors import LogNorm
Download data
Read the training data from csv. You can download the csv from kaggle.
Download link: https://www.kaggle.com/datasets/abhinavwalia95/entity-annotated-corpus
I have download my data so I can just load it from my Google Drive:
from google.colab import drive drive.mount('/content/drive') file_path = "/content/drive/MyDrive/kaggle_ner_dataset/ner_dataset.csv" df = pd.read_csv(file_path, encoding="iso-8859-1", header=0) df.head()
Data Processing
- We see the first column “Sentence #” is used to denote which sentence the word is in, but it’s a string, so we need to convert the sentence number from string to int
- Since only the first word of every sentence has the sentence number label while the rest has NaN. We use ffill (forward fill) to copy it to all records after the labeled record for each sentence.
df = df.fillna(method="ffill") # Extra the substring after "Sentence: " df["Sentence #"] = df["Sentence #"].apply(lambda s: int(s[9:])) df.head()
For this time, we don’t need the POS tag, so let’s just remove this column.
Note: this POS data can be also useful for training a Hidden Markov Model for POS Tagging
df.drop('POS', axis=1, inplace=True) df.head()
Exploratory Data Analysis
df["Tag"].value_counts().plot(kind="bar", figsize=(10,5));
df[df["Tag"]!="O"]["Tag"].value_counts().plot(kind="bar", figsize=(10,5))
We can see here that geography names, time, organization names are very popular.
The tag notation has two parts
First part differentiates the beginning (B) and the inside (I) of entities.
Second part represents the type of entity
- geo = Geographical Entity
- org = Organization
- per = Person
- gpe = Geopolitical Entity
- tim = Time indicator
- art = Artifact
- eve = Event
- nat = Natural Phenomenon
We also want to check the length of the sentences in the dataset because we need to decide how to set up the max length in our model later.
word_counts = df.groupby("Sentence #")["Word"].agg(["count"]) word_counts.hist(bins=50, figsize=(8,6));
As we can see, the average lenght is around 20 but it has a long tail on the right.
max_length=word_counts.max()
To speed up the training, let’s only process the shorter sentences for demo purpose. You are free to skip this part of the code if you want to process all sentences.
max_length=70 print("There are {} sentences over {} words.".format(np.sum(word_counts['count']>max_length), max_length))
There are 4 sentences over 70 words.
keep_sentence_ids = word_counts[word_counts['count']<=max_length].index df = df[df['Sentence #'].isin(keep_sentence_ids)]
Converting words and TAGs to numerical values
We need to build dictionary to convert between string and numbers because the keras model that we use needs to accept numeric tensors.
all_words = list(df["Word"].unique()) all_tags = list(df["Tag"].unique())
Let’s build two dictionary to convert between the words in string to index.
word2index = {word: idx for idx, word in enumerate(all_words, 2)} # Setup the reserved index 0 and index 1 for two special tokens: unknown word and padding word2index["<_UNK_>"]=0 word2index["<_PAD_>"]=1 # Create the inverted dictionary index2word = {idx: word for word, idx in word2index.items()}
- We can take a quick look at the dictionary
- Get word from index
- Get index from word
- Confirm the index is same as after mapping thru both index2word and word2index
for i in range(5): word = index2word[i] index = word2index[word] print("i={} word={:10s} index={}".format(i, word, index))
i=0 word=<_UNK_> index=0 i=1 word=<_PAD_> index=1 i=2 word=Thousands index=2 i=3 word=of index=3 i=4 word=demonstrators index=4
Next, build 2 dicinotaries to convert between the TAG and index.
tag2index = {tag: idx for idx, tag in enumerate(all_tags, 1)} # Setup the reversed index 0 for padding in the TAG sequence tag2index["<_PAD_TAG_>"] = 0 # Create the inverted dictionary index2tag = {idx: word for word, idx in tag2index.items()}
Each word in the data set has a corresponding tag, but they are in separate columns in the data set. So we apply two operations to bring the data to the right form:
- Group the data by sentence #
- Convert the 2 columns to 2 lists and zip the 2 lists into 1 list of triples for each sentence.
listOfListOfWords = df.groupby("Sentence #")['Word'].apply(lambda x: x.values.tolist()).tolist() listOfListOfTags = df.groupby("Sentence #")['Tag'].apply(lambda x: x.values.tolist()).tolist() print(listOfListOfWords[0]) print(listOfListOfTags[0])
['Thousands', 'of', 'demonstrators', 'have', 'marched', ...] ['O', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', ...]
X = [[word2index[word] for word in listOfWords] for listOfWords in listOfListOfWords] y = [[tag2index[tag] for tag in listOfTags] for listOfTags in listOfListOfTags] # Convert the sentences from string to index and pad to max length X = [listOfWordIndices + [word2index["<_PAD_>"]] * (max_length - len(listOfWordIndices)) for listOfWordIndices in X] # Convert the tags from string to index and pad to max length y = [listOfTagIndices + [tag2index["<_PAD_TAG_>"]] * (max_length - len(listOfTagIndices)) for listOfTagIndices in y] print(X[0]) print(y[0])
[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 11, 17 ...] [1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, ...]
Next we need to turn the densely encoded label of tags to one-hot encoding.
num_tags = len(index2tag) y = [np.eye(num_tags)[listOfTagIndices] for listOfTagIndices in y]
Convert the data to numpy arrays and split into train and test sets
X_train = np.array(X_train) X_test = np.array(X_test) y_train = np.array(y_train) y_test = np.array(y_test) print("X_train:", X_train.shape) print("X_test:", X_test.shape) print("y_train:", y_train.shape) print("y_test:", y_test.shape)
X_train: (43159, 70) X_test: (4796, 70) y_train: (43159, 70, 18) y_test: (4796, 70, 18)
Model
The model we want to build is the LSTM-CRF from Neural Architectures for Named Entity Recognition
vocab_size = len(index2word) print(vocab_size, vocab_size ** 0.25)
35164 13.693818515221722
As a rule of thumb by Google, we choose dense embedding dimension to be 14.
input_layer = layers.Input(shape=(max_length,)) model = layers.Embedding(vocab_size, 14, embeddings_initializer="uniform", input_length=max_length)(input_layer) # Drop out of 0.1 is used for extra robustness (the paper used 0.5) # LSTM hidden state has dimension of 50 as suggested in the paper. # return_sequences = True because we want pass the output from all time steps to the next layer model = layers.Bidirectional(layers.LSTM(50, recurrent_dropout=0.2, return_sequences=True))(model) # Connect a Dense layer to the output of the Bidirectional LSTM to output at dimension of 100 as suggested in the paper. model = layers.TimeDistributed(layers.Dense(100, activation="relu"))(model) # The CRF layer is the output layer and it should have the matching output dimension as the number of tags. crf_layer = CRF(units=num_tags) output_layer = crf_layer(model) ner_model = Model(input_layer, output_layer) # Need to apply the specific loss objective and accuracy metric for the CRF layer loss = losses.crf_loss acc_metric = metrics.crf_accuracy ner_model.compile(optimizer=tf.optimizers.Adam(lr=0.001), loss=loss, metrics=[acc_metric]) ner_model.summary()
Model: "model_1" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= input_1 (InputLayer) (None, 70) 0 _________________________________________________________________ embedding_1 (Embedding) (None, 70, 14) 492296 _________________________________________________________________ bidirectional_1 (Bidirection (None, 70, 100) 26000 _________________________________________________________________ time_distributed_1 (TimeDist (None, 70, 100) 10100 _________________________________________________________________ crf_1 (CRF) (None, 70, 18) 2178 ================================================================= Total params: 530,574 Trainable params: 530,574 Non-trainable params: 0 _________________________________________________________________
history = ner_model.fit(X_train, y_train, batch_size=256, epochs=20, validation_split=0.1, verbose=2)
Train on 38843 samples, validate on 4316 samples Epoch 1/20 - 54s - loss: 0.5942 - crf_accuracy: 0.8606 - val_loss: 0.2495 - val_crf_accuracy: 0.9499 Epoch 2/20 - 53s - loss: 0.2222 - crf_accuracy: 0.9509 - val_loss: 0.1734 - val_crf_accuracy: 0.9516 Epoch 3/20 - 48s - loss: 0.1259 - crf_accuracy: 0.9615 - val_loss: 0.1054 - val_crf_accuracy: 0.9679 Epoch 4/20 - 49s - loss: 0.0907 - crf_accuracy: 0.9732 - val_loss: 0.0842 - val_crf_accuracy: 0.9769 Epoch 5/20 - 48s - loss: 0.0683 - crf_accuracy: 0.9815 - val_loss: 0.0649 - val_crf_accuracy: 0.9833 Epoch 6/20 - 48s - loss: 0.0511 - crf_accuracy: 0.9866 - val_loss: 0.0532 - val_crf_accuracy: 0.9859 Epoch 7/20 - 49s - loss: 0.0417 - crf_accuracy: 0.9888 - val_loss: 0.0474 - val_crf_accuracy: 0.9868 Epoch 8/20 - 48s - loss: 0.0360 - crf_accuracy: 0.9899 - val_loss: 0.0440 - val_crf_accuracy: 0.9874 Epoch 9/20 - 55s - loss: 0.0319 - crf_accuracy: 0.9907 - val_loss: 0.0409 - val_crf_accuracy: 0.9879 Epoch 10/20 - 49s - loss: 0.0287 - crf_accuracy: 0.9913 - val_loss: 0.0389 - val_crf_accuracy: 0.9882 Epoch 11/20 - 48s - loss: 0.0260 - crf_accuracy: 0.9917 - val_loss: 0.0376 - val_crf_accuracy: 0.9884 Epoch 12/20 - 48s - loss: 0.0237 - crf_accuracy: 0.9922 - val_loss: 0.0361 - val_crf_accuracy: 0.9886 Epoch 13/20 - 49s - loss: 0.0215 - crf_accuracy: 0.9925 - val_loss: 0.0351 - val_crf_accuracy: 0.9886 Epoch 14/20 - 50s - loss: 0.0194 - crf_accuracy: 0.9929 - val_loss: 0.0336 - val_crf_accuracy: 0.9887 Epoch 15/20 - 51s - loss: 0.0175 - crf_accuracy: 0.9932 - val_loss: 0.0321 - val_crf_accuracy: 0.9888 Epoch 16/20 - 49s - loss: 0.0155 - crf_accuracy: 0.9935 - val_loss: 0.0315 - val_crf_accuracy: 0.9889 Epoch 17/20 - 49s - loss: 0.0136 - crf_accuracy: 0.9938 - val_loss: 0.0305 - val_crf_accuracy: 0.9891 Epoch 18/20 - 49s - loss: 0.0117 - crf_accuracy: 0.9941 - val_loss: 0.0293 - val_crf_accuracy: 0.9891 Epoch 19/20 - 49s - loss: 0.0099 - crf_accuracy: 0.9942 - val_loss: 0.0287 - val_crf_accuracy: 0.9891 Epoch 20/20 - 49s - loss: 0.0079 - crf_accuracy: 0.9944 - val_loss: 0.0280 - val_crf_accuracy: 0.9891
Predict and sanity test
Let’s try it on an example sentence: Joe went to Stanford University.
sentence = "Joe went to Stanford University" words = sentence.split() padded_words = words + [word2index["<_PAD_>"]] * (max_length - len(words)) padded_words_encoded = [word2index.get(w, 0) for w in padded_words] pred = ner_model.predict(np.array([padded_words_encoded])) # pred is in one-hot encoding, we need to convert it to dense encoding (by index) pred_dense = np.argmax(pred, axis=-1) retval = "" for w, p in zip(sentence, pred[0]): retval = retval + "{:15}: {:5}".format(w, index2tag[p]) + "\n" print(retval)
Joe : I-per went : O to : O Stanford : B-org University : I-org
The output is as expected with Joe tagged as a person and Stanford University as an organization.
Evaluation
plot_history(history.history);
The performance is not bad, generally approaching 99%, but it can definitely be higher with hyper parameter tuning (i.e. the dimension of the LSTM states, the embedding, using pretrain embedding).
Further this 99% accuracy does not tell the full picture because we should look at the individual class performance (i.e. how does the model predict on B-org vs B-per?). To do this, we can plot the multi-class confusion matrix.
y_pred = ner_model.predict(X_test) y_pred_dense = np.argmax(y_pred, axis=2) y_test_dense = np.argmax(y_test, axis=2) # First arg is true label, second arg is for prediction # Result in a confusion matrix where each row represents a different true label # and each column represents a different prediction cm = confusion_matrix(y_test_dense.flatten(), y_pred_dense.flatten()) plt.figure(figsize = (12,12)) tags = [index2tag[i] for i in range(num_tags)] ax = sn.heatmap(cm+0.001, annot=False, square=True, norm=LogNorm(), xticklabels=tags, yticklabels=tags) ax.set_title('Confusion Matrix of NER Tags') ax.set_xlabel('Prediction') ax.set_ylabel('Ground truth')
The confusion matrix is ploted using a heatmap in log scale. I had to use log scale because the dataset is dominated by the padding and O tags, which are not as interesting. A couple problem we can notice on the graph is that B-art, I-art, B-nat, B-eve, I-eve, I-nat columns are all black, which means our model did not predict these POS tags. This can be a problem. Let’s dig into them further by plotting the individual 1-vs-all accuracy, precision and recall (see Advanced Classification Metrics: Precision, Recall, and more)
cm2 = multilabel_confusion_matrix(y_test_dense.flatten(), y_pred_dense.flatten()) result = [] for i in range(num_tags): tag = index2tag[i] # get the 2x2 confusion matrix for a particular tag # [ true-neg, false-pos] # [ false-neg, true-pos] cm2b2 = cm2[i] result.append({ "Tag": tag, "Accuracy": (cm2b2[0,0]+cm2b2[1,1])/np.sum(cm2b2), "Precision": cm2b2[1,1]/(cm2b2[0,1]+cm2b2[1,1]), "Recall": cm2b2[1,1]/(cm2b2[1,0]+cm2b2[1,1]), }) metrics_df = pd.DataFrame(result) metrics_df
From the table, we can see a few problems
- Recall is 0. This agrees with our previous findings that our model is biased and does not produce any prediction for some of the tags.
- The NaN precision value is due to a divide by 0 problem because precision = True Positive / (TP+FP). (see Advanced Classification Metrics: Precision, Recall, and more)
- Precision and recall is much worse than accuracy for B-per, I-geo, B-org, I-org. This is a problem of the imbalance of our data. It’s much easier for the model to predict the popular tags to achieve high accuracy and just ignore the rare tags.
Besides hyper-parameter tuning, some other thigns we can try includes:
- Using a larger data set
- Get more balance data for rare labels
- Using a pretrained word embedding
Appendix (CRF layer)
The CRF layer models the conditional probability between adjacent tags, and the objective function is define as below:
If you dont’ understand all these, it’s okay. The main point to know is tha the CRF layer gives a way to optimize (1) transition score between tags and (2) word to tag score, so that the probability of a given input word representation (x) and output tags (y) can be be maximized.