With PyTorch code
Within the rapidly evolving field of natural language processing, Transformers have emerged as dominant models, demonstrating remarkable performance across a big selection of sequence modelling tasks, including part-of-speech tagging, named entity recognition, and chunking. Prior to the era of Transformers, Conditional Random Fields (CRFs) were the go-to tool for sequence modelling and specifically linear-chain CRFs that model sequences as directed graphs while CRFs more generally will be used on arbitrary graphs.
This text can be broken down as follows:
- Introduction
- Emission and Transition scores
- Loss function
- Efficient estimation of partition function through Forward Algorithm
- Viterbi Algorithm
- Full LSTM-CRF code
- Drawbacks and Conclusions
The implementation of CRFs in this text in based on this excellent tutorial. Please note that it’s definitely not probably the most efficient implementation on the market and likewise lacks batching capability, nonetheless, it’s relatively easy to read and understand and since the aim of this tutorial is to get our heads around the interior working of CRFs it’s perfectly suitable for us.
In sequence tagging problems, we take care of a sequence of input data elements, akin to the words inside a sentence, where each element corresponds to a selected label or category. The first objective is to appropriately assign the suitable label to every individual element. Inside the CRF-LSTM model we are able to discover two key components to do that: emission and transition probabilities. Note we are going to actually take care of scores in log space as an alternative of probabilities for numerical stability:
- Emission scores relate to the likelihood of observing a selected label for a given data element. Within the context of named entity recognition, for instance, each word in a sequence is affiliated with considered one of three labels: Starting of an entity (B), Intermediate word of an entity (I), or a word outside to any entity (O). Emission probabilities quantify the probability of a selected word being related to a selected label. That is expressed mathematically as P(y_i | x_i), where y_i denotes the label and x_i represents the…