Saccade: Sequence Labeling Parser

As mentioned in my previous article, résumé parsing is the automatic extraction of information from résumés and other types of candidate data. In the Natural Language Processing (NLP) field, the task of identifying and extracting relevant entities from text is known as Named Entity Recognition (NER). We can think of this task as the process of identifying relevant chunks in a text and assigning them a label. As an illustration:

2010 – 2013  –  Worked as a Software Developer at Avature Ltd.

For this example, let’s assume we are interested in the Period, Job Title, and Company. Having to manually identify the relevant data in this string, we might end up with something like this:

And if we were given thousands of sentences containing similar information, we could probably find a pattern for those categories. Namely, “periods” usually contain digits, a start date and an end date. Automatically recognizing patterns in data is a machine learning task known as Pattern Recognition. It is an effective approach to extract entities from text, hence, a good strategy for doing Named Entity Recognition.

But recognizing isolated patterns is not the same as recognizing patterns in a text in which the order of the information matters. If we were given the task of manually labeling “periods,” we might consider the whole string as a sequence that contains a start date, a dash or hyphen, and an end date. Labeling sequences turns out to be more efficient in tasks like résumé parsing because it can take advantage of the context in which the relevant information appears.

So, when parsing résumés, we actually do not identify isolated keywords like “Ltd.,”  “Engineer,” etc. We care about the whole sequence of words that constitutes any given company, address, degree, job title, among other fields.

Now, the main question is: how does a machine learning algorithm correctly categorize each sequence of words? We need to feed it résumés with strings that have manually been labeled as addresses, phone numbers, e-mails, job titles, and so on, in order to provide it with enough data to learn a pattern from the categories. Provided we have this dataset, we can select an algorithm suitable for the task.

One of the standard algorithms used for sequence labeling is known as Conditional Random Fields (CRF). Briefly speaking, this probabilistic algorithm takes a set of features from each word in a sentence and is trained to predict a sequence of labels for the full sentence. Features are a series of characteristics that make one word different from others. We could say that features help identify words, just like tags are used in Instagram or hashtags in Twitter.  In a nutshell, CRF predicts the probability of assigning a label Y given a word X in a particular context Z. For instance, the probability of assigning the label “company” given the sequence “My Company Inc.” within a Work Experience section.

Features are typically things like “is a number,” “is lowercase,” etc. The answer is binary: it can be either “yes” or “no.” These answers are transformed into real-valued numbers, 1s and 0s, that are subsequently fed into the CRF algorithm, which is basically a box of math functions that receives numbers and performs several complex calculations on them.

We aim to find the most appropriate set of features to help the Machine Learning model be good at predicting the likelihood of every label for a given word considering the previous and following words. To illustrate, given the sequence “Avature Ltd.,” the model is expected to identify that string and assign it the label “company.”

Even though this approach seems effective, it requires a large set of manual features to identify every sequence of words in a résumé with enough detail. In other words, to make sure that dates and zip codes are not confused, it is necessary to add more features than “is a digit,” such as “isZipCode.” And since zip codes can vary from country to country, it becomes mandatory to add more granularity with features like “isUSzipCode.” Thus, the addition of new features can scale up very rapidly.

Not to mention that in spite of all the complex math the algorithm does, it is still not good at performing crucial linguistic tasks such as modeling the semantic similarity between words.

This is important because identifying that similarity implies knowing the meaning of a word, a crucial aspect for understanding and processing natural language.

Fortunately, there is one technique that kills two birds with one stone. On the one hand, it helps reduce the amount of manual features. On the other, it is “surprisingly good at capturing syntactic and semantic regularities in language.”1 It creates good generalizations for unseen examples and thus helps improve Sequence Labeling accuracy.  This technique turns out to be one of the most successful and widely used across the NLP industry in recent years.  But I’ll talk more about it in my next article.

1. Mikolov et al. (2013). Linguistic Regularities in Continuous Space Word Representations