Developing a Résumé Parser

There’s no doubt AI is becoming the new sensation in human capital management software. There’s plenty of processes that can be automated: information seeking, candidate and job matching, recommendation and résumé parsing. In this article, we’ll focus on résumé parsing. This is an aspect of the hiring process that has become essential for the industry in recent years. It is undoubtedly a great benefit and provides a noticeable competitive advantage.

In a nutshell, résumé parsing is the automatic extraction of information from résumés/CVs and other types of candidate data; namely the processing of documents in a semi-structured format (such as résumés/snippets/digital profiles) to organize information in a manner that’s more suitable for storage in a database. The software that does this is known as a parser.

But not all parsers extract the same kind of information for every language. The fields a résumé parser can detect are mainly based on user preference, from contact info and work experience to education history, language, and beyond. The biggest advantage of this technology is that it helps people save hours of manual work, and it can provide them with an optimal résumé database of candidate profiles. In the era of automation, résumé parsing is turning into a must for all recruiters.

This task may not seem too challenging at first sight. One might think that résumés are typically very clear, and information is presented in a simple way to catch the eye. But for a computer, a résumé is just a sequence of lines of text.

So, although reading and understanding a simple sentence like “I work as a Java Developer at Avature,” or “Company: Avature, (Apr 2010 – Dec 2015),” may be very straightforward to us, it’s actually a high-level cognitive task. And it turns out that processing statements like these algorithmically is challenging: as humans, we know that we understand them, but we cannot easily describe how. Because language is always creative and there are many ways to express the same information in text, the task of résumé parsing is far from trivial.

Early attempts were made using hand-crafted rules but, as we’ve said, language is infinitely varied and creative. So, in practice, rule-based systems can recognize numerous cases and patterns, but they are very hard to maintain. Once basic performance is achieved, the effort necessary to improve these résumé parsers is prohibitive.

This is one of the reasons why this explicit approach was slowly abandoned and AI and Machine Learning solutions began to gain popularity in the field. These implicit approaches leverage data and statistics to create computational models generated through the input of a large quantity of annotated data. When applied to new, unseen résumés these models are able to predict and extract relevant information.

Ideally, we’d like to have a perfect parser that can fully understand all the concepts and relationships within a résumé, but this is still too complex. However, improvements are being made all the time. While you are reading this article, engineers and testers are working hard to perfect résumé parsers doing research and using the most up-to-date technologies.

There’s still a long way to go, but the accuracy of résumé parsers is increasing very rapidly. Ultimately, we are talking about making a computer understand and interpret human language. We are talking about artificial intelligence.