January 01, 2013

Unstructured Data Analysis: From Mystery to Magic

Competing on analytics has spawned poster-boys in the tech boom. Agile, early adopter of new technologies, companies that embraced innovation and technology stood tall above the competition. Moore’s law ensured processing came easy, cheaper and faster. Software costs came down and more and more open source software continues to make analytics even more exciting. In the new frontier that excites innovative companies, unstructured data analysis has been most prominent. The field is often referred to by many similar and equivalent, but sometimes misquoted sciences - Natural Language Processing, Computational Linguistics, or text/web mining are but a few.  In this article, we unravel the mystery and power of unstructured data analysis so that managers can implement and use to their advantage.

What is Unstructured Data in a Firm?

An enterprise generates unstructured data in multiple forms and formats.. Emails, contracts, policies, standard operating procedures, meeting minutes, sharepoint and shared drive documents, presentations, consulting reports, audit reports, archives, invoices, customer feedback – the list is endless. Structured data is in form of tables, which most of us know of through spreadsheet packages like Excel. Many of us can “play” around with structured data using our excel skills but need the company’s analytics person to analyze customer feedback from multiple customers.  We need additional skills and tools to wade through the maze. Meanwhile, companies cannot afford to ignore critical data in text form, but they don’t possess the capacity to analyze data beyond the immediate purpose for which it is generated.

Challenge of Unstructured Data

In many ways text is like data, but it is important to keep in mind that text is not data. Even in the ways that text mimics data, it is not readily so and needs some massaging and “structuring” to mold it into a shape fit for analysis.

To understand the challenge of unstructured data, think back to grade school.  How different were the subjects Mathematics and English in school for you? Solving the puzzle really boils down to playing English by the rules of mathematics and statistics while still treating it as English; and therein lies the challenge.  In other words, one needs to convert text to numbers so that powerful algorithms can be applied for meaningful analyses.

Text data has rules of syntax, grammar and expression, resulting in the same content being able to carry different meanings. How true! is not the same as How is it true? The interpretation is also domain-sensitive. So, the same text could acquire different meanings when used in media and entertainment or in say, medical research. Likewise, there are dialect-specific or culture-specific nuances, sarcasm and emotions that alter meaning that must be inferred from context than mere words. All of this complicates analysis.

Complication in analysis has its own secondary problems too. One needs powerful algorithms, unique training of machine to a model, since most text mining applications are context-specific and then it needs large scale processing. In fact, all serious users of text mining know the latest buzzword, “big data”, very quickly.

Another challenge is in terms of achieved accuracy levels in using predictive analytics or algorithms in classifying text data. Often a 60-70% accuracy is the best achievable. It may still be an asset compared to not using text mining, because it possibly took seconds and one is at least not worse off by using it. But addressing this issue appears optimistic due to recent improvements in research, accessibility and computational powers.

The Good News

The challenge is steep but then we have come a long way. As mentioned earlier, processing power, bandwidth and big data making huge storage and processing power once available only to nations and biggest corporate, are now available to lay users at a pittance. Growing research in the area is encouraging adoption in an increasing number of areas. Open source tools are available and sophisticated add-ons to these are updated frequently. Many of the popular analysis and statistical packages have a text mining add-on or option too, and many of these are very sophisticated at that.

A Case Study: Recruitment Streamlining through Algorithm Based Candidate Profile Classification

In a pilot done by DCR Workforce, candidate profiles were used to do machine learning and “train” the algorithm to grasp features from a test data of resumes of candidates applying for a technical job position. Subsequently, after validation of classification of candidates into Accept/Reject categories, the algorithm was run on fresh resumes. A further improvement and larger database to improve outcomes resulted in a score each for all unseen resumes. Recruiters now could review profiles and shortlist for interview in a decreasing order of priority of profiles. Valuable hours were saved for recruiters, and metrics like resumes read per candidate selected and resumes maintained, shortlisted, selectivity ratio- all showed significant improvement.

Going Forward

As you can imagine, many big data sources are unstructured, resulting in the analysis of unstructured data to emerge as a popular topic for managers in a whole array of industries.  Industry leaders are starting to recognize the importance of unraveling this valuable information with appropriate information extraction into a manageable size and format.  Having this valuable insight into the “big picture” rather than a simple sample set will allow companies to make key decisions substantial to enhancing business performance.  The mystery of unstructured data can truly turn into a magical formula for companies in all sectors.

Next month, we continue our examination of utilizing unstructured data, particularly the methods for extraction.  Look out for an interesting follow-up article in which we discuss another application of text mining using customer feedback data.