The human-generated and people-oriented nature of unstructured data is both an unprecedented asset and a disruptive force. Data’s value lies in its ability to capture the desires, hopes, dreams, preferences, buying habits, likes and dislikes of everyday people, whether individually or in aggregate. The disruptive nature of this data stems from two attributes:
· It’s raw material. It requires processing to translate it into a format that machines, and therefore people, can understand and act upon at scale.
· It offers a window into human behavior and attitudes. When enriched with demographic and location information, data can introduce an unprecedented level of insight and, potentially, privacy concerns.
Unstructured data requires a number of processes and technologies to:
· Identify the appropriate sources
· Crawl and extract it
· Detect and interpret the language being used
· Filter it for spam
· Categorize it for relevance (e.g., “Gap store” versus “trade gap”)
· Analyze the content for context (sentiment, tone, intensity, keywords, location, demographic information)
· Classify it so the business can act on it (a customer service issue, a request for a product enhancement, a question, etc.)
Each of these steps is rife with nuances that require both sophisticated technologies and processes to address (see Figure 1).
The above challenges add up to a host of risks: missed signals, inaccurate conclusions, bad decisions, high total cost of data and tool ownership, and an inability to scale, among others. Even a small misstep, such as a missing source, a disparity in filtering algorithms, or a lack of language support, can have a significant detrimental effect on the trustworthiness of the results.
A recent story in Foreign Policy magazine provides a timely example. “Why Big
Data Missed the Early Warning Signs of Ebola” highlights the importance of an early media report published by Xinhua’s French-language newswire covering a press conference about an outbreak of an unidentified hemorrhagic fever in the Macenta prefecture in Guinea.
The Foreign Policy article debunks some of the hyperbole about the role of Big Data in identifying Ebola, not because the technology wasn’t available (it was) or because the indications weren’t there (they were), but because, as author Kalev Leetaru writes, “part of the problem is that the majority of media in Guinea is not published in English, while most monitoring systems today emphasize English-language material.”
About the Author:
Etlinger conducts independent research and advises global executives on data and analytics strategy. She is a TED speaker, is regularly asked to speak on data strategy and best practices, and has been quoted in media outlets such as The Wall Street Journal, Forbes, The New York Times and BBC. Find Etlinger on Twitter at @setlinger or on her blog at susanetlinger.com.