Text Analytics and Enterprise Content Management
By billy.cripe on Jun 05, 2009
Text Analytics is an interesting sub-genre of Business Intelligence. It does a lot but if I had to condense it to layman's terms, it figures out the important bits of some text you give it, transforms those bits, stores them, and mashes them up for Business Intelligence analysis with other bits.
But if you think about it this way it gets vastly more exciting: TA systems read your document and tell you what it's about. Hey, COOL! Now it is not all HAL 9000. Most TA vendors came out of (or are currently part of) the Data Warehousing and Business Intelligence spaces. As such, they are used to talking about transforming inputs with ETL (extract, transform, load) processes, storing them and then running different reporting algorithms (OLAP etc) against the aggregate data sets. That works really well for structured data - stuff in databases, spreadsheets, and in surveys.
But what do they do with that bit at the end of the survey where it asks "is there anything else you would like us to know?" Do they have teams of college intern readers who manually index that free form data? No. That's where Text Analytics comes in. At least that was the overwhelming and pervading theme of the recent Text Analytics Summit I attended in Boston this week.
All of the vendors and case studies were about finding and accessing the "Voice of the Customer". Initially I thought, "Cool, they're getting into what the customers are saying not just what form data the customers are filling out". But it seems that this is ALL they are doing.
I was impressed with all of the vendors and technologies there (GATE, Attensity, Clarabridge, SPSS, Lexalytics, SAS, and Endeca to name a few). They are doing some very interesting things in the realms of document and topic categorization, grammar, dictionary, thesaurus and even sentiment analysis. (Is this free text customer review of my hotel good or bad? - COOL!)
But I was disappointed though that so many of them are focused on simple ETL for documents. It was like they were imprisoned by a traditional data warehousing/bizintel myopia and didn't/couldn't/wouldn't worry about the vast amount of unstructured (as opposed to lightly or slightly structured) content that is in need of being analyzed. I guess this makes sense since most come from the BI and Data Warehousing world. Still it seems to me that having a defined body of content in an ECM system would be a pretty exciting target for these companies. Instead they're all (ALL!) looking at things like survey responses. Don't get me wrong, being able to get at the free form text box at the end of a survey to pull out key words and phrases and whether or not the review is good or bad is cool. But it is still lightly structured content and therefore still closer to structured data than a web page, or product manual or contract.
The one exception to this was Endeca's chief scientist Daniel Tunkelang who writes over at The Noisy Channel. He seems to "get" the bigger picture and the potential that TA has in conjunction with vast content stores. The others gave me assurances that they "could" do TA against an ECM stores (e.g. to output RDF triples of the entity relationships to each other and to the containing documents), and while I sincerely believe them, was left with the impression that they all kept thinking ("but why in the world would you want to do that?").
Sigh. I'll have to build it to prove it. I think this is a mistake for the TA software vendors out there. Doing ETL against lightly-structured survey and feedback form data so it fits in with the rest of your BI OLAP cubes may be low hanging fruit but it is still ignoring the digital dark matter of the information universe. There needs to be some thought leadership from within the TA community and finding it in the micro-incrementalism of Data Warehousing improvements is keeping these folks from taking the entire "enterprise information management" sector by storm. Stop working on better dashboards and start thinking about what the data you can get at can do when extracted, mashed, combined, categorized, related and parsed in new and novel ways. Some orthagonal thinking is called for.