Text Analytics and Enterprise Content Management

hand_sm.jpgText Analytics is an interesting sub-genre of Business Intelligence. It does a lot but if I had to condense it to layman's terms, it figures out the important bits of some text you give it, transforms those bits, stores them, and mashes them up for Business Intelligence analysis with other bits.


But if you think about it this way it gets vastly more exciting: TA systems read your document and tell you what it's about. Hey, COOL! Now it is not all HAL 9000. Most TA vendors came out of (or are currently part of) the Data Warehousing and Business Intelligence spaces. As such, they are used to talking about transforming inputs with ETL (extract, transform, load) processes, storing them and then running different reporting algorithms (OLAP etc) against the aggregate data sets. That works really well for structured data - stuff in databases, spreadsheets, and in surveys.

But what do they do with that bit at the end of the survey where it asks "is there anything else you would like us to know?" Do they have teams of college intern readers who manually index that free form data? No. That's where Text Analytics comes in. At least that was the overwhelming and pervading theme of the recent Text Analytics Summit I attended in Boston this week.

All of the vendors and case studies were about finding and accessing the "Voice of the Customer". Initially I thought, "Cool, they're getting into what the customers are saying not just what form data the customers are filling out". But it seems that this is ALL they are doing.

I was impressed with all of the vendors and technologies there (GATE, Attensity, Clarabridge, SPSS, Lexalytics, SAS, and Endeca to name a few). They are doing some very interesting things in the realms of document and topic categorization, grammar, dictionary, thesaurus and even sentiment analysis. (Is this free text customer review of my hotel good or bad? - COOL!)

But I was disappointed though that so many of them are focused on simple ETL for documents. It was like they were imprisoned by a traditional data warehousing/bizintel myopia and didn't/couldn't/wouldn't worry about the vast amount of unstructured (as opposed to lightly or slightly structured) content that is in need of being analyzed. I guess this makes sense since most come from the BI and Data Warehousing world. Still it seems to me that having a defined body of content in an ECM system would be a pretty exciting target for these companies. Instead they're all (ALL!) looking at things like survey responses. Don't get me wrong, being able to get at the free form text box at the end of a survey to pull out key words and phrases and whether or not the review is good or bad is cool. But it is still lightly structured content and therefore still closer to structured data than a web page, or product manual or contract.

The one exception to this was Endeca's chief scientist Daniel Tunkelang who writes over at The Noisy Channel. He seems to "get" the bigger picture and the potential that TA has in conjunction with vast content stores. The others gave me assurances that they "could" do TA against an ECM stores (e.g. to output RDF triples of the entity relationships to each other and to the containing documents), and while I sincerely believe them, was left with the impression that they all kept thinking ("but why in the world would you want to do that?").

Sigh. I'll have to build it to prove it. I think this is a mistake for the TA software vendors out there. Doing ETL against lightly-structured survey and feedback form data so it fits in with the rest of your BI OLAP cubes may be low hanging fruit but it is still ignoring the digital dark matter of the information universe. There needs to be some thought leadership from within the TA community and finding it in the micro-incrementalism of Data Warehousing improvements is keeping these folks from taking the entire "enterprise information management" sector by storm. Stop working on better dashboards and start thinking about what the data you can get at can do when extracted, mashed, combined, categorized, related and parsed in new and novel ways. Some orthagonal thinking is called for.


Extraction is easy. Deciphering meaningful data that you can easily relate/compare/contrast/discover against is really, really hard. There's a LOT of big, really BIG picture thinking that has to happen there. I'm not sure they're making a mistake as in ignoring the situation, it is more like they are unable to take that step just yet. Someone will have to inovate to make the leap forward (one of your points) to really useful stuff, not just the low hanging fruit you mentioned. Orthagonal thinking is expensive though. Scientific endeavors are expensive. They may be wanting to think about these things but just unable to fund the amount of thought required at this time. That makes them just like the rest of us I guess.

Posted by Jason Stortz on June 05, 2009 at 06:43 AM CDT #

Billy, thanks for the shout-out. I was surprised not to see more of the TA vendors talking about this bigger picture--especially since I know that Endeca partners with some of them! In any case, I feel better knowing that the talk was well received, since I did feel like a bit of an outlier.

Posted by Daniel Tunkelang on June 05, 2009 at 11:18 AM CDT #

Billy, I can see some potential uses of the capabilities of TA (the knowledge uncovered by TA) for Enterprise 2.0 content. We can add another dimension to Enterprise 2.0 - that is Universe 2.0 In short: 1. A powerful Universal TA engine crawls and analyses all the meaningful info/text on the web and offers its services to all those who are interested (subscribers) in the context of a specific need/topic/tag. 2. Enterprises have local TA engine to crawl their entire Enterprise 2.0 content repository 3. Now, all future Enterprise 2.0 communication/collaboration threads can be mashed-up with this Universal and Enterprise knowledge uncovered by TA engines in the context of the present communication. Let’s say that this is called recommendations. This will serve as a reference for the parties involved in collaboration and will effectively provide them with “collective wisdom of the Universe” about the topic of discussion/collaboration. 4. The end result is - Enterprise 2.0 meets Universe 2.0. A perfect union. Faster and richer innovation. Higher productivity.

Posted by Vijay Prasad Gupta on June 07, 2009 at 05:04 AM CDT #

Vijay - I love it. You nailed the vision.

Posted by Billy on June 08, 2009 at 02:46 AM CDT #

You make excellent points! At the event Attensity focused on VOC because it provides a tangible - business benefit example of what we are able to do with our Text Analytics capabilities. We also had two customer there talking about it.... Shame on us for not going into the other examples where we take our sophisticated semantic technologies and turn it into business value. We do this in the ECM space. We have a content system underneath our Customer Experience Management and E-Service offerings where we use our semantic technologies to enable business users (service agents, end customers, repair people, etc.) to access vast amounts of content (manuals, FAQs, videos, audio files, etc., etc.) easily. Behind the scenes we are classifying, breaking down the content into doubles and triples and looking at the roles and relationships between objects so that the user gets the right content, for the right issue, at the right time. The business benefit is clearly not ETL - and it goes beyond ECM it is the combination of ECM and Text Analytics! Thank you for highlighting this.....

Posted by Michelle de Haaff on June 10, 2009 at 11:36 AM CDT #

Where Vijay hit the strategic vision, Michelle nails the tactical - available right now - strategy. Michelle, your quote here is what nails it for me: "... we are classifying, breaking down the content ...and looking at the ...relationships between objects so that the user gets the right content, for the right issue, at the right time. The business benefit is clearly not ETL - and it goes beyond ECM it is the combination of ECM and Text Analytics!" YES! TA provides the breakdown and first pass at relationship defining. Search, Ontologies, tracking and usage analytics data further interact to get the end user highly relevant and persuasive content - even if they never knew it was there. I'll have to look more into Attensity.

Posted by Billy on June 11, 2009 at 02:39 AM CDT #

Post a Comment:
Comments are closed for this entry.

Enterprise 2.0 and Content Management


« July 2016