Ten Requirements for Achieving Collaboration #5: Data Portability & Referencing
By billy.cripe on Sep 21, 2009
We are in the midst of a series investigating collaboration. We previously wrote about the two types of collaboration - intentional and accidental.
INTENTIONAL: where we get together to achieve a goal and
ACCIDENTAL: where you interact with something of mine and I am never aware of your interaction
While intentional collaboration is good it is not where the bulk of untapped collaborative potential lies. Accidental collaboration is. But the challenge is to intentionally facilitate accidental collaboration. For the full list of 10 requirements see the original post. Last time I wrote about requirement #4: why we must be sure to enable the humans. While it is great if humans are empowered to consume, we must keep it easy for them to do so. Therefore enter requirement #5: the importance of data portability and ability to be referenced. After all, if we go to all that work to identify and extract data from content containers then it is a simple next step to ensure that the data is located where we want it when we need it.
Well here we are. If you have followed along, we have identified data residing inside of documents, we have exploded content items, we have jail-breaked data-and-relationship assertions and made it easy for people to add their own experiences and expertise to those growing data sets. That is all well and good if and only if there is a consistent way to move that data around, to relate it to other data that may be and usually is in a different schema, and finally to address and obtain that data. This is practically taken for granted with traditional relational databases and on the web. After all, the primary key uniquely identifies a record, the URL uniquely identifies a document on the web. We can get it, move it and relate it to other bits of information.But isn't it curious that we reference items differently depending on where they're located. We need to know the structure of data first before we can even think about trying to reach it. Web pages have URLs, DB Records have primary keys, filing cabinet files have a physical location, you and I have a postal address. That makes it awfully hard to combine data sets, perform meaningful comparisons or combine expertise with the confidence that we're not leaving out the most important information available simply because it is addressed differently.Of course structured data practitioners solved this challenge ages ago (in web time). Data warehousing and data integration is all about bringing together differently classified, differently formatted data from disparate sources and schemas and doing cool things with that aggregated data set. In order for them to do those "cool things" though, they have to jump through some sophisticated transformations and data mappings.
Enter the ETL crews. ETL - extract, transform, load - is the process by which data is pulled from a source, changed into a common (often lowest common denominator style) format and loaded into the data warehouse (to be sure this is a gross oversimplification but you get the idea). While some variations on this theme exist, the principles are widely accepted. If you want to aggregate data you have to get it all into the same format and all into the same place.
This has several advantages:
1) even if you are not sure what you're looking for you have a good idea where it is (in the data warehouse)
2) if you know what you're looking for and you know where it is you don't have to worry about formatting mismatches or extra work smoothing data.
To be sure, I exaggerate in order to make a point. Business Intelligence is good at what it does and continues to grow as a market because the intelligence it generates is reasonably complete and trustworthy. But maybe we should call it "small slice of operations automation intelligence" or "customer purchasing based on scan codes intelligence" rather than the grandiose and complete sounding BUSINESS INTELLIGENCE.
So what is an enterprise to do if it wants to do "cool things" with the information it generates - including unstructured information? Well Enterprise Content Management systems are one solution. After all, content (i.e. unstructured information) goes in, gets a persistent reference (URL/URI), gets some structured data describing it (metadata), and is indexed so that it can be found by a query later on. But what is missing is any reliable way of understanding what data exists in the content container.
If the document is an auto manual, you have no way of knowing what kind of engine the auto has without reading the manual. If the web page is about the movie District 9 you have no way of knowing if the picture there is of an actor, the director or the screen writer without viewing it. Sure, full text indexing can tell you if certain words (tokens) exist. Metadata, if it is detailed enough, might tell you what you want to know. But metadata quickly reaches a point of diminishing returns. It is cumbersome and unreasonable to expect end users to fill out lengthy forms describing their document when all they want to do is stick it somewhere and forget about it (until they need it again). And besides, metadata that perfectly describes a content item simply replicates the object in database structure in which case you have saved notime, processing power, or storage space.
Until now the only way to perform the necessary ETL operations on unstructured information to use it in context and concert with structured data was to consume it. Our brains are amazing at extracting and transforming data. Our brains became the data warehouse against which we created mental data cubes and drew conclusions. But such processes do not make the data portable. Data so extracted from unstructured information is locked within our brains. We may have vast data warehouses in our heads (Are you an expert? if yes, then you are the proud owner of a vast data warehouse in your head!) but finding and working with us can be difficult (especially if we are cranky!). So "expertise location" software was born. That still doesn't solve the data portability issue and referencing it by saying, "Billy knows that answer" is not very scalable. So knowledge management software was born. But earlier versions of KM software (especially in the late 90's and early 2000's) took the wrong approach. They thought that the solution was to have people do the ETL operations on unstructured information as well as structured information (i.e. become experts) and then regurgitate that expertise (often by typing it) back into another system. Not only is this a horribly circuitous route, but it was nearly always "out-of-the-flow" of normal daily activities and so cumbersome to actually input that it quickly became an annoyance rather than a help. The end result was more unstructured information that was often so nuanced as to be quickly relegated to the corporate archives or so useless and generic that it went to the delete bin. The hopes of capturing institutional wisdom from those who had come before was on its death bed.
So ignoring 80% or more of organizational information is really not an option. Requiring people to digest information before systems can access it does not scale. What is an information savvy organization to do?
Wouldn't it be better if there were a way to automatically extract the concepts from unstructured information, provide context and relationships to those concepts, store them in a persistent, structured, referencable way that allows our systems to better "understand" what we want and need, find it using the sophisticated querying, clustering, faceting, slicing, cubing, and reporting technologies that we have developed and deliver it to us in compelling and persuasive ways?
If you are still wondering, the answer is "Yes".
If you are wondering if we can actually do that, read the previous entries in this series. The answer there is also, "Yes".Text analytics, entity extractors, Natural Language Processors, and synopsis engines all exist today. Most will output structured RDF (portable, addressable, referencable). RDF can be stored and indexed in the Oracle Spatial 11.2 Database. That database will create semantically aware indices on those vast stores of RDF assertions (concepts contained in documents). We can search those indices with concept maps and relational graphs called "Ontologies" which are also structured, portable and referencable. Such searching, or content clustering, faceting, analyzing, and discovering becomes the basis for true, in-the-flow, "organic" knowledge management. The market is picking up on this and agreeing.
This is the cornerstone to the foundation of the Wise Enterprise.
Next time we will continue the series investigating requirement # 6 when the accessibility theme continues and we discuss how to keep the data extracted from content accessible by people as well as machines.