Structured and Unstructured Data - What Are They
By billy.cripe on Sep 13, 2007
So a good discussion is ranging around the internal Oracle threads and discussion lists on what is meant by "Structured" and "Unstructured" data. We talk about these a lot in the ECM space.
When we talk to customers about their need to manage unstructured data, what exactly should/do we mean? This article has a certain take:
Duncan Pauly, founder and chief technology officer of Coppereye add's eloquent insight to the conversation:
"The labels "structured data" and "unstructured data" are often used ambiguously by different interest groups; and often used lazily to cover multiple distinct aspects of the issue. In reality, there are at least three orthogonal aspects to structure:
- The structure of the data itself.
- The structure of the container that hosts the data.
- The structure of the access method used to access the data.
These three dimensions are largely independent and one does not need to imply another. For example, it is absolutely feasible and reasonable to store unstructured data in a structured database container and access it by unstructured search mechanisms."
So when we say 80% of your data is unstructured, do we mean "Not stored in database"? Is XML tagged data, structured ? (yes), if it stored on the file system( )? A .pdf stored in a database and indexed via a search engine?
I have never been asked by a customer to clarify what I mean by unstructured data but I know it is coming.
One participant in the Oracle conversation has this take:
As per my experience, 'unstructured data' is data/information/content which doesn't have a specific structure/rule attached to it. For example, a word document or an HTML page can contain data/information/content in any structure. One can have any number of images, paragraph etc. Also, in most of the cases, there is no relation between the content(s). On the other hand, 'structured data' has structure/rules attached to it e.g. a product. A product will always have a code, manufacturer, category etc. and thus defines the structure of data.
Now, the above is business terms. So, you can store them the way you wish to have your technical solution- it could be Database, File System etc.
So this would basically be saying that it is the structure of the data itself that determines whether or not it is structured or unstructured.
However, within the ECM space, I tend to take a different tack, at least when explaining it to myself. I typically take a more simplistic approach. Structured vs Unstructured is cellular data vs non-cellular data. DB LOB types are special exception cases.
<disclaimer>Of course, I take this approach when presenting ECM which deals primarily with content sored in non-DB table cell formats/locations.</disclaimer>
While XML data may be structured, it is contained in a content item (XML Document) that is itself unstructured. Were the xml data to be parsed and inserted into a table structure that mirrored the XML tag names (for example) at that point the data in the DB would be considered "structured" while the XML Document and all the data it contained would still be considered "unstructured".
"Can I query it with SQL?" may be a good way of thinking about structured vs unstructured. When entire content items and not just the data they contain are stored in the database I see these as exception cases. At that point the DB becomes not a structuring entity but a storage entity providing approximately the same /quantity /of structure as say a file system.
Before you bring up exceptions see <disclaimer>.
But what do YOU think? How does this explanation strike you? How do you explain it?