Structured and Unstructured Data - What Are They

So a good discussion is ranging around the internal Oracle threads and discussion lists on what is meant by "Structured" and "Unstructured" data.  We talk about these a lot in the ECM space. 


When we talk to customers about their need to manage unstructured data, what exactly should/do we mean?  This article has a certain take:



Duncan Pauly, founder and chief technology officer of Coppereye add's eloquent insight to the conversation:


"The labels "structured data" and "unstructured data" are often used ambiguously by different interest groups; and often used lazily to cover multiple distinct aspects of the issue. In reality, there are at least three orthogonal aspects to structure:



  • The structure of the data itself.
  • The structure of the container that hosts the data.
  • The structure of the access method used to access the data.

These three dimensions are largely independent and one does not need to imply another. For example, it is absolutely feasible and reasonable to store unstructured data in a structured database container and access it by unstructured search mechanisms."


So when we say 80% of your data is unstructured, do we mean "Not stored in database"?  Is XML tagged data, structured ? (yes), if it stored on the file system( )?  A .pdf stored in a database and indexed via a search engine? 


I have never been asked by a customer to clarify what I mean by unstructured data but I know it is coming.


One participant in the Oracle conversation has this take:



As per my experience, 'unstructured data' is data/information/content which doesn't have a specific  structure/rule attached to it. For example, a word document or an HTML page can contain data/information/content in any structure. One can have any number of images, paragraph etc. Also, in most of the cases, there is no relation between the content(s). On the other hand, 'structured data' has structure/rules attached to it e.g. a product. A product will always have a code, manufacturer, category etc. and thus defines the structure of data.

Now, the above is business terms. So, you can store them the way you wish to have your technical solution- it could be Database, File System etc.


So this would basically be saying that it is the structure of the data itself that determines whether or not it is structured or unstructured. 


However, within the ECM space, I tend to take a different tack, at least when explaining it to myself.  I typically take a more simplistic approach.  Structured vs Unstructured is cellular data vs non-cellular data.  DB LOB types are special exception cases.

<disclaimer>Of course, I take this approach when presenting ECM which deals primarily with content sored in non-DB table cell formats/locations.</disclaimer>

While XML data may be structured, it is contained in a content item (XML Document) that is itself unstructured.  Were the xml data to be parsed and inserted into a table structure that mirrored the XML tag names (for example) at that point the data in the DB would be considered "structured" while the XML Document and all the data it contained would still be considered "unstructured".

"Can I query it with SQL?" may be a good way of thinking about structured vs unstructured. When entire content items and not just the data they contain are stored in the database I see these as exception cases.  At that point the DB becomes not a structuring entity but a storage entity providing approximately the same /quantity  /of structure as say a file system.



Before you bring up exceptions see <disclaimer>.


But what do YOU think?  How does this explanation strike you?  How do you explain it?

Comments:

I think this is a very difficult question to answer. Until know, reasoning from a practical view (is it feasable? will it perform?)/ perspective, I at least would say: don't mix models (oo <> xml <> relational <> etc). From a classical viewpoint one could reason: relational = structured, all other = unstructured.

Posted by Marco Gralike on September 13, 2007 at 05:32 AM CDT #

Interesting question... In working with Enterprise Visualization, primarily with both standalone and PLM / DMS usage, I've thought a lot about structured vs. unstructured data. My feeling is that in an enterprise context, the key is in how the end users can access the data - if it's provided to them in a structure that makes sense for their business use, it's "structured data" otherwise it's unstructured. I suspect the end users don't care whether the data is in a database, on a file system with automatically generated indices, or scribbled on file cards sorted by monkeys, as long as they can easily find the info they need, when they need it. From a more technical standpoint, I don't think I agree that an XML document is any less structured than what you get if you parse the XML document and dump it in a database. The structure is certainly more accessible if you put in a DB, but the structure is still there in the original file. Compare two files: File 1 | File 2 | Fred Jones, PO Box 1034, Philipston TX | <contact> <name> Fred Jones </name> | <address>PO Box 1034, Philipston, TX</address> </contact> | Bob Smith | <contact> <name> Bob Smith </name> 9482 Waterfall Cr. | <address>9482 Waterfall Cr., Lowavile, KC</address> </contact> Lowavile, KC | To me, it seems pretty clear that "File 1" is totally unstructured, but "File 2" would qualify as structured... As a side note - we've been acquired by Oracle and should be getting access to the Oracle systems as of Oct 1 - I'm quite curious to read up on some of the discussions you mention on the internal Oracle discussion lists --- can you give me a pointer to the right lists?

Posted by Warren Baird on September 26, 2007 at 08:56 AM CDT #

Whoops - I kinda assumed that internal formating would be preserved... that's what I get for trying to be fancy, I guess. The two files I was trying to express looked like this: File 1: Fred Jones, PO Box 1034, Philipston TX Bob Smith 9482 Waterfall Cr. Lowavile, KC ---------------------------------------- File 2: <contact> <name> Fred Jones </name> <address>PO Box 1034, Philipston, TX</address> </contact> <contact> <name> Bob Smith </name> <address>9482 Waterfall Cr., Lowavile, KC</address> </contact> I hope this works better. any chance of a 'preview' button? :-)

Posted by Warren Baird on September 26, 2007 at 08:59 AM CDT #

Interesting discussion. My take on this? That the distinction is artificial since all content has some structure. This post, for instance, has structure as dictated by "sentence structure," which allows for things like programmatic extraction of meaning based upon the placement and proximity of its words. If this wasn't true of any written language, then Google wouldn't even exist as a company (I'm sure Oracle wouldn't cry about that). Look at "File 1" in the prior post. By some standards, the data there is highly structured. Put another way, it's easily parsed even without the tags. Given my argument, you need to describe content across a continuum of just how structured it is. A database table or XML document is highly structured. To make things simple, I'd place all other content into some other bucket, such as not-highly-structured. :-) Here's my definition for highly structured data: Highly structured content allows for the simple programmatic derivation of commonalities between different entities (e.g., rows, objects).

Posted by Greg Selvin on August 19, 2009 at 03:53 AM CDT #

Post a Comment:
Comments are closed for this entry.
About

Enterprise 2.0 and Content Management

Search

Archives
« August 2015
SunMonTueWedThuFriSat
      
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
     
Today