What is the difference between structured and unstructured data—and should you care? For many businesses and organizations, such distinctions may feel like they belong solely to the IT department dealing with big data. And while there is some truth to that, it’s worthwhile for everyone to understand the difference, because once you grasp the definition of structured data and unstructured data (along with where that data lives and how to process it), it’s possible to see how this can be used to improve any data-driven process.
And these days, nearly any workflow in any department is data-driven.
Sales, marketing, communications, operations, human resources, all of these produce data. Even the smallest of small business—say, a brick-and-mortar store with physical inventory and a local customer base—produces structured and unstructured data from things like email, credit card transactions, inventory purchases, and social media. Thus, taking advantage of this comes through understanding the two, and how they work together.
Structured data is data that uses a predefined and expected format. This can come from many different sources, but the common factor is that the fields are fixed, as is the way that it is stored (hence, structured). This predetermined data model enables easy entry, querying, and analysis. Here are two examples to illustrate this point.
First, consider transactional data from an online purchase. In this data, each record will have a timestamp, purchase amount, associated account information (or guest account), item(s) purchased, payment information, and confirmation number. Because each field has a defined purpose, it makes it easy to manually query (the equivalent of hitting CTRL+F on an Excel spreadsheet) and also easy for machine learning algorithms to identify patterns—and in many cases, identify anomalies outside of those patterns.
Another example is data coming from a medical device. Something as simple as a hospital EKG meter represents structured data down to two key fields: the electrical activity of a person’s heart and the associated timestamp. Those two fields are predefined and would easily fit into a relational or tabular database; machine learning algorithms could easily identify patterns and anomalies with just a few minutes worth of records.
Despite the vast difference in technical complexity between these examples, it’s clearly shown that structured data drills down to using established and expected elements. Timestamps will arrive in a defined format; it won’t (or can’t) transmit a timestamp described in words because that is outside of the structure. A predefined format allows for easy scalability and processing, even if handled on a manual level.
Structured data can be used for anything as long as the source defines the structure. Some of the most common uses in business include CRM forms, online transactions, stock data, corporate network monitoring data, and website forms.
Structured data comes with definition. Thus, unstructured data is the opposite of that. Rather than predefined fields in a purposeful format, unstructured data can come in all shapes and sizes. Though typically text (like an open text field in a form), unstructured data can come in many forms to be stored as objects: images, audio, video, document files, and other file formats. The common point with all types of unstructured data comes back to the idea of lacking definition. Unstructured data is more commonly available (more on that below) and fields may not have the same character or space limits as structured data. Given the wide range of formats comprising unstructured data, it’s not surprising that this type typically makes up about 80% of an organization’s data.
Let’s look at some examples of unstructured data.
First, a company’s social posts are a specific example of unstructured data. The metrics behind each social media post—likes, shares, views, hashtags, and so on—are structured, in that they are predefined and purposeful for each post. The actual posts, though, are unstructured. The posts archive into a repository, but searching or relating the posts with metrics or other insights requires effort. There is no way of knowing what each post specifically contains without actually examining it, whether it’s customer service or promotion or an organizational news update. Compare that to structured data, where the purpose of fields (e.g., dates, names, geospatial coordinates) is clear.
A second example comes from media files. Something like a podcast has no structure to its content. Searching for the podcast’s MP3 file is not easy by default; metadata such as file name, timestamp, and manually assigned tags may help the search, but the audio file itself lacks context without further analysis or relationships.
Another example comes from video files. Video assets are everywhere these days, from short clips on social media to larger files that show full webinars or discussions. As with podcast MP3 files, content of this data lacks specificity outside of metadata. You simply can’t search for a specific video file based on its actual content in the database.
In today’s data-driven business world, structured and unstructured data tend to go hand in hand. For most instances, using both is a good way to develop insight. Let’s go back to the example of a company’s social media posts, specifically posts with some form of media attachment. How can an organization develop insights on marketing engagement?
First, use structured data to sort social media posts by highest engagement, then filter out hashtags that aren’t related to marketing (for example, removing any high-engagement posts with a hashtag related to customer service). From there, the related unstructured data can be examined—the actual social media post content—looking at messaging, type of media, tone, and other elements that may give insight as to why the post generated engagement.
This may sound like a lot of manual labor is involved, and that was true several years ago. However, advances in machine learning and artificial intelligence are enabling levels of automation. For example, if audio files are run through natural-language processing to create speech-to-text output, then the text can be analyzed for keyword patterns or positive/negative messaging. These insights are expedited thanks to cutting-edge tools, which are becoming increasingly important due to the fact that big data is getting bigger and that the majority of that big data is unstructured.
In today’s business world, data comes in from multiple sources. Let’s look at a mid-size company with a standard ecommerce setup. In this case, data likely comes from the following areas:
In fact, the amount of data pulled by any company these days is staggering. You don’t have to be one of the world’s biggest corporations to be part of the big data revolution. But how you handle that data is key to being able to utilize it. The best solution in many cases is a data lake.
Data lakes are repositories that receive structured, and unstructured data. The ability to consolidate multiple data inputs into a single source makes data lakes an essential part of any big data infrastructure. When data goes into a data lake, any inherent structure is stripped out so that it is raw data, making it easily scalable and flexible. When the data is read and processed, it is then given structure and schema as needed, balancing both volume and efficiency.
Efficiency in storage is key because scalability and flexibility allow for including more data sources and more applications of cutting-edge tools such as machine learning. This means that the foundation for receiving structured and unstructured data needs to be built for the present and the future, and the industry consensus points to moving data to the cloud.
Want to dig deeper? The following links might help: