The New Data Lake - You Need More Than HDFS
A data lake is a key element of any big data strategy and conventional wisdom has it that Hadoop/HDFS is the core of your lake. But conventional wisdom changes with new information (which is why we're no longer living on an earth presumed to be both flat and at the center of the universe), and in this case that new information is all about object storage.
Guest blogger Paul Miller, Big Data and Analytics Manager at Oracle, has this post on object storage as the foundation of the new data lake. And if you'd like to try building one yourself, head over to our New Data Lake Workshop (it's free!) which will guide you through the process. After a short time, you'll have a functioning, modern data lake, ready to go.
Object Store is the New Data Lake
There are many ways to persist data in cloud platforms today such as Object, Block, File, SMB, DB, Queue, Archive, etc. As an overview, here are Oracle’s, AWS' and Azure’s primary storage solutions.
Object Based Distributed Storage: Key/Content driven interface
File Based Distributed Storage: Nested file/folders interface
Block Based Storage: Raw disk like 1s and 0s interface
Of the three persistence strategies outlined above, Object Based Distributed Storage is the center piece for public cloud platforms. Amazon paved a mindset centered around cloud native application developers using object store (AWS S3) as their persistent store. Object store is now the integration point where cloud and on-premise applications can easily persist and distribute data globally in a canonical way.
Oracle, recognizing this fact, made a massive investment in developing an object store that is fast and easy to use within the Oracle Public Cloud. When it comes to analytics, cloud native persistence and backup targets, Oracle Object Store is critical.
How Object Storage Works
Object storage is a scalable redundant foundational storage service. Objects and files are written to multiple disk drives spread throughout servers in the Oracle Public Cloud, with the Oracle’s software responsible for ensuring data replication and integrity across the cluster. Because Oracle uses provisioning logic to maintain availability locally and across different data centers, they are able to provide 11 9s data durability. Should anything fail, Oracle handles the replication of the container's content from other active nodes to new locations in the Oracle Public Cloud ecosystem.
When it comes to using the latest in greatest tools for data science and fast data processing, object store enables agility, cost saving and deployment time saving capabilities by:
1. Detaching compute from storage allowing for the environments to grow independently - check out what we are doing with Big Data Cloud Service CE or IoT Cloud Service
2. Persisting all the data in a low cost, globally distributed store that speeds processes up while making it more durable
3. Maintaining a core, distribution based environment (Cloudera) while being able to use the latest and greatest Hadoop projects on demand (Apache)
The Benefits of Object Store
Hadoop HDFS' strategy of intrinsically tying storage and compute is increasing becoming an inefficient use of resources when it comes to enterprise data lakes. Think of object store as the lowest tier in your storage hierarchy. Object store allows you to decouple storage from compute giving organizations more flexibility, durability and cost savings. Store everything in object store and read only the data you need into the application or processing tier (Java CS, Node.js, Coherence Data Grid, DBaaS, Spark RDD, Essbase, etc) on demand.
At the end of the day, the cost of copying this data as needed is small compared with the cost savings and the increased flexibility. These key factors placed object store at the center of our Oracle Analytics and Big Data Reference Architecture: