5 Reasons Why Oracle Cloud Infrastructure Data Flow Optimizes Apache Spark

February 19, 2020 | 5 minute read
Text Size 100%:

Apache Spark is the dominant force in big data, seamlessly combining scalable data processing, reporting, and machine learning in a convenient package. How widespread is Apache Spark usage? Consider this: since launching a decade ago as an open-source project at UC Berkeley, Apache Spark has powered AI and big data for some of the world’s largest tech companies, and in doing so, has changed the way the world consumes data. From driving user recommendations to understanding customer data, Apache Spark’s ability to unify complex data workflows and provide analytics at scale make it the ideal foundation for big data projects. Its presence in the industry can’t be understated – there’s a reason why, as an open source project, has a significant list of ongoing contributors.

Why is this so important? Simply put, big data has powered the shift to the digital era; today, everyone understands that they need big data, and the numbers back it up – big data is getting bigger and there’s no way around it. Apache Spark powers big data because of its ability to use cluster computing for data preparation and computational functions, allowing for powerful parallel processes that maximizes the ability to handle big data projects. Despite that significant ability – along with the ability to scale by simply increasing the quantity of processors—is still dependent on the infrastructure it’s built upon.

And because of that, industry research still shows that about 85 percent of big data projects fail. Why is that? The broad-stroke answer is that the enormous complexity of current big data solutions often causes projects to implode under their own weight. A finer examination shows that some of this stems from a user perspective (e.g. not having a clear scope or goal) but technology is just as involved. Big data usually involves juggling data from many sources, which can also include different security protocols and requirements. Bringing all of this together creates all sorts of headaches, especially when you consider the logistical difficulties involved with unifying legacy systems.

However, there’s a solution on the horizon for that.

Never miss an update about big data! Subscribe to the Big Data Blog to receive the latest posts straight to your inbox!

Oracle is excited to launch Oracle Cloud Infrastructure Data Flow, a fully-managed Apache Spark service that makes running Spark applications easier than ever before. With Oracle Cloud Infrastructure Data Flow, everything becomes simplified and streamlined. How does Oracle achieve such a bold claim? Simple – Oracle Cloud Infrastructure Data Flow offers:

  • Spark without the infrastructure – everything exists in the cloud.
  • Always-on cloud-native security that is continuously updated to meet the latest protocols and regulations.
  • Minimal resources from IT thanks to zero need to configure, install, manage, or upgrade.

Five Reasons (And More) To Use Oracle Cloud Infrastructure Data Flow

All of this comes with an Oracle Cloud Infrastructure account. But let’s dive even deeper with five reasons Data Flow is better than your current Spark solution (and a bonus reason to boot!):

Reason 1: Managed infrastructure

Running multiple Spark jobs is hard enough to manage without having to worry about IT operations. However, as mentioned above, the underlying infrastructure is often a critical factor in whether or not a project can come to its conclusion – and even it does finish, the overall efficiency in resource allocation can be a drain on the system as a whole. Oracle Cloud Infrastructure Data Flow handles this infrastructure provisioning, setting up networks, storage and security, and tearing everything down when Spark jobs complete. This enables teams to focus solely on their Spark projects – the engine under the hood is all handled by Oracle.

Reason 2: Out-of-the-box security

Security concerns derail a lot of big data projects. Data Flow checks all the requisite security boxes: Authentication, Authorization, Encryption and Isolation, and all other critical points. This platform is built on the foundation of Oracle Cloud Infrastructure’s cloud-native identity and access management (IAM) security system. Data stored in Oracle Cloud Infrastructure’s object store is encrypted at rest and in motion, and is protected by IAM authorization policies. With Oracle Cloud Infrastructure Data Flow, security is automatic and not an extra step.

Reason 3: Consolidated operational insight

Big data often creates big problems for IT operations, no pun intended. Whether it's making sense of thousands of jobs or figuring out which jobs are consuming the most resources, getting a handle on utilization is a complicated task. Existing Spark solutions make it hard to get a complete and thorough picture of what all users are doing. Oracle Cloud Infrastructure Data Flow makes it easy by consolidating all this information into a single searchable, sortable interface. Want to know which job from last week cost the most? With a few clicks, this information can be requested and displayed.

Reason 4: Simple troubleshooting

Tracking down the logs and tools necessary to troubleshoot a Spark job can take hours. Oracle Cloud Infrastructure Data Flow consolidates everything required into a single place, from Spark UI to Spark History Server and log output – all just a click away. In addition, administrators can easily load other user jobs when a persistent issue needs an expert eye for troubleshooting.

Reason 5: Fully managed job output.

Getting the code to work is the first step. However, a project isn’t complete until the job output makes it to the target business users. Oracle Cloud Infrastructure Data Flow makes it easy to get analytics to the people who need it through an automated process that securely captures and stores a job’s output. This output is available either via the web interface or by calling REST APIs. This means that historic outputs are easily obtainable for any purpose. Need to know the output of a SQL job you ran last week? It's just one click or API call away.

Bonus Reason: The best value in big data

The five reasons above show why Oracle Cloud Infrastructure Data Flow is the easiest way to run Apache Spark. But there’s one more very practical benefit designed to help convince all of the decision makers and stakeholders along the way: with Oracle Cloud Infrastructure Data Flow, you only pay for the IaaS resources used while they’re being used. In short, there’s no additional charge despite the many features built into this platform. Combined with Oracle Cloud Infrastructure’s already industry-leading price for performance, it may be the best value in big data.

The easiest way to see for yourself is to simply dive in with a test drive. All you need is an Oracle Cloud account. Follow along with our Data Flow Tutorial which takes you through step by step and shows you just how simple Big Data can be with Data Flow. Check out Oracle’s big data management products and don't forget to subscribe to the Oracle Big Data blog to get the latest posts sent to your inbox.

Guest Author

Previous Post

What Is Oracle Cloud Infrastructure Data Science?

Guest Author | 6 min read

Next Post

Reinforcement Learning: Q Learning Made Simple

Praphul Singh | 5 min read