More Enterprises are building data lakes in the cloud to unlock the cloud’s benefits of scale and agility. When it comes to Big Data, Spark continues to dominate, which is why Oracle built Oracle Cloud Infrastructure Data Flow, Oracle’s fully managed Spark service that lets you run Spark applications with no administration required.
When it comes to developer productivity in Spark, PySpark can’t be beaten thanks to more than 240,000 freely available modules that cover everything from data preparation to analytics, machine learning and much more.
This freewheeling environment creates a problem: When many Python developers use the same big data cluster, you’ll quickly encounter version conflicts. One developer will use the latest version of a library while another relies on an older version for stability.
Python solves this problem using Virtual Environments – that is, private copies of the Python runtime allowing developers to get the version they want without interfering with other developers. The problem is that big data environments historically have poor support for Virtual Environments and community workaround are reported to be unstable, forcing users to solve the problem through cluster proliferation.
Data Flow had this problem in mind from day one. Each job in Data Flow is a completely isolated cluster dedicated to just that job. No matter what you run or what you modify it’s impossible for your job to interfere with someone else’s job. Now Data Flow takes it a step further by letting you provide a Python Virtual Environment for Data Flow to install before launching your job. With Virtual Environment support, Data Flow can tap the amazing Python ecosystem without drawbacks.
Subscribe to the Oracle Big Data Blog to get the latest big data content sent straight to your inbox!
Each Data Flow run creates a Spark cluster in our managed environment and executes your application. In order to add your Virtual Environment, it’s essential to use versions compatible with our managed environment. To make that easy, Data Flow provides a Docker container to automate packaging a compatible virtual environment into a Dependency Archive zip file, which can then be provided alongside your Spark code. All that’s needed is a standard requirements file. After that, one command downloads compatible versions of your dependencies, then packages them in the Dependency Archive.
Sometimes you need additional Java JAR files or even other static content to make your applications work. These can also be added to your Dependency Archive via the same tool or you can add them yourself to the zip file after the fact. For step-by-step guidance on the process, refer to our documentation.
Before starting, we recommend reading Develop Oracle Cloud Infrastructure Data Flow Applications Locally, Deploy to The Cloud to learn how to build and test PySpark applications on your laptop and then deploy them to Data Flow with no modifications.
The possibilities are limitless thanks to Python’s extensive 3rd party libraries. Here are 9 ways PySpark makes it easier to tackle common problems.
These ten examples are just a small sample of the problems you can tackle now with Data Flow, at any scale and with no administrative overhead.
Subscribe to the Oracle Big Data Blog to catch the latest on machine learning, all delivered straight to your inbox—and don’t forget to follow us on Twitter@OracleBigData.