Learn about data lakes, machine learning & more innovations

Announcing Python Virtual Environment Support for Spark Applications

Guest Author

More Enterprises are building data lakes in the cloud to unlock the cloud’s benefits of scale and agility. When it comes to Big Data, Spark continues to dominate, which is why Oracle built Oracle Cloud Infrastructure Data Flow, Oracle’s fully managed Spark service that lets you run Spark applications with no administration required.

When it comes to developer productivity in Spark, PySpark can’t be beaten thanks to more than 240,000 freely available modules that cover everything from data preparation to analytics, machine learning and much more.

This freewheeling environment creates a problem: When many Python developers use the same big data cluster, you’ll quickly encounter version conflicts. One developer will use the latest version of a library while another relies on an older version for stability.

Python solves this problem using Virtual Environments – that is, private copies of the Python runtime allowing developers to get the version they want without interfering with other developers. The problem is that big data environments historically have poor support for Virtual Environments and community workaround are reported to be unstable, forcing users to solve the problem through cluster proliferation.

Data Flow had this problem in mind from day one. Each job in Data Flow is a completely isolated cluster dedicated to just that job. No matter what you run or what you modify it’s impossible for your job to interfere with someone else’s job. Now Data Flow takes it a step further by letting you provide a Python Virtual Environment for Data Flow to install before launching your job. With Virtual Environment support, Data Flow can tap the amazing Python ecosystem without drawbacks.

Subscribe to the Oracle Big Data Blog to get the latest big data content sent straight to your inbox!

How Does it Work?

Each Data Flow run creates a Spark cluster in our managed environment and executes your application. In order to add your Virtual Environment, it’s essential to use versions compatible with our managed environment. To make that easy, Data Flow provides a Docker container to automate packaging a compatible virtual environment into a Dependency Archive zip file, which can then be provided alongside your Spark code. All that’s needed is a standard requirements file. After that, one command downloads compatible versions of your dependencies, then packages them in the Dependency Archive.

Sometimes you need additional Java JAR files or even other static content to make your applications work. These can also be added to your Dependency Archive via the same tool or you can add them yourself to the zip file after the fact. For step-by-step guidance on the process, refer to our documentation.

Before starting, we recommend reading Develop Oracle Cloud Infrastructure Data Flow Applications Locally, Deploy to The Cloud to learn how to build and test PySpark applications on your laptop and then deploy them to Data Flow with no modifications.

9 Sample Use Cases

The possibilities are limitless thanks to Python’s extensive 3rd party libraries. Here are 9 ways PySpark makes it easier to tackle common problems.

  1. Machine Learning: Virtual Environment support feature puts advanced ML libraries like tensorflow, keras, scikit-learn, xgboost, lightgbm, torch, theano and much more at your fingertips.
  2. Data Cleansing: Natural language processing (NLP) systems can get confused by a mixture of ASCII and Unicode data, as can legacy databases. Cleaning text with unidecode could save you a lot of frustration.
  3. Computer Vision / Video Processing: Computer Vision is a red-hot topic right now, and its computationally expensive nature builds a great case for on-demand computing. Virtual environment support enables inclusion of libraries like opencv for computer vision tasks or even tools like ffmpeg for general purpose video preprocessing.
  4. Control Oracle Cloud Infrastructure Services: The OCI Python SDK provides comprehensive access to Oracle Cloud Infrastructure services. For example you can read and write files from Oracle Cloud Infrastructure object store, interact with Oracle NoSQL, send messages and more. Even better, your Data Flow runtime includes a token that allows you to access any IAM-enabled Oracle Cloud Infrastructure service without requiring credentials.
  5. Databases and Other Data Sources: Need to talk to a MySQL database? Try mysqlclient. Need to read messages off a Kafka queue? Try kafka. Just about any major datasource will have some Python plugin.
  6. Connect to Oracle Databases: To talk to Oracle databases, include Oracle JDBC JARs in your Dependency Bundle and interact using the Data Sources API.
  7. Advanced Data Preparation: Need to extract currency information reliably? Consider money-parser which seamlessly handles multiple currencies, multiple separator standards, and other tricky topics like the Indian numbering system. In addition, non-ISO 8601 dates and timestamps can use dateparser to convert them into a Spark-friendly format.
  8. Extract and Transform XML: Python makes it simple to parse, extract, or convert XML. Consider Beautiful Soup for a refreshingly pleasant take on XML parsing.
  9. Tough ETL Challenges Made Simple: Need to extract data from a lot of Excel workbooks? Use openpyxl to make it easy.

These ten examples are just a small sample of the problems you can tackle now with Data Flow, at any scale and with no administrative overhead.

Subscribe to the Oracle Big Data Blog to catch the latest on machine learning, all delivered straight to your inbox—and don’t forget to follow us on Twitter@OracleBigData.

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.