X

Oracle AI & Data Science Blog
Learn AI, ML, and data science best practices

Data Science Trials: Everything You Need to Know

John Peach
Principal Data Scientist

The decision to move to a data science platform often arises when productivity and collaboration show signs of strain, machine learning models cannot be audited or reproduced, and models are not making it into production. Data integration has also become increasingly complicated, with organizations needing to connect disparate data sources and applications. 

If this sounds like your organization, then it is time to trial a data science platform. While identifying the best solution can sometimes be a painstaking process, I have pulled together a checklist of must-have elements for a successful machine learning trial. 

Start Your Data Science Trial Today

In short, your goal should be to find a data science platform that solves the everyday problems you experience as a data scientist so you can successfully drive business outcomes. This includes looking for a platform that offers a suite of tools that helps you get your work done faster while enabling that work to be shared, audited, reproduced, and scaled.

 

Know Which Data Science Problems You’re Trying to Solve

Because of the nature of a data scientist’s work, as you well know, some days you need very little compute and the next you need quite a lot. This bursty workload can be a challenge for IT, who may also have to address the strains you put on databases or your requests for access to higher security levels as you work in production environments. A good data science platform can alleviate this dependency on IT while improving productivity and efficiency for data scientists and their teams.

Other data-science specific challenges include:

  • Data and model provenance
  • Managing code versions
  • Sharing notebooks
  • Speeding up workflow using pipelined processes
  • The ability to reproduce and audit models once they are in production
  • Storing and moving large amounts of data
  • Decoupling model deployment from engineering so that they own models from end-to-end

When working through a data science platform trial, keep in mind that once you have made your selection, you will ultimately need to present it to IT. When you do, make sure you note that you will be able to work more efficiently without running up costs, hampering security, or requiring constant, exhaustive support. 

 

What Should You Evaluate in a Machine Learning Trial?

Rather than spending hours talking to provider representatives, look for free or low-cost machine learning trials that allow you ample time, at least one month, to try different services. Some trials offer the guidance of real teams, but opt for those that are automated and simple to use—there will be plenty of time to talk to a provider when you are ready to take next steps. 

Below is a checklist of the key items you will want to make sure you evaluate in a data science trial:

 

Data Science Service Set-Up

One of the first things you will naturally want to try is to set up the primary work environment and look at the resources that you have. Be sure to look for:

  • A data catalog service that finds and governs data using an organized inventory of data assets.
  • A rich variety of sample notebooks or tutorials that will quickly get you up to speed on the tooling. They must provide practical examples related to your workflow. 
  • The ability to seamlessly use multiple tools and libraries and share notebooks with your colleagues for improved productivity. 

 

Running Big Data Applications

Running Spark on-prem is often a challenge for data scientists because the systems are sized for production workloads and not the bursty ad hoc workloads that data scientists create. This is one of the top reasons to choose a cloud-based data science platform. In your data science trials, make sure the functionality for big data applications:

  • Is well-integrated with the notebook environment
  • Provides batch and ad hoc processing
  • Gives consolidated control and visibility over applications

For example, Oracle’s data science platform’s Oracle Cloud Infrastructure Data Flow supports MLlib in Spark so you can develop models with industry standard algorithms. It is serverless, which means it is quick and easy for data scientists to provision just the resources they need to run a job and then destroy the cluster. As a data scientist, the focus of your work is to bring business insights and machine learning models into production. There is little value to patching, upgrading, or managing the clusters. The serverless approach removes that burden and allows you to focus on where you bring real value to the organization. I recommend trialing Machine Learning with PySpark for Spark to give you an idea of its functionality and ease of use.

 

Cloud Analytics & Autonomous Databases

Strong cloud analytics and access to autonomous databases are an indicator of the maturity of a good data science platform. Make sure you look for:

  • The ability to quickly provision a temporary database
  • The ability to develop models by bringing compute to the data
  • Analytics tools that can work transparently with data in other data stores 
  • Scale-out processing that minimizes data movement
  • Machine learning tools built into the databases

On Oracle’s data science platform, I recommend connecting to the Oracle Autonomous Database and experimenting with its ability to visualize data. I also recommend setting up Oracle Autonomous Data Warehouse and using the sample data in the SH scheme, or load your own data, to test the ease of data movement. Finally, you will want to test out Oracle Machine Learning to see how you can easily train, test, and tune machine learning models from the data science notebooks but have the heavy lifting done in the database.

 

Block Storage and Data Integration

Make sure your data science platform solution ensures limitless and inexpensive storage and offers easy integration between databases and other data sources. Items to add to your checklist include:

  • A platform where the underlying infrastructure is provisioned and maintained for you
  • Payment options that keep costs down by only requiring payment for infrastructure resources when you are using them
  • A strong data-integration-to-block-storage pathway. Testing the speed and ease-of-use of the solutions’ extract, transform, load (ETL) will give you a good indicator of this integration.
  • Ability to replicate large volumes of data and then dispose of it

 

The Data Catalog

Your data science platform’s data catalog is a key element in your ability to discover, find, organize, enrich, and trace data assets. Key functionality you will want to look for during your trials includes:

  • Self-service solutions that help you find and govern data across the enterprise
  • Transparency and traceability that allows you to know where data came from to support governance and auditability
  • Automation of data management tasks that will help you improve productivity at scale

 

Innovative New Data Science Tools

Each data science platform will feature innovative tools you may not know were available. Take note of which solutions offer the type of innovations that best meets your needs and budget. It should improve your workflow by speeding up repetitive processes and provide you with the ability to expand the value that you bring to the organization.

 

Subscribe to the Oracle AI & Data Science Newsletter to get the latest AI, ML, and data science content sent straight to your inbox! 

 

Key Notebooks to Test Oracle’s Accelerated Data Science SDK

One of the unique tools on Oracle’s data science platform is the Accelerated Data Science SDK (ADS). ADS is a native Python library available within Oracle Cloud Infrastructure Data Science service that contains tools covering the end-to-end lifecycle of predictive machine learning models. This includes data acquisition, data visualization, data profiling, automated data transformation, feature engineering, model training, model evaluation, model explanation, and capturing the model artifact itself. 

The goal of ADS is to provide a set of powerful tools that help data scientists perform routine operations such as exploratory data analysis, model selection, and hyperparameter tuning. Once you have your model, it has features that allow you to do machine learning explainability (MLX).  MLX allows you to understand what the black box model is doing at a global level or on individual predictions. It should be agnostic to the model structure, and it should provide you with an understanding of how the model is working so that you will have confidence that it has learned the correct things and you can check for bias in the model. Once you have done that, you will have confidence that it will perform well once it is in production.

When trialing ADS, I highly recommend testing the following notebooks:

1. Working with an ADSDataset Object (adsdataset_working_with.ipynb): One of the most important elements of any data science project is the data itself. This notebook demonstrates how to work with the ADSDataset class. The ADSDataset is a like a data frame but with many additional features that will improve your workflow.

Why It Is Important: Having a powerful way of representing your data in the notebook will improve your performance. The ADSDataset allows the data scientist to work with data that is larger than what will fit into memory but manipulate it as if were all in memory. Also, it has features that link the data to the type of problem that you are working with. It allows you to define the dependent (target) variable that the ADS model will understand and it also helps in exploring the data.

2. Introduction to Loading Data with the Dataset Factory (datasetfactory_loading_data.ipynb): This notebook demonstrates how to use ADSDataset to read in data from a wide selection of standard formats. No need to learn a new package for each data source or format, the DatasetFactory.open() method does it all.

Why It Is Important: The ADS DatasetFactory is a powerful class that makes it easy to access data from all kinds of different sources, including such things as sample data in the class, web data, S3, Oracle Cloud Infrastructure Object Storage, and flat files. This single class standardizes the way to access data from a large number of sources.

3. Introduction to Dataset Factory Transformations (transforming_data.ipynb): To get the best performance out of your model, it is imperative that data condition issues be detected and fixed. Depending on the class of model being used, different transformations should be applied. This notebook shows you how ADS helps you do this.

Why It Is Important: A lot of a data scientist’s time is spent cleaning up data condition issues. The ADSDatasetFactory class makes it easy to find those issues and fix them. Also, it has an automated process to do it for the data scientist.

4. Classification for Predicting Census Income with ADS (classification_adult.ipynb): In this notebook, you can build a classifier using the OracleAutoMLProvider tool for the public Census Income dataset. This is a binary classification problem, and more details about the dataset can be found at https://archive.ics.uci.edu/ml/datasets/Adult. You can explore the various options provided by the Oracle AutoML tool, which allows users to exercise control over the AutoML training process. Finally, you can evaluate the different models trained by Oracle AutoML.

Why It Is Important: The ADS SDK has powerful tools that are built on top of open-source libraries. This notebook provides a practical example of how to use AutoML to generate high quality models.

5. Introduction to Model Evaluation with ADSEvaluator (model_evaluation.ipynb): In this notebook demo, you will be able to see the capabilities of the ADSEvaluator, the ML evaluation component of the Accelerated Data Science (ADS) SDK. You will see how it can be used for the evaluation of any general class of supervised machine learning models, as well as for comparison amongst models within the same class.

This notebook focuses on binary classification using an imbalanced data set, multi-class classification using a synthetically generated data set consisting of three equally distributed classes and lastly a regression problem. The training would be done using open source libraries, and subsequently, the models would be evaluated using ADSEvaluator. It demonstrates how the tools you know and love can be enhanced with ADSEvaluator.

Why It Is Important: The way that you evaluate models is fairly standardized. The ADSEvaluator speeds up the process by determining what metrics you need to look at and then computing them for you.

6. Model Explanations for a Regression Use Case (mlx_regression_housing.ipynb): In this notebook, you will be performing an exploratory data analysis (EDA) to understand the Boston housing dataset. The Boston housing dataset is a regression dataset which contains information about houses located in different neighborhoods or suburbs in Boston, Massachusetts. The target variables are continuous values representing the monetary value of the houses.

You will train a model to predict the house prices and then evaluate how well the model generalizes to the problem. Once you are satisfied with the model, you can look into how the model works, using model-agnostic explanation techniques. Specifically, you will learn how to generate global explanations (to help understand the general behavior of the model) and local explanations (to understand why the model made a specific prediction).

Why It Is Important: It can be a challenge to understand what a black box model is doing. It is also important to make sure that the model has learned the correct things and to check for bias. Machine learning explainability (MLX) empowers the data scientist to do that.

 

Start Experimenting with Data Science Today

There are many resources available on the web to help you find data science trials. Ultimately, your final selection should offer innovation, meet your budget needs and solve the challenges you experience every day so that you can get your models into production and drive business outcomes.

Don’t forget, Oracle offers a machine learning trial that offers $300 worth of free credits and allows you to try all aspects of the Oracle’s data science platform.

To start your trial of the Oracle Data Science Platform, click here

To learn more about Oracle's data science products, visit the Oracle Data Science page, and follow us on Twitter @OracleDataSci
 

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.