How to Build a Collaborative Data Science Environment

December 18, 2019 | 6 minute read
Anna Chystiakova
Senior Data Scientist
Text Size 100%:

           Data science is constantly getting bigger in terms of people, models, and communities in different organizations. Nowadays, the most popular communication channels and knowledge sharing platforms are blogs, papers, GitHub, data science meetups and workshops. Sometimes, these channels may be too theoretical, missing completed code or real-world examples so you can’t train and experiment on your own. In other cases, it has data, code and detailed model descriptions, but data scientists may face the issue of version conflict for some required libraries or even all of the framework. This happens inside teams but is even more relevant for cross-team cooperation.

Thus, to have the same experience, including accuracy and processing time, data scientists should use the same platform, storage, data and model pipelines, and configuration. This could potentially happen only if the community operates in the same cloud resources available across the organization with corresponding permissions.

Cooperation is highly important in big companies when the data science team has a lot of contributors and communicates insights to other teams. Luckily, today, the Cloud is available at an affordable cost and allows users to build required infrastructure and set up a platform for experiments, model training and testing.

Data Science Environment

There are well-known tools that allow collaborative data science -- Databrics is a major player right now. However, what if you need to work in an existing cloud with strict rules about the customer’s data policy and nonstandard tools along with customized configuration? If you have your data science platform built, it provides such opportunities:

  • Developed models can be adjusted and reused for other predictions in the same type of environment it was developed and trained.
  • Input data, models and results are available for all team members for better cooperation in a data lake with highly controlled security.
  • Customized data science tools and data sources in one place for faster and more accurate analysis.

Consider the data science environment as a platform to perform many types of analysis that data scientists, business analysts, developers and managers can all benefit from. It consists of an entire data lake and a number of compute nodes organized in CPU and GPU clusters. Cooperation within this environment excludes the data export/import operations as the teams use the latest reliable data from the data lake and connected storage. All team members get the same results for training, testing and reporting. Also, it provides the opportunity to make a copy of the last model configuration and tune it based on different goals.

Let’s gain deeper insight in the minimum environment architecture.


Designing and Deploying Your Environment

The design of data lake architecture varies a lot, but let’s consider the simple one based on distributed file storage.

Apache Hadoop is an open-source framework to store very large data sets across clusters of computers and perform parallel processing. It’s composed of the Hadoop Distributed File System (HDFS™) that handles scalability and redundancy of data across nodes, and Hadoop YARN, a framework for job scheduling that executes data processing tasks on all nodes. The baseline for the required environment is 3-Node Hadoop Cluster. Nodes can have different mixed roles such as NameNode, DataNode, ResourceManager, and NodeManager. In case there is continuous data ingestion from different sources, streaming can be built by Kafka stream-processing platform.

Kafka Stream Processing

There is no specific work done in stream processing, as it just transforms the original delimeter-separated values to parquet format. Comparing to Hive, parquet files are more flexible and don’t require predefined schema. In case the streamed values are totally different from standard format, either stream processing uses customized transformation, or the data is stored “as is” in the HDFS. This stage is very important, as there is no dedicated project or analysis the data should be prepared for; the pipeline just make the data available for the data scientist, so they can get it without any information loss. Data sources vary from different log files to different types of services and systems inputs, but all this data goes to the data lake and is supposed to be connected in designed use cases.

With the data lake ready, the next step is configuring the cluster(s) to deploy an environment for data scientists with required tools and diverse opportunities. Here’s the toolset that is required:

  • Apache Spark cluster-computing framework installed in all nodes. Driver runs inside an application master process which is managed by YARN on the cluster. Worker nodes run on each DataNode.
  • Python: same version installed in all Cluster nodes with all basic data science libraries
  • R installed in all Cluster nodes (optional)
  • Jupyter Notebook installed in couple of Cluster nodes (just an advice)
  • TensorFlow on top of Spark

Additional analytics tooling such as KNIME installed in one of the data nodes or attached servers.

Let’s understand the role for each of the installed tools.

Apache Spark allows you to work on files from HDFS stored in parquet or another format and prepare the data for specific analysis and modelling. You can either run prepared Spark Jobs or develop the transformation pipeline as part of the model. Created pipelines allow you to use the latest ingested data for training and test each time you run the model. The prepared structured data can be uploaded to a data mart or database for business analysis and visualization.

Python is the most popular programming language that data scientists use today. Spark provides an API for Python called Pyspark; many data science libraries such as numpy, scikit-learn, and pandas are developed with Python. TensorFlow has an intuitive higher-level Python API, too.

R – open-source implementation of the S/Splus language (scripting language mostly popular for data analysis, statistics, and graphics). R has a variety of scientific packages and is used to turn raw data into insight, knowledge, and understanding.

Jupyter Notebook is an interactive computational environment (simple Web Browser View) where you can combine code execution, rich text, mathematics, plots and rich media. It has reusable pipelines for data transformation, modeling and testing live in notebooks, so you save time in data preparation and understanding while moving from one target of analysis to another. Jupyter python is the primary language for developing models, but there’s an opportunity to run sections of analysis in R calling R scripts with rpy2 library.

TensorFlow on top of Spark has been just announced last year and can potentially bring a lot of wins to data science workflow.


Benefits of a Flexible Data Science Environment

Eventually, your data science environment will provide cooperative access to data insights for all data science team members and other teams. 

Cooperative Access to Data Insights

There are definitely benefits for team members, leaders, and manager in building a flexible data science environment: 

  • An entire source of data, models and analysis -- time savings
  • The latest versions of models and workflows for different projects -- efficiency increase
  • Shareable compute resources and optimized workflows -- efficiency increase
  • A variety of tools and shared knowledge -- efficiency increase
  • Data science community 

Finally, the environment works well for fast prototyping and demos for an executive audience. It becomes easy for developers and business analysts to run the models and proof the expectations. Based on the business use case, data scientists define the set of data sources from the data lake and verify it through modelling and communication with the customer. 

Modelling and Communication with the Customer

Model development to resolve each use case becomes an iterative process with documented goals and KPIs. Each model prototype goes through communication with the customer, verification and feedback using a list of criteria: 

  • Model meets the business goals
  • Accuracy of the model is acceptable or higher
  • Explainability of the approach is satisfied
  • Frequency and format of delivery are approved

A lot of times, data science teams face the issue of "blurry" initial business requirements or a lack of machine learning capabilities understanding, so the iterative process really helps to understand, clarify and specify business needs along with providing data insights and potential opportunities. 

Different business use case models based on the same data are the building blocks for a portfolio that provides a single pane of glass for your business. On the other hand, data scientists can reuse the data, model pipelines, code and models as features for new business use cases. Thus, having a variety of data, tools, knowledge, models and code in one data science environment allows your company to build a strong data science community for cross-team projects.

To learn more about AI and machine learning, visit the Oracle AI page. You can also try Oracle Cloud for free


Anna Chystiakova

Senior Data Scientist

Anna Chystiakova is a Senior Data Scientist in Oracle. She designs approaches to empower Oracle SaaS Engineering with data driven decisions in capacity and budget planning using predictive analytics, dashboards and machine learning models. Anna has Masters in Computer Science and 12 years of experience, holds a patent in Time Series prognostics and a member of External Research Office in Oracle. 

Previous Post

Autonomous: When Database Patch Lifecycle Management meets Machine Learning Model Lifecycle Management

Sonali Inamdar | 2 min read

Next Post

AI Influencer Blog Series - Why Data Engineering Is Crucial to Scalable Data Science

Andreas Kretz | 3 min read