Data science is constantly getting bigger in terms of people, models, and communities in different organizations. Nowadays, the most popular communication channels and knowledge sharing platforms are blogs, papers, GitHub, data science meetups and workshops. Sometimes, these channels may be too theoretical, missing completed code or real-world examples so you can’t train and experiment on your own. In other cases, it has data, code and detailed model descriptions, but data scientists may face the issue of version conflict for some required libraries or even all of the framework. This happens inside teams but is even more relevant for cross-team cooperation.
Thus, to have the same experience, including accuracy and processing time, data scientists should use the same platform, storage, data and model pipelines, and configuration. This could potentially happen only if the community operates in the same cloud resources available across the organization with corresponding permissions.
Cooperation is highly important in big companies when the data science team has a lot of contributors and communicates insights to other teams. Luckily, today, the Cloud is available at an affordable cost and allows users to build required infrastructure and set up a platform for experiments, model training and testing.
There are well-known tools that allow collaborative data science -- is a major player right now. However, what if you need to work in an existing cloud with strict rules about the customer’s data policy and nonstandard tools along with customized configuration? If you have your data science platform built, it provides such opportunities:
Consider the data science environment as a platform to perform many types of analysis that data scientists, business analysts, developers and managers can all benefit from. It consists of an entire data lake and a number of compute nodes organized in CPU and GPU clusters. Cooperation within this environment excludes the data export/import operations as the teams use the latest reliable data from the data lake and connected storage. All team members get the same results for training, testing and reporting. Also, it provides the opportunity to make a copy of the last model configuration and tune it based on different goals.
Let’s gain deeper insight in the minimum environment architecture.
Designing and Deploying Your Environment
The design of data lake architecture varies a lot, but let’s consider the simple one based on distributed file storage.
Apache Hadoop is an open-source framework to store very large data sets across clusters of computers and perform parallel processing. It’s composed of the Hadoop Distributed File System (HDFS™) that handles scalability and redundancy of data across nodes, and Hadoop YARN, a framework for job scheduling that executes data processing tasks on all nodes. The baseline for the required environment is 3-Node Hadoop Cluster. Nodes can have different mixed roles such as NameNode, DataNode, ResourceManager, and NodeManager. In case there is continuous data ingestion from different sources, streaming can be built by Kafka stream-processing platform.
There is no specific work done in stream processing, as it just transforms the original delimeter-separated values to parquet format. Comparing to Hive, parquet files are more flexible and don’t require predefined schema. In case the streamed values are totally different from standard format, either stream processing uses customized transformation, or the data is stored “as is” in the HDFS. This stage is very important, as there is no dedicated project or analysis the data should be prepared for; the pipeline just make the data available for the data scientist, so they can get it without any information loss. Data sources vary from different log files to different types of services and systems inputs, but all this data goes to the data lake and is supposed to be connected in designed use cases.
With the data lake ready, the next step is configuring the cluster(s) to deploy an environment for data scientists with required tools and diverse opportunities. Here’s the toolset that is required:
Additional analytics tooling such as KNIME installed in one of the data nodes or attached servers.
Let’s understand the role for each of the installed tools.
Apache Spark allows you to work on files from HDFS stored in parquet or another format and prepare the data for specific analysis and modelling. You can either run prepared Spark Jobs or develop the transformation pipeline as part of the model. Created pipelines allow you to use the latest ingested data for training and test each time you run the model. The prepared structured data can be uploaded to a data mart or database for business analysis and visualization.
Python is the most popular programming language that data scientists use today. Spark provides an API for Python called Pyspark; many data science libraries such as numpy, scikit-learn, and pandas are developed with Python. TensorFlow has an intuitive higher-level Python API, too.
R – open-source implementation of the S/Splus language (scripting language mostly popular for data analysis, statistics, and graphics). R has a variety of scientific packages and is used to turn raw data into insight, knowledge, and understanding.
Jupyter Notebook is an interactive computational environment (simple Web Browser View) where you can combine code execution, rich text, mathematics, plots and rich media. It has reusable pipelines for data transformation, modeling and testing live in notebooks, so you save time in data preparation and understanding while moving from one target of analysis to another. Jupyter python is the primary language for developing models, but there’s an opportunity to run sections of analysis in R calling R scripts with rpy2 library.
TensorFlow on top of Spark has been just announced last year and can potentially bring a lot of wins to data science workflow.
Benefits of a Flexible Data Science Environment
Eventually, your data science environment will provide cooperative access to data insights for all data science team members and other teams.
There are definitely benefits for team members, leaders, and manager in building a flexible data science environment:
Finally, the environment works well for fast prototyping and demos for an executive audience. It becomes easy for developers and business analysts to run the models and proof the expectations. Based on the business use case, data scientists define the set of data sources from the data lake and verify it through modelling and communication with the customer.
Model development to resolve each use case becomes an iterative process with documented goals and KPIs. Each model prototype goes through communication with the customer, verification and feedback using a list of criteria:
A lot of times, data science teams face the issue of "blurry" initial business requirements or a lack of machine learning capabilities understanding, so the iterative process really helps to understand, clarify and specify business needs along with providing data insights and potential opportunities.
Different business use case models based on the same data are the building blocks for a portfolio that provides a single pane of glass for your business. On the other hand, data scientists can reuse the data, model pipelines, code and models as features for new business use cases. Thus, having a variety of data, tools, knowledge, models and code in one data science environment allows your company to build a strong data science community for cross-team projects.