Data science projects are increasingly pervasive in enterprises both large and small. Such projects rely on not only data, data preparation, visualization, and machine learning, but also data management and the ability to deploy solutions quickly and easily. Additionally, such projects are not only the realm of data scientists, but include a broader range of personas, or user roles, to realize solutions to business problems. These roles include business and data analysts, database administrators and information technology professionals, application and dashboard developers, as well as executives who sponsor key data science initiatives.
In this blog, we look at the capabilities that are important to individuals in these user roles.
Let’s start with the capabilities data scientists rely on to achieve enterprise goals. Today, Python, R, and SQL are powerful and popular data science languages, enabling data scientists to explore, visualize, and prepare data, and develop machine learning models. With rich open source ecosystems from Python and R, data scientists benefit from the thousands of packages and libraries that aid productivity, avoid reinventing standard techniques, and make available state-of-the-art techniques.
However, two of the challenges facing data scientists are scalability and performance. In enterprise settings, data volumes and enterprise deployment requirements may conflict with the benefits of open source software. Typically, open source packages are not designed with performance in mind and rely on data residing in-memory, which limits scalability. Solutions are needed that combine the best of open source with the scalability and performance of enterprise-level tools.
Increasingly, data scientists are looking for tools that automate traditionally manual and repetitive machine learning tasks. These can range from data preparation and feature selection to algorithm selection and model tuning. Automation doesn’t mean replacing data scientists, but making them more productive and free more time for solving additional business problems.
Collaboration is also an important capability as data scientists need to work with other enterprise users to realize data science solutions. Collaboration involves communication, but also environments that facilitate cooperative problem solving and provide easy and immediate access to work products like scripts, notebooks, models, and visualizations.
Since data is at the center of a data science solution, data scientists also require the ability to find, access, and integrate data across the enterprise. Platforms that provide data catalog functionality and ease of granting and managing data access can greatly benefit data science project outcomes.
Business and Data Analysts
Business and data analysts—once the key data analysis force in enterprises that relied mostly on deductive techniques involving spreadsheets, database queries, and business intelligence tools—are themselves expanding their analytical tool set with machine learning. One of the more recent developments enabling analysts is automated machine learning.
Automated machine learning is particularly valuable for analysts who may not yet have formally enhanced their skillset with machine learning methodologies and algorithms, but can apply their extensive domain knowledge to enterprise business problems. Automated machine learning—whether provided as a code-free user interface or programming interface—can deal with many of the machine learning algorithm-specific details, such as data preparation requirements, the algorithm(s) to be used, the predictors most suited to the algorithm, and how algorithm-specific parameters should be tuned to optimize predictive model performance.
Business and data analysts also benefit from broader collaboration, for example up taking scripts, prepared data, and models from data scientists to augment their work, or making their own results available to developers for inclusion in applications and dashboards.
DBAs and IT Professionals
Most enterprises today view data management as a key business function, normally relying on database administrators and information technology professionals to support the data, software, hardware—and now Cloud—needs of the enterprise. DBAs and IT professionals strive to maximize the value enterprises derive from their technology investment.
They must also provide other users, like data scientists, analysts, and developers, with the scalability and performance needed to address enterprise-scale business problems. The availability of integrated, streamlined, converged, and automated infrastructure not only facilitate data science projects and reduce enterprise costs and complexity, but also frees up DBA and IT resources to tackle projects often deferred due to time-consuming maintenance activities, e.g., software upgrade and patching, system backup, and recovering from failures.
Today, many DBAs are expanding their skills in the area of data science, taking advantage of machine learning integrated with database management systems, whether through SQL or integration with Python and R. Often, their extensive knowledge of database technology combined with data manipulation using SQL serves as an excellent foundation for contributing to enterprise data science projects.
In support of other enterprise users, DBAs and IT professionals need to manage the access to data in database and big data sources. Historically, and even in enterprises today, data access is provided ad hoc. Users request DBAs to provide data extracts—often as flat files—and these requests can take several iterations to get the “right” data. This wastes both the data users’ time, but also the that of the DBA. As data volumes have grown such approaches simply don’t scale, either due to human resource limitations or due to the shear volumes of data involved. Having tools that enable direct access to data, but with proper security and data life cycle controls can greatly aid DBAs and IT professionals.
Application and Dashboard Developers
Application and dashboard developers are often on the receiving end when it comes to machine learning solutions. They take results from data scientists and analysts and weave them into new or existing applications and dashboards for use across the enterprise or external customers.
One of the major challenges facing enterprises is the ability to deploy machine learning solutions in production. Even when data science projects are successful in solving business problems, enterprises may not realize the benefit because of the difficulty integrating with existing systems or meeting time-to-market requirements. Additionally, some enterprises, like those dealing with fraud, need to refresh and redeploy models very quickly, e.g., within hours of detecting a problem. As such, having a well-integrated software stack enables realizing intelligent solutions faster, as does the ability to leverage R and Python-based solutions easily in combination with enterprise software. Since most applications and dashboard tools interact with relational databases using SQL, the ability to invoke R and Python seamlessly using SQL greatly simplifies embedding machine learning into applications and dashboards.
Lastly, C-level executives recognize the importance of and need for world-class data management technology and support. Mission critical applications demand scalability and reliability, whether on-premises, private cloud, cloud at customer, or public could. Increasingly, automation of many standard database activities is not a nice-to-have feature, but a must-have capability. Today, the ability to apply security patches to 10s or 100s or database systems quickly can mean the difference between a minor nuisance and a viral data breach news story.
Executives also see the value in empowering knowledge workers across the enterprise with machine learning technology to enable better data-driven decisions. In effect, this democratization of machine learning across the enterprise helps a broader range of users to look at data not just as static content to query and summarize, but as a new corporate asset that can be used to understand customers better, determine the root causes of problems, predict demand, and recommend actions, just to name a few.
Also critical for executives is the ability to take these new insights from across the enterprise and deploy them faster to realize return on their data science investment—reflected in data, people, software, hardware, and Cloud resources.
To meet these needs of data scientists, business and data analysts, DBAs and IT professionals, application and dashboard developers, and executives, Oracle provides an integrated, multi-model, converged data management platform both in the cloud and on-premises, which includes machine learning. The Oracle Machine Learning product family enables scalable data science projects where enterprises achieve data science project goals faster while taking full advantage of their Oracle platform. Oracle Machine Learning consists of complementary components supporting scalable machine learning algorithms for in-database and big data environments, notebook technology, APIs for SQL, R, and coming soon Python. See https://oracle.com/machine-learning for more details and the companion video.