When will the query finish?
At 9:00pm, I was still in the office. I wanted to run a data quality report and print it out to show my boss the next day.
My wife called. It was the 3rd time she called; she was upset because I promised to go home by 7:00pm. I canceled the query and submitted it one last time.
But this time, I had given up any hope of having data ready for statisticians one day ahead of schedule. Instead, I created a ticket for the database administrator to find out why my new query did not finish in time.
That was in 1995. I was a database analyst without full access to the database backend, and it was a mystery to me why my database report did not finish in time. I wished to have the knowledge and power to analyze and control the situation myself.
I was lucky -- I became a database administrator not long after that. I began to manage multiple database platforms: Oracle, Sybase, and Microsoft SQL Server. It was fun to be the king of the enterprise database world, dictating what can or cannot be run.
Oracle databases are usually the largest, most scalable databases, but they’re also the easiest to tune -- I can gather information from v$session and v$session_wait database views. These views expose an infrastructure called Oracle Wait Interface that takes the guesswork out of performance tuning.
Even if it is relatively easier to identify performance problems in an Oracle database compared with Sybase and SQL Server, database performance is still unpredictable. Some in the industry have claimed that tuning is a nightmare and that auto-tuning is wishful thinking.
Finally, Oracle decided to take on the challenge and get rid of its rule-based optimizer – the transition from a DBA manually tuning queries to a database automatically tuning SQL queries.
In the early days of database upgrades to a version that only has a cost-based optimizer, there were Oracle DBA’s and developers who got frustrated trying to figure out the best way to collect database statistics and use Oracle tools to stabilize performance. (Even nowadays, there are still database administrators that do not collect database statistics and use all the tools properly.)
Oracle documentation has a very good analogy for the cost-based optimizer: an online trip advisor.
Let’s say a cyclist wants to know the most efficient bicycle route from point A to point B. The advisor picks the most efficient (lowest cost) overall route based on user-specified goals and the available statistics about roads and traffic conditions. The more accurate the statistics, the better the advice. For example, if the advisor is not frequently notified of traffic jams, road closures, and poor road conditions, then the recommended route may turn out to be inefficient (high cost).
Collecting stats to cover all scenarios is the main driver behind the success of a cost-based optimizer. Oracle has embarked on this since Oracle 7 in 1992.
Oracle Optimizer did not stop at the database tier. With the introduction of Exadata, Oracle optimization goes deep into the storage/InfiniBand network layer. Smart Scan allows most of the SQL processing to happen in the storage tier instead of the database tier, which dramatically improves query performance. It reduces the volume of data sent to the database tier, thereby reducing CPU usage on database nodes. Similarly, Oracle Big Data SQL also has a smart scan optimization feature, which enables organizations to immediately analyze data across Apache Hadoop, Apache Kafka, NoSQL and Oracle databases.
Compared with LIDAR for self-driving cars, Oracle’s machine learning-driven database is not popular news. Yet, this technology made it possible for a database to perform well through learning from its own experience, running a SQL query, and deciding what to change.
Elon Musk recently commented, “LIDAR is a fool's errand. Anyone who relies on LIDAR is doomed. Expensive sensors are unnecessary.”
Tesla is planning to use only cameras and radar without LIDAR for Tesla’s self-driving cars. Whether Elon will succeed is still to be watched, but his remarks do show that Elon and Larry had a similar vision – that artificial intelligence and machine learning do not need to rely on fancy hardware or even a sophisticated machine learning framework to get implemented.
Life was not easy for Oracle DBA's, and neither was it for the engineers testing self-driving cars. I have full respect for all the people who worked through the early days, when you couldn’t use database rule-based hints to force SQL query to use an index lookup over a full table scan.
After over 2 decades of improvements, the Autonomous Database is finally here -- unlike self-driving cars, which are still a dream for the future.
Fast-forward to OpenWorld 2017 – right after Oracle unveiled the world's first Autonomous Database Cloud. Suddenly, many database vendors claimed that their managed database platforms were autonomous, truly self-driving as well. Whenever I talked about these other “autonomous databases" with Oracle database administrators, we all laughed...
Now back to the present day, 2019: I was talking to a developer about his data science project where he brought all the data into Python pandas dataframe. The real-time analytics and beautiful graphics caught my attention.
Because of my “occupational hazard” as a longtime DBA, I was curious about how he processed everything in an AWS EC2 instance that was not even as powerful as a low-end laptop. He answered, "Because the source data is less than 100 Megabyte.”
That made me wonder, “What if the source data is more than 100 Terabyte?”
While there are use cases where you only need a small dataset, the growing volume of data has caused data scientists to use sampling techniques that could reduce the accuracy of machine learning models.
Instead of risking significant sampling bias or information loss, why not bring the machine learning algorithm to the data?
Actually, Oracle has done that for years. It is called Database Advanced Analytics (OAA), and the components are Oracle DB + Oracle Data Mining + Oracle R Enterprise.
The term “deep learning” is used in conjunction with neural networks. Starting with Oracle Database 18c, the Neural Network machine-learning model was introduced. You can use the model to label new data, which is something that is typically required in your front-end applications, web-based applications, or your back-end or batch applications and processes. Any programming language or framework that can call SQL can now run the in-database Neural Network model by using a simple SQL command.
One interesting experiment in the database/big data world right now is the use of innovative hardware accelerators such as Graphics Processor Units (GPUs). They run SQL queries on multi-billion-row data sets, and they enable machine learning algorithms and code to run on data directly within the database. These GPU databases, even if they become mature enough and available on Oracle Cloud Infrastructure, will only complement Oracle in-database analytics. Just like their slower yet similar predecessors (such as Amazon Redshift and GreenPlum), they are not scalable as an OLTP database engine.
As a multi-model database, Oracle in-database machine learning can support the majority of use cases in every industry. With Hybrid Transaction/Analytical Processing (HTAP) capabilities in the same database, Oracle database can analyze data "on the spot.”
Oracle database, to this date, is still the ONLY DATABASE that is best at handling high-volume OLTP transactions and offers an in-database analytics platform. Only the self-driving Oracle Autonomous Database can bring machine learning as close to the core business workflow as possible.
Now that Oracle made it a breeze to create a production-ready autonomous database and implement SQL queries with minimal user interaction, there is really no reason not to get started with Database Advanced Analytics.
To learn more, check out these Oracle documentation chapters on Machine Learning models that are already available to use as SQL functions on Oracle database.
Even if your company is not ready to try in-database machine learning, Oracle database can jumpstart successful data science and machine learning projects.
Oracle is known to be far superior as an OLTP database. It was also positioned highest for its ability to execute and completeness of vision in Gartner’s 2019 “Magic Quadrant for Data Management Solutions for Analytics” report. Compared with the 2018 report, Oracle is expanding its lead on Gartner’s vertical axis over Microsoft and Amazon Web Services.
This unique and mature platform is not easily imitated. Without rewriting the database engine from scratch, it will be very hard for other platforms to handle high-volume OLTP transactions because of architecture defects such as "Auto-vacuum" or "reader block writer, writer block reader.”
After several years of the "NoSQL revolution,” more and more enterprise users and startups have realized that SQL databases have a major advantage over NoSQL databases in many areas. A SQL database that is self-driving, self-securing, and self-repairing, as well as highly scalable vertically and horizontally, is critical in any data science project.
Here are some reasons why the “Ready to Work” Oracle databases can lead to successful data science projects:
With Oracle databases, you can simplify data management, correlate different data sources for deeper insights, and enable your existing teams to innovate through new projects.
Andrew Ng maintains that you’re not really ready to be an AI company until you make strategic data acquisitions and gather all your data into a centralized data warehouse. Bring it on then – your AI company already has Oracle databases. Even better, they are becoming autonomous.
Even if Oracle’s Autonomous Database is not the best fit for your company’s data, there are plenty of Oracle offerings that can better fit your unique needs.
Most Oracle-based applications can live with a millisecond response time. However, real-time applications require microsecond query response time. Therefore, Oracle acquired TimesTen, an in-memory database company, in 2005. While working with customers, I achieved query performance close to single-digit microseconds. This was not thinkable, even with the latest release of Oracle database.
There was quite some excitement in the Oracle customer base that year. I got called into numerous customer sessions regarding TimesTen use cases. In 2006, there were still not many well-known solutions as scalable as Oracle database in ANY use case, even for the startup world.
However, one of the customer's requests stumped me. He wanted to embed Oracle database in an application server to store session state information. Data would be partitioned with a replication of session state between the application servers.
Oracle database’s footprint was too big for this use case, and it would have been overkill to be a local store for session state. TimesTen was a possible solution; however, it was a new use case that needed to be investigated.
While I was still doing research, Oracle announced their acquisition of Sleepycat, which owns an alternative NoSQL database - BerkeleyDB. The software is embedded in open-source products such as the Linux and BSD Unix operating systems, the Apache Web server, the OpenLDAP directory, and the OpenOffice application suite. Perfect!
As matter of fact, there are more and more products in Oracle's ecosystem that are great for a specific use case when Oracle's self-tuning, highly available database might be overkill.
Over the years, I have tried to use Oracle database as a scalable solution for both structured and unstructured data. Although Oracle is a multi-model database, my experience tells me that it is not the best fit for every scenario -- but it should be the centerpiece of any enterprise data science effort.
There are many other Oracle solutions available. MySQL is designed for the web; it scales well and complement Oracle RDBMS very well. Oracle NoSQL database was an early darling of many Silicon Valley startups.
Oracle is one of the largest producers of open-source software in the world, developing and providing contributions for projects including Apache NetBeans, Berkeley DB, Eclipse Jakarta, GraalVM, Kubernetes, Linux, MySQL, OpenJDK, PHP, VirtualBox, and Xen. Oracle Cloud Infrastructure’s core services are built on open-source technologies to support workloads for cloud native applications, data streams, eventing, and data transformation and processing.
The upcoming Data Science Cloud will feature open-source options. Built-in, cloud-hosted Jupyter notebooks enable teams to build and train models using Python. Popular open-source visualization tools like Plotly, Matplotlib, and Bokeh allow you to visualize and explore data. You can launch environments with popular machine learning frameworks like TensorFlow and scikit-learn, or easily install any other open-source package.
This solution will also use native support for most popular version control providers (Github, Gitlab, and Bitbucket) to ensure that all work is synced across the platform. Tight integration with OCI and Oracle Big Data Platform provides data scientists with self-service access to scalable compute, so that they can get to work quickly.
On Oracle Data Science Cloud, you can deploy a function or model as an API to encapsulate data science work and expose it to other team members and applications. With APIs, you can:
Looking for scalability and security? Spin up containers on Oracle Cloud Infrastructure to tackle analyses of any size leveraging Kubernetes. You can use secure credentials to safely access data in Oracle Object Storage Cloud.
Oracle Data Catalog enables users to discover, find, organize, enrich, and trace data assets from multiple sources. You can put your data to work to create a unified data warehouse, including "ready to go" data sources from cloud SaaS databases.
For analytic workloads over large and complex datasets, SNAP, an Apache Spark™ native business intelligence platform, brings sub-second query response times that can be deployed over enterprise data warehouses, data lakes, and IoT analytics.
Although Oracle database is the world's only real autonomous database for OLTP and Analytics workloads, there are new cloud services introduced all the time that give the right hammer for your nail. This way, the focus is not on the tools you use, but on how you can drive business innovation and revenue growth –and that is PERFECT!