Amid mounting digital transformation pressures, the top two categories for increased IT spending in 2019 are cloud computing—and on-premises data analytics, according to a recent Morgan Stanley survey of North American CIOs.
Why analytics? IT leaders in large organizations are investing heavily to extract more value from information they already have. For most of them, their critical applications aren’t in the cloud yet. And—for the time being—their center of gravity remains in the data center.
For most organizations, the process of gaining insight from data has long been the same: Extract data from one or more sources, Transform it to be suitable for analytics, and then Load it into a separate environment for analytic processing (ETL).
Every year, though, there’s more to process—more data, more sources, and more data types. And the time-tested ETL approach is starting to show limitations.
At even modest scale, the ETL process can be complex, time-consuming, and costly. It’s also slow: Your people will always be looking at a snapshot of what was instead of what’s happening right now. ETL can also increase security risks, as potentially sensitive data is being stored and used in more places.
Modern Databases: A Better Way?
These limitations raise an important question: Is there a way to minimize the cost, complexity, and risk of ETL, yet still extract value from data? The answer is: yes.
Modern databases can flip the ETL model by running powerful analytics right inside the database itself, producing high-level results without the need to extract, transform, and load data into a separate environment. The result can be powerful insights, delivered faster, more cost-effectively, and more securely.
Many are surprised to find that modern databases have literally hundreds of powerful analytical functions built in, from simple statistics to modern machine learning. Or code in R or Python, if you prefer.
Answers are delivered much faster, thanks to in-memory processing and optimized algorithms, in addition to the elimination of the inherent delays associated with ETL. Organizations can safely use the same database for both transactions and analytics at the same time, which eliminates the need to extract and load, in addition to providing real-time insights.
Any required data transformations can usually be done in-place via a separate database view, without affecting the underlying data. And all database security, auditing, and compliance protections remain in place at all times. The results can be transformative.
A Healthcare Example
One problem for emergency rooms in the US is identifying “frequent flyers,” patients who often visit emergency rooms for reasons other than the need for urgent medical care.
How useful would be a model that could predict the probability that a visiting patient might fit that profile and perhaps benefit from another course of treatment? An ideal model would look at factors beyond the frequency of ER visits and their outcomes.
Using a conventional ETL approach would require setting up a separate analytics environment, extracting all relevant data, masking and redacting sensitive information, then cleansing and transforming the data to be usable. You’d next begin the long process of building, training, and testing different machine learning models. Once you had something that worked, you’d have to find a way to get your model into production. And every time you wanted to retrain your model with current data, you’d have to do it all over again.
Let’s compare that approach with a modern in-database approach. Assuming all of the relevant data is in a modern database, the data would stay in place. Automatic data masking and redaction would identify sensitive bits and protect them with a full audit trail for compliance purposes. The rules for cleaning and transforming data could simply be an additional database view of existing data, with no separate processing required.
The database could help build your model, if you choose. It could automatically identify the most useful predictive features hidden among all the less useful ones. It would then automatically select the model that produces the best predictive results. And, finally, it would automatically tune the model parameters to make even better predictions.
The finished model could be stored as a procedure in the database, making it easy to put into production—simply invoke it as needed. When it comes time, the process of retraining your model could be done in a few hours, instead of weeks or longer.
Sometimes speedy decision-making is of the essence. For example, wireless communications providers work hard to combat fraudulent cell phone calls. The best time to spot fraud is as the call is being placed, not after the fact. Using in-memory data processing, several wireless providers are using models that identify potentially fraudulent calls before they’re connected. Credit card providers are doing much of the same to identify potentially fraudulent transactions.
Old Habits Die Hard
Tell most data analytics practitioners that much of their ETL efforts might not be the best way to do things, and they’ll react with disbelief. After all, that’s the way they’ve always operated.
However, once they see the inherent advantages of bringing the algorithms to where the data lives, instead of bringing the data to the processing, their perspective changes quickly.
Indeed, as business pressures grow to extract ever more value from existing data, the appeal of using modern databases to deliver advanced analytical insights will only grow.