Welcome to the second post and the first step in our “Oracle Data Professional to Oracle Data Scientist” series. Last time, we covered how Oracle Autonomous Database is helping to change the DBA’s role and how DBAs and Oracle data professionals can make the transition to Oracle data scientist in six steps. If you haven’t read the previous blog post, I highly recommend doing so.
This time, we’re going to jump into the first step to becoming an Oracle data scientist: Business Understanding.
You can now also watch the Oracle Machine Learning Overview: from Oracle Data Professional to Oracle Data Scientist in 6 Weeks! on YouTube.
Why Business Understanding Is So Important
The biggest mistake data analysts make in applying machine learning algorithms is to minimize the importance of this critical step and jump to poorly formed problem statements. Forty percent of your mental energy on your machine learning project should be spent making sure that you have properly thought through the data science project’s goal and everyone buys in.
Instead, most machine learning “rookies” come up too quickly with goals from their managers that sound good, like “identify employees that leave” or “target our ‘best’ customers.” Later, these poorly formed business problem statements steer the data science projects down time-wasting and non-productive paths. Managers become frustrated with the lack of useful results or worse, take action on poorly defined business problem statements and associated machine learning results. This waste of time and resources fools organizations into believing that the machine learning is to blame rather than realizing that they rushed into a data science project before fully thinking it through.
The Business Problem Statement
The key to creating business understanding is to start with a well-defined business problem statement.
These are the types of business problems you may be looking to solve:
On the surface, these seem like fine problems to investigate, but they’re way too general to base a machine learning project on. Instead, we need to take a note from Albert Einstein.
“If I had an hour to solve a problem I'd spend 55 minutes thinking about the problem and five minutes thinking about solutions.”
― Albert Einstein
Einstein knew that the more you think about the problem you’re trying to solve, the more specific you can be in your approach to solving that problem. The more specific you are in your approach, the more likely you are to get the results you’re looking for instead of chasing a bunch of irrelevant data.
For example, “identify employees who leave our organization” seems like a good business problem statement, but it isn’t specific enough. The easiest way to identify employees that will leave is to see which employees have submitted their resignation notices. But if you’re interested in employee retention, that’s not the data you’re really looking for.
You probably want to, say, identify the signs that an employee is likely to voluntarily leave three months before he or she submits their resignation notice—when the business can actually do something about it.
But even that level of detail isn’t good enough. You’d still need to figure out what “voluntarily leave” means. Are you including people who retire, take family leave, or switch to part time?
Do you only require a prediction and a probability for each prediction, or do you also require understanding the model’s reasons for each prediction? A machine learning model that can predict three months ahead of an employee’s voluntary attrition requires that you feed the algorithms data about employees that both leave and stay.
A well-defined business problem statement for this case might be: Identify employees that voluntarily attrite three months before they leave and provide reasons why. A variant of this could be to identify likely high-value employees that are likely to leave three months before they leave and provide reasons why—but that’s another model or a combination of a likely high-performer model in combination with our voluntary attrition early warning model.
How to Properly Define a Business Problem Statement
Thinking through the business problem statement in detail is what separates successful machine learning projects from those that fail and end in frustration.
The king of all ill-defined business problem statements is: “I’ve got all this data. Can you mine it for useful insights?”
This is a trap.
Just like golf where a poorly placed shot can leave you struggling in a sand trap, a poorly formed problem statement can start you on the wrong path—wasting time, resources, and management’s patience in the process.
For example, some time ago I supported a large consumer packaged goods company (CPG)—detergents, soaps, dental items, baby items, deodorants, and personal care items. One of the data analysts at the company was tasked by his manager to “apply clustering algorithms to our data from our retailer stores to find interesting patterns in our products sold.”
You can approach the problem this way, but it’s often a lot of work and unsatisfying when insights don’t just jump out at you. This stems from the difference between supervised and unsupervised machine learning.
Supervised learning is where you have “labeled data” (e.g. 0 or 1; Yes vs. No; A, B, or C; defaulted on loan, good credit; etc.) and use that as your “target attribute.” This would be employees who have attrited or are currently employed in our previous example. Typical supervised learning algorithms include classification algorithms such as decision trees, generalized linear models, support vector machines, neural networks, and random forests—but more on this in Week 4.
Unsupervised learning is where you lack labelled data and a target field. Unsupervised learning includes algorithms such as clustering, anomaly detection and associations, or “market basket analysis,” where you are asking the algorithm to discover hidden patterns or identify subgroups or subpopulations that share common traits or attribute values and find commonly co-occurring items in a “basket”—but more on unsupervised learning in Week 4.
We spent much effort accessing, assembling, preparing, joining, and transforming the data (more on this coming soon in our next two blog posts).
While we were reviewing our Oracle Machine Learning (OML) clusters of various retail stores looking to find something “interesting,” the CPG data analyst whom I was teaching machine learning to expressed some disappointment in our results.
“OK, but remember that clustering, or ‘unsupervised learning’ can sometimes be frustrating because you may not find anything useful—something that you can act upon. What were you hoping to find?,” I said.
“My boss wanted to understand the attributes (floor and shelf space allocations, store hours, store demographics, advertising spends, etc.) associated with well-performing and poorly performing retail stores that sell our CPG products,” he replied.
“That’s ‘supervised learning!” I said.
We recast our problem statement to identify the top 20 percent highest performing stores versus non-high performers, a classification problem. We also set out to understand explanatory “rules” identified by the model.
Luckily, our data preparation efforts had left us with a good starting dataset. We changed our approach to the classification model and used an OML Oracle Data Miner, an extension to SQL Developer, Classification Build node. Five minutes later were reviewing our Oracle Machine Learning’s Decision Tree results. Immediately, he found value in the patterns the machine learning identified for different retail store segments in both the well-performing and poorly performing stores.
My point is that, like Albert Einstein, you need to think seriously through the entire scenario and develop a well-defined business problem statement—not pick which “algorithm” you want to use.
Mapping Your Problem Statement to Your Machine Learning Algorithm
Before we get too far along, it’s important to have a basic understand of what machine learning is and the basic types of machine learning functions or “algorithms” available.
Here’s a slide I use to map common goals with the name for the machine learning algorithm. They’re grouped into supervised and unsupervised learning.
Let us now consider some poorly defined problem statements, see how we could define them more explicitly, and then understand how that naturally maps into specific machine learning functions, including classification, regression, anomaly detection, associations, clustering, and association rules, to mention the more common ones.
I know this may seem still like a lot to understand, but like learning any new skill, practice and additional reading helps. Here is one of my favorite introductory books on machine learning (or data mining as we called it back in the early 2000s): Data Mining Techniques: For Marketing, Sales, and Customer Relationship Management by Gordon S. Linoff and Michael J. A. Berry. It is chock full of examples that will help you learn where and how to apply machine learning algorithms.
Predicting Customer Purchases
Now, let’s walk through an example together. Recently, I was pulled into a customer proof of concept (POC) where the customer’s goal was to “Analyze this sales data and predict which customers are going to buy what, when.” It sounds great and like a logical problem statement for machine learning algorithms, but can machine learning really live up to these high expectations? OML’s Association Rules, aka “market basket analysis,” quickly sifts through point of sale (POS) transactional data to discover patterns in baskets and “rules.”
Example Market Basket Rule:
Customers who purchase items A + D, also buy Item F with 88 percent confidence.
This rule is found in 10 percent of your total customers’ baskets.
So now, once OML has found customers who have purchased items A + D, we can recommend Item F.
But as far as when a customer will buy Item F, we can’t quite tell yet. Predicting when items are purchased typically is equated to forecasting sales over multiple time-periods in the future and is performed at aggregate levels (not individual predictions for a specific customer).
Alternatively, this problem statement could possibly be approached as multiple classification prediction problems. For example, these multiple business problem statements:
What is the likelihood a customer will purchase Item F in 30 days?
… in 60 days?
… in 90 days?
To make a proper prediction for all of the products sold, we would have to build a classification problem for each product and build those models for each of our time windows (30, 60, and 90 days). This complexity may not be what the manager envisioned, but with some work we could do it.
So, in conclusion, hopefully, we see the folly of rushing too quickly into machine learning projects with poorly formed business problems and the value of teamwork from all stakeholders. Collaborations between business managers, Oracle data professionals, and data scientists always yield better results. Believing that you can delegate to any one individual (e.g. the data scientist or database developer) and expecting that that one individual has all the knowledge required to formulate a well-defined business problem, assemble the “right data” to build, and then deploy machine learning models is simply expecting too much. It leaves gaps in assumptions and opens opportunities for failure.
Working together to reach consensus on the explicit problem statement to be tackled using our machine learning “algorithms” is the single most important step you will take in this journey. After that, you can start looking at assembling the right data and deriving new “engineered features” that tease more information from the data to make it easier for the algorithms, but we’re getting ahead of ourselves!
Hope this helps! Good luck!
Editor: Amanda O’Callaghan