If you've ever trained a machine learning model that worked well in one context, and then tried to give it new data and it performed poorly, chances are you may have experienced some form of bias in your training data. Data bias can lead to machine learning models that are at best inaccurate, but even more importantly, unfair, or even harmful. You might imagine a healthcare scenario where the model predicts how a patient may respond to a particular treatment. An under-represented subgroup of patients may be predicted to respond well, when in fact, this is more representative of the larger population, not the subgroup.
To help our customers identify possible bias in data early in the machine learning lifecycle, thereby avoid impact on model quality and inadvertently put certain groups at a disadvantage, Oracle Machine Learning (OML) Services now offers the Data Bias Detector through a REST API.
Proactively addressing data bias has multiple benefits for your company. Companies concerned with regulatory compliance know that AI/ML models are under increasing scrutiny and regulation, so detecting and mitigating potential issues can help with addressing relevant requirements and guidelines. Understanding bias can lead to more accurate and reliable decision-making for better customer experiences and more effective marketing. Mitigating bias also offers competitive advantage – a differentiator by prioritizing bias detection and mitigation.
Bias can exist in both datasets and machine learning models. In the data preparation step, data may not adequately represent the population from which they were drawn or may contain biased labels that were incorrectly or subjectively assigned due to human error or social stereotypes. Predictions using machine learning models trained on biased data reflect such bias.
To mitigate bias from being propagated or augmented in the later stages of the model building, minimizing or correcting data bias in the data preparation step of the machine learning process is desired. A popular and light-weight bias mitigation method, reweighing, is one approach. Some machine learning packages accept row (or sample) weights as a training parameter when building models. For example, in Oracle’s DBMS_DATA_MINING package, users can set the ODMS_ROW_WEIGHT_COLUMN_NAME in Global Settings while training a generalized linear model (GLM). For classification algorithms that do not incorporate row weights in the training process, the weighing matrices can serve as guidance for data resampling.
Bias can result from a range of factors, including how data is selected, measured, and labeled. For example, there are several types of bias that may be introduced through how we select data:
The OML Services Data Bias Detector provides REST endpoints for creating bias detector jobs. The Database Bias Detector calculates several metrics for common types of data bias: class imbalance (CI), statistical parity (SP), and conditional demographic disparity (CDD). The Data Bias Detector uses these three to simplify the user experience while providing valuable insight into data bias. As you might expect, the Data Bias Detector provides insight and recommendations, rather than specific guidance or actions. Let’s explore each of these in more detail.
Class imbalance occurs when you have too many examples of certain types of objects and too few of others, but you’re trying to predict all of them with similar accuracy. As you might expect models are less likely to accurately classify underrepresented objects. The class imbalance (CI) metric provides a simple way to understand potential issues. For example, if we have two groups within a dataset:
To address class imbalance, you might acquire more examples of the underrepresented class, or perform stratified sampling to balance the data from each group.
Statistical parity involves evaluating whether the distribution of positive predicted outcomes matches the distribution of underlying groups. Think about hiring. Statistical parity means that people hired from certain groups, such as gender or ethnicity, are proportional to their representation in the data. Statistical parity is also referred to as independence, group fairness, demographic parity, and disparate impact.
Like CI, the statistical parity (SP) metric also provides a simple way to understand potential issues:
Even though SP indicates certain disparities, it doesn’t prescribe a recommended action since there may be other factors that help explain any disparities.
Conditional demographic disparity helps you identify hidden biases and assess fairness across different groups within a larger group. You can think about this visually in the following chart, where a trend appears in subgroups groups of your data but disappears or reverses when the groups are combined.
This is known as Simpson's paradox, and CDD helps you discover this.
There are some classic cases of Simpson’s paradox. For example, consider the UC Berkeley Gender Bias Case from 1973, where male applicants had a higher acceptance rate than female applicants. But at the department level, no single department was significantly biased against women. To the contrary, many departments favored women. It was determined that the overall bias was due to women applying to more competitive departments with lower acceptance rates.
Interpreting the CDD metric is simple:
The Data Bias Detector provides quantitative measures of data bias to help you detect and mitigate data bias. The setting of a bias threshold is left to the user specific. Bias assessment depends on specific data features and each project’s objectives.
Try the new Oracle Machine Learning Services Data Bias Detector REST API through Oracle LiveLabs. Check out https://bit.ly/omlfundamentalshol for easy access to OML Services.
Data Bias Detector
Mark Hornick is Senior Director, Machine Learning and AI Product Management. Mark has more than 20 years of experience integrating and leveraging machine learning with Oracle software as well as working with internal and external customers to apply Oracle’s machine learning technologies. Mark is Oracle’s representative to the R Consortium and is an Oracle Adviser and founding member of the Analytics and Data Oracle User Community. He has been issued seven US patents. Mark holds a bachelor’s degree from Rutgers University and a master’s degree from Brown University, both in computer science. Follow him on Twitter and connect on LinkedIn.
Previous Post
Next Post