The general goal of machine learning is to build models that can learn from data without being explicitly programmed. Among the many subdomains of machine learning, the one that usually gets the most attention is what is known as supervised learning. It is the most accessible, especially for people new to the field, and provides a great introduction to the wider world of machine learning. The 'supervised' in supervised learning refers to the fact that each sample within the data being used to build the system contains an associated label. The goal is to build a model that can accurately predict the value of the label when presented with new data. More formally, if the data set contains features, denoted x", and labels, denoted Y", the supervised learning model takes the form
Y=f(x)
Where the label is assumed to be some general function of the input features. This function is general in the sense that it can be linear or nonlinear, parametric or nonparametric, etc.
Outline
Here's a broad outline of what we're going to cover:

The two main types of supervised learning: regression and classification

How to choose an appropriate model

The general tradeoff between model accuracy and interpretability

A regression example using the Boston Housing dataset

A classification example using the UCI ML Breast Cancer Wisconsin dataset
Regression Versus Classification
Supervised learning problems can be divided into two primary types, regression and classification. In regression problems, the labels are quantitative, or continuous in nature. Examples include:

Income in dollars

Weight in pounds

Distance in miles
In classification problems, the labels are qualitative, or categorical in nature, and can be grouped into two or more classes. Examples include:

Binary labels (Yes/No or 0/1)

Different brands of a product (A, B, C)

The weather on a given day (rainy, sunny, overcast)
In both cases, the features (x's) are different variables that we assume are related to the label in some way. For regression, if the label represents income, the features could be job title, years of experience, location, level of education, etc. For classification, if the label represents whether or not a passenger survived the sinking of the Titanic, the features could be age, gender, cabin class, etc. The exact form of the relationship between the features and label will depend on the type of model used. Regardless of the type of problem, the goal is to predict the value of the labels with an acceptable level of accuracy. The way to measure accuracy depends on whether the problem involves regression or classification, and the definition of an acceptable level of accuracy depends on the specific domain.
Choosing an Appropriate Model
Within the areas of regression and classification, there are a wide variety of models to choose from. Choosing an appropriate model depends on a number of factors, including:

The size of the data, as some models perform better on larger or smaller data sets

The distribution of the data, as some models assume the features with a dataset follow a specific statistical distribution

The relationship between the features and labels (linear or nonlinear, additive or multiplicative, etc.)

The format of the data
 Structured data, such as a comma delimited text file, and whether the features are quantitative or qualitative
 Unstructured data such as audio, video, or image files

The primary goal of the analysis, which is typically either prediction or inference
Model Accuracy Versus Interpretability
The last bullet hints at an important distinction between different supervised learning models, and that is the general tradeoff between accuracy and interpretability. Here, interpretability refers to the ability to see how a model arrived at a particular answer, or at a higher level, why the model made the decisions it did. This tradeoff can be viewed in terms of the overall flexibility of a model. Models that are less flexible tend to be less accurate, as they assume a somewhat rigid form of f(x), and can only produce a small range of estimates. Most real world phenomena do not follow such an explicit form, and thus the model will not be able to completely capture the underlying relationship between the features and label. However, because they are somewhat rigid in nature, these models provide a higher level of interpretability. Models that are more flexible tend to more more accurate, as they do not make explicit assumptions about the form of f(x), and can fit a wider variety of shapes to the data. Because they are more flexible, however, they often provide a lower level of interpretability.
Since this post is meant to serve as an introduction to supervised learning, our focus will be on interpretability when choosing a suitable model.
Examples Using ScikitLearn With Python
Now that we have a general idea about what supervised learning is, it's time for some examples to solidify the concepts that have been introduced so far. Both regression and classification examples will be given, both will be done in Python 2.7, and both will use the scikitlearn and pandas packages. Scikitlearn is a free machinelearning library that contains all of the functions we'll need for the examples, and pandas provides flexible data structures designed to make working with relational datasets easy. Finally, both examples will use datasets that come bundled with scikitlearn, so there is no need to visit an external source.
Scikitlearn: http://scikitlearn.org/stable/index.html
Pandas: https://pandas.pydata.org/
Regression Example
Our regression example will use the Boston Housing Prices dataset. Our goal is to predict the median price of a house in a suburb of the city given a set of features pertaining to the suburb. Because our goal is interpretability, we'll use linear regression as our model of choice. Despite being one of the oldest supervised learning methods, it is still useful, and quite widely used. In addition, understanding linear regression is essential to understanding more complex models like neural networks.
If we have a label Y" and features X1" through Xp", the linear regression model is of the form
Y=β0+β1X1+β2X2+ ... +βpXp"
Here, the β" terms are unknown coefficients that will be determined by our specific data set. As a quick aside, a linear regression model assumes a linear relationship between the label and the coefficients" of the features. This distinction is important because it is often wrongly assumed that the linear relationship is between the label and the features themselves. However, it is perfectly acceptable, and often helpful, to use nonlinear features such as X1X2" or X12", if it improves the model. The resulting model is still linear, and all of the general rules regarding linear regression models apply.
Before we import our data, there are two questions we need to address.
How are the Β's determined?
The coefficients selected are those that minimize a quantity known as the residual" sum of" squares, or RSS. If we denote a true label as Y, a predicted label as Y^", and have a total of n samples, the RSS is defined as
From the above equation, the minimum RSS is clearly achieved when the values between the true and predicted labels are as small as possible. The selected β values will be those that achieve the smallest delta between the true and predicted labels.
How do we measure the accuracy of our model?
There are many ways to measure the accuracy of a linear regression model. We're going to use what's known as the root mean squared error (RMSE), which is given by the equation
The RMSE can be thought of the as the square root of the 'average' RSS for each term. One advantage of using RMSE is that it is in the same units as the label. As with RSS, smaller values are better, but there isn't a cutoff for what's considered a 'good' value. Such a threshold depends on the specifics of the problem.
Boston Housing Data
Now that we've defined our model, let's import the dataset.
from sklearn.datasets import load_boston
boston = load_boston()
Before we start to explore the data, let's turn it into a pandas data frame, which is a tablelike data structure with labeled rows and columns. We'll label the columns using the 'feature_names' property of the dataset.
import pandas as pd
boston_data = pd.DataFrame(boston.data, columns=boston.feature_names)
We can use the shape() function to see the size of the data frame.
boston_data.shape
Shape lists rows, then columns. The way to interpret this is that each row represents a different suburb in the greater Boston area, and there are 506 suburbs in the dataset. Each column represents a different feature, and there are 13 features for each suburb.
We can look at the first few rows in the data frame using the head() function.
boston_data.head()

CRIM 
ZN 
INDUS 
CHAS 
NOX 
RM 
AGE 
DIS 
RAD 
TAX 
PTRATIO 
B 
LSTAT 

0.00632 
18.0 
2.31 
0.0 
0.538 
6.575 
65.2 
4.0900 
1.0 
296.0 
15.3 
396.90 
4.98 
1 
0.02731 
0.0 
7.07 
0.0 
0.469 
6.421 
78.9 
4.9671 
2.0 
242.0 
17.8 
396.90 
9.14 
2 
0.02729 
0.0 
7.07 
0.0 
0.469 
7.185 
61.1 
4.9671 
2.0 
242.0 
17.8 
392.83 
4.03 
3 
0.03237 
0.0 
2.18 
0.0 
0.458 
6.998 
45.8 
6.0622 
3.0 
222.0 
18.7 
394.63 
2.94 
4 
0.06905 
0.0 
2.18 
0.0 
0.458 
7.147 
54.2 
6.0622 
3.0 
222.0 
18.7 
396.90 
5.33 
As noted earlier, there are 13 features for each suburb. Some of the features are:

CRIM  Per capita crime rate by town

INDUS  Proportion of nonretail business acres per town

NOX  Nitric oxides concentration (parts per 10 million)

AGE  Proportion of houses built before 1940

PTRATIO  Pupilteacher ratio by town
Note that the median price is not one of the features. It is actually stored separately, so let's go ahead and add it to the data set. We can add a new column to our data frame using the syntax below, and note that the price is given in thousands of dollars, so we'll convert it to dollars.
boston_data['PRICE'] = boston.target * 1000
Now that we have all of our data in the data frame, we can view some basic statistics using the describe() function.
boston_data.describe().transpose()

count 
mean 
std 
min 
25% 
50% 
75% 
max 
CRIM 
506.0 
3.593761 
8.596783 
0.00632 
0.082045 
0.25651 
3.647423 
88.9762 
ZN 
506.0 
11.363636 
23.322453 
0.00000 
0.000000 
0.00000 
12.500000 
100.0000 
INDUS 
506.0 
11.136779 
6.860353 
0.46000 
5.190000 
9.69000 
18.100000 
27.7400 
CHAS 
506.0 
0.069170 
0.253994 
0.00000 
0.000000 
0.00000 
0.000000 
1.0000 
NOX 
506.0 
0.554695 
0.115878 
0.38500 
0.449000 
0.53800 
0.624000 
0.8710 
RM 
506.0 
6.284634 
0.702617 
3.56100 
5.885500 
6.20850 
6.623500 
8.7800 
AGE 
506.0 
68.574901 
28.148861 
2.90000 
45.025000 
77.50000 
94.075000 
100.0000 
DIS 
506.0 
3.795043 
2.105710 
1.12960 
2.100175 
3.20745 
5.188425 
12.1265 
RAD 
506.0 
9.549407 
8.707259 
1.00000 
4.000000 
5.00000 
24.000000 
24.0000 
TAX 
506.0 
408.237154 
168.537116 
187.00000 
279.000000 
330.00000 
666.000000 
711.0000 
PTRATIO 
506.0 
18.455534 
2.164946 
12.60000 
17.400000 
19.05000 
20.200000 
22.0000 
B 
506.0 
356.674032 
91.294864 
0.32000 
375.377500 
391.44000 
396.225000 
396.9000 
LSTAT 
506.0 
12.653063 
7.141062 
1.73000 
6.950000 
11.36000 
16.955000 
37.9700 
PRICE 
506.0 
22532.806324 
9197.104087 
5000.00000 
17025.000000 
21200.00000 
25000.000000 
50000.0000 
Note that many of the features have different scales. This is important to recognize because many machine learning models are sensitive to the relative scaling of each feature, and it is often necessary to rescale the features to the same range. The most common ways to do this are to normalize each feature so that it ranges from 0 to 1, or standardize each feature so that it has zero mean and a standard deviation of one. For our example, the final result will be the same whether we scale or not, but it will make the coefficients more interpretable if we do.
Training Data Versus Test Data
Before we scale our data, we need to address one of the most important parts of supervised learning. We mentioned earlier that our goal is to predict the median house price using the data set, but we didn't say how we were going to go about doing that. The way we're going to do it is to split our data set into two groups, one for training our model, and one for testing it. It's important to set aside some data for testing because we need to get a sense of how our model will perform on data it has never seen before, which is what it would do if it were used in a real production environment. Because our model has already seen the training data, it would not be a good idea to predict prices using that same data. We would expect the model to perform well, and that would give us an overoptimistic estimate of our model's performance ability. The real test is to use data that is new, and that's the purpose of keeping a separate set of data specifically for testing. We want to keep our test data pristine, so we'll split it away from the training data before we do any scaling.
The first thing to do is split the data back apart into features (X) and labels (y). Then, we can use the 'train_test_split' function from scikitlearn to randomly split our data into training and testing sets. Note that this split should always be random, in case the data is ordered in some way. A common split is to allocate 7080% for training, and the rest for testing. Also, because the split is random, we are highly likely to generate training and testing sets that both capture the same underlying relationship between the features and labels.
X = boston_data.iloc[:,:1]
y = boston_data['PRICE']
from sklearn.model_selection import train_test_split
# Split the data into 80% training and 20% testing.
# The random_state allows us to make the same random split every time.
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=327)
print('Training data size: (%i,%i)' % X_train.shape)
print('Testing data size: (%i,%i)' % X_test.shape)
Training data size: (404,13)
Testing data size: (102,13)
Scaling the Features
Now we can use the 'StandardScaler' function from scikitlearn to scale the training data so that each feature has a mean of zero and unit standard deviation. We'll apply this same scale to the test data. Note that the test data should never be scaled using its own data (think about a scenario where you had to predict the price of a single suburb, how would you scale a single sample?).
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
As a check, let's print the mean and standard deviation of the training data.
print('Training set mean by feature:')
print(X_train.mean(axis=))
print('Training set standard deviation by feature:')
print(X_train.std(axis=))
Training set mean by feature:
[ 5.93584587e17 4.17707673e17 7.03507659e17 4.83661516e17
1.73678453e16 3.03387678e16 3.15479216e16 0.00000000e+00
9.45338417e17 4.39692287e17 3.36364600e16 3.14379985e16
2.02258452e16]
Training set standard deviation by feature:
[ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
As expected, they are equal to zero, and one, respectively.
Training Our Model
It's finally time to build our linear regression model using the training data. This is quite simple, and just involves creating a LinearRegression model object and one call to its 'fit' method.
from sklearn.linear_model import LinearRegression
regression_model = LinearRegression()
regression_model.fit(X_train,y_train)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
Interpreting The Coefficients
It was mentioned above that the linear regression model assumes that the median home price is a linear combination of the various features, with coefficients determined by calling the 'fit' method. The LinearRegression model object stores those values for us, so let's take a look.
intercept = regression_model.intercept_
coef = pd.DataFrame(regression_model.coef_, index=boston.feature_names, columns=['Coefficients'])
print('Intercept = %f\n' % intercept)
print(coef)
Intercept = 22452.475248
Coefficients
CRIM 797.292486
ZN 1076.530798
INDUS 300.966686
CHAS 694.329854
NOX 1729.254032
RM 2761.795061
AGE 403.233095
DIS 3223.486941
RAD 2720.184752
TAX 1947.419925
PTRATIO 1916.685459
B 1092.865681
LSTAT 3325.234011
There is a lot to be learned by studying these coefficients. First, there's the intercept term (β0), which is equal to the mean home price among all suburbs in the training data set when all of the other coefficients are set equal to their mean values (which are all zero in this case).
Another important detail is the sign of the coefficients. A positive coefficient means that the median home price increases as the corresponding feature increases. On the other hand, a negative coefficient means that the median home price decreases as the corresponding feature increases. As a first order check, let's see if some of these values make sense:

CRIM  An increase in the crime rate corresponds to a decrease in median home price

RM  An increase in the average number of rooms per home corresponds to an increase in median home price

AGE  An increase in the proportion of houses built before 1940 corresponds to a decrease in median home price

RAD  An increase in accessibility to radial highways corresponds to a increase in median home price

PTRATIO  An increase in the pupilteacher ratio (meaning more students in each class) corresponds to a decrease in median home price
All of these trends make intuitive sense. In addition, because we scaled our data, each coefficient can be interpreted as the average effect on the median price given a one unit increase in the corresponding feature while holding all other features fixed. In that sense, we can see that factors like the number of rooms per home (RM) and access to highways (RAD) have the largest positive effect on median home price, while factors like the distance to local employment centers (DIS) and percent of the population that qualifies as 'lower status' (LSTAT) have the largest negative effect.
Testing the Model on New Data
Now that we've built our model, we can check its performance on the test data set we set aside earlier. We do that by using the 'predict' function within the LinearRegression model class. In addition, let's compute the RMSE on the test data using the formula shown earlier. Not surprisingly, scikitlearn has a built in function for that as well. In the code below, 'y_pred' contains the predicted home prices, and 'y_test' contains the true values (labels) from the test data set.
from sklearn.metrics import mean_squared_error
import numpy as np
y_pred = regression_model.predict(X_test)
test_rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print('Test RMSE: %f' % test_rmse)
This value means that on average, the error in the predicted median price is approximately $4,800 dollars. Given that the home prices range from $5,000 to $50,0000, this is a nontrivial difference. Two possible reasons for this difference include:

The relationship between the features and response is not perfectly linear (this is most certainly true).

Some of the features we included were not actually correlated with the median price. Adding additional complexity without improving the model can lead to what is known as overfitting, where the model performs well on the training data but does not generalize well to new data.
There are other potential sources of error, but those have to do with the specific assumptions regarding linear regression models (such a multicollinearity between features, the presence of heteroscedasticity, and the distribution of the residuals between the predicted and actual prices) and are beyond the scope of this tutorial. However, there is one plot we can make which will give us a sense of how well our model fit the data, and that is a plot of the predicted versus actual home prices. The red line has a slope of one, and represents the line where the predicted price would be identical to the actual price.
import matplotlib.pyplot as plt
plt.scatter(y_test,y_pred)
plt.plot([,50000],[,50000],'r',lw=2)
plt.xlabel('Actual Price (Dollars)')
plt.ylabel('Predicted Price (Dollars)')
plt.show()
If our model had a test RMSE of zero, we would expect every blue dot to land perfectly on the red line. This is not the case of course, and this plot tells us that our model tends to under predict home prices at the lower and higher ends of the price range, while prices in the middle are somewhat equally distributed above and below the perfect fit line.
Overall, our model did a satisfactory job of predicting the median home price, especially for a first effort. Plus, we learned which features are most influential, and which contribute to an increase or decrease in median home price, which can be just as valuable as being able to predict the price itself.
Our classification example will use the UCI ML Breast Cancer Wisconsin dataset, and our goal is to predict the whether or not a mass is benign or malignant given a set of features based on a digital image of the mass. Because our goal is interpretability, we'll use logistic regression as our model of choice. As was the case with linear regression, despite being one of the older supervised learning methods, it is still useful, and quite widely used.
Given that the name of this model is similar to linear regression, you'd be right to think that there is some similarity between the two. In this case, rather than predicting a quantitative output, we are predicting a qualitative one, specifically whether a mass of cells is benign or malignant. If we designate each one of these labels as a class, with values of 0 for malignant and 1 for benign, what we'd really like is for our model to return a probability of each mass belonging to the benign class. That is, we'd like it to output
Where the right hand side is the conditional probability that the value of the label is equal to 1 (i.e., benign) given the particular features of the sample.
If we once again have p features, we can try to use the linear regression model from the previous example, in which case we end up with
>
The problem here is that we need our estimates to be valid probabilities (i.e., between 0 and 1), but as we saw, the right hand side outputs continuous values over a wide range. What we need is a function that will always return values between 0 and 1, and the logistic function does just that. The logistic function is defined as
A plot of it is shown below:
x = np.linspace(20,20,100)
y = 1/(1+np.exp(x))
plt.plot(x,y)
plt.xlabel('x')
plt.ylabel('y')
plt.title('The Logistic Function')
plt.show()
This function is perfect for us, as large negative values get mapped to zero, and large positive values get mapped to one. Given that, the logistic regression model is given as
As in the previous example, the β terms are unknown coefficients that will be determined by our specific data set.
The default threshold for classification is p(X) = 0.5, and note that in the graph above, y, or p(X), is 0.5 when x is equal to 0. This means that when the expression inside the exponent of the equation above is greater than 0 (corresponding to p(X) > 0.5), the mass is classified as benign (class 1). When it's less than zero (corresponding to p(X) < 0.5), the mass is classified as malignant (class 0). This default threshold may not always be appropriate, for example in this case you may want to classify masses as malignant using a lower threshold. For this example however, we're going to stick with the default.
Before we import our data, let's address the same two questions we asked in the regression example.
How are the Β's determined?
In this case, the coefficients are determined using a method called maximum likelihood. Although the details are beyond the scope of this tutorial, the method works by finding the values of the β's such that the output of the model is close to zero for all malignant class examples, and close to one for all benign class examples. The β's are chosen such that they maximize what is known as the llikelihood function.
How do we measure the accuracy of our model?
It was relatively straightforward to determine the accuracy of our model in the regression example. For classification, things get a bit more complicated. There are many different ways to measure the accuracy of a classifier, which metric to use depends on the specific problem. The most basic approach is to measure the error rate, which is simply the percentage of correct classifications
Error rate=1n∑i=1nI(Yi≠Y^i)" id="MathJaxElement37Frame" role="presentation" tabindex="0">
Here, n" id="MathJaxElement38Frame" role="presentation" tabindex="0">n again refers to the number of samples in our data set. The function inside the summation simply counts the number of samples for which the class was correctly predicted. Dividing by the number of samples converts this count into a fraction, which can be interpreted as the accuracy of the classifier.
Another way to measure accuracy requires a more specific definition of correct and incorrect predictions. Consider the following terms:

A true positive classification is one where we correctly predicted that a sample belonged to the positive class (in this case, we'll call the malignant class positive).

A true negative classification is one where we correctly predicted that a sample belonged to the negative class (in this case, we'll call the benign class negative).

A false positive classification is one where we incorrectly predicted that a sample belonged to the positive class (in this case, we said the mass was malignant when it was actually benign).

A false negative classification is one where we incorrectly predicted that a sample belonged to the negative class (in this case, we said the mass was benign when it was actually malignant).
Depending on the problem, you may be more concerned with tracking the number of false positives or false negatives, rather than the overall accuracy. The accuracy metric assumes that true positive and true negative classifications are equally important. In many cases, including fraud detection and cancer diagnoses, false negatives are much more dangerous than false positives.
With these new terms defined, we can compute what is known as the confusion matrix for our classifier. For a binary classifier such as the one we're going to create, the confusion matrix lists the total count of each of the four types of classifications after a set of predictions has been made. From there, a variety of metrics can be calculated depending on the problem.
UCI ML Breast Cancer Wisconsin Data
Now that we've defined our model, let's import the dataset.
from sklearn.datasets import load_breast_cancer
breast_cancer_data = load_breast_cancer()
As before, let's turn the data set into a pandas data frame, and label the columns using the 'feature_names' property of the dataset.
bc = pd.DataFrame(breast_cancer_data.data)
bc.columns = breast_cancer_data.feature_names
We can use the shape() function to see the size of the data frame.
bc.shape
For this data set, each row represents a different digital image of a mass, and there are 569 total images in the dataset. Each column represents a different feature, and there are 30 features for each mass. We can look at the first few rows in the data frame using the head() function.
bc.head()

mean radius 
mean texture 
mean perimeter 
mean area 
mean smoothness 
mean compactness 
mean concavity 
mean concave points 
mean symmetry 
mean fractal dimension 
... 
worst radius 
worst texture 
worst perimeter 
worst area 
worst smoothness 
worst compactness 
worst concavity 
worst concave points 
worst symmetry 
worst fractal dimension 

17.99 
10.38 
122.80 
1001.0 
0.11840 
0.27760 
0.3001 
0.14710 
0.2419 
0.07871 
... 
25.38 
17.33 
184.60 
2019.0 
0.1622 
0.6656 
0.7119 
0.2654 
0.4601 
0.11890 
1 
20.57 
17.77 
132.90 
1326.0 
0.08474 
0.07864 
0.0869 
0.07017 
0.1812 
0.05667 
... 
24.99 
23.41 
158.80 
1956.0 
0.1238 
0.1866 
0.2416 
0.1860 
0.2750 
0.08902 
2 
19.69 
21.25 
130.00 
1203.0 
0.10960 
0.15990 
0.1974 
0.12790 
0.2069 
0.05999 
... 
23.57 
25.53 
152.50 
1709.0 
0.1444 
0.4245 
0.4504 
0.2430 
0.3613 
0.08758 
3 
11.42 
20.38 
77.58 
386.1 
0.14250 
0.28390 
0.2414 
0.10520 
0.2597 
0.09744 
... 
14.91 
26.50 
98.87 
567.7 
0.2098 
0.8663 
0.6869 
0.2575 
0.6638 
0.17300 
4 
20.29 
14.34 
135.10 
1297.0 
0.10030 
0.13280 
0.1980 
0.10430 
0.1809 
0.05883 
... 
22.54 
16.67 
152.20 
1575.0 
0.1374 
0.2050 
0.4000 
0.1625 
0.2364 
0.07678 
5 rows × 30 columns
As noted earlier, there are 30 features for each mass. These features relate to the description of the mass based on the digital image. Some of the features describe characteristics like:

Radius

Texture

Perimeter

Area

Smoothness

Symmetry
As before, the labels (class) are not one of the features. We can add a new column to our data frame using the same method as before.
bc['class'] = breast_cancer_data.target
Let's take a look at the class counts, which corresponds to the number of benign and malignant masses. We can do this using the 'value_counts' function within pandas.
pd.value_counts(bc['class'])
1 357
0 212
Name: class, dtype: int64
There are 212 malignant masses (class 0), and 357 benign masses (class 1) in our data set. We need to be careful when calculating our confusion matrix, in terms of which class is considered positive and negative, but we'll address that when the time comes.
Let's look at some basic statistics, using the describe() function as before.
bc.describe().transpose()

count 
mean 
std 
min 
25% 
50% 
75% 
max 
mean radius 
569.0 
14.127292 
3.524049 
6.981000 
11.700000 
13.370000 
15.780000 
28.11000 
mean texture 
569.0 
19.289649 
4.301036 
9.710000 
16.170000 
18.840000 
21.800000 
39.28000 
mean perimeter 
569.0 
91.969033 
24.298981 
43.790000 
75.170000 
86.240000 
104.100000 
188.50000 
mean area 
569.0 
654.889104 
351.914129 
143.500000 
420.300000 
551.100000 
782.700000 
2501.00000 
mean smoothness 
569.0 
0.096360 
0.014064 
0.052630 
0.086370 
0.095870 
0.105300 
0.16340 
mean compactness 
569.0 
0.104341 
0.052813 
0.019380 
0.064920 
0.092630 
0.130400 
0.34540 
mean concavity 
569.0 
0.088799 
0.079720 
0.000000 
0.029560 
0.061540 
0.130700 
0.42680 
mean concave points 
569.0 
0.048919 
0.038803 
0.000000 
0.020310 
0.033500 
0.074000 
0.20120 
mean symmetry 
569.0 
0.181162 
0.027414 
0.106000 
0.161900 
0.179200 
0.195700 
0.30400 
mean fractal dimension 
569.0 
0.062798 
0.007060 
0.049960 
0.057700 
0.061540 
0.066120 
0.09744 
radius error 
569.0 
0.405172 
0.277313 
0.111500 
0.232400 
0.324200 
0.478900 
2.87300 
texture error 
569.0 
1.216853 
0.551648 
0.360200 
0.833900 
1.108000 
1.474000 
4.88500 
perimeter error 
569.0 
2.866059 
2.021855 
0.757000 
1.606000 
2.287000 
3.357000 
21.98000 
area error 
569.0 
40.337079 
45.491006 
6.802000 
17.850000 
24.530000 
45.190000 
542.20000 
smoothness error 
569.0 
0.007041 
0.003003 
0.001713 
0.005169 
0.006380 
0.008146 
0.03113 
compactness error 
569.0 
0.025478 
0.017908 
0.002252 
0.013080 
0.020450 
0.032450 
0.13540 
concavity error 
569.0 
0.031894 
0.030186 
0.000000 
0.015090 
0.025890 
0.042050 
0.39600 
concave points error 
569.0 
0.011796 
0.006170 
0.000000 
0.007638 
0.010930 
0.014710 
0.05279 
symmetry error 
569.0 
0.020542 
0.008266 
0.007882 
0.015160 
0.018730 
0.023480 
0.07895 
fractal dimension error 
569.0 
0.003795 
0.002646 
0.000895 
0.002248 
0.003187 
0.004558 
0.02984 
worst radius 
569.0 
16.269190 
4.833242 
7.930000 
13.010000 
14.970000 
18.790000 
36.04000 
worst texture 
569.0 
25.677223 
6.146258 
12.020000 
21.080000 
25.410000 
29.720000 
49.54000 
worst perimeter 
569.0 
107.261213 
33.602542 
50.410000 
84.110000 
97.660000 
125.400000 
251.20000 
worst area 
569.0 
880.583128 
569.356993 
185.200000 
515.300000 
686.500000 
1084.000000 
4254.00000 
worst smoothness 
569.0 
0.132369 
0.022832 
0.071170 
0.116600 
0.131300 
0.146000 
0.22260 
worst compactness 
569.0 
0.254265 
0.157336 
0.027290 
0.147200 
0.211900 
0.339100 
1.05800 
worst concavity 
569.0 
0.272188 
0.208624 
0.000000 
0.114500 
0.226700 
0.382900 
1.25200 
worst concave points 
569.0 
0.114606 
0.065732 
0.000000 
0.064930 
0.099930 
0.161400 
0.29100 
worst symmetry 
569.0 
0.290076 
0.061867 
0.156500 
0.250400 
0.282200 
0.317900 
0.66380 
worst fractal dimension 
569.0 
0.083946 
0.018061 
0.055040 
0.071460 
0.080040 
0.092080 
0.20750 
class 
569.0 
0.627417 
0.483918 
0.000000 
0.000000 
1.000000 
1.000000 
1.00000 
As with the regression example, many of the features have different scales. This time, it is very important that we scale our features. The reason has to do with the specifics of the logistic regression model in scikitlearn. The model performs regularization by default, which is sensitive to the relative values of the coefficients, and helps control overfitting, which was described during the regression section. Before we scale the features, let's briefly discuss classifier decision boundaries.
Classifier Decision Boundaries
As discussed above, the ultimate goal in classification is to correctly predict which class each sample belongs to. This is equivalent to defining a geometric boundary where samples are classified depending on which side of the boundary they fall. This can be made more clear using an example from our data set. Consider the figure below, which plots the 'mean radius' feature with each sample colored by class (benign samples are yellow, malignant are purple).
x = range(len(bc['mean radius']))
y = bc['mean radius']
plt.scatter(x,y,c=bc['class'])
plt.xlabel('sample')
plt.ylabel('mean radius')
plt.show()
If we were trying to classify a sample using just this feature, a good boundary could be a value of 12.5 for the mean radius. Any sample with a mean radius less than 12.5 is classified as benign, and any sample with a mean radius greater than 12.5 is classified as malignant. It's not a perfect classification, but it's a start. Since we're using not one but 30 features, our classifier will create a analogous boundary in higher dimensional space to separate the samples. It could be that not all features are useful in separating the classes, which is something we could investigate once our model has been built.
Training Data Versus Test Data
As before, we need to split the data back apart into features (X) and labels (y). Then, we can use the 'train_test_split' function from scikitlearn to randomly split our data into training and testing sets.
X = bc.iloc[:,:1]
y = bc['class']
# Split the data into 80% training and 20% testing.
# The random_state allows us to make the same random split every time.
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=327)
print('Training data size: (%i,%i)' % X_train.shape)
print('Testing data size: (%i,%i)' % X_test.shape)
Training data size: (455,30)
Testing data size: (114,30)
Scaling the Features
We'll scale our training data once again using the 'StandardScaler' function so that each feature has a mean of zero and unit standard deviation. We'll apply this same scale to the test data.
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
Let's double check to make sure the scaling worked as intended.
print('Training set mean by feature:')
print(X_train.mean(axis=))
print('Training set standard deviation by feature:')
print(X_train.std(axis=))
Training set mean by feature:
[ 2.75628116e15 4.59705534e16 5.40227204e16 6.52957542e16
6.25189766e15 3.13644105e15 2.56205313e16 1.11607915e15
3.54783358e15 3.99192279e15 8.92387507e16 4.62389589e16
1.18725237e15 9.41005515e16 1.29859493e15 1.24088773e15
7.30795156e16 1.70315532e16 2.79580998e15 3.99436284e16
1.29859493e15 4.74785046e15 8.00824608e16 5.39495188e16
4.93427033e15 1.86383265e15 9.58451877e16 1.65923441e17
2.49739179e15 1.38546073e15]
Training set standard deviation by feature:
[ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
Training Our Model
It's finally time to build our logistic regression model using the training data. Similar to the linear regression model, this just involves creating a LogisticRegression model object and one call to its 'fit' method.
from sklearn.linear_model import LogisticRegression
regression_model = LogisticRegression()
regression_model.fit(X_train,y_train)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
verbose=0, warm_start=False)
Interpreting The Coefficients
The LogisticRegression model object also stores the coefficient values for us, so let's take a look.
intercept = regression_model.intercept_
coef = pd.DataFrame(regression_model.coef_.transpose(),
index=breast_cancer_data.feature_names,
columns=['Coefficients'])
print('Intercept = %f\n' % intercept)
print(coef)
Intercept = 0.493176
Coefficients
mean radius 0.620034
mean texture 0.504767
mean perimeter 0.564625
mean area 0.646708
mean smoothness 0.131364
mean compactness 0.460890
mean concavity 0.800368
mean concave points 0.771655
mean symmetry 0.168892
mean fractal dimension 0.553119
radius error 1.144201
texture error 0.337015
perimeter error 0.569919
area error 0.946128
smoothness error 0.405995
compactness error 0.610680
concavity error 0.154262
concave points error 0.078581
symmetry error 0.249923
fractal dimension error 0.731013
worst radius 1.150628
worst texture 1.192783
worst perimeter 0.880301
worst area 1.086891
worst smoothness 0.994301
worst compactness 0.007784
worst concavity 0.752380
worst concave points 0.752560
worst symmetry 0.703783
worst fractal dimension 0.449378
We have to be a bit more careful when interpreting the coefficients of a logistic regression model. We want to be able to articulate the effect of each coefficient on the output, and recall that our model outputs a probability of the form
In this form, it is hard to specify the effect of each coefficient on the output probability. What we really want is an expression involving only the coefficients and features on the right hand side. With a bit of manipulation, we can express the equation for our model as
The quantity on the left hand side is called the odds" id="MathJaxElement42Frame" role="presentation" tabindex="0">odds, and is defined as the ratio between the probability of a given event occurring and not occurring. Unlike probability, odds can range from 0 to infinity. Events with higher probabilities have higher odds, and vice versa. If we take the log of the above equation, we arrive at the following
log(p(X)1−p(X))=β0+β1X1+β2X2+ ... +βpXp" id="MathJaxElement43Frame" role="presentation" tabindex="0">
The expression on the left hand side is called the log" id="MathJaxElement44Frame" role="presentation" tabindex="0">logodds" id="MathJaxElement45Frame" role="presentation" style="fontsize: 1.8rem;" tabindex="0">odds, and this is what we were after, as the right side involves only the coefficients and features. Now, we see how to properly interpret the effect of each coefficient on the output. We can say that a oneunit increase in a given feature (while holding all other features fixed) increases the logodds by an amount equal to its particular coefficient.
Although the actual change in p(X) caused by a oneunit increase in a particular feature is harder to quantify, we can still interpret the sign of the coefficient in the same manner as before. That is, a positive coefficient is associated with increasing the value of p(X) as X increases, and likewise a negative coefficient is associated with decreasing the value of p(X) as X increases. Recall that since malignant masses are coded as 0, and benign as 1, an increase in p(X) corresponds with a higher chance of the mass being benign. Give that, let's take a closer look at some of our coefficients:

Most coefficients related to measures of size (radius, perimeter, area) are negative, indicating that an increase is related to an increased risk of a mass being classified as malignant. This suggests that larger masses may be more often malignant, which makes sense.

The feature corresponding to compactness is positive, indicating that an increase in compactness (i.e., a smaller mass), is related to a decreased risk of a mass being classified as malignant. This also makes sense.
Testing the Model on New Data
Now that we've built our model, we can check its performance on the test data set we set aside earlier. As before, we will do that by using the 'predict' function. In addition, let's compute the error rate (i.e., accuracy) on the test data using the formula shown earlier. Not surprisingly, scikitlearn has a built in function for that as well. In the code below, 'y_pred' contains the predicted class, and 'y_test' contains the true class (labels) from the test data set.
from sklearn.metrics import accuracy_score
y_pred = regression_model.predict(X_test)
test_acc = accuracy_score(y_test,y_pred)*100
print('The test set accuracy is %4.2f%%' % test_acc)
The test set accuracy is 96.49%
This is an impressive result, over 95% accuracy using a relatively simple model with the default parameters. Recall our discussion earlier regarding the different ways of measuring accuracy in a classification model. We introduced something known as a confusion matrix, so let's go ahead and print it below using the 'confusion_matrix' function from scikitlearn's 'metrics' module.
from sklearn.metrics import confusion_matrix
conf_matrix = confusion_matrix(y_test,y_pred, labels=[1,])
print(conf_matrix)
The confusion matrix is interpreted as follows:

The upper left term contains number of true negatives in the test set. A true negative is where both the predicted and actual label was negative (benign). In our case, there were 66 true negatives.

The lower right term contains number of true positives in the test set. A true positive is where both the predicted and actual label was positive (malignant). In our case, there were 44 true positives.

The lower left term contains the number of false negatives in the test set. A false negative is where the predicted label was negative (benign), but the actual label was positive (malignant). In our case, there were 3 false negatives, which are quite dangerous in this context.

The upper right term contains the number of false positives in the test set. A false positive is where the predicted label was positive (malignant), but the actual label was negative (benign). In our case, there was only 1 false positive. While stressful for a patient (until repeated tests can be performed), false positives are far less dangerous than false negatives.
Let's assign the various components to variables and calculate a few more metrics.
# True negatives
TN = conf_matrix[][]
# True positives
TP = conf_matrix[1][1]
# False negatives
FN = conf_matrix[1][]
# False positives
FP = conf_matrix[][1]
The sensitivity, also known as recall, or true positive rate (TPR), is given as
TPR=TPTP+FN" id="MathJaxElement46Frame" role="presentation" tabindex="0">
A way to interpret the sensitivity is that, out of all positive results (a combination of true positive and false negatives), how many did we correctly predict? In our case, the sensitivity is equal to
TPR = float(TP)/(TP+FN)
print('TPR = %4.2f%%' % (TPR*100))
The specificity, also known as the true negative rate (TNR), is given as
A way to interpret the sensitivity is that, out of all negative results (a combination of true negatives and false positives), how many did we correctly predict? In our case, the specificity is equal to
TNR = float(TN)/(TN+FP)
print('TNR = %4.2f%%' % (TNR*100))
The precision, or positive predictive value (PPV), is given as
PPV=TPTP+FP" id="MathJaxElement48Frame" role="presentation" tabindex="0">
A way to interpret the precision is that, out of all the results we said were positive (a combination of true positives and false positives), how many did we correctly predict? In our case, the precision is equal to
PPV = float(TP)/(TP+FP)
print('PPV = %4.2f%%' % (PPV*100))
Finally, the negative predictive value (NPV), is given as
NPV=TNTN+FN" id="MathJaxElement49Frame" role="presentation" tabindex="0">
A way to interpret the NPV is that, out of all the results we said were negative (a combination of true negatives and false negatives), how many did we correctly predict? In our case, the NPV is equal to
NPV = float(TN)/(TN+FN)
print('NPV = %4.2f%%' % (NPV*100))
As you can see, calculating the accuracy of the model doesn't always tell the whole story. Because we had three false negatives, our TPR and NPV were lower than our accuracy. On the other hand, because we only had one false positive, our TNR and PPV were higher. Some tweaking of the model could help adjust these values, for example it may be desired to decrease the number of false negatives. This could be done by adjusting the parameters of the logistic regression model (we used the defaults), and the decision threshold itself (recall that the default probability threshold of 0.5 may not be appropriate in this case).
Conclusion
Hopefully this tutorial served as a good introduction to supervised learning with Python. We were introduced to the scikitlearn package, which provides a lot of very powerful machine learning related functionality, and is a great place to start experimenting. We also saw the pandas package, which supports flexible data structures and is designed to make working with datasets easy. Along the way, we learned a bit of theory regarding model selection, some tradeoffs to consider when using different models, the importance of keeping separate training and testing sets, and why it's usually a good idea to scale your data. In addition, we got to see examples of both regression and classification problems, ways to evaluate the performance of each, and how the results could be improved.