By the end of the session, learners will be able to do the following:
numpy.random.seed function.By the end of the session, learners will be able to do the following:
Sometimes our classifier might make the wrong prediction.
A classifier does not need to be right 100% of the time to be useful, though we don’t want the classifier to make too many wrong predictions.
How do we measure how “good” our classifier is?
\[\mathrm{accuracy} = \frac{\mathrm{number \; of \; correct \; predictions}}{\mathrm{total \; number \; of \; predictions}} = \frac{58}{65} = 0.892\]
Prediction accuracy only tells us how often the classifier makes mistakes in general, but does not tell us anything about the kinds of mistakes the classifier makes.
The confusion matrix tells a more complete story.
| Predicted Malignant | Predicted Benign | |
|---|---|---|
| Actually Malignant | 1 | 3 |
| Actually Benign | 4 | 57 |
\[\mathrm{precision} = \frac{\mathrm{number \; of \; correct \; positive \; predictions}}{\mathrm{total \; number \; of \; positive \; predictions}}\]
\[\mathrm{recall} = \frac{\mathrm{number \; of \; correct \; positive \; predictions}}{\mathrm{total \; number \; of \; positive \; test \; set \; observations}}\]
| Predicted Malignant | Predicted Benign | |
|---|---|---|
| Actually Malignant | 1 | 3 |
| Actually Benign | 4 | 57 |
\[\mathrm{precision} = \frac{1}{1+4} = 0.20, \quad \mathrm{recall} = \frac{1}{1+3} = 0.25\]
So even with an accuracy of 89%, the precision and recall of the classifier were both relatively low. For this data analysis context, recall is particularly important: if someone has a malignant tumor, we certainly want to identify it. A recall of just 25% would likely be unacceptable!
Our data analyses will often involve the use of randomness
We use randomness any time we need to make a decision in our analysis that needs to be fair, unbiased, and not influenced by human input (e.g., splitting into training and test sets).
However, the use of randomness runs counter to one of the main tenets of good data analysis practice: reproducibility…
The trick is that in Python—and other programming languages—randomness is not actually random! Instead, Python uses a random number generator that produces a sequence of numbers that are completely determined by a seed value.
Once you set the seed value, everything after that point may look random, but is actually totally reproducible.
Let’s say we want to make a series object containing the integers from 0 to 9. And then we want to randomly pick 10 numbers from that list, but we want it to be reproducible.
Reminder of the list of numbers we just generated:
If we run the sample method again, we will get a fresh batch of 10 numbers that also look random.
Let’s regenerate original set of 10 random numbers by setting the seed to 1:
If we choose a different value for the seed—say, 4235—we obtain a different sequence of random numbers:
Load packages and set seed:
Load the data and look at class distribution:
Setting normalize=True allows us to view the proportions.
Before fitting any models, or doing exploratory data analysis, it is critical that you split the data into training and test sets.
Typically, the training set is between 50% and 95% of the data, while the test set is the remaining 5% to 50%.
The train_test_split function from scikit-learn handles the procedure of splitting the data for us.
Use shuffle=True to remove the influence of order in the data set.
Set the stratify parameter to be the response variable to ensure the same proportion of each class ends up in both the training and testing sets.
Class variableLet’s look at the training and test splits (in practice you look at both)
We can see our class proportions were roughly preserved when we split the data.
Many machine learning models are sensitive to the scale of the predictors, and even if not, comparison of importance of features for prediction after fitting requires scaling.
When preprocessing the data (scaling is part of this), it is critical that we use only the training set in creating the mathematical function to do this.
If this is not done, we will get overly optimistic test accuracy, as our test data will have influenced our model.
After creating the preprocessing function, we can then apply it separately to both the training and test data sets.
scikit-learnscikit-learn helps us handle this properly as long as we wrap our analysis steps in a Pipeline.
Specifically, we construct and prepare the preprocessor using make_column_transformer, specifying the type of tranformation we want to apply, as well as which columns to apply it to:
Now we can create our K-nearest neighbors classifier with only the training set.
For simplicity, we will just choose \(K\) = 3, and use only the concavity and smoothness predictors.
Now that we have a K-nearest neighbors classifier object, we can use it to predict the class labels for our test set:
To evaluate the model, we will look at:
All of these together, will help us develop a fuller picture of how the model is performing, as opposed to only evaluating the model based on a single metric or table.
The Pandas crosstab function takes two arguments: the actual labels first, then the predicted labels second.
Note that crosstab orders its columns alphabetically, but the positive label is still Malignant, even if it is not in the top left corner as in the table shown earlier.
Is 90% accuracy, a precision of 83% and a recall of 91% good enough?
To get a sense of scale, we often compare our model to a baseline model. In the case of classification, this would be the majority classifier (always guesses the majority class label from the training data).
For the breast cancer training data, the baseline classifier’s accuracy would be 63%
So we do see that our model is doing a LOT better than the baseline, which is great, but considering our application domain is in cancer diagnosis, we still have a ways to go…
Analyzing model performance really depends on your application!
Most predictive models in statistics and machine learning have parameters (a number you have to pick in advance that determines some aspect of how the model behaves).
For our working example, \(K\)-nearest neighbors classification algorithm, \(K\) is a parameter that we have to pick that determines how many neighbors participate in the class vote.
How do we choose \(K\), or any parameter for other models?
Data splitting!
Cannot use the test set to choose the parameter!
But we can split the training set into two partitions, a traning set and a validation set.
For each parameter value we want to assess, we can fit on the training set, and evaluate on the validation set.
Then after we find the best value for our parameter, we can refit the model with the best parameter on the entire training set and then evaluate our model on the test set.
Depending on how we split the data into the training and validation sets, we might get a lucky split (or an unlucky one) that doesn’t give us a good estimate of the model’s true accuracy.
In many cases, we can do better by making many splits, and averaging the accuracy scores to get a better estimate.
We call this cross-validation.
scikit-learnUse the scikit-learn cross_validate function.
Need to specify:
Pipeline as the estimator argument,cv argument,X argumenty arguments.Note that the cross_validate function handles stratifying the classes in each train and validate fold automatically.
scikit-learn (cont’d)scikit-learn (cont’d)Since cross-validation helps us evaluate the accuracy of our classifier, we can use cross-validation to calculate an accuracy for each value of our parameter, here \(K\), in a reasonable range.
Then we pick the value of \(K\) that gives us the best accuracy, and refit the model with our parameter on the training data, and then evaluate on the test data.
The scikit-learn package collection provides built-in functionality, named GridSearchCV, to automatically handle the details for us.
Now we use the fit method on the GridSearchCV object to begin the tuning process.
We can also obtain the number of neighbours with the highest accuracy programmatically by accessing the best_params_ attribute of the fit GridSearchCV object.
Do we use \(K\) = 36?
Generally, when selecting a parameters, we are looking for a value where:
Before we evaluate on the test set, we need to refit the model using the best parameter(s) on the entire training set
Luckily, scikit-learn does it for us automatically!
To make predictions and assess the estimated accuracy of the best model on the test data, we can use the score and predict methods of the fit GridSearchCV object.
How well might our classifier do on unseen data?
To find out we can then pass those predictions to the precision, recall, and crosstab functions to assess the estimated precision and recall, and print a confusion matrix.
Strengths: K-nearest neighbors classification
Weaknesses: K-nearest neighbors classification
scikit-learn Choosing the right estimator documentation: https://scikit-learn.org/stable/machine_learning_map.html
scikit-learn website is an excellent reference for more details on, and advanced usage of, the functions and packages in the past two chapters. Aside from that, it also offers many useful examples to get you started.Evelyn Martin Lansdowne Beale, Maurice George Kendall, and David Mann. The discarding of variables in multivariate analysis. Biometrika, 54(3-4):357–366, 1967.
Norman Draper and Harry Smith. Applied Regression Analysis. Wiley, 1966.
M. Eforymson. Stepwise regression—a backward and forward look. In Eastern Regional Meetings of the Institute of Mathematical Statistics. 1966.
Ronald Hocking and R. N. Leslie. Selection of the best subset in regression analysis. Technometrics, 9(4):531–540, 1967.
Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani. An Introduction to Statistical Learning. Springer, 1st edition, 2013. URL: https://www.statlearning.com/.
Wes McKinney. Python for data analysis: Data wrangling with Pandas, NumPy, and IPython. ” O’Reilly Media, Inc.”, 2012.
William Nick Street, William Wolberg, and Olvi Mangasarian. Nuclear feature extraction for breast tumor diagnosis. In International Symposium on Electronic Imaging: Science and Technology. 1993.