
Tiffany Timbers
Associate Professor of Teaching,
Dept. of Statistics, UBC

Katie Burak
Assistant Professor of Teaching,
Dept. of Statistics, UBC
By the end of the session, learners will be able to do the following:
scikit-learn framework to predict the class of a single new observation.predicting a categorical class (sometimes called a label) for an observation given its other variables (sometimes called features)
Observations with known classes that we use as a basis for prediction
How?
Predict observations based on other observations “close” to it
Data:
digitized breast cancer image features, created by Dr. William H. Wolberg, W. Nick Street, and Olvi L. Mangasarian
Each row:
Diagnosis for each image was conducted by physicians.
Formulate a predictive question:
Can we use the tumor image measurements available to us to predict whether a future tumor image (with unknown diagnosis) shows a benign or malignant tumor?
these values have been standardized (centered and scaled)
if a point is close to another in the scatter plot, then the perimeter and concavity values are similar, and so we may expect that they would have the same diagnosis
kwe can consider several neighboring points, k=3
\[\mathrm{Distance} = \sqrt{(a_x -b_x)^2 + (a_y - b_y)^2}\]
k=53 of the 5 nearest neighbors to our new observation are malignant
euclidean_distancesThe distance formula becomes
\[\mathrm{Distance} = \sqrt{(a_{1} -b_{1})^2 + (a_{2} - b_{2})^2 + \dots + (a_{m} - b_{m})^2}.\]
The K-nearest neighbors algorithm works as follows:
scikit-learnscikit-learnNow we can get started with sklearn and KNeighborsClassifier()
scikit-learn: Create Model Objectscikit-learn: Fit the modelNote
X and y (note the capitialization). This comes from matrix notation.scikit-learn: PredictFor KNN:
Compare these 2 scenarios:
All have a distance of 2
Many other models:
center of each variable (e.g., its mean) matters as well
Does not matter as much in KNN:
Person A (200 lbs, 6ft tall) vs Person B (202 lbs, 6ft tall)
Person A (200 lbs, 6ft tall) vs Person B (200 lbs, 8ft tall)
Difference in weight is in the 10s, difference in height is fractions of a foot.
scikit-learn: ColumnTransformerscikit-learn has a preprocessing module
StandardScaler(): scale our datamake_column_transformer: creates a ColumnTransformer to select columnsscikit-learn: Select numeric columnsscikit-learn: transformScale the data
Compare unscaled vs scaled
scikit-learn pipelines?What if we have class imbalance? i.e., if the response variable has a big difference in frequency counts between classes?


Rebalance the data by oversampling the rare class
.sample() method on the rare class data frame
.value_counts() method to see that our classes are now balancedSet seed
Upsample the rare class
k=7Assume we are only looking at “randomly missing” data
.dropna()KNN computes distances across all the features, it needs complete observations
SimpleImputer()We can impute missing data (with the mean) if there’s too many missing values
Model prediction area.
Alternatively, go to menti.com and use code 8481 0955
scikit-learn website is an excellent reference for more details on, and advanced usage of, the functions and packages in this lesson. Aside from that, it also offers many useful tutorials to get you started.Lars Buitinck, Gilles Louppe, Mathieu Blondel, Fabian Pedregosa, Andreas Mueller, Olivier Grisel, Vlad Niculae, Peter Prettenhofer, Alexandre Gramfort, Jaques Grobler, Robert Layton, Jake VanderPlas, Arnaud Joly, Brian Holt, and Gaël Varoquaux. API design for machine learning software: experiences from the scikit-learn project. In ECML PKDD Workshop: Languages for Data Mining and Machine Learning, 108–122. 2013.
Thomas Cover and Peter Hart. Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13(1):21–27, 1967.
Evelyn Fix and Joseph Hodges. Discriminatory analysis. nonparametric discrimination: consistency properties. Technical Report, USAF School of Aviation Medicine, Randolph Field, Texas, 1951.
William Nick Street, William Wolberg, and Olvi Mangasarian. Nuclear feature extraction for breast tumor diagnosis. In International Symposium on Electronic Imaging: Science and Technology. 1993.
Stanford Health Care. What is cancer? 2021. URL: https://stanfordhealthcare.org/medical-conditions/cancer/cancer.html.