23 K Nearest Neighbours
SLIDE DECKS
Some of the material presented in this chapter will be discussed in class. It is your responsibility to ensure you cover all the concepts presented both in class and in this textbook.
The k-Nearest Neighbors Algorithm
The k-Nearest Neighbors (k-NN) algorithm is a simple and versatile supervised machine learning technique used for classification and regression tasks. It works based on the idea that objects with similar features are more likely to belong to the same class or have similar values. The algorithm classifies a new data point based on the majority class of its k-nearest neighbours in the feature space.
When Do We Use k-NN?
- Classification: k-NN can be used for tasks where data points need to be classified into distinct categories. For example, it can classify emails as spam or not spam, diagnose diseases, or recognize handwritten digits.
- Regression: It can also be applied to regression tasks to predict numerical values, like estimating house prices based on features like size, location, and number of bedrooms.
Best Practices
- Choosing the Right Value of k: Selecting an appropriate value for k is crucial. Smaller values of k may lead to noisy results, while larger values may lead to over-smoothed decisions. Cross-validation can help determine the optimal k value. A general rule is to start by setting k to the square root of the number of observations in the data set.
- Scaling: Ensure that variables are on the same scale to prevent any one of them from having a disproportionate influence on the distance calculations. In other words, normalize or standardize your variables!
- Distance Metric: Choose an appropriate distance metric (e.g., Euclidean, Manhattan, or Minkowski distance) depending on the nature of your data.
- Data Quality: High-quality and clean data is essential for k-NN. Outliers or irrelevant features can significantly affect results. Be sure to take the time to understand your data before you apply this method.
- Computational Efficiency: For large datasets, k-NN will be slow. Consider using data structures like KD-trees or ball trees to improve computational efficiency.
Extensions of the k-NN algorithm
- Weighted k-NN: Assign different weights to the nearest neighbours based on their distance, giving more influence to closer neighbours.
- Local Outlier Factor (LOF): LOF is used to identify outliers in the dataset by assessing the local density of data points. It’s an extension of k-NN for anomaly detection.
- k-NN with Feature Selection: Incorporating feature selection techniques like Principal Component Analysis (PCA) or feature engineering to improve the algorithm’s performance.
k-NN is a straightforward algorithm and is often used as a baseline for classification tasks. Its simplicity and interpretability make it a valuable tool for various machine-learning applications. However, it may not perform well with high-dimensional data or in cases where class boundaries are complex, so it’s essential to consider the nature of the problem and data when choosing k-NN.
Activity
Explore the k-NN algorithm using the following R script in RStudio.