April 8, 2022 • 3 min read

How to implement a Fast Custom KNN in Sklearn Using Cython

Rédigé par Reda Boumahdi

Reda Boumahdi

Let’s dive into how you can implement a fast custom KNN in Scikit-learn.

Scikit-learn is a great library for machine learning, but quite slow to solve some problems, especially for custom enrichment such as custom metrics. k-NN or KNN is an intuitive algorithm for classification or regression. I will mainly use it for classification, but the same principle works for regression and for other algorithms using custom metrics.

Let’s dive into how you can implement a fast custom KNN in Scikit-learn.

A quick taste of Cython

The fundamental nature of Cython can be summed up as follows: Cython is Python with C data types.

Cython is actually Python code that will be compiled to C file and create a library. The calls to this library will be faster than calls to python files. How fast ? See for yourself !

First, let’s create a simple loop in python, for instance like this:

Then, let’s do the same in cython:

To build the cython library, the command line is:

python setup.py build_ext --inplace

Then we need to execute the main file:

Surprise… Cython is 1000 times faster!

What is KNN?

KNN is a great tool for interpreting the outputs and not having to handle black boxes. The idea behind this clustering algorithm is to compare a new point (the green circle) to the K most similar points in the data set (the closest ones), and to give it the mainly represented label (square or triangle).

K-NN classification

In the following case, if K = 3, the algorithm will predict a triangle, if K = 5, the algorithm will predict a square. Thus, the choice of K is quite important.

The main computational complexity is to calculate all the distances between the current point and all the remaining points. Let’s see how we can apply what we learned on Cython in the first part of this tutorial to K-NN.

Why use a custom distance?

When it comes to data, the more features you can use, the better it is. Often, you will be faced with continuous and discrete features at the same time. In this case, a classic distance may fail to give good results. A custom distance needs to be chosen. There is an infinity of distances choices, and you could combine them to create your own for a better predictive power… Just like the avengers they become better if you combine them !!

Here you can find a list of some common metrics that are implemented in Scikit-learn :

  • Euclidean : sqrt(sum((x - y)^2))
  • Manhattan : sum(|x - y|)
  • Chebyshev : max(|x - y|)
  • Minkowski with parameter p : sum(|x - y|^p)^(1/p)
  • W-Minkowski with parameters p,w : sum(w * |x - y|^p)^(1/p)
  • S-Euclidean with parameter V : sqrt(sum((x - y)^2 / V))
  • Mahalanobis with parameter V : sqrt((x - y)' V^-1 (x - y))

Click here to learn more about importing these distances with Scikit-learn.

As an example, custom metrics are useful for recommendation. Imagine if Booking wanted to suggest hostels as a function of what you already booked in the past using the price (price_i), the location (geoloc_i) and the stars (stars_i) as features. To apply K-NN we must be careful to compare the same level of objects. One way to overcome this difficulty is to normalize, another one is to use the following distance :

α‖geoloc_1-geoloc_2‖² + β‖price_1-price_2‖² + γ‖stars_1-stars_2‖²

And to choose α, β and γ so that the learning rate is better.

Custom distance syntax

The first step is the definition of our custom distance. We define a metric in a file called cython_metric.pyx. My custom distance takes two arrays and returns a double.

In simple Python, it would look like this.

Then in another file, we load the data and call KNeighborsClassifier from Scikit-learn, and give it our custom distance.

We output the time taken by the algorithm and the precision as well.

Performance benchmark

For the same distance (Euclidean) we compare the performance of python code vs cython code by running the previous code several time for a different number of observations. We can then plot the following graph.

Conclusion

Python is a great language to solve several problems. If you are interested in performance and want to speed some part of your code, you have the possibility to move it in a Cython module. As we did with the calculation of the distance, your code will run much much faster! I hope this tutorial will help your algorithms learn blazing-fast!

External links


Thanks to Flavian Hautbois, Adil Baaj, Adrien Lina, and Charles Bochet.

If you are looking for Machine Learning Experts, don't hesitate to contact us

Cet article a été écrit par

Reda Boumahdi

Reda Boumahdi