• D.YANG

The world is but a canvas to our imagination.

Scrappy KNN


Intro

In this post, I’m tryign to build my first machine learning classifier from scratch, which is simple and ‘scrappy’. However it will give me a broad impression of to solve a machine learning problem.

I will be using scikit-learn and scipy package, which provide me the functions I need as well as the dataset. I will be using the famous wiki: iris dataset. And the dataset is also given in the wiki page. There are four features that help us classify which species of iris flower it is.

There are major step in supervised learning. First is training data: let computer learn from the data. Second is fitting data: let computer predict from our inputed features or X.

  • Regression: the output variable takes continuous values.
  • Classification: the output variable takes class labels.

KNeighborsClassifier

We first building a k nearest neighbors using KNeighborsClassifier function from sklearn. In this algorithm we just simply look what are the labels of k nearest neighbors, and classify our training data base on majority vote of the neighbors.

Wiki: k-nearest neighbors algorithm

from sklearn import datasets
from sklearn.model_selection import train_test_split

# load iris dataset
iris = datasets.load_iris()

# X is the feature we use to predict result
X = iris.data
# y is the result
y = iris.target

# split our X, y to 50% as traning data, 50% as testing data
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5)

sk_clf = KNeighborsClassifier()
# traning our model with X_train, y_train
sk_clf.fit(X_train, y_train)
# using X_test to test our model
sk_predictions = sk_clf.predict(X_test)
# compare prediction accuracy vs result
print('skilearn: {0:.2f} %'.format((accuracy_score(y_test, sk_predictions)*100)))

Scrappy 1NN

Building a simple, bare-born, or ‘scrappy’ classifier, I continue used method of KNN. However, to simplify, just label it as same as the closest neighbor. And to use euclidean distance to define what the closest neighbor is.

from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from scipy.spatial import distance

# calculate euclidean distance
def euc(a, b):
    return distance.euclidean(a, b)

class ScrappyKNN():
    # scrappy fit, just initial our class
    def fit(self, X_train, y_train):
        self.X_train = X_train
        self.y_train = y_train

    def predict(self, X_test):
        prediction = []
        for row in X_test:
            # find the label of closest neighbor
            prediction.append(self.closest(row))
        return prediction

    def closest(self, row):
        # initial
        best_distance = euc(row, self.X_train[0])
        best_index = 0
        # range over all point to determine which is closest
        for i in range(1, len(self.X_train)):
            dist = euc(row, self.X_train[i])
            if dist < best_distance:
                best_distance = dist
                # record index
                best_index = i
        # return the label of closest neighbor
        return self.y_train[best_index]

iris = datasets.load_iris()

X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5)

sk_clf = KNeighborsClassifier()
sk_clf.fit(X_train, y_train)
sk_predictions = sk_clf.predict(X_test)


sp_clf = ScrappyKNN()
sp_clf.fit(X_train, y_train)
sp_predictions = sp_clf.predict(X_test)


print('skilearn: {0:.2f} %'.format((accuracy_score(y_test, sk_predictions)*100)))
print('scrappyKNN: {0:.2f} %'.format((accuracy_score(y_test, sp_predictions)*100)))