Skip to content

Latest commit

 

History

History
116 lines (89 loc) · 2.53 KB

07_knn.md

File metadata and controls

116 lines (89 loc) · 2.53 KB

K近邻

应用场景

  • 产品推荐和推荐引擎
  • 自然语言处理 (NLP) 算法的相关性排名
  • 图像或视频的相似性搜索

过程

  • 计算训练样本和测试样本中每个样本之间的距离
  • 对距离进行排序
  • 选择距离最小的k个样本: K小,噪音多,K大,边界模糊
  • 根据k个样本的标签进行投票, 得到分类类别

距离

  • 欧几里得距离
  • 明可夫斯基距离
  • 曼哈顿距离
  • 余弦相似度

优化

  • NDTree
  • Approximate kNN 加速服务
    • Tree-based ANN
    • Locality-sensitive hashing (LSH)
    • Clustering based
  • faiss

问答

Pros:

  • Almost no assumption on f other than smoothness
  • High capacity/complexity
  • High accuracy given a large training set
  • Supports online training (by simply memorizing)
  • Readily extensible to multi-class and regression problems

Cons:

  • Storage demanding
  • Sensitive to outliers
  • Sensitive to irrelevant data features (vs. decision trees)
  • Needs to deal with missing data (e.g., special distances)
  • Computationally expensive: O(ND) time for making each prediction
  • Can speed up with index and/or approximation

code

# https://towardsdatascience.com/create-your-own-k-nearest-neighbors-algorithm-in-python-eb7093fc6339

import numpy as np

def eu_dist(x_train, x_test):
    distances = []
    for x_cur in x_train:
        distance = np.sqrt(np.sum((x_cur - x_test) ** 2))
        distances.append(distance)
    return distances


def KNN(x_train, y_train, x_test, k):
    # convert to numpy array
    x_train = np.array(x_train)
    y_train = np.array(y_train)
    x_test = np.array(x_test)

    # calculate the distance between x_train and x_test
    distances = eu_dist(x_train, x_test)
    # sort the distance and find the top-k in y_train
    dist_index = np.argsort(distances)[:k]
    topk_y = [y_train[index] for index in dist_index]

    # count the frequency in y_train
    dict = {}
    for y_cur in topk_y:
        y_cur = y_cur[0]
        dict[y_cur] = dict.get(y_cur, 0) + 1

    dict = list(dict.items())
    dict = sorted(dict, key=lambda x: x[1], reverse= True)

    # return the highest frequent value
    predict = dict[0][0]
    return predict

x_train = [[73.84, 241.89],
       [68.78, 162.31],
       [74.11, 212.74],
       [74.73, 220.04],
       [69.88, 206.34],
       [67.25, 152.21],
       [63.45, 156.39]
    ]


y_train = [
       [1],
       [1],
       [2],
       [1],
       [1],
       [2],
       [2],
       [1],
]


print(KNN(x_train, y_train, x_test, k))
import faiss