python 2.7 - Why does scikit-learn's Nearest Neighbor doesn't seem to return proper cosine similarity distances? -
i trying use scikit's nearest neighbor implementation find closest column vectors given column vector, out of matrix of random values.
this code supposed find nearest neighbors of column 21 check actual cosine similarity of neighbors against column 21.
from sklearn.neighbors import nearestneighbors import sklearn.metrics.pairwise smp import numpy np test=np.random.randint(0,5,(50,50)) nbrs = nearestneighbors(n_neighbors=5, algorithm='auto', metric=smp.cosine_similarity).fit(test) distances, indices = nbrs.kneighbors(test) x=21 idx,d in enumerate(indices[x]): sim2 = smp.cosine_similarity(test[:,x],test[:,d]) print "sklearns cosine similarity ", sim2 print 'sklearns reported distance is', distances[x][idx] print 'sklearns if distance cosine, similarity be: ' ,1- distances[x][idx]
output looks
sklearns cosine similarity [[ 0.66190748]] sklearns reported distance 0.616586738214 sklearns if distance cosine, similarity be: 0.383413261786
so output of kneighbors neither cosine distance or cosine similarity. gives?
also, aside, thought sklearn's nearest neighbors implementation not approximate nearest neighbors approach, yet doesn't seem detect actual best neighbors in dataset, compared results if iterate on matrix , check similarities of column 211 other ones. misunderstanding basic here?
ok problem nearestneighbors's .fit() method, default assumes rows samples , columns features. had tranpose matrix before passing fit.
edit: also, problem callable passed metric should distance callable, not similarity callable. otherwise you'll k farthest neighbors :/
Comments
Post a Comment