python - Recovering features names of explained_variance_ratio_ in PCA with sklearn -


i'm trying recover pca done scikit-learn, which features selected relevant.

a classic example iris dataset.

import pandas pd import pylab pl sklearn import datasets sklearn.decomposition import pca  # load dataset iris = datasets.load_iris() df = pd.dataframe(iris.data, columns=iris.feature_names)  # normalize data df_norm = (df - df.mean()) / df.std()  # pca pca = pca(n_components=2) pca.fit_transform(df_norm.values) print pca.explained_variance_ratio_ 

this returns

in [42]: pca.explained_variance_ratio_ out[42]: array([ 0.72770452,  0.23030523]) 

how can recover 2 features allow these 2 explained variance among dataset ? said diferently, how can index of features in iris.feature_names ?

in [47]: print iris.feature_names ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)'] 

thanks in advance help.

edit: others have commented, may same values .components_ attribute.


each principal component linear combination of original variables:

pca-coef

where x_is original variables, , beta_is corresponding weights or called coefficients.

to obtain weights, may pass identity matrix transform method:

>>> = np.identity(df.shape[1])  # identity matrix >>> array([[ 1.,  0.,  0.,  0.],        [ 0.,  1.,  0.,  0.],        [ 0.,  0.,  1.,  0.],        [ 0.,  0.,  0.,  1.]])  >>> coef = pca.transform(i) >>> coef array([[ 0.5224, -0.3723],        [-0.2634, -0.9256],        [ 0.5813, -0.0211],        [ 0.5656, -0.0654]]) 

each column of coef matrix above shows weights in linear combination obtains corresponding principal component:

>>> pd.dataframe(coef, columns=['pc-1', 'pc-2'], index=df.columns)                     pc-1   pc-2 sepal length (cm)  0.522 -0.372 sepal width (cm)  -0.263 -0.926 petal length (cm)  0.581 -0.021 petal width (cm)   0.566 -0.065  [4 rows x 2 columns] 

for example, above shows second principal component (pc-2) aligned sepal width, has highest weight of 0.926 in absolute value;

since data normalized, can confirm principal components have variance 1.0 equivalent each coefficient vector having norm 1.0:

>>> np.linalg.norm(coef,axis=0) array([ 1.,  1.]) 

one may confirm principal components can calculated dot product of above coefficients , original variables:

>>> np.allclose(df_norm.values.dot(coef), pca.fit_transform(df_norm.values)) true 

note need use numpy.allclose instead of regular equality operator, because of floating point precision error.


Comments

Popular posts from this blog

apache - Remove .php and add trailing slash in url using htaccess not loading css -

javascript - jQuery show full size image on click -