python - Recovering features names of explained_variance_ratio_ in PCA with sklearn -
i'm trying recover pca done scikit-learn, which features selected relevant.
a classic example iris dataset.
import pandas pd import pylab pl sklearn import datasets sklearn.decomposition import pca # load dataset iris = datasets.load_iris() df = pd.dataframe(iris.data, columns=iris.feature_names) # normalize data df_norm = (df - df.mean()) / df.std() # pca pca = pca(n_components=2) pca.fit_transform(df_norm.values) print pca.explained_variance_ratio_
this returns
in [42]: pca.explained_variance_ratio_ out[42]: array([ 0.72770452, 0.23030523])
how can recover 2 features allow these 2 explained variance among dataset ? said diferently, how can index of features in iris.feature_names ?
in [47]: print iris.feature_names ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
thanks in advance help.
edit: others have commented, may same values .components_
attribute.
each principal component linear combination of original variables:
where x_i
s original variables, , beta_i
s corresponding weights or called coefficients.
to obtain weights, may pass identity matrix transform
method:
>>> = np.identity(df.shape[1]) # identity matrix >>> array([[ 1., 0., 0., 0.], [ 0., 1., 0., 0.], [ 0., 0., 1., 0.], [ 0., 0., 0., 1.]]) >>> coef = pca.transform(i) >>> coef array([[ 0.5224, -0.3723], [-0.2634, -0.9256], [ 0.5813, -0.0211], [ 0.5656, -0.0654]])
each column of coef
matrix above shows weights in linear combination obtains corresponding principal component:
>>> pd.dataframe(coef, columns=['pc-1', 'pc-2'], index=df.columns) pc-1 pc-2 sepal length (cm) 0.522 -0.372 sepal width (cm) -0.263 -0.926 petal length (cm) 0.581 -0.021 petal width (cm) 0.566 -0.065 [4 rows x 2 columns]
for example, above shows second principal component (pc-2
) aligned sepal width
, has highest weight of 0.926
in absolute value;
since data normalized, can confirm principal components have variance 1.0
equivalent each coefficient vector having norm 1.0
:
>>> np.linalg.norm(coef,axis=0) array([ 1., 1.])
one may confirm principal components can calculated dot product of above coefficients , original variables:
>>> np.allclose(df_norm.values.dot(coef), pca.fit_transform(df_norm.values)) true
note need use numpy.allclose
instead of regular equality operator, because of floating point precision error.
Comments
Post a Comment