Sunday, April 24, 2016

How to get feature names selected by feature elimination in sklearn pipeline?

Leave a Comment

I am using recursive feature elimination in my sklearn pipeline, the pipeline looks something like this:

from sklearn.pipeline import FeatureUnion, Pipeline from sklearn import feature_selection from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.svm import LinearSVC  X = ['I am a sentence', 'an example'] Y = [1, 2] X_dev = ['another sentence']  # classifier LinearSVC1 = LinearSVC(tol=1e-4,  C = 0.10000000000000001) f5 = feature_selection.RFE(estimator=LinearSVC1, n_features_to_select=500, step=1)  pipeline = Pipeline([     ('features', FeatureUnion([        ('tfidf', TfidfVectorizer(ngram_range=(1, 3), max_features= 4000)),         ('custom_features', CustomFeatures())])),     ('rfe_feature_selection', f5),     ('clf', LinearSVC1),     ])  pipeline.fit(X, Y) y_pred = pipeline.predict(X_dev) 

How can I get the feature names of features selected by the RFE? RFE should select the best 500 features, but I really need to take a look at what features have been selected.

EDIT:

I have a complex Pipeline which consists of multiple pipelines and feature unions, percentile feature selection and at the end Recursive Feature Elimination:

fs = feature_selection.SelectPercentile(feature_selection.chi2, percentile=90) fs_vect = feature_selection.SelectPercentile(feature_selection.chi2, percentile=80) f5 = feature_selection.RFE(estimator=svc, n_features_to_select=600, step=3)  countVecWord = TfidfVectorizer(ngram_range=(1, 3), max_features=2000, analyzer=u'word', sublinear_tf=True, use_idf = True, min_df=2, max_df=0.85, lowercase = True) countVecWord_tags = TfidfVectorizer(ngram_range=(1, 4), max_features= 1000, analyzer=u'word', min_df=2, max_df=0.85, sublinear_tf=True, use_idf = True, lowercase = False)  pipeline = Pipeline([         ('union', FeatureUnion(                 transformer_list=[                  ('vectorized_pipeline', Pipeline([                     ('union_vectorizer', FeatureUnion([                          ('stem_text', Pipeline([                             ('selector', ItemSelector(key='stem_text')),                             ('stem_tfidf', countVecWord)                         ])),                          ('pos_text', Pipeline([                             ('selector', ItemSelector(key='pos_text')),                             ('pos_tfidf', countVecWord_tags)                         ])),                      ])),                         ('percentile_feature_selection', fs_vect)                     ])),                   ('custom_pipeline', Pipeline([                     ('custom_features', FeatureUnion([                          ('pos_cluster', Pipeline([                             ('selector', ItemSelector(key='pos_text')),                             ('pos_cluster_inner', pos_cluster)                         ])),                          ('stylistic_features', Pipeline([                             ('selector', ItemSelector(key='raw_text')),                             ('stylistic_features_inner', stylistic_features)                         ])),                       ])),                         ('percentile_feature_selection', fs),                         ('inner_scale', inner_scaler)                 ])),                  ],                  # weight components in FeatureUnion                 # n_jobs=6,                  transformer_weights={                     'vectorized_pipeline': 0.8,  # 0.8,                     'custom_pipeline': 1.0  # 1.0                 },         )),          ('rfe_feature_selection', f5),         ('clf', classifier),         ]) 

I'll try to explain the steps. The first Pipeline consists of vectorizers and is called "vectorized_pipeline", all of these have a function "get_feature_names". The second Pipeline consists of my own features, I have implemented them with fit, transform and get_feature_names functions as well. When I use the suggestion of @Kevin, I get an error that 'union' (which is the name of my top element in the pipeline) does not have get_feature_names function:

support = pipeline.named_steps['rfe_feature_selection'].support_ feature_names = pipeline.named_steps['union'].get_feature_names() print np.array(feature_names)[support] 

Also, when I try to get feature names from individual FeatureUnions, like this:

support = pipeline.named_steps['rfe_feature_selection'].support_ feature_names = pipeline_age.named_steps['union_vectorizer'].get_feature_names() print np.array(feature_names)[support] 

I get a key error:

feature_names = pipeline.named_steps['union_vectorizer'].get_feature_names() KeyError: 'union_vectorizer' 

1 Answers

Answers 1

You can access each step of the Pipeline with the attribute named_steps, here's an example on the iris dataset, that only selects 2 features, but the solution will scale.

from sklearn import datasets from sklearn import feature_selection from sklearn.svm import LinearSVC  iris = datasets.load_iris() X = iris.data y = iris.target  # classifier LinearSVC1 = LinearSVC(tol=1e-4,  C = 0.10000000000000001) f5 = feature_selection.RFE(estimator=LinearSVC1, n_features_to_select=2, step=1)  pipeline = Pipeline([     ('rfe_feature_selection', f5),     ('clf', LinearSVC1)     ])  pipeline.fit(X, y) 

With named_steps you can access the attributes and methods of the transform object in the pipeline. The RFE attribute support_ (or the method get_support()) will return a boolean mask of the selected features:

support = pipeline.named_steps['rfe_feature_selection'].support_ 

Now support is an array, you can use that to efficiently extract the name of your selected features (columns). Make sure your feature names are in a numpy array, not a python list.

import numpy as np feature_names = np.array(iris.feature_names) # transformed list to array  feature_names[support]  array(['sepal width (cm)', 'petal width (cm)'],        dtype='|S17') 

EDIT

Per my comment above, here is your example with the CustomFeautures() function removed:

from sklearn.pipeline import FeatureUnion, Pipeline from sklearn import feature_selection from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.svm import LinearSVC import numpy as np  X = ['I am a sentence', 'an example'] Y = [1, 2] X_dev = ['another sentence']  # classifier LinearSVC1 = LinearSVC(tol=1e-4,  C = 0.10000000000000001) f5 = feature_selection.RFE(estimator=LinearSVC1, n_features_to_select=500, step=1)  pipeline = Pipeline([     ('features', FeatureUnion([        ('tfidf', TfidfVectorizer(ngram_range=(1, 3), max_features= 4000))])),      ('rfe_feature_selection', f5),     ('clf', LinearSVC1),     ])  pipeline.fit(X, Y) y_pred = pipeline.predict(X_dev)  support = pipeline.named_steps['rfe_feature_selection'].support_ feature_names = pipeline.named_steps['features'].get_feature_names() np.array(feature_names)[support] 
If You Enjoyed This, Take 5 Seconds To Share It

0 comments:

Post a Comment