I am using recursive feature elimination in my sklearn pipeline, the pipeline looks something like this:
from sklearn.pipeline import FeatureUnion, Pipeline from sklearn import feature_selection from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.svm import LinearSVC X = ['I am a sentence', 'an example'] Y = [1, 2] X_dev = ['another sentence'] # classifier LinearSVC1 = LinearSVC(tol=1e-4, C = 0.10000000000000001) f5 = feature_selection.RFE(estimator=LinearSVC1, n_features_to_select=500, step=1) pipeline = Pipeline([ ('features', FeatureUnion([ ('tfidf', TfidfVectorizer(ngram_range=(1, 3), max_features= 4000)), ('custom_features', CustomFeatures())])), ('rfe_feature_selection', f5), ('clf', LinearSVC1), ]) pipeline.fit(X, Y) y_pred = pipeline.predict(X_dev)
How can I get the feature names of features selected by the RFE? RFE should select the best 500 features, but I really need to take a look at what features have been selected.
EDIT:
I have a complex Pipeline which consists of multiple pipelines and feature unions, percentile feature selection and at the end Recursive Feature Elimination:
fs = feature_selection.SelectPercentile(feature_selection.chi2, percentile=90) fs_vect = feature_selection.SelectPercentile(feature_selection.chi2, percentile=80) f5 = feature_selection.RFE(estimator=svc, n_features_to_select=600, step=3) countVecWord = TfidfVectorizer(ngram_range=(1, 3), max_features=2000, analyzer=u'word', sublinear_tf=True, use_idf = True, min_df=2, max_df=0.85, lowercase = True) countVecWord_tags = TfidfVectorizer(ngram_range=(1, 4), max_features= 1000, analyzer=u'word', min_df=2, max_df=0.85, sublinear_tf=True, use_idf = True, lowercase = False) pipeline = Pipeline([ ('union', FeatureUnion( transformer_list=[ ('vectorized_pipeline', Pipeline([ ('union_vectorizer', FeatureUnion([ ('stem_text', Pipeline([ ('selector', ItemSelector(key='stem_text')), ('stem_tfidf', countVecWord) ])), ('pos_text', Pipeline([ ('selector', ItemSelector(key='pos_text')), ('pos_tfidf', countVecWord_tags) ])), ])), ('percentile_feature_selection', fs_vect) ])), ('custom_pipeline', Pipeline([ ('custom_features', FeatureUnion([ ('pos_cluster', Pipeline([ ('selector', ItemSelector(key='pos_text')), ('pos_cluster_inner', pos_cluster) ])), ('stylistic_features', Pipeline([ ('selector', ItemSelector(key='raw_text')), ('stylistic_features_inner', stylistic_features) ])), ])), ('percentile_feature_selection', fs), ('inner_scale', inner_scaler) ])), ], # weight components in FeatureUnion # n_jobs=6, transformer_weights={ 'vectorized_pipeline': 0.8, # 0.8, 'custom_pipeline': 1.0 # 1.0 }, )), ('rfe_feature_selection', f5), ('clf', classifier), ])
I'll try to explain the steps. The first Pipeline consists of vectorizers and is called "vectorized_pipeline", all of these have a function "get_feature_names". The second Pipeline consists of my own features, I have implemented them with fit, transform and get_feature_names functions as well. When I use the suggestion of @Kevin, I get an error that 'union' (which is the name of my top element in the pipeline) does not have get_feature_names function:
support = pipeline.named_steps['rfe_feature_selection'].support_ feature_names = pipeline.named_steps['union'].get_feature_names() print np.array(feature_names)[support]
Also, when I try to get feature names from individual FeatureUnions, like this:
support = pipeline.named_steps['rfe_feature_selection'].support_ feature_names = pipeline_age.named_steps['union_vectorizer'].get_feature_names() print np.array(feature_names)[support]
I get a key error:
feature_names = pipeline.named_steps['union_vectorizer'].get_feature_names() KeyError: 'union_vectorizer'
1 Answers
Answers 1
You can access each step of the Pipeline
with the attribute named_steps
, here's an example on the iris dataset, that only selects 2
features, but the solution will scale.
from sklearn import datasets from sklearn import feature_selection from sklearn.svm import LinearSVC iris = datasets.load_iris() X = iris.data y = iris.target # classifier LinearSVC1 = LinearSVC(tol=1e-4, C = 0.10000000000000001) f5 = feature_selection.RFE(estimator=LinearSVC1, n_features_to_select=2, step=1) pipeline = Pipeline([ ('rfe_feature_selection', f5), ('clf', LinearSVC1) ]) pipeline.fit(X, y)
With named_steps
you can access the attributes and methods of the transform object in the pipeline. The RFE
attribute support_
(or the method get_support()
) will return a boolean mask of the selected features:
support = pipeline.named_steps['rfe_feature_selection'].support_
Now support
is an array, you can use that to efficiently extract the name of your selected features (columns). Make sure your feature names are in a numpy array
, not a python list.
import numpy as np feature_names = np.array(iris.feature_names) # transformed list to array feature_names[support] array(['sepal width (cm)', 'petal width (cm)'], dtype='|S17')
EDIT
Per my comment above, here is your example with the CustomFeautures() function removed:
from sklearn.pipeline import FeatureUnion, Pipeline from sklearn import feature_selection from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.svm import LinearSVC import numpy as np X = ['I am a sentence', 'an example'] Y = [1, 2] X_dev = ['another sentence'] # classifier LinearSVC1 = LinearSVC(tol=1e-4, C = 0.10000000000000001) f5 = feature_selection.RFE(estimator=LinearSVC1, n_features_to_select=500, step=1) pipeline = Pipeline([ ('features', FeatureUnion([ ('tfidf', TfidfVectorizer(ngram_range=(1, 3), max_features= 4000))])), ('rfe_feature_selection', f5), ('clf', LinearSVC1), ]) pipeline.fit(X, Y) y_pred = pipeline.predict(X_dev) support = pipeline.named_steps['rfe_feature_selection'].support_ feature_names = pipeline.named_steps['features'].get_feature_names() np.array(feature_names)[support]
0 comments:
Post a Comment