Getting Started with Udacity Machine Learning - Feature Selection

Exercise: A New Enron Trait Exercise

poi_flag_emal.py

    if from_emails:
        ctr=0
        while not from_poi and ctr < len(from_emails):
            if from_emails[ctr] in poi_email_list:
                from_poi = True
            ctr += 1

Exercise: Visualize New Features

studentCode.py

    ### you fill in this code, so that it returns either
    ###     the fraction of all messages to this person that come from POIs
    ###     or
    ###     the fraction of all messages from this person that are sent to POIs
    ### the same code can be used to compute either quantity

    ### beware of "NaN" when there is no known email address (and so
    ### no filled email features), and integer division!
    ### in case of poi_messages or all_messages having "NaN" value, return 0.
    if poi_messages !='NaN' and all_messages != 'NaN':
        fraction = float(poi_messages)/all_messages
    else:
        fraction =0.

    return fraction

Beware of Feature Vulnerabilities:

Anyone can make mistakes - be skeptical of the results you get! You should always be wary of 100% accuracy. Unusual claims are supported by unusual evidence. If a feature is over-tracking your labels, then it's probably a bug! If you're sure it's not a vulnerability, then you largely don't need machine learning - you can just use that feature to assign labels.

Remove features:

When to ignore a feature:

Features ≠ information, features are the actual number or characteristics of a particular data point trying to obtain information

For example: if you have a lot of features, you probably have a lot of data, and the quality of those features is the content of the information. What we need is a feature with as little information as possible, and if you think a feature is not giving you information, you should delete it.

There are various helper methods for automatic feature selection in sklearn. Most methods fall into the category of univariate feature selection, where each feature is treated independently and asked about its power in classification or regression.

There are two univariate feature selection tools in sklearn: SelectPercentile and SelectKBest . The difference between the two can be seen from the name: SelectPercentile selects the most powerful X% features (X is the parameter), while SelectKBest selects the K most powerful features (K is the parameter).

Classic high-bias situation: use a small number of features to induce high-bias

Classic high variance situation: too many features, too much tuning parameters

Balance point: use few features to fit an algorithm, but at the same time want a large R-squared or a low residual error sum of squares in terms of regression

Too many features cause high variance and weak generalization ability

A Regularized Regression: Lasso Regression

A general linear regression is to minimize the squared error in the fit (i.e. shorten the distance or the square of the distance between the fit and any given data point), and Lasso regression also reduces the squared error, but in addition to maximizing the reduction In addition to the small squared error, it also maximizes the number of features used, the lambda penalty parameter, β describes the number of features used, the formula principle: using more features will have a smaller squared error, more accurate Fit these points, but with an additional penalty, so the benefit of using multiple features is greater than the loss of formation. This formula prescribes a balance between less error and a simpler fit using fewer features

Lasso Regression Exercises

.coef_ print coefficients

.predict([[2,4]]) prediction

.fit(features, labels) fit

Feature selection mini-projects:

As a traditional algorithm, decision trees are very easy to overfit, and the easiest way to get an overfit decision tree is to use a small training set and a large number of features.

1. If the decision tree is overfitted, do you expect the test set accuracy to be very high or fairly low? low

2. If the decision tree is overfitted, do you expect the training set accuracy to be high or low? high

A traditional way of overfitting an algorithm is to use a large number of features and a small amount of training data. You can feature_selection/find_signature.py find the initial code in. Prepare the decision tree, start training on the training data, and print out the accuracy.

According to the initial code, how many training points are there? 150

### a classic way to overfit is to use a small number
### of data points and a large number of features;
### train on only 150 events to put ourselves in this regime
features_train = features_train[:150].toarray()
labels_train   = labels_train[:150]

What is the accuracy of the decision tree you just created? 0.950511945392

(Remember, we set up the decision tree for overfitting - ideally, we'd like to see relatively low test accuracy.)

Select (overfit) the decision tree and use feature_importances_ attributes to get a list of the relative importance of all the features used (the list will be long due to text data). We recommend iterating over this list and printing feature importance only if it exceeds a threshold (say 0.2 - remember, all words are equally important and every word is less than 0.01).

What is the importance of the most important features? 0.764705882353 What is the number for this feature? 36584

(Because the Enron dataset of the text learning mini-project may be different, I did not get the correct answer, so the answer of the numbers of the features here is just my own answer)

To determine what word is causing the problem, you need to go back to TfIdf and use the number of features you got from the previous part of the mini-project to get the associated word. You can call get_feature_names() in TfIdf to return a list of all words; extract the words that cause most decision tree discrimination.

what is this word? Does a word like a signature that is uniquely associated with Chris Germany or Sara Shackleton make sense?

sshacklensf

In a sense, the word looks like an outlier, so let's refit after removing it. Go back text_learning/vectorize_text.pyand use the method we did to remove "sara", "chris", etc. to remove this word from the message. Rerun vectorize_text.py, and rerun as soon as you're done find_signature.py.

Are there any other outliers that jump out? What is the word? Like a signature type word? (As before, outliers are defined as features with an importance greater than 0.2). cgermannsf

Re-run after updating vectorize_test.py again. Then, run it again find_signature.py.

Are there any other important features (importance greater than 0.2)? how many? Do they look like "signature text", or are they more like "mail content text" from the message body?

Yes, and a new important word

9. What is the accuracy of the decision tree now? 0.811149032992

find_signature.py

#!/usr/bin/python

import pickle
import numpy
numpy.random.seed(42)


### The words (features) and authors (labels), already largely processed.
### These files should have been created from the previous (Lesson 10)
### mini-project.
words_file = "../text_learning/your_word_data.pkl"
authors_file = "../text_learning/your_email_authors.pkl"
word_data = pickle.load( open(words_file, "r"))
authors = pickle.load( open(authors_file, "r") )



### test_size is the percentage of events assigned to the test set (the
### remainder go into training)
### feature matrices changed to dense representations for compatibility with
### classifier functions in versions 0.15.2 and earlier
from sklearn import cross_validation
features_train, features_test, labels_train, labels_test = cross_validation.train_test_split(word_data, authors, test_size=0.1, random_state=42)

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,
                             stop_words='english')
features_train = vectorizer.fit_transform(features_train)
features_test  = vectorizer.transform(features_test).toarray()

words = vectorizer.get_feature_names()
### a classic way to overfit is to use a small number
### of data points and a large number of features;
### train on only 150 events to put ourselves in this regime
features_train = features_train[:150].toarray()
labels_train   = labels_train[:150]



### your code goes here
from sklearn import tree
from sklearn.metrics import accuracy_score

clf = tree.DecisionTreeClassifier()
clf.fit(features_train,labels_train)
#accuracy_score method 1
acc = clf.score(features_test,labels_test)
print acc
#accuracy_score method 2
pred = clf.predict(features_test)
print "Accuracy:", accuracy_score(labels_test, pred)


fi=clf.feature_importances_

print "Important features:"
for index, feature in enumerate(clf.feature_importances_):
    if feature>0.2:
        print "feature no", index
        print "importance", feature
        print "word", words[index]

vectorize_text.py

        stopwords = ["sara", "shackleton", "chris", "germani", "sshacklensf", "cgermannsf"]
        for word in stopwords:
            words = words.replace(word, ' ')
        words = ' '.join(words.split())

Getting Started with Udacity Machine Learning - Feature Selection

Guess you like