What can cause outliers : choose to ignore or pay attention to specific events (fraud detection)
Sensor failure, (ignore)
data entry error, (ignore)
External data ×
Unusual events (often needing attention)
select outliers
Outlier detection-removal algorithm
Training
Outlier detection - find the most visited points in the training set and remove these points (typically about 10% of outliers )
retraining
(May need to repeat steps 2 and 3 several times)
Before deleting:
after deletion
Summary of Outlier Removal Strategy: It should be noted here that outliers should be retained and normal data should be removed (anomaly detection, fraud detection).
1 training
2 Remove the point with the largest error (generally called the residual)
3 Retrain
Outlier mini-project:
This project has two parts. In the first part, the regression will be run and then the points with the 10% of the largest residuals will be identified and removed. Then, remove those outliers from the dataset and refit the regression as suggested by Sebastian in the course video.
first part
1. Start by running the initial code ( outliers/outlier_removal_regression.py ) and visualization points. Some outliers should jump out. Deploy a linear regression where the net value is the target and the feature used to make the prediction is the age of the person (remember to train on the training data!).
The correct slope for the body of data points is 6.25 (which we know because we used that value to generate the data); the slope of your regression is 5.07793064
2. When using regression to predict on test data, you get a score of 0.87826247036 6
3.You will find the skeleton of the outlierCleaner() function in outliers/outlier_cleaner.py and populate it with the cleaning algorithm. The three parameters used are: predictions is a list containing the prediction targets for the regression; ages is also a list containing the ages in the training set; net_worths is the actual value of the net value in the training set. There should be 90 elements in each list (since there are 90 points in the training set). Your job is to return a list called cleaned_data which has only 81 elements, ie the 81 training points (90 * 0.9 = 81) where the predicted and actual values (net_worths) have the smallest error. The format of cleaned_data should be a list of tuples, where each tuple is of the form (age, net_worth, error).
Once this cleanup function runs, you should see a change in the regression results. new slope 6.36859481
4. When using regression to predict on the test set, the new score is 0.983189455396
outlier_removal_regression.py
#!/usr/bin/python import random import numpy import matplotlib.pyplot as plt import pickle from outlier_cleaner import outlierCleaner ### load up some practice data with outliers in it ages = pickle.load( open("practice_outliers_ages.pkl", "r") ) net_worths = pickle.load( open("practice_outliers_net_worths.pkl", "r") ) ### ages and net_worths need to be reshaped into 2D numpy arrays ### second argument of reshape command is a tuple of integers: (n_rows, n_columns) ### by convention, n_rows is the number of data points ### and n_columns is the number of features ages = numpy.reshape( numpy.array(ages), (len(ages), 1)) net_worths = numpy.reshape( numpy.array(net_worths), (len(net_worths), 1)) from sklearn.cross_validation import train_test_split ages_train, ages_test, net_worths_train, net_worths_test = train_test_split(ages, net_worths, test_size=0.1, random_state=42) ### fill in a regression here! Name the regression object reg so that ### the plotting code below works, and you can see what your regression looks like #=========answer====================== from sklearn import linear_model reg = linear_model.LinearRegression() reg.fit(ages_train,net_worths_train) print 'slope', reg.coef_ print 'r-square', reg.score(ages_test,net_worths_test) try: plt.plot(ages, reg.predict(ages), color="blue") except NameError: pass plt.scatter(ages, net_worths) plt.show() ### identify and remove the most outlier-y points cleaned_data = [] try: predictions = reg.predict(ages_train) cleaned_data = outlierCleaner( predictions, ages_train, net_worths_train ) except NameError: print "your regression object doesn't exist, or isn't name reg" print "can't make predictions to use in identifying outliers" ### only run this code if cleaned_data is returning data if len(cleaned_data) > 0: ages, net_worths, errors = zip(*cleaned_data) ages = numpy.reshape( numpy.array(ages), (len(ages), 1)) net_worths = numpy.reshape( numpy.array(net_worths), (len(net_worths), 1)) ### refit your cleaned data! try: reg.fit(ages, net_worths) plt.plot(ages, reg.predict(ages), color="blue") except NameError: print "you don't seem to have regression imported/created," print " or else your regression object isn't named reg" print " either way, only draw the scatter plot of the cleaned data" plt.scatter(ages, net_worths) plt.xlabel("ages") plt.ylabel("net worths") plt.show() ========answer2======================== print 'slope', reg.coef_ print 'r-square', reg.score(ages_test, net_worths_test) else: print "outlierCleaner() is returning an empty list, no refitting to be done"
outlier_cleaner.py
#!/usr/bin/python # -*- coding: utf-8 -*- import numpy as np import math def outlierCleaner(predictions, ages, net_worths): """ Clean away the 10% of points that have the largest residual errors (difference between the prediction and the actual net worth). Return a list of tuples named cleaned_data where each tuple is of the form (age, net_worth, error). """ cleaned_data = [] ### your code goes here ages = ages.reshape((1,len(ages)))[0] net_worths = net_worths.reshape((1,len(ages)))[0] predictions = predictions.reshape((1,len(ages)))[0] # The zip() function is used to take an iterable object as a parameter, pack the corresponding elements in the object into tuples, and then return a list of these tuples. cleaned_data = zip(ages,net_worths,abs(net_worths-predictions)) #sort by error size cleaned_data = sorted(cleaned_data , key=lambda x: (x[2])) The #ceil() function returns the integer of the number and counts the number of elements to delete cleaned_num = int(-1 * math.ceil(len(cleaned_data)* 0.1)) #slice cleaned_data = cleaned_data[:cleaned_num] return cleaned_data
the second part
In the second part, you will become familiar with some of the outliers in Enron's financial data and learn if/how to remove them.
1. Find the initial code in outliers/enron_outliers.py that reads in the data (in dictionary form) and converts it to a numpy array suitable for sklearn. Since two features ("salary" and "bonus") are extracted from the dictionary, the resulting numpy array dimension will be N x 2, where N is the number of data points and 2 is the number of features. This is perfect input for scatter plots; we'll use the matplotlib.pyplot module to draw the graphs. (In this course, we use pyplot for all visualizations.) Add these lines to the bottom of the script to plot the scatter:
for point in data: salary = point[0] bonus = point[1] matplotlib.pyplot.scatter( salary, bonus ) matplotlib.pyplot.xlabel("salary") matplotlib.pyplot.ylabel("bonus") matplotlib.pyplot.show()
2. There is an outlier that should jump out immediately. The problem now is to identify the source. We found the original data source to be very helpful for identification efforts; you can find the PDF at final_project/enron61702insiderpay.pdf .
What is the dictionary key name for this data point? (Example: If it's Ken Lay, then the answer is "LAY KENNETH L").
TOTAL
#-----Search for anomalies-------------- solve = data.reshape( ( 1, len(data) * len(data[0]) ) )[0] max_value = sorted(solve,reverse=True)[0] print max_value import pprint pp = pprint.PrettyPrinter (indent = 4) for item in data_dict: if data_dict[item]['bonus'] == max_value: print item # the answer is crazy
3. Do you think this outlier should be removed and left as a data point?
- Clear it out, it's a spreadsheet bug
4. One way to quickly delete key-value pairs from a dictionary is shown in the following line:
#Remove TOTAL outliers data_dict.pop("TOTAL",0)
Write a line of code like this and remove outliers before calling featureFormat() . Then re-run the code and your scatter plot will no longer have this outlier.
But there are other outliers in the Enron data—maybe four more
5. What are the names associated with the current Enron outliers? (give the name as written in the dictionary key value - eg: Phillip Allen would be ALLEN PHILLIP K)
These are the bosses
WORKED JOHN J 粉 点
LAY KENNETH L on the orange dot
SKILLING JEFFREY K red dot
FREVERT MARK A under orange dot
#Identify outliers where two people have earned at least $5 million in bonuses, and more than $1 million in salaries for item in data_dict: if data_dict[item]['bonus'] != 'NaN' and data_dict[item]['salary'] != 'NaN': if data_dict[item]['bonus'] > 5e6 and data_dict[item]['salary'] > 1e6: print item
enron_outliers.py
#!/usr/bin/python import pickle import sys import matplotlib.pyplot sys.path.append("../tools/") from feature_format import featureFormat, targetFeatureSplit ### read in data dictionary, convert to numpy array data_dict = pickle.load( open("../final_project/final_project_dataset.pkl", "r") ) # answer data_dict.pop( 'TOTAL', 0 ) features = ["salary", "bonus"] data = featureFormat(data_dict, features) ### your code below # answer # solve = data.reshape( ( 1, len(data) * len(data[0]) ) )[0] # max_value = sorted(solve,reverse=True)[0] # print max_value # import pprint # pp = pprint.PrettyPrinter (indent = 4) # for item in data_dict: # if data_dict[item]['bonus'] == max_value: # print item # the answer is crazy # answer for item in data_dict: if data_dict[item]['bonus'] != 'NaN' and data_dict[item]['salary'] != 'NaN': if data_dict[item]['bonus'] > 5e6 and data_dict[item]['salary'] > 1e6: print item for point in data: salary = point[0] bonus = point[1] matplotlib.pyplot.scatter( salary, bonus ) matplotlib.pyplot.xlabel("salary") matplotlib.pyplot.ylabel("bonus") matplotlib.pyplot.show()