Getting Started with Udacity Machine Learning - Outliers

What can cause outliers : choose to ignore or pay attention to specific events (fraud detection)

Sensor failure, (ignore)

data entry error, (ignore)

External data ×

Unusual events (often needing attention)


select outliers

Outlier detection-removal algorithm

  1. Training

  2. Outlier detection - find the most visited points in the training set and remove these points (typically about 10% of outliers )

  3. retraining 

(May need to repeat steps 2 and 3 several times)

Before deleting:


after deletion



Summary of Outlier Removal Strategy: It should be noted here that outliers should be retained and normal data should be removed (anomaly detection, fraud detection).

1 training

2 Remove the point with the largest error (generally called the residual)

3 Retrain


Outlier mini-project:

This project has two parts. In the first part, the regression will be run and then the points with the 10% of the largest residuals will be identified and removed. Then, remove those outliers from the dataset and refit the regression as suggested by Sebastian in the course video.



first part

1. Start by running the initial code ( outliers/outlier_removal_regression.py ) and visualization points. Some outliers should jump out. Deploy a linear regression where the net value is the target and the feature used to make the prediction is the age of the person (remember to train on the training data!).

    The correct slope for the body of data points is 6.25 (which we know because we used that value to generate the data); the slope of your regression is 5.07793064

2. When using regression to predict on test data, you get a score of 0.87826247036 6


3.You will   find  the skeleton of the outlierCleaner() function in outliers/outlier_cleaner.py  and populate it with the cleaning algorithm. The three parameters used are: predictions  is a list containing the prediction targets for the regression; ages  is also a list containing the ages in the training set; net_worths  is the actual value of the net value in the training set. There should be 90 elements in each list (since there are 90 points in the training set). Your job is to return a list called cleaned_data which has only 81 elements, ie the 81 training points (90 * 0.9 = 81) where the predicted and actual values ​​(net_worths) have the smallest error. The format of cleaned_data should be a list of tuples, where each tuple is of the form (age, net_worth, error).

Once this cleanup function runs, you should see a change in the regression results. new slope 6.36859481


4. When using regression to predict on the test set, the new score is 0.983189455396

outlier_removal_regression.py

#!/usr/bin/python

import random
import numpy
import matplotlib.pyplot as plt
import pickle

from outlier_cleaner import outlierCleaner


### load up some practice data with outliers in it
ages = pickle.load( open("practice_outliers_ages.pkl", "r") )
net_worths = pickle.load( open("practice_outliers_net_worths.pkl", "r") )



### ages and net_worths need to be reshaped into 2D numpy arrays
### second argument of reshape command is a tuple of integers: (n_rows, n_columns)
### by convention, n_rows is the number of data points
### and n_columns is the number of features
ages       = numpy.reshape( numpy.array(ages), (len(ages), 1))
net_worths = numpy.reshape( numpy.array(net_worths), (len(net_worths), 1))
from sklearn.cross_validation import train_test_split
ages_train, ages_test, net_worths_train, net_worths_test = train_test_split(ages, net_worths, test_size=0.1, random_state=42)

### fill in a regression here!  Name the regression object reg so that
### the plotting code below works, and you can see what your regression looks like
#=========answer======================
from sklearn import linear_model
reg = linear_model.LinearRegression()
reg.fit(ages_train,net_worths_train)
print 'slope', reg.coef_
print 'r-square', reg.score(ages_test,net_worths_test)



try:
    plt.plot(ages, reg.predict(ages), color="blue")
except NameError:
    pass
plt.scatter(ages, net_worths)
plt.show()


### identify and remove the most outlier-y points
cleaned_data = []
try:
    predictions = reg.predict(ages_train)
    cleaned_data = outlierCleaner( predictions, ages_train, net_worths_train )
except NameError:
    print "your regression object doesn't exist, or isn't name reg"
    print "can't make predictions to use in identifying outliers"


### only run this code if cleaned_data is returning data
if len(cleaned_data) > 0:
    ages, net_worths, errors = zip(*cleaned_data)
    ages       = numpy.reshape( numpy.array(ages), (len(ages), 1))
    net_worths = numpy.reshape( numpy.array(net_worths), (len(net_worths), 1))

    ### refit your cleaned data!
    try:
        reg.fit(ages, net_worths)
        plt.plot(ages, reg.predict(ages), color="blue")
    except NameError:
        print "you don't seem to have regression imported/created,"
        print "   or else your regression object isn't named reg"
        print "   either way, only draw the scatter plot of the cleaned data"
    plt.scatter(ages, net_worths)
    plt.xlabel("ages")
    plt.ylabel("net worths")
    plt.show()
========answer2========================
    print 'slope', reg.coef_
    print 'r-square', reg.score(ages_test, net_worths_test)

else:
    print "outlierCleaner() is returning an empty list, no refitting to be done"

outlier_cleaner.py

#!/usr/bin/python
# -*- coding: utf-8 -*-
import numpy as np
import math

def outlierCleaner(predictions, ages, net_worths):
    """
        Clean away the 10% of points that have the largest
        residual errors (difference between the prediction
        and the actual net worth).

        Return a list of tuples named cleaned_data where
        each tuple is of the form (age, net_worth, error).
    """
    
    cleaned_data = []

    ### your code goes here

    ages = ages.reshape((1,len(ages)))[0]
    net_worths = net_worths.reshape((1,len(ages)))[0]
    predictions = predictions.reshape((1,len(ages)))[0]
    # The zip() function is used to take an iterable object as a parameter, pack the corresponding elements in the object into tuples, and then return a list of these tuples.
    cleaned_data = zip(ages,net_worths,abs(net_worths-predictions))
    #sort by error size
    cleaned_data = sorted(cleaned_data , key=lambda x: (x[2]))
    The #ceil() function returns the integer of the number and counts the number of elements to delete
    cleaned_num = int(-1 * math.ceil(len(cleaned_data)* 0.1))
    #slice
    cleaned_data = cleaned_data[:cleaned_num]
    return cleaned_data


the second part

In the second part, you will become familiar with some of the outliers in Enron's financial data and learn if/how to remove them.

1.  Find the initial code in  outliers/enron_outliers.py that reads in the data (in dictionary form) and converts it to a numpy array suitable for sklearn. Since two features ("salary" and "bonus") are extracted from the dictionary, the resulting numpy array dimension will be N x 2, where N is the number of data points and 2 is the number of features. This is perfect input for scatter plots; we'll use the matplotlib.pyplot module to draw the graphs. (In this course, we use pyplot for all visualizations.) Add these lines to the bottom of the script to plot the scatter:

for point in data:
    salary = point[0]
    bonus = point[1]
    matplotlib.pyplot.scatter( salary, bonus )

matplotlib.pyplot.xlabel("salary")
matplotlib.pyplot.ylabel("bonus")
matplotlib.pyplot.show()

2. There is an outlier that should jump out immediately. The problem now is to identify the source. We found the original data source to be very helpful for identification efforts; you can find the PDF at  final_project/enron61702insiderpay.pdf  .

What is the dictionary key name for this data point? (Example: If it's Ken Lay, then the answer is "LAY KENNETH L").

TOTAL

#-----Search for anomalies--------------
solve = data.reshape( ( 1, len(data) * len(data[0]) ) )[0]
max_value = sorted(solve,reverse=True)[0]
print max_value

import pprint
pp = pprint.PrettyPrinter (indent = 4)

for item in data_dict:
	if data_dict[item]['bonus'] == max_value:
		print item # the answer is crazy

3. Do you think this outlier should be removed and left as a data point?

  • Clear it out, it's a spreadsheet bug

4. One way to quickly delete key-value pairs from a dictionary is shown in the following line:

#Remove TOTAL outliers
data_dict.pop("TOTAL",0)

Write a line of code like this and   remove outliers before calling featureFormat() . Then re-run the code and your scatter plot will no longer have this outlier.


But there are other outliers in the Enron data—maybe four more

5. What are the names associated with the current Enron outliers? (give the name as written in the dictionary key value - eg: Phillip Allen would be ALLEN PHILLIP K)

These are the bosses

WORKED JOHN J 粉 点

LAY KENNETH L on the orange dot

SKILLING JEFFREY K red dot

FREVERT MARK A under orange dot

#Identify outliers where two people have earned at least $5 million in bonuses, and more than $1 million in salaries

for item in data_dict:
    if data_dict[item]['bonus'] != 'NaN' and data_dict[item]['salary'] != 'NaN':
        if data_dict[item]['bonus'] > 5e6 and data_dict[item]['salary'] > 1e6:
            print item

enron_outliers.py

#!/usr/bin/python

import pickle
import sys
import matplotlib.pyplot
sys.path.append("../tools/")
from feature_format import featureFormat, targetFeatureSplit


### read in data dictionary, convert to numpy array
data_dict = pickle.load( open("../final_project/final_project_dataset.pkl", "r") )

# answer
data_dict.pop( 'TOTAL', 0 )

features = ["salary", "bonus"]
data = featureFormat(data_dict, features)

### your code below
# answer
# solve = data.reshape( ( 1, len(data) * len(data[0]) ) )[0]
# max_value = sorted(solve,reverse=True)[0]
# print max_value

# import pprint
# pp = pprint.PrettyPrinter (indent = 4)

# for item in data_dict:
#     if data_dict[item]['bonus'] == max_value:
#         print item # the answer is crazy

# answer
for item in data_dict:
    if data_dict[item]['bonus'] != 'NaN' and data_dict[item]['salary'] != 'NaN':
        if data_dict[item]['bonus'] > 5e6 and data_dict[item]['salary'] > 1e6:
            print item

for point in data:
    salary = point[0]
    bonus = point[1]
    matplotlib.pyplot.scatter( salary, bonus )

matplotlib.pyplot.xlabel("salary")
matplotlib.pyplot.ylabel("bonus")
matplotlib.pyplot.show()

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325987892&siteId=291194637