Background

Last day we talk about Python Basics in Chinese. Today, we will do data analysis with python and explain in English(No zuo, no die. In this section, we will discuss prominent hypotheses that have been proposed to explain the EU referendum result and how we try to capture them in our empirical analysis.

In this topic, we try to find application of Regularization Linear Regression and Overfitting.

Overfitting: the model is already so complex that it fits the idiosyncrasies of our training data, idiosyncrasies which limit the model's ability to generalize (as measured by the testing error).

Refer to there to see more background.
Data can be download here

we will look at four broad groups of variables:

EU exposure through immigration, trade and structural funds;
local public service provision and fiscal consolidation;
demography and education;
economic structure, wages and unemployment.

Let's have a brief view of the data

We can use pandas package to load various data file, including stata file (ending with ".dta").

from pandas.io.stata import StataReader, StataWriter
import pandas as pd
stata_data = StataReader("referendum data.dta", convert_categoricals=False)

data = stata_data.read()# basic data
varlist = stata_data.varlist# variable list
value_labels = stata_data.value_labels() # labels/ description of data value
fmtlist = stata_data.fmtlist
variable_labels = stata_data.variable_labels()# labels/ description of the variables

var = [i for i in variable_labels]
var_label = [variable_labels[i] for i in variable_labels]
df_labels = pd.DataFrame({"variable": var, "variable_label": var_label})# we use DataFrame to see the formated table of variable labels

Then, we can see variables meaning each, and the labels include: Total votes of remain/leave, Region of the votes, Population 60 older growth (2001-2011), Population 60 older (2001), Median hourly pay (2005), Median hourly pay change (2005-2015), Non-EU migrant resident share (2001), Non-EU migrant resident growth (2001-2011), Change in low skilled labour force share (2001-2011), Unemployment rate (2015), Total economy EU dependence (2010), Total fiscal cuts (2010-2015), EU Structural Funds per capita (2013) and so on.

Variable selection analysis

Dependent variable(DV): we choose 'Pct_Remain' as our DV as it decide whether remain or leave
Independent variable(IV): we choose 'Region', 'pensionergrowth20012011', 'ResidentAge60plusshare',
'median_hourly_pay2005', 'median_hourly_pay_growth', 'NONEU_2001Migrantshare', 'NONEU_Migrantgrowth',
'unqualifiedsharechange', 'umemployment_rate_aps', 'Total_EconomyEU_dependence', 'TotalImpactFLWAAYR', 'eufundspercapitapayment13'
Four types of IV:
- EU exposure through immigration('NONEU_2001Migrantshare', 'NONEU_Migrantgrowth'), trade and structural funds('eufundspercapitapayment13');
- Fiscal consolidation('TotalImpactFLWAAYR');
- demography('Region', 'pensionergrowth20012011', 'ResidentAge60plusshare', 'unqualifiedsharechange');
- economic structure('Total_EconomyEU_dependence'), wages('median_hourly_pay2005', 'median_hourly_pay_growth') and unemployment('umemployment_rate_aps').
IVs = ['Region', 'pensionergrowth20012011', 'ResidentAge60plusshare', 'median_hourly_pay2005', 'median_hourly_pay_growth', 'NONEU_2001Migrantshare', 'NONEU_Migrantgrowth', 'unqualifiedsharechange', 'umemployment_rate_aps', 'Total_EconomyEU_dependence', 'TotalImpactFLWAAYR', 'eufundspercapitapayment13']

df1 = pd.read_stata("referendum data.dta")
df1 = df1.set_index("id")# load the stata data into DataFrame format data (df1) directly.
df1.shape

We can see the data has many attributes, and this lead problem to our model - overfitting and lacking generalization (generalize to new, unseen cases), we'll solve the problem next.

drop_list = list(set(var) - set(IVs) - {"id"} - {"Pct_Remain"})# to get our df including only selected attributes
df = df1.drop(drop_list, axis = 1)
df = df.dropna()
df.describe()# we can see the data has different magnitude(some less than 1, while some over 100, and one string type for "Region")

Encode the region

import warnings
warnings.filterwarnings("ignore",category=DeprecationWarning)

#Next, we use LabelEncoder().fit_transform to transform Region into norminal variable
from sklearn.preprocessing import LabelEncoder
df['Region'] = LabelEncoder().fit_transform(df['Region'])
df.describe()

Y = pd.DataFrame(df['Pct_Remain'])# we select x and Y out as the IV and DVs
x = df.drop('Pct_Remain', axis = 1)
x.head()# view the x

Standardise the X - varibales

#We see that magnitude of our variables are different, we need to standardise the X - variables 
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().fit(x)
scaled_x = scaler.transform(x)
x_scaled_df = pd.DataFrame({x.columns[i]: scaled_x[:,i] for i in range(12)})
x_scaled_df.head()

Split data into train set and test set

from sklearn.model_selection import train_test_split
seed = 2019
test_size = 0.20

X_train, X_test, Y_train, Y_test = train_test_split(x_scaled_df, y, test_size = test_size, random_state = seed)

Train Model

# Import a range of sklearn functions
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression, Lasso, LassoCV, RidgeCV, Ridge, ElasticNet
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.feature_selection import SelectFromModel

Fit an OLS linear model to the x and y

#Fit an OLS model to the data and view the coefficients
reg_OLS = LinearRegression(normalize=True).fit(X_train, Y_train)
coef = pd.Series(reg_OLS.coef_, index = x_scaled_df.columns)
print(coef)

To compare, we also choose Lasso model - A Regularization Technique

Fit a Lasso model - A Regularization Technique

Lasso model use alpha as the penalty parameter to solve overfitting problem.

LassoCV can be used to iteratively fit the alpha parameter. Try running this now and printing out the coeffs

reg_LassoCV = LassoCV(normalize=True)
reg_LassoCV.fit(X_train,Y_train)

print("Best alpha using built-in LassoCV: %f" % reg_LassoCV.alpha_)
print("Best score using built-in LassoCV: %f" %reg_LassoCV.score(X_train,Y_train))
coef = pd.Series(reg_LassoCV.coef_, index = x_scaled_df.columns)
print(coef)

%matplotlib inline
import matplotlib.pyplot as plt# let's see the coefficient in graph
imp_coef = coef.sort_values()
plt.rcParams['figure.figsize'] = (8.0, 10.0)
imp_coef.plot(kind = "barh")
plt.title("Feature importance using Lasso Model")

View accuracy of the model

There, we assume when "Pct_Remain" (percentage of those who choose remain) is less than 0.5, people choose to leave while over 0.5, remain.

from sklearn.metrics import *
def predict( model, X_test = X_test, Y_test = Y_test):
    """
    Input: trained model, test_IVs(X) and test_DV(Y)
    Output: accuracy of the model(as the assumption defines)
    """
    Y_test_0_1 = [0 if i<50 else 1 for i in Y_test]
    y_pred = model.predict(X_test)
    predictions = [0 if i<50 else 1 for i in y_pred]
    accuracy = accuracy_score(Y_test_0_1, predictions)
    confusion_matrix1 = confusion_matrix(Y_test_0_1, predictions)
    print(model)
    print("Confusion matrix:\n"+str(confusion_matrix1))
    return ("Accuracy: %.2f%%" % (accuracy * 100.0))

predict(reg_OLS, X_test, Y_test)

predict(reg_LassoCV, X_test, Y_test)

We can see that regularized OLS gives same result as OLS model, mainly because we selecte 12 IVs (far more less than 367 observations) by observation or experience. Thus, the train model doesn't meet the overfitting problem.

What if we choose all variables?

df1_all = df1.dropna()
Y_all = df1_all["Pct_Remain"]
X_all = df1_all.drop("Pct_Remain", axis = 1)
for i in X_all.columns:
    X_all[i] = LabelEncoder().fit_transform(X_all[i])#Label Encoder

scaler = StandardScaler().fit(X_all)
scaled_x = scaler.transform(X_all)
x_all_scaled_df = pd.DataFrame({X_all.columns[i]: scaled_x[:,i] for i in range(len(X_all.columns))})#standardise X

seed = 2019
test_size = 0.20
X_train_all, X_test_all, Y_train_all, Y_test_all = train_test_split(x_all_scaled_df, Y_all, test_size = test_size, random_state = seed)

reg_OLS_all = LinearRegression(normalize=True).fit(X_train_all, Y_train_all)
reg_LassoCV_all = LassoCV(normalize=True)
reg_LassoCV_all.fit(X_train_all,Y_train_all)

predict(reg_OLS_all, X_test_all, Y_test_all)
predict(reg_LassoCV_all, X_test_all, Y_test_all)

We can see that regularized OLS gives better prediction accuracy than OLS model, mainly because we selecte all 108 IVs (nearly equals 140, the number of train sets). Thus, the trained model meet the overfitting problem, and LASSO solve it well.

Conclusion

When there are many attributes, be alarm as the traditional Regression model may meet overfitting problem(However, other models such as Decision Tree, Random Forest, Gradient Boosting and so on can solve it easily)
We have to strike a balance between variance and (inductive) bias: our model needs to have sufficient complexity to model the relationship between the predictors and the response, but it must not fit the idiosyncrasies of our training data.
Idiosyncrasies which limit the model's ability to generalize (as measured by the testing error).

Further more

We can try various models (such as XGBoost Model, Support Vector Machine, Random Forest, ...) on this dataset, and let's leave it as this blog's homework.

Predict Referendum by sklearn package in python