Python programming learning: in-depth analysis of the difference between X, y and X_display, y_display output numbers in the source code of shap.datasets.adult()

Python programming learning: in-depth analysis of the difference between X, y and X_display, y_display output numbers in the source code of shap.datasets.adult()

Table of contents

In-depth analysis of X, y and X_display, y_display in the source code of shap.datasets.adult()

Read the source code

understand the source code

data and raw_data comparison results

X.shape 

X_display.shape 


In-depth analysis of X, y and X_display, y_display in the source code of shap.datasets.adult()

X,y = shap.datasets.adult()
X_display,y_display = shap.datasets.adult(display=True)

Read the source code

def adult(display=False):
    """ Return the Adult census data in a nice package. """
    dtypes = [
        ("Age", "float32"), ("Workclass", "category"), ("fnlwgt", "float32"),
        ("Education", "category"), ("Education-Num", "float32"), ("Marital Status", "category"),
        ("Occupation", "category"), ("Relationship", "category"), ("Race", "category"),
        ("Sex", "category"), ("Capital Gain", "float32"), ("Capital Loss", "float32"),
        ("Hours per week", "float32"), ("Country", "category"), ("Target", "category")
    ]
    raw_data = pd.read_csv(
        cache(github_data_url + "adult.data"),
        names=[d[0] for d in dtypes],
        na_values="?",
        dtype=dict(dtypes)
    )
    data = raw_data.drop(["Education"], axis=1)  # redundant with Education-Num
    filt_dtypes = list(filter(lambda x: not (x[0] in ["Target", "Education"]), dtypes))
    data["Target"] = data["Target"] == " >50K"
    rcode = {
        "Not-in-family": 0,
        "Unmarried": 1,
        "Other-relative": 2,
        "Own-child": 3,
        "Husband": 4,
        "Wife": 5
    }
    for k, dtype in filt_dtypes:
        if dtype == "category":
            if k == "Relationship":
                data[k] = np.array([rcode[v.strip()] for v in data[k]])
            else:
                data[k] = data[k].cat.codes

    if display:
        return raw_data.drop(["Education", "Target", "fnlwgt"], axis=1), data["Target"].values
    return data.drop(["Target", "fnlwgt"], axis=1), data["Target"].values

understand the source code

 

 

data and raw_data comparison results

Conclusion : data is based on the csv file data read in raw_data. For the newly defined new data , 3 columns are dropped , and some data processing is performed ; while raw_data is the original data , read in from csv, only after dropping 3 Column , the rest of the output data is different as it is.
Meaning :

X.shape 

(32561, 12) X.shape 
        age         workclass  ...  hours-per-week native-country
0       39         State-gov  ...              40  United-States
1       50  Self-emp-not-inc  ...              13  United-States
2       38           Private  ...              40  United-States
3       53           Private  ...              40  United-States
4       28           Private  ...              40           Cuba
...    ...               ...  ...             ...            ...
32556   27           Private  ...              38  United-States
32557   40           Private  ...              40  United-States
32558   58           Private  ...              40  United-States
32559   22           Private  ...              20  United-States
32560   52      Self-emp-inc  ...              40  United-States

[32561 rows x 12 columns]
age workclass education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country
0 39 State-gov 13 Never-married Adm-clerical Not-in-family White Male 2174 0 40 United-States
1 50 Self-emp-not-inc 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 13 United-States
2 38 Private 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 United-States
3 53 Private 7 Married-civ-spouse Handlers-cleaners Husband Black Male 0 0 40 United-States
4 28 Private 13 Married-civ-spouse Prof-specialty Wife Black Female 0 0 40 Cuba
5 37 Private 14 Married-civ-spouse Exec-managerial Wife White Female 0 0 40 United-States
6 49 Private 5 Married-spouse-absent Other-service Not-in-family Black Female 0 0 16 Jamaica
7 52 Self-emp-not-inc 9 Married-civ-spouse Exec-managerial Husband White Male 0 0 45 United-States
8 31 Private 14 Never-married Prof-specialty Not-in-family White Female 14084 0 50 United-States
9 42 Private 13 Married-civ-spouse Exec-managerial Husband White Male 5178 0 40 United-States

X_display.shape 

(32561, 12) X_display.shape 
        age         workclass  ...  hours-per-week native-country
0       39         State-gov  ...              40  United-States
1       50  Self-emp-not-inc  ...              13  United-States
2       38           Private  ...              40  United-States
3       53           Private  ...              40  United-States
4       28           Private  ...              40           Cuba
...    ...               ...  ...             ...            ...
32556   27           Private  ...              38  United-States
32557   40           Private  ...              40  United-States
32558   58           Private  ...              40  United-States
32559   22           Private  ...              20  United-States
32560   52      Self-emp-inc  ...              40  United-States

[32561 rows x 12 columns]
age workclass education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country
0 39 State-gov 13 Never-married Adm-clerical Not-in-family White Male 2174 0 40 United-States
1 50 Self-emp-not-inc 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 13 United-States
2 38 Private 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 United-States
3 53 Private 7 Married-civ-spouse Handlers-cleaners Husband Black Male 0 0 40 United-States
4 28 Private 13 Married-civ-spouse Prof-specialty Wife Black Female 0 0 40 Cuba
5 37 Private 14 Married-civ-spouse Exec-managerial Wife White Female 0 0 40 United-States
6 49 Private 5 Married-spouse-absent Other-service Not-in-family Black Female 0 0 16 Jamaica
7 52 Self-emp-not-inc 9 Married-civ-spouse Exec-managerial Husband White Male 0 0 45 United-States
8 31 Private 14 Never-married Prof-specialty Not-in-family White Female 14084 0 50 United-States
9 42 Private 13 Married-civ-spouse Exec-managerial Husband White Male 5178 0 40 United-States

Guess you like

Origin blog.csdn.net/qq_41185868/article/details/125611687