Machine Learning - -Feature generation and engineering features Feature selection

  1. Overview : Let's say the festival is a core feature of engineering machine learning. Then we learned the basics of engineering characteristics, are missing value handling and categorical data encoding some of the methods and techniques. But some of the content will be in front of the light, not enough to cope with the practical work of many cases, such as if too many features our original data, we should choose those features as our training features? Our features or too little, you can use the existing re-create some of the new features are more closely associated with our target features of it? These are our feature scenes often encountered in engineering, where some commonly used technique involves also do each machine learning or data analysis engineer must master. Technology mentioned above, it is often called: feature generation and feature selection. Here we come to talk about these two technical point of it
  2. Generation the Feature . For this technical point, in fact, no trick, is a deep understanding of the significance of our data with a touch of creativity. It is not very ignorant force, ha ha, this done for? ? ? ? Haha course not, but this is not a lack of a unified model, with a certain randomness. But by summing up, we can summarize the common mode, to facilitate the application of the time reference. 2.1 Interaction. In fact, this is equivalent to the meaning of the cross, we can combine several features direct spliced together to form a "interesting" new feature, keep in mind must be meaningful, otherwise you will not only get a white, or even the original good data were you screwed up, not to the loading force and loading force, we should loaded to force in the invisible. So what significance does this have? First, it will install multiple columns into a column, to facilitate our data processing; secondly at some specific data, such interaction is more able to reflect the nature of the data. Specifically how to operate, and we are to show by a simple code, note that I only interception of a part of my code, default data have been loaded, so do not tangle my code variables and data Ha, mainly look at the process and ideas
    interactions = data_raw["category"]+"_"+data_raw["country"]
    baseline_data = baseline_data.assign(category_country = label_encoder.fit_transform(interactions))

    The above code is the one we first interaction portion, the second sentence is speaking after the label encoding data interaction and added to our data set inside simple. The above is connected to the raw data in the category, and country together to form a new feature 2.2 numerical transforming. What does it mean, for some numerical data of the columns, their data distribution is very uneven, or that their value is too large or too small, sometimes not suitable for training our data, there may be vanishing gradient or in the case of gradient explode. DETAILED Shajiao vanishing gradient and gradient exploding, later he explained the contents slowly. Being only need to know that this is a very troublesome thing like, it will lead the training model not so fast hardware on the line. Then solve it by what method? Here solved primarily through some common mathematical way, for example with a log or sqrt, etc. way. You can use the following code to show how simple

    np.sqrt(baseline_data['goal'])
    np.log(baseline_data['goal'])

    We can see from the above, here we mainly API provided by numpy inside for processing, very simple, simple, like a, well, here it comes up. By the way, forget a thing, is no numerical transforming the eggs used in the tree-based model, since all tree-based models are scale invariant, that is tree-based model is not care the size of the data distribution . 2.3 rolling. This is more advanced that it (compared to the previous two ways), first of all we must first understand the concept of rolling, rolling is in fact equivalent to our data in a fixed-size on the (series) above card small window, and then for this data window covering some simple calculations, for example: counting, mean, sum and the like. If you still feel do not know, I put it official links posted here, we all go and see, which there are many examples: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas # pandas.Series.rolling .Series.rolling.html  . So I'll write a simple little example for your reference Ha

    launched = pd.Series(data_raw.index,data_raw.launched,name="count_7_days").sort_index()
    count_7_days = launched.rolling('7d').count()
    count_7_days.index = launched.values
    count_7_days
    = count_7_days.reindex(data_raw.index)

    Let me briefly explain the meaning Siha above code, the first sentence is to make this time as a series of index, the second sentence is rolling calculate the number of the last 7 days, third and fourth sentence index data to speak before the reduction of order, room to re-join the original data set. Then rolling this way generally under what circumstances to use it? In general you have datetime data when, in front of or behind the data will affect the results of the time, we can consider the case, but this is not necessarily, or need to have some creativity of. For example, the above example is to count the number of the last seven days of total upload APP, to determine whether a particular APP is downloaded application scenarios. In general, too many APP recently uploaded, the lower the probability of being downloaded, so they still have some affiliated. So I generate a new feature or a grain of truth. 2.4 Time delta. From this we may know the name, this is certainly a relationship with time, this I guess it right. time delta is a certain randomness, sometimes it is not always necessary, is to be determined according to the characteristics of the actual data, and even according to their own engineer to decide, with the above rolling somewhat similar. For ease of explanation details of which, I am also an example of directly and then slowly explained

    def time_since_last_project_same_category(series):
        return series.diff().dt.total_seconds()/3600
    df = data_raw[['category','launched']].sort_values('launched')
    group_category = df.groupby('category')
    timedeltas = group_category.transform(time_since_last_project_same_category)
    timedeltas = timedeltas.fillna (timedeltas.mean ()). Reindex (baseline_data.index)

    Before the two adjacent rows is a calculated difference between the number of hours datatime, according to the third line to create a sorted index launched time of the dataframe, the fourth row in accordance with the conditions of the category to the previously created group df, of Five group which is to calculate the time delta between adjacent data, and returns a Series, the sixth row is filled with null data, and to facilitate reorder index added to the original data in the original data mode. This concludes the process. The above scenario is calculated the same category neighboring app upload time difference, the factors that actually affect whether our APP to be downloaded. So this is one of our creativity, the actual situation in the ever-changing, must be based on the actual situation, not to the loading force and loading force, must be based on actual business needs, or just the opposite. Well, in fact, about the feature generation there is still a variety of ways, some things such as calculating the difference between the two columns, modulo, etc., there is no uniform standard, and the only shortcut key is, we got to understand we each columns and dataset of actual business sense, otherwise the fast hardware generation will not save you. Here we enter into the final section of this chapter feature selection bar.

  3. Feature selection. When we now missing value, categorical data handling, feature generation this complicated steps are completed, and we came to the last step of the feature engineering, that is, feature selection. According means that we in the end we finally choose what data to train our model, a good selection data, the scope of the model, efficiency and accuracy are better, or our previous efforts may be destroyed. About feature selection I personally think it is a combination of personal experience and a number of selection techniques to choose the best features samples as training. Personal experience, the engineers themselves for understanding the data, and some features to see and target the relationship is not half dime, we certainly directly exclude these features, such as our mobile devices and mobile phone number to see prices had nothing to do, certainly delete the phone number of the device feature; some features to see and target have a strong relationship, such as phone memory size and price of the phone there is a strong correlation to see, so we will certainly have to select the memory feature . Than personal experience as well as how to do a lot of ambiguous features of it? Just like I said earlier, you can also use a number of techniques to choose from. Here we introduce two commonly used feature selection techniques. 3.1 F-classification method. This is to calculate a separate column for each association and the target, and then choose the most relevant Columns of the f, where f is the value of our custom. Specific implementation details we need to know so deep, because we made good sklearn has helped the wheels, from the following code, we come to feel it's the charm of it
    from sklearn.feature_selection import SelectKBest, f_classif
    selector = SelectKBest(score_func = f_classif, k = 5)
    train,valid,test = get_data_splits(baseline_data, 0.1)
    feature_cols = train.columns.drop("outcome")
    X_new = selector.fit_transform(train[feature_cols],train["outcome"] )

    #get back to the features we kept
    features = pd.DataFrame(selector.inverse_transform(X_new), index = train.index, columns = feature_cols)
    #drop the columns that the values are all 0s
    feature_cols_final = features.columns[features.var()!=0]
    features_final = features[feature_cols_final]

    We can see from the above code out, first of all introduce SelectKBest and f_classif two sub-modules from sklearn.feature_selection this module; the second step is to create a Selector instance of an object, how many features this selector is through our eventual return to the parameter K control, selector ultimately choose which features are also controlled by f_classif this function; the last is the selector of the transform, and the features and target as a parameter passed to him were, he will automatically get, and return to the K features, then numpy array will return to dataframe format. Calculated in this way only a linear dependency of each feature and the target, and does not include all the features for disposable correlation calculations. 3.2 L1 Regression. L1 Regression include all features can be directly associated with the computing of the disposable columns and a target. The stronger the correlation, the greater the value. It does not require the development of how many features the last return, it is automatically helps our features based on the results of L1. But its speed is much slower than the above methods k-classif, but the benefit is generally the result of L1 Regression run better than K-classif, but not necessarily Australia, most can only say that this is the case.

    from sklearn.linear_model import LogisticRegression
    from sklearn.feature_selection import SelectFromModel
    train,valid,test = get_data_splits(baseline_data, 0.1)
    X, y = train[train.columns.drop("outcome")], train["outcome"]
    logistic_model = LogisticRegression(C=1, penalty="l1", random_state=7).fit(X,y)
    selector = SelectFromModel(logistic_model,prefit=True)
    X_new = selector.transform(X)
    features = pd.DataFrame(selector.inverse_transform(X_new),index = train.index, columns = feature_cols)
    feature_cols_final = features.columns[features.var()!=0]

    Summary: The above is the general method of feature selection of general engineering features and feature generation, and we talked about above, a variety of ways, the specific choice of the actual situation in which one or according to the actual situation, must not die reading. feature generation in general is interaction, numerical generation, rolling and time delta four ways. feature selection is generally used techniques is the f-classif and L1 regression in several ways.

Guess you like

Origin www.cnblogs.com/tangxiaobo199181/p/12210734.html