[Fastai] ML lecture1 note

Books

python for data analysis, 2nd edition
introduction to machine learning with python
Python Machine Learning: Machine Learning and Deep Learning with Python, scikit-learn, and TensorFlow, 2nd Edition
Hands-On Machine Learning with Scikit-Learn and TensorFlow

autoreload magics

%load_ext autoreload
%autoreload 2

the first line load an extension for auto reloading modules before executing user codes.
the second line specifically tells which modules should be auto reloaded.
this documentation gives more details.

matplotlib magics

%matplotlib inline is a frequently used magic. It enables the inline backend of matplotlib so that figures are displayed interactively.
check more usage here.

tricks to trace functions

see where it comes from
just run the cell with the function only (shift+enter)
see the doc
type one ?
see the source
type two ?

run commands in cells

PATH = "path\to\dir"
!ls {PATH}

the use of {} is like in bash.

target contest: Blue Book for Bulldozers

The goal of the contest is to predict the sale price of a particular piece of heavy equiment at auction based on it’s usage, equipment type, and configuaration.

python 3.6 string: f prefix

f'{PATH}Train.csv'
python 3.6 format string {} automatically run python codes in it.
for example, in the above expression, variable PATH is evaluated.

pd.read_csv low_memory option

low_memory : boolean, default True
    Internally process the file in chunks, resulting in lower memory use
    while parsing, but possibly mixed type inference.  To ensure no mixed
    types either set False, or specify the type with the `dtype` parameter.
    Note that the entire file is read into a single DataFrame regardless,
    use the `chunksize` or `iterator` parameter to return the data in chunks.
    (Only valid with C parser)

`display_all` for jupyter cell

def display_all(df):
    with pd.option_context("display.max_rows", 1000, "display.max_columns", 1000): 
        display(df)

because by default, jupyter does not show all columns and all rows if there are too many.

transpose on df

display_all(df_raw.tail().T)
this is useful when the number of columns is much more than that of the rows.
(You may ask how can the number of rows be small…, remember when you df.head() or df.tail() only 5 rows are returned.)

df.describe(include)

include : 'all', list-like of dtypes or None (default), optional
    A white list of data types to include in the result. Ignored
    for ``Series``. Here are the options:

    - 'all' : All columns of the input will be included in the output.
    - A list-like of dtypes : Limits the results to the
      provided data types.
      To limit the result to numeric types submit
      ``numpy.number``. To limit it instead to object columns submit
      the ``numpy.object`` data type. Strings
      can also be used in the style of
      ``select_dtypes`` (e.g. ``df.describe(include=['O'])``). To
      select pandas categorical columns, use ``'category'``
    - None (default) : The result will include all numeric columns.

this relates to what dtypes will be involved in the computation of .describe()

`df_raw.isnull().sum().sort_index()`

.isnull() returns 2dim dataframe
.sum() by default use axis=0, i.e., axis 0 is reduced, the sum is taken across the rows for each column.
the Series returned is sort by index name in the alphabetical order.

prepeocessing pipeline

take the log of price
because the metric required by kaggle is rmse of log price

deal with dates

Signature: add_datepart(df, fldname, drop=True, time=False)
Docstring:
add_datepart converts a column of df from a datetime64 to many columns containing
the information from the date. This applies changes inplace.

Parameters:
-----------
df: A pandas data frame. df gain several new columns.
fldname: A string that is the name of the date column you wish to expand.
    If it is not a datetime64 series, it will be converted to one with pd.to_datetime.
drop: If true then the original date column will be removed.
time: If true time features: Hour, Minute, Second will be added.

typically, add_datepart(df, 'salesDate') adds new attributes like salesYear (with [Dd]ate$ recognized and removed), and removes the requested column.
after this step, we ensure all categorical variables are stored as strings.

convert categorical string valuesconvert categorical string values
train_cats()
(optional) specify the order to use for cateforical variables if we wish
df.some_attr.cat.categories shows current order for categorical variables
df.some_attr.cat.set_categories()
df.some_attr = df.some_attr.cat.codes manually set the codes
store and reload the data
feather is a fast binary format
df.to_feather()
df = pd.read_feather()
proc_df()
replace the categorical values with numerical codes,
and deal with missing values.
this function returns (features, y, na_positions)

RF pipeline

m = RandomForestRegressor(n_jobs=-1)
m.fit(df, y)
m.score(df,y)

the score is $R^2$ , intuitively shows how much better the model is compared to a naive mean model.

validation set

def split_vals(a,n): return a[:n].copy(), a[n:].copy()

n_valid = 12000  # same as Kaggle's test set size
n_trn = len(df)-n_valid
raw_train, raw_valid = split_vals(df_raw, n_trn)
X_train, X_valid = split_vals(df, n_trn)
y_train, y_valid = split_vals(y, n_trn)

X_train.shape, y_train.shape, X_valid.shape

apply the function for three times to : 1. (x,y) 2. x 3. y

evaluation metrics

def rmse(x,y): return math.sqrt(((x-y)**2).mean())

def print_score(m):
    res = [rmse(m.predict(X_train), y_train), rmse(m.predict(X_valid), y_valid),
                m.score(X_train, y_train), m.score(X_valid, y_valid)]
    if hasattr(m, 'oob_score_'): res.append(m.oob_score_)
    print(res)

the above lines print out rmse and r-square on both train_data and val_data, with oob_score

standard RF pipeline

m = RandomForestRegressor(n_jobs=-1)
%time m.fit(X_train, y_train)
print_score(m)

note the time magic
the above follows instantiate-fit-eval pipeline

speeding up

subset
proc_df(.., subset, na_dict)
and then split val set.
this technique features limiting the scope our forest model can see, and is different from the subsampling technique.
subsample
set_rf_samples()
this limit the scope each tree model can see, but the whole model sees all the data. Note that each tree samples randomly a subset from the whole data.
additional trees allow the model to see more data.
reset_rf_samples()

RF options

n_estimators
max_depth
you can use draw_tree() if you limit the tree depth, e.g.
draw_tree(m.estimators_[0], df_trn, precision=3)
n_jobs
keep this under 8
bootstrap
seems like randomly sample from train set with replacement, and different trees use some different subset of samples.
here is more information.
oob_score
see the section below
min_samples_leaf
for nodes, the minimum number of samples to split
if larger:
- quicker (cuz shallower, less rules)
- better generalization (cuz individual tree less predictive)
- predictions now needs to average more rows (cuz less specific leaves)
  if too large may affect accuracy
max_features
for each nodes, use how much a subset of all features
e.g. ‘sqrt’, ‘log’, 0.5
less features may require more trees but better model

RF attributes

.estimators_
a list of tree model instances

out-of-bag score

this is still r-square, but there is no val set or we can say that train set is our val set.
In the traditional setting, we compute the prediction for samples in val set and compare them with their ground truths in val set and finally get the r-square. Note the prediction for each sample is the average of all trees’ predictions.
Currently, the prediction for a specific sample is the average of only a subset of trees, all of which have not trained on the sample.

useful gadgets

metrics.r2_score()