Books
- python for data analysis, 2nd edition
- introduction to machine learning with python
- Python Machine Learning: Machine Learning and Deep Learning with Python, scikit-learn, and TensorFlow, 2nd Edition
- Hands-On Machine Learning with Scikit-Learn and TensorFlow
autoreload magics
%load_ext autoreload
%autoreload 2
the first line load an extension for auto reloading modules before executing user codes.
the second line specifically tells which modules should be auto reloaded.
this documentation gives more details.
matplotlib magics
%matplotlib inline
is a frequently used magic. It enables the inline backend of matplotlib so that figures are displayed interactively.
check more usage here.
tricks to trace functions
- see where it comes from
just run the cell with the function only (shift+enter) - see the doc
type one?
- see the source
type two?
run commands in cells
PATH = "path\to\dir"
!ls {PATH}
the use of {}
is like in bash.
target contest: Blue Book for Bulldozers
The goal of the contest is to predict the sale price of a particular piece of heavy equiment at auction based on it’s usage, equipment type, and configuaration.
python 3.6 string: f prefix
f'{PATH}Train.csv'
python 3.6 format string {}
automatically run python codes in it.
for example, in the above expression, variable PATH
is evaluated.
pd.read_csv low_memory option
low_memory : boolean, default True
Internally process the file in chunks, resulting in lower memory use
while parsing, but possibly mixed type inference. To ensure no mixed
types either set False, or specify the type with the `dtype` parameter.
Note that the entire file is read into a single DataFrame regardless,
use the `chunksize` or `iterator` parameter to return the data in chunks.
(Only valid with C parser)
display_all
for jupyter cell
def display_all(df):
with pd.option_context("display.max_rows", 1000, "display.max_columns", 1000):
display(df)
because by default, jupyter does not show all columns and all rows if there are too many.
transpose on df
display_all(df_raw.tail().T)
this is useful when the number of columns is much more than that of the rows.
(You may ask how can the number of rows be small…, remember when you df.head()
or df.tail()
only 5 rows are returned.)
df.describe(include)
include : 'all', list-like of dtypes or None (default), optional
A white list of data types to include in the result. Ignored
for ``Series``. Here are the options:
- 'all' : All columns of the input will be included in the output.
- A list-like of dtypes : Limits the results to the
provided data types.
To limit the result to numeric types submit
``numpy.number``. To limit it instead to object columns submit
the ``numpy.object`` data type. Strings
can also be used in the style of
``select_dtypes`` (e.g. ``df.describe(include=['O'])``). To
select pandas categorical columns, use ``'category'``
- None (default) : The result will include all numeric columns.
this relates to what dtypes will be involved in the computation of .describe()
df_raw.isnull().sum().sort_index()
.isnull()
returns 2dim dataframe
.sum()
by default use axis=0
, i.e., axis 0 is reduced, the sum is taken across the rows for each column.
the Series returned is sort by index name in the alphabetical order.
prepeocessing pipeline
-
take the log of price
because the metric required by kaggle is rmse of log price -
deal with dates
Signature: add_datepart(df, fldname, drop=True, time=False) Docstring: add_datepart converts a column of df from a datetime64 to many columns containing the information from the date. This applies changes inplace. Parameters: ----------- df: A pandas data frame. df gain several new columns. fldname: A string that is the name of the date column you wish to expand. If it is not a datetime64 series, it will be converted to one with pd.to_datetime. drop: If true then the original date column will be removed. time: If true time features: Hour, Minute, Second will be added.
typically,
add_datepart(df, 'salesDate')
adds new attributes likesalesYear
(with[Dd]ate$
recognized and removed), and removes the requested column.
after this step, we ensure all categorical variables are stored as strings. -
convert categorical string valuesconvert categorical string values
train_cats()
-
(optional) specify the order to use for cateforical variables if we wish
df.some_attr.cat.categories
shows current order for categorical variables
df.some_attr.cat.set_categories()
df.some_attr = df.some_attr.cat.codes
manually set the codes -
store and reload the data
feather is a fast binary format
df.to_feather()
df = pd.read_feather()
-
proc_df()
replace the categorical values with numerical codes,
and deal with missing values.
this function returns(features, y, na_positions)
RF pipeline
m = RandomForestRegressor(n_jobs=-1)
m.fit(df, y)
m.score(df,y)
the score is , intuitively shows how much better the model is compared to a naive mean model.
validation set
def split_vals(a,n): return a[:n].copy(), a[n:].copy()
n_valid = 12000 # same as Kaggle's test set size
n_trn = len(df)-n_valid
raw_train, raw_valid = split_vals(df_raw, n_trn)
X_train, X_valid = split_vals(df, n_trn)
y_train, y_valid = split_vals(y, n_trn)
X_train.shape, y_train.shape, X_valid.shape
apply the function for three times to : 1. (x,y) 2. x 3. y
evaluation metrics
def rmse(x,y): return math.sqrt(((x-y)**2).mean())
def print_score(m):
res = [rmse(m.predict(X_train), y_train), rmse(m.predict(X_valid), y_valid),
m.score(X_train, y_train), m.score(X_valid, y_valid)]
if hasattr(m, 'oob_score_'): res.append(m.oob_score_)
print(res)
the above lines print out rmse and r-square on both train_data and val_data, with oob_score
standard RF pipeline
m = RandomForestRegressor(n_jobs=-1)
%time m.fit(X_train, y_train)
print_score(m)
note the time magic
the above follows instantiate-fit-eval pipeline
speeding up
- subset
proc_df(.., subset, na_dict)
and then split val set.
this technique features limiting the scope our forest model can see, and is different from the subsampling technique. - subsample
set_rf_samples()
this limit the scope each tree model can see, but the whole model sees all the data. Note that each tree samples randomly a subset from the whole data.
additional trees allow the model to see more data.
reset_rf_samples()
RF options
- n_estimators
- max_depth
you can usedraw_tree()
if you limit the tree depth, e.g.
draw_tree(m.estimators_[0], df_trn, precision=3)
- n_jobs
keep this under 8 - bootstrap
seems like randomly sample from train set with replacement, and different trees use some different subset of samples.
here is more information. - oob_score
see the section below - min_samples_leaf
for nodes, the minimum number of samples to split
if larger:- quicker (cuz shallower, less rules)
- better generalization (cuz individual tree less predictive)
- predictions now needs to average more rows (cuz less specific leaves)
if too large may affect accuracy
- max_features
for each nodes, use how much a subset of all features
e.g. ‘sqrt’, ‘log’, 0.5
less features may require more trees but better model
RF attributes
- .estimators_
a list of tree model instances
out-of-bag score
this is still r-square, but there is no val set or we can say that train set is our val set.
In the traditional setting, we compute the prediction for samples in val set and compare them with their ground truths in val set and finally get the r-square. Note the prediction for each sample is the average of all trees’ predictions.
Currently, the prediction for a specific sample is the average of only a subset of trees, all of which have not trained on the sample.
useful gadgets
- metrics.r2_score()