波妞的机器学习和数据分析之路中遇到的函数_更新

  1. some.where() 用来寻找并操作不符合某条件的位置的数据
housing.where(
    cond,
    other=nan,
    inplace=False,
    axis=None,
    level=None,
    errors='raise',
    try_cast=False,
    raise_on_error=None,
)

Docstring:
Replace values where the condition is **False**
  1. pd.cut()
    用来把连续值分成离散片段,产生柱状图
Signature:
pd.cut(
    x,
    bins,
    right=True,
    labels=None,
    retbins=False,
    precision=3,
    include_lowest=False,
    duplicates='raise',
)
Docstring:
Bin values into discrete intervals.

Use `cut` when you need to segment and sort data values into bins. This
function is also useful for going from a continuous variable to a
categorical variable. For example, `cut` could convert ages to groups of
age ranges. Supports binning into an equal number of bins, or a
pre-specified array of bins
  1. StratifiedShuffleSplit() 用来分层抽样
StratifiedShuffleSplit(
    n_splits=10,
    test_size='default',
    train_size=None,
    random_state=None,
)
Docstring:     
Stratified ShuffleSplit cross-validator

Provides train/test indices to split data in train/test sets.

This cross-validation object is a merge of StratifiedKFold and
ShuffleSplit, which returns stratified randomized folds. The folds
are made by preserving the percentage of samples for each class.

Note: like the ShuffleSplit strategy, stratified random splits
do not guarantee that all folds will be different, although this is
still very likely for sizeable datasets.

Read more in the :ref:`User Guide <cross_validation>`.
  1. SratifiedShuffleSplit(…,…,…).split()
    用来产生训练集和测试集的索引
Signature: split.split(X, y, groups=None)
Docstring:
Generate indices to split data into training and test set.

Parameters
----------
X : array-like, shape (n_samples, n_features)
    Training data, where n_samples is the number of samples
    and n_features is the number of features.

    Note that providing ``y`` is sufficient to generate the splits and
    hence ``np.zeros(n_samples)`` may be used as a placeholder for
    ``X`` instead of actual training data.

y : array-like, shape (n_samples,)
    The target variable for supervised learning problems.
    Stratification is done based on the y labels.

groups : object
    Always ignored, exists for compatibility.

Yields
------
train : ndarray
    The training set indices for that split.

test : ndarray
    The testing set indices for that split.

Notes
-----
Randomized CV splitters may return different results for each call of
split. You can make the results identical by setting ``random_state``
to an integer.
  1. SimpleImputer()
    用来处理丢失数据
Init signature:
SimpleImputer(
    missing_values=nan,
    strategy='mean',
    fill_value=None,
    verbose=0,
    copy=True,
)
Docstring:     
Imputation transformer for completing missing values.

Read more in the :ref:`User Guide <impute>`.

Parameters
----------
missing_values : number, string, np.nan (default) or None
    The placeholder for the missing values. All occurrences of
    `missing_values` will be imputed.

strategy : string, optional (default="mean")
    The imputation strategy.

    - If "mean", then replace missing values using the mean along
      each column. Can only be used with numeric data.
    - If "median", then replace missing values using the median along
      each column. Can only be used with numeric data.
    - If "most_frequent", then replace missing using the most frequent
      value along each column. Can be used with strings or numeric data.
    - If "constant", then replace missing values with fill_value. Can be
      used with strings or numeric data.

    .. versionadded:: 0.20
       strategy="constant" for fixed value imputation.

fill_value : string or numerical value, optional (default=None)
    When strategy == "constant", fill_value is used to replace all
    occurrences of missing_values.
    If left to the default, fill_value will be 0 when imputing numerical
    data and "missing_value" for strings or object data types.

verbose : integer, optional (default=0)
    Controls the verbosity of the imputer.

copy : boolean, optional (default=True)
    If True, a copy of X will be created. If False, imputation will
    be done in-place whenever possible. Note that, in the following cases,
    a new copy will always be made, even if `copy=False`:

    - If X is not an array of floating values;
    - If X is encoded as a CSR matrix.

Attributes
----------
statistics_ : array of shape (n_features,)
    The imputation fill value for each feature.

Examples
--------
>>> import numpy as np
>>> from sklearn.impute import SimpleImputer
>>> imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
>>> imp_mean.fit([[7, 2, 3], [4, np.nan, 6], [10, 5, 9]])
... # doctest: +NORMALIZE_WHITESPACE
SimpleImputer(copy=True, fill_value=None, missing_values=nan,
       strategy='mean', verbose=0)
>>> X = [[np.nan, 2, 3], [4, np.nan, 6], [10, np.nan, 9]]
>>> print(imp_mean.transform(X))
... # doctest: +NORMALIZE_WHITESPACE
[[ 7.   2.   3. ]
 [ 4.   3.5  6. ]
 [10.   3.5  9. ]]

Notes
-----
Columns which only contained missing values at `fit` are discarded upon
`transform` if strategy is not "constant".
F
  1. OriginalEncoder()
    用来将非数值 转变为数值 ,然后进行统计计算
Init signature: OrdinalEncoder(categories='auto', dtype=<class 'numpy.float64'>)
Docstring:     
Encode categorical features as an integer array.

The input to this transformer should be an array-like of integers or
strings, denoting the values taken on by categorical (discrete) features.
The features are converted to ordinal integers. This results in
a single column of integers (0 to n_categories - 1) per feature.

Read more in the :ref:`User Guide <preprocessing_categorical_features>`.

Parameters
----------
categories : 'auto' or a list of lists/arrays of values.
    Categories (unique values) per feature:

    - 'auto' : Determine categories automatically from the training data.
    - list : ``categories[i]`` holds the categories expected in the ith
      column. The passed categories should not mix strings and numeric
      values, and should be sorted in case of numeric values.

    The used categories can be found in the ``categories_`` attribute.

dtype : number type, default np.float64
    Desired dtype of output.

Attributes
----------
categories_ : list of arrays
    The categories of each feature determined during fitting
    (in order of the features in X and corresponding with the output
    of ``transform``).

Examples
--------
Given a dataset with two features, we let the encoder find the unique
values per feature and transform the data to an ordinal encoding.

>>> from sklearn.preprocessing import OrdinalEncoder
>>> enc = OrdinalEncoder()
>>> X = [['Male', 1], ['Female', 3], ['Female', 2]]
>>> enc.fit(X)
... # doctest: +ELLIPSIS
OrdinalEncoder(categories='auto', dtype=<... 'numpy.float64'>)
>>> enc.categories_
[array(['Female', 'Male'], dtype=object), array([1, 2, 3], dtype=object)]
>>> enc.transform([['Female', 3], ['Male', 1]])
array([[0., 2.],
       [1., 0.]])

>>> enc.inverse_transform([[1, 0], [0, 1]])
array([['Male', 1],
       ['Female', 2]], dtype=object)

See also
--------
sklearn.preprocessing.OneHotEncoder : performs a one-hot encoding of
  categorical features.
sklearn.preprocessing.LabelEncoder : encodes target labels with values
  between 0 and n_classes-1.
  
  1. OneHotEncoder()
    更好的编码方式,比起原始,不会因为数值相近就认为相似

Init signature:
OneHotEncoder(
    n_values=None,
    categorical_features=None,
    categories=None,
    sparse=True,
    dtype=<class 'numpy.float64'>,
    handle_unknown='error',
)
Docstring:     
Encode categorical integer features as a one-hot numeric array.

The input to this transformer should be an array-like of integers or
strings, denoting the values taken on by categorical (discrete) features.
The features are encoded using a one-hot (aka 'one-of-K' or 'dummy')
encoding scheme. This creates a binary column for each category and
returns a sparse matrix or dense array.

By default, the encoder derives the categories based on the unique values
in each feature. Alternatively, you can also specify the `categories`
manually.
The OneHotEncoder previously assumed that the input features take on
values in the range [0, max(values)). This behaviour is deprecated.

This encoding is needed for feeding categorical data to many scikit-learn
estimators, notably linear models and SVMs with the standard kernels.

Note: a one-hot encoding of y labels should use a LabelBinarizer
instead.

Read more in the :ref:`User Guide <preprocessing_categorical_features>`.

Parameters
----------
categories : 'auto' or a list of lists/arrays of values, default='auto'.
    Categories (unique values) per feature:

    - 'auto' : Determine categories automatically from the training data.
    - list : ``categories[i]`` holds the categories expected in the ith
      column. The passed categories should not mix strings and numeric
      values within a single feature, and should be sorted in case of
      numeric values.

    The used categories can be found in the ``categories_`` attribute.

sparse : boolean, default=True
    Will return sparse matrix if set True else will return an array.

dtype : number type, default=np.float
    Desired dtype of output.

handle_unknown : 'error' or 'ignore', default='error'.
    Whether to raise an error or ignore if an unknown categorical feature
    is present during transform (default is to raise). When this parameter
    is set to 'ignore' and an unknown category is encountered during
    transform, the resulting one-hot encoded columns for this feature
    will be all zeros. In the inverse transform, an unknown category
    will be denoted as None.

n_values : 'auto', int or array of ints, default='auto'
    Number of values per feature.

    - 'auto' : determine value range from training data.
    - int : number of categorical values per feature.
            Each feature value should be in ``range(n_values)``
    - array : ``n_values[i]`` is the number of categorical values in
              ``X[:, i]``. Each feature value should be
              in ``range(n_values[i])``

    .. deprecated:: 0.20
        The `n_values` keyword was deprecated in version 0.20 and will
        be removed in 0.22. Use `categories` instead.

categorical_features : 'all' or array of indices or mask, default='all'
    Specify what features are treated as categorical.

    - 'all': All features are treated as categorical.
    - array of indices: Array of categorical feature indices.
    - mask: Array of length n_features and with dtype=bool.

    Non-categorical features are always stacked to the right of the matrix.

    .. deprecated:: 0.20
        The `categorical_features` keyword was deprecated in version
        0.20 and will be removed in 0.22.
        You can use the ``ColumnTransformer`` instead.

Attributes
----------
categories_ : list of arrays
    The categories of each feature determined during fitting
    (in order of the features in X and corresponding with the output
    of ``transform``).

active_features_ : array
    Indices for active features, meaning values that actually occur
    in the training set. Only available when n_values is ``'auto'``.

    .. deprecated:: 0.20
        The ``active_features_`` attribute was deprecated in version
        0.20 and will be removed in 0.22.

feature_indices_ : array of shape (n_features,)
    Indices to feature ranges.
    Feature ``i`` in the original data is mapped to features
    from ``feature_indices_[i]`` to ``feature_indices_[i+1]``
    (and then potentially masked by ``active_features_`` afterwards)

    .. deprecated:: 0.20
        The ``feature_indices_`` attribute was deprecated in version
        0.20 and will be removed in 0.22.

n_values_ : array of shape (n_features,)
    Maximum number of values per feature.

    .. deprecated:: 0.20
        The ``n_values_`` attribute was deprecated in version
        0.20 and will be removed in 0.22.

Examples
--------
Given a dataset with two features, we let the encoder find the unique
values per feature and transform the data to a binary one-hot encoding.

>>> from sklearn.preprocessing import OneHotEncoder
>>> enc = OneHotEncoder(handle_unknown='ignore')
>>> X = [['Male', 1], ['Female', 3], ['Female', 2]]
>>> enc.fit(X)
... # doctest: +ELLIPSIS
OneHotEncoder(categorical_features=None, categories=None,
       dtype=<... 'numpy.float64'>, handle_unknown='ignore',
       n_values=None, sparse=True)

>>> enc.categories_
[array(['Female', 'Male'], dtype=object), array([1, 2, 3], dtype=object)]
>>> enc.transform([['Female', 1], ['Male', 4]]).toarray()
array([[1., 0., 1., 0., 0.],
       [0., 1., 0., 0., 0.]])
>>> enc.inverse_transform([[0, 1, 1, 0, 0], [0, 0, 0, 1, 0]])
array([['Male', 1],
       [None, 2]], dtype=object)
>>> enc.get_feature_names()
array(['x0_Female', 'x0_Male', 'x1_1', 'x1_2', 'x1_3'], dtype=object)

See also
--------
sklearn.preprocessing.OrdinalEncoder : performs an ordinal (integer)
  encoding of the categorical features.
sklearn.feature_extraction.DictVectorizer : performs a one-hot encoding of
  dictionary items (also handles string-valued features).
sklearn.feature_extraction.FeatureHasher : performs an approximate one-hot
  encoding of dictionary items or strings.
sklearn.preprocessing.LabelBinarizer : binarizes labels in a one-vs-all
  fashion.
sklearn.preprocessing.MultiLabelBinarizer : transforms between iterable of
  iterables and a multilabel format, e.g. a (samples x classes) binary
  matrix indicating the presence of a class label.
  1. sorted(iterable, /, *, key = None, reverse = False)排序返回列表
Signature: sorted(iterable, /, *, key=None, reverse=False)
Docstring:
Return a new list containing all items from the iterable in ascending order.

A custom key function can be supplied to customize the sort order, and the
reverse flag can be set to request the result in descending order.
Type:      builtin_function_or_method
  1. fetch_openml() 拿取数据集
Signature:
fetch_openml(
    name=None,
    version='active',
    data_id=None,
    data_home=None,
    target_column='default-target',
    cache=True,
    return_X_y=False,
)
Docstring:
Fetch dataset from openml by name or dataset id.

Datasets are uniquely identified by either an integer ID or by a
combination of name and version (i.e. there might be multiple
versions of the 'iris' dataset). Please give either name or data_id
(not both). In case a name is given, a version can also be
provided.

Read more in the :ref:`User Guide <openml>`.

.. note:: EXPERIMENTAL

    The API is experimental in version 0.20 (particularly the return value
    structure), and might have small backward-incompatible changes in
    future releases.

Parameters
----------
name : str or None
    String identifier of the dataset. Note that OpenML can have multiple
    datasets with the same name.

version : integer or 'active', default='active'
    Version of the dataset. Can only be provided if also ``name`` is given.
    If 'active' the oldest version that's still active is used. Since
    there may be more than one active version of a dataset, and those
    versions may fundamentally be different from one another, setting an
    exact version is highly recommended.

data_id : int or None
    OpenML ID of the dataset. The most specific way of retrieving a
    dataset. If data_id is not given, name (and potential version) are
    used to obtain a dataset.

data_home : string or None, default None
    Specify another download and cache folder for the data sets. By default
    all scikit-learn data is stored in '~/scikit_learn_data' subfolders.

target_column : string, list or None, default 'default-target'
    Specify the column name in the data to use as target. If
    'default-target', the standard target column a stored on the server
    is used. If ``None``, all columns are returned as data and the
    target is ``None``. If list (of strings), all columns with these names
    are returned as multi-target (Note: not all scikit-learn classifiers
    can handle all types of multi-output combinations)

cache : boolean, default=True
    Whether to cache downloaded datasets using joblib.

return_X_y : boolean, default=False.
    If True, returns ``(data, target)`` instead of a Bunch object. See
    below for more information about the `data` and `target` objects.

Returns
-------

data : Bunch
    Dictionary-like object, with attributes:

    data : np.array or scipy.sparse.csr_matrix of floats
        The feature matrix. Categorical features are encoded as ordinals.
    target : np.array
        The regression target or classification labels, if applicable.
        Dtype is float if numeric, and object if categorical.
    DESCR : str
        The full description of the dataset
    feature_names : list
        The names of the dataset columns
    categories : dict
        Maps each categorical feature name to a list of values, such
        that the value encoded as i is ith in the list.
    details : dict
        More metadata from OpenML

(data, target) : tuple if ``return_X_y`` is True

    .. note:: EXPERIMENTAL

        This interface is **experimental** as at version 0.20 and
        subsequent releases may change attributes without notice
        (although there should only be minor changes to ``data``
        and ``target``).

    Missing values in the 'data' are represented as NaN's. Missing values
    in 'target' are represented as NaN's (numerical target) or None
    (categorical target)
  1. ==enumerate() ==

Init signature: enumerate(iterable, start=0)
Docstring:     
Return an enumerate object.

  iterable
    an object supporting iteration

The enumerate object yields pairs containing a count (from start, which
defaults to zero) and a value yielded by the iterable argument.

enumerate is useful for obtaining an indexed list:
    (0, seq[0]), (1, seq[1]), (2, seq[2]), ...

猜你喜欢

转载自blog.csdn.net/weixin_43702920/article/details/94696520