Data Science Library -day3

1 pandas Index

    For Series, which can be accessed by index values ​​directly, such as:

        s=pd.Series(np.array([1,2,3,4]),index=['a','b','a','b'])

        s[‘a’]

    For DataFrame, you need to use loc, such as:

    s=pd.DataFrame(np.array([[1,2],[3,4],[5,6]],index=[‘a’,’b’,’a’])

        s.loc[‘b’]

2, repeated index

    Determination: To determine whether they contain duplicate index, is_unique functions may be used, such as:

        s.is_unique

    Obtaining a duplicate index, unique functions may be used, such as:

        s1=s.unique()

    Index process is repeated, by groupby may be used, such as:

        s2=s.groupby(s.index).sum()

3, multi-level index

3.1 Series

Create a multi-level index can be used MultiIndex functions, such as:

        a1=[[‘a’,’b’,’a’,’c’,’b’,’b’],[1,2,2,3,1,2]]

        a2=list(zip(*a1))

       index=pd.MultiIndex.from_tuples(a2,names=[‘level1’,’level2’])

3.2 DataFrame

    Series with slightly different, such as:

df=pd.DataFrame(np.random.randint(1,10,(6,3)),index=[[‘a’,’b’,’a’,’c’,’b’,’b’],[1,2,2,3,1,2]],columns=[[‘one’,’two’,’one’],[‘red’,’blue’,’red’]])

Exchange between 3.3 multilevel indexing

    df.swaplevel(0)

3.4 According to sort index

    df.sortlevel(1)

3.5 calculates the index

    df.sum(level=0)

3.6 will convert the column index

    df.set_index (1 column names, column names, 2 ...)

    You can also go back to the index, such as df.reset_index ()

4 grouping calculation

    The basic process: Split - Applications - Merge

4.1 by grouping list

    Example: df.groupby (1 column name, column name 2, ...) .sum ()

4.2 by grouping Dictionary

df=pd.DataFrame(np.random.randint(1,10,(6,4)),columns=[‘a’,’b’,’c’,’d’])

mapping={‘a’:red,’b’:blue,’c’:red,’d’:blue}

df1=df.groupby(mapping,axis=1)

4.3 by grouping function

    Based on the index, a function return value of the packet

    def _group(idx):

        return idx

    df.groupby(_group)

More than 4.4-level index grouping

    Different levels of index names to be grouped, such as:

    df.groupby (level = index name, axis = 1)

5 aggregation operation

Built-in functions polymerization 5.1

    sum,mean,min,max,describle

5.2 custom aggregation function

    def _group(s):

        return s.max()-s.min()

    df1.agg(_group)

    Note: df1 is the result of grouping

5.3 apply method

    def top(g,n=2,column=’data1’):

        return g.sort_value(by=column,ascengding=False)[:,n]

    df2 = df.groupby (column name) .apply (top)

6 Data Import and Export

6.1 Data Import

6.1.1 Import method

    pd.read_csv (file path)

    pd.read_table (file path, sep = ',')

    pd.read_table pd.read_csv more flexible than some of its sep supports regular expressions.

6.1.2 Import settings

When reading a file, you may be set if there are column labels, some of the columns may be designated as the row label, such as:

        df = pd.read_csv (path, header = None, index_col = [1 column name, column name 2, ...])

6.1.3 missing values

    For the missing data, pd.read_csv automatically vacancies, NA as missing values ​​and the like, can also be customized, such as:

    df = pd.read_csv (path, na_values ​​= [ 'NA', 'NULL', 'foo'])

    Even manner can be used dictionaries, individually provided for each column

6.1.4 read data block

    Chunksize block read data by parameters such as the number of key values ​​is calculated:

    df = pd.read_csv (path, chunksize = 1000)

    result=pd.Series([])

    for chunk in df:

        result=result.add(chunk[key].value_counts(),fill_value=0)

6.2 Export data to disk

    pd.to_csv (path)

    It can be provided, without holding the index, such as: pd.to_csv (path, index = False)

7 Time Series

7.1 Two common libraries

    from datetime import datetime

    from datetime import timedelta

7.2 Define a time

t1=datetime(2019,9,23)

7.3 Time Conversion

    Time change string:

        t1=datetime(2019,7,23)

        t1.strftime(‘%Y/%m/%d %H:%M:%S’)

    String change time:

        datetime.strptime(‘2019-7-12 9:20’,’%y-%m-%d %H:%M’)

7.4 generating a time series

    Use date_range:

        pd.date_range(‘20190620’,’20190628’)

pd.date_range(‘20190620’,periods=10,freq=’M’)

    Use period:

pd.period_range(‘2016-10’,periods=10,freq=’M’)

7.5 timestamp conversion of the period

    timestamp converted period:

        如:s=pd.Series(np.random.randint(5),index=pd.date_range(‘2016-04-01’,periods=5,freq=’D’))

        Call: s.to_period (), can also add parameters, s.to_period (freq = 'M')

        Similarly, the data for the period s1, s1.to_timestamp () can be converted into a timestamp

        Note: One is periods, one period!

7.6 Time resampling

7.6.1 downsampling: low frequency to high frequency

    ts.resample (sampling period, embodiment, the time value), such as:

    ts.resample(‘5min’,how=’sum’,label=’right’)

    You can also use groupby function:

        ts.groupby(lambda x:x.month).sum()

        ts.groupby(ts.index.to_period(‘M’)).sum()

7.6.2 upsampling: low to high

    Timestamp: ts.resample ( 'D', fill_method = 'ffill')

    period:ts.resamle(‘A-DEC’,how=’sum’)

7.7 The file parsing time to time to deal with

    df=pd.read_csv(路径,parse_dates=True)

8 Data Visualization

    Videos line chart: ts.plot (figsize = tuple, style = color and line type, title = FIG name)

    Videos Scatter: ts.plot.scatter (x = '', y = '')

    Draw a histogram: ts.plot.bar (stacked = True)

    Draw a histogram: ts.plot.hist (bins = 20)

    FIG yourselves: ts.plot.pie ()

Guess you like

Origin www.cnblogs.com/zhuome/p/11618789.html