Introduction to Data Analysis --Pandas

Pandas

1, Pandas Profile

Tools pandas Python is a powerful data analysis package, which is based Numpy built, appears because of pandas, so Python language has become one of the most widely used and powerful data analysis environment.

1.1 Pandas main functions:

- includes a data structure DataFrame its function, Series
- integration time series feature
- and provides rich math operations
- Flexible handle missing data

1.2 Installation: pip install pandas

Reference method: import pandas as pd

2、Series

Series objects is similar to a one-dimensional array, by a set of data and a set of data associated with the tag (index) Composition

2.1, create a method

The first: pd.Series ([4,5,6]) # default index

The results:
0. 4
. 1. 5
2. 6

The second: a custom index, index is an index list, which contains a string, you can still value index by default.

pd.Series ([4,5,6,7,8], index = [ 'a', 'b', 'c', 'd', 'e'])
Results of:
A. 4
B. 5
C. 6
D . 7
E. 8
DTYPE: Int64

Third: the specified index

pd.Series ({ "a": 1 , "b": 2})
Results of:
A. 1
B 2
DTYPE: Int64

Fourth: Create a full array of value 0

pd.Series (0, index = [ ' a', 'b', 'c'])
Results of:
A 0
B 0
C 0
DTYPE: Int64

2.2 Missing Data

dropna (): filtering out the row is NaN
fill (): fill in the missing data
isnull (): returns boolean array, corresponding to the missing values as True
NotNull (): returns boolean array, missing values corresponding to False

2.3 Series Features:

Series created from ndarray: Series (arr)
and scalar (number) Multiplication: sr * 2
two operational Series
generic function: np.ads (sr)

Boolean filter: sr [sr> 0]
Statistics function: mean (), sum () , cumsum ()

2.4 support features dictionary:

Created from a dictionary Series: Series (DIC),
the In operation: 'a'in sr, for x in sr
key index: sr [' a '], sr [[' a ',' b ',' d ']]
key slice: sr [ 'a': ' c']
other functions: get ( 'a', default = 0) , etc.

2.5 integer index

loc properties: Label
iloc attributes: looking for a subscript

Series data alignment
pandas during operation, will be aligned and calculated according to the index. If the presence of different index, the index is the result of two operands and the index set.

3、DataFrame

DataFrame is a tabular data structure is equivalent to a two-dimensional array, comprising an ordered set of columns. He can be seen as a dictionary composed Series, and a common index.

3.1 Common Properties and Methods:

index: Gets the row index
columns: Gets the column index
T: transpose
columns: Gets the column index
values: Gets the value of the index
describe: get Quick Stats

3.2 indexing and slicing

There DataFrame row and column indices.
DataFrame can also be indexed and slicing through the label and the position of the two methods.

Slice index DataFrame use:
Method 1: two brackets, the column then take preemptive row.
Method 2 (recommended): Use loc / iloc property, a bracket, a comma, and then take the first row fetching column.
DataFrame object using the value written Method 2 only
the row / column index portion may be a conventional index, slice, Boolean index, index with arbitrary fancy. (Note: The two parts are the results may differ from expected when the fancy index)

4, the processing time of the object

Time sequence type

Time stamp: specific time
fixed period: January 2019 as
the time interval: start time - end time
Python libraries: datatime
DATE, Time, datetime, timedelta
dt.strftime ()
strptime ()
flexible processing time Object: dateutil package
dateutil.parser.parse ()
to set the processing time of the object: PANDAS
pd.to_datetime ([ '2018-01-01', '2019-02-02'])

Generation time object array: data_range
Start time start
end end time
periods duration
freq temporal frequency, the default is 'D', the optional H (our), W (eek ), B (usiness), S (emi-) M (onth ), (min) T (es ), S (econd), a (year), ...
in time series
time series is a time index object or series DataFrame. datetime objects as the index is stored DatetimeIndex object.
special function:

Incoming "year" or "date" as a way of slicing
the incoming date range as a way to slice
a wealth of support functions: resample (), strftime () , ......
batch convert datetime objects: to_pydatetime ()

5, the data packets and aggregation

Among the data analysis, we sometimes need to split the data, and then calculates at each specific group, the data analysis is usually an important part of the work in these operations.
Chapter learning content:
grouping (GroupBy mechanism)
polymerizing (a function within a group application)
Apply
a perspective cross-table and the table

5.1 Packet (GroupBY mechanism)

pandas target data (whether or what Series, DataFrame other) which will be provided in accordance with one or more keys split into a plurality of groups, split operation performed real axis of the specific object. For example it can DataFrame grouping columns or on his line, then applying a function to each packet and generates a new value. Finally, all the results incorporated into the final result object.
Bond form of packets:
a list or array, the length of the shaft to be grouped as
a value representing a column name DataFrame.
Dictionary or Series, a correspondence relationship between the value of group name to be given packet axis
function, for processing axis index or index the labels do
the latter three only shortcuts, in order to produce a final set of still values for splitting the object.

Added:
1, the group key may be an array of any length
2, a packet, a column for not an array of data is excluded from the results, for example key1, key2 such columns
3, GroupBy size method returns Series one containing the packet size

5.2, the polymerization (the application of a function group)

Polymerization refers to any conversion process capable of producing data from an array of scalar values. Just above operation will find use GroupBy not directly a result of the dominant, but an intermediate data, the results can be obtained by performing similar mean, count, min is calculated and the like, there are some common:
SUM: Non-NA values and
median: NA non-arithmetic median value
std, var: unbiased (the denominator is n-1) standard deviation and variance
prod: non-product NA values
first, last: first and last non-value NA

5.3、apply

GroupBy freedom is the highest among the Apply method, it is an object to be processed will be split into a plurality of segments, each segment are then passed in the function call, and finally combining them together.

6, other commonly used methods

pandas conventional method (for Series and DataFrame)
Mean (Axis = 0, skipna = False)
SUM (Axis =. 1)
sort_index (Axis, ..., Ascending) # row or column index sort
sort_values (by, axis, ascending) # Press values are sorted
apply (func, axis = 0) # custom functions used in each of the rows or columns, func returns a scalar or Series
applymap (FUNC) # the function is applied on DataFrame each element
map (func) # the function is applied each element in the Series

Guess you like

Origin www.cnblogs.com/allenchen168/p/12405307.html