Pandas's Series / DataFrame acquaintance

import numpy as np
import pandas as pd

Pandas will be a major tool of interest throughout(贯穿) much of the rest of the book. It contains data structures and manipulation tools designed to make data cleaning(数据清洗) and analysis fast and easy in Python. pandas is often used in tandem(串联) with numerical computing tools like NumPy and SciPy, analytical libraries like statsmodels and scikit-learn, and data visualization(可视化) libraries like matplotlib. pandas adopts(采用) sinificant(显著的,大量的) parts of NumPy's idiomatic(惯用的) style of array based computing, especially array-based functions and preference for data processing without for loops.(面向数组编程)

While pandas adopts many coding idioms (conventional) from NumPy, the biggest difference is that pandas is disgined for working with tabular (table type) or heterogeneous (multiple type) data. NumPy, by contrast (contrast), is best suite for working with homogeneous numerical array data. → pandas preferred embodiment is a tabular data processing (it can blow oh oF)

Since become an open source project in 2010, pandas has matured (mature) into a quite large library that is applicable (suitable for) in a broad set of real-world use cases. → widely used The developer community has grown to over 800 distinct (active) contributors, who have been helping build the project as they have used it to solve their day-to-day data problems. → solve the problem of processing large amounts of data daily life

Throughout the rest of the book, I use the following import convention for pandas:

import pandas as pd
# from pandas import Serieser, DataFrame

Thus, whever you see pd in code, it is refering to pandas. You may also find it easier to import Series and Dataframe into the local namespace since they are frequently used:

"from pandas import Series DataFrame"

To get start with pandas, you will need to comfortable(充分了解) with its two workhorse data structures: Series and DataFrame. While(尽管) they are not a universal solution for every problem, they provide a solid(稳定的), easy-to-use basis for most applications.

Series

A series is a one-dimensional array-like object containing a sequence of values ​​(of similar types to NumPy types) and an associated array of data labels, called it's index. The simplest (Concise for) Series is formed from only an array of data. → Series NumPy like a one-dimensional array index.

obj = pd.Series([4, 7, -5, 3])
obj
0    4
1    7
2   -5
3    3
dtype: int64

The string representation (on behalf) of a Series displaye interactively (interactively) show the index on the left and the value on the right. (Index display on the left, value on the right) Since we did not specify (specify) an index for the data, a default one consisting of the integer 0 throught N-1 (where N is the lenght of the data) (the index starting from 0) is created. You can get the array representation and index object of the Series via (through) its values ​​and index attributes, respectively: → accessed and provided through its values, index attributes.

obj.values
array([ 4,  7, -5,  3], dtype=int64)
obj.index  # like range(4)
RangeIndex(start=0, stop=4, step=1)

Often it will be describe to create a Series with an index identifying each data point with a lable:

obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
obj2

"打印索引"
obj2.index
d    4
b    7
a   -5
c    3
dtype: int64
'打印索引'
Index(['d', 'b', 'a', 'c'], dtype='object')

Compared with NumPy arrays, you can use labels in the index when selecting single values ​​or a set of values ​​.-> to select a single or a plurality of elements by index

"选取单个元素[index]"
obj2['a']

"修改元素-直接赋值-修改是-inplace"
obj2['d'] = 'cj'

"选取多个元素[[index]], 注意, 没有值则会NaN, 比较健壮的"
obj2[['c', 'a', 'd', 'xx']]

'选取单个元素[index]'
-5
'修改元素-直接赋值-修改是-inplace'
'选取多个元素[[index]], 注意, 没有值则会NaN, 比较健壮的'
c:\python\python36\lib\site-packages\pandas\core\series.py:851: FutureWarning: 
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
  return self.loc[key]
c       3
a      -5
d      cj
xx    NaN
dtype: object
"对元素赋值修改, 默认是原地修改的"
obj2
'对元素赋值修改, 默认是原地修改的'
d    cj
b     7
a    -5
c     3
dtype: object

Here [ 'c', 'a', 'd'] is interpreted (as is required) as a list of indices, even though it contains strings instead of integers .-> plurality of index keys, together with a list of pre-existing then as a parameter to the index.

Using NumPy functions or NumPy-like operations, such as filtering with a boolean array, scalar multiplication (scalar multiplication), or appplying math functions) function maps, will preserve the index-value link: → as the operation of NumPy array the same operation as bool array, scalar multiplication, math functions and so on ..

"过滤出Series中大于0的元素及对应索引"
"先还原数据, 字符不能和数字比较哦"
obj2['d'] = 4 

obj2[obj2 > 0]

"标量计算"
obj2 * 2

"调用NumPy函数"
"需要用values过滤掉索引, cj 觉得, 不然会报错"
np.exp(obj.values)
'过滤出Series中大于0的元素及对应索引'
'先还原数据, 字符不能和数字比较哦'
d    4
b    7
c    3
dtype: object
'标量计算'
d      8
b     14
a    -10
c      6
dtype: object
'调用NumPy函数'
'需要用values过滤掉索引, cj 觉得, 不然会报错'
array([5.45981500e+01, 1.09663316e+03, 6.73794700e-03, 2.00855369e+01])

"cj test"
obj2 > 0

np.exp(obj2)
'cj test'

d     True
b     True
a    False
c     True
dtype: bool

---------------------------------------------------------------------------

AttributeError                            Traceback (most recent call last)

<ipython-input-39-86002a981278> in <module>
      2 obj2 > 0
      3 
----> 4 np.exp(obj2)

AttributeError: 'int' object has no attribute 'exp'

Another way to think about a Series is as fixed-lenght, ordered dict, as it's a mapping of index values ​​to data values. → (Series can be seen as a mapping ordered dictionary, key is the index, value.) It can be used in many contexts (scenarios) where you might use a dict:

"跟字典操作一样, 遍历, 选取, 默认都是对key进行操作"

'b' in obj2
'xxx' in obj2


'跟字典操作一样, 遍历, 选取, 默认都是对key进行操作'

True

False

Should you have data contained in a Python dict, you can create a Series from it by pass the dict: → Python dictionary objects can be directly converted to Series, index is the key.

sdata = {'Ohio':35000, 'Texas':71000, 'Oregon':16000, 'Utah':5000}

"直接可将字典转为Series"
obj3 = pd.Series(sdata)
obj3
'直接可将字典转为Series'

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

# cj test

"多层字典嵌套也是可以的, 但只会显示顶层结构"

cj_data = {'Ohio':{'sex':1, 'age':18}, 'Texas':{'cj':123}}

pd.Series(cj_data)
'多层字典嵌套也是可以的, 但只会显示顶层结构'

Ohio     {'sex': 1, 'age': 18}
Texas              {'cj': 123}
dtype: object

When you are only passing a dict, the index in the resulting Series will have the dict's keys in sorted order You can override this by passing the dict keys in order you want them to appear in the resulting Series:. → Incoming dictionary objects, the default index is the key, we can achieve any desired result by rewriting the index:

"重写, 覆盖掉原来的index"

states = ['California', 'Ohio', 'Oregon', 'Texas']

"相同的字段直接 替换, 没有的字段, 则显示为NA"
obj4 = pd.Series(sdata, index=states)
obj4
'重写, 覆盖掉原来的index'

'相同的字段直接 替换, 没有的字段, 则显示为NA'

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

Here, three values found in sdata were palced in the appropriate(适当的) location, (替换, 字段相同), but since no value for 'Carlifornia' was found, it appears as NaN(not a number), which is considered in pandas to mark(标记) missing or NA values. Since 'Utah' was not include in states, it is excluded from the resulting object.

I will use the terms(短语) 'missing' or 'NA' interchangeably(交替地) to refer to(涉及) missing data. The isnull and notnull functions in pandas should be used to detect(检测) missing data:

"pd.isnull(), pd.notnull() 用来检测缺失值情况"
pd.isnull(obj4)

"正向逻辑"
pd.notnull(obj4)

"Series also has these as instance methods:"
obj4.notnull()
'pd.isnull(), pd.notnull() 用来检测缺失值情况'

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

'正向逻辑'

California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

'Series also has these as instance methods:'

California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

I discuss working with missing data in more detail in Chapter 7.

A usefull Series feature for many applications is that it this Automatically (automatically) Aligns (aligned) index label in arithmetic operations. → Series in the arithmetic operation, is automatically aligned index, i.e. the same index, can be considered an index is The essential.

obj3
obj4

"obj3 + obj4, index相同, 直接数值相加, 不想同则NaN"
obj3 + obj4
Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

'obj3 + obj4, index相同, 直接数值相加, 不想同则NaN'

California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64

Data alignment features (functions aligned data) will be in addressed in more detail later. If you have experience with databases, you can think about this as being simalar to a join operation. → (data alignment, data connection is just Similarly, the connections, connecting the left and right connection)

Both the Series object itself and its index hava a name attribute, which integrates (integrated) with other keys areas of pandas functionality: → (name attribute is linked to some of the key region)

"设置键的名字 obj4.name='xxx'"
obj4.name = 'population'

"设置索引的名字 obj4.index.name = 'xxx'"
obj4.index.name = 'state'

obj4
"设置键的名字 obj4.name='xxx'"

"设置索引的名字 obj4.index.name = 'xxx'"

state
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: population, dtype: float64

A Series's index can be altered (changed) in-place by assignment. → index may be assigned by way of changing the place

obj

"通过obj.index = 'xxx'实现原地修改索引, 数量不匹配则会报错哦"

obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan']
obj
Bob      4
Steve    7
Jeff    -5
Ryan     3
dtype: int64

"通过obj.index = 'xxx'实现原地修改索引, 数量不匹配则会报错哦"

Bob      4
Steve    7
Jeff    -5
Ryan     3
dtype: int64

DataFrame

A DataFrame represents a rectangular table of data ( data table rectangle) and contains an ordered collecton of columns , each of can be different value type (numeric, string, boolean, etc ..) which -> ( each column can contain different data type) the DataFrame has both a row and column index; ( with row index index, and the column index columns)
. It CAN BE Thought of AS a dict FO Series All sharing the same index (sharing the same index Series) Under the hood (from the bottom point of view) the data is stored as one or more two-dimensional blocks rather than a list, dict, or some other collection fo one-dimensional arrays. ( data is stored as a plurality of blocks, rather than two-dimensional array list, dict, or other one-dimensional array) the exact (detailed) details of DataFrame's internals (underlying principles) are outside the scope of this book .

While a DataFrame is physically (represented originally used) two-dimensional, you can use it to represent higher dimensional data in a tabular format using hierarchical (layered) indexing, a subject we wil discuss in Chapter8 and an ingredient (component) in some of the more advanced data-handling features in pandas. → layered index multidimensional data processing, and more high-dimensional data of the advanced features in the pandas can learn.

There are many ways to construct (constructed) a DataFrame, though one of the most common is from a dict of equal-length lists of or NumPy array. → (a configuration most common way is to pass a DataFrame isometric dictionary, or Multidimensional Arrays)

data = {
    'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
    'year': [2000, 2001, 2002, 2001, 2002, 2003],
    'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]
}

frame = pd.DataFrame(data)

The resulting DataFrame will have its index assigned automatically as with Series, and the columns are placed in sorted order:

frame
state year pop
0 Ohio 2000 1.5
1 Ohio 2001 1.7
2 Ohio 2002 3.6
3 Nevada 2001 2.4
4 Nevada 2002 2.9
5 Nevada 2003 3.2

If you are using the Jupyter notebook, pandas DataFrame objects will be displayed as a more browser-friendly HTML table.

For large DataFrames, the head method selects only the first five rows: → df.head() 默认查看前5行

frame.head()
state year pop
0 Ohio 2000 1.5
1 Ohio 2001 1.7
2 Ohio 2002 3.6
3 Nevada 2001 2.4
4 Nevada 2002 2.9

If you specify a sequence of columns, The DataFrame's columns will be arranged in that order: → 指定列的顺序

"按指定列的顺序排列"
pd.DataFrame(data, columns=['year', 'state', 'pop'])
'按指定列的顺序排列'

year state pop
0 2000 Ohio 1.5
1 2001 Ohio 1.7
2 2002 Ohio 3.6
3 2001 Nevada 2.4
4 2002 Nevada 2.9
5 2003 Nevada 3.2

If you pass a column that isn't contained in the dict, it will appear with missing values the result:

frame2 = pd.DataFrame(data, 
                     columns=['year', 'state', 'pop', 'debt'],
                     index=['one', 'two', 'three', 'four', 'five', 'six'])

"对于没有的 columns, 则会新建, 值为NaN"
frame2

"index没有, 则会报错哦, frame.columns 可查看列索引"
frame2.columns
'对于没有的 columns, 则会新建, 值为NaN'

year state pop debt
one 2000 Ohio 1.5 NaN
two 2001 Ohio 1.7 NaN
three 2002 Ohio 3.6 NaN
four 2001 Nevada 2.4 NaN
five 2002 Nevada 2.9 NaN
six 2003 Nevada 3.2 NaN
'index没有, 则会报错哦, frame.columns 可查看列索引'

Index(['year', 'state', 'pop', 'debt'], dtype='object')

A column in a DataFrame can be retrieve (retrieved) as a Series either by dict-like notation or by attribute: → (as an index list or df column names.)

"中括号索引[字段名]"
frame2['state']

"通过属方式 df.字段名"
frame2.state
'中括号索引[字段名]'

one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
six      Nevada
Name: state, dtype: object

'通过属方式 df.字段名'

one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
six      Nevada
Name: state, dtype: object

Attribute-like access (eg, frame2.year ) and tab completion ( completion) of column names in Ipython is provided as a convenience. → to select the column name by way property is very convenient.
Frame2 [column] Works for the any column name, but frame2.column only works when the column name is valid Python variable name.

Note that the returned Series have the same index as the DataFrame, (Series returned with the same index) and their name attribute has been appropriately (as appropriate) set.

Rows can also be retrieve by position or name with the special loc attribute(much more than this later) → loc属性用来选取行...

"选取index为three的行 loc[index]"
frame2.loc['three']

"选取第二行和第三行, frame.loc[1:2]"
frame.loc[1:2]
'选取index为three的行 loc[index]'

year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object

'选取第二行和第三行, frame.loc[1:2]'

state year pop
1 Ohio 2001 1.7
2 Ohio 2002 3.6

Columns can be modified by assignment. For example, the enpty 'debt' column could be assigned a scalar value or an array of values: → 原地修改值

frame2['debet'] = 16.5

"原地修改了整列的值了"
frame2
'原地修改了整列的值了'

year state pop debt should
one 2000 Ohio 1.5 NaN 16.5
two 2001 Ohio 1.7 NaN 16.5
three 2002 Ohio 3.6 NaN 16.5
four 2001 Nevada 2.4 NaN 16.5
five 2002 Nevada 2.9 NaN 16.5
six 2003 Nevada 3.2 NaN 16.5
"原地修改, 自动对齐"
frame2['debet'] = np.arange(6)

"删除掉debt列, axis=1, 列, inplace=True原地删除"
frame2.drop(labels='debt', axis=1, inplace=True)

frame2
'原地修改, 自动对齐'

'删除掉debt列, axis=1, 列, inplace=True原地删除'

year state pop should
one 2000 Ohio 1.5 0
two 2001 Ohio 1.7 1
three 2002 Ohio 3.6 2
four 2001 Nevada 2.4 3
five 2002 Nevada 2.9 4
six 2003 Nevada 3.2 5
frame2.columns
Index(['year', 'state', 'pop', 'debet'], dtype='object')

frame2.drop()
frame2['debt']
one      0
two      1
three    2
four     3
five     4
six      5
Name: debt, dtype: int32

When you are assigning list or arrays to a column, the value's lenght must match the lenght of the DataFrame. (Insertion length of the data must be aligned, then not missing values ​​a) If you assign a Series, it's labels will be realigned exactly to the DataFrame's index, inserting missing values ​​in any holes:

val = pd.Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])

"自动对齐, 根据index"
frame2['debet'] = val

frame2
'自动对齐, 根据index'

year state pop should
one 2000 Ohio 1.5 NaN
two 2001 Ohio 1.7 -1.2
three 2002 Ohio 3.6 NaN
four 2001 Nevada 2.4 -1.5
five 2002 Nevada 2.9 -1.7
six 2003 Nevada 3.2 NaN

Assigning a column that doesn't exist will create a new colum. The del keyword will delete columns as with a dict. → del 来删除列

As an example of del, I first add a new column of boolean values where the state columns equals 'Ohio':

frame2['eastern'] = frame2.state == 'Ohio'

"先新增一列 eastern"
frame2

"然后用 del 关键子去删除该列"
del frame2['eastern']

"显示字段名, 发现 eastern列被干掉了, 当然, drop()方法也可以"
frame2.columns
'先新增一列 eastern'

year state pop debet eastern
one 2000 Ohio 1.5 NaN True
two 2001 Ohio 1.7 -1.2 True
three 2002 Ohio 3.6 NaN True
four 2001 Nevada 2.4 -1.5 False
five 2002 Nevada 2.9 -1.7 False
six 2003 Nevada 3.2 NaN False
'然后用 del 关键子去删除该列'

'显示字段名, 发现 eastern列被干掉了, 当然, drop()方法也可以'

Index(['year', 'state', 'pop', 'debet'], dtype='object')

The column returned from indexing a DataFrame is a view on teh underlying data, not a copy.(视图哦, in-place的) Thus, any in-place modifications to the Series will be reflected in the DataFrame. The column can be explicitly copied with the Serie's copy method. → 可以显示指定列进行拷贝, 不然操作的是视图.

Another common form of data is a nested dict of dicts:

pop = {
    'Nevada': {2001:2.4, 2002:2.9},
    'Ohio': {2000:1.5, 2001:1.7, 2002:3.6}
}

If the nested dict is passed to the DataFrame, pandas will interpret the outer dict keys as the columns and the inner keys as the row indices: →(字典一层嵌套, pandas 会将最外层key作为columns, 内层key作为index)

frame3 = pd.DataFrame(pop)
"外层字典的键作为column, 值的键作为index"
frame3
'外层字典的键作为column, 值的键作为index'

Nevada Ohio
2000 NaN 1.5
2001 2.4 1.7
2002 2.9 3.6

You can transpose the DataFrame(swap rows and columns) with similar syntax to a NumPy array:

"转置"
frame3.T
'转置'

2000 2001 2002
Nevada NaN 2.4 2.9
Ohio 1.5 1.7 3.6

The keys in the inner dicts(内部键, index) are combined and sorted to form the index in the result. This isn't true if an explicit index is specified:

# pd.DataFrame(pop, index=('a', 'b','c'))

Dicts of Series are treated in much the same way.

pdata = {
    'Ohio': frame3['Ohio'][:-1],
    'Nevada': frame3['Nevada'][:2]
}

pd.DataFrame(pdata)
Ohio Nevada
2000 1.5 NaN
2001 1.7 2.4

For a complete list of things you can pass the DataFrame constructor(构造), see Table5-1.
If a DataFrame's index and columns have their name attributes, these will also be displayed: → 设置行列索引的名字属性

frame3.index.name = 'year'
frame3.columns.name = 'state'

frame3
state Nevada Ohio
year
2000 NaN 1.5
2001 2.4 1.7
2002 2.9 3.6

As with Series, the values attribute returns the data contained in the DataFrame as a two-dimensional ndarray: → values属性返回的是二维的

frame3.values
array([[nan, 1.5],
       [2.4, 1.7],
       [2.9, 3.6]])

If the DataFrame's columns are different dtypes, the dtype of the values array will be chosen to accommodate(容纳) all of the columns.

"会自动选择dtype去容纳各种类型的数据"
frame2.values
'会自动选择dtype去容纳各种类型的数据'

array([[2000, 'Ohio', 1.5, nan],
       [2001, 'Ohio', 1.7, nan],
       [2002, 'Ohio', 3.6, nan],
       [2001, 'Nevada', 2.4, nan],
       [2002, 'Nevada', 2.9, nan],
       [2003, 'Nevada', 3.2, nan]], dtype=object)

Table 5-1 Possible data inputs to DataFrame constructor

  • 2D ndarray A matrix of data, passing optional and columns labels
  • .......用到再说吧

Index Objects

pandas's Index objects are responsible(保存) for holding the axis labels and other metadata(like the axis name or names). Any array or other sequence of lables you use when constructing(构造) a Series or DataFrame is internally(内部地) converted to an Index(转为索引):

obj = pd.Series(range(3), index=['a', 'b', 'c'])

index = obj.index
index

index[1:]
obj
Index(['a', 'b', 'c'], dtype='object')

Index(['b', 'c'], dtype='object')

a    0
b    1
c    2
dtype: int64

Index objects are immutable(不可变的) and thus can't be modified by the user:

index[1] = 'd'
---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

<ipython-input-14-a452e55ce13b> in <module>
----> 1 index[1] = 'd'

c:\python\python36\lib\site-packages\pandas\core\indexes\base.py in __setitem__(self, key, value)
   2063 
   2064     def __setitem__(self, key, value):
-> 2065         raise TypeError("Index does not support mutable operations")
   2066 
   2067     def __getitem__(self, key):

TypeError: Index does not support mutable operations

"index 不可变哦"
index
'index 不可变哦'

Index(['a', 'b', 'c'], dtype='object')

labels = pd.Index(np.arange(3))
labels
Int64Index([0, 1, 2], dtype='int64')

obj2 = pd.Series([1.5, -2.5, 0], index=labels)
obj2
0    1.5
1   -2.5
2    0.0
dtype: float64

obj2.index is labels
True

Unlike Python sets, a pandas Index can con

Selections with dumplicate labels will select all occurrences(发生) of that label.

Each Index has a number of methods and properties for set logic which answer other common questions about the data it contains. Some useful ones are summarized in Table 5-2

  • append Concatenate with additional Index objects, producing a new index
  • difference Compute set difference as Index
  • intersection Compute set intersection
  • union Compute set union
  • isin → 是否在里面
  • delete Compute new index with element at index i deleted
  • drop Compute new index by deleting passed values
  • insert Compute new index by inserting element at index i
  • is_unique Return True if the index has no duplicate values
  • unique Compute the array of unique values in the index.

Guess you like

Origin www.cnblogs.com/chenjieyouge/p/11869423.html