Data analysis tool pandas series of tutorials (a): Speaking from Series


Began serialization data analysis tool pandas from today's series of articles, recommended Pycharm integrated Python3.6 +; whether you are a zero-based white or pandas has been started, you can go to some of the dry goods in this series of high school.

Excerpt from Baidu Encyclopedia: pandas numpy is a tool, the tool to solve data analysis tasks created based on. pandas into a large library and some standard data model that provides the required tools to operate efficiently large data sets. pandas provides a number of functions and methods enable us to quickly and easily handle the data. You will soon find that it is one of the important factors that make Python become a powerful and efficient data analysis environment.

Although pandas based on numpy, but before the start of pandas series of articles, I do not intend to introduce specific use numpy, because numpy focus on solving a math problem a multidimensional list or matrix, pandas beginning of the design is to solve practical problems, I I think we can get started straight pandas, in a series of tutorials, I will try to pre-readers and friends no numpy basis, or, where necessary numpy knowledge, I will speak directly with, I will try in the most simple language the least preliminaries, finished the entire series pandas.

As the series begins, the central task of this paper is to make every reader familiar with the concepts and basic operation of a data structure of the pandas, it is the Series.

Here Insert Picture Description

Series objects is similar to a one-dimensional array, by a set of data (data type may be an integer, float, string, and other objects Python), and with the same length index (or tag) composition. for example:

import pandas as pd
# 标签 1 索引 数据'a', 标签 2 索引数据 'b'...
s = pd.Series(data=['a','b','c','d'],index=[1,2,3,4])
print(s)

Here Insert Picture Description

Series of three ways to create

For the constructor pd.Series(), the three parameters we are most concerned about data data, indexes and index data types dtype, respectively, by Series of values index and dtype property access.

# 代码接上一段,后同
print(s.values)
print(s.index)
print(s.dtype)

Here Insert Picture Description

is required parameter data, index, such as the default, which is the default range (len (data)), the above code is not specified as index, the index = [0,1,2,3], instead of [1,2, 3,4]; dtype as default, default Object;

Created by the array (list)

data = ['l','o','v','e']
s1 = pd.Series(data=data)
print(s1)

Here Insert Picture Description

By dictionary creation

data = {'math':100,'english':94,'chinese':'95'}
s2 = pd.Series(data=data)
print(s2)

Here Insert Picture Description

It can be seen as a dictionary key index values ​​as data, created Series

Created by Constant

In this way created, you must specify the index, they are indexed to the same value, the value is constant we give.

s3 = pd.Series(1,index=[1,2,3,4,5])
print(s3)

Here Insert Picture Description

Query Series of four ways

Series s2 at an example:

Here Insert Picture Description

slice

Series is similar to the list, but also provides a slice operation:

print(s2[1:3])

Here Insert Picture Description

For a slice, two things: First, the index is zero, the second opening is closed after the front section, include: [13] only subscript 2, i.e. the second, third data Series, , pay attention to slice index and index does not matter.

index index

This behavior is similar dictionary of key values

print('math',s2['math'])

Here Insert Picture Description

May s2.get('math')take the 100, if the uncertainty is present in s2, math, can s2.get('math',101)set the default value 101, and if not, it will return 101 without error.

head()/tail()

See known name meaning, head()is to take the first few data, tail()it is to take a few data.

print(s2.head())
print(s2.head(2))

Here Insert Picture Description

The default is to take five, if less than 5 then take all of them.

Conditions inquiry

print("\n成绩大于 95 的科目:\n",s2[s2>95])
print("\n成绩等于 95 的科目:\n",s2[s2==95])
print("\n成绩大于等于 95 的科目:\n",s2[s2>=95])

Here Insert Picture Description

Other commonly used functions

New line of data

Has two functions: append()and set_value()can perform this function, but append()only accept Series / DataFrame formal parameter is completed by a new modification of a Series, you must accept its return value; set_value()more like Python built-in dictionary item in the new way, is In-place editing.

s2 = s2.append(pd.Series({'music':98}))
print(s2)
s2.set_value('history',99)
print(s2)

Here Insert Picture Description

Note that the above warning, set_value()will be obsolete in a future release, the recommended .at [] or .iat [] expression.

s2.at['history'] = 93
s2.at['geo'] = 91
print(s2)

Here Insert Picture Description

Found experimentally .at [] index and the foregoing effect is almost the same index query, the query can be modified to add; so .get () / [] is not only query can also be modified to add; .at [ ] can also be used as one of ways to search, use and flexible.

.iat [] and .at [] i is the difference of only one, the same function, i represents an integer of the English Integer, representatives .iat [] data can only be accessed by index, such as modifying the math scores 99:

s2.iat[0] = 99
print(s2)

Here Insert Picture Description

To delete a row of data

Using the drop()function to note that it does not modify the default place, you need to receive a return value:

s2 = s2.drop('math')
print(s2)

Here Insert Picture Description

Situ may become modified by setting the parameter inplace = True, exactly the same code and the following code above effects:

s2.drop('math',inplace=True)
print(s2)

Deduplication

If you just want to get not duplicate data in data, the direct use unique(), it returns a list, Series itself does not change;
if you want to get rid Series in duplicate data, recommended to use drop_duplicates(), it also has inplace parameters, another important parameter is keep , often ranging first / last, i.e., in the duplicate data, retains the first / last.

s2['english'] = 95
print(s2.unique(),'\n')
print(s2,'\n')
s2.drop_duplicates(keep='last',inplace=True)
print(s2)

Here Insert Picture Description

Sequence

By sort_values()complete sequencing, focusing parameter inplace and ascending (whether or not sorted in ascending order, the default is True, which is the default in ascending order:

s2.sort_values(inplace=True,ascending=True)
print(s2)

Here Insert Picture Description

Detecting missing values

Function isnull()/notnull()antisense function pair EENOW see the name, value detecting missing, with the length of the data and returns the list bool:

s2['bio'] = None
print(s2.isnull(),'\n')
print(s2.notnull())

Here Insert Picture Description

Series point, as one of pandas two data structures, it is the basis of another data structure DataFrame, only Series is one-dimensional, two-dimensional table format DataFrame, next to talk about DataFrame, in this please before digestion good Series.

Published 84 original articles · won praise 250 · Views 150,000 +

Guess you like

Origin blog.csdn.net/ygdxt/article/details/104152401