Python has always been good at data processing and preparation, but not so good at data analysis and modeling. pandas helps fill
this gap, enabling you to perform your entire data analysis workflow in Python without having to switch to a more domain-specific language
such as R. pandas is Python's core data analysis support library, providing fast, flexible, and clear data structures, designed to handle relational and labeled data simply and intuitively. pandas is an essential advanced tool for data analysis in Python.
The main data structures of pandas are Series (one-dimensional data) and DataFrame (two-dimensional data), which are sufficient to handle
most cases in the fields of finance, statistics, social science, engineering, etc. Processing data is generally divided into four stages: data collation and cleaning, data analysis and modeling, data visualization and tabulation, Pandas is an ideal tool for processing data.
** Data source and download: https://www.heywhale.com/mw/dataset/59e715b76d213335f38d4507
1. Create arrays and data frames
1.1Series
When using a list to generate a Series, Pandas automatically generates an integer index by default, or you can specify an index
s1 = pd.Series(np.random.randint(1,10,5))#默认自动生成整数索引
s2 = pd.Series(np.random.randint(1,10,5),index=list('abcde'))#指定行索引
s3 = pd.Series({
'a':90,'b':80,'c':70})#采用字典方式创建,键为行索引
display(s1, s2, s3)
output:
1.2Dataframe
Dataframe is a two-dimensional label data structure composed of various types of columns, similar to an Excel\SQL table, or a dictionary of Series objects
pokemon = pd.DataFrame({
'evolution':['Ivysaur','Charmeleon','Wartortle','Metapod'],
"hp": [45, 39, 44, 45],
"name": ['Bulbasaur', 'Charmander','Squirtle','Caterpie'],
"pokedex": ['yes', 'no','yes','no'],
"type": ['grass', 'fire', 'water', 'bug']})
pokemon.rename(index = {
0:'A',1:'B',2:'C',3:'D',4:'E'})#修改行索引
#修改列索引将index改为columns
output:
2. Know your data
Sample data ( chipotle.tsv ) of orders from a chipotle fast food restaurant. The specific fields are described as follows:
Field Name | explain |
---|---|
order_id | order number |
quantity | quantity |
item_name | product name |
choice_description | Product Description |
2.1 Data Input
import pandas as pd
chipo = pd.read_csv('chipotle.tsv', sep = '\t',header = [0])
#若数据无列索引,则header = None
#若数据存在行索引,则index_col = 0,可以指定行索引
#若文件为csv,默认分隔符为逗号,则sep = ','
2.2 Data viewing
1. View the first 10 rows of data
chipo.head(10)
Output:
2. View the last 10 rows of data
chipo.tail(10)
Output:
3. View the shape, the number of rows and columns of data, output (number of rows, number of columns)
chipo.shape
Output:
4. Row index, starting from 0 to 4622 (not included), with a step size of 1
chipo.index
Output:
5. Column index, name of each column
chipo.columns
Output:
6. Object value, two-dimensional ndarray-NumPy data structure
chipo.values
Output:
7. View the data summary statistics of the numeric column, output count count, mean average, std standard deviation, min minimum value, 25% first quartile, 50% median, 75% third quartile number of digits, max maximum value
chipo.describe()
Output:
8. View column index (Columns), data type (Dtype), number of missing values (Non-Null Count) and memory information (memery usage)
chipo.info()
output:
2.3 Data selection
1. Check the product name column, and the returned data is Series
chipo.item_name
chipo['item_name']
Output:
2. View the two columns of product name and quantity, and the returned data is DataFrame
chipo[['item_name','quantity']]
Output:
3. Check the row index from 3 to 10 (exclusive)
chipo[3:15]
Output:
4. View the order information of products whose sales quantity is greater than 5
cond = chipo.quantity>5#返回值是boolean类型的Series
chipo[cond]#返回数量quantity>5的商品订单信息
Output:
5. View the order information of the sales quantity greater than 50, and the product name is 'Bottled Water'
cond = (chipo.quantity>5) & (chipo.item_name =='Bottled Water')#与运算,返回布尔值
chipo[cond]
Output:
6. Select data by position
chipo.iloc[3]#得到series,行标签为3
chipo.iloc[3:5,1:3]#得到dataframe,3~4行,1~2列
chipo.iloc[[3,5],[1,3]]#得到dataframe,行标签为3、5,列索引为1、3
Output:
7. Add a column of remark and assign values to the elements in the column
chipo['remark']='' #新增一列
chipo.loc[0,'remark'] = '无' #修改列中单个元素值
output:
I will write here for today, and we will see you in the next issue O(∩_∩)O
** It is not easy to sort out the courseware. I think the content of the course is good when I pass by. Please help to like and bookmark it! Thanks♪(・ω・)ノ****If you need to reprint, please indicate the source