Pandas data analysis practice (1) - exploring Chipotle fast food data

Python has always been good at data processing and preparation, but not so good at data analysis and modeling. pandas helps fill
this gap, enabling you to perform your entire data analysis workflow in Python without having to switch to a more domain-specific language
such as R. pandas is Python's core data analysis support library, providing fast, flexible, and clear data structures, designed to handle relational and labeled data simply and intuitively. pandas is an essential advanced tool for data analysis in Python.

The main data structures of pandas are Series (one-dimensional data) and DataFrame (two-dimensional data), which are sufficient to handle
most cases in the fields of finance, statistics, social science, engineering, etc. Processing data is generally divided into four stages: data collation and cleaning, data analysis and modeling, data visualization and tabulation, Pandas is an ideal tool for processing data.

** Data source and download: https://www.heywhale.com/mw/dataset/59e715b76d213335f38d4507

1. Create arrays and data frames

1.1Series

When using a list to generate a Series, Pandas automatically generates an integer index by default, or you can specify an index

s1 = pd.Series(np.random.randint(1,10,5))#默认自动生成整数索引
s2 = pd.Series(np.random.randint(1,10,5),index=list('abcde'))#指定行索引
s3 = pd.Series({
    
    'a':90,'b':80,'c':70})#采用字典方式创建,键为行索引
display(s1, s2, s3)

output:
insert image description here

1.2Dataframe

Dataframe is a two-dimensional label data structure composed of various types of columns, similar to an Excel\SQL table, or a dictionary of Series objects

pokemon = pd.DataFrame({
    
    'evolution':['Ivysaur','Charmeleon','Wartortle','Metapod'],
             "hp": [45, 39, 44, 45],
             "name": ['Bulbasaur', 'Charmander','Squirtle','Caterpie'],
             "pokedex": ['yes', 'no','yes','no'],
             "type": ['grass', 'fire', 'water', 'bug']})

pokemon.rename(index = {
    
    0:'A',1:'B',2:'C',3:'D',4:'E'})#修改行索引
#修改列索引将index改为columns

output:
insert image description here
insert image description here

2. Know your data

Sample data ( chipotle.tsv ) of orders from a chipotle fast food restaurant. The specific fields are described as follows:

Field Name explain
order_id order number
quantity quantity
item_name product name
choice_description Product Description

2.1 Data Input

import pandas as pd 
chipo = pd.read_csv('chipotle.tsv', sep = '\t',header = [0])
#若数据无列索引,则header = None
#若数据存在行索引,则index_col = 0,可以指定行索引
#若文件为csv,默认分隔符为逗号,则sep = ','

2.2 Data viewing

1. View the first 10 rows of data

chipo.head(10)

Output: insert image description here
2. View the last 10 rows of data

chipo.tail(10)

Output: insert image description here
3. View the shape, the number of rows and columns of data, output (number of rows, number of columns)

chipo.shape

Output:
insert image description here
4. Row index, starting from 0 to 4622 (not included), with a step size of 1

chipo.index

Output:
insert image description here
5. Column index, name of each column

chipo.columns

Output: insert image description here
6. Object value, two-dimensional ndarray-NumPy data structure

chipo.values

Output: insert image description here
7. View the data summary statistics of the numeric column, output count count, mean average, std standard deviation, min minimum value, 25% first quartile, 50% median, 75% third quartile number of digits, max maximum value

chipo.describe()

Output:
insert image description here
8. View column index (Columns), data type (Dtype), number of missing values ​​(Non-Null Count) and memory information (memery usage)

chipo.info()

output:
insert image description here

2.3 Data selection

1. Check the product name column, and the returned data is Series

chipo.item_name
chipo['item_name']

Output:
insert image description here
2. View the two columns of product name and quantity, and the returned data is DataFrame

chipo[['item_name','quantity']]

Output:
insert image description here
3. Check the row index from 3 to 10 (exclusive)

chipo[3:15]

Output:
insert image description here
4. View the order information of products whose sales quantity is greater than 5

cond = chipo.quantity>5#返回值是boolean类型的Series
chipo[cond]#返回数量quantity>5的商品订单信息

Output: insert image description here
5. View the order information of the sales quantity greater than 50, and the product name is 'Bottled Water'

cond = (chipo.quantity>5) & (chipo.item_name =='Bottled Water')#与运算,返回布尔值
chipo[cond]

Output:
insert image description here
6. Select data by position

chipo.iloc[3]#得到series,行标签为3
chipo.iloc[3:5,1:3]#得到dataframe,3~4行,1~2列
chipo.iloc[[3,5],[1,3]]#得到dataframe,行标签为3、5,列索引为1、3

Output:
insert image description here
insert image description here
insert image description here
7. Add a column of remark and assign values ​​to the elements in the column

chipo['remark']='' #新增一列
chipo.loc[0,'remark'] = '无' #修改列中单个元素值

output:
insert image description here

I will write here for today, and we will see you in the next issue O(∩_∩)O

** It is not easy to sort out the courseware. I think the content of the course is good when I pass by. Please help to like and bookmark it! Thanks♪(・ω・)ノ****If you need to reprint, please indicate the source

Guess you like

Origin blog.csdn.net/zxxxlh123/article/details/115484415