作业一:
import pandas as pd
import numpy as np
from pandas import DataFrame,Series
Step 1.加载数据(datasets/users.csv)
users = pd.read_csv("datasets/users.csv",sep = '|')
users
|
user_id |
age |
gender |
occupation |
zip_code |
0 |
1 |
24 |
M |
technician |
85711 |
1 |
2 |
53 |
F |
other |
94043 |
2 |
3 |
23 |
M |
writer |
32067 |
3 |
4 |
24 |
M |
technician |
43537 |
4 |
5 |
33 |
F |
other |
15213 |
5 |
6 |
42 |
M |
executive |
98101 |
6 |
7 |
57 |
M |
administrator |
91344 |
7 |
8 |
36 |
M |
administrator |
05201 |
… |
… |
… |
… |
… |
… |
939 |
940 |
32 |
M |
administrator |
02215 |
940 |
941 |
20 |
M |
student |
97229 |
941 |
942 |
48 |
F |
librarian |
78209 |
942 |
943 |
22 |
M |
student |
77841 |
943 rows × 5 columns
Step 2. 以occupation分组,求每一种职业所有用户的平均年龄
#users.groupby('occupation').age.mean() #方法1
Ave_age = users["age"].groupby(users["occupation"]).mean() #方法2
Ave_age
occupation
administrator 38.746835
artist 31.392857
doctor 43.571429
educator 42.010526
engineer 36.388060
entertainment 29.222222
... ...
salesman 35.666667
scientist 35.548387
student 22.081633
technician 33.148148
writer 36.311111
Name: age, dtype: float64
Step 3. 求每一种职业男性的占比,作为新的一列(male_pct)添加到数据集中,并按照从低到高的顺序排列
def gender_to_numeric(x):
if x == 'M':
return 1
if x == 'F':
return 0
users['gender_n'] = users['gender'].apply(gender_to_numeric)
a = users.groupby('occupation').gender_n.sum() / users.occupation.value_counts() * 100
a.sort_values(ascending = False)
# 输出: 各种职业男性的占比:降序
doctor 100.000000
engineer 97.014925
technician 96.296296
retired 92.857143
programmer 90.909091
executive 90.625000
scientist 90.322581
entertainment 88.888889
lawyer 83.333333
salesman 75.000000
educator 72.631579
student 69.387755
other 65.714286
marketing 61.538462
writer 57.777778
none 55.555556
administrator 54.430380
artist 53.571429
librarian 43.137255
healthcare 31.250000
homemaker 14.285714
dtype: float64
Step 4. 获取每一种职业对应的最大和最小的用户年龄
users.groupby('occupation').age.agg(['min', 'max'])
min max
occupation
administrator 21 70
artist 19 48
doctor 28 64
educator 23 63
engineer 22 70
... ... ...
retired 51 73
salesman 18 66
scientist 23 55
student 7 42
technician 21 55
writer 18 60
作业二:(数据过滤与排序)
import pandas as pd
import numpy as np
from pandas import DataFrame,Series
Step 1. 导入数据并赋值给变量 chipo
chipo = pd.read_csv("./datasets/chipotle.csv",sep = "\t")
chipo
#导入网络数据
#url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/chipotle.tsv'
#chipo = pd.read_csv(url, sep = '\t')
|
order_id |
quantity |
item_name |
choice_description |
item_price |
0 |
1 |
1 |
Chips and Fresh Tomato Salsa |
NaN |
$2.39 |
1 |
1 |
1 |
Izze |
[Clementine] |
$3.39 |
2 |
1 |
1 |
Nantucket Nectar |
[Apple] |
$3.39 |
3 |
1 |
1 |
Chips and Tomatillo-Green Chili Salsa |
NaN |
$2.39 |
4 |
2 |
2 |
Chicken Bowl |
[Tomatillo-Red Chili Salsa (Hot), [Black Beans… |
$16.98 |
5 |
3 |
1 |
Chicken Bowl |
[Fresh Tomato Salsa (Mild), [Rice, Cheese, Sou… |
$10.98 |
6 |
3 |
1 |
Side of Chips |
NaN |
$1.69 |
7 |
4 |
1 |
Steak Burrito |
[Tomatillo Red Chili Salsa, [Fajita Vegetables… |
$11.75 |
8 |
4 |
1 |
Steak Soft Tacos |
[Tomatillo Green Chili Salsa, [Pinto Beans, Ch… |
$9.25 |
9 |
5 |
1 |
Steak Burrito |
[Fresh Tomato Salsa, [Rice, Black Beans, Pinto… |
$9.25 |
10 |
5 |
1 |
Chips and Guacamole |
NaN |
$4.45 |
11 |
6 |
1 |
Chicken Crispy Tacos |
[Roasted Chili Corn Salsa, [Fajita Vegetables,… |
$8.75 |
12 |
6 |
1 |
Chicken Soft Tacos |
[Roasted Chili Corn Salsa, [Rice, Black Beans,… |
$8.75 |
13 |
7 |
1 |
Chicken Bowl |
[Fresh Tomato Salsa, [Fajita Vegetables, Rice,… |
$11.25 |
14 |
7 |
1 |
Chips and Guacamole |
NaN |
$4.45 |
15 |
8 |
1 |
Chips and Tomatillo-Green Chili Salsa |
NaN |
$2.39 |
16 |
8 |
1 |
Chicken Burrito |
[Tomatillo-Green Chili Salsa (Medium), [Pinto … |
$8.49 |
17 |
9 |
1 |
Chicken Burrito |
[Fresh Tomato Salsa (Mild), [Black Beans, Rice… |
$8.49 |
18 |
9 |
2 |
Canned Soda |
[Sprite] |
$2.18 |
19 |
10 |
1 |
Chicken Bowl |
[Tomatillo Red Chili Salsa, [Fajita Vegetables… |
$8.75 |
… |
… |
… |
… |
… |
… |
4617 |
1833 |
1 |
Steak Burrito |
[Fresh Tomato Salsa, [Rice, Black Beans, Sour … |
$11.75 |
4618 |
1833 |
1 |
Steak Burrito |
[Fresh Tomato Salsa, [Rice, Sour Cream, Cheese… |
$11.75 |
4619 |
1834 |
1 |
Chicken Salad Bowl |
[Fresh Tomato Salsa, [Fajita Vegetables, Pinto… |
$11.25 |
4620 |
1834 |
1 |
Chicken Salad Bowl |
[Fresh Tomato Salsa, [Fajita Vegetables, Lettu… |
$8.75 |
4621 |
1834 |
1 |
Chicken Salad Bowl |
[Fresh Tomato Salsa, [Fajita Vegetables, Pinto… |
$8.75 |
Step 2. 计算出有多商品大于10美元(去除列数据中特殊字符)
方法1:
#去除$符号并将object类型转成number类型
prices = [float(value[1 : -1]) for value in chipo.item_price]
chipo.item_price = prices
# 筛选大于10美元的商品
chipo10 = chipo[chipo['item_price'] > 10.00]
#计数
len(chipo10)
◆ 输出:
1130
——————————————————————————————————————————————————————————————
方法2:
#去除$符号
chipo["item_price"] = chipo["item_price"].str.split('$').str[1]
#将object类型转成number类型,计数
chipo["item_price"] = pd.to_numeric(chipo["item_price"])
(chipo["item_price"]>10).value_counts()
◆ 输出:
False 3492
True 1130
Name: item_price, dtype: int64
Step 3. 每个项目的价格是多少?[指定列去重,筛选,排序]
输出一个只包含[ item_name ]和 [item_price]两列的dataframe
# 删除项目名称和数量中的重复项
chipo_filtered = chipo.drop_duplicates(['item_name','quantity'])
# 只选择数量等于1的产品
chipo_one_prod = chipo_filtered[chipo_filtered.quantity == 1]
# 只选择 item_name 和 item_price 价格列
price_per_item = chipo_one_prod[['item_name', 'item_price']]
# 以item_price列降序数据为标准对 price_per_item 整个数据进行排序
price_per_item.sort_values(by = "item_price", ascending = False)
|
item_name |
item_price |
606 |
Steak Salad Bowl |
11.89 |
1229 |
Barbacoa Salad Bowl |
11.89 |
1132 |
Carnitas Salad Bowl |
11.89 |
7 |
Steak Burrito |
11.75 |
168 |
Barbacoa Crispy Tacos |
11.75 |
39 |
Barbacoa Bowl |
11.75 |
… |
… |
… |
0 |
Chips and Fresh Tomato Salsa |
2.39 |
40 |
Chips |
2.15 |
6 |
Side of Chips |
1.69 |
263 |
Canned Soft Drink |
1.25 |
28 |
Canned Soda |
1.09 |
34 |
Bottled Water |
1.09 |
Step 4 根据商品的价格对数据进行排序
方法1:
chipo.item_name.sort_values()
方法2:
chipo.sort_values(by = "item_name")
3389 6 Pack Soft Drink
341 6 Pack Soft Drink
1849 6 Pack Soft Drink
...
3141 6 Pack Soft Drink
639 6 Pack Soft Drink
1026 6 Pack Soft Drink
...
2996 Veggie Salad
3163 Veggie Salad
...
2756 Veggie Salad
4201 Veggie Salad Bowl
1884 Veggie Salad Bowl
...
2683 Veggie Salad Bowl
496 Veggie Salad Bowl
4109 Veggie Salad Bowl
738 Veggie Soft Tacos
3889 Veggie Soft Tacos
...
1699 Veggie Soft Tacos
1395 Veggie Soft Tacos
Name: item_name, dtype: object
Step 5.在所有商品订单中 最贵商品的数量(quantity)是多少?
chipo.sort_values(by = "item_price", ascending = False).head(1)
|
order_id |
quantity |
item_name |
choice_description |
item_price |
3598 |
1443 |
15 |
Chips and Fresh Tomato Salsa |
NaN |
44.25 |
Step 6. 商品订购单中,商品 Veggie Salad Bowl 的订单数目?
chipo_salad = chipo[chipo.item_name == "Veggie Salad Bowl"]
len(chipo_salad)
# 输出:
18
Step 7. 在所有订单中,购买商品Canned Soda数量大于1的订单数有几条?
chipo_drink_steak_bowl = chipo[(chipo.item_name == "Canned Soda")\&(chipo.quantity > 1)]
len(chipo_drink_steak_bowl)
# 输出:
20