First, demand:
Csv file for data cleansing crawling
Use of content: pandas, regular expressions
Second, a simple analysis:
A total of 176 data
Among them, analyze the target subject to full-time, but job titles include internship information, you need removed.
Data aspects: csv format is saved str, use regular expressions to extract the value of work experience to the average wage according to the market situation, taking 25% of the former salary range.
Third, the code:
import pandas as pd df = pd.read_csv('lagou8.4jobs.csv',encoding='utf-8-sig') #print(df.describe()) #共175条信息,其中包含了实习信息需要清洗掉 df.drop(df[df['职位名称'].str.contains('实习')].index,inplace=True) #print(df.describe()) #67条 pattern = '\d+' #正则表达式 获取所有数字 df['工作经验'] = df['工作经验'].str.findall(pattern) #print(df['工作经验']) avg_work_year = [] for i in df['工作经验']: if len(i) == 0: avg_work_year.append(0) else: num = [int(j) for j in i] avg = sum(num)/2 avg_work_year.append(avg) #print(avg_work_year) df['工作经验'] = avg_work_year df['工资'] = df['工资'].str.findall(pattern) #print(df['工资']) avg_salary = [] for i in df['工资']: num = [int(j) for j in i] #print(num) avg = num[0]+(num[1]-num[0])/4 print(avg) avg_salary.append(avg) df['工资'] = avg_salary df.to_csv('clear_data.csv', index = False,encoding='utf-8-sig')
其间遇到问题:
一开始csv文件名为中文,导入期间遇到编码问题‘utf-8’无法解析,后查证修改文件名,以utf-8编码模式保存即可。