csv file data cleansing

First, demand:

Csv file for data cleansing crawling

Use of content: pandas, regular expressions

Second, a simple analysis:

A total of 176 data

Among them, analyze the target subject to full-time, but job titles include internship information, you need removed.

 

Data aspects: csv format is saved str, use regular expressions to extract the value of work experience to the average wage according to the market situation, taking 25% of the former salary range.

 

Third, the code:

import pandas as pd
df = pd.read_csv('lagou8.4jobs.csv',encoding='utf-8-sig')
#print(df.describe())
#共175条信息,其中包含了实习信息需要清洗掉
df.drop(df[df['职位名称'].str.contains('实习')].index,inplace=True)
#print(df.describe())
#67条
pattern = '\d+'         #正则表达式 获取所有数字
df['工作经验'] = df['工作经验'].str.findall(pattern)
#print(df['工作经验'])
avg_work_year = []
for i in df['工作经验']:
    if len(i) == 0:
        avg_work_year.append(0)
    else:
        num = [int(j) for j in i]
        avg = sum(num)/2
        avg_work_year.append(avg)
#print(avg_work_year)
df['工作经验'] = avg_work_year

df['工资'] = df['工资'].str.findall(pattern)
#print(df['工资'])
avg_salary = []
for i in df['工资']:
    num = [int(j) for j in i]
    #print(num)
    avg = num[0]+(num[1]-num[0])/4
    print(avg)
    avg_salary.append(avg)
df['工资'] = avg_salary

df.to_csv('clear_data.csv', index = False,encoding='utf-8-sig')

 其间遇到问题:

一开始csv文件名为中文,导入期间遇到编码问题‘utf-8’无法解析,后查证修改文件名,以utf-8编码模式保存即可。

 

Guess you like

Origin www.cnblogs.com/itljx/p/11297870.html