How python crawling site data and data visualization

This article describes the python crawling pull hook network data and data visualization, crawling pull hook gateway to python job-related data, and crawling into all kinds of data have been csv file, and then related to csv file washed data fields, and data visualization display, bar graph display including, a histogram display, can refer to a friend
Foreword

Crawling gateway to pull hook python position data related information and data has been crawling kinds csv file stored, and the data field associated csv file is cleaned, and the visual display of data, including a bar graph display, histogram FIG display, display a word cloud and the like for further analysis of the visual data, and display the remaining analysis can play their readers and expanded to include a variety of different analysis and storage and the like. . . . .

A crawling and Related dependencies

Python version: Python3.6
Requests: download page
math: rounded up
time: halt process
pandas: data analysis and saved as csv file
matplotlib: Drawing
pyecharts: Drawing
statsmodels: statistical modeling
wordcloud, scipy, jieba: Chinese word cloud generated
pylab : set drawing to show Chinese
readers may encounter problems such as failure to install or import more than the installation or use their own Baidu, select the appropriate version dependencies of
two, page structure analysis
by Chrome search for 'python engineer', then right-click on the check or F12, use the check function to view the page source code, click Next when we observe url browser's search bar has not changed, because the network did pull hook mechanism anti-reptile, position information is not in the source code, and JSON is saved in the file, so we JSON direct download, and read the data directly using the dictionary method to get the relevant information python jobs we want.
Here Insert Picture Description
to be crawling python engineer job information is as follows: Here Insert Picture Description
to be able to climb we want the data, we use the program to simulate browser to view web pages, so I In the process of crawling will add header information, the header information is our acquired by analyzing web pages by web analytics we know header of the request, and how the information requested and the request is a POST request, so that we can the url request to get the data we want to do further processing Here Insert Picture Description
crawled pages of information codes are as follows:

import requests
 
url = ' https://www.lagou.com/jobs/positionAjax.json?needAddtionalResult=false'
 
 
def get_json(url, num):
 """
 从指定的url中通过requests请求携带请求头和请求体获取网页中的信息,
 :return:
 """
 url1 = 'https://www.lagou.com/jobs/list_python%E5%BC%80%E5%8F%91%E5%B7%A5%E7%A8%8B%E5%B8%88?labelWords=&fromSearch=true&suginput='
 headers = {
  'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36',
  'Host': 'www.lagou.com',
  'Referer': 'https://www.lagou.com/jobs/list_%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90?labelWords=&fromSearch=true&suginput=',
  'X-Anit-Forge-Code': '0',
  'X-Anit-Forge-Token': 'None',
  'X-Requested-With': 'XMLHttpRequest'
 }
 data = {
  'first': 'true',
  'pn': num,
  'kd': 'python工程师'}
 s = requests.Session()
 print('建立session:', s, '\n\n')
 s.get(url=url1, headers=headers, timeout=3)
 cookie = s.cookies
 print('获取cookie:', cookie, '\n\n')
 res = requests.post(url, headers=headers, data=data, cookies=cookie, timeout=3)
 res.raise_for_status()
 res.encoding = 'utf-8'
 page_data = res.json()
 print('请求响应结果:', page_data, '\n\n')
 return page_data
 
 
print(get_json(url, 1))

We know that by searching jobs per page 15, shows up to 30, by analyzing the web page source code to know the total number of posts can be read by JSON, through the total number of posts and the number of jobs per page can be displayed. We can calculate the total number of pages, and then use the cycle by page crawling and finally the post summary information is written to a file in CSV format.

The results are shown running: Here Insert Picture Description
crawling all python related jobs information is as follows: Here Insert Picture Description
Third, after the data storage cleansing
data cleansing will actually occupy a large part of the work that we are here only after some simple data analysis storage. In the pull hook net input python-related jobs will be 18,988. You can select the field work needs to be put in storage, and some of the fields and do further screening, for example, we can remove the post office name as interns, post filter specified field area in our designated area, take the field Payroll on average, a quarter of the minimum value and the difference between the average and so is free to play on demand

import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
from wordcloud import WordCloud
from scipy.misc import imread
from imageio import imread
import jieba
from pylab import mpl
 
# 使用matplotlib能够显示中文
mpl.rcParams['font.sans-serif'] = ['SimHei'] # 指定默认字体
mpl.rcParams['axes.unicode_minus'] = False # 解决保存图像是负号'-'显示为方块的问题
# 读取数据
df = pd.read_csv('Python_development_engineer.csv', encoding='utf-8')
 
# 进行数据清洗,过滤掉实习岗位
# df.drop(df[df['职位名称'].str.contains('实习')].index, inplace=True)
# print(df.describe())
 
 
# 由于csv文件中的字符是字符串形式,先用正则表达式将字符串转化为列表,在去区间的均值
pattern = '\d+'
# print(df['工作经验'], '\n\n\n')
# print(df['工作经验'].str.findall(pattern))
df['工作年限'] = df['工作经验'].str.findall(pattern)
print(type(df['工作年限']), '\n\n\n')
avg_work_year = []
count = 0
for i in df['工作年限']:
 # print('每个职位对应的工作年限',i)
 # 如果工作经验为'不限'或'应届毕业生',那么匹配值为空,工作年限为0
 if len(i) == 0:
  avg_work_year.append(0)
  # print('nihao')
  count += 1
 # 如果匹配值为一个数值,那么返回该数值
 elif len(i) == 1:
  # print('hello world')
  avg_work_year.append(int(''.join(i)))
  count += 1
 # 如果匹配为一个区间则取平均值
 else:
  num_list = [int(j) for j in i]
  avg_year = sum(num_list) / 2
  avg_work_year.append(avg_year)
  count += 1
print(count)
df['avg_work_year'] = avg_work_year
# 将字符串转化为列表,薪资取最低值加上区间值得25%,比较贴近现实
df['salary'] = df['薪资'].str.findall(pattern)
#
avg_salary_list = []
for k in df['salary']:
 int_list = [int(n) for n in k]
 avg_salary = int_list[0] + (int_list[1] - int_list[0]) / 4
 avg_salary_list.append(avg_salary)
df['月薪'] = avg_salary_list
# df.to_csv('python.csv', index=False)

Fourth, data visualization shows
the following is a visual display of the data, only a partial view of some visual display, if the reader wants to do something to show to other fields and want to use different view types on display, your own play, Note: The following code see module introduced last complete code

1, frequency histograms drawn python salary and saved
if we want to see on the Internet industry-related jobs python engineer salary was generally a division of the interval in which range, the proportion accounted for as much as we can help matplotlib library to we will save the data in csv file visualization of the display, and then we can see a more intuitive division of trend data

# 绘制python薪资的频率直方图并保存
plt.hist(df['月薪'],bins=8,facecolor='#ff6700',edgecolor='blue') # bins是默认的条形数目
plt.xlabel('薪资(单位/千元)')
plt.ylabel('频数/频率')
plt.title('python薪资直方图')
plt.savefig('python薪资分布.jpg')
plt.show()

Results are as follows: Here Insert Picture Description
2, draw python related positions geographical pie charts
by geographic location python position of division we can roughly understand the IT industry is mainly concentrated in the division which cities, so we choose to be more conducive to regional selective employment you can get more interview opportunities, parameters can debug their own, or added as needed.

# 绘制饼状图并保存
city = df['城市'].value_counts()
print(type(city))
# print(len(city))
label = city.keys()
print(label)
city_list = []
count = 0
n = 1
distance = []
for i in city:
 
 city_list.append(i)
 print('列表长度', len(city_list))
 count += 1
 if count > 5:
  n += 0.1
  distance.append(n)
 else:
  distance.append(0)
plt.pie(city_list, labels=label, labeldistance=1.2, autopct='%2.1f%%', pctdistance=0.6, shadow=True, explode=distance)
plt.axis('equal') # 使饼图为正圆形
plt.legend(loc='upper left', bbox_to_anchor=(-0.1, 1))
plt.savefig('python地理位置分布图.jpg')
plt.show()

Results are as follows: Here Insert Picture Description
3, drawn based pyechart city distribution histogram
pycharts is python call Baidu based js developed echarts interface may be various visualization of data, more data visualization graphical representation, reference echarts official website: https : //www.echartsjs.com/,echarts official website provides various examples for our reference, such as line charts, bar charts, pie charts, the road map, tree and so on, based on pyecharts documents refer to the following official website: https: //pyecharts.org/#/, Baidu may also use more network resources on their own

city = df['城市'].value_counts()
print(type(city))
print(city)
# print(len(city))
 
keys = city.index # 等价于keys = city.keys()
values = city.values
from pyecharts import Bar
 
bar = Bar("python职位的城市分布图")
bar.add("城市", keys, values)
bar.print_echarts_options() # 该行只为了打印配置项,方便调试时使用
bar.render(path='a.html')

Results are as follows: Here Insert Picture Description
4, draw python welfare-related word cloud
word cloud word cloud called, is a high frequency of keywords in text data appear to be prominent visual form "keyword rendering" on the same cloud-like color pictures, to filter out a lot of text information ,, people one can enjoy the main meaning of the expression text data. Use jieba word and word cloud generated WorldCloud (customizable background), the following is the welfare-related jobs python made a word cloud of the display, can be more intuitive to see the benefits for most companies focus on areas in which to draw welfare # word cloud treatment

text = ''
for line in df['公司福利']:
 if len(eval(line)) == 0:
  continue
 else:
  for word in eval(line):
   # print(word)
   text += word
 
cut_word = ','.join(jieba.cut(text))
word_background = imread('公主.jpg')
cloud = WordCloud(
 font_path=r'C:\Windows\Fonts\simfang.ttf',
 background_color='black',
 mask=word_background,
 max_words=500,
 max_font_size=100,
 width=400,
 height=800
 
)
word_cloud = cloud.generate(cut_word)
word_cloud.to_file('福利待遇词云.png')
plt.imshow(word_cloud)
plt.axis('off')
plt.show()

Results are as follows:
Here Insert Picture Description
Fifth, reptiles and visualization of the complete code
complete code in the following code are test can be run properly, interested junior partner to go to try and understand where to use, such as running or module installation failure can be carried out in the comments section comments, let us solve it

If you think you can help point a praise Oh, original content to be reproduced indicate the source! ! !

1, reptiles complete code

In order to prevent the frequent request of our website is restricted ip, we chose to sleep after a period of time crawling on each page, of course, you can also use other means such as self-realization agents

import requests
import math
import time
import pandas as pd
 
 
def get_json(url, num):
 """
 从指定的url中通过requests请求携带请求头和请求体获取网页中的信息,
 :return:
 """
 url1 = 'https://www.lagou.com/jobs/list_python%E5%BC%80%E5%8F%91%E5%B7%A5%E7%A8%8B%E5%B8%88?labelWords=&fromSearch=true&suginput='
 headers = {
  'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36',
  'Host': 'www.lagou.com',
  'Referer': 'https://www.lagou.com/jobs/list_%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90?labelWords=&fromSearch=true&suginput=',
  'X-Anit-Forge-Code': '0',
  'X-Anit-Forge-Token': 'None',
  'X-Requested-With': 'XMLHttpRequest'
 }
 data = {
  'first': 'true',
  'pn': num,
  'kd': 'python工程师'}
 s = requests.Session()
 print('建立session:', s, '\n\n')
 s.get(url=url1, headers=headers, timeout=3)
 cookie = s.cookies
 print('获取cookie:', cookie, '\n\n')
 res = requests.post(url, headers=headers, data=data, cookies=cookie, timeout=3)
 res.raise_for_status()
 res.encoding = 'utf-8'
 page_data = res.json()
 print('请求响应结果:', page_data, '\n\n')
 return page_data
 
 
def get_page_num(count):
 """
 计算要抓取的页数,通过在拉勾网输入关键字信息,可以发现最多显示30页信息,每页最多显示15个职位信息
 :return:
 """
 page_num = math.ceil(count / 15)
 if page_num > 30:
  return 30
 else:
  return page_num
 
 
def get_page_info(jobs_list):
 """
 获取职位
 :param jobs_list:
 :return:
 """
 page_info_list = []
 for i in jobs_list: # 循环每一页所有职位信息
  job_info = []
  job_info.append(i['companyFullName'])
  job_info.append(i['companyShortName'])
  job_info.append(i['companySize'])
  job_info.append(i['financeStage'])
  job_info.append(i['district'])
  job_info.append(i['positionName'])
  job_info.append(i['workYear'])
  job_info.append(i['education'])
  job_info.append(i['salary'])
  job_info.append(i['positionAdvantage'])
  job_info.append(i['industryField'])
  job_info.append(i['firstType'])
  job_info.append(i['companyLabelList'])
  job_info.append(i['secondType'])
  job_info.append(i['city'])
  page_info_list.append(job_info)
 return page_info_list
 
 
def main():
 url = ' https://www.lagou.com/jobs/positionAjax.json?needAddtionalResult=false'
 first_page = get_json(url, 1)
 total_page_count = first_page['content']['positionResult']['totalCount']
 num = get_page_num(total_page_count)
 total_info = []
 time.sleep(10)
 print("python开发相关职位总数:{},总页数为:{}".format(total_page_count, num))
 for num in range(1, num + 1):
  # 获取每一页的职位相关的信息
  page_data = get_json(url, num) # 获取响应json
  jobs_list = page_data['content']['positionResult']['result'] # 获取每页的所有python相关的职位信息
  page_info = get_page_info(jobs_list)
  print("每一页python相关的职位信息:%s" % page_info, '\n\n')
  total_info += page_info
  print('已经爬取到第{}页,职位总数为{}'.format(num, len(total_info)))
  time.sleep(20)
  # 将总数据转化为data frame再输出,然后在写入到csv各式的文件中
  df = pd.DataFrame(data=total_info,
       columns=['公司全名', '公司简称', '公司规模', '融资阶段', '区域', '职位名称', '工作经验', '学历要求', '薪资', '职位福利', '经营范围',
         '职位类型', '公司福利', '第二职位类型', '城市'])
  # df.to_csv('Python_development_engineer.csv', index=False)
  print('python相关职位信息已保存')
 
 
if __name__ == '__main__':
 main()

2, the complete code visualization

Involves the use of data visualization matplotlib, jieba, wordcloud, pyecharts, pylab, scipy like module, the reader can understand the use of the individual modules themselves, and which relate to various parameters

import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
from wordcloud import WordCloud
from scipy.misc import imread
# from imageio import imread
import jieba
from pylab import mpl
 
# 使用matplotlib能够显示中文
mpl.rcParams['font.sans-serif'] = ['SimHei'] # 指定默认字体
mpl.rcParams['axes.unicode_minus'] = False # 解决保存图像是负号'-'显示为方块的问题
# 读取数据
df = pd.read_csv('Python_development_engineer.csv', encoding='utf-8')
 
# 进行数据清洗,过滤掉实习岗位
# df.drop(df[df['职位名称'].str.contains('实习')].index, inplace=True)
# print(df.describe())
 
 
# 由于csv文件中的字符是字符串形式,先用正则表达式将字符串转化为列表,在去区间的均值
pattern = '\d+'
# print(df['工作经验'], '\n\n\n')
# print(df['工作经验'].str.findall(pattern))
df['工作年限'] = df['工作经验'].str.findall(pattern)
print(type(df['工作年限']), '\n\n\n')
avg_work_year = []
count = 0
for i in df['工作年限']:
 # print('每个职位对应的工作年限',i)
 # 如果工作经验为'不限'或'应届毕业生',那么匹配值为空,工作年限为0
 if len(i) == 0:
  avg_work_year.append(0)
  # print('nihao')
  count += 1
 # 如果匹配值为一个数值,那么返回该数值
 elif len(i) == 1:
  # print('hello world')
  avg_work_year.append(int(''.join(i)))
  count += 1
 # 如果匹配为一个区间则取平均值
 else:
  num_list = [int(j) for j in i]
  avg_year = sum(num_list) / 2
  avg_work_year.append(avg_year)
  count += 1
print(count)
df['avg_work_year'] = avg_work_year
# 将字符串转化为列表,薪资取最低值加上区间值得25%,比较贴近现实
df['salary'] = df['薪资'].str.findall(pattern)
#
avg_salary_list = []
for k in df['salary']:
 int_list = [int(n) for n in k]
 avg_salary = int_list[0] + (int_list[1] - int_list[0]) / 4
 avg_salary_list.append(avg_salary)
df['月薪'] = avg_salary_list
# df.to_csv('python.csv', index=False)
 
 
"""1、绘制python薪资的频率直方图并保存"""
plt.hist(df['月薪'], bins=8, facecolor='#ff6700', edgecolor='blue') # bins是默认的条形数目
plt.xlabel('薪资(单位/千元)')
plt.ylabel('频数/频率')
plt.title('python薪资直方图')
plt.savefig('python薪资分布.jpg')
plt.show()
 
"""2、绘制饼状图并保存"""
city = df['城市'].value_counts()
print(type(city))
# print(len(city))
label = city.keys()
print(label)
city_list = []
count = 0
n = 1
distance = []
for i in city:
 
 city_list.append(i)
 print('列表长度', len(city_list))
 count += 1
 if count > 5:
  n += 0.1
  distance.append(n)
 else:
  distance.append(0)
plt.pie(city_list, labels=label, labeldistance=1.2, autopct='%2.1f%%', pctdistance=0.6, shadow=True, explode=distance)
plt.axis('equal') # 使饼图为正圆形
plt.legend(loc='upper left', bbox_to_anchor=(-0.1, 1))
plt.savefig('python地理位置分布图.jpg')
plt.show()
 
"""3、绘制福利待遇的词云"""
text = ''
for line in df['公司福利']:
 if len(eval(line)) == 0:
  continue
 else:
  for word in eval(line):
   # print(word)
   text += word
 
cut_word = ','.join(jieba.cut(text))
word_background = imread('公主.jpg')
cloud = WordCloud(
 font_path=r'C:\Windows\Fonts\simfang.ttf',
 background_color='black',
 mask=word_background,
 max_words=500,
 max_font_size=100,
 width=400,
 height=800
 
)
word_cloud = cloud.generate(cut_word)
word_cloud.to_file('福利待遇词云.png')
plt.imshow(word_cloud)
plt.axis('off')
plt.show()
 
"""4、基于pyechart的柱状图"""
city = df['城市'].value_counts()
print(type(city))
print(city)
# print(len(city))
 
keys = city.index # 等价于keys = city.keys()
values = city.values
from pyecharts import Bar
 
bar = Bar("python职位的城市分布图")
bar.add("城市", keys, values)
bar.print_echarts_options() # 该行只为了打印配置项,方便调试时使用
bar.render(path='a.html')

We recommend learning Python buckle qun: 913066266, look at how seniors are learning! From basic web development python script to, reptiles, django, data mining, etc. [PDF, actual source code], zero-based projects to combat data are finishing. Given to every little python partner! Every day, Daniel explain the timing Python technology, to share some learning methods and the need to pay attention to small details, click Join our learners python gathering
is more than the entire contents of this article, we want to help learning

Published 46 original articles · won praise 30 · views 70000 +

Guess you like

Origin blog.csdn.net/haoxun09/article/details/104828417