Practical practice of crawler data cleaning and visualization - analysis of employment situation

Investigation and research on the employment situation in Wuhu based on collecting and analyzing data from recruitment websites

I. Introduction

This report aims to analyze the local employment situation based on big data and provide detailed information on salary, job location, experience requirements, academic requirements, company industry, company benefits, and company type and size. This analysis is obtained by collecting and analyzing data from recruitment websites through web crawler technology.
Part of the content of this article comes from online collection and personal practice. If any information is incorrect, readers are welcome to criticize and correct it. This article is only for learning and communication, not for any commercial purposes.

2. Salary range distribution analysis

1. Salary range distribution histogram

In the salary range distribution histogram, we analyze the distribution of salary ranges. Here are the key results:

Insert image description here

  • Chart content: A histogram including the minimum salary and the maximum salary, with the salary range as the x-axis and the number of recruitments as the y-axis. The minimum salary and the maximum salary are represented in different colors.
  • Local data storage: We saved text files of salary ranges, minimum salary amounts, and maximum salary amounts salary_distribution.txt, as well as histogram image files salary_distribution.png.

Based on the salary range and volume data provided, we can analyze the recruitment situation in different salary ranges. Here is an analysis of the data:

  • Analysis :

    • The minimum salary number is mostly concentrated in the range of 0.0-25000.0, with a total of 521 positions.
    • The highest salary number is also mainly distributed in the range of 0.0-25000.0, with 444 positions.
    • The average salary quantity varies greatly in different salary ranges. The highest average salary quantity is in the range of 0.0-25000.0, which is 498.
    • In other salary ranges, the minimum salary quantity, the maximum salary quantity, and the average salary quantity are relatively small.
  • Conclusion :

    • The salary range for most job postings is concentrated in the lower range (0.0-25000.0), which is probably the salary level for most positions.
    • There are fewer positions in the higher salary ranges, but the average salary level is likely to be higher.
    • Judging from the distribution of salary ranges, the salaries of positions vary greatly, and job seekers need to make choices based on their own circumstances and expectations.

This analysis can help job seekers understand hiring conditions in different salary ranges, helping them make more informed career decisions.

3. Work location analysis

2. Bar chart of recruitment numbers in different locations

We analyzed recruitment numbers across different job locations. Here are the key results:
Insert image description here

  • Chart content: The bar chart shows the number of recruitments in different work locations. The x-axis represents the work location and the y-axis represents the number of recruitments.

  • Local data storage: We save text files of locations and corresponding recruitment numbers location_counts.txt, as well as histogram image files location_counts.png.

  • Analysis :

    • Wuhu is the most active recruiting location, with 279 job openings.
    • There are also a certain number of recruitment positions in various regions of Wuhu, including Wuhu Jiujiang District (80 positions), Wuhu Yijiang District (58 positions), Wuhu Wuhu County (23 positions), Wuhu Fanchang County (13 positions), Wuhu·Jinghu District (10 positions) and Wuhu·Sanshan District (10 positions).
    • There are also some recruitment positions in Nanjing Jiangbei New District, Wuhu Wuwei City, Chuzhou, Wuhu Nanling County and other places, but the number is smaller.
    • Other cities and regions, such as Ma'anshan, Xuancheng, Qiannan, Hefei, Anhui Province, Suzhou Gusu District, Zhenjiang, Hefei Shushan District, Zhengzhou High-tech Zone, Shijiazhuang, Shanghai Jiading District, Nanjing Gulou District, Shenzhen The recruitment numbers in Futian District, Huainan, Hefei High-tech Zone, Suzhou, Nanjing Lishui District, Shangqiu, Nanjing Gaochun District, Hangzhou Yuhang District, Shanghai Minhang District, Dalian Jinzhou District, Nanjing and Zhengzhou are relatively large. Few, only 1 or 2 positions per location.
  • Conclusion :

    • Wuhu is the location with the most concentrated recruitment activities and the largest number of recruitment positions.
    • In addition to Wuhu, there are also a certain number of recruitment positions in Jiangbei New District of Nanjing.
    • For job seekers, understanding the recruitment situation in different locations can help you choose a job location that suits you and better plan your job search strategy.

3. Bar chart of average salary in different locations

We further analyzed average salaries by job location. Here are the key results:

Insert image description here

  • Chart content: The bar chart shows the average salary in different work locations. The x-axis represents the work location and the y-axis represents the average salary.

  • Local data storage: We saved a text file of job locations and corresponding average salaries avg_salary_by_location.txt, as well as a histogram image file avg_salary_by_location.png.

  • Analysis :

    • Zhengzhou High-tech Zone has the highest average salary at 150,000.0 yuan.
    • The average salaries in Chuzhou, Hefei, Qiannan and Anhui provinces are also at relatively high levels, at 119,750.0 yuan, 82,500.0 yuan, 79,000.0 yuan and 53,000.0 yuan respectively.
    • The average salary in Wuhu, Nanjing, Hangzhou, Shijiazhuang and other places is relatively low, not exceeding 15,000.0 yuan.
    • Average salaries in other locations also vary, showing how salaries vary by region.
  • Conclusion :

    • Work location has a significant impact on average salaries, with some regions having significantly higher average salaries than others.
    • The higher average salaries in Zhengzhou High-tech Zone, Chuzhou, Hefei and other places may reflect the level of economic development and more employment opportunities in these areas.
    • Areas with relatively low wages may need to consider other factors, such as cost of living and employment opportunities.

This analysis can help job seekers understand average salaries in different locations, helping them make more informed decisions when choosing a job location.

4. Analysis of experience and academic requirements

4. Bar chart of recruitment numbers with different experience requirements and academic qualifications

We analyzed the recruitment numbers for different experience and education requirements. Here are the key results:
Insert image description here

  • Chart content: Includes two sub-figures. The left sub-figure shows the number of recruitments with different experience requirements, and the right sub-figure shows the number of recruitments with different academic requirements.

  • Local data storage: We save text files of experience requirements and corresponding recruitment quantities, academic requirements and corresponding recruitment quantities experience_education_counts.txt, and histogram image files experience_education_counts.png.

  • Analysis of experience requirements :

    • The most common experience requirement is 3-4 years, with 180 positions requiring this experience.
    • This is followed by 2 years of experience with 102 positions.
    • There are similar numbers of positions with 1 year of experience and 5-7 years of experience, 101 and 88 positions respectively.
    • There are also a certain number of positions that do not require experience, which is 52 positions.
    • For higher level experience requirements, the number of positions for 10+ years and 8-9 years is relatively smaller at 10 and 9 positions respectively.
  • Analysis of academic requirements :

    • The most common educational requirement is a college degree, with 257 positions requiring this degree.
    • The number of positions requiring a bachelor's degree is also relatively large, 188 positions.
    • The number of positions requiring technical/technical secondary school education, high school and junior high school education or below decreased in order, to 52, 33 and 9 positions respectively.
    • The number of positions required for a master's degree is the smallest, only 3 positions.
  • Conclusion :

    • Most positions have experience requirements ranging from 1 to 4 years, and no experience is required.
    • In terms of academic requirements, college and bachelor's degrees are the most common academic requirements, occupying the vast majority of positions.
    • For job seekers, understanding the distribution of positions with different experience and academic requirements can help better match their background and goals to find suitable job opportunities.

5. Company information analysis

5. Bar chart of recruitment numbers in different company industries

We analyzed hiring numbers across different company industries. Here are the key results:
Insert image description here

  • Chart content: The bar chart shows the number of recruitments in different company industries. The x-axis represents the company industry and the y-axis represents the number of recruitments.

  • Local data storage: We save text files of company industries and corresponding recruitment numbers industry_counts.txt, as well as histogram image files industry_counts.png.

  • Analysis :

    • The automotive industry
      is the industry with the largest number of job openings, with 142 positions.
    • The new energy and electronic technology/semiconductor/integrated circuit industries rank second and third with 59 and 56 positions respectively.
    • Machinery/equipment/heavy industry, auto parts, computer software and other industries also have a certain number of positions, 46, 39 and 28 respectively.
    • The number of recruitments in finance/investment/securities, diversified business group companies, environmental protection, instrumentation/industrial automation and other industries is relatively small, all below 10.
  • Conclusion :

    • Recruitment varies widely across company industries, with some industries having a large number of positions and other industries having relatively few.
    • The automotive industry, new energy and electronic technology/semiconductor/integrated circuit industries may be popular industries that job seekers focus on because they have more job opportunities.
    • It should be noted that the number of recruitments in industries such as finance/investment/securities and diversified business group companies is relatively small and competition may be fierce.

This analysis can help job seekers understand the recruitment situation in different company industries and help them choose suitable industries and positions based on their interests and majors.

6. Bar chart of average salary in different company industries

We further analyzed average salaries across different company industries. Here are the key results:
Insert image description here

  • Chart content: The bar chart shows the average salary of different company industries. The x-axis represents the company industry and the y-axis represents the average salary.

  • Local data storage: We save text files of company industries and corresponding average salaries avg_salary_by_industry.txt, as well as histogram image files avg_salary_by_industry.png.

  • Analysis :

    • The medical/nursing/health industry has the highest average salary, reaching 116,000 yuan. This may be because the medical industry has higher salary standards for professionals.
    • The average salaries in the electronic technology/semiconductor/integrated circuit industry and the computer software industry are also relatively high, at 24,696 yuan and 19,071 yuan respectively.
    • The average salary in the finance/investment/securities industry is 47,857 yuan, which is one of the high-paying industries.
    • The average salary in industries such as automobiles, fast moving consumer goods (food, beverages, cosmetics), auto parts, and construction/building materials/engineering is also higher, all above 10,000 yuan.
    • Some industries such as advertising, life services, intermediary services, etc. have lower average salaries, all below 10,000 yuan.
  • Conclusion :

    • Average salaries vary widely across company industries, with some industries having relatively high average salaries and others having lower average salaries.
    • Job seekers can choose suitable industries and companies based on their career goals and salary expectations.
    • It should be noted that the average salary is only a reference indicator, and the actual salary will also be affected by various factors such as personal experience, job level, and region.

This analysis helps job seekers understand salary levels in different industries so they can make more informed career choices.

6. Company welfare and type and scale analysis

7. Histogram of occurrence times of company welfare items

We analyzed the frequency of occurrence of different company benefit items. Here are the key results:
Insert image description here

  • Chart content: The bar chart shows the number of occurrences of different company welfare items. The x-axis represents the company welfare items and the y-axis represents the number of occurrences.

  • Local data storage: We save text files of company benefit items and corresponding occurrence times welfare_counts.txt, as well as histogram image files welfare_counts.png.

  • analyze:

    • From the histogram, you can see that "five insurances and one fund" is the most popular company benefit item and appears the most, indicating that companies generally provide this benefit to meet employees' basic social security needs.

    • "Performance bonus" and "year-end bonus" followed closely, which shows that the company also attaches great importance to employee performance evaluation and incentives.

    • Some other benefit items, such as "employee travel" and "free shuttle bus", appear less frequently, but still receive some attention.

    • Benefits such as "free employee meals" and "included food and accommodation" appear less frequently and may only be provided by a few companies.

    • Common benefit items: The company benefit item that appears most often is "five insurances and one fund", which appears 372 times, which shows that this is a standard benefit provided by most companies. Other common benefits include "performance bonus", "year-end bonus", "catering allowance" and "professional training".

    • Standard benefits: In addition to "five insurances and one fund", benefits such as "employee travel", "free shuttle bus", "transportation subsidy", "regular physical examination", "communication subsidy", etc. are also adopted by many companies. These benefits are usually is a common choice for engaging employees and improving employee satisfaction.

    • Industry-related benefits: Some benefit items may be related to the company's industry, such as "automotive industry" and "auto parts" related benefit items, including "cars", "factory inspection reports" and "car repairs".

    • Personalized benefits: Some companies may also provide some more personalized benefits, such as "Traditional Chinese Medicine Platform Promotion", "Stuttering Correction", "Deep Learning", etc. These benefits may be based on the special needs of the company or the special requirements of employees. Certainly.

  • Conclusion :

    • Benefit improvements: By analyzing these benefit items, you can learn about the benefits offered by other companies, which will help you understand market competition and strategies to attract employees. You can use this information to consider whether your company's benefits policies need to be improved or added to better attract and retain talent.
    • These analyzes can help you understand what different companies prefer when it comes to benefits and which benefits items may have a positive impact on employee recruitment and retention. Of course, specific benefit choices may also be affected by factors such as company size, industry, and geographical location.

8. Bar chart of recruitment numbers by company type and size

We analyzed hiring numbers by company type and size. Here are the key results:

  • Chart content: The bar chart shows the number of recruitments of different company types and sizes. The x-axis represents the company type and size, and the y-axis represents the number of recruitments.

  • Local data storage: We save text files of company type and size and corresponding recruitment numbers company_type_and_size_counts.txt, as well as histogram image filescompany_type_and_size_counts.png。
    Insert image description here

  • Analysis :

    • Private companies have more recruitment needs at different scales. Among them, private companies with a scale of 150-500 people have the highest demand, with 102 positions.
    • Followed by private companies with 500-1,000 employees, recruiting 66 positions.
    • Among state-owned enterprises, those with a scale of more than 10,000 people have higher demand, with 26 positions available.
    • The demand for joint venture companies is also relatively balanced, with companies with 500-1,000 employees having the most demand for 22 positions.
    • Some companies do not have specific size indicators, and the recruitment number is 16 positions. They may be small startups or other types of companies.
    • The demand for foreign-funded companies is relatively small in different sizes, and most of them are concentrated in companies with 150-500 people, with 16 positions.
    • The demand for listed companies is relatively limited, and most of them are concentrated in companies with more than 10,000 employees and 500-1,000 employees.
  • Conclusion :

    • Private companies and state-owned enterprises are the types of companies with higher recruitment needs, especially private companies with a scale of 150-500 people.
    • The demand for joint ventures and foreign-funded companies is relatively small, mainly concentrated in companies with 500-1,000 people and 150-500 people.
    • The recruitment demand of listed companies is relatively low, and most of them are distributed among larger companies.
    • Recruiters can better select positions and companies that suit them based on the distribution of company types and sizes.

This analysis helps job seekers understand hiring conditions across company types and sizes so they can make more informed career decisions.

7. Keyword analysis

9. Company industry word cloud diagram

We generated a word cloud diagram of the company's industry to display the keywords of the company's industry. Here are the key results:
Insert image description here

  • Chart content: The word cloud chart shows keywords of different company industries, generated based on word frequency, and the text color is random.

  • Local data storage: We saved company industry text data to local files industry_wordcloud.txt, as well as word cloud image files industry_wordcloud.png.

  • Analysis :

    • The word cloud chart shows keywords from different industries, among which the most frequently occurring words include "automobile", "electronics", "technology", "new energy", "medical care", etc.
    • "Auto" is the word that appears most frequently, indicating that the automobile industry occupies an important position in the recruitment market.
    • "Electronics" and "technology" are also popular industry keywords, indicating that these two industries also have more recruitment opportunities.
  • Conclusion :

    • It can be seen from the word cloud chart that industries such as automobiles, electronics, and technology are popular areas in the current recruitment market, and job seekers can pay attention to career opportunities in these industries.
    • The word cloud chart provides a visual representation of the company's industry data, helping to intuitively understand the recruitment situation in different industries.

10. Company benefits word cloud

We also generated a word cloud of company benefits to display the keywords of company benefits. Here are the key results:
Insert image description here

  • Chart content: The word cloud chart shows the keywords of different company benefits, generated based on word frequency, and the text color is random.

  • Local data storage: We saved the company benefits text data to local files welfare_wordcloud.txt, as well as word cloud image files welfare_wordcloud.png.
    The following is a word cloud and analysis generated from welfare text data:

  • Analysis :

    • The word cloud diagram shows the benefits provided by different companies. Among them, the most commonly seen benefits include "five insurances and one fund", "year-end bonus", "catering subsidy", "communication subsidy", "employee travel", etc.
    • "Five insurances and one housing fund" is one of the most frequently appearing benefits, indicating that most companies provide this basic social security benefit.
    • "Year-end bonus" is also a common benefit that may attract the attention of job seekers.
    • Benefits such as catering subsidies and communication subsidies have also received a certain degree of attention. These benefits can help improve the quality of life of employees.
  • Conclusion :

    • It can be seen from the word cloud that benefits such as five insurances and one housing fund, year-end bonuses, and catering subsidies are highlights that attract job seekers. Companies can highlight these benefits when recruiting.
    • The word cloud chart provides a visual representation of the company's welfare data, helping job seekers understand the welfare packages of different companies.

8. Summary and Outlook

Through multi-dimensional analysis and visual presentation of recruitment data, we derived a series of important information about the local employment situation. This information not only helps students understand the trends in the job market, but also provides schools and society with important references on the employment situation.

In the future, we can further expand and improve this research work, including the collection of more data sources, in-depth data analysis methods, and the establishment of more accurate prediction models to help more people make informed career choices.

9. Acknowledgments

Thank you for your attention and reading. I hope this report will be helpful to your career planning. If you have any questions or require further consultation, please feel free to contact us.
This report represents our in-depth study of the employment situation, and we hope that the results of this analysis will be helpful to everyone.

10. Related codes

11. Data collection code

# -*- coding = utf-8 -*-
import csv
import time
import random
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options

option = Options()
option.add_experimental_option('excludeSwitches', ['enable-automation'])
option.add_argument('--disable-blink-features=AutomationControlled')

driver = webdriver.Chrome(options=option)
driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")
headers = ['职位名称', '薪资范围', '工作地点', '经验要求', '学历要求', '公司名称', '公司类型及规模',
           '公司行业', '公司福利']

jobs_list = []
count = 1
error_time = 0
try:
    driver.get("https://we.51job.com/pc/search?jobArea=150300&keyword=&searchType=2&sortType=0&metro=")
    driver.implicitly_wait(10)
    for j in range(1, 30):
        for i in range(1, 21):
            try:
                job = driver.find_element(By.CSS_SELECTOR, f'div.j_joblist > div:nth-child({i})')
                # print(job.get_attribute('innerHTML'))
                job_name = job.find_element(By.CSS_SELECTOR, '.el > div > span').text
                salary = job.find_element(By.CSS_SELECTOR, '.el > p.info > span.sal').text
                location = job.find_element(By.CSS_SELECTOR, '.el > p.info > span.d.at > span:nth-child(1)').text
                exp = job.find_element(By.CSS_SELECTOR, '.el > p.info > span.d.at > span:nth-child(3)').text
                edu = job.find_element(By.CSS_SELECTOR, '.el > p.info > span.d.at > span:nth-child(5)').text
                company = job.find_element(By.CSS_SELECTOR, '.er > a').text
                company_type_scale = job.find_element(By.CSS_SELECTOR, '.er > p.dc.at').text
                industry = job.find_element(By.CSS_SELECTOR, '.er > p.int.at').text
                try:
                    tag = job.find_element(By.CSS_SELECTOR, '.el > p.tags').text
                except:
                    tag = ''
            except:
                print("error:" + str(i))
                error_time += 1
                if error_time >= 3:
                    input("网络可能断开,输入任意值继续")
                    error_time = 0
                continue
            job_item = {
    
    
                '职位名称': job_name,
                '薪资范围': salary,
                '工作地点': location,
                '经验要求': exp,
                '学历要求': edu,
                '公司名称': company,
                '公司类型及规模': company_type_scale,
                '公司行业': industry,
                '公司福利': tag
            }

            jobs_list.append(job_item)

            # 随机等待1-5秒,防止被识别
            time.sleep(random.randint(2, 6))
            print(j,i)
        error_time = 0
        driver.find_element(By.CSS_SELECTOR,
                            f'div.bottom-page > div > div > div> button.btn-next').click()
        time.sleep(random.randint(2, 6))
except Exception as e:
    print("Error:", e)
    input("网络可能断开,输入任意值继续")


finally:
    with open('51job.csv', 'w', newline='', encoding='utf-8-sig') as f:
        writer = csv.writer(f)
        writer.writerow(headers)

        for job_item in jobs_list:
            row = list(job_item.values())
            writer.writerow(row)

    driver.quit()



12. Data cleaning and analysis code

from collections import Counter
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.font_manager import FontProperties
import jieba
from wordcloud import WordCloud
import re
# 设置中文字体
font = FontProperties(fname="C:/Windows/Fonts/simhei.ttf", size=12)
plt.rcParams['font.sans-serif'] = ['SimHei']
# 读取CSV文件
df = pd.read_csv('51job.csv')
# 定义提取薪资函数
def extract_salary(s):
    pattern1 = '([0-9]+\.?[0-9]*)千-([0-9]+\.?[0-9]*)万·([0-9]+\.?[0-9]*)薪'
    pattern2 = '([0-9]+\.?[0-9]*)-([0-9]+\.?[0-9]*)千'
    pattern3 = '([0-9]+\.?[0-9]*)千-([0-9]+\.?[0-9]*)万'
    pattern4 = '([0-9]+\.?[0-9]*)-([0-9]+\.?[0-9]*)千·([0-9]+\.?[0-9]*)薪'
    pattern5 = '([0-9]+\.?[0-9]*)-([0-9]+\.?[0-9]*)万'
    pattern6 = '([0-9]+\.?[0-9]*)-([0-9]+\.?[0-9]*)万·([0-9]+\.?[0-9]*)薪'
    pattern7 = '([0-9]+)元/天'
    match1 = re.search(pattern1, s)
    match2 = re.search(pattern2, s)
    match3 = re.search(pattern3, s)
    match4 = re.search(pattern4, s)
    match5 = re.search(pattern5, s)
    match6 = re.search(pattern6, s)
    match7 = re.search(pattern7, s)
    if match1:
        low, high, extra = match1.groups()
        low, high, extra = float(low), float(high), float(extra)
        return low * 1000, high * 10000, extra
    elif match2:
        low, high = match2.groups()
        low, high = float(low), float(high)
        return low * 1000, high * 1000
    elif match3:
        low, high = match3.groups()
        low, high = float(low), float(high)
        return low * 1000, high * 10000
    elif match4:
        low, high, extra = match4.groups()
        low, high, extra = float(low), float(high), float(extra)
        return low * 1000, high * 1000, extra
    elif match5:
        low, high = match5.groups()
        low, high = float(low), float(high)
        return low * 10000, high * 10000
    elif match6:
        low, high, extra = match6.groups()
        low, high, extra = float(low), float(high), float(extra)
        return low * 10000, high * 10000, extra
    elif match7:
        day = float(match7.group(1))
        return day
    else:
        print(s)
        return None


# 计算平均薪资
try:
    df['最低薪资'] = df['薪资范围'].apply(lambda x: extract_salary(x)[0])
    df['最高薪资'] = df['薪资范围'].apply(lambda x: extract_salary(x)[1])
    df['平均薪资'] = (df['最低薪资'].astype(float) + df['最高薪资'].astype(float)) / 2
except:
    df['日结'] = df['薪资范围'].apply(lambda x: extract_salary(x)[0])
    df['平均薪资'] = df['日结'] * 30

try:
    df['加成'] = df['薪资范围'].apply(lambda x: extract_salary(x)[2])
    df['平均薪资'] += df['加成'] * ((df['最低薪资'].astype(float) + df['最高薪资'].astype(float)) / 2) / 12
except:
    print('无加成')
# 1. 薪资范围分布直方图
plt.figure(figsize=(20, 10))
hist_low = plt.hist(df['最低薪资'], bins=10, alpha=0.5, color='blue', label='最低薪资')
hist_high = plt.hist(df['最高薪资'], bins=30, alpha=0.5, color='red', label='最高薪资')
hist_ave = plt.hist(df['平均薪资'], bins=20, alpha=0.5, color='yellow', label='平均薪资')
plt.xlabel('薪资范围 (k RMB)')
plt.ylabel('招聘数量')
plt.legend()
plt.title('薪资范围分布')

# 添加数据标签和保存数据到文本文件
with open('salary_distribution.txt', 'w', encoding='utf-8') as file:
    # 写入列名
    file.write('薪资范围\t最低薪资数量\t最高薪资数量\t平均薪资数量\n')
    # 遍历直方图的每个x轴区间
    for i in range(len(hist_low[0])):
        # 构造该区间的字符串
        bin_str = f'{hist_low[1][i]}-{hist_low[1][i + 1]}'
        # 获取该区间对应的最低薪资频数
        low_count = int(hist_low[0][i])
        # 获取该区间对应的最高薪资频数
        high_count = int(hist_high[0][i])
        # 获取该区间对应的平均薪资频数
        ave_count = int(hist_ave[0][i])
        # 拼接并写入行数据
        file.write(f'{bin_str}\t{low_count}\t{high_count}\t{ave_count}\n')
plt.savefig('salary_distribution.png')
# 显示图表
plt.show()

# 2. 不同地点的招聘数量柱状图
location_counts = df['工作地点'].value_counts()
# 绘制柱状图
plt.figure(figsize=(20, 10))
location_counts.plot(kind='bar', color='skyblue')
# 设置中文标签
plt.xlabel('工作地点', fontproperties=font)
plt.ylabel('招聘数量', fontproperties=font)
plt.title('不同地点的招聘数量', fontproperties=font)
plt.xticks(rotation=90)
# 添加数据标签和保存数据到文本文件
with open('location_counts.txt', 'w', encoding='utf-8') as file:
    file.write('地点\t招聘数量\n')
    for location, count in location_counts.items():
        file.write(f'{location}\t{count}\n')
plt.savefig('location_counts.png')
# 显示图表
plt.show()

# 3. 不同地点的平均薪资柱状图
avg_salary_by_location = df.groupby('工作地点')['最低薪资'].mean()
# 可以选择展示前几个地点的数据
top_avg_salary_by_location = avg_salary_by_location.sort_values(ascending=False)
# 绘制柱状图
plt.figure(figsize=(20, 10))
top_avg_salary_by_location.plot(kind='bar', color='lightgreen')
# 设置中文标签
plt.xlabel('工作地点', fontproperties=font)
plt.ylabel('平均薪资 (k RMB)', fontproperties=font)
plt.title('不同地点的平均薪资', fontproperties=font)
plt.xticks(rotation=90)
# 添加数据标签和保存数据到文本文件
with open('avg_salary_by_location.txt', 'w', encoding='utf-8') as file:
    file.write('工作地点\t平均薪资\n')
    for location, avg_salary in top_avg_salary_by_location.items():
        file.write(f'{location}\t{avg_salary}\n')
plt.savefig('avg_salary_by_location.png')
# 显示图表
plt.show()

# 4. 不同经验要求和学历要求的招聘数量柱状图
# 统计不同经验要求的招聘数量
exp_counts = df['经验要求'].value_counts()
# 统计不同学历要求的招聘数量
edu_counts = df['学历要求'].value_counts()
# 可视化经验要求的招聘数量
plt.figure(figsize=(20, 10))
plt.subplot(1, 2, 1)
exp_counts.plot(kind='bar', color='skyblue')
plt.xlabel('经验要求', fontproperties=font)
plt.ylabel('招聘数量', fontproperties=font)
plt.title('不同经验要求的招聘数量', fontproperties=font)
# 可视化学历要求的招聘数量
plt.subplot(1, 2, 2)
edu_counts.plot(kind='bar', color='lightgreen')
plt.xlabel('学历要求', fontproperties=font)
plt.ylabel('招聘数量', fontproperties=font)
plt.title('不同学历要求的招聘数量', fontproperties=font)
plt.tight_layout()
# 添加数据标签和保存数据到文本文件
with open('experience_education_counts.txt', 'w', encoding='utf-8') as file:
    file.write('经验要求\t招聘数量\n')
    for experience, count in exp_counts.items():
        file.write(f'{experience}\t{count}\n')

    file.write('学历要求\t招聘数量\n')
    for education, count in edu_counts.items():
        file.write(f'{education}\t{count}\n')
plt.savefig('experience_education_counts.png')
# 显示图表
plt.show()

# 5. 不同公司行业的招聘数量柱状图
# 统计不同公司行业的招聘数量
industry_counts = df['公司行业'].value_counts()

# 可视化不同公司行业的招聘数量
plt.figure(figsize=(20, 10))
industry_counts.plot(kind='bar', color='skyblue')
plt.xlabel('公司行业', fontproperties=font)
plt.ylabel('招聘数量', fontproperties=font)
plt.title('不同公司行业的招聘数量', fontproperties=font)
plt.xticks(rotation=90)

# 添加数据标签和保存数据到文本文件
with open('industry_counts.txt', 'w', encoding='utf-8') as file:
    file.write('公司行业\t招聘数量\n')
    for industry, count in industry_counts.items():
        file.write(f'{industry}\t{count}\n')
plt.savefig('industry_counts.png')
# 显示图表
plt.show()

# 6. 不同公司行业的平均薪资柱状图
# 计算不同公司行业的平均薪资水平
avg_salary_by_industry = df.groupby('公司行业')['平均薪资'].mean()
# 可视化不同公司行业的平均薪资水平
plt.figure(figsize=(20, 10))
avg_salary_by_industry.plot(kind='bar', color='lightgreen')
plt.xlabel('公司行业', fontproperties=font)
plt.ylabel('平均薪资 (k RMB)', fontproperties=font)
plt.title('不同公司行业的平均薪资', fontproperties=font)
plt.xticks(rotation=90)
# 添加数据标签和保存数据到文本文件
with open('avg_salary_by_industry.txt', 'w', encoding='utf-8') as file:
    file.write('公司行业\t平均薪资\n')
    for industry, avg_salary in avg_salary_by_industry.items():
        file.write(f'{industry}\t{avg_salary}\n')
plt.savefig('avg_salary_by_industry.png')
# 显示图表
plt.show()


# 7. 公司福利项出现次数柱状图
# 统计不同公司福利的出现次数
def extract_welfare(welfare_str):
    if isinstance(welfare_str, str):
        return welfare_str.split('\n')
    else:
        return []


# 将福利信息分割成列表
df['公司福利'] = df['公司福利'].apply(extract_welfare)
# 统计不同福利项的出现次数
welfare_counts = df['公司福利'].explode().str.strip().value_counts()
# 处理没有福利信息的公司
if '' in welfare_counts:
    no_welfare_count = welfare_counts['']
    welfare_counts = welfare_counts.drop('')
    welfare_counts['无福利'] = no_welfare_count
# 可视化不同福利项的出现次数
plt.figure(figsize=(20, 10))
welfare_counts.nlargest(20).plot(kind='bar', color='lightblue')
plt.xlabel('公司福利项', fontproperties=font)
plt.ylabel('出现次数', fontproperties=font)
plt.title('不同公司福利项的出现次数', fontproperties=font)
plt.xticks(rotation=90)
# 添加数据标签和保存数据到文本文件
with open('welfare_counts.txt', 'w', encoding='utf-8') as file:
    file.write('公司福利项\t出现次数\n')
    for welfare, count in welfare_counts.items():
        file.write(f'{welfare}\t{count}\n')
plt.savefig('welfare_counts.png')
# 显示图表
plt.show()

# 8. 不同公司类型及规模的招聘数量柱状图
# 统计不同公司类型及规模的招聘数量
company_type_and_size_counts = df['公司类型及规模'].value_counts()
# 可视化不同公司类型及规模的招聘数量
plt.figure(figsize=(20, 10))
company_type_and_size_counts.plot(kind='bar', color='skyblue')
plt.xlabel('公司类型及规模', fontproperties=font)
plt.ylabel('招聘数量', fontproperties=font)
plt.title('不同公司类型及规模的招聘数量', fontproperties=font)
plt.xticks(rotation=90)
# 添加数据标签和保存数据到文本文件
with open('company_type_and_size_counts.txt', 'w', encoding='utf-8') as file:
    file.write('公司类型及规模\t招聘数量\n')
    for company_type, count in company_type_and_size_counts.items():
        file.write(f'{company_type}\t{count}\n')
plt.savefig('company_type_and_size_counts.png')
# 显示图表
plt.show()

# 9. 公司行业词云图
# 合并公司行业文本
industry_text = ' '.join(df['公司行业'].dropna())
# 使用jieba分词
seg_list = jieba.cut(industry_text)
# 创建词云
wordcloud = WordCloud(width=800, height=400, background_color='white',
                      font_path='C:/Windows/Fonts/simhei.ttf').generate(' '.join(seg_list))
# 可视化词云
plt.figure(figsize=(20, 10))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('公司行业词云', fontproperties=font)
# 将公司行业词频信息保存到本地文件
industry_words = list(jieba.cut(industry_text))
industry_word_counts = Counter(industry_words)
with open('industry_wordcloud.txt', 'w', encoding='utf-8') as file:
    for word, count in industry_word_counts.items():
        file.write(f'{word}: {count}\n')
# 保存词云图像
wordcloud.to_file('industry_wordcloud.png')
# 显示词云图像
plt.show()

# 10. 公司福利词云图
# 合并公司福利文本
welfare_text = ' '.join(df['公司福利'].explode().dropna())
# 使用jieba分词
seg_list = jieba.cut(welfare_text)
# 创建词云
wordcloud = WordCloud(width=800, height=400, background_color='white',
                      font_path='C:/Windows/Fonts/simhei.ttf').generate(' '.join(seg_list))
# 可视化词云
plt.figure(figsize=(20, 10))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('公司福利词云', fontproperties=font)
# 保存词云图像
wordcloud.to_file('welfare_wordcloud.png')
# 将公司福利文本数据保存到本地文件
welfare_words = list(jieba.cut(welfare_text))
welfare_word_counts = Counter(welfare_words)
with open('welfare_wordcloud.txt', 'w', encoding='utf-8') as file:
    for word, count in welfare_word_counts.items():
        file.write(f'{word}: {count}\n')
# 显示词云图像
plt.show()

Guess you like

Origin blog.csdn.net/qq_42531954/article/details/132639697
Recommended