Python visual analysis project

Today, I would like to share with you a project based on the python-based django framework combined with crawlers, data visualization and databases. The project is generally quite good. The following is a specific introduction to this project.

1: The technology involved in the project:

Project backend language: python

Project page layout display: front-end (html, css, javascript)

Visual presentation of project data: echars.js

Project data operation: mysql database

Project data acquisition method: crawler (selenium, Xpath)

2: Project function:

After crawling the data and starting the project, the data will be stored in the database (the database has 3 tables, a job information table, a user information table, and a job collection table), and then enter the login registration page of the project, and will The user's account password is verified and stored. After the verification is successful, enter the home page:

2.1: If the registration information is empty, customize the 404 error page

2.2: Register and log in successfully to enter the home page:

The left side is the navigation status bar, the personal center has subpages for modifying personal information and password, data statistics has subpages for data overview and job collection, and data visualization has visualization pages for information such as salary, company information, number of employees, and corporate financing .

In the middle is a pie chart, the visualized data is the creation time of the user, on the right side of the pie chart is a display of user information table, and then on the right side is a statistics on the crawled information (such as the total number of data entries, the number of users, the highest degree of education , the highest salary, popular cities, long monthly salary and other information display), the following is a technology-related marquee effect, and finally the following is the display of all information and data of employees.

2.3: Then the display of personal information:

2.4: Data overview: use django's built-in paginator paging system to process the data by row paging

  

2.5: Data visualization: use echars.js in the front end to perform various visual analysis on the data, we only need to use the django template syntax to render the data in the data attribute of the js series dictionary through the information rendering of the database data.

echars.js official website: Examples - Apache ECharts

 

 

 

 Some items displayed inside:

 3: Analysis of technical process of project realization:

3.1: Data crawling:

The data is obtained from the boss direct employment website, and the obtained content objects are title job name, address job city, type job type, educational degree, workExperience work experience, workTag job label, salary salary, salaryMonth multi-pay, companyTag company benefits, hrWorkHR position , hrNameHR name, whether pratice is an intern, companyTitle company name, companyAvatar company avatar, companyNature company nature, companyStatus company financing, companyPeopl company number, detailUrl job details link, companyUrl company details link, dist administrative region.

Let's first get the data of two pages from the official website, the photos are as follows:

The above data is a bit different from the data format we want to store in the database, so we need to do some processing on the personal data style.

For example, salary information, such as 15-30k, the type we want to display in the database is [15000, 30000], the processing process is to first judge whether the position is an internship position, because the internship position does not have a monthly salary, how much is the internship position? Today, we first use XPath path analysis to obtain the text content of the salary label, which is 15-30k·16 salary. We can know whether the position is an internship position by judging whether the text string content contains K, and if it contains k, it is not Internship positions, and then the python separator split (·) to separate the salary from the multi-salary, and then use the len() method to obtain its length. If it is 1, it means that it does not have a multi-salary. We set it to 0 salary, otherwise it indicates that it is multi-salary Salary, as for less salary, we can use python slices to obtain and assign the salary to a variable of salaryMonth.

For the conversion of the salary value from 15-30k to [15000,30000], the code first gets the first element from the salaries list, and replaces "K" with an empty string in this element. Next, the code uses the split('-') method to split the element according to "-" to get a list. Finally, the code uses the map() function to convert each element in the list to an integer and multiplies it by 1000 to get a list of integer types. For example, if the value of salaries[0] is "10K-20K", the value of salary will be [10000, 20000] after processing by this code.

In addition, the assignment of the address city and the dist administrative district is to use xpath to obtain the text of the address label, such as Wuhan City Hongshan District Optics Valley, then we need to use python slices for simple processing.

The company's financing situation also needs to be dealt with, because it is not difficult for us to browse the data of the website to find that some companies have no financing status, and some show that they have been listed or financing objects, etc. For example: the following financing status is no, and some are yes is the object of financing.

 

So we need to do some processing and put it into the database. The processing process is: first use the find_element method of the selenium library to obtain the tag elements in the king, traverse the number of tag elements, and extract the company's nature, financing status, and company size. Specifically, it first judges whether there are tags containing company nature, financing status, and company size in the webpage , and if so, extracts their text information in order; otherwise, considers the company's financing status as "unfunded" , and then extract the company size information , and the number of companies here also needs to be processed.

The number of people in the company also needs to be processed. For example, the type displayed on the website is 20-99 people. What we want to display in the database is [20,99]. The processing process is the same as above. Use the find_element method of the Selenium library to find the matching Conditional HTML element, and then use the split and replace methods to process the text of the element, and finally convert the company size information into a list companyPeople containing two integers. If an exception occurs during processing, such as the element cannot be obtained or the text of the element does not meet expectations, then this code will jump to the except statement block and use the default value [0, 10000] instead of companyPeople.

Company benefits also need to be processed before entering the database because we found that some companies did not post benefits such as:

At this time, you only need to make a simple judgment on the existence of the content. If it does not exist, assign the string "None" to avoid the problem that mysql does not allow inserting null values.

After data crawling and processing, assign variables to the data, put it into a jobData list, call the writerow method of the os module to write it into csv, and then do the final processing on the data JobInfo.objects . create , set it

Store it in the database, and all the data crawled so far have been stored in the database as required.

3.2: Login and registration:

The registration content is pure front-end knowledge. Create a form form, put the content of the input input box inside, create a button tag with the property of submit, and write the background logic code in the view.py view. If request.method == " GET " , enter the address corresponding to the url (registration interface), if the request method is request.method == " POST " , the form is submitted, and the content submitted by the input box is obtained at the same time, and then the data is written using the User.objects.create method of the database User object Enter the user table of the database, use md5 = hashlib.md5()  . update(pwd.encode()) to encrypt the database password with MD5 and salt, and jump to the login url address page (check the registration is yes The account number and password are simple data verification pop-up prompts or error page display will not be explained)

3.3: Homepage:

3.3.1: Homepage - time & welcome user language:

time.localtime() to get the current time, year = timeFormat.tm_year , month = timeFormat.tm_mon , day = timeFormat.tm_mday to get the year, month and day, because the month is in digital form, and what we want is (June 1, 2023 ) in this format, so here we create a list and put it into English monthList = [ "January" , "February" , "March" , "April" , "May" , "June" , "July" , "August" , "September" , "October" , "November" , "December" ]

, and then the month we need can be represented by monthlist[obtained number-1], and the username in the welcome message can be represented by using the django template in the front-end html syntax { { username } } .

3.3.2: Home page - user creation time pie chart:

Find the required data display picture on the echars official website, take the source code, and write the user time field attribute value userTime in the data dictionary of the series list. In order to prevent escaping, add | safe at the end, tooltip : { trigger : 'item' }, legend : { top : '5%' , left : 'center' }, which are used to realize the dynamic display and top navigation of the data layout graph when hovering over the mouse.

3.3.3: Homepage - display of the latest user information table:

The layout of the front-end table label will not be explained too much. I will mainly talk about the display of the user data in the database. First, the user data class User.objects.all() is used to obtain all the object information of the user in the database and assign it to the Newuser object. Then use the item.user field attribute to get the value of the user field, and then write the standard django template loop syntax {% for item in newUser %} on the front page of the front page to traverse, remember to loop together in the tr tag, and the user The data display of the information stipulates to display 5 items and display them according to the creation time time.mktime(time.strptime( str (item.createTime), '%Y-%m-%d' )) use the strptime function in the time module to Convert the time string to a time tuple of type struct_time, and then use the mktime function to convert the time tuple into a timestamp. Timestamp refers to the identification of a specific point in time, usually expressed as the length of time elapsed from a fixed starting point (such as 00:00:00 UTC on January 1, 1970) to that point in time, The unit can be seconds, milliseconds, microseconds, etc. Timestamps can be used to indicate when an event occurred, to calculate time intervals between events, etc. In computer systems, timestamps are usually stored as integers or floating point numbers. list ( sorted (users, key =sort_fn, reverse =True ))[: 6 ] use sort to sort the timestamp rows and slice the first 5 (including the 5th)

3.3.4: Home page - technical requirements of the project involve display:

You only need to introduce the marquee text scrolling tag in the div area tag corresponding to the front-end page and set related attributes such as direction, speed, font size, etc., to change the displayed text content "marquee effect"

3.3.5: Front page - data on the right (total data, number of users, highest education, highest salary, advantageous location, highest salary, job interest) to achieve:

Total amount of data: Get the user object of the database, and use the len() method to understand the number of users.

Highest degree: Create an educations = { " Doctor " : 1 , " Master " : 2 , " Undergraduate " : 3 , " College " : 4 , " High School " : 5 , " Technical Secondary School / Secondary Technical " : 6 , " Junior High School and below " : 7 , " Education is not limited " : 8 } dictionary,educations[job.educational]<educations[educationsTop]: educationsTop = job.educational traverses the jobs in the job database, the created educations dictionary [job degree] can get the value of the created dictionary, and use it to compare with each job The academic qualifications of the work items in the table are compared to find the largest value value, which corresponds to the highest academic qualification in the created dictionary, and finally use the django template syntax { { the variable of the highest academic qualification  }} on the front end.

Advantage location: if address.get(job.address,- 1 ) == - 1 :address[job.address] = 1
else :address[job.address] += 1 counts the address and experience of the job object, If the address and experience of the job object do not exist in the dictionary, add it to the dictionary and assign a value of 1; if it already exists, add 1 to its corresponding value, so that the following form {"location a" can be obtained : value 1, "location b": value 2. etc.} Then call the built-in items() method to return all the key-value pairs of the dictionary, and finally addressStr = sorted (address.items(), key = lambda x:x[ 1 ], reverse = True )[: 3 ] Arrange the values ​​in descending order and slice them into the first 3 addresses and assign their connection representation to an addressstr variable, and finally write the django template syntax in the front-end planning { { addressstr }} that is Yes, the maximum salary and the nature of the position are the same.

Highest salary: Go through each salary in the worksheet and compare them to find the highest assignment.

3.3.6: Data table display on the home page:

Jobs= JobInfo.objects.all() , traverse the jobs to get each job, and render the data in the tbale of the front-end page, but for the salary data table is [xk, xk] what we want to display here is The highest salary per month, so job .salary = json.loads(i.salary)[ 1 ] Use the loads method of json to convert the strings in the dictionary into python objects and then get it later (the highest salary value), whether it is an internship We also need to do a simple process, because our database puts 0 or 1 to indicate whether it is an internship, here we only need to make an if else statement, and the number of people, our database uses [a,b] to represent, Here we want to display as a person-b person,  json.loads(i.companyPeople) takes the value of the attribute number of people in the database and converts it to a python object, i.companyPeople = list ( map ( lambda x: str (x) + ' people ' , and then use '-' .join(i.companyPeople) to connect the two numbers in the list with - to get the required result form.

3.4: Personal Center:

3.4.1: Personal Center - Personal Information:

The realization of the education background, work experience and intended position selection drop-down box is to cycle through the values ​​of the education background, work experience and job type fields of the database in the select tag, use the for loop to traverse each element e in the educations list, and use if And the else statement to judge whether the currently traversed element is equal to userInfo.educational, if it is equal, output a selected option tag, otherwise output an ordinary option tag. Among them, educations and userInfo.educational are the data passed from the backend to the frontend. What this code does is generate a drop-down box that allows the user to select their education level. Among them, the educations list is the educations list we defined = [ " Doctor " , " Master " , " Undergraduate" , " College " , " High School " , " Technical Secondary / Technical " , " Unlimited Education "

userInfo.educationa is the content value corresponding to the education field of the database, and the selection principle of work experience and intended position is the same as above.

The selection of the picture is to call the front-end file input type="file"

Then, the post method of submitting the form and executing the back-end view calls the modification information function. This function accepts two parameters newInfo and FileInfo, where newInfo is a dictionary containing the user's new personal information, and FileInfo is a dictionary containing the user's uploaded information. file information. This function uses the ORM (Object Relational Mapping) method in the Django framework to update the user's personal information. Specifically, it first obtains the User object of the specified user name from the database through the get method, and then updates the educational, workExpirence, address, and work attributes of the object. If the avatar property in FileInfo is not None, it is set to the avatar property of the User object. Finally, save the updated User object back to the database using the save method.

3.4.2: Personal Center - Change Password:

The principle is the same as above.

3.5 Visual Icons -

3.5.1: Salary situation:

Traverse all non-internship jobs, add their job types to jobsType, and add the second item in their salary information to  the list of corresponding types in jobsType . The form is {Java : [1,2,4,3]} The content of the list here is the collection of various wages in java

Next, the code creates an empty dictionary barData , traverses  each type in jobsType , groups its salary list according to a certain range, counts the number of each range, and stores the result in barData  . Finally, the code returns salaryList , barData  , and a key list of barData  . At this time, the form of barData is {java: [1,2,3,4]} At this time, the content of the list is the number of people in different salary segments.

Then introduce echars, in the js code series : [ {% for k , v in barData . items %}
       { name : ' { { k }} ' , type : 'bar' , data : { { v }}, }, {% endfor %}] to display the effect, and the dynamic effect of mouse hovering is the tooltip field.

The selection filter of the input box transmits the value to the backend according to the option selection of the front-end page, and then executes the object for each object content of the job position in the database. The fillter (filter condition) can obtain the required data and then display it with the echars mentioned above.

Pie chart data display of average salary of interns : Use Django's ORM module to get all instances of JobInfo objects from the database. Then, it creates an empty dictionary jobsType to store each job type and the corresponding monthly average salary, and traverses all job information, filters out the job information of internship experience (pratice=1 ( internship ) ), and calculates their monthly average salary Salary is added to jobsType.

Then create an empty list result to store the sum of the monthly average salary of each job type, then traverse each key-value pair in jobsType, and use the custom addLis function (average) function to average the monthly salary of the job type Salary summation. Finally, add the name of each job type and the sum of the average monthly salary to result, and return result. At this time, the result form is [{job type: the average salary of this type}] and finally write it in the series of the front-end js code.

As for the display of multi-salary icons, process the JobInfo object, filter out the job information with the number of salary months greater than 0, count the number of each salary month, and return a list and a dictionary. Among them, the list contains strings of all salary months, and the dictionary contains key-value pairs of each salary month and its corresponding quantity. Then set  { {   louDouData | safe }} in the data field of the imported echars series list .

The length of the question after the question is abbreviated................................... ................................................... ...................................

The code (a small part) should be a small amount of code, otherwise the blog will not be easy to pass the popular:

Database creation:

from django.db import models

# Create your models here.

class JobInfo(models.Model):
    id = models.AutoField('id',primary_key=True)
    title = models.CharField('工作名',max_length=255,default='')
    address = models.CharField('地址',max_length=255,default='')
    type = models.CharField('类型',max_length=255,default='')
    educational = models.CharField('学历',max_length=255,default='')
    workExperience = models.CharField('工作经验',max_length=255,default='')
    workTag = models.CharField('工作标签',max_length=2555,default='')
    salary = models.CharField('薪资',max_length=255,default='')
    salaryMonth = models.CharField('年终奖',max_length=255,default='')
    companyTags = models.CharField('公司标签',max_length=2555,default='')
    hrWork = models.CharField('人事职位',max_length=255,default='')
    hrName = models.CharField('人事名字',max_length=255,default='')
    pratice = models.BooleanField('是否为实习单位',max_length=255,default='')
    companyTitle = models.CharField('公司名称',max_length=255,default='')
    companyAvatar = models.CharField('公司头像',max_length=255,default='')
    companyNature = models.CharField('公司性质',max_length=255,default='')
    companyStatus = models.CharField('公司状态',max_length=255,default='')
    companyPeople = models.CharField('公司人数',max_length=255,default='')
    detailUrl = models.CharField('详情地址',max_length=2555,default='')
    companyUrl = models.CharField('公司详情地址',max_length=2555,default='')
    createTime = models.DateField('创建时间',auto_now_add=True)
    dist = models.CharField('行政区',max_length=255,default='')
    class Meta:
        db_table = "jobInfo"

class User(models.Model):
    id = models.AutoField('id',primary_key=True)
    username = models.CharField('用户名',max_length=255,default='')
    password = models.CharField('密码',max_length=255,default='')
    educational = models.CharField('学历',max_length=255,default='')
    workExpirence = models.CharField('工作经验',max_length=255,default='')
    address = models.CharField('意向城市',max_length=255,default='')
    work = models.CharField('意向岗位',max_length=255,default='')
    avatar = models.FileField("用户头像",upload_to="avatar",default="avatar/default.png")
    createTime = models.DateField("创建时间",auto_now_add=True)

    class Meta:
        db_table = "user"

class History(models.Model):
    id = models.AutoField('id',primary_key=True)
    job = models.ForeignKey(JobInfo,on_delete=models.CASCADE)
    user = models.ForeignKey(User,on_delete=models.CASCADE)
    count = models.IntegerField("点击次数",default=1)
    class Meta:
        db_table = "histroy"

Crawler code:

import json
import time
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
import csv
import pandas as pd
import os
import django
os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'boss直聘数据可视化分析.settings')
django.setup()
# 但是还是需要在文件开头添加两行配置环境变量的配置语句,让程序知道该去哪儿寻找 models 中的文件。
from myApp.models import *

class spider(object):
    def __init__(self,type,page):
        self.type = type
        self.page = page
        self.spiderUrl = "https://www.zhipin.com/web/geek/job?query=%s&city=100010000&page=%s"

    def startBrower(self):
        option = webdriver.ChromeOptions()
        # option.add_experimental_option("debuggerAddress", "localhost:9222")
        option.add_experimental_option("excludeSwitches", ['enable-automation'])
        # s = Service("./chromedriver.exe")
        # browser = webdriver.Chrome(service=s, options=option)
        browser=webdriver.Chrome(executable_path='./chromedriver.exe',options=option)
        return browser

    def main(self,**info):
        if info['page'] < self.page:return
        brower = self.startBrower()
        print('页表页面URL:' + self.spiderUrl % (self.type,self.page))
        brower.get(self.spiderUrl % (self.type,self.page))
        time.sleep(15)
        # return
        job_list = brower.find_elements(by=By.XPATH, value="//ul[@class='job-list-box']/li")
        for index,job in enumerate(job_list):
            try:
                print("爬取的是第 %d 条" % (index + 1))
                jobData = []
                # title  工作名字
                title = job.find_element(by=By.XPATH,
                                         value=".//div[contains(@class,'job-title')]/span[@class='job-name']").text
                # address  地址
                addresses = job.find_element(by=By.XPATH,
                                           value=".//div[contains(@class,'job-title')]//span[@class='job-area']").text.split(
                    '·')
                address = addresses[0]
                # dist 行政区
                if len(addresses) != 1:dist = addresses[1]
                else: dist = ''
                # type  工作类型
                type = self.type

                tag_list = job.find_elements(by=By.XPATH,
                                             value=".//div[contains(@class,'job-info')]/ul[@class='tag-list']/li")
                if len(tag_list) == 2:
                    educational = job.find_element(by=By.XPATH,
                                                   value=".//div[contains(@class,'job-info')]/ul[@class='tag-list']/li[2]").text
                    workExperience = job.find_element(by=By.XPATH,
                                                      value=".//div[contains(@class,'job-info')]/ul[@class='tag-list']/li[1]").text
                else:
                    educational = job.find_element(by=By.XPATH,
                                                   value=".//div[contains(@class,'job-info')]/ul[@class='tag-list']/li[3]").text
                    workExperience = job.find_element(by=By.XPATH,
                                                      value=".//div[contains(@class,'job-info')]/ul[@class='tag-list']/li[2]").text
                # hr
                hrWork = job.find_element(by=By.XPATH,
                                          value=".//div[contains(@class,'job-info')]/div[@class='info-public']/em").text
                hrName = job.find_element(by=By.XPATH,
                                          value=".//div[contains(@class,'job-info')]/div[@class='info-public']").text

                # workTag 工作标签
                workTag = job.find_elements(by=By.XPATH,
                                            value="./div[contains(@class,'job-card-footer')]/ul[@class='tag-list']/li")
                workTag = json.dumps(list(map(lambda x: x.text, workTag)))

                # salary 薪资
                salaries = job.find_element(by=By.XPATH,
                                            value=".//div[contains(@class,'job-info')]/span[@class='salary']").text
                # 是否为实习单位
                pratice = 0
                if salaries.find('K') != -1:
                    salaries = salaries.split('·')
                    if len(salaries) == 1:
                        salary = list(map(lambda x: int(x) * 1000, salaries[0].replace('K', '').split('-')))
                        salaryMonth = '0薪'
                    else:
                        # salaryMonth 年底多薪
                        salary = list(map(lambda x: int(x) * 1000, salaries[0].replace('K', '').split('-')))
                        salaryMonth = salaries[1]
                else:
                    salary = list(map(lambda x: int(x), salaries.replace('元/天', '').split('-')))
                    salaryMonth = '0薪'
                    pratice = 1

                # companyTitle 公司名称
                companyTitle = job.find_element(by=By.XPATH, value=".//h3[@class='company-name']/a").text
                # companyAvatar 公司头像
                companyAvatar = job.find_element(by=By.XPATH,
                                                 value=".//div[contains(@class,'job-card-right')]//img").get_attribute(
                    "src")
                companyInfoList = job.find_elements(by=By.XPATH,
                                                    value=".//div[contains(@class,'job-card-right')]//ul[@class='company-tag-list']/li")
                if len(companyInfoList) == 3:
                    companyNature = job.find_element(by=By.XPATH,
                                                     value=".//div[contains(@class,'job-card-right')]//ul[@class='company-tag-list']/li[1]").text
                    companyStatus = job.find_element(by=By.XPATH,
                                                     value=".//div[contains(@class,'job-card-right')]//ul[@class='company-tag-list']/li[2]").text
                    try:
                        companyPeople = list(map(lambda x: int(x), job.find_element(by=By.XPATH,
                                                                                    value=".//div[contains(@class,'job-card-right')]//ul[@class='company-tag-list']/li[3]").text.replace(
                            '人', '').split('-')))
                    except:
                        companyPeople = [0, 10000]
                else:
                    companyNature = job.find_element(by=By.XPATH,
                                                     value=".//div[contains(@class,'job-card-right')]//ul[@class='company-tag-list']/li[1]").text
                    companyStatus = "未融资"
                    try:
                        companyPeople = list(map(lambda x: int(x), job.find_element(by=By.XPATH,
                                                                                    value=".//div[contains(@class,'job-card-right')]//ul[@class='company-tag-list']/li[2]").text.replace(
                            '人', '').split('-')))
                    except:
                        companyPeople = [0, 10000]
                # companyTag 公司标签
                companyTag = job.find_element(by=By.XPATH,
                                              value="./div[contains(@class,'job-card-footer')]/div[@class='info-desc']").text
                if companyTag:
                    companyTag = json.dumps(companyTag.split(','))

                else:
                    companyTag = '无'

                # 详情地址
                detailUrl = job.find_element(by=By.XPATH,
                                             value="./div[@class='job-card-body clearfix']/a").get_attribute('href')
                # 公司详情
                companyUrl = job.find_element(by=By.XPATH, value="//h3[@class='company-name']/a").get_attribute('href')

                jobData.append(title)
                jobData.append(address)
                jobData.append(type)
                jobData.append(educational)
                jobData.append(workExperience)
                jobData.append(workTag)
                jobData.append(salary)
                jobData.append(salaryMonth)
                jobData.append(companyTag)
                jobData.append(hrWork)
                jobData.append(hrName)
                jobData.append(pratice)
                jobData.append(companyTitle)
                jobData.append(companyAvatar)
                jobData.append(companyNature)
                jobData.append(companyStatus)
                jobData.append(companyPeople)
                jobData.append(detailUrl)
                jobData.append(companyUrl)
                jobData.append(dist)

                self.save_to_csv(jobData)
            except:
                pass

        self.page += 1
        self.main(page=info['page'])

    def save_to_csv(self,rowData):
        with open('./temp.csv', 'a', newline='', encoding='utf-8') as f:
            writer = csv.writer(f)
            writer.writerow(rowData)

    def clear_numTemp(self):
        with open('./numTemp.txt','w',encoding='utf-8') as f:
            f.write('')

    def init(self):
        if not os.path.exists('./temp.csv'):
            with open('./temp.csv','a',newline='',encoding='utf-8') as f:
                writer = csv.writer(f)
                writer.writerow(["title","address","type","educational","workExperience","workTag","salary","salaryMonth",
                                 "companyTags","hrWork","hrName","pratice","companyTitle","companyAvatar","companyNature",
                                 "companyStatus","companyPeople","detailUrl","companyUrl","dist"])

    def save_to_sql(self):
        data = self.clearData()
        for job in data:
            JobInfo.objects.create(
                title=job[0],
                address = job[1],
                type = job[2],
                educational = job[3],
                workExperience = job[4],
                workTag = job[5],
                salary = job[6],
                salaryMonth = job[7],
                companyTags = job[8],
                hrWork = job[9],
                hrName = job[10],
                pratice = job[11],
                companyTitle = job[12],
                companyAvatar = job[13],
                companyNature = job[14],
                companyStatus = job[15],
                companyPeople = job[16],
                detailUrl = job[17],
                companyUrl = job[18],
                dist=job[19]
            )
        print("导入数据库成功")
        os.remove("./temp.csv")

    def clearData(self):
        df = pd.read_csv('./temp.csv')
        df.dropna(inplace=True)
        df.drop_duplicates(inplace=True)
        df['salaryMonth'] = df['salaryMonth'].map(lambda x:x.replace('薪',''))
        print("总条数为%d" %  df.shape[0])
        return df.values

if __name__ == '__main__':
    spiderObj = spider("go",1);
    spiderObj.init()
    spiderObj.main(page=10)
    spiderObj.save_to_sql()

Home background function :

from myApp.models import User,JobInfo
from .publicData import *
import time
import json
# 首页时间+欢迎语
def getNowTime():
    timeFormat = time.localtime()
    year = timeFormat.tm_year
    month = timeFormat.tm_mon
    day = timeFormat.tm_mday
    monthList = ["January","February","March","April","May","June","July","August","September","October","November","December"]
    return year,monthList[month - 1],day

# 首页右侧7指标
def getTagData():
    jobs = getAllJobInfo()
    users = getAllUser()
    educationsTop = "学历不限"
    salaryTop = 0
    salaryMonthTop = 0
    address = {}
    pratice = {}
    for job in jobs:
        if educations[job.educational] < educations[educationsTop]:
            educationsTop = job.educational
            # 仅仅针对非实习岗位
        if not job.pratice:
            salary = json.loads(job.salary)[1]
            if salaryTop < salary:
                salaryTop = salary
        if int(job.salaryMonth) > salaryMonthTop:
            salaryMonthTop = int(job.salaryMonth)
        # 这段代码的作用是判断一个字典 address 中是否包含 key 为 job.address 的元素。
        # 如果不包含,则向字典中添加一个 key 为 job.address,value 为 1 的元素。如果包含,则不进行任何操作。
        if address.get(job.address,-1) == -1:
            address[job.address] = 1
        else:
            address[job.address] += 1
        if pratice.get(job.pratice,-1) == -1:
            pratice[job.pratice] = 1
        else:
            pratice[job.pratice] += 1
    addressStr = sorted(address.items(),key=lambda x:x[1],reverse=True)[:3]
    addressTop = ""
    for i in addressStr:
        addressTop += i[0] + ","
    praticeMax = sorted(pratice.items(),key=lambda x:x[1],reverse=True)
    # a = "普通岗位" ? praticeMax[0][0] == False : "实习岗位"
    return len(jobs),len(users),educationsTop,salaryTop,salaryMonthTop,addressTop,praticeMax[0][0]

def getUserCreateTime():
    users = getAllUser()
    data = {}
    for u in users:
        if data.get(str(u.createTime),-1) == -1:
            data[str(u.createTime)] = 1
        else:
            data[str(u.createTime)] += 1
    result = []
    for k,v in data.items():
        result.append({
            'name':k,
            'value':v
        })
    return result

def getUserTop5():
    users = getAllUser()
    def sort_fn(item):
        return time.mktime(time.strptime(str(item.createTime),'%Y-%m-%d'))
    users = list(sorted(users,key=sort_fn,reverse=True))[:6]
    return users

def getAllJobsPBar():
    jobs = getAllJobInfo()
    tempData = {}
    for job in jobs:
        if tempData.get(str(job.createTime),-1) == -1:
            tempData[str(job.createTime)] = 1
        else:
            tempData[str(job.createTime)] += 1
    def sort_fn(item):
        item = list(item)
        return time.mktime(time.strptime(str(item[0]), '%Y-%m-%d'))
    result = list(sorted(tempData.items(),key=sort_fn,reverse=False))
    def map_fn(item):
        item = list(item)
        item.append(round(item[1] / len(jobs),3))
        return item
    result = list(map(map_fn,result))
    return result

def getTableData():
    jobs = getAllJobInfo()
    for i in jobs:
        i.workTag = '/'.join(json.loads(i.workTag))
        if i.companyTags != "无":
            i.companyTags = '/'.join(json.loads(i.companyTags))

        i.companyPeople = json.loads(i.companyPeople)
        i.companyPeople = list(map(lambda x:str(x) + '人',i.companyPeople))
        i.companyPeople = '-'.join(i.companyPeople)
        i.salary = json.loads(i.salary)[1]
    return jobs
    # jobs[0].workTags = '/'.join(json.loads(jobs[0].workTag))
    # def map_fn(item):
    #     item.workTag = "/".join()
    # jobs = list(map(map_fn,jobs))


Explanation: Due to the large amount of code, the detailed introduction process will take up a lot of time and space. I have written all the word documents corresponding to the project. If necessary, I will package the project file including the word document description for you. In addition, if Then if you can’t escape, I can remotely operate it for you to ensure that you can all escape.

Guess you like

Origin blog.csdn.net/Abtxr/article/details/131039047