Python crawler crawling recruitment (requests, BeautifulSoup)

As the end of python's homework, I wrote a crawling about IT recruitment. I want to know some IT quotations. Everyone is welcome to discuss and discuss together. Let's not talk nonsense, let's start dry goods.

First select the content you want to crawl.
I want to know the content of the recruitment, click on the webpage and click f12 to see the content. Insert picture description here
You can see where the website you want to visit is.

https://job.csdn.net/search/index?k=&t=1&f=1
https://job.csdn.net/search/index?k=&t=1&f=2

Check the different points of the webpage, analyze the rule of webpage turning,
set up a loop, realize the page turning,

   page = 1  # 设置条件
        while page<=10:#设置循环条件,经行翻页
             url=f'https://job.csdn.net/search/index?k=&t=1&f={page}'

In this way, we can see that the key to the next page of the web page is on f. Let's look at the head of your server in the network. The
Insert picture description here
simulation head is to prevent the anti-crawl mechanism and does not allow the python head to be crawled.

headers= {
      'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.67 Safari/537.36 Edg/87.0.664.47"
      }#模拟的服务器头

Next, we will import the module to crawl web content

import requests
from bs4 import BeautifulSoup#导入模块
 response = requests.get(url,headers=headers)#换包头
        newurl = response.text#获取网页内容

Insert picture description here
Next, find the content you want (point the arrow marked in blue, click on the web page, you can view the content you want)
Insert picture description here
by recursive upward search to see what we need.
Insert picture description here
The content I need comes out. I build the css selector, search for tags, and select the content I want.
Next, save the file, and the small project is complete.

 with open(file= 'e:/练习.txt ',mode= 'a+') as f :#e:/练习.txt 为我电脑新建的文件,a+为给内容进行添加,但不进行覆盖原内容。

Below is the complete code that everyone wants.

'''
	作者:ls富
	时间:2020/12/1
构建对象
构建方法
方法
    换头
    请求数据
    对数据进行筛选
    写入文件夹
'''
import requests
from bs4 import BeautifulSoup#导入模块


class Position():

    def __init__(self,position_name,position_require):#构建对象属性
        self.position_name=position_name
        self.position_require=position_require

    def __str__(self):
        return '%s%s/n'%(self.position_name,self.position_require)#重载方法将输入变量改成字符串形式


class Xiang():
    def harder(self,url):
        headers= {
            'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.67 Safari/537.36 Edg/87.0.664.47"
        }#模拟的服务器头
        response = requests.get(url,headers=headers)#换包头
        newurl = response.text#获取网页内容
        soup=BeautifulSoup(newurl,'html.parser')  # BeautifulSoup打看网页
        soupl = soup.select(".common_r_con")#进行选择页面第一次内容
        results=[]#创建一个列表用来存储数据
        for e in soupl:
            biao=e.select('.position_list.clearfix')#进行二次筛选
            for h in biao:
                p=Position(h.select_one('.employ_pos_name').get_text(strip=True),h.select_one('.position_require').get_text(strip=True))#调用类转换(继续三次筛选选择自己需要内容)

                results.append(p)
        return results#返回内容

if __name__ == '__main__':
    a=Xiang()#构建对象
    url = f'https://job.csdn.net/search/index?k=&t=1&f=1'
    a.harder(url)
    import time
    with open(file= 'e:/练习.txt ',mode= 'a+') as f :#e:/练习.txt 为我电脑新建的文件,a+为给内容进行添加,但不进行覆盖原内容。
        page = 1  # 设置条件
        while page<=10:#设置循环条件,经行翻页
             url=f'https://job.csdn.net/search/index?k=&t=1&f={page}'
             for item in a.harder(url):
                 line=f'{item.position_name}\t{item.position_require}\n'
                 f.write(line)  # 采用方法
                 print("下载完成")

             page += 1

effectInsert picture description here

Guess you like

Origin blog.csdn.net/weixin_47514459/article/details/110420969