Python 3.7 implements a simple crawler, simple crawling data, grabbing data, line-by-line code tutorial, can always teach you

a brief introdction

抓取数据Simply put, it is to extract the required data from the web page.

抓取数据原理The process of parsing HTML tags from web source code, such as extracting attributes or text content of HTML tags. Therefore, understanding the web page structure is necessary for crawling data.

Selenium advanced crawlers, interactive crawlers (such as clicking, opening links, turning pages, etc.) are essential. Those who are interested in the same shoes can find out.

Beautiful Soup

Beautiful Soup [ Documentation ] is a Python library that can extract data from HTML or XML files. It can implement the usual way of document navigation, search, and modification of documents through your favorite converter.

There are other libraries in Python that can parse HTML. If you are interested in children's shoes, you can Baidu. This article usesBeautiful Soup

Teach you to grab CSDN homepage blog data

CSDN Homepage https://www.csdn.net

data we need

  • blog title
  • blog address (url)
  • Brief description of the blog (abstract)

Understand page structure

Google ChromeView developer tools, you should be no stranger to web development.
insert image description here

From the screenshot, we can see that the data we need are placed liin the tag, so we extract lithe tag is the first step. The selector of lithe parent container is , remember this, it will be helpful for later coding understanding.ulidfeedlist_id

Take a look at lithe internal structure of the label
insert image description here

As you can see from the screenshot, the title is placed ain the tag, and the brief description is placed divin the , so we are done after parsing these two tags in the second step

start coding

  1. Get the page source code
import lxml
import requests
from bs4 import BeautifulSoup

# 获取源码
url = 'https://www.csdn.net'
html_source = requests.get(url) 
  1. extract litag
import lxml
import requests
from bs4 import BeautifulSoup

# 获取源码
url = 'https://www.csdn.net'
html_source = requests.get(url) 

# 提取 `li` 标签
html_parser = BeautifulSoup(html_source.text,'lxml') # 创建页面解析器
results = html_parser.select('ul#feedlist_id > li') # 返回的是一个列表,符合条件的 `li` 标签集合

html_parser.selectIt uses CSS selectors to parse tags.
Class selectors are .linked via , for example: ul.class > li.
ID selectors are #connected via , for example: ul#id > li.
It is also possible not to use CSS selectors, you can ensure correctness by adding multiple layers of nodes, for example: replace 'ul#feedlist_id > li'with div > div > main > ul > li.

  1. Parsing litags inside tags
import lxml
import requests
from bs4 import BeautifulSoup

# 获取源码
url = 'https://www.csdn.net'
html_source = requests.get(url) 

# 提取 `li` 标签
html_parser = BeautifulSoup(html_source.text,'lxml') # 创建页面解析器
results = html_parser.select('ul#feedlist_id > li') # 返回的是一个列表,符合条件的 `li` 标签集合

# 解析 `li` 标签内部标签,提取我们所需要的数据
for item in results:
 	a_tag = item.select('div > div > h2 > a')[0] # 提取 a 标签
    title = a_tag.get_text('|', strip=True) # 获取 a 标签的文本,即标题,strip=True  去掉空白字符,如果有多行,用 `|` 拼接
    href = a_tag['href'] # 获取 a 标签的属性 href,即博客 url
    desc = item.select('div > div.summary')[0].get_text(strip=True) # 提取摘要
   
    print('title: ' + title)
    print('url: ' + href)
    print('desc: ' + desc, end='\n\n')

At this point, the goal of grabbing data has been completed, but the data is still filtered in actual development. For example, the lilabel may be an advertisement, which is not what we need. We only need blog data. At this time, we need to filter the data to ensure normal operation of the script.

  1. filter data
import lxml
import requests
from bs4 import BeautifulSoup

# 获取源码
url = 'https://www.csdn.net'
html_source = requests.get(url) 

# 提取 `li` 标签
html_parser = BeautifulSoup(html_source.text,'lxml') # 创建页面解析器
results = html_parser.select('ul#feedlist_id > li') # 返回的是一个列表,符合条件的 `li` 标签集合

# 解析 `li` 标签内部标签,提取我们所需要的数据
for item in results:
 	if 'data-type' not in item.attrs: # item.attrs 是 li 标签属性集合,类型为 dict
        continue
    
    if item.attrs['data-type'] != 'blog': # 过滤数据,只要 博客数据
        continue
        
 	a_tag = item.select('div > div > h2 > a')[0] # 提取 a 标签
    title = a_tag.get_text('|', strip=True) # 获取 a 标签的文本,即标题,strip=True  去掉空白字符,如果有多行,用 `|` 拼接
    href = a_tag['href'] # 获取 a 标签的属性 href,即博客 url
    desc = item.select('div > div.summary')[0].get_text(strip=True) # 提取摘要
   
    print('title: ' + title)
    print('url: ' + href)
    print('desc: ' + desc, end='\n\n')

Script running result display

title: 荐|终于有人把域名和DNS服务器给写明白了
url: https://blog.csdn.net/qq_17623363/article/details/106037921
desc: 终于有人把域名和DNS服务器给写明白了

title: 荐|母亲节不能陪在妈妈身边,我用css和js给妈妈做了一个爱心飘落
url: https://blog.csdn.net/weixin_43570367/article/details/106018731
desc: 这篇博客做了一个爱心飘落的动图,给妈妈送去节日的祝福!

title: 程序猿版:溢出吧,后浪
url: https://blog.csdn.net/loongggdroid/article/details/106030143
desc: 【回复“1024”,送你一个特别推送】那些口口声声一届不如一届的程序猿,应该看着你们像我一样我看着你们满怀羡慕计算机发展积攒了几十年的财富层出不穷的不断迭代的技术,框架,算法和遗留的祖传...

title: Intellij IDEA 美化指南
url: https://blog.csdn.net/qq_35067322/article/details/105852521
desc: 经常有人问我,你的 IDEA 配色哪里搞的,我会告诉他我自己改的。作为生产力工具,不但要顺手而且更要顺眼。这样才能快乐编码,甚至降低 BUG 率。上次分享了一些 IDEA 有用的插件,反...

......

script download

https://gitee.com/harvey520/simpleReptile

Guess you like

Origin blog.csdn.net/yao1500/article/details/106091466