Reptile combat: link crawler combat

Suppose we want to extract all the link addresses in a web page, we can do it through python crawler.
Ideas

  1. Determine the entry link to be crawled
  2. Build regular expressions for link extraction according to requirements
  3. Simulate as a browser and crawl the corresponding webpage
  4. Extract the link in the webpage according to the regular expression in step 2
  5. Filter out duplicate links
  6. Follow-up operations, such as printing out the link.

Step 1: Entrance link
Personal blog

URL

https://blog.csdn.net/KOBEYU652453?spm=1001.2101.3001.5343

Step 2: Define the regular expression

Link example

 href="https://blog.csdn.net/kobeyu652453/article/details/106355922

Regular usage tutorial link
python :re module basic usage

So we can define regular rules

pat='(https?://[^\s)";]+\.(\w|/)*)' #^\匹配任何非空白字符 \w任何数字字母 * 0个或多个

Because some URLs are http, not https, how to add s after it? number.

Full text code

import re
import urllib.request
from urllib import request
def getlink(url):
    headers = ("User-Agent",
               "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3947.100 Safari/537.36")
    opener = urllib.request.build_opener()
    opener.addheaders = [headers]
    # 将opener安装为全局
    urllib.request.install_opener(opener)

    url_request = request.Request(url)
    html1 = request.urlopen(url_request, timeout=10)
    data=str(html1.read())
    #根据需要定义正则表达式
    pat = '(https?://[^\s)";]+\.(\w|/)*)'  # ^\匹配任何非空白字符 \w任何数字字母 * 0个或多个
    link=re.compile(pat).findall(data)
    #去除重复元素
    link=set(link)
    return link

url='https://blog.csdn.net/KOBEYU652453?spm=1001.2101.3001.5343'
linklist=getlink(url)
for link in linklist:
    print(link[0])

Insert picture description here
Author: Electrical - Yu Dengwu

Guess you like

Origin blog.csdn.net/kobeyu652453/article/details/112740874