Suppose we want to extract all the link addresses in a web page, we can do it through python crawler.
Ideas
- Determine the entry link to be crawled
- Build regular expressions for link extraction according to requirements
- Simulate as a browser and crawl the corresponding webpage
- Extract the link in the webpage according to the regular expression in step 2
- Filter out duplicate links
- Follow-up operations, such as printing out the link.
Step 1: Entrance link
Personal blog
URL
https://blog.csdn.net/KOBEYU652453?spm=1001.2101.3001.5343
Step 2: Define the regular expression
Link example
href="https://blog.csdn.net/kobeyu652453/article/details/106355922
Regular usage tutorial link
python :re module basic usage
So we can define regular rules
pat='(https?://[^\s)";]+\.(\w|/)*)' #^\匹配任何非空白字符 \w任何数字字母 * 0个或多个
Because some URLs are http, not https, how to add s after it? number.
Full text code
import re
import urllib.request
from urllib import request
def getlink(url):
headers = ("User-Agent",
"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3947.100 Safari/537.36")
opener = urllib.request.build_opener()
opener.addheaders = [headers]
# 将opener安装为全局
urllib.request.install_opener(opener)
url_request = request.Request(url)
html1 = request.urlopen(url_request, timeout=10)
data=str(html1.read())
#根据需要定义正则表达式
pat = '(https?://[^\s)";]+\.(\w|/)*)' # ^\匹配任何非空白字符 \w任何数字字母 * 0个或多个
link=re.compile(pat).findall(data)
#去除重复元素
link=set(link)
return link
url='https://blog.csdn.net/KOBEYU652453?spm=1001.2101.3001.5343'
linklist=getlink(url)
for link in linklist:
print(link[0])
Author: Electrical - Yu Dengwu