Reptile Basics One study notes python web crawler

Disclaimer: This article is a blogger original article, shall not be reproduced without the bloggers allowed. https://blog.csdn.net/bowei026/article/details/90147540

Work reptiles three basic steps: web crawling, parses the content stored data

ready

Install crawled pages need to use third-party libraries: requests and bs4

pip install requests

pip install bs4

Crawls pages

# coding: UTF-8
import requests

link = "http://www.santostang.com/"
headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'}
r = requests.get(link, headers=headers)
print(r.text)

Running the program output pages of html code

Parses web content

# coding: UTF-8
import requests
from bs4 import BeautifulSoup

link = "http://www.santostang.com/"
headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'}
r = requests.get(link, headers=headers)

soup = BeautifulSoup(r.text, "lxml")
title = soup.find("h1", class_="post-title").a.text.strip()
print(title)

Get to the first page of the article title, output content:

Chapter IV - 4.3 by selenium analog browser crawl

Storing data

# coding: UTF-8
import requests
from bs4 import BeautifulSoup

link = "http://www.santostang.com/"
headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'}
r = requests.get(link, headers=headers)

soup = BeautifulSoup(r.text, "lxml")
title = soup.find("h1", class_="post-title").a.text.strip()

with open('d:/title.txt', 'w') as f:
	f.write(title)

After running the program to find d: /title.txt file, find the file contents page is the first article of the title, namely, "Chapter IV - 4.3 by selenium analog browser grab"

At this point, to explain a crawler python over three basic steps and code to achieve

This concludes this article, it may be more concerned about the number of public and personal micro signal:

Guess you like

Origin blog.csdn.net/bowei026/article/details/90147540