Python 爬虫基础教程

Python 爬虫基础教程

爬取网页流程

  1. 选择网址(url)
  2. 使用 python 登录上这个网址 (urlopen) 等
  3. 读取网页信息
  4. 将读取的信息放入 BeautifulSoup
  5. 使用 BeautifulSoup 选取 tag 信息等(代替正则表达式)

了解网页结构

  • < h t m l > , < / h t m l > <html>,</html> 首尾
  • < h e a d > , < / h e a d > <head>,</head> 头部,不显示
  • < b o d y > , < / b o d y > <body>,</body> 主体
<!DOCTYPE html>
<html lang="cn">
<head>
	<meta charset="UTF-8">
	<title>Scraping tutorial 1 | 莫烦Python</title>
	<link rel="icon" href="https://morvanzhou.github.io/static/img/description/tab_icon.png">
</head>
<body>
	<h1>爬虫测试1</h1>
	<p>
		这是一个在 <a href="https://morvanzhou.github.io/">莫烦Python</a>
		<a href="https://morvanzhou.github.io/tutorials/scraping">爬虫教程</a> 中的简单测试.
	</p>

</body>
</html>
  • python 匹配网页源码
from urllib.request import urlopen

html = urlopen(
    "https://morvanzhou.github.io/static/scraping/basic-structure.html"
).read().decode('utf-8')
print(html)


import re
res = re.findall(r"<title>(.+?)</title>", html)
print("\nPage title is: ", res[0])

res = re.findall(r"<p>(.*?)</p>", html, flags=re.DOTALL)    # re.DOTALL if multi line
print("\nPage paragraph is: ", res[0])

res = re.findall(r'href="(.*?)"', html)
print("\nAll links: ", res)

BeautifulSoup 解析网页:基础

  • 安装(windows下):pip install beautifulsoup4
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen(
    "https://morvanzhou.github.io/static/scraping/basic-structure.html"
).read().decode('utf-8')
# print(html)

soup = BeautifulSoup(html, features='lxml')
# print(soup.h1)
# print('\n', soup.p)

all_href = soup.find_all('a')
for l in all_href:
    print(l['href'])

href

BeautifulSoup 解析网页:CSS

test

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen(
    "https://morvanzhou.github.io/static/scraping/list.html"
).read().decode('utf-8')
# print(html)

soup = BeautifulSoup(html, features='lxml')

# month = soup.find_all('li', {'class': 'month'})
# for m in month:
#     print(m.get_text())

jan = soup.find('ul', {'class': 'jan'})
print(jan)
d_jan = jan.find_all('li')
for d in d_jan:
    print(d.get_text())

result

BeautifulSoup 解析网页:正则表达式

爬百度百科

error

from bs4 import  BeautifulSoup
from urllib.request import urlopen
import re
import random

base_url = "https://baike.baidu.com/"
his = ["/item/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB/5162711"]

for i in range(3):
    url = base_url + his[-1]

    html = urlopen(url).read().decode('utf-8')
    soup = BeautifulSoup(html, features='lxml')
    print(i, soup.find('h1').get_text(), '    url: ', his[-1])

    # find valid urls
    sub_urls = soup.find_all("a", {"target": "_blank", "href": re.compile("/item/(%.{2})+$")})

    if len(sub_urls) != 0:
        his.append(random.sample(sub_urls, 1)[0]['href'])
    else:
        # no valid sub link found
        his.pop()

Post 登录 Cookies(Requests)

  • 其实在加载网页的时候, 有几种类型, 而这几种类型就是你打开网页的关键. 最重要的类型 (method) 就是 get 和 post (当然还有其他的, 比如 head, delete). 刚接触网页构架的朋友可能又会觉得有点懵逼了. 这些请求的方式到底有什么不同? 他们又有什么作用?

  • 我们就来说两个重要的, get, post, 95% 的时间, 你都是在使用这两个来请求一个网页.

  • post
    账号登录
    搜索内容
    上传图片
    上传文件
    往服务器传数据 等

  • get
    正常打开网页
    不往服务器传数据

import requests
import webbrowser

# param = {"wd": "莫烦Python"}
# r = requests.get('https://www.baidu.com/s', params=param)
# print(r.url)

# data = {'firstname': 'Guosheng', 'lastname': 'Zhang'}
# r = requests.post('http://pythonscraping.com/files/processing.php', data=data)
# print(r.text)

# file = {'uploadFile': open('./image.png', 'rb')}
# r = requests.post('http://pythonscraping.com/files/processing2.php', files=file)
# print(r.text)
发布了40 篇原创文章 · 获赞 12 · 访问量 5722

猜你喜欢

转载自blog.csdn.net/weixin_43488958/article/details/104347096
今日推荐