《Python网络爬虫从入门到实践》-笔记

第一章入门

1.python爬虫的流程

1获取网页 2 解析网页（提取数据）3 存储数据

技术实现：

获取网页：基础： request,urllib,selenium（模拟浏览器）。进阶：多进程多线程抓取、登陆抓取、突破IP封禁和服务器抓取
解析网页：基础：re正则表达式，BeautifulSoup和lxml 进阶：解决中文乱码
存储数据：基础：存入txt文件和存入csv文件进阶：存入MySQL数据库和 MongolianDB数据库

第二章 python 入门以及简单爬虫

入门知识点：

列表 list
字典 key value

namebook={"Name:":"Alex","Age":7,"Class":"First"}
for key,value in namebook.items():
    print(key,value)

__init__()方法为类的构造方法注意：有两个下划线 _ _

简单爬虫

一：获取页面

#！/usr/bin/python
#coding:UTF-8

import requests
link="http://www.santostang.com/"
headers={'User-Agent':'Mozilla/5.0(Windows;U;Windows NT 6.1;en-US;rv:1.9.1.6) Geocko/20091201 Firefox/3.5.6'}
r=requests.get(link,headers=headers)
print(r.text)

上述代码获取了博客首页的HTML代码

首先 import requests，使用requests.get(link,headers=headers)获取网页

用requests的header伪装成浏览器访问

r是requests的Response回复对象

r.text是获取的网页内容代码

二：提取需要的数据

#！/usr/bin/python
#coding:UTF-8

import requests
from bs4 import BeautifulSoup#从bs4这个库中导入BeautifulSoup


link="http://www.santostang.com/"
headers={'User-Agent':'Mozilla/5.0(Windows;U;Windows NT 6.1;en-US;rv:1.9.1.6) Geocko/20091201 Firefox/3.5.6'}
r=requests.get(link,headers=headers)

soup=BeautifulSoup(r.text,"lxml")#使用BeautifulSoup解析这段代码
title=soup.find("h1",class_="post-title").a.text.strip()
print(title)

获取HTML代码后，需要从整个网页中提取第一篇文章的标题

用BeautifulSoup这个库对爬取下来的页面进行解析

先导入库，然后将HTML代码转化为soup对象

用soup.find("h1",class_="post-title").a.text.strip()获取标题

三：存储数据

#！/usr/bin/python
#coding:UTF-8

import requests
from bs4 import BeautifulSoup#从bs4这个库中导入BeautifulSoup


link="http://www.santostang.com/"
headers={'User-Agent':'Mozilla/5.0(Windows;U;Windows NT 6.1;en-US;rv:1.9.1.6) Geocko/20091201 Firefox/3.5.6'}
r=requests.get(link,headers=headers)

soup=BeautifulSoup(r.text,"lxml")#使用BeautifulSoup解析这段代码
title=soup.find("h1",class_="post-title").a.text.strip()
print(title)

with open('title.txt',"a+")as f:
    f.write(title)
    f.close

《Python网络爬虫 从入门到实践》-笔记

第一章 入门