The next day reptiles

The next day reptiles

review

  • problem:
    • ip was closed: agent
    • Request parameters of the problem:
      • Dynamically changing request parameter
      • Encrypted request parameter
    • Response of the data:
      • cookie
      • Request parameter
    • encryption:
      • Reverse js
  • Important content
    • Dynamic parameters
      • data / prames
    • Anti-climbing mechanisms:
      • robots.txt
      • UA detection
      • Dynamic loading of data
        • How to detect whether the data is dynamically loaded
        • How to capture data dynamically loaded
      • How to dynamically load data is generated?
        • ajax
        • js

data analysis

  • Role: Implement focused crawler
  • Method to realize:
    • Regular
    • bs4: Key
    • xpath: Key
    • pyquery: self-study
  • What is the general principle of data analysis?
    • The source data must be parsed html page
      • Text labels stored
      • Attribute value of the attribute tag
        • principle:
        • Label positioning
        • Take a text attribute or take
  • Reptile realization process
    • Specified url
    • Send request
    • Fetch response data
    • data analysis
    • Persistent storage

In [2]:

import requests
import re
headers = {
    #需要修改的请求头信息
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36'
}

Regular resolve

Single character:.: All characters except newline []: [aoe] [aw] matches any character set \ D: number [0-9] \ D: non-numeric \ W: numbers, letters, underline, Chinese \ W: non \ w \ s: packet all whitespace characters, including spaces, tabs, page breaks and the like. It is equivalent to [\ f \ n \ r \ t \ v]. \ S: the number of non-blank modification:

    * : 任意多次  >=0
    + : 至少1次   >=1
    ? : 可有可无  0次或者1次
    {m} :固定m次 hello{3,}
    {m,} :至少m次
    {m,n} :m-n次
边界:
    $ : 以某某结尾 
    ^ : 以某某开头
分组:
    (ab)  
贪婪模式: .*
非贪婪(惰性)模式: .*?

re.I : 忽略大小写
re.M :多行匹配
re.S :单行匹配

re.sub(正则表达式, 替换内容, 字符串)

In [ ]:

import re
#提取出python
key="javapythonc++php"
#####################################################################
#提取出hello world
key="<html><h1>hello world<h1></html>"
#####################################################################
#提取170
string = '我喜欢身高为170的女孩'
#####################################################################
#提取出http://和https://
key='http://www.baidu.com and https://boob.com'
#####################################################################
#提取出hit. 
key='[email protected]'#想要匹配到hit.
#####################################################################
#匹配sas和saas
key='saas and sas and saaas'
#####################################################################

In [3]:

key="javapythonc++php"
re.findall('python',key)

Out[3]:

['python']

In [4]:

#提取出hello world
key="<html><h1>hello world<h1></html>"
re.findall('<h1>(.*?)<h1>',key)

Out[4]:

['hello world']

In [6]:

#提取170
string = '我喜欢身高为170的女孩'
re.findall('\d+',string)

Out[6]:

['170']

In [11]:

#提取出http://和https://
key='http://www.baidu.com and https://boob.com'
re.findall('https?://',key)

Out[11]:

['http://', 'https://']

In [13]:

#提取出hit. 
key='[email protected]'#想要匹配到hit.
re.findall('h.*?\.',key)

Out[13]:

['hit.']

In [14]:

#匹配sas和saas
key='saas and sas and saaas'
re.findall('sa{1,2}s',key)

Out[14]:

['saas', 'sas']

In [16]:

url = 'http://duanziwang.com/category/搞笑图/'
#捕获到的是字符串形式的响应数据
page_text = requests.get(url=url,headers=headers).text

#数据解析
ex = '<div class="post-head">.*?<a href="http://duanziwang.com/\d+\.html">(.*?)</a></h1>'
re.findall(ex,page_text,re.S)#爬虫使用正则做解析的话re.S必须要使用

#持久化存储

Out[16]:

['比较下老婆和老妈,一比吓一跳_段子网收录最新段子',
 '搞笑夫妻:烦恼里面找快乐_段子网收录最新段子',
 '夫妻界的搞笑奇葩_段子网收录最新段子',
 '超囧冷人小夫妻_段子网收录最新段子',
 '雷人夫妻:吃泡面、偷看日记和生日送花_段子网收录最新段子',
 '脸皮薄的人难以成功_段子网收录最新段子',
 '12秒的雷政富和21秒李小璐_段子网收录最新段子',
 '从前有只麋鹿,它在森林里玩儿,不小心走丢了。于是它给它的好朋友长颈鹿打电话:“喂…我迷路啦。”长颈鹿听见了回答说:“喂~我长颈鹿啦~”_段子网收录最新段子',
 '最萌挡车球_段子网收录最新段子',
 '再高贵冷艳的喵星人! 也总有一天有被吓得屁滚尿流的时候。_段子网收录最新段子']

bs4 resolve

  • Installation Environment:
    • pip install bs4
    • pip install lxml
  • ANALYSIS PRINCIPLE
    • 1. BeautifulSoup instantiate an object, it is necessary to wait for the page to be parsed data is loaded into the source object
    • 2. BeautifulSoup call the object properties and methods associated with extraction of the text data and the label location
  • BeautifulSoup how to instantiate
    • Mode 1: BeautifulSoup (fp, 'lxml'), a html file specified data stored locally parses
    • Mode 2: BeautifulSoup (page_text, 'lxml'), from crawling the Internet directly to the data analysis data
  • Label positioning
    • soup.tagName: Locate the first occurrence of tag tagName
    • Location attribute: positioning a tag corresponding to the specified attribute
      • soup.find('tagName',attrName='value')
      • soup.find_all('tagName',attrName='value')
    • Select positioning:
      • soup.select ( 'selector')
      • Level Selector:
        • Greater-than sign: a hierarchical representation
        • Spaces: represent multiple levels
  • Take text
    • tag.string: Remove text lineal
    • tag.text: Remove all of the text
  • Take property
    • tag [ 'attrName']

In [18]:

from bs4 import BeautifulSoup
fp = open('./test.html','r',encoding='utf-8')
soup = BeautifulSoup(fp,'lxml')
soup

In [28]:

soup.div
# soup.find('div',class_='song')
# soup.find('a',id='feng')
# soup.find_all('div',class_='song')
# soup.select('.tang')
# soup.select('#feng')
soup.select('.tang li')

In [38]:

soup.p.string
soup.p.text
soup.find('div',class_='tang').text
a_tag = soup.select('#feng')[0]
a_tag['href']

Out[38]:

'http://www.haha.com'
  • The content of the novel Three Kingdoms flood crawling and persistent storage
    • http://www.shicimingju.com/book/sanguoyanyi.html

In [44]:

main_url = 'http://www.shicimingju.com/book/sanguoyanyi.html'
page_text = requests.get(url=main_url,headers=headers).text

fp = open('./sanguo.txt','w',encoding='utf-8')

#数据解析
soup = BeautifulSoup(page_text,'lxml')
#解析出章节的标题&详情页的url
a_list = soup.select('.book-mulu > ul > li > a')
for a in a_list:
    title = a.string
    detail_url = 'http://www.shicimingju.com'+a['href']
    
    #捕获章节内容
    page_text_detail = requests.get(detail_url,headers=headers).text#获取了详情页的页面源码数据
    #数据解析:章节内容
    detail_soup = BeautifulSoup(page_text_detail,'lxml')
    div_tag = detail_soup.find('div',class_='chapter_content')

    content = div_tag.text
    
    fp.write(title+':'+content+'\n')
    print(title,'下载成功!!!')
fp.close()
  • Picture of crawling data
    • Based on requests
    • Based urllib
    • Difference: it can not be achieved UA camouflage

In [47]:

#基于requests
url = 'http://pic.netbian.com/uploads/allimg/190902/152344-1567409024af8c.jpg'
img_data = requests.get(url=url,headers=headers).content#content返回的是二进制类型的数据
with open('./123.png','wb') as fp:
    fp.write(img_data)

In [49]:

#基于urllib
from urllib import request
request.urlretrieve(url=url,filename='./456.jpg')

Out[49]:

('./456.jpg', <http.client.HTTPMessage at 0x217a8af3470>)

parsing xpath

  • Installation Environment:
    • pip install lxml
  • ANALYSIS PRINCIPLE
    • Etree instantiate an object, and the loading page parsed data to the source object
    • xpath method etree object xpath binding different forms of expressions and data label location extract
  • Examples of objects:
    • etree.parse ( 'filePath'): to load the locally stored data to a html file objects instantiated good etree
    • etree.HTML(page_text)
  • xpath expression
    • Label positioning:
      • The leftmost /: positioning the label must start from the root (almost no)
      • Non-leftmost /: Indicates a level
      • @ Leftmost: can be positioned at any position from the specified tag (most common)
      • Non-leftmost //: represent multiple levels
      • Attribute Positioning: // tagName [@ attrName = "value"]
      • Index Positioning: // tagNamne [index], the index is starting from 1
      • //div[contains(@class, "ng")]
      • //div[starts-with(@class, "ta")]
    • Take text
      • / Text (): Remove text lineal
      • // text (): Remove all of the text is
    • Take property
      • /@attrName

In [76]:

from lxml import etree
tree = etree.parse('./test.html')
tree.xpath('/html/head/meta')
tree.xpath('/html//meta')
tree.xpath('//meta')
tree.xpath('/meta')#error
tree.xpath('//p')
tree.xpath('//div[@class="tang"]')
tree.xpath('//li[1]')
tree.xpath('//a[@id="feng"]/text()')[0]
tree.xpath('//div[@class="tang"]//text()')
tree.xpath('//a[@id="feng"]/@href')

Out[76]:

['http://www.haha.com']
  • Requirements: Use pictures parsing xpath address and name of the image to download and save to your local
    • http://pic.netbian.com/4kmeinv/

Focus: Local resolved when the meaning indicated ./

In [86]:

import os

In [88]:

url = 'http://pic.netbian.com/4kmeinv/'
page_text = requests.get(url=url,headers=headers).text

dirName = 'imgLibs'
if not os.path.exists(dirName):
    os.mkdir(dirName)

#数据解析
tree = etree.HTML(page_text)
#xpath是基于一整张页面源码进行数据解析
img_list = tree.xpath('//div[@class="slist"]/ul/li/a/img')

#局部数据解析
for img in img_list:
    #./表示的是当前标签,xpath的调用者就是当前标签
    img_src = 'http://pic.netbian.com'+img.xpath('./@src')[0]
    img_name = img.xpath('./@alt')[0]+'.jpg'
    img_name = img_name.encode('iso-8859-1').decode('gbk')
    
    filePath = './'+dirName+'/'+img_name
    request.urlretrieve(img_src,filename=filePath)
    print(img_name,'下载成功!!!')

In [90]:

#全站操作
#1.指定一个通用的url模板:用来生成不同页码对应的url,模板是不可变
url = 'http://pic.netbian.com/4kmeinv/index_%d.html'
dirName = 'imgLibs'
if not os.path.exists(dirName):
    os.mkdir(dirName)
    
for pageNum in range(1,6):
    if pageNum == 1:
        page_url = 'http://pic.netbian.com/4kmeinv/'
    else:
        page_url = format(url%pageNum)

    page_text = requests.get(url=page_url,headers=headers).text

    #数据解析
    tree = etree.HTML(page_text)
    #xpath是基于一整张页面源码进行数据解析
    img_list = tree.xpath('//div[@class="slist"]/ul/li/a/img')

    #局部数据解析
    for img in img_list:
        #./表示的是当前标签,xpath的调用者就是当前标签
        img_src = 'http://pic.netbian.com'+img.xpath('./@src')[0]
        img_name = img.xpath('./@alt')[0]+'.jpg'
        img_name = img_name.encode('iso-8859-1').decode('gbk')

        filePath = './'+dirName+'/'+img_name
        request.urlretrieve(img_src,filename=filePath)
        print(img_name,'下载成功!!!')

In [ ]:

Guess you like

Origin www.cnblogs.com/bky20061005/p/12145691.html