Crawler learning record

How to convert a string to a dictionary:

# 字典推导式
cookies="anonymid=j3jxk555-nrn0wh; _r01_=1; _ga=GA1.2.1274811859.1497951251; _de=BF09EE3A28DED52E6B65F6A4705D973F1383380866D39FF5; [email protected]; depovince=BJ; jebecookies=54f5d0fd-9299-4bb4-801c-eefa4fd3012b|||||; JSESSIONID=abcI6TfWH4N4t_aWJnvdw; ick_login=4be198ce-1f9c-4eab-971d-48abfda70a50; p=0cbee3304bce1ede82a56e901916d0949; first_login_flag=1; ln_hurl=http://hdn.xnimg.cn/photos/hdn421/20171230/1635/main_JQzq_ae7b0000a8791986.jpg; t=79bdd322e760beae79c0b511b8c92a6b9; societyguester=79bdd322e760beae79c0b511b8c92a6b9; id=327550029; xnsid=2ac9a5d8; loginfrom=syshome; ch_id=10016; wp_fold=0"
cookies = {
    
    i.split("=")[0]:i.split("=")[1] for i in cookies.split("; ")}
#字典是无序的

# 列表推导式
self.url_temp = "https://tieba.baidu.com/f?kw=" + tieba_name + "&ie=utf-8&pn={}"
return [self.url_temp.format(i * 50) for i in range(1000)]

requests usage:

tips:response.text 对象中,名词一般是属性,动词一般是方法,需要加括号

byte型数据(字符串前面加b'的),解码时用decode:
例:b'<!DOCTYPE html>\.... </html>\r\n'
response.content.decode()

# requests中解决编解码的方法
response.content.decode()
response.content.decode("gbk")
response.text

import json
json.loads(response.content.decode())

Data extraction method

json data extraction

  • The strings in json are all enclosed in double quotes
    • If not double quotes
      • eval: can realize simple string and python type conversion
      • replace: replace single quotes with double quotes

json.loads()和json.dump()

html_str = parse_url(url)
# json.loads把json字符串转化为python类型(字典)
ret1 = json.loads(html_str)

# json.dumps能够把python类型转化为json字符串
# ensure_ascii=False设置编码, indent=4设置换行缩进,都是为了使数据看起来更美观
# encoding="utf-8" 也是必要的
with open("douban.json","w",encoding="utf-8") as f:
    f.write(json.dumps(ret1, ensure_ascii=False, indent=4))

Regular expression

常用正则表达式的方法:
	re.compile(编译)
	pattern.match(从头找一个)
	pattern.search(找一个)
	pattern.findall(找所有)
	pattern.sub(替换)

Insert picture description here

Use attention points:

  • re.findall("a(*?)b","str"), Can return the content in the brackets, the content before and after the brackets plays the role of positioning and filtering
  • The original string r, when there is a backslash in the string to be matched, using r can ignore the escape effect brought by the backslash
  • The dot does not match by default\n
  • \sAble to match blank strings, not only containing spaces, but also\t \r \n

Xpath learning

Insert picture description here
Insert picture description here

  • Using xpath helper or copy xpath in chrome is the data extracted from the element, but the crawler gets the response corresponding to the url, which is often different from the elements
  • Get text
    • a/text() Get the text under a
    • a//text() Get the text contained in all tags under a
    • //a[text()='下一页'] Choose based on text
  • Get attributes@符号
    • /html/head/link/
    • //ul[@id="detail-list"]
  • //
    • At the beginning of xpath, it means to select from any position in the current html
    • li//aIndicates any label under li (if you use /, you need to choose one level and one level, which is more complicated)
示例:
//ul[@id="detail-list"]/li/a[@class='image share-url']/@href
# /@href 用于获取标签中的网页链接

Use xpath in the code

Need to use lxml library

  • Getting started:
    • Import lxml of etree library
      from lxml import etree
    • Use etree.HTML to convert a string into an Element object
    • Element object has xpath method
      html = etree.HTML(text)
from lxml import etree
text = ''' <div> <ul> 
        <li class="item-1"><a>first item</a></li> 
        <li class="item-1"><a href="link2.html">second item</a></li> 
        <li class="item-inactive"><a href="link3.html">third item</a></li> 
        <li class="item-1"><a href="link4.html">fourth item</a></li> 
        <li class="item-0"><a href="link5.html">fifth item</a>  
        </ul> </div> '''

html = etree.HTML(text)
print(html)
#查看element对象中包含的字符串
print(etree.tostring(html).decode())

#获取class为item-1 li下的a的herf
ret1 = html.xpath("//li[@class='item-1']/a/@href")

#获取class为item-1 li下的a的文本
ret2 = html.xpath("//li[@class='item-1']/a/text()")

#分组,根据li标签进行分组,对每一组继续写xpath
ret3 = html.xpath("//li[@class='item-1']")
print(ret3)
for i in ret3:
    item=  {
    
    }
    item["title"] = i.xpath("./a/text()")[0] if len(i.xpath("./a/text()"))>0 else None
    item["href"] = i.xpath("./a/@href")[0] if len( i.xpath("./a/@href"))>0 else None
    print(item)

Scrapy

Basic concept

  • The difference between asynchronous and non-blocking
    Insert picture description here
    异步 : After the call is issued, the call will return directly, regardless of the result [同步异步是过程]
    非阻塞: the focus is on the state of the program while waiting for the call result (message, return value), which means that before the result can not be obtained immediately The call will not block the current thread.[阻塞非阻塞是状态,拿到返回值前的状态,就是不用一直等,可以做其他事情]

Scrapy process

Insert picture description here

Getting started with Scrapy

  1. Create a scrapy project
scrapy startproject [项目名]
  1. Spawn a crawler
# 需要先进入项目文件下
cd myproject
scrapy genspider [爬虫名] [域名]

# 运行爬虫
scrapy scrawl itcast
  1. Extract data,
    perfect spider, use xpath and other methods
  2. Save data in the data
    pipeline

Guess you like

Origin blog.csdn.net/Saker__/article/details/107913622
Recommended