How to convert a string to a dictionary:
# 字典推导式
cookies="anonymid=j3jxk555-nrn0wh; _r01_=1; _ga=GA1.2.1274811859.1497951251; _de=BF09EE3A28DED52E6B65F6A4705D973F1383380866D39FF5; [email protected]; depovince=BJ; jebecookies=54f5d0fd-9299-4bb4-801c-eefa4fd3012b|||||; JSESSIONID=abcI6TfWH4N4t_aWJnvdw; ick_login=4be198ce-1f9c-4eab-971d-48abfda70a50; p=0cbee3304bce1ede82a56e901916d0949; first_login_flag=1; ln_hurl=http://hdn.xnimg.cn/photos/hdn421/20171230/1635/main_JQzq_ae7b0000a8791986.jpg; t=79bdd322e760beae79c0b511b8c92a6b9; societyguester=79bdd322e760beae79c0b511b8c92a6b9; id=327550029; xnsid=2ac9a5d8; loginfrom=syshome; ch_id=10016; wp_fold=0"
cookies = {
i.split("=")[0]:i.split("=")[1] for i in cookies.split("; ")}
#字典是无序的
# 列表推导式
self.url_temp = "https://tieba.baidu.com/f?kw=" + tieba_name + "&ie=utf-8&pn={}"
return [self.url_temp.format(i * 50) for i in range(1000)]
requests usage:
tips:response.text 对象中,名词一般是属性,动词一般是方法,需要加括号
byte型数据(字符串前面加b'的),解码时用decode:
例:b'<!DOCTYPE html>\.... </html>\r\n'
response.content.decode()
# requests中解决编解码的方法
response.content.decode()
response.content.decode("gbk")
response.text
import json
json.loads(response.content.decode())
Data extraction method
json data extraction
- The strings in json are all enclosed in double quotes
- If not double quotes
- eval: can realize simple string and python type conversion
- replace: replace single quotes with double quotes
- If not double quotes
json.loads()和json.dump()
html_str = parse_url(url)
# json.loads把json字符串转化为python类型(字典)
ret1 = json.loads(html_str)
# json.dumps能够把python类型转化为json字符串
# ensure_ascii=False设置编码, indent=4设置换行缩进,都是为了使数据看起来更美观
# encoding="utf-8" 也是必要的
with open("douban.json","w",encoding="utf-8") as f:
f.write(json.dumps(ret1, ensure_ascii=False, indent=4))
Regular expression
常用正则表达式的方法:
re.compile(编译)
pattern.match(从头找一个)
pattern.search(找一个)
pattern.findall(找所有)
pattern.sub(替换)
Use attention points:
re.findall("a(*?)b","str")
, Can return the content in the brackets, the content before and after the brackets plays the role of positioning and filtering- The original string r, when there is a backslash in the string to be matched, using r can ignore the escape effect brought by the backslash
- The dot does not match by default\n
\s
Able to match blank strings, not only containing spaces, but also\t \r \n
Xpath learning
- Using xpath helper or copy xpath in chrome is the data extracted from the element, but the crawler gets the response corresponding to the url, which is often different from the elements
- Get text
a/text()
Get the text under aa//text()
Get the text contained in all tags under a//a[text()='下一页']
Choose based on text
- Get attributes
@符号
- /html/head/link/
//ul[@id="detail-list"]
//
- At the beginning of xpath, it means to select from any position in the current html
li//a
Indicates any label under li (if you use /, you need to choose one level and one level, which is more complicated)
示例:
//ul[@id="detail-list"]/li/a[@class='image share-url']/@href
# /@href 用于获取标签中的网页链接
Use xpath in the code
Need to use lxml library
- Getting started:
- Import lxml of etree library
from lxml import etree - Use etree.HTML to convert a string into an Element object
- Element object has xpath method
html = etree.HTML(text)
- Import lxml of etree library
from lxml import etree
text = ''' <div> <ul>
<li class="item-1"><a>first item</a></li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-inactive"><a href="link3.html">third item</a></li>
<li class="item-1"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a>
</ul> </div> '''
html = etree.HTML(text)
print(html)
#查看element对象中包含的字符串
print(etree.tostring(html).decode())
#获取class为item-1 li下的a的herf
ret1 = html.xpath("//li[@class='item-1']/a/@href")
#获取class为item-1 li下的a的文本
ret2 = html.xpath("//li[@class='item-1']/a/text()")
#分组,根据li标签进行分组,对每一组继续写xpath
ret3 = html.xpath("//li[@class='item-1']")
print(ret3)
for i in ret3:
item= {
}
item["title"] = i.xpath("./a/text()")[0] if len(i.xpath("./a/text()"))>0 else None
item["href"] = i.xpath("./a/@href")[0] if len( i.xpath("./a/@href"))>0 else None
print(item)
Scrapy
Basic concept
- The difference between asynchronous and non-blocking
异步
: After the call is issued, the call will return directly, regardless of the result[同步异步是过程]
非阻塞
: the focus is on the state of the program while waiting for the call result (message, return value), which means that before the result can not be obtained immediately The call will not block the current thread.[阻塞非阻塞是状态,拿到返回值前的状态,就是不用一直等,可以做其他事情]
Scrapy process
Getting started with Scrapy
- Create a scrapy project
scrapy startproject [项目名]
- Spawn a crawler
# 需要先进入项目文件下
cd myproject
scrapy genspider [爬虫名] [域名]
# 运行爬虫
scrapy scrawl itcast
- Extract data,
perfect spider, use xpath and other methods - Save data in the data
pipeline