Reptile learning: request + xpath crawling novel pen Fun Club

Reptile into the pit for some time, ready to engage in something, hehe

Note: Read this article to have a certain foundation python, and understand Requests related Xpath syntax and regular expressions

1. About Requests and Xpath

Requests

Requests are using python language based urllib written, using Apache2 Licensed open source protocol HTTP library
if you read the article on the use of urllib library, you will find that, in fact, urllib still very inconvenient, but it will be more than urllib Requests convenient, you can save us a lot of work. (Followed by the requests, you basically do not want to use the urllib) sentence, requests are implemented python easiest to use HTTP library, it is recommended to use crawler requests library.

Xpath

XPath is the XML Path Language (XML Path Language), which is an XML document is used to determine the position of a part of the language.
XPath-based XML tree structure, providing the ability to find a node in the data tree. XPath mind initially proposed that it as a general-purpose, between XPointer and XSL grammar model between. But XPath soon be employed to developers as a small query language .

 

2. Code

#正则+request+xpath
from lxml import etree
import requests
import re
import warnings
import time

warnings.filterwarnings("ignore")
headers = {"User-Agent" : "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1 Trident/5.0;"}

def get_urls(URL):
    Html=requests.get(URL,headers=headers,verify=False)
    Html.encoding = 'gbk'
    HTML=etree.HTML(Html.text)
    results=HTML.xpath('//dd/a/@href')
    return results

def get_items(result):
    url='https://www.biquyun.com'+str(result)
    html=requests.get(url,headers=headers,verify=False)
    html.encoding = 'gbk'
    pattern=re.compile('<div.*?<h1>(.*?)</h1>.*?<div.*?content">(.*?)</div>',re.S)
    items='\n'*2+str(re.findall(pattern,html.text)[0][0])+'\n'*2+str(re.findall(pattern,html.text)[0][1])
    items=items.replace('&nbsp;&nbsp;&nbsp;&nbsp;','').replace('<br />','')
    return items
    
def save_to_file(items):
    with open ("xiaoshuo1.txt",'a',encoding='utf-8') as file:
        file.write(items)    
    
def main(URL):
    results=get_urls(URL)
    ii=1
    for result in results:
        items=get_items(result)
        save_to_file(items)
        print(str(ii)+' in 1028')
        ii=ii+1
#        time.sleep(1)
if __name__ == '__main__':
    start_1 = time.time()
    URL='https://www.biquyun.com/15_15566/'
    main(URL)
    print('Done!')
    end_1 = time.time()
    print('爬虫时间1:',end_1-start_1)
        

运行结果:

 

#requests+xpath
from lxml import etree
import requests
import re
import warnings
import time

warnings.filterwarnings("ignore")#由于requests获取网页源代码采用verify=False,需要忽略警告
headers = {"User-Agent" : "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1 Trident/5.0;"}

def get_urls(URL):
    Html=requests.get(URL,headers=headers,verify=False)
    Html.encoding = 'gbk'
    HTML=etree.HTML(Html.text)
    results=HTML.xpath('//dd/a/@href')
    return results

def get_items(result):
    url='https://www.biquyun.com'+str(result)
    html=requests.get(url,headers=headers,verify=False)
    html.encoding = 'gbk'
    html=etree.HTML(html.text)
    resultstitle=html.xpath('//*[@class="bookname"]/h1/text()')
    resultsbody=html.xpath('//*[@id="content"]/text()')
    items=str(resultstitle[0])+'\n'*2+str(resultsbody).replace('\', \'','').replace('\\xa0\\xa0\\xa0\\xa0','').replace('\\r\\n\\r\\n','\n\n').replace('[\'','').replace('\']','')+'\n'*2
    return items

def save_to_file(items):
    with open ("xiaoshuo2.txt",'a',encoding='utf-8') as file:
        file.write(items)    
    
def main(URL):
    results=get_urls(URL)
    ii=1
    for result in results:
        items=get_items(result)
        save_to_file(items)
        print(str(ii)+' in 1028')
        ii=ii+1
#        time.sleep(1)
if __name__ == '__main__':
    start_2 = time.time()
    URL='https://www.biquyun.com/15_15566/'
    main(URL)
    print('Done!')
    end_2 = time.time()
    print('爬虫时间2:',end_2-start_2)

 运行结果:

ps: 具体爬取速度与电脑配置和网速有关。另外,利用正则匹配时间有时候会很长,建议采用xpath。

 

编写爬虫的坑 :

1.爬取网页中文乱码

 

解决方案: 

print(response.encoding)  # requests猜测的编码格式
print(requests.utils.get_encodings_from_content(response.text)[0])

 

 

参考链接:

http://cn.python-requests.org/zh_CN/latest/

https://www.liaoxuefeng.com/wiki/1016959663602400/1183249464292448

http://www.w3school.com.cn/xpath/index.asp

https://www.cnblogs.com/lei0213/p/7506130.html

https://blog.csdn.net/ahua_c/article/details/80942726 

https://www.bilibili.com/video/av19057145

https://www.crifan.com/python_re_search_vs_re_findall/

https://www.jianshu.com/p/4c076da1b7f7

https://blog.csdn.net/u014109807/article/details/79735400

https://www.cnblogs.com/bjdx1314/p/8934031.html

https://www.cnblogs.com/ConnorShip/p/9966290.html

https://www.cnblogs.com/carlos-mm/p/8819519.html

https://blog.51cto.com/13603552/2308728

 https://blog.csdn.net/sinat_35360663/article/details/78455991

https://www.cnblogs.com/lei0213/p/7506130.html

 

Guess you like

Origin www.cnblogs.com/0n-the-way/p/11131418.html