Python Reptile practice - 7. Net ghost story secret story Daquan crawling (lxml xpath + requests)

Because tutorial demo site embarrassments Wikipedia has gg (what seems to be the reason for the user's private information involved), so I just had to find a site to practice hand.

A few days ago to learn the usage section lxml, mainly etree, because the 4.4.2 version of the update, etree now in ElementInclude package, direct reference is to die, and etree added a new parser, parse method is invoked first instance of HTMLparse method, of course, I did not use this garbage reptiles crawling html and data cleansing as two separate steps, and demo it did not use multi-threading, resulting climb 200 ghost story, a few mb of txt, crawling write time is not really very good ah (fog, large grass), so that how important it is multi-threaded. xxxxx

Oh, at first f12 analysis website source code, to get content and page URL, because what we get is the text, and are explicitly displayed on the page, it is easy to come to the law:

 Ghost stories we want to get the contents of the page the link <a> label, under the <article> tag can be used to locate xpath, ( "// article // h2 / a / @ href"), you can get to the current page 20 ghost story content page link

Then open the content details page

It is easy to see that we want to <p> tags in the text id = "single" of the div, you can use xpath to locate, ( "// div [@ id = 'single']" // p)

When a person is, as always content pages page 20, has also been spliced ​​URL, / page / i gets the job done

Not catching pictures, though apparently unrelated graphic xxxxx

Then the code implementation

import requests
from lxml.ElementInclude import etree

for i in range(1, 11):
    url = "https://mimi.kunjuke.com/guigushi/page/" + str(i)
    headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36"}

    res = requests.get(url, headers=headers).text

    html = etree.HTML(res)
    url_result = html.xpath("//article//h2/a/@href")

    for site in url_result:
        res2 = requests.get(site).text
        html2 = etree.HTML(res2)
        content_result = html2.xpath("//div[@id='single']//p")
        title_result = html2.xpath("//div[@id='single']//h2")
        storyName = "H:/GhostStory/"+title_result[0].text+".txt"
        with open(storyName,"wb+") as f:
            f.write(bytes(content_result[0].text, encoding="utf-8"))

很简单的实现,诶呀不用多线程爬这种几kb的txt,真实难受,看来后续还要进一步学习多线程啊

  因为没做日志info或者系统打印台输出和异常管理,嘿嘿,懒狗,手动校验一下吧

爬完校验一下是不是爬了十页两百个鬼故事

 

打开一个看看,i/o和encoding没写错的话就应该没问题

bingo ,欸,好垃圾哦, 我转了一圈吃完一个肉松饼,这200个鬼故事还没爬完,残念xxxxxx

下次一定,下次一定,下次一定补上 info ,异常和多线程,惭愧地流出了虚假地泪水 喵  >_<!!!

Guess you like

Origin www.cnblogs.com/liuchaodada/p/12181566.html