E-book trekking multi-threaded crawling

E-book trekking multi-threaded crawling - EDITORIAL

Recently looking for a few e-book look, we turn ah turn, then, to find a named  周读site, the site is particularly good, simple and fresh, a lot of books, and are open Baidu network disk can be downloaded directly, the update also speed Yes, Ever since, I have to climb . This article can be learned, such a good sharing site, try not to climb, affecting people access speed on the bad  http://www.ireadweek.com/ , the data you want, you can comment on my blog below, I sent you, QQ, mail, Han all OK.

Here Insert Picture Description

Here Insert Picture Description

This logic is particularly simple Web page, I flipped through the book details page will look like, we just need to generate cycle links to these pages, and then climb on it for speed, multi-threaded I used, you try on it, after the data you want to crawl, the following comments in this blog, do not ruin other people server.

http://www.ireadweek.com/index.php/bookInfo/11393.html
http://www.ireadweek.com/index.php/bookInfo/11.html
....

E-book trekking multi-threaded crawling - line and Code

The code is very simple, we have in front of tutorials pave the way, very little code can achieve full functionality, and finally to collect the contents of written  csv documents inside, ( csv is what, Baidu, you know) this code is IO密集操作 we used aiohttpto write modules.

step 1

Stitching URL, open thread.

import requests

# 导入协程模块
import asyncio
import aiohttp


headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36", "Host": "www.ireadweek.com", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8"} async def get_content(url): print("正在操作:{}".format(url)) # 创建一个session 去获取数据 async with aiohttp.ClientSession() as session: async with session.get(url,headers=headers,timeout=3) as res: if res.status == 200: source = await res.text() # 等待获取文本 print(source) if __name__ == '__main__': url_format = "http://www.ireadweek.com/index.php/bookInfo/{}.html" full_urllist = [url_format.format(i) for i in range(1,11394)] # 11394 loop = asyncio.get_event_loop() tasks = [get_content(url) for url in full_urllist] results = loop.run_until_complete(asyncio.wait(tasks)) 

The code above can be open simultaneously N multiple threads, but this way it is easy to cause paralysis of someone else's server, so we have to limit the number of concurrent look at the following code, you try to put it specified location.

sema = asyncio.Semaphore(5)
# 为避免爬虫一次性请求次数太多,控制一下
async def x_get_source(url): with(await sema): await get_content(url)

Step 2

Processing crawled pages source code to extract the elements we want, I added a method used lxmlfor data extraction.

def async_content(tree): title = tree.xpath("//div[@class='hanghang-za-title']")[0].text # 如果页面没有信息,直接返回即可 if title == '': return else: try: description = tree.xpath("//div[@class='hanghang-shu-content-font']") author = description[0].xpath("p[1]/text()")[0].replace("作者:","") if description[0].xpath("p[1]/text()")[0] is not None else None cate = description[0].xpath("p[2]/text()")[0].replace("分类:","") if description[0].xpath("p[2]/text()")[0] is not None else None douban = description[0].xpath("p[3]/text()")[0].replace("豆瓣评分:","") if description[0].xpath("p[3]/text()")[0] is not None else None # 这部分内容不明确,不做记录 #des = description[0].xpath("p[5]/text()")[0] if description[0].xpath("p[5]/text()")[0] is not None else None download = tree.xpath("//a[@class='downloads']") except Exception as e: print(title) return ls = [ title,author,cate,douban,download[0].get('href') ] return ls 

Step 3

After the data format, save it to csva file, call it a day!

 print(data)
 with open('hang.csv', 'a+', encoding='utf-8') as fw: writer = csv.writer(fw) writer.writerow(data) print("插入成功!") 

E-book trekking multi-threaded crawling - running the code, view the results

Here Insert Picture Description

Recently looking for a few e-book look, we turn ah turn, then, to find a named  周读site, the site is particularly good, simple and fresh, a lot of books, and are open Baidu network disk can be downloaded directly, the update also speed Yes, Ever since, I have to climb . This article can be learned, such a good sharing site, try not to climb, affecting people access speed on the bad  http://www.ireadweek.com/ , the data you want, you can comment on my blog below, I sent you, QQ, mail, Han all OK.

Here Insert Picture Description

Here Insert Picture Description

This logic is particularly simple Web page, I flipped through the book details page will look like, we just need to generate cycle links to these pages, and then climb on it for speed, multi-threaded I used, you try on it, after the data you want to crawl, the following comments in this blog, do not ruin other people server.

http://www.ireadweek.com/index.php/bookInfo/11393.html
http://www.ireadweek.com/index.php/bookInfo/11.html
....

E-book trekking multi-threaded crawling - line and Code

The code is very simple, we have in front of tutorials pave the way, very little code can achieve full functionality, and finally to collect the contents of written  csv documents inside, ( csv is what, Baidu, you know) this code is IO密集操作 we used aiohttpto write modules.

step 1

Stitching URL, open thread.

import requests

# 导入协程模块
import asyncio
import aiohttp


headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36", "Host": "www.ireadweek.com", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8"} async def get_content(url): print("正在操作:{}".format(url)) # 创建一个session 去获取数据 async with aiohttp.ClientSession() as session: async with session.get(url,headers=headers,timeout=3) as res: if res.status == 200: source = await res.text() # 等待获取文本 print(source) if __name__ == '__main__': url_format = "http://www.ireadweek.com/index.php/bookInfo/{}.html" full_urllist = [url_format.format(i) for i in range(1,11394)] # 11394 loop = asyncio.get_event_loop() tasks = [get_content(url) for url in full_urllist] results = loop.run_until_complete(asyncio.wait(tasks)) 

The code above can be open simultaneously N multiple threads, but this way it is easy to cause paralysis of someone else's server, so we have to limit the number of concurrent look at the following code, you try to put it specified location.

sema = asyncio.Semaphore(5)
# 为避免爬虫一次性请求次数太多,控制一下
async def x_get_source(url): with(await sema): await get_content(url)

Step 2

Processing crawled pages source code to extract the elements we want, I added a method used lxmlfor data extraction.

def async_content(tree): title = tree.xpath("//div[@class='hanghang-za-title']")[0].text # 如果页面没有信息,直接返回即可 if title == '': return else: try: description = tree.xpath("//div[@class='hanghang-shu-content-font']") author = description[0].xpath("p[1]/text()")[0].replace("作者:","") if description[0].xpath("p[1]/text()")[0] is not None else None cate = description[0].xpath("p[2]/text()")[0].replace("分类:","") if description[0].xpath("p[2]/text()")[0] is not None else None douban = description[0].xpath("p[3]/text()")[0].replace("豆瓣评分:","") if description[0].xpath("p[3]/text()")[0] is not None else None # 这部分内容不明确,不做记录 #des = description[0].xpath("p[5]/text()")[0] if description[0].xpath("p[5]/text()")[0] is not None else None download = tree.xpath("//a[@class='downloads']") except Exception as e: print(title) return ls = [ title,author,cate,douban,download[0].get('href') ] return ls 

Step 3

After the data format, save it to csva file, call it a day!

 print(data)
 with open('hang.csv', 'a+', encoding='utf-8') as fw: writer = csv.writer(fw) writer.writerow(data) print("插入成功!") 

E-book trekking multi-threaded crawling - running the code, view the results

Here Insert Picture Description

Guess you like

Origin www.cnblogs.com/xiaohuhu/p/12197126.html