E-book trekking multi-threaded crawling - EDITORIAL
Recently looking for a few e-book look, we turn ah turn, then, to find a named 周读
site, the site is particularly good, simple and fresh, a lot of books, and are open Baidu network disk can be downloaded directly, the update also speed Yes, Ever since, I have to climb . This article can be learned, such a good sharing site, try not to climb, affecting people access speed on the bad http://www.ireadweek.com/
, the data you want, you can comment on my blog below, I sent you, QQ, mail, Han all OK.
This logic is particularly simple Web page, I flipped through the book details page will look like, we just need to generate cycle links to these pages, and then climb on it for speed, multi-threaded I used, you try on it, after the data you want to crawl, the following comments in this blog, do not ruin other people server.
http://www.ireadweek.com/index.php/bookInfo/11393.html
http://www.ireadweek.com/index.php/bookInfo/11.html
....
E-book trekking multi-threaded crawling - line and Code
The code is very simple, we have in front of tutorials pave the way, very little code can achieve full functionality, and finally to collect the contents of written csv
documents inside, ( csv
is what, Baidu, you know) this code is IO密集操作
we used aiohttp
to write modules.
step 1
Stitching URL, open thread.
import requests
# 导入协程模块
import asyncio
import aiohttp
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36", "Host": "www.ireadweek.com", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8"} async def get_content(url): print("正在操作:{}".format(url)) # 创建一个session 去获取数据 async with aiohttp.ClientSession() as session: async with session.get(url,headers=headers,timeout=3) as res: if res.status == 200: source = await res.text() # 等待获取文本 print(source) if __name__ == '__main__': url_format = "http://www.ireadweek.com/index.php/bookInfo/{}.html" full_urllist = [url_format.format(i) for i in range(1,11394)] # 11394 loop = asyncio.get_event_loop() tasks = [get_content(url) for url in full_urllist] results = loop.run_until_complete(asyncio.wait(tasks))
The code above can be open simultaneously N multiple threads, but this way it is easy to cause paralysis of someone else's server, so we have to limit the number of concurrent look at the following code, you try to put it specified location.
sema = asyncio.Semaphore(5)
# 为避免爬虫一次性请求次数太多,控制一下
async def x_get_source(url): with(await sema): await get_content(url)
Step 2
Processing crawled pages source code to extract the elements we want, I added a method used lxml
for data extraction.
def async_content(tree): title = tree.xpath("//div[@class='hanghang-za-title']")[0].text # 如果页面没有信息,直接返回即可 if title == '': return else: try: description = tree.xpath("//div[@class='hanghang-shu-content-font']") author = description[0].xpath("p[1]/text()")[0].replace("作者:","") if description[0].xpath("p[1]/text()")[0] is not None else None cate = description[0].xpath("p[2]/text()")[0].replace("分类:","") if description[0].xpath("p[2]/text()")[0] is not None else None douban = description[0].xpath("p[3]/text()")[0].replace("豆瓣评分:","") if description[0].xpath("p[3]/text()")[0] is not None else None # 这部分内容不明确,不做记录 #des = description[0].xpath("p[5]/text()")[0] if description[0].xpath("p[5]/text()")[0] is not None else None download = tree.xpath("//a[@class='downloads']") except Exception as e: print(title) return ls = [ title,author,cate,douban,download[0].get('href') ] return ls
Step 3
After the data format, save it to csv
a file, call it a day!
print(data)
with open('hang.csv', 'a+', encoding='utf-8') as fw: writer = csv.writer(fw) writer.writerow(data) print("插入成功!")
E-book trekking multi-threaded crawling - running the code, view the results
Recently looking for a few e-book look, we turn ah turn, then, to find a named 周读
site, the site is particularly good, simple and fresh, a lot of books, and are open Baidu network disk can be downloaded directly, the update also speed Yes, Ever since, I have to climb . This article can be learned, such a good sharing site, try not to climb, affecting people access speed on the bad http://www.ireadweek.com/
, the data you want, you can comment on my blog below, I sent you, QQ, mail, Han all OK.
This logic is particularly simple Web page, I flipped through the book details page will look like, we just need to generate cycle links to these pages, and then climb on it for speed, multi-threaded I used, you try on it, after the data you want to crawl, the following comments in this blog, do not ruin other people server.
http://www.ireadweek.com/index.php/bookInfo/11393.html
http://www.ireadweek.com/index.php/bookInfo/11.html
....
E-book trekking multi-threaded crawling - line and Code
The code is very simple, we have in front of tutorials pave the way, very little code can achieve full functionality, and finally to collect the contents of written csv
documents inside, ( csv
is what, Baidu, you know) this code is IO密集操作
we used aiohttp
to write modules.
step 1
Stitching URL, open thread.
import requests
# 导入协程模块
import asyncio
import aiohttp
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36", "Host": "www.ireadweek.com", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8"} async def get_content(url): print("正在操作:{}".format(url)) # 创建一个session 去获取数据 async with aiohttp.ClientSession() as session: async with session.get(url,headers=headers,timeout=3) as res: if res.status == 200: source = await res.text() # 等待获取文本 print(source) if __name__ == '__main__': url_format = "http://www.ireadweek.com/index.php/bookInfo/{}.html" full_urllist = [url_format.format(i) for i in range(1,11394)] # 11394 loop = asyncio.get_event_loop() tasks = [get_content(url) for url in full_urllist] results = loop.run_until_complete(asyncio.wait(tasks))
The code above can be open simultaneously N multiple threads, but this way it is easy to cause paralysis of someone else's server, so we have to limit the number of concurrent look at the following code, you try to put it specified location.
sema = asyncio.Semaphore(5)
# 为避免爬虫一次性请求次数太多,控制一下
async def x_get_source(url): with(await sema): await get_content(url)
Step 2
Processing crawled pages source code to extract the elements we want, I added a method used lxml
for data extraction.
def async_content(tree): title = tree.xpath("//div[@class='hanghang-za-title']")[0].text # 如果页面没有信息,直接返回即可 if title == '': return else: try: description = tree.xpath("//div[@class='hanghang-shu-content-font']") author = description[0].xpath("p[1]/text()")[0].replace("作者:","") if description[0].xpath("p[1]/text()")[0] is not None else None cate = description[0].xpath("p[2]/text()")[0].replace("分类:","") if description[0].xpath("p[2]/text()")[0] is not None else None douban = description[0].xpath("p[3]/text()")[0].replace("豆瓣评分:","") if description[0].xpath("p[3]/text()")[0] is not None else None # 这部分内容不明确,不做记录 #des = description[0].xpath("p[5]/text()")[0] if description[0].xpath("p[5]/text()")[0] is not None else None download = tree.xpath("//a[@class='downloads']") except Exception as e: print(title) return ls = [ title,author,cate,douban,download[0].get('href') ] return ls
Step 3
After the data format, save it to csv
a file, call it a day!
print(data)
with open('hang.csv', 'a+', encoding='utf-8') as fw: writer = csv.writer(fw) writer.writerow(data) print("插入成功!")