版权声明:版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/qq_42658739/article/details/89812772
如有不得当之处,请联系我会及时删除
这次的抓取我用的是requests和Xpath,因为没有必要使用大型工具
import requests
from lxml import etree
思路:
1.目的是下载爬虫教程
2.分析网页以及规则,使用Xpath简单获取下载url
3.循环下载
代码如下:
class github():
def __init__(self):
self.allowed_domains = 'https://github.com/Python3WebSpider'
self.headers = {
'User-Agent':'*****请换成你们自己的 '
}
def spider_pipline(self):
response1 = requests.get(self.allowed_domains,headers = self.headers,timeout = 5)
selector = etree.HTML(response1.text)
main_hrefs = selector.xpath('//div[@id="org-repositories"]//ul/li/div[@class="d-inline-block mb-1"]//a/@href')
for start_href in main_hrefs:
href = 'https://github.com'+ start_href
response2 = requests.get(href, headers=self.headers, timeout=5)
selector2 = etree.HTML(response2.text)
href = selector2.xpath('//main[@id="js-repo-pjax-container"]//div[@class="get-repo-modal-options"]/div[@class="mt-2"]/a[2]/@href')
for item in href:
item_new = 'https://github.com'+item
# yield item_new
# print(item_new)
r = requests.get(item_new)
item = item[18:].replace('/','-')
# print(item)
with open(item, "wb") as git_zip:
git_zip.write(r.content)
print('done-')
if __name__ == '__main__':
git = github()
git.spider_pipline()
print('down——OK')
最后的最后,建议大家给GitHub博主送个星,那个博主也是我崇拜的偶像呢! 他的书很不错!建议买书进行学习、有利于知识体系的结构化构建
如有冒犯之处,请联系删除相应内容。