Python Reptile novice tutorial: know almost picture of the article crawler

1. know almost climb two blog articles pictures to take control of the background

Yesterday know almost wrote the article picture of part of the code crawler, were crawling know almost for answers to questions json, blog content appears in part to write the dead, some of the information that today the adjustment is completed, and the picture perfect download to the code.

First, you need to get to know almost any problem, you need only enter the ID issue, you can get the relevant page information, such as the most important of the total number of people to answer questions.
Problems ID numbers are as follows standard red

Write code, the following code is used to detect whether the user input is correct ID, and the URL to get through stitching total number of answers to the following questions.

在学习过程中有什么不懂得可以加我的
python学习交流扣扣qun,784758214
群里有不错的学习视频教程、开发工具与电子书籍。
与你分享python企业当下人才需求及怎么从零基础学习好python,和学习什么内容
import requests
import re
import pymongo
import time
DATABASE_IP = '127.0.0.1'
DATABASE_PORT = 27017
DATABASE_NAME = 'sun'
client = pymongo.MongoClient(DATABASE_IP,DATABASE_PORT)
db = client.sun
db.authenticate("dba", "dba")
collection = db.zhihuone  # 准备插入数据

BASE_URL = "https://www.zhihu.com/question/{}"
def get_totle_answers(article_id):
    headers = {
        "user-agent": "需要自己补全 Mozilla/5.0 (Windows NT 10.0; WOW64)"
    }

    with requests.Session() as s:
        with s.get(BASE_URL.format(article_id),headers=headers,timeout=3) as rep:
            html = rep.text
            pattern =re.compile( '<meta itemProp="answerCount" content="(\d*?)"/>')
            s = pattern.search(html)
            print("查找到{}条数据".format(s.groups()[0]))
            return s.groups()[0]

if __name__ == '__main__':

    # 用死循环判断用户输入的是否是数字
    article_id = ""
    while not article_id.isdigit():
        article_id = input("请输入文章ID:")

    totle = get_totle_answers(article_id)
    if int(totle)>0:
        zhi = ZhihuOne(article_id,totle)
        zhi.run()
    else:
        print("没有任何数据!")

Perfect picture download section, the picture Download the review process found that there json field content, we use a simple regular expression to match him out. Detail in the following figure shows

Python Reptile novice tutorial: know almost picture of the article crawler

Bar coding, the following code comments Read the middle of a small BUG, the need to manually modify pic3 pic2 This place is currently no clear reason, may be the reason my local network, as well as please create a project in the root directory imgsfolder to store pictures

    def download_img(self,data):
        ## 下载图片
        for item in data["data"]:
            content = item["content"]
            pattern = re.compile('<noscript>(.*?)</noscript>')
            imgs = pattern.findall(content)
            if len(imgs) > 0:
                for img in imgs:
                    match = re.search('<img src="(.*?)"', img)
                    download = match.groups()[0]
                    download = download.replace("pic3", "pic2")  # 小BUG,pic3的下载不到

                    print("正在下载{}".format(download), end="")
                    try:
                        with requests.Session() as s:
                            with s.get(download) as img_down:
                                # 获取文件名称
                                file = download[download.rindex("/") + 1:]

                                content = img_down.content
                                with open("imgs/{}".format(file), "wb+") as f:  # 这个地方进行了硬编码
                                    f.write(content)

                                print("图片下载完成", end="\n")

                    except Exception as e:
                        print(e.args)

            else:
                pass

If you are still confused in the programming world, you can join us to learn Python buckle qun: 784758214, look at how seniors are learning. Exchange of experience. From basic web development python script to, reptiles, django, data mining and other projects to combat zero-based data are finishing. Given to every little python partner! Share some learning methods and need to pay attention to small details, click on Join us python learner gathering

Run results

Python Reptile novice tutorial: know almost picture of the article crawler

Then know almost playing in the process, I found a lot of good questions

Python Reptile novice tutorial: know almost picture of the article crawler

Python Reptile novice tutorial: know almost picture of the article crawler

Guess you like

Origin blog.51cto.com/14510224/2438070