Python fifty lines of code batch download hot list answers

foreword

As a part of fine-tuning the language model, we need to collect a large amount of text data on the Internet. A certain network is full of a large number of public high-quality questions and answers, which are suitable for training. Well, we're going to download its top two-year list of questions and answers today.

train of thought

First, let's click on the answer to a question, click share, and copy the link.
It is found that the link structure is as follows:
https://www.zhihu.com/question/431730729/answer/1591942026
A link constructed from a question ID and answer ID leads to the following questions:
1. The question ID is the id of the question, how to What about all the question IDs that have been on the hot list for the past two years?
2. The answer ID is the id of the answer. This seems to share one answer for all questions. There is no way to get the answer id of the current question through a loop.
3. How to get the text in the link after having these ids?

Prepare

1. You can find the hot search list for regular crawling on github:
https://github.com/justjavac/zhihu-trending-hot-questions
starting from 2020-11-24, download and decompress, and put it in the archive folder You can find all the hot searched md documents.

2. Install the required libraries

pip install beautifulsoup4
pip install lxml

3. Planning code
According to the problem, it can be divided into three parts:
the first part extracts all the hot list url links from the md document.
The second part obtains the following answer link according to the url link of the hot list.
The third part fetches the page content based on unique links constructed from questions and answers.

the code

Note that the code is stored in the archive directory by default to read the file names of all md documents.

from bs4 import BeautifulSoup
import json
import numpy as np
import requests
import os
import re
import time
#第一部分
headers = {
    
    
    'content-type': 'text/html; charset=UTF-8',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)'
                  ' Chrome/111.0.0.0 Safari/537.36'}
arr = os.listdir()
print(arr)
list_total_question = []
for x in arr:
    if x.endswith('.md'):
        with open(x,'r',encoding='utf8') as hot_diary:
            pattern = r'https://www.zhihu.com/question/(\d+)'
            list_total_question+=re.findall(pattern, hot_diary.read())
print(len(list_total_question))
#第二部分
list_all_url = []
for x in list_total_question:
    a1 = time.time()
    url = 'https://www.zhihu.com/api/v4/questions/'+str(x)+'/feeds?'
    datas = requests.get(url,headers=headers).json()
    for info in datas['data']:
        answerid = info['target']['id']
        final_url = 'https://www.zhihu.com/question/'+x+'/answer/'+str(answerid)
        list_all_url.append(final_url)
        #break #每条问题默认提取五条问答,此处Break则只选默认排序第一条
    a2 = time.time()
    print(a2-a1) #0.5s
    break #选一条问题进行测试
print(list_all_url)
#第三部分
list_json = []
for x in list_all_url:
    html = requests.get(url=x, headers=headers)
    site = BeautifulSoup(html.text, 'lxml')
    title = site.find_all('meta', attrs={
    
    'itemprop': "name"}, recursive=True, limit=1)[0].__getattribute__('attrs')[
        'content']
    text = site.find('div', attrs={
    
    'class': "RichContent-inner"})
    print(title)
    total_string = ""
    for i in text:
        if i.text.find('.css') == -1:
            total_string += i.text
    dict_json = {
    
    "instruction":title,"input":"","output":total_string}
    list_json.append(dict_json)
print(list_json)

It takes about 91,000 seconds to crawl all 168,500 hot lists, and the information extraction has not been tested yet. If you just want to simply see the effect, you can add a break test single item. By default, the code extracts five answers to the first question. The code for saving files is not written here, it is very simple.

Effect

insert image description here
The purpose of constructing it in the form of json here is to use the library to fine-tune the large language model, and remove it if you don't need it.

Guess you like

Origin blog.csdn.net/weixin_43945848/article/details/130004546