Ajax dynamic pages crawled a little exercise

# To know almost simulate Ajax request, for example, know almost crawling a dynamic column I am concerned about the
# Enter concerned know almost column, right-click on the selected examination Ajax in the Network tab XHR filter, sliding page load new content , you will find in the Name box option appears there have been articles
Ajax request to the articles is the refresh option out. Analysis parameter information, click on the request to enter the details page, this is a GET request type.
Dynamic Message # request link to https://www.zhihu.com/people/sgai/activities, crawling this column
# Dynamic Link There are four parameters limit, session_id, after_id, desktop. Ajax request wherein each occurrence of only after_id parameters will change, but change is erratic change. Therefore, only a crawling content Ajax request, as an exercise.
#Click Ajax request, response contents can be viewed in the Preview option request, the main data included in the content request option box
with a plurality of elements #data below, contains some of the information element is known under the almost. Wherein some of the nodes selected information crawling. For example action_text, comment_count, actor, content, excerpt node content.
# Each new Ajax request link https://www.zhihu.com/api/v4/members/sgai/activities?limit=7&session_id=1090949826551648256&after_id=1551863271&desktop=True
content # crawling down a dictionary format, it can not be saved to a text file, save data to the database

from urllib.parse import urlencode
from pyquery import PyQuery as pq
import requests
import json
import time
from pymongo import MongoClient
def get_page( ):
    params = {
        'limit': '7',
        'session_id ': '1090949826551648256',
        'after_id': 1547986974,
        'desktop': 'True'
    }
    url = base_url + urlencode(params)
    try:
        response = requests.get(url, headers=headers)
        if response.status_code == 200:
            return response.json()
    except requests.ConnectionError as e:
        print('Error', e.args)
# 首先定义base_url来表示请求的URL的前半部分。接着构造参数字典,其中include,limit是固定参数,offset是可变参数
def parse_page(json):
    if json:
        items = json.get('data')
        for index, item in enumerate(items):
            target_item = item.get('target')
            yield {
                'action_text:': item.get('action_text'),
                'comment_count': target_item.get('comment_count'),
                'actor:': item.get('actor'),
                'content:': target_item.get('content'),
                'excerpt': target_item.get('excerpt')
                }
   def write_to_file(content):
         with open('zhihu.txt', 'a', encoding='utf-8') as f:
         f.write(json.dumps(content.values, ensure_ascii=False))


base_url = "https://www.zhihu.com/api/v4/members/sgai/activities?"
headers = {
    'headers': 'www.zhihu.com',

    'User-Agent': 'Mozilla/5.O (Macintosh;Intel Mac OS X 10_12_3) AppleWebKit/537.36(KHTML,like Gecko) Chrome/58.0.3029.110 Safari/537.36',
    'X-Requested-With': 'XMLHttpRequest',
    }
client = MongoClient( ) #定义一个对象
db = client['zhihu']
collection = db['zhihu']
def save_to_mongo(result):
    if collection.insert(result):
        print('Saved to Mongo')
if __name__ == '__main__':
        json = get_page( )
        results = parse_page(json)
        for result in results:
            print(result)
            save_to_mongo(result)

Guess you like

Origin blog.csdn.net/wg5foc08/article/details/89422073