instagram的post数据获取

1.找到需要的内容,F12在network中选择XHR文件。

2.分析链接

链接1:https://www.instagram.com/graphql/query/?query_hash=103056d32c2554def88228bc3fd9668a&variables=%7B%22id%22%3A%222176779867%22%2C%22first%22%3A12%2C%22after%22%3A%22QVFEYzFqUFVzZ2tQWVNDekQ1TXBRLWdnWHNzU01UQmktZjV4Y2VablhPVDdrWWg0WDFmbEJOdE9ycnU0WFY3SXk5U3hRZjR2VllkOXdPVWxJbDNHT2t6VQ%3D%3D%22%7D

链接2:https://www.instagram.com/graphql/query/?query_hash=103056d32c2554def88228bc3fd9668a&variables=%7B%22id%22%3A%222176779867%22%2C%22first%22%3A12%2C%22after%22%3A%22QVFDX0ZJdTN0SDNTY3dDcmJ5dmc5OWNCWkwxQXBWZmZ4SU56bFpOTTVqM1FtN29XQzBrT192aXdEclJWdXlHSk9YZGY3dWNYMTltTW9YOVhJbFBVUG5mMQ%3D%3D%22%7D


对比可以发现只有相同的地方不同的地方用{}来表示
https://www.instagram.com/graphql/query/?query_hash=103056d32c2554def88228bc3fd9668a&variables=%7B%22id%22%3A%222176779867%22%2C%22first%22%3A12%2C%22after%22%3A%22{}%3D%3D%22%7D

目前需要获取不同的{}内容

3.如何获取{}的内容存在关键

以上可以发现链接2的空缺部分就是链接1内容中的“end_cursor”的内容。

4.总结以上已经获取了链接的构成,爬取已经可以实现。

具体实现代码如下:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time
from selenium.webdriver.support.ui import WebDriverWait
import json
from bs4 import BeautifulSoup
options = Options()
options.add_argument('--lang=en')
options.add_argument('--start-maximized')
# 取消自动化
options.add_experimental_option("excludeSwitches", ['enable-automation'])
options.add_argument("--disable-blink-features=AutomationControlled")
options.add_argument(r'--user-data-dir=C:\Users\Administrator\AppData\Local\Google\Chrome\User Data')

driver = webdriver.Chrome(options = options)
wait = WebDriverWait(driver, 30)
# driver.get('https://www.instagram.com/bedsurehome/')
# after参数获取
"""
# 获取最新的帖子
driver.get('https://www.instagram.com/bedsurehome/?__a=1')
html=driver.page_source
soup = BeautifulSoup(html, 'lxml')
cc = soup.select('pre')[0]
res = json.loads(cc.text)
with open('instagram/0.json', 'w') as f:
    json.dump(res, f)
"""
# url的构成
URL_FIRST='https://www.instagram.com/graphql/query/?query_hash=103056d32c2554def88228bc3fd9668a&variables=%7B%22id%22%3A%222176779867%22%2C%22first%22%3A12%2C%22after%22%3A%22QVFEbS1NcUxsa3VZN3JzbW5CdHZ5MXl1OEdYaElxY2Y2R1poWW05QlROSlhjaE53UjNsSHhER0c5TzJRSGw5bmNvQnF6M19DTlRZWFYtamVCbFFKTlNtTw%3D%3D%22%7D'
driver.get(URL_FIRST)

# 1.保存json内容
html=driver.page_source
soup = BeautifulSoup(html, 'lxml')
cc = soup.select('pre')[0]
res = json.loads(cc.text)
with open('instagram/1.json', 'w') as f:
    json.dump(res, f)
print(html)
after=res["data"]["user"]["edge_owner_to_timeline_media"]["page_info"]["end_cursor"][:-2]
count=2
for  i in range(0,136):

    url = 'https://www.instagram.com/graphql/query/?query_hash=103056d32c2554def88228bc3fd9668a&variables=%7B%22id%22%3A%222176779867%22%2C%22first%22%3A12%2C%22after%22%3A%22{}%3D%3D%22%7D'.format(
        after)
    print(after)
    driver.get(url)

    # 1.保存json内容
    html = driver.page_source
    soup = BeautifulSoup(html, 'lxml')
    cc = soup.select('pre')[0]
    res = json.loads(cc.text)
    with open('instagram/{}.json'.format(count), 'w') as f:
        json.dump(res, f)
    count+=1
    after = res["data"]["user"]["edge_owner_to_timeline_media"]["page_info"]["end_cursor"][:-2]
    time.sleep(1)

driver.close()

如果有不懂的地方可以跟博主沟通

猜你喜欢

转载自blog.csdn.net/weixin_45631815/article/details/114578112