Python爬虫刷Blog的阅读量

Version-2.0

工作之余更新了一版,更新如下:
1.增加了对用户多页文章对获取
2.增加了点击顺序的随机性
3.增加了点击行为时间的随机性
4.增加了点击内容的随机性
5.增加了点击的轮次

缺陷与不足

1.UA还未设置
2.未有多个IP

import requests
from bs4 import BeautifulSoup
import time
import random

#对用户多页文章对获取
def get_writer_article_list(base_url,page_num):
    all_article_list = []
    for i in range(page_num):
        index = i +1
        print('cur index is ' + str(index))
        cur_page_url = base_url + str(index)
        all_article_list = get_article_list(cur_page_url) + all_article_list
    return all_article_list
        

#获取单页所有文章url
def get_article_list(base_url):
    web_data = requests.get(base_url)
    soup = BeautifulSoup(web_data.text,'lxml')
    divs = soup.find_all('div', class_='article-item-box csdn-tracking-statistics')
    
    url_list = []
    for div in divs:
        label = div.find_all('a')
        url = label[0].get('href')
        url_list.append(url)
    return url_list

#生成每一轮点击的随机list
def click_random(url_list,min_random_rate):
    new_url_list = []
    max_url_count = len(url_list)
    min_url_count = int(max_url_count*min_random_rate)
    term_url_count = random.randint(min_url_count,max_url_count)
    for i in range(term_url_count):
        ramdom_index = random.randint(0,max_url_count-1)
        new_url_list.append(url_list[ramdom_index])
    return new_url_list


#多轮点击
def click_article_url(term_num,click_random_start,click_random_end,term_random_start,term_random_end,all_list):
    for i in range(term_num):
        term_url_list = click_random(all_list,0.7)
        for url in term_url_list:
            requests.get(url)
            print('click for '+url)
            click_sleep_time = random.randint(click_random_start,click_random_end)
            time.sleep(click_sleep_time)
            print('sleep for '+str(click_sleep_time))
        term_num = i +1
        print('finish the term of '+str(term_num))
        term_sleep_time = random.randint(term_random_start,term_random_end)
        time.sleep(term_sleep_time)
        print('sleep for the term '+str(term_sleep_time))

base_url1 = "https://blog.csdn.net/xxx1/article/list/"
base_url2 = "https://blog.csdn.net/xxx2/article/list/"

url_list_1 = get_writer_article_list(base_url1,2)
url_list_2 = get_writer_article_list(base_url2,2)

all_list = url_list_1 + url_list_2

click_article_url(200,8,50,30,60,all_list)

Version-1.0

利用午休的时间,用Python写了一个刷Blog中文章阅读量的小玩具,还不成熟,最近会迭代修改。

当前的整体思路:获取用户的文章页,解析html,获取该页面每一篇文章的url,存入list,然后训练访问list中的url,达到点击的目的。

import requests
from bs4 import BeautifulSoup
import time

//爬取我的粉丝 周英俊 同学的blog
base_url = "https://blog.csdn.net/qq_38835878/article/list/1"

web_data = requests.get(base_url)
soup = BeautifulSoup(web_data.text,'lxml')
divs = soup.find_all('div', class_='article-item-box csdn-tracking-statistics')

url_list = []
for div in divs:
    label = div.find_all('a')
    url = label[0].get('href')
    url_list.append(url)

for url in url_list:
    requests.get(url)
    print('request for '+url)
    time.sleep(61)
    print('sleep for 61s')

缺陷与不足

  1. 没有加入翻页功能
  2. 没有设计代理Ip
  3. 没有设置UA
  4. 行为规律性较强:session时长无偏差
  5. 点击行为pattern明显:顺序访问所有文章。

猜你喜欢

转载自blog.csdn.net/Daverain/article/details/82909867
今日推荐