博客访问人数统计

很早之前就写了这个代码,今天重新更新一下

发现对千分位数字的匹配有些bug

另外对来自不同地区的数据没进行爬取。

爬取数据:http://s04.flagcounter.com/more7/XTPq/

处理逻辑:

1. 爬取数据

2. 构造数组:日期,博客访问量,flag访问量

3. 保存数据到文件

4. 保存pickle文件

5. 生成访问折线图

爬取代码如下所示,详细代码见开源访问:https://github.com/zpfbuaa/blogVisitors

# -*- coding: utf-8 -*-
# @Time    : 2018/5/25 下午1:15
# @Author  : 伊甸一点
# @FileName: getHtml.py
# @Software: PyCharm
# @Blog    : http://zpfbuaa.github.io

import requests
import re
import time
import os


date_pt = re.compile('<font face=arial size=-1>(\w+ \d+, \d+)')
visitors_pt = re.compile('<font face=arial size=2>(\w+)</td><td>')
flagViews_pt = re.compile('<font face=arial size=2>(\S+)</font></td></tr>')

def getTotalBlog(url, pages):

    date = []
    visitors = []
    flagViews = []

    for page in range(1, pages+1):
        newUrl = url + str(page)
        print(newUrl)

        html = requests.get(newUrl).text
        item_date = date_pt.findall(html)
        item_visitors = visitors_pt.findall(html)
        item_flagViews = flagViews_pt.findall(html)

        date.extend(item_date)
        visitors.extend(item_visitors)
        flagViews.extend(item_flagViews)


    return date, visitors, flagViews

def change_data(date, visitors, flagViews):
    print(len(visitors))
    print(len(flagViews))
    for i in range(0, len(date)):
        str_visitor = str(visitors[i])
        str_flagViews = str(flagViews[i])
        if (str_visitor.find(',') != -1):
            v_split = str_visitor.split(',')
            visitors[i] = int(v_split[0]) * 1000 + int(v_split[1])
        else:
            visitors[i] = int(str_visitor)

        if (str_flagViews.find(',') != -1):
            f_split = str_flagViews.split(',')
            flagViews[i] = int(f_split[0]) * 1000 + int(f_split[1])
        else:
            flagViews[i] = int(str_flagViews)

    return date, visitors, flagViews

def printData(date, visitors, flagViews):
    print('Date    Visitors    Flag Counter Views')
    for i in range(0, len(date)):
        print(date[i],visitors[i],flagViews[i])

def writeToFile(date, visitors, flagViews, data_root='data/'):

    today = time.strftime('%Y%m%d', time.localtime(time.time()))
    data_file = data_root+'blog_'+str(today)

    f = open(data_file,'w+')
    header = 'Date\tVisitors\tFlag Counter Views'+'\n'
    f.write(header)

    for i in range(0, len(date)):
        line = date[i]+'\t'+str(visitors[i])+'\t'+str(flagViews[i])+'\n'
        f.write(line)
    f.close()
    return 1


url = 'http://s04.flagcounter.com/more7/XTPq/'
pages = 23
date, visitors, flagViews = getTotalBlog(url, pages)

# printData(date, visitors, flagViews)

date, visitors, flagViews = change_data(date, visitors, flagViews)

# printData(date, visitors, flagViews)

flag = writeToFile(date, visitors, flagViews)

print('Data Prepare Done!')

 以下为截止到当前2019年01月12日的访问量折线图

访问量折线图

访问入口flag统计图

flag访问量

两者diff差值

访问量差值折线图

猜你喜欢

转载自www.cnblogs.com/zpfbuaa/p/10261294.html