python爬虫设计刷博客访问量（刷访问量，赞，爬取图片）

分享一下我老师大神的人工智能教程！零基础，通俗易懂！http://blog.csdn.net/jiangjunshow

也欢迎大家转载本篇文章。分享知识，造福人民，实现我们中华民族伟大复兴！

需要准备的工具：

安装python软件，下载地址：https://www.python.org/

Fiddler抓包软件：http://blog.csdn.net/qq_21792169/article/details/51628123

刷博客访问量的原理是：打开一次网页博客访问量就增加一次。（新浪，搜狐等博客满足这个要求）

count.py

<span style="font-size:18px;">import webbrowser as web  import time  import os  import random  count = random.randint(1,2)  j=0  while j<count:      i=0      while i<=8 :          web.open_new_tab('http://blog.sina.com.cn/s/blog_552d7c620100aguu.html')  #网址替换这里        i=i+1          time.sleep(3)  #这个时间根据自己电脑处理速度设置，单位是s    else:          time.sleep(10)  <span style="font-family: Arial, Helvetica, sans-serif;">#这个时间根据自己电脑处理速度设置，单位是s</span>        os.system('taskkill /F /IM chrome.exe')  #google浏览器，其他的更换下就行        #print 'time webbrower closed'            j=j+1  </span>

刷赞就需要用Fiddler来获取Request header数据，比如Cookie,Host,Referer,User-Agent等

sina.py

<span style="font-size:18px;">import urllib.requestimport syspoints = 2   #how count ?if len(sys.argv) > 1:    points = int(sys.argv[1])aritcleUrl = ''point_header = {    'Accept' : '*/*',    'Cookie' :  '',#填你的cookie信息    'Host':'',  #主机    'Referer' : '',    'User-Agent' : 'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.110 Safari/537.36',}for i in range(points):    point_request = urllib.request.Request(aritcleUrl, headers = point_header)    point_response = urllib.request.urlopen(point_request)</span>

上面的header头通过抓包数据可以获取，这里只是提供思路。

爬取网页上的图片：

getimg.py

#coding=utf-8import urllibimport urllib2import redef getHtml(url): headers = {'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'} req = urllib2.Request(url,headers=headers) page = urllib2.urlopen(req); html = page.read() return htmldef getImg(html):     reg = r'src="(h.*?g)"'    #reg = r'<img src="(.+?\.jpg)"'    imgre = re.compile(reg)    imglist = re.findall(imgre,html)    print imglist    x = 0    for imgurl in imglist:        urllib.urlretrieve(imgurl,'%s.jpg' % x)        x+=1html = getHtml("http://pic.yxdown.com/list/0_0_1.html")print getImg(html)

1、 .*? 三个符号可以匹配任意多个任意符号

2、 \. 是将 ‘.’ 转义，代表的就是HTML中的 .

3、（）表示我们只取括号中的部分，省略之外的。

爬取CSDN的访问量csdn.py

[html] view plain copy print ?

<code class="language-html">#!usr/bin/python
# -*- coding: utf-8 -*-
import urllib2
import re
#当前的博客列表页号
page_num = 1
#不是最后列表的一页
notLast = 1
fs = open('blogs.txt','w')
account = str(raw_input('Input csdn Account:'))
while notLast:
#首页地址
baseUrl = 'http://blog.csdn.net/'+account
#连接页号，组成爬取的页面网址
myUrl = baseUrl+'/article/list/'+str(page_num)
#伪装成浏览器访问，直接访问的话csdn会拒绝
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
headers = {'User-Agent':user_agent}
#构造请求
req = urllib2.Request(myUrl,headers=headers)
#访问页面
myResponse = urllib2.urlopen(req)
myPage = myResponse.read()
#在页面中查找是否存在‘尾页’这一个标签来判断是否为最后一页
notLast = re.findall('<a href=".*?">尾页</a>',myPage,re.S)
print '-----------------------------第%d页---------------------------------' % (page_num,)
fs.write('--------------------------------第%d页--------------------------------\n' % page_num)
#利用正则表达式来获取博客的href
title_href = re.findall('<a href="(.*?)">',myPage,re.S)
titleListhref=[]
for items in title_href:
titleListhref.append(str(items).lstrip().rstrip())
#利用正则表达式来获取博客的
title= re.findall('<a href=".*?">(.*?)</a>',myPage,re.S)
titleList=[]
for items in title:
titleList.append(str(items).lstrip().rstrip())
#利用正则表达式获取博客的访问量
view = re.findall('<a href=".*?" title="阅读次数">阅读</a>(.∗?) ',myPage,re.S)
viewList=[]
for items in view:
viewList.append(str(items).lstrip().rstrip())
#将结果输出
for n in range(len(titleList)):
print '访问量:%s href:%s 标题:%s' % (viewList[n].zfill(4),titleListhref[n],titleList[n])
fs.write('访问量:%s\t\thref:%s\t\t标题:%s\n' % (viewList[n].zfill(4),titleListhref[n],titleList[n]))
#页号加1
page_num = page_num + 1
</code>

#!usr/bin/python# -*- coding: utf-8 -*-import urllib2import re#当前的博客列表页号page_num = 1#不是最后列表的一页notLast = 1fs = open('blogs.txt','w')account = str(raw_input('Input csdn Account:'))while notLast:    #首页地址    baseUrl = 'http://blog.csdn.net/'+account    #连接页号，组成爬取的页面网址    myUrl = baseUrl+'/article/list/'+str(page_num)    #伪装成浏览器访问，直接访问的话csdn会拒绝    user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'    headers = {'User-Agent':user_agent}    #构造请求    req = urllib2.Request(myUrl,headers=headers)    #访问页面    myResponse = urllib2.urlopen(req)    myPage = myResponse.read()    #在页面中查找是否存在‘尾页’这一个标签来判断是否为最后一页    notLast = re.findall('<a href=".*?">尾页</a>',myPage,re.S)    print '-----------------------------第%d页---------------------------------' % (page_num,)    fs.write('--------------------------------第%d页--------------------------------\n' % page_num)    #利用正则表达式来获取博客的href    title_href = re.findall('<span class="link_title"><a href="(.*?)">',myPage,re.S)    titleListhref=[]    for items in title_href:        titleListhref.append(str(items).lstrip().rstrip())         #利用正则表达式来获取博客的    title= re.findall('<span class="link_title"><a href=".*?">(.*?)</a></span>',myPage,re.S)    titleList=[]    for items in title:        titleList.append(str(items).lstrip().rstrip())         #利用正则表达式获取博客的访问量    view = re.findall('<span class="link_view".*?><a href=".*?" title="阅读次数">阅读</a>\((.*?)\)</span>',myPage,re.S)    viewList=[]    for items in view:        viewList.append(str(items).lstrip().rstrip())    #将结果输出       for n in range(len(titleList)):        print '访问量:%s href:%s 标题:%s' % (viewList[n].zfill(4),titleListhref[n],titleList[n]) fs.write('访问量:%s\t\thref:%s\t\t标题:%s\n' % (viewList[n].zfill(4),titleListhref[n],titleList[n]))    #页号加1    page_num = page_num + 1

这个正则表达式写的不是很完整，如果有置顶文章的话，抓取到的文章标题就会多出[置顶]，所以这里应该添加一个判断语句，读者可以自行尝试。

手动生成IP列表creat_ip：

#-*- coding:utf-8 -*-#!/usr/bin/pythonimport timetime_start = time.time()def get_ip(number='10' ,start='1.1.1.1' ):     file = open('ip_list.txt', 'w')     starts = start.split( '.')    A = int(starts[0])    B = int(starts[1])    C = int(starts[2])    D = int(starts[3])       for A in range(A,256):        for B in range(B, 256):            for C in range(C, 256):                for D in range(D, 256):                    ip = "%d.%d.%d.%d" %(A,B,C,D)                                                            if number > 1:                                                 file.write(ip+ '\n')                        number -= 1                                          elif number == 1:    #解决最后多一行回车问题                        file.write(ip)                        number -= 1                    else:                        file.close()                        print ip                        return                                     D = 0            C = 0        B = 0   get_ip(100000,'101.23.228.102')time_end = time.time()time = time_end - time_startprint '耗时%s秒' %time

grab_ip.py 抓取代理IP网站，读取出IP和端口号，具体怎么使用这些IP和端口看个人实际情况。

#!/usr/bin/python#-*- coding:utf-8 -*-import urllib,time,re,loggingimport urllib  import urllib2  import reimport timeimport osimport randomurl = 'http://www.xicidaili.com/'csdn_url='http://blog.csdn.net/qq_21792169/article/details/51628142'header = {'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'}USER_AGENT_LIST = [                'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11',                'Opera/9.25 (Windows NT 5.1; U; en)',                'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)',                'Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.5 (like Gecko) (Kubuntu)',                'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.12) Gecko/20070731 Ubuntu/dapper-security Firefox/1.5.0.12',                'Lynx/2.8.5rel.1 libwww-FM/2.14 SSL-MM/1.4.1 GNUTLS/1.2.9'            ]def getProxyHtml(url):      headers = {'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'}      req = urllib2.Request(url,headers=headers)      page = urllib2.urlopen(req);      html = page.read()    return htmldef ipPortGain(html):    ip_re = re.compile(r'(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}).+\n.+>(\d{1,5})<')    ip_port = re.findall(ip_re,html)    return ip_portdef proxyIP(ip_port):#to ip deal with['221.238.28.158:8081', '183.62.62.188:9999']格式        proxyIP = [] for i in range( 0,len(ip_port)):  proxyIP.append( ':'.join(ip_port[i]))    logging.info(proxyIP[i])#to ip deal with[{'http': 'http://221.238.28.158:8081'}, {'http': 'http://183.62.62.188:9999'}]格式        proxy_list = [] for i in range( 0,len(proxyIP)):  a0 = 'http://%s'%proxyIP[i]  a1 = { 'http ':'%s'%a0}  proxy_list.append(a1) return proxy_listdef csdn_Brush(ip): print ip  #use ping verify ip if alive    def ping_ip(ip):    ping_cmd = 'ping -c 2 -w 5 %s' % ip                    ping_result = os.popen(ping_cmd).read()    print 'ping_cmd : %s, ping_result : %r' % (ping_cmd, ping_result)         if ping_result.find('100% packet loss') < 0:        print 'ping %s ok' % ip        return True    else:        print 'ping %s fail' % ip fh = open('proxy_ip.txt','w')  html=getProxyHtml(url)ip_port=ipPortGain(html)proxy_list=proxyIP(ip_port)for proxy_ip in proxy_list:    ping_ip(proxy_ip)    fh.write('%s\n'%(proxy_ip,))    res=urllib.urlopen(csdn_url,proxies=proxy_ip).read()#这里可以添加一个for循环，把博文所以的文章都用这个IP请求一次，然后博文的访问量=IP*博文总数*进程数

（有时间间隔，大约是半个小时，CSDN设置时间检测，所以我们配合上C语言）fh.close()

这样一个完整的刷访问量脚本就写成功了，这样一个脚本运行一次只是一个进程，一个进程出现我问题，整个程序也就无法执行下去，这里写一个C语言脚本程序。

#include<stdlib.h>int main(int argc,char **argv){ while(1) {  char *cmd="python /home/book/csdn.py";  /* 这里是CSDN刷访问量的Python脚本程序路径 */  system(cmd);   /* 这里是执行一个进程，一个进程出现问题，立马开启新的进程，一个进程运行脚本的时间大约是半个小时，所以CSDN的时间检测也就无效了，一天访问量=IP*博文总数*24*2*/  return 0; }}

csdn.py

import urllib2import threadimport timepoints = 200000webstring='http://blog.csdn.net/qq_21792169/article/details/51461098'aritcleUrl = webstringpoint_header = {    'Accept' : '*/*',    'Cookie' :  'Cookie: uuid_tt_dd=225004857698634670_20160708; __message_district_code=000000; Hm_lvt_6bcd52f51e9b3dce32bec4a3997715ac=1467989598; lzstat_uv=34579586981120259917|3400560@2942182; _ga=GA1.2.985440903.1467988379; _gat=1; UserName=qq_21792169; UserInfo=4tvvd2LURRttsNBUVWn7g2HWjoUBNOvTYr%2FKJInp6knc%2FWnL7JpBPoWkIFnTu2DLyKyad7FO%2BB3GziEIYWMLk1ekYH0Y04BoGaP4w%2BMUxAd%2B8dmThjsZSsUkBwpSU71HgyVO5RU2A8k1suY%2BaE531Q%3D%3D; UserNick=%E7%BD%91%E7%BB%9C%E4%BA%BAVS%E7%81%B0%E9%B8%BD%E5%AD%90; AU=44A; UD=%E6%9C%9D%E4%BD%9C%E4%B8%80%E5%90%8D%E4%BC%98%E7%A7%80%E7%9A%84%E5%B5%8C%E5%85%A5%E5%BC%8F%E5%BC%80%E5%8F%91%E5%B7%A5%E7%A8%8B%E5%B8%88%E8%80%8C%E5%A5%8B%E6%96%97%EF%BC%8CCSDN%E5%8D%9A%E5%AE%A2%E5%B0%86%E8%AE%B0%E5%BD%95%E6%88%91%E6%88%90%E9%95%BF%E7%9A%84%E7%82%B9%E7%82%B9%E6%BB%B4%E6%BB%B4%E3%80%82; UN=qq_21792169; UE="[email protected]"; BT=1468046002179; access-token=99302955-285c-4600-8d15-9533eff8f3a9; dc_tos=oa1bjr; dc_session_id=1468046007438; __message_sys_msg_id=0; __message_gu_msg_id=0; __message_cnel_msg_id=0; __message_in_school=0',    'Host':'dc.csdn.net',    'Referer' : webstring,    # 'Referer' : 'http://blog.csdn.net/qq_21792169/article/details/51858371',    'User-Agent' : 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6',}def test():    for i in range(points):        req = urllib2.Request(aritcleUrl,headers=point_header)        page = urllib2.urlopen(req);        print  itry:         thread.start_new_thread( test,())    thread.start_new_thread( test,())    thread.start_new_thread( test,())    thread.start_new_thread( test,())    thread.start_new_thread( test,())    thread.start_new_thread( test,())    thread.start_new_thread( test,())    thread.start_new_thread( test,())    thread.start_new_thread( test,())    thread.start_new_thread( test,())    thread.start_new_thread( test,())    thread.start_new_thread( test,())    thread.start_new_thread( test,())    thread.start_new_thread( test,())    thread.start_new_thread( test,())    thread.start_new_thread( test,())    thread.start_new_thread( test,())    thread.start_new_thread( test,())    thread.start_new_thread( test,())    thread.start_new_thread( test,())    thread.start_new_thread( test,())    thread.start_new_thread( test,())    thread.start_new_thread( test,())    thread.start_new_thread( test,())    thread.start_new_thread( test,())    thread.start_new_thread( test,())    thread.start_new_thread( test,())    thread.start_new_thread( test,())    thread.start_new_thread( test,())    thread.start_new_thread( test,())    thread.start_new_thread( test,())    thread.start_new_thread( test,())    thread.start_new_thread( test,())    thread.start_new_thread( test,())    thread.start_new_thread( test,())    thread.start_new_thread( test,())    thread.start_new_thread( test,())    thread.start_new_thread( test,())    thread.start_new_thread( test,())    thread.start_new_thread( test,())    thread.start_new_thread( test,())    thread.start_new_thread( test,())    thread.start_new_thread( test,())    thread.start_new_thread( test,())    thread.start_new_thread( test,())    thread.start_new_thread( test,())    thread.start_new_thread( test,())    thread.start_new_thread( test,())    thread.start_new_thread( test,())    thread.start_new_thread( test,())    thread.start_new_thread( test,())    thread.start_new_thread( test,())    thread.start_new_thread( test,())    thread.start_new_thread( test,())except:     print "Error: unable to start thread" while 1:     pass #html = page.read()#print  html

csdn_new.py

import urllib2import threadimport repoints = 1href="href.html"cnt=0point_header = {    'Accept' : '*/*',    'User-Agent' : 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6',}def test1():    input_file = open(href,"r");    html = input_file.read();    reg = r'href="(http://blog.csdn.net/qq_21792169/article/details/.+?)">'    imgre = re.compile(reg)    imglist = re.findall(imgre,html)    x = 0    for imgurl in imglist:        x=x+1        if(x>cnt):            print "blog num %03d :%s"%(x,imgurl)            for i in range(points):                  req = urllib2.Request(imgurl,headers=point_header)                urllib2.urlopen(req);try:         thread.start_new_thread( test1,())except:     print "Error: unable to start thread" while 1:     pass

href.html 下面这种格式

<li><a target="_blank" href="http://blog.csdn.net/qq_21792169/article/details/50629515">手把手教你怎么创建自己的网站</a></li><p></p><li><a target="_blank" href="http://blog.csdn.net/qq_21792169/article/details/50596464">虚拟机 开发板 PC机 三者之间不能ping通的各种原因分析</a></li><p></p><li><a target="_blank" href="http://blog.csdn.net/qq_21792169/article/details/50503279">博客专栏HTML语言编写详解</a></li><p></p><li><a target="_blank" href="http://blog.csdn.net/qq_21792169/article/details/50465363">Linux驱动静态编译和动态编译方法详解</a></li><p></p><li><a target="_blank" href="http://blog.csdn.net/qq_21792169/article/details/50448639">多文件夹下编写Makefile详解</a></li><p></p><li><a target="_blank" href="http://blog.csdn.net/qq_21792169/article/details/50436089">结构体中定义函数指针</a></li><p></p><li><a target="_blank" href="http://blog.csdn.net/qq_21792169/article/details/50426701">交叉编译参数 -I -L -l 详解</a></li><p></p><li><a target="_blank" href="http://blog.csdn.net/qq_21792169/article/details/50420937">智能家居网络系统的设计(一)</a></li><p></p><li><a target="_blank" href="http://blog.csdn.net/qq_21792169/article/details/50418560">智能家居网络系统设计（二）</a></li><p></p>

最后一个比较可靠的办法：抓取肉鸡，执行我们的脚本程序，安全，可靠。

自动发送QQ消息：qq.vbs(复制你要发送的字，打开QQ对话框，点击这个文件)

Set WshShell= WScript.Createobject("WScript.Shell")
for i=1 to 100
WScript.Sleep 1000
WshShell.SendKeys"^v"
WshShell.SendKeys "%s"
next

给我老师的人工智能教程打call！http://blog.csdn.net/jiangjunshow

你好！这是你第一次使用 **Markdown编辑器** 所展示的欢迎页。如果你想学习如何使用Markdown编辑器, 可以仔细阅读这篇文章，了解一下Markdown的基本语法知识。

新的改变

我们对Markdown编辑器进行了一些功能拓展与语法支持，除了标准的Markdown编辑器功能，我们增加了如下几点新功能，帮助你用它写博客：

全新的界面设计 ，将会带来全新的写作体验；
在创作中心设置你喜爱的代码高亮样式，Markdown 将代码片显示选择的高亮样式 进行展示；
增加了 图片拖拽 功能，你可以将本地的图片直接拖拽到编辑区域直接展示；
全新的 KaTeX数学公式 语法；
增加了支持甘特图的mermaid语法¹ 功能；
增加了 多屏幕编辑 Markdown文章功能；
增加了 焦点写作模式、预览模式、简洁写作模式、左右区域同步滚轮设置 等功能，功能按钮位于编辑区域与预览区域中间；
增加了 检查列表 功能。

功能快捷键

撤销：Ctrl/Command + Z
重做：Ctrl/Command + Y
加粗：Ctrl/Command + B
斜体：Ctrl/Command + I
标题：Ctrl/Command + Shift + H
无序列表：Ctrl/Command + Shift + U
有序列表：Ctrl/Command + Shift + O
检查列表：Ctrl/Command + Shift + C
插入代码：Ctrl/Command + Shift + K
插入链接：Ctrl/Command + Shift + L
插入图片：Ctrl/Command + Shift + G

合理的创建标题，有助于目录的生成

直接输入1次#，并按下space后，将生成1级标题。
输入2次#，并按下space后，将生成2级标题。
以此类推，我们支持6级标题。有助于使用TOC语法后生成一个完美的目录。

如何改变文本的样式

强调文本 强调文本

加粗文本 加粗文本

标记文本

~~删除文本~~

引用文本

H₂O is是液体。

2¹⁰ 运算结果是 1024.

插入链接与图片

链接: link.

图片:

带尺寸的图片:

当然，我们为了让用户更加便捷，我们增加了图片拖拽功能。

如何插入一段漂亮的代码片

去博客设置页面，选择一款你喜欢的代码片高亮样式，下面展示同样高亮的 代码片.

// An highlighted block var foo = 'bar';

生成一个适合你的列表

项目
- 项目
  - 项目

项目1
项目2
项目3

计划任务
完成任务

创建一个表格

一个简单的表格是这么创建的：

项目	Value
电脑	$1600
手机	$12
导管	$1

设定内容居中、居左、居右

使用:---------:居中
使用:----------居左
使用----------:居右

第一列	第二列	第三列
第一列文本居中	第二列文本居右	第三列文本居左

SmartyPants

SmartyPants将ASCII标点字符转换为“智能”印刷标点HTML实体。例如：

TYPE	ASCII	HTML
Single backticks	`'Isn't this fun?'`	‘Isn’t this fun?’
Quotes	`"Isn't this fun?"`	“Isn’t this fun?”
Dashes	`-- is en-dash, --- is em-dash`	– is en-dash, — is em-dash

创建一个自定义列表

Markdown: Text-to- HTML conversion tool
Authors: John; Luke

如何创建一个注脚

一个具有注脚的文本。²

注释也是必不可少的

Markdown将文本转换为 HTML。

KaTeX数学公式

您可以使用渲染LaTeX数学表达式 KaTeX:

Gamma公式展示 $\Gamma(n) = (n-1)!\quad\forall n\in\mathbb N$ 是通过欧拉积分

$\Gamma(z) = \int_0^\infty t^{z-1}e^{-t}dt\,.$

你可以找到更多关于的信息 LaTeX 数学表达式here.

新的甘特图功能，丰富你的文章

gantt
        dateFormat  YYYY-MM-DD
        title Adding GANTT diagram functionality to mermaid
        section 现有任务
        已完成               :done,    des1, 2014-01-06,2014-01-08
        进行中               :active,  des2, 2014-01-09, 3d
        计划一               :         des3, after des2, 5d
        计划二               :         des4, after des3, 5d

关于 甘特图 语法，参考这儿,

UML 图表

可以使用UML图表进行渲染。 Mermaid. 例如下面产生的一个序列图：:

这将产生一个流程图。:

关于 Mermaid 语法，参考这儿,

FLowchart流程图

我们依旧会支持flowchart的流程图：

关于 Flowchart流程图 语法，参考这儿.

导出与导入

导出

如果你想尝试使用此编辑器, 你可以在此篇文章任意编辑。当你完成了一篇文章的写作, 在上方工具栏找到 文章导出 ，生成一个.md文件或者.html文件进行本地保存。

导入

如果你想加载一篇你写过的.md文件或者.html文件，在上方工具栏可以选择导入功能进行对应扩展名的文件导入，
继续你的创作。

mermaid语法说明 ↩︎
注脚的解释 ↩︎