爬虫入门（一）：用Python爬取静态HTML网页

系统环境：

操作系统：Windows10 专业版 64bit  
Python：anaconda2、Python2.7  
Python packages:requests、beautifulsoup os

新手入门爬虫时一般都会先从静态HTML网页下手，并且爬取HTML网页不难，容易上手。遇到没见过函数可以找度娘，去理解那些函数有什么作用，弄清楚那些参数的用途，然后用多几次，就大概知道他的套路是怎么样的了（小白我就是这样入门滴）。好了，废话不多说，上代码：

# -*- coding: utf-8 -*-
"""
Created on Thu Apr 26 18:09:20 2018

@author: zww
"""

import requests
from bs4 import BeautifulSoup
import os

proxies = { 'https': 'http://41.118.132.69:4433' }
hd={ 'User-Agent': "Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)"}
url='http://q.10jqka.com.cn/thshy/'

req =requests.get(url,headers=hd, proxies =proxies )
#print req

bs=BeautifulSoup(req.content,'html.parser')

div_all=bs.find_all('div',attrs={'class':'cate_items'}) #爬取所有的div标签

for infos in div_all:
    a_all=infos.find_all('a')  #在div标签里找到所有的a标签

for a in a_all[0:10]:         #获取其中10个a标签的信息
    ind_code=a.get('href')[-7:-1]  #在a标签中利用切片切出行业代码
    ind_name=a.text                 #获取a标签的文本内容，即获取行业名称
#    print ind_name
    os.mkdir("E:\\test\\"+ind_name)  #在E盘中创建test文件夹，并在test中创建以行业名称命名的文件夹



    for i in range(1,4):  #设置i变量，循环3次，即只爬取前3页的行业新闻
        
        url1='http://news.10jqka.com.cn/list/field/'+ind_code+'/index_'+str(i)+'.shtml'
        req1 =requests.get(url1,headers=hd, proxies =proxies )
        #print req.text
        
        bs1=BeautifulSoup(req1.content,'html.parser')
        span_all=bs1.find_all('span',attrs={'class':'arc-title'})  #爬取所有span标签
        
        for span in span_all:   
            a_all=span.find_all('a',attrs={'target':'_blank'})  #在span标签中爬取所有a标签
            news_link=span.a.get('href')                        #在a标签中获取新闻链接
            req2=requests.get(news_link,headers=hd, proxies =proxies) #访问新闻链接
            #print req2
            bs2=BeautifulSoup(req2.content,'html.parser')
            try:            
                news_title=bs2.find('h2').text  #爬取新闻标题，
            except:                             #如果标题中出现特殊字符等异常，跳过本次循环，进入下一次循环
                continue
            p_all=bs2.find_all('p')    #找到所有的p标签
#            print news_title
            try:                      #打开ind_name文件夹中以新闻标题的文本文件，并执行写的功能
                path1="E:\\test\\"+ind_name+"\\"+news_title+".txt"
                fo = open(path1, "w")
            except:                   #出现因特殊字符不能写入等异常时，跳过本次循环
                continue
            for p in p_all:
                news=p.text.encode('utf-8')   #爬取在p标签中的新闻内容     
                fo.write(news)  #把新闻内容写入文本文档里
            fo.close()          #关闭文件

代码中有比较详细的注释，如有不足或不妥之处，请指出。

爬虫入门（一）：用Python爬取静态HTML网页

猜你喜欢