[Crawler of Practical Tools Series] Python Crawls Information Data

series

1. [Crawlers of Practical Tools Series] Python realizes crawling proxy IP (anti-'anti-crawlers')
2. [Crawlers of Practical Tools Series] Python crawls information data


foreword

In the big data architecture, data collection and data storage occupy an extremely important position, which can be said to be the core foundation of big data. And crawler technology occupies a large proportion of these two core technology levels.

This article implements a simple and fast crawling method, in which proxy ip is used, and the proxy ip can be obtained by referring to my article [Practical Tools Series Crawler] Python implements crawling proxy IP (anti-'anti-crawler') .

Articles by szZack


Proxy IP

Proxy IP website: xicidaili

For details, please refer to [Crawlers of Practical Tools Series] Python realizes crawling proxy IP (anti-'anti-crawler') .
The output proxy ip data is saved to 'proxy_ip.pkl'


crawl data code

This article takes crawling a small amount of financial data as an example.

  • website

    • Address: http://xxx

    • Crawl content
      url, title, click_number, html_content

    • Save the data as csv, the format is as follows:
      url, title, click_number, html_content, crawl_time
      szZack's article

  • combat

    • Steps:
      1. Crawl the homepage, extract the url as the first layer
      2. Crawl the url of the first layer as the second layer
      3. Crawl the url of the second layer as the third layer
      4. End
  • environment

    • pandas
    • python3
    • Ubuntu16.04
    • requests

  • codecrawl_finance_news.py
    _
  • 1. Import dependent packages
import crawl_proxy_ip
import pandas as pd
import re, time, sys, os, random
import telnetlib
import requests

  • 2. Global variables
global url_set
url_set = {}
  • 3. Crawl the core code
def crawl_finance_news(start_url):
    
    #提取数据格式:url,title,click_number,html_content,crawl_time
    
    proxy_ip_list = crawl_proxy_ip.load_proxy_ip('proxy_ip.pkl')
    
    #爬取首页
    start_html = crawl_web_data(start_url, proxy_ip_list)
    #open('tmp.txt', 'w').write(start_html)
    global url_set
    url_set[start_url] = 0
    
    #提取第一层web
    web_content_list = extract_web_content(start_html, proxy_ip_list)
    
    #提取第二层web
    length = len(web_content_list)
    for i in range(length):
        if len(web_content_list[i][2]) == 0:
            html = crawl_web_data(web_content_list[i][0], proxy_ip_list)
            web_content_list += extract_web_content(html, proxy_ip_list)
            if len(web_content_list) > 1000:#仅仅是测试
                break
    
    #提取第3层web
    length = len(web_content_list)
    for i in range(length):
        if len(web_content_list[i][2]) == 0:
            html = crawl_web_data(web_content_list[i][0], proxy_ip_list)
            web_content_list += extract_web_content(html, proxy_ip_list)
            if len(web_content_list) > 1000:#仅仅是测试
                break
            
    #保存数据
    columns = ['url', 'title', 'click_number', 'html_content', 'crawl_time']
    df = pd.DataFrame(columns = columns, data = web_content_list)
    df.to_csv('finance_data.csv', encoding='utf-8')
    print('data_len:', len(web_content_list))
    
    
def crawl_web_data(url, proxy_ip_list):

    proxy_ip_dict = random.choice(proxy_ip_list)
    if len(proxy_ip_list) == 0:
        return ''
    proxy_ip_dict = proxy_ip_list[0]
    
    try:
        html = download_by_proxy(url, proxy_ip_dict)
        print(url, 'ok')
            
    except Exception as e:
        #print('50 e', e)
        #删除无效的ip
        index = proxy_ip_list.index(proxy_ip_dict)
        proxy_ip_list.pop(index)
        print('proxy_ip_list', len(proxy_ip_list))
       	
        return crawl_web_data(url, proxy_ip_list)
        
    return html
    

def download_by_proxy(url, proxies):
    headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.103 Safari/537.36', 'Connection':'close'}
    response = requests.get(url=url, proxies=proxies, headers=headers, timeout=10)
    response.encoding = 'utf-8'
    html = response.text
    return html
    

def extract_web_content(html, proxy_ip_list):

    #提取数据格式:url,title,click_number,html_content, crawl_time
    
    web_content_list = []
    
    html_content = html
    html = html.replace(' target ="_blank"', '')
    html = html.replace(' ', '')
    html = html.replace('\r', '')
    html = html.replace('\n', '')
    html = html.replace('\t', '')
    html = html.replace('"target="_blank', '')
    
    #<h3><a href="xxx/a/123.html" >证监会:拟对证券违法行为提高刑期上限</a></h3>
    res = re.search('href="(http[^"><]*finance[^"><]*)">([^<]*)<', html)#finance 必须是金融资讯
    while res is not None:
        url, title = res.groups()
        #print('url, title', url, title)
        global url_set
        if url in url_set:#防止重复
            html = html.replace('href="%s">%s<' %(url, title), '')
            res = re.search('href="(http[^"><]*finance[^"><]*)">([^<]*)<', html)
            continue
            
        else:
            url_set[url] = 0
        click_number = get_click_number(url, proxy_ip_list)
        #print('click_number', click_number)
        now_time = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())
        if len(click_number) == 0:#仅保留正文
            html_content = ''
        web_content_list.append([url, title, click_number, html_content, now_time])
        if len(web_content_list) > 200:#test 每页最多爬取200条
            break
        
        html = html.replace('href="%s">%s<' %(url, title), '')
        res = re.search('href="(http[^"><]*finance[^"><]*)">([^<]*)<', html)
        
    return web_content_list
    [szZack的文章](https://blog.csdn.net/zengNLP?type=blog)
    
def get_click_number(url, proxy_ip_list):
    
    html = crawl_web_data(url, proxy_ip_list)
    #<span class="num ml5">4297</span>
    res = re.search('<span class="num ml5">(\d{1,})</span>', html)
    if res is not None:
        return res.groups()[0]
        
    return ''
    
  • 4. Test
if __name__ == '__main__':
    
    #xx网:xxx/
    #用法:python crawl_finance_news.py 'xxx/'
    if len(sys.argv) == 2:
        crawl_finance_news(sys.argv[1])
        
  • 5. Code description
    1. Crawl the proxy ip first: python crawl_proxy_ip.py
    2. Crawl financial news: python crawl_finance_news.py 'xxx/'
    3. This is just a test, and it will end after crawling 1000 items
    . 4. Save the data to: finance_data.csv
    Article by szZack

  • 6. Retrieval of content
    insert image description here

Articles by szZack

Guess you like

Origin blog.csdn.net/zengNLP/article/details/126647866