Currency station log 1 - python3 reptiles crawling Block Chaining News

Currency station log 1 - python3 reptiles crawling Block Chaining News

Recently block chain is the fire, so I want to be crawling news and analysis class media sites, went ahead, but do media sites always need data sources Yeah, and data sources come from it, write it yourself behind the kinds of things to say, first climb. . . Because it is public information, not related to personal privacy.
Here I first block chain Mouding several news sites are

  • Wen chain
  • 8btc
  • District and powerful media
  • Gold Finance
  • FINANCE chain to
    crawl As similar rules, where a focused crawling wherein the code for
    the following is the code of crawling gold FINANCE
import urllib.request
import json
import _thread
import threading
import time
import mysql.connector
from pyquery import PyQuery as pq
import news_base

def url_open(url):
    #print(url)
    headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0'}  
    req = urllib.request.Request(url=url, headers=headers)
    for i in range(10):
        try:
            response = urllib.request.urlopen(url=req, timeout=5).read().decode('utf-8')
            return response
        except :
            print("chainnewscrawl except:")

def get_news(page_count, cb):
    time_utc = int(time.time())
    error_count = 0
    index = 0
    for i in range(1,page_count+1):
        #print(">>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>")
        response = url_open("https://api.jinse.com/v6/information/list?catelogue_key=www&limit=23&information_id=%d&flag=down&version=9.9.9&_source=www"%(index))
        #print(response)
        json_data = json.loads(response)
        for item in json_data['list']:
            if item["type"] != 1 and item["type"] != 2:
                continue
            article_item = news_base.article_info(
                item["extra"]['author'],# 
                int(item["extra"]["published_at"]),# 
                item['title'], #
                item["extra"]['summary'],#
                'content', 
                item["extra"]['topic_url'],
                "金色财金")
            source_responce = url_open(article_item.source_addr)
            source_doc = pq(source_responce)
            article_item.content = source_doc(".js-article-detail").html() if source_doc(".js-article-detail").html() else source_doc(".js-article").html()
            index = item['id']
            if not cb(article_item):
                error_count+=1
            else:
                error_count = 0
            if error_count >= 5:
                break
        if error_count >= 5:
            break
        #print(json_data['results'][0])
#def get_news(10)

#print(response)

Briefly talk about a few reference library
which
urllib.request information is used to crawl through http or https tools
due http crawling have a great chance to open the lead to unsuccessful so here wrote me feel good function with

def url_open(url):
    #print(url)
    headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0'}  
    req = urllib.request.Request(url=url, headers=headers)
    for i in range(10):
        try:
            response = urllib.request.urlopen(url=req, timeout=5).read().decode('utf-8')
            return response
        except :
            print("chainnewscrawl except:")

It is to continue to open the site, 10 times in a row, so it basically achieved each URL crawling bound to be open, and very easy to use
PyQuery is similar to jquery web site analysis tools to
persistent storage for mysql.connector the tool database
as news_base while climbing because so few sites require a common data structure
is shown below

class article_info:
    def __init__(self, author, time_utc, title, desc, content, source_addr, source_media):
        self.author = author
        self.time_utc = time_utc
        self.title = title
        self.desc = desc
        self.content = content
        self.source_addr = source_addr
        self.source_media = source_media
    def __str__(self):
        return("""==========================
author:%s
time_utc:%d
title:%s
desc:%s
content:%s
source_addr:%s
source_media:%s"""%(self.author, self.time_utc, self.title, self.desc, 'self.content', self.source_addr, self.source_media))

The news of the crawling process is carried out one by one http connections, obtain successful results, the speed is very slow, it is necessary to run multi-threaded together, almost as speed increases geometrically, here I open a thread for each specific site code show as below:

import db_base
import news_chainfor
import news_jinse
import news_8btc
import news_55coin
import news_chainnews
import threading

class myThread (threading.Thread):
    def __init__(self, func, arg1, arg2):
        threading.Thread.__init__(self)
        self.func = func
        self.arg1 = arg1
        self.arg2 = arg2
    def run(self):
        print ("开始线程:" + self.name)
        self.func(self.arg1, self.arg2)
        print ("退出线程:" + self.name)
def run():
    db_base.init_db()

    thread_list = [
        myThread(news_55coin.get_news, 10, db_base.insert_article),
        myThread(news_8btc.get_news, 10, db_base.insert_article),
        myThread(news_jinse.get_news, 10, db_base.insert_article),
        myThread(news_chainfor.get_news, 10, db_base.insert_article),
        myThread(news_chainnews.get_news, 10, db_base.insert_article)
        ]
    for i in range(len(thread_list)):
        thread_list[i].start()

    for i in range(len(thread_list)):
        thread_list[i].join()

Because it is used less python before, this is now with the current studies, the code may be some ugly, but I incompetence, ha ha ha ha ha

Credits station is now on line www.bxiaozhan.com
are open throughout all code site (including front and rear end), the positionhttps://github.com/lihn1987/CoinCollector
I hope the exhibitions

Guess you like

Origin blog.51cto.com/14633800/2456750