Currency station log 1 - python3 reptiles crawling Block Chaining News
Recently block chain is the fire, so I want to be crawling news and analysis class media sites, went ahead, but do media sites always need data sources Yeah, and data sources come from it, write it yourself behind the kinds of things to say, first climb. . . Because it is public information, not related to personal privacy.
Here I first block chain Mouding several news sites are
- Wen chain
- 8btc
- District and powerful media
- Gold Finance
- FINANCE chain to
crawl As similar rules, where a focused crawling wherein the code for
the following is the code of crawling gold FINANCE
import urllib.request
import json
import _thread
import threading
import time
import mysql.connector
from pyquery import PyQuery as pq
import news_base
def url_open(url):
#print(url)
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0'}
req = urllib.request.Request(url=url, headers=headers)
for i in range(10):
try:
response = urllib.request.urlopen(url=req, timeout=5).read().decode('utf-8')
return response
except :
print("chainnewscrawl except:")
def get_news(page_count, cb):
time_utc = int(time.time())
error_count = 0
index = 0
for i in range(1,page_count+1):
#print(">>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>")
response = url_open("https://api.jinse.com/v6/information/list?catelogue_key=www&limit=23&information_id=%d&flag=down&version=9.9.9&_source=www"%(index))
#print(response)
json_data = json.loads(response)
for item in json_data['list']:
if item["type"] != 1 and item["type"] != 2:
continue
article_item = news_base.article_info(
item["extra"]['author'],#
int(item["extra"]["published_at"]),#
item['title'], #
item["extra"]['summary'],#
'content',
item["extra"]['topic_url'],
"金色财金")
source_responce = url_open(article_item.source_addr)
source_doc = pq(source_responce)
article_item.content = source_doc(".js-article-detail").html() if source_doc(".js-article-detail").html() else source_doc(".js-article").html()
index = item['id']
if not cb(article_item):
error_count+=1
else:
error_count = 0
if error_count >= 5:
break
if error_count >= 5:
break
#print(json_data['results'][0])
#def get_news(10)
#print(response)
Briefly talk about a few reference library
which
urllib.request information is used to crawl through http or https tools
due http crawling have a great chance to open the lead to unsuccessful so here wrote me feel good function with
def url_open(url):
#print(url)
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0'}
req = urllib.request.Request(url=url, headers=headers)
for i in range(10):
try:
response = urllib.request.urlopen(url=req, timeout=5).read().decode('utf-8')
return response
except :
print("chainnewscrawl except:")
It is to continue to open the site, 10 times in a row, so it basically achieved each URL crawling bound to be open, and very easy to use
PyQuery is similar to jquery web site analysis tools to
persistent storage for mysql.connector the tool database
as news_base while climbing because so few sites require a common data structure
is shown below
class article_info:
def __init__(self, author, time_utc, title, desc, content, source_addr, source_media):
self.author = author
self.time_utc = time_utc
self.title = title
self.desc = desc
self.content = content
self.source_addr = source_addr
self.source_media = source_media
def __str__(self):
return("""==========================
author:%s
time_utc:%d
title:%s
desc:%s
content:%s
source_addr:%s
source_media:%s"""%(self.author, self.time_utc, self.title, self.desc, 'self.content', self.source_addr, self.source_media))
The news of the crawling process is carried out one by one http connections, obtain successful results, the speed is very slow, it is necessary to run multi-threaded together, almost as speed increases geometrically, here I open a thread for each specific site code show as below:
import db_base
import news_chainfor
import news_jinse
import news_8btc
import news_55coin
import news_chainnews
import threading
class myThread (threading.Thread):
def __init__(self, func, arg1, arg2):
threading.Thread.__init__(self)
self.func = func
self.arg1 = arg1
self.arg2 = arg2
def run(self):
print ("开始线程:" + self.name)
self.func(self.arg1, self.arg2)
print ("退出线程:" + self.name)
def run():
db_base.init_db()
thread_list = [
myThread(news_55coin.get_news, 10, db_base.insert_article),
myThread(news_8btc.get_news, 10, db_base.insert_article),
myThread(news_jinse.get_news, 10, db_base.insert_article),
myThread(news_chainfor.get_news, 10, db_base.insert_article),
myThread(news_chainnews.get_news, 10, db_base.insert_article)
]
for i in range(len(thread_list)):
thread_list[i].start()
for i in range(len(thread_list)):
thread_list[i].join()
Because it is used less python before, this is now with the current studies, the code may be some ugly, but I incompetence, ha ha ha ha ha
Credits station is now on line www.bxiaozhan.com
are open throughout all code site (including front and rear end), the positionhttps://github.com/lihn1987/CoinCollector
I hope the exhibitions