series
1. [Crawlers of Practical Tools Series] Python realizes crawling proxy IP (anti-'anti-crawlers')
2. [Crawlers of Practical Tools Series] Python crawls information data
foreword
In the big data architecture, data collection and data storage occupy an extremely important position, which can be said to be the core foundation of big data. And crawler technology occupies a large proportion of these two core technology levels.
This article implements a simple and fast crawling method, in which proxy ip is used, and the proxy ip can be obtained by referring to my article [Practical Tools Series Crawler] Python implements crawling proxy IP (anti-'anti-crawler') .
Proxy IP
Proxy IP website: xicidaili
For details, please refer to [Crawlers of Practical Tools Series] Python realizes crawling proxy IP (anti-'anti-crawler') .
The output proxy ip data is saved to 'proxy_ip.pkl'
crawl data code
This article takes crawling a small amount of financial data as an example.
-
website
-
Address: http://xxx
-
Crawl content
url, title, click_number, html_content -
Save the data as csv, the format is as follows:
url, title, click_number, html_content, crawl_time
szZack's article
-
-
combat
- Steps:
1. Crawl the homepage, extract the url as the first layer
2. Crawl the url of the first layer as the second layer
3. Crawl the url of the second layer as the third layer
4. End
- Steps:
-
environment
- pandas
- python3
- Ubuntu16.04
- requests
- codecrawl_finance_news.py
_
- 1. Import dependent packages
import crawl_proxy_ip
import pandas as pd
import re, time, sys, os, random
import telnetlib
import requests
- 2. Global variables
global url_set
url_set = {}
- 3. Crawl the core code
def crawl_finance_news(start_url):
#提取数据格式:url,title,click_number,html_content,crawl_time
proxy_ip_list = crawl_proxy_ip.load_proxy_ip('proxy_ip.pkl')
#爬取首页
start_html = crawl_web_data(start_url, proxy_ip_list)
#open('tmp.txt', 'w').write(start_html)
global url_set
url_set[start_url] = 0
#提取第一层web
web_content_list = extract_web_content(start_html, proxy_ip_list)
#提取第二层web
length = len(web_content_list)
for i in range(length):
if len(web_content_list[i][2]) == 0:
html = crawl_web_data(web_content_list[i][0], proxy_ip_list)
web_content_list += extract_web_content(html, proxy_ip_list)
if len(web_content_list) > 1000:#仅仅是测试
break
#提取第3层web
length = len(web_content_list)
for i in range(length):
if len(web_content_list[i][2]) == 0:
html = crawl_web_data(web_content_list[i][0], proxy_ip_list)
web_content_list += extract_web_content(html, proxy_ip_list)
if len(web_content_list) > 1000:#仅仅是测试
break
#保存数据
columns = ['url', 'title', 'click_number', 'html_content', 'crawl_time']
df = pd.DataFrame(columns = columns, data = web_content_list)
df.to_csv('finance_data.csv', encoding='utf-8')
print('data_len:', len(web_content_list))
def crawl_web_data(url, proxy_ip_list):
proxy_ip_dict = random.choice(proxy_ip_list)
if len(proxy_ip_list) == 0:
return ''
proxy_ip_dict = proxy_ip_list[0]
try:
html = download_by_proxy(url, proxy_ip_dict)
print(url, 'ok')
except Exception as e:
#print('50 e', e)
#删除无效的ip
index = proxy_ip_list.index(proxy_ip_dict)
proxy_ip_list.pop(index)
print('proxy_ip_list', len(proxy_ip_list))
return crawl_web_data(url, proxy_ip_list)
return html
def download_by_proxy(url, proxies):
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.103 Safari/537.36', 'Connection':'close'}
response = requests.get(url=url, proxies=proxies, headers=headers, timeout=10)
response.encoding = 'utf-8'
html = response.text
return html
def extract_web_content(html, proxy_ip_list):
#提取数据格式:url,title,click_number,html_content, crawl_time
web_content_list = []
html_content = html
html = html.replace(' target ="_blank"', '')
html = html.replace(' ', '')
html = html.replace('\r', '')
html = html.replace('\n', '')
html = html.replace('\t', '')
html = html.replace('"target="_blank', '')
#<h3><a href="xxx/a/123.html" >证监会:拟对证券违法行为提高刑期上限</a></h3>
res = re.search('href="(http[^"><]*finance[^"><]*)">([^<]*)<', html)#finance 必须是金融资讯
while res is not None:
url, title = res.groups()
#print('url, title', url, title)
global url_set
if url in url_set:#防止重复
html = html.replace('href="%s">%s<' %(url, title), '')
res = re.search('href="(http[^"><]*finance[^"><]*)">([^<]*)<', html)
continue
else:
url_set[url] = 0
click_number = get_click_number(url, proxy_ip_list)
#print('click_number', click_number)
now_time = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())
if len(click_number) == 0:#仅保留正文
html_content = ''
web_content_list.append([url, title, click_number, html_content, now_time])
if len(web_content_list) > 200:#test 每页最多爬取200条
break
html = html.replace('href="%s">%s<' %(url, title), '')
res = re.search('href="(http[^"><]*finance[^"><]*)">([^<]*)<', html)
return web_content_list
[szZack的文章](https://blog.csdn.net/zengNLP?type=blog)
def get_click_number(url, proxy_ip_list):
html = crawl_web_data(url, proxy_ip_list)
#<span class="num ml5">4297</span>
res = re.search('<span class="num ml5">(\d{1,})</span>', html)
if res is not None:
return res.groups()[0]
return ''
- 4. Test
if __name__ == '__main__':
#xx网:xxx/
#用法:python crawl_finance_news.py 'xxx/'
if len(sys.argv) == 2:
crawl_finance_news(sys.argv[1])
-
5. Code description
1. Crawl the proxy ip first: python crawl_proxy_ip.py
2. Crawl financial news: python crawl_finance_news.py 'xxx/'
3. This is just a test, and it will end after crawling 1000 items
. 4. Save the data to: finance_data.csv
Article by szZack -
6. Retrieval of content