オリジナルの記事は、転載を歓迎します。指定してください再現:より転載ITの人々の話、ありがとうございました!
オリジナルリンクアドレス:docker-ビブラートウェブ側のデータ「ドッキングウィンドウの戦闘章」Pythonのクロール(19)ビブラートグラブ戦闘、なぜデータをつかみますか?たとえば、次のように新鮮な電気プロバイダのインターネット企業があり、上司は露出の仕方によって、同社の製品を増やし、一部の交通広告に入れたいと思い、マーケティング、彼はビブラートを発見した選択に入れ、ビブラートは非常に持っています大規模なデータフロー、利益と所得効果があるかどうかを確認するために、ビブラートに広告を実行してみてください。彼らは、データビブラート、マッチングの程度を決定番号、ニックネーム懸念点の数、のように、ビブラートファンを取るために、ユーザ、ユーザグループおよび企業分析のビブラート肖像画を分析しました。動画への同社の製品によって、ユーザーの好みは、より良い会社の製品を促進します。いくつかの広報会社は、これらのデータネットワーク赤馬、マーケティングのパッケージを見つけることができます。出典:https://github.com/limingios/dockerpython.git(douyin)
ビブラートページを共有
- GoogleのXPathのヘルパーツールをインストールします。
ソース取得のCRX
Googleのブラウザに入力します。クロム://拡張/
直接インタフェースクロムにドラッグのxpath-helper.crx://拡張/
インストールが成功した後、
ショートカットキーはCtrl + Shift + X XPathを開始し、F12と組み合わせて使用し、通常はGoogleの開発者向けツール。
Pythonのクロールビブラートは、ウェブサイトのデータの共有を開始します
シェア分析ページhttps://www.douyin.com/share/user/76055758243
1.抖音做了反派机制,抖音ID中的数字变成了字符串,进行替换。
{'name':['  ','  ','  '],'value':0},
{'name':['  ','  ','  '],'value':1},
{'name':['  ','  ','  '],'value':2},
{'name':['  ','  ','  '],'value':3},
{'name':['  ','  ','  '],'value':4},
{'name':['  ','  ','  '],'value':5},
{'name':['  ','  ','  '],'value':6},
{'name':['  ','  ','  '],'value':7},
{'name':['  ','  ','  '],'value':8},
{'name':['  ','  ','  '],'value':9},
2.获取需要的节点的的xpath
# 昵称
//div[@class='personal-card']/div[@class='info1']//p[@class='nickname']/text()
#抖音ID
//div[@class='personal-card']/div[@class='info1']//p[@class='nickname']/text()
#工作
//div[@class='personal-card']/div[@class='info2']/div[@class='verify-info']/span[@class='info']/text()
#描述
//div[@class='personal-card']/div[@class='info2']/p[@class='signature']/text()
#地址
//div[@class='personal-card']/div[@class='info2']/p[@class='extra-info']/span[1]/text()
#星座
//div[@class='personal-card']/div[@class='info2']/p[@class='extra-info']/span[2]/text()
#关注数
//div[@class='personal-card']/div[@class='info2']/p[@class='follow-info']//span[@class='focus block']//i[@class='icon iconfont follow-num']/text()
#粉丝数
//div[@class='personal-card']/div[@class='info2']/p[@class='follow-info']//span[@class='follower block']//i[@class='icon iconfont follow-num']/text()
#赞数
//div[@class='personal-card']/div[@class='info2']/p[@class='follow-info']//span[@class='follower block']/span[@class='num']/text()
- 完整代码
import re import requests import time from lxml import etree
def handle_decode(input_data,share_web_url,task):
search_douyin_str = re.compile(r'抖音ID:')
regex_list = [
{'name':[' ',' ',' '],'value':0},
{'name':[' ',' ',' '],'value':1},
{'name':[' ',' ',' '],'value':2},
{'name':[' ',' ',' '],'value':3},
{'name':[' ',' ',' '],'value':4},
{'name':[' ',' ',' '],'value':5},
{'name':[' ',' ',' '],'value':6},
{'name':[' ',' ',' '],'value':7},
{'name':[' ',' ',' '],'value':8},
{'name':[' ',' ',' '],'value':9},
]
for i1 in regex_list:
for i2 in i1['name']:
input_data = re.sub(i2,str(i1['value']),input_data)
share_web_html = etree.HTML(input_data)
douyin_info = {}
douyin_info['nick_name'] = share_web_html.xpath("//div[@class='personal-card']/div[@class='info1']//p[@class='nickname']/text()")[0]
douyin_id = ''.join(share_web_html.xpath("//div[@class='personal-card']/div[@class='info1']/p[@class='shortid']/i/text()"))
douyin_info['douyin_id'] = re.sub(search_douyin_str,'',share_web_html.xpath("//div[@class='personal-card']/div[@class='info1']//p[@class='nickname']/text()")[0]).strip() + douyin_id
try:
douyin_info['job'] = share_web_html.xpath("//div[@class='personal-card']/div[@class='info2']/div[@class='verify-info']/span[@class='info']/text()")[0].strip()
except:
pass
douyin_info['describe'] = share_web_html.xpath("//div[@class='personal-card']/div[@class='info2']/p[@class='signature']/text()")[0].replace('\n',',')
douyin_info['location'] = share_web_html.xpath("//div[@class='personal-card']/div[@class='info2']/p[@class='extra-info']/span[1]/text()")[0]
douyin_info['xingzuo'] = share_web_html.xpath("//div[@class='personal-card']/div[@class='info2']/p[@class='extra-info']/span[2]/text()")[0]
douyin_info['follow_count'] = share_web_html.xpath("//div[@class='personal-card']/div[@class='info2']/p[@class='follow-info']//span[@class='focus block']//i[@class='icon iconfont follow-num']/text()")[0].strip()
fans_value = ''.join(share_web_html.xpath("//div[@class='personal-card']/div[@class='info2']/p[@class='follow-info']//span[@class='follower block']//i[@class='icon iconfont follow-num']/text()"))
unit = share_web_html.xpath("//div[@class='personal-card']/div[@class='info2']/p[@class='follow-info']//span[@class='follower block']/span[@class='num']/text()")
if unit[-1].strip() == 'w':
douyin_info['fans'] = str((int(fans_value)/10))+'w'
like = ''.join(share_web_html.xpath("//div[@class='personal-card']/div[@class='info2']/p[@class='follow-info']//span[@class='liked-num block']//i[@class='icon iconfont follow-num']/text()"))
unit = share_web_html.xpath("//div[@class='personal-card']/div[@class='info2']/p[@class='follow-info']//span[@class='liked-num block']/span[@class='num']/text()")
if unit[-1].strip() == 'w':
douyin_info['like'] = str(int(like)/10)+'w'
douyin_info['from_url'] = share_web_url
print(douyin_info)
def handle_douyin_web_share(share_id):
share_web_url = 'https://www.douyin.com/share/user/'+share_id
print(share_web_url)
share_web_header = {
'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.75 Safari/537.36'
}
share_web_response = requests.get(url=share_web_url,headers=share_web_header)
handle_decode(share_web_response.text,share_web_url,share_id)
if name == 'main':
while True:
share_id = "76055758243"
if share_id == None:
print('当前处理task为:%s'%share_id)
break
else:
print('当前处理task为:%s'%share_id)
handle_douyin_web_share(share_id)
time.sleep(2)
![](https://upload-images.jianshu.io/upload_images/11223715-651b910cb91c1c8d.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)
#### mongodb
>通过vagrant 生成虚拟机创建mongodb,具体查看
「docker实战篇」python的docker爬虫技术-python脚本app抓取(13)
``` bash
su -
#密码:vagrant
docker
>https://hub.docker.com/r/bitnami/mongodb
>默认端口:27017
``` bash
docker pull bitnami/mongodb:latest
mkdir bitnami
cd bitnami
mkdir mongodb
docker run -d -v /path/to/mongodb-persistence:/root/bitnami -p 27017:27017 bitnami/mongodb:latest
#关闭防火墙
systemctl stop firewalld
- 操作mongodb
读txt文件获取userId的编号。
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @Time : 2019/1/30 19:35
# @Author : Aries
# @Site :
# @File : handle_mongo.py.py
# @Software: PyCharm
import pymongo
from pymongo.collection import Collection
client = pymongo.MongoClient(host='192.168.66.100',port=27017)
db = client['douyin']
def handle_init_task():
task_id_collections = Collection(db, 'task_id')
with open('douyin_hot_id.txt','r') as f:
f_read = f.readlines()
for i in f_read:
task_info = {}
task_info['share_id'] = i.replace('\n','')
task_id_collections.insert(task_info)
def handle_get_task():
task_id_collections = Collection(db, 'task_id')
# return task_id_collections.find_one({})
return task_id_collections.find_one_and_delete({})
#handle_init_task()
- 修改python程序调用
import re import requests import time from lxml import etree from handle_mongo import handle_get_task from handle_mongo import handle_insert_douyin
def handle_decode(input_data,share_web_url,task):
search_douyin_str = re.compile(r'抖音ID:')
regex_list = [
{'name':[' ',' ',' '],'value':0},
{'name':[' ',' ',' '],'value':1},
{'name':[' ',' ',' '],'value':2},
{'name':[' ',' ',' '],'value':3},
{'name':[' ',' ',' '],'value':4},
{'name':[' ',' ',' '],'value':5},
{'name':[' ',' ',' '],'value':6},
{'name':[' ',' ',' '],'value':7},
{'name':[' ',' ',' '],'value':8},
{'name':[' ',' ',' '],'value':9},
]
for i1 in regex_list:
for i2 in i1['name']:
input_data = re.sub(i2,str(i1['value']),input_data)
share_web_html = etree.HTML(input_data)
douyin_info = {}
douyin_info['nick_name'] = share_web_html.xpath("//div[@class='personal-card']/div[@class='info1']//p[@class='nickname']/text()")[0]
douyin_id = ''.join(share_web_html.xpath("//div[@class='personal-card']/div[@class='info1']/p[@class='shortid']/i/text()"))
douyin_info['douyin_id'] = re.sub(search_douyin_str,'',share_web_html.xpath("//div[@class='personal-card']/div[@class='info1']//p[@class='nickname']/text()")[0]).strip() + douyin_id
try:
douyin_info['job'] = share_web_html.xpath("//div[@class='personal-card']/div[@class='info2']/div[@class='verify-info']/span[@class='info']/text()")[0].strip()
except:
pass
douyin_info['describe'] = share_web_html.xpath("//div[@class='personal-card']/div[@class='info2']/p[@class='signature']/text()")[0].replace('\n',',')
douyin_info['location'] = share_web_html.xpath("//div[@class='personal-card']/div[@class='info2']/p[@class='extra-info']/span[1]/text()")[0]
douyin_info['xingzuo'] = share_web_html.xpath("//div[@class='personal-card']/div[@class='info2']/p[@class='extra-info']/span[2]/text()")[0]
douyin_info['follow_count'] = share_web_html.xpath("//div[@class='personal-card']/div[@class='info2']/p[@class='follow-info']//span[@class='focus block']//i[@class='icon iconfont follow-num']/text()")[0].strip()
fans_value = ''.join(share_web_html.xpath("//div[@class='personal-card']/div[@class='info2']/p[@class='follow-info']//span[@class='follower block']//i[@class='icon iconfont follow-num']/text()"))
unit = share_web_html.xpath("//div[@class='personal-card']/div[@class='info2']/p[@class='follow-info']//span[@class='follower block']/span[@class='num']/text()")
if unit[-1].strip() == 'w':
douyin_info['fans'] = str((int(fans_value)/10))+'w'
like = ''.join(share_web_html.xpath("//div[@class='personal-card']/div[@class='info2']/p[@class='follow-info']//span[@class='liked-num block']//i[@class='icon iconfont follow-num']/text()"))
unit = share_web_html.xpath("//div[@class='personal-card']/div[@class='info2']/p[@class='follow-info']//span[@class='liked-num block']/span[@class='num']/text()")
if unit[-1].strip() == 'w':
douyin_info['like'] = str(int(like)/10)+'w'
douyin_info['from_url'] = share_web_url
print(douyin_info)
handle_insert_douyin(douyin_info)
def handle_douyin_web_share(task):
share_web_url = 'https://www.douyin.com/share/user/'+task["share_id"]
print(share_web_url)
share_web_header = {
'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.75 Safari/537.36'
}
share_web_response = requests.get(url=share_web_url,headers=share_web_header)
handle_decode(share_web_response.text,share_web_url,task["share_id"])
if name == 'main':
while True:
task=handle_get_task()
handle_douyin_web_share(task)
time.sleep(2)
* mongodb字段
>handle_init_task 是将txt存入mongodb中
>handle_get_task 查出来一条然后删除一条,因为txt是存在的,所以删除根本没有关系
``` python
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @Time : 2019/1/30 19:35
# @Author : Aries
# @Site :
# @File : handle_mongo.py.py
# @Software: PyCharm
import pymongo
from pymongo.collection import Collection
client = pymongo.MongoClient(host='192.168.66.100',port=27017)
db = client['douyin']
def handle_init_task():
task_id_collections = Collection(db, 'task_id')
with open('douyin_hot_id.txt','r') as f:
f_read = f.readlines()
for i in f_read:
task_info = {}
task_info['share_id'] = i.replace('\n','')
task_id_collections.insert(task_info)
def handle_insert_douyin(douyin_info):
task_id_collections = Collection(db, 'douyin_info')
task_id_collections.insert(douyin_info)
def handle_get_task():
task_id_collections = Collection(db, 'task_id')
# return task_id_collections.find_one({})
return task_id_collections.find_one_and_delete({})
handle_init_task()
PS:1000のテキストのテキストデータが少なすぎる、実際には、アプリの終了を登ると、クロールのPCで終了するのに十分ではなかった、PC側は、ファンリストにユーザーIDによって得られたデータを、初期化する責任があり、その後のサイクルを停止しますクロールによって、非常に大量のデータを得ることができないので。