BeautifulSoup basic combat
Installation: pip install beautifulsoup4
Common commands:
from bs4 import BeautifulSoup as bs
import urllib.request
data=urllib.request.urlopen("https://www.cnblogs.com/mcq1999/").read().decode("utf-8","ignore")
bs1=bs(data)
print(bs1.prettify()) #格式化输出
print(bs1.title) #获取标签title,bs对象.标签名
print(bs1.title.string) #获取标签title的文字
print(bs1.title.name) #获取标签名,如title
print(bs1.a.attrs) #获取属性列表 键值对
print(bs1.a['name']) #获取某个属性对应的值
print(bs1.find_all('a')) #提取所有某个节点的内容,传参是标签名
print('---------------------------------')
print(bs1.find_all(['a','ul']))
k1=bs1.ul.contents #提取当前节点的所有子节点,返回一个列表
k2=bs1.ul.children #返回一个生成器
allulc=[i for i in k2]
PhantomJS basic combat
Efficiency is not high, but the problem can be solved a lot of anti-climb, is essentially a non-browser interface, the command line (or python) manipulation. Difficulties in part by generally PhantomJS writing, then the data to the urllib scrapy or subsequent treatment.
PhantomJS and selenium has been breaking up, learn later.
docker basis of a distributed crawler
Mirror: You can not change the content
Container: You can change the content, equivalent to a virtual machine, by default closed to each other
Advantages: light to deploy, cost-saving, easy to deploy migration
安装:yum -y install docker
Startup and Shutdown:
systemctl start docker
systemctl stop docker
If the startup
Refer to the following this blog, I was so successful, other methods are useless
https://blog.csdn.net/w1316022737/article/details/83692701
And change it to mirror the best source docker, otherwise run very slowly:
https://blog.csdn.net/julien71/article/details/79760919
View existing mirror: docker images
Download Mirror: docker pull
Create a container: docker run -tid
View container: docker ps -a
Into the container: docker attach
Exit container: generally do not exit, because it will stop the container. P + q we can use ctrl +.
Operating in the container will not affect the unit, equivalent to a virtual machine and then open a virtual machine
Start container: docker start ...
Packaging container are mirror images: docker commit 2d6 mytest: v1
Container-based mirroring to a name: docker run -tid --name testabc a2a (based a2a this image to create a container named testabs)
docker run -tid --name h1 mytest:v1
docker run -tid --name h2 --link h1 mytest: v1 (h2 container linked to the h1, h2 and h1 that is to make the communication)
Here I use ubuntu mirrored find ping, yum commands and so on are not, so the switch to the centos mirror image.
[root@hadoop106 mcq]# docker attach fe3
[root@fe3489945006 /]# cat /etc/hosts
127.0.0.1 localhost
::1 localhost ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
172.17.0.6 c1 4c3dab0e013c
172.17.0.7 fe3489945006
[root@fe3489945006 /]# ping 172.17.0.6
The package file image docker: docker save -o /mytest.tar c3e8