python crawler (5) - BeautifulSoup & docker base

BeautifulSoup basic combat

Installation: pip install beautifulsoup4

Common commands:

from bs4 import BeautifulSoup as bs
import urllib.request
data=urllib.request.urlopen("https://www.cnblogs.com/mcq1999/").read().decode("utf-8","ignore")
bs1=bs(data)
print(bs1.prettify()) #格式化输出
print(bs1.title) #获取标签title,bs对象.标签名
print(bs1.title.string) #获取标签title的文字
print(bs1.title.name) #获取标签名,如title
print(bs1.a.attrs) #获取属性列表 键值对
print(bs1.a['name']) #获取某个属性对应的值
print(bs1.find_all('a')) #提取所有某个节点的内容,传参是标签名
print('---------------------------------')
print(bs1.find_all(['a','ul']))
k1=bs1.ul.contents #提取当前节点的所有子节点,返回一个列表
k2=bs1.ul.children #返回一个生成器
allulc=[i for i in k2]

PhantomJS basic combat

Efficiency is not high, but the problem can be solved a lot of anti-climb, is essentially a non-browser interface, the command line (or python) manipulation. Difficulties in part by generally PhantomJS writing, then the data to the urllib scrapy or subsequent treatment.

PhantomJS and selenium has been breaking up, learn later.

docker basis of a distributed crawler

Mirror: You can not change the content

Container: You can change the content, equivalent to a virtual machine, by default closed to each other

Advantages: light to deploy, cost-saving, easy to deploy migration

安装:yum -y install docker

Startup and Shutdown:

systemctl start docker

systemctl stop docker

If the startup

Refer to the following this blog, I was so successful, other methods are useless

https://blog.csdn.net/w1316022737/article/details/83692701

And change it to mirror the best source docker, otherwise run very slowly:

https://blog.csdn.net/julien71/article/details/79760919

View existing mirror: docker images

Download Mirror: docker pull

Create a container: docker run -tid

View container: docker ps -a

Into the container: docker attach

Exit container: generally do not exit, because it will stop the container. P + q we can use ctrl +.

Operating in the container will not affect the unit, equivalent to a virtual machine and then open a virtual machine

Start container: docker start ...

Packaging container are mirror images: docker commit 2d6 mytest: v1

Container-based mirroring to a name: docker run -tid --name testabc a2a (based a2a this image to create a container named testabs)

docker run -tid --name h1 mytest:v1

docker run -tid --name h2 --link h1 mytest: v1 (h2 container linked to the h1, h2 and h1 that is to make the communication)

Here I use ubuntu mirrored find ping, yum commands and so on are not, so the switch to the centos mirror image.

[root@hadoop106 mcq]# docker attach fe3
[root@fe3489945006 /]# cat /etc/hosts
127.0.0.1 localhost
::1 localhost ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
172.17.0.6 c1 4c3dab0e013c
172.17.0.7 fe3489945006
[root@fe3489945006 /]# ping 172.17.0.6

The package file image docker: docker save -o /mytest.tar c3e8

Guess you like

Origin www.cnblogs.com/mcq1999/p/11469119.html