Python crawler library-1-use of BeautifulSoup

Beautiful Soup is a Python library that can extract data from HTML or XML files. In simple terms, it can parse HTML tag files into a tree structure (the web page is originally a tree structure), and then get the correspondence of the specified tag Attributes.

Through the Beautiful Soup library, we can use the specified class or id value as a parameter to directly obtain the relevant data of the corresponding tag. It is a common library among python crawlers, in the python 3 environment.

Content outline:

  1. installation
  2. Call beautifulsoup4 (bs4)
  3. Page resolution. Get the page and convert it to a bs4 object
  4. Crawl. Get each element in the bs4 object

Environment recommends using anaconda+vscode

1. Install beautifulsoup4, urllib library

Under vscode, run pip install beautifulsoup4, pip install urllib 

2. Call bs4

After the installation is complete, try to include the library and run:

from bs4 import BeautifulSoup

If there is no error, the library has been installed normally.

3. Page access

This article will use this webpage http://reeoo.com for example explanation, as shown in the figure below

Import the urllib.request library first, access the url through the Request method, get the return value of the web page, and then initialize it through the BeautifulSoup object

from bs4 import BeautifulSoup
import urllib.request

url = 'http://reeoo.com'

request = urllib.request.Request(url)

response = urllib.request.urlopen(request, timeout=20)

content = response.read()

soup = BeautifulSoup(content, 'html.parser')

Pass a piece of document into BeautifulSoup's construction method, you can get a document object, which is the object format of beautifulsoup. As shown in the following code, the document is obtained by requesting the url:

" rel="EditURI" title="RSD" type="application/rsd+xml"/>
<link href="http://reeoo.com/wp-includes/wlwmanifest.xml" rel="wlwmanifest" type="application/wlwmanifest+xml"/>
<meta content="WordPress 4.9.8" name="generator"/>
</link></meta></meta></meta></meta></meta></meta></head>
<body>
<header id="header">
<div id="main_menu">
<div class="box">
<h1 id="logo"><a href="https://reeoo.com" title="Web design inspiration and gallery"><span class="icon-reeoo"></span></a></h1>
<ul>
<li class="active" id="link_web"><a href="https://reeoo.com" title="Web Design Gallery">Web Design</a></li>
<li id="link_iphone"><a href="https://iphone.reeoo.com" title="iPhone Patterns">iPhone App</a></li>
<li id="link_ipad"><a href="https://ipad.reeoo.com" title="iPad Patterns">iPad App</a></li>
<li id="link_icon"><a href="https://icon.reeoo.com" title="iOS Icon Design">Icon</a></li>
<li id="link_designer"><a href="https://designer.reeoo.com" title="Designer Show">Designer</a></li>
<li id="link_download"><a href="https://download.reeoo.com" title="Design resources download">Download</a></li>
</ul>
<div id="more">
<div id="search">
<span class="icon-search"></span>
<form action="https://reeoo.com" id="searchform" method="get">
<input id="s" name="s" placeholder="Search name or tag" required="" size="20" type="text" value=""/>
</form>
</div>
<div id="contact"><a href="http://weibo.com/reeoocom" target="_blank"><span class="icon-weibo"></span></a><a href="https://twitter.com/reeoocom" target="_blank"><span class="icon-twitter"></span></a><a href="mailto:[email protected]" target="_blank"><span class="icon-email"></span></a></div>
</div>
</div>
</div>
<div id="submenu">
<div class="box">
<div class="menu-color-menu-container"><ul class="menu" id="menu-color-menu"><li class="menu-item menu-item-type-taxonomy menu-item-object-category menu-item-3865" id="menu-item-3865"><a href="https://reeoo.com/category/black" title="Black Web Design">Black</a></li>
<li class="menu-item menu-item-type-taxonomy menu-item-object-category menu-item-3866" id="menu-item-3866"><a href="https://reeoo.com/category/blue" title="Blue Web Design">Blue</a></li>
<li class="menu-item menu-item-type-taxonomy menu-item-object-category menu-item-3867" id="menu-item-3867"><a href="https://reeoo.com/category/brown" title="Brown Web Design">Brown</a></li>
<li class="menu-item menu-item-type-taxonomy menu-item-object-category menu-item-3869" id="menu-item-3869"><a href="https://reeoo.com/category/green" title="Green Web Design">Green</a></li>
<li class="menu-item menu-item-type-taxonomy menu-item-object-category menu-item-3868" id="menu-item-3868"><a href="https://reeoo.com/category/gray" title="Gray Web Design">Gray</a></li>
<li class="menu-item menu-item-type-taxonomy menu-item-object-category menu-item-3871" id="menu-item-3871"><a href="https://reeoo.com/category/orange" title="Orange Web Design">Orange</a></li>
<li class="menu-item menu-item-type-taxonomy menu-item-object-category menu-item-3872" id="menu-item-3872"><a href="https://reeoo.com/category/purple" title="Purple Web Design">Purple</a></li>
<li class="menu-item menu-item-type-taxonomy menu-item-object-category menu-item-13232" id="menu-item-13232"><a href="https://reeoo.com/category/pink">Pink</a></li>
<li class="menu-item menu-item-type-taxonomy menu-item-object-category menu-item-3873" id="menu-item-3873"><a href="https://reeoo.com/category/red" title="Red Web Design">Red</a></li>
<li class="menu-item menu-item-type-taxonomy menu-item-object-category menu-item-3874" id="menu-item-3874"><a href="https://reeoo.com/category/white" title="White Web Design">White</a></li>
<li class="menu-item menu-item-type-taxonomy menu-item-object-category menu-item-3875" id="menu-item-3875"><a href="https://reeoo.com/category/yellow" title="Yellow Web Design">Yellow</a></li>
<li class="menu-item menu-item-type-taxonomy menu-item-object-category menu-item-3870" id="menu-item-3870"><a href="https://reeoo.com/category/multicolored" title="Multicolored Web Design">Multicolored</a></li>
</ul></div> <div class="filter">
<span class="icon-category"></span>
<div class="menu-header-menu-container"><ul class="menu" id="menu-header-menu"><li class="menu-item menu-item-type-custom menu-item-object-custom current-menu-item menu-item-11736" id="menu-item-11736"><a href="http://reeoo.com/">All</a></li>
<li class="menu-item menu-item-type-custom menu-item-object-custom menu-item-11737" id="menu-item-11737"><a href="http://reeoo.com/?s=app">App</a></li>
<li class="menu-item menu-item-type-custom menu-item-object-custom menu-item-11750" id="menu-item-11750"><a href="http://reeoo.com/tag/software">Software</a></li>
<li class="menu-item menu-item-type-custom menu-item-object-custom menu-item-11754" id="menu-item-11754"><a href="http://reeoo.com/tag/icon">Icon</a></li>
<li class="menu-item menu-item-type-custom menu-item-object-custom menu-item-11747" id="menu-item-11747"><a href="http://reeoo.com/?s=agency">Agency</a></li>
<li class="menu-item menu-item-type-custom menu-item-object-custom menu-item-11752" id="menu-item-11752"><a href="http://reeoo.com/tag/company">Company</a></li>
<li class="menu-item menu-item-type-custom menu-item-object-custom menu-item-11740" id="menu-item-11740"><a href="http://reeoo.com/?s=studio">Studio</a></li>
<li class="menu-item menu-item-type-custom menu-item-object-custom menu-item-11738" id="menu-item-11738"><a href="http://reeoo.com/tag/coming-soon">Coming Soon</a></li>
<li class="menu-item menu-item-type-custom menu-item-object-custom menu-item-11739" id="menu-item-11739"><a href="http://reeoo.com/tag/onepage">Onepage</a></li>
<li class="menu-item menu-item-type-custom menu-item-object-custom menu-item-11751" id="menu-item-11751"><a href="http://reeoo.com/tag/cartoon">Cartoon</a></li>
<li class="menu-item menu-item-type-custom menu-item-object-custom menu-item-11764" id="menu-item-11764"><a href="http://reeoo.com/?s=animation">Animation</a></li>
<li class="menu-item menu-item-type-custom menu-item-object-custom menu-item-11766" id="menu-item-11766"><a href="http://reeoo.com/?s=develop">Develop</a></li>
<li class="menu-item menu-item-type-custom menu-item-object-custom menu-item-11743" id="menu-item-11743"><a href="http://reeoo.com/tag/designer">Designer</a></li>
<li class="menu-item menu-item-type-custom menu-item-object-custom menu-item-11741" id="menu-item-11741"><a href="http://reeoo.com/tag/food">Food</a></li>
<li class="menu-item menu-item-type-custom menu-item-object-custom menu-item-11742" id="menu-item-11742"><a href="http://reeoo.com/tag/music">Music</a></li>
<li class="menu-item menu-item-type-custom menu-item-object-custom menu-item-11749" id="menu-item-11749"><a href="http://reeoo.com/?s=movie">Movie</a></li>
<li class="menu-item menu-item-type-custom menu-item-object-custom menu-item-11763" id="menu-item-11763"><a href="http://reeoo.com/?s=metting">Metting</a></li>
<li class="menu-item menu-item-type-custom menu-item-object-custom menu-item-11744" id="menu-item-11744"><a href="http://reeoo.com/?s=shop">Shop</a></li>
<li class="menu-item menu-item-type-custom menu-item-object-custom menu-item-11756" id="menu-item-11756"><a href="http://reeoo.com/tag/fashion">Fashion</a></li>
<li class="menu-item menu-item-type-custom menu-item-object-custom menu-item-11745" id="menu-item-11745"><a href="http://reeoo.com/?s=wordpress">WordPress</a></li>
<li class="menu-item menu-item-type-custom menu-item-object-custom menu-item-11746" id="menu-item-11746"><a href="http://reeoo.com/?s=theme">Theme</a></li>
<li class="menu-item menu-item-type-custom menu-item-object-custom menu-item-11748" id="menu-item-11748"><a href="http://reeoo.com/?s=official">Official</a></li>
<li class="menu-item menu-item-type-custom menu-item-object-custom menu-item-11753" id="menu-item-11753"><a href="http://reeoo.com/tag/travel">Travel</a></li>
<li class="menu-item menu-item-type-custom menu-item-object-custom menu-item-11757" id="menu-item-11757"><a href="http://reeoo.com/?s=tool">Tool</a></li>
<li class="menu-item menu-item-type-custom menu-item-object-custom menu-item-11755" id="menu-item-11755"><a href="http://reeoo.com/tag/product">Product</a></li>
<li class="menu-item menu-item-type-custom menu-item-object-custom menu-item-11758" id="menu-item-11758"><a href="http://reeoo.com/?s=bike">Bike</a></li>
</ul></div> </div>
</div>
</div>
</header>
<article class="box">
<div id="main">
<ul id="list">
<li class="sponsor">
<script async="" id="_carbonads_js" src="//cdn.carbonads.com/carbon.js?serve=CKYIVKJ7&amp;placement=reeoocom" type="text/javascript"></script>
</li>
<li>
<div class="thumb">
<a href="https://reeoo.com/loop">
<img alt="Loop" class="lazy" data-original="https://reeoo.xnny.net/Loop.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="Loop" width="300"/>
</a>
</div>
<div class="title"><a href="https://reeoo.com/loop">Loop</a></div>
</li>
<li>
<div class="thumb">
<a href="https://reeoo.com/programatorio">
<img alt="Programatório" class="lazy" data-original="https://reeoo.xnny.net/Programatorio.png!page" height="200" src="https://reeoo.com/assets/white.gif" title="Programatório" width="300"/>
</a>
</div>
<div class="title"><a href="https://reeoo.com/programatorio">Programatório</a></div>
</li>
<li>
<div class="thumb">
<a href="https://reeoo.com/ultraviolet-way">
<img alt="Ultraviolet Way" class="lazy" data-original="https://reeoo.xnny.net/Ultraviolet Way.png!page" height="200" src="https://reeoo.com/assets/white.gif" tile="Ultraviolet Way" width="300"/>
</a>
</div>
<div class="title"><a href="https://reeoo.com/ultraviolet-way">Ultraviolet Way</a></div>
</li>
<li>
<div class="thumb">
<a href="https://reeoo.com/misatoto-town">
<img alt="みさとと。" class="lazy" data-original="https://reeoo.xnny.net/Misatoto Town.png!page" height="200" src="https://reeoo.com/assets/white.gif" title="みさとと。" width="300"/>
</a>
</div>
<div class="title"><a href="https://reeoo.com/misatoto-town">みさとと。</a></div>
</li>
<li>
<div class="thumb">
<a href="https://reeoo.com/block-studio">
<img alt="Block Studio" class="lazy" data-original="https://reeoo.xnny.net/Block Studio.png!page" height="200" src="https://reeoo.com/assets/white.gif" tile="Block Studio" width="300"/>
</a>
</div>
<div class="title"><a href="https://reeoo.com/block-studio">Block Studio</a></div>
</li>
<li>
<div class="thumb">
<a href="https://reeoo.com/composition-no-24">
<img alt="Composition No. 24" class="lazy" data-original="https://reeoo.xnny.net/Composition No. 24.png!page" height="200" src="https://reeoo.com/assets/white.gif" tile="Composition No. 24" width="300"/>
</a>
</div>
<div class="title"><a href="https://reeoo.com/composition-no-24">Composition No. 24</a></div>
</li>
<li>
<div class="thumb">
<a href="https://reeoo.com/discovery-land-company">
<img alt="Discovery Land Company" class="lazy" data-original="https://reeoo.xnny.net/Discovery Land Company.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="Discovery Land Company" width="300"/>
</a>
</div>
<div class="title"><a href="https://reeoo.com/discovery-land-company">Discovery Land Company</a></div>
</li>
<li>
<div class="thumb">
<a href="https://reeoo.com/hardies">
<img alt="Hardies" class="lazy" data-original="https://reeoo.xnny.net/Hardies.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="Hardies" width="300"/>
</a>
</div>
<div class="title"><a href="https://reeoo.com/hardies">Hardies</a></div>
</li>
<li>
<div class="thumb">
<a href="https://reeoo.com/welchs-fruit-snacks">
<img alt="Welch’s Fruit Snacks" class="lazy" data-original="https://reeoo.xnny.net/Welch's Fruit Snacks.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="Welch’s Fruit Snacks" width="300"/>
</a>
</div>
<div class="title"><a href="https://reeoo.com/welchs-fruit-snacks">Welch’s Fruit Snacks</a></div>
</li>
<li>
<div class="thumb">
<a href="https://reeoo.com/exeron">
<img alt="EXERON" class="lazy" data-original="https://reeoo.xnny.net/EXERON.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="EXERON" width="300"/>
</a>
</div>
<div class="title"><a href="https://reeoo.com/exeron">EXERON</a></div>
</li>
<li>
<div class="thumb">
<a href="https://reeoo.com/pop-weaver">
<img alt="Pop Weaver" class="lazy" data-original="https://reeoo.xnny.net/Pop Weaver.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="Pop Weaver" width="300"/>
</a>
</div>
<div class="title"><a href="https://reeoo.com/pop-weaver">Pop Weaver</a></div>
</li>
<li>
<div class="thumb">
<a href="https://reeoo.com/edesign-interactive">
<img alt="eDesign Interactive" class="lazy" data-original="https://reeoo.xnny.net/eDesign Interactive.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="eDesign Interactive" width="300"/>
</a>
</div>
<div class="title"><a href="https://reeoo.com/edesign-interactive">eDesign Interactive</a></div>
</li>
<li>
<div class="thumb">
<a href="https://reeoo.com/obsolete">
<img alt="OBSOLETE" class="lazy" data-original="https://reeoo.xnny.net/OBSOLETE.png!page" height="200" src="https://reeoo.com/assets/white.gif" tile="OBSOLETE" width="300"/>
</a>
</div>
<div class="title"><a href="https://reeoo.com/obsolete">OBSOLETE</a></div>
</li>
<li>
<div class="thumb">
<a href="https://reeoo.com/minibricks">
<img alt="Minibricks" class="lazy" data-original="https://reeoo.xnny.net/Minibricks.png!page" height="200" src="https://reeoo.com/assets/white.gif" tile="Minibricks" width="300"/>
</a>
</div>
<div class="title"><a href="https://reeoo.com/minibricks">Minibricks</a></div>
</li>
<li>
<div class="thumb">
<a href="https://reeoo.com/your-sport-agent">
<img alt="Your Sport Agent" class="lazy" data-original="https://reeoo.xnny.net/Your Sport Agent.png!page" height="200" src="https://reeoo.com/assets/white.gif" tile="Your Sport Agent" width="300"/>
</a>
</div>
<div class="title"><a href="https://reeoo.com/your-sport-agent">Your Sport Agent</a></div>
</li>
<li>
<div class="thumb">
<a href="https://reeoo.com/modulz">
<img alt="Modulz" class="lazy" data-original="https://reeoo.xnny.net/Modulz.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="Modulz" width="300"/>
</a>
</div>
<div class="title"><a href="https://reeoo.com/modulz">Modulz</a></div>
</li>
<li>
<div class="thumb">
<a href="https://reeoo.com/shift-2">
<img alt="Shift" class="lazy" data-original="https://reeoo.xnny.net/Shift.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="Shift" width="300"/>
</a>
</div>
<div class="title"><a href="https://reeoo.com/shift-2">Shift</a></div>
</li>
<li>
<div class="thumb">
<a href="https://reeoo.com/rand">
<img alt="Rand" class="lazy" data-original="https://reeoo.xnny.net/Rand.png!page" height="200" src="https://reeoo.com/assets/white.gif" tile="Rand" width="300"/>
</a>
</div>
<div class="title"><a href="https://reeoo.com/rand">Rand</a></div>
</li>
<li>
<div class="thumb">
<a href="https://reeoo.com/rappipay-2">
<img alt="RappiPay" class="lazy" data-original="https://reeoo.xnny.net/RappiPay 2.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="RappiPay" width="300"/>
</a>
</div>
<div class="title"><a href="https://reeoo.com/rappipay-2">RappiPay</a></div>
</li>
<li>
<div class="thumb">
<a href="https://reeoo.com/real-happiness-project-from-bbc-earth">
<img alt="Real Happiness Project from BBC Earth" class="lazy" data-original="https://reeoo.xnny.net/Real Happiness Project from BBC Earth.png!page" height="200" src="https://reeoo.com/assets/white.gif" tile="Real Happiness Project from BBC Earth" width="300"/>
</a>
</div>
<div class="title"><a href="https://reeoo.com/real-happiness-project-from-bbc-earth">Real Happiness Project from BBC Earth</a></div>
</li>
<li>
<div class="thumb">
<a href="https://reeoo.com/opera">
<img alt="OPERA" class="lazy" data-original="https://reeoo.xnny.net/OPERA.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="OPERA" width="300"/>
</a>
</div>
<div class="title"><a href="https://reeoo.com/opera">OPERA</a></div>
</li>
<li>
<div class="thumb">
<a href="https://reeoo.com/kyoto-shin-nyo-do">
<img alt="真如堂を楽しむ" class="lazy" data-original="https://reeoo.xnny.net/Kyoto Shin nyo-do.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="真如堂を楽しむ" width="300"/>
</a>
</div>
<div class="title"><a href="https://reeoo.com/kyoto-shin-nyo-do">真如堂を楽しむ</a></div>
</li>
<li>
<div class="thumb">
<a href="https://reeoo.com/bitbiome">
<img alt="bitBiome" class="lazy" data-original="https://reeoo.xnny.net/bitBiome.png!page" height="200" src="https://reeoo.com/assets/white.gif" tile="bitBiome" width="300"/>
</a>
</div>
<div class="title"><a href="https://reeoo.com/bitbiome">bitBiome</a></div>
</li>
</ul>
<!-- pb265 --><div class="pagebar"><span> </span><span class="this-page">1</span>
<a href="https://reeoo.com/page/2" title="Page 2">2</a>
<a href="https://reeoo.com/page/3" title="Page 3">3</a>
<a href="https://reeoo.com/page/4" title="Page 4">4</a>
<a href="https://reeoo.com/page/5" title="Page 5">5</a>
<a href="https://reeoo.com/page/6" title="Page 6">6</a>
<a href="https://reeoo.com/page/7" title="Page 7">7</a>
<a href="https://reeoo.com/page/8" title="Page 8">8</a>
<a href="https://reeoo.com/page/9" title="Page 9">9</a>
<span class="break">...</span>
<a href="https://reeoo.com/page/172" title="Page 172">172</a>
<a href="https://reeoo.com/page/173" title="Page 173">173</a>
<a href="https://reeoo.com/page/174" title="Page 174">174</a>
<a href="https://reeoo.com/page/175" title="Page 175">175</a>
<a href="https://reeoo.com/page/176" title="Page 176">176</a>
<a href="https://reeoo.com/page/177" title="Page 177">177</a>
<a href="https://reeoo.com/page/2" title="Page 2">&gt;</a>
</div></div>
</article>
<footer id="footer">
<div class="box">
<p>
<span class="link">
<a href="http://designlol.net" target="_blank" title="全球设计精华分享站">Design lol</a>
<a href="http://logojoy.com" target="_blank">Logojoy</a>
<a href="http://www.pplock.com/" target="_blank" title="分享艺术·设计·创意">PPLock</a>
<a href="http://reader.mx/?utm_source=reeoo&amp;utm_medium=web&amp;utm_campaign=link" target="_blank" title="Reader APP">ReaderMX</a>
<a href="http://www.ui.cn" target="_blank">UICN</a>
<a href="http://www.uisdc.com/" target="_blank" title="优秀网页设计联盟">UISDC</a>
<a href="http://zmingcx.com/" target="_blank" title="知更鸟">Zmingcx</a>
</span>
<span class="link">
<a href="https://logomaster.ai/" rel="noopener" target="_blank">Online Logo Maker</a>
<a href="http://www.treasurebox.co.nz/outdoor-garden/greenhouse.html" rel="noopener" target="_blank">greenhouse nz</a>
<a href="https://www.payformathhomework.com" target="_blank">Pay For Math Homework</a>- math help
				</span>
<a href="https://www.zessay.com/" target="_blank">Essay services</a> for college students.   
				<a href="https://myhomeworkdone.com/" target="_blank">My Homework Done</a> really makes your homework done.   
				<a href="http://mydissertations.com/" target="_blank">MyDissertations</a> - dissertation help on design topics.   
						<br/>
			Powered by <a href="http://wordpress.org/" target="_blank">WordPress</a>. © <a href="https://reeoo.com" rel="home" title="Reeoo">Reeoo.com</a>.</p>
</div>
</footer>
<script type="text/javascript">
/* <![CDATA[ */
var image_lazy_load = {"image_unveil_load":"0"};
/* ]]> */
</script>
<script src="http://reeoo.com/wp-content/plugins/image-lazy-load/js/min/frontend-min.js?ver=1.0.9" type="text/javascript"></script>
<script src="http://reeoo.com/wp-includes/js/wp-embed.min.js?ver=4.9.8" type="text/javascript"></script>
<script>
  (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
  (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
  m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
  })(window,document,'script','//www.google-analytics.com/analytics.js','ga');

  ga('create', 'UA-11594399-2', 'auto');
  ga('send', 'pageview');

</script>
</body>
</html>

There is no exception handling for the request request. Ignore it for now. Generally, the urllib library is used to determine whether the request is successful. The second parameter (lxml or html.parser) of the BeautifulSoup construction method is the document parser. If this parameter is not passed in, BeautifulSoup will choose the most suitable parser to parse the document, but there will be warnings. For details, please refer to bs4 Help documentation ( https://www.crummy.com/software/BeautifulSoup/bs4/doc/ ).

It can also be initialized by a file handle. You can first save the HTML source code to the local directory reo.html at the same level, and then use the file name as a parameter:

soup = BeautifulSoup(open('reo.html'))

In this way, all the web pages can be collected first, and then analyzed, avoiding the problem of being blocked due to multiple visits to the website during the test process. The soup can be printed (print), and the output content is the same as HTML text. At this time, it is a complex tree structure, and each node is a Python object.

4. Get the specified label

The soup used in the following sample code is this soup.

4.1、Tag

The Tag object is the same as the tag in the HTML native document and can be obtained directly by the corresponding name

tag = soup.title
print(tag)

Print result:

<title>Reeoo - web design inspiration and website gallerytitle>

4.2、Name

The name of the tag can be obtained through the name attribute of the Tag object

print tag.name

# title

4.3、Attributes

A tag may contain many attributes, such as id, class, etc. The way to manipulate tag attributes is the same as that of a dictionary.

For example, the tag article that contains the thumbnail area on the web page

...

<article class="box">

   <div id="main">

   <ul id="list">

       <li id="sponsor"><div class="sponsor_tips">div>

           <script async type="text/javascript" src="//cdn.carbonads.com/carbon.js?zoneid=1696&serve=CVYD42T&placement=reeoocom" id="_carbonads_js">script>

       li>

...

Get the value of its class attribute

tag = soup.article

c = tag['class']

 

# [u'box']

You can also get all attributes directly through .attrs

tag = soup.article

attrs = tag.attrs

print(attrs)

# {u'class': [u'box']}

ps. Because class is a multi-valued attribute, its value is an array.

 

String in -1-tag

Get the string contained in the label through the string method

tag = soup.title

s = tag.string

print(s)

# Reeoo - web design inspiration and website gallery

 

-2-Traversal of the document tree

A tag may contain multiple strings or other tags, which are all child nodes of this tag. Beautiful Soup provides many operations and attributes for traversing child nodes.

Child node

The corresponding label can be obtained through the name of the Tag, and the corresponding label in the child node can be obtained by calling this method multiple times.

For example, we want to get the li in the article tag

tag = soup.article.div.ul.li

print(tag)

Print result:

<li id="sponsor"><div class="sponsor_tips">div>

<script async="" id="_carbonads_js" src="//cdn.carbonads.com/carbon.js?zoneid=1696&serve=CVYD42T&placement=reeoocom" type="text/javascript">script>

li>

You can also omit some nodes in the middle, and the results are the same

tag = soup.article.li

Only the first tag can be obtained through the. Attribute. If you want to get all li tags, you can use the find_all() method

ls = soup.article.div.ul.find_all('li')

What you get is a list of all li tags.

The .contents attribute of the tag can output the child nodes of the tag as a list:

tag = soup.article.div.ul
contents = tag.contents
print(contents)
for i in contents:
    print(i)

If you print contents, you can see that the list contains not only the content of the li tag, but also the newline character'\n'. You can also output it in a loop to see the internal difference.

Through the tag's.children generator, you can loop the child nodes of the tag

tag = soup.article.div.ul

children = tag.children

print(children)

for child in children:

   print(child)

You can see that the type of children is object. Comparing the results of the above two for methods, you will find that their results are similar, but you can look at the beginning and find that the results of the children method are more standardized.

The .contents and .children attributes only contain the direct child nodes of the tag. If you want to traverse the child nodes of the child nodes, you can pass the .descendants attribute. The method is similar to the previous two, and they are not listed here.

-3-parent node

Get the parent node of an element through the .parent property. The parent node of article is body.

tag = soup.article

print tag.parent.name

# body

Or traverse all parent nodes through the .parents property.

tag = soup.article

for p in tag.parents:

   print(p.name)

 

-4-sibling node

The .next_sibling and .previous_sibling attributes are used to insert sibling nodes, and the usage is similar to other nodes.

 

-5-Search in the document tree

A specific search on a tree structured document is the most commonly used operation in the crawling process.

find_all()

find_all(name , attrs , recursive , string , ** kwargs)

4.4, name parameter

Find all tags whose name is name

soup.find_all('title')

# [<title>Reeoo - web design inspiration and website gallerytitle>]

soup.find_all('footer')

# [<footer id="footer"> <div class="box"> <p> ... div> footer>]

4.5, keyword parameter

If the name of the specified parameter is not the built-in parameter name (name, attrs, recursive, string), then the parameter will be searched as an attribute of the tag. If no tag is specified, all tags will be searched by default.

For example, search for all tags whose id value is footer

soup.find_all(id='footer')

# [<footer id="footer"> <div class="box"> <p> ... div> footer>]

Tagged parameters

soup.find_all('footer', id='footer')

[<footer id="footer">
 <div class="box">
 <p>
 <span class="link">
 <a href="http://designlol.net" target="_blank" title="全球设计精华分享站">Design lol</a>
 <a href="http://logojoy.com" target="_blank">Logojoy</a>
 <a href="http://www.pplock.com/" target="_blank" title="分享艺术·设计·创意">PPLock</a>
 <a href="http://reader.mx/?utm_source=reeoo&amp;utm_medium=web&amp;utm_campaign=link" target="_blank" title="Reader APP">ReaderMX</a>
 <a href="http://www.ui.cn" target="_blank">UICN</a>
 <a href="http://www.uisdc.com/" target="_blank" title="优秀网页设计联盟">UISDC</a>
 <a href="http://zmingcx.com/" target="_blank" title="知更鸟">Zmingcx</a>
 </span>
 <span class="link">
 <a href="https://logomaster.ai/" rel="noopener" target="_blank">Online Logo Maker</a>
 <a href="http://www.treasurebox.co.nz/outdoor-garden/greenhouse.html" rel="noopener" target="_blank">greenhouse nz</a>
 <a href="https://www.payformathhomework.com" target="_blank">Pay For Math Homework</a>- math help
 				</span>
 <a href="https://www.zessay.com/" target="_blank">Essay services</a> for college students.   
 				<a href="https://myhomeworkdone.com/" target="_blank">My Homework Done</a> really makes your homework done.   
 				<a href="http://mydissertations.com/" target="_blank">MyDissertations</a> - dissertation help on design topics.   
 						<br/>
 			Powered by <a href="http://wordpress.org/" target="_blank">WordPress</a>. © <a href="https://reeoo.com" rel="home" title="Reeoo">Reeoo.com</a>.</p>
 </div>
 </footer>]

 

Get the div tags of all thumbnails, the thumbnails are marked with the class thumb

soup.find_all('div', class_='thumb')

One thing to note here, because class is a reserved keyword of Python, it is underscored as a parameter, which is "class_".

The attribute parameter value of the specified name can include: string, regular expression, list, True/False.

True/False

Whether the specified attribute exists.

Search all tags with target attribute

soup.find_all(target=True)

Search for all tags without the target attribute (a closer look will reveal that the search results will still have a tag with a target, which is a sub-tag without a target tag, here you need to pay attention.)

soup.find_all(target=False)

You can specify multiple parameters as filter conditions. For example, the label of the thumbnail part of the page is as follows:

<li>

   <div class="thumb">

       <a href="http://reeoo.com/aim-creative-studios">![AIM Creative Studios](http://upload-images.jianshu.io/upload_images/1346917-f6281ffe1a8f0b18.gif?imageMogr2/auto-orient/strip)a>

   div>

   <div class="title">

       <a href="http://reeoo.com/aim-creative-studios">AIM Creative Studiosa>

   div>

li>

Search for tags whose src attribute contains the reeoo string and the class is lazy:

Note: here re is a regular expression, you need to import the re package

soup.find_all(src=re.compile("reeoo.com"), class_='lazy')

The search result is all the thumbnail img tags.

Some attributes cannot be used as parameters, such as the data-**** attribute. In the above example, data-original cannot be used as a parameter, and an error will be reported when running, SyntaxError: keyword can't be an expression*.

4.6, attrs parameters

Defining a dictionary parameter to search for the tag of the corresponding attribute can solve the problem mentioned above that certain attributes cannot be used as parameters to a certain extent.

For example, search for tags containing data-original attributes

print soup.find_all(attrs={'data-original': True})

[<img alt="Travelshift" class="lazy" data-original="https://reeoo.xnny.net/Travelshift.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="Travelshift" width="300"/>,
 <img alt="Loop" class="lazy" data-original="https://reeoo.xnny.net/Loop.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="Loop" width="300"/>,
 <img alt="Programatório" class="lazy" data-original="https://reeoo.xnny.net/Programatorio.png!page" height="200" src="https://reeoo.com/assets/white.gif" title="Programatório" width="300"/>,
 <img alt="Ultraviolet Way" class="lazy" data-original="https://reeoo.xnny.net/Ultraviolet Way.png!page" height="200" src="https://reeoo.com/assets/white.gif" tile="Ultraviolet Way" width="300"/>,
 <img alt="みさとと。" class="lazy" data-original="https://reeoo.xnny.net/Misatoto Town.png!page" height="200" src="https://reeoo.com/assets/white.gif" title="みさとと。" width="300"/>,
 <img alt="Block Studio" class="lazy" data-original="https://reeoo.xnny.net/Block Studio.png!page" height="200" src="https://reeoo.com/assets/white.gif" tile="Block Studio" width="300"/>,
 <img alt="Composition No. 24" class="lazy" data-original="https://reeoo.xnny.net/Composition No. 24.png!page" height="200" src="https://reeoo.com/assets/white.gif" tile="Composition No. 24" width="300"/>,
 <img alt="Discovery Land Company" class="lazy" data-original="https://reeoo.xnny.net/Discovery Land Company.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="Discovery Land Company" width="300"/>,
 <img alt="Hardies" class="lazy" data-original="https://reeoo.xnny.net/Hardies.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="Hardies" width="300"/>,
 <img alt="Welch’s Fruit Snacks" class="lazy" data-original="https://reeoo.xnny.net/Welch's Fruit Snacks.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="Welch’s Fruit Snacks" width="300"/>,
 <img alt="EXERON" class="lazy" data-original="https://reeoo.xnny.net/EXERON.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="EXERON" width="300"/>,
 <img alt="Pop Weaver" class="lazy" data-original="https://reeoo.xnny.net/Pop Weaver.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="Pop Weaver" width="300"/>,
 <img alt="eDesign Interactive" class="lazy" data-original="https://reeoo.xnny.net/eDesign Interactive.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="eDesign Interactive" width="300"/>,
 <img alt="OBSOLETE" class="lazy" data-original="https://reeoo.xnny.net/OBSOLETE.png!page" height="200" src="https://reeoo.com/assets/white.gif" tile="OBSOLETE" width="300"/>,
 <img alt="Minibricks" class="lazy" data-original="https://reeoo.xnny.net/Minibricks.png!page" height="200" src="https://reeoo.com/assets/white.gif" tile="Minibricks" width="300"/>,
 <img alt="Your Sport Agent" class="lazy" data-original="https://reeoo.xnny.net/Your Sport Agent.png!page" height="200" src="https://reeoo.com/assets/white.gif" tile="Your Sport Agent" width="300"/>,
 <img alt="Modulz" class="lazy" data-original="https://reeoo.xnny.net/Modulz.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="Modulz" width="300"/>,
 <img alt="Shift" class="lazy" data-original="https://reeoo.xnny.net/Shift.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="Shift" width="300"/>,
 <img alt="Rand" class="lazy" data-original="https://reeoo.xnny.net/Rand.png!page" height="200" src="https://reeoo.com/assets/white.gif" tile="Rand" width="300"/>,
 <img alt="RappiPay" class="lazy" data-original="https://reeoo.xnny.net/RappiPay 2.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="RappiPay" width="300"/>,
 <img alt="Real Happiness Project from BBC Earth" class="lazy" data-original="https://reeoo.xnny.net/Real Happiness Project from BBC Earth.png!page" height="200" src="https://reeoo.com/assets/white.gif" tile="Real Happiness Project from BBC Earth" width="300"/>,
 <img alt="OPERA" class="lazy" data-original="https://reeoo.xnny.net/OPERA.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="OPERA" width="300"/>,
 <img alt="真如堂を楽しむ" class="lazy" data-original="https://reeoo.xnny.net/Kyoto Shin nyo-do.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="真如堂を楽しむ" width="300"/>]

Search for tags that contain the reeoo.com string in the data-original attribute

soup.find_all(attrs={'data-original':re.compile('reeoo')})

[<img alt="Travelshift" class="lazy" data-original="https://reeoo.xnny.net/Travelshift.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="Travelshift" width="300"/>,
 <img alt="Loop" class="lazy" data-original="https://reeoo.xnny.net/Loop.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="Loop" width="300"/>,
 <img alt="Programatório" class="lazy" data-original="https://reeoo.xnny.net/Programatorio.png!page" height="200" src="https://reeoo.com/assets/white.gif" title="Programatório" width="300"/>,
 <img alt="Ultraviolet Way" class="lazy" data-original="https://reeoo.xnny.net/Ultraviolet Way.png!page" height="200" src="https://reeoo.com/assets/white.gif" tile="Ultraviolet Way" width="300"/>,
 <img alt="みさとと。" class="lazy" data-original="https://reeoo.xnny.net/Misatoto Town.png!page" height="200" src="https://reeoo.com/assets/white.gif" title="みさとと。" width="300"/>,
 <img alt="Block Studio" class="lazy" data-original="https://reeoo.xnny.net/Block Studio.png!page" height="200" src="https://reeoo.com/assets/white.gif" tile="Block Studio" width="300"/>,
 <img alt="Composition No. 24" class="lazy" data-original="https://reeoo.xnny.net/Composition No. 24.png!page" height="200" src="https://reeoo.com/assets/white.gif" tile="Composition No. 24" width="300"/>,
 <img alt="Discovery Land Company" class="lazy" data-original="https://reeoo.xnny.net/Discovery Land Company.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="Discovery Land Company" width="300"/>,
 <img alt="Hardies" class="lazy" data-original="https://reeoo.xnny.net/Hardies.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="Hardies" width="300"/>,
 <img alt="Welch’s Fruit Snacks" class="lazy" data-original="https://reeoo.xnny.net/Welch's Fruit Snacks.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="Welch’s Fruit Snacks" width="300"/>,
 <img alt="EXERON" class="lazy" data-original="https://reeoo.xnny.net/EXERON.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="EXERON" width="300"/>,
 <img alt="Pop Weaver" class="lazy" data-original="https://reeoo.xnny.net/Pop Weaver.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="Pop Weaver" width="300"/>,
 <img alt="eDesign Interactive" class="lazy" data-original="https://reeoo.xnny.net/eDesign Interactive.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="eDesign Interactive" width="300"/>,
 <img alt="OBSOLETE" class="lazy" data-original="https://reeoo.xnny.net/OBSOLETE.png!page" height="200" src="https://reeoo.com/assets/white.gif" tile="OBSOLETE" width="300"/>,
 <img alt="Minibricks" class="lazy" data-original="https://reeoo.xnny.net/Minibricks.png!page" height="200" src="https://reeoo.com/assets/white.gif" tile="Minibricks" width="300"/>,
 <img alt="Your Sport Agent" class="lazy" data-original="https://reeoo.xnny.net/Your Sport Agent.png!page" height="200" src="https://reeoo.com/assets/white.gif" tile="Your Sport Agent" width="300"/>,
 <img alt="Modulz" class="lazy" data-original="https://reeoo.xnny.net/Modulz.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="Modulz" width="300"/>,
 <img alt="Shift" class="lazy" data-original="https://reeoo.xnny.net/Shift.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="Shift" width="300"/>,
 <img alt="Rand" class="lazy" data-original="https://reeoo.xnny.net/Rand.png!page" height="200" src="https://reeoo.com/assets/white.gif" tile="Rand" width="300"/>,
 <img alt="RappiPay" class="lazy" data-original="https://reeoo.xnny.net/RappiPay 2.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="RappiPay" width="300"/>,
 <img alt="Real Happiness Project from BBC Earth" class="lazy" data-original="https://reeoo.xnny.net/Real Happiness Project from BBC Earth.png!page" height="200" src="https://reeoo.com/assets/white.gif" tile="Real Happiness Project from BBC Earth" width="300"/>,
 <img alt="OPERA" class="lazy" data-original="https://reeoo.xnny.net/OPERA.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="OPERA" width="300"/>,
 <img alt="真如堂を楽しむ" class="lazy" data-original="https://reeoo.xnny.net/Kyoto Shin nyo-do.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="真如堂を楽しむ" width="300"/>]

Search for tags whose data-original attribute is the specified value

soup.find_all(attrs={'data-original': 'https://reeoo.xnny.net/OBSOLETE.png!page'})

[<img alt="OBSOLETE" class="lazy" data-original="https://reeoo.xnny.net/OBSOLETE.png!page" height="200" src="https://reeoo.com/assets/white.gif" tile="OBSOLETE" width="300"/>]

4.7, string parameters

Similar to the name parameter, for the string content in the document.

Search for tags containing Reeoo strings :

soup.find_all(string=re.compile("Reeoo"))

 

4.8, limit parameters

find_all() returns the search results of the entire document. If the content of the document is large, the search process will take too long, plus the limit limit. When the result reaches the limit value, the search stops and the result is returned.

Search for div tags whose class is thumb, only 3 are searched

soup.find_all('div', class_='thumb', limit=3)

The printed result is a list of 3 elements, and there are more than 3 tags in the document that actually satisfy the result.

4.9, recursive parameters

find_all() will retrieve all descendant nodes of the current tag. If you only want to search for the direct child nodes of the tag, you can use the parameter recursive=False.

 

4.10、find()

find(name , attrs , recursive , string , ** kwargs)

The parameters of the find() method and the find_all() method are basically the same, except that the search method of find() will only return the first result that meets the requirements, which is equivalent to the find_all() method and set limit to 1.

soup.find_all('div', class_='thumb', limit=1)

soup.find('div', class_='thumb')

The search results are the same, the only difference is that find_all() returns an array, and find() returns an element.

When no tags that meet the criteria are found, find() returns None, and find_all() returns an empty list.

 

4.11, CSS selector

Tag or BeautifulSoup objects pass string parameters in the select() method, and then you can use the syntax of CSS selectors to find tags.

The semantics are consistent with CSS, search the li tag in the ul tag under the article tag

print(soup.select('article ul li'))

Search by class name, the results of the two lines of code are the same, search for the label whose class is thumb

soup.select('.thumb')

soup.select('[class~=thumb]')

Search by id, search for the tag whose id is submenu

soup.select('#submenu')

Find by whether a certain attribute exists, search for the li tag with id attribute

soup.select('li[id]')

Find and find by the value of the attribute, search for the li tag whose class is sponsor

soup.select('li[class="sponsor"]')

 

other

Other search methods include:

find_parents() 和 find_parent()

find_next_siblings() 和 find_next_sibling()

find_previous_siblings() 和 find_previous_sibling()

The function of the parameter is not much different from that of find_all() and find(), so I won’t list the usage methods here. These two methods can basically meet most of the query needs.

Other methods involve modification of the document tree. For crawlers, most of the work is only to retrieve page information, and rarely need to make changes to the page source code, so the content of this part will not be listed again.

For specific details, please refer to the official documentation of the Beautiful Soup library.

 

Guess you like

Origin blog.csdn.net/u010472858/article/details/103483496