python爬虫基础-Beautiful中的正则表达式的应用

Beautiful中的正则表达式的应用


正则表达式常见的符号

符号 含义 例子 匹配结果
* 匹配前面的字符、子表达式或者括号里的字符0次或者多次 a*b* aaabbbb
+ 匹配前面的字符、子表达式或者括号里的字符至少1次 a+b+ aabb
[] 任意选择一个字符 [A-Z]* ABC
() 表达式编组(编组里的规则会优先运行) (a*b)* aaabbaab
{m,n} 匹配前面的字符、子表达式或者括号里的字符m至n次(包括m和n) a{2,3} aab
[^] 匹配任意一个不在括号里的字符 [^a-z] APPLE
竖线 匹配任意一个由竖线分割的字符、子表达式(键盘中的竖线,此处不能转义,使用汉字代替) b(a竖线i)d bad
_ 匹配任意单个字符(包括数字、空格、符号登) b_d bad
^ 指字符串开始位置的的字符或者子表达式 ^a apple
\ 转义字符 \_ _
$ 用在正则表达式的末尾,用来从末尾匹配 [A-Z]*[a-z]$ ABCabc
?! 不包含 ^((?![A-Z])_)$ no-caps-here

BeautifulSoup与正则表达式

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
html = urlopen("http://www.pythonscraping.com/pages/page3.html")
bsObj = BeautifulSoup(html.read())
images = bsObj.findAll("img",{"src":re.compile("\.\./img/gifts/img.*\.jpg")})
for image in images:
    print(image["src"])
../img/gifts/img1.jpg
../img/gifts/img2.jpg
../img/gifts/img3.jpg
../img/gifts/img4.jpg
../img/gifts/img6.jpg

BeautifulSoup 与lambda表达式

Lambda 表达式本质上就是一个函数,可以作为其他函数的变量使用。BeautifulSoup 允许我们把特定函数类型当作findAll 函数的参数。唯一的限制条件是这些函数必须把一个标签作为参数且返回结果是布尔类型。BeautifulSoup 用这个函数来评估它
遇到的每个标签对象,最后把评估结果为“真”的标签保留,把其他标签剔除。

eles=bsObj.findAll(lambda tag: len(tag.attrs) == 2)
for ele in eles:
    print(ele)
<img src="../img/gifts/logo.jpg" style="float:left;"/>
<tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td><td>
$15.00
</td><td>
<img src="../img/gifts/img1.jpg"/>
</td></tr>
<tr class="gift" id="gift2"><td>
Russian Nesting Dolls
</td><td>
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
</td><td>
$10,000.52
</td><td>
<img src="../img/gifts/img2.jpg"/>
</td></tr>
<tr class="gift" id="gift3"><td>
Fish Painting
</td><td>
If something seems fishy about this painting, it's because it's a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span>
</td><td>
$10,005.00
</td><td>
<img src="../img/gifts/img3.jpg"/>
</td></tr>
<tr class="gift" id="gift4"><td>
Dead Parrot
</td><td>
This is an ex-parrot! <span class="excitingNote">Or maybe he's only resting?</span>
</td><td>
$0.50
</td><td>
<img src="../img/gifts/img4.jpg"/>
</td></tr>
<tr class="gift" id="gift5"><td>
Mystery Box
</td><td>
If you love suprises, this mystery box is for you! Do not place on light-colored surfaces. May cause oil staining. <span class="excitingNote">Keep your friends guessing!</span>
</td><td>
$1.50
</td><td>
<img src="../img/gifts/img6.jpg"/>
</td></tr>

猜你喜欢

转载自blog.csdn.net/hfutzhouyonghang/article/details/80676522