[Reptile] U11_BeautifulSoup4 Python3 library to extract data Detailed


Referenced in the following sections of 51job.com in part source for case presentations, all of the following are based on actual cases to parse BeautifulSoup4 library to extract data.
51job.com in part in the source codes used in the following :( "前程无忧" codes represent less)

<div class="el">
    <p class="t1 ">
        <em class="check" name="delivery_em" onclick="checkboxClick(this)"></em>
        <input class="checkbox" type="checkbox" name="delivery_jobid" value="120207510" jt="0" style="display:none" />
        <span>
            <a target="_blank" title="python教师" href="https://jobs.51job.com/kunming-whq/120207510.html?s=01&t=0"  onmousedown="">
                python教师                </a>
        </span>
                                                                </p>
    <span class="t2"><a target="_blank" title="云南通识教育信息咨询有限公司" href="https://jobs.51job.com/all/co5751385.html">云南通识教育信息咨询有限公司</a></span>
    <span class="t3">昆明-五华区</span>
    <span class="t4">4.5-6千/月</span>
    <span class="t5">03-29</span>
</div>
<div class="el">
    <p class="t1 ">
        <em class="check" name="delivery_em" onclick="checkboxClick(this)"></em>
        <input class="checkbox" type="checkbox" name="delivery_jobid" value="118417429" jt="0" style="display:none" />
        <span>
            <a target="_blank" title="Python工程师" href="https://jobs.51job.com/kunming-whq/118417429.html?s=01&t=0"  onmousedown="">
                Python工程师                </a>
        </span>
                                                                </p>
    <span class="t2"><a target="_blank" title="云南蓝典科技股份有限公司" href="https://jobs.51job.com/all/co4646964.html">云南蓝典科技股份有限公司</a></span>
    <span class="t3">昆明-五华区</span>
    <span class="t4">4-6千/月</span>
    <span class="t5">03-27</span>
</div>
<div class="el">
    <p class="t1 ">
        <em class="check" name="delivery_em" onclick="checkboxClick(this)"></em>
        <input class="checkbox" type="checkbox" name="delivery_jobid" value="120703493" jt="0" style="display:none" />
        <span>
            <a target="_blank" title="YX00-Python开发工程师" href="https://jobs.51job.com/kunming-gdq/120703493.html?s=01&t=0"  onmousedown="">
                YX00-Python开发工程师                </a>
        </span>
                                                                </p>
    <span class="t2"><a target="_blank" title="云南远信科技有限公司" href="https://jobs.51job.com/all/co2256249.html">云南远信科技有限公司</a></span>
    <span class="t3">昆明-官渡区</span>
    <span class="t4">4-8千/月</span>
    <span class="t5">03-27</span>
</div>
<div class="el">
    <p class="t1 ">
        <em class="check" name="delivery_em" onclick="checkboxClick(this)"></em>
        <input class="checkbox" type="checkbox" name="delivery_jobid" value="117230454" jt="0" style="display:none" />
        <span>
            <a target="_blank" title="Python开发工程师" href="https://jobs.51job.com/kunming/117230454.html?s=01&t=0"  onmousedown="">
                Python开发工程师                </a>
        </span>
                                                                </p>
    <span class="t2"><a target="_blank" title="云南紫米科技有限公司" href="https://jobs.51job.com/all/co4672988.html">云南紫米科技有限公司</a></span>
    <span class="t3">昆明</span>
    <span class="t4">0.8-1万/月</span>
    <span class="t5">03-27</span>
</div>
<div class="el">
    <p class="t1 ">
        <em class="check" name="delivery_em" onclick="checkboxClick(this)"></em>
        <input class="checkbox" type="checkbox" name="delivery_jobid" value="117148016" jt="0" style="display:none" />
        <span>
            <a target="_blank" title="Python高级开发工程师" href="https://jobs.51job.com/kunming-plq/117148016.html?s=01&t=0"  onmousedown="">
                Python高级开发工程师                </a>
        </span>
                                                                </p>
    <span class="t2"><a target="_blank" title="微加普惠金融服务(深圳)有限公司" href="https://jobs.51job.com/all/co5633133.html">微加普惠金融服务(深圳)有限公司...</a></span>
    <span class="t3">昆明-盘龙区</span>
    <span class="t4">1-2万/月</span>
    <span class="t5">03-27</span>
</div>
<div class="el">
    <p class="t1 ">
        <em class="check" name="delivery_em" onclick="checkboxClick(this)"></em>
        <input class="checkbox" type="checkbox" name="delivery_jobid" value="118740280" jt="0" style="display:none" />
        <span>
            <a target="_blank" title="Java/大数据/python 讲师" href="https://jobs.51job.com/kunming/118740280.html?s=01&t=0"  onmousedown="">
                Java/大数据/python 讲师                </a>
        </span>
                                                                </p>
    <span class="t2"><a target="_blank" title="云南新华计算机中等专业学校" href="https://jobs.51job.com/all/co3757091.html">云南新华计算机中等专业学校</a></span>
    <span class="t3">昆明</span>
    <span class="t4">0.5-1万/月</span>
    <span class="t5">03-27</span>
</div>
<div class="el">
    <p class="t1 ">
        <em class="check" name="delivery_em" onclick="checkboxClick(this)"></em>
        <input class="checkbox" type="checkbox" name="delivery_jobid" value="104297888" jt="0" style="display:none" />
        <span>
            <a target="_blank" title="Python开发工程师" href="https://jobs.51job.com/kunming-whq/104297888.html?s=01&t=0"  onmousedown="">
                Python开发工程师                </a>
        </span>
                                                                </p>
    <span class="t2"><a target="_blank" title="云南创至互达网络科技有限公司" href="https://jobs.51job.com/all/co4670824.html">云南创至互达网络科技有限公司</a></span>
    <span class="t3">昆明-五华区</span>
    <span class="t4">0.6-1万/月</span>
    <span class="t5">03-19</span>
</div>
<div class="el">
    <p class="t1 ">
        <em class="check" name="delivery_em" onclick="checkboxClick(this)"></em>
        <input class="checkbox" type="checkbox" name="delivery_jobid" value="120456484" jt="0" style="display:none" />
        <span>
            <a target="_blank" title="Python开发工程师" href="https://jobs.51job.com/kunming/120456484.html?s=01&t=0"  onmousedown="">
                Python开发工程师                </a>
        </span>
                                                                </p>
    <span class="t2"><a target="_blank" title="云思华盛(北京)科技有限公司" href="https://jobs.51job.com/all/co2898169.html">云思华盛(北京)科技有限公司</a></span>
    <span class="t3">昆明</span>
    <span class="t4">10-15万/年</span>
    <span class="t5">03-14</span>
</div>

1. Get all the p tags

html = "前程无忧"
soup = BeautifulSoup(html,'lxml')
ps = soup.find_all('p')
for p in ps:
    print(p)
    print("=" * 40)

P is the above code output a tag type, but from bs4.element import Tag, the Tag enters into this class, we can find the following methods: __repr__it can be seen from the figure, this method can print a string element come out.

FIG output results are as follows:

2. Get the first two p tags

html = "前程无忧"
soup = BeautifulSoup(html,'lxml')
p = soup.find_all('p',limit=2)[1] # limit=2:最多提取2个标签
print(p)

FIG output results are as follows:

3. Get all the class of a span equal to t3

html = "前程无忧"
soup = BeautifulSoup(html,'lxml')
spans = soup.find_all('span',class_='t3') # 此处使用class_,由于class是关键字
# 上述语句也可以使用attrs替换:spans = soup.find_all('span',attrs=({'class':"t3"}))
for span in spans:
    print(span)
    print("=" * 40)

4. Get class equal check, name of the em tag equal delivery_em

html = "前程无忧"
soup = BeautifulSoup(html,'lxml')
# 错误的语法:emList = soup.find_all('em', class_="check" ,name="delivery_em" )
emList = soup.find_all('em', attrs = {'class':"check" ,'name':"delivery_em"} )
for em in emList:
    print(em)
    print("=" * 40)

Here, if used emList = soup.find_all('em', class_="check" ,name="delivery_em" ), it will be given below, because: findall () can not be directly used as a parameter name

5. Obtain all href attribute class as a tag in the tag t1 of p

html = "前程无忧"
soup = BeautifulSoup(html,'lxml')
pList = soup.find_all('p',class_='t1')
for p in pList:
    aList = p.find_all('a')
    for a in aList:
        # 1)通过下标操作(推荐使用,语法简洁明了)
        href = a['href']
        print(href)
        print("=" * 40)

        # 2)通过attrs属性
        # href = a.attrs['href']
        # print(href)
        # print("=" * 40)

Output:

6. Get all the jobs (text)

html = "前程无忧"
soup = BeautifulSoup(html,'lxml')
divs = soup.find_all('div')[1:]
infoSet = list()
for div in divs:
    info = {}
    infos = list(div.stripped_strings) # div.stripped_strings返回的是一个生成器
    info['job'] = infos[0]
    info['company'] = infos[1]
    info['address'] = infos[2]
    info['salary'] = infos[3]
    info['ReleaseDate'] = infos[4]
    infoSet.append(info)
print(infoSet)

Output:

7. Summary

Use of 7.1 find_all

  • 1. When the label is extracted, the first is the name of the label. If you want to use the label attribute filter when extracting the label, it can be in the form of keyword arguments, the name of the property and the corresponding value in the transfer process in, or use attrsproperty, all the attributes and the corresponding value put pass in a dictionaryattrs
  • 2. limitproperty to limit the number of tags extracted

7.2 find the difference and find_all

  • find: find the first label to meet the conditions of return
  • find_all: will meet all the conditions of the return label

7.3 find and find_all filters

  • Keyword parameters: the name of the attribute as a key value as a parameter to filter keywords and value of the property, such as: soup.find_all ( 'p', class _ = 't1')
  • attrs parameters: the attribute condition into a dictionary, passed attrs parameter. Such as: soup.find_all ( 'p', attrs = { 'class': 't1'})

7.4 acquire property of the label

  • Obtained by the subscript
href = a['href']
  • Obtaining property by attrs
href = a.attrs['href']

7.5 strings and stripped_strings, string properties, and methods get_text

  • string: String Gets off-label under a label, returns a string.
  • strings: String Gets sons untagged under a label, returns a generator.
  • stripped_strings: get the sons of non-tag strings under a label, it will remove whitespace characters, returns a generator.
  • get_text: Get descendants of non-tag strings in a label, is not returned as a generator.

Guess you like

Origin www.cnblogs.com/OliverQin/p/12595647.html