python reptiles (b) (suitable for beginners)

Disclaimer: This article is a blogger original article, follow the CC 4.0 BY-SA copyright agreement, reproduced, please attach the original source link and this statement.
This link: https://blog.csdn.net/weixin_43701019/article/details/98961995

- Personal Notes


  • Analytical data (step 1)

    • BeautifulSoup parse and extract data using the web page, not BeautifulSoup Python standard library, it needs to be installed separately.

    • Analytical Usage: BeautifulSoup (to parse the text, 'parser') returns a <class 'bs4.BeautifulSoup'> object.

      import requests
      from bs4 import BeautifulSoup #引入
      res = requests.get(' URL') 
      #第二个是html解析器,soup就是bs4.BeautifulSoup对象
      soup = BeautifulSoup( res.text,'html.parser') 	
      
  • Extracting data (step 2)

    • After obtaining a soup (bs4.BeautifulSoup objects), we can extract the content data using find () and find_all ().
    • Analytical Usage: to find () as an example, Find (tags, attributes), which can be found in accordance with the matching tags and attributes, two parameters have at least one (more than two kinds of tags and attributes, the two usually sufficient a). The difference between the two functions is that the former only returns the first to meet the requirements, which will return all matching.
    import requests
    from bs4 import BeautifulSoup
    url = 'URL'
    res = requests.get (url)
    soup = BeautifulSoup(res.text,'html.parser')
    item = soup.find('div') #查找获取的html中<div>标签和其的内容
    
    • Above, this is actually not completed O (∩_∩) O, since the item also includes label these properties. Then we use the function Tag, because item will be the Tag object. Tag objects have the same find () and find_all (), and use as above. The difference is that there is text attributes, it is the content Tag object.
    • The complete code for the entire process:
    import requests # 调用requests库
    from bs4 import BeautifulSoup # 调用BeautifulSoup库
    res =requests.get('URL')
    # 返回一个response对象,赋值给res,URL为地址,这里以推荐书籍网站为例
    html=res.text
    # 把res解析为字符串
    soup = BeautifulSoup( html,'html.parser')
    # 把网页解析为BeautifulSoup对象
    items = soup.find_all(class_='books')   # 通过匹配属性class='books'提取出我们想要的元素
    for item in items:                      # 遍历列表items
        kind = item.find('h2')               # 在列表中的每个元素里,匹配标签<h2>提取出数据
        title = item.find(class_='title')     #  在列表中的每个元素里,匹配属性class_='title'提取出数据
        brief = item.find(class_='info')      # 在列表中的每个元素里,匹配属性class_='info'提取出数据
        print(kind.text,'\n',title.text,'\n',title['href'],'\n',brief.text) # 打印书籍的类型、名字、链接和简介的文字
    
  • Process Mapping

Here Insert Picture Description

Guess you like

Origin blog.csdn.net/weixin_43701019/article/details/98961995