BeautifulSoup: Classify parent and children element

KitKit :

I have a question about BeautifulSoup in Python 3.I spent a couple of hours to try but I have not solved it yet.

This is my soup:

print(soup.prettify())
# REMEMBER THIS SOUP IS DYNAMIC
# <html>
#  <body>
#   <div class="title" itemtype="http://schema.org/FoodEstablishment">
#    <div class="address" itemtype="http://schema.org/PostalAddress">
#      <div class="address-inset">
#        <p itemprop="name">33 San Francisco</p>
#      </div>
#    </div>
#    <div class="image">
#      <img src=""/>
#      <span class="subtitle">image subtitle</p>
#    </div>
#    <a itemprop="name">The Dormouse's story</a>
#   </div>
#  </body>
# </html>

I have to extract two text by itemprop="name": The Dormouse's story and 33 San Francisco But I want need way to define what class is the parent.

Expected output:

{
   "FoodEstablishment": "The Dormouse's story",
   "PostalAddress": "33 San Francisco"
}

Remember the soup is always dynamic and have many chilren elements in it.

JL Peyret :

Why use soup.find when you can use soup.select, get help from all the CSS wiz kids and test your criteria in a browser first?

There's a [performance benchmark][1] on SO and select is faster, or at least not significantly slower, so that's not it. Habit, I guess.

(works just as well without the <p> tag qualifier, i.e. just "[itemprop=name]")

found = soup.select("p[itemprop=name]")

results = dict()

for node in found:

    itemtype = node.parent.attrs.get("itemtype", "?")
    itemtype = itemtype.split("/")[-1]
    results[itemtype] = node.text

print(results)

output:

It is what you asked for, but if many nodes existed with FoodEstablishment, last would win, because you are using a dictionary. A defaultdict with a list might work better, for you to judge.

{'PostalAddress': '33 San Francisco', 'FoodEstablishment': "The Dormouse's story"}

step 1, before Python: rock that CSS!

[![enter image description here][2]][2]

and if you need to check higher up ancestors for itemtype:

it would help if you had html with that happening:

    <div class="address" itemtype="http://schema.org/PostalAddress">
      <div>
        <p itemprop="name">33 San Francisco</p>  
      </div>

    </div>
found = soup.select("[itemprop=name]")

results = dict()

for node in found:

    itemtype = None
    parent = node.parent
    while itemtype is None and parent is not None:
      itemtype = parent.attrs.get("itemtype")
      if itemtype is None:
        parent = parent.parent


    itemtype = itemtype or "?"
    itemtype = itemtype.split("/")[-1]
    results[itemtype] = node.text

print(results)

same output.

using a defautdict

everything stays the same except for declaring the results and putting data into it.

from collections import defaultdict
...
results = defaultdict(list)
...

results[itemtype].append(node.text)

#### output (after I added a sibling to 33 San Francisco):

defaultdict(<class 'list'>, {'PostalAddress': ['33 San Francisco', '34 LA'], 'FoodEstablishment': ["The Dormouse's story"]})

  [1]: https://stackoverflow.com/questions/38028384/beautifulsoup-is-there-a-difference-between-find-and-select-python-3-x
  [2]: https://i.stack.imgur.com/hXckC.png

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=169368&siteId=1
Recommended