I have a question about BeautifulSoup in Python 3.I spent a couple of hours to try but I have not solved it yet.
This is my soup:
print(soup.prettify())
# REMEMBER THIS SOUP IS DYNAMIC
# <html>
# <body>
# <div class="title" itemtype="http://schema.org/FoodEstablishment">
# <div class="address" itemtype="http://schema.org/PostalAddress">
# <div class="address-inset">
# <p itemprop="name">33 San Francisco</p>
# </div>
# </div>
# <div class="image">
# <img src=""/>
# <span class="subtitle">image subtitle</p>
# </div>
# <a itemprop="name">The Dormouse's story</a>
# </div>
# </body>
# </html>
I have to extract two text by itemprop="name"
: The Dormouse's story
and 33 San Francisco
But I want need way to define what class is the parent.
Expected output:
{
"FoodEstablishment": "The Dormouse's story",
"PostalAddress": "33 San Francisco"
}
Remember the soup is always dynamic and have many chilren elements in it.
Why use soup.find
when you can use soup.select
, get help from all the CSS wiz kids and test your criteria in a browser first?
There's a [performance benchmark][1] on SO and select
is faster, or at least not significantly slower, so that's not it. Habit, I guess.
(works just as well without the <p>
tag qualifier, i.e. just "[itemprop=name]"
)
found = soup.select("p[itemprop=name]")
results = dict()
for node in found:
itemtype = node.parent.attrs.get("itemtype", "?")
itemtype = itemtype.split("/")[-1]
results[itemtype] = node.text
print(results)
output:
It is what you asked for, but if many nodes existed with FoodEstablishment, last would win, because you are using a dictionary. A defaultdict with a list might work better, for you to judge.
{'PostalAddress': '33 San Francisco', 'FoodEstablishment': "The Dormouse's story"}
step 1, before Python: rock that CSS!
[![enter image description here][2]][2]
and if you need to check higher up ancestors for itemtype
:
it would help if you had html with that happening:
<div class="address" itemtype="http://schema.org/PostalAddress">
<div>
<p itemprop="name">33 San Francisco</p>
</div>
</div>
found = soup.select("[itemprop=name]")
results = dict()
for node in found:
itemtype = None
parent = node.parent
while itemtype is None and parent is not None:
itemtype = parent.attrs.get("itemtype")
if itemtype is None:
parent = parent.parent
itemtype = itemtype or "?"
itemtype = itemtype.split("/")[-1]
results[itemtype] = node.text
print(results)
same output.
using a defautdict
everything stays the same except for declaring the results and putting data into it.
from collections import defaultdict
...
results = defaultdict(list)
...
results[itemtype].append(node.text)
#### output (after I added a sibling to 33 San Francisco):
defaultdict(<class 'list'>, {'PostalAddress': ['33 San Francisco', '34 LA'], 'FoodEstablishment': ["The Dormouse's story"]})
[1]: https://stackoverflow.com/questions/38028384/beautifulsoup-is-there-a-difference-between-find-and-select-python-3-x
[2]: https://i.stack.imgur.com/hXckC.png