Use xpath to extract all the text under the label

Use xpath to extract all the text under the label

html style

The source code of the webpage is part of Weibo, we need to extract the blog post, but found

The text under the label is divided, how to deal with this situation

<div class="content" node-type="like">
                <div class="info">
                    <div class="menu s-fr">
                        <a href="javascript:void(0);" action-type="fl_menu"><i class="wbicon">c</i></a>
                        <ul style="display:none;" node-type="fl_menu_right">
                            <li><a onclick="javascript:window.open('//service.account.weibo.com/reportspam?rid=4488118096861246&amp;type=1&amp;from=10501&amp;url=&amp;bottomnav=1&amp;wvr=6', 'newwindow', 'height=700, width=550, toolbar =yes, menubar=no, scrollbars=yes, resizable=yes, location=no, status=no');" href="javascript:void(0);">投诉</a></li>
                                                    </ul>
                    </div>
                    <div>
                        <a class="name" href="//weibo.com/2864108830?refer_flag=1001030103_" target="_blank" suda-data="key=tblog_search_weibo&amp;value=seqid:158609447248102927726|type:1|t:0|pos:2-0|q:%E7%97%98%E7%97%98%E5%8E%8B%E5%8A%9B|ext:cate:31,mpos:19,click:user_name" nick-name="一Z_c一">一Z_c一</a>
                        <a title="微博达人" href="//club.weibo.com/intro" target="_blank"><i class="icon-vip icon-daren"></i></a>
                        <!--广告微博加关注按钮 -->
                                            </div>
                </div>
                <p class="txt" node-type="feed_list_content" nick-name="一Z_c一">
                    忌甜忌辣忌油忌熬夜否则就会长<em class="s-color-red">痘痘</em>变丑 忌咖啡忌可可忌巧克力忌熬夜忌<em class="s-color-red">压力</em>忌受刺激忌紧张忌生气否则就会偏头痛 我也太难了.. ​                </p>
                                                <p class="from">

xpath extraction method

The specific code is as follows

blog_content = str(blog.xpath("string(div[@class = 'card']//div/div[2]/p)").strip())

Among them, blog is the block of extracted blog posts. The
code is as follows:

tree = html.fromstring(response.text)
blog_list = tree.xpath("//div[@class='card-wrap']")
print(len(blog_list)
for blog in blog_list:
	......
Published 7 original articles · won 11 · views 259

Guess you like

Origin blog.csdn.net/weixin_43165512/article/details/105339147