Extracting all the text between two tags(bold)<b> using XPATH

Vighnesh.P Vicky :

This is my HTML element ,

<div class="abstract-content selected" id="en-abstract">
    <p>
        <b>Introduction.</b> 
         Against the backdrop of increasing resistance to conventional antibiotics, bacteriocins represent an attractive alternative, given their potent activity, novel modes of action and perceived lack of issues with resistance.
        <b>Aim.</b>
         In this study, the nature of the antibacterial activity of a clinical isolate of 
        <i>Streptococcus gallolyticus</i>
         was investigated.
        <b>Methods.</b>
         Optimization of the production of an inhibitor from strain AB39 was performed using different broth media and supplements. Purification was carried out using size exclusion, ion exchange and HPLC. Gel diffusion agar overlay, MS/MS, 
        <i>de novo</i>
         peptide sequencing and genome mining were used in a proteogenomics approach to facilitate identification of the genetic basis for production of the inhibitor.
        <b>Results.</b>
         Strain AB39 was identified as representing 
        <i>Streptococcus gallolyticus</i>
         subsp. 
        <i>pasteurianus</i>
         and the successful production and purification of the AB39 peptide, named nisin P, with a mass of 3133.78 Da, was achieved using BHI broth with 10 % serum. Nisin P showed antibacterial activity towards clinical isolates of drug-resistant bacteria, including methicillin-resistant 
        <i>Staphylococcus aureus</i>
         , vancomycin-resistant 
        <i>Enterococcus</i>
         and penicillin-resistant 
        <i>Streptococcus pneumoniae</i>
         . In addition, the peptide exhibited significant stability towards high temperature, wide pH and certain proteolytic enzymes and displayed very low toxicity towards sheep red blood cells and Vero cells.
        <b>Conclusion.</b>
         To the best of our knowledge, this study represents the first production, purification and characterization of nisin P. Further study of nisin P may reveal its potential for treating or preventing infections caused by antibiotic-resistant Gram-positive bacteria, or those evading vaccination regimens.
    </p>
</div>

Here I wanted to extract "headings" from the "<b>" tag and their corresponding values from the text residing below them.

example: "AIM" : In this study, the nature of the antibacterial activity of a clinical isolate of Streptococcus gallolyticus was investigated.

Is there any way of achieving this using xpath. And note: I am using scrapy to extract things.

I used

"response.xpath("//p//text()[normalize-space()][preceding-sibling::*/self::b]")" which gives all heading values as seperate chunks,

[u' Against the backdrop of increasing resistance to conventional antibiotics, bacteriocins represent an attractive alternative, given their potent activity, novel modes of action and perceived lack of issues with resistance.', u' In this study, the nature of the antibacterial activity of a clinical isolate of ', u' was investigated.', u' Optimization of the production of an inhibitor from strain AB39 was performed using different broth media and supplements. Purification was carried out using size exclusion, ion exchange and HPLC. Gel diffusion agar overlay, MS/MS, ', u' peptide sequencing and genome mining were used in a proteogenomics approach to facilitate identification of the genetic basis for production of the inhibitor.', u' Strain AB39 was identified as representing ', u' subsp. ', u' and the successful production and purification of the AB39 peptide, named nisin P, with a mass of 3133.78 Da, was achieved using BHI broth with 10 % serum. Nisin P showed antibacterial activity towards clinical isolates of drug-resistant bacteria, including methicillin-resistant ', u', vancomycin-resistant ', u' and penicillin-resistant ', u'. In addition, the peptide exhibited significant stability towards high temperature, wide pH and certain proteolytic enzymes and displayed very low toxicity towards sheep red blood cells and Vero cells.', u' To the best of our knowledge, this study represents the first production, purification and characterization of nisin P. Further study of nisin P may reveal its potential for treating or preventing infections caused by antibiotic-resistant Gram-positive bacteria, or those evading vaccination regimens.\n \n\n \n ']

Any Guidance is helpful!!!!

E.Wiest :

The fastest way would probably be to get all the content with string(//p) and split with specific text manipulation commands.

With XPath, you can try :

Get all the titles (returns 5 elements) :

//b/text()

Get the corresponding description (including italic tags) with these XPaths (returns 5*1 element):

normalize-space(substring-before(substring-after(string(//p),//b[.="Introduction."]),//b[.="Aim."]))
normalize-space(substring-before(substring-after(string(//p),//b[.="Aim."]),//b[.="Methods."]))
normalize-space(substring-before(substring-after(string(//p),//b[.="Methods."]),//b[.="Results."]))
normalize-space(substring-before(substring-after(string(//p),//b[.="Results."]),//b[.="Conclusion."]))
normalize-space(substring-after(string(//p),//b[.="Conclusion."]))

If you don't know the text between tags, you could use indexing by position (//b[1],//b[2],...). Use count(//b) to know the maximum value.

EDIT : Alternative XPaths :

normalize-space(//text()[preceding::b="Introduction." and following::b="Aim."])
normalize-space(//text()[preceding::b="Aim." and following::b="Methods."])
normalize-space(//text()[preceding::b="Methods." and following::b="Results."])
normalize-space(//text()[preceding::b="Results." and following::b="Conclusion."])
normalize-space(//text()[preceding::b="Conclusion."])

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=198789&siteId=1