Collecting information by scraping

still_learning :

I am trying to collect names of Italian politicians by scraping Wikipedia. What I would need is to scrape all parties from this page: https://it.wikipedia.org/wiki/Categoria:Politici_italiani_per_partito, then for each of them scrape all the names of politicians within that party (for each party listed in the link that I mentioned above).

For example, the first party on the page is: https://it.wikipedia.org/wiki/Categoria:Politici_di_Alleanza_Democratica, so I would need to scrape this page and get the following names:

Ferdinando Adornato
Giuseppe Ayala
Giorgio Benvenuto
Enzo Bianco
Giorgio Bogi
Willer Bordon
Franco Castellazzi
Fabio Ciani
Oscar Giannino
Giorgio La Malfa
Miriam Mafai
Pierluigi Mantini
Ferdinando Schettino
Mariotto Segni
Giulio Tremonti

I wrote the following code:

from bs4 import BeautifulSoup as bs
import requests

res = requests.get("https://it.wikipedia.org/wiki/Categoria:Politici_italiani_per_partito")
soup = bs(res.text, "html.parser")
array1 = {}
possible_links = soup.find_all('a')
for link in possible_links:
    url = link.get("href", "")
    if "/wiki/Provenienza" in url: # It is incomplete, as I should scrape also links including word "Politici di/dei"
        res1=requests.get("https://it.wikipedia.org"+url)
        print("https://it.wikipedia.org"+url)
        soup = bs(res1, "html.parser")
        possible_links1 = soup.find_all('a')
        for link in possible_links1:
            url_1 = link.get("href", "")
            array1[link.text.strip()] = url_1

but it does not work, as it does not collect names for each party. It collects all the parties (from the wikipedia page that I mentioned above): however, when I try to scrape the parties' pages, it does not collect the names of politician within that party.

I hope you can help me.

QHarr :

You could collect the urls and party names from first page and then loop those urls and add the list of associated politician names to a dict whose key is the party name. You would gain efficiency from using a session object and thereby re-use underlying tcp connection

from bs4 import BeautifulSoup as bs
import requests

results = {}

with requests.Session() as s: # use session object for efficiency of tcp re-use
    s.headers = {'User-Agent': 'Mozilla/5.0'}
    r = s.get('https://it.wikipedia.org/wiki/Categoria:Politici_italiani_per_partito')
    soup = bs(r.content, 'lxml')
    party_info = {i.text:'https://it.wikipedia.org/' + i['href'] for i in soup.select('.CategoryTreeItem a')} #dict of party names and party links

    for party, link in party_info.items():
        r = s.get(link)
        soup = bs(r.content, 'lxml')
        results[party] = [i.text for i in soup.select('.mw-content-ltr .mw-content-ltr a')] # get politicians names 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=361071&siteId=1