Use Scrapy to build your own dataset

1. Description

        When I first started working in industry, one of the things I quickly realized was that sometimes you have to collect, organize and clean your own data. In this tutorial, we will collect data from a crowdfunding site called FundRazr . Like many websites, this website has its own structure, form, and has a lot of useful data accessible, but since it has no structured API, it is difficult to get data from the website. So we'll be web scraping the website to get unstructured website data and put it into an ordered form to build our own dataset.

        To crawl the website, we will use Scrapy . In short, Scrapy is a framework designed to make building web crawlers easier and take the pain out of maintaining them. Basically, it allows you to focus on data extraction using CSS selectors and selection XPath expressions, rather than the complex internals of how a spider should work. This blog post goes beyond the excellent official tutorial in the scraping docs , hopefully if you need to scrape something harder you can do it yourself. With that out of the way, let's get started. If you get lost, I recommend opening the video in a separate tab .

2. Getting started with installation (prerequisites)

        If you already have anaconda and Google Chrome (or Firefox), skip to creating a new Scrapy project.

        1.  Install Anaconda (Python) on your operating system. You can download anaconda from the official website and install it yourself, or follow these anaconda installation tutorials below.

    Installing Anaconda

        2.  Install Scrapy (anaconda comes with it, but just in case). You can also install on the terminal (mac/linux) or command line (windows). You can enter the following:

conda install -c conda-forge scrapy 

        3. Make sure you have Google Chrome or Firefox browser. For this tutorial, I am using Google Chrome. If you don't have Google Chrome, you can install it here using this link .

3. Create a new Scrapy project

        1. Open a terminal (mac/linux) or a command line (window). Navigate to the desired folder (see image below if you need help) and type

    scrapy startproject fundrazr 

Start project funds for scrapy, this will create a fundrazr directory with the following content:

Fundrazr project directory

4. Use the inspector on Google Chrome (or Firefox) to find good starting URLs

        In the crawler framework, start_urls is a list of URLs that the spider will start crawling when no specific URL is specified. We'll use each element in the start_urls list as a way to get links to individual campaigns.

        1. The image below shows that depending on the category you choose, you will get different starting URLs. The sections highlighted in black are possible fund categories to grab.

find a good start_url

        For this tutorial, the first in the list start_urls is:

        Raise money for Health, Illness & Medical Treatments - FundRazr

        2. This part is about getting other elements to put in the start_urls list. We're figuring out how to go to the next page so we can get other urls to put in start_urls .

Get additional elements to put in the list start_urls by checking the "Next" button

        The second starting URL is: Raise money for Health, Illness & Medical Treatments - FundRazr

        The code below will be used in the spider code later in this tutorial. All it does is list start_urls. The variable npages is just the number of additional pages (after the first page) that we want to get the campaign link from.

start_urls = ["https://fundrazr.com/find?category=Health"]

npages = 2

# This mimics getting the pages using the next button. 
for i in range(2, npages + 2 ):
	start_urls.append("https://fundrazr.com/find?category=Health&page="+str(i)+"")

        Generate code for additional starting URLs based on the current structure of the site

5. Scratch Shell for Finding Individual Campaign Links

        The best way to learn how to extract data with Scrapy is to use the Scrapy shell. We'll use XPaths, which can be used to select elements from an HTML document.

        The first thing we'll try to get the xpath for is the individual campaign links. First, we check the approximate location of the campaign in the HTML.

Find links to individual campaigns

        We'll use XPath to extract the part enclosed in the red rectangle below.

        Attached are the partial URLs that we will isolate

        In terminal type (mac/linux):

scrapy shell 'https://fundrazr.com/find?category=Health'

In the command line type (windows):

scrapy shell “https://fundrazr.com/find?category=Health"

Type the following in the scrapy shell (for help understanding the code, watch the video):

response.xpath("//h2[contains(@class, 'title headline-font')]/a[contains(@class, 'campaign-link')]//@href").extract()

        As the site is updated over time, you will most likely end up with different partial URLs

        The code below is used to get all active links for a given starting url (more on that later in the first spider section)

for href in response.xpath("//h2[contains(@class, 'title headline-font')]/a[contains(@class, 'campaign-link')]//@href"):
	# add the scheme, eg http://
	url  = "https:" + href.extract() 

Exit the Scrapy Shell by typing  exit()  . 

        Exit Scrapy Shell

6. Check individual campaigns

        While we should understand the structure of individual campaign links earlier, this section will cover individual campaign links.

  1. Next we go to the individual campaign pages (see link below) to scrape (I should note that some of these campaigns are hard to view)

Please help to save Yvonne by Yvonne Foong

    2. Using the same checking process as before, we check the headers on the page

Check campaign title

3. Now we will use the scratch case again, but this time for a personal activity. We do this because we want to understand the format of each campaign (including understanding how the headline is extracted from the page).

In terminal type (mac/linux):

scrapy shell 'https://fundrazr.com/savemyarm'

In the command line type (windows):

scrapy shell “https://fundrazr.com/savemyarm"

The code to get the campaign title is

response.xpath("//div[contains(@id, 'campaign-title')]/descendant::text()").extract()[0]

4. We can do the same for the rest of the page.

Amount raised:

response.xpath("//span[contains(@class,'stat')]/span[contains(@class, 'amount-raised')]/descendant::text()").extract()

Target:

response.xpath("//div[contains(@class, 'stats-primary with-goal')]//span[contains(@class, 'stats-label hidden-phone')]/text()").extract()

Currency Type:

response.xpath("//div[contains(@class, 'stats-primary with-goal')]/@title").extract()

Event end date:

response.xpath("//div[contains(@id, 'campaign-stats')]//span[contains(@class,'stats-label hidden-phone')]/span[@class='nowrap']/text()").extract()

Number of contributors:

response.xpath("//div[contains(@class, 'stats-secondary with-goal')]//span[contains(@class, 'donation-count stat')]/text()").extract()

story:

response.xpath("//div[contains(<a data-cke-saved-href="http://twitter.com/id" href="http://twitter.com/id" class="af ov">@id</a>, 'full-story')]/descendant::text()").extract() 

URL:

response.xpath("//meta[<a data-cke-saved-href="http://twitter.com/property" href="http://twitter.com/property" class="af ov">@property</a>='og:url']/@content").extract() 

5. Exit the scratch shell by typing:

 exit() 

7. Project

        The main goal of scraping is to extract structured data from unstructured sources (usually web pages). Scrapy Spiders can return extracted data as Python dictionaries. While convenient and familiar, Python dictionaries lack structure: it's easy to make typos in field names or return inconsistent data, especially in large projects with many spiders (copied almost verbatim from the great scrape official docs !

The code for the file items.py we will be modifying is here .

Save it in the fundrazr/fundrazr directory (overwriting the original items.py file).

The item class used in this tutorial (basically how we store data before outputting it) is shown below.

items.py code

8. Spider

        A spider is a class that you define that Scrapy uses to scrape information from a website (or set of websites). The code for our spider is as follows.

        Fundrazr scraping code, download code here .

        Save it in a file called fundrazr_scrape.py in the fundrazr/spiders directory .

        The current project should now have the following:

The files we will create/add

9. Run the spider

  1. Go to the fundrazr/fundrazr directory and type:
scrapy crawl my_scraper -o MonthDay_Year.csv 

        scrape crawl my_scraper -o MonthDay_Year.csv

        2. The data should be exported in the fund/fund directory.

data output location

10. Our data

  1. The output data for this tutorial should look roughly like the image below. The individual campaigns crawled will vary as the site is constantly updated. Also, there may be spaces between each individual campaign because excel is interpreting the csv file.

Data should roughly be in this format.

        2. If you want to download a bigger file (it is done by changing npages=2 to npages=450 and adding download_delay  = 2), you can download a bigger file by downloading it  from my github , which contains approximately 6000 campaigns. The file is called MiniMorningScrape.csv (it's a large file).

About 6000 campaigns were crawled

11. Conclusion

        Creating datasets can be a lot of work and is often an overlooked part of learning data science. One thing we didn't review was that while we scraped a lot of data, we still didn't clean enough of it to analyze it. That's another blog post, though.

Guess you like

Origin blog.csdn.net/gongdiwudu/article/details/132262815