Use only 01 tips for web scraping using the ChatGPT code interpreter (step-by-step tutorial)

Regardless of whether you have programming knowledge or not, performing web scraping can seem like a complex and demanding task. However, the ChatGPT and Code Interpreter plugin will save us many lines of code and hassle as it extracts information from web pages in seconds with just a single prompt.

Next, we will see through three examples how to use ChatGPT to perform web scraping in a simple and practical way, all of which are explained step by step

image-20230729114148746

Let's start...\

1) Walmart

We will be using the Shop All Back to School section of Walmart’s online store. I've provided direct links below:

Shop All Back to School Products at Back to School - Walmart.com

Shop all the back to school stores in Back to School. Shop products like JLab Audio JBuddies Studio Kids On-Ear...

www.walmart.com

Step 1: Define the fields to extract

We need to define the information we wish to extract. This is very important as it will help us build our prompts in ChatGPT later

image-20230729114214780

In this case we will grab the product name and price

image-20230729114238643

Step 2: Check the code

Here we need to define the code for 1 product (as an example and then enter it into ChatGPT)

But before we do that, here are a few things to keep in mind:

To access the inspect element feature in Chrome, if you're using Windows, there are two keyboard shortcut options:

a) Press + shift + c

or

b) Press + shift + i

If you are using macOS, use:

a) alt + Command + i

or

b) option + command + i

With this in mind, we can now check out the Walmart website. Let’s review the following sections:

i) Product name

In this case we need to find the product name in the code to crawl

image-20230729114307316

Let's copy it and include it in our prompt. To copy the span tag \ , we hover over the section, right click, and the following will appear:

image-20230729114324582

For now we're just copying it, for practical purposes we'll keep it handy to include in the prompt later

<span data-automation-id="product-title" class="normal dark-gray mb0 mt1 lh-title f6 f5-l lh-copy">Nintendo Kids Super Mario Bros. Mario World 17" Laptop Backpack

ii) Price

We will do the same thing with the price field

image-20230729114344077

image-20230729114401416

We will keep the copied element of the price field for later use

$14.92

If you need to extract more parts from the web page, you should repeat the same steps we did for the product name and price

Tip: \ To quickly find a field to check within a code area, simply place your mouse over the field, right-click, and enable the check option.

image-20230729114418499

Step 3: Save the HTML file

Since we'll be using a code interpreter, we need to attach a file to it. So what we do is save the page we want to scrape as an HTML file.

Go back to the page and use the keyboard shortcut Ctrl + S ( for Windows and macOS )

image-20230729114438575

Keyboard shortcut: Press Ctrl + s

Next, save the file in a local folder in HTML format

image-20230729114452620

Step 4: Upload HTML file + generate prompt

Now that we have defined the fields to scrape and their codes on the web, let's construct the prompt in ChatGPT

If you haven't activated the code interpreter yet, let's follow some instructions. Otherwise, I recommend you skip this section and go directly to the construction prompt

i) Settings

image-20230729114505103

ii) Open the code interpreter

image-20230729114517612

image-20230729114528373

After activating the code interpreter in ChatGPT , let's upload the HTML file we saved in step 3

image-20230729114545523

image-20230729114556425

Now, let's build the prompt, taking into account the product name and price, and the code for each part (if in doubt, see step 2 )

image-20230729114608261

Tip: From the HTML file, extract the product names and prices, put the data on a table and export as a CSV file

This is a product element: <span data automation id="product title" class="normal dark gray MB0 mt1 lh-title f6 f5-l lh-copy">Nintendo Kids Super Mario Bros. Mario World 17" Laptop Backpack

Here are the elements of price:

$14.92

If the product price is missing, leave the price as empty data

In the prompt, we see that there are 04 sections .

In the first paragraph, I specify that I have loaded an HTML file and asked it to grab product names and prices. After doing this, I request it to export the data into a CSV file

In the second and third paragraphs , I provided ChatGPT with an example of each corresponding structure for the product name and price fields. We see that each product is a span\ tag and price is a div tag\

In the last paragraph , if it finds a null value for price, I ask it to assign null\ data

Do keep this tip in mind as upcoming samples will have the same structure and will only change the fields and their code

result:

image-20230729114646951

image-20230729114658693

Download and open CSV\ file

image-20230729114722203

Finally, we successfully web scraped the products and their respective prices and then exported them to a CSV file as shown in the table image. Please note that the products we used as examples are included!

bonus

The previous steps enabled us to perform a web scrape from the first ( 01 ) page of the Walmart website. However, if we want to extract data from the second ( 02 ) page, we perform the same steps as before, but do not forget to identify the product in this new page and include it as an example in the prompt

Page 02 of the "Back to School" section on Walmart's website

i) Product name

image-20230729114743309

<span data-automation-id="product-title" class="normal dark-gray mb0 mt1 lh-title f6 f5-l lh-copy">Minecraft Boys Cliff Goats Graphic T-Shirt, 2-Pack, Sizes 4–18 </span>

ii) Price

image-20230729114755686

$13.96

Just like the first page, we need to save the file for the second ( 02) page as HTML\ format (check out step 03 if you have any questions )

hint

From the HTML file, extract the names of the products and prices, put the data on a table and export it to a CSV file.

This is an element of a product: <span data-automation-id=”product-title” class=”normal dark-gray mb0 mt1 lh-title f6 f5-l lh-copy”>Minecraft Boys Cliff Goats Graphic T-Shirt, 2 Packs, sizes 4–18</span>

Here are the elements of price:

$13.96

If the product price is missing, leave the price as empty data

image-20230729114811397

If you wish to merge two tables into one, you can ask ChatGPT to do the following:

image-20230729114842177

2. Goal

In the second example, we will perform a web scrape from the mobile phone portion of the target website . We'll go straight ahead, and if in doubt, see the steps for Walmart's first example

Here is the direct link:

Mobile: Target

Buy a target phone, a phone you'll love, at a low price. Choose same-day delivery, drive up or order for pickup...

www.target.com

Step 1: Let’s identify the fields we want to extract

a) Product b) Brand c) Price

image-20230729114857810

Now, let's check the code level of each target field (see step 2)

Keyboard shortcut for checking: Ctrl + Shift + c ( Windows ) or * Alt + Command + i* ( macOS )

Step 2: Check the code

i) Products

We find codes and tags. We copy and save the code so we can incorporate it into the ChatGPT prompt later (if in doubt, check out step 02 of the first Walmart example)

image-20230729114910782

<a href=“/p/tracfone-prepaid-apple-iphone-se-2nd-gen-64gb-cdma-black/-/A-82040163#lnk=sametab” aria-label=“Tracfone Prepaid Apple iPhone SE 2nd Gen (64GB) CDMA — Black” class=“styles__StyledLink-sc-vpsldm-0 styles__StyledTitleLink-sc-14ktig2–1 fajhWk gkIDAW h-display-block h-text-bold h-text-bs” data-test=“product-title”>Tracfone Prepaid Apple iPhone SE 2nd Gen (64GB) CDMA — Black

ii) Brand

image-20230729114951027

<a href=“/b/apple/-/N-5y3ej” class=“styles__StyledLink-sc-vpsldm-0 lnixiM h-text-sm h-text-grayDark” data-test=“@web/ProductCard/ProductCardBrandAndRibbonMessage/brand”>Apple

iii) Price

image-20230729115002100

$189.99

Step 3: Save the HTML file

Save the page to scrape as an HTML file (see step 3 in the Walmart example )

Step 4: Upload HTML file + generate prompt

We'll construct the prompt, but unlike the previous example, we'll include a field for the phone brand (see step 4 of the Walmart example ).

Load the HTML file and add code for each field to scrape (product name, brand and price)

image-20230729115019331

Tip: From the HTML file, extract the product name, brand, price, put the data on a table and export as a CSV file. Extract all products

This is a product element: <a href="/p/tracfone-prepaid-apple-iphone-se-2nd-gen-64gb-cdma-black/-/A-82040163#lnk=sametab" aria-label=" Tracfone Prepaid Apple iPhone SE 2nd Gen (64GB) CDMA — Black” class="styles__StyledLink-sc-vpsldm-0 styles__StyledTitleLink-sc-14ktig2–1 fajhWk gkIDAW h-display-block h-text-bold h-text-bs" data -test="product-title">Tracfone Prepaid Apple iPhone SE 2nd Gen (64GB) CDMA — Black

Here are the elements of the brand: <a href="/b/apple/-/N-5y3ej" class="styles__StyledLink-sc-vpsldm-0 lnixiM h-text-sm h-text-grayDark" data-test="@ web/ProductCard/ProductCardBrandAndRibbonMessage/brand”>Apple

The following are the elements of price:

$189.99
If product price is missing, leave the price as empty data

result

image-20230729115039728

Download and open CSV\ file

image-20230729115054753

The results were great, we were able to scrape all the data from the Target website

image-20230729115105916

3) Amazon

In our final example, we will perform web scraping of Kindle books. It might be fun to see which books are the most popular and then use ChatGPT to create stories with different trending topics

Here is the link:

Amazon.com: eBooks Ignite

Back to SchoolDisabilityCustomer SupportClose to College ClinicBestsellerCustomer ServiceAmazon BasicsMusic...  

www.amazon.com

Step 1: Let’s identify the fields we want to extract

a) Product or title b) Author c) Price

image-20230729115118872

Step 2: Check the code

i) Product or title:

We find codes and tags. We copy and keep the code so we can later incorporate it into the ChatGPT prompt (if in doubt, check out step 02 of the first Walmart example )

The keyboard shortcuts to check are: Ctrl + Shift + c ( Windows ) or Alt + Command + i ( macOS). You can refer to step 2 for more details

image-20230729115130481

<span class=“a-size-base-plus a-color-base a-text-normal”>Lessons in Chemistry: A Novel

ii) Author

image-20230729115141629

<a class=“a-size-base a-link-normal s-underline-text s-underline-link-text s-link-style” href=“/Bonnie-Garmus/e/B09964CPY4?ref=sr_ntt_srch_lnk_1&qid=1690568130&sr=8–1”>邦妮·加莫斯

iii) Price

image-20230729115210781

Note that for this example we are only extracting the integer part of the price

14.

Step 3: Save the HTML file

We save the web pages to be scraped as HTML files. To do this we use the shortcut Ctrl + S on the page we want to save. Let's not forget to save the file as HTML (check details in step 3 of the Walmart example )

Step 4: Upload HTML file + generate prompt

Now, let's build prompts based on the fields we want to extract from Amazon's web pages, specifically from their Kindle books section. In this example, we want to extract the title, author and price.

Next, we load the HTML file and add code to grab each required field ( title, author, and price\ ))

image-20230729115223012

Tip: From the HTML file, extract the product name, author and price, put the data on a table and export as a CSV file.

This is an element of a product: <span class="a-size-base-plus a-color-base a-text-normal">Lessons in Chemistry: A Novel

Here are the author’s elements: <a class="a-size-base a-link-normal s-underline-text s-underline-link-text s-link-style" href="/Bonnie-Garmus/e/B09964CPY4 ?ref=sr_ntt_srch_lnk_1&qid=1690568130&sr=8–1”>Bonnie Garmus

The following are the price elements: 14 .

If the product price is missing, leave the price as empty data

Let's see that the prompts in the example we saw have the same structure

result

image-20230729115237256

We download the CSV\ file

image-20230729115248311

We succeeded!

image-20230729115300816

Summary and recommendations

  1. If we try to put the URL directly into ChatGPT , it will not be able to perform web scraping even with the code interpreter activated . For this reason we download the pages to be scraped in HTML
  2. ChatGPT may initially not recognize the labels of the fields to extract and may give us wrong information. At this point, I recommend opening another chat and running the prompt again
  3. We should remember that the code interpreter uses Python and libraries like BeautifulSoup for web scraping.
  4. This method is not intended to replace traditional web scraping , however, it will save us time and lines of code
  5. What we see in the story with 03 web scraping examples is intended both for people who work in programming and for people who know little or nothing in this field
  6. It’s interesting what we can accomplish with web scraping , as I mentioned above we can focus on dropshipping , create Kindle books, take into account best-selling books , analyze competitor prices , track certain products, and more

This complete guide is for anyone who wants an alternative to web scraping using ChatGPT. No prior programming knowledge is necessary, just curiosity and patience. See you in the next story, best wishes!

Guess you like

Origin blog.csdn.net/shupan/article/details/132009068