Regardless of whether you have programming knowledge or not, performing web scraping can seem like a complex and demanding task. However, the ChatGPT and Code Interpreter plugin will save us many lines of code and hassle as it extracts information from web pages in seconds with just a single prompt.
Next, we will see through three examples how to use ChatGPT to perform web scraping in a simple and practical way, all of which are explained step by step
Let's start...\
1) Walmart
We will be using the Shop All Back to School section of Walmart’s online store. I've provided direct links below:
Shop All Back to School Products at Back to School - Walmart.com
Shop all the back to school stores in Back to School. Shop products like JLab Audio JBuddies Studio Kids On-Ear...
Step 1: Define the fields to extract
We need to define the information we wish to extract. This is very important as it will help us build our prompts in ChatGPT later
In this case we will grab the product name and price
Step 2: Check the code
Here we need to define the code for 1 product (as an example and then enter it into ChatGPT)
But before we do that, here are a few things to keep in mind:
To access the inspect element feature in Chrome, if you're using Windows, there are two keyboard shortcut options:
a) Press + shift + c
or
b) Press + shift + i
If you are using macOS, use:
a) alt + Command + i
or
b) option + command + i
With this in mind, we can now check out the Walmart website. Let’s review the following sections:
i) Product name
In this case we need to find the product name in the code to crawl
Let's copy it and include it in our prompt. To copy the span tag \ , we hover over the section, right click, and the following will appear:
For now we're just copying it, for practical purposes we'll keep it handy to include in the prompt later
<span data-automation-id="product-title" class="normal dark-gray mb0 mt1 lh-title f6 f5-l lh-copy">Nintendo Kids Super Mario Bros. Mario World 17" Laptop Backpack
ii) Price
We will do the same thing with the price field
We will keep the copied element of the price field for later use
$14.92
If you need to extract more parts from the web page, you should repeat the same steps we did for the product name and price
Tip: \ To quickly find a field to check within a code area, simply place your mouse over the field, right-click, and enable the check option.
Step 3: Save the HTML file
Since we'll be using a code interpreter, we need to attach a file to it. So what we do is save the page we want to scrape as an HTML file.
Go back to the page and use the keyboard shortcut Ctrl + S ( for Windows and macOS )
Keyboard shortcut: Press Ctrl + s
Next, save the file in a local folder in HTML format
Step 4: Upload HTML file + generate prompt
Now that we have defined the fields to scrape and their codes on the web, let's construct the prompt in ChatGPT
If you haven't activated the code interpreter yet, let's follow some instructions. Otherwise, I recommend you skip this section and go directly to the construction prompt
i) Settings
ii) Open the code interpreter
After activating the code interpreter in ChatGPT , let's upload the HTML file we saved in step 3
Now, let's build the prompt, taking into account the product name and price, and the code for each part (if in doubt, see step 2 )
Tip: From the HTML file, extract the product names and prices, put the data on a table and export as a CSV file
This is a product element: <span data automation id="product title" class="normal dark gray MB0 mt1 lh-title f6 f5-l lh-copy">Nintendo Kids Super Mario Bros. Mario World 17" Laptop Backpack
Here are the elements of price:
$14.92If the product price is missing, leave the price as empty data
In the prompt, we see that there are 04 sections .
In the first paragraph, I specify that I have loaded an HTML file and asked it to grab product names and prices. After doing this, I request it to export the data into a CSV file
In the second and third paragraphs , I provided ChatGPT with an example of each corresponding structure for the product name and price fields. We see that each product is a span\ tag and price is a div tag\
In the last paragraph , if it finds a null value for price, I ask it to assign null\ data
Do keep this tip in mind as upcoming samples will have the same structure and will only change the fields and their code
result:
Download and open CSV\ file
Finally, we successfully web scraped the products and their respective prices and then exported them to a CSV file as shown in the table image. Please note that the products we used as examples are included!
bonus
The previous steps enabled us to perform a web scrape from the first ( 01 ) page of the Walmart website. However, if we want to extract data from the second ( 02 ) page, we perform the same steps as before, but do not forget to identify the product in this new page and include it as an example in the prompt
Page 02 of the "Back to School" section on Walmart's website
i) Product name
<span data-automation-id="product-title" class="normal dark-gray mb0 mt1 lh-title f6 f5-l lh-copy">Minecraft Boys Cliff Goats Graphic T-Shirt, 2-Pack, Sizes 4–18 </span>
ii) Price
$13.96
Just like the first page, we need to save the file for the second ( 02) page as HTML\ format (check out step 03 if you have any questions )
hint
From the HTML file, extract the names of the products and prices, put the data on a table and export it to a CSV file.
This is an element of a product: <span data-automation-id=”product-title” class=”normal dark-gray mb0 mt1 lh-title f6 f5-l lh-copy”>Minecraft Boys Cliff Goats Graphic T-Shirt, 2 Packs, sizes 4–18</span>
Here are the elements of price:
$13.96If the product price is missing, leave the price as empty data
If you wish to merge two tables into one, you can ask ChatGPT to do the following:
2. Goal
In the second example, we will perform a web scrape from the mobile phone portion of the target website . We'll go straight ahead, and if in doubt, see the steps for Walmart's first example
Here is the direct link:
Mobile: Target
Buy a target phone, a phone you'll love, at a low price. Choose same-day delivery, drive up or order for pickup...
Step 1: Let’s identify the fields we want to extract
a) Product b) Brand c) Price
Now, let's check the code level of each target field (see step 2)
Keyboard shortcut for checking: Ctrl + Shift + c ( Windows ) or * Alt + Command + i* ( macOS )
Step 2: Check the code
i) Products
We find codes and tags. We copy and save the code so we can incorporate it into the ChatGPT prompt later (if in doubt, check out step 02 of the first Walmart example)
<a href=“/p/tracfone-prepaid-apple-iphone-se-2nd-gen-64gb-cdma-black/-/A-82040163#lnk=sametab” aria-label=“Tracfone Prepaid Apple iPhone SE 2nd Gen (64GB) CDMA — Black” class=“styles__StyledLink-sc-vpsldm-0 styles__StyledTitleLink-sc-14ktig2–1 fajhWk gkIDAW h-display-block h-text-bold h-text-bs” data-test=“product-title”>Tracfone Prepaid Apple iPhone SE 2nd Gen (64GB) CDMA — Black
ii) Brand
<a href=“/b/apple/-/N-5y3ej” class=“styles__StyledLink-sc-vpsldm-0 lnixiM h-text-sm h-text-grayDark” data-test=“@web/ProductCard/ProductCardBrandAndRibbonMessage/brand”>Apple
iii) Price
$189.99
Step 3: Save the HTML file
Save the page to scrape as an HTML file (see step 3 in the Walmart example )
Step 4: Upload HTML file + generate prompt
We'll construct the prompt, but unlike the previous example, we'll include a field for the phone brand (see step 4 of the Walmart example ).
Load the HTML file and add code for each field to scrape (product name, brand and price)
Tip: From the HTML file, extract the product name, brand, price, put the data on a table and export as a CSV file. Extract all products
This is a product element: <a href="/p/tracfone-prepaid-apple-iphone-se-2nd-gen-64gb-cdma-black/-/A-82040163#lnk=sametab" aria-label=" Tracfone Prepaid Apple iPhone SE 2nd Gen (64GB) CDMA — Black” class="styles__StyledLink-sc-vpsldm-0 styles__StyledTitleLink-sc-14ktig2–1 fajhWk gkIDAW h-display-block h-text-bold h-text-bs" data -test="product-title">Tracfone Prepaid Apple iPhone SE 2nd Gen (64GB) CDMA — Black
Here are the elements of the brand: <a href="/b/apple/-/N-5y3ej" class="styles__StyledLink-sc-vpsldm-0 lnixiM h-text-sm h-text-grayDark" data-test="@ web/ProductCard/ProductCardBrandAndRibbonMessage/brand”>Apple
The following are the elements of price:
$189.99If product price is missing, leave the price as empty data
result
Download and open CSV\ file
The results were great, we were able to scrape all the data from the Target website
3) Amazon
In our final example, we will perform web scraping of Kindle books. It might be fun to see which books are the most popular and then use ChatGPT to create stories with different trending topics
Here is the link:
Amazon.com: eBooks Ignite
Back to SchoolDisabilityCustomer SupportClose to College ClinicBestsellerCustomer ServiceAmazon BasicsMusic...
Step 1: Let’s identify the fields we want to extract
a) Product or title b) Author c) Price
Step 2: Check the code
i) Product or title:
We find codes and tags. We copy and keep the code so we can later incorporate it into the ChatGPT prompt (if in doubt, check out step 02 of the first Walmart example )
The keyboard shortcuts to check are: Ctrl + Shift + c ( Windows ) or Alt + Command + i ( macOS). You can refer to step 2 for more details
<span class=“a-size-base-plus a-color-base a-text-normal”>Lessons in Chemistry: A Novel
ii) Author
<a class=“a-size-base a-link-normal s-underline-text s-underline-link-text s-link-style” href=“/Bonnie-Garmus/e/B09964CPY4?ref=sr_ntt_srch_lnk_1&qid=1690568130&sr=8–1”>邦妮·加莫斯
iii) Price
Note that for this example we are only extracting the integer part of the price
14.
Step 3: Save the HTML file
We save the web pages to be scraped as HTML files. To do this we use the shortcut Ctrl + S on the page we want to save. Let's not forget to save the file as HTML (check details in step 3 of the Walmart example )
Step 4: Upload HTML file + generate prompt
Now, let's build prompts based on the fields we want to extract from Amazon's web pages, specifically from their Kindle books section. In this example, we want to extract the title, author and price.
Next, we load the HTML file and add code to grab each required field ( title, author, and price\ ))
Tip: From the HTML file, extract the product name, author and price, put the data on a table and export as a CSV file.
This is an element of a product: <span class="a-size-base-plus a-color-base a-text-normal">Lessons in Chemistry: A Novel
Here are the author’s elements: <a class="a-size-base a-link-normal s-underline-text s-underline-link-text s-link-style" href="/Bonnie-Garmus/e/B09964CPY4 ?ref=sr_ntt_srch_lnk_1&qid=1690568130&sr=8–1”>Bonnie Garmus
The following are the price elements: 14 .
If the product price is missing, leave the price as empty data
Let's see that the prompts in the example we saw have the same structure
result
We download the CSV\ file
We succeeded!
Summary and recommendations
- If we try to put the URL directly into ChatGPT , it will not be able to perform web scraping even with the code interpreter activated . For this reason we download the pages to be scraped in HTML
- ChatGPT may initially not recognize the labels of the fields to extract and may give us wrong information. At this point, I recommend opening another chat and running the prompt again
- We should remember that the code interpreter uses Python and libraries like BeautifulSoup for web scraping.
- This method is not intended to replace traditional web scraping , however, it will save us time and lines of code
- What we see in the story with 03 web scraping examples is intended both for people who work in programming and for people who know little or nothing in this field
- It’s interesting what we can accomplish with web scraping , as I mentioned above we can focus on dropshipping , create Kindle books, take into account best-selling books , analyze competitor prices , track certain products, and more
This complete guide is for anyone who wants an alternative to web scraping using ChatGPT. No prior programming knowledge is necessary, just curiosity and patience. See you in the next story, best wishes!