Selenium & Beautiful Soup Returning Different len() Values on Same Website Scrape (amazon)

jamesishere :

In the code example below, I get different results for the print(len(block1)) item at the end of the code. I cannot seem to figure out what is causing this:

  • my code,
  • page loading with Selenium,
  • some sort of anti-scrape method that Amazon uses, or
  • a silly thing I am missing.

My ten most recent results were

LOG: 3/14/2020 - 2:30pm EDT 
Length Results for 10 separate runs:
0 / 20 / 55 / 25 / 57 / 55 / 6 / 59 / 54 / 39
# python version: 3.8.1
#Import necessary modules
from selenium import webdriver # version 3.141.0
from bs4 import BeautifulSoup # version 4.8.2

#set computer path and object to chrome browser
chrome_path = r"C:\webdrivers\chromedriver.exe"
browser = webdriver.Chrome(chrome_path)

# search Amazon for "bar+soap"
# use 'get' for URL request and set object to variable "source"
browser.get("https://www.amazon.com/s?k=soap+bar&ref=nb_sb_noss_2")
source = browser.page_source

#use Beautiful Soup to parse html
page_soup = BeautifulSoup(source, 'html.parser')

#set a variable "block1" to find all "a" tags that fit criteria
block1 = page_soup.findAll("a", {"class":"a-size-base"})

#print the number of tags pulled
print(len(block1))
Svetlana Levinsohn :

Your code looks correct. I modified it a little to make sure everything is fine, so I collected tags with both Selenium and Beautiful Soup and counted, they always match.

I was getting very different results at first, so I added 7 sec wait after page load. This made things more stable, so it is possible that some of the elements just take longer to load and when you count are not on the page.

This didn't fully solve the issue. I am still getting different results, for 10 runs, I got 64(2), 65(6), 67(2). My recommendation for you would be to:

  1. try adding and increasing sleep and see how it behaves;
  2. try actually printing out the results and see what is the difference between runs;
  3. potentially, just use the result that you get most often because a lot of websites run product A/B tests and there can be multiple UI/content variants for the same page or different components of the same page (this is very likely our case). So every time we run the script, we get into a certain A/B variant or probably a combination of variants, which leads to these results.

Just in case, my code:

#Import necessary modules
from selenium import webdriver
from bs4 import BeautifulSoup
from time import sleep

#set computer path and object to chrome browser
browser = webdriver.Chrome()

# use 'get' for URL request and set object to variable "source"
browser.get("https://www.amazon.com/s?k=soap+bar&ref=nb_sb_noss_2")
sleep(7)
source = browser.page_source

#use Beautiful Soup to parse html
page_soup = BeautifulSoup(source, 'html.parser')

#set a variable "block1" to find all "a" tags that fit criteria
block1 = page_soup.findAll("a", {"class":"a-size-base"})
#print the number of tags pulled
print('BS', len(block1))

# To be save, let's also count with pure Selenium:
e = browser.find_elements_by_css_selector('a.a-size-base')
print('SEL', len(e))

Hope this helps, good luck.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=276627&siteId=1