Python introductory example: Get real reviews of tourist attractions

Preface

TripAdvisor is a travel review website. If you want to crawl data from this website, you need to understand the website's access rules and crawling restrictions.

Code

For the TripAdvisor website, you can use Python's third-party library Seleniumto simulate browser behavior and simulate user operations on the website to obtain data. The following is a simple implementation process:

1. Install the necessary libraries: Selenium and BeautifulSoup

pip install selenium beautifulsoup4

2. Download the webdriver corresponding to the browser and install it into the system

# 以Chrome浏览器调用为例
# 下载对应管理器
from selenium import webdriver
driver_path = "/path/to/chromedriver"
options=webdriver.ChromeOptions()
options.add_argument('--no-sandbox') # 以root模式下不是必须的,非root模式下才有必要
browser = webdriver.Chrome(executable_path=driver_path, options=options)

3. Send an HTTP request to obtain the target page data

url = "https://www.tripadvisor.cn/Attractions-g186338-Activities-London_England.html#FILTERED_LIST"
browser.get(url)
html = browser.page_source
soup = BeautifulSoup(html, "html.parser")

4. Parse the HTML page and obtain the required data

results = []
for element in soup.find_all("div", class_="listItem"):
    name = element.find("div", class_="listing_title").text
    rating = element.find("span", class_="ui_bubble_rating")['class'][1][1]
    review_count = element.find("a", class_="review_count").text.split(" ")[0]
    results.append((name, rating, review_count))

5. Collect data and save it for later processing and analysis

df = pd.DataFrame(results, columns=["name", "rating", "review_count"])
df.to_csv("tripadvisor_data.csv", index=False)

Please note that the specific crawling process may change as the website changes. Please perform specific analysis and processing yourself. I just provide a simple implementation process for reference.

Guess you like

Origin blog.csdn.net/m0_48405781/article/details/131069146