Master scraping and sentiment analysis using Python and machine learning

In this tutorial, we will crawl a website and use natural language processing to analyze the text data.

The end result will be a sentiment analysis of the website content. Here are the steps we will follow:

  1. Project scope
  2. Required libraries
  3. Learn about web scraping
  4. Crawl website
  5. Text cleaning and preprocessing
  6. Sentiment analysis using machine learning
  7. final result

1. Project scope

The goal of this project is to crawl websites, perform text preprocessing, and then apply machine learning algorithms to perform sentiment analysis of website content.

In other words, we want to determine whether textual content on a website has a positive, negative, or neutral sentiment.

To achieve this, we will use Python and some libraries to perform web scraping and machine learning.

2. Required libraries

This project requires the following libraries:

  • requests: Make HTTP requests to the website
  • BeautifulSoup: parsing HTML and XML documents
  • pandas: working with data frames
  • nltk: Perform natural language processing
  • scikit-learn: train machine learning models

You can install these libraries using pip:

pip install requests beautifulsoup4 pandas nltk scikit-learn

3. Understand web scraping

Web scraping is the process of extracting data from websites. This can be done manually, but is not practical for large amounts of data.

Therefore, we use software to automate the process. In Python, we use libraries like requests and BeautifulSoup to scrape websites.

There are two types of web scraping:

  • Static crawling : We crawl websites with fixed content
  • Dynamic crawling : We crawl sites whose content changes frequently or is dynamically generated

For this project, we will perform static scraping.

4. Crawl the website

First, we need to find a website to crawl. In this tutorial, we will scrape news articles from BBC News . We will be removing the " Technical" part of the website.

This is the code to crawl the website:

import requests
from bs4 import BeautifulSoup

url = "https://www.bbc.com/news/technology"
response = requests.get(url)

soup = BeautifulSoup(response.text, "html.parser")
articles = soup.find_all("article")

for article in articles:
    headline = article.find("h3").text.strip()
    summary = article.find("p").text.strip()
    print(headline)
    print(summary)
    print()

Let's break this code down:

  • We first import the requests and BeautifulSoup libraries
  • We define the URL of the website to crawl
  • We use requests.get() to make an HTTP request to the website and get the HTML content
  • We create a BeautifulSoup object from the HTML content
  • We use find_all() to get all articles on the page
  • We loop through each article and extract the title and abstract
  • We print the title and abstract of each article

When we run this code, we should see the title and summary of the article printed in the console.

5. Text cleaning and preprocessing

Before performing sentiment analysis, we need to clean and preprocess text data. This involves the following steps:

  • Remove HTML tags
  • Convert all text to lowercase
  • Remove punctuation
  • Remove stop words (common words like "the", "a", "an", etc.)
  • Stemming or lemmatizing text (reducing words to their root forms)

Here is the code that performs text cleaning and preprocessing:

import re
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize

stemmer = SnowballStemmer("english")
stop_words = set(stopwords.words("english"))

def clean_text(text):
    # Remove HTML tags
    text = re.sub(r"<.*?>", "", text)
    # Convert to lowercase
    text = text.lower()
    # Remove punctuation
    text = re.sub(r"[^\w\s]", "", text)
    # Remove stopwords and stem words
    tokens = word_tokenize(text)
    tokens = [stemmer.stem(word) for word in tokens if word not in stop_words]
    # Join tokens back into a string
    text = " ".join(tokens)
    return text

Let's break this code down:

  • We first import the regular expression library (re), the stopword corpus and SnowballStemmer from the NLTK library, and the word_tokenize function from the nltk.tokenize module.
  • We define a SnowballStemmer object and set the language to "english" which will be used for stemming
  • We define a set of stop words to be removed from text data
  • We define a function called clean_text() which accepts a text string
  • Inside the function we use a regular expression to remove any HTML tags
  • We use the lower() method to convert the text to lowercase
  • We use regular expression to remove punctuation
  • We use the word_tokenize() method from the nltk.tokenize module to tokenize the text into individual words.
  • We use the SnowballStemmer object to stem each word and remove stop words.
  • Finally, we use the join() method to rejoin the stem words back into the string.

6. Using machine learning for sentiment analysis

Now that we have cleaned and preprocessed the text data, we can use machine learning for sentiment analysis.

We will use the scikit-learn library to perform sentiment analysis.

First, we need to split the data into training and test sets. We will use 80% of the data for training and 20% for testing .

Here is the code to split the data:

from sklearn.model_selection import train_test_split

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df["text"], df["sentiment"], test_size=0.2, random_state=42)

Let's break this code down:

  • We import the train_test_split() function from scikit-learn
  • We use the train_test_split() function to split text data (stored in the "text" column of the data frame) and sentiment data (stored in the "sentiment" column of the data frame) into training and test sets.
  • We use test_size of 0.2, which means 20% of the data will be used for testing, and random_state of 42 for repeatability.

Next, we need to convert the text data into a numerical vector that can be used as input to the machine learning algorithm.

We will use the TF-IDF vectorizer to do this.

Here is the code to convert the text data:

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(ngram_range=(1,2))
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

Let's break this code down:

  • We import the TfidfVectorizer class from scikit-learn
  • We create a TfidfVectorizer object and set ngram_range to (1,2), which means we want to consider untuples (single words) and bigrams (pairs of adjacent words) in the text data.
  • We fit the vectorizer on the training data using the fit_transform() method, which computes the TF-IDF score for each word in the corpus and transforms the text data into a sparse matrix of numerical features.
  • We transform the test data using the transform() method, which applies the same transformation to the test data using the vocabulary learned from the training data.

Now that we have converted text data into numerical features, we can train a machine learning model to predict the sentiment of text.

We will use the logistic regression algorithm , which is a popular algorithm for text classification tasks.

Here is the code to train the model:

from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(max_iter=1000)
clf.fit(X_train_vec, y_train)

Let's break this code down:

  • We import the LogisticRegression class from scikit-learn
  • We create a LogisticRegression object and set max_iter to 1000, which means we allow the algorithm to run for up to 1000 iterations to converge.
  • We train the model on the training data using the fit() method, which learns model parameters that can be used to predict sentiment for new text data.

Finally, we can evaluate the model's performance on the test data by calculating the accuracy score, precision, recall, and F1 score.

Here is the code to evaluate the model:

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

y_pred = clf.predict(X_test_vec)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average="macro")
recall = recall_score(y_test, y_pred, average="macro")
f1 = f1_score(y_test, y_pred, average="macro")

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 score:", f1)

Let's break this code down:

  • We import precision_score, precision_score, recall_score and f1_score functions from scikit-learn
  • We use the predict() method of the LogisticRegression object to predict the sentiment of the test data
  • We calculate the accuracy, precision, recall and F1 score of the model using the corresponding functions in scikit-learn
  • We print performance metrics.

That's it! We have successfully performed web scraping, text cleaning, preprocessing, and sentiment analysis using machine learning in Python.

Guess you like

Origin blog.csdn.net/qq_41929396/article/details/132908479