Python crawler tracking news event development process and public opinion reflection

Table of contents

Implementation plan

1. Identify your target news sources:

2. Identify keywords:

3. Use web crawlers to obtain news content:

4. Extract and analyze news articles:

5. Track the development process of news events:

6. Monitor public opinion reflections:

7. Data visualization:

Complete code example

Precautions

1. Site Usage Policy and Compliance:

2. Web page analysis and data extraction:

3. Crawler frequency and data volume:

4. API usage and restrictions:

5. Data processing and storage:

6. Code robustness and exception handling:

7. Privacy and Copyright Issues:

Summarize


Tracking the development process of news events and the reflection of public opinion is crucial for us to understand the dynamics of current events and public sentiment. With the help of technologies such as Python crawlers and sentiment analysis, we can more efficiently obtain news content, analyze emotional tendencies, and understand public opinion reactions. So how to use Python to achieve this tracking of news events and public opinion?

 

Implementation plan

To implement Python crawlers to track the development process of news events and public opinion reflection, the following is a possible implementation plan:

1. Identify your target news sources:

First, you need to identify the news sources you want to follow. Choose from multiple news sites, social media platforms, forums, etc. for comprehensive information.

target_news_sources = ['https://example.com/news', 'https://example2.com/news']

2. Identify keywords:

Filter and identify news and opinions related to specific events by identifying key words or phrases. These keywords should be related to the event and appear frequently.

keywords = ['event 1', 'public opinion reflection', 'keywords']

3. Use web crawlers to obtain news content:

Use Python's crawler library (such as BeautifulSoup or Scrapy) to crawl news articles related to keywords on news sites. Sections such as title, body, tags, etc. can be searched for news content.

import requests
from bs4 import BeautifulSoup

def crawl_news_content(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    # 根据页面结构,提取新闻标题、正文等信息
    title = soup.find('h1').get_text()
    content = soup.find('div', class_='article-content').get_text()
    # 返回提取的新闻内容
    return {
        'title': title,
        'content': content
    }

news_content = []
for news_url in target_news_sources:
    news_content.append(crawl_news_content(news_url))

4. Extract and analyze news articles:

For each scraped news article, use a natural language processing tool (such as NLTK or spaCy) to extract key information such as date, title, author, abstract, etc. Machine learning techniques can be used for sentiment analysis to identify the emotional tendencies reflected in public opinion.

import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer

nltk.download('vader_lexicon')
sia = SentimentIntensityAnalyzer()

def analyze_news_sentiment(news_content):
    sentiment_scores = []
    for news in news_content:
        title = news['title']
        text = news['content']
        sentiment_score = sia.polarity_scores(text)
        sentiment_scores.append({
            'title': title,
            'sentiment_score': sentiment_score
        })
    return sentiment_scores

news_sentiment_scores = analyze_news_sentiment(news_content)

5. Track the development process of news events:

Sort and track the captured news by timestamp or date to understand the development process of the event. Events can be displayed in chronological order, and a summary of key information, such as news titles, links, release time, etc., can be provided.

sorted_news_content = sorted(news_content, key=lambda x: x['publish_time'])

for news in sorted_news_content:
    title = news['title']
    publish_time = news['publish_time']
    print(f"新闻标题:{title}")
    print(f"发布时间:{publish_time}")
    print("---------------------------")

6. Monitor public opinion reflections:

Analyze comments in captured news, discussions on social media, and relevant forums and other public opinion channels, track and monitor public opinion reflections, and use text classification and clustering techniques to summarize and summarize public opinions.

import tweepy

def monitor_public_opinion(keyword):
    consumer_key = "your-consumer-key"
    consumer_secret = "your-consumer-secret"
    access_token = "your-access-token"
    access_token_secret = "your-access-token-secret"

    auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
    auth.set_access_token(access_token, access_token_secret)
    api = tweepy.API(auth)

    tweets = api.search(q=keyword, tweet_mode='extended', count=10)

    opinions = []
    for tweet in tweets:
        opinions.append(tweet.full_text)
    
    return opinions

public_opinions = monitor_public_opinion(keywords[0])

7. Data visualization:

In order to better show the development process of news events and public opinion reflection, you can use data visualization libraries in Python (such as Matplotlib or Plotly) to create charts and visual dashboards.

import matplotlib.pyplot as plt

def visualize_sentiment_scores(sentiment_scores):
    titles = [score['title'] for score in sentiment_scores]
    scores = [score['sentiment_score']['compound'] for score in sentiment_scores]

    plt.figure(figsize=(10, 6))
    plt.bar(titles, scores)
    plt.xlabel('新闻标题')
    plt.ylabel('情感分数')
    plt.xticks(rotation=90)
    plt.title('新闻情感分析')
    plt.show()

visualize_sentiment_scores(news_sentiment_scores)

It should be noted that scraping website information and dealing with public opinion sometimes has some legal and ethical issues. When performing crawling activities, please ensure that you comply with relevant laws and regulations and the terms of use of the website, and ensure that you do not violate the privacy of others or use malicious means to crawl.

 

Complete code example

The following is a complete code example, showing how to use Python crawlers to track the development process of news events and the realization of public opinion reflection. Note that this is just a basic example, and may need to be optimized and extended in practice.

import requests
from bs4 import BeautifulSoup
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import tweepy
import matplotlib.pyplot as plt

# Step 1: 确定目标新闻源
target_news_sources = ['https://example.com/news', 'https://example2.com/news']

# Step 2: 确定关键词
keywords = ['事件1', '舆论反映', '关键词']

# Step 3: 使用网络爬虫获取新闻内容
def crawl_news_content(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    # 根据页面结构,提取新闻标题、正文等信息
    title = soup.find('h1').get_text()
    content = soup.find('div', class_='article-content').get_text()
    # 返回提取的新闻内容
    return {
        'title': title,
        'content': content
    }

news_content = []
for news_url in target_news_sources:
    news_content.append(crawl_news_content(news_url))

# Step 4: 提取和分析新闻文章
nltk.download('vader_lexicon')
sia = SentimentIntensityAnalyzer()

def analyze_news_sentiment(news_content):
    sentiment_scores = []
    for news in news_content:
        title = news['title']
        text = news['content']
        sentiment_score = sia.polarity_scores(text)
        sentiment_scores.append({
            'title': title,
            'sentiment_score': sentiment_score
        })
    return sentiment_scores

news_sentiment_scores = analyze_news_sentiment(news_content)

# Step 5: 追踪新闻事件的发展进程
sorted_news_content = sorted(news_content, key=lambda x: x['publish_time'])

for news in sorted_news_content:
    title = news['title']
    publish_time = news['publish_time']
    print(f"新闻标题:{title}")
    print(f"发布时间:{publish_time}")
    print("---------------------------")

# Step 6: 监测舆论反映
def monitor_public_opinion(keyword):
    consumer_key = "your-consumer-key"
    consumer_secret = "your-consumer-secret"
    access_token = "your-access-token"
    access_token_secret = "your-access-token-secret"

    auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
    auth.set_access_token(access_token, access_token_secret)
    api = tweepy.API(auth)

    tweets = api.search(q=keyword, tweet_mode='extended', count=10)

    opinions = []
    for tweet in tweets:
        opinions.append(tweet.full_text)
    
    return opinions

public_opinions = monitor_public_opinion(keywords[0])

# Step 7: 数据可视化
def visualize_sentiment_scores(sentiment_scores):
    titles = [score['title'] for score in sentiment_scores]
    scores = [score['sentiment_score']['compound'] for score in sentiment_scores]

    plt.figure(figsize=(10, 6))
    plt.bar(titles, scores)
    plt.xlabel('新闻标题')
    plt.ylabel('情感分数')
    plt.xticks(rotation=90)
    plt.title('新闻情感分析')
    plt.show()

visualize_sentiment_scores(news_sentiment_scores)

Please note that in practical applications, it is necessary to modify and optimize the codes of crawling, data analysis, public opinion monitoring, and data visualization according to the structure and data source of the specific website. In addition, you also need to pay attention to issues such as website usage policies, crawler compliance, and API restrictions.

 

Precautions

When using Python crawlers to track news events and public opinion reflections, there are some things to pay attention to as follows:

1. Site Usage Policy and Compliance:

Before crawling news website data, you need to understand the website usage policy to ensure that your crawling behavior complies with laws, regulations and website regulations. Some websites may restrict or prohibit reptile behavior.

2. Web page analysis and data extraction:

According to the page structure of the target website, use an appropriate parsing library (such as BeautifulSoup) to parse HTML or XML and extract the required data. Note that the structure of different websites may be different, and it needs to be handled accordingly according to the actual situation.

3. Crawler frequency and data volume:

Reasonably control the frequency of crawlers to avoid bringing a large visit load to the website. At the same time, pay attention to reasonably limit the amount of data crawled to avoid excessive resource requests.

4. API usage and restrictions:

If you use services such as the Twitter API, you must abide by the usage policies and restrictions of the API provider, do not exceed the access frequency limit and follow other relevant restrictions.

5. Data processing and storage:

According to actual needs, reasonably process and store the crawled data. The data may need to be cleaned, deduplicated, denoised, etc., or the data may be stored in a database or file for subsequent analysis and use.

6. Code robustness and exception handling:

Try to write robust code and handle possible exceptions, such as network connection failure, page structure changes, etc. Appropriately add exception handling mechanism to ensure the stability and reliability of the program.

7. Privacy and Copyright Issues:

News and public opinion data may involve personal privacy and copyright issues. When using and processing data, you need to abide by relevant laws and regulations, and respect the privacy and intellectual property rights of others.

The above are just some common precautions, and the specific situation may vary depending on the application scenario and data source. In actual use, it is recommended to carefully read the usage policies and compliance requirements of relevant websites, follow laws and regulations, and ensure that your crawler behavior complies with ethics and ethics.

Summarize

By crawling news sites, analyzing sentiment, and monitoring social media, we can get a more comprehensive picture of events and public sentiment. At the same time, we also mentioned some points that need attention, such as compliance, data processing and privacy issues. Hope this example can provide a starting point to stimulate more creativity and thinking to be applied in practical scenarios. Mastering these technologies can help us better grasp the dynamics of current events and understand the voice of the public, so as to make better decisions and actions.

Guess you like

Origin blog.csdn.net/wq2008best/article/details/132537374