Unity C#'s Http gets the html data of the web page and removes the html format and other related information

Unity C#'s Http gets the html data of the web page and removes the html format and other related information

Table of contents

Unity C#'s Http gets the html data of the web page and removes the html format and other related information

1. Brief introduction

2. Implementation principle

3. Matters needing attention

4. Effect preview

 5. Key code


1. Brief introduction

Organize some knowledge points in Unity.

This section briefly introduces how to use HttpClient to obtain the relevant information of the specified web page in the development of Unity, and then perform data cleaning, remove the html format, tags, functions, redundant spaces and other information, leaving only text information similar to that displayed on the web page , Why do you do this? In fact, one usage scenario here is to feed webpage data to GPT, and then let GPT process and summarize it. If you have a new method, you can also leave a message, thank you.

2. Implementation principle

1. HttpClient gets the html data of the specified web page

2. Use HtmlAgilityPack for html data to remove all <script> tags and their content, obtain plain text content, and finally remove redundant spaces and blank lines

3. Matters needing attention

1. Direct code access to web pages, it is best to add User-Agent, otherwise, it may not be able to access normally

2. Note that NuGet installs the HtmlAgilityPack package

4. Effect preview

 5. Key code

using HtmlAgilityPack;
using System;
using System.Linq;
using System.Net.Http;
using System.Text.RegularExpressions;

namespace TestHtml
{
    class Program
    {
        static async System.Threading.Tasks.Task Main(string[] args)
        {
            //string url = "https://movie.douban.com/chart";
            //string url = "http://www.weather.com.cn/";
            //string url = "https://movie.douban.com/";
            //string url = "http://time.tianqi.com/";
            string url = "http://time.tianqi.com/shenzhen/";
            string htmlContent = @"
            <html>
            <head>
            <title>Sample Page</title>
            <script>
            function myFunction() {
                alert(""Hello!"");
            }
            </script>
            </head>
            <body>
            <h1>Welcome to My Page</h1>
            <p>This is a sample page with some content.</p>
            </body>
            </html>";

            using (HttpClient client = new HttpClient())
            {
                // 设置请求头以模拟浏览器访问
                client.DefaultRequestHeaders.Add("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3");

                // 访问网页并获取HTML内容
                
                htmlContent = await client.GetStringAsync(url);

                // 输出获取的HTML内容
                //Console.WriteLine(htmlContent);
            }

            // 创建HtmlDocument对象并加载HTML内容
            HtmlDocument doc = new HtmlDocument();
            doc.LoadHtml(htmlContent);

            // 去除所有的<script>标签及其内容
            foreach (var script in doc.DocumentNode.DescendantsAndSelf("script").ToArray())
            {
                script.Remove();
            }

            // 获取纯文本内容
            string text = doc.DocumentNode.InnerText;

            // 去除多余的空格和空行
            text = Regex.Replace(text, @"\s+", " ").Trim();

            // 输出展示内容
            Console.WriteLine(text);
        }
    }
}

Guess you like

Origin blog.csdn.net/u014361280/article/details/132258032