Recommend a spider crawler open source project developed by .Net Core

For more open source projects, please check: A list focusing on recommending .Net open source projects

If we need to capture data on the Internet, we need to write a crawler at this time, which involves web page crawling, web page analysis, data extraction, and crawling performance. Today, I recommend an open source project to everyone. , it can solve your problems very well, allowing you to focus more on business development.

Project Description

This is a web crawler open source project based on .Net Core, a lightweight, high-performance, and simple framework; the framework integrates functions such as crawling, data analysis and extraction, and proxying, which can help us quickly complete crawling Function.

Technology Architecture

1. Cross-platform: Based on .NetCore development, it supports Windows, Mono, Liunx, Windows Azure, and Docker.

2. Support .NetCore 2.2+.

3. Database: MySql.

4. Component: RabbitMQ.

Framework function

1. Basic functions: Webpage Http data crawling, parsing webpage data (text, json, html), storing parsed data to the database.

2. Acquisition scheduling: Deduplication of acquisition and control of acquisition order, supporting breadth-first and depth-first modes.

3. Split deployment: multiple download servers can be deployed at the same time;

4. Download agent registration service: responsible for download agent registration and heartbeat; stand-alone mode starts a built-in registration service by default;

5. Statistics: Count the status of each crawler and service center, such as the number of crawler requests, the number of successes, the number of failures, etc.;

6. Request configuration: such as adding signature configuration;

7. Data flow: Multiple rule parsers can be supported and parsed in order;

8. Concurrency: Support message queues, pre-cache request data, and improve collection performance.

project structure

picture

Example of use

*********Simple reptile example


public class TestSpider : Spider
{
  public static readonly HashSet<string> CompletedUrls = new();

  //配置:速度、间隔时间
  public static async Task RunAsync()
  {
    var builder = Builder.CreateDefaultBuilder<TestSpider>(x =>
    {
      x.Speed = 1;
      x.EmptySleepTime = 5;
    });
    builder.UseDownloader<HttpClientDownloader>();
    builder.UseQueueDistinctBfsScheduler<HashSetDuplicateRemover>();
    await builder.Build().RunAsync();
  }

  class MyDataParser : DataParser
  {
    protected override Task ParseAsync(DataFlowContext context)
    {
      var request = context.Request;

      lock (CompletedUrls)
      {
        //过滤
        var url = request.RequestUri.ToString();
        CompletedUrls.Add(url);
        if (url == "http://axx.com/")
        {
          context.AddFollowRequests(new[] { new Uri("http://bxx.com") });
        }
      }


      return Task.CompletedTask;
    }

    public override Task InitializeAsync()
    {
      return Task.CompletedTask;
    }
  }

  public TestSpider(IOptions<SpiderOptions> options, DependenceServices services,
    ILogger<Spider> logger) : base(
    options, services, logger)
  {
  }

  protected override async Task InitializeAsync(CancellationToken stoppingToken = default)
  {
    await AddRequestsAsync(new Request("http://axx.com"));
    AddDataFlow(new MyDataParser());
  }
}

Html data analysis

public async Task XpathFollow()
{
var request = new Request("http://xxx.com");
var dataContext =
new DataFlowContext(null, new SpiderOptions(), request,
new Response {Content = new ByteArrayContent(File.ReadAllBytes("cnblogs.html"))});


var dataParser = new TestDataParser();
  dataParser.AddFollowRequestQuerier(Selectors.XPath(".//div[@class='pager']"));

await dataParser.HandleAsync(dataContext);
var requests = dataContext.FollowRequests;

  Assert.Equal(12, requests.Count);
  Assert.Contains(requests, r => r.RequestUri.ToString() == "http://cnblogs.com/sitehome/p/2");
}

Configuration parsing

private class N : EntityBase<N>
{
  [ValueSelector(Expression = "./div[@class='title']")]
  public string title { get; set; }

  [ValueSelector(Expression = "./div[@class='dotnetspider']")]
  public string dotnetspider { get; set; }
}

project address

https://github.com/dotnetcore/DotnetSpider

- End -

recommended reading

Recommend a front-end and back-end separation.NetCore+Angular rapid development framework

The method and experience of reading the source code of open source projects

A powerful .Net image manipulation library that supports more than 100 formats

Based on .NetCore+React single sign-on system

An open source project that counts fishing time

Guess you like

Origin blog.csdn.net/daremeself/article/details/129279072