For more open source projects, please check: A list focusing on recommending .Net open source projects
If we need to capture data on the Internet, we need to write a crawler at this time, which involves web page crawling, web page analysis, data extraction, and crawling performance. Today, I recommend an open source project to everyone. , it can solve your problems very well, allowing you to focus more on business development.
Project Description
This is a web crawler open source project based on .Net Core, a lightweight, high-performance, and simple framework; the framework integrates functions such as crawling, data analysis and extraction, and proxying, which can help us quickly complete crawling Function.
Technology Architecture
1. Cross-platform: Based on .NetCore development, it supports Windows, Mono, Liunx, Windows Azure, and Docker.
2. Support .NetCore 2.2+.
3. Database: MySql.
4. Component: RabbitMQ.
Framework function
1. Basic functions: Webpage Http data crawling, parsing webpage data (text, json, html), storing parsed data to the database.
2. Acquisition scheduling: Deduplication of acquisition and control of acquisition order, supporting breadth-first and depth-first modes.
3. Split deployment: multiple download servers can be deployed at the same time;
4. Download agent registration service: responsible for download agent registration and heartbeat; stand-alone mode starts a built-in registration service by default;
5. Statistics: Count the status of each crawler and service center, such as the number of crawler requests, the number of successes, the number of failures, etc.;
6. Request configuration: such as adding signature configuration;
7. Data flow: Multiple rule parsers can be supported and parsed in order;
8. Concurrency: Support message queues, pre-cache request data, and improve collection performance.
project structure
Example of use
*********Simple reptile example
public class TestSpider : Spider
{
public static readonly HashSet<string> CompletedUrls = new();
//配置:速度、间隔时间
public static async Task RunAsync()
{
var builder = Builder.CreateDefaultBuilder<TestSpider>(x =>
{
x.Speed = 1;
x.EmptySleepTime = 5;
});
builder.UseDownloader<HttpClientDownloader>();
builder.UseQueueDistinctBfsScheduler<HashSetDuplicateRemover>();
await builder.Build().RunAsync();
}
class MyDataParser : DataParser
{
protected override Task ParseAsync(DataFlowContext context)
{
var request = context.Request;
lock (CompletedUrls)
{
//过滤
var url = request.RequestUri.ToString();
CompletedUrls.Add(url);
if (url == "http://axx.com/")
{
context.AddFollowRequests(new[] { new Uri("http://bxx.com") });
}
}
return Task.CompletedTask;
}
public override Task InitializeAsync()
{
return Task.CompletedTask;
}
}
public TestSpider(IOptions<SpiderOptions> options, DependenceServices services,
ILogger<Spider> logger) : base(
options, services, logger)
{
}
protected override async Task InitializeAsync(CancellationToken stoppingToken = default)
{
await AddRequestsAsync(new Request("http://axx.com"));
AddDataFlow(new MyDataParser());
}
}
Html data analysis
public async Task XpathFollow()
{
var request = new Request("http://xxx.com");
var dataContext =
new DataFlowContext(null, new SpiderOptions(), request,
new Response {Content = new ByteArrayContent(File.ReadAllBytes("cnblogs.html"))});
var dataParser = new TestDataParser();
dataParser.AddFollowRequestQuerier(Selectors.XPath(".//div[@class='pager']"));
await dataParser.HandleAsync(dataContext);
var requests = dataContext.FollowRequests;
Assert.Equal(12, requests.Count);
Assert.Contains(requests, r => r.RequestUri.ToString() == "http://cnblogs.com/sitehome/p/2");
}
Configuration parsing
private class N : EntityBase<N>
{
[ValueSelector(Expression = "./div[@class='title']")]
public string title { get; set; }
[ValueSelector(Expression = "./div[@class='dotnetspider']")]
public string dotnetspider { get; set; }
}
project address
https://github.com/dotnetcore/DotnetSpider
- End -
recommended reading
Recommend a front-end and back-end separation.NetCore+Angular rapid development framework
The method and experience of reading the source code of open source projects
A powerful .Net image manipulation library that supports more than 100 formats