RuiJi.Net .NET 开源爬虫框架

项目地址

https://github.com/zhupingqi/RuiJi.Net

https://gitee.com/zhupingqi/RuiJi.Net

文档

http://www.ruijihg.com/archives/ruijinet/getting-started

RuiJi.Net 爬虫框架 讨论群 545931923

RuiJi.Net

RuiJi.Net 是一个C#开发的分布式抓取框架

RuiJi.Net 支持自托管,具有分布式抓取,提取和自管理Cookie

RuiJi.Net 支持服务器端IP轮询访问及使用代理服务器访问(未完成)

Notice

项目正在开发中

Features

抓取端

Feature Support
webheader custom
method get/post
auto redirection support
cookie managed/custom
service point ip auto/custom Bind
encoding auto detect/by specify
response raw/string
proxy future additions

提取器

Feature Support
selector css/xpath/regex/json/text range/exclude text/clear
extrac structure block/tile/meta
jsonconvert extractblock

关于提取结构

示例

直接使用RuiJi.Net.Core

        var crawler = new IPCrawler();
        var request = new Request("http://www.ruijihg.com/%e5%bc%80%e5%8f%91/");

        var response = crawler.Request(request);
        var content = response.Data.ToString();

        var block = new ExtractBlock();
        block.Selectors = new List<ISelector>
        {
            new CssSelector(".entry-content",CssTypeEnum.InnerHtml)
        };

        block.TileSelector = new ExtractTile
        {
            Selectors = new List<ISelector>
            {
                new CssSelector(".pt-cv-content-item",CssTypeEnum.InnerHtml)
            }
        };

        block.TileSelector.Metas.AddMeta("title",new List<ISelector> {
            new CssSelector(".pt-cv-title")
        });

        block.TileSelector.Metas.AddMeta("url", new List<ISelector> {
            new CssSelector(".pt-cv-readmore","href")
        });

        var ext = new RuiJiExtracter();
        var r = ext.Extract(content, block);

使用集群

  1. 下载 ZooKeeper
    http://mirrors.hust.edu.cn/apache/zookeeper/zookeeper-3.4.12/
  2. 在文件夹conf中添加与zoosample.cfg相同的文件,并将其重命名为zoo.cfg。更改datadir为你的路径
  3. 确认Java的运行环境
  4. 运行 bin/zkServer.cmd
  5. 以管理员的身份运行 RuiJi.cmd.exe

启动完成后 将看到如下信息

Server Start At http://x.x.x.x:x
proxy x.x.x.x:x ready to startup!
try connect to zookeeper server : x.x.x.x:2181
zookeeper server connected!

运行如下代码

using RuiJi.Net.NodeVisitor;

....

           var response = new Crawler().Request("http://www.ruijihg.com/%e5%bc%80%e5%8f%91/");

            if (response.StatusCode != System.Net.HttpStatusCode.OK)
                return;

            var content = response.Data.ToString();

            var block = new ExtractBlock();
            block.Selectors = new List<ISelector>
            {
                new CssSelector(".entry-content",CssTypeEnum.InnerHtml)
            };

            block.TileSelector = new ExtractTile
            {
                Selectors = new List<ISelector>
                {
                    new CssSelector(".pt-cv-content-item",CssTypeEnum.InnerHtml)
                }
            };

            block.TileSelector.Metas.AddMeta("title", new List<ISelector> {
                new CssSelector(".pt-cv-title")
            });

            block.TileSelector.Metas.AddMeta("url", new List<ISelector> {
                new CssSelector(".pt-cv-readmore","href")
            });

            var r = Extracter.Extract(new ExtractRequest {
                Blocks = new List<ExtractFeatureBlock> {
                    new ExtractFeatureBlock
                    {
                        Block = block
                    }
                },
                Content = content
            });

RuiJi表达式

RuiJi表达式是为了快速添加页面的提取规则,实现软编码的一种方式,RuiJI表达式尽量简单、易懂。

Selectors为选择器
Tiles为需要重复提取的区域
Metas为需要提取的元数据
Blocks为Block内需要提取的子Block

如果需要对http://www.ruijihg.com/开发 进行提取的话,首先需要观察一下页面的结构
你可以使用F12来观察页面的结构

首先确保Block选择器的结果是唯一的

Block的定义可以如下

#content
css .pt-cv-view:ohtml

继续添加tile

[tile]
\t#tiles
\tcss .pt-cv-content-item:ohtml

\t[meta]
\t#title
\tcss .pt-cv-title:text

\t#content
\tcss .pt-cv-content:html
\tex 阅读更多... -e

你可能注意到了\t 这是因为block和tile都包含meta,所以tile的选择器部分和tile的meta以\t作为当前tile的标记

完整的Block描述结构如下

[Block]
#blockname
selector

[blocks]
@subblockname1
@subblockname2

[tile]
\t#tilename
\ttile selector

\t[meta]
\t#meta1
\tselector
\t#meta2
\tselector

[meta]
#blockmeta1
selector

#blockmeta2
selector

猜你喜欢

转载自my.oschina.net/u/3875422/blog/1826317
net
今日推荐