C# 截取所需网页上的部分内容

比如我有一个test.html,其内容如下：

<div id="contents">
        <ul class="firstClass">
        </ul>
        <ul class="onlyClass">
            <li><a href="http://www.baidu.com"><span class="only">Rabbit</span><br>
                The Rabbit and The Wolf’One day a rabbit was walking near the hill.<br>
                He heard someone crying,‘Help! Help!’It was a wolf. A big stone was on the wolfs
                back.
                <br>
                He cried, "Mr. Rabbit, take this big stone from my back, or I will die." </a>
            </li>
            <li><a href="http://www.baidu.com"><span class="only">Stone</span><br>
                The Rabbit moved the stone from the wolfs back. Then the wolf jumped and caught
                the rabbit. </a></li>
            <li><a href="http://www.baidu.com"><span class="only">Help</span><br>
                “If you kill me, I will never help you again.” Cried the rabbit .
                <br>
                “Ha,ha!You will not live, because I will kill you." said the wolf. </a></li>
            <li><a href="http://www.baidu.com"><span class="only">Kill</span><br>
                I helped you. How can you kill me? It’s unfair. You ask Mrs. Duck. She will say
                that you are wrong." </a></li>
            <li><a href="http://www.baidu.com"><span class="only">Duck</span><br>
                said the rabbit. “I will ask her,” said the wolf. So they went to ask Mrs. Duck.
                The duck listened to their story and said,” What stone? I must see it. Then I can
                know who is right. </a></li>
            <li><a href="http://www.baidu.com"><span class="only">Again</span><br>
                “So the wolf and the rabbit and the duck went to see the stone. "Now, put the stone
                back," said Mrs. Duck. So they put the stone back. Now the big stone is on the wolf’s
                back again. </a></li>
            <li><a href="http://www.baidu.com"><span class="only">Story</span><br>
                That’s all for my story. Thanks for listening. </a></li>
        </ul>
        <ul class="lastClass">
        </ul>
    </div>

我要获得OnlyClass下标签Span的内容和标签a中的正文内容。

目前我了解的方法有3种。

1、利用正则表达式（在此不做具体说明）
2、如果能够找到唯一的字符标识，可以利用截取的方式。
3、利用外部HtmlAgilityPack.dll

方法2的实现：

WebRequest myWebRequest = WebRequest.Create( @"D:\\test.htm");
WebResponse myWebResponse = myWebRequest.GetResponse();
Stream myStream = myWebResponse.GetResponseStream();
Encoding encode = System.Text.Encoding.GetEncoding("utf-8");
StreamReader myStreamReader = new StreamReader(myStream, encode);
string html = myStreamReader.ReadToEnd();

List<string> nameList = new List<string>();
List<string> textList = new List<string>();

 string s = html.Replace("\r", string.Empty);
 s = s.Replace("\n", string.Empty);
 s = s.Replace("\t", string.Empty);

 string[] SPLIT_CLASS_NAME = new string[] { "class=\"only\">" };

 string[] strArray = s.Split(SPLIT_CLASS_NAME, StringSplitOptions.None);
 string[] SPLIT_SPAN = new string[] { "</span>" };

 int index = 0;

 foreach (var item in strArray)
 {
     if (index== 0)
     {
         index++;
         continue;
     }

     int tmpIndex = item.IndexOf('>');
     string tmp = item.ToString();
     tmpIndex = tmp.IndexOf("</a");
     tmp = tmp.Substring(0, tmpIndex);

     string[] strArray1 = tmp.Split(SPLIT_SPAN, StringSplitOptions.None);

     if (strArray1 != null && strArray1.Length == 2)
     {
	nameList.Add(strArray1[0]);
	textList.Add(strArray1[1].Replace("<br>", string.Empty));
     }
 }

方法3的实现：

下载HtmlAgilityPack.dll，并引用到工程中。

 	    HtmlWeb htmlWeb = new HtmlWeb();
            HtmlAgilityPack.HtmlDocument document = htmlWeb.Load("D:\\test.htm");
            string xpath = "//ul[@class='onlyClass']//li/a";
            HtmlNodeCollection collection = document.DocumentNode.SelectNodes(xpath);
	    List<string> nameList = new List<string>();
            List<string> textList = new List<string>();
            foreach (HtmlNode item in collection)
            {
                HtmlNode temp = null;
                string name = null;
                string text = null;
                temp = HtmlNode.CreateNode(item.OuterHtml);
                if (string.IsNullOrEmpty(temp.InnerText) == false)
                {
                    name= temp.SelectSingleNode("//span[@class='only']").InnerText;
                    text = temp.InnerText;
                }

                if (name != null && text != null)
                {
                    nameList.Add(name);
                    textList.Add(text);
                }
            }

C# 截取所需网页上的部分内容

猜你喜欢