Simple to understand regular expressions

 For regular expressions, I believe many people know, but a lot of people's first impression is difficult to learn, because first glance, that no rules can be found, and they are all a bunch a variety of special symbols, complete incomprehensible.

In fact, just the regular lack of understanding in order to understand you will find the original so ah canonical correlation character, not much used, it is not difficult to remember, but not difficult to understand, the only difficulty is to combine after, readability relatively poor, but not easy to understand, this article aims to allow everyone to have a regular basic understanding, can understand simple regular expressions, come to write simple regular expressions, to meet the daily needs can be developed in .

0 \ d {2} - \ d {8} | 0 \ d {3} - \ d {7} prior to a period of regular, if you then do not understand the positive, is not completely unaware of this string of characters is what does this mean? It does not matter article will explain in detail the meaning of each character.

 

1.1 What is a regular expression

     Regular expressions are a special kind of string pattern, to match a set of strings, like product made by mold, and the mold is regular, define a rule to conform to the rules of the match character.

1.2 Common regular matching tool 

     Online matching tools:

  1 http://www.regexpal.com/ 

      2 http://rubular.com/ 

     Regular matching software

      McTracer 

      After using a few still think this is the best use, support will lead to a corresponding regular languages ​​such as java C # js etc. also help you escape the, Copy directly on the line is very convenient, in addition to support the regular expression to explain , which section is such as to capture packets, which segment is greedy matching, etc., together with a short So Happy.

 

Two regular characters in brief

2.1 yuan characters introduced

   "^"  : ^ Matches the line or the starting position of the string, sometimes matching the starting position of the entire document. 

   "$"   : $ Matches the end of a line or string

    Figure

         And the characters must be matched This is the beginning of the spaces does not work, must end Regex, but also with other characters without spaces

     

 

 "\ b"  : do not consume any character matches only a position, often used to match a word boundary, as I think from the string "This is Regex" matching the individual word "is" regular will be written "\ bis \ b"  

    \ B does not match the character is on both sides, but it will identify whether the word is on both sides of the border 

 "\ d" : matching numbers,

    For example, to match a telephone number in a fixed format before beginning 04 7 bits, such as regular 0737-5686123: ^ 0 \ d \ d \ d- \ d \ d \ d \ d \ d \ d \ d $ here just to introduce "\ d" character, actually better wording will be described below.

 "\ w" : matching letters, numbers, underscores.

    For example, I want to match "a2345BCD__TTz" regular: "\ w +" where "+" character to a quantifier refers to the number of repeat, will detail later.

 "\ S" : matches a space 

    For example, the character "ab c" Regular: "\ w \ s \ w \ s \ w" character followed by a space, a plurality of spaces directly "\ s" if inter-written character "\ s +" Let repeat space

  . "" : Match any character except newline

    This regarded as "\ w" enhanced version of the "\ w" does not match the string with spaces if the space with "\ w" will be limited, and look with "." Is how to match the character "a23 4 5 BC D__TTz "regular:" +. "

  "[ABC]" : Character set of matching elements contained in parentheses characters 

        This is relatively simple only match characters in the brackets of memory can also be written as [az] So matching the letters a to z is equal to only be used to control the input in English,

 

2.2 Several antisense

  Very simple to change the wording of capital on the line, with the original meaning of the contrary, there is not an example of the child

   " \ W"    matches any not letters, numbers, the underscore character

   "\ S"    matches any character is not whitespace

 "\ D"   matches any non-numeric characters

   "\ B"   matches the word is not the beginning or end position

   "[^ abc]"   matches any character except abc

 

 2.3 quantifier

  First explain important concepts about three quantifiers involved

    Greed (greed), such as "*" character greedy quantifier will match the entire string first, when you try to match it as much content as selected, and if that fails then go back one character, and then try again rollback process is called backtracking it will go back one character each time, until you find content that matches or no characters can be rolled back. Compared two kinds of greedy quantifier following consumption of resources is the biggest,

   Lazy (barely), such as "?" Lazy quantifier another way match, which starts at the beginning of the attempt to match the position of the target, every time a character check, and looking for content to match it, so at the end of the cycle until the character.

   Possession such as "+" quantifier possession will cover things a target string, and then try to find a match, but only try once, will not go back, grab a handful of stone like the first, and then pick out the gold from the rock

     "*" (Greed)    repeated zero or more times

     E.g. "aaaaaaaa" string matching a regular all: "a *" will be out of all the character "a"

     "+" (Lazy)    repeated one or more times

       For example, "AAAAAAAA" string matching all a regular: "a +" will get the character of all the a character, "a +" and "a *" except that "+" is at least one and "*" may be 0 times ,

       Will the "?" Character combination to reflect this distinction later

     "?" (Possession)    repeated zero or one times

       For example, a regular "aaaaaaaa" matches the string: "? A" match only once, that is just the result of a single character

   "{n}"   is repeated n times

       For example, from "aaaaaaaa" string matching a regular and repeated three times: "a {3}" The result is taken to a three character "aaa";

   "{n, m}"   repeated n times to m

       Example, the regular "a {3,4}" will match a repeated three times or four times so that the characters for matching may be three "aaa" may be four "aaaa" can be matched to the regular

     "{n,}"   repeated n times or more

       And {n, m} is different from the number of matches is that there will be no upper limit, but at least n times as regular "a {3,}" a to be repeated at least three times

 Before the match a phone number after quantifier understand regular can now change too simple point ^ 0 \ d \ d \ d- \ d \ d \ d \ d \ d \ d \ d $ can be changed to "^ 0 \ d + - \ d {7} $ ".

Such writing is not perfect because if not done in front of the area code defined, so that you can enter a lot of them, but usually only three or four,

Now and then change it "^ 0 \ d {2,3} - \ d {7}" In this way the area code portion will match three or four of the

 2.4 lazy qualifier

  "*?"    Repeated any number of times, but less duplication wherever possible 

      Such as "acbacb" regular "a. *? B" will take to the first "acb" could have all but taken to add the qualifier, the only match as few characters, "acbacb" Minimum characters The result is "acb" 

  "+?"   Repeated one or more times, but less duplication wherever possible

     As above, except that at least one time to repeat

  "??"   repeated 0 or 1, but less duplication wherever possible

      Such as "aaacb" regular "a.??b" will take to the last three characters "acb"

  "{n, m}?"   repeated n to m times, but less repeated as

          The "aaaaaaaa" canonical "a {0, m}" is 0 times because it happened to get the result is null

  "{n,}?"     repeated n times or more, but less duplication wherever possible

          The "aaaaaaa" canonical "a {1,}" is the least 1 times so as to take the result "a"

 

Three regular Adv

     Packet Capture 3.1

  To understand is captured in the concept of grouping in the positive, in fact, a content within brackets such as "(\ d) \ d" and "(\ d)" This is a capturing group, you can post to capture packets to a reference (if after but have the same content can be referenced directly capturing group as previously defined, to simplify expressions) as (\ d) \ d \ 1 where "\ a" is a reference to the "(\ D)" in

That packet capture what use is it to see an example to know

Such as "zery zery" regular \ b (\ w +) \ b \ s \ 1 \ b so here's "\ 1" captured the character but also with (\ w +) as the "zery", in order to allow the group name is more meaning, you can customize the group name is the name of the

"\ B (? <Name> \ w +) \ b \ s \ k <name> \ b" with "? <Name>" you can customize the group name a while back to the time reference group to remember written "\ k <name> "; customization group name, the group matching to capture value will be stored in the group name defined in

Listed below are grouped capture common usage

 

"(exp)"     match exp, and capture the text to automatically name of the group

"(? <name> exp) "    match exp, and capture the text to name the name of the group

: "(? exp)"   text match exp, does not capture the match, the packet is not assigned to this group number

The following zero width assertion

"(? = exp)"   matching position in front of exp

  Such as "How are you doing" regular "(? <Txt>. + (? = Ing))" to take all of the characters before ing here and define a capturing group named "txt" and "txt" this group in the is "How are you do";

"(? <= exp)"   matching position behind exp

  Such as "How are you doing" regular "(? <Txt> (? <= How). +)" Take this all of the characters after the "How", and defines a capturing group named "txt" and "txt" This the value of the group "are you doing";

"(?! exp)"   matches with the position of the back is not exp

  The "123abc" regular "\ d {3} (?! \ D)" after the digital Match three non-numeric results

"(? <! exp)"   to match the front position than exp

  The "abc123" regular "(? <! [0-9]) 123" matches "123" the results of previous figures can also be written non "(?! <\ D) 123"

 

Four Regular combat

  Regular doing the filter embodied verification, the data might be great, I think used friends all know, here we just know all combine to do a real doing data collection with regular filter Html tags and take the corresponding data

Our blog on the election in the battlefield park. Suppose now that all the information to be collected blog articles Home Park include the article title, author blog links contact address, Introduction The article Published, read data, comments, recommendation number.

 

Look at the Garden blog articles Html format

Copy the code
<div class="post_item">
<div class="digg">
    <div class="diggit" onclick="DiggIt(3439076,120879,1)"> 
    <span class="diggnum" id="digg_count_3439076">4</span>
    </div>
    <div class="clear"></div>    
    <div id="digg_tip_3439076" class="digg_tip"></div>
</div>      
<div class="post_item_body">
    <h3><a class="titlelnk" href="http://www.cnblogs.com/swq6413/p/3439076.html" target="_blank">分享完整的项目工程目录结构</a></h3>                   
    <p class="post_item_summary">
<a href="http://www.cnblogs.com/swq6413/" target="_blank"> <img width = " 48" height = "48" class = "pfs" src = "http: // pic. cnitblog.com/face/142964/20131116170946.png "alt =" "/> </a> in the project development process, how the various types of data files in order to save the project, to establish a clear classification, easy to manage directory structure is very important. Previous projects and comprehensive project structure with some friends, I think I put together a pretty good project directory structure. Here to share to you, welcome to put forward your valuable comments and suggestions. If you like, please "recommendation" is one, grateful! ! The entire directory to a subdirectory level 4, real ... 
    </ the p->               
    <div class = "post_item_foot">                     
    <a href="http://www.cnblogs.com/swq6413/" class="lightblue"> seven Master </a> 
    Posted 2013-11-23 15:48  
    <span class = "article_comment"> <A the href = "http://www.cnblogs.com/swq6413/p/3439076.html#commentform" title = "
        comments (4) </a> </ span> <span class = "article_view"> <a href = "http://www.cnblogs.com/ swq6413 / p / 3439076.html "
Copy the code

 

 

 And to pick up the data and processes the data obtained by constructing a Http request key information, when the filter paper Html tag takes regular strong power is manifested,

Regularization of basic knowledge also spend such as "\ s \ w +. *?" There packet capture, zero-width assertion, and so on. Like a friend to try it, then take yourself to see how the corresponding data through positive and regular code is very basic, simple, and its meaning and usage are written in detail above.

 

Copy the code
    class Program
    {
        static void Main(string[] args)
        {
         
            string content = HttpUtility.HttpGetHtml();
            HttpUtility.GetArticles(content);
        }
    }

    internal class HttpUtility
    {
        //默认获取第一页数据
        public static string HttpGetHtml()
        {
            HttpWebRequest request = (HttpWebRequest)WebRequest.Create("http://www.cnblogs.com/");
            request.Accept = "text/plain, */*; q=0.01";
            request.Method = "GET";
            request.Headers.Add("Accept-Language", "zh-cn,zh;q=0.8,en-us;q=0.5,en;q=0.3");
            request.ContentLength = 0;
           
            request.Host = "www.cnblogs.com";
            request.UserAgent = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.1 (KHTML, like Gecko) Maxthon/4.1.3.5000 Chrome/26.0.1410.43 Safari/537.1";
            HttpWebResponse response = (HttpWebResponse)request.GetResponse();
            Stream responStream = response.GetResponseStream();
            StreamReader reader = new StreamReader(responStream, Encoding.UTF8);
            string content = reader.ReadToEnd();
            return content;

        }

        public static List<Article> GetArticles(string htmlString)
        {
            List<Article> articleList = new List<Article>();
            Regex regex = null;
            Article article = null;
            regex = new Regex("<div class=\"post_item\">(?<content>.*?)(?=<div class=\"clear\">" + @"</div>\s*</div>)",
                              RegexOptions.Singleline);

            if (regex.IsMatch(htmlString))
            {
                MatchCollection aritcles = regex.Matches(htmlString);

                foreach (Match item in aritcles)
                {
                    article = new Article();
                    //取推荐
                    regex =
                        new Regex(
                            "<div class=\"digg\">.*<span.*>(?<digNum>.*)" + @"</span>" +
                            ".*<div class=\"post_item_body\">", RegexOptions.Singleline);
                    article.DiggNum = regex.Match(item.Value).Groups["digNum"].Value;

                    //取文章标题 需要去除转义字符
                    regex = new Regex("<h3>(?<a>.*)</h3>", RegexOptions.Singleline);
                    string a = regex.Match(item.Value).Groups["a"].Value;
                    regex = new Regex("<a\\s.*href=\"(?<href>.*?)\".*>(?<summary>.*)</a>", RegexOptions.Singleline);
                    article.AritcleUrl = regex.Match(a).Groups["href"].Value;
                    article.AritcleTitle = regex.Match(a).Groups[". the Summary "] Value; 

                    // get author pictures
                    regex = new Regex("<a.*>(?<img><img[^>].*>)</a>", RegexOptions.Singleline);
                    article.AuthorImg = regex.Match(item.Value).Groups["img"].Value;

                    //取作者博客URL及链接的target属性
                    regex = new Regex("<a\\s*?href=\"(?<href>.*)\"\\s*?target=\"(?<target>.*?)\">.*</a>",
                                      RegexOptions.Singleline);
                    article.AuthorUrl = regex.Match(item.Value).Groups["href"].Value;
                    string urlTarget = regex.Match(item.Value).Groups["target"].Value;

                    //取文章简介
                    //1 先取summary Div中所有内容
                    regex = new Regex("<p class=\"post_item_summary\">(?<summary>.*)</p>", RegexOptions.Singleline);
                    string summary = regex.Match(item.Value).Groups["summary"].Value;
                    //2 取简介
                    regex = new Regex("(?<indroduct>(?<=</a>).*)", RegexOptions.Singleline);
                    article.AritcleInto = regex.Match(summary).Groups["indroduct"].Value;


                    //取发布人与发布时间
                    regex =
                        new Regex(
                            "<div class=\"post_item_foot\">\\s*<a.*?>(?<publishName>.*)</a>(?<publishTime>.*)<span class=\"article_comment\">",
                            RegexOptions.Singleline);
                    article.Author = regex.Match(item.Value).Groups["publishName"].Value;
                    article.PublishTime = regex.Match(item.Value).Groups["publishTime"].Value.Trim();

                    //取评论数
                    regex =
                        new Regex(
                            "<span class=\"article_comment\"><a.*>(?<comment>.*)</a></span><span class=\"article_view\">",
                            RegexOptions.Singleline);
                    article.CommentNum = regex.Match(item.Value).Groups["comment"].Value;

                    //取阅读数
                    regex = new Regex("<span\\s*class=\"article_view\"><a.*>(?<readNum>.*)</a>", RegexOptions.Singleline);
                    article.ReadNum = regex.Match(item.Value).Groups["readNum"].Value;
                    articleList.Add(article);
                }

            }
            return articleList;
        }



        public static string ClearSpecialTag(string htmlString)
        {

            string htmlStr = Regex.Replace(htmlString, "\n", "", RegexOptions.IgnoreCase);
            htmlStr = Regex.Replace(htmlStr, "\t", "", RegexOptions.IgnoreCase);
            htmlStr = Regex.Replace(htmlStr, "\r", "", RegexOptions.IgnoreCase);
            htmlStr = Regex.Replace(htmlStr, "\"", "'", RegexOptions.IgnoreCase);
            return htmlStr;
        }
    }

    Article This article was class public 
    { 
        /// <Summary> 
        /// article title 
        /// </ Summary> 
        public String AritcleTitle {GET; SET;} 
        /// <Summary> 
        /// article link 
        /// </ Summary> 
        public GET AritcleUrl {String; the SET;} 
        /// <the Summary> 
        /// Introduction The 
        /// </ the Summary> 
        public String AritcleInto {GET; the SET;} 
        /// <the Summary> 
        /// author 
        /// < / Summary> 
        public String the author {GET; SET;} 
        /// <Summary> 
        /// address of 
        /// </ summary>
        public string AuthorUrl { get; set; }
        /// <the Summary> 
        /// Author Picture 
        /// </ the Summary> 
        public String AuthorImg {GET; the SET;} 
        /// <the Summary> 
        /// Published 
        /// </ the Summary> 
        public String PublishTime {GET ; SET;} 
        /// <Summary> 
        /// recommended number 
        /// </ Summary> 
        public String DiggNum {GET; SET;} 

        /// <Summary> 
        /// comments 
        /// </ Summary> 
        public CommentNum {GET String; SET;} 
        /// <Summary> 
        /// number read 
        /// </ Summary> 
        public String ReadNum {GET; SET;} 

    }
Copy the code

 Regular written part may not be perfect, but at least match it, the other because he is new to regular, this comparison can only write simple regular. Also hope you forgive ~~

 

 

Five summary

  Regular is not difficult to understand the meaning of each symbol after their own hands immediately try to write a few naturally understand, regularization is notoriously pits and more, just write a little less than a point to match the data I stepped on a lot of the pit, stepping stepping on a stamped experience.

This article is only for regular made a very basic introduction, there are many regular characters no introduction, just write some of the more commonly used. If wrong, but also hope in the comments point out that I will soon change.

 

Original Address  https://www.cnblogs.com/zery/p/3438845.html

Guess you like

Origin www.cnblogs.com/qhantime/p/11408234.html