Delete specific content using regular replacement

    In the past, the company's network restrictions made it impossible to log in to certain websites through username and password, but the website could be opened through a browser. However, there was a problem with these websites. Non-logged-in users could not copy the code in the blog, which was a bit troublesome.

    Fortunately, these codes can be obtained by viewing the source code of the web page. As shown below, we see the code of csdn:

    How to get the content of html elements:

     1. Open F12 and enter the developer debugging interface.

     2. Switch to the element tab page.

     3. Use the mouse to select the code part.

     4. Copy elements.

   We analyze this content, which is mainly wrapped by html tags, and we need to remove the tag part.

    There is a characteristic of html tags, they appear in pairs, if we remove all the tags, then only the content is left. With this in mind, we can get to work.

    Consider using regular replacement here to remove the tags, which are similar to <div> or </div>. We may kill all the content directly through the regular <.*>. Because <div>xxx</div> will actually match <.*>. You can't just remove the <div> and </div> on both sides.

    It needs to be considered here that no closing tag ">" can appear inside <>. It seems a bit difficult to understand. It means that only one tag <> is matched here, whether it is the opening tag <div> or the closing tag </div>.

     To meet this requirement, change the regex again, <[^>]+>, the replacement effect is as follows:

    After replacement, just the code part is left. Just what we need.

    Let me talk about it a little bit, this is done using notepad++, which supports regular expression writing, so we can see information like this in the replacement interface:

    In search mode, you need to switch to regular expressions because we want to delete these tags and replace the matching content directly with nothing, so there is no need to fill in anything in "replace with". Finally, select Replace All.

    Another example of replacement, here we implement it through java code.

    Suppose we have a string like this:

[{"name":"buejee","id":101,"email":[],"mobile":"15909062001"},{"name":"lucky","id":102,"email":["[email protected]"],"mobile":"15909062002"}]

    Although this is a json format, it exists as a string. We need to delete the email part. There are two emails in the above example, namely "email":[],   and  "email":["[email protected]"],  respectively . In order to remove them all, our regular expression can be written like this:

"email":\[[^\]]*\], 

    Among them, [^\]]* means that [] cannot appear inside]. The * here means that the content can have multiple characters or none, matching [] and [ "[email protected]"] . In addition, because "[" and "]" themselves are keyword symbols in regular expressions, they need to be escaped here. In java, the escape symbol is two backslashes \\ .

    The program code is as follows:

package com.xxx.reg;

import java.util.regex.Matcher;
import java.util.regex.Pattern;


public class StringReplace {
    public static void main(String[] args) {
        String str = "[{\"name\":\"buejee\",\"id\":101,\"email\":[],\"mobile\":\"15909062001\"},{\"name\":\"lucky\",\"id\":102,\"email\":[\"[email protected]\"],\"mobile\":\"15909062002\"}]";
        Pattern pattern = Pattern.compile("\"email\":\\[[^\\]]*\\],");
        Matcher matcher = pattern.matcher(str);
        System.out.println(str);
        String result = matcher.replaceAll("");
        System.out.println(result);
    }
}

    operation result:

     The printed result just deletes the email part.

    These two examples have the same part, that is, the deleted content needs to be filtered and cannot be greedily matched, otherwise the effect will not be achieved. Use the [^] syntax in the regular to restrict the occurrence of a specific mark.

Guess you like

Origin blog.csdn.net/feinifi/article/details/131236766