Web scraping--4 (crack countermeasure)

The technology that java sends http requests and parses HTML returns is mainly used to crawl website data.

Ideas:

    Java links the destination URL through URLConnection. After the link is successful, the returned html content is obtained from the inputStream. After obtaining, it can analyze the data according to the rules of the page information and obtain data according to the regular matching method or third-party tools.

 

Countermeasures:

    Whose website does not want others to easily grab their own data, so how to prevent this from happening?

Generally, the robots.txt file is configured in the http server directory, and the specific writing method can be checked online; however, there are ways to break through, basically preventing gentlemen but not villains; big websites also use robots.txt to do this, but it is more a kind of Legal means prohibit others from crawling. It is difficult to technically do not allow crawling. For example, Taobao shields baidu is an example; the configuration of the robots.txt file can block a large part; but if you If you want to be stricter, you can only do the http server extension yourself to ensure that, such as developing an apache module to block those inconspicuous crawlers.
To counteract other people's malicious crawling and analysis of your own website data, you can use web page encryption. The encrypted web pages can be browsed normally in Internet Explorer or Netscape Navigator, but the source code cannot be edited or viewed normally. The uncertain question is whether it affects the speed and whether it affects the compatibility of multiple browsers. --to be verified
 
 

Java sends Http request, parses html and returns

Article classification: Java programming
Today is Monday, July 7, 2008, and I have been working on a personal start page at school in the afternoon. Because I can't do without Google's translation, I want to integrate Google's translation into my start page, so I encountered a problem, how to use a java program to send an http request and then intercept the data returned by the remote server and process it properly. output? In addition, Google's translation page uses the post method to submit data, which cannot be processed directly through the URL. Therefore, this involves a problem of how to use java to post data.
After reading the questions that Baidu knows (it took me 20 points), I found a jar package component of htmlparser, which is said to be able to parse html efficiently. So, I immediately downloaded one. (In the attachment at the end of the article) I tried it and it turned out good. At the same time, in the process, I also learned how to use java to interact with other websites. This is a very good function. With htmlparser, you can intercept the information of other websites as you like!
Without further ado, here are the specific steps to use it.

First, sending a POST request to a web site only takes a few simple steps:
note that there is no need to import any third-party packages here
 
package com.test;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.OutputStreamWriter;
import java.net.URL;
import java.net.URLConnection;
public class TestPost {
 public static void testPost() throws IOException {
  
  URL url = new URL(" http://www.faircanton.com/message/check.asp ");
  URLConnection connection = url.openConnection();
  
  connection.setDoOutput(true);
  
  OutputStreamWriter out = new OutputStreamWriter(connection
    .getOutputStream (), "8859_1");
  out.write("username=kevin&password=********"); //The key to the post!
  // remember to clean up
  out.flush();
  out.close();
  
  // Once sent, get a response from the server with:
  String sCurrentLine;
  String sTotalString;
  sCurrentLine = "";
  sTotalString = "";
  InputStream l_urlStream;
  l_urlStream = connection.getInputStream();

  BufferedReader l_reader = new BufferedReader(new InputStreamReader(
    l_urlStream));
  while ((sCurrentLine = l_reader.readLine()) != null) {
   sTotalString += sCurrentLine + "\r\n";
  }
  System.out.println(sTotalString);
 }
 public static void main(String[] args) throws IOException {
  testPost();
 }
}
 
Execution result: (It really returns the verified html! Magic!)
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=gb2312" />
<title >Account has been frozen</title>
<style type="text/css">
<!--
.temp {
 font-family: Arial, Helvetica, sans-serif;
 font-size: 14px;
 font-weight: bold;
 color : #666666;
 margin: 10px;
 padding: 10px;
 border: 1px solid #999999;
}
.STYLE1 {color: #FF0000}
-->
</style>
</head>
<body>
<p> </p>
<p> </p>
<p> </p>
<table width="700" border="0" align="center" cellpadding=" 0" cellspacing="0" class="temp">
  <tr>
    <td width="135" height="192"><div align="center"><img src="images/err.jpg" width= "54" height="58"></div></td>
    <td width="563"><p><span class="STYLE1">登录失败</span><br>
        <br> 您
    的The account activity index is below the system limit, and your account has been temporarily frozen. <br>
    Please contact your Network Director or Personnel Director to reactivate your account. </p>
    </td>




 
Some Web sites use POST instead of GET because POST can carry more data and doesn't use a URL, which makes it appear less bulky. Using the rough code listed above, Java code can easily talk to these sites.


After getting the html, it is relatively easy to analyze the content. Now you can use htmlparser, the following is a simple sample program, I will not explain too much, I believe the code can explain everything!
 
package com.test;
import org.htmlparser.Node;
import org.htmlparser.NodeFilter;
import org.htmlparser.Parser;
import org.htmlparser.filters.TagNameFilter;
import org.htmlparser.tags.TableTag;
import org.htmlparser.util.NodeList;

public class TestHTMLParser {
  public static void testHtml() {
    try {
        String sCurrentLine;
        String sTotalString;
        sCurrentLine = "";
        sTotalString = "";
        java.io.InputStream l_urlStream;
        java.net.URL l_url = new java.net.URL(" http://www.ideagrace.com/html/doc/2006/07/04/00929.html ");
        java.net.HttpURLConnection l_connection = (java.net.HttpURLConnection) l_url.openConnection();
        l_connection.connect();
        l_urlStream = l_connection.getInputStream();
        java.io.BufferedReader l_reader = new java.io.BufferedReader(new java.io.InputStreamReader(l_urlStream));
        while ((sCurrentLine = l_reader.readLine()) != null) {
          sTotalString += sCurrentLine+"\r\n";
        //  System.out.println(sTotalString);
        }
        String testText = extractText(sTotalString);
        System.out.println( testText );
    } catch (Exception e) {
        e.printStackTrace();
    }
  }
 
  public static String extractText(String inputHtml) throws Exception {
    StringBuffer text = new StringBuffer();
    Parser parser = Parser.createParser(new String(inputHtml.getBytes(),"GBK"), "GBK");
    // 遍历所有的节点
    NodeList nodes = parser.extractAllNodesThatMatch(new NodeFilter() {
        public boolean accept(Node node) {
          return true;
        }
    });
    System.out.println(nodes.size()); //打印节点的数量
    for (int i=0;i<nodes.size();i++){
         Node nodet = nodes.elementAt(i);
         //System.out.println(nodet.getText());
        text.append(new String(nodet.toPlainTextString().getBytes("GBK"))+"\r\n");         
    }
    return text.toString();
  }
 
  public static void test5(String resource) throws Exception {
    Parser myParser = new Parser(resource);
    myParser.setEncoding("GBK");
    String filterStr = "table";
    NodeFilter filter = new TagNameFilter(filterStr);
    NodeList nodeList = myParser.extractAllNodesThatMatch(filter);
    TableTag tabletag = (TableTag) nodeList.elementAt(11);
    
  }
  public static void main(String[] args) throws Exception {
    // test5(" http://www.ggdig.com ");
    testHtml();
  }
}

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326976878&siteId=291194637