The front pages get crawled js

1, there is a problem

same origin policy (same origin policy)

Javascript can only read pages, access the page with domain. It should be noted that, Javascript define their own domain and the site it has nothing to do, and the only domain Javascript code embedded in the document concerned. The following sample code:

<!DOCTYPE HTML>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>This is a webpage came from http://localhost:8000</title>
  <script src="//ajax.googleapis.com/ajax/libs/jquery/1.8.3/jquery.min.js"></script>
</head>
<body>
  <div id="test">123</div>
  <script type="text/javascript">
    console.log($('#test').text());
  </script>
</body>
</html>

The HTML document comes from http://localhost:8000, which means it's domain http://localhost:8000(the domain and port are also relevant), although the page is loaded from jquery ajax.googleapis.com, however, only the JQuery domain and its domain is located about HTML document, which can be accessed properties HTML document, so the above code to run properly.
Annex: using the reason code is universal developer Javascript library (e.g., JQuery) public address at the same URL. When the user loads once the JS, the future will be loaded through the browser cache to speed up page loading speed.

From this point of view, if known to the questioner 远端refers to any page on the Internet, then you can not achieve the desired functionality; if 远端refers to the questioner that you have control over the site, see the following Relaxing the same-origin policy;

Relaxing the same-origin policy

  1. Document.domain: the situation for the subdomain. Windows for a plurality of (a plurality iframes page), by setting document.domain same field values, such exotic Javascript access window;
  2. Cross-origin resource sharing: Access-Control- increased by returning the head at the server

Allow-Origin, the head contains a list of all allowed access to the domain. Supported browsers will allow Javascript to access this page these fields;

  1. cross-document messaging: the field and independent manner, Javascript mutually different documents may be sent acceptance message without limitation, but not actively reading interest, a method call to another document attribute;

If the questioner has 远端control over the page, you can try the second method.

Server-side grab

According to the needs of the questioner, more viable options should be handled on the server side. With ( http://phantomjs.org/ ), you can use the Javascript syntax DOM manipulation on the server side, and you can use nodejs further analysis, of course, you can also use Python, php, Java language follow-up operation .

 

 

in conclusion:

(1) the server will limit cross-domain pages open;

(2) the end of the service requested page

Guess you like

Origin www.cnblogs.com/mengfangui/p/11543411.html