What are reptiles


          
           Web crawler very powerful thing, the industry benchmark for Google, Baidu, which have Needless to say, the web crawler is a program to provide sources of information. For the time I touch this stuff was forced to look ignorant, also we suspect that these big companies to develop something blessing for me and other man, the best to the study and understanding. Passion still drives me to buy a lot of books, see also foggy, did not get any inspiration. Finally, in the library, find a seal in the corner << >> network robot java programming guide gave me great inspiration.
           Start into the topic, the web crawler is actually a robot program, what is a robot program, it is to do repetitive work for human program, for example: you get a very boring job, your boss calls you to the next competition day rival company's Web site home page content copied down to the word to save him (who also did the), took you open the browser, enter the company's Web site next door, into the home, draw a mouse, and then Ctrl + C and Ctrl + V,, save the world, all is well. Suddenly one day your company a strong competitor range increased to 50, or repeat the previous process, better trained CV Dafa relatively skillful, before work is complete, but think about it later if the hundreds, thousands, competitors, my God, you should be exhausted ah! Heart desperate a program that can get rid of this boring job, but fortunately a program ape, so to write a program to help me complete the work.
           How do we design this program, first of all I want 50 company's Web site saved, he cycles each URL, visit the Web site, find content stored locally. This is the crawler, the name sounds big, in fact, reptile is to help you solve these repetitive operations only. So he wrote a code following:
                  main ()
                 {
                              url [50] // 50 companies addresses
                              loop (i <50) // 50 cycles, for loops, while loops, as long as the cycle line.
                              {
                                    Request URL [I]; // Get page
                                    Ctrl + C; // copy
                                    Ctrl + V; // paste
                                    I ++;
                              }
                 }
                           
            My God, which is also called the code. This program ape encounter a problem, how to obtain the page? How to find the kind of content you want to save it? After hundreds of thousands of websites sites, save these URLs arrays do I have to enter over and over again? Always a problem a solution.
            First we get to the page we want to enter the URL through the browser, that browser is how to do it, whether you have had the experience, eagerly waiting for an update of the TV series, said a good 19 points updates, but also how no update, refresh the page a bit and finally found has been updated, if you do not refresh you still do not know him updated, so long as you do not request, your page will not change, this is not a protracted process, but short-lived process. You enter the URL through the operation of the refresh operation, is in fact the browser to the page you want to see, the server program to the server program to return to the page you want to see on the browser, which is to achieve a http communication tools of the agreement to do things, then the next step is to find this tool into our program, now any one object-oriented development language supported by this tool, do not worry about finding the most important thing is that we know he can get through the pages. Why can accurately find the page I want to see, which is to say under the URL, "Uniform Resource Locator is a location of resources available on the Internet and access method is simple, he said the standard is the address of a resource on the Internet every file on the Internet has a unique URL, which contains information on the location of the file pointed out, "
  it is said URL (except he did not someone else)
           For example: (http://127.0.0.1:8080 /index.html, http representatives are using http transport protocol, 127.0.0.1 to access the target machine's IP address, port 8080 is, index.html is what we want to file name), we give so much information nor difficult to imagine how the http protocol tool to locate the file we want through the web site. (http://www.XXX.com/
 like this will be translated into the World Wide Web URL with url normal format).
          Then how can you know what I want, and save. First, what page is, is to include text, images, video, music, programs ... large collections, and that such a complex collection will always have some form of organization, or to manage it is not tired, that's super htm text language, in fact, a bunch of tag contains content that we want (text, images, video, music, programs ...) open your browser and enter a URL, press F12 and you will see something like the picture on the left is is the page the right side is the page html format, suck right side looks good, but it does not matter, we're just looking for a few labels we care about (the label is <div> </ div> structure like this form), you you can also find a page of writing tools like DreamWeaver, write pages ~~
            page after the F12 as follows:
 

 

 

    
     You enter the URL of the requested page, in fact, you are returned to the html format, that is, give you cook the right side of this thing, then converted to the left more harmonious thing, your browser can parse out the format and its corresponding content, into the corresponding location on the page, view the text on the right side and the left side corresponds to the text, as shown above as long as we parse <p> </ p> this tag will be able to get the text we want the same html parsing tools are everywhere, attention label containing the text you're interested, you can save to a file after obtaining the text.
     Not bad, leaving the last question looked a big problem, but it is most critical. URL boss to give you how much you write how many URLs. But the reality is that the Internet is not meeting the boss, how to give your web site, so in real life situations to you task is to find relevant to our web site and get the text to me, a good king, are the king, but there is no how can I get the page URL I want it?
     We will click on a page title or button when browsing the page will automatically jump to the next page.
For example, the following figure shows a movie :( page text classification title)
       
        Here is a property tag brought a href = "....", looking at the form that this is a URL, so jump (hyperlink) is actually through this new url made a request to the server, and then a the new page to show you. That and you enter the URL in your browser and then press Enter to be a reason, but this time click on (long-winded). So we are associated with such bold imagination all pages on the network together, such as write their own website if you want others to know quickly, usually placed in high-traffic Web site, through this website then go to my website then visit my web content, of course, my web content also contains other people connected to the site, as long as you can find a web page parse url, then sends a request to find the next page, and so on. As to achieve is very simple to use html parsing parsing tool, this time we are interested in the second tab
 href = "url" <a>
         Re-write it again:
         void main ()
        {
                  Vector <string> urlArray; // here with the container Vector, is the infinite length of the array, you do not have to write in order to facilitate the length of the array (but not actually fit, next time decomposition), or stored URL (URL), before starting to add at least a url
                  the while (urlArray.count! = 0) // url stored in an array of judgment can not be empty, empty the end of the loop.
                  {
                            String = HtmlPage the Request (urlArray [0]); // Request page file, index.html, xxx.html, obtaining a string form.
                            string text = ParseHtmlText (htmlPage); // parse htmlPage, and get the text,
                            String url = ParseHtmlUrl (the HtmlPage); // parse htmlPage, and get url, in fact, here is a bunch of url, there will be a lot of home url parsed, here clarity write a bar.
                            saveFile (text); // save just to get to a local text file
                            urlArray.add (url); // add url to urlArray save url, url as the next use of
                            urlArray.remove (0); // remove the used url is the first urlArray [0]
                   }
        }
  
     Finally breadth way to search the web crawler finished. Of course, you will Tucao, but too many slots point unable to eat. Please forgive my laziness, this is one hundred percent web crawler. First, explain the process, the next cycle begins url check whether the list is empty, it is not the first acquisition url, obtained by htmlPage page url request and then parse the web page text and url get two, and then save the text to text, will url url join the list as a point to be accessed. This process is to get through the url page for url by page, so back to the philosophical questions, chicken, egg.
 
    Is the first, the chicken or the egg it, no matter who is first, we have to let this cycle go on, let chickens lay eggs, so egg, infinite deficient children and grandchildren also. So we give eggs (url), the world's gear begins to turn. Reptile principle is so simple, you use any one object-oriented languages ​​(C ++, java, c #, python, go, php) can achieve it. Although finished, but things are far from over, we ape the program after writing this program will encounter a variety of problems, the network is a dangerous place, we'll Mars bar. This sloppy procedures can live on half an hour in today's network is already a miracle. This is why there is a wide range of reptiles, procedures, formats endless variety of development tools components, as is the cycle in the end. But first thoroughly understand reptile is what can be more flexible to design their own reptiles, apes As the program can not complete the task, to say the next

 

Guess you like

Origin www.cnblogs.com/1208xu/p/11740340.html