Students to do community reptiles share of R language

Hello, everyone, for everyone to make a share on the R language reptiles, are honored and somewhat ashamed, because I was a rookie programming, community, there are a lot of good students to experience the rich than I, this share is very early, apply to have no contact with reptiles and some basic programming students, mainly in the following areas: background, crawling method, data processing and storage as well as a learning experience and I learned since programming.

 

Background One: What is the reptile

It is simply to write a program, disguised himself as a browser continue to access the target site, batch download above information.

 

This figure is a public official number -RUC Square News from People's University journalism department, who was acquired by crawlers information, which after processing through analysis with greater value. But there are reptile easy to difficult, some pages dedicated to open API interface directly provide you with the information you need, such as the picture clear broad data; some pages are not, such as the picture of the chain of home, you need to cost a lot of effort to really It contains the information you need to find a web page. Then I will explain how to deal with both cases.

We first deal with the difficult situation that did not provide API interface. I have a link, for example, our goal is to climb down the latitude and longitude of the cell. This is a fundamental step access to home district of chain

 

Before the formal reptiles also understand some of the basics about the web page, fortunately, is not a simple reptiles need to master complex web of knowledge. Web page consists of information, as long as we know we need to find where we want information, and that information is what format can be.

On the first question: Where is the information we want, in my chain of home site, for example

 

I zoom close and marked five important position:

 

Top developer tools are some of the labels, where 1 marks the network tag for a reptile is the most important, like the other Elements, Console and so we do not need to see. 2 is a data format labeled here tag page. I chose this picture is ALL, that is, including all file formats. I climbed site XHR, JS, DOC format file more. 3 Here is the timeline, every time you click on will be displayed on the timeline, such as the chain of family maps have zoom function, zoom will appear every time a new cell information. When only the information you need some time scaling appears, comes in handy timeline, you can select the time zone that corresponds to zoom, just as only a small part of the middle of this period show my figure the information will appear in the four marked information bar inside. So the way it introduces 4, you can see there are a lot of area 4 file, the name is very long, beginning qt, but also the beginning of the callback, the file where cell information we need at the beginning of the callback inside. 5 display area information is contained in the file. The picture selected preview label, which is a preview of meaning, we are here to preview the file information.

So far we have answered the first question: Where information

The second problem is the format of the information.

Why format information so important? Because the format determines the way we deal with them. There are two very important format: XML and JSON, below left two graphs is XML format, the right is the JSON format, XML presents step-like, wrapped in angle brackets, braces JSON is based on the marked.

 

Background finish began crawling up. Some people with some python with R, R reptiles need to load the package, commonly used Rcurl, httr, rvest, rjson etc., python I know beautifulsoup more famous. There are two kinds of programming language syntax differences in function, but the idea is the same reptile framework, with R and I just Homelink website for example. Our goal is to crawl Wuhan district of latitude and longitude. The first step is disguised as reptiles browser, we first need to enter the URL when accessing the Homelink website, then click on the map to find room - Wuchang - Wuchang District and then in the Shuiguohu plate to see this section in all cells, If a program, you can interface directly into the cell, but the premise is to find the URL where the interface area, and it is great URL in your browser address bar is no longer the same. In the front we find the file where the cell information, so the key is to find the URL of the file. Then we see this picture of the network tag on ppt has been switched to the headers label, tag URL headers have this callback file.

 

 The following is disguised as part of the browser code.

getURL function is to request information on the website, it uses URL and request headers we provided above to obtain a callback documents in the cell information and pass it to the webpage this variable.

 

This is the webpage where the information is very confused, we want to see the latitude and longitude of the cell were in it, we want to organize it into a clear form, as in this two-dimensional table. How to do it?

Here text processing program debut!

We need to understand where text processing regular expressions and some basic text processing functions. Learning the R language text processing is strongly recommended articles this web site, made it very clear. I've only listed a few regular expressions and functions, if you're interested want to know more you can look at that article. Step finishing webpage information I have listed out ...

 

 

 

 

 

  So far we've crawled down part of the cell information Wuchang District Shuiguohu sector, examples of the chain of home for the time being to explain here. Currently we will only deal with the case of using a GET JSON format information, but there are ways and post the information in XML format without speaking, I am here only the portion of the code in here, interested students can find out.

Look at the following example of the API, many individuals, government agencies and so on for the new crown pneumonia developed API data interface, such as the government of Shandong Province, Guizhou Province has an open data platform, as well as the breadth of data before clearing application access is required to register an account into, the more trouble, so I am here to develop the interface a personal example. Visit this site, you can see that it is specially marked GET request, we just crawling Homelink also get data request.

 

 

 post requests and XML format I will not go into detail.

 

 

 

 

 

Guess you like

Origin www.cnblogs.com/yuxuan320/p/12545466.html