Experiment: Unity grab all the pictures with the specified url web page and download the saved

Whim, I feel sometimes save resources on the Web page is very troublesome, there is no way to enter a URL corresponding to resources on the batch fetching way of it.

Need to think about:

1. How to get the page url html source it?

2. How to match the vast sea of ​​html in the resources needed to address it?

3. How does a collection of resources according to batch download resources address obtained?

4. Download the file stream resources generally, how to generate the specified resource type and save it?

 

It requires knowledge:

1. Fundamentals web crawler, Http request transmission method

2.C # using regular expressions, is to identify the main needs of html rul URL

3.UnityWebRequest class file stream downloads

4.C # File class and the Stream class and other basic file operations

 

The following sub-items to achieve:

Here is not about the reptiles were introduced, where there are many other online data collection procedures in a nutshell is the page of information and data.

The first step is to send a Web request, it can be said Http request.

This opens your browser and enter a url address with you and then enter the effect produced is basically similar, has been able to show the correct information and data on a page because every page has a corresponding html source code, for example, like many browsers Google browser to view the page source code supports all the functions, such as the following are the html meow nest I often go to the home page of the <head> section:

 

 

html source code can view the current page's a lot of hidden information and data, there are a lot of links and style sheets and other resources. It is noteworthy that, html source code can be displayed and viewed it only after the entire page has finished loading, Web url address of this means that a successful response to the request; a successful case of course there will be a variety of failure situations, for example, we Tip 404 after a rul often enter the address, this is a Http request error occurs, the web page request 404 indicates that the server is not found. There are many other types of errors. Why do you want to understand this, because then find a way to deal with errors when sending Http request or skip to the next task.

There are many ways we can send Http request, Unity also updated the way Web requests :( I'll just after the code shots, this function can not automatically insert the code really neat row of uncomfortable)

 

 

The main use of the class is UnityWebRequest, and Unity in the previous class WWW somewhat similar, is mainly used to download and upload files.

To introduce the following namespaces:

 

 

UnityAction mainly as a parameter can be automatically returned after a request for html source. It essentially is a generic delegate:

 

 

No generic parameter from the plurality, is a very useful class (especially in the callback coroutine can easily transfer delay parameter)

Of course, in addition to the built-in method Unity send Web requests, C # also encapsulates several classes, you can pick one use, e.g. HttpWebRequest,WebClient,HttpClient等:

Like this:

 

 

If successful, the Web request was specified url address html source code, you can execute the next step.

The second step, to collect data in html need, in this case is to find out the source code from these pictures link address.

For example it may be the following situations:

 

 

 

In summary, the first use of commonly used html tag <img> may come to find most of the pictures, but there are some pictures of these labels are not within the. And sometimes, even within the <img> tag of the address of the picture, it is still possible differences outside the chain within the chain or the emergence of the chain, then as a legitimate direct url address to execute, but if it is within the chain, then on but also complement the domain name address, so we need to find a way to identify the correct domain name of a url.

How to identify matching string contents mentioned above, the most effective way is a regular expression, here exemplified in the present embodiment requires the use of regular expressions to:

1. matching url domain address:

private const string URLRealmCheck = @"(http|https)://(www.)?(\w+(\.)?)+";

2. Match url address:

private const string URLStringCheck = @"((http|https)://)(([a-zA-Z0-9\._-]+\.[a-zA-Z]{2,6})|([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}))(:[0-9]{1,4})*(/[a-zA-Z0-9\&%_\./-~-]*)?";

3. Match the html <img> url address within the tag :( case insensitive, wherein the packet <imgUrl> for the desired url addresses)

private const string imgLableCheck = @"<img\b[^<>]*?\bsrc[\s\t\r\n]*=[\s\t\r\n]*[""']?[\s\t\r\n]*(?<imgUrl>[^\s\t\r\n""'<>]*)[^<>]*?/?[\s\t\r\n]*>";

4. The matching inner <a> html tag href attribute url addresses :( case-insensitive, mainly for deep search packet wherein <url> for the desired url addresses)

private const string hrefLinkCheck = @"(?i)<a\s[^>]*?href=(['""]?)(?!javascript|__doPostBack)(?<url>[^'""\s*#<>]+)[^>]*>";

The specified picture type matching :( mainly for the chain)

private const string jpg = @"\.jpg";
private const string png = @"\.png";

Usage on specific regular expression matching, there are many online tutorials, I can not say here.

 

Given a html source from two directions where the image matching, matching the first outer chain, here designated matching file types:

The following is a match in the chain, have to match the domain address:

 

Once you have the domain name address within the chain can easily match the address:

 

Using regular expressions is necessary to introduce the following namespaces:

 

 

Use regular expressions to match all of imgLinks after the picture which can in turn download.

The third step is to download a valid image url transmission:

 

You can also synchronize download transmission of these url, but this may require additional maximum number of threads, but also more difficult to control the overall progress of the download.

Specific transmission coroutine follows:

 

It is worth noting that not only calls the Complete method only successfully downloaded, even if an error occurs, you need to call, thus avoiding an error occurs, automatically download immediately suspended. Under normal circumstances, even if an error occurred, but also skip to the next file download tasks.

 

The final step is to download the data stream into a file specified type of file and save it, there are many methods here, provided one of the following:

 

 

Extended:

Sometimes all image links in a single html can not fully meet our needs, because the sub-links in the html may also have url address resource needs, then we can consider adding deeper traversal. It needs to match the address of the html link, and then to give the sub-link address html source, thus matching the depth of about circulated.

Matching sub-links may be by searching html href <a> tag attribute, the attribute above has been given regular expression matching, matching the depth of the layer here only for reference:

 

 

 

Test: Here crawl depth matching meow nest with home page links to pictures jpg format and download, save to the D disk. (UI casually do not care)

 

 

 

 

 

 

 

Guess you like

Origin www.cnblogs.com/koshio0219/p/11851507.html