11 ways to download Python, one is more advanced than the other

overview

Today let's learn how to download files from the web using different Python modules. Additionally, you'll download to regular files, web pages, Amazon S3, and other sources. 

Finally, you'll learn how to overcome various challenges you may encounter, such as downloading redirected files, downloading large files, completing a multi-threaded download, and other strategies.


1. Use requests

You can download files from a URL using the requests module.

Consider the following code:

f13bdb8da8c541baa0c404c368b5e668.png

 You simply fetch the URL using the get method of the requests module and store the result in a variable called "myfile". Then, write the contents of this variable to the file.

2. Use wget

6418ef03cf2d47cdb897567f3f3e937b.png

 You can also download files from a URL using Python's wget module. You can install the wget module using pip as follows:

Consider the following code, which we will use to download the logo image for Python.

a02ff598ccbc4e5cada775a956f5625b.png In this code, the URL and path (where the image will be stored) is passed to the download method of the wget module.

3. Download the redirected file

In this section, you will learn how to use requests to download a file from a URL that will be redirected to another URL with a .pdf file. The URL looks like this:

828610f2f9f44cc9b691dc3132bb7a03.jpeg

To download this pdf file, use the following code:

dd896bb4f46f45258b6ffec5f01011b4.jpeg

 In this code, our first step is to specify the URL. Then, we use the get method of the request module to get the URL. In the get method, we set allow_redirects to True, which will allow redirection in the URL, and the redirected content will be assigned to the variable myfile.

Finally, we open a file to write the fetched content.

4. Download large files in chunks

Consider the following code:

7d188b4feeed49d3b582170a51852f22.jpeg

 First, we use the get method of the requests module as before, but this time, we will set the stream property to True.

Next, we create a file called PythonBook.pdf in the current working directory and open it for writing.

We then specify the chunk size to download each time. We've set it to 1024 bytes and then iterate over each block and write those blocks in the file until the end of the block.

Isn't it pretty? Don't worry, we'll show a progress bar for the download process later.

5. Download multiple files (parallel/batch download)

To download multiple files at once, import the following modules:

14bb7c2fc96e4fd0b47a9890929659ba.jpeg

 We imported the os and time modules to check how much time it takes to download a file. The ThreadPool module allows you to run multiple threads or processes using a pool.

Let's create a simple function that sends the response in chunks to a file:

21f2a94527ce4954a8c10782b07ae266.jpeg

 The URL is a two-dimensional array that specifies the paths and URLs of the pages you want to download.

60245d4c71e54603a5bd558a39a69732.jpeg

 Just like we did in the previous section, we pass this URL to requests.get. Finally, we open the file (at the path specified in the URL) and write the page content.

Now, we can call this function for each URL individually, or we can call this function for all URLs at the same time. Let's call this function for each URL separately in a for loop, notice the timer:

a97457bd89d24fa2aadc3d6b2266d821.jpeg

 Now, replace the for loop with the following line of code:

981e0d5e233d47bbb86a0f0ebbfcfe61.jpeg

 Run the script.

6. Use the progress bar to download

The progress bar is a UI component of the clint module. Enter the following command to install the clint module:

193bf36ff2cd45f0809085d126f66c14.jpeg

 Consider the following code:

0a0c64cf3b4f4cf5ae0d0aff5b7b4e61.jpeg

 In this code, we first import the requests module, then, we import the progress component from clint.textui. The only difference is in the for loop. When writing the content to the file, we use the bar method of the progress bar module.

7. Use urllib to download web pages

In this section, we will use urllib to download a web page.

The urllib library is Python's standard library, so you don't need to install it.

The following line of code can easily download a web page:

437c68d250c4492a9d59e7167964be56.jpeg

 Here you specify what you want to save the file to and the URL where you want it stored.

c0ad91573ecb4fca96bbdae11cb320b5.jpeg

 In this code, we use the urlretrieve method and pass the URL of the file, and the path to save the file. The file extension will be .html.

8. Download via proxy

If you need to use a proxy to download your files, you can use the ProxyHandler of the urllib module. Please see the following code:

9d4901b370c247c0b13976466a8e3d84.jpeg

 In this code, we create a proxy object, open the proxy by calling urllib's build_opener method, and pass in the proxy object. Then, we create a request to fetch the page.

In addition, you can also use the requests module as described in the official documentation:

a881e77bb6294c08bbc006ced650fc48.jpeg

 You just need to import the requests module and create your proxy object. Then, you can fetch the file.

9. Use urllib3

urllib3 is an improved version of the urllib module. You can download and install it using pip:

98ebfccda0a24c988c15ce3394d6935d.jpeg

 We will fetch a web page and store it in a text file by using urllib3.

Import the following modules:

58d463e49b66462683fc2bd5647ba09a.jpeg

 When processing files, we use the shutil module.

Now, we initialize the URL string variable like this:

656955a1449a4a9ca3ac10e59ecaad9e.jpeg

 We then used urllib3's PoolManager, which keeps track of the necessary connection pools.

Create a file:

562c709e34064e8e9bcc440298f3c166.jpeg

 Finally, we send a GET request to get the URL and open a file, then write the response to the file:

10. Download files from S3 using Boto3

To download files from Amazon S3, you can use the Python boto3 module.

Before starting, you need to install the awscli module using pip:

78d4595bb3e44ab8babb0bd42599f6d7.jpeg

 For AWS configuration, run the following command:

9a18ea9e01bf405b8dc1797108235a26.jpeg

 Now, enter your details as follows:

b34714767a7e4f15a123b8c0a54e9c0f.jpeg

 To download files from Amazon S3, you need to import boto3 and botocore. Boto3 is an Amazon SDK that allows Python to access Amazon web services (such as S3). Botocore provides command-line services for interacting with Amazon web services.

Botocore comes with awscli. To install boto3, run the following command:

18daaa528b4a45ef8dfc44deff4da1e6.jpeg

 Now, import these two modules:

b4c8109ac2974063aec6a29df3d513f1.jpeg

 When downloading a file from Amazon, we need three parameters:

  • Bucket name 

  • The name of the file you need to download 

  • The name initialization variable after the file is downloaded:

Now, we initialize a variable to use the session's resources. To do this, we'll call boto3's resource() method and pass in the service, which is s3:

a18dd9a77c864699b059083aa1bf7959.jpeg

 Finally, download the file using the download_file method and pass in the variable:

acd812710d454ca99880d585cac04baa.jpeg

11. Use asyncio

The asyncio module is mainly used to handle system events. It works around an event loop that waits for an event to occur and then reacts to it. This response can be to call another function. This process is called event handling. The asyncio module uses coroutines for event handling.

To use asyncio event handling and coroutines, we will import the asyncio module:

68221f0b23824a1e9b3fe06d6e5e0405.jpeg

 Now, define asyncio coroutine methods like this:

8432365e68a34989b0b9880551fa99f4.jpeg

 The keyword async indicates that this is a native asyncio coroutine. Inside the coroutine, we have an await keyword which returns a specific value. We can also use the return keyword.

Now, let's use coroutine to create a piece of code to download a file from a website:

05fee99592594283b6bc9b59e05f15f6.jpeg

 In this code, we create an asynchronous coroutine that downloads our file and returns a message.

Then we call main_func with another asynchronous coroutine, which waits for URLs and forms a queue of all URLs. asyncio's wait function waits for the coroutine to complete.

Now, in order to start the coroutine, we have to put the coroutine into the event loop using asyncio's get_event_loop() method and finally, we execute that event loop using asyncio's run_until_complete() method.


 That's all for today's sharing, welcome to like, collect and forward, thank you!

Guess you like

Origin blog.csdn.net/Rocky006/article/details/132207529