Home Artificial Intelligence Use Python to Download Multiple Files (or URLs) in Parallel Import modules Define URLs and filenames Function to download a URL Download multiple files with a Python loop Download multiple files in parallel with Python Conclusion

Use Python to Download Multiple Files (or URLs) in Parallel Import modules Define URLs and filenames Function to download a URL Download multiple files with a Python loop Download multiple files in parallel with Python Conclusion

0
Use Python to Download Multiple Files (or URLs) in Parallel
Import modules
Define URLs and filenames
Function to download a URL
Download multiple files with a Python loop
Download multiple files in parallel with Python
Conclusion

Get more data in less time

Towards Data Science
Photo by Wesley Tingey on Unsplash

We live in a world of huge data. Often, big data is organized as a big collection of small datasets (i.e., one large dataset comprised of multiple files). Obtaining these data is usually frustrating due to download (or acquisition burden). Fortunately, with a bit of code, there are methods to automate and speed-up file download and acquisition.

Automating file downloads can save plenty of time. There are several ways to automate file downloads with Python. The simplest option to download files is using an easy Python loop to iterate through a listing of URLs to download. This serial approach can work well with a couple of small files, but should you are downloading many files or large files, you’ll need to use a parallel approach to maximise your computational resources.

With a parallel file download routine, you may higher use your computer’s resources to download multiple files concurrently, saving you time. This tutorial demonstrates develop a generic file download function in Python and apply it to download multiple files with serial and parallel approaches. The code on this tutorial uses only modules available from the Python standard library, so no installations are required.

For this instance, we only need the requests and multiprocessing Python modules to download files in parallel. The requests and multiprocessing modules are each available from the Python standard library, so you will not must perform any installations.

We’ll also import the time module to maintain track of how long it takes to download individual files and compare performance between the serial and parallel download routines. The time module can also be a part of the Python standard library.

import requests import time from multiprocessing import cpu_count from multiprocessing.pool import ThreadPool

I’ll show parallel file downloads in Python using gridMET NetCDF files that contain each day precipitation data for america.

Here, I specify the URLs to 4 files in a listing. In other applications, you might programmatically generate a listing of files to download.

urls = ['https://www.northwestknowledge.net/metdata/data/pr_1979.nc', 'https://www.northwestknowledge.net/metdata/data/pr_1980.nc', 'https://www.northwestknowledge.net/metdata/data/pr_1981.nc', 'https://www.northwestknowledge.net/metdata/data/pr_1982.nc']

Each URL have to be related to its download location. Here, I’m downloading the files to the Windows ‘Downloads’ directory. I’ve hardcoded the filenames in a listing for simplicity and transparency. Given your application, you might want to put in writing code that may parse the input URL and download it to a selected directory.

fns = [r'C:UserskonradDownloadspr_1979.nc', r'C:UserskonradDownloadspr_1980.nc', r'C:UserskonradDownloadspr_1981.nc', r'C:UserskonradDownloadspr_1982.nc']

Multiprocessing requires parallel functions to have just one argument (there are some workarounds, but we won’t get into that here). To download a file we’ll must pass two arguments, a URL and a filename. So we’ll zip the urls and fns lists together to get a listing of tuples. Each tuple within the list will contain two elements; a URL and the download filename for the URL. This manner we are able to pass a single argument (the tuple) that incorporates two pieces of knowledge.

inputs = zip(urls, fns)

Now that we’ve got specified the URLs to download and their associated filenames, we’d like a function to download the URLs ( download_url).

We’ll pass one argument ( arg) to download_url. This argument can be an iterable (list or tuple) where the primary element is the URL to download ( url) and the second element is the filename ( fn). The weather are assigned to variables ( url and fn) for readability.

Now create a try statement through which the URL is retrieved and written to the file after it’s created. When the file is written the URL and download time are returned. If an exception occurs a message is printed.

The download_url function is the meat of our code. It does the actual work of downloading and file creation. We will now use this function to download files in serial (using a loop) and in parallel. Let’s undergo those examples.

def download_url(args): 
t0 = time.time()
url, fn = args[0], args[1]
try:
r = requests.get(url)
with open(fn, 'wb') as f:
f.write(r.content)
return(url, time.time() - t0)
except Exception as e:
print('Exception in download_url():', e)

To download the list of URLs to the associated files, loop through the iterable ( inputs) that we created, passing each element to download_url. After each download is complete we are going to print the downloaded URL and the time it took to download.

The full time to download all URLs will print in any case downloads have been accomplished.

t0 = time.time() 
for i in inputs:
result = download_url(i)
print('url:', result[0], 'time:', result[1])
print('Total time:', time.time() - t0)

Output:

url: https://www.northwestknowledge.net/metdata/data/pr_1979.nc time: 16.381176710128784 
url: https://www.northwestknowledge.net/metdata/data/pr_1980.nc time: 11.475878953933716
url: https://www.northwestknowledge.net/metdata/data/pr_1981.nc time: 13.059367179870605
url: https://www.northwestknowledge.net/metdata/data/pr_1982.nc time: 12.232381582260132
Total time: 53.15849542617798

It took between 11 and 16 seconds to download the person files. The full download time was a bit of lower than one minute. Your download times will vary based in your specific network connection.

Let’s compare this serial (loop) approach to the parallel approach below.

To begin, create a function ( download_parallel) to handle the parallel download. The function ( download_parallel) will take one argument, an iterable containing URLs and associated filenames (the inputs variable we created earlier).

Next, get the variety of CPUs available for processing. This can determine the variety of threads to run in parallel.

Now use the multiprocessing ThreadPool to map the inputs to the download_url function. Here we use the imap_unordered approach to ThreadPool and pass it the download_url function and input arguments to download_url (the inputs variable). The imap_unordered method will run download_url concurrently for the variety of specified threads (i.e. parallel download).

Thus, if we’ve got 4 files and 4 threads all files could be downloaded at the identical time as an alternative of waiting for one download to complete before the following starts. This could save a substantial amount of processing time.

In the ultimate a part of the download_parallel function the downloaded URLs and the time required to download each URL are printed.

def download_parallel(args): 
cpus = cpu_count()
results = ThreadPool(cpus - 1).imap_unordered(download_url, args)
for lead to results:
print('url:', result[0], 'time (s):', result[1])

Once the inputs and download_parallel are defined, the files could be downloaded in parallel with a single line of code.

download_parallel(inputs)

Output:

url: https://www.northwestknowledge.net/metdata/data/pr_1980.nc time (s): 14.641696214675903 
url: https://www.northwestknowledge.net/metdata/data/pr_1981.nc time (s): 14.789752960205078
url: https://www.northwestknowledge.net/metdata/data/pr_1979.nc time (s): 15.052601337432861
url: https://www.northwestknowledge.net/metdata/data/pr_1982.nc time (s): 23.287317752838135
Total time: 23.32273244857788

Notice that it took longer to download each individual file with the approach. This may occasionally be a result of adjusting network speed, or overhead required to map the downloads to their respective threads. Although the person files took longer to download, the parallel method resulted in a 50% decrease in total download time.

You may see how parallel processing can greatly reduce processing time for multiple files. Because the variety of files increases, you’ll save far more time by utilizing a parallel download approach.

Automating file downloads in your development and evaluation routines can prevent plenty of time. As demonstrated by this tutorial implementing a parallel download routine can greatly decrease file acquisition time should you require many files or large files.

LEAVE A REPLY

Please enter your comment!
Please enter your name here