Forums

Running concurrent file downloads?

Making a script that downloads a bunch of binary data file from AWS so I can start investigating the data in an ipython notebook. Trying to use concurrent.futures.ThreadPoolExecutor to download all the files concurrently because I'll be downloading ~200 of them, and I want to go fast. I've tested the non-concurrent parts of the code and they work just fine.

import requests
import os
from concurrent import futures

def download_bin(link):
    print('Beginning download of {}'.format(link))
    base, fname = os.path.split(link)
    new_fn = './bins/' + fname
    r = requests.get(link)
    with open(new_fn, 'wb+') as fh:
        fh.write(r.content)
    print('Done downloading {}'.format(link))

def download_files(links):
    print('\nDownloading binary data...')
    from pprint import pprint
    pprint(links)
    workers = min(20, len(links))
    with futures.ThreadPoolExecutor(workers) as executor:
        executor.map(download_bin, links)

if __name__ == "__main__":
    links = [...]
    download_files(links)
    print('done..?')

The program calls download_files. I see it print "Downloading binary data..", then I see it pprint all the links, but it never says "Beginning download of X". Instead, the program continues until its end with no errors as if the with block didn't exist. What's going on here?

edit1: added __name__ == "__main__" section to show how the code is run

edit2: added relevant imports

Hi,

if you can use Python 3.5+ I would recommend using aiohttp instead of requests. It's blazingly fast and together with the new async/await syntax the code is actually readable/understandable. Here is a nice blog article about this: https://pawelmhm.github.io/asyncio/python/aiohttp/2016/04/22/asyncio-aiohttp.html

Cheers, Oliver

Error most likely due to the fact that it was using python 2.7 instead of 3.4. But the fact that it wasn't giving me any errors seemed odd.

That does sound odd. What happens if you do

print(executor.map(download_bin, links))

...? I'm wondering if the map is returning some kind of iterator that only triggers execution of the code when it's examined.

<generator object map at 0x7fa7d58a7370>

¯\_(ツ)_/¯

still no code execution

No, that's not very helpful, is it ;-)

How about explicitly iterating over it, eg.

[_ for _ in executor.map(download_bin, links)]

That seemed to do it. Maybe in the 2.7 implementation you have to call it explicitly like that. Thanks for the help!

Excellent! Glad we could work it out.