Core image scraping functions for creating deep learning datasets

Search filtering

The scrape/search functions can use the following enums as filters for searches. Filtering is normally pretty good, so by default the results should be square photos as this is what's requested from DDG. Sometimes results may not be quite what you've requested (eg: you may get a bit of clipart or something more or less square but not exactly). No checks are actually performed on what comes back.

ImgSize[source]

Enum = [Cached, Small, Medium, Large, Wallpaper]

An enumeration.

Using Cached as the image size (the default) returns the image cached by DuckDuckGo/Bing. This is a very decent size for deep learning purposes and is much more reliable to download from (no 404s, no hot-linking bans etc). Using any other size will return the original images from the source websites.

ImgLayout[source]

Enum = [All, Square, Tall, Wide]

An enumeration.

This defaults to Square everywhere because that's what your DL models want.

ImgType[source]

Enum = [All, Photo, Clipart, Gif, Transparent]

An enumeration.

Defaults to Photo everywhere

ImgColor[source]

Enum = [All, Color, Monochrome, Red, Orange, Yellow, Green, Blue, Purple, Pink, Brown, Black, Gray, Teal, White]

An enumeration.

Probably unlikely to be of much use to you but it's part of the API so I include it. You never know...

Scraping URLs

duckduckgo_scrape_urls[source]

duckduckgo_scrape_urls(keywords:str, max_results:int, img_size:ImgSize=<ImgSize.Cached: ''>, img_type:ImgType=<ImgType.Photo: 'photo'>, img_layout:ImgLayout=<ImgLayout.Square: 'Square'>, img_color:ImgColor=<ImgColor.All: ''>)

Scrapes URLs from DuckDuckGo image search. Returns list of URLs.

At the time of writing, this function will return up to 477 urls for a single search.

links = duckduckgo_scrape_urls("happy clowns", max_results=3)
links
['https://tse1.mm.bing.net/th?id=OIP.LR-2HW7P9ENbMGJ7cZTVGwHaHL&pid=Api',
 'https://tse4.mm.bing.net/th?id=OIP.jgAbDJb9lY-p0Q83Q2xsCgHaI0&pid=Api',
 'https://tse4.mm.bing.net/th?id=OIP.4g2txn6PXyuTbEXcJPI2qQHaIE&pid=Api']
display_img(links[0])

This is the kind of size you can expect by default. As you can see it should normally be sufficient for your needs.

Since the parameters you use are likely to be the same across every image search within your dataset, if you plan on overriding the defaults, you can pass your parameters in using a dictionary like this:

params = {
    "max_results": 3,
    "img_size":    ImgSize.Medium, 
    "img_type":    ImgType.Photo,
    "img_layout":  ImgLayout.All,
    "img_color":   ImgColor.Purple
}

links = duckduckgo_scrape_urls("puppies", **params)
links
['https://cdn3.volusion.com/9nxdj.fchy5/v/vspfiles/photos/WR-13710-2T.jpg?1528880561',
 'http://4.bp.blogspot.com/-GKGVUan6I3w/UOQtWCzichI/AAAAAAAANs0/mxox-FdrnRA/s1600/019.jpg',
 'http://www.hahastop.com/thumbsb/The_All_Purple_Dog_b.jpg']
display_img(links[1])
# why? just why??

Downloading images

rmtree[source]

rmtree(path:Union[str, Path])

Recursively delete a directory tree

You can use rmtree() to scrub your downloaded images, either to create a new dataset or if you just want to "reset" and start over while experimenting.

download_urls[source]

download_urls(path:Union[str, Path], links:list, uuid_names:bool=True)

Downloads urls to the given path. Returns a list of Path objects for files downloaded to disc.

Files will be saved as 001.jpg, 002.jpg etc but images already present will not be overwritten, so you can run multiple searches for the same label (eg: different genres of orchid all under one 'orchid' label) and file numbering will carry on from the last one on disc.

If uuid_names parameter is True, enough of a uuid is appended to the name of the file (like 001_4cda4d95.jpg) to ensure filenames are unique across directories. This is for compatibility with tools like fastai.vision.widgets.ImageClassifierCleaner which can move images between folders and hence cause name clashes. This is the default everywhere.

Downloaded files will be checked for validity so you should never end up with corrupt images or truncated downloads. (Let me know if anything duff gets through)

root = Path.cwd()/"images"
download_urls(root/"purple", links)
Downloading results into C:\Users\Joe\Documents\GitHub\jmd_imagescraper\images\purple
100.00% [3/3 00:00<00:00 Images downloaded]
[Path('C:/Users/Joe/Documents/GitHub/jmd_imagescraper/images/purple/001_0e3cc95b.jpg'),
 Path('C:/Users/Joe/Documents/GitHub/jmd_imagescraper/images/purple/002_6e8b3e7a.jpg'),
 Path('C:/Users/Joe/Documents/GitHub/jmd_imagescraper/images/purple/003_4fddf126.jpg')]

duckduckgo_search(path:Union[str, Path], label:str, keywords:str, max_results:int=100, img_size:ImgSize=<ImgSize.Cached: ''>, img_type:ImgType=<ImgType.Photo: 'photo'>, img_layout:ImgLayout=<ImgLayout.Square: 'Square'>, img_color:ImgColor=<ImgColor.All: ''>, uuid_names:bool=True)

Run a DuckDuckGo search and download the images. Returns a list of Path objects for files downloaded to disc.

duckduckgo_search(root, "Nice", "nice clowns", max_results=3)
Duckduckgo search: nice clowns
Downloading results into C:\Users\Joe\Documents\GitHub\jmd_imagescraper\images\Nice
100.00% [3/3 00:00<00:00 Images downloaded]
[Path('C:/Users/Joe/Documents/GitHub/jmd_imagescraper/images/Nice/001_6b99919b.jpg'),
 Path('C:/Users/Joe/Documents/GitHub/jmd_imagescraper/images/Nice/002_8c1451d7.jpg'),
 Path('C:/Users/Joe/Documents/GitHub/jmd_imagescraper/images/Nice/003_d8a0fef5.jpg')]

If you want a list of all the images downloaded across multiple searches you can do it like this:

params = {
    "max_results": 3,
    "img_size":    ImgSize.Cached, 
    "img_type":    ImgType.Photo,
    "img_layout":  ImgLayout.Square,
    "img_color":   ImgColor.All,
    "uuid_names": True
}

imgs = []
imgs.extend(duckduckgo_search(root, "Nice", "nice clowns", **params))
imgs.extend(duckduckgo_search(root, "Scary", "scary clowns", **params))
imgs.extend(duckduckgo_search(root, "Mime", "mimes", **params))
imgs
Duckduckgo search: nice clowns
Downloading results into C:\Users\Joe\Documents\GitHub\jmd_imagescraper\images\Nice
100.00% [3/3 00:00<00:00 Images downloaded]
Duckduckgo search: scary clowns
Downloading results into C:\Users\Joe\Documents\GitHub\jmd_imagescraper\images\Scary
100.00% [3/3 00:00<00:00 Images downloaded]
Duckduckgo search: mimes
Downloading results into C:\Users\Joe\Documents\GitHub\jmd_imagescraper\images\Mime
100.00% [3/3 00:00<00:00 Images downloaded]
[Path('C:/Users/Joe/Documents/GitHub/jmd_imagescraper/images/Nice/007_6af54a70.jpg'),
 Path('C:/Users/Joe/Documents/GitHub/jmd_imagescraper/images/Nice/008_c304d5ca.jpg'),
 Path('C:/Users/Joe/Documents/GitHub/jmd_imagescraper/images/Nice/009_efb040f8.jpg'),
 Path('C:/Users/Joe/Documents/GitHub/jmd_imagescraper/images/Scary/001_b63c3858.jpg'),
 Path('C:/Users/Joe/Documents/GitHub/jmd_imagescraper/images/Scary/002_40398473.jpg'),
 Path('C:/Users/Joe/Documents/GitHub/jmd_imagescraper/images/Scary/003_e801795a.jpg'),
 Path('C:/Users/Joe/Documents/GitHub/jmd_imagescraper/images/Mime/001_f66174ed.jpg'),
 Path('C:/Users/Joe/Documents/GitHub/jmd_imagescraper/images/Mime/002_ee152455.jpg'),
 Path('C:/Users/Joe/Documents/GitHub/jmd_imagescraper/images/Mime/003_32d7762d.jpg')]

Creating a CSV dataset

If you want to create a very large dataset with a lot of images but don't want to store and distribute a very large file, you can create a CSV file containing URL/label pairs. Your users can then download the image files themselves.

save_urls_to_csv[source]

save_urls_to_csv(path:Union[str, Path], label:str, keywords:str, max_results:int=100, img_size:ImgSize=<ImgSize.Cached: ''>, img_type:ImgType=<ImgType.Photo: 'photo'>, img_layout:ImgLayout=<ImgLayout.Square: 'Square'>, img_color:ImgColor=<ImgColor.All: ''>)

Run a search and concat the URLs to a CSV file

csv = root/"clowns.csv"
save_urls_to_csv(csv, "Nice", "nice clowns", max_results=5)
save_urls_to_csv(csv, "Scary", "scary clowns", max_results=5)
df = pd.read_csv(csv)
df
URL Label
0 https://tse4.mm.bing.net/th?id=OIP.uFX0ybAs0Hi... Nice
1 https://tse4.mm.bing.net/th?id=OIP.s3Ie8ax_Fa6... Nice
2 https://tse1.mm.bing.net/th?id=OIP.lwC5ho3Ta-T... Nice
3 https://tse4.mm.bing.net/th?id=OIP.glEf94S1eD0... Nice
4 https://tse3.mm.bing.net/th?id=OIP.n3504PAjzbN... Nice
5 https://tse3.mm.bing.net/th?id=OIP.zMsnePdSfSb... Scary
6 https://tse3.mm.bing.net/th?id=OIP.yhDrJ18seBC... Scary
7 https://tse1.mm.bing.net/th?id=OIP.y5tm55MMKcW... Scary
8 https://tse3.mm.bing.net/th?id=OIP.MWOP-aLPv8D... Scary
9 https://tse3.mm.bing.net/th?id=OIP.AZyYLBgzuTA... Scary

download_images_from_csv[source]

download_images_from_csv(path:Union[str, Path], csv:Union[str, Path], url_col:str='URL', label_col:str='Label', uuid_names:bool=True)

Download the URLs from a CSV file to the given path. Returns a list of Path objects for files downloaded to disc.

This will (you've guessed it), download the image files from the CSV file we've just created. You can also supply column names if you want to use it on a CSV file created elsewhere with different names.

download_images_from_csv(root, csv)
Downloading results into C:\Users\Joe\Documents\GitHub\jmd_imagescraper\images\Nice
100.00% [5/5 00:00<00:00 Images downloaded]
Downloading results into C:\Users\Joe\Documents\GitHub\jmd_imagescraper\images\Scary
100.00% [5/5 00:01<00:00 Images downloaded]
[Path('C:/Users/Joe/Documents/GitHub/jmd_imagescraper/images/Nice/010_9cbdb8a7.jpg'),
 Path('C:/Users/Joe/Documents/GitHub/jmd_imagescraper/images/Nice/011_2f35c643.jpg'),
 Path('C:/Users/Joe/Documents/GitHub/jmd_imagescraper/images/Nice/012_5af5d807.jpg'),
 Path('C:/Users/Joe/Documents/GitHub/jmd_imagescraper/images/Nice/013_30a96f50.jpg'),
 Path('C:/Users/Joe/Documents/GitHub/jmd_imagescraper/images/Nice/014_b5eef117.jpg'),
 Path('C:/Users/Joe/Documents/GitHub/jmd_imagescraper/images/Scary/004_7813f590.jpg'),
 Path('C:/Users/Joe/Documents/GitHub/jmd_imagescraper/images/Scary/005_23d91904.jpg'),
 Path('C:/Users/Joe/Documents/GitHub/jmd_imagescraper/images/Scary/006_50884c99.jpg'),
 Path('C:/Users/Joe/Documents/GitHub/jmd_imagescraper/images/Scary/007_88334447.jpg'),
 Path('C:/Users/Joe/Documents/GitHub/jmd_imagescraper/images/Scary/008_b3da1cce.jpg')]