Search filtering

The scrape/search functions can use the following enums as filters for searches. Filtering is normally pretty good, so by default the results should be square photos as this is what's requested from DDG. Sometimes results may not be quite what you've requested (eg: you may get a bit of clipart or something more or less square but not exactly). No checks are actually performed on what comes back.

Using Cached as the image size (the default) returns the image cached by DuckDuckGo/Bing. This is a very decent size for deep learning purposes and is much more reliable to download from (no 404s, no hot-linking bans etc). Using any other size will return the original images from the source websites.

This defaults to Square everywhere because that's what your DL models want.

Defaults to Photo everywhere

Probably unlikely to be of much use to you but it's part of the API so I include it. You never know...

Scraping URLs

At the time of writing, this function will return up to 477 urls for a single search.

links = duckduckgo_scrape_urls("happy clowns", max_results=3)
links

['https://tse1.mm.bing.net/th?id=OIP.LR-2HW7P9ENbMGJ7cZTVGwHaHL&pid=Api',
 'https://tse4.mm.bing.net/th?id=OIP.jgAbDJb9lY-p0Q83Q2xsCgHaI0&pid=Api',
 'https://tse4.mm.bing.net/th?id=OIP.4g2txn6PXyuTbEXcJPI2qQHaIE&pid=Api']

display_img(links[0])

This is the kind of size you can expect by default. As you can see it should normally be sufficient for your needs.

Since the parameters you use are likely to be the same across every image search within your dataset, if you plan on overriding the defaults, you can pass your parameters in using a dictionary like this:

params = {
    "max_results": 3,
    "img_size":    ImgSize.Medium, 
    "img_type":    ImgType.Photo,
    "img_layout":  ImgLayout.All,
    "img_color":   ImgColor.Purple
}

links = duckduckgo_scrape_urls("puppies", **params)
links

['https://cdn3.volusion.com/9nxdj.fchy5/v/vspfiles/photos/WR-13710-2T.jpg?1528880561',
 'http://4.bp.blogspot.com/-GKGVUan6I3w/UOQtWCzichI/AAAAAAAANs0/mxox-FdrnRA/s1600/019.jpg',
 'http://www.hahastop.com/thumbsb/The_All_Purple_Dog_b.jpg']

display_img(links[1])
# why? just why??

Downloading images

You can use rmtree() to scrub your downloaded images, either to create a new dataset or if you just want to "reset" and start over while experimenting.

Files will be saved as 001.jpg, 002.jpg etc but images already present will not be overwritten, so you can run multiple searches for the same label (eg: different genres of orchid all under one 'orchid' label) and file numbering will carry on from the last one on disc.

If uuid_names parameter is True, enough of a uuid is appended to the name of the file (like 001_4cda4d95.jpg) to ensure filenames are unique across directories. This is for compatibility with tools like fastai.vision.widgets.ImageClassifierCleaner which can move images between folders and hence cause name clashes. This is the default everywhere.

Downloaded files will be checked for validity so you should never end up with corrupt images or truncated downloads. (Let me know if anything duff gets through)

root = Path.cwd()/"images"
download_urls(root/"purple", links)

Downloading results into C:\Users\Joe\Documents\GitHub\jmd_imagescraper\images\purple

[Path('C:/Users/Joe/Documents/GitHub/jmd_imagescraper/images/purple/001_0e3cc95b.jpg'),
 Path('C:/Users/Joe/Documents/GitHub/jmd_imagescraper/images/purple/002_6e8b3e7a.jpg'),
 Path('C:/Users/Joe/Documents/GitHub/jmd_imagescraper/images/purple/003_4fddf126.jpg')]

duckduckgo_search(root, "Nice", "nice clowns", max_results=3)

Duckduckgo search: nice clowns
Downloading results into C:\Users\Joe\Documents\GitHub\jmd_imagescraper\images\Nice

[Path('C:/Users/Joe/Documents/GitHub/jmd_imagescraper/images/Nice/001_6b99919b.jpg'),
 Path('C:/Users/Joe/Documents/GitHub/jmd_imagescraper/images/Nice/002_8c1451d7.jpg'),
 Path('C:/Users/Joe/Documents/GitHub/jmd_imagescraper/images/Nice/003_d8a0fef5.jpg')]

If you want a list of all the images downloaded across multiple searches you can do it like this:

params = {
    "max_results": 3,
    "img_size":    ImgSize.Cached, 
    "img_type":    ImgType.Photo,
    "img_layout":  ImgLayout.Square,
    "img_color":   ImgColor.All,
    "uuid_names": True
}

imgs = []
imgs.extend(duckduckgo_search(root, "Nice", "nice clowns", **params))
imgs.extend(duckduckgo_search(root, "Scary", "scary clowns", **params))
imgs.extend(duckduckgo_search(root, "Mime", "mimes", **params))
imgs

Duckduckgo search: nice clowns
Downloading results into C:\Users\Joe\Documents\GitHub\jmd_imagescraper\images\Nice

Duckduckgo search: scary clowns
Downloading results into C:\Users\Joe\Documents\GitHub\jmd_imagescraper\images\Scary

Duckduckgo search: mimes
Downloading results into C:\Users\Joe\Documents\GitHub\jmd_imagescraper\images\Mime

[Path('C:/Users/Joe/Documents/GitHub/jmd_imagescraper/images/Nice/007_6af54a70.jpg'),
 Path('C:/Users/Joe/Documents/GitHub/jmd_imagescraper/images/Nice/008_c304d5ca.jpg'),
 Path('C:/Users/Joe/Documents/GitHub/jmd_imagescraper/images/Nice/009_efb040f8.jpg'),
 Path('C:/Users/Joe/Documents/GitHub/jmd_imagescraper/images/Scary/001_b63c3858.jpg'),
 Path('C:/Users/Joe/Documents/GitHub/jmd_imagescraper/images/Scary/002_40398473.jpg'),
 Path('C:/Users/Joe/Documents/GitHub/jmd_imagescraper/images/Scary/003_e801795a.jpg'),
 Path('C:/Users/Joe/Documents/GitHub/jmd_imagescraper/images/Mime/001_f66174ed.jpg'),
 Path('C:/Users/Joe/Documents/GitHub/jmd_imagescraper/images/Mime/002_ee152455.jpg'),
 Path('C:/Users/Joe/Documents/GitHub/jmd_imagescraper/images/Mime/003_32d7762d.jpg')]

Creating a CSV dataset

If you want to create a very large dataset with a lot of images but don't want to store and distribute a very large file, you can create a CSV file containing URL/label pairs. Your users can then download the image files themselves.

csv = root/"clowns.csv"
save_urls_to_csv(csv, "Nice", "nice clowns", max_results=5)
save_urls_to_csv(csv, "Scary", "scary clowns", max_results=5)

df = pd.read_csv(csv)
df

This will (you've guessed it), download the image files from the CSV file we've just created. You can also supply column names if you want to use it on a CSV file created elsewhere with different names.

download_images_from_csv(root, csv)

Downloading results into C:\Users\Joe\Documents\GitHub\jmd_imagescraper\images\Nice

Downloading results into C:\Users\Joe\Documents\GitHub\jmd_imagescraper\images\Scary

[Path('C:/Users/Joe/Documents/GitHub/jmd_imagescraper/images/Nice/010_9cbdb8a7.jpg'),
 Path('C:/Users/Joe/Documents/GitHub/jmd_imagescraper/images/Nice/011_2f35c643.jpg'),
 Path('C:/Users/Joe/Documents/GitHub/jmd_imagescraper/images/Nice/012_5af5d807.jpg'),
 Path('C:/Users/Joe/Documents/GitHub/jmd_imagescraper/images/Nice/013_30a96f50.jpg'),
 Path('C:/Users/Joe/Documents/GitHub/jmd_imagescraper/images/Nice/014_b5eef117.jpg'),
 Path('C:/Users/Joe/Documents/GitHub/jmd_imagescraper/images/Scary/004_7813f590.jpg'),
 Path('C:/Users/Joe/Documents/GitHub/jmd_imagescraper/images/Scary/005_23d91904.jpg'),
 Path('C:/Users/Joe/Documents/GitHub/jmd_imagescraper/images/Scary/006_50884c99.jpg'),
 Path('C:/Users/Joe/Documents/GitHub/jmd_imagescraper/images/Scary/007_88334447.jpg'),
 Path('C:/Users/Joe/Documents/GitHub/jmd_imagescraper/images/Scary/008_b3da1cce.jpg')]

jmd_imagescraper.core

Search filtering

`ImgSize`[source]

`ImgLayout`[source]

`ImgType`[source]

`ImgColor`[source]

Scraping URLs

`duckduckgo_scrape_urls`[source]

Downloading images

`rmtree`[source]

`download_urls`[source]

`duckduckgo_search`[source]

Creating a CSV dataset

`save_urls_to_csv`[source]

`download_images_from_csv`[source]

	URL	Label
0	https://tse4.mm.bing.net/th?id=OIP.uFX0ybAs0Hi...	Nice
1	https://tse4.mm.bing.net/th?id=OIP.s3Ie8ax_Fa6...	Nice
2	https://tse1.mm.bing.net/th?id=OIP.lwC5ho3Ta-T...	Nice
3	https://tse4.mm.bing.net/th?id=OIP.glEf94S1eD0...	Nice
4	https://tse3.mm.bing.net/th?id=OIP.n3504PAjzbN...	Nice
5	https://tse3.mm.bing.net/th?id=OIP.zMsnePdSfSb...	Scary
6	https://tse3.mm.bing.net/th?id=OIP.yhDrJ18seBC...	Scary
7	https://tse1.mm.bing.net/th?id=OIP.y5tm55MMKcW...	Scary
8	https://tse3.mm.bing.net/th?id=OIP.MWOP-aLPv8D...	Scary
9	https://tse3.mm.bing.net/th?id=OIP.AZyYLBgzuTA...	Scary

jmd_imagescraper.core

Search filtering

ImgSize[source]

ImgLayout[source]

ImgType[source]

ImgColor[source]

Scraping URLs

duckduckgo_scrape_urls[source]

Downloading images

rmtree[source]

download_urls[source]

duckduckgo_search[source]

Creating a CSV dataset

save_urls_to_csv[source]

download_images_from_csv[source]

`ImgSize`[source]

`ImgLayout`[source]

`ImgType`[source]

`ImgColor`[source]

`duckduckgo_scrape_urls`[source]

`rmtree`[source]

`download_urls`[source]

`duckduckgo_search`[source]

`save_urls_to_csv`[source]

`download_images_from_csv`[source]