Search filtering
The scrape/search functions can use the following enums as filters for searches. Filtering is normally pretty good, so by default the results should be square photos as this is what's requested from DDG. Sometimes results may not be quite what you've requested (eg: you may get a bit of clipart or something more or less square but not exactly). No checks are actually performed on what comes back.
Using Cached
as the image size (the default) returns the image cached by DuckDuckGo/Bing. This is a very decent size for deep learning purposes and is much more reliable to download from (no 404s, no hot-linking bans etc). Using any other size will return the original images from the source websites.
This defaults to Square
everywhere because that's what your DL models want.
Defaults to Photo
everywhere
Probably unlikely to be of much use to you but it's part of the API so I include it. You never know...
At the time of writing, this function will return up to 477 urls for a single search.
links = duckduckgo_scrape_urls("happy clowns", max_results=3)
links
display_img(links[0])
This is the kind of size you can expect by default. As you can see it should normally be sufficient for your needs.
Since the parameters you use are likely to be the same across every image search within your dataset, if you plan on overriding the defaults, you can pass your parameters in using a dictionary like this:
params = {
"max_results": 3,
"img_size": ImgSize.Medium,
"img_type": ImgType.Photo,
"img_layout": ImgLayout.All,
"img_color": ImgColor.Purple
}
links = duckduckgo_scrape_urls("puppies", **params)
links
display_img(links[1])
# why? just why??
You can use rmtree()
to scrub your downloaded images, either to create a new dataset or if you just want to "reset" and start over while experimenting.
Files will be saved as 001.jpg, 002.jpg etc but images already present will not be overwritten, so you can run multiple searches for the same label (eg: different genres of orchid all under one 'orchid' label) and file numbering will carry on from the last one on disc.
If uuid_names
parameter is True
, enough of a uuid is appended to the name of the file (like 001_4cda4d95.jpg
) to ensure filenames are unique across directories. This is for compatibility with tools like fastai.vision.widgets.ImageClassifierCleaner
which can move images between folders and hence cause name clashes. This is the default everywhere.
Downloaded files will be checked for validity so you should never end up with corrupt images or truncated downloads. (Let me know if anything duff gets through)
root = Path.cwd()/"images"
download_urls(root/"purple", links)
duckduckgo_search(root, "Nice", "nice clowns", max_results=3)
If you want a list of all the images downloaded across multiple searches you can do it like this:
params = {
"max_results": 3,
"img_size": ImgSize.Cached,
"img_type": ImgType.Photo,
"img_layout": ImgLayout.Square,
"img_color": ImgColor.All,
"uuid_names": True
}
imgs = []
imgs.extend(duckduckgo_search(root, "Nice", "nice clowns", **params))
imgs.extend(duckduckgo_search(root, "Scary", "scary clowns", **params))
imgs.extend(duckduckgo_search(root, "Mime", "mimes", **params))
imgs
csv = root/"clowns.csv"
save_urls_to_csv(csv, "Nice", "nice clowns", max_results=5)
save_urls_to_csv(csv, "Scary", "scary clowns", max_results=5)
df = pd.read_csv(csv)
df
This will (you've guessed it), download the image files from the CSV file we've just created. You can also supply column names if you want to use it on a CSV file created elsewhere with different names.
download_images_from_csv(root, csv)