Somebody asked at some point if I could turn my image scraping notebook into an installable library, and I've been meaning to have a look at nbdev for a while anyway.

Nbdev is great as it basically lets you develop a software library in Jupyter notebooks, and from your notebooks it creates library + documentation + install packages. You can also include tests in your notebooks and these tests will be run as part of the CI process, and if you choose can be shown in the documentation which can help people understand the behaviour they can expect from your code.

If you've been developing software for more than 5 minutes in a professional environment, you'll know that even with the best will in the world, code & docs & tests are often not in sync. When everything is generated from one place, you'd actually have to put effort into getting this wrong.

I can't recommend it highly enough, and it's well worth watching the youtube video on the tutorial page to get an idea of what it does.

As a quick example, the following cell in my source notebook:

#export
def rmtree(path: Union[str, Path]):
    '''Recursively delete a directory tree'''
    path = Path(path); assert path.is_dir()
    for p in reversed(list(path.glob('**/*'))):
        if p.is_file():  p.unlink()
        elif p.is_dir(): p.rmdir()
    path.rmdir()

becomes this part of the docs, and this part of the library, just because i put #export at the top of the cell.

If you look at the docs you'll see that any cells (markdown or code) which follow a function export, become part of the docs unless I choose to hide them, so I can add explanation, thought process, and runnable code examples. It's the ultimate self documenting code.

So now you can use my image scraper like this (which is far more convenient, let's be honest):

!pip install -q jmd_imagescraper
from pathlib import Path
root = Path().cwd()/"images"

from jmd_imagescraper.core import * # dont't worry, it's designed to work with import *

duckduckgo_search(root, "Cats", "cute kittens", max_results=10)
duckduckgo_search(root, "Dogs", "cute puppies", max_results=10)
duckduckgo_search(root, "Birds", "cute baby ducks and chickens", max_results=10)
Duckduckgo search: cute kittens
Downloading results into /content/images/Cats
100.00% [10/10 00:01<00:00 Images downloaded]
Duckduckgo search: cute puppies
Downloading results into /content/images/Dogs
100.00% [10/10 00:00<00:00 Images downloaded]
Duckduckgo search: cute baby ducks and chickens
Downloading results into /content/images/Birds
100.00% [10/10 00:00<00:00 Images downloaded]
[PosixPath('/content/images/Birds/001_3eecd1d6.jpg'),
 PosixPath('/content/images/Birds/002_0773b37c.jpg'),
 PosixPath('/content/images/Birds/003_fded0010.jpg'),
 PosixPath('/content/images/Birds/004_4d8df6ba.jpg'),
 PosixPath('/content/images/Birds/005_380ee0c3.jpg'),
 PosixPath('/content/images/Birds/006_f3bf1d26.jpg'),
 PosixPath('/content/images/Birds/007_173bf825.jpg'),
 PosixPath('/content/images/Birds/008_8bf49659.jpg'),
 PosixPath('/content/images/Birds/009_f9bac3df.jpg'),
 PosixPath('/content/images/Birds/010_49c2fd7a.jpg')]

Filenames are unique across folders so it's compatible with fastai.vision.widgets.ImageClassifierCleaner (thanks to @butchland)

from jmd_imagescraper.imagecleaner import *

display_image_cleaner(root)

Boom.

The repo contains quite a few files but the meat of it is basically in 00_core.ipynb, 01_imagecleaner.ipynb (which are the two modules in my lib) and index.ipynb (which is the homepage for my docs).

I'll write more about this later I'm sure but for now I just wanted to let people know it's available as a package.