diff options
Diffstat (limited to 'backend/hist_data')
| -rw-r--r-- | backend/hist_data/README.md | 4 | ||||
| -rwxr-xr-x | backend/hist_data/enwiki/download_imgs.py | 2 |
2 files changed, 4 insertions, 2 deletions
diff --git a/backend/hist_data/README.md b/backend/hist_data/README.md index 50108e0..9fe2d0e 100644 --- a/backend/hist_data/README.md +++ b/backend/hist_data/README.md @@ -69,7 +69,9 @@ Some of the scripts use third-party packages: script variable to identify yourself to the online API (this is expected [best practice](https://www.mediawiki.org/wiki/API:Etiquette)). 1. In enwiki/, run `download_imgs.py`, which downloads images into enwiki/imgs/. Setting the - USER_AGENT variable applies here as well. + USER_AGENT variable applies here as well. <br> + In some rare cases, the download won't produce an image file, but a text file containing + 'File not found: ...'. These can simply be deleted. 1. Run `gen_imgs.py`, which creates resized/cropped images in img/, from images in enwiki/imgs/. Adds the `imgs` and `event_imgs` tables. <br> The output images might need additional manual changes: diff --git a/backend/hist_data/enwiki/download_imgs.py b/backend/hist_data/enwiki/download_imgs.py index 378de7f..df40bae 100755 --- a/backend/hist_data/enwiki/download_imgs.py +++ b/backend/hist_data/enwiki/download_imgs.py @@ -24,7 +24,7 @@ USER_AGENT = 'terryt.dev (terry06890@gmail.com)' TIMEOUT = 1 # https://en.wikipedia.org/wiki/Wikipedia:Database_download says to 'throttle to 1 cache miss per sec' # It's unclear how to properly check for cache misses, so we just aim for 1 per sec -EXP_BACKOFF = False # If True, double the timeout each time a download error occurs (otherwise just exit) +EXP_BACKOFF = True # If True, double the timeout each time a download error occurs (otherwise just exit) def downloadImgs(imgDb: str, outDir: str, timeout: int) -> None: if not os.path.exists(outDir): |
