aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorTerry Truong <terry06890@gmail.com>2023-01-18 20:21:22 +1100
committerTerry Truong <terry06890@gmail.com>2023-01-18 20:21:22 +1100
commit4cb5ec14bcfb2db4574c0b0b0d4d4aff59e24c8a (patch)
tree290df19a175d41b8bb4206af14eb179c122442af
parentf3e08d5e636849f9c503d65cee8953f183a3dc2a (diff)
Adjust backend docs after another db regeneration
-rw-r--r--backend/hist_data/README.md4
-rwxr-xr-xbackend/hist_data/enwiki/download_imgs.py2
2 files changed, 4 insertions, 2 deletions
diff --git a/backend/hist_data/README.md b/backend/hist_data/README.md
index 50108e0..9fe2d0e 100644
--- a/backend/hist_data/README.md
+++ b/backend/hist_data/README.md
@@ -69,7 +69,9 @@ Some of the scripts use third-party packages:
script variable to identify yourself to the online API (this is expected
[best practice](https://www.mediawiki.org/wiki/API:Etiquette)).
1. In enwiki/, run `download_imgs.py`, which downloads images into enwiki/imgs/. Setting the
- USER_AGENT variable applies here as well.
+ USER_AGENT variable applies here as well. <br>
+ In some rare cases, the download won't produce an image file, but a text file containing
+ 'File not found: ...'. These can simply be deleted.
1. Run `gen_imgs.py`, which creates resized/cropped images in img/, from images in enwiki/imgs/.
Adds the `imgs` and `event_imgs` tables. <br>
The output images might need additional manual changes:
diff --git a/backend/hist_data/enwiki/download_imgs.py b/backend/hist_data/enwiki/download_imgs.py
index 378de7f..df40bae 100755
--- a/backend/hist_data/enwiki/download_imgs.py
+++ b/backend/hist_data/enwiki/download_imgs.py
@@ -24,7 +24,7 @@ USER_AGENT = 'terryt.dev (terry06890@gmail.com)'
TIMEOUT = 1
# https://en.wikipedia.org/wiki/Wikipedia:Database_download says to 'throttle to 1 cache miss per sec'
# It's unclear how to properly check for cache misses, so we just aim for 1 per sec
-EXP_BACKOFF = False # If True, double the timeout each time a download error occurs (otherwise just exit)
+EXP_BACKOFF = True # If True, double the timeout each time a download error occurs (otherwise just exit)
def downloadImgs(imgDb: str, outDir: str, timeout: int) -> None:
if not os.path.exists(outDir):