aboutsummaryrefslogtreecommitdiff
path: root/backend/tol_data/enwiki/README.md
diff options
context:
space:
mode:
authorTerry Truong <terry06890@gmail.com>2023-01-23 18:00:43 +1100
committerTerry Truong <terry06890@gmail.com>2023-01-23 18:01:13 +1100
commit94a8ad9b067e5a2c442ce47ce72d1a53eb444160 (patch)
tree2056373ee56b8b2f8269ac3e94d40f8f0e6eec0d /backend/tol_data/enwiki/README.md
parent796c4e5660b1006575b8f2af9d99e2ce592c767a (diff)
Clean up some docs and naming inconsistencies
Diffstat (limited to 'backend/tol_data/enwiki/README.md')
-rw-r--r--backend/tol_data/enwiki/README.md22
1 files changed, 11 insertions, 11 deletions
diff --git a/backend/tol_data/enwiki/README.md b/backend/tol_data/enwiki/README.md
index ba1de33..6f27d7f 100644
--- a/backend/tol_data/enwiki/README.md
+++ b/backend/tol_data/enwiki/README.md
@@ -14,12 +14,12 @@ This directory holds files obtained/derived from [English Wikipedia](https://en.
# Dump-Index Files
- `gen_dump_index_db.py` <br>
Creates a database version of the enwiki-dump index file.
-- `dumpIndex.db` <br>
+- `dump_index.db` <br>
Generated by `gen_dump_index_db.py`. <br>
Tables: <br>
- `offsets`: `title TEXT PRIMARY KEY, id INT UNIQUE, offset INT, next_offset INT`
-# Description Database Files
+# Description Files
- `gen_desc_data.py` <br>
Reads through pages in the dump file, and adds short-description info to a database.
- `desc_data.db` <br>
@@ -29,20 +29,20 @@ This directory holds files obtained/derived from [English Wikipedia](https://en.
- `redirects`: `id INT PRIMARY KEY, target TEXT`
- `descs`: `id INT PRIMARY KEY, desc TEXT`
-# Image Database Files
+# Image Files
- `gen_img_data.py` <br>
- Used to find infobox image names for page IDs, storing them into a database.
-- `downloadImgLicenseInfo.py` <br>
- Used to download licensing metadata for image names, via wikipedia's online API, storing them into a database.
+ Used to find infobox image names for page IDs, and store them into a database.
+- `download_img_license_info.py` <br>
+ Used to download licensing metadata for image names, via wikipedia's online API, and store them into a database.
- `img_data.db` <br>
- Used to hold metadata about infobox images for a set of pageIDs.
+ Used to hold metadata about infobox images for a set of page IDs.
Generated using `get_enwiki_img_data.py` and `download_img_license_info.py`. <br>
Tables: <br>
- `page_imgs`: `page_id INT PRIMAY KEY, img_name TEXT` <br>
- `img_name` may be null, which means 'none found', and is used to avoid re-processing page-ids.
+ `img_name` may be null, which means 'none found', and is used to avoid re-processing page IDs.
- `imgs`: `name TEXT PRIMARY KEY, license TEXT, artist TEXT, credit TEXT, restrictions TEXT, url TEXT` <br>
Might lack some matches for `img_name` in `page_imgs`, due to licensing info unavailability.
-- `downloadImgs.py` <br>
+- `download_imgs.py` <br>
Used to download image files into imgs/.
# Page View Files
@@ -51,7 +51,7 @@ This directory holds files obtained/derived from [English Wikipedia](https://en.
Obtained via <https://dumps.wikimedia.org/other/pageview_complete/monthly/>.
Some format info was available from <https://dumps.wikimedia.org/other/pageview_complete/readme.html>.
- `gen_pageview_data.py` <br>
- Reads pageview/*, and creates a database holding average monthly pageview counts.
+ Reads pageview/* and `dump_index.db`, and creates a database holding average monthly pageview counts.
- `pageview_data.db` <br>
Generated using `gen_pageview_data.py`. <br>
Tables: <br>
@@ -60,4 +60,4 @@ This directory holds files obtained/derived from [English Wikipedia](https://en.
# Other Files
- `lookup_page.py` <br>
Running `lookup_page.py title1` looks in the dump for a page with a given title,
- and prints the contents to stdout. Uses dumpIndex.db.
+ and prints the contents to stdout. Uses dump_index.db.