diff options
| author | Terry Truong <terry06890@gmail.com> | 2023-01-23 18:00:43 +1100 |
|---|---|---|
| committer | Terry Truong <terry06890@gmail.com> | 2023-01-23 18:01:13 +1100 |
| commit | 94a8ad9b067e5a2c442ce47ce72d1a53eb444160 (patch) | |
| tree | 2056373ee56b8b2f8269ac3e94d40f8f0e6eec0d /backend/tol_data/enwiki/README.md | |
| parent | 796c4e5660b1006575b8f2af9d99e2ce592c767a (diff) | |
Clean up some docs and naming inconsistencies
Diffstat (limited to 'backend/tol_data/enwiki/README.md')
| -rw-r--r-- | backend/tol_data/enwiki/README.md | 22 |
1 files changed, 11 insertions, 11 deletions
diff --git a/backend/tol_data/enwiki/README.md b/backend/tol_data/enwiki/README.md index ba1de33..6f27d7f 100644 --- a/backend/tol_data/enwiki/README.md +++ b/backend/tol_data/enwiki/README.md @@ -14,12 +14,12 @@ This directory holds files obtained/derived from [English Wikipedia](https://en. # Dump-Index Files - `gen_dump_index_db.py` <br> Creates a database version of the enwiki-dump index file. -- `dumpIndex.db` <br> +- `dump_index.db` <br> Generated by `gen_dump_index_db.py`. <br> Tables: <br> - `offsets`: `title TEXT PRIMARY KEY, id INT UNIQUE, offset INT, next_offset INT` -# Description Database Files +# Description Files - `gen_desc_data.py` <br> Reads through pages in the dump file, and adds short-description info to a database. - `desc_data.db` <br> @@ -29,20 +29,20 @@ This directory holds files obtained/derived from [English Wikipedia](https://en. - `redirects`: `id INT PRIMARY KEY, target TEXT` - `descs`: `id INT PRIMARY KEY, desc TEXT` -# Image Database Files +# Image Files - `gen_img_data.py` <br> - Used to find infobox image names for page IDs, storing them into a database. -- `downloadImgLicenseInfo.py` <br> - Used to download licensing metadata for image names, via wikipedia's online API, storing them into a database. + Used to find infobox image names for page IDs, and store them into a database. +- `download_img_license_info.py` <br> + Used to download licensing metadata for image names, via wikipedia's online API, and store them into a database. - `img_data.db` <br> - Used to hold metadata about infobox images for a set of pageIDs. + Used to hold metadata about infobox images for a set of page IDs. Generated using `get_enwiki_img_data.py` and `download_img_license_info.py`. <br> Tables: <br> - `page_imgs`: `page_id INT PRIMAY KEY, img_name TEXT` <br> - `img_name` may be null, which means 'none found', and is used to avoid re-processing page-ids. + `img_name` may be null, which means 'none found', and is used to avoid re-processing page IDs. - `imgs`: `name TEXT PRIMARY KEY, license TEXT, artist TEXT, credit TEXT, restrictions TEXT, url TEXT` <br> Might lack some matches for `img_name` in `page_imgs`, due to licensing info unavailability. -- `downloadImgs.py` <br> +- `download_imgs.py` <br> Used to download image files into imgs/. # Page View Files @@ -51,7 +51,7 @@ This directory holds files obtained/derived from [English Wikipedia](https://en. Obtained via <https://dumps.wikimedia.org/other/pageview_complete/monthly/>. Some format info was available from <https://dumps.wikimedia.org/other/pageview_complete/readme.html>. - `gen_pageview_data.py` <br> - Reads pageview/*, and creates a database holding average monthly pageview counts. + Reads pageview/* and `dump_index.db`, and creates a database holding average monthly pageview counts. - `pageview_data.db` <br> Generated using `gen_pageview_data.py`. <br> Tables: <br> @@ -60,4 +60,4 @@ This directory holds files obtained/derived from [English Wikipedia](https://en. # Other Files - `lookup_page.py` <br> Running `lookup_page.py title1` looks in the dump for a page with a given title, - and prints the contents to stdout. Uses dumpIndex.db. + and prints the contents to stdout. Uses dump_index.db. |
