From 90a5e15bb824b84e5bb60337d6a57a1394090dc6 Mon Sep 17 00:00:00 2001 From: Terry Truong Date: Wed, 4 May 2022 01:17:06 +1000 Subject: Add scripts for obtaining/sending/displaying wikipedia descriptions Add backend/data/enwiki/ directory containing scripts and instructive READMEs. Adjust some other scripts to generate 'eol_ids' sqlite table separate from 'names'. Make server respond to /data/desc requests, and have client TileInfo component display response data. Also adjust .gitignore entries to be root-relative. --- backend/data/enwiki/README.md | 35 +++++++++++++++++++++++++++++++++++ 1 file changed, 35 insertions(+) create mode 100644 backend/data/enwiki/README.md (limited to 'backend/data/enwiki/README.md') diff --git a/backend/data/enwiki/README.md b/backend/data/enwiki/README.md new file mode 100644 index 0000000..8e748c9 --- /dev/null +++ b/backend/data/enwiki/README.md @@ -0,0 +1,35 @@ +Downloaded Files +================ +- enwiki\_content/enwiki-20220420-pages-articles-*.xml.gz: + Obtained via https://dumps.wikimedia.org/backup-index.html (site suggests downloading from a mirror). + Contains text content and metadata for pages in English Wikipedia (current revision only, excludes talk pages). + Some file content and format information was available from + https://meta.wikimedia.org/wiki/Data_dumps/What%27s_available_for_download. +- enwiki-20220420-page.sql.gz: + Obtained like above. Contains page-table information including page id, namespace, title, etc. + Format information was found at https://www.mediawiki.org/wiki/Manual:Page_table. +- enwiki-20220420-redirect.sql.gz: + Obtained like above. Contains page-redirection info. + Format information was found at https://meta.wikimedia.org/wiki/Data_dumps/What%27s_available_for_download. + +Generated Files +=============== +- enwiki\_content/enwiki-*.xml and enwiki-*.sql: + Uncompressed versions of downloaded files. +- enwikiData.db: + An sqlite database representing data from the enwiki dump files. + Generation: + 1 Install python, and packages mwsql, mwxml, and mwparsefromhell. Example: + 1 On Ubuntu, install python3, python3-pip, and python3-venv via `apt-get update; apt-get ...`. + 2 Create a virtual environment in which to install packages via `python3 -m venv .venv`. + 3 Activate the virtual environment via `source .venv/bin/activate`. + 4 Install mwsql, mwxml, and mwparsefromhell via `pip install mwsql mwxml mwparsefromhell`. + 2 Run genPageData.py (still under the virtual environment), which creates the database, + reads from the page dump, and creates a 'pages' table. + 3 Run genRedirectData.py, which creates a 'redirects' table, using information in the redirects dump, + and page ids from the 'pages' table. + 4 Run genDescData.py, which reads the page-content xml dumps, and the 'pages' and 'redirects' tables, + and associates page ids with (potentially redirect-resolved) pages, and attempts to parse some + wikitext within those pages to obtain the first descriptive paragraph, with markup removed. +- .venv: + Provides a python virtual environment for packages needed to generate data. -- cgit v1.2.3