diff options
Diffstat (limited to 'backend/data/enwiki/README.md')
| -rw-r--r-- | backend/data/enwiki/README.md | 51 |
1 files changed, 22 insertions, 29 deletions
diff --git a/backend/data/enwiki/README.md b/backend/data/enwiki/README.md index e4e1aae..cdabf50 100644 --- a/backend/data/enwiki/README.md +++ b/backend/data/enwiki/README.md @@ -1,35 +1,28 @@ Downloaded Files ================ -- enwiki\_content/enwiki-20220420-pages-articles-*.xml.gz <br> - Obtained via https://dumps.wikimedia.org/backup-index.html (site suggests downloading from a mirror). - Contains text content and metadata for pages in English Wikipedia (current revision only, excludes talk pages). - Some file content and format information was available from - https://meta.wikimedia.org/wiki/Data_dumps/What%27s_available_for_download. -- enwiki-20220420-page.sql.gz <br> - Obtained like above. Contains page-table information including page id, namespace, title, etc. - Format information was found at https://www.mediawiki.org/wiki/Manual:Page_table. -- enwiki-20220420-redirect.sql.gz <br> - Obtained like above. Contains page-redirection info. - Format information was found at https://meta.wikimedia.org/wiki/Data_dumps/What%27s_available_for_download. +- enwiki-20220501-pages-articles-multistream.xml.bz2 <br> + Obtained via <https://dumps.wikimedia.org/backup-index.html> + (site suggests downloading from a mirror). Contains text + content and metadata for pages in English Wikipedia + (current revision only, excludes talk pages). Some file + content and format information was available from + <https://meta.wikimedia.org/wiki/Data_dumps/What%27s_available_for_download>. +- enwiki-20220501-pages-articles-multistream-index.txt.bz2 <br> + Obtained like above. Holds lines of the form offset1:pageId1:title1, + providing offsets, for each page, into the dump file, of a chunk of + 100 pages that includes it. Generated Files =============== -- enwiki\_content/enwiki-*.xml and enwiki-*.sql <br> - Uncompressed versions of downloaded files. +- dumpIndex.db <br> + Holds data from the enwiki dump index file. Generated by + genDumpIndexDb.py, and used by lookupPage.py to get content for a + given page title. - enwikiData.db <br> - An sqlite database representing data from the enwiki dump files. - Generation: - 1 Install python, and packages mwsql, mwxml, and mwparsefromhell. Example: - 1 On Ubuntu, install python3, python3-pip, and python3-venv via `apt-get update; apt-get ...`. - 2 Create a virtual environment in which to install packages via `python3 -m venv .venv`. - 3 Activate the virtual environment via `source .venv/bin/activate`. - 4 Install mwsql, mwxml, and mwparsefromhell via `pip install mwsql mwxml mwparsefromhell`. - 2 Run genPageData.py (still under the virtual environment), which creates the database, - reads from the page dump, and creates a 'pages' table. - 3 Run genRedirectData.py, which creates a 'redirects' table, using information in the redirects dump, - and page ids from the 'pages' table. - 4 Run genDescData.py, which reads the page-content xml dumps, and the 'pages' and 'redirects' tables, - and associates page ids with (potentially redirect-resolved) pages, and attempts to parse some - wikitext within those pages to obtain the first descriptive paragraph, with markup removed. -- .venv <br> - Provides a python virtual environment for packages needed to generate data. + Holds data obtained from the enwiki dump file, in 'pages', + 'redirects', and 'descs' tables. Generated by genData.py, which uses + python packages mwxml and mwparserfromhell. <br> + Tables: <br> + - pages: id INT PRIMARY KEY, title TEXT UNIQUE + - redirects: id INT PRIMARY KEY, target TEXT + - descs: id INT PRIMARY KEY, desc TEXT |
