aboutsummaryrefslogtreecommitdiff
path: root/backend/data/enwiki/README.md
diff options
context:
space:
mode:
authorTerry Truong <terry06890@gmail.com>2022-05-17 10:41:12 +1000
committerTerry Truong <terry06890@gmail.com>2022-05-17 10:41:12 +1000
commit29940d51eb8b6b220d53940ecbc212cea78159ae (patch)
treebfa698c17525de7876b80ad37d8f7777b9505ba0 /backend/data/enwiki/README.md
parenta840a16c6bd5aef906bd5cbce8293fc863cb5a5d (diff)
Improve enwiki description extraction
Adjust enwiki code to handle single dump file, and add scripts for 'convenient' page-content lookup.
Diffstat (limited to 'backend/data/enwiki/README.md')
-rw-r--r--backend/data/enwiki/README.md51
1 files changed, 22 insertions, 29 deletions
diff --git a/backend/data/enwiki/README.md b/backend/data/enwiki/README.md
index e4e1aae..cdabf50 100644
--- a/backend/data/enwiki/README.md
+++ b/backend/data/enwiki/README.md
@@ -1,35 +1,28 @@
Downloaded Files
================
-- enwiki\_content/enwiki-20220420-pages-articles-*.xml.gz <br>
- Obtained via https://dumps.wikimedia.org/backup-index.html (site suggests downloading from a mirror).
- Contains text content and metadata for pages in English Wikipedia (current revision only, excludes talk pages).
- Some file content and format information was available from
- https://meta.wikimedia.org/wiki/Data_dumps/What%27s_available_for_download.
-- enwiki-20220420-page.sql.gz <br>
- Obtained like above. Contains page-table information including page id, namespace, title, etc.
- Format information was found at https://www.mediawiki.org/wiki/Manual:Page_table.
-- enwiki-20220420-redirect.sql.gz <br>
- Obtained like above. Contains page-redirection info.
- Format information was found at https://meta.wikimedia.org/wiki/Data_dumps/What%27s_available_for_download.
+- enwiki-20220501-pages-articles-multistream.xml.bz2 <br>
+ Obtained via <https://dumps.wikimedia.org/backup-index.html>
+ (site suggests downloading from a mirror). Contains text
+ content and metadata for pages in English Wikipedia
+ (current revision only, excludes talk pages). Some file
+ content and format information was available from
+ <https://meta.wikimedia.org/wiki/Data_dumps/What%27s_available_for_download>.
+- enwiki-20220501-pages-articles-multistream-index.txt.bz2 <br>
+ Obtained like above. Holds lines of the form offset1:pageId1:title1,
+ providing offsets, for each page, into the dump file, of a chunk of
+ 100 pages that includes it.
Generated Files
===============
-- enwiki\_content/enwiki-*.xml and enwiki-*.sql <br>
- Uncompressed versions of downloaded files.
+- dumpIndex.db <br>
+ Holds data from the enwiki dump index file. Generated by
+ genDumpIndexDb.py, and used by lookupPage.py to get content for a
+ given page title.
- enwikiData.db <br>
- An sqlite database representing data from the enwiki dump files.
- Generation:
- 1 Install python, and packages mwsql, mwxml, and mwparsefromhell. Example:
- 1 On Ubuntu, install python3, python3-pip, and python3-venv via `apt-get update; apt-get ...`.
- 2 Create a virtual environment in which to install packages via `python3 -m venv .venv`.
- 3 Activate the virtual environment via `source .venv/bin/activate`.
- 4 Install mwsql, mwxml, and mwparsefromhell via `pip install mwsql mwxml mwparsefromhell`.
- 2 Run genPageData.py (still under the virtual environment), which creates the database,
- reads from the page dump, and creates a 'pages' table.
- 3 Run genRedirectData.py, which creates a 'redirects' table, using information in the redirects dump,
- and page ids from the 'pages' table.
- 4 Run genDescData.py, which reads the page-content xml dumps, and the 'pages' and 'redirects' tables,
- and associates page ids with (potentially redirect-resolved) pages, and attempts to parse some
- wikitext within those pages to obtain the first descriptive paragraph, with markup removed.
-- .venv <br>
- Provides a python virtual environment for packages needed to generate data.
+ Holds data obtained from the enwiki dump file, in 'pages',
+ 'redirects', and 'descs' tables. Generated by genData.py, which uses
+ python packages mwxml and mwparserfromhell. <br>
+ Tables: <br>
+ - pages: id INT PRIMARY KEY, title TEXT UNIQUE
+ - redirects: id INT PRIMARY KEY, target TEXT
+ - descs: id INT PRIMARY KEY, desc TEXT