aboutsummaryrefslogtreecommitdiff
path: root/backend/data/enwiki/README.md
diff options
context:
space:
mode:
Diffstat (limited to 'backend/data/enwiki/README.md')
-rw-r--r--backend/data/enwiki/README.md12
1 files changed, 6 insertions, 6 deletions
diff --git a/backend/data/enwiki/README.md b/backend/data/enwiki/README.md
index 8e748c9..e4e1aae 100644
--- a/backend/data/enwiki/README.md
+++ b/backend/data/enwiki/README.md
@@ -1,22 +1,22 @@
Downloaded Files
================
-- enwiki\_content/enwiki-20220420-pages-articles-*.xml.gz:
+- enwiki\_content/enwiki-20220420-pages-articles-*.xml.gz <br>
Obtained via https://dumps.wikimedia.org/backup-index.html (site suggests downloading from a mirror).
Contains text content and metadata for pages in English Wikipedia (current revision only, excludes talk pages).
Some file content and format information was available from
https://meta.wikimedia.org/wiki/Data_dumps/What%27s_available_for_download.
-- enwiki-20220420-page.sql.gz:
+- enwiki-20220420-page.sql.gz <br>
Obtained like above. Contains page-table information including page id, namespace, title, etc.
Format information was found at https://www.mediawiki.org/wiki/Manual:Page_table.
-- enwiki-20220420-redirect.sql.gz:
+- enwiki-20220420-redirect.sql.gz <br>
Obtained like above. Contains page-redirection info.
Format information was found at https://meta.wikimedia.org/wiki/Data_dumps/What%27s_available_for_download.
Generated Files
===============
-- enwiki\_content/enwiki-*.xml and enwiki-*.sql:
+- enwiki\_content/enwiki-*.xml and enwiki-*.sql <br>
Uncompressed versions of downloaded files.
-- enwikiData.db:
+- enwikiData.db <br>
An sqlite database representing data from the enwiki dump files.
Generation:
1 Install python, and packages mwsql, mwxml, and mwparsefromhell. Example:
@@ -31,5 +31,5 @@ Generated Files
4 Run genDescData.py, which reads the page-content xml dumps, and the 'pages' and 'redirects' tables,
and associates page ids with (potentially redirect-resolved) pages, and attempts to parse some
wikitext within those pages to obtain the first descriptive paragraph, with markup removed.
-- .venv:
+- .venv <br>
Provides a python virtual environment for packages needed to generate data.