From ad82c9dc1eb35036c4078b9cd36ae0924e1ff0d2 Mon Sep 17 00:00:00 2001 From: Terry Truong Date: Sat, 7 May 2022 11:09:03 +1000 Subject: Update README line breaks --- backend/data/enwiki/README.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) (limited to 'backend/data/enwiki/README.md') diff --git a/backend/data/enwiki/README.md b/backend/data/enwiki/README.md index 8e748c9..e4e1aae 100644 --- a/backend/data/enwiki/README.md +++ b/backend/data/enwiki/README.md @@ -1,22 +1,22 @@ Downloaded Files ================ -- enwiki\_content/enwiki-20220420-pages-articles-*.xml.gz: +- enwiki\_content/enwiki-20220420-pages-articles-*.xml.gz
Obtained via https://dumps.wikimedia.org/backup-index.html (site suggests downloading from a mirror). Contains text content and metadata for pages in English Wikipedia (current revision only, excludes talk pages). Some file content and format information was available from https://meta.wikimedia.org/wiki/Data_dumps/What%27s_available_for_download. -- enwiki-20220420-page.sql.gz: +- enwiki-20220420-page.sql.gz
Obtained like above. Contains page-table information including page id, namespace, title, etc. Format information was found at https://www.mediawiki.org/wiki/Manual:Page_table. -- enwiki-20220420-redirect.sql.gz: +- enwiki-20220420-redirect.sql.gz
Obtained like above. Contains page-redirection info. Format information was found at https://meta.wikimedia.org/wiki/Data_dumps/What%27s_available_for_download. Generated Files =============== -- enwiki\_content/enwiki-*.xml and enwiki-*.sql: +- enwiki\_content/enwiki-*.xml and enwiki-*.sql
Uncompressed versions of downloaded files. -- enwikiData.db: +- enwikiData.db
An sqlite database representing data from the enwiki dump files. Generation: 1 Install python, and packages mwsql, mwxml, and mwparsefromhell. Example: @@ -31,5 +31,5 @@ Generated Files 4 Run genDescData.py, which reads the page-content xml dumps, and the 'pages' and 'redirects' tables, and associates page ids with (potentially redirect-resolved) pages, and attempts to parse some wikitext within those pages to obtain the first descriptive paragraph, with markup removed. -- .venv: +- .venv
Provides a python virtual environment for packages needed to generate data. -- cgit v1.2.3