Downloaded Files ================ - enwiki-20220501-pages-articles-multistream.xml.bz2
Obtained via (site suggests downloading from a mirror). Contains text content and metadata for pages in English Wikipedia (current revision only, excludes talk pages). Some file content and format information was available from . - enwiki-20220501-pages-articles-multistream-index.txt.bz2
Obtained like above. Holds lines of the form offset1:pageId1:title1, providing offsets, for each page, into the dump file, of a chunk of 100 pages that includes it. Generated Files =============== - dumpIndex.db
Holds data from the enwiki dump index file. Generated by genDumpIndexDb.py, and used by lookupPage.py to get content for a given page title.
Tables:
- offsets: title TEXT PRIMARY KEY, id INT UNIQUE, offset INT, next\_offset INT - enwikiData.db
Holds data obtained from the enwiki dump file, in 'pages', 'redirects', and 'descs' tables. Generated by genData.py, which uses python packages mwxml and mwparserfromhell.
Tables:
- pages: id INT PRIMARY KEY, title TEXT UNIQUE - redirects: id INT PRIMARY KEY, target TEXT - descs: id INT PRIMARY KEY, desc TEXT - enwikiImgs.db
Holds infobox-images obtained for some set of wiki page-ids. Generated by running getEnwikiImgData.py, which uses the enwiki dump file and dumpIndex.db.
Tables:
- page\_imgs: page\_id INT PRIMAY KEY, img\_name TEXT (img\_name may be null, which is used to avoid re-processing the page-id on a second pass) - imgs: name TEXT PRIMARY KEY, license TEXT, artist TEXT, credit TEXT, restrictions TEXT, url TEXT (might lack some matches for 'img_name' in 'page_imgs', due to inability to get license info)