1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
|
Downloaded Files
================
- enwiki-20220501-pages-articles-multistream.xml.bz2 <br>
Obtained via <https://dumps.wikimedia.org/backup-index.html>
(site suggests downloading from a mirror). Contains text
content and metadata for pages in English Wikipedia
(current revision only, excludes talk pages). Some file
content and format information was available from
<https://meta.wikimedia.org/wiki/Data_dumps/What%27s_available_for_download>.
- enwiki-20220501-pages-articles-multistream-index.txt.bz2 <br>
Obtained like above. Holds lines of the form offset1:pageId1:title1,
providing offsets, for each page, into the dump file, of a chunk of
100 pages that includes it.
Generated Files
===============
- dumpIndex.db <br>
Holds data from the enwiki dump index file. Generated by
genDumpIndexDb.py, and used by lookupPage.py to get content for a
given page title. <br>
Tables: <br>
- offsets: title TEXT PRIMARY KEY, id INT UNIQUE, offset INT, next\_offset INT
- enwikiData.db <br>
Holds data obtained from the enwiki dump file, in 'pages',
'redirects', and 'descs' tables. Generated by genData.py, which uses
python packages mwxml and mwparserfromhell. <br>
Tables: <br>
- pages: id INT PRIMARY KEY, title TEXT UNIQUE
- redirects: id INT PRIMARY KEY, target TEXT
- descs: id INT PRIMARY KEY, desc TEXT
|