backend/data/enwiki/README.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39

Downloaded Files
================
-   enwiki-20220501-pages-articles-multistream.xml.bz2 <br>
    Obtained via <https://dumps.wikimedia.org/backup-index.html>
    (site suggests downloading from a mirror).  Contains text
    content and metadata for pages in English Wikipedia
    (current revision only, excludes talk pages).  Some file
    content and format information was available from
    <https://meta.wikimedia.org/wiki/Data_dumps/What%27s_available_for_download>.
-   enwiki-20220501-pages-articles-multistream-index.txt.bz2 <br>
    Obtained like above. Holds lines of the form offset1:pageId1:title1,
    providing offsets, for each page, into the dump file, of a chunk of
    100 pages that includes it.

Generated Files
===============
-   dumpIndex.db <br>
    Holds data from the enwiki dump index file. Generated by
    genDumpIndexDb.py, and used by lookupPage.py to get content for a
    given page title. <br>
    Tables: <br>
    -   offsets: title TEXT PRIMARY KEY, id INT UNIQUE, offset INT, next\_offset INT
-   enwikiData.db <br>
    Holds data obtained from the enwiki dump file, in 'pages',
    'redirects', and 'descs' tables. Generated by genData.py, which uses
    python packages mwxml and mwparserfromhell. <br>
    Tables: <br>
    -   pages:     id INT PRIMARY KEY, title TEXT UNIQUE
    -   redirects: id INT PRIMARY KEY, target TEXT
    -   descs:     id INT PRIMARY KEY, desc TEXT
-   enwikiImgs.db <br>
    Holds infobox-images obtained for some set of wiki page-ids.
    Generated by running getEnwikiImgData.py, which uses the enwiki dump
    file and dumpIndex.db. <br>
    Tables: <br>
    -   page\_imgs: page\_id INT PRIMAY KEY, img\_name TEXT
        (img\_name may be null, which is used to avoid re-processing the page-id on a second pass)
    -   imgs: name TEXT PRIMARY KEY, license TEXT, artist TEXT, credit TEXT, restrictions TEXT, url TEXT
        (might lack some matches for 'img_name' in 'page_imgs', due to inability to get license info)