aboutsummaryrefslogtreecommitdiff
path: root/backend/data/README.md
blob: a1bc28771f90ba3f8c50ddf503157bf4c2b2b5d0 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
File Generation Process
=======================

1   Tree Structure Data
    1   Obtain data in otol/, as specified in it's README.
    2   Run genOtolData.py, which creates data.db, and adds
        'nodes' and 'edges' tables using data in otol/*.
2   Name Data for Search
    1   Obtain data in eol/, as specified in it's README.
    2   Run genEolNameData.py, which adds 'names' and 'eol\_ids' tables to data.db,
        using data in eol/vernacularNames.csv and the 'nodes' table.
3   Image Data
    1   Run downloadImgsForReview.py to download EOL images into imgsForReview/.
        It uses data in eol/imagesList.db, and the 'eol\_ids' table.
    2   Run reviewImgs.py to filter images in imgsForReview/ into EOL-id-unique
        images in imgsReviewed/ (uses 'names' and 'eol\_ids' to display extra info).
    3   Run genImgsForWeb.py to create cropped/resized images in img/, using
        images in imgsReviewed, and also to add an 'images' table to data.db.
    4   Run genLinkedImgs.py to add a 'linked_imgs' table to data.db,
        which uses 'nodes', 'edges', 'eol_ids', and 'images', to associate
        nodes without images to child images.
4   Node Description Data
    -   Using DBpedia
        1   Obtain data in dbpedia/, as specified in it's README.
        2   Run genDbpData.py, which adds a 'descs' table to data.db, using
            data in dbpedia/dbpData.db, dbpPickedLabels.txt, and the 'nodes' table.
    -   Supplementing with Wikipedia dump
        1   Obtain data in enwiki/, as specified in it's README.
        2   Run genEnwikiData.py, which adds to the 'descs' table, using data in
            enwiki/enwikiData.db, and the 'nodes' table.
5   Reduced Tree Structure Data
    1   Run genReducedTreeData.py, which adds 'r_nodes' and 'r_edges' tables to
        data.db, using reducedTol/names.txt, and the 'nodes' and 'names' tables.

data.db Tables
==============
-   nodes:        name TEXT PRIMARY KEY, tips INT
-   edges:        node TEXT, child TEXT, p\_support INT, PRIMARY KEY (node, child)
-   names:        name TEXT, alt\_name TEXT, pref\_alt INT, PRIMARY KEY(name, alt\_name)
-   eol\_ids:     id INT PRIMARY KEY, name TEXT
-   images:       eol\_id INT PRIMARY KEY, source\_url TEXT, license TEXT, copyright\_owner TEXT
-   linked\_imgs: name TEXT PRIMARY KEY, eol\_id INT, eol\_id2 INT
-   descs:        name TEXT PRIMARY KEY, desc TEXT, redirected INT, wiki\_id INT, from\_dbp INT
-   r\_nodes:     name TEXT PRIMARY KEY, tips INT
-   r\_edges:     node TEXT, child TEXT, p\_support INT, PRIMARY KEY (node, child)

Other Files
===========
-   dbpPickedLabels.txt <br>
    Contains DBpedia labels, one per line. Used by genDbpData.py to help
    resolve conflicts when associating tree-of-life node names with
    DBpedia node labels. Was generated by manually editing the output
    of genDbpConflicts.py.
-   genDbpConflicts.py <br>
    Reads data from dbpedia/dbpData.db, and the 'nodes' table of data.db,
    and looks for potential conflicts that would arise when genDbpData.db
    tries to associate tree-of-life node names wth DBpedia node labels. It
    writes data about them to conflicts.txt, which can be manually edited
    to resolve them.