aboutsummaryrefslogtreecommitdiff
path: root/backend/data/README.md
blob: d4a619696b4116d73fd92b36c5eb8f696acb08f7 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
File Generation Process
=======================
1   Tree Structure Data
    1   Obtain data in otol/, as specified in it's README.
    2   Run genOtolData.py, which creates data.db, and adds
        'nodes' and 'edges' tables using data in otol/*, as well as
        genOtolNamesToKeep.txt, if present.
2   Name Data for Search
    1   Obtain data in eol/, as specified in it's README.
    2   Run genEolNameData.py, which adds 'names' and 'eol_ids' tables to data.db,
        using data in eol/vernacularNames.csv and the 'nodes' table, and possibly
        genEolNameDataPickedIds.txt.
3   Node Description Data
    1   Obtain data in dbpedia/ and enwiki/, as specified in their README files.
    2   Run genDbpData.py, which adds 'wiki_ids' and 'descs' tables to data.db,
        using data in dbpedia/dbpData.db, the 'nodes' table, and possibly
        genDescNamesToSkip.txt and dbpPickedLabels.txt.
    3   Run genEnwikiDescData.py, which adds to the 'wiki_ids' and 'descs' tables,
        using data in enwiki/enwikiData.db, and the 'nodes' table.
        Also uses genDescNamesToSkip.txt and genEnwikiDescTitlesToUse.txt for
        skipping/resolving some name-page associations.
4   Image Data
    1   In eol/, run downloadImgs.py to download EOL images into eol/imgsForReview/.
        It uses data in eol/imagesList.db, and the 'eol_ids' table.
    2   In eol/, run reviewImgs.py to filter images in eol/imgsForReview/ into EOL-id-unique
        images in eol/imgsReviewed/ (uses 'names' and 'eol_ids' to display extra info).
    3   In enwiki/, run getEnwikiImgData.py, which generates a list of
        tol-node images, and creates enwiki/enwikiImgs.db to store it.
        Uses the 'wiki_ids' table to get tol-node wiki-ids.
    4   In enwiki/, run downloadImgLicenseInfo.py, which downloads licensing
        information for images listed in enwiki/enwikiImgs.db, and stores
        it in that db.
    5   In enwiki/, run downloadEnwikiImgs.py, which downloads 'permissively-licensed'
        images in listed in enwiki/enwikiImgs.db, storing them in enwiki/imgs/.
    6   Run reviewImgsToMerge.py, which displays images from eol/ and enwiki/,
        and enables choosing, for each tol-node, which image should be used, if any,
        and outputs choice information into mergedImgList.txt. Uses the 'nodes',
        'eol_ids', and 'wiki_ids' tables (as well as 'names' for info-display).
    7   Run genImgsForWeb.py, which creates cropped/resized images in img/,
        using mergedImgList.txt, and possibly pickedImgs/, and adds 'images' and
        'node_imgs' tables to data.db. <br>
        Smartcrop's outputs might need to be manually created/adjusted: <br>
        -   An input image might have no output produced, possibly due to
            data incompatibilities, memory limits, etc. A few input image files
            might actually be html files, containing a 'file not found' page.
        -   An input x.gif might produce x-1.jpg, x-2.jpg, etc, instead of x.jpg.
        -   An input image might produce output with unexpected dimensions.
            This seems to happen when the image is very large, and triggers a
            decompression bomb warning.
        The result might have as many as 150k images, with about 2/3 of them
        being from wikipedia.
    8   Run genLinkedImgs.py to add a 'linked_imgs' table to data.db,
        which uses 'nodes', 'edges', 'eol_ids', and 'node_imgs', to associate
        nodes without images to child images.
5   Reduced Tree Structure Data
    1   Run genReducedTreeData.py, which adds 'r_nodes' and 'r_edges' tables to
        data.db, using reducedTol/names.txt, and the 'nodes' and 'names' tables.
6   Other
    -   Optionally run genEnwikiNameData.py, which adds more entries to the 'names' table,
        using data in enwiki/enwikiData.db, and the 'names' and 'wiki_ids' tables.
    -   Optionally run addPickedNames.py, which adds manually-picked names to
        the 'names' table, as specified in pickedNames.txt.
    -   Optionally run trimTree.py, which tries to remove some 'low-significance' nodes,
        for the sake of performance and result-relevance. Without this, jumping to certain
        nodes within the fungi and moths can take over a minute to render.

data.db Tables
==============
-   nodes:        name TEXT PRIMARY KEY, id TEXT UNIQUE, tips INT
-   edges:        node TEXT, child TEXT, p\_support INT, PRIMARY KEY (node, child)
-   eol\_ids:     id INT PRIMARY KEY, name TEXT
-   names:        name TEXT, alt\_name TEXT, pref\_alt INT, src TEXT, PRIMARY KEY(name, alt\_name)
-   wiki\_ids:    name TEXT PRIMARY KEY, id INT, redirected INT
-   descs:        wiki\_id INT PRIMARY KEY, desc TEXT, from\_dbp INT
-   node\_imgs:   name TEXT PRIMARY KEY, img\_id INT, src TEXT
-   images:       id INT, src TEXT, url TEXT, license TEXT, artist TEXT, credit TEXT, PRIMARY KEY (id, src)
-   linked\_imgs: name TEXT PRIMARY KEY, otol\_ids TEXT
-   r\_nodes:     name TEXT PRIMARY KEY, tips INT
-   r\_edges:     node TEXT, child TEXT, p\_support INT, PRIMARY KEY (node, child)

Other Files
===========
-   dbpPickedLabels.txt <br>
    Contains DBpedia labels, one per line. Used by genDbpData.py to help
    resolve conflicts when associating tree-of-life node names with
    DBpedia node labels.
-   genOtolNamesToKeep.txt <br>
    Contains names to avoid trimming off the tree data generated by
    genOtolData.py.  Usage is optional, but, without it, a large amount
    of possibly-significant nodes are removed, using a short-sighted
    heuristic. <br>
    One way to generate this list is to generate the files as usual,
    then get node names that have an associated image, description, or
    presence in r_nodes. Then run the genOtolData.py and genEolNameData.py
    scripts again (after deleting their created tables).
-   genEnwikiDescNamesToSkip.txt <br>
    Contains names for nodes that genEnwikiNameData.py should skip adding
    a description for. Usage is optional, but without it, some nodes will
    probably get descriptions that don't match (eg: the bee genus Osiris
    might be described as an egyptian god). <br>
    This file was generated by running genEnwikiNameData.py, then listing
    the names that it added into a file, along with descriptions, and
    manually removing those that seemed node-matching (got about 30k lines,
    with about 1 in 30 descriptions non-matching). And, after creating
    genEnwikiDescTitlesToUse.txt, names shared with that file were removed.
-   genEnwikiDescTitlesToUse.txt <br>
    Contains enwiki titles with the form 'name1 (category1)' for
    genEnwikiNameData.py to use to resolve nodes matching name name1.
    Usage is optional, but it adds some descriptions that would otherwise
    be skipped. <br>
    This file was generated by taking the content of genEnwikiNameData.py,
    after the manual filtering step, then, for each name name,1 getting
    page titles from dbpedia/dbpData.db that match 'name1 (category1)'.
    This was followed by manually removing lines, keeping those that
    seemed to match the corresponding node (used the app to help with this).