File Generation Process ======================= 1 Tree Structure Data 1 Obtain data in otol/, as specified in it's README. 2 Run genOtolData.py, which creates data.db, and adds 'nodes' and 'edges' tables using data in otol/*, as well as genOtolNamesToKeep.txt, if present. 2 Name Data for Search 1 Obtain data in eol/, as specified in it's README. 2 Run genEolNameData.py, which adds 'names' and 'eol\_ids' tables to data.db, using data in eol/vernacularNames.csv and the 'nodes' table. 3 Image Data 1 In eol/, run downloadImgs.py to download EOL images into eol/imgsForReview/. It uses data in eol/imagesList.db, and the 'eol\_ids' table. 2 In eol/, run reviewImgs.py to filter images in eol/imgsForReview/ into EOL-id-unique images in eol/imgsReviewed/ (uses 'names' and 'eol\_ids' to display extra info). 3 Run genImgsForWeb.py to create cropped/resized images in img/, using images in eol/imgsReviewed/, and also to add an 'images' table to data.db. 4 Run genLinkedImgs.py to add a 'linked_imgs' table to data.db, which uses 'nodes', 'edges', 'eol\_ids', and 'images', to associate nodes without images to child images. 4 Node Description Data 1 Obtain data in dbpedia/, as specified in it's README. 2 Run genDbpData.py, which adds a 'descs' table to data.db, using data in dbpedia/dbpData.db, dbpPickedLabels.txt, and the 'nodes' table. 5 Supplementary Name/Description Data 1 Obtain data in enwiki/, as specified in it's README. 2 Run genEnwikiDescData.py, which adds to the 'descs' table, using data in enwiki/enwikiData.db, and the 'nodes' table. Also uses genEnwikiDesc*.txt files for skipping/resolving some name-page associations. 3 Run genEnwikiNameData.py, which adds to the 'names' table, using data in enwiki/enwikiData.db, and the 'names' and 'descs' tables. 5 Reduced Tree Structure Data 1 Run genReducedTreeData.py, which adds 'r_nodes' and 'r_edges' tables to data.db, using reducedTol/names.txt, and the 'nodes' and 'names' tables. data.db Tables ============== - nodes: name TEXT PRIMARY KEY, tips INT - edges: node TEXT, child TEXT, p\_support INT, PRIMARY KEY (node, child) - names: name TEXT, alt\_name TEXT, pref\_alt INT, PRIMARY KEY(name, alt\_name) - eol\_ids: id INT PRIMARY KEY, name TEXT - images: eol\_id INT PRIMARY KEY, source\_url TEXT, license TEXT, copyright\_owner TEXT - linked\_imgs: name TEXT PRIMARY KEY, eol\_id INT, eol\_id2 INT - descs: name TEXT PRIMARY KEY, desc TEXT, redirected INT, wiki\_id INT, from\_dbp INT - r\_nodes: name TEXT PRIMARY KEY, tips INT - r\_edges: node TEXT, child TEXT, p\_support INT, PRIMARY KEY (node, child) Other Files =========== - dbpPickedLabels.txt
Contains DBpedia labels, one per line. Used by genDbpData.py to help resolve conflicts when associating tree-of-life node names with DBpedia node labels. Was generated by manually editing the output of genDbpConflicts.py. - genDbpConflicts.py
Reads data from dbpedia/dbpData.db, and the 'nodes' table of data.db, and looks for potential conflicts that would arise when genDbpData.db tries to associate tree-of-life node names wth DBpedia node labels. It writes data about them to conflicts.txt, which can be manually edited to resolve them. - genOtolNamesToKeep.txt
Contains names to avoid trimming off the tree data generated by genOtolData.py. Usage is optional, but, without it, a large amount of possibly-significant nodes are removed, using a short-sighted heuristic.
One way to generate this list is to generate the files as usual, then get node names that have an associated image, description, or presence in r_nodes. Then run the genOtolData.py and genEolNameData.py scripts again (after deleting their created tables). - genEnwikiDescNamesToSkip.txt
Contains names for nodes that genEnwikiNameData.py should skip adding a description for. Usage is optional, but without it, some nodes will probably get descriptions that don't match (eg: the bee genus Osiris might be described as an egyptian god).
This file was generated by running genEnwikiNameData.py, then listing the names that it added into a file, along with descriptions, and manually removing those that seemed node-matching (got about 30k lines, with about 1 in 30 descriptions non-matching). And, after creating genEnwikiDescTitlesToUse.txt, names shared with that file were removed. - genEnwikiDescTitlesToUse.txt
Contains enwiki titles with the form 'name1 (category1)' for genEnwikiNameData.py to use to resolve nodes matching name name1. Usage is optional, but it adds some descriptions that would otherwise be skipped.
This file was generated by taking the content of genEnwikiNameData.py, after the manual filtering step, then, for each name name,1 getting page titles from dbpedia/dbpData.db that match 'name1 (category1)'. This was followed by manually removing lines, keeping those that seemed to match the corresponding node (used the app to help with this).