File Generation Process
=======================
1 Tree Structure Data
1 Obtain data in otol/, as specified in it's README.
2 Run genOtolData.py, which creates data.db, and adds
'nodes' and 'edges' tables using data in otol/*.
2 Name Data for Search
1 Obtain data in eol/, as specified in it's README.
2 Run genEolNameData.py, which adds 'names' and 'eol\_ids' tables to data.db,
using data in eol/vernacularNames.csv and the 'nodes' table.
3 Image Data
1 Run downloadImgsForReview.py to download EOL images into imgsForReview/.
It uses data in eol/imagesList.db, and the 'eol\_ids' table.
2 Run reviewImgs.py to filter images in imgsForReview/ into EOL-id-unique
images in imgsReviewed/ (uses 'names' and 'eol\_ids' to display extra info).
3 Run genImgsForWeb.py to create cropped/resized images in img/, using
images in imgsReviewed, and also to add an 'images' table to data.db.
4 Run genLinkedImgs.py to add a 'linked_imgs' table to data.db,
which uses 'nodes', 'edges', 'eol_ids', and 'images', to associate
nodes without images to child images.
4 Node Description Data
- Using DBpedia
1 Obtain data in dbpedia/, as specified in it's README.
2 Run genDbpData.py, which adds a 'descs' table to data.db, using
data in dbpedia/dbpData.db, dbpPickedLabels.txt, and the 'nodes' table.
- Supplementing with Wikipedia dump
1 Obtain data in enwiki/, as specified in it's README.
2 Run genEnwikiData.py, which adds to the 'descs' table, using data in
enwiki/enwikiData.db, and the 'nodes' table.
5 Reduced Tree Structure Data
1 Run genReducedTreeData.py, which adds 'r_nodes' and 'r_edges' tables to
data.db, using reducedTol/names.txt, and the 'nodes' and 'names' tables.
data.db Tables
==============
- nodes: name TEXT PRIMARY KEY, tips INT
- edges: node TEXT, child TEXT, p\_support INT, PRIMARY KEY (node, child)
- names: name TEXT, alt\_name TEXT, pref\_alt INT, PRIMARY KEY(name, alt\_name)
- eol\_ids: id INT PRIMARY KEY, name TEXT
- images: eol\_id INT PRIMARY KEY, source\_url TEXT, license TEXT, copyright\_owner TEXT
- linked\_imgs: name TEXT PRIMARY KEY, eol\_id INT, eol\_id2 INT
- descs: name TEXT PRIMARY KEY, desc TEXT, redirected INT, wiki\_id INT, from\_dbp INT
- r\_nodes: name TEXT PRIMARY KEY, tips INT
- r\_edges: node TEXT, child TEXT, p\_support INT, PRIMARY KEY (node, child)
Other Files
===========
- dbpPickedLabels.txt
Contains DBpedia labels, one per line. Used by genDbpData.py to help
resolve conflicts when associating tree-of-life node names with
DBpedia node labels. Was generated by manually editing the output
of genDbpConflicts.py.
- genDbpConflicts.py
Reads data from dbpedia/dbpData.db, and the 'nodes' table of data.db,
and looks for potential conflicts that would arise when genDbpData.db
tries to associate tree-of-life node names wth DBpedia node labels. It
writes data about them to conflicts.txt, which can be manually edited
to resolve them.