This directory holds files used to generate data.db, which contains tree-of-life data.
# Tables
## Tree Structure data
- `nodes`
Format : `name TEXT PRIMARY KEY, id TEXT UNIQUE, tips INT`
Represents a tree-of-life node. `tips` represents the number of no-child descendants.
- `edges`
Format: `parent TEXT, child TEXT, p_support INT, PRIMARY KEY (parent, child)`
`p_support` is 1 if the edge has 'phylogenetic support', and 0 otherwise
## Node name data
- `eol_ids`
Format: `id INT PRIMARY KEY, name TEXT`
Associates an EOL ID with a node's name.
- `names`
Format: `name TEXT, alt_name TEXT, pref_alt INT, src TEXT, PRIMARY KEY(name, alt_name)`
Associates a node with alternative names.
`pref_alt` is 1 if the alt-name is the most 'preferred' one.
`src` indicates the dataset the alt-name was obtained from (can be 'eol', 'enwiki', or 'picked').
## Node description data
- `wiki_ids`
Format: `name TEXT PRIMARY KEY, id INT, redirected INT`
Associates a node with a wikipedia page ID.
`redirected` is 1 if the node was associated with a different page that redirected to this one.
- `descs`
Format: `wiki_id INT PRIMARY KEY, desc TEXT, from_dbp INT`
Associates a wikipedia page ID with a short-description.
`from_dbp` is 1 if the description was obtained from DBpedia, and 0 otherwise.
## Node image data
- `node_imgs`
Format: `name TEXT PRIMARY KEY, img_id INT, src TEXT`
Associates a node with an image.
- `images`
Format: `id INT, src TEXT, url TEXT, license TEXT, artist TEXT, credit TEXT, PRIMARY KEY (id, src)`
Represents an image, identified by a source ('eol', 'enwiki', or 'picked'), and a source-specific ID.
- `linked_imgs`
Format: `name TEXT PRIMARY KEY, otol_ids TEXT`
Associates a node with an image from another node.
`otol_ids` can be an otol ID, or two comma-separated otol IDs or empty strings.
The latter is used for compound nodes.
## Reduced-tree data
- `r_nodes`
Format: `name TEXT PRIMARY KEY, tips INT`
Like `nodes`, but for a reduced tree.
- `r_edges`
Format: `node TEXT, child TEXT, p_support INT, PRIMARY KEY (node, child)`
Like `edges` but for a reduced tree.
# Generating the Database
For the most part, these steps should be done in order.
As a warning, the whole process takes a lot of time and file space. The tree will probably
have about 2.5 billion nodes. Downloading the images will take several days, and occupy over
200 GB. And if you want good data, you'll need to do some manual review, which can take weeks.
## Environment
The scripts are written in python and bash.
Some of the python scripts require third-party packages:
- jsonpickle: For encoding class objects as JSON.
- requests: For downloading data.
- PIL: For image processing.
- tkinter: For providing a basic GUI to review images.
- mwxml, mwparserfromhell: For parsing Wikipedia dumps.
## Generate tree structure data
1. Obtain files in otol/, as specified in it's README.
2. Run genOtolData.py, which creates data.db, and adds the `nodes` and `edges` tables,
using data in otol/. It also uses these files, if they exist:
- pickedOtolNames.txt: Has lines of the form `name1|otolId1`. Some nodes in the
tree may have the same name (eg: Pholidota can refer to pangolins or orchids).
Normally, such nodes will get the names 'name1', 'name1 [2]', 'name1 [3], etc.
This file can be used to manually specify which node should be named 'name1'.
## Generate node name data
1. Obtain 'name data files' in eol/, as specified in it's README.
2. Run genEolNameData.py, which adds the `names` and `eol_ids` tables, using data in
eol/ and the `nodes` table. It also uses these files, if they exist:
- pickedEolIds.txt: Has lines of the form `nodeName1|eolId1` or `nodeName1|`.
Specifies node names that should have a particular EOL ID, or no ID.
Quite a few taxons have ambiguous names, and may need manual correction.
For example, Viola may resolve to a taxon of butterflies or of plants.
- pickedEolAltsToSkip.txt: Has lines of the form `nodeName1|altName1`.
Specifies that a node's alt-name set should exclude altName1.
## Generate node description data
### Get data from DBpedia
1. Obtain files in dbpedia/, as specified in it's README.
2. Run genDbpData.py, which adds the `wiki_ids` and `descs` tables, using data in
dbpedia/ and the `nodes` table. It also uses these files, if they exist:
- pickedEnwikiNamesToSkip.txt: Each line holds the name of a node for which
no description should be obtained. Many node names have a same-name
wikipedia page that describes something different (eg: Osiris).
- pickedDbpLabels.txt: Has lines of the form `nodeName1|label1`.
Specifies node names that should have a particular associated page label.
### Get data from Wikipedia
1. Obtain 'description database files' in enwiki/, as specified in it's README.
2. Run genEnwikiDescData.py, which adds to the `wiki_ids` and `descs` tables,
using data in enwiki/ and the `nodes` table.
It also uses these files, if they exist:
- pickedEnwikiNamesToSkip.txt: Same as with genDbpData.py.
- pickedEnwikiLabels.txt: Similar to pickedDbpLabels.txt.
## Generate node image data
### Get images from EOL
1. Obtain 'image metadata files' in eol/, as specified in it's README.
2. In eol/, run downloadImgs.py, which downloads images (possibly multiple per node),
into eol/imgsForReview, using data in eol/, as well as the `eol_ids` table.
3. In eol/, run reviewImgs.py, which interactively displays the downloaded images for
each node, providing the choice of which to use, moving them to eol/imgs/.
Uses `names` and `eol_ids` to display extra info.
### Get images from Wikipedia
1. In enwiki/, run genImgData.py, which looks for wikipedia image names for each node,
using the `wiki_ids` table, and stores them in a database.
2. In enwiki/, run downloadImgLicenseInfo.py, which downloads licensing information for
those images, using wikipedia's online API.
3. In enwiki/, run downloadImgs.py, which downloads 'permissively-licensed'
images into enwiki/imgs/.
### Merge the image sets
1. Run reviewImgsToGen.py, which displays images from eol/imgs/ and enwiki/imgs/,
and enables choosing, for each node, which image should be used, if any,
and outputs choice information into imgList.txt. Uses the `nodes`,
`eol_ids`, and `wiki_ids` tables (as well as `names` to display extra info).
2. Run genImgs.py, which creates cropped/resized images in img/, from files listed in
imgList.txt and located in eol/ and enwiki/, and creates the `node_imgs` and
`images` tables. If pickedImgs/ is present, images within it are also used.
The outputs might need to be manually created/adjusted:
- An input image might have no output produced, possibly due to
data incompatibilities, memory limits, etc. A few input image files
might actually be html files, containing a 'file not found' page.
- An input x.gif might produce x-1.jpg, x-2.jpg, etc, instead of x.jpg.
- An input image might produce output with unexpected dimensions.
This seems to happen when the image is very large, and triggers a
decompression bomb warning.
The result might have as many as 150k images, with about 2/3 of them
being from wikipedia.
### Add more image associations
1. Run genLinkedImgs.py, which tries to associate nodes without images to
images of it's children. Adds the `linked_imgs` table, and uses the
`nodes`, `edges`, and `node_imgs` tables.
## Do some post-processing
1. Run genEnwikiNameData.py, which adds more entries to the `names` table,
using data in enwiki/, and the `names` and `wiki_ids` tables.
2. Optionally run addPickedNames.py, which allows adding manually-selected name data to
the `names` table, as specified in pickedNames.txt.
- pickedNames.txt: Has lines of the form `nodeName1|altName1|prefAlt1`.
These correspond to entries in the `names` table. `prefAlt` should be 1 or 0.
A line like `name1|name1|1` causes a node to have no preferred alt-name.
3. Run genReducedTreeData.py, which generates a second, reduced version of the tree,
adding the `r_nodes` and `r_edges` tables, using `nodes` and `names`. Reads from
pickedReducedNodes.txt, which lists names of nodes that must be included (1 per line).
4. Optionally run trimTree.py, which tries to remove some 'low significance' nodes,
for the sake of performance and content-relevance. Otherwise, some nodes may have
over 10k children, which can take a while to render (took over a minute in testing).