aboutsummaryrefslogtreecommitdiff
path: root/backend/tolData/README.md
diff options
context:
space:
mode:
Diffstat (limited to 'backend/tolData/README.md')
-rw-r--r--backend/tolData/README.md58
1 files changed, 27 insertions, 31 deletions
diff --git a/backend/tolData/README.md b/backend/tolData/README.md
index ba64114..75731ae 100644
--- a/backend/tolData/README.md
+++ b/backend/tolData/README.md
@@ -1,14 +1,14 @@
-This directory holds files used to generate data.db, which contains tree-of-life data.
+This directory holds files used to generate the tree-of-life database data.db.
-# Tables
-## Tree Structure data
+# Database Tables
+## Tree Structure
- `nodes` <br>
Format : `name TEXT PRIMARY KEY, id TEXT UNIQUE, tips INT` <br>
- Represents a tree-of-life node. `tips` represents the number of no-child descendants.
+ Represents a tree-of-life node. `tips` holds the number of no-child descendants.
- `edges` <br>
Format: `parent TEXT, child TEXT, p_support INT, PRIMARY KEY (parent, child)` <br>
`p_support` is 1 if the edge has 'phylogenetic support', and 0 otherwise
-## Node name data
+## Node Names
- `eol_ids` <br>
Format: `id INT PRIMARY KEY, name TEXT` <br>
Associates an EOL ID with a node's name.
@@ -17,7 +17,7 @@ This directory holds files used to generate data.db, which contains tree-of-life
Associates a node with alternative names.
`pref_alt` is 1 if the alt-name is the most 'preferred' one.
`src` indicates the dataset the alt-name was obtained from (can be 'eol', 'enwiki', or 'picked').
-## Node description data
+## Node Descriptions
- `wiki_ids` <br>
Format: `name TEXT PRIMARY KEY, id INT, redirected INT` <br>
Associates a node with a wikipedia page ID.
@@ -26,7 +26,7 @@ This directory holds files used to generate data.db, which contains tree-of-life
Format: `wiki_id INT PRIMARY KEY, desc TEXT, from_dbp INT` <br>
Associates a wikipedia page ID with a short-description.
`from_dbp` is 1 if the description was obtained from DBpedia, and 0 otherwise.
-## Node image data
+## Node Images
- `node_imgs` <br>
Format: `name TEXT PRIMARY KEY, img_id INT, src TEXT` <br>
Associates a node with an image.
@@ -36,11 +36,10 @@ This directory holds files used to generate data.db, which contains tree-of-life
- `linked_imgs` <br>
Format: `name TEXT PRIMARY KEY, otol_ids TEXT` <br>
Associates a node with an image from another node.
- `otol_ids` can be an otol ID, or two comma-separated otol IDs or empty strings.
- The latter is used for compound nodes.
-## Reduced tree data
+ `otol_ids` can be an otol ID, or (for compound nodes) two comma-separated strings that may be otol IDs or empty.
+## Reduced Trees
- `nodes_t`, `nodes_i`, `nodes_p` <br>
- These are like `nodes`, but describe the nodes for various reduced trees.
+ These are like `nodes`, but describe nodes of reduced trees.
- `edges_t`, `edges_i`, `edges_p` <br>
Like `edges` but for reduced trees.
@@ -53,24 +52,23 @@ have about 2.5 billion nodes. Downloading the images takes several days, and occ
200 GB. And if you want good data, you'll need to do some manual review, which can take weeks.
## Environment
-The scripts are written in python and bash.
-Some of the python scripts require third-party packages:
+Some of the scripts use third-party packages:
- jsonpickle: For encoding class objects as JSON.
- requests: For downloading data.
- PIL: For image processing.
- tkinter: For providing a basic GUI to review images.
- mwxml, mwparserfromhell: For parsing Wikipedia dumps.
-## Generate tree structure data
+## Generate Tree Structure Data
1. Obtain files in otol/, as specified in it's README.
2. Run genOtolData.py, which creates data.db, and adds the `nodes` and `edges` tables,
using data in otol/. It also uses these files, if they exist:
- - pickedOtolNames.txt: Has lines of the form `name1|otolId1`. Some nodes in the
- tree may have the same name (eg: Pholidota can refer to pangolins or orchids).
- Normally, such nodes will get the names 'name1', 'name1 [2]', 'name1 [3], etc.
- This file can be used to manually specify which node should be named 'name1'.
+ - pickedOtolNames.txt: Has lines of the form `name1|otolId1`. When nodes in the
+ tree have the same name (eg: Pholidota can refer to pangolins or orchids),
+ they get the names 'name1', 'name1 [2]', 'name1 [3], etc. This file is used to
+ forcibly specify which node should be named 'name1'.
-## Generate node name data
+## Generate Node Names Data
1. Obtain 'name data files' in eol/, as specified in it's README.
2. Run genEolNameData.py, which adds the `names` and `eol_ids` tables, using data in
eol/ and the `nodes` table. It also uses these files, if they exist:
@@ -81,8 +79,8 @@ Some of the python scripts require third-party packages:
- pickedEolAltsToSkip.txt: Has lines of the form `nodeName1|altName1`.
Specifies that a node's alt-name set should exclude altName1.
-## Generate node description data
-### Get data from DBpedia
+## Generate Node Description Data
+### Get Data from DBpedia
1. Obtain files in dbpedia/, as specified in it's README.
2. Run genDbpData.py, which adds the `wiki_ids` and `descs` tables, using data in
dbpedia/ and the `nodes` table. It also uses these files, if they exist:
@@ -91,7 +89,7 @@ Some of the python scripts require third-party packages:
wikipedia page that describes something different (eg: Osiris).
- pickedDbpLabels.txt: Has lines of the form `nodeName1|label1`.
Specifies node names that should have a particular associated page label.
-### Get data from Wikipedia
+### Get Data from Wikipedia
1. Obtain 'description database files' in enwiki/, as specified in it's README.
2. Run genEnwikiDescData.py, which adds to the `wiki_ids` and `descs` tables,
using data in enwiki/ and the `nodes` table.
@@ -99,7 +97,7 @@ Some of the python scripts require third-party packages:
- pickedEnwikiNamesToSkip.txt: Same as with genDbpData.py.
- pickedEnwikiLabels.txt: Similar to pickedDbpLabels.txt.
-## Generate node image data
+## Generate Node Images Data
### Get images from EOL
1. Obtain 'image metadata files' in eol/, as specified in it's README.
2. In eol/, run downloadImgs.py, which downloads images (possibly multiple per node),
@@ -107,14 +105,14 @@ Some of the python scripts require third-party packages:
3. In eol/, run reviewImgs.py, which interactively displays the downloaded images for
each node, providing the choice of which to use, moving them to eol/imgs/.
Uses `names` and `eol_ids` to display extra info.
-### Get images from Wikipedia
+### Get Images from Wikipedia
1. In enwiki/, run genImgData.py, which looks for wikipedia image names for each node,
using the `wiki_ids` table, and stores them in a database.
2. In enwiki/, run downloadImgLicenseInfo.py, which downloads licensing information for
those images, using wikipedia's online API.
3. In enwiki/, run downloadImgs.py, which downloads 'permissively-licensed'
images into enwiki/imgs/.
-### Merge the image sets
+### Merge the Image Sets
1. Run reviewImgsToGen.py, which displays images from eol/imgs/ and enwiki/imgs/,
and enables choosing, for each node, which image should be used, if any,
and outputs choice information into imgList.txt. Uses the `nodes`,
@@ -130,14 +128,14 @@ Some of the python scripts require third-party packages:
- An input image might produce output with unexpected dimensions.
This seems to happen when the image is very large, and triggers a
decompression bomb warning.
- The result might have as many as 150k images, with about 2/3 of them
- being from wikipedia.
-### Add more image associations
+ In testing, this resulted in about 150k images, with about 2/3 of them
+ being from Wikipedia.
+### Add more Image Associations
1. Run genLinkedImgs.py, which tries to associate nodes without images to
images of it's children. Adds the `linked_imgs` table, and uses the
`nodes`, `edges`, and `node_imgs` tables.
-## Do some post-processing
+## Do some Post-Processing
1. Run genEnwikiNameData.py, which adds more entries to the `names` table,
using data in enwiki/, and the `names` and `wiki_ids` tables.
2. Optionally run addPickedNames.py, which allows adding manually-selected name data to
@@ -148,5 +146,3 @@ Some of the python scripts require third-party packages:
3. Run genReducedTrees.py, which generates multiple reduced versions of the tree,
adding the `nodes_*` and `edges_*` tables, using `nodes` and `names`. Reads from
pickedNodes.txt, which lists names of nodes that must be included (1 per line).
- The original tree isn't used for web-queries, as some nodes would have over
- 10k children, which can take a while to render (took over a minute in testing).