aboutsummaryrefslogtreecommitdiff
path: root/backend/tolData/README.md
diff options
context:
space:
mode:
Diffstat (limited to 'backend/tolData/README.md')
-rw-r--r--backend/tolData/README.md110
1 files changed, 51 insertions, 59 deletions
diff --git a/backend/tolData/README.md b/backend/tolData/README.md
index 21c02ab..1248098 100644
--- a/backend/tolData/README.md
+++ b/backend/tolData/README.md
@@ -4,24 +4,24 @@ This directory holds files used to generate the tree-of-life database data.db.
## Tree Structure
- `nodes` <br>
Format : `name TEXT PRIMARY KEY, id TEXT UNIQUE, tips INT` <br>
- Represents a tree-of-life node. `tips` holds the number of no-child descendants.
+ Represents a tree-of-life node. `tips` holds the number of no-child descendants
- `edges` <br>
Format: `parent TEXT, child TEXT, p_support INT, PRIMARY KEY (parent, child)` <br>
`p_support` is 1 if the edge has 'phylogenetic support', and 0 otherwise
-## Node Names
+## Node Mappings
- `eol_ids` <br>
- Format: `id INT PRIMARY KEY, name TEXT` <br>
- Associates an EOL ID with a node's name.
+ Format: `name TEXT PRIMARY KEY, id INT` <br>
+ Associates nodes with EOL IDs
+- `wiki_ids` <br>
+ Format: `name TEXT PRIMARY KEY, id INT` <br>
+ Associates nodes with wikipedia page IDs
+## Node Vernacular Names
- `names` <br>
Format: `name TEXT, alt_name TEXT, pref_alt INT, src TEXT, PRIMARY KEY(name, alt_name)` <br>
Associates a node with alternative names.
`pref_alt` is 1 if the alt-name is the most 'preferred' one.
`src` indicates the dataset the alt-name was obtained from (can be 'eol', 'enwiki', or 'picked').
## Node Descriptions
-- `wiki_ids` <br>
- Format: `name TEXT PRIMARY KEY, id INT, redirected INT` <br>
- Associates a node with a wikipedia page ID.
- `redirected` is 1 if the node was associated with a different page that redirected to this one.
- `descs` <br>
Format: `wiki_id INT PRIMARY KEY, desc TEXT, from_dbp INT` <br>
Associates a wikipedia page ID with a short-description.
@@ -42,61 +42,62 @@ This directory holds files used to generate the tree-of-life database data.db.
These are like `nodes`, but describe nodes of reduced trees.
- `edges_t`, `edges_i`, `edges_p` <br>
Like `edges` but for reduced trees.
+## Other
+- `node_iucn` <br>
+ Format: `name TEXT PRIMARY KEY, iucn TEXT` <br>
+ Associated nodes with IUCN conservation status strings (eg: 'endangered')
# Generating the Database
-For the most part, these steps should be done in order.
-
-As a warning, the whole process takes a lot of time and file space. The tree will probably
-have about 2.5 billion nodes. Downloading the images takes several days, and occupies over
-200 GB. And if you want good data, you'll likely need to make additional corrections,
-which can take several weeks.
+As a warning, the whole process takes a lot of time and file space. The
+tree will probably have about 2.6 million nodes. Downloading the images
+takes several days, and occupies over 200 GB.
## Environment
Some of the scripts use third-party packages:
-- jsonpickle: For encoding class objects as JSON.
-- requests: For downloading data.
-- PIL: For image processing.
-- tkinter: For providing a basic GUI to review images.
-- mwxml, mwparserfromhell: For parsing Wikipedia dumps.
+- `indexed_bzip2`: For parallelised bzip2 processing.
+- `jsonpickle`: For encoding class objects as JSON.
+- `requests`: For downloading data.
+- `PIL`: For image processing.
+- `tkinter`: For providing a basic GUI to review images.
+- `mwxml`, `mwparserfromhell`: For parsing Wikipedia dumps.
## Generate Tree Structure Data
-1. Obtain files in otol/, as specified in it's README.
+1. Obtain 'tree data files' in otol/, as specified in it's README.
2. Run genOtolData.py, which creates data.db, and adds the `nodes` and `edges` tables,
using data in otol/. It also uses these files, if they exist:
- - pickedOtolNames.txt: Has lines of the form `name1|otolId1`. When nodes in the
- tree have the same name (eg: Pholidota can refer to pangolins or orchids),
- they get the names 'name1', 'name1 [2]', 'name1 [3], etc. This file is used to
- forcibly specify which node should be named 'name1'.
+ - pickedOtolNames.txt: Has lines of the form `name1|otolId1`.
+ Can be used to override numeric suffixes added to same-name nodes.
+
+## Generate Dataset Mappings
+1. Obtain 'taxonomy data files' in otol/, 'mapping files' in eol/,
+ files in wikidata/, and 'dump-index files' in enwiki/, as specified
+ in their READMEs.
+2. Run genMappingData.py, which adds the `eol_ids` and `wiki_ids` tables,
+ using the files obtained above, and the `nodes` table. It also uses
+ 'picked mappings' files, if they exist.
+ - pickedEolIds.txt contains lines like `3785967|405349`, specifying
+ an otol ID and an eol ID to map it to. The eol ID can be empty,
+ in which case the otol ID won't be mapped.
+ - pickedWikiIds.txt and pickedWikiIdsRough.txt contain lines like
+ `5341349|Human`, specifying an otol ID and an enwiki title,
+ which may contain spaces. The title can be empty.
-## Generate Node Names Data
-1. Obtain 'name data files' in eol/, as specified in it's README.
-2. Run genEolNameData.py, which adds the `names` and `eol_ids` tables, using data in
- eol/ and the `nodes` table. It also uses these files, if they exist:
- - pickedEolIds.txt: Has lines of the form `nodeName1|eolId1` or `nodeName1|`.
- Specifies node names that should have a particular EOL ID, or no ID.
- Quite a few taxons have ambiguous names, and may need manual correction.
- For example, Viola may resolve to a taxon of butterflies or of plants.
- - pickedEolAltsToSkip.txt: Has lines of the form `nodeName1|altName1`.
- Specifies that a node's alt-name set should exclude altName1.
+## Generate Node Name Data
+1. Obtain 'name data files' in eol/, and 'description database files' in enwiki/,
+ as specified in their READMEs.
+2. Run genNameData.py, which adds the `names` table, using data in eol/ and enwiki/,
+ along with the `nodes`, `eol_ids`, and `wiki_ids` tables. <br>
+ It also uses pickedNames.txt, if it exists. This file can hold lines like
+ `embryophyta|land plant|1`, specifying a node name, an alt-name to add for it,
+ and a 1 or 0 indicating whether it is a 'preferred' alt-name. The last field
+ can be empty, which indicates that the alt-name should be removed, or, if the
+ alt-name is the same as the node name, that no alt-name should be preferred.
## Generate Node Description Data
-### Get Data from DBpedia
1. Obtain files in dbpedia/, as specified in it's README.
-2. Run genDbpData.py, which adds the `wiki_ids` and `descs` tables, using data in
- dbpedia/ and the `nodes` table. It also uses these files, if they exist:
- - pickedEnwikiNamesToSkip.txt: Each line holds the name of a node for which
- no description should be obtained. Many node names have a same-name
- wikipedia page that describes something different (eg: Osiris).
- - pickedDbpLabels.txt: Has lines of the form `nodeName1|label1`.
- Specifies node names that should have a particular associated page label.
-### Get Data from Wikipedia
-1. Obtain 'description database files' in enwiki/, as specified in it's README.
-2. Run genEnwikiDescData.py, which adds to the `wiki_ids` and `descs` tables,
- using data in enwiki/ and the `nodes` table.
- It also uses these files, if they exist:
- - pickedEnwikiNamesToSkip.txt: Same as with genDbpData.py.
- - pickedEnwikiLabels.txt: Similar to pickedDbpLabels.txt.
+2. Run genDescData.py, which adds the `descs` table, using data in dbpedia/ and
+ enwiki/, and the `nodes` table.
## Generate Node Images Data
### Get images from EOL
@@ -129,21 +130,12 @@ Some of the scripts use third-party packages:
- An input image might produce output with unexpected dimensions.
This seems to happen when the image is very large, and triggers a
decompression bomb warning.
- In testing, this resulted in about 150k images, with about 2/3 of them
- being from Wikipedia.
### Add more Image Associations
1. Run genLinkedImgs.py, which tries to associate nodes without images to
images of it's children. Adds the `linked_imgs` table, and uses the
`nodes`, `edges`, and `node_imgs` tables.
## Do some Post-Processing
-1. Run genEnwikiNameData.py, which adds more entries to the `names` table,
- using data in enwiki/, and the `names` and `wiki_ids` tables.
-2. Optionally run addPickedNames.py, which allows adding manually-selected name data to
- the `names` table, as specified in pickedNames.txt.
- - pickedNames.txt: Has lines of the form `nodeName1|altName1|prefAlt1`.
- These correspond to entries in the `names` table. `prefAlt` should be 1 or 0.
- A line like `name1|name1|1` causes a node to have no preferred alt-name.
-3. Run genReducedTrees.py, which generates multiple reduced versions of the tree,
+1. Run genReducedTrees.py, which generates multiple reduced versions of the tree,
adding the `nodes_*` and `edges_*` tables, using `nodes` and `names`. Reads from
pickedNodes.txt, which lists names of nodes that must be included (1 per line).