From e8e58a3bb9dc233dacf573973457c5b48d369503 Mon Sep 17 00:00:00 2001
From: Terry Truong <terry06890@gmail.com>
Date: Tue, 30 Aug 2022 12:27:42 +1000
Subject: Add scripts for generating eol/enwiki mappings

- New data sources: OTOL taxonomy, EOL provider-ids, Wikidata dump
- Add 'node_iucn' table
- Remove 'redirected' field from 'wiki_ids' table
- Make 'eol_ids' table have 'name' as the primary key
- Combine name-generation scripts into genNameData.py
- Combine description-generation scripts into genDescData.py
---
 backend/tolData/README.md | 110 +++++++++++++++++++++-------------------------
 1 file changed, 51 insertions(+), 59 deletions(-)

(limited to 'backend/tolData/README.md')
diff --git a/backend/tolData/README.md b/backend/tolData/README.md
index 21c02ab..1248098 100644
--- a/backend/tolData/README.md
+++ b/backend/tolData/README.md
@@ -4,24 +4,24 @@ This directory holds files used to generate the tree-of-life database data.db.
 ## Tree Structure
 -   `nodes` <br>
     Format : `name TEXT PRIMARY KEY, id TEXT UNIQUE, tips INT` <br>
-    Represents a tree-of-life node. `tips` holds the number of no-child descendants.
+    Represents a tree-of-life node. `tips` holds the number of no-child descendants
 -   `edges` <br>
     Format: `parent TEXT, child TEXT, p_support INT, PRIMARY KEY (parent, child)` <br>
     `p_support` is 1 if the edge has 'phylogenetic support', and 0 otherwise
-## Node Names
+## Node Mappings
 -   `eol_ids` <br>
-    Format: `id INT PRIMARY KEY, name TEXT` <br>
-    Associates an EOL ID with a node's name.
+    Format: `name TEXT PRIMARY KEY, id INT` <br>
+    Associates nodes with EOL IDs
+-   `wiki_ids` <br>
+    Format: `name TEXT PRIMARY KEY, id INT` <br>
+    Associates nodes with wikipedia page IDs
+## Node Vernacular Names
 -   `names` <br>
     Format: `name TEXT, alt_name TEXT, pref_alt INT, src TEXT, PRIMARY KEY(name, alt_name)` <br>
     Associates a node with alternative names.
     `pref_alt` is 1 if the alt-name is the most 'preferred' one.
     `src` indicates the dataset the alt-name was obtained from (can be 'eol', 'enwiki', or 'picked').
 ## Node Descriptions
--   `wiki_ids` <br>
-    Format: `name TEXT PRIMARY KEY, id INT, redirected INT` <br>
-    Associates a node with a wikipedia page ID.
-    `redirected` is 1 if the node was associated with a different page that redirected to this one.
 -   `descs` <br>
     Format: `wiki_id INT PRIMARY KEY, desc TEXT, from_dbp INT` <br>
     Associates a wikipedia page ID with a short-description.
@@ -42,61 +42,62 @@ This directory holds files used to generate the tree-of-life database data.db.
     These are like `nodes`, but describe nodes of reduced trees.
 -   `edges_t`, `edges_i`, `edges_p` <br>
     Like `edges` but for reduced trees.
+## Other
+-   `node_iucn` <br>
+    Format: `name TEXT PRIMARY KEY, iucn TEXT` <br>
+    Associated nodes with IUCN conservation status strings (eg: 'endangered')
 
 # Generating the Database
 
-For the most part, these steps should be done in order.
-
-As a warning, the whole process takes a lot of time and file space. The tree will probably
-have about 2.5 billion nodes. Downloading the images takes several days, and occupies over
-200 GB. And if you want good data, you'll likely need to make additional corrections,
-which can take several weeks.
+As a warning, the whole process takes a lot of time and file space. The
+tree will probably have about 2.6 million nodes. Downloading the images
+takes several days, and occupies over 200 GB.
 
 ## Environment
 Some of the scripts use third-party packages:
--   jsonpickle: For encoding class objects as JSON.
--   requests: For downloading data.
--   PIL: For image processing.
--   tkinter: For providing a basic GUI to review images.
--   mwxml, mwparserfromhell: For parsing Wikipedia dumps.
+-   `indexed_bzip2`: For parallelised bzip2 processing.
+-   `jsonpickle`: For encoding class objects as JSON.
+-   `requests`: For downloading data.
+-   `PIL`: For image processing.
+-   `tkinter`: For providing a basic GUI to review images.
+-   `mwxml`, `mwparserfromhell`: For parsing Wikipedia dumps.
 
 ## Generate Tree Structure Data
-1.  Obtain files in otol/, as specified in it's README.
+1.  Obtain 'tree data files' in otol/, as specified in it's README.
 2.  Run genOtolData.py, which creates data.db, and adds the `nodes` and `edges` tables,
     using data in otol/. It also uses these files, if they exist:
-    -   pickedOtolNames.txt: Has lines of the form `name1|otolId1`. When nodes in the
-        tree have the same name (eg: Pholidota can refer to pangolins or orchids),
-        they get the names 'name1', 'name1 [2]', 'name1 [3], etc. This file is used to
-        forcibly specify which node should be named 'name1'.
+    -   pickedOtolNames.txt: Has lines of the form `name1|otolId1`.
+        Can be used to override numeric suffixes added to same-name nodes.
+
+## Generate Dataset Mappings
+1.  Obtain 'taxonomy data files' in otol/, 'mapping files' in eol/,
+    files in wikidata/, and 'dump-index files' in enwiki/, as specified
+    in their READMEs.
+2.  Run genMappingData.py, which adds the `eol_ids` and `wiki_ids` tables,
+    using the files obtained above, and the `nodes` table. It also uses
+    'picked mappings' files, if they exist.
+    -   pickedEolIds.txt contains lines like `3785967|405349`, specifying
+        an otol ID and an eol ID to map it to. The eol ID can be empty,
+        in which case the otol ID won't be mapped.
+    -   pickedWikiIds.txt and pickedWikiIdsRough.txt contain lines like
+        `5341349|Human`, specifying an otol ID and an enwiki title,
+        which may contain spaces. The title can be empty.
 
-## Generate Node Names Data
-1.  Obtain 'name data files' in eol/, as specified in it's README.
-2.  Run genEolNameData.py, which adds the `names` and `eol_ids` tables, using data in
-    eol/ and the `nodes` table. It also uses these files, if they exist:
-    -   pickedEolIds.txt: Has lines of the form `nodeName1|eolId1` or `nodeName1|`.
-        Specifies node names that should have a particular EOL ID, or no ID.
-        Quite a few taxons have ambiguous names, and may need manual correction.
-        For example, Viola may resolve to a taxon of butterflies or of plants.
-    -   pickedEolAltsToSkip.txt: Has lines of the form `nodeName1|altName1`.
-        Specifies that a node's alt-name set should exclude altName1.
+## Generate Node Name Data
+1.  Obtain 'name data files' in eol/, and 'description database files' in enwiki/,
+    as specified in their READMEs.
+2.  Run genNameData.py, which adds the `names` table, using data in eol/ and enwiki/,
+    along with the `nodes`, `eol_ids`, and `wiki_ids` tables. <br>
+    It also uses pickedNames.txt, if it exists. This file can hold lines like
+    `embryophyta|land plant|1`, specifying a node name, an alt-name to add for it,
+    and a 1 or 0 indicating whether it is a 'preferred' alt-name. The last field
+    can be empty, which indicates that the alt-name should be removed, or, if the
+    alt-name is the same as the node name, that no alt-name should be preferred.
 
 ## Generate Node Description Data
-### Get Data from DBpedia
 1.  Obtain files in dbpedia/, as specified in it's README.
-2.  Run genDbpData.py, which adds the `wiki_ids` and `descs` tables, using data in
-    dbpedia/ and the `nodes` table. It also uses these files, if they exist:
-    -   pickedEnwikiNamesToSkip.txt: Each line holds the name of a node for which
-        no description should be obtained. Many node names have a same-name
-        wikipedia page that describes something different (eg: Osiris).
-    -   pickedDbpLabels.txt: Has lines of the form `nodeName1|label1`.
-        Specifies node names that should have a particular associated page label.
-### Get Data from Wikipedia
-1.  Obtain 'description database files' in enwiki/, as specified in it's README.
-2.  Run genEnwikiDescData.py, which adds to the `wiki_ids` and `descs` tables,
-    using data in enwiki/ and the `nodes` table.
-    It also uses these files, if they exist:
-    -   pickedEnwikiNamesToSkip.txt: Same as with genDbpData.py.
-    -   pickedEnwikiLabels.txt: Similar to pickedDbpLabels.txt.
+2.  Run genDescData.py, which adds the `descs` table, using data in dbpedia/ and
+    enwiki/, and the `nodes` table.
 
 ## Generate Node Images Data
 ### Get images from EOL
@@ -129,21 +130,12 @@ Some of the scripts use third-party packages:
     -   An input image might produce output with unexpected dimensions.
         This seems to happen when the image is very large, and triggers a
         decompression bomb warning.
-    In testing, this resulted in about 150k images, with about 2/3 of them
-    being from Wikipedia.
 ### Add more Image Associations
 1.  Run genLinkedImgs.py, which tries to associate nodes without images to
     images of it's children. Adds the `linked_imgs` table, and uses the
     `nodes`, `edges`, and `node_imgs` tables.
 
 ## Do some Post-Processing
-1.  Run genEnwikiNameData.py, which adds more entries to the `names` table,
-    using data in enwiki/, and the `names` and `wiki_ids` tables.
-2.  Optionally run addPickedNames.py, which allows adding manually-selected name data to
-    the `names` table, as specified in pickedNames.txt.
-    -   pickedNames.txt: Has lines of the form `nodeName1|altName1|prefAlt1`.
-        These correspond to entries in the `names` table. `prefAlt` should be 1 or 0.
-        A line like `name1|name1|1` causes a node to have no preferred alt-name.
-3.  Run genReducedTrees.py, which generates multiple reduced versions of the tree,
+1.  Run genReducedTrees.py, which generates multiple reduced versions of the tree,
     adding the `nodes_*` and `edges_*` tables, using `nodes` and `names`. Reads from
     pickedNodes.txt, which lists names of nodes that must be included (1 per line).
-- 
cgit v1.2.3