Update backend READMEs, rename some files for consistency

author: Terry Truong <terry06890@gmail.com> 2022-06-22 01:42:41 +1000
committer: Terry Truong <terry06890@gmail.com> 2022-06-22 09:39:44 +1000
commit: e78c4df403e5f98afa08f7a0841ff233d5f6d05b (patch)
tree: f13dbf91228550075644be9766b4546eb20f1e1f /backend/data/README.md
parent: ae1467d2ab35a03eb2d7bf3e5ca1cf4634b23443 (diff)
1 files changed, 119 insertions, 113 deletions
diff --git a/backend/data/README.md b/backend/data/README.md
index d4a6196..7d1adad 100644
--- a/backend/data/README.md
+++ b/backend/data/README.md
@@ -1,115 +1,121 @@
-File Generation Process
-=======================
-1   Tree Structure Data
-    1   Obtain data in otol/, as specified in it's README.
-    2   Run genOtolData.py, which creates data.db, and adds
-        'nodes' and 'edges' tables using data in otol/*, as well as
-        genOtolNamesToKeep.txt, if present.
-2   Name Data for Search
-    1   Obtain data in eol/, as specified in it's README.
-    2   Run genEolNameData.py, which adds 'names' and 'eol_ids' tables to data.db,
-        using data in eol/vernacularNames.csv and the 'nodes' table, and possibly
-        genEolNameDataPickedIds.txt.
-3   Node Description Data
-    1   Obtain data in dbpedia/ and enwiki/, as specified in their README files.
-    2   Run genDbpData.py, which adds 'wiki_ids' and 'descs' tables to data.db,
-        using data in dbpedia/dbpData.db, the 'nodes' table, and possibly
-        genDescNamesToSkip.txt and dbpPickedLabels.txt.
-    3   Run genEnwikiDescData.py, which adds to the 'wiki_ids' and 'descs' tables,
-        using data in enwiki/enwikiData.db, and the 'nodes' table.
-        Also uses genDescNamesToSkip.txt and genEnwikiDescTitlesToUse.txt for
-        skipping/resolving some name-page associations.
-4   Image Data
-    1   In eol/, run downloadImgs.py to download EOL images into eol/imgsForReview/.
-        It uses data in eol/imagesList.db, and the 'eol_ids' table.
-    2   In eol/, run reviewImgs.py to filter images in eol/imgsForReview/ into EOL-id-unique
-        images in eol/imgsReviewed/ (uses 'names' and 'eol_ids' to display extra info).
-    3   In enwiki/, run getEnwikiImgData.py, which generates a list of
-        tol-node images, and creates enwiki/enwikiImgs.db to store it.
-        Uses the 'wiki_ids' table to get tol-node wiki-ids.
-    4   In enwiki/, run downloadImgLicenseInfo.py, which downloads licensing
-        information for images listed in enwiki/enwikiImgs.db, and stores
-        it in that db.
-    5   In enwiki/, run downloadEnwikiImgs.py, which downloads 'permissively-licensed'
-        images in listed in enwiki/enwikiImgs.db, storing them in enwiki/imgs/.
-    6   Run reviewImgsToMerge.py, which displays images from eol/ and enwiki/,
-        and enables choosing, for each tol-node, which image should be used, if any,
-        and outputs choice information into mergedImgList.txt. Uses the 'nodes',
-        'eol_ids', and 'wiki_ids' tables (as well as 'names' for info-display).
-    7   Run genImgsForWeb.py, which creates cropped/resized images in img/,
-        using mergedImgList.txt, and possibly pickedImgs/, and adds 'images' and
-        'node_imgs' tables to data.db. <br>
-        Smartcrop's outputs might need to be manually created/adjusted: <br>
-        -   An input image might have no output produced, possibly due to
-            data incompatibilities, memory limits, etc. A few input image files
-            might actually be html files, containing a 'file not found' page.
-        -   An input x.gif might produce x-1.jpg, x-2.jpg, etc, instead of x.jpg.
-        -   An input image might produce output with unexpected dimensions.
-            This seems to happen when the image is very large, and triggers a
-            decompression bomb warning.
-        The result might have as many as 150k images, with about 2/3 of them
-        being from wikipedia.
-    8   Run genLinkedImgs.py to add a 'linked_imgs' table to data.db,
-        which uses 'nodes', 'edges', 'eol_ids', and 'node_imgs', to associate
-        nodes without images to child images.
-5   Reduced Tree Structure Data
-    1   Run genReducedTreeData.py, which adds 'r_nodes' and 'r_edges' tables to
-        data.db, using reducedTol/names.txt, and the 'nodes' and 'names' tables.
-6   Other
-    -   Optionally run genEnwikiNameData.py, which adds more entries to the 'names' table,
-        using data in enwiki/enwikiData.db, and the 'names' and 'wiki_ids' tables.
-    -   Optionally run addPickedNames.py, which adds manually-picked names to
-        the 'names' table, as specified in pickedNames.txt.
-    -   Optionally run trimTree.py, which tries to remove some 'low-significance' nodes,
-        for the sake of performance and result-relevance. Without this, jumping to certain
-        nodes within the fungi and moths can take over a minute to render.
+This directory holds files used to generate data.db, which contains tree-of-life data.
 
-data.db Tables
-==============
--   nodes:        name TEXT PRIMARY KEY, id TEXT UNIQUE, tips INT
--   edges:        node TEXT, child TEXT, p\_support INT, PRIMARY KEY (node, child)
--   eol\_ids:     id INT PRIMARY KEY, name TEXT
--   names:        name TEXT, alt\_name TEXT, pref\_alt INT, src TEXT, PRIMARY KEY(name, alt\_name)
--   wiki\_ids:    name TEXT PRIMARY KEY, id INT, redirected INT
--   descs:        wiki\_id INT PRIMARY KEY, desc TEXT, from\_dbp INT
--   node\_imgs:   name TEXT PRIMARY KEY, img\_id INT, src TEXT
--   images:       id INT, src TEXT, url TEXT, license TEXT, artist TEXT, credit TEXT, PRIMARY KEY (id, src)
--   linked\_imgs: name TEXT PRIMARY KEY, otol\_ids TEXT
--   r\_nodes:     name TEXT PRIMARY KEY, tips INT
--   r\_edges:     node TEXT, child TEXT, p\_support INT, PRIMARY KEY (node, child)
+# Tables:
+-   `nodes`:       `name TEXT PRIMARY KEY, id TEXT UNIQUE, tips INT`
+-   `edges`:       `node TEXT, child TEXT, p_support INT, PRIMARY KEY (node, child)`
+-   `eol_ids`:     `id INT PRIMARY KEY, name TEXT`
+-   `names`:       `name TEXT, alt_name TEXT, pref_alt INT, src TEXT, PRIMARY KEY(name, alt_name)`
+-   `wiki_ids`:    `name TEXT PRIMARY KEY, id INT, redirected INT`
+-   `descs`:       `wiki_id INT PRIMARY KEY, desc TEXT, from_dbp INT`
+-   `node_imgs`:   `name TEXT PRIMARY KEY, img_id INT, src TEXT`
+-   `images`:      `id INT, src TEXT, url TEXT, license TEXT, artist TEXT, credit TEXT, PRIMARY KEY (id, src)`
+-   `linked_imgs`: `name TEXT PRIMARY KEY, otol_ids TEXT`
+-   `r_nodes`:     `name TEXT PRIMARY KEY, tips INT`
+-   `r_edges`:     `node TEXT, child TEXT, p_support INT, PRIMARY KEY (node, child)`
 
-Other Files
-===========
--   dbpPickedLabels.txt <br>
-    Contains DBpedia labels, one per line. Used by genDbpData.py to help
-    resolve conflicts when associating tree-of-life node names with
-    DBpedia node labels.
--   genOtolNamesToKeep.txt <br>
-    Contains names to avoid trimming off the tree data generated by
-    genOtolData.py.  Usage is optional, but, without it, a large amount
-    of possibly-significant nodes are removed, using a short-sighted
-    heuristic. <br>
-    One way to generate this list is to generate the files as usual,
-    then get node names that have an associated image, description, or
-    presence in r_nodes. Then run the genOtolData.py and genEolNameData.py
-    scripts again (after deleting their created tables).
--   genEnwikiDescNamesToSkip.txt <br>
-    Contains names for nodes that genEnwikiNameData.py should skip adding
-    a description for. Usage is optional, but without it, some nodes will
-    probably get descriptions that don't match (eg: the bee genus Osiris
-    might be described as an egyptian god). <br>
-    This file was generated by running genEnwikiNameData.py, then listing
-    the names that it added into a file, along with descriptions, and
-    manually removing those that seemed node-matching (got about 30k lines,
-    with about 1 in 30 descriptions non-matching). And, after creating
-    genEnwikiDescTitlesToUse.txt, names shared with that file were removed.
--   genEnwikiDescTitlesToUse.txt <br>
-    Contains enwiki titles with the form 'name1 (category1)' for
-    genEnwikiNameData.py to use to resolve nodes matching name name1.
-    Usage is optional, but it adds some descriptions that would otherwise
-    be skipped. <br>
-    This file was generated by taking the content of genEnwikiNameData.py,
-    after the manual filtering step, then, for each name name,1 getting
-    page titles from dbpedia/dbpData.db that match 'name1 (category1)'.
-    This was followed by manually removing lines, keeping those that
-    seemed to match the corresponding node (used the app to help with this).
+# Generating the Database
+
+For the most part, these steps should be done in order.
+
+As a warning, the whole process takes a lot of time and file space. The tree will probably
+have about 2.5 billion nodes. Downloading the images will take several days, and occupy over
+200 GB. And if you want good data, you'll need to do some manual review, which can take weeks.
+
+## Environment
+The scripts are written in python and bash.
+Some of the python scripts require third-party packages:
+-   jsonpickle: For encoding class objects as JSON.
+-   requests: For downloading data.
+-   PIL: For image processing.
+-   tkinter: For providing a basic GUI to review images.
+-   mwxml, mwparserfromhell: For parsing Wikipedia dumps.
+
+## Generate tree structure data
+1.  Obtain files in otol/, as specified in it's README.
+2.  Run genOtolData.py, which creates data.db, and adds the `nodes` and `edges` tables,
+    using data in otol/. It also uses these files, if they exist:
+    -   pickedOtolNames.txt: Has lines of the form `name1|otolId1`. Some nodes in the
+        tree may have the same name (eg: Pholidota can refer to pangolins or orchids).
+        Normally, such nodes will get the names 'name1', 'name1 [2]', 'name1 [3], etc.
+        This file can be used to manually specify which node should be named 'name1'.
+
+## Generate node name data
+1.  Obtain 'name data files' in eol/, as specified in it's README.
+2.  Run genEolNameData.py, which adds the `names` and `eol_ids` tables, using data in
+    eol/ and the `nodes` table. It also uses these files, if they exist:
+    -   pickedEolIds.txt: Has lines of the form `nodeName1|eolId1` or `nodeName1|`.
+        Specifies node names that should have a particular EOL ID, or no ID.
+        Quite a few taxons have ambiguous names, and may need manual correction.
+        For example, Viola may resolve to a taxon of butterflies or of plants.
+    -   pickedEolAltsToSkip.txt: Has lines of the form `nodeName1|altName1`.
+        Specifies that a node's alt-name set should exclude altName1.
+
+## Generate node description data
+### Get data from DBpedia
+1.  Obtain files in dbpedia/, as specified in it's README.
+2.  Run genDbpData.py, which adds the `wiki_ids` and `descs` tables, using data in
+    dbpedia/ and the `nodes` table. It also uses these files, if they exist:
+    -   pickedEnwikiNamesToSkip.txt: Each line holds the name of a node for which
+        no description should be obtained. Many node names have a same-name
+        wikipedia page that describes something different (eg: Osiris).
+    -   pickedDbpLabels.txt: Has lines of the form `nodeName1|label1`.
+        Specifies node names that should have a particular associated page label.
+### Get data from Wikipedia
+1.  Obtain 'description database files' in enwiki/, as specified in it's README.
+2.  Run genEnwikiDescData.py, which adds to the `wiki_ids` and `descs` tables,
+    using data in enwiki/ and the `nodes` table.
+    It also uses these files, if they exist:
+    -   pickedEnwikiNamesToSkip.txt: Same as with genDbpData.py.
+    -   pickedEnwikiLabels.txt: Similar to pickedDbpLabels.txt.
+
+## Generate image data
+### Get images from EOL
+1.  Obtain 'image metadata files' in eol/, as specified in it's README.
+2.  In eol/, run downloadImgs.py, which downloads images (possibly multiple per node),
+    into eol/imgsForReview, using data in eol/, as well as the `eol_ids` table.
+3.  In eol/, run reviewImgs.py, which interactively displays the downloaded images for
+    each node, providing the choice of which to use, moving them to eol/imgs/.
+    Uses `names` and `eol_ids` to display extra info.
+### Get images from Wikipedia
+1.  In enwiki/, run genImgData.py, which looks for wikipedia image names for each node,
+    using the `wiki_ids` table, and stores them in a database.
+2.  In enwiki/, run downloadImgLicenseInfo.py, which downloads licensing information for
+    those images, using wikipedia's online API.
+3.  In enwiki/, run downloadEnwikiImgs.py, which downloads 'permissively-licensed'
+    images into enwiki/imgs/.
+### Merge the image sets
+1.  Run reviewImgsToGen.py, which displays images from eol/imgs/ and enwiki/imgs/,
+    and enables choosing, for each node, which image should be used, if any,
+    and outputs choice information into imgList.txt. Uses the `nodes`,
+    `eol_ids`, and `wiki_ids` tables (as well as `names` to display extra info).
+2.  Run genImgs.py, which creates cropped/resized images in img/, from files listed in
+    imgList.txt and located in eol/ and enwiki/, and creates the `node_imgs` and
+    `images` tables. If pickedImgs/ is present, images within it are also used. <br>
+    The outputs might need to be manually created/adjusted:
+    -   An input image might have no output produced, possibly due to
+        data incompatibilities, memory limits, etc. A few input image files
+        might actually be html files, containing a 'file not found' page.
+    -   An input x.gif might produce x-1.jpg, x-2.jpg, etc, instead of x.jpg.
+    -   An input image might produce output with unexpected dimensions.
+        This seems to happen when the image is very large, and triggers a
+        decompression bomb warning.
+    The result might have as many as 150k images, with about 2/3 of them
+    being from wikipedia.
+### Add more image associations
+1.  Run genLinkedImgs.py, which tries to associate nodes without images to
+    images of it's children. Adds the `linked_imgs` table, and uses the
+    `nodes`, `edges`, and `node_imgs` tables.
+
+## Do some post-processing
+1.  Run genReducedTreeData.py, which generates a second, reduced version of the tree,
+    adding the `r_nodes` and `r_edges` tables, using `nodes` and `names`. Reads from
+    pickedReducedNodes.txt, which lists names of nodes that must be included (1 per line).
+2.  Optionally run trimTree.py, which tries to remove some 'low-significance' nodes,
+    for the sake of performance and result-relevance. Otherwise, some nodes may have
+    over 10k children, which can take a while to render (over a minute in my testing).
+    You might want to backup the untrimmed tree first, as this operation is not easily
+    reversible.
+3.  Optionally run genEnwikiNameData.py, which adds more entries to the `names` table,
+    using data in enwiki/, and the `names` and `wiki_ids` tables.
+4.  Optionally run addPickedNames.py, which allows adding manually-selected name data to
+    the `names` table, as specified in pickedNames.txt.
author	Terry Truong <terry06890@gmail.com>	2022-06-22 01:42:41 +1000
committer	Terry Truong <terry06890@gmail.com>	2022-06-22 09:39:44 +1000
commit	e78c4df403e5f98afa08f7a0841ff233d5f6d05b (patch)
tree	f13dbf91228550075644be9766b4546eb20f1e1f /backend/data/README.md
parent	ae1467d2ab35a03eb2d7bf3e5ca1cf4634b23443 (diff)