Update backend READMEs, rename some files for consistency

author: Terry Truong <terry06890@gmail.com> 2022-06-22 01:42:41 +1000
committer: Terry Truong <terry06890@gmail.com> 2022-06-22 09:39:44 +1000
commit: e78c4df403e5f98afa08f7a0841ff233d5f6d05b (patch)
tree: f13dbf91228550075644be9766b4546eb20f1e1f /backend
parent: ae1467d2ab35a03eb2d7bf3e5ca1cf4634b23443 (diff)
22 files changed, 257 insertions, 227 deletions
diff --git a/backend/README.md b/backend/README.md
new file mode 100644
index 0000000..331e7f4
--- /dev/null
+++ b/backend/README.md
@@ -0,0 +1,4 @@
+Files
+=====
+-   server.py: Runs the server
+-   data/: For generating the server's tree-of-life database
diff --git a/backend/data/README.md b/backend/data/README.md
index d4a6196..7d1adad 100644
--- a/backend/data/README.md
+++ b/backend/data/README.md
@@ -1,115 +1,121 @@
-File Generation Process
-=======================
-1   Tree Structure Data
-    1   Obtain data in otol/, as specified in it's README.
-    2   Run genOtolData.py, which creates data.db, and adds
-        'nodes' and 'edges' tables using data in otol/*, as well as
-        genOtolNamesToKeep.txt, if present.
-2   Name Data for Search
-    1   Obtain data in eol/, as specified in it's README.
-    2   Run genEolNameData.py, which adds 'names' and 'eol_ids' tables to data.db,
-        using data in eol/vernacularNames.csv and the 'nodes' table, and possibly
-        genEolNameDataPickedIds.txt.
-3   Node Description Data
-    1   Obtain data in dbpedia/ and enwiki/, as specified in their README files.
-    2   Run genDbpData.py, which adds 'wiki_ids' and 'descs' tables to data.db,
-        using data in dbpedia/dbpData.db, the 'nodes' table, and possibly
-        genDescNamesToSkip.txt and dbpPickedLabels.txt.
-    3   Run genEnwikiDescData.py, which adds to the 'wiki_ids' and 'descs' tables,
-        using data in enwiki/enwikiData.db, and the 'nodes' table.
-        Also uses genDescNamesToSkip.txt and genEnwikiDescTitlesToUse.txt for
-        skipping/resolving some name-page associations.
-4   Image Data
-    1   In eol/, run downloadImgs.py to download EOL images into eol/imgsForReview/.
-        It uses data in eol/imagesList.db, and the 'eol_ids' table.
-    2   In eol/, run reviewImgs.py to filter images in eol/imgsForReview/ into EOL-id-unique
-        images in eol/imgsReviewed/ (uses 'names' and 'eol_ids' to display extra info).
-    3   In enwiki/, run getEnwikiImgData.py, which generates a list of
-        tol-node images, and creates enwiki/enwikiImgs.db to store it.
-        Uses the 'wiki_ids' table to get tol-node wiki-ids.
-    4   In enwiki/, run downloadImgLicenseInfo.py, which downloads licensing
-        information for images listed in enwiki/enwikiImgs.db, and stores
-        it in that db.
-    5   In enwiki/, run downloadEnwikiImgs.py, which downloads 'permissively-licensed'
-        images in listed in enwiki/enwikiImgs.db, storing them in enwiki/imgs/.
-    6   Run reviewImgsToMerge.py, which displays images from eol/ and enwiki/,
-        and enables choosing, for each tol-node, which image should be used, if any,
-        and outputs choice information into mergedImgList.txt. Uses the 'nodes',
-        'eol_ids', and 'wiki_ids' tables (as well as 'names' for info-display).
-    7   Run genImgsForWeb.py, which creates cropped/resized images in img/,
-        using mergedImgList.txt, and possibly pickedImgs/, and adds 'images' and
-        'node_imgs' tables to data.db. <br>
-        Smartcrop's outputs might need to be manually created/adjusted: <br>
-        -   An input image might have no output produced, possibly due to
-            data incompatibilities, memory limits, etc. A few input image files
-            might actually be html files, containing a 'file not found' page.
-        -   An input x.gif might produce x-1.jpg, x-2.jpg, etc, instead of x.jpg.
-        -   An input image might produce output with unexpected dimensions.
-            This seems to happen when the image is very large, and triggers a
-            decompression bomb warning.
-        The result might have as many as 150k images, with about 2/3 of them
-        being from wikipedia.
-    8   Run genLinkedImgs.py to add a 'linked_imgs' table to data.db,
-        which uses 'nodes', 'edges', 'eol_ids', and 'node_imgs', to associate
-        nodes without images to child images.
-5   Reduced Tree Structure Data
-    1   Run genReducedTreeData.py, which adds 'r_nodes' and 'r_edges' tables to
-        data.db, using reducedTol/names.txt, and the 'nodes' and 'names' tables.
-6   Other
-    -   Optionally run genEnwikiNameData.py, which adds more entries to the 'names' table,
-        using data in enwiki/enwikiData.db, and the 'names' and 'wiki_ids' tables.
-    -   Optionally run addPickedNames.py, which adds manually-picked names to
-        the 'names' table, as specified in pickedNames.txt.
-    -   Optionally run trimTree.py, which tries to remove some 'low-significance' nodes,
-        for the sake of performance and result-relevance. Without this, jumping to certain
-        nodes within the fungi and moths can take over a minute to render.
+This directory holds files used to generate data.db, which contains tree-of-life data.
 
-data.db Tables
-==============
--   nodes:        name TEXT PRIMARY KEY, id TEXT UNIQUE, tips INT
--   edges:        node TEXT, child TEXT, p\_support INT, PRIMARY KEY (node, child)
--   eol\_ids:     id INT PRIMARY KEY, name TEXT
--   names:        name TEXT, alt\_name TEXT, pref\_alt INT, src TEXT, PRIMARY KEY(name, alt\_name)
--   wiki\_ids:    name TEXT PRIMARY KEY, id INT, redirected INT
--   descs:        wiki\_id INT PRIMARY KEY, desc TEXT, from\_dbp INT
--   node\_imgs:   name TEXT PRIMARY KEY, img\_id INT, src TEXT
--   images:       id INT, src TEXT, url TEXT, license TEXT, artist TEXT, credit TEXT, PRIMARY KEY (id, src)
--   linked\_imgs: name TEXT PRIMARY KEY, otol\_ids TEXT
--   r\_nodes:     name TEXT PRIMARY KEY, tips INT
--   r\_edges:     node TEXT, child TEXT, p\_support INT, PRIMARY KEY (node, child)
+# Tables:
+-   `nodes`:       `name TEXT PRIMARY KEY, id TEXT UNIQUE, tips INT`
+-   `edges`:       `node TEXT, child TEXT, p_support INT, PRIMARY KEY (node, child)`
+-   `eol_ids`:     `id INT PRIMARY KEY, name TEXT`
+-   `names`:       `name TEXT, alt_name TEXT, pref_alt INT, src TEXT, PRIMARY KEY(name, alt_name)`
+-   `wiki_ids`:    `name TEXT PRIMARY KEY, id INT, redirected INT`
+-   `descs`:       `wiki_id INT PRIMARY KEY, desc TEXT, from_dbp INT`
+-   `node_imgs`:   `name TEXT PRIMARY KEY, img_id INT, src TEXT`
+-   `images`:      `id INT, src TEXT, url TEXT, license TEXT, artist TEXT, credit TEXT, PRIMARY KEY (id, src)`
+-   `linked_imgs`: `name TEXT PRIMARY KEY, otol_ids TEXT`
+-   `r_nodes`:     `name TEXT PRIMARY KEY, tips INT`
+-   `r_edges`:     `node TEXT, child TEXT, p_support INT, PRIMARY KEY (node, child)`
 
-Other Files
-===========
--   dbpPickedLabels.txt <br>
-    Contains DBpedia labels, one per line. Used by genDbpData.py to help
-    resolve conflicts when associating tree-of-life node names with
-    DBpedia node labels.
--   genOtolNamesToKeep.txt <br>
-    Contains names to avoid trimming off the tree data generated by
-    genOtolData.py.  Usage is optional, but, without it, a large amount
-    of possibly-significant nodes are removed, using a short-sighted
-    heuristic. <br>
-    One way to generate this list is to generate the files as usual,
-    then get node names that have an associated image, description, or
-    presence in r_nodes. Then run the genOtolData.py and genEolNameData.py
-    scripts again (after deleting their created tables).
--   genEnwikiDescNamesToSkip.txt <br>
-    Contains names for nodes that genEnwikiNameData.py should skip adding
-    a description for. Usage is optional, but without it, some nodes will
-    probably get descriptions that don't match (eg: the bee genus Osiris
-    might be described as an egyptian god). <br>
-    This file was generated by running genEnwikiNameData.py, then listing
-    the names that it added into a file, along with descriptions, and
-    manually removing those that seemed node-matching (got about 30k lines,
-    with about 1 in 30 descriptions non-matching). And, after creating
-    genEnwikiDescTitlesToUse.txt, names shared with that file were removed.
--   genEnwikiDescTitlesToUse.txt <br>
-    Contains enwiki titles with the form 'name1 (category1)' for
-    genEnwikiNameData.py to use to resolve nodes matching name name1.
-    Usage is optional, but it adds some descriptions that would otherwise
-    be skipped. <br>
-    This file was generated by taking the content of genEnwikiNameData.py,
-    after the manual filtering step, then, for each name name,1 getting
-    page titles from dbpedia/dbpData.db that match 'name1 (category1)'.
-    This was followed by manually removing lines, keeping those that
-    seemed to match the corresponding node (used the app to help with this).
+# Generating the Database
+
+For the most part, these steps should be done in order.
+
+As a warning, the whole process takes a lot of time and file space. The tree will probably
+have about 2.5 billion nodes. Downloading the images will take several days, and occupy over
+200 GB. And if you want good data, you'll need to do some manual review, which can take weeks.
+
+## Environment
+The scripts are written in python and bash.
+Some of the python scripts require third-party packages:
+-   jsonpickle: For encoding class objects as JSON.
+-   requests: For downloading data.
+-   PIL: For image processing.
+-   tkinter: For providing a basic GUI to review images.
+-   mwxml, mwparserfromhell: For parsing Wikipedia dumps.
+
+## Generate tree structure data
+1.  Obtain files in otol/, as specified in it's README.
+2.  Run genOtolData.py, which creates data.db, and adds the `nodes` and `edges` tables,
+    using data in otol/. It also uses these files, if they exist:
+    -   pickedOtolNames.txt: Has lines of the form `name1|otolId1`. Some nodes in the
+        tree may have the same name (eg: Pholidota can refer to pangolins or orchids).
+        Normally, such nodes will get the names 'name1', 'name1 [2]', 'name1 [3], etc.
+        This file can be used to manually specify which node should be named 'name1'.
+
+## Generate node name data
+1.  Obtain 'name data files' in eol/, as specified in it's README.
+2.  Run genEolNameData.py, which adds the `names` and `eol_ids` tables, using data in
+    eol/ and the `nodes` table. It also uses these files, if they exist:
+    -   pickedEolIds.txt: Has lines of the form `nodeName1|eolId1` or `nodeName1|`.
+        Specifies node names that should have a particular EOL ID, or no ID.
+        Quite a few taxons have ambiguous names, and may need manual correction.
+        For example, Viola may resolve to a taxon of butterflies or of plants.
+    -   pickedEolAltsToSkip.txt: Has lines of the form `nodeName1|altName1`.
+        Specifies that a node's alt-name set should exclude altName1.
+
+## Generate node description data
+### Get data from DBpedia
+1.  Obtain files in dbpedia/, as specified in it's README.
+2.  Run genDbpData.py, which adds the `wiki_ids` and `descs` tables, using data in
+    dbpedia/ and the `nodes` table. It also uses these files, if they exist:
+    -   pickedEnwikiNamesToSkip.txt: Each line holds the name of a node for which
+        no description should be obtained. Many node names have a same-name
+        wikipedia page that describes something different (eg: Osiris).
+    -   pickedDbpLabels.txt: Has lines of the form `nodeName1|label1`.
+        Specifies node names that should have a particular associated page label.
+### Get data from Wikipedia
+1.  Obtain 'description database files' in enwiki/, as specified in it's README.
+2.  Run genEnwikiDescData.py, which adds to the `wiki_ids` and `descs` tables,
+    using data in enwiki/ and the `nodes` table.
+    It also uses these files, if they exist:
+    -   pickedEnwikiNamesToSkip.txt: Same as with genDbpData.py.
+    -   pickedEnwikiLabels.txt: Similar to pickedDbpLabels.txt.
+
+## Generate image data
+### Get images from EOL
+1.  Obtain 'image metadata files' in eol/, as specified in it's README.
+2.  In eol/, run downloadImgs.py, which downloads images (possibly multiple per node),
+    into eol/imgsForReview, using data in eol/, as well as the `eol_ids` table.
+3.  In eol/, run reviewImgs.py, which interactively displays the downloaded images for
+    each node, providing the choice of which to use, moving them to eol/imgs/.
+    Uses `names` and `eol_ids` to display extra info.
+### Get images from Wikipedia
+1.  In enwiki/, run genImgData.py, which looks for wikipedia image names for each node,
+    using the `wiki_ids` table, and stores them in a database.
+2.  In enwiki/, run downloadImgLicenseInfo.py, which downloads licensing information for
+    those images, using wikipedia's online API.
+3.  In enwiki/, run downloadEnwikiImgs.py, which downloads 'permissively-licensed'
+    images into enwiki/imgs/.
+### Merge the image sets
+1.  Run reviewImgsToGen.py, which displays images from eol/imgs/ and enwiki/imgs/,
+    and enables choosing, for each node, which image should be used, if any,
+    and outputs choice information into imgList.txt. Uses the `nodes`,
+    `eol_ids`, and `wiki_ids` tables (as well as `names` to display extra info).
+2.  Run genImgs.py, which creates cropped/resized images in img/, from files listed in
+    imgList.txt and located in eol/ and enwiki/, and creates the `node_imgs` and
+    `images` tables. If pickedImgs/ is present, images within it are also used. <br>
+    The outputs might need to be manually created/adjusted:
+    -   An input image might have no output produced, possibly due to
+        data incompatibilities, memory limits, etc. A few input image files
+        might actually be html files, containing a 'file not found' page.
+    -   An input x.gif might produce x-1.jpg, x-2.jpg, etc, instead of x.jpg.
+    -   An input image might produce output with unexpected dimensions.
+        This seems to happen when the image is very large, and triggers a
+        decompression bomb warning.
+    The result might have as many as 150k images, with about 2/3 of them
+    being from wikipedia.
+### Add more image associations
+1.  Run genLinkedImgs.py, which tries to associate nodes without images to
+    images of it's children. Adds the `linked_imgs` table, and uses the
+    `nodes`, `edges`, and `node_imgs` tables.
+
+## Do some post-processing
+1.  Run genReducedTreeData.py, which generates a second, reduced version of the tree,
+    adding the `r_nodes` and `r_edges` tables, using `nodes` and `names`. Reads from
+    pickedReducedNodes.txt, which lists names of nodes that must be included (1 per line).
+2.  Optionally run trimTree.py, which tries to remove some 'low-significance' nodes,
+    for the sake of performance and result-relevance. Otherwise, some nodes may have
+    over 10k children, which can take a while to render (over a minute in my testing).
+    You might want to backup the untrimmed tree first, as this operation is not easily
+    reversible.
+3.  Optionally run genEnwikiNameData.py, which adds more entries to the `names` table,
+    using data in enwiki/, and the `names` and `wiki_ids` tables.
+4.  Optionally run addPickedNames.py, which allows adding manually-selected name data to
+    the `names` table, as specified in pickedNames.txt.
diff --git a/backend/data/dbpedia/README.md b/backend/data/dbpedia/README.md
index 78e2a90..8a08f20 100644
--- a/backend/data/dbpedia/README.md
+++ b/backend/data/dbpedia/README.md
@@ -1,28 +1,29 @@
-Downloaded Files
-================
--   labels\_lang=en.ttl.bz2 <br>
-    Obtained via https://databus.dbpedia.org/dbpedia/collections/latest-core, 
-    using the link <https://databus.dbpedia.org/dbpedia/generic/labels/2022.03.01/labels_lang=en.ttl.bz2>.
--   page\_lang=en\_ids.ttl.bz2 <br>
+This directory holds files obtained from/using [Dbpedia](https://www.dbpedia.org).
+
+# Downloaded Files
+-   `labels_lang=en.ttl.bz2` <br>
+    Obtained via https://databus.dbpedia.org/dbpedia/collections/latest-core.
+    Downloaded from <https://databus.dbpedia.org/dbpedia/generic/labels/2022.03.01/labels_lang=en.ttl.bz2>.
+-   `page_lang=en_ids.ttl.bz2` <br>
     Downloaded from <https://databus.dbpedia.org/dbpedia/generic/page/2022.03.01/page_lang=en_ids.ttl.bz2>
--   redirects\_lang=en\_transitive.ttl.bz2 <br>
+-   `redirects_lang=en_transitive.ttl.bz2` <br>
     Downloaded from <https://databus.dbpedia.org/dbpedia/generic/redirects/2022.03.01/redirects_lang=en_transitive.ttl.bz2>.
--   disambiguations\_lang=en.ttl.bz2 <br>
+-   `disambiguations_lang=en.ttl.bz2` <br>
     Downloaded from <https://databus.dbpedia.org/dbpedia/generic/disambiguations/2022.03.01/disambiguations_lang=en.ttl.bz2>.
--   instance-types\_lang=en\_specific.ttl.bz2 <br>
+-   `instance-types_lang=en_specific.ttl.bz2` <br>
     Downloaded from <https://databus.dbpedia.org/dbpedia/mappings/instance-types/2022.03.01/instance-types_lang=en_specific.ttl.bz2>.
--   short-abstracts\_lang=en.ttl.bz2 <br>
+-   `short-abstracts_lang=en.ttl.bz2` <br>
     Downloaded from <https://databus.dbpedia.org/vehnem/text/short-abstracts/2021.05.01/short-abstracts_lang=en.ttl.bz2>.
 
-Generated Files
-===============
--   dbpData.db <br>
-    An sqlite database representing data from the ttl files.
-    Generated by running genData.py.
-    Tables
-    -   labels:          iri TEXT PRIMARY KEY, label TEXT 
-    -   ids:             iri TEXT PRIMARY KEY, id INT
-    -   redirects:       iri TEXT PRIMARY KEY, target TEXT
-    -   disambiguations: iri TEXT PRIMARY KEY
-    -   types:           iri TEXT, type TEXT
-    -   abstracts:       iri TEXT PRIMARY KEY, abstract TEXT
+# Other Files
+-   genDescData.py <br>
+    Used to generate a database representing data from the ttl files.
+-   descData.db <br>
+    Generated by genDescData.py. <br>
+    Tables: <br>
+    -   `labels`:          `iri TEXT PRIMARY KEY, label TEXT `
+    -   `ids`:             `iri TEXT PRIMARY KEY, id INT`
+    -   `redirects`:       `iri TEXT PRIMARY KEY, target TEXT`
+    -   `disambiguations`: `iri TEXT PRIMARY KEY`
+    -   `types`:           `iri TEXT, type TEXT`
+    -   `abstracts`:       `iri TEXT PRIMARY KEY, abstract TEXT`
diff --git a/backend/data/dbpedia/genData.py b/backend/data/dbpedia/genDescData.py
index 41c48a8..bba3ff5 100755
--- a/backend/data/dbpedia/genData.py
+++ b/backend/data/dbpedia/genDescData.py
@@ -16,7 +16,7 @@ redirectsFile = "redirects_lang=en_transitive.ttl.bz2"
 disambigFile = "disambiguations_lang=en.ttl.bz2"
 typesFile = "instance-types_lang=en_specific.ttl.bz2"
 abstractsFile = "short-abstracts_lang=en.ttl.bz2"
-dbFile = "dbpData.db"
+dbFile = "descData.db"
 
 # Open db
 dbCon = sqlite3.connect(dbFile)
diff --git a/backend/data/enwiki/README.md b/backend/data/enwiki/README.md
index 6462d7d..1c16a2e 100644
--- a/backend/data/enwiki/README.md
+++ b/backend/data/enwiki/README.md
@@ -1,39 +1,52 @@
-Downloaded Files
-================
+This directory holds files obtained from/using [English Wikipedia](https://en.wikipedia.org/wiki/Main_Page).
+
+# Downloaded Files
 -   enwiki-20220501-pages-articles-multistream.xml.bz2 <br>
-    Obtained via <https://dumps.wikimedia.org/backup-index.html>
-    (site suggests downloading from a mirror).  Contains text
-    content and metadata for pages in English Wikipedia
-    (current revision only, excludes talk pages).  Some file
-    content and format information was available from
-    <https://meta.wikimedia.org/wiki/Data_dumps/What%27s_available_for_download>.
+    Obtained via <https://dumps.wikimedia.org/backup-index.html> (site suggests downloading from a mirror).
+    Contains text content and metadata for pages in enwiki.
+    Some file content and format information was available from
+        <https://meta.wikimedia.org/wiki/Data_dumps/What%27s_available_for_download>.
 -   enwiki-20220501-pages-articles-multistream-index.txt.bz2 <br>
     Obtained like above. Holds lines of the form offset1:pageId1:title1,
-    providing offsets, for each page, into the dump file, of a chunk of
+    providing, for each page, an offset into the dump file of a chunk of
     100 pages that includes it.
 
-Generated Files
-===============
+# Generated Dump-Index Files
+-   genDumpIndexDb.py <br>
+    Creates an sqlite-database version of the enwiki-dump index file.
 -   dumpIndex.db <br>
-    Holds data from the enwiki dump index file. Generated by
-    genDumpIndexDb.py, and used by lookupPage.py to get content for a
-    given page title. <br>
+    Generated by genDumpIndexDb.py. <br>
     Tables: <br>
-    -   offsets: title TEXT PRIMARY KEY, id INT UNIQUE, offset INT, next\_offset INT
--   enwikiData.db <br>
-    Holds data obtained from the enwiki dump file, in 'pages',
-    'redirects', and 'descs' tables. Generated by genData.py, which uses
-    python packages mwxml and mwparserfromhell. <br>
+    -   `offsets`: `title TEXT PRIMARY KEY, id INT UNIQUE, offset INT, next_offset INT`
+
+# Description Database Files
+-   genDescData.py <br>
+    Reads through pages in the dump file, and adds short-description info to a database.
+-   descData.db <br>
+    Generated by genDescData.py. <br>
     Tables: <br>
-    -   pages:     id INT PRIMARY KEY, title TEXT UNIQUE
-    -   redirects: id INT PRIMARY KEY, target TEXT
-    -   descs:     id INT PRIMARY KEY, desc TEXT
--   enwikiImgs.db <br>
-    Holds infobox-images obtained for some set of wiki page-ids.
-    Generated by running getEnwikiImgData.py, which uses the enwiki dump
-    file and dumpIndex.db. <br>
+    -   `pages`:     `id INT PRIMARY KEY, title TEXT UNIQUE`
+    -   `redirects`: `id INT PRIMARY KEY, target TEXT`
+    -   `descs`:     `id INT PRIMARY KEY, desc TEXT`
+
+# Image Database Files
+-   genImgData.py <br>
+    Used to find infobox image names for page IDs, storing them into a database.
+-   downloadImgLicenseInfo.py <br>
+    Used to download licensing metadata for image names, via wikipedia's online API, storing them into a database.
+-   imgData.db <br>
+    Used to hold metadata about infobox images for a set of pageIDs.
+    Generated using getEnwikiImgData.py and downloadImgLicenseInfo.py. <br>
     Tables: <br>
-    -   page\_imgs: page\_id INT PRIMAY KEY, img\_name TEXT
-        (img\_name may be null, which is used to avoid re-processing the page-id on a second pass)
-    -   imgs: name TEXT PRIMARY KEY, license TEXT, artist TEXT, credit TEXT, restrictions TEXT, url TEXT
-        (might lack some matches for 'img_name' in 'page_imgs', due to inability to get license info)
+    -   `page_imgs`: `page_id INT PRIMAY KEY, img_name TEXT` <br>
+        `img_name` may be null, which means 'none found', and is used to avoid re-processing page-ids.
+    -   `imgs`: `name TEXT PRIMARY KEY, license TEXT, artist TEXT, credit TEXT, restrictions TEXT, url TEXT` <br>
+        Might lack some matches for `img_name` in `page_imgs`, due to licensing info unavailability.
+-   downloadEnwikiImgs.py <br>
+    Used to download image files into imgs/.
+
+# Other Files
+-   lookupPage.py <br>
+    Running `lookupPage.py title1` looks in the dump for a page with a given title,
+    and prints the contents to stdout. Uses dumpIndex.db.
+
diff --git a/backend/data/enwiki/downloadEnwikiImgs.py b/backend/data/enwiki/downloadEnwikiImgs.py
index de9b862..2929a0d 100755
--- a/backend/data/enwiki/downloadEnwikiImgs.py
+++ b/backend/data/enwiki/downloadEnwikiImgs.py
@@ -16,7 +16,7 @@ if len(sys.argv) > 1:
 	print(usageInfo, file=sys.stderr)
 	sys.exit(1)
 
-imgDb = "enwikiImgs.db" # About 130k image names
+imgDb = "imgData.db" # About 130k image names
 outDir = "imgs"
 licenseRegex = re.compile(r"cc0|cc([ -]by)?([ -]sa)?([ -][1234]\.[05])?( \w\w\w?)?", flags=re.IGNORECASE)
 
diff --git a/backend/data/enwiki/downloadImgLicenseInfo.py b/backend/data/enwiki/downloadImgLicenseInfo.py
index 8231fbb..097304b 100755
--- a/backend/data/enwiki/downloadImgLicenseInfo.py
+++ b/backend/data/enwiki/downloadImgLicenseInfo.py
@@ -16,7 +16,7 @@ if len(sys.argv) > 1:
 	print(usageInfo, file=sys.stderr)
 	sys.exit(1)
 
-imgDb = "enwikiImgs.db" # About 130k image names
+imgDb = "imgData.db" # About 130k image names
 apiUrl = "https://en.wikipedia.org/w/api.php"
 batchSz = 50 # Max 50
 tagRegex = re.compile(r"<[^<]+>")
diff --git a/backend/data/enwiki/genData.py b/backend/data/enwiki/genDescData.py
index 3e60bb5..032dbed 100755
--- a/backend/data/enwiki/genData.py
+++ b/backend/data/enwiki/genDescData.py
@@ -13,7 +13,7 @@ if len(sys.argv) > 1:
 	sys.exit(1)
 
 dumpFile = "enwiki-20220501-pages-articles-multistream.xml.bz2" # 22,034,540 pages
-enwikiDb = "enwikiData.db"
+enwikiDb = "descData.db"
 
 # Some regexps and functions for parsing wikitext
 descLineRegex = re.compile("^ *[A-Z'\"]")
diff --git a/backend/data/enwiki/getEnwikiImgData.py b/backend/data/enwiki/genImgData.py
index f8bb2ee..9bd28f4 100755
--- a/backend/data/enwiki/getEnwikiImgData.py
+++ b/backend/data/enwiki/genImgData.py
@@ -21,7 +21,7 @@ def getInputPageIds():
 	return pageIds
 dumpFile = "enwiki-20220501-pages-articles-multistream.xml.bz2"
 indexDb = "dumpIndex.db"
-imgDb = "enwikiImgs.db" # Output db
+imgDb = "imgData.db" # Output db
 idLineRegex = re.compile(r"<id>(.*)</id>")
 imageLineRegex = re.compile(r".*\| *image *= *([^|]*)")
 bracketImageRegex = re.compile(r"\[\[(File:[^|]*).*]]")
diff --git a/backend/data/eol/README.md b/backend/data/eol/README.md
index 8338be0..fbb008d 100644
--- a/backend/data/eol/README.md
+++ b/backend/data/eol/README.md
@@ -1,18 +1,25 @@
-Downloaded Files
-================
--   imagesList.tgz <br>
-    Obtained from https://opendata.eol.org/dataset/images-list on 24/04/2022.
-    Listed as being last updated on 05/02/2020.
+This directory holds files obtained from/using the [Encyclopedia of Life](https://eol.org/).
+
+# Name Data Files
 -   vernacularNames.csv <br>
-    Obtained from https://opendata.eol.org/dataset/vernacular-names on 24/04/2022.
-    Listed as being last updated on 27/10/2020.
+    Obtained from <https://opendata.eol.org/dataset/vernacular-names> on 24/04/2022 (last updated on 27/10/2020).
+    Contains alternative-name data from EOL.
 
-Generated Files
-===============
+# Image Metadata Files
+-   imagesList.tgz <br>
+    Obtained from <https://opendata.eol.org/dataset/images-list> on 24/04/2022 (last updated on 05/02/2020).
+    Contains metadata for images from EOL.
 -   imagesList/ <br>
-    Obtained by extracting imagesList.tgz.
+    Extracted from imagesList.tgz.
 -   imagesList.db <br>
-    Represents data from eol/imagesList/*, and is created by genImagesListDb.sh. <br>
+    Contains data from imagesList/.
+    Created by running genImagesListDb.sh, which simply imports csv files into a database. <br>
     Tables: <br>
-    -   images:
-        content_id INT PRIMARY KEY, page_id INT, source_url TEXT, copy_url TEXT, license TEXT, copyright_owner TEXT
+    -   `images`:
+        `content_id INT PRIMARY KEY, page_id INT, source_url TEXT, copy_url TEXT, license TEXT, copyright_owner TEXT`
+
+# Image Generation Files
+-   downloadImgs.py <br>
+    Used to download image files into imgsForReview/.
+-   reviewImgs.py <br>
+    Used to review images in imgsForReview/, moving acceptable ones into imgs/.
diff --git a/backend/data/eol/reviewImgs.py b/backend/data/eol/reviewImgs.py
index 4fea1c4..5290f9e 100755
--- a/backend/data/eol/reviewImgs.py
+++ b/backend/data/eol/reviewImgs.py
@@ -17,7 +17,7 @@ if len(sys.argv) > 1:
 	sys.exit(1)
 
 imgDir = "imgsForReview/"
-outDir = "imgsReviewed/"
+outDir = "imgs/"
 extraInfoDbCon = sqlite3.connect("../data.db")
 extraInfoDbCur = extraInfoDbCon.cursor()
 def getExtraInfo(eolId):
diff --git a/backend/data/genDbpData.py b/backend/data/genDbpData.py
index e921b6c..afe1e17 100755
--- a/backend/data/genDbpData.py
+++ b/backend/data/genDbpData.py
@@ -12,9 +12,9 @@ if len(sys.argv) > 1:
 	print(usageInfo, file=sys.stderr)
 	sys.exit(1)
 
-dbpediaDb = "dbpedia/dbpData.db"
-namesToSkipFile = "genDescNamesToSkip.txt"
-pickedLabelsFile = "dbpPickedLabels.txt"
+dbpediaDb = "dbpedia/descData.db"
+namesToSkipFile = "pickedEnwikiNamesToSkip.txt"
+pickedLabelsFile = "pickedDbpLabels.txt"
 dbFile = "data.db"
 
 # Open dbs
diff --git a/backend/data/genEnwikiDescData.py b/backend/data/genEnwikiDescData.py
index 2396540..dbc8d6b 100755
--- a/backend/data/genEnwikiDescData.py
+++ b/backend/data/genEnwikiDescData.py
@@ -11,10 +11,10 @@ if len(sys.argv) > 1:
 	print(usageInfo, file=sys.stderr)
 	sys.exit(1)
 
-enwikiDb = "enwiki/enwikiData.db"
+enwikiDb = "enwiki/descData.db"
 dbFile = "data.db"
-namesToSkipFile = "genDescNamesToSkip.txt"
-pickedLabelsFile = "enwikiPickedLabels.txt"
+namesToSkipFile = "pickedEnwikiNamesToSkip.txt"
+pickedLabelsFile = "pickedEnwikiLabels.txt"
 
 # Open dbs
 enwikiCon = sqlite3.connect(enwikiDb)
diff --git a/backend/data/genEnwikiNameData.py b/backend/data/genEnwikiNameData.py
index 71960a5..8285a40 100755
--- a/backend/data/genEnwikiNameData.py
+++ b/backend/data/genEnwikiNameData.py
@@ -10,7 +10,7 @@ if len(sys.argv) > 1:
 	print(usageInfo, file=sys.stderr)
 	sys.exit(1)
 
-enwikiDb = "enwiki/enwikiData.db"
+enwikiDb = "enwiki/descData.db"
 dbFile = "data.db"
 altNameRegex = re.compile(r"[a-zA-Z]+")
 	# Avoids names like 'Evolution of Elephants', 'Banana fiber', 'Fish (zoology)',
diff --git a/backend/data/genEolNameData.py b/backend/data/genEolNameData.py
index aa3905e..d852751 100755
--- a/backend/data/genEolNameData.py
+++ b/backend/data/genEolNameData.py
@@ -18,8 +18,8 @@ if len(sys.argv) > 1:
 vnamesFile = "eol/vernacularNames.csv"
 dbFile = "data.db"
 NAMES_TO_SKIP = {"unknown", "unknown species", "unidentified species"}
-pickedIdsFile = "genEolNameDataPickedIds.txt"
-badAltsFile = "genEolNameDataBadAlts.txt"
+pickedIdsFile = "pickedEolIds.txt"
+badAltsFile = "pickedEolAltsToSkip.txt"
 
 # Read in vernacular-names data
 	# Note: Canonical-names may have multiple pids
diff --git a/backend/data/genImgsForWeb.py b/backend/data/genImgs.py
index 3c299bb..097959f 100755
--- a/backend/data/genImgsForWeb.py
+++ b/backend/data/genImgs.py
@@ -15,12 +15,12 @@ if len(sys.argv) > 1:
 	print(usageInfo, file=sys.stderr)
 	sys.exit(1)
 
-imgListFile = "mergedImgList.txt"
+imgListFile = "imgList.txt"
 outDir = "img/"
 eolImgDb = "eol/imagesList.db"
-enwikiImgDb = "enwiki/enwikiImgs.db"
+enwikiImgDb = "enwiki/imgData.db"
 pickedImgsDir = "pickedImgs/"
-pickedImgsFile = "metadata.txt"
+pickedImgsFilename = "imgData.txt"
 dbFile = "data.db"
 IMG_OUT_SZ = 200
 genImgFiles = True
@@ -37,9 +37,9 @@ enwikiCon = sqlite3.connect(enwikiImgDb)
 enwikiCur = enwikiCon.cursor()
 # Get 'picked images' info
 nodeToPickedImg = {}
-if os.path.exists(pickedImgsDir + pickedImgsFile):
+if os.path.exists(pickedImgsDir + pickedImgsFilename):
 	lineNum = 0
-	with open(pickedImgsDir + pickedImgsFile) as file:
+	with open(pickedImgsDir + pickedImgsFilename) as file:
 		for line in file:
 			lineNum += 1
 			(filename, url, license, artist, credit) = line.rstrip().split("|")
diff --git a/backend/data/genOtolData.py b/backend/data/genOtolData.py
index cfb5bed..87b35c3 100755
--- a/backend/data/genOtolData.py
+++ b/backend/data/genOtolData.py
@@ -1,6 +1,6 @@
 #!/usr/bin/python3
 
-import sys, re
+import sys, re, os
 import json, sqlite3
 
 usageInfo =  f"usage: {sys.argv[0]}\n"
@@ -30,8 +30,8 @@ annFile = "otol/annotations.json"
 dbFile = "data.db"
 nodeMap = {} # Maps node IDs to node objects
 nameToFirstId = {} # Maps node names to first found ID (names might have multiple IDs)
-dupNameToIds = {} # Maps names of nodes with multiple IDs to those node IDs
-pickedDupsFile = "genOtolDataPickedDups.txt"
+dupNameToIds = {} # Maps names of nodes with multiple IDs to those IDs
+pickedNamesFile = "pickedOtolNames.txt"
 
 # Parse treeFile
 print("Parsing tree file")
@@ -142,10 +142,11 @@ rootId = parseNewick()
 # Resolve duplicate names
 print("Resolving duplicates")
 nameToPickedId = {}
-with open(pickedDupsFile) as file:
-	for line in file:
-		(name, _, otolId) = line.rstrip().partition("|")
-		nameToPickedId[name] = otolId
+if os.path.exists(pickedNamesFile):
+	with open(pickedNamesFile) as file:
+		for line in file:
+			(name, _, otolId) = line.rstrip().partition("|")
+			nameToPickedId[name] = otolId
 for [dupName, ids] in dupNameToIds.items():
 	# Check for picked id
 	if dupName in nameToPickedId:
diff --git a/backend/data/genReducedTreeData.py b/backend/data/genReducedTreeData.py
index 208c937..b475794 100755
--- a/backend/data/genReducedTreeData.py
+++ b/backend/data/genReducedTreeData.py
@@ -10,7 +10,7 @@ if len(sys.argv) > 1:
 	sys.exit(1)
 
 dbFile = "data.db"
-nodeNamesFile = "reducedTol/names.txt"
+nodeNamesFile = "reducedTreeNodes.txt"
 minimalNames = set()
 nodeMap = {} # Maps node names to node objects
 PREF_NUM_CHILDREN = 3 # Attempt inclusion of children up to this limit
diff --git a/backend/data/otol/README.md b/backend/data/otol/README.md
index a6f13c2..4be2fd2 100644
--- a/backend/data/otol/README.md
+++ b/backend/data/otol/README.md
@@ -1,6 +1,10 @@
-Downloaded Files
-================
+Files
+=====
+-   opentree13.4tree.tgz <br>
+    Obtained from <https://tree.opentreeoflife.org/about/synthesis-release/v13.4>.
+    Contains tree data from the [Open Tree of Life](https://tree.opentreeoflife.org/about/open-tree-of-life).
 -   labelled\_supertree\_ottnames.tre <br>
-    Obtained from https://tree.opentreeoflife.org/about/synthesis-release/v13.4.
--   annotations.json <br>
-    Obtained from https://tree.opentreeoflife.org/about/synthesis-release/v13.4.
+    Extracted from the .tgz file. Describes the structure of the tree.
+-   annotations.json
+    Extracted from the .tgz file. Contains additional attributes of tree
+    nodes. Used for finding out which nodes have 'phylogenetic support'.
diff --git a/backend/data/pickedImgs/README.md b/backend/data/pickedImgs/README.md
index 52fc608..dfe192b 100644
--- a/backend/data/pickedImgs/README.md
+++ b/backend/data/pickedImgs/README.md
@@ -1,12 +1,10 @@
-This directory is used for adding additional, manually-picked images,
-to the server's dataset, overriding any from eol and enwiki. If used,
-it is expected to contain image files, and a metadata.txt file that
-holds metadata.
+This directory holds additional image files to use for tree-of-life nodes,
+on top of those from EOL and Wikipedia.
 
 Possible Files
 ==============
--   Image files
--   metadata.txt <br>
-    Contains lines with the format filename|url|license|artist|credit.
-    The filename should be a tree-of-life node name, with an image
-    extension.  Other fields correspond to those in the 'images' table.
+-   (Image files)
+-   imgData.txt <br>
+    Contains lines with the format `filename|url|license|artist|credit`.
+    The filename should consist of a node name, with an image extension.
+    Other fields correspond to those in the `images` table (see ../README.md).
diff --git a/backend/data/reducedTol/README.md b/backend/data/reducedTol/README.md
deleted file mode 100644
index 103bffc..0000000
--- a/backend/data/reducedTol/README.md
+++ /dev/null
@@ -1,4 +0,0 @@
-Files
-=====
--   names.txt <br>
-	Contains names of nodes to be kept in a reduced Tree of Life.
diff --git a/backend/data/reviewImgsToMerge.py b/backend/data/reviewImgsToGen.py
index d177a5e..4d970ba 100755
--- a/backend/data/reviewImgsToMerge.py
+++ b/backend/data/reviewImgsToGen.py
@@ -20,13 +20,13 @@ if len(sys.argv) > 1:
 	print(usageInfo, file=sys.stderr)
 	sys.exit(1)
 
-eolImgDir = "eol/imgsReviewed/"
+eolImgDir = "eol/imgs/"
 enwikiImgDir = "enwiki/imgs/"
 dbFile = "data.db"
-outFile = "mergedImgList.txt"
+outFile = "imgList.txt"
 IMG_DISPLAY_SZ = 400
 PLACEHOLDER_IMG = Image.new("RGB", (IMG_DISPLAY_SZ, IMG_DISPLAY_SZ), (88, 28, 135))
-onlyReviewPairs = False
+onlyReviewPairs = True
 
 # Open db
 dbCon = sqlite3.connect(dbFile)
author	Terry Truong <terry06890@gmail.com>	2022-06-22 01:42:41 +1000
committer	Terry Truong <terry06890@gmail.com>	2022-06-22 09:39:44 +1000
commit	e78c4df403e5f98afa08f7a0841ff233d5f6d05b (patch)
tree	f13dbf91228550075644be9766b4546eb20f1e1f /backend
parent	ae1467d2ab35a03eb2d7bf3e5ca1cf4634b23443 (diff)