aboutsummaryrefslogtreecommitdiff
path: root/backend/data/README.md
diff options
context:
space:
mode:
authorTerry Truong <terry06890@gmail.com>2022-06-22 01:42:41 +1000
committerTerry Truong <terry06890@gmail.com>2022-06-22 09:39:44 +1000
commite78c4df403e5f98afa08f7a0841ff233d5f6d05b (patch)
treef13dbf91228550075644be9766b4546eb20f1e1f /backend/data/README.md
parentae1467d2ab35a03eb2d7bf3e5ca1cf4634b23443 (diff)
Update backend READMEs, rename some files for consistency
Diffstat (limited to 'backend/data/README.md')
-rw-r--r--backend/data/README.md232
1 files changed, 119 insertions, 113 deletions
diff --git a/backend/data/README.md b/backend/data/README.md
index d4a6196..7d1adad 100644
--- a/backend/data/README.md
+++ b/backend/data/README.md
@@ -1,115 +1,121 @@
-File Generation Process
-=======================
-1 Tree Structure Data
- 1 Obtain data in otol/, as specified in it's README.
- 2 Run genOtolData.py, which creates data.db, and adds
- 'nodes' and 'edges' tables using data in otol/*, as well as
- genOtolNamesToKeep.txt, if present.
-2 Name Data for Search
- 1 Obtain data in eol/, as specified in it's README.
- 2 Run genEolNameData.py, which adds 'names' and 'eol_ids' tables to data.db,
- using data in eol/vernacularNames.csv and the 'nodes' table, and possibly
- genEolNameDataPickedIds.txt.
-3 Node Description Data
- 1 Obtain data in dbpedia/ and enwiki/, as specified in their README files.
- 2 Run genDbpData.py, which adds 'wiki_ids' and 'descs' tables to data.db,
- using data in dbpedia/dbpData.db, the 'nodes' table, and possibly
- genDescNamesToSkip.txt and dbpPickedLabels.txt.
- 3 Run genEnwikiDescData.py, which adds to the 'wiki_ids' and 'descs' tables,
- using data in enwiki/enwikiData.db, and the 'nodes' table.
- Also uses genDescNamesToSkip.txt and genEnwikiDescTitlesToUse.txt for
- skipping/resolving some name-page associations.
-4 Image Data
- 1 In eol/, run downloadImgs.py to download EOL images into eol/imgsForReview/.
- It uses data in eol/imagesList.db, and the 'eol_ids' table.
- 2 In eol/, run reviewImgs.py to filter images in eol/imgsForReview/ into EOL-id-unique
- images in eol/imgsReviewed/ (uses 'names' and 'eol_ids' to display extra info).
- 3 In enwiki/, run getEnwikiImgData.py, which generates a list of
- tol-node images, and creates enwiki/enwikiImgs.db to store it.
- Uses the 'wiki_ids' table to get tol-node wiki-ids.
- 4 In enwiki/, run downloadImgLicenseInfo.py, which downloads licensing
- information for images listed in enwiki/enwikiImgs.db, and stores
- it in that db.
- 5 In enwiki/, run downloadEnwikiImgs.py, which downloads 'permissively-licensed'
- images in listed in enwiki/enwikiImgs.db, storing them in enwiki/imgs/.
- 6 Run reviewImgsToMerge.py, which displays images from eol/ and enwiki/,
- and enables choosing, for each tol-node, which image should be used, if any,
- and outputs choice information into mergedImgList.txt. Uses the 'nodes',
- 'eol_ids', and 'wiki_ids' tables (as well as 'names' for info-display).
- 7 Run genImgsForWeb.py, which creates cropped/resized images in img/,
- using mergedImgList.txt, and possibly pickedImgs/, and adds 'images' and
- 'node_imgs' tables to data.db. <br>
- Smartcrop's outputs might need to be manually created/adjusted: <br>
- - An input image might have no output produced, possibly due to
- data incompatibilities, memory limits, etc. A few input image files
- might actually be html files, containing a 'file not found' page.
- - An input x.gif might produce x-1.jpg, x-2.jpg, etc, instead of x.jpg.
- - An input image might produce output with unexpected dimensions.
- This seems to happen when the image is very large, and triggers a
- decompression bomb warning.
- The result might have as many as 150k images, with about 2/3 of them
- being from wikipedia.
- 8 Run genLinkedImgs.py to add a 'linked_imgs' table to data.db,
- which uses 'nodes', 'edges', 'eol_ids', and 'node_imgs', to associate
- nodes without images to child images.
-5 Reduced Tree Structure Data
- 1 Run genReducedTreeData.py, which adds 'r_nodes' and 'r_edges' tables to
- data.db, using reducedTol/names.txt, and the 'nodes' and 'names' tables.
-6 Other
- - Optionally run genEnwikiNameData.py, which adds more entries to the 'names' table,
- using data in enwiki/enwikiData.db, and the 'names' and 'wiki_ids' tables.
- - Optionally run addPickedNames.py, which adds manually-picked names to
- the 'names' table, as specified in pickedNames.txt.
- - Optionally run trimTree.py, which tries to remove some 'low-significance' nodes,
- for the sake of performance and result-relevance. Without this, jumping to certain
- nodes within the fungi and moths can take over a minute to render.
+This directory holds files used to generate data.db, which contains tree-of-life data.
-data.db Tables
-==============
-- nodes: name TEXT PRIMARY KEY, id TEXT UNIQUE, tips INT
-- edges: node TEXT, child TEXT, p\_support INT, PRIMARY KEY (node, child)
-- eol\_ids: id INT PRIMARY KEY, name TEXT
-- names: name TEXT, alt\_name TEXT, pref\_alt INT, src TEXT, PRIMARY KEY(name, alt\_name)
-- wiki\_ids: name TEXT PRIMARY KEY, id INT, redirected INT
-- descs: wiki\_id INT PRIMARY KEY, desc TEXT, from\_dbp INT
-- node\_imgs: name TEXT PRIMARY KEY, img\_id INT, src TEXT
-- images: id INT, src TEXT, url TEXT, license TEXT, artist TEXT, credit TEXT, PRIMARY KEY (id, src)
-- linked\_imgs: name TEXT PRIMARY KEY, otol\_ids TEXT
-- r\_nodes: name TEXT PRIMARY KEY, tips INT
-- r\_edges: node TEXT, child TEXT, p\_support INT, PRIMARY KEY (node, child)
+# Tables:
+- `nodes`: `name TEXT PRIMARY KEY, id TEXT UNIQUE, tips INT`
+- `edges`: `node TEXT, child TEXT, p_support INT, PRIMARY KEY (node, child)`
+- `eol_ids`: `id INT PRIMARY KEY, name TEXT`
+- `names`: `name TEXT, alt_name TEXT, pref_alt INT, src TEXT, PRIMARY KEY(name, alt_name)`
+- `wiki_ids`: `name TEXT PRIMARY KEY, id INT, redirected INT`
+- `descs`: `wiki_id INT PRIMARY KEY, desc TEXT, from_dbp INT`
+- `node_imgs`: `name TEXT PRIMARY KEY, img_id INT, src TEXT`
+- `images`: `id INT, src TEXT, url TEXT, license TEXT, artist TEXT, credit TEXT, PRIMARY KEY (id, src)`
+- `linked_imgs`: `name TEXT PRIMARY KEY, otol_ids TEXT`
+- `r_nodes`: `name TEXT PRIMARY KEY, tips INT`
+- `r_edges`: `node TEXT, child TEXT, p_support INT, PRIMARY KEY (node, child)`
-Other Files
-===========
-- dbpPickedLabels.txt <br>
- Contains DBpedia labels, one per line. Used by genDbpData.py to help
- resolve conflicts when associating tree-of-life node names with
- DBpedia node labels.
-- genOtolNamesToKeep.txt <br>
- Contains names to avoid trimming off the tree data generated by
- genOtolData.py. Usage is optional, but, without it, a large amount
- of possibly-significant nodes are removed, using a short-sighted
- heuristic. <br>
- One way to generate this list is to generate the files as usual,
- then get node names that have an associated image, description, or
- presence in r_nodes. Then run the genOtolData.py and genEolNameData.py
- scripts again (after deleting their created tables).
-- genEnwikiDescNamesToSkip.txt <br>
- Contains names for nodes that genEnwikiNameData.py should skip adding
- a description for. Usage is optional, but without it, some nodes will
- probably get descriptions that don't match (eg: the bee genus Osiris
- might be described as an egyptian god). <br>
- This file was generated by running genEnwikiNameData.py, then listing
- the names that it added into a file, along with descriptions, and
- manually removing those that seemed node-matching (got about 30k lines,
- with about 1 in 30 descriptions non-matching). And, after creating
- genEnwikiDescTitlesToUse.txt, names shared with that file were removed.
-- genEnwikiDescTitlesToUse.txt <br>
- Contains enwiki titles with the form 'name1 (category1)' for
- genEnwikiNameData.py to use to resolve nodes matching name name1.
- Usage is optional, but it adds some descriptions that would otherwise
- be skipped. <br>
- This file was generated by taking the content of genEnwikiNameData.py,
- after the manual filtering step, then, for each name name,1 getting
- page titles from dbpedia/dbpData.db that match 'name1 (category1)'.
- This was followed by manually removing lines, keeping those that
- seemed to match the corresponding node (used the app to help with this).
+# Generating the Database
+
+For the most part, these steps should be done in order.
+
+As a warning, the whole process takes a lot of time and file space. The tree will probably
+have about 2.5 billion nodes. Downloading the images will take several days, and occupy over
+200 GB. And if you want good data, you'll need to do some manual review, which can take weeks.
+
+## Environment
+The scripts are written in python and bash.
+Some of the python scripts require third-party packages:
+- jsonpickle: For encoding class objects as JSON.
+- requests: For downloading data.
+- PIL: For image processing.
+- tkinter: For providing a basic GUI to review images.
+- mwxml, mwparserfromhell: For parsing Wikipedia dumps.
+
+## Generate tree structure data
+1. Obtain files in otol/, as specified in it's README.
+2. Run genOtolData.py, which creates data.db, and adds the `nodes` and `edges` tables,
+ using data in otol/. It also uses these files, if they exist:
+ - pickedOtolNames.txt: Has lines of the form `name1|otolId1`. Some nodes in the
+ tree may have the same name (eg: Pholidota can refer to pangolins or orchids).
+ Normally, such nodes will get the names 'name1', 'name1 [2]', 'name1 [3], etc.
+ This file can be used to manually specify which node should be named 'name1'.
+
+## Generate node name data
+1. Obtain 'name data files' in eol/, as specified in it's README.
+2. Run genEolNameData.py, which adds the `names` and `eol_ids` tables, using data in
+ eol/ and the `nodes` table. It also uses these files, if they exist:
+ - pickedEolIds.txt: Has lines of the form `nodeName1|eolId1` or `nodeName1|`.
+ Specifies node names that should have a particular EOL ID, or no ID.
+ Quite a few taxons have ambiguous names, and may need manual correction.
+ For example, Viola may resolve to a taxon of butterflies or of plants.
+ - pickedEolAltsToSkip.txt: Has lines of the form `nodeName1|altName1`.
+ Specifies that a node's alt-name set should exclude altName1.
+
+## Generate node description data
+### Get data from DBpedia
+1. Obtain files in dbpedia/, as specified in it's README.
+2. Run genDbpData.py, which adds the `wiki_ids` and `descs` tables, using data in
+ dbpedia/ and the `nodes` table. It also uses these files, if they exist:
+ - pickedEnwikiNamesToSkip.txt: Each line holds the name of a node for which
+ no description should be obtained. Many node names have a same-name
+ wikipedia page that describes something different (eg: Osiris).
+ - pickedDbpLabels.txt: Has lines of the form `nodeName1|label1`.
+ Specifies node names that should have a particular associated page label.
+### Get data from Wikipedia
+1. Obtain 'description database files' in enwiki/, as specified in it's README.
+2. Run genEnwikiDescData.py, which adds to the `wiki_ids` and `descs` tables,
+ using data in enwiki/ and the `nodes` table.
+ It also uses these files, if they exist:
+ - pickedEnwikiNamesToSkip.txt: Same as with genDbpData.py.
+ - pickedEnwikiLabels.txt: Similar to pickedDbpLabels.txt.
+
+## Generate image data
+### Get images from EOL
+1. Obtain 'image metadata files' in eol/, as specified in it's README.
+2. In eol/, run downloadImgs.py, which downloads images (possibly multiple per node),
+ into eol/imgsForReview, using data in eol/, as well as the `eol_ids` table.
+3. In eol/, run reviewImgs.py, which interactively displays the downloaded images for
+ each node, providing the choice of which to use, moving them to eol/imgs/.
+ Uses `names` and `eol_ids` to display extra info.
+### Get images from Wikipedia
+1. In enwiki/, run genImgData.py, which looks for wikipedia image names for each node,
+ using the `wiki_ids` table, and stores them in a database.
+2. In enwiki/, run downloadImgLicenseInfo.py, which downloads licensing information for
+ those images, using wikipedia's online API.
+3. In enwiki/, run downloadEnwikiImgs.py, which downloads 'permissively-licensed'
+ images into enwiki/imgs/.
+### Merge the image sets
+1. Run reviewImgsToGen.py, which displays images from eol/imgs/ and enwiki/imgs/,
+ and enables choosing, for each node, which image should be used, if any,
+ and outputs choice information into imgList.txt. Uses the `nodes`,
+ `eol_ids`, and `wiki_ids` tables (as well as `names` to display extra info).
+2. Run genImgs.py, which creates cropped/resized images in img/, from files listed in
+ imgList.txt and located in eol/ and enwiki/, and creates the `node_imgs` and
+ `images` tables. If pickedImgs/ is present, images within it are also used. <br>
+ The outputs might need to be manually created/adjusted:
+ - An input image might have no output produced, possibly due to
+ data incompatibilities, memory limits, etc. A few input image files
+ might actually be html files, containing a 'file not found' page.
+ - An input x.gif might produce x-1.jpg, x-2.jpg, etc, instead of x.jpg.
+ - An input image might produce output with unexpected dimensions.
+ This seems to happen when the image is very large, and triggers a
+ decompression bomb warning.
+ The result might have as many as 150k images, with about 2/3 of them
+ being from wikipedia.
+### Add more image associations
+1. Run genLinkedImgs.py, which tries to associate nodes without images to
+ images of it's children. Adds the `linked_imgs` table, and uses the
+ `nodes`, `edges`, and `node_imgs` tables.
+
+## Do some post-processing
+1. Run genReducedTreeData.py, which generates a second, reduced version of the tree,
+ adding the `r_nodes` and `r_edges` tables, using `nodes` and `names`. Reads from
+ pickedReducedNodes.txt, which lists names of nodes that must be included (1 per line).
+2. Optionally run trimTree.py, which tries to remove some 'low-significance' nodes,
+ for the sake of performance and result-relevance. Otherwise, some nodes may have
+ over 10k children, which can take a while to render (over a minute in my testing).
+ You might want to backup the untrimmed tree first, as this operation is not easily
+ reversible.
+3. Optionally run genEnwikiNameData.py, which adds more entries to the `names` table,
+ using data in enwiki/, and the `names` and `wiki_ids` tables.
+4. Optionally run addPickedNames.py, which allows adding manually-selected name data to
+ the `names` table, as specified in pickedNames.txt.