diff options
| author | Terry Truong <terry06890@gmail.com> | 2022-06-22 01:42:41 +1000 |
|---|---|---|
| committer | Terry Truong <terry06890@gmail.com> | 2022-06-22 09:39:44 +1000 |
| commit | e78c4df403e5f98afa08f7a0841ff233d5f6d05b (patch) | |
| tree | f13dbf91228550075644be9766b4546eb20f1e1f /backend/data/README.md | |
| parent | ae1467d2ab35a03eb2d7bf3e5ca1cf4634b23443 (diff) | |
Update backend READMEs, rename some files for consistency
Diffstat (limited to 'backend/data/README.md')
| -rw-r--r-- | backend/data/README.md | 232 |
1 files changed, 119 insertions, 113 deletions
diff --git a/backend/data/README.md b/backend/data/README.md index d4a6196..7d1adad 100644 --- a/backend/data/README.md +++ b/backend/data/README.md @@ -1,115 +1,121 @@ -File Generation Process -======================= -1 Tree Structure Data - 1 Obtain data in otol/, as specified in it's README. - 2 Run genOtolData.py, which creates data.db, and adds - 'nodes' and 'edges' tables using data in otol/*, as well as - genOtolNamesToKeep.txt, if present. -2 Name Data for Search - 1 Obtain data in eol/, as specified in it's README. - 2 Run genEolNameData.py, which adds 'names' and 'eol_ids' tables to data.db, - using data in eol/vernacularNames.csv and the 'nodes' table, and possibly - genEolNameDataPickedIds.txt. -3 Node Description Data - 1 Obtain data in dbpedia/ and enwiki/, as specified in their README files. - 2 Run genDbpData.py, which adds 'wiki_ids' and 'descs' tables to data.db, - using data in dbpedia/dbpData.db, the 'nodes' table, and possibly - genDescNamesToSkip.txt and dbpPickedLabels.txt. - 3 Run genEnwikiDescData.py, which adds to the 'wiki_ids' and 'descs' tables, - using data in enwiki/enwikiData.db, and the 'nodes' table. - Also uses genDescNamesToSkip.txt and genEnwikiDescTitlesToUse.txt for - skipping/resolving some name-page associations. -4 Image Data - 1 In eol/, run downloadImgs.py to download EOL images into eol/imgsForReview/. - It uses data in eol/imagesList.db, and the 'eol_ids' table. - 2 In eol/, run reviewImgs.py to filter images in eol/imgsForReview/ into EOL-id-unique - images in eol/imgsReviewed/ (uses 'names' and 'eol_ids' to display extra info). - 3 In enwiki/, run getEnwikiImgData.py, which generates a list of - tol-node images, and creates enwiki/enwikiImgs.db to store it. - Uses the 'wiki_ids' table to get tol-node wiki-ids. - 4 In enwiki/, run downloadImgLicenseInfo.py, which downloads licensing - information for images listed in enwiki/enwikiImgs.db, and stores - it in that db. - 5 In enwiki/, run downloadEnwikiImgs.py, which downloads 'permissively-licensed' - images in listed in enwiki/enwikiImgs.db, storing them in enwiki/imgs/. - 6 Run reviewImgsToMerge.py, which displays images from eol/ and enwiki/, - and enables choosing, for each tol-node, which image should be used, if any, - and outputs choice information into mergedImgList.txt. Uses the 'nodes', - 'eol_ids', and 'wiki_ids' tables (as well as 'names' for info-display). - 7 Run genImgsForWeb.py, which creates cropped/resized images in img/, - using mergedImgList.txt, and possibly pickedImgs/, and adds 'images' and - 'node_imgs' tables to data.db. <br> - Smartcrop's outputs might need to be manually created/adjusted: <br> - - An input image might have no output produced, possibly due to - data incompatibilities, memory limits, etc. A few input image files - might actually be html files, containing a 'file not found' page. - - An input x.gif might produce x-1.jpg, x-2.jpg, etc, instead of x.jpg. - - An input image might produce output with unexpected dimensions. - This seems to happen when the image is very large, and triggers a - decompression bomb warning. - The result might have as many as 150k images, with about 2/3 of them - being from wikipedia. - 8 Run genLinkedImgs.py to add a 'linked_imgs' table to data.db, - which uses 'nodes', 'edges', 'eol_ids', and 'node_imgs', to associate - nodes without images to child images. -5 Reduced Tree Structure Data - 1 Run genReducedTreeData.py, which adds 'r_nodes' and 'r_edges' tables to - data.db, using reducedTol/names.txt, and the 'nodes' and 'names' tables. -6 Other - - Optionally run genEnwikiNameData.py, which adds more entries to the 'names' table, - using data in enwiki/enwikiData.db, and the 'names' and 'wiki_ids' tables. - - Optionally run addPickedNames.py, which adds manually-picked names to - the 'names' table, as specified in pickedNames.txt. - - Optionally run trimTree.py, which tries to remove some 'low-significance' nodes, - for the sake of performance and result-relevance. Without this, jumping to certain - nodes within the fungi and moths can take over a minute to render. +This directory holds files used to generate data.db, which contains tree-of-life data. -data.db Tables -============== -- nodes: name TEXT PRIMARY KEY, id TEXT UNIQUE, tips INT -- edges: node TEXT, child TEXT, p\_support INT, PRIMARY KEY (node, child) -- eol\_ids: id INT PRIMARY KEY, name TEXT -- names: name TEXT, alt\_name TEXT, pref\_alt INT, src TEXT, PRIMARY KEY(name, alt\_name) -- wiki\_ids: name TEXT PRIMARY KEY, id INT, redirected INT -- descs: wiki\_id INT PRIMARY KEY, desc TEXT, from\_dbp INT -- node\_imgs: name TEXT PRIMARY KEY, img\_id INT, src TEXT -- images: id INT, src TEXT, url TEXT, license TEXT, artist TEXT, credit TEXT, PRIMARY KEY (id, src) -- linked\_imgs: name TEXT PRIMARY KEY, otol\_ids TEXT -- r\_nodes: name TEXT PRIMARY KEY, tips INT -- r\_edges: node TEXT, child TEXT, p\_support INT, PRIMARY KEY (node, child) +# Tables: +- `nodes`: `name TEXT PRIMARY KEY, id TEXT UNIQUE, tips INT` +- `edges`: `node TEXT, child TEXT, p_support INT, PRIMARY KEY (node, child)` +- `eol_ids`: `id INT PRIMARY KEY, name TEXT` +- `names`: `name TEXT, alt_name TEXT, pref_alt INT, src TEXT, PRIMARY KEY(name, alt_name)` +- `wiki_ids`: `name TEXT PRIMARY KEY, id INT, redirected INT` +- `descs`: `wiki_id INT PRIMARY KEY, desc TEXT, from_dbp INT` +- `node_imgs`: `name TEXT PRIMARY KEY, img_id INT, src TEXT` +- `images`: `id INT, src TEXT, url TEXT, license TEXT, artist TEXT, credit TEXT, PRIMARY KEY (id, src)` +- `linked_imgs`: `name TEXT PRIMARY KEY, otol_ids TEXT` +- `r_nodes`: `name TEXT PRIMARY KEY, tips INT` +- `r_edges`: `node TEXT, child TEXT, p_support INT, PRIMARY KEY (node, child)` -Other Files -=========== -- dbpPickedLabels.txt <br> - Contains DBpedia labels, one per line. Used by genDbpData.py to help - resolve conflicts when associating tree-of-life node names with - DBpedia node labels. -- genOtolNamesToKeep.txt <br> - Contains names to avoid trimming off the tree data generated by - genOtolData.py. Usage is optional, but, without it, a large amount - of possibly-significant nodes are removed, using a short-sighted - heuristic. <br> - One way to generate this list is to generate the files as usual, - then get node names that have an associated image, description, or - presence in r_nodes. Then run the genOtolData.py and genEolNameData.py - scripts again (after deleting their created tables). -- genEnwikiDescNamesToSkip.txt <br> - Contains names for nodes that genEnwikiNameData.py should skip adding - a description for. Usage is optional, but without it, some nodes will - probably get descriptions that don't match (eg: the bee genus Osiris - might be described as an egyptian god). <br> - This file was generated by running genEnwikiNameData.py, then listing - the names that it added into a file, along with descriptions, and - manually removing those that seemed node-matching (got about 30k lines, - with about 1 in 30 descriptions non-matching). And, after creating - genEnwikiDescTitlesToUse.txt, names shared with that file were removed. -- genEnwikiDescTitlesToUse.txt <br> - Contains enwiki titles with the form 'name1 (category1)' for - genEnwikiNameData.py to use to resolve nodes matching name name1. - Usage is optional, but it adds some descriptions that would otherwise - be skipped. <br> - This file was generated by taking the content of genEnwikiNameData.py, - after the manual filtering step, then, for each name name,1 getting - page titles from dbpedia/dbpData.db that match 'name1 (category1)'. - This was followed by manually removing lines, keeping those that - seemed to match the corresponding node (used the app to help with this). +# Generating the Database + +For the most part, these steps should be done in order. + +As a warning, the whole process takes a lot of time and file space. The tree will probably +have about 2.5 billion nodes. Downloading the images will take several days, and occupy over +200 GB. And if you want good data, you'll need to do some manual review, which can take weeks. + +## Environment +The scripts are written in python and bash. +Some of the python scripts require third-party packages: +- jsonpickle: For encoding class objects as JSON. +- requests: For downloading data. +- PIL: For image processing. +- tkinter: For providing a basic GUI to review images. +- mwxml, mwparserfromhell: For parsing Wikipedia dumps. + +## Generate tree structure data +1. Obtain files in otol/, as specified in it's README. +2. Run genOtolData.py, which creates data.db, and adds the `nodes` and `edges` tables, + using data in otol/. It also uses these files, if they exist: + - pickedOtolNames.txt: Has lines of the form `name1|otolId1`. Some nodes in the + tree may have the same name (eg: Pholidota can refer to pangolins or orchids). + Normally, such nodes will get the names 'name1', 'name1 [2]', 'name1 [3], etc. + This file can be used to manually specify which node should be named 'name1'. + +## Generate node name data +1. Obtain 'name data files' in eol/, as specified in it's README. +2. Run genEolNameData.py, which adds the `names` and `eol_ids` tables, using data in + eol/ and the `nodes` table. It also uses these files, if they exist: + - pickedEolIds.txt: Has lines of the form `nodeName1|eolId1` or `nodeName1|`. + Specifies node names that should have a particular EOL ID, or no ID. + Quite a few taxons have ambiguous names, and may need manual correction. + For example, Viola may resolve to a taxon of butterflies or of plants. + - pickedEolAltsToSkip.txt: Has lines of the form `nodeName1|altName1`. + Specifies that a node's alt-name set should exclude altName1. + +## Generate node description data +### Get data from DBpedia +1. Obtain files in dbpedia/, as specified in it's README. +2. Run genDbpData.py, which adds the `wiki_ids` and `descs` tables, using data in + dbpedia/ and the `nodes` table. It also uses these files, if they exist: + - pickedEnwikiNamesToSkip.txt: Each line holds the name of a node for which + no description should be obtained. Many node names have a same-name + wikipedia page that describes something different (eg: Osiris). + - pickedDbpLabels.txt: Has lines of the form `nodeName1|label1`. + Specifies node names that should have a particular associated page label. +### Get data from Wikipedia +1. Obtain 'description database files' in enwiki/, as specified in it's README. +2. Run genEnwikiDescData.py, which adds to the `wiki_ids` and `descs` tables, + using data in enwiki/ and the `nodes` table. + It also uses these files, if they exist: + - pickedEnwikiNamesToSkip.txt: Same as with genDbpData.py. + - pickedEnwikiLabels.txt: Similar to pickedDbpLabels.txt. + +## Generate image data +### Get images from EOL +1. Obtain 'image metadata files' in eol/, as specified in it's README. +2. In eol/, run downloadImgs.py, which downloads images (possibly multiple per node), + into eol/imgsForReview, using data in eol/, as well as the `eol_ids` table. +3. In eol/, run reviewImgs.py, which interactively displays the downloaded images for + each node, providing the choice of which to use, moving them to eol/imgs/. + Uses `names` and `eol_ids` to display extra info. +### Get images from Wikipedia +1. In enwiki/, run genImgData.py, which looks for wikipedia image names for each node, + using the `wiki_ids` table, and stores them in a database. +2. In enwiki/, run downloadImgLicenseInfo.py, which downloads licensing information for + those images, using wikipedia's online API. +3. In enwiki/, run downloadEnwikiImgs.py, which downloads 'permissively-licensed' + images into enwiki/imgs/. +### Merge the image sets +1. Run reviewImgsToGen.py, which displays images from eol/imgs/ and enwiki/imgs/, + and enables choosing, for each node, which image should be used, if any, + and outputs choice information into imgList.txt. Uses the `nodes`, + `eol_ids`, and `wiki_ids` tables (as well as `names` to display extra info). +2. Run genImgs.py, which creates cropped/resized images in img/, from files listed in + imgList.txt and located in eol/ and enwiki/, and creates the `node_imgs` and + `images` tables. If pickedImgs/ is present, images within it are also used. <br> + The outputs might need to be manually created/adjusted: + - An input image might have no output produced, possibly due to + data incompatibilities, memory limits, etc. A few input image files + might actually be html files, containing a 'file not found' page. + - An input x.gif might produce x-1.jpg, x-2.jpg, etc, instead of x.jpg. + - An input image might produce output with unexpected dimensions. + This seems to happen when the image is very large, and triggers a + decompression bomb warning. + The result might have as many as 150k images, with about 2/3 of them + being from wikipedia. +### Add more image associations +1. Run genLinkedImgs.py, which tries to associate nodes without images to + images of it's children. Adds the `linked_imgs` table, and uses the + `nodes`, `edges`, and `node_imgs` tables. + +## Do some post-processing +1. Run genReducedTreeData.py, which generates a second, reduced version of the tree, + adding the `r_nodes` and `r_edges` tables, using `nodes` and `names`. Reads from + pickedReducedNodes.txt, which lists names of nodes that must be included (1 per line). +2. Optionally run trimTree.py, which tries to remove some 'low-significance' nodes, + for the sake of performance and result-relevance. Otherwise, some nodes may have + over 10k children, which can take a while to render (over a minute in my testing). + You might want to backup the untrimmed tree first, as this operation is not easily + reversible. +3. Optionally run genEnwikiNameData.py, which adds more entries to the `names` table, + using data in enwiki/, and the `names` and `wiki_ids` tables. +4. Optionally run addPickedNames.py, which allows adding manually-selected name data to + the `names` table, as specified in pickedNames.txt. |
