From e8e58a3bb9dc233dacf573973457c5b48d369503 Mon Sep 17 00:00:00 2001 From: Terry Truong Date: Tue, 30 Aug 2022 12:27:42 +1000 Subject: Add scripts for generating eol/enwiki mappings - New data sources: OTOL taxonomy, EOL provider-ids, Wikidata dump - Add 'node_iucn' table - Remove 'redirected' field from 'wiki_ids' table - Make 'eol_ids' table have 'name' as the primary key - Combine name-generation scripts into genNameData.py - Combine description-generation scripts into genDescData.py --- backend/tolData/README.md | 110 +++++++++++++++++++++------------------------- 1 file changed, 51 insertions(+), 59 deletions(-) (limited to 'backend/tolData/README.md') diff --git a/backend/tolData/README.md b/backend/tolData/README.md index 21c02ab..1248098 100644 --- a/backend/tolData/README.md +++ b/backend/tolData/README.md @@ -4,24 +4,24 @@ This directory holds files used to generate the tree-of-life database data.db. ## Tree Structure - `nodes`
Format : `name TEXT PRIMARY KEY, id TEXT UNIQUE, tips INT`
- Represents a tree-of-life node. `tips` holds the number of no-child descendants. + Represents a tree-of-life node. `tips` holds the number of no-child descendants - `edges`
Format: `parent TEXT, child TEXT, p_support INT, PRIMARY KEY (parent, child)`
`p_support` is 1 if the edge has 'phylogenetic support', and 0 otherwise -## Node Names +## Node Mappings - `eol_ids`
- Format: `id INT PRIMARY KEY, name TEXT`
- Associates an EOL ID with a node's name. + Format: `name TEXT PRIMARY KEY, id INT`
+ Associates nodes with EOL IDs +- `wiki_ids`
+ Format: `name TEXT PRIMARY KEY, id INT`
+ Associates nodes with wikipedia page IDs +## Node Vernacular Names - `names`
Format: `name TEXT, alt_name TEXT, pref_alt INT, src TEXT, PRIMARY KEY(name, alt_name)`
Associates a node with alternative names. `pref_alt` is 1 if the alt-name is the most 'preferred' one. `src` indicates the dataset the alt-name was obtained from (can be 'eol', 'enwiki', or 'picked'). ## Node Descriptions -- `wiki_ids`
- Format: `name TEXT PRIMARY KEY, id INT, redirected INT`
- Associates a node with a wikipedia page ID. - `redirected` is 1 if the node was associated with a different page that redirected to this one. - `descs`
Format: `wiki_id INT PRIMARY KEY, desc TEXT, from_dbp INT`
Associates a wikipedia page ID with a short-description. @@ -42,61 +42,62 @@ This directory holds files used to generate the tree-of-life database data.db. These are like `nodes`, but describe nodes of reduced trees. - `edges_t`, `edges_i`, `edges_p`
Like `edges` but for reduced trees. +## Other +- `node_iucn`
+ Format: `name TEXT PRIMARY KEY, iucn TEXT`
+ Associated nodes with IUCN conservation status strings (eg: 'endangered') # Generating the Database -For the most part, these steps should be done in order. - -As a warning, the whole process takes a lot of time and file space. The tree will probably -have about 2.5 billion nodes. Downloading the images takes several days, and occupies over -200 GB. And if you want good data, you'll likely need to make additional corrections, -which can take several weeks. +As a warning, the whole process takes a lot of time and file space. The +tree will probably have about 2.6 million nodes. Downloading the images +takes several days, and occupies over 200 GB. ## Environment Some of the scripts use third-party packages: -- jsonpickle: For encoding class objects as JSON. -- requests: For downloading data. -- PIL: For image processing. -- tkinter: For providing a basic GUI to review images. -- mwxml, mwparserfromhell: For parsing Wikipedia dumps. +- `indexed_bzip2`: For parallelised bzip2 processing. +- `jsonpickle`: For encoding class objects as JSON. +- `requests`: For downloading data. +- `PIL`: For image processing. +- `tkinter`: For providing a basic GUI to review images. +- `mwxml`, `mwparserfromhell`: For parsing Wikipedia dumps. ## Generate Tree Structure Data -1. Obtain files in otol/, as specified in it's README. +1. Obtain 'tree data files' in otol/, as specified in it's README. 2. Run genOtolData.py, which creates data.db, and adds the `nodes` and `edges` tables, using data in otol/. It also uses these files, if they exist: - - pickedOtolNames.txt: Has lines of the form `name1|otolId1`. When nodes in the - tree have the same name (eg: Pholidota can refer to pangolins or orchids), - they get the names 'name1', 'name1 [2]', 'name1 [3], etc. This file is used to - forcibly specify which node should be named 'name1'. + - pickedOtolNames.txt: Has lines of the form `name1|otolId1`. + Can be used to override numeric suffixes added to same-name nodes. + +## Generate Dataset Mappings +1. Obtain 'taxonomy data files' in otol/, 'mapping files' in eol/, + files in wikidata/, and 'dump-index files' in enwiki/, as specified + in their READMEs. +2. Run genMappingData.py, which adds the `eol_ids` and `wiki_ids` tables, + using the files obtained above, and the `nodes` table. It also uses + 'picked mappings' files, if they exist. + - pickedEolIds.txt contains lines like `3785967|405349`, specifying + an otol ID and an eol ID to map it to. The eol ID can be empty, + in which case the otol ID won't be mapped. + - pickedWikiIds.txt and pickedWikiIdsRough.txt contain lines like + `5341349|Human`, specifying an otol ID and an enwiki title, + which may contain spaces. The title can be empty. -## Generate Node Names Data -1. Obtain 'name data files' in eol/, as specified in it's README. -2. Run genEolNameData.py, which adds the `names` and `eol_ids` tables, using data in - eol/ and the `nodes` table. It also uses these files, if they exist: - - pickedEolIds.txt: Has lines of the form `nodeName1|eolId1` or `nodeName1|`. - Specifies node names that should have a particular EOL ID, or no ID. - Quite a few taxons have ambiguous names, and may need manual correction. - For example, Viola may resolve to a taxon of butterflies or of plants. - - pickedEolAltsToSkip.txt: Has lines of the form `nodeName1|altName1`. - Specifies that a node's alt-name set should exclude altName1. +## Generate Node Name Data +1. Obtain 'name data files' in eol/, and 'description database files' in enwiki/, + as specified in their READMEs. +2. Run genNameData.py, which adds the `names` table, using data in eol/ and enwiki/, + along with the `nodes`, `eol_ids`, and `wiki_ids` tables.
+ It also uses pickedNames.txt, if it exists. This file can hold lines like + `embryophyta|land plant|1`, specifying a node name, an alt-name to add for it, + and a 1 or 0 indicating whether it is a 'preferred' alt-name. The last field + can be empty, which indicates that the alt-name should be removed, or, if the + alt-name is the same as the node name, that no alt-name should be preferred. ## Generate Node Description Data -### Get Data from DBpedia 1. Obtain files in dbpedia/, as specified in it's README. -2. Run genDbpData.py, which adds the `wiki_ids` and `descs` tables, using data in - dbpedia/ and the `nodes` table. It also uses these files, if they exist: - - pickedEnwikiNamesToSkip.txt: Each line holds the name of a node for which - no description should be obtained. Many node names have a same-name - wikipedia page that describes something different (eg: Osiris). - - pickedDbpLabels.txt: Has lines of the form `nodeName1|label1`. - Specifies node names that should have a particular associated page label. -### Get Data from Wikipedia -1. Obtain 'description database files' in enwiki/, as specified in it's README. -2. Run genEnwikiDescData.py, which adds to the `wiki_ids` and `descs` tables, - using data in enwiki/ and the `nodes` table. - It also uses these files, if they exist: - - pickedEnwikiNamesToSkip.txt: Same as with genDbpData.py. - - pickedEnwikiLabels.txt: Similar to pickedDbpLabels.txt. +2. Run genDescData.py, which adds the `descs` table, using data in dbpedia/ and + enwiki/, and the `nodes` table. ## Generate Node Images Data ### Get images from EOL @@ -129,21 +130,12 @@ Some of the scripts use third-party packages: - An input image might produce output with unexpected dimensions. This seems to happen when the image is very large, and triggers a decompression bomb warning. - In testing, this resulted in about 150k images, with about 2/3 of them - being from Wikipedia. ### Add more Image Associations 1. Run genLinkedImgs.py, which tries to associate nodes without images to images of it's children. Adds the `linked_imgs` table, and uses the `nodes`, `edges`, and `node_imgs` tables. ## Do some Post-Processing -1. Run genEnwikiNameData.py, which adds more entries to the `names` table, - using data in enwiki/, and the `names` and `wiki_ids` tables. -2. Optionally run addPickedNames.py, which allows adding manually-selected name data to - the `names` table, as specified in pickedNames.txt. - - pickedNames.txt: Has lines of the form `nodeName1|altName1|prefAlt1`. - These correspond to entries in the `names` table. `prefAlt` should be 1 or 0. - A line like `name1|name1|1` causes a node to have no preferred alt-name. -3. Run genReducedTrees.py, which generates multiple reduced versions of the tree, +1. Run genReducedTrees.py, which generates multiple reduced versions of the tree, adding the `nodes_*` and `edges_*` tables, using `nodes` and `names`. Reads from pickedNodes.txt, which lists names of nodes that must be included (1 per line). -- cgit v1.2.3