diff options
Diffstat (limited to 'backend/tolData/README.md')
| -rw-r--r-- | backend/tolData/README.md | 149 |
1 files changed, 0 insertions, 149 deletions
diff --git a/backend/tolData/README.md b/backend/tolData/README.md deleted file mode 100644 index 3b78af8..0000000 --- a/backend/tolData/README.md +++ /dev/null @@ -1,149 +0,0 @@ -This directory holds files used to generate the tree-of-life database data.db. - -# Database Tables -## Tree Structure -- `nodes` <br> - Format : `name TEXT PRIMARY KEY, id TEXT UNIQUE, tips INT` <br> - Represents a tree-of-life node. `tips` holds the number of no-child descendants -- `edges` <br> - Format: `parent TEXT, child TEXT, p_support INT, PRIMARY KEY (parent, child)` <br> - `p_support` is 1 if the edge has 'phylogenetic support', and 0 otherwise -## Node Mappings -- `eol_ids` <br> - Format: `name TEXT PRIMARY KEY, id INT` <br> - Associates nodes with EOL IDs -- `wiki_ids` <br> - Format: `name TEXT PRIMARY KEY, id INT` <br> - Associates nodes with wikipedia page IDs -## Node Vernacular Names -- `names` <br> - Format: `name TEXT, alt_name TEXT, pref_alt INT, src TEXT, PRIMARY KEY(name, alt_name)` <br> - Associates a node with alternative names. - `pref_alt` is 1 if the alt-name is the most 'preferred' one. - `src` indicates the dataset the alt-name was obtained from (can be 'eol', 'enwiki', or 'picked'). -## Node Descriptions -- `descs` <br> - Format: `wiki_id INT PRIMARY KEY, desc TEXT, from_dbp INT` <br> - Associates a wikipedia page ID with a short-description. - `from_dbp` is 1 if the description was obtained from DBpedia, and 0 otherwise. -## Node Images -- `node_imgs` <br> - Format: `name TEXT PRIMARY KEY, img_id INT, src TEXT` <br> - Associates a node with an image. -- `images` <br> - Format: `id INT, src TEXT, url TEXT, license TEXT, artist TEXT, credit TEXT, PRIMARY KEY (id, src)` <br> - Represents an image, identified by a source ('eol', 'enwiki', or 'picked'), and a source-specific ID. -- `linked_imgs` <br> - Format: `name TEXT PRIMARY KEY, otol_ids TEXT` <br> - Associates a node with an image from another node. - `otol_ids` can be an otol ID, or (for compound nodes) two comma-separated strings that may be otol IDs or empty. -## Reduced Trees -- `nodes_t`, `nodes_i`, `nodes_p` <br> - These are like `nodes`, but describe nodes of reduced trees. -- `edges_t`, `edges_i`, `edges_p` <br> - Like `edges` but for reduced trees. -## Other -- `node_iucn` <br> - Format: `name TEXT PRIMARY KEY, iucn TEXT` <br> - Associates nodes with IUCN conservation status strings (eg: 'endangered') -- `node_pop` <br> - Format: `name TEXT PRIMARY KEY, pop INT` <br> - Associates nodes with popularity values (higher means more popular) - -# Generating the Database - -As a warning, the whole process takes a lot of time and file space. The -tree will probably have about 2.6 million nodes. Downloading the images -takes several days, and occupies over 200 GB. - -## Environment -Some of the scripts use third-party packages: -- `indexed_bzip2`: For parallelised bzip2 processing. -- `jsonpickle`: For encoding class objects as JSON. -- `requests`: For downloading data. -- `PIL`: For image processing. -- `tkinter`: For providing a basic GUI to review images. -- `mwxml`, `mwparserfromhell`: For parsing Wikipedia dumps. - -## Generate Tree Structure Data -1. Obtain 'tree data files' in otol/, as specified in it's README. -2. Run genOtolData.py, which creates data.db, and adds the `nodes` and `edges` tables, - using data in otol/. It also uses these files, if they exist: - - pickedOtolNames.txt: Has lines of the form `name1|otolId1`. - Can be used to override numeric suffixes added to same-name nodes. - -## Generate Dataset Mappings -1. Obtain 'taxonomy data files' in otol/, 'mapping files' in eol/, - files in wikidata/, and 'dump-index files' in enwiki/, as specified - in their READMEs. -2. Run genMappingData.py, which adds the `eol_ids` and `wiki_ids` tables, - as well as `node_iucn`. It uses the files obtained above, the `nodes` table, - and 'picked mappings' files, if they exist. - - pickedEolIds.txt contains lines like `3785967|405349`, specifying - an otol ID and an eol ID to map it to. The eol ID can be empty, - in which case the otol ID won't be mapped. - - pickedWikiIds.txt and pickedWikiIdsRough.txt contain lines like - `5341349|Human`, specifying an otol ID and an enwiki title, - which may contain spaces. The title can be empty. - -## Generate Node Name Data -1. Obtain 'name data files' in eol/, and 'description database files' in enwiki/, - as specified in their READMEs. -2. Run genNameData.py, which adds the `names` table, using data in eol/ and enwiki/, - along with the `nodes`, `eol_ids`, and `wiki_ids` tables. <br> - It also uses pickedNames.txt, if it exists. This file can hold lines like - `embryophyta|land plant|1`, specifying a node name, an alt-name to add for it, - and a 1 or 0 indicating whether it is a 'preferred' alt-name. The last field - can be empty, which indicates that the alt-name should be removed, or, if the - alt-name is the same as the node name, that no alt-name should be preferred. - -## Generate Node Description Data -1. Obtain files in dbpedia/, as specified in it's README. -2. Run genDescData.py, which adds the `descs` table, using data in dbpedia/ and - enwiki/, and the `nodes` table. - -## Generate Node Images Data -### Get images from EOL -1. Obtain 'image metadata files' in eol/, as specified in it's README. -2. In eol/, run downloadImgs.py, which downloads images (possibly multiple per node), - into eol/imgsForReview, using data in eol/, as well as the `eol_ids` table. -3. In eol/, run reviewImgs.py, which interactively displays the downloaded images for - each node, providing the choice of which to use, moving them to eol/imgs/. - Uses `names` and `eol_ids` to display extra info. -### Get Images from Wikipedia -1. In enwiki/, run genImgData.py, which looks for wikipedia image names for each node, - using the `wiki_ids` table, and stores them in a database. -2. In enwiki/, run downloadImgLicenseInfo.py, which downloads licensing information for - those images, using wikipedia's online API. -3. In enwiki/, run downloadImgs.py, which downloads 'permissively-licensed' - images into enwiki/imgs/. -### Merge the Image Sets -1. Run reviewImgsToGen.py, which displays images from eol/imgs/ and enwiki/imgs/, - and enables choosing, for each node, which image should be used, if any, - and outputs choice information into imgList.txt. Uses the `nodes`, - `eol_ids`, and `wiki_ids` tables (as well as `names` to display extra info). -2. Run genImgs.py, which creates cropped/resized images in img/, from files listed in - imgList.txt and located in eol/ and enwiki/, and creates the `node_imgs` and - `images` tables. If pickedImgs/ is present, images within it are also used. <br> - The outputs might need to be manually created/adjusted: - - An input image might have no output produced, possibly due to - data incompatibilities, memory limits, etc. A few input image files - might actually be html files, containing a 'file not found' page. - - An input x.gif might produce x-1.jpg, x-2.jpg, etc, instead of x.jpg. - - An input image might produce output with unexpected dimensions. - This seems to happen when the image is very large, and triggers a - decompression bomb warning. -### Add more Image Associations -1. Run genLinkedImgs.py, which tries to associate nodes without images to - images of it's children. Adds the `linked_imgs` table, and uses the - `nodes`, `edges`, and `node_imgs` tables. - -## Generate Reduced Trees -1. Run genReducedTrees.py, which generates multiple reduced versions of the tree, - adding the `nodes_*` and `edges_*` tables, using `nodes` and `names`. Reads from - pickedNodes.txt, which lists names of nodes that must be included (1 per line). - -## Generate Node Popularity Data -1. Obtain 'page view files' in enwiki/Run genPopData.py, as specified in it's README. -2. Run genPopData.py, which adds the `node_pop` table, using data in enwiki/, - and the `wiki_ids` table. |
