diff options
| author | Terry Truong <terry06890@gmail.com> | 2022-07-11 13:19:18 +1000 |
|---|---|---|
| committer | Terry Truong <terry06890@gmail.com> | 2022-07-11 13:19:18 +1000 |
| commit | 7a28e15874796b3becf97c0193575d906d0cfd01 (patch) | |
| tree | 20c679fb7167c18009a697f0d3db7bed1d1b409c /backend/tolData | |
| parent | 5fe71ea7b9d9a5d2dc6e8e5ce5b9193629eed74d (diff) | |
Update backend documentation
Diffstat (limited to 'backend/tolData')
| -rw-r--r-- | backend/tolData/README.md | 58 | ||||
| -rw-r--r-- | backend/tolData/dbpedia/README.md | 2 | ||||
| -rw-r--r-- | backend/tolData/enwiki/README.md | 3 | ||||
| -rw-r--r-- | backend/tolData/eol/README.md | 6 | ||||
| -rwxr-xr-x | backend/tolData/eol/genImagesListDb.py | 36 | ||||
| -rwxr-xr-x | backend/tolData/eol/genImagesListDb.sh | 12 | ||||
| -rwxr-xr-x | backend/tolData/genDbpData.py | 3 | ||||
| -rwxr-xr-x | backend/tolData/genEnwikiDescData.py | 4 | ||||
| -rwxr-xr-x | backend/tolData/genEolNameData.py | 4 | ||||
| -rwxr-xr-x | backend/tolData/genLinkedImgs.py | 2 | ||||
| -rwxr-xr-x | backend/tolData/genOtolData.py | 2 |
11 files changed, 76 insertions, 56 deletions
diff --git a/backend/tolData/README.md b/backend/tolData/README.md index ba64114..75731ae 100644 --- a/backend/tolData/README.md +++ b/backend/tolData/README.md @@ -1,14 +1,14 @@ -This directory holds files used to generate data.db, which contains tree-of-life data. +This directory holds files used to generate the tree-of-life database data.db. -# Tables -## Tree Structure data +# Database Tables +## Tree Structure - `nodes` <br> Format : `name TEXT PRIMARY KEY, id TEXT UNIQUE, tips INT` <br> - Represents a tree-of-life node. `tips` represents the number of no-child descendants. + Represents a tree-of-life node. `tips` holds the number of no-child descendants. - `edges` <br> Format: `parent TEXT, child TEXT, p_support INT, PRIMARY KEY (parent, child)` <br> `p_support` is 1 if the edge has 'phylogenetic support', and 0 otherwise -## Node name data +## Node Names - `eol_ids` <br> Format: `id INT PRIMARY KEY, name TEXT` <br> Associates an EOL ID with a node's name. @@ -17,7 +17,7 @@ This directory holds files used to generate data.db, which contains tree-of-life Associates a node with alternative names. `pref_alt` is 1 if the alt-name is the most 'preferred' one. `src` indicates the dataset the alt-name was obtained from (can be 'eol', 'enwiki', or 'picked'). -## Node description data +## Node Descriptions - `wiki_ids` <br> Format: `name TEXT PRIMARY KEY, id INT, redirected INT` <br> Associates a node with a wikipedia page ID. @@ -26,7 +26,7 @@ This directory holds files used to generate data.db, which contains tree-of-life Format: `wiki_id INT PRIMARY KEY, desc TEXT, from_dbp INT` <br> Associates a wikipedia page ID with a short-description. `from_dbp` is 1 if the description was obtained from DBpedia, and 0 otherwise. -## Node image data +## Node Images - `node_imgs` <br> Format: `name TEXT PRIMARY KEY, img_id INT, src TEXT` <br> Associates a node with an image. @@ -36,11 +36,10 @@ This directory holds files used to generate data.db, which contains tree-of-life - `linked_imgs` <br> Format: `name TEXT PRIMARY KEY, otol_ids TEXT` <br> Associates a node with an image from another node. - `otol_ids` can be an otol ID, or two comma-separated otol IDs or empty strings. - The latter is used for compound nodes. -## Reduced tree data + `otol_ids` can be an otol ID, or (for compound nodes) two comma-separated strings that may be otol IDs or empty. +## Reduced Trees - `nodes_t`, `nodes_i`, `nodes_p` <br> - These are like `nodes`, but describe the nodes for various reduced trees. + These are like `nodes`, but describe nodes of reduced trees. - `edges_t`, `edges_i`, `edges_p` <br> Like `edges` but for reduced trees. @@ -53,24 +52,23 @@ have about 2.5 billion nodes. Downloading the images takes several days, and occ 200 GB. And if you want good data, you'll need to do some manual review, which can take weeks. ## Environment -The scripts are written in python and bash. -Some of the python scripts require third-party packages: +Some of the scripts use third-party packages: - jsonpickle: For encoding class objects as JSON. - requests: For downloading data. - PIL: For image processing. - tkinter: For providing a basic GUI to review images. - mwxml, mwparserfromhell: For parsing Wikipedia dumps. -## Generate tree structure data +## Generate Tree Structure Data 1. Obtain files in otol/, as specified in it's README. 2. Run genOtolData.py, which creates data.db, and adds the `nodes` and `edges` tables, using data in otol/. It also uses these files, if they exist: - - pickedOtolNames.txt: Has lines of the form `name1|otolId1`. Some nodes in the - tree may have the same name (eg: Pholidota can refer to pangolins or orchids). - Normally, such nodes will get the names 'name1', 'name1 [2]', 'name1 [3], etc. - This file can be used to manually specify which node should be named 'name1'. + - pickedOtolNames.txt: Has lines of the form `name1|otolId1`. When nodes in the + tree have the same name (eg: Pholidota can refer to pangolins or orchids), + they get the names 'name1', 'name1 [2]', 'name1 [3], etc. This file is used to + forcibly specify which node should be named 'name1'. -## Generate node name data +## Generate Node Names Data 1. Obtain 'name data files' in eol/, as specified in it's README. 2. Run genEolNameData.py, which adds the `names` and `eol_ids` tables, using data in eol/ and the `nodes` table. It also uses these files, if they exist: @@ -81,8 +79,8 @@ Some of the python scripts require third-party packages: - pickedEolAltsToSkip.txt: Has lines of the form `nodeName1|altName1`. Specifies that a node's alt-name set should exclude altName1. -## Generate node description data -### Get data from DBpedia +## Generate Node Description Data +### Get Data from DBpedia 1. Obtain files in dbpedia/, as specified in it's README. 2. Run genDbpData.py, which adds the `wiki_ids` and `descs` tables, using data in dbpedia/ and the `nodes` table. It also uses these files, if they exist: @@ -91,7 +89,7 @@ Some of the python scripts require third-party packages: wikipedia page that describes something different (eg: Osiris). - pickedDbpLabels.txt: Has lines of the form `nodeName1|label1`. Specifies node names that should have a particular associated page label. -### Get data from Wikipedia +### Get Data from Wikipedia 1. Obtain 'description database files' in enwiki/, as specified in it's README. 2. Run genEnwikiDescData.py, which adds to the `wiki_ids` and `descs` tables, using data in enwiki/ and the `nodes` table. @@ -99,7 +97,7 @@ Some of the python scripts require third-party packages: - pickedEnwikiNamesToSkip.txt: Same as with genDbpData.py. - pickedEnwikiLabels.txt: Similar to pickedDbpLabels.txt. -## Generate node image data +## Generate Node Images Data ### Get images from EOL 1. Obtain 'image metadata files' in eol/, as specified in it's README. 2. In eol/, run downloadImgs.py, which downloads images (possibly multiple per node), @@ -107,14 +105,14 @@ Some of the python scripts require third-party packages: 3. In eol/, run reviewImgs.py, which interactively displays the downloaded images for each node, providing the choice of which to use, moving them to eol/imgs/. Uses `names` and `eol_ids` to display extra info. -### Get images from Wikipedia +### Get Images from Wikipedia 1. In enwiki/, run genImgData.py, which looks for wikipedia image names for each node, using the `wiki_ids` table, and stores them in a database. 2. In enwiki/, run downloadImgLicenseInfo.py, which downloads licensing information for those images, using wikipedia's online API. 3. In enwiki/, run downloadImgs.py, which downloads 'permissively-licensed' images into enwiki/imgs/. -### Merge the image sets +### Merge the Image Sets 1. Run reviewImgsToGen.py, which displays images from eol/imgs/ and enwiki/imgs/, and enables choosing, for each node, which image should be used, if any, and outputs choice information into imgList.txt. Uses the `nodes`, @@ -130,14 +128,14 @@ Some of the python scripts require third-party packages: - An input image might produce output with unexpected dimensions. This seems to happen when the image is very large, and triggers a decompression bomb warning. - The result might have as many as 150k images, with about 2/3 of them - being from wikipedia. -### Add more image associations + In testing, this resulted in about 150k images, with about 2/3 of them + being from Wikipedia. +### Add more Image Associations 1. Run genLinkedImgs.py, which tries to associate nodes without images to images of it's children. Adds the `linked_imgs` table, and uses the `nodes`, `edges`, and `node_imgs` tables. -## Do some post-processing +## Do some Post-Processing 1. Run genEnwikiNameData.py, which adds more entries to the `names` table, using data in enwiki/, and the `names` and `wiki_ids` tables. 2. Optionally run addPickedNames.py, which allows adding manually-selected name data to @@ -148,5 +146,3 @@ Some of the python scripts require third-party packages: 3. Run genReducedTrees.py, which generates multiple reduced versions of the tree, adding the `nodes_*` and `edges_*` tables, using `nodes` and `names`. Reads from pickedNodes.txt, which lists names of nodes that must be included (1 per line). - The original tree isn't used for web-queries, as some nodes would have over - 10k children, which can take a while to render (took over a minute in testing). diff --git a/backend/tolData/dbpedia/README.md b/backend/tolData/dbpedia/README.md index 8a08f20..dd9bda7 100644 --- a/backend/tolData/dbpedia/README.md +++ b/backend/tolData/dbpedia/README.md @@ -1,4 +1,4 @@ -This directory holds files obtained from/using [Dbpedia](https://www.dbpedia.org). +This directory holds files obtained/derived from [Dbpedia](https://www.dbpedia.org). # Downloaded Files - `labels_lang=en.ttl.bz2` <br> diff --git a/backend/tolData/enwiki/README.md b/backend/tolData/enwiki/README.md index 90d16c7..dfced94 100644 --- a/backend/tolData/enwiki/README.md +++ b/backend/tolData/enwiki/README.md @@ -1,4 +1,4 @@ -This directory holds files obtained from/using [English Wikipedia](https://en.wikipedia.org/wiki/Main_Page). +This directory holds files obtained/derived from [English Wikipedia](https://en.wikipedia.org/wiki/Main_Page). # Downloaded Files - enwiki-20220501-pages-articles-multistream.xml.bz2 <br> @@ -49,4 +49,3 @@ This directory holds files obtained from/using [English Wikipedia](https://en.wi - lookupPage.py <br> Running `lookupPage.py title1` looks in the dump for a page with a given title, and prints the contents to stdout. Uses dumpIndex.db. - diff --git a/backend/tolData/eol/README.md b/backend/tolData/eol/README.md index 8c527a8..1a9dbdf 100644 --- a/backend/tolData/eol/README.md +++ b/backend/tolData/eol/README.md @@ -3,7 +3,7 @@ This directory holds files obtained from/using the [Encyclopedia of Life](https: # Name Data Files - vernacularNames.csv <br> Obtained from <https://opendata.eol.org/dataset/vernacular-names> on 24/04/2022 (last updated on 27/10/2020). - Contains alternative-name data from EOL. + Contains alternative-node-names data from EOL. # Image Metadata Files - imagesList.tgz <br> @@ -11,10 +11,10 @@ This directory holds files obtained from/using the [Encyclopedia of Life](https: Contains metadata for images from EOL. - imagesList/ <br> Extracted from imagesList.tgz. -- genImagesListDb.sh <br> +- genImagesListDb.py <br> Creates a database, and imports imagesList/*.csv files into it. - imagesList.db <br> - Created by running genImagesListDb.sh <br> + Created by running genImagesListDb.py <br> Tables: <br> - `images`: `content_id INT PRIMARY KEY, page_id INT, source_url TEXT, copy_url TEXT, license TEXT, copyright_owner TEXT` diff --git a/backend/tolData/eol/genImagesListDb.py b/backend/tolData/eol/genImagesListDb.py new file mode 100755 index 0000000..32df10a --- /dev/null +++ b/backend/tolData/eol/genImagesListDb.py @@ -0,0 +1,36 @@ +#!/usr/bin/python3 + +import sys, os, re +import csv +import sqlite3 + +usageInfo = f""" +Usage: {sys.argv[0]} + +Generates a sqlite db from a directory of CSV files holding EOL image data +""" +if len(sys.argv) > 1: + print(usageInfo, file=sys.stderr) + sys.exit(1) + +imagesListDir = "imagesList/" +dbFile = "imagesList.db" + +print("Creating database") +dbCon = sqlite3.connect(dbFile) +dbCur = dbCon.cursor() +dbCur.execute("CREATE TABLE images" \ + " (content_id INT PRIMARY KEY, page_id INT, source_url TEXT, copy_url TEXT, license TEXT, copyright_owner TEXT)") +print("Reading CSV files") +csvFilenames = os.listdir(imagesListDir) +for filename in csvFilenames: + print(f"Processing {imagesListDir}{filename}") + with open(imagesListDir + filename, newline="") as file: + for (contentId, pageId, sourceUrl, copyUrl, license, owner) in csv.reader(file): + if re.match(r"^[a-zA-Z]", contentId): # Skip header line + continue + dbCur.execute("INSERT INTO images VALUES (?, ?, ?, ?, ?, ?)", + (int(contentId), int(pageId), sourceUrl, copyUrl, license, owner)) +print("Closing database") +dbCon.commit() +dbCon.close() diff --git a/backend/tolData/eol/genImagesListDb.sh b/backend/tolData/eol/genImagesListDb.sh deleted file mode 100755 index 87dd840..0000000 --- a/backend/tolData/eol/genImagesListDb.sh +++ /dev/null @@ -1,12 +0,0 @@ -#!/bin/bash -set -e - -# Combine CSV files into one, skipping header lines -cat imagesList/media_*_{1..58}.csv | tail -n +2 > imagesList.csv -# Create database, and import the CSV file -sqlite3 imagesList.db <<END -CREATE TABLE images ( - content_id INT PRIMARY KEY, page_id INT, source_url TEXT, copy_url TEXT, license TEXT, copyright_owner TEXT); -.mode csv -.import 'imagesList.csv' images -END diff --git a/backend/tolData/genDbpData.py b/backend/tolData/genDbpData.py index df3a6be..606ffcb 100755 --- a/backend/tolData/genDbpData.py +++ b/backend/tolData/genDbpData.py @@ -7,7 +7,8 @@ usageInfo = f""" Usage: {sys.argv[0]} Reads a database containing data from DBpedia, and tries to associate -DBpedia IRIs with nodes in a database, adding short-descriptions for them. +DBpedia IRIs with nodes in the tree-of-life database, adding +short-descriptions for them. """ if len(sys.argv) > 1: print(usageInfo, file=sys.stderr) diff --git a/backend/tolData/genEnwikiDescData.py b/backend/tolData/genEnwikiDescData.py index d3f93ed..0e86fd5 100755 --- a/backend/tolData/genEnwikiDescData.py +++ b/backend/tolData/genEnwikiDescData.py @@ -7,8 +7,8 @@ usageInfo = f""" Usage: {sys.argv[0]} Reads a database containing data from Wikipedia, and tries to associate -wiki pages with nodes in the database, and add descriptions for nodes -that don't have them. +wiki pages with nodes in the tree-of-life database, and add descriptions for +nodes that don't have them. """ if len(sys.argv) > 1: print(usageInfo, file=sys.stderr) diff --git a/backend/tolData/genEolNameData.py b/backend/tolData/genEolNameData.py index dd33ee0..1b19a47 100755 --- a/backend/tolData/genEolNameData.py +++ b/backend/tolData/genEolNameData.py @@ -7,8 +7,8 @@ usageInfo = f""" Usage: {sys.argv[0]} Reads files describing name data from the 'Encyclopedia of Life' site, -tries to associate names with nodes in the database, and adds tables -to represent associated names. +tries to associate names with nodes in the tree-of-life database, +and adds tables to represent associated names. Reads a vernacularNames.csv file: Starts with a header line containing: diff --git a/backend/tolData/genLinkedImgs.py b/backend/tolData/genLinkedImgs.py index a8e1322..c9cc622 100755 --- a/backend/tolData/genLinkedImgs.py +++ b/backend/tolData/genLinkedImgs.py @@ -32,7 +32,7 @@ print(f"Found {len(resolvedNodes)}") print("Iterating through nodes, trying to resolve images for ancestors") nodesToResolve = {} # Maps a node name to a list of objects that represent possible child images processedNodes = {} # Map a node name to an OTOL ID, representing a child node whose image is to be used -parentToChosenTips = {} # used to prefer images from children with more tips +parentToChosenTips = {} # Used to prefer images from children with more tips iterNum = 0 while len(resolvedNodes) > 0: iterNum += 1 diff --git a/backend/tolData/genOtolData.py b/backend/tolData/genOtolData.py index b5e0055..4236999 100755 --- a/backend/tolData/genOtolData.py +++ b/backend/tolData/genOtolData.py @@ -7,7 +7,7 @@ usageInfo = f""" Usage: {sys.argv[0]} Reads files describing a tree-of-life from an 'Open Tree of Life' release, -and stores tree information in a database. +and stores tree info in a database. Reads a labelled_supertree_ottnames.tre file, which is assumed to have this format: The tree-of-life is represented in Newick format, which looks like: (n1,n2,(n3,n4)n5)n6 |
