aboutsummaryrefslogtreecommitdiff
path: root/backend/tolData
diff options
context:
space:
mode:
Diffstat (limited to 'backend/tolData')
-rw-r--r--backend/tolData/README.md152
-rwxr-xr-xbackend/tolData/addPickedNames.py57
-rw-r--r--backend/tolData/dbpedia/README.md29
-rwxr-xr-xbackend/tolData/dbpedia/genDescData.py130
-rw-r--r--backend/tolData/enwiki/README.md52
-rwxr-xr-xbackend/tolData/enwiki/downloadImgLicenseInfo.py150
-rwxr-xr-xbackend/tolData/enwiki/downloadImgs.py91
-rwxr-xr-xbackend/tolData/enwiki/genDescData.py127
-rwxr-xr-xbackend/tolData/enwiki/genDumpIndexDb.py58
-rwxr-xr-xbackend/tolData/enwiki/genImgData.py190
-rwxr-xr-xbackend/tolData/enwiki/lookupPage.py68
-rw-r--r--backend/tolData/eol/README.md26
-rwxr-xr-xbackend/tolData/eol/downloadImgs.py147
-rwxr-xr-xbackend/tolData/eol/genImagesListDb.sh12
-rwxr-xr-xbackend/tolData/eol/reviewImgs.py205
-rwxr-xr-xbackend/tolData/genDbpData.py247
-rwxr-xr-xbackend/tolData/genEnwikiDescData.py102
-rwxr-xr-xbackend/tolData/genEnwikiNameData.py76
-rwxr-xr-xbackend/tolData/genEolNameData.py184
-rwxr-xr-xbackend/tolData/genImgs.py191
-rwxr-xr-xbackend/tolData/genLinkedImgs.py125
-rwxr-xr-xbackend/tolData/genOtolData.py250
-rwxr-xr-xbackend/tolData/genReducedTrees.py329
-rw-r--r--backend/tolData/otol/README.md10
-rw-r--r--backend/tolData/pickedImgs/README.md10
-rwxr-xr-xbackend/tolData/reviewImgsToGen.py225
26 files changed, 3243 insertions, 0 deletions
diff --git a/backend/tolData/README.md b/backend/tolData/README.md
new file mode 100644
index 0000000..ba64114
--- /dev/null
+++ b/backend/tolData/README.md
@@ -0,0 +1,152 @@
+This directory holds files used to generate data.db, which contains tree-of-life data.
+
+# Tables
+## Tree Structure data
+- `nodes` <br>
+ Format : `name TEXT PRIMARY KEY, id TEXT UNIQUE, tips INT` <br>
+ Represents a tree-of-life node. `tips` represents the number of no-child descendants.
+- `edges` <br>
+ Format: `parent TEXT, child TEXT, p_support INT, PRIMARY KEY (parent, child)` <br>
+ `p_support` is 1 if the edge has 'phylogenetic support', and 0 otherwise
+## Node name data
+- `eol_ids` <br>
+ Format: `id INT PRIMARY KEY, name TEXT` <br>
+ Associates an EOL ID with a node's name.
+- `names` <br>
+ Format: `name TEXT, alt_name TEXT, pref_alt INT, src TEXT, PRIMARY KEY(name, alt_name)` <br>
+ Associates a node with alternative names.
+ `pref_alt` is 1 if the alt-name is the most 'preferred' one.
+ `src` indicates the dataset the alt-name was obtained from (can be 'eol', 'enwiki', or 'picked').
+## Node description data
+- `wiki_ids` <br>
+ Format: `name TEXT PRIMARY KEY, id INT, redirected INT` <br>
+ Associates a node with a wikipedia page ID.
+ `redirected` is 1 if the node was associated with a different page that redirected to this one.
+- `descs` <br>
+ Format: `wiki_id INT PRIMARY KEY, desc TEXT, from_dbp INT` <br>
+ Associates a wikipedia page ID with a short-description.
+ `from_dbp` is 1 if the description was obtained from DBpedia, and 0 otherwise.
+## Node image data
+- `node_imgs` <br>
+ Format: `name TEXT PRIMARY KEY, img_id INT, src TEXT` <br>
+ Associates a node with an image.
+- `images` <br>
+ Format: `id INT, src TEXT, url TEXT, license TEXT, artist TEXT, credit TEXT, PRIMARY KEY (id, src)` <br>
+ Represents an image, identified by a source ('eol', 'enwiki', or 'picked'), and a source-specific ID.
+- `linked_imgs` <br>
+ Format: `name TEXT PRIMARY KEY, otol_ids TEXT` <br>
+ Associates a node with an image from another node.
+ `otol_ids` can be an otol ID, or two comma-separated otol IDs or empty strings.
+ The latter is used for compound nodes.
+## Reduced tree data
+- `nodes_t`, `nodes_i`, `nodes_p` <br>
+ These are like `nodes`, but describe the nodes for various reduced trees.
+- `edges_t`, `edges_i`, `edges_p` <br>
+ Like `edges` but for reduced trees.
+
+# Generating the Database
+
+For the most part, these steps should be done in order.
+
+As a warning, the whole process takes a lot of time and file space. The tree will probably
+have about 2.5 billion nodes. Downloading the images takes several days, and occupies over
+200 GB. And if you want good data, you'll need to do some manual review, which can take weeks.
+
+## Environment
+The scripts are written in python and bash.
+Some of the python scripts require third-party packages:
+- jsonpickle: For encoding class objects as JSON.
+- requests: For downloading data.
+- PIL: For image processing.
+- tkinter: For providing a basic GUI to review images.
+- mwxml, mwparserfromhell: For parsing Wikipedia dumps.
+
+## Generate tree structure data
+1. Obtain files in otol/, as specified in it's README.
+2. Run genOtolData.py, which creates data.db, and adds the `nodes` and `edges` tables,
+ using data in otol/. It also uses these files, if they exist:
+ - pickedOtolNames.txt: Has lines of the form `name1|otolId1`. Some nodes in the
+ tree may have the same name (eg: Pholidota can refer to pangolins or orchids).
+ Normally, such nodes will get the names 'name1', 'name1 [2]', 'name1 [3], etc.
+ This file can be used to manually specify which node should be named 'name1'.
+
+## Generate node name data
+1. Obtain 'name data files' in eol/, as specified in it's README.
+2. Run genEolNameData.py, which adds the `names` and `eol_ids` tables, using data in
+ eol/ and the `nodes` table. It also uses these files, if they exist:
+ - pickedEolIds.txt: Has lines of the form `nodeName1|eolId1` or `nodeName1|`.
+ Specifies node names that should have a particular EOL ID, or no ID.
+ Quite a few taxons have ambiguous names, and may need manual correction.
+ For example, Viola may resolve to a taxon of butterflies or of plants.
+ - pickedEolAltsToSkip.txt: Has lines of the form `nodeName1|altName1`.
+ Specifies that a node's alt-name set should exclude altName1.
+
+## Generate node description data
+### Get data from DBpedia
+1. Obtain files in dbpedia/, as specified in it's README.
+2. Run genDbpData.py, which adds the `wiki_ids` and `descs` tables, using data in
+ dbpedia/ and the `nodes` table. It also uses these files, if they exist:
+ - pickedEnwikiNamesToSkip.txt: Each line holds the name of a node for which
+ no description should be obtained. Many node names have a same-name
+ wikipedia page that describes something different (eg: Osiris).
+ - pickedDbpLabels.txt: Has lines of the form `nodeName1|label1`.
+ Specifies node names that should have a particular associated page label.
+### Get data from Wikipedia
+1. Obtain 'description database files' in enwiki/, as specified in it's README.
+2. Run genEnwikiDescData.py, which adds to the `wiki_ids` and `descs` tables,
+ using data in enwiki/ and the `nodes` table.
+ It also uses these files, if they exist:
+ - pickedEnwikiNamesToSkip.txt: Same as with genDbpData.py.
+ - pickedEnwikiLabels.txt: Similar to pickedDbpLabels.txt.
+
+## Generate node image data
+### Get images from EOL
+1. Obtain 'image metadata files' in eol/, as specified in it's README.
+2. In eol/, run downloadImgs.py, which downloads images (possibly multiple per node),
+ into eol/imgsForReview, using data in eol/, as well as the `eol_ids` table.
+3. In eol/, run reviewImgs.py, which interactively displays the downloaded images for
+ each node, providing the choice of which to use, moving them to eol/imgs/.
+ Uses `names` and `eol_ids` to display extra info.
+### Get images from Wikipedia
+1. In enwiki/, run genImgData.py, which looks for wikipedia image names for each node,
+ using the `wiki_ids` table, and stores them in a database.
+2. In enwiki/, run downloadImgLicenseInfo.py, which downloads licensing information for
+ those images, using wikipedia's online API.
+3. In enwiki/, run downloadImgs.py, which downloads 'permissively-licensed'
+ images into enwiki/imgs/.
+### Merge the image sets
+1. Run reviewImgsToGen.py, which displays images from eol/imgs/ and enwiki/imgs/,
+ and enables choosing, for each node, which image should be used, if any,
+ and outputs choice information into imgList.txt. Uses the `nodes`,
+ `eol_ids`, and `wiki_ids` tables (as well as `names` to display extra info).
+2. Run genImgs.py, which creates cropped/resized images in img/, from files listed in
+ imgList.txt and located in eol/ and enwiki/, and creates the `node_imgs` and
+ `images` tables. If pickedImgs/ is present, images within it are also used. <br>
+ The outputs might need to be manually created/adjusted:
+ - An input image might have no output produced, possibly due to
+ data incompatibilities, memory limits, etc. A few input image files
+ might actually be html files, containing a 'file not found' page.
+ - An input x.gif might produce x-1.jpg, x-2.jpg, etc, instead of x.jpg.
+ - An input image might produce output with unexpected dimensions.
+ This seems to happen when the image is very large, and triggers a
+ decompression bomb warning.
+ The result might have as many as 150k images, with about 2/3 of them
+ being from wikipedia.
+### Add more image associations
+1. Run genLinkedImgs.py, which tries to associate nodes without images to
+ images of it's children. Adds the `linked_imgs` table, and uses the
+ `nodes`, `edges`, and `node_imgs` tables.
+
+## Do some post-processing
+1. Run genEnwikiNameData.py, which adds more entries to the `names` table,
+ using data in enwiki/, and the `names` and `wiki_ids` tables.
+2. Optionally run addPickedNames.py, which allows adding manually-selected name data to
+ the `names` table, as specified in pickedNames.txt.
+ - pickedNames.txt: Has lines of the form `nodeName1|altName1|prefAlt1`.
+ These correspond to entries in the `names` table. `prefAlt` should be 1 or 0.
+ A line like `name1|name1|1` causes a node to have no preferred alt-name.
+3. Run genReducedTrees.py, which generates multiple reduced versions of the tree,
+ adding the `nodes_*` and `edges_*` tables, using `nodes` and `names`. Reads from
+ pickedNodes.txt, which lists names of nodes that must be included (1 per line).
+ The original tree isn't used for web-queries, as some nodes would have over
+ 10k children, which can take a while to render (took over a minute in testing).
diff --git a/backend/tolData/addPickedNames.py b/backend/tolData/addPickedNames.py
new file mode 100755
index 0000000..d56a0cb
--- /dev/null
+++ b/backend/tolData/addPickedNames.py
@@ -0,0 +1,57 @@
+#!/usr/bin/python3
+
+import sys
+import sqlite3
+
+usageInfo = f"""
+Usage: {sys.argv[0]}
+
+Reads alt-name data from a file, and adds it to the database's 'names' table.
+"""
+if len(sys.argv) > 1:
+ print(usageInfo, file=sys.stderr)
+ sys.exit(1)
+
+dbFile = "data.db"
+pickedNamesFile = "pickedNames.txt"
+
+print("Opening database")
+dbCon = sqlite3.connect(dbFile)
+dbCur = dbCon.cursor()
+
+print("Iterating through picked-names file")
+with open(pickedNamesFile) as file:
+ for line in file:
+ # Get record data
+ nodeName, altName, prefAlt = line.lower().rstrip().split("|")
+ prefAlt = int(prefAlt)
+ # Check whether there exists a node with the name
+ row = dbCur.execute("SELECT name from nodes where name = ?", (nodeName,)).fetchone()
+ if row == None:
+ print(f"ERROR: No node with name \"{nodeName}\" exists")
+ break
+ # Remove any existing preferred-alt status
+ if prefAlt == 1:
+ query = "SELECT name, alt_name FROM names WHERE name = ? AND pref_alt = 1"
+ row = dbCur.execute(query, (nodeName,)).fetchone()
+ if row != None and row[1] != altName:
+ print(f"Removing pref-alt status from alt-name {row[1]} for {nodeName}")
+ dbCur.execute("UPDATE names SET pref_alt = 0 WHERE name = ? AND alt_name = ?", row)
+ # Check for an existing record
+ if nodeName == altName:
+ continue
+ query = "SELECT name, alt_name, pref_alt FROM names WHERE name = ? AND alt_name = ?"
+ row = dbCur.execute(query, (nodeName, altName)).fetchone()
+ if row == None:
+ print(f"Adding record for alt-name {altName} for {nodeName}")
+ dbCur.execute("INSERT INTO names VALUES (?, ?, ?, 'picked')", (nodeName, altName, prefAlt))
+ else:
+ # Update existing record
+ if row[2] != prefAlt:
+ print(f"Updating record for alt-name {altName} for {nodeName}")
+ dbCur.execute("UPDATE names SET pref_alt = ?, src = 'picked' WHERE name = ? AND alt_name = ?",
+ (prefAlt, nodeName, altName))
+
+print("Closing database")
+dbCon.commit()
+dbCon.close()
diff --git a/backend/tolData/dbpedia/README.md b/backend/tolData/dbpedia/README.md
new file mode 100644
index 0000000..8a08f20
--- /dev/null
+++ b/backend/tolData/dbpedia/README.md
@@ -0,0 +1,29 @@
+This directory holds files obtained from/using [Dbpedia](https://www.dbpedia.org).
+
+# Downloaded Files
+- `labels_lang=en.ttl.bz2` <br>
+ Obtained via https://databus.dbpedia.org/dbpedia/collections/latest-core.
+ Downloaded from <https://databus.dbpedia.org/dbpedia/generic/labels/2022.03.01/labels_lang=en.ttl.bz2>.
+- `page_lang=en_ids.ttl.bz2` <br>
+ Downloaded from <https://databus.dbpedia.org/dbpedia/generic/page/2022.03.01/page_lang=en_ids.ttl.bz2>
+- `redirects_lang=en_transitive.ttl.bz2` <br>
+ Downloaded from <https://databus.dbpedia.org/dbpedia/generic/redirects/2022.03.01/redirects_lang=en_transitive.ttl.bz2>.
+- `disambiguations_lang=en.ttl.bz2` <br>
+ Downloaded from <https://databus.dbpedia.org/dbpedia/generic/disambiguations/2022.03.01/disambiguations_lang=en.ttl.bz2>.
+- `instance-types_lang=en_specific.ttl.bz2` <br>
+ Downloaded from <https://databus.dbpedia.org/dbpedia/mappings/instance-types/2022.03.01/instance-types_lang=en_specific.ttl.bz2>.
+- `short-abstracts_lang=en.ttl.bz2` <br>
+ Downloaded from <https://databus.dbpedia.org/vehnem/text/short-abstracts/2021.05.01/short-abstracts_lang=en.ttl.bz2>.
+
+# Other Files
+- genDescData.py <br>
+ Used to generate a database representing data from the ttl files.
+- descData.db <br>
+ Generated by genDescData.py. <br>
+ Tables: <br>
+ - `labels`: `iri TEXT PRIMARY KEY, label TEXT `
+ - `ids`: `iri TEXT PRIMARY KEY, id INT`
+ - `redirects`: `iri TEXT PRIMARY KEY, target TEXT`
+ - `disambiguations`: `iri TEXT PRIMARY KEY`
+ - `types`: `iri TEXT, type TEXT`
+ - `abstracts`: `iri TEXT PRIMARY KEY, abstract TEXT`
diff --git a/backend/tolData/dbpedia/genDescData.py b/backend/tolData/dbpedia/genDescData.py
new file mode 100755
index 0000000..d9e8a80
--- /dev/null
+++ b/backend/tolData/dbpedia/genDescData.py
@@ -0,0 +1,130 @@
+#!/usr/bin/python3
+
+import sys, re
+import bz2, sqlite3
+
+usageInfo = f"""
+Usage: {sys.argv[0]}
+
+Adds DBpedia labels/types/abstracts/etc data into a database.
+"""
+if len(sys.argv) > 1:
+ print(usageInfo, file=sys.stderr)
+ sys.exit(1)
+
+labelsFile = "labels_lang=en.ttl.bz2" # Had about 16e6 entries
+idsFile = "page_lang=en_ids.ttl.bz2"
+redirectsFile = "redirects_lang=en_transitive.ttl.bz2"
+disambigFile = "disambiguations_lang=en.ttl.bz2"
+typesFile = "instance-types_lang=en_specific.ttl.bz2"
+abstractsFile = "short-abstracts_lang=en.ttl.bz2"
+dbFile = "descData.db"
+# In testing, this script took a few hours to run, and generated about 10GB
+
+print("Creating database")
+dbCon = sqlite3.connect(dbFile)
+dbCur = dbCon.cursor()
+
+print("Reading/storing label data")
+dbCur.execute("CREATE TABLE labels (iri TEXT PRIMARY KEY, label TEXT)")
+dbCur.execute("CREATE INDEX labels_idx ON labels(label)")
+dbCur.execute("CREATE INDEX labels_idx_nc ON labels(label COLLATE NOCASE)")
+labelLineRegex = re.compile(r'<([^>]+)> <[^>]+> "((?:[^"]|\\")+)"@en \.\n')
+lineNum = 0
+with bz2.open(labelsFile, mode='rt') as file:
+ for line in file:
+ lineNum += 1
+ if lineNum % 1e5 == 0:
+ print(f"At line {lineNum}")
+ #
+ match = labelLineRegex.fullmatch(line)
+ if match == None:
+ raise Exception(f"ERROR: Line {lineNum} has unexpected format")
+ dbCur.execute("INSERT INTO labels VALUES (?, ?)", (match.group(1), match.group(2)))
+
+print("Reading/storing wiki page ids")
+dbCur.execute("CREATE TABLE ids (iri TEXT PRIMARY KEY, id INT)")
+idLineRegex = re.compile(r'<([^>]+)> <[^>]+> "(\d+)".*\n')
+lineNum = 0
+with bz2.open(idsFile, mode='rt') as file:
+ for line in file:
+ lineNum += 1
+ if lineNum % 1e5 == 0:
+ print(f"At line {lineNum}")
+ #
+ match = idLineRegex.fullmatch(line)
+ if match == None:
+ raise Exception(f"ERROR: Line {lineNum} has unexpected format")
+ try:
+ dbCur.execute("INSERT INTO ids VALUES (?, ?)", (match.group(1), int(match.group(2))))
+ except sqlite3.IntegrityError as e:
+ # Accounts for certain lines that have the same IRI
+ print(f"WARNING: Failed to add entry with IRI \"{match.group(1)}\": {e}")
+
+print("Reading/storing redirection data")
+dbCur.execute("CREATE TABLE redirects (iri TEXT PRIMARY KEY, target TEXT)")
+redirLineRegex = re.compile(r'<([^>]+)> <[^>]+> <([^>]+)> \.\n')
+lineNum = 0
+with bz2.open(redirectsFile, mode='rt') as file:
+ for line in file:
+ lineNum += 1
+ if lineNum % 1e5 == 0:
+ print(f"At line {lineNum}")
+ #
+ match = redirLineRegex.fullmatch(line)
+ if match == None:
+ raise Exception(f"ERROR: Line {lineNum} has unexpected format")
+ dbCur.execute("INSERT INTO redirects VALUES (?, ?)", (match.group(1), match.group(2)))
+
+print("Reading/storing diambiguation-page data")
+dbCur.execute("CREATE TABLE disambiguations (iri TEXT PRIMARY KEY)")
+disambigLineRegex = redirLineRegex
+lineNum = 0
+with bz2.open(disambigFile, mode='rt') as file:
+ for line in file:
+ lineNum += 1
+ if lineNum % 1e5 == 0:
+ print(f"At line {lineNum}")
+ #
+ match = disambigLineRegex.fullmatch(line)
+ if match == None:
+ raise Exception(f"ERROR: Line {lineNum} has unexpected format")
+ dbCur.execute("INSERT OR IGNORE INTO disambiguations VALUES (?)", (match.group(1),))
+
+print("Reading/storing instance-type data")
+dbCur.execute("CREATE TABLE types (iri TEXT, type TEXT)")
+dbCur.execute("CREATE INDEX types_iri_idx ON types(iri)")
+typeLineRegex = redirLineRegex
+lineNum = 0
+with bz2.open(typesFile, mode='rt') as file:
+ for line in file:
+ lineNum += 1
+ if lineNum % 1e5 == 0:
+ print(f"At line {lineNum}")
+ #
+ match = typeLineRegex.fullmatch(line)
+ if match == None:
+ raise Exception(f"ERROR: Line {lineNum} has unexpected format")
+ dbCur.execute("INSERT INTO types VALUES (?, ?)", (match.group(1), match.group(2)))
+
+print("Reading/storing abstracts")
+dbCur.execute("CREATE TABLE abstracts (iri TEXT PRIMARY KEY, abstract TEXT)")
+descLineRegex = labelLineRegex
+lineNum = 0
+with bz2.open(abstractsFile, mode='rt') as file:
+ for line in file:
+ lineNum += 1
+ if lineNum % 1e5 == 0:
+ print(f"At line {lineNum}")
+ #
+ if line[0] == "#":
+ continue
+ match = descLineRegex.fullmatch(line)
+ if match == None:
+ raise Exception(f"ERROR: Line {lineNum} has unexpected format")
+ dbCur.execute("INSERT INTO abstracts VALUES (?, ?)",
+ (match.group(1), match.group(2).replace(r'\"', '"')))
+
+print("Closing database")
+dbCon.commit()
+dbCon.close()
diff --git a/backend/tolData/enwiki/README.md b/backend/tolData/enwiki/README.md
new file mode 100644
index 0000000..90d16c7
--- /dev/null
+++ b/backend/tolData/enwiki/README.md
@@ -0,0 +1,52 @@
+This directory holds files obtained from/using [English Wikipedia](https://en.wikipedia.org/wiki/Main_Page).
+
+# Downloaded Files
+- enwiki-20220501-pages-articles-multistream.xml.bz2 <br>
+ Obtained via <https://dumps.wikimedia.org/backup-index.html> (site suggests downloading from a mirror).
+ Contains text content and metadata for pages in enwiki.
+ Some file content and format information was available from
+ <https://meta.wikimedia.org/wiki/Data_dumps/What%27s_available_for_download>.
+- enwiki-20220501-pages-articles-multistream-index.txt.bz2 <br>
+ Obtained like above. Holds lines of the form offset1:pageId1:title1,
+ providing, for each page, an offset into the dump file of a chunk of
+ 100 pages that includes it.
+
+# Generated Dump-Index Files
+- genDumpIndexDb.py <br>
+ Creates an sqlite-database version of the enwiki-dump index file.
+- dumpIndex.db <br>
+ Generated by genDumpIndexDb.py. <br>
+ Tables: <br>
+ - `offsets`: `title TEXT PRIMARY KEY, id INT UNIQUE, offset INT, next_offset INT`
+
+# Description Database Files
+- genDescData.py <br>
+ Reads through pages in the dump file, and adds short-description info to a database.
+- descData.db <br>
+ Generated by genDescData.py. <br>
+ Tables: <br>
+ - `pages`: `id INT PRIMARY KEY, title TEXT UNIQUE`
+ - `redirects`: `id INT PRIMARY KEY, target TEXT`
+ - `descs`: `id INT PRIMARY KEY, desc TEXT`
+
+# Image Database Files
+- genImgData.py <br>
+ Used to find infobox image names for page IDs, storing them into a database.
+- downloadImgLicenseInfo.py <br>
+ Used to download licensing metadata for image names, via wikipedia's online API, storing them into a database.
+- imgData.db <br>
+ Used to hold metadata about infobox images for a set of pageIDs.
+ Generated using getEnwikiImgData.py and downloadImgLicenseInfo.py. <br>
+ Tables: <br>
+ - `page_imgs`: `page_id INT PRIMAY KEY, img_name TEXT` <br>
+ `img_name` may be null, which means 'none found', and is used to avoid re-processing page-ids.
+ - `imgs`: `name TEXT PRIMARY KEY, license TEXT, artist TEXT, credit TEXT, restrictions TEXT, url TEXT` <br>
+ Might lack some matches for `img_name` in `page_imgs`, due to licensing info unavailability.
+- downloadImgs.py <br>
+ Used to download image files into imgs/.
+
+# Other Files
+- lookupPage.py <br>
+ Running `lookupPage.py title1` looks in the dump for a page with a given title,
+ and prints the contents to stdout. Uses dumpIndex.db.
+
diff --git a/backend/tolData/enwiki/downloadImgLicenseInfo.py b/backend/tolData/enwiki/downloadImgLicenseInfo.py
new file mode 100755
index 0000000..399922e
--- /dev/null
+++ b/backend/tolData/enwiki/downloadImgLicenseInfo.py
@@ -0,0 +1,150 @@
+#!/usr/bin/python3
+
+import sys, re
+import sqlite3, urllib.parse, html
+import requests
+import time, signal
+
+usageInfo = f"""
+Usage: {sys.argv[0]}
+
+Reads image names from a database, and uses enwiki's online API to obtain
+licensing information for them, adding the info to the database.
+
+SIGINT causes the program to finish an ongoing download and exit.
+The program can be re-run to continue downloading, and looks
+at already-processed names to decide what to skip.
+"""
+if len(sys.argv) > 1:
+ print(usageInfo, file=sys.stderr)
+ sys.exit(1)
+
+imgDb = "imgData.db"
+apiUrl = "https://en.wikipedia.org/w/api.php"
+userAgent = "terryt.dev (terry06890@gmail.com)"
+batchSz = 50 # Max 50
+tagRegex = re.compile(r"<[^<]+>")
+whitespaceRegex = re.compile(r"\s+")
+
+print("Opening database")
+dbCon = sqlite3.connect(imgDb)
+dbCur = dbCon.cursor()
+dbCur2 = dbCon.cursor()
+print("Checking for table")
+if dbCur.execute("SELECT name FROM sqlite_master WHERE type='table' AND name='imgs'").fetchone() == None:
+ dbCur.execute("CREATE TABLE imgs(" \
+ "name TEXT PRIMARY KEY, license TEXT, artist TEXT, credit TEXT, restrictions TEXT, url TEXT)")
+
+print("Reading image names")
+imgNames = set()
+for (imgName,) in dbCur.execute("SELECT DISTINCT img_name FROM page_imgs WHERE img_name NOT NULL"):
+ imgNames.add(imgName)
+print(f"Found {len(imgNames)}")
+
+print("Checking for already-processed images")
+oldSz = len(imgNames)
+for (imgName,) in dbCur.execute("SELECT name FROM imgs"):
+ imgNames.discard(imgName)
+print(f"Found {oldSz - len(imgNames)}")
+
+# Set SIGINT handler
+interrupted = False
+oldHandler = None
+def onSigint(sig, frame):
+ global interrupted
+ interrupted = True
+ signal.signal(signal.SIGINT, oldHandler)
+oldHandler = signal.signal(signal.SIGINT, onSigint)
+
+print("Iterating through image names")
+imgNames = list(imgNames)
+iterNum = 0
+for i in range(0, len(imgNames), batchSz):
+ iterNum += 1
+ if iterNum % 1 == 0:
+ print(f"At iteration {iterNum} (after {(iterNum - 1) * batchSz} images)")
+ if interrupted:
+ print(f"Exiting loop at iteration {iterNum}")
+ break
+ # Get batch
+ imgBatch = imgNames[i:i+batchSz]
+ imgBatch = ["File:" + x for x in imgBatch]
+ # Make request
+ headers = {
+ "user-agent": userAgent,
+ "accept-encoding": "gzip",
+ }
+ params = {
+ "action": "query",
+ "format": "json",
+ "prop": "imageinfo",
+ "iiprop": "extmetadata|url",
+ "maxlag": "5",
+ "titles": "|".join(imgBatch),
+ "iiextmetadatafilter": "Artist|Credit|LicenseShortName|Restrictions",
+ }
+ responseObj = None
+ try:
+ response = requests.get(apiUrl, params=params, headers=headers)
+ responseObj = response.json()
+ except Exception as e:
+ print(f"ERROR: Exception while downloading info: {e}")
+ print(f"\tImage batch: " + "|".join(imgBatch))
+ continue
+ # Parse response-object
+ if "query" not in responseObj or "pages" not in responseObj["query"]:
+ print("WARNING: Response object for doesn't have page data")
+ print("\tImage batch: " + "|".join(imgBatch))
+ if "error" in responseObj:
+ errorCode = responseObj["error"]["code"]
+ print(f"\tError code: {errorCode}")
+ if errorCode == "maxlag":
+ time.sleep(5)
+ continue
+ pages = responseObj["query"]["pages"]
+ normalisedToInput = {}
+ if "normalized" in responseObj["query"]:
+ for entry in responseObj["query"]["normalized"]:
+ normalisedToInput[entry["to"]] = entry["from"]
+ for (_, page) in pages.items():
+ # Some fields // More info at https://www.mediawiki.org/wiki/Extension:CommonsMetadata#Returned_data
+ # LicenseShortName: short human-readable license name, apparently more reliable than 'License',
+ # Artist: author name (might contain complex html, multiple authors, etc)
+ # Credit: 'source'
+ # For image-map-like images, can be quite large/complex html, creditng each sub-image
+ # May be <a href="text1">text2</a>, where the text2 might be non-indicative
+ # Restrictions: specifies non-copyright legal restrictions
+ title = page["title"]
+ if title in normalisedToInput:
+ title = normalisedToInput[title]
+ title = title[5:] # Remove 'File:'
+ if title not in imgNames:
+ print(f"WARNING: Got title \"{title}\" not in image-name list")
+ continue
+ if "imageinfo" not in page:
+ print(f"WARNING: No imageinfo section for page \"{title}\"")
+ continue
+ metadata = page["imageinfo"][0]["extmetadata"]
+ url = page["imageinfo"][0]["url"]
+ license = metadata['LicenseShortName']['value'] if 'LicenseShortName' in metadata else None
+ artist = metadata['Artist']['value'] if 'Artist' in metadata else None
+ credit = metadata['Credit']['value'] if 'Credit' in metadata else None
+ restrictions = metadata['Restrictions']['value'] if 'Restrictions' in metadata else None
+ # Remove markup
+ if artist != None:
+ artist = tagRegex.sub(" ", artist)
+ artist = whitespaceRegex.sub(" ", artist)
+ artist = html.unescape(artist)
+ artist = urllib.parse.unquote(artist)
+ if credit != None:
+ credit = tagRegex.sub(" ", credit)
+ credit = whitespaceRegex.sub(" ", credit)
+ credit = html.unescape(credit)
+ credit = urllib.parse.unquote(credit)
+ # Add to db
+ dbCur2.execute("INSERT INTO imgs VALUES (?, ?, ?, ?, ?, ?)",
+ (title, license, artist, credit, restrictions, url))
+
+print("Closing database")
+dbCon.commit()
+dbCon.close()
diff --git a/backend/tolData/enwiki/downloadImgs.py b/backend/tolData/enwiki/downloadImgs.py
new file mode 100755
index 0000000..8fb605f
--- /dev/null
+++ b/backend/tolData/enwiki/downloadImgs.py
@@ -0,0 +1,91 @@
+#!/usr/bin/python3
+
+import sys, re, os
+import sqlite3
+import urllib.parse, requests
+import time, signal
+
+usageInfo = f"""
+Usage: {sys.argv[0]}
+
+Downloads images from URLs in an image database, into an output directory,
+with names of the form 'pageId1.ext1'.
+
+SIGINT causes the program to finish an ongoing download and exit.
+The program can be re-run to continue downloading, and looks
+in the output directory do decide what to skip.
+"""
+if len(sys.argv) > 1:
+ print(usageInfo, file=sys.stderr)
+ sys.exit(1)
+
+imgDb = "imgData.db" # About 130k image names
+outDir = "imgs"
+licenseRegex = re.compile(r"cc0|cc([ -]by)?([ -]sa)?([ -][1234]\.[05])?( \w\w\w?)?", flags=re.IGNORECASE)
+# In testing, this downloaded about 100k images, over several days
+
+if not os.path.exists(outDir):
+ os.mkdir(outDir)
+print("Checking for already-downloaded images")
+fileList = os.listdir(outDir)
+pageIdsDone = set()
+for filename in fileList:
+ (basename, extension) = os.path.splitext(filename)
+ pageIdsDone.add(int(basename))
+print(f"Found {len(pageIdsDone)}")
+
+# Set SIGINT handler
+interrupted = False
+oldHandler = None
+def onSigint(sig, frame):
+ global interrupted
+ interrupted = True
+ signal.signal(signal.SIGINT, oldHandler)
+oldHandler = signal.signal(signal.SIGINT, onSigint)
+
+print("Opening database")
+dbCon = sqlite3.connect(imgDb)
+dbCur = dbCon.cursor()
+print("Starting downloads")
+iterNum = 0
+query = "SELECT page_id, license, artist, credit, restrictions, url FROM" \
+ " imgs INNER JOIN page_imgs ON imgs.name = page_imgs.img_name"
+for (pageId, license, artist, credit, restrictions, url) in dbCur.execute(query):
+ if pageId in pageIdsDone:
+ continue
+ if interrupted:
+ print(f"Exiting loop")
+ break
+ # Check for problematic attributes
+ if license == None or licenseRegex.fullmatch(license) == None:
+ continue
+ if artist == None or artist == "" or len(artist) > 100 or re.match(r"(\d\. )?File:", artist) != None:
+ continue
+ if credit == None or len(credit) > 300 or re.match(r"File:", credit) != None:
+ continue
+ if restrictions != None and restrictions != "":
+ continue
+ # Download image
+ iterNum += 1
+ print(f"Iteration {iterNum}: Downloading for page-id {pageId}")
+ urlParts = urllib.parse.urlparse(url)
+ extension = os.path.splitext(urlParts.path)[1]
+ if len(extension) <= 1:
+ print(f"WARNING: No filename extension found in URL {url}")
+ sys.exit(1)
+ outFile = f"{outDir}/{pageId}{extension}"
+ headers = {
+ "user-agent": "terryt.dev (terry06890@gmail.com)",
+ "accept-encoding": "gzip",
+ }
+ try:
+ response = requests.get(url, headers=headers)
+ with open(outFile, 'wb') as file:
+ file.write(response.content)
+ time.sleep(1)
+ # https://en.wikipedia.org/wiki/Wikipedia:Database_download says to "throttle self to 1 cache miss per sec"
+ # It's unclear how to properly check for cache misses, so this just aims for 1 per sec
+ except Exception as e:
+ print(f"Error while downloading to {outFile}: {e}")
+print("Closing database")
+dbCon.close()
diff --git a/backend/tolData/enwiki/genDescData.py b/backend/tolData/enwiki/genDescData.py
new file mode 100755
index 0000000..b0ca272
--- /dev/null
+++ b/backend/tolData/enwiki/genDescData.py
@@ -0,0 +1,127 @@
+#!/usr/bin/python3
+
+import sys, os, re
+import bz2
+import html, mwxml, mwparserfromhell
+import sqlite3
+
+usageInfo = f"""
+Usage: {sys.argv[0]}
+
+Reads through the wiki dump, and attempts to
+parse short-descriptions, and add them to a database.
+"""
+if len(sys.argv) > 1:
+ print(usageInfo, file=sys.stderr)
+ sys.exit(1)
+
+dumpFile = "enwiki-20220501-pages-articles-multistream.xml.bz2" # Had about 22e6 pages
+enwikiDb = "descData.db"
+# In testing, this script took over 10 hours to run, and generated about 5GB
+
+descLineRegex = re.compile("^ *[A-Z'\"]")
+embeddedHtmlRegex = re.compile(r"<[^<]+/>|<!--[^<]+-->|<[^</]+>([^<]*|[^<]*<[^<]+>[^<]*)</[^<]+>|<[^<]+$")
+ # Recognises a self-closing HTML tag, a tag with 0 children, tag with 1 child with 0 children, or unclosed tag
+convertTemplateRegex = re.compile(r"{{convert\|(\d[^|]*)\|(?:(to|-)\|(\d[^|]*)\|)?([a-z][^|}]*)[^}]*}}")
+def convertTemplateReplace(match):
+ if match.group(2) == None:
+ return f"{match.group(1)} {match.group(4)}"
+ else:
+ return f"{match.group(1)} {match.group(2)} {match.group(3)} {match.group(4)}"
+parensGroupRegex = re.compile(r" \([^()]*\)")
+leftoverBraceRegex = re.compile(r"(?:{\||{{).*")
+
+def parseDesc(text):
+ # Find first matching line outside {{...}}, [[...]], and block-html-comment constructs,
+ # and then accumulate lines until a blank one.
+ # Some cases not accounted for include: disambiguation pages, abstracts with sentences split-across-lines,
+ # nested embedded html, 'content significant' embedded-html, markup not removable with mwparsefromhell,
+ lines = []
+ openBraceCount = 0
+ openBracketCount = 0
+ inComment = False
+ skip = False
+ for line in text.splitlines():
+ line = line.strip()
+ if len(lines) == 0:
+ if len(line) > 0:
+ if openBraceCount > 0 or line[0] == "{":
+ openBraceCount += line.count("{")
+ openBraceCount -= line.count("}")
+ skip = True
+ if openBracketCount > 0 or line[0] == "[":
+ openBracketCount += line.count("[")
+ openBracketCount -= line.count("]")
+ skip = True
+ if inComment or line.find("<!--") != -1:
+ if line.find("-->") != -1:
+ if inComment:
+ inComment = False
+ skip = True
+ else:
+ inComment = True
+ skip = True
+ if skip:
+ skip = False
+ continue
+ if line[-1] == ":": # Seems to help avoid disambiguation pages
+ return None
+ if descLineRegex.match(line) != None:
+ lines.append(line)
+ else:
+ if len(line) == 0:
+ return removeMarkup(" ".join(lines))
+ lines.append(line)
+ if len(lines) > 0:
+ return removeMarkup(" ".join(lines))
+ return None
+def removeMarkup(content):
+ content = embeddedHtmlRegex.sub("", content)
+ content = convertTemplateRegex.sub(convertTemplateReplace, content)
+ content = mwparserfromhell.parse(content).strip_code() # Remove wikitext markup
+ content = parensGroupRegex.sub("", content)
+ content = leftoverBraceRegex.sub("", content)
+ return content
+def convertTitle(title):
+ return html.unescape(title).replace("_", " ")
+
+print("Creating database")
+if os.path.exists(enwikiDb):
+ raise Exception(f"ERROR: Existing {enwikiDb}")
+dbCon = sqlite3.connect(enwikiDb)
+dbCur = dbCon.cursor()
+dbCur.execute("CREATE TABLE pages (id INT PRIMARY KEY, title TEXT UNIQUE)")
+dbCur.execute("CREATE INDEX pages_title_idx ON pages(title COLLATE NOCASE)")
+dbCur.execute("CREATE TABLE redirects (id INT PRIMARY KEY, target TEXT)")
+dbCur.execute("CREATE INDEX redirects_idx ON redirects(target)")
+dbCur.execute("CREATE TABLE descs (id INT PRIMARY KEY, desc TEXT)")
+
+print("Iterating through dump file")
+with bz2.open(dumpFile, mode='rt') as file:
+ dump = mwxml.Dump.from_file(file)
+ pageNum = 0
+ for page in dump:
+ pageNum += 1
+ if pageNum % 1e4 == 0:
+ print(f"At page {pageNum}")
+ if pageNum > 3e4:
+ break
+ # Parse page
+ if page.namespace == 0:
+ try:
+ dbCur.execute("INSERT INTO pages VALUES (?, ?)", (page.id, convertTitle(page.title)))
+ except sqlite3.IntegrityError as e:
+ # Accounts for certain pages that have the same title
+ print(f"Failed to add page with title \"{page.title}\": {e}", file=sys.stderr)
+ continue
+ if page.redirect != None:
+ dbCur.execute("INSERT INTO redirects VALUES (?, ?)", (page.id, convertTitle(page.redirect)))
+ else:
+ revision = next(page)
+ desc = parseDesc(revision.text)
+ if desc != None:
+ dbCur.execute("INSERT INTO descs VALUES (?, ?)", (page.id, desc))
+
+print("Closing database")
+dbCon.commit()
+dbCon.close()
diff --git a/backend/tolData/enwiki/genDumpIndexDb.py b/backend/tolData/enwiki/genDumpIndexDb.py
new file mode 100755
index 0000000..3955885
--- /dev/null
+++ b/backend/tolData/enwiki/genDumpIndexDb.py
@@ -0,0 +1,58 @@
+#!/usr/bin/python3
+
+import sys, os, re
+import bz2
+import sqlite3
+
+usageInfo = f"""
+Usage: {sys.argv[0]}
+
+Adds data from the wiki dump index-file into a database.
+"""
+if len(sys.argv) > 1:
+ print(usageInfo, file=sys.stderr)
+ sys.exit(1)
+
+indexFile = "enwiki-20220501-pages-articles-multistream-index.txt.bz2" # Had about 22e6 lines
+indexDb = "dumpIndex.db"
+
+if os.path.exists(indexDb):
+ raise Exception(f"ERROR: Existing {indexDb}")
+print("Creating database")
+dbCon = sqlite3.connect(indexDb)
+dbCur = dbCon.cursor()
+dbCur.execute("CREATE TABLE offsets (title TEXT PRIMARY KEY, id INT UNIQUE, offset INT, next_offset INT)")
+
+print("Iterating through index file")
+lineRegex = re.compile(r"([^:]+):([^:]+):(.*)")
+lastOffset = 0
+lineNum = 0
+entriesToAdd = []
+with bz2.open(indexFile, mode='rt') as file:
+ for line in file:
+ lineNum += 1
+ if lineNum % 1e5 == 0:
+ print(f"At line {lineNum}")
+ #
+ match = lineRegex.fullmatch(line.rstrip())
+ (offset, pageId, title) = match.group(1,2,3)
+ offset = int(offset)
+ if offset > lastOffset:
+ for (t, p) in entriesToAdd:
+ try:
+ dbCur.execute("INSERT INTO offsets VALUES (?, ?, ?, ?)", (t, p, lastOffset, offset))
+ except sqlite3.IntegrityError as e:
+ # Accounts for certain entries in the file that have the same title
+ print(f"Failed on title \"{t}\": {e}", file=sys.stderr)
+ entriesToAdd = []
+ lastOffset = offset
+ entriesToAdd.append([title, pageId])
+for (title, pageId) in entriesToAdd:
+ try:
+ dbCur.execute("INSERT INTO offsets VALUES (?, ?, ?, ?)", (title, pageId, lastOffset, -1))
+ except sqlite3.IntegrityError as e:
+ print(f"Failed on title \"{t}\": {e}", file=sys.stderr)
+
+print("Closing database")
+dbCon.commit()
+dbCon.close()
diff --git a/backend/tolData/enwiki/genImgData.py b/backend/tolData/enwiki/genImgData.py
new file mode 100755
index 0000000..dedfe14
--- /dev/null
+++ b/backend/tolData/enwiki/genImgData.py
@@ -0,0 +1,190 @@
+#!/usr/bin/python3
+
+import sys, re
+import bz2, html, urllib.parse
+import sqlite3
+
+usageInfo = f"""
+Usage: {sys.argv[0]}
+
+For some set of page IDs, looks up their content in the wiki dump,
+and tries to parse infobox image names, storing them into a database.
+
+The program can be re-run with an updated set of page IDs, and
+will skip already-processed page IDs.
+"""
+if len(sys.argv) > 1:
+ print(usageInfo, file=sys.stderr)
+ sys.exit(1)
+
+def getInputPageIds():
+ pageIds = set()
+ dbCon = sqlite3.connect("../data.db")
+ dbCur = dbCon.cursor()
+ for (pageId,) in dbCur.execute("SELECT id from wiki_ids"):
+ pageIds.add(pageId)
+ dbCon.close()
+ return pageIds
+dumpFile = "enwiki-20220501-pages-articles-multistream.xml.bz2"
+indexDb = "dumpIndex.db"
+imgDb = "imgData.db" # The database to create
+idLineRegex = re.compile(r"<id>(.*)</id>")
+imageLineRegex = re.compile(r".*\| *image *= *([^|]*)")
+bracketImageRegex = re.compile(r"\[\[(File:[^|]*).*]]")
+imageNameRegex = re.compile(r".*\.(jpg|jpeg|png|gif|tiff|tif)", flags=re.IGNORECASE)
+cssImgCropRegex = re.compile(r"{{css image crop\|image *= *(.*)", flags=re.IGNORECASE)
+# In testing, got about 360k image names
+
+print("Getting input page-ids")
+pageIds = getInputPageIds()
+print(f"Found {len(pageIds)}")
+
+print("Opening databases")
+indexDbCon = sqlite3.connect(indexDb)
+indexDbCur = indexDbCon.cursor()
+imgDbCon = sqlite3.connect(imgDb)
+imgDbCur = imgDbCon.cursor()
+print("Checking tables")
+if imgDbCur.execute("SELECT name FROM sqlite_master WHERE type='table' AND name='page_imgs'").fetchone() == None:
+ # Create tables if not present
+ imgDbCur.execute("CREATE TABLE page_imgs (page_id INT PRIMARY KEY, img_name TEXT)") # img_name may be NULL
+ imgDbCur.execute("CREATE INDEX page_imgs_idx ON page_imgs(img_name)")
+else:
+ # Check for already-processed page IDs
+ numSkipped = 0
+ for (pid,) in imgDbCur.execute("SELECT page_id FROM page_imgs"):
+ if pid in pageIds:
+ pageIds.remove(pid)
+ numSkipped += 1
+ else:
+ print(f"WARNING: Found already-processed page ID {pid} which was not in input set")
+ print(f"Will skip {numSkipped} already-processed page IDs")
+
+print("Getting dump-file offsets")
+offsetToPageids = {}
+offsetToEnd = {} # Maps chunk-start offsets to their chunk-end offsets
+iterNum = 0
+for pageId in pageIds:
+ iterNum += 1
+ if iterNum % 1e4 == 0:
+ print(f"At iteration {iterNum}")
+ #
+ query = "SELECT offset, next_offset FROM offsets WHERE id = ?"
+ row = indexDbCur.execute(query, (pageId,)).fetchone()
+ if row == None:
+ print(f"WARNING: Page ID {pageId} not found")
+ continue
+ (chunkOffset, endOffset) = row
+ offsetToEnd[chunkOffset] = endOffset
+ if chunkOffset not in offsetToPageids:
+ offsetToPageids[chunkOffset] = []
+ offsetToPageids[chunkOffset].append(pageId)
+print(f"Found {len(offsetToEnd)} chunks to check")
+
+print("Iterating through chunks in dump file")
+def getImageName(content):
+ " Given an array of text-content lines, tries to return an infoxbox image name, or None "
+ # Doesn't try and find images in outside-infobox [[File:...]] and <imagemap> sections
+ for line in content:
+ match = imageLineRegex.match(line)
+ if match != None:
+ imageName = match.group(1).strip()
+ if imageName == "":
+ return None
+ imageName = html.unescape(imageName)
+ # Account for {{...
+ if imageName.startswith("{"):
+ match = cssImgCropRegex.match(imageName)
+ if match == None:
+ return None
+ imageName = match.group(1)
+ # Account for [[File:...|...]]
+ if imageName.startswith("["):
+ match = bracketImageRegex.match(imageName)
+ if match == None:
+ return None
+ imageName = match.group(1)
+ # Account for <!--
+ if imageName.find("<!--") != -1:
+ return None
+ # Remove an initial 'File:'
+ if imageName.startswith("File:"):
+ imageName = imageName[5:]
+ # Remove an initial 'Image:'
+ if imageName.startswith("Image:"):
+ imageName = imageName[6:]
+ # Check for extension
+ match = imageNameRegex.match(imageName)
+ if match != None:
+ imageName = match.group(0)
+ imageName = urllib.parse.unquote(imageName)
+ imageName = html.unescape(imageName) # Intentionally unescaping again (handles some odd cases)
+ imageName = imageName.replace("_", " ")
+ return imageName
+ # Exclude lines like: | image = &lt;imagemap&gt;
+ return None
+ return None
+with open(dumpFile, mode='rb') as file:
+ iterNum = 0
+ for (pageOffset, endOffset) in offsetToEnd.items():
+ iterNum += 1
+ if iterNum % 100 == 0:
+ print(f"At iteration {iterNum}")
+ #
+ pageIds = offsetToPageids[pageOffset]
+ # Jump to chunk
+ file.seek(pageOffset)
+ compressedData = file.read(None if endOffset == -1 else endOffset - pageOffset)
+ data = bz2.BZ2Decompressor().decompress(compressedData).decode()
+ # Look in chunk for pages
+ lines = data.splitlines()
+ lineIdx = 0
+ while lineIdx < len(lines):
+ # Look for <page>
+ if lines[lineIdx].lstrip() != "<page>":
+ lineIdx += 1
+ continue
+ # Check page id
+ lineIdx += 3
+ idLine = lines[lineIdx].lstrip()
+ match = idLineRegex.fullmatch(idLine)
+ if match == None or int(match.group(1)) not in pageIds:
+ lineIdx += 1
+ continue
+ pageId = int(match.group(1))
+ lineIdx += 1
+ # Look for <text> in <page>
+ foundText = False
+ while lineIdx < len(lines):
+ if not lines[lineIdx].lstrip().startswith("<text "):
+ lineIdx += 1
+ continue
+ foundText = True
+ # Get text content
+ content = []
+ line = lines[lineIdx]
+ content.append(line[line.find(">") + 1:])
+ lineIdx += 1
+ foundTextEnd = False
+ while lineIdx < len(lines):
+ line = lines[lineIdx]
+ if not line.endswith("</text>"):
+ content.append(line)
+ lineIdx += 1
+ continue
+ foundTextEnd = True
+ content.append(line[:line.rfind("</text>")])
+ # Look for image-filename
+ imageName = getImageName(content)
+ imgDbCur.execute("INSERT into page_imgs VALUES (?, ?)", (pageId, imageName))
+ break
+ if not foundTextEnd:
+ print(f"WARNING: Did not find </text> for page id {pageId}")
+ break
+ if not foundText:
+ print(f"WARNING: Did not find <text> for page id {pageId}")
+
+print("Closing databases")
+indexDbCon.close()
+imgDbCon.commit()
+imgDbCon.close()
diff --git a/backend/tolData/enwiki/lookupPage.py b/backend/tolData/enwiki/lookupPage.py
new file mode 100755
index 0000000..1a90851
--- /dev/null
+++ b/backend/tolData/enwiki/lookupPage.py
@@ -0,0 +1,68 @@
+#!/usr/bin/python3
+
+import sys, re
+import bz2
+import sqlite3
+
+usageInfo = f"""
+Usage: {sys.argv[0]} title1
+
+Looks up a page with title title1 in the wiki dump, using
+the dump-index db, and prints the corresponding <page>.
+"""
+if len(sys.argv) != 2:
+ print(usageInfo, file=sys.stderr)
+ sys.exit(1)
+
+dumpFile = "enwiki-20220501-pages-articles-multistream.xml.bz2"
+indexDb = "dumpIndex.db"
+pageTitle = sys.argv[1].replace("_", " ")
+
+print("Looking up offset in index db")
+dbCon = sqlite3.connect(indexDb)
+dbCur = dbCon.cursor()
+query = "SELECT title, offset, next_offset FROM offsets WHERE title = ?"
+row = dbCur.execute(query, (pageTitle,)).fetchone()
+if row == None:
+ print("Title not found")
+ sys.exit(0)
+_, pageOffset, endOffset = row
+dbCon.close()
+print(f"Found chunk at offset {pageOffset}")
+
+print("Reading from wiki dump")
+content = []
+with open(dumpFile, mode='rb') as file:
+ # Get uncompressed chunk
+ file.seek(pageOffset)
+ compressedData = file.read(None if endOffset == -1 else endOffset - pageOffset)
+ data = bz2.BZ2Decompressor().decompress(compressedData).decode()
+ # Look in chunk for page
+ lines = data.splitlines()
+ lineIdx = 0
+ found = False
+ pageNum = 0
+ while not found:
+ line = lines[lineIdx]
+ if line.lstrip() == "<page>":
+ pageNum += 1
+ if pageNum > 100:
+ print("ERROR: Did not find title after 100 pages")
+ break
+ lineIdx += 1
+ titleLine = lines[lineIdx]
+ if titleLine.lstrip() == '<title>' + pageTitle + '</title>':
+ found = True
+ print(f"Found title in chunk as page {pageNum}")
+ content.append(line)
+ content.append(titleLine)
+ while True:
+ lineIdx += 1
+ line = lines[lineIdx]
+ content.append(line)
+ if line.lstrip() == "</page>":
+ break
+ lineIdx += 1
+
+print("Content: ")
+print("\n".join(content))
diff --git a/backend/tolData/eol/README.md b/backend/tolData/eol/README.md
new file mode 100644
index 0000000..8c527a8
--- /dev/null
+++ b/backend/tolData/eol/README.md
@@ -0,0 +1,26 @@
+This directory holds files obtained from/using the [Encyclopedia of Life](https://eol.org/).
+
+# Name Data Files
+- vernacularNames.csv <br>
+ Obtained from <https://opendata.eol.org/dataset/vernacular-names> on 24/04/2022 (last updated on 27/10/2020).
+ Contains alternative-name data from EOL.
+
+# Image Metadata Files
+- imagesList.tgz <br>
+ Obtained from <https://opendata.eol.org/dataset/images-list> on 24/04/2022 (last updated on 05/02/2020).
+ Contains metadata for images from EOL.
+- imagesList/ <br>
+ Extracted from imagesList.tgz.
+- genImagesListDb.sh <br>
+ Creates a database, and imports imagesList/*.csv files into it.
+- imagesList.db <br>
+ Created by running genImagesListDb.sh <br>
+ Tables: <br>
+ - `images`:
+ `content_id INT PRIMARY KEY, page_id INT, source_url TEXT, copy_url TEXT, license TEXT, copyright_owner TEXT`
+
+# Image Generation Files
+- downloadImgs.py <br>
+ Used to download image files into imgsForReview/.
+- reviewImgs.py <br>
+ Used to review images in imgsForReview/, moving acceptable ones into imgs/.
diff --git a/backend/tolData/eol/downloadImgs.py b/backend/tolData/eol/downloadImgs.py
new file mode 100755
index 0000000..96bc085
--- /dev/null
+++ b/backend/tolData/eol/downloadImgs.py
@@ -0,0 +1,147 @@
+#!/usr/bin/python3
+
+import sys, re, os, random
+import sqlite3
+import urllib.parse, requests
+import time
+from threading import Thread
+import signal
+
+usageInfo = f"""
+Usage: {sys.argv[0]}
+
+For some set of EOL IDs, downloads associated images from URLs in
+an image-list database. Uses multiple downloading threads.
+
+May obtain multiple images per ID. The images will get names
+with the form 'eolId1 contentId1.ext1'.
+
+SIGINT causes the program to finish ongoing downloads and exit.
+The program can be re-run to continue downloading. It looks for
+already-downloaded files, and continues after the one with
+highest EOL ID.
+"""
+if len(sys.argv) > 1:
+ print(usageInfo, file=sys.stderr)
+ sys.exit(1)
+# In testing, this downloaded about 70k images, over a few days
+
+imagesListDb = "imagesList.db"
+def getInputEolIds():
+ eolIds = set()
+ dbCon = sqlite3.connect("../data.db")
+ dbCur = dbCon.cursor()
+ for (id,) in dbCur.execute("SELECT id FROM eol_ids"):
+ eolIds.add(id)
+ dbCon.close()
+ return eolIds
+outDir = "imgsForReview/"
+MAX_IMGS_PER_ID = 3
+MAX_THREADS = 5
+POST_DL_DELAY_MIN = 2 # Minimum delay in seconds to pause after download before starting another (for each thread)
+POST_DL_DELAY_MAX = 3
+LICENSE_REGEX = r"cc-by((-nc)?(-sa)?(-[234]\.[05])?)|cc-publicdomain|cc-0-1\.0|public domain"
+
+print("Getting input EOL IDs")
+eolIds = getInputEolIds()
+print("Getting EOL IDs to download for")
+# Get IDs from images-list db
+imgDbCon = sqlite3.connect(imagesListDb)
+imgCur = imgDbCon.cursor()
+imgListIds = set()
+for (pageId,) in imgCur.execute("SELECT DISTINCT page_id FROM images"):
+ imgListIds.add(pageId)
+# Get set intersection, and sort into list
+eolIds = eolIds.intersection(imgListIds)
+eolIds = sorted(eolIds)
+print(f"Result: {len(eolIds)} EOL IDs")
+
+print("Checking output directory")
+if not os.path.exists(outDir):
+ os.mkdir(outDir)
+print("Finding next ID to download for")
+nextIdx = 0
+fileList = os.listdir(outDir)
+ids = [int(filename.split(" ")[0]) for filename in fileList]
+if len(ids) > 0:
+ ids.sort()
+ nextIdx = eolIds.index(ids[-1]) + 1
+if nextIdx == len(eolIds):
+ print("No IDs left. Exiting...")
+ sys.exit(0)
+
+print("Starting download threads")
+numThreads = 0
+threadException = None # Used for ending main thread after a non-main thread exception
+# Handle SIGINT signals
+interrupted = False
+oldHandler = None
+def onSigint(sig, frame):
+ global interrupted
+ interrupted = True
+ signal.signal(signal.SIGINT, oldHandler)
+oldHandler = signal.signal(signal.SIGINT, onSigint)
+# Function for threads to execute
+def downloadImg(url, outFile):
+ global numThreads, threadException
+ try:
+ data = requests.get(url)
+ with open(outFile, 'wb') as file:
+ file.write(data.content)
+ time.sleep(random.random() * (POST_DL_DELAY_MAX - POST_DL_DELAY_MIN) + POST_DL_DELAY_MIN)
+ except Exception as e:
+ print(f"Error while downloading to {outFile}: {str(e)}", file=sys.stderr)
+ threadException = e
+ numThreads -= 1
+# Manage downloading
+for idx in range(nextIdx, len(eolIds)):
+ eolId = eolIds[idx]
+ # Get image urls
+ imgDataList = []
+ ownerSet = set() # Used to get images from different owners, for variety
+ exitLoop = False
+ query = "SELECT content_id, copy_url, license, copyright_owner FROM images WHERE page_id = ?"
+ for (contentId, url, license, copyrightOwner) in imgCur.execute(query, (eolId,)):
+ if url.startswith("data/"):
+ url = "https://content.eol.org/" + url
+ urlParts = urllib.parse.urlparse(url)
+ extension = os.path.splitext(urlParts.path)[1]
+ if len(extension) <= 1:
+ print(f"WARNING: No filename extension found in URL {url}", file=sys.stderr)
+ continue
+ # Check image-quantity limit
+ if len(ownerSet) == MAX_IMGS_PER_ID:
+ break
+ # Check for skip conditions
+ if re.fullmatch(LICENSE_REGEX, license) == None:
+ continue
+ if len(copyrightOwner) > 100: # Avoid certain copyrightOwner fields that seem long and problematic
+ continue
+ if copyrightOwner in ownerSet:
+ continue
+ ownerSet.add(copyrightOwner)
+ # Determine output filename
+ outPath = f"{outDir}{eolId} {contentId}{extension}"
+ if os.path.exists(outPath):
+ print(f"WARNING: {outPath} already exists. Skipping download.")
+ continue
+ # Check thread limit
+ while numThreads == MAX_THREADS:
+ time.sleep(1)
+ # Wait for threads after an interrupt or thread-exception
+ if interrupted or threadException != None:
+ print("Waiting for existing threads to end")
+ while numThreads > 0:
+ time.sleep(1)
+ exitLoop = True
+ break
+ # Perform download
+ print(f"Downloading image to {outPath}")
+ numThreads += 1
+ thread = Thread(target=downloadImg, args=(url, outPath), daemon=True)
+ thread.start()
+ if exitLoop:
+ break
+# Close images-list db
+print("Finished downloading")
+imgDbCon.close()
diff --git a/backend/tolData/eol/genImagesListDb.sh b/backend/tolData/eol/genImagesListDb.sh
new file mode 100755
index 0000000..87dd840
--- /dev/null
+++ b/backend/tolData/eol/genImagesListDb.sh
@@ -0,0 +1,12 @@
+#!/bin/bash
+set -e
+
+# Combine CSV files into one, skipping header lines
+cat imagesList/media_*_{1..58}.csv | tail -n +2 > imagesList.csv
+# Create database, and import the CSV file
+sqlite3 imagesList.db <<END
+CREATE TABLE images (
+ content_id INT PRIMARY KEY, page_id INT, source_url TEXT, copy_url TEXT, license TEXT, copyright_owner TEXT);
+.mode csv
+.import 'imagesList.csv' images
+END
diff --git a/backend/tolData/eol/reviewImgs.py b/backend/tolData/eol/reviewImgs.py
new file mode 100755
index 0000000..ecdf7ab
--- /dev/null
+++ b/backend/tolData/eol/reviewImgs.py
@@ -0,0 +1,205 @@
+#!/usr/bin/python3
+
+import sys, re, os, time
+import sqlite3
+import tkinter as tki
+from tkinter import ttk
+import PIL
+from PIL import ImageTk, Image, ImageOps
+
+usageInfo = f"""
+Usage: {sys.argv[0]}
+
+Provides a GUI for reviewing images. Looks in a for-review directory for
+images named 'eolId1 contentId1.ext1', and, for each EOL ID, enables the user to
+choose an image to keep, or reject all. Also provides image rotation.
+Chosen images are placed in another directory, and rejected ones are deleted.
+"""
+if len(sys.argv) > 1:
+ print(usageInfo, file=sys.stderr)
+ sys.exit(1)
+
+imgDir = "imgsForReview/"
+outDir = "imgs/"
+extraInfoDbCon = sqlite3.connect("../data.db")
+extraInfoDbCur = extraInfoDbCon.cursor()
+def getExtraInfo(eolId):
+ global extraInfoDbCur
+ query = "SELECT names.alt_name FROM" \
+ " names INNER JOIN eol_ids ON eol_ids.name = names.name" \
+ " WHERE id = ? and pref_alt = 1"
+ row = extraInfoDbCur.execute(query, (eolId,)).fetchone()
+ if row != None:
+ return f"Reviewing EOL ID {eolId}, aka \"{row[0]}\""
+ else:
+ return f"Reviewing EOL ID {eolId}"
+IMG_DISPLAY_SZ = 400
+MAX_IMGS_PER_ID = 3
+IMG_BG_COLOR = (88, 28, 135)
+PLACEHOLDER_IMG = Image.new("RGB", (IMG_DISPLAY_SZ, IMG_DISPLAY_SZ), IMG_BG_COLOR)
+
+print("Checking output directory")
+if not os.path.exists(outDir):
+ os.mkdir(outDir)
+print("Getting input image list")
+imgList = os.listdir(imgDir)
+imgList.sort(key=lambda s: int(s.split(" ")[0]))
+if len(imgList) == 0:
+ print("No input images found")
+ sys.exit(0)
+
+class EolImgReviewer:
+ " Provides the GUI for reviewing images "
+ def __init__(self, root, imgList):
+ self.root = root
+ root.title("EOL Image Reviewer")
+ # Setup main frame
+ mainFrame = ttk.Frame(root, padding="5 5 5 5")
+ mainFrame.grid(column=0, row=0, sticky=(tki.N, tki.W, tki.E, tki.S))
+ root.columnconfigure(0, weight=1)
+ root.rowconfigure(0, weight=1)
+ # Set up images-to-be-reviewed frames
+ self.imgs = [PLACEHOLDER_IMG] * MAX_IMGS_PER_ID # Stored as fields for use in rotation
+ self.photoImgs = list(map(lambda img: ImageTk.PhotoImage(img), self.imgs)) # Image objects usable by tkinter
+ # These need a persistent reference for some reason (doesn't display otherwise)
+ self.labels = []
+ for i in range(MAX_IMGS_PER_ID):
+ frame = ttk.Frame(mainFrame, width=IMG_DISPLAY_SZ, height=IMG_DISPLAY_SZ)
+ frame.grid(column=i, row=0)
+ label = ttk.Label(frame, image=self.photoImgs[i])
+ label.grid(column=0, row=0)
+ self.labels.append(label)
+ # Add padding
+ for child in mainFrame.winfo_children():
+ child.grid_configure(padx=5, pady=5)
+ # Add keyboard bindings
+ root.bind("<q>", self.quit)
+ root.bind("<Key-j>", lambda evt: self.accept(0))
+ root.bind("<Key-k>", lambda evt: self.accept(1))
+ root.bind("<Key-l>", lambda evt: self.accept(2))
+ root.bind("<Key-i>", lambda evt: self.reject())
+ root.bind("<Key-a>", lambda evt: self.rotate(0))
+ root.bind("<Key-s>", lambda evt: self.rotate(1))
+ root.bind("<Key-d>", lambda evt: self.rotate(2))
+ root.bind("<Key-A>", lambda evt: self.rotate(0, True))
+ root.bind("<Key-S>", lambda evt: self.rotate(1, True))
+ root.bind("<Key-D>", lambda evt: self.rotate(2, True))
+ # Initialise images to review
+ self.imgList = imgList
+ self.imgListIdx = 0
+ self.nextEolId = 0
+ self.nextImgNames = []
+ self.rotations = []
+ self.getNextImgs()
+ # For displaying extra info
+ self.numReviewed = 0
+ self.startTime = time.time()
+ def getNextImgs(self):
+ " Updates display with new images to review, or ends program "
+ # Gather names of next images to review
+ for i in range(MAX_IMGS_PER_ID):
+ if self.imgListIdx == len(self.imgList):
+ if i == 0:
+ self.quit()
+ return
+ break
+ imgName = self.imgList[self.imgListIdx]
+ eolId = int(re.match(r"(\d+) (\d+)", imgName).group(1))
+ if i == 0:
+ self.nextEolId = eolId
+ self.nextImgNames = [imgName]
+ self.rotations = [0]
+ else:
+ if self.nextEolId != eolId:
+ break
+ self.nextImgNames.append(imgName)
+ self.rotations.append(0)
+ self.imgListIdx += 1
+ # Update displayed images
+ idx = 0
+ while idx < MAX_IMGS_PER_ID:
+ if idx < len(self.nextImgNames):
+ try:
+ img = Image.open(imgDir + self.nextImgNames[idx])
+ img = ImageOps.exif_transpose(img)
+ except PIL.UnidentifiedImageError:
+ os.remove(imgDir + self.nextImgNames[idx])
+ del self.nextImgNames[idx]
+ del self.rotations[idx]
+ continue
+ self.imgs[idx] = self.resizeImgForDisplay(img)
+ else:
+ self.imgs[idx] = PLACEHOLDER_IMG
+ self.photoImgs[idx] = ImageTk.PhotoImage(self.imgs[idx])
+ self.labels[idx].config(image=self.photoImgs[idx])
+ idx += 1
+ # Restart if all image files non-recognisable
+ if len(self.nextImgNames) == 0:
+ self.getNextImgs()
+ return
+ # Update title
+ firstImgIdx = self.imgListIdx - len(self.nextImgNames) + 1
+ lastImgIdx = self.imgListIdx
+ title = getExtraInfo(self.nextEolId)
+ title += f" (imgs {firstImgIdx} to {lastImgIdx} out of {len(self.imgList)})"
+ self.root.title(title)
+ def accept(self, imgIdx):
+ " React to a user selecting an image "
+ if imgIdx >= len(self.nextImgNames):
+ print("Invalid selection")
+ return
+ for i in range(len(self.nextImgNames)):
+ inFile = imgDir + self.nextImgNames[i]
+ if i == imgIdx: # Move accepted image, rotating if needed
+ outFile = outDir + self.nextImgNames[i]
+ img = Image.open(inFile)
+ img = ImageOps.exif_transpose(img)
+ if self.rotations[i] != 0:
+ img = img.rotate(self.rotations[i], expand=True)
+ img.save(outFile)
+ os.remove(inFile)
+ else: # Delete non-accepted image
+ os.remove(inFile)
+ self.numReviewed += 1
+ self.getNextImgs()
+ def reject(self):
+ " React to a user rejecting all images of a set "
+ for i in range(len(self.nextImgNames)):
+ os.remove(imgDir + self.nextImgNames[i])
+ self.numReviewed += 1
+ self.getNextImgs()
+ def rotate(self, imgIdx, anticlockwise = False):
+ " Respond to a user rotating an image "
+ deg = -90 if not anticlockwise else 90
+ self.imgs[imgIdx] = self.imgs[imgIdx].rotate(deg)
+ self.photoImgs[imgIdx] = ImageTk.PhotoImage(self.imgs[imgIdx])
+ self.labels[imgIdx].config(image=self.photoImgs[imgIdx])
+ self.rotations[imgIdx] = (self.rotations[imgIdx] + deg) % 360
+ def quit(self, e = None):
+ global extraInfoDbCon
+ print(f"Number reviewed: {self.numReviewed}")
+ timeElapsed = time.time() - self.startTime
+ print(f"Time elapsed: {timeElapsed:.2f} seconds")
+ if self.numReviewed > 0:
+ print(f"Avg time per review: {timeElapsed/self.numReviewed:.2f} seconds")
+ extraInfoDbCon.close()
+ self.root.destroy()
+ def resizeImgForDisplay(self, img):
+ " Returns a copy of an image, shrunk to fit in it's frame (keeps aspect ratio), and with a background "
+ if max(img.width, img.height) > IMG_DISPLAY_SZ:
+ if (img.width > img.height):
+ newHeight = int(img.height * IMG_DISPLAY_SZ/img.width)
+ img = img.resize((IMG_DISPLAY_SZ, newHeight))
+ else:
+ newWidth = int(img.width * IMG_DISPLAY_SZ / img.height)
+ img = img.resize((newWidth, IMG_DISPLAY_SZ))
+ bgImg = PLACEHOLDER_IMG.copy()
+ bgImg.paste(img, box=(
+ int((IMG_DISPLAY_SZ - img.width) / 2),
+ int((IMG_DISPLAY_SZ - img.height) / 2)))
+ return bgImg
+# Create GUI and defer control
+print("Starting GUI")
+root = tki.Tk()
+EolImgReviewer(root, imgList)
+root.mainloop()
diff --git a/backend/tolData/genDbpData.py b/backend/tolData/genDbpData.py
new file mode 100755
index 0000000..df3a6be
--- /dev/null
+++ b/backend/tolData/genDbpData.py
@@ -0,0 +1,247 @@
+#!/usr/bin/python3
+
+import sys, os, re
+import sqlite3
+
+usageInfo = f"""
+Usage: {sys.argv[0]}
+
+Reads a database containing data from DBpedia, and tries to associate
+DBpedia IRIs with nodes in a database, adding short-descriptions for them.
+"""
+if len(sys.argv) > 1:
+ print(usageInfo, file=sys.stderr)
+ sys.exit(1)
+
+dbpediaDb = "dbpedia/descData.db"
+namesToSkipFile = "pickedEnwikiNamesToSkip.txt"
+pickedLabelsFile = "pickedDbpLabels.txt"
+dbFile = "data.db"
+rootNodeName = "cellular organisms"
+rootLabel = "organism" # Will be associated with root node
+# Got about 400k descriptions when testing
+
+print("Opening databases")
+dbpCon = sqlite3.connect(dbpediaDb)
+dbpCur = dbpCon.cursor()
+dbCon = sqlite3.connect(dbFile)
+dbCur = dbCon.cursor()
+
+print("Getting node names")
+nodeNames = set()
+for (name,) in dbCur.execute("SELECT name from nodes"):
+ nodeNames.add(name)
+
+print("Checking for names to skip")
+oldSz = len(nodeNames)
+if os.path.exists(namesToSkipFile):
+ with open(namesToSkipFile) as file:
+ for line in file:
+ nodeNames.remove(line.rstrip())
+print(f"Skipping {oldSz - len(nodeNames)} nodes")
+
+print("Reading disambiguation-page labels")
+disambigLabels = set()
+query = "SELECT labels.iri from labels INNER JOIN disambiguations ON labels.iri = disambiguations.iri"
+for (label,) in dbpCur.execute(query):
+ disambigLabels.add(label)
+
+print("Trying to associate nodes with DBpedia labels")
+nodeToLabel = {}
+nameVariantRegex = re.compile(r"(.*) \(([^)]+)\)") # Used to recognise labels like 'Thor (shrimp)'
+nameToVariants = {} # Maps node names to lists of matching labels
+iterNum = 0
+for (label,) in dbpCur.execute("SELECT label from labels"):
+ iterNum += 1
+ if iterNum % 1e5 == 0:
+ print(f"At iteration {iterNum}")
+ #
+ if label in disambigLabels:
+ continue
+ name = label.lower()
+ if name in nodeNames:
+ if name not in nameToVariants:
+ nameToVariants[name] = [label]
+ elif label not in nameToVariants[name]:
+ nameToVariants[name].append(label)
+ else:
+ match = nameVariantRegex.fullmatch(name)
+ if match != None:
+ subName = match.group(1)
+ if subName in nodeNames and match.group(2) != "disambiguation":
+ if subName not in nameToVariants:
+ nameToVariants[subName] = [label]
+ elif name not in nameToVariants[subName]:
+ nameToVariants[subName].append(label)
+# Associate labels without conflicts
+for (name, variants) in nameToVariants.items():
+ if len(variants) == 1:
+ nodeToLabel[name] = variants[0]
+for name in nodeToLabel:
+ del nameToVariants[name]
+# Special case for root node
+nodeToLabel[rootNodeName] = rootLabel
+if rootNodeName in nameToVariants:
+ del nameToVariants["cellular organisms"]
+
+print("Trying to resolve {len(nameToVariants)} conflicts")
+def resolveWithPickedLabels():
+ " Attempts to resolve conflicts using a picked-names file "
+ with open(pickedLabelsFile) as file:
+ for line in file:
+ (name, _, label) = line.rstrip().partition("|")
+ if name not in nameToVariants:
+ print(f"WARNING: No conflict found for name \"{name}\"", file=sys.stderr)
+ continue
+ if label == "":
+ del nameToVariants[name]
+ else:
+ if label not in nameToVariants[name]:
+ print(f"INFO: Picked label \"{label}\" for name \"{name}\" outside choice set", file=sys.stderr)
+ nodeToLabel[name] = label
+ del nameToVariants[name]
+def resolveWithCategoryList():
+ """
+ Attempts to resolve conflicts by looking for labels like 'name1 (category1)',
+ and choosing those with a category1 that seems 'biological'.
+ Does two passes, using more generic categories first. This helps avoid stuff like
+ Pan being classified as a horse instead of an ape.
+ """
+ generalCategories = {
+ "species", "genus",
+ "plant", "fungus", "animal",
+ "annelid", "mollusc", "arthropod", "crustacean", "insect", "bug",
+ "fish", "amphibian", "reptile", "bird", "mammal",
+ }
+ specificCategories = {
+ "protist", "alveolate", "dinoflagellates",
+ "orchid", "poaceae", "fern", "moss", "alga",
+ "bryozoan", "hydrozoan",
+ "sponge", "cnidarian", "coral", "polychaete", "echinoderm",
+ "bivalve", "gastropod", "chiton",
+ "shrimp", "decapod", "crab", "barnacle", "copepod",
+ "arachnid", "spider", "harvestman", "mite",
+ "dragonfly", "mantis", "cicada", "grasshopper", "planthopper",
+ "beetle", "fly", "butterfly", "moth", "wasp",
+ "catfish",
+ "frog",
+ "lizard",
+ "horse", "sheep", "cattle", "mouse",
+ }
+ namesToRemove = set()
+ for (name, variants) in nameToVariants.items():
+ found = False
+ for label in variants:
+ match = nameVariantRegex.match(label)
+ if match != None and match.group(2) in generalCategories:
+ nodeToLabel[name] = label
+ namesToRemove.add(name)
+ found = True
+ break
+ if not found:
+ for label in variants:
+ match = nameVariantRegex.match(label)
+ if match != None and match.group(2) in specificCategories:
+ nodeToLabel[name] = label
+ namesToRemove.add(name)
+ break
+ for name in namesToRemove:
+ del nameToVariants[name]
+def resolveWithTypeData():
+ " Attempts to resolve conflicts using DBpedia's type data "
+ taxonTypes = { # Obtained from the DBpedia ontology
+ "http://dbpedia.org/ontology/Species",
+ "http://dbpedia.org/ontology/Archaea",
+ "http://dbpedia.org/ontology/Bacteria",
+ "http://dbpedia.org/ontology/Eukaryote",
+ "http://dbpedia.org/ontology/Plant",
+ "http://dbpedia.org/ontology/ClubMoss",
+ "http://dbpedia.org/ontology/Conifer",
+ "http://dbpedia.org/ontology/CultivatedVariety",
+ "http://dbpedia.org/ontology/Cycad",
+ "http://dbpedia.org/ontology/Fern",
+ "http://dbpedia.org/ontology/FloweringPlant",
+ "http://dbpedia.org/ontology/Grape",
+ "http://dbpedia.org/ontology/Ginkgo",
+ "http://dbpedia.org/ontology/Gnetophytes",
+ "http://dbpedia.org/ontology/GreenAlga",
+ "http://dbpedia.org/ontology/Moss",
+ "http://dbpedia.org/ontology/Fungus",
+ "http://dbpedia.org/ontology/Animal",
+ "http://dbpedia.org/ontology/Fish",
+ "http://dbpedia.org/ontology/Crustacean",
+ "http://dbpedia.org/ontology/Mollusca",
+ "http://dbpedia.org/ontology/Insect",
+ "http://dbpedia.org/ontology/Arachnid",
+ "http://dbpedia.org/ontology/Amphibian",
+ "http://dbpedia.org/ontology/Reptile",
+ "http://dbpedia.org/ontology/Bird",
+ "http://dbpedia.org/ontology/Mammal",
+ "http://dbpedia.org/ontology/Cat",
+ "http://dbpedia.org/ontology/Dog",
+ "http://dbpedia.org/ontology/Horse",
+ }
+ iterNum = 0
+ for (label, type) in dbpCur.execute("SELECT label, type from labels INNER JOIN types on labels.iri = types.iri"):
+ iterNum += 1
+ if iterNum % 1e5 == 0:
+ print(f"At iteration {iterNum}")
+ #
+ if type in taxonTypes:
+ name = label.lower()
+ if name in nameToVariants:
+ nodeToLabel[name] = label
+ del nameToVariants[name]
+ else:
+ match = nameVariantRegex.fullmatch(name)
+ if match != None:
+ name = match.group(1)
+ if name in nameToVariants:
+ nodeToLabel[name] = label
+ del nameToVariants[name]
+#resolveWithTypeData()
+#resolveWithCategoryList()
+resolveWithPickedLabels()
+print(f"Remaining number of conflicts: {len(nameToVariants)}")
+
+print("Getting node IRIs")
+nodeToIri = {}
+for (name, label) in nodeToLabel.items():
+ (iri,) = dbpCur.execute("SELECT iri FROM labels where label = ? COLLATE NOCASE", (label,)).fetchone()
+ nodeToIri[name] = iri
+
+print("Resolving redirects")
+redirectingIriSet = set()
+iterNum = 0
+for (name, iri) in nodeToIri.items():
+ iterNum += 1
+ if iterNum % 1e4 == 0:
+ print(f"At iteration {iterNum}")
+ #
+ row = dbpCur.execute("SELECT target FROM redirects where iri = ?", (iri,)).fetchone()
+ if row != None:
+ nodeToIri[name] = row[0]
+ redirectingIriSet.add(name)
+
+print("Adding description tables")
+dbCur.execute("CREATE TABLE wiki_ids (name TEXT PRIMARY KEY, id INT, redirected INT)")
+dbCur.execute("CREATE INDEX wiki_id_idx ON wiki_ids(id)")
+dbCur.execute("CREATE TABLE descs (wiki_id INT PRIMARY KEY, desc TEXT, from_dbp INT)")
+iterNum = 0
+for (name, iri) in nodeToIri.items():
+ iterNum += 1
+ if iterNum % 1e4 == 0:
+ print(f"At iteration {iterNum}")
+ #
+ query = "SELECT abstract, id FROM abstracts INNER JOIN ids ON abstracts.iri = ids.iri WHERE ids.iri = ?"
+ row = dbpCur.execute(query, (iri,)).fetchone()
+ if row != None:
+ desc, wikiId = row
+ dbCur.execute("INSERT INTO wiki_ids VALUES (?, ?, ?)", (name, wikiId, 1 if name in redirectingIriSet else 0))
+ dbCur.execute("INSERT OR IGNORE INTO descs VALUES (?, ?, ?)", (wikiId, desc, 1))
+
+print("Closing databases")
+dbCon.commit()
+dbCon.close()
+dbpCon.commit()
+dbpCon.close()
diff --git a/backend/tolData/genEnwikiDescData.py b/backend/tolData/genEnwikiDescData.py
new file mode 100755
index 0000000..d3f93ed
--- /dev/null
+++ b/backend/tolData/genEnwikiDescData.py
@@ -0,0 +1,102 @@
+#!/usr/bin/python3
+
+import sys, re, os
+import sqlite3
+
+usageInfo = f"""
+Usage: {sys.argv[0]}
+
+Reads a database containing data from Wikipedia, and tries to associate
+wiki pages with nodes in the database, and add descriptions for nodes
+that don't have them.
+"""
+if len(sys.argv) > 1:
+ print(usageInfo, file=sys.stderr)
+ sys.exit(1)
+
+enwikiDb = "enwiki/descData.db"
+dbFile = "data.db"
+namesToSkipFile = "pickedEnwikiNamesToSkip.txt"
+pickedLabelsFile = "pickedEnwikiLabels.txt"
+# Got about 25k descriptions when testing
+
+print("Opening databases")
+enwikiCon = sqlite3.connect(enwikiDb)
+enwikiCur = enwikiCon.cursor()
+dbCon = sqlite3.connect(dbFile)
+dbCur = dbCon.cursor()
+
+print("Checking for names to skip")
+namesToSkip = set()
+if os.path.exists(namesToSkipFile):
+ with open(namesToSkipFile) as file:
+ for line in file:
+ namesToSkip.add(line.rstrip())
+ print(f"Found {len(namesToSkip)}")
+print("Checking for picked-titles")
+nameToPickedTitle = {}
+if os.path.exists(pickedLabelsFile):
+ with open(pickedLabelsFile) as file:
+ for line in file:
+ (name, _, title) = line.rstrip().partition("|")
+ nameToPickedTitle[name.lower()] = title
+print(f"Found {len(nameToPickedTitle)}")
+
+print("Getting names of nodes without descriptions")
+nodeNames = set()
+query = "SELECT nodes.name FROM nodes LEFT JOIN wiki_ids ON nodes.name = wiki_ids.name WHERE wiki_ids.id IS NULL"
+for (name,) in dbCur.execute(query):
+ nodeNames.add(name)
+print(f"Found {len(nodeNames)}")
+nodeNames.difference_update(namesToSkip)
+
+print("Associating nodes with page IDs")
+nodeToPageId = {}
+iterNum = 0
+for name in nodeNames:
+ iterNum += 1
+ if iterNum % 1e4 == 0:
+ print(f"At iteration {iterNum}")
+ #
+ if name not in nameToPickedTitle:
+ row = enwikiCur.execute("SELECT id FROM pages WHERE pages.title = ? COLLATE NOCASE", (name,)).fetchone()
+ if row != None:
+ nodeToPageId[name] = row[0]
+ else:
+ title = nameToPickedTitle[name]
+ row = enwikiCur.execute("SELECT id FROM pages WHERE pages.title = ?", (title,)).fetchone()
+ if row != None:
+ nodeToPageId[name] = row[0]
+ else:
+ print("WARNING: Picked title {title} not found", file=sys.stderr)
+
+print("Resolving redirects")
+redirectingNames = set()
+iterNum = 0
+for (name, pageId) in nodeToPageId.items():
+ iterNum += 1
+ if iterNum % 1e3 == 0:
+ print(f"At iteration {iterNum}")
+ #
+ query = "SELECT pages.id FROM redirects INNER JOIN pages ON redirects.target = pages.title WHERE redirects.id = ?"
+ row = enwikiCur.execute(query, (pageId,)).fetchone()
+ if row != None:
+ nodeToPageId[name] = row[0]
+ redirectingNames.add(name)
+
+print("Adding description data")
+iterNum = 0
+for (name, pageId) in nodeToPageId.items():
+ iterNum += 1
+ if iterNum % 1e3 == 0:
+ print(f"At iteration {iterNum}")
+ #
+ row = enwikiCur.execute("SELECT desc FROM descs where descs.id = ?", (pageId,)).fetchone()
+ if row != None:
+ dbCur.execute("INSERT INTO wiki_ids VALUES (?, ?, ?)", (name, pageId, 1 if name in redirectingNames else 0))
+ dbCur.execute("INSERT OR IGNORE INTO descs VALUES (?, ?, ?)", (pageId, row[0], 0))
+
+print("Closing databases")
+dbCon.commit()
+dbCon.close()
+enwikiCon.close()
diff --git a/backend/tolData/genEnwikiNameData.py b/backend/tolData/genEnwikiNameData.py
new file mode 100755
index 0000000..7ad61d1
--- /dev/null
+++ b/backend/tolData/genEnwikiNameData.py
@@ -0,0 +1,76 @@
+#!/usr/bin/python3
+
+import sys, re
+import sqlite3
+
+usageInfo = f"""
+Usage: {sys.argv[0]}
+
+Reads from a database containing data from Wikipdia, along with
+node and wiki-id information from the database, and use wikipedia
+page-redirect information to add additional alt-name data.
+"""
+if len(sys.argv) > 1:
+ print(usageInfo, file=sys.stderr)
+ sys.exit(1)
+
+enwikiDb = "enwiki/descData.db"
+dbFile = "data.db"
+altNameRegex = re.compile(r"[a-zA-Z]+")
+ # Avoids names like 'Evolution of Elephants', 'Banana fiber', 'Fish (zoology)',
+
+print("Opening databases")
+enwikiCon = sqlite3.connect(enwikiDb)
+enwikiCur = enwikiCon.cursor()
+dbCon = sqlite3.connect(dbFile)
+dbCur = dbCon.cursor()
+
+print("Getting nodes with wiki IDs")
+nodeToWikiId = {}
+for (nodeName, wikiId) in dbCur.execute("SELECT name, id from wiki_ids"):
+ nodeToWikiId[nodeName] = wikiId
+print(f"Found {len(nodeToWikiId)}")
+
+print("Iterating through nodes, finding names that redirect to them")
+nodeToAltNames = {}
+numAltNames = 0
+iterNum = 0
+for (nodeName, wikiId) in nodeToWikiId.items():
+ iterNum += 1
+ if iterNum % 1e4 == 0:
+ print(f"At iteration {iterNum}")
+ #
+ nodeToAltNames[nodeName] = set()
+ query = "SELECT p1.title FROM pages p1" \
+ " INNER JOIN redirects r1 ON p1.id = r1.id" \
+ " INNER JOIN pages p2 ON r1.target = p2.title WHERE p2.id = ?"
+ for (name,) in enwikiCur.execute(query, (wikiId,)):
+ if altNameRegex.fullmatch(name) != None and name.lower() != nodeName:
+ nodeToAltNames[nodeName].add(name.lower())
+ numAltNames += 1
+print(f"Found {numAltNames} alt-names")
+
+print("Excluding existing alt-names from the set")
+query = "SELECT alt_name FROM names WHERE alt_name IN ({})"
+iterNum = 0
+for (nodeName, altNames) in nodeToAltNames.items():
+ iterNum += 1
+ if iterNum % 1e4 == 0:
+ print(f"At iteration {iterNum}")
+ #
+ existingNames = set()
+ for (name,) in dbCur.execute(query.format(",".join(["?"] * len(altNames))), list(altNames)):
+ existingNames.add(name)
+ numAltNames -= len(existingNames)
+ altNames.difference_update(existingNames)
+print(f"Left with {numAltNames} alt-names")
+
+print("Adding alt-names to database")
+for (nodeName, altNames) in nodeToAltNames.items():
+ for altName in altNames:
+ dbCur.execute("INSERT INTO names VALUES (?, ?, ?, 'enwiki')", (nodeName, altName, 0))
+
+print("Closing databases")
+dbCon.commit()
+dbCon.close()
+enwikiCon.close()
diff --git a/backend/tolData/genEolNameData.py b/backend/tolData/genEolNameData.py
new file mode 100755
index 0000000..dd33ee0
--- /dev/null
+++ b/backend/tolData/genEolNameData.py
@@ -0,0 +1,184 @@
+#!/usr/bin/python3
+
+import sys, re, os
+import html, csv, sqlite3
+
+usageInfo = f"""
+Usage: {sys.argv[0]}
+
+Reads files describing name data from the 'Encyclopedia of Life' site,
+tries to associate names with nodes in the database, and adds tables
+to represent associated names.
+
+Reads a vernacularNames.csv file:
+ Starts with a header line containing:
+ page_id, canonical_form, vernacular_string, language_code,
+ resource_name, is_preferred_by_resource, is_preferred_by_eol
+ The canonical_form and vernacular_string fields contain names
+ associated with the page ID. Names are not always unique to
+ particular page IDs.
+"""
+if len(sys.argv) > 1:
+ print(usageInfo, file=sys.stderr)
+ sys.exit(1)
+
+vnamesFile = "eol/vernacularNames.csv" # Had about 2.8e6 entries
+dbFile = "data.db"
+namesToSkip = {"unknown", "unknown species", "unidentified species"}
+pickedIdsFile = "pickedEolIds.txt"
+altsToSkipFile = "pickedEolAltsToSkip.txt"
+
+print("Reading in vernacular-names data")
+nameToPids = {} # 'pid' means 'Page ID'
+canonicalNameToPids = {}
+pidToNames = {}
+pidToPreferred = {} # Maps pids to 'preferred' names
+def updateMaps(name, pid, canonical, preferredAlt):
+ global namesToSkip, nameToPids, canonicalNameToPids, pidToNames, pidToPreferred
+ if name in namesToSkip:
+ return
+ if name not in nameToPids:
+ nameToPids[name] = {pid}
+ else:
+ nameToPids[name].add(pid)
+ if canonical:
+ if name not in canonicalNameToPids:
+ canonicalNameToPids[name] = {pid}
+ else:
+ canonicalNameToPids[name].add(pid)
+ if pid not in pidToNames:
+ pidToNames[pid] = {name}
+ else:
+ pidToNames[pid].add(name)
+ if preferredAlt:
+ pidToPreferred[pid] = name
+with open(vnamesFile, newline="") as csvfile:
+ reader = csv.reader(csvfile)
+ lineNum = 0
+ for row in reader:
+ lineNum += 1
+ if lineNum % 1e5 == 0:
+ print(f"At line {lineNum}")
+ # Skip header line
+ if lineNum == 1:
+ continue
+ # Parse line
+ pid = int(row[0])
+ name1 = re.sub(r"<[^>]+>", "", row[1].lower()) # Remove tags
+ name2 = html.unescape(row[2]).lower()
+ lang = row[3]
+ preferred = row[6] == "preferred"
+ # Add to maps
+ updateMaps(name1, pid, True, False)
+ if lang == "eng" and name2 != "":
+ updateMaps(name2, pid, False, preferred)
+
+print("Checking for manually-picked pids")
+nameToPickedPid = {}
+if os.path.exists(pickedIdsFile):
+ with open(pickedIdsFile) as file:
+ for line in file:
+ (name, _, eolId) = line.rstrip().partition("|")
+ nameToPickedPid[name] = None if eolId == "" else int(eolId)
+print(f"Found {len(nameToPickedPid)}")
+
+print("Checking for alt-names to skip")
+nameToAltsToSkip = {}
+numToSkip = 0
+if os.path.exists(altsToSkipFile):
+ with open(altsToSkipFile) as file:
+ for line in file:
+ (name, _, altName) = line.rstrip().partition("|")
+ if name not in nameToAltsToSkip:
+ nameToAltsToSkip[name] = [altName]
+ else:
+ nameToAltsToSkip[name].append(altName)
+ numToSkip += 1
+print(f"Found {numToSkip} alt-names to skip")
+
+print("Creating database tables")
+dbCon = sqlite3.connect(dbFile)
+dbCur = dbCon.cursor()
+dbCur.execute("CREATE TABLE names(name TEXT, alt_name TEXT, pref_alt INT, src TEXT, PRIMARY KEY(name, alt_name))")
+dbCur.execute("CREATE INDEX names_idx ON names(name)")
+dbCur.execute("CREATE INDEX names_alt_idx ON names(alt_name)")
+dbCur.execute("CREATE INDEX names_alt_idx_nc ON names(alt_name COLLATE NOCASE)")
+dbCur.execute("CREATE TABLE eol_ids(id INT PRIMARY KEY, name TEXT)")
+dbCur.execute("CREATE INDEX eol_name_idx ON eol_ids(name)")
+
+print("Associating nodes with names")
+usedPids = set()
+unresolvedNodeNames = set()
+dbCur2 = dbCon.cursor()
+def addToDb(nodeName, pidToUse):
+ " Adds page-ID-associated name data to a node in the database "
+ global dbCur, pidToPreferred
+ dbCur.execute("INSERT INTO eol_ids VALUES (?, ?)", (pidToUse, nodeName))
+ # Get alt-names
+ altNames = set()
+ for n in pidToNames[pidToUse]:
+ # Avoid alt-names with >3 words
+ if len(n.split(" ")) > 3:
+ continue
+ # Avoid alt-names that already name a node in the database
+ if dbCur.execute("SELECT name FROM nodes WHERE name = ?", (n,)).fetchone() != None:
+ continue
+ # Check for picked alt-name-to-skip
+ if nodeName in nameToAltsToSkip and n in nameToAltsToSkip[nodeName]:
+ print(f"Excluding alt-name {n} for node {nodeName}")
+ continue
+ #
+ altNames.add(n)
+ # Add alt-names to db
+ preferredName = pidToPreferred[pidToUse] if (pidToUse in pidToPreferred) else None
+ for n in altNames:
+ isPreferred = 1 if (n == preferredName) else 0
+ dbCur.execute("INSERT INTO names VALUES (?, ?, ?, 'eol')", (nodeName, n, isPreferred))
+print("Adding picked IDs")
+for (name, pid) in nameToPickedPid.items():
+ if pid != None:
+ addToDb(name, pid)
+ usedPids.add(pid)
+print("Associating nodes with canonical names")
+iterNum = 0
+for (nodeName,) in dbCur2.execute("SELECT name FROM nodes"):
+ iterNum += 1
+ if iterNum % 1e5 == 0:
+ print(f"At iteration {iterNum}")
+ if nodeName in nameToPickedPid:
+ continue
+ # Check for matching canonical name
+ if nodeName in canonicalNameToPids:
+ pidToUse = None
+ # Pick an associated page ID
+ for pid in canonicalNameToPids[nodeName]:
+ hasLowerPrio = pid not in pidToPreferred and pidToUse in pidToPreferred
+ hasHigherPrio = pid in pidToPreferred and pidToUse not in pidToPreferred
+ if hasLowerPrio:
+ continue
+ if pid not in usedPids and (pidToUse == None or pid < pidToUse or hasHigherPrio):
+ pidToUse = pid
+ if pidToUse != None:
+ addToDb(nodeName, pidToUse)
+ usedPids.add(pidToUse)
+ elif nodeName in nameToPids:
+ unresolvedNodeNames.add(nodeName)
+print("Associating leftover nodes with other names")
+iterNum = 0
+for nodeName in unresolvedNodeNames:
+ iterNum += 1
+ if iterNum % 100 == 0:
+ print(f"At iteration {iterNum}")
+ # Check for matching name
+ pidToUse = None
+ for pid in nameToPids[nodeName]:
+ # Pick an associated page ID
+ if pid not in usedPids and (pidToUse == None or pid < pidToUse):
+ pidToUse = pid
+ if pidToUse != None:
+ addToDb(nodeName, pidToUse)
+ usedPids.add(pidToUse)
+
+print("Closing database")
+dbCon.commit()
+dbCon.close()
diff --git a/backend/tolData/genImgs.py b/backend/tolData/genImgs.py
new file mode 100755
index 0000000..ecca8e0
--- /dev/null
+++ b/backend/tolData/genImgs.py
@@ -0,0 +1,191 @@
+#!/usr/bin/python3
+
+import sys, os, subprocess
+import sqlite3, urllib.parse
+import signal
+
+usageInfo = f"""
+Usage: {sys.argv[0]}
+
+Reads node IDs and image paths from a file, and possibly from a directory,
+and generates cropped/resized versions of those images into a directory,
+with names of the form 'nodeId1.jpg'. Also adds image metadata to the
+database.
+
+SIGINT can be used to stop, and the program can be re-run to continue
+processing. It uses already-existing database entries to decide what
+to skip.
+"""
+if len(sys.argv) > 1:
+ print(usageInfo, file=sys.stderr)
+ sys.exit(1)
+
+imgListFile = "imgList.txt"
+outDir = "img/"
+eolImgDb = "eol/imagesList.db"
+enwikiImgDb = "enwiki/imgData.db"
+pickedImgsDir = "pickedImgs/"
+pickedImgsFilename = "imgData.txt"
+dbFile = "data.db"
+IMG_OUT_SZ = 200
+genImgFiles = True # Usable for debugging
+
+if not os.path.exists(outDir):
+ os.mkdir(outDir)
+
+print("Opening databases")
+dbCon = sqlite3.connect(dbFile)
+dbCur = dbCon.cursor()
+eolCon = sqlite3.connect(eolImgDb)
+eolCur = eolCon.cursor()
+enwikiCon = sqlite3.connect(enwikiImgDb)
+enwikiCur = enwikiCon.cursor()
+print("Checking for picked-images")
+nodeToPickedImg = {}
+if os.path.exists(pickedImgsDir + pickedImgsFilename):
+ lineNum = 0
+ with open(pickedImgsDir + pickedImgsFilename) as file:
+ for line in file:
+ lineNum += 1
+ (filename, url, license, artist, credit) = line.rstrip().split("|")
+ nodeName = os.path.splitext(filename)[0] # Remove extension
+ (otolId,) = dbCur.execute("SELECT id FROM nodes WHERE name = ?", (nodeName,)).fetchone()
+ nodeToPickedImg[otolId] = {
+ "nodeName": nodeName, "id": lineNum,
+ "filename": filename, "url": url, "license": license, "artist": artist, "credit": credit,
+ }
+
+print("Checking for image tables")
+nodesDone = set()
+imgsDone = set()
+if dbCur.execute("SELECT name FROM sqlite_master WHERE type='table' AND name='node_imgs'").fetchone() == None:
+ # Add image tables if not present
+ dbCur.execute("CREATE TABLE node_imgs (name TEXT PRIMARY KEY, img_id INT, src TEXT)")
+ dbCur.execute("CREATE TABLE images" \
+ " (id INT, src TEXT, url TEXT, license TEXT, artist TEXT, credit TEXT, PRIMARY KEY (id, src))")
+else:
+ # Get existing image-associated nodes
+ for (otolId,) in dbCur.execute("SELECT nodes.id FROM node_imgs INNER JOIN nodes ON node_imgs.name = nodes.name"):
+ nodesDone.add(otolId)
+ # Get existing node-associated images
+ for (imgId, imgSrc) in dbCur.execute("SELECT id, src from images"):
+ imgsDone.add((imgId, imgSrc))
+ print(f"Found {len(nodesDone)} nodes and {len(imgsDone)} images to skip")
+
+# Set SIGINT handler
+interrupted = False
+def onSigint(sig, frame):
+ global interrupted
+ interrupted = True
+signal.signal(signal.SIGINT, onSigint)
+
+print("Iterating through input images")
+def quit():
+ print("Closing databases")
+ dbCon.commit()
+ dbCon.close()
+ eolCon.close()
+ enwikiCon.close()
+ sys.exit(0)
+def convertImage(imgPath, outPath):
+ print(f"Converting {imgPath} to {outPath}")
+ if os.path.exists(outPath):
+ print(f"ERROR: Output image already exists")
+ return False
+ try:
+ completedProcess = subprocess.run(
+ ['npx', 'smartcrop-cli', '--width', str(IMG_OUT_SZ), '--height', str(IMG_OUT_SZ), imgPath, outPath],
+ stdout=subprocess.DEVNULL
+ )
+ except Exception as e:
+ print(f"ERROR: Exception while attempting to run smartcrop: {e}")
+ return False
+ if completedProcess.returncode != 0:
+ print(f"ERROR: smartcrop had exit status {completedProcess.returncode}")
+ return False
+ return True
+print("Processing picked-images")
+for (otolId, imgData) in nodeToPickedImg.items():
+ # Check for SIGINT event
+ if interrupted:
+ print("Exiting")
+ quit()
+ # Skip if already processed
+ if otolId in nodesDone:
+ continue
+ # Convert image
+ if genImgFiles:
+ success = convertImage(pickedImgsDir + imgData["filename"], outDir + otolId + ".jpg")
+ if not success:
+ quit()
+ else:
+ print(f"Processing {imgData['nodeName']}: {otolId}.jpg")
+ # Add entry to db
+ if (imgData["id"], "picked") not in imgsDone:
+ dbCur.execute("INSERT INTO images VALUES (?, ?, ?, ?, ?, ?)",
+ (imgData["id"], "picked", imgData["url"], imgData["license"], imgData["artist"], imgData["credit"]))
+ imgsDone.add((imgData["id"], "picked"))
+ dbCur.execute("INSERT INTO node_imgs VALUES (?, ?, ?)", (imgData["nodeName"], imgData["id"], "picked"))
+ nodesDone.add(otolId)
+print("Processing images from eol and enwiki")
+iterNum = 0
+with open(imgListFile) as file:
+ for line in file:
+ iterNum += 1
+ # Check for SIGINT event
+ if interrupted:
+ print("Exiting")
+ break
+ # Skip lines without an image path
+ if line.find(" ") == -1:
+ continue
+ # Get filenames
+ (otolId, _, imgPath) = line.rstrip().partition(" ")
+ # Skip if already processed
+ if otolId in nodesDone:
+ continue
+ # Convert image
+ if genImgFiles:
+ success = convertImage(imgPath, outDir + otolId + ".jpg")
+ if not success:
+ break
+ else:
+ if iterNum % 1e4 == 0:
+ print(f"At iteration {iterNum}")
+ # Add entry to db
+ (nodeName,) = dbCur.execute("SELECT name FROM nodes WHERE id = ?", (otolId,)).fetchone()
+ fromEol = imgPath.startswith("eol/")
+ imgName = os.path.basename(os.path.normpath(imgPath)) # Get last path component
+ imgName = os.path.splitext(imgName)[0] # Remove extension
+ if fromEol:
+ eolId, _, contentId = imgName.partition(" ")
+ eolId, contentId = (int(eolId), int(contentId))
+ if (eolId, "eol") not in imgsDone:
+ query = "SELECT source_url, license, copyright_owner FROM images WHERE content_id = ?"
+ row = eolCur.execute(query, (contentId,)).fetchone()
+ if row == None:
+ print(f"ERROR: No image record for EOL ID {eolId}, content ID {contentId}")
+ break
+ (url, license, owner) = row
+ dbCur.execute("INSERT INTO images VALUES (?, ?, ?, ?, ?, ?)",
+ (eolId, "eol", url, license, owner, ""))
+ imgsDone.add((eolId, "eol"))
+ dbCur.execute("INSERT INTO node_imgs VALUES (?, ?, ?)", (nodeName, eolId, "eol"))
+ else:
+ enwikiId = int(imgName)
+ if (enwikiId, "enwiki") not in imgsDone:
+ query = "SELECT name, license, artist, credit FROM" \
+ " page_imgs INNER JOIN imgs ON page_imgs.img_name = imgs.name" \
+ " WHERE page_imgs.page_id = ?"
+ row = enwikiCur.execute(query, (enwikiId,)).fetchone()
+ if row == None:
+ print(f"ERROR: No image record for enwiki ID {enwikiId}")
+ break
+ (name, license, artist, credit) = row
+ url = "https://en.wikipedia.org/wiki/File:" + urllib.parse.quote(name)
+ dbCur.execute("INSERT INTO images VALUES (?, ?, ?, ?, ?, ?)",
+ (enwikiId, "enwiki", url, license, artist, credit))
+ imgsDone.add((enwikiId, "enwiki"))
+ dbCur.execute("INSERT INTO node_imgs VALUES (?, ?, ?)", (nodeName, enwikiId, "enwiki"))
+# Close dbs
+quit()
diff --git a/backend/tolData/genLinkedImgs.py b/backend/tolData/genLinkedImgs.py
new file mode 100755
index 0000000..a8e1322
--- /dev/null
+++ b/backend/tolData/genLinkedImgs.py
@@ -0,0 +1,125 @@
+#!/usr/bin/python3
+
+import sys, re
+import sqlite3
+
+usageInfo = f"""
+Usage: {sys.argv[0]}
+
+Look for nodes without images in the database, and tries to
+associate them with images from their children.
+"""
+if len(sys.argv) > 1:
+ print(usageInfo, file=sys.stderr)
+ sys.exit(1)
+
+dbFile = "data.db"
+compoundNameRegex = re.compile(r"\[(.+) \+ (.+)]")
+upPropagateCompoundImgs = False
+
+print("Opening databases")
+dbCon = sqlite3.connect(dbFile)
+dbCur = dbCon.cursor()
+dbCur.execute("CREATE TABLE linked_imgs (name TEXT PRIMARY KEY, otol_ids TEXT)")
+
+print("Getting nodes with images")
+resolvedNodes = {} # Will map node names to otol IDs with a usable image
+query = "SELECT nodes.name, nodes.id FROM nodes INNER JOIN node_imgs ON nodes.name = node_imgs.name"
+for (name, otolId) in dbCur.execute(query):
+ resolvedNodes[name] = otolId
+print(f"Found {len(resolvedNodes)}")
+
+print("Iterating through nodes, trying to resolve images for ancestors")
+nodesToResolve = {} # Maps a node name to a list of objects that represent possible child images
+processedNodes = {} # Map a node name to an OTOL ID, representing a child node whose image is to be used
+parentToChosenTips = {} # used to prefer images from children with more tips
+iterNum = 0
+while len(resolvedNodes) > 0:
+ iterNum += 1
+ if iterNum % 1e3 == 0:
+ print(f"At iteration {iterNum}")
+ # Get next node
+ (nodeName, otolId) = resolvedNodes.popitem()
+ processedNodes[nodeName] = otolId
+ # Traverse upwards, resolving ancestors if able
+ while True:
+ # Get parent
+ row = dbCur.execute("SELECT parent FROM edges WHERE child = ?", (nodeName,)).fetchone()
+ if row == None or row[0] in processedNodes or row[0] in resolvedNodes:
+ break
+ parent = row[0]
+ # Get parent data
+ if parent not in nodesToResolve:
+ childNames = [row[0] for row in dbCur.execute("SELECT child FROM edges WHERE parent = ?", (parent,))]
+ query = "SELECT name, tips FROM nodes WHERE name IN ({})".format(",".join(["?"] * len(childNames)))
+ childObjs = [{"name": row[0], "tips": row[1], "otolId": None} for row in dbCur.execute(query, childNames)]
+ childObjs.sort(key=lambda x: x["tips"], reverse=True)
+ nodesToResolve[parent] = childObjs
+ else:
+ childObjs = nodesToResolve[parent]
+ # Check if highest-tips child
+ if (childObjs[0]["name"] == nodeName):
+ # Resolve parent, and continue from it
+ dbCur.execute("INSERT INTO linked_imgs VALUES (?, ?)", (parent, otolId))
+ del nodesToResolve[parent]
+ processedNodes[parent] = otolId
+ parentToChosenTips[parent] = childObjs[0]["tips"]
+ nodeName = parent
+ continue
+ else:
+ # Mark child as a potential choice
+ childObj = next(c for c in childObjs if c["name"] == nodeName)
+ childObj["otolId"] = otolId
+ break
+ # When out of resolved nodes, resolve nodesToResolve nodes, possibly adding more nodes to resolve
+ if len(resolvedNodes) == 0:
+ for (name, childObjs) in nodesToResolve.items():
+ childObj = next(c for c in childObjs if c["otolId"] != None)
+ resolvedNodes[name] = childObj["otolId"]
+ parentToChosenTips[name] = childObj["tips"]
+ dbCur.execute("INSERT INTO linked_imgs VALUES (?, ?)", (name, childObj["otolId"]))
+ nodesToResolve.clear()
+
+print("Replacing linked-images for compound nodes")
+iterNum = 0
+for nodeName in processedNodes.keys():
+ iterNum += 1
+ if iterNum % 1e4 == 0:
+ print(f"At iteration {iterNum}")
+ #
+ match = compoundNameRegex.fullmatch(nodeName)
+ if match != None:
+ # Replace associated image with subname images
+ (subName1, subName2) = match.group(1,2)
+ otolIdPair = ["", ""]
+ if subName1 in processedNodes:
+ otolIdPair[0] = processedNodes[subName1]
+ if subName2 in processedNodes:
+ otolIdPair[1] = processedNodes[subName2]
+ # Use no image if both subimages not found
+ if otolIdPair[0] == "" and otolIdPair[1] == "":
+ dbCur.execute("DELETE FROM linked_imgs WHERE name = ?", (nodeName,))
+ continue
+ # Add to db
+ dbCur.execute("UPDATE linked_imgs SET otol_ids = ? WHERE name = ?",
+ (otolIdPair[0] + "," + otolIdPair[1], nodeName))
+ # Possibly repeat operation upon parent/ancestors
+ if upPropagateCompoundImgs:
+ while True:
+ # Get parent
+ row = dbCur.execute("SELECT parent FROM edges WHERE child = ?", (nodeName,)).fetchone()
+ if row != None:
+ parent = row[0]
+ # Check num tips
+ (numTips,) = dbCur.execute("SELECT tips from nodes WHERE name = ?", (nodeName,)).fetchone()
+ if parent in parentToChosenTips and parentToChosenTips[parent] <= numTips:
+ # Replace associated image
+ dbCur.execute("UPDATE linked_imgs SET otol_ids = ? WHERE name = ?",
+ (otolIdPair[0] + "," + otolIdPair[1], parent))
+ nodeName = parent
+ continue
+ break
+
+print("Closing databases")
+dbCon.commit()
+dbCon.close()
diff --git a/backend/tolData/genOtolData.py b/backend/tolData/genOtolData.py
new file mode 100755
index 0000000..b5e0055
--- /dev/null
+++ b/backend/tolData/genOtolData.py
@@ -0,0 +1,250 @@
+#!/usr/bin/python3
+
+import sys, re, os
+import json, sqlite3
+
+usageInfo = f"""
+Usage: {sys.argv[0]}
+
+Reads files describing a tree-of-life from an 'Open Tree of Life' release,
+and stores tree information in a database.
+
+Reads a labelled_supertree_ottnames.tre file, which is assumed to have this format:
+ The tree-of-life is represented in Newick format, which looks like: (n1,n2,(n3,n4)n5)n6
+ The root node is named n6, and has children n1, n2, and n5.
+ Name examples include: Homo_sapiens_ott770315, mrcaott6ott22687, 'Oxalis san-miguelii ott5748753',
+ 'ott770315' and 'mrcaott6ott22687' are node IDs. The latter is for a 'compound node'.
+ The node with ID 'ott770315' will get the name 'homo sapiens'.
+ A compound node will get a name composed from it's sub-nodes (eg: [name1 + name2]).
+ It is possible for multiple nodes to have the same name.
+ In these cases, extra nodes will be named sequentially, as 'name1 [2]', 'name1 [3]', etc.
+Reads an annotations.json file, which is assumed to have this format:
+ Holds a JSON object, whose 'nodes' property maps node IDs to objects holding information about that node,
+ such as the properties 'supported_by' and 'conflicts_with', which list phylogenetic trees that
+ support/conflict with the node's placement.
+Reads from a picked-names file, if present, which specifies name and node ID pairs.
+ These help resolve cases where multiple nodes share the same name.
+"""
+if len(sys.argv) > 1:
+ print(usageInfo, file=sys.stderr)
+ sys.exit(1)
+
+treeFile = "otol/labelled_supertree_ottnames.tre" # Had about 2.5e9 nodes
+annFile = "otol/annotations.json"
+dbFile = "data.db"
+nodeMap = {} # Maps node IDs to node objects
+nameToFirstId = {} # Maps node names to first found ID (names might have multiple IDs)
+dupNameToIds = {} # Maps names of nodes with multiple IDs to those IDs
+pickedNamesFile = "pickedOtolNames.txt"
+
+class Node:
+ " Represents a tree-of-life node "
+ def __init__(self, name, childIds, parentId, tips, pSupport):
+ self.name = name
+ self.childIds = childIds
+ self.parentId = parentId
+ self.tips = tips
+ self.pSupport = pSupport
+
+print("Parsing tree file")
+# Read file
+data = None
+with open(treeFile) as file:
+ data = file.read()
+dataIdx = 0
+# Parse content
+iterNum = 0
+def parseNewick():
+ " Parses a node using 'data' and 'dataIdx', updates nodeMap accordingly, and returns the node's ID "
+ global data, dataIdx, iterNum
+ iterNum += 1
+ if iterNum % 1e5 == 0:
+ print(f"At iteration {iterNum}")
+ # Check for EOF
+ if dataIdx == len(data):
+ raise Exception(f"ERROR: Unexpected EOF at index {dataIdx}")
+ # Check for node
+ if data[dataIdx] == "(": # parse inner node
+ dataIdx += 1
+ childIds = []
+ while True:
+ # Read child
+ childId = parseNewick()
+ childIds.append(childId)
+ if (dataIdx == len(data)):
+ raise Exception(f"ERROR: Unexpected EOF at index {dataIdx}")
+ # Check for next child
+ if (data[dataIdx] == ","):
+ dataIdx += 1
+ continue
+ else:
+ # Get node name and id
+ dataIdx += 1 # Consume an expected ')'
+ name, id = parseNewickName()
+ updateNameMaps(name, id)
+ # Get child num-tips total
+ tips = 0
+ for childId in childIds:
+ tips += nodeMap[childId].tips
+ # Add node to nodeMap
+ nodeMap[id] = Node(name, childIds, None, tips, False)
+ # Update childrens' parent reference
+ for childId in childIds:
+ nodeMap[childId].parentId = id
+ return id
+ else: # Parse node name
+ name, id = parseNewickName()
+ updateNameMaps(name, id)
+ nodeMap[id] = Node(name, [], None, 1, False)
+ return id
+def parseNewickName():
+ " Parses a node name using 'data' and 'dataIdx', and returns a (name, id) pair "
+ global data, dataIdx
+ name = None
+ end = dataIdx
+ # Get name
+ if (end < len(data) and data[end] == "'"): # Check for quoted name
+ end += 1
+ inQuote = True
+ while end < len(data):
+ if (data[end] == "'"):
+ if end + 1 < len(data) and data[end + 1] == "'": # Account for '' as escaped-quote
+ end += 2
+ continue
+ else:
+ end += 1
+ inQuote = False
+ break
+ end += 1
+ if inQuote:
+ raise Exception(f"ERROR: Unexpected EOF at index {dataIdx}")
+ name = data[dataIdx:end]
+ dataIdx = end
+ else:
+ while end < len(data) and not re.match(r"[(),]", data[end]):
+ end += 1
+ if (end == dataIdx):
+ raise Exception(f"ERROR: Unexpected EOF at index {dataIdx}")
+ name = data[dataIdx:end].rstrip()
+ if end == len(data): # Ignore trailing input semicolon
+ name = name[:-1]
+ dataIdx = end
+ # Convert to (name, id)
+ name = name.lower()
+ if name.startswith("mrca"):
+ return (name, name)
+ elif name[0] == "'":
+ match = re.fullmatch(r"'([^\\\"]+) (ott\d+)'", name)
+ if match == None:
+ raise Exception(f"ERROR: invalid name \"{name}\"")
+ name = match.group(1).replace("''", "'")
+ return (name, match.group(2))
+ else:
+ match = re.fullmatch(r"([^\\\"]+)_(ott\d+)", name)
+ if match == None:
+ raise Exception(f"ERROR: invalid name \"{name}\"")
+ return (match.group(1).replace("_", " "), match.group(2))
+def updateNameMaps(name, id):
+ global nameToFirstId, dupNameToIds
+ if name not in nameToFirstId:
+ nameToFirstId[name] = id
+ else:
+ if name not in dupNameToIds:
+ dupNameToIds[name] = [nameToFirstId[name], id]
+ else:
+ dupNameToIds[name].append(id)
+rootId = parseNewick()
+
+print("Resolving duplicate names")
+# Read picked-names file
+nameToPickedId = {}
+if os.path.exists(pickedNamesFile):
+ with open(pickedNamesFile) as file:
+ for line in file:
+ (name, _, otolId) = line.rstrip().partition("|")
+ nameToPickedId[name] = otolId
+# Resolve duplicates
+for (dupName, ids) in dupNameToIds.items():
+ # Check for picked id
+ if dupName in nameToPickedId:
+ idToUse = nameToPickedId[dupName]
+ else:
+ # Get conflicting node with most tips
+ tipNums = [nodeMap[id].tips for id in ids]
+ maxIdx = tipNums.index(max(tipNums))
+ idToUse = ids[maxIdx]
+ # Adjust name of other conflicting nodes
+ counter = 2
+ for id in ids:
+ if id != idToUse:
+ nodeMap[id].name += f" [{counter}]"
+ counter += 1
+
+print("Changing mrca* names")
+def convertMrcaName(id):
+ node = nodeMap[id]
+ name = node.name
+ childIds = node.childIds
+ if len(childIds) < 2:
+ print(f"WARNING: MRCA node \"{name}\" has less than 2 children")
+ return
+ # Get 2 children with most tips
+ childTips = [nodeMap[id].tips for id in childIds]
+ maxIdx1 = childTips.index(max(childTips))
+ childTips[maxIdx1] = 0
+ maxIdx2 = childTips.index(max(childTips))
+ childId1 = childIds[maxIdx1]
+ childId2 = childIds[maxIdx2]
+ childName1 = nodeMap[childId1].name
+ childName2 = nodeMap[childId2].name
+ # Check for mrca* child names
+ if childName1.startswith("mrca"):
+ childName1 = convertMrcaName(childId1)
+ if childName2.startswith("mrca"):
+ childName2 = convertMrcaName(childId2)
+ # Check for composite names
+ match = re.fullmatch(r"\[(.+) \+ (.+)]", childName1)
+ if match != None:
+ childName1 = match.group(1)
+ match = re.fullmatch(r"\[(.+) \+ (.+)]", childName2)
+ if match != None:
+ childName2 = match.group(1)
+ # Create composite name
+ node.name = f"[{childName1} + {childName2}]"
+ return childName1
+for (id, node) in nodeMap.items():
+ if node.name.startswith("mrca"):
+ convertMrcaName(id)
+
+print("Parsing annotations file")
+# Read file
+data = None
+with open(annFile) as file:
+ data = file.read()
+obj = json.loads(data)
+nodeAnnsMap = obj["nodes"]
+# Find relevant annotations
+for (id, node) in nodeMap.items():
+ # Set has-support value using annotations
+ if id in nodeAnnsMap:
+ nodeAnns = nodeAnnsMap[id]
+ supportQty = len(nodeAnns["supported_by"]) if "supported_by" in nodeAnns else 0
+ conflictQty = len(nodeAnns["conflicts_with"]) if "conflicts_with" in nodeAnns else 0
+ node.pSupport = supportQty > 0 and conflictQty == 0
+
+print("Creating nodes and edges tables")
+dbCon = sqlite3.connect(dbFile)
+dbCur = dbCon.cursor()
+dbCur.execute("CREATE TABLE nodes (name TEXT PRIMARY KEY, id TEXT UNIQUE, tips INT)")
+dbCur.execute("CREATE INDEX nodes_idx_nc ON nodes(name COLLATE NOCASE)")
+dbCur.execute("CREATE TABLE edges (parent TEXT, child TEXT, p_support INT, PRIMARY KEY (parent, child))")
+dbCur.execute("CREATE INDEX edges_child_idx ON edges(child)")
+for (otolId, node) in nodeMap.items():
+ dbCur.execute("INSERT INTO nodes VALUES (?, ?, ?)", (node.name, otolId, node.tips))
+ for childId in node.childIds:
+ childNode = nodeMap[childId]
+ dbCur.execute("INSERT INTO edges VALUES (?, ?, ?)",
+ (node.name, childNode.name, 1 if childNode.pSupport else 0))
+print("Closing database")
+dbCon.commit()
+dbCon.close()
diff --git a/backend/tolData/genReducedTrees.py b/backend/tolData/genReducedTrees.py
new file mode 100755
index 0000000..a921be4
--- /dev/null
+++ b/backend/tolData/genReducedTrees.py
@@ -0,0 +1,329 @@
+#!/usr/bin/python3
+
+import sys, os.path, re
+import json, sqlite3
+
+usageInfo = f"""
+Usage: {sys.argv[0]} [tree1]
+
+Creates reduced versions of the tree in the database:
+- A 'picked nodes' tree:
+ Created from a minimal set of node names read from a file,
+ possibly with some extra randmly-picked children.
+- An 'images only' tree:
+ Created by removing nodes without an image or presence in the
+ 'picked' tree.
+- A 'weakly trimmed' tree:
+ Created by removing nodes that lack an image or description, or
+ presence in the 'picked' tree. And, for nodes with 'many' children,
+ removing some more, despite any node descriptions.
+
+If tree1 is specified, as 'picked', 'images', or 'trimmed', only that
+tree is generated.
+"""
+if len(sys.argv) > 2 or len(sys.argv) == 2 and re.fullmatch(r"picked|images|trimmed", sys.argv[1]) == None:
+ print(usageInfo, file=sys.stderr)
+ sys.exit(1)
+
+tree = sys.argv[1] if len(sys.argv) > 1 else None
+dbFile = "data.db"
+pickedNodesFile = "pickedNodes.txt"
+COMP_NAME_REGEX = re.compile(r"\[.+ \+ .+]") # Used to recognise composite nodes
+
+class Node:
+ def __init__(self, id, children, parent, tips, pSupport):
+ self.id = id
+ self.children = children
+ self.parent = parent
+ self.tips = tips
+ self.pSupport = pSupport
+
+print("Opening database")
+dbCon = sqlite3.connect(dbFile)
+dbCur = dbCon.cursor()
+
+def genPickedNodeTree(dbCur, pickedNames, rootName):
+ global COMP_NAME_REGEX
+ PREF_NUM_CHILDREN = 3 # Include extra children up to this limit
+ nodeMap = {} # Maps node names to Nodes
+ print("Getting ancestors")
+ nodeMap = genNodeMap(dbCur, pickedNames, 100)
+ print(f"Result has {len(nodeMap)} nodes")
+ print("Removing composite nodes")
+ removedNames = removeCompositeNodes(nodeMap)
+ print(f"Result has {len(nodeMap)} nodes")
+ print("Removing 'collapsible' nodes")
+ temp = removeCollapsibleNodes(nodeMap, pickedNames)
+ removedNames.update(temp)
+ print(f"Result has {len(nodeMap)} nodes")
+ print("Adding some additional nearby children")
+ namesToAdd = []
+ iterNum = 0
+ for (name, node) in nodeMap.items():
+ iterNum += 1
+ if iterNum % 100 == 0:
+ print(f"At iteration {iterNum}")
+ #
+ numChildren = len(node.children)
+ if numChildren < PREF_NUM_CHILDREN:
+ children = [row[0] for row in dbCur.execute("SELECT child FROM edges where parent = ?", (name,))]
+ newChildren = []
+ for n in children:
+ if n in nodeMap or n in removedNames:
+ continue
+ if COMP_NAME_REGEX.fullmatch(n) != None:
+ continue
+ if dbCur.execute("SELECT name from node_imgs WHERE name = ?", (n,)).fetchone() == None and \
+ dbCur.execute("SELECT name from linked_imgs WHERE name = ?", (n,)).fetchone() == None:
+ continue
+ newChildren.append(n)
+ newChildNames = newChildren[:(PREF_NUM_CHILDREN - numChildren)]
+ node.children.extend(newChildNames)
+ namesToAdd.extend(newChildNames)
+ for name in namesToAdd:
+ parent, pSupport = dbCur.execute("SELECT parent, p_support from edges WHERE child = ?", (name,)).fetchone()
+ (id,) = dbCur.execute("SELECT id FROM nodes WHERE name = ?", (name,)).fetchone()
+ parent = None if parent == "" else parent
+ nodeMap[name] = Node(id, [], parent, 0, pSupport == 1)
+ print(f"Result has {len(nodeMap)} nodes")
+ print("Updating 'tips' values")
+ updateTips(rootName, nodeMap)
+ print("Creating table")
+ addTreeTables(nodeMap, dbCur, "p")
+def genImagesOnlyTree(dbCur, nodesWithImgOrPicked, pickedNames, rootName):
+ print("Getting ancestors")
+ nodeMap = genNodeMap(dbCur, nodesWithImgOrPicked, 1e4)
+ print(f"Result has {len(nodeMap)} nodes")
+ print("Removing composite nodes")
+ removeCompositeNodes(nodeMap)
+ print(f"Result has {len(nodeMap)} nodes")
+ print("Removing 'collapsible' nodes")
+ removeCollapsibleNodes(nodeMap, {})
+ print(f"Result has {len(nodeMap)} nodes")
+ print(f"Updating 'tips' values") # Needed for next trimming step
+ updateTips(rootName, nodeMap)
+ print(f"Trimming from nodes with 'many' children")
+ trimIfManyChildren(nodeMap, rootName, 300, pickedNames)
+ print(f"Result has {len(nodeMap)} nodes")
+ print(f"Updating 'tips' values")
+ updateTips(rootName, nodeMap)
+ print("Creating table")
+ addTreeTables(nodeMap, dbCur, "i")
+def genWeaklyTrimmedTree(dbCur, nodesWithImgDescOrPicked, nodesWithImgOrPicked, rootName):
+ print("Getting ancestors")
+ nodeMap = genNodeMap(dbCur, nodesWithImgDescOrPicked, 1e5)
+ print(f"Result has {len(nodeMap)} nodes")
+ print("Getting nodes to 'strongly keep'")
+ iterNum = 0
+ nodesFromImgOrPicked = set()
+ for name in nodesWithImgOrPicked:
+ iterNum += 1
+ if iterNum % 1e4 == 0:
+ print(f"At iteration {iterNum}")
+ #
+ while name != None:
+ if name not in nodesFromImgOrPicked:
+ nodesFromImgOrPicked.add(name)
+ name = nodeMap[name].parent
+ else:
+ break
+ print(f"Node set has {len(nodesFromImgOrPicked)} nodes")
+ print("Removing 'collapsible' nodes")
+ removeCollapsibleNodes(nodeMap, nodesWithImgDescOrPicked)
+ print(f"Result has {len(nodeMap)} nodes")
+ print(f"Updating 'tips' values") # Needed for next trimming step
+ updateTips(rootName, nodeMap)
+ print(f"Trimming from nodes with 'many' children")
+ trimIfManyChildren(nodeMap, rootName, 600, nodesFromImgOrPicked)
+ print(f"Result has {len(nodeMap)} nodes")
+ print(f"Updating 'tips' values")
+ updateTips(rootName, nodeMap)
+ print("Creating table")
+ addTreeTables(nodeMap, dbCur, "t")
+# Helper functions
+def genNodeMap(dbCur, nameSet, itersBeforePrint = 1):
+ " Returns a subtree that includes nodes in 'nameSet', as a name-to-Node map "
+ nodeMap = {}
+ iterNum = 0
+ for name in nameSet:
+ iterNum += 1
+ if iterNum % itersBeforePrint == 0:
+ print(f"At iteration {iterNum}")
+ #
+ prevName = None
+ while name != None:
+ if name not in nodeMap:
+ # Add node
+ (id, tips) = dbCur.execute("SELECT id, tips from nodes where name = ?", (name,)).fetchone()
+ row = dbCur.execute("SELECT parent, p_support from edges where child = ?", (name,)).fetchone()
+ parent = None if row == None or row[0] == "" else row[0]
+ pSupport = row == None or row[1] == 1
+ children = [] if prevName == None else [prevName]
+ nodeMap[name] = Node(id, children, parent, 0, pSupport)
+ # Iterate to parent
+ prevName = name
+ name = parent
+ else:
+ # Just add as child
+ if prevName != None:
+ nodeMap[name].children.append(prevName)
+ break
+ return nodeMap
+def removeCompositeNodes(nodeMap):
+ " Given a tree, removes composite-name nodes, and returns the removed nodes' names "
+ global COMP_NAME_REGEX
+ namesToRemove = set()
+ for (name, node) in nodeMap.items():
+ parent = node.parent
+ if parent != None and COMP_NAME_REGEX.fullmatch(name) != None:
+ # Connect children to parent
+ nodeMap[parent].children.remove(name)
+ nodeMap[parent].children.extend(node.children)
+ for n in node.children:
+ nodeMap[n].parent = parent
+ nodeMap[n].pSupport &= node.pSupport
+ # Remember for removal
+ namesToRemove.add(name)
+ for name in namesToRemove:
+ del nodeMap[name]
+ return namesToRemove
+def removeCollapsibleNodes(nodeMap, nodesToKeep = {}):
+ """ Given a tree, removes single-child parents, then only-childs,
+ with given exceptions, and returns the set of removed nodes' names """
+ namesToRemove = set()
+ # Remove single-child parents
+ for (name, node) in nodeMap.items():
+ if len(node.children) == 1 and node.parent != None and name not in nodesToKeep:
+ # Connect parent and children
+ parent = node.parent
+ child = node.children[0]
+ nodeMap[parent].children.remove(name)
+ nodeMap[parent].children.append(child)
+ nodeMap[child].parent = parent
+ nodeMap[child].pSupport &= node.pSupport
+ # Remember for removal
+ namesToRemove.add(name)
+ for name in namesToRemove:
+ del nodeMap[name]
+ # Remove only-childs (not redundant because 'nodesToKeep' can cause single-child parents to be kept)
+ namesToRemove.clear()
+ for (name, node) in nodeMap.items():
+ isOnlyChild = node.parent != None and len(nodeMap[node.parent].children) == 1
+ if isOnlyChild and name not in nodesToKeep:
+ # Connect parent and children
+ parent = node.parent
+ nodeMap[parent].children = node.children
+ for n in node.children:
+ nodeMap[n].parent = parent
+ nodeMap[n].pSupport &= node.pSupport
+ # Remember for removal
+ namesToRemove.add(name)
+ for name in namesToRemove:
+ del nodeMap[name]
+ #
+ return namesToRemove
+def trimIfManyChildren(nodeMap, rootName, childThreshold, nodesToKeep = {}):
+ namesToRemove = set()
+ def findTrimmables(nodeName):
+ nonlocal nodeMap, nodesToKeep
+ node = nodeMap[nodeName]
+ if len(node.children) > childThreshold:
+ numToTrim = len(node.children) - childThreshold
+ # Try removing nodes, preferring those with less tips
+ candidatesToTrim = [n for n in node.children if n not in nodesToKeep]
+ childToTips = {n: nodeMap[n].tips for n in candidatesToTrim}
+ candidatesToTrim.sort(key=lambda n: childToTips[n], reverse=True)
+ childrenToRemove = set(candidatesToTrim[-numToTrim:])
+ node.children = [n for n in node.children if n not in childrenToRemove]
+ # Mark nodes for deletion
+ for n in childrenToRemove:
+ markForRemoval(n)
+ # Recurse on children
+ for n in node.children:
+ findTrimmables(n)
+ def markForRemoval(nodeName):
+ nonlocal nodeMap, namesToRemove
+ namesToRemove.add(nodeName)
+ for child in nodeMap[nodeName].children:
+ markForRemoval(child)
+ findTrimmables(rootName)
+ for nodeName in namesToRemove:
+ del nodeMap[nodeName]
+def updateTips(nodeName, nodeMap):
+ " Updates the 'tips' values for a node and it's descendants, returning the node's new 'tips' value "
+ node = nodeMap[nodeName]
+ tips = sum([updateTips(childName, nodeMap) for childName in node.children])
+ tips = max(1, tips)
+ node.tips = tips
+ return tips
+def addTreeTables(nodeMap, dbCur, suffix):
+ " Adds a tree to the database, as tables nodes_X and edges_X, where X is the given suffix "
+ nodesTbl = f"nodes_{suffix}"
+ edgesTbl = f"edges_{suffix}"
+ dbCur.execute(f"CREATE TABLE {nodesTbl} (name TEXT PRIMARY KEY, id TEXT UNIQUE, tips INT)")
+ dbCur.execute(f"CREATE INDEX {nodesTbl}_idx_nc ON {nodesTbl}(name COLLATE NOCASE)")
+ dbCur.execute(f"CREATE TABLE {edgesTbl} (parent TEXT, child TEXT, p_support INT, PRIMARY KEY (parent, child))")
+ dbCur.execute(f"CREATE INDEX {edgesTbl}_child_idx ON {edgesTbl}(child)")
+ for (name, node) in nodeMap.items():
+ dbCur.execute(f"INSERT INTO {nodesTbl} VALUES (?, ?, ?)", (name, node.id, node.tips))
+ for childName in node.children:
+ pSupport = 1 if nodeMap[childName].pSupport else 0
+ dbCur.execute(f"INSERT INTO {edgesTbl} VALUES (?, ?, ?)", (name, childName, pSupport))
+
+print(f"Finding root node")
+query = "SELECT name FROM nodes LEFT JOIN edges ON nodes.name = edges.child WHERE edges.parent IS NULL LIMIT 1"
+(rootName,) = dbCur.execute(query).fetchone()
+print(f"Found \"{rootName}\"")
+
+print('=== Getting picked-nodes ===')
+pickedNames = set()
+pickedTreeExists = False
+if dbCur.execute("SELECT name FROM sqlite_master WHERE type='table' AND name='nodes_p'").fetchone() == None:
+ print(f"Reading from {pickedNodesFile}")
+ with open(pickedNodesFile) as file:
+ for line in file:
+ name = line.rstrip()
+ row = dbCur.execute("SELECT name from nodes WHERE name = ?", (name,)).fetchone()
+ if row == None:
+ row = dbCur.execute("SELECT name from names WHERE alt_name = ?", (name,)).fetchone()
+ if row != None:
+ pickedNames.add(row[0])
+ if len(pickedNames) == 0:
+ raise Exception("ERROR: No picked names found")
+else:
+ pickedTreeExists = True
+ print("Picked-node tree already exists")
+ if tree == 'picked':
+ sys.exit()
+ for (name,) in dbCur.execute("SELECT name FROM nodes_p"):
+ pickedNames.add(name)
+print(f"Found {len(pickedNames)} names")
+
+if (tree == 'picked' or tree == None) and not pickedTreeExists:
+ print("=== Generating picked-nodes tree ===")
+ genPickedNodeTree(dbCur, pickedNames, rootName)
+if tree != 'picked':
+ print("=== Finding 'non-low significance' nodes ===")
+ nodesWithImgOrPicked = set()
+ nodesWithImgDescOrPicked = set()
+ print("Finding nodes with descs")
+ for (name,) in dbCur.execute("SELECT name FROM wiki_ids"): # Can assume the wiki_id has a desc
+ nodesWithImgDescOrPicked.add(name)
+ print("Finding nodes with images")
+ for (name,) in dbCur.execute("SELECT name FROM node_imgs"):
+ nodesWithImgDescOrPicked.add(name)
+ nodesWithImgOrPicked.add(name)
+ print("Adding picked nodes")
+ for name in pickedNames:
+ nodesWithImgDescOrPicked.add(name)
+ nodesWithImgOrPicked.add(name)
+ if tree == 'images' or tree == None:
+ print("=== Generating images-only tree ===")
+ genImagesOnlyTree(dbCur, nodesWithImgOrPicked, pickedNames, rootName)
+ if tree == 'trimmed' or tree == None:
+ print("=== Generating weakly-trimmed tree ===")
+ genWeaklyTrimmedTree(dbCur, nodesWithImgDescOrPicked, nodesWithImgOrPicked, rootName)
+
+print("Closing database")
+dbCon.commit()
+dbCon.close()
diff --git a/backend/tolData/otol/README.md b/backend/tolData/otol/README.md
new file mode 100644
index 0000000..4be2fd2
--- /dev/null
+++ b/backend/tolData/otol/README.md
@@ -0,0 +1,10 @@
+Files
+=====
+- opentree13.4tree.tgz <br>
+ Obtained from <https://tree.opentreeoflife.org/about/synthesis-release/v13.4>.
+ Contains tree data from the [Open Tree of Life](https://tree.opentreeoflife.org/about/open-tree-of-life).
+- labelled\_supertree\_ottnames.tre <br>
+ Extracted from the .tgz file. Describes the structure of the tree.
+- annotations.json
+ Extracted from the .tgz file. Contains additional attributes of tree
+ nodes. Used for finding out which nodes have 'phylogenetic support'.
diff --git a/backend/tolData/pickedImgs/README.md b/backend/tolData/pickedImgs/README.md
new file mode 100644
index 0000000..dfe192b
--- /dev/null
+++ b/backend/tolData/pickedImgs/README.md
@@ -0,0 +1,10 @@
+This directory holds additional image files to use for tree-of-life nodes,
+on top of those from EOL and Wikipedia.
+
+Possible Files
+==============
+- (Image files)
+- imgData.txt <br>
+ Contains lines with the format `filename|url|license|artist|credit`.
+ The filename should consist of a node name, with an image extension.
+ Other fields correspond to those in the `images` table (see ../README.md).
diff --git a/backend/tolData/reviewImgsToGen.py b/backend/tolData/reviewImgsToGen.py
new file mode 100755
index 0000000..de592f5
--- /dev/null
+++ b/backend/tolData/reviewImgsToGen.py
@@ -0,0 +1,225 @@
+#!/usr/bin/python3
+
+import sys, re, os, time
+import sqlite3
+import tkinter as tki
+from tkinter import ttk
+import PIL
+from PIL import ImageTk, Image, ImageOps
+
+usageInfo = f"""
+Usage: {sys.argv[0]}
+
+Provides a GUI that displays, for each node in the database, associated
+images from EOL and Wikipedia, and allows choosing which to use. Writes
+choice data to a text file with lines of the form 'otolId1 imgPath1', or
+'otolId1', where no path indicates a choice of no image.
+
+The program can be closed, and run again to continue from the last choice.
+The program looks for an existing output file to determine what choices
+have already been made.
+"""
+if len(sys.argv) > 1:
+ print(usageInfo, file=sys.stderr)
+ sys.exit(1)
+
+eolImgDir = "eol/imgs/"
+enwikiImgDir = "enwiki/imgs/"
+dbFile = "data.db"
+outFile = "imgList.txt"
+IMG_DISPLAY_SZ = 400
+PLACEHOLDER_IMG = Image.new("RGB", (IMG_DISPLAY_SZ, IMG_DISPLAY_SZ), (88, 28, 135))
+onlyReviewPairs = True
+
+print("Opening database")
+dbCon = sqlite3.connect(dbFile)
+dbCur = dbCon.cursor()
+
+nodeToImgs = {} # Maps otol-ids to arrays of image paths
+print("Iterating through images from EOL")
+if os.path.exists(eolImgDir):
+ for filename in os.listdir(eolImgDir):
+ # Get associated EOL ID
+ eolId, _, _ = filename.partition(" ")
+ query = "SELECT nodes.id FROM nodes INNER JOIN eol_ids ON nodes.name = eol_ids.name WHERE eol_ids.id = ?"
+ # Get associated node IDs
+ found = False
+ for (otolId,) in dbCur.execute(query, (int(eolId),)):
+ if otolId not in nodeToImgs:
+ nodeToImgs[otolId] = []
+ nodeToImgs[otolId].append(eolImgDir + filename)
+ found = True
+ if not found:
+ print(f"WARNING: No node found for {eolImgDir}{filename}")
+print(f"Result: {len(nodeToImgs)} nodes with images")
+print("Iterating through images from Wikipedia")
+if os.path.exists(enwikiImgDir):
+ for filename in os.listdir(enwikiImgDir):
+ # Get associated page ID
+ (wikiId, _, _) = filename.partition(".")
+ # Get associated node IDs
+ query = "SELECT nodes.id FROM nodes INNER JOIN wiki_ids ON nodes.name = wiki_ids.name WHERE wiki_ids.id = ?"
+ found = False
+ for (otolId,) in dbCur.execute(query, (int(wikiId),)):
+ if otolId not in nodeToImgs:
+ nodeToImgs[otolId] = []
+ nodeToImgs[otolId].append(enwikiImgDir + filename)
+ found = True
+ if not found:
+ print(f"WARNING: No node found for {enwikiImgDir}{filename}")
+print(f"Result: {len(nodeToImgs)} nodes with images")
+print("Filtering out already-made image choices")
+oldSz = len(nodeToImgs)
+if os.path.exists(outFile):
+ with open(outFile) as file:
+ for line in file:
+ line = line.rstrip()
+ if " " in line:
+ line = line[:line.find(" ")]
+ del nodeToImgs[line]
+print(f"Filtered out {oldSz - len(nodeToImgs)} entries")
+
+class ImgReviewer:
+ " Provides the GUI for reviewing images "
+ def __init__(self, root, nodeToImgs):
+ self.root = root
+ root.title("Image Reviewer")
+ # Setup main frame
+ mainFrame = ttk.Frame(root, padding="5 5 5 5")
+ mainFrame.grid(column=0, row=0, sticky=(tki.N, tki.W, tki.E, tki.S))
+ root.columnconfigure(0, weight=1)
+ root.rowconfigure(0, weight=1)
+ # Set up images-to-be-reviewed frames
+ self.eolImg = ImageTk.PhotoImage(PLACEHOLDER_IMG)
+ self.enwikiImg = ImageTk.PhotoImage(PLACEHOLDER_IMG)
+ self.labels = []
+ for i in (0, 1):
+ frame = ttk.Frame(mainFrame, width=IMG_DISPLAY_SZ, height=IMG_DISPLAY_SZ)
+ frame.grid(column=i, row=0)
+ label = ttk.Label(frame, image=self.eolImg if i == 0 else self.enwikiImg)
+ label.grid(column=0, row=0)
+ self.labels.append(label)
+ # Add padding
+ for child in mainFrame.winfo_children():
+ child.grid_configure(padx=5, pady=5)
+ # Add keyboard bindings
+ root.bind("<q>", self.quit)
+ root.bind("<Key-j>", lambda evt: self.accept(0))
+ root.bind("<Key-k>", lambda evt: self.accept(1))
+ root.bind("<Key-l>", lambda evt: self.reject())
+ # Set fields
+ self.nodeImgsList = list(nodeToImgs.items())
+ self.listIdx = -1
+ self.otolId = None
+ self.eolImgPath = None
+ self.enwikiImgPath = None
+ self.numReviewed = 0
+ self.startTime = time.time()
+ # Initialise images to review
+ self.getNextImgs()
+ def getNextImgs(self):
+ " Updates display with new images to review, or ends program "
+ # Get next image paths
+ while True:
+ self.listIdx += 1
+ if self.listIdx == len(self.nodeImgsList):
+ print("No more images to review. Exiting program.")
+ self.quit()
+ return
+ self.otolId, imgPaths = self.nodeImgsList[self.listIdx]
+ # Potentially skip user choice
+ if onlyReviewPairs and len(imgPaths) == 1:
+ with open(outFile, 'a') as file:
+ file.write(f"{self.otolId} {imgPaths[0]}\n")
+ continue
+ break
+ # Update displayed images
+ self.eolImgPath = self.enwikiImgPath = None
+ imageOpenError = False
+ for imgPath in imgPaths:
+ img = None
+ try:
+ img = Image.open(imgPath)
+ img = ImageOps.exif_transpose(img)
+ except PIL.UnidentifiedImageError:
+ print(f"UnidentifiedImageError for {imgPath}")
+ imageOpenError = True
+ continue
+ if imgPath.startswith("eol/"):
+ self.eolImgPath = imgPath
+ self.eolImg = ImageTk.PhotoImage(self.resizeImgForDisplay(img))
+ elif imgPath.startswith("enwiki/"):
+ self.enwikiImgPath = imgPath
+ self.enwikiImg = ImageTk.PhotoImage(self.resizeImgForDisplay(img))
+ else:
+ print(f"Unexpected image path {imgPath}")
+ self.quit()
+ return
+ # Re-iterate if all image paths invalid
+ if self.eolImgPath == None and self.enwikiImgPath == None:
+ if imageOpenError:
+ self.reject()
+ self.getNextImgs()
+ return
+ # Add placeholder images
+ if self.eolImgPath == None:
+ self.eolImg = ImageTk.PhotoImage(self.resizeImgForDisplay(PLACEHOLDER_IMG))
+ elif self.enwikiImgPath == None:
+ self.enwikiImg = ImageTk.PhotoImage(self.resizeImgForDisplay(PLACEHOLDER_IMG))
+ # Update image-frames
+ self.labels[0].config(image=self.eolImg)
+ self.labels[1].config(image=self.enwikiImg)
+ # Update title
+ title = f"Images for otol ID {self.otolId}"
+ query = "SELECT names.alt_name FROM" \
+ " nodes INNER JOIN names ON nodes.name = names.name" \
+ " WHERE nodes.id = ? and pref_alt = 1"
+ row = dbCur.execute(query, (self.otolId,)).fetchone()
+ if row != None:
+ title += f", aka {row[0]}"
+ title += f" ({self.listIdx + 1} out of {len(self.nodeImgsList)})"
+ self.root.title(title)
+ def accept(self, imgIdx):
+ " React to a user selecting an image "
+ imgPath = self.eolImgPath if imgIdx == 0 else self.enwikiImgPath
+ if imgPath == None:
+ print("Invalid selection")
+ return
+ with open(outFile, 'a') as file:
+ file.write(f"{self.otolId} {imgPath}\n")
+ self.numReviewed += 1
+ self.getNextImgs()
+ def reject(self):
+ " React to a user rejecting all images of a set "
+ with open(outFile, 'a') as file:
+ file.write(f"{self.otolId}\n")
+ self.numReviewed += 1
+ self.getNextImgs()
+ def quit(self, e = None):
+ global dbCon
+ print(f"Number reviewed: {self.numReviewed}")
+ timeElapsed = time.time() - self.startTime
+ print(f"Time elapsed: {timeElapsed:.2f} seconds")
+ if self.numReviewed > 0:
+ print(f"Avg time per review: {timeElapsed/self.numReviewed:.2f} seconds")
+ dbCon.close()
+ self.root.destroy()
+ def resizeImgForDisplay(self, img):
+ " Returns a copy of an image, shrunk to fit it's frame (keeps aspect ratio), and with a background "
+ if max(img.width, img.height) > IMG_DISPLAY_SZ:
+ if (img.width > img.height):
+ newHeight = int(img.height * IMG_DISPLAY_SZ/img.width)
+ img = img.resize((IMG_DISPLAY_SZ, newHeight))
+ else:
+ newWidth = int(img.width * IMG_DISPLAY_SZ / img.height)
+ img = img.resize((newWidth, IMG_DISPLAY_SZ))
+ bgImg = PLACEHOLDER_IMG.copy()
+ bgImg.paste(img, box=(
+ int((IMG_DISPLAY_SZ - img.width) / 2),
+ int((IMG_DISPLAY_SZ - img.height) / 2)))
+ return bgImg
+# Create GUI and defer control
+print("Starting GUI")
+root = tki.Tk()
+ImgReviewer(root, nodeToImgs)
+root.mainloop()