diff options
Diffstat (limited to 'backend/tolData')
26 files changed, 3243 insertions, 0 deletions
diff --git a/backend/tolData/README.md b/backend/tolData/README.md new file mode 100644 index 0000000..ba64114 --- /dev/null +++ b/backend/tolData/README.md @@ -0,0 +1,152 @@ +This directory holds files used to generate data.db, which contains tree-of-life data. + +# Tables +## Tree Structure data +- `nodes` <br> + Format : `name TEXT PRIMARY KEY, id TEXT UNIQUE, tips INT` <br> + Represents a tree-of-life node. `tips` represents the number of no-child descendants. +- `edges` <br> + Format: `parent TEXT, child TEXT, p_support INT, PRIMARY KEY (parent, child)` <br> + `p_support` is 1 if the edge has 'phylogenetic support', and 0 otherwise +## Node name data +- `eol_ids` <br> + Format: `id INT PRIMARY KEY, name TEXT` <br> + Associates an EOL ID with a node's name. +- `names` <br> + Format: `name TEXT, alt_name TEXT, pref_alt INT, src TEXT, PRIMARY KEY(name, alt_name)` <br> + Associates a node with alternative names. + `pref_alt` is 1 if the alt-name is the most 'preferred' one. + `src` indicates the dataset the alt-name was obtained from (can be 'eol', 'enwiki', or 'picked'). +## Node description data +- `wiki_ids` <br> + Format: `name TEXT PRIMARY KEY, id INT, redirected INT` <br> + Associates a node with a wikipedia page ID. + `redirected` is 1 if the node was associated with a different page that redirected to this one. +- `descs` <br> + Format: `wiki_id INT PRIMARY KEY, desc TEXT, from_dbp INT` <br> + Associates a wikipedia page ID with a short-description. + `from_dbp` is 1 if the description was obtained from DBpedia, and 0 otherwise. +## Node image data +- `node_imgs` <br> + Format: `name TEXT PRIMARY KEY, img_id INT, src TEXT` <br> + Associates a node with an image. +- `images` <br> + Format: `id INT, src TEXT, url TEXT, license TEXT, artist TEXT, credit TEXT, PRIMARY KEY (id, src)` <br> + Represents an image, identified by a source ('eol', 'enwiki', or 'picked'), and a source-specific ID. +- `linked_imgs` <br> + Format: `name TEXT PRIMARY KEY, otol_ids TEXT` <br> + Associates a node with an image from another node. + `otol_ids` can be an otol ID, or two comma-separated otol IDs or empty strings. + The latter is used for compound nodes. +## Reduced tree data +- `nodes_t`, `nodes_i`, `nodes_p` <br> + These are like `nodes`, but describe the nodes for various reduced trees. +- `edges_t`, `edges_i`, `edges_p` <br> + Like `edges` but for reduced trees. + +# Generating the Database + +For the most part, these steps should be done in order. + +As a warning, the whole process takes a lot of time and file space. The tree will probably +have about 2.5 billion nodes. Downloading the images takes several days, and occupies over +200 GB. And if you want good data, you'll need to do some manual review, which can take weeks. + +## Environment +The scripts are written in python and bash. +Some of the python scripts require third-party packages: +- jsonpickle: For encoding class objects as JSON. +- requests: For downloading data. +- PIL: For image processing. +- tkinter: For providing a basic GUI to review images. +- mwxml, mwparserfromhell: For parsing Wikipedia dumps. + +## Generate tree structure data +1. Obtain files in otol/, as specified in it's README. +2. Run genOtolData.py, which creates data.db, and adds the `nodes` and `edges` tables, + using data in otol/. It also uses these files, if they exist: + - pickedOtolNames.txt: Has lines of the form `name1|otolId1`. Some nodes in the + tree may have the same name (eg: Pholidota can refer to pangolins or orchids). + Normally, such nodes will get the names 'name1', 'name1 [2]', 'name1 [3], etc. + This file can be used to manually specify which node should be named 'name1'. + +## Generate node name data +1. Obtain 'name data files' in eol/, as specified in it's README. +2. Run genEolNameData.py, which adds the `names` and `eol_ids` tables, using data in + eol/ and the `nodes` table. It also uses these files, if they exist: + - pickedEolIds.txt: Has lines of the form `nodeName1|eolId1` or `nodeName1|`. + Specifies node names that should have a particular EOL ID, or no ID. + Quite a few taxons have ambiguous names, and may need manual correction. + For example, Viola may resolve to a taxon of butterflies or of plants. + - pickedEolAltsToSkip.txt: Has lines of the form `nodeName1|altName1`. + Specifies that a node's alt-name set should exclude altName1. + +## Generate node description data +### Get data from DBpedia +1. Obtain files in dbpedia/, as specified in it's README. +2. Run genDbpData.py, which adds the `wiki_ids` and `descs` tables, using data in + dbpedia/ and the `nodes` table. It also uses these files, if they exist: + - pickedEnwikiNamesToSkip.txt: Each line holds the name of a node for which + no description should be obtained. Many node names have a same-name + wikipedia page that describes something different (eg: Osiris). + - pickedDbpLabels.txt: Has lines of the form `nodeName1|label1`. + Specifies node names that should have a particular associated page label. +### Get data from Wikipedia +1. Obtain 'description database files' in enwiki/, as specified in it's README. +2. Run genEnwikiDescData.py, which adds to the `wiki_ids` and `descs` tables, + using data in enwiki/ and the `nodes` table. + It also uses these files, if they exist: + - pickedEnwikiNamesToSkip.txt: Same as with genDbpData.py. + - pickedEnwikiLabels.txt: Similar to pickedDbpLabels.txt. + +## Generate node image data +### Get images from EOL +1. Obtain 'image metadata files' in eol/, as specified in it's README. +2. In eol/, run downloadImgs.py, which downloads images (possibly multiple per node), + into eol/imgsForReview, using data in eol/, as well as the `eol_ids` table. +3. In eol/, run reviewImgs.py, which interactively displays the downloaded images for + each node, providing the choice of which to use, moving them to eol/imgs/. + Uses `names` and `eol_ids` to display extra info. +### Get images from Wikipedia +1. In enwiki/, run genImgData.py, which looks for wikipedia image names for each node, + using the `wiki_ids` table, and stores them in a database. +2. In enwiki/, run downloadImgLicenseInfo.py, which downloads licensing information for + those images, using wikipedia's online API. +3. In enwiki/, run downloadImgs.py, which downloads 'permissively-licensed' + images into enwiki/imgs/. +### Merge the image sets +1. Run reviewImgsToGen.py, which displays images from eol/imgs/ and enwiki/imgs/, + and enables choosing, for each node, which image should be used, if any, + and outputs choice information into imgList.txt. Uses the `nodes`, + `eol_ids`, and `wiki_ids` tables (as well as `names` to display extra info). +2. Run genImgs.py, which creates cropped/resized images in img/, from files listed in + imgList.txt and located in eol/ and enwiki/, and creates the `node_imgs` and + `images` tables. If pickedImgs/ is present, images within it are also used. <br> + The outputs might need to be manually created/adjusted: + - An input image might have no output produced, possibly due to + data incompatibilities, memory limits, etc. A few input image files + might actually be html files, containing a 'file not found' page. + - An input x.gif might produce x-1.jpg, x-2.jpg, etc, instead of x.jpg. + - An input image might produce output with unexpected dimensions. + This seems to happen when the image is very large, and triggers a + decompression bomb warning. + The result might have as many as 150k images, with about 2/3 of them + being from wikipedia. +### Add more image associations +1. Run genLinkedImgs.py, which tries to associate nodes without images to + images of it's children. Adds the `linked_imgs` table, and uses the + `nodes`, `edges`, and `node_imgs` tables. + +## Do some post-processing +1. Run genEnwikiNameData.py, which adds more entries to the `names` table, + using data in enwiki/, and the `names` and `wiki_ids` tables. +2. Optionally run addPickedNames.py, which allows adding manually-selected name data to + the `names` table, as specified in pickedNames.txt. + - pickedNames.txt: Has lines of the form `nodeName1|altName1|prefAlt1`. + These correspond to entries in the `names` table. `prefAlt` should be 1 or 0. + A line like `name1|name1|1` causes a node to have no preferred alt-name. +3. Run genReducedTrees.py, which generates multiple reduced versions of the tree, + adding the `nodes_*` and `edges_*` tables, using `nodes` and `names`. Reads from + pickedNodes.txt, which lists names of nodes that must be included (1 per line). + The original tree isn't used for web-queries, as some nodes would have over + 10k children, which can take a while to render (took over a minute in testing). diff --git a/backend/tolData/addPickedNames.py b/backend/tolData/addPickedNames.py new file mode 100755 index 0000000..d56a0cb --- /dev/null +++ b/backend/tolData/addPickedNames.py @@ -0,0 +1,57 @@ +#!/usr/bin/python3 + +import sys +import sqlite3 + +usageInfo = f""" +Usage: {sys.argv[0]} + +Reads alt-name data from a file, and adds it to the database's 'names' table. +""" +if len(sys.argv) > 1: + print(usageInfo, file=sys.stderr) + sys.exit(1) + +dbFile = "data.db" +pickedNamesFile = "pickedNames.txt" + +print("Opening database") +dbCon = sqlite3.connect(dbFile) +dbCur = dbCon.cursor() + +print("Iterating through picked-names file") +with open(pickedNamesFile) as file: + for line in file: + # Get record data + nodeName, altName, prefAlt = line.lower().rstrip().split("|") + prefAlt = int(prefAlt) + # Check whether there exists a node with the name + row = dbCur.execute("SELECT name from nodes where name = ?", (nodeName,)).fetchone() + if row == None: + print(f"ERROR: No node with name \"{nodeName}\" exists") + break + # Remove any existing preferred-alt status + if prefAlt == 1: + query = "SELECT name, alt_name FROM names WHERE name = ? AND pref_alt = 1" + row = dbCur.execute(query, (nodeName,)).fetchone() + if row != None and row[1] != altName: + print(f"Removing pref-alt status from alt-name {row[1]} for {nodeName}") + dbCur.execute("UPDATE names SET pref_alt = 0 WHERE name = ? AND alt_name = ?", row) + # Check for an existing record + if nodeName == altName: + continue + query = "SELECT name, alt_name, pref_alt FROM names WHERE name = ? AND alt_name = ?" + row = dbCur.execute(query, (nodeName, altName)).fetchone() + if row == None: + print(f"Adding record for alt-name {altName} for {nodeName}") + dbCur.execute("INSERT INTO names VALUES (?, ?, ?, 'picked')", (nodeName, altName, prefAlt)) + else: + # Update existing record + if row[2] != prefAlt: + print(f"Updating record for alt-name {altName} for {nodeName}") + dbCur.execute("UPDATE names SET pref_alt = ?, src = 'picked' WHERE name = ? AND alt_name = ?", + (prefAlt, nodeName, altName)) + +print("Closing database") +dbCon.commit() +dbCon.close() diff --git a/backend/tolData/dbpedia/README.md b/backend/tolData/dbpedia/README.md new file mode 100644 index 0000000..8a08f20 --- /dev/null +++ b/backend/tolData/dbpedia/README.md @@ -0,0 +1,29 @@ +This directory holds files obtained from/using [Dbpedia](https://www.dbpedia.org). + +# Downloaded Files +- `labels_lang=en.ttl.bz2` <br> + Obtained via https://databus.dbpedia.org/dbpedia/collections/latest-core. + Downloaded from <https://databus.dbpedia.org/dbpedia/generic/labels/2022.03.01/labels_lang=en.ttl.bz2>. +- `page_lang=en_ids.ttl.bz2` <br> + Downloaded from <https://databus.dbpedia.org/dbpedia/generic/page/2022.03.01/page_lang=en_ids.ttl.bz2> +- `redirects_lang=en_transitive.ttl.bz2` <br> + Downloaded from <https://databus.dbpedia.org/dbpedia/generic/redirects/2022.03.01/redirects_lang=en_transitive.ttl.bz2>. +- `disambiguations_lang=en.ttl.bz2` <br> + Downloaded from <https://databus.dbpedia.org/dbpedia/generic/disambiguations/2022.03.01/disambiguations_lang=en.ttl.bz2>. +- `instance-types_lang=en_specific.ttl.bz2` <br> + Downloaded from <https://databus.dbpedia.org/dbpedia/mappings/instance-types/2022.03.01/instance-types_lang=en_specific.ttl.bz2>. +- `short-abstracts_lang=en.ttl.bz2` <br> + Downloaded from <https://databus.dbpedia.org/vehnem/text/short-abstracts/2021.05.01/short-abstracts_lang=en.ttl.bz2>. + +# Other Files +- genDescData.py <br> + Used to generate a database representing data from the ttl files. +- descData.db <br> + Generated by genDescData.py. <br> + Tables: <br> + - `labels`: `iri TEXT PRIMARY KEY, label TEXT ` + - `ids`: `iri TEXT PRIMARY KEY, id INT` + - `redirects`: `iri TEXT PRIMARY KEY, target TEXT` + - `disambiguations`: `iri TEXT PRIMARY KEY` + - `types`: `iri TEXT, type TEXT` + - `abstracts`: `iri TEXT PRIMARY KEY, abstract TEXT` diff --git a/backend/tolData/dbpedia/genDescData.py b/backend/tolData/dbpedia/genDescData.py new file mode 100755 index 0000000..d9e8a80 --- /dev/null +++ b/backend/tolData/dbpedia/genDescData.py @@ -0,0 +1,130 @@ +#!/usr/bin/python3 + +import sys, re +import bz2, sqlite3 + +usageInfo = f""" +Usage: {sys.argv[0]} + +Adds DBpedia labels/types/abstracts/etc data into a database. +""" +if len(sys.argv) > 1: + print(usageInfo, file=sys.stderr) + sys.exit(1) + +labelsFile = "labels_lang=en.ttl.bz2" # Had about 16e6 entries +idsFile = "page_lang=en_ids.ttl.bz2" +redirectsFile = "redirects_lang=en_transitive.ttl.bz2" +disambigFile = "disambiguations_lang=en.ttl.bz2" +typesFile = "instance-types_lang=en_specific.ttl.bz2" +abstractsFile = "short-abstracts_lang=en.ttl.bz2" +dbFile = "descData.db" +# In testing, this script took a few hours to run, and generated about 10GB + +print("Creating database") +dbCon = sqlite3.connect(dbFile) +dbCur = dbCon.cursor() + +print("Reading/storing label data") +dbCur.execute("CREATE TABLE labels (iri TEXT PRIMARY KEY, label TEXT)") +dbCur.execute("CREATE INDEX labels_idx ON labels(label)") +dbCur.execute("CREATE INDEX labels_idx_nc ON labels(label COLLATE NOCASE)") +labelLineRegex = re.compile(r'<([^>]+)> <[^>]+> "((?:[^"]|\\")+)"@en \.\n') +lineNum = 0 +with bz2.open(labelsFile, mode='rt') as file: + for line in file: + lineNum += 1 + if lineNum % 1e5 == 0: + print(f"At line {lineNum}") + # + match = labelLineRegex.fullmatch(line) + if match == None: + raise Exception(f"ERROR: Line {lineNum} has unexpected format") + dbCur.execute("INSERT INTO labels VALUES (?, ?)", (match.group(1), match.group(2))) + +print("Reading/storing wiki page ids") +dbCur.execute("CREATE TABLE ids (iri TEXT PRIMARY KEY, id INT)") +idLineRegex = re.compile(r'<([^>]+)> <[^>]+> "(\d+)".*\n') +lineNum = 0 +with bz2.open(idsFile, mode='rt') as file: + for line in file: + lineNum += 1 + if lineNum % 1e5 == 0: + print(f"At line {lineNum}") + # + match = idLineRegex.fullmatch(line) + if match == None: + raise Exception(f"ERROR: Line {lineNum} has unexpected format") + try: + dbCur.execute("INSERT INTO ids VALUES (?, ?)", (match.group(1), int(match.group(2)))) + except sqlite3.IntegrityError as e: + # Accounts for certain lines that have the same IRI + print(f"WARNING: Failed to add entry with IRI \"{match.group(1)}\": {e}") + +print("Reading/storing redirection data") +dbCur.execute("CREATE TABLE redirects (iri TEXT PRIMARY KEY, target TEXT)") +redirLineRegex = re.compile(r'<([^>]+)> <[^>]+> <([^>]+)> \.\n') +lineNum = 0 +with bz2.open(redirectsFile, mode='rt') as file: + for line in file: + lineNum += 1 + if lineNum % 1e5 == 0: + print(f"At line {lineNum}") + # + match = redirLineRegex.fullmatch(line) + if match == None: + raise Exception(f"ERROR: Line {lineNum} has unexpected format") + dbCur.execute("INSERT INTO redirects VALUES (?, ?)", (match.group(1), match.group(2))) + +print("Reading/storing diambiguation-page data") +dbCur.execute("CREATE TABLE disambiguations (iri TEXT PRIMARY KEY)") +disambigLineRegex = redirLineRegex +lineNum = 0 +with bz2.open(disambigFile, mode='rt') as file: + for line in file: + lineNum += 1 + if lineNum % 1e5 == 0: + print(f"At line {lineNum}") + # + match = disambigLineRegex.fullmatch(line) + if match == None: + raise Exception(f"ERROR: Line {lineNum} has unexpected format") + dbCur.execute("INSERT OR IGNORE INTO disambiguations VALUES (?)", (match.group(1),)) + +print("Reading/storing instance-type data") +dbCur.execute("CREATE TABLE types (iri TEXT, type TEXT)") +dbCur.execute("CREATE INDEX types_iri_idx ON types(iri)") +typeLineRegex = redirLineRegex +lineNum = 0 +with bz2.open(typesFile, mode='rt') as file: + for line in file: + lineNum += 1 + if lineNum % 1e5 == 0: + print(f"At line {lineNum}") + # + match = typeLineRegex.fullmatch(line) + if match == None: + raise Exception(f"ERROR: Line {lineNum} has unexpected format") + dbCur.execute("INSERT INTO types VALUES (?, ?)", (match.group(1), match.group(2))) + +print("Reading/storing abstracts") +dbCur.execute("CREATE TABLE abstracts (iri TEXT PRIMARY KEY, abstract TEXT)") +descLineRegex = labelLineRegex +lineNum = 0 +with bz2.open(abstractsFile, mode='rt') as file: + for line in file: + lineNum += 1 + if lineNum % 1e5 == 0: + print(f"At line {lineNum}") + # + if line[0] == "#": + continue + match = descLineRegex.fullmatch(line) + if match == None: + raise Exception(f"ERROR: Line {lineNum} has unexpected format") + dbCur.execute("INSERT INTO abstracts VALUES (?, ?)", + (match.group(1), match.group(2).replace(r'\"', '"'))) + +print("Closing database") +dbCon.commit() +dbCon.close() diff --git a/backend/tolData/enwiki/README.md b/backend/tolData/enwiki/README.md new file mode 100644 index 0000000..90d16c7 --- /dev/null +++ b/backend/tolData/enwiki/README.md @@ -0,0 +1,52 @@ +This directory holds files obtained from/using [English Wikipedia](https://en.wikipedia.org/wiki/Main_Page). + +# Downloaded Files +- enwiki-20220501-pages-articles-multistream.xml.bz2 <br> + Obtained via <https://dumps.wikimedia.org/backup-index.html> (site suggests downloading from a mirror). + Contains text content and metadata for pages in enwiki. + Some file content and format information was available from + <https://meta.wikimedia.org/wiki/Data_dumps/What%27s_available_for_download>. +- enwiki-20220501-pages-articles-multistream-index.txt.bz2 <br> + Obtained like above. Holds lines of the form offset1:pageId1:title1, + providing, for each page, an offset into the dump file of a chunk of + 100 pages that includes it. + +# Generated Dump-Index Files +- genDumpIndexDb.py <br> + Creates an sqlite-database version of the enwiki-dump index file. +- dumpIndex.db <br> + Generated by genDumpIndexDb.py. <br> + Tables: <br> + - `offsets`: `title TEXT PRIMARY KEY, id INT UNIQUE, offset INT, next_offset INT` + +# Description Database Files +- genDescData.py <br> + Reads through pages in the dump file, and adds short-description info to a database. +- descData.db <br> + Generated by genDescData.py. <br> + Tables: <br> + - `pages`: `id INT PRIMARY KEY, title TEXT UNIQUE` + - `redirects`: `id INT PRIMARY KEY, target TEXT` + - `descs`: `id INT PRIMARY KEY, desc TEXT` + +# Image Database Files +- genImgData.py <br> + Used to find infobox image names for page IDs, storing them into a database. +- downloadImgLicenseInfo.py <br> + Used to download licensing metadata for image names, via wikipedia's online API, storing them into a database. +- imgData.db <br> + Used to hold metadata about infobox images for a set of pageIDs. + Generated using getEnwikiImgData.py and downloadImgLicenseInfo.py. <br> + Tables: <br> + - `page_imgs`: `page_id INT PRIMAY KEY, img_name TEXT` <br> + `img_name` may be null, which means 'none found', and is used to avoid re-processing page-ids. + - `imgs`: `name TEXT PRIMARY KEY, license TEXT, artist TEXT, credit TEXT, restrictions TEXT, url TEXT` <br> + Might lack some matches for `img_name` in `page_imgs`, due to licensing info unavailability. +- downloadImgs.py <br> + Used to download image files into imgs/. + +# Other Files +- lookupPage.py <br> + Running `lookupPage.py title1` looks in the dump for a page with a given title, + and prints the contents to stdout. Uses dumpIndex.db. + diff --git a/backend/tolData/enwiki/downloadImgLicenseInfo.py b/backend/tolData/enwiki/downloadImgLicenseInfo.py new file mode 100755 index 0000000..399922e --- /dev/null +++ b/backend/tolData/enwiki/downloadImgLicenseInfo.py @@ -0,0 +1,150 @@ +#!/usr/bin/python3 + +import sys, re +import sqlite3, urllib.parse, html +import requests +import time, signal + +usageInfo = f""" +Usage: {sys.argv[0]} + +Reads image names from a database, and uses enwiki's online API to obtain +licensing information for them, adding the info to the database. + +SIGINT causes the program to finish an ongoing download and exit. +The program can be re-run to continue downloading, and looks +at already-processed names to decide what to skip. +""" +if len(sys.argv) > 1: + print(usageInfo, file=sys.stderr) + sys.exit(1) + +imgDb = "imgData.db" +apiUrl = "https://en.wikipedia.org/w/api.php" +userAgent = "terryt.dev (terry06890@gmail.com)" +batchSz = 50 # Max 50 +tagRegex = re.compile(r"<[^<]+>") +whitespaceRegex = re.compile(r"\s+") + +print("Opening database") +dbCon = sqlite3.connect(imgDb) +dbCur = dbCon.cursor() +dbCur2 = dbCon.cursor() +print("Checking for table") +if dbCur.execute("SELECT name FROM sqlite_master WHERE type='table' AND name='imgs'").fetchone() == None: + dbCur.execute("CREATE TABLE imgs(" \ + "name TEXT PRIMARY KEY, license TEXT, artist TEXT, credit TEXT, restrictions TEXT, url TEXT)") + +print("Reading image names") +imgNames = set() +for (imgName,) in dbCur.execute("SELECT DISTINCT img_name FROM page_imgs WHERE img_name NOT NULL"): + imgNames.add(imgName) +print(f"Found {len(imgNames)}") + +print("Checking for already-processed images") +oldSz = len(imgNames) +for (imgName,) in dbCur.execute("SELECT name FROM imgs"): + imgNames.discard(imgName) +print(f"Found {oldSz - len(imgNames)}") + +# Set SIGINT handler +interrupted = False +oldHandler = None +def onSigint(sig, frame): + global interrupted + interrupted = True + signal.signal(signal.SIGINT, oldHandler) +oldHandler = signal.signal(signal.SIGINT, onSigint) + +print("Iterating through image names") +imgNames = list(imgNames) +iterNum = 0 +for i in range(0, len(imgNames), batchSz): + iterNum += 1 + if iterNum % 1 == 0: + print(f"At iteration {iterNum} (after {(iterNum - 1) * batchSz} images)") + if interrupted: + print(f"Exiting loop at iteration {iterNum}") + break + # Get batch + imgBatch = imgNames[i:i+batchSz] + imgBatch = ["File:" + x for x in imgBatch] + # Make request + headers = { + "user-agent": userAgent, + "accept-encoding": "gzip", + } + params = { + "action": "query", + "format": "json", + "prop": "imageinfo", + "iiprop": "extmetadata|url", + "maxlag": "5", + "titles": "|".join(imgBatch), + "iiextmetadatafilter": "Artist|Credit|LicenseShortName|Restrictions", + } + responseObj = None + try: + response = requests.get(apiUrl, params=params, headers=headers) + responseObj = response.json() + except Exception as e: + print(f"ERROR: Exception while downloading info: {e}") + print(f"\tImage batch: " + "|".join(imgBatch)) + continue + # Parse response-object + if "query" not in responseObj or "pages" not in responseObj["query"]: + print("WARNING: Response object for doesn't have page data") + print("\tImage batch: " + "|".join(imgBatch)) + if "error" in responseObj: + errorCode = responseObj["error"]["code"] + print(f"\tError code: {errorCode}") + if errorCode == "maxlag": + time.sleep(5) + continue + pages = responseObj["query"]["pages"] + normalisedToInput = {} + if "normalized" in responseObj["query"]: + for entry in responseObj["query"]["normalized"]: + normalisedToInput[entry["to"]] = entry["from"] + for (_, page) in pages.items(): + # Some fields // More info at https://www.mediawiki.org/wiki/Extension:CommonsMetadata#Returned_data + # LicenseShortName: short human-readable license name, apparently more reliable than 'License', + # Artist: author name (might contain complex html, multiple authors, etc) + # Credit: 'source' + # For image-map-like images, can be quite large/complex html, creditng each sub-image + # May be <a href="text1">text2</a>, where the text2 might be non-indicative + # Restrictions: specifies non-copyright legal restrictions + title = page["title"] + if title in normalisedToInput: + title = normalisedToInput[title] + title = title[5:] # Remove 'File:' + if title not in imgNames: + print(f"WARNING: Got title \"{title}\" not in image-name list") + continue + if "imageinfo" not in page: + print(f"WARNING: No imageinfo section for page \"{title}\"") + continue + metadata = page["imageinfo"][0]["extmetadata"] + url = page["imageinfo"][0]["url"] + license = metadata['LicenseShortName']['value'] if 'LicenseShortName' in metadata else None + artist = metadata['Artist']['value'] if 'Artist' in metadata else None + credit = metadata['Credit']['value'] if 'Credit' in metadata else None + restrictions = metadata['Restrictions']['value'] if 'Restrictions' in metadata else None + # Remove markup + if artist != None: + artist = tagRegex.sub(" ", artist) + artist = whitespaceRegex.sub(" ", artist) + artist = html.unescape(artist) + artist = urllib.parse.unquote(artist) + if credit != None: + credit = tagRegex.sub(" ", credit) + credit = whitespaceRegex.sub(" ", credit) + credit = html.unescape(credit) + credit = urllib.parse.unquote(credit) + # Add to db + dbCur2.execute("INSERT INTO imgs VALUES (?, ?, ?, ?, ?, ?)", + (title, license, artist, credit, restrictions, url)) + +print("Closing database") +dbCon.commit() +dbCon.close() diff --git a/backend/tolData/enwiki/downloadImgs.py b/backend/tolData/enwiki/downloadImgs.py new file mode 100755 index 0000000..8fb605f --- /dev/null +++ b/backend/tolData/enwiki/downloadImgs.py @@ -0,0 +1,91 @@ +#!/usr/bin/python3 + +import sys, re, os +import sqlite3 +import urllib.parse, requests +import time, signal + +usageInfo = f""" +Usage: {sys.argv[0]} + +Downloads images from URLs in an image database, into an output directory, +with names of the form 'pageId1.ext1'. + +SIGINT causes the program to finish an ongoing download and exit. +The program can be re-run to continue downloading, and looks +in the output directory do decide what to skip. +""" +if len(sys.argv) > 1: + print(usageInfo, file=sys.stderr) + sys.exit(1) + +imgDb = "imgData.db" # About 130k image names +outDir = "imgs" +licenseRegex = re.compile(r"cc0|cc([ -]by)?([ -]sa)?([ -][1234]\.[05])?( \w\w\w?)?", flags=re.IGNORECASE) +# In testing, this downloaded about 100k images, over several days + +if not os.path.exists(outDir): + os.mkdir(outDir) +print("Checking for already-downloaded images") +fileList = os.listdir(outDir) +pageIdsDone = set() +for filename in fileList: + (basename, extension) = os.path.splitext(filename) + pageIdsDone.add(int(basename)) +print(f"Found {len(pageIdsDone)}") + +# Set SIGINT handler +interrupted = False +oldHandler = None +def onSigint(sig, frame): + global interrupted + interrupted = True + signal.signal(signal.SIGINT, oldHandler) +oldHandler = signal.signal(signal.SIGINT, onSigint) + +print("Opening database") +dbCon = sqlite3.connect(imgDb) +dbCur = dbCon.cursor() +print("Starting downloads") +iterNum = 0 +query = "SELECT page_id, license, artist, credit, restrictions, url FROM" \ + " imgs INNER JOIN page_imgs ON imgs.name = page_imgs.img_name" +for (pageId, license, artist, credit, restrictions, url) in dbCur.execute(query): + if pageId in pageIdsDone: + continue + if interrupted: + print(f"Exiting loop") + break + # Check for problematic attributes + if license == None or licenseRegex.fullmatch(license) == None: + continue + if artist == None or artist == "" or len(artist) > 100 or re.match(r"(\d\. )?File:", artist) != None: + continue + if credit == None or len(credit) > 300 or re.match(r"File:", credit) != None: + continue + if restrictions != None and restrictions != "": + continue + # Download image + iterNum += 1 + print(f"Iteration {iterNum}: Downloading for page-id {pageId}") + urlParts = urllib.parse.urlparse(url) + extension = os.path.splitext(urlParts.path)[1] + if len(extension) <= 1: + print(f"WARNING: No filename extension found in URL {url}") + sys.exit(1) + outFile = f"{outDir}/{pageId}{extension}" + headers = { + "user-agent": "terryt.dev (terry06890@gmail.com)", + "accept-encoding": "gzip", + } + try: + response = requests.get(url, headers=headers) + with open(outFile, 'wb') as file: + file.write(response.content) + time.sleep(1) + # https://en.wikipedia.org/wiki/Wikipedia:Database_download says to "throttle self to 1 cache miss per sec" + # It's unclear how to properly check for cache misses, so this just aims for 1 per sec + except Exception as e: + print(f"Error while downloading to {outFile}: {e}") +print("Closing database") +dbCon.close() diff --git a/backend/tolData/enwiki/genDescData.py b/backend/tolData/enwiki/genDescData.py new file mode 100755 index 0000000..b0ca272 --- /dev/null +++ b/backend/tolData/enwiki/genDescData.py @@ -0,0 +1,127 @@ +#!/usr/bin/python3 + +import sys, os, re +import bz2 +import html, mwxml, mwparserfromhell +import sqlite3 + +usageInfo = f""" +Usage: {sys.argv[0]} + +Reads through the wiki dump, and attempts to +parse short-descriptions, and add them to a database. +""" +if len(sys.argv) > 1: + print(usageInfo, file=sys.stderr) + sys.exit(1) + +dumpFile = "enwiki-20220501-pages-articles-multistream.xml.bz2" # Had about 22e6 pages +enwikiDb = "descData.db" +# In testing, this script took over 10 hours to run, and generated about 5GB + +descLineRegex = re.compile("^ *[A-Z'\"]") +embeddedHtmlRegex = re.compile(r"<[^<]+/>|<!--[^<]+-->|<[^</]+>([^<]*|[^<]*<[^<]+>[^<]*)</[^<]+>|<[^<]+$") + # Recognises a self-closing HTML tag, a tag with 0 children, tag with 1 child with 0 children, or unclosed tag +convertTemplateRegex = re.compile(r"{{convert\|(\d[^|]*)\|(?:(to|-)\|(\d[^|]*)\|)?([a-z][^|}]*)[^}]*}}") +def convertTemplateReplace(match): + if match.group(2) == None: + return f"{match.group(1)} {match.group(4)}" + else: + return f"{match.group(1)} {match.group(2)} {match.group(3)} {match.group(4)}" +parensGroupRegex = re.compile(r" \([^()]*\)") +leftoverBraceRegex = re.compile(r"(?:{\||{{).*") + +def parseDesc(text): + # Find first matching line outside {{...}}, [[...]], and block-html-comment constructs, + # and then accumulate lines until a blank one. + # Some cases not accounted for include: disambiguation pages, abstracts with sentences split-across-lines, + # nested embedded html, 'content significant' embedded-html, markup not removable with mwparsefromhell, + lines = [] + openBraceCount = 0 + openBracketCount = 0 + inComment = False + skip = False + for line in text.splitlines(): + line = line.strip() + if len(lines) == 0: + if len(line) > 0: + if openBraceCount > 0 or line[0] == "{": + openBraceCount += line.count("{") + openBraceCount -= line.count("}") + skip = True + if openBracketCount > 0 or line[0] == "[": + openBracketCount += line.count("[") + openBracketCount -= line.count("]") + skip = True + if inComment or line.find("<!--") != -1: + if line.find("-->") != -1: + if inComment: + inComment = False + skip = True + else: + inComment = True + skip = True + if skip: + skip = False + continue + if line[-1] == ":": # Seems to help avoid disambiguation pages + return None + if descLineRegex.match(line) != None: + lines.append(line) + else: + if len(line) == 0: + return removeMarkup(" ".join(lines)) + lines.append(line) + if len(lines) > 0: + return removeMarkup(" ".join(lines)) + return None +def removeMarkup(content): + content = embeddedHtmlRegex.sub("", content) + content = convertTemplateRegex.sub(convertTemplateReplace, content) + content = mwparserfromhell.parse(content).strip_code() # Remove wikitext markup + content = parensGroupRegex.sub("", content) + content = leftoverBraceRegex.sub("", content) + return content +def convertTitle(title): + return html.unescape(title).replace("_", " ") + +print("Creating database") +if os.path.exists(enwikiDb): + raise Exception(f"ERROR: Existing {enwikiDb}") +dbCon = sqlite3.connect(enwikiDb) +dbCur = dbCon.cursor() +dbCur.execute("CREATE TABLE pages (id INT PRIMARY KEY, title TEXT UNIQUE)") +dbCur.execute("CREATE INDEX pages_title_idx ON pages(title COLLATE NOCASE)") +dbCur.execute("CREATE TABLE redirects (id INT PRIMARY KEY, target TEXT)") +dbCur.execute("CREATE INDEX redirects_idx ON redirects(target)") +dbCur.execute("CREATE TABLE descs (id INT PRIMARY KEY, desc TEXT)") + +print("Iterating through dump file") +with bz2.open(dumpFile, mode='rt') as file: + dump = mwxml.Dump.from_file(file) + pageNum = 0 + for page in dump: + pageNum += 1 + if pageNum % 1e4 == 0: + print(f"At page {pageNum}") + if pageNum > 3e4: + break + # Parse page + if page.namespace == 0: + try: + dbCur.execute("INSERT INTO pages VALUES (?, ?)", (page.id, convertTitle(page.title))) + except sqlite3.IntegrityError as e: + # Accounts for certain pages that have the same title + print(f"Failed to add page with title \"{page.title}\": {e}", file=sys.stderr) + continue + if page.redirect != None: + dbCur.execute("INSERT INTO redirects VALUES (?, ?)", (page.id, convertTitle(page.redirect))) + else: + revision = next(page) + desc = parseDesc(revision.text) + if desc != None: + dbCur.execute("INSERT INTO descs VALUES (?, ?)", (page.id, desc)) + +print("Closing database") +dbCon.commit() +dbCon.close() diff --git a/backend/tolData/enwiki/genDumpIndexDb.py b/backend/tolData/enwiki/genDumpIndexDb.py new file mode 100755 index 0000000..3955885 --- /dev/null +++ b/backend/tolData/enwiki/genDumpIndexDb.py @@ -0,0 +1,58 @@ +#!/usr/bin/python3 + +import sys, os, re +import bz2 +import sqlite3 + +usageInfo = f""" +Usage: {sys.argv[0]} + +Adds data from the wiki dump index-file into a database. +""" +if len(sys.argv) > 1: + print(usageInfo, file=sys.stderr) + sys.exit(1) + +indexFile = "enwiki-20220501-pages-articles-multistream-index.txt.bz2" # Had about 22e6 lines +indexDb = "dumpIndex.db" + +if os.path.exists(indexDb): + raise Exception(f"ERROR: Existing {indexDb}") +print("Creating database") +dbCon = sqlite3.connect(indexDb) +dbCur = dbCon.cursor() +dbCur.execute("CREATE TABLE offsets (title TEXT PRIMARY KEY, id INT UNIQUE, offset INT, next_offset INT)") + +print("Iterating through index file") +lineRegex = re.compile(r"([^:]+):([^:]+):(.*)") +lastOffset = 0 +lineNum = 0 +entriesToAdd = [] +with bz2.open(indexFile, mode='rt') as file: + for line in file: + lineNum += 1 + if lineNum % 1e5 == 0: + print(f"At line {lineNum}") + # + match = lineRegex.fullmatch(line.rstrip()) + (offset, pageId, title) = match.group(1,2,3) + offset = int(offset) + if offset > lastOffset: + for (t, p) in entriesToAdd: + try: + dbCur.execute("INSERT INTO offsets VALUES (?, ?, ?, ?)", (t, p, lastOffset, offset)) + except sqlite3.IntegrityError as e: + # Accounts for certain entries in the file that have the same title + print(f"Failed on title \"{t}\": {e}", file=sys.stderr) + entriesToAdd = [] + lastOffset = offset + entriesToAdd.append([title, pageId]) +for (title, pageId) in entriesToAdd: + try: + dbCur.execute("INSERT INTO offsets VALUES (?, ?, ?, ?)", (title, pageId, lastOffset, -1)) + except sqlite3.IntegrityError as e: + print(f"Failed on title \"{t}\": {e}", file=sys.stderr) + +print("Closing database") +dbCon.commit() +dbCon.close() diff --git a/backend/tolData/enwiki/genImgData.py b/backend/tolData/enwiki/genImgData.py new file mode 100755 index 0000000..dedfe14 --- /dev/null +++ b/backend/tolData/enwiki/genImgData.py @@ -0,0 +1,190 @@ +#!/usr/bin/python3 + +import sys, re +import bz2, html, urllib.parse +import sqlite3 + +usageInfo = f""" +Usage: {sys.argv[0]} + +For some set of page IDs, looks up their content in the wiki dump, +and tries to parse infobox image names, storing them into a database. + +The program can be re-run with an updated set of page IDs, and +will skip already-processed page IDs. +""" +if len(sys.argv) > 1: + print(usageInfo, file=sys.stderr) + sys.exit(1) + +def getInputPageIds(): + pageIds = set() + dbCon = sqlite3.connect("../data.db") + dbCur = dbCon.cursor() + for (pageId,) in dbCur.execute("SELECT id from wiki_ids"): + pageIds.add(pageId) + dbCon.close() + return pageIds +dumpFile = "enwiki-20220501-pages-articles-multistream.xml.bz2" +indexDb = "dumpIndex.db" +imgDb = "imgData.db" # The database to create +idLineRegex = re.compile(r"<id>(.*)</id>") +imageLineRegex = re.compile(r".*\| *image *= *([^|]*)") +bracketImageRegex = re.compile(r"\[\[(File:[^|]*).*]]") +imageNameRegex = re.compile(r".*\.(jpg|jpeg|png|gif|tiff|tif)", flags=re.IGNORECASE) +cssImgCropRegex = re.compile(r"{{css image crop\|image *= *(.*)", flags=re.IGNORECASE) +# In testing, got about 360k image names + +print("Getting input page-ids") +pageIds = getInputPageIds() +print(f"Found {len(pageIds)}") + +print("Opening databases") +indexDbCon = sqlite3.connect(indexDb) +indexDbCur = indexDbCon.cursor() +imgDbCon = sqlite3.connect(imgDb) +imgDbCur = imgDbCon.cursor() +print("Checking tables") +if imgDbCur.execute("SELECT name FROM sqlite_master WHERE type='table' AND name='page_imgs'").fetchone() == None: + # Create tables if not present + imgDbCur.execute("CREATE TABLE page_imgs (page_id INT PRIMARY KEY, img_name TEXT)") # img_name may be NULL + imgDbCur.execute("CREATE INDEX page_imgs_idx ON page_imgs(img_name)") +else: + # Check for already-processed page IDs + numSkipped = 0 + for (pid,) in imgDbCur.execute("SELECT page_id FROM page_imgs"): + if pid in pageIds: + pageIds.remove(pid) + numSkipped += 1 + else: + print(f"WARNING: Found already-processed page ID {pid} which was not in input set") + print(f"Will skip {numSkipped} already-processed page IDs") + +print("Getting dump-file offsets") +offsetToPageids = {} +offsetToEnd = {} # Maps chunk-start offsets to their chunk-end offsets +iterNum = 0 +for pageId in pageIds: + iterNum += 1 + if iterNum % 1e4 == 0: + print(f"At iteration {iterNum}") + # + query = "SELECT offset, next_offset FROM offsets WHERE id = ?" + row = indexDbCur.execute(query, (pageId,)).fetchone() + if row == None: + print(f"WARNING: Page ID {pageId} not found") + continue + (chunkOffset, endOffset) = row + offsetToEnd[chunkOffset] = endOffset + if chunkOffset not in offsetToPageids: + offsetToPageids[chunkOffset] = [] + offsetToPageids[chunkOffset].append(pageId) +print(f"Found {len(offsetToEnd)} chunks to check") + +print("Iterating through chunks in dump file") +def getImageName(content): + " Given an array of text-content lines, tries to return an infoxbox image name, or None " + # Doesn't try and find images in outside-infobox [[File:...]] and <imagemap> sections + for line in content: + match = imageLineRegex.match(line) + if match != None: + imageName = match.group(1).strip() + if imageName == "": + return None + imageName = html.unescape(imageName) + # Account for {{... + if imageName.startswith("{"): + match = cssImgCropRegex.match(imageName) + if match == None: + return None + imageName = match.group(1) + # Account for [[File:...|...]] + if imageName.startswith("["): + match = bracketImageRegex.match(imageName) + if match == None: + return None + imageName = match.group(1) + # Account for <!-- + if imageName.find("<!--") != -1: + return None + # Remove an initial 'File:' + if imageName.startswith("File:"): + imageName = imageName[5:] + # Remove an initial 'Image:' + if imageName.startswith("Image:"): + imageName = imageName[6:] + # Check for extension + match = imageNameRegex.match(imageName) + if match != None: + imageName = match.group(0) + imageName = urllib.parse.unquote(imageName) + imageName = html.unescape(imageName) # Intentionally unescaping again (handles some odd cases) + imageName = imageName.replace("_", " ") + return imageName + # Exclude lines like: | image = <imagemap> + return None + return None +with open(dumpFile, mode='rb') as file: + iterNum = 0 + for (pageOffset, endOffset) in offsetToEnd.items(): + iterNum += 1 + if iterNum % 100 == 0: + print(f"At iteration {iterNum}") + # + pageIds = offsetToPageids[pageOffset] + # Jump to chunk + file.seek(pageOffset) + compressedData = file.read(None if endOffset == -1 else endOffset - pageOffset) + data = bz2.BZ2Decompressor().decompress(compressedData).decode() + # Look in chunk for pages + lines = data.splitlines() + lineIdx = 0 + while lineIdx < len(lines): + # Look for <page> + if lines[lineIdx].lstrip() != "<page>": + lineIdx += 1 + continue + # Check page id + lineIdx += 3 + idLine = lines[lineIdx].lstrip() + match = idLineRegex.fullmatch(idLine) + if match == None or int(match.group(1)) not in pageIds: + lineIdx += 1 + continue + pageId = int(match.group(1)) + lineIdx += 1 + # Look for <text> in <page> + foundText = False + while lineIdx < len(lines): + if not lines[lineIdx].lstrip().startswith("<text "): + lineIdx += 1 + continue + foundText = True + # Get text content + content = [] + line = lines[lineIdx] + content.append(line[line.find(">") + 1:]) + lineIdx += 1 + foundTextEnd = False + while lineIdx < len(lines): + line = lines[lineIdx] + if not line.endswith("</text>"): + content.append(line) + lineIdx += 1 + continue + foundTextEnd = True + content.append(line[:line.rfind("</text>")]) + # Look for image-filename + imageName = getImageName(content) + imgDbCur.execute("INSERT into page_imgs VALUES (?, ?)", (pageId, imageName)) + break + if not foundTextEnd: + print(f"WARNING: Did not find </text> for page id {pageId}") + break + if not foundText: + print(f"WARNING: Did not find <text> for page id {pageId}") + +print("Closing databases") +indexDbCon.close() +imgDbCon.commit() +imgDbCon.close() diff --git a/backend/tolData/enwiki/lookupPage.py b/backend/tolData/enwiki/lookupPage.py new file mode 100755 index 0000000..1a90851 --- /dev/null +++ b/backend/tolData/enwiki/lookupPage.py @@ -0,0 +1,68 @@ +#!/usr/bin/python3 + +import sys, re +import bz2 +import sqlite3 + +usageInfo = f""" +Usage: {sys.argv[0]} title1 + +Looks up a page with title title1 in the wiki dump, using +the dump-index db, and prints the corresponding <page>. +""" +if len(sys.argv) != 2: + print(usageInfo, file=sys.stderr) + sys.exit(1) + +dumpFile = "enwiki-20220501-pages-articles-multistream.xml.bz2" +indexDb = "dumpIndex.db" +pageTitle = sys.argv[1].replace("_", " ") + +print("Looking up offset in index db") +dbCon = sqlite3.connect(indexDb) +dbCur = dbCon.cursor() +query = "SELECT title, offset, next_offset FROM offsets WHERE title = ?" +row = dbCur.execute(query, (pageTitle,)).fetchone() +if row == None: + print("Title not found") + sys.exit(0) +_, pageOffset, endOffset = row +dbCon.close() +print(f"Found chunk at offset {pageOffset}") + +print("Reading from wiki dump") +content = [] +with open(dumpFile, mode='rb') as file: + # Get uncompressed chunk + file.seek(pageOffset) + compressedData = file.read(None if endOffset == -1 else endOffset - pageOffset) + data = bz2.BZ2Decompressor().decompress(compressedData).decode() + # Look in chunk for page + lines = data.splitlines() + lineIdx = 0 + found = False + pageNum = 0 + while not found: + line = lines[lineIdx] + if line.lstrip() == "<page>": + pageNum += 1 + if pageNum > 100: + print("ERROR: Did not find title after 100 pages") + break + lineIdx += 1 + titleLine = lines[lineIdx] + if titleLine.lstrip() == '<title>' + pageTitle + '</title>': + found = True + print(f"Found title in chunk as page {pageNum}") + content.append(line) + content.append(titleLine) + while True: + lineIdx += 1 + line = lines[lineIdx] + content.append(line) + if line.lstrip() == "</page>": + break + lineIdx += 1 + +print("Content: ") +print("\n".join(content)) diff --git a/backend/tolData/eol/README.md b/backend/tolData/eol/README.md new file mode 100644 index 0000000..8c527a8 --- /dev/null +++ b/backend/tolData/eol/README.md @@ -0,0 +1,26 @@ +This directory holds files obtained from/using the [Encyclopedia of Life](https://eol.org/). + +# Name Data Files +- vernacularNames.csv <br> + Obtained from <https://opendata.eol.org/dataset/vernacular-names> on 24/04/2022 (last updated on 27/10/2020). + Contains alternative-name data from EOL. + +# Image Metadata Files +- imagesList.tgz <br> + Obtained from <https://opendata.eol.org/dataset/images-list> on 24/04/2022 (last updated on 05/02/2020). + Contains metadata for images from EOL. +- imagesList/ <br> + Extracted from imagesList.tgz. +- genImagesListDb.sh <br> + Creates a database, and imports imagesList/*.csv files into it. +- imagesList.db <br> + Created by running genImagesListDb.sh <br> + Tables: <br> + - `images`: + `content_id INT PRIMARY KEY, page_id INT, source_url TEXT, copy_url TEXT, license TEXT, copyright_owner TEXT` + +# Image Generation Files +- downloadImgs.py <br> + Used to download image files into imgsForReview/. +- reviewImgs.py <br> + Used to review images in imgsForReview/, moving acceptable ones into imgs/. diff --git a/backend/tolData/eol/downloadImgs.py b/backend/tolData/eol/downloadImgs.py new file mode 100755 index 0000000..96bc085 --- /dev/null +++ b/backend/tolData/eol/downloadImgs.py @@ -0,0 +1,147 @@ +#!/usr/bin/python3 + +import sys, re, os, random +import sqlite3 +import urllib.parse, requests +import time +from threading import Thread +import signal + +usageInfo = f""" +Usage: {sys.argv[0]} + +For some set of EOL IDs, downloads associated images from URLs in +an image-list database. Uses multiple downloading threads. + +May obtain multiple images per ID. The images will get names +with the form 'eolId1 contentId1.ext1'. + +SIGINT causes the program to finish ongoing downloads and exit. +The program can be re-run to continue downloading. It looks for +already-downloaded files, and continues after the one with +highest EOL ID. +""" +if len(sys.argv) > 1: + print(usageInfo, file=sys.stderr) + sys.exit(1) +# In testing, this downloaded about 70k images, over a few days + +imagesListDb = "imagesList.db" +def getInputEolIds(): + eolIds = set() + dbCon = sqlite3.connect("../data.db") + dbCur = dbCon.cursor() + for (id,) in dbCur.execute("SELECT id FROM eol_ids"): + eolIds.add(id) + dbCon.close() + return eolIds +outDir = "imgsForReview/" +MAX_IMGS_PER_ID = 3 +MAX_THREADS = 5 +POST_DL_DELAY_MIN = 2 # Minimum delay in seconds to pause after download before starting another (for each thread) +POST_DL_DELAY_MAX = 3 +LICENSE_REGEX = r"cc-by((-nc)?(-sa)?(-[234]\.[05])?)|cc-publicdomain|cc-0-1\.0|public domain" + +print("Getting input EOL IDs") +eolIds = getInputEolIds() +print("Getting EOL IDs to download for") +# Get IDs from images-list db +imgDbCon = sqlite3.connect(imagesListDb) +imgCur = imgDbCon.cursor() +imgListIds = set() +for (pageId,) in imgCur.execute("SELECT DISTINCT page_id FROM images"): + imgListIds.add(pageId) +# Get set intersection, and sort into list +eolIds = eolIds.intersection(imgListIds) +eolIds = sorted(eolIds) +print(f"Result: {len(eolIds)} EOL IDs") + +print("Checking output directory") +if not os.path.exists(outDir): + os.mkdir(outDir) +print("Finding next ID to download for") +nextIdx = 0 +fileList = os.listdir(outDir) +ids = [int(filename.split(" ")[0]) for filename in fileList] +if len(ids) > 0: + ids.sort() + nextIdx = eolIds.index(ids[-1]) + 1 +if nextIdx == len(eolIds): + print("No IDs left. Exiting...") + sys.exit(0) + +print("Starting download threads") +numThreads = 0 +threadException = None # Used for ending main thread after a non-main thread exception +# Handle SIGINT signals +interrupted = False +oldHandler = None +def onSigint(sig, frame): + global interrupted + interrupted = True + signal.signal(signal.SIGINT, oldHandler) +oldHandler = signal.signal(signal.SIGINT, onSigint) +# Function for threads to execute +def downloadImg(url, outFile): + global numThreads, threadException + try: + data = requests.get(url) + with open(outFile, 'wb') as file: + file.write(data.content) + time.sleep(random.random() * (POST_DL_DELAY_MAX - POST_DL_DELAY_MIN) + POST_DL_DELAY_MIN) + except Exception as e: + print(f"Error while downloading to {outFile}: {str(e)}", file=sys.stderr) + threadException = e + numThreads -= 1 +# Manage downloading +for idx in range(nextIdx, len(eolIds)): + eolId = eolIds[idx] + # Get image urls + imgDataList = [] + ownerSet = set() # Used to get images from different owners, for variety + exitLoop = False + query = "SELECT content_id, copy_url, license, copyright_owner FROM images WHERE page_id = ?" + for (contentId, url, license, copyrightOwner) in imgCur.execute(query, (eolId,)): + if url.startswith("data/"): + url = "https://content.eol.org/" + url + urlParts = urllib.parse.urlparse(url) + extension = os.path.splitext(urlParts.path)[1] + if len(extension) <= 1: + print(f"WARNING: No filename extension found in URL {url}", file=sys.stderr) + continue + # Check image-quantity limit + if len(ownerSet) == MAX_IMGS_PER_ID: + break + # Check for skip conditions + if re.fullmatch(LICENSE_REGEX, license) == None: + continue + if len(copyrightOwner) > 100: # Avoid certain copyrightOwner fields that seem long and problematic + continue + if copyrightOwner in ownerSet: + continue + ownerSet.add(copyrightOwner) + # Determine output filename + outPath = f"{outDir}{eolId} {contentId}{extension}" + if os.path.exists(outPath): + print(f"WARNING: {outPath} already exists. Skipping download.") + continue + # Check thread limit + while numThreads == MAX_THREADS: + time.sleep(1) + # Wait for threads after an interrupt or thread-exception + if interrupted or threadException != None: + print("Waiting for existing threads to end") + while numThreads > 0: + time.sleep(1) + exitLoop = True + break + # Perform download + print(f"Downloading image to {outPath}") + numThreads += 1 + thread = Thread(target=downloadImg, args=(url, outPath), daemon=True) + thread.start() + if exitLoop: + break +# Close images-list db +print("Finished downloading") +imgDbCon.close() diff --git a/backend/tolData/eol/genImagesListDb.sh b/backend/tolData/eol/genImagesListDb.sh new file mode 100755 index 0000000..87dd840 --- /dev/null +++ b/backend/tolData/eol/genImagesListDb.sh @@ -0,0 +1,12 @@ +#!/bin/bash +set -e + +# Combine CSV files into one, skipping header lines +cat imagesList/media_*_{1..58}.csv | tail -n +2 > imagesList.csv +# Create database, and import the CSV file +sqlite3 imagesList.db <<END +CREATE TABLE images ( + content_id INT PRIMARY KEY, page_id INT, source_url TEXT, copy_url TEXT, license TEXT, copyright_owner TEXT); +.mode csv +.import 'imagesList.csv' images +END diff --git a/backend/tolData/eol/reviewImgs.py b/backend/tolData/eol/reviewImgs.py new file mode 100755 index 0000000..ecdf7ab --- /dev/null +++ b/backend/tolData/eol/reviewImgs.py @@ -0,0 +1,205 @@ +#!/usr/bin/python3 + +import sys, re, os, time +import sqlite3 +import tkinter as tki +from tkinter import ttk +import PIL +from PIL import ImageTk, Image, ImageOps + +usageInfo = f""" +Usage: {sys.argv[0]} + +Provides a GUI for reviewing images. Looks in a for-review directory for +images named 'eolId1 contentId1.ext1', and, for each EOL ID, enables the user to +choose an image to keep, or reject all. Also provides image rotation. +Chosen images are placed in another directory, and rejected ones are deleted. +""" +if len(sys.argv) > 1: + print(usageInfo, file=sys.stderr) + sys.exit(1) + +imgDir = "imgsForReview/" +outDir = "imgs/" +extraInfoDbCon = sqlite3.connect("../data.db") +extraInfoDbCur = extraInfoDbCon.cursor() +def getExtraInfo(eolId): + global extraInfoDbCur + query = "SELECT names.alt_name FROM" \ + " names INNER JOIN eol_ids ON eol_ids.name = names.name" \ + " WHERE id = ? and pref_alt = 1" + row = extraInfoDbCur.execute(query, (eolId,)).fetchone() + if row != None: + return f"Reviewing EOL ID {eolId}, aka \"{row[0]}\"" + else: + return f"Reviewing EOL ID {eolId}" +IMG_DISPLAY_SZ = 400 +MAX_IMGS_PER_ID = 3 +IMG_BG_COLOR = (88, 28, 135) +PLACEHOLDER_IMG = Image.new("RGB", (IMG_DISPLAY_SZ, IMG_DISPLAY_SZ), IMG_BG_COLOR) + +print("Checking output directory") +if not os.path.exists(outDir): + os.mkdir(outDir) +print("Getting input image list") +imgList = os.listdir(imgDir) +imgList.sort(key=lambda s: int(s.split(" ")[0])) +if len(imgList) == 0: + print("No input images found") + sys.exit(0) + +class EolImgReviewer: + " Provides the GUI for reviewing images " + def __init__(self, root, imgList): + self.root = root + root.title("EOL Image Reviewer") + # Setup main frame + mainFrame = ttk.Frame(root, padding="5 5 5 5") + mainFrame.grid(column=0, row=0, sticky=(tki.N, tki.W, tki.E, tki.S)) + root.columnconfigure(0, weight=1) + root.rowconfigure(0, weight=1) + # Set up images-to-be-reviewed frames + self.imgs = [PLACEHOLDER_IMG] * MAX_IMGS_PER_ID # Stored as fields for use in rotation + self.photoImgs = list(map(lambda img: ImageTk.PhotoImage(img), self.imgs)) # Image objects usable by tkinter + # These need a persistent reference for some reason (doesn't display otherwise) + self.labels = [] + for i in range(MAX_IMGS_PER_ID): + frame = ttk.Frame(mainFrame, width=IMG_DISPLAY_SZ, height=IMG_DISPLAY_SZ) + frame.grid(column=i, row=0) + label = ttk.Label(frame, image=self.photoImgs[i]) + label.grid(column=0, row=0) + self.labels.append(label) + # Add padding + for child in mainFrame.winfo_children(): + child.grid_configure(padx=5, pady=5) + # Add keyboard bindings + root.bind("<q>", self.quit) + root.bind("<Key-j>", lambda evt: self.accept(0)) + root.bind("<Key-k>", lambda evt: self.accept(1)) + root.bind("<Key-l>", lambda evt: self.accept(2)) + root.bind("<Key-i>", lambda evt: self.reject()) + root.bind("<Key-a>", lambda evt: self.rotate(0)) + root.bind("<Key-s>", lambda evt: self.rotate(1)) + root.bind("<Key-d>", lambda evt: self.rotate(2)) + root.bind("<Key-A>", lambda evt: self.rotate(0, True)) + root.bind("<Key-S>", lambda evt: self.rotate(1, True)) + root.bind("<Key-D>", lambda evt: self.rotate(2, True)) + # Initialise images to review + self.imgList = imgList + self.imgListIdx = 0 + self.nextEolId = 0 + self.nextImgNames = [] + self.rotations = [] + self.getNextImgs() + # For displaying extra info + self.numReviewed = 0 + self.startTime = time.time() + def getNextImgs(self): + " Updates display with new images to review, or ends program " + # Gather names of next images to review + for i in range(MAX_IMGS_PER_ID): + if self.imgListIdx == len(self.imgList): + if i == 0: + self.quit() + return + break + imgName = self.imgList[self.imgListIdx] + eolId = int(re.match(r"(\d+) (\d+)", imgName).group(1)) + if i == 0: + self.nextEolId = eolId + self.nextImgNames = [imgName] + self.rotations = [0] + else: + if self.nextEolId != eolId: + break + self.nextImgNames.append(imgName) + self.rotations.append(0) + self.imgListIdx += 1 + # Update displayed images + idx = 0 + while idx < MAX_IMGS_PER_ID: + if idx < len(self.nextImgNames): + try: + img = Image.open(imgDir + self.nextImgNames[idx]) + img = ImageOps.exif_transpose(img) + except PIL.UnidentifiedImageError: + os.remove(imgDir + self.nextImgNames[idx]) + del self.nextImgNames[idx] + del self.rotations[idx] + continue + self.imgs[idx] = self.resizeImgForDisplay(img) + else: + self.imgs[idx] = PLACEHOLDER_IMG + self.photoImgs[idx] = ImageTk.PhotoImage(self.imgs[idx]) + self.labels[idx].config(image=self.photoImgs[idx]) + idx += 1 + # Restart if all image files non-recognisable + if len(self.nextImgNames) == 0: + self.getNextImgs() + return + # Update title + firstImgIdx = self.imgListIdx - len(self.nextImgNames) + 1 + lastImgIdx = self.imgListIdx + title = getExtraInfo(self.nextEolId) + title += f" (imgs {firstImgIdx} to {lastImgIdx} out of {len(self.imgList)})" + self.root.title(title) + def accept(self, imgIdx): + " React to a user selecting an image " + if imgIdx >= len(self.nextImgNames): + print("Invalid selection") + return + for i in range(len(self.nextImgNames)): + inFile = imgDir + self.nextImgNames[i] + if i == imgIdx: # Move accepted image, rotating if needed + outFile = outDir + self.nextImgNames[i] + img = Image.open(inFile) + img = ImageOps.exif_transpose(img) + if self.rotations[i] != 0: + img = img.rotate(self.rotations[i], expand=True) + img.save(outFile) + os.remove(inFile) + else: # Delete non-accepted image + os.remove(inFile) + self.numReviewed += 1 + self.getNextImgs() + def reject(self): + " React to a user rejecting all images of a set " + for i in range(len(self.nextImgNames)): + os.remove(imgDir + self.nextImgNames[i]) + self.numReviewed += 1 + self.getNextImgs() + def rotate(self, imgIdx, anticlockwise = False): + " Respond to a user rotating an image " + deg = -90 if not anticlockwise else 90 + self.imgs[imgIdx] = self.imgs[imgIdx].rotate(deg) + self.photoImgs[imgIdx] = ImageTk.PhotoImage(self.imgs[imgIdx]) + self.labels[imgIdx].config(image=self.photoImgs[imgIdx]) + self.rotations[imgIdx] = (self.rotations[imgIdx] + deg) % 360 + def quit(self, e = None): + global extraInfoDbCon + print(f"Number reviewed: {self.numReviewed}") + timeElapsed = time.time() - self.startTime + print(f"Time elapsed: {timeElapsed:.2f} seconds") + if self.numReviewed > 0: + print(f"Avg time per review: {timeElapsed/self.numReviewed:.2f} seconds") + extraInfoDbCon.close() + self.root.destroy() + def resizeImgForDisplay(self, img): + " Returns a copy of an image, shrunk to fit in it's frame (keeps aspect ratio), and with a background " + if max(img.width, img.height) > IMG_DISPLAY_SZ: + if (img.width > img.height): + newHeight = int(img.height * IMG_DISPLAY_SZ/img.width) + img = img.resize((IMG_DISPLAY_SZ, newHeight)) + else: + newWidth = int(img.width * IMG_DISPLAY_SZ / img.height) + img = img.resize((newWidth, IMG_DISPLAY_SZ)) + bgImg = PLACEHOLDER_IMG.copy() + bgImg.paste(img, box=( + int((IMG_DISPLAY_SZ - img.width) / 2), + int((IMG_DISPLAY_SZ - img.height) / 2))) + return bgImg +# Create GUI and defer control +print("Starting GUI") +root = tki.Tk() +EolImgReviewer(root, imgList) +root.mainloop() diff --git a/backend/tolData/genDbpData.py b/backend/tolData/genDbpData.py new file mode 100755 index 0000000..df3a6be --- /dev/null +++ b/backend/tolData/genDbpData.py @@ -0,0 +1,247 @@ +#!/usr/bin/python3 + +import sys, os, re +import sqlite3 + +usageInfo = f""" +Usage: {sys.argv[0]} + +Reads a database containing data from DBpedia, and tries to associate +DBpedia IRIs with nodes in a database, adding short-descriptions for them. +""" +if len(sys.argv) > 1: + print(usageInfo, file=sys.stderr) + sys.exit(1) + +dbpediaDb = "dbpedia/descData.db" +namesToSkipFile = "pickedEnwikiNamesToSkip.txt" +pickedLabelsFile = "pickedDbpLabels.txt" +dbFile = "data.db" +rootNodeName = "cellular organisms" +rootLabel = "organism" # Will be associated with root node +# Got about 400k descriptions when testing + +print("Opening databases") +dbpCon = sqlite3.connect(dbpediaDb) +dbpCur = dbpCon.cursor() +dbCon = sqlite3.connect(dbFile) +dbCur = dbCon.cursor() + +print("Getting node names") +nodeNames = set() +for (name,) in dbCur.execute("SELECT name from nodes"): + nodeNames.add(name) + +print("Checking for names to skip") +oldSz = len(nodeNames) +if os.path.exists(namesToSkipFile): + with open(namesToSkipFile) as file: + for line in file: + nodeNames.remove(line.rstrip()) +print(f"Skipping {oldSz - len(nodeNames)} nodes") + +print("Reading disambiguation-page labels") +disambigLabels = set() +query = "SELECT labels.iri from labels INNER JOIN disambiguations ON labels.iri = disambiguations.iri" +for (label,) in dbpCur.execute(query): + disambigLabels.add(label) + +print("Trying to associate nodes with DBpedia labels") +nodeToLabel = {} +nameVariantRegex = re.compile(r"(.*) \(([^)]+)\)") # Used to recognise labels like 'Thor (shrimp)' +nameToVariants = {} # Maps node names to lists of matching labels +iterNum = 0 +for (label,) in dbpCur.execute("SELECT label from labels"): + iterNum += 1 + if iterNum % 1e5 == 0: + print(f"At iteration {iterNum}") + # + if label in disambigLabels: + continue + name = label.lower() + if name in nodeNames: + if name not in nameToVariants: + nameToVariants[name] = [label] + elif label not in nameToVariants[name]: + nameToVariants[name].append(label) + else: + match = nameVariantRegex.fullmatch(name) + if match != None: + subName = match.group(1) + if subName in nodeNames and match.group(2) != "disambiguation": + if subName not in nameToVariants: + nameToVariants[subName] = [label] + elif name not in nameToVariants[subName]: + nameToVariants[subName].append(label) +# Associate labels without conflicts +for (name, variants) in nameToVariants.items(): + if len(variants) == 1: + nodeToLabel[name] = variants[0] +for name in nodeToLabel: + del nameToVariants[name] +# Special case for root node +nodeToLabel[rootNodeName] = rootLabel +if rootNodeName in nameToVariants: + del nameToVariants["cellular organisms"] + +print("Trying to resolve {len(nameToVariants)} conflicts") +def resolveWithPickedLabels(): + " Attempts to resolve conflicts using a picked-names file " + with open(pickedLabelsFile) as file: + for line in file: + (name, _, label) = line.rstrip().partition("|") + if name not in nameToVariants: + print(f"WARNING: No conflict found for name \"{name}\"", file=sys.stderr) + continue + if label == "": + del nameToVariants[name] + else: + if label not in nameToVariants[name]: + print(f"INFO: Picked label \"{label}\" for name \"{name}\" outside choice set", file=sys.stderr) + nodeToLabel[name] = label + del nameToVariants[name] +def resolveWithCategoryList(): + """ + Attempts to resolve conflicts by looking for labels like 'name1 (category1)', + and choosing those with a category1 that seems 'biological'. + Does two passes, using more generic categories first. This helps avoid stuff like + Pan being classified as a horse instead of an ape. + """ + generalCategories = { + "species", "genus", + "plant", "fungus", "animal", + "annelid", "mollusc", "arthropod", "crustacean", "insect", "bug", + "fish", "amphibian", "reptile", "bird", "mammal", + } + specificCategories = { + "protist", "alveolate", "dinoflagellates", + "orchid", "poaceae", "fern", "moss", "alga", + "bryozoan", "hydrozoan", + "sponge", "cnidarian", "coral", "polychaete", "echinoderm", + "bivalve", "gastropod", "chiton", + "shrimp", "decapod", "crab", "barnacle", "copepod", + "arachnid", "spider", "harvestman", "mite", + "dragonfly", "mantis", "cicada", "grasshopper", "planthopper", + "beetle", "fly", "butterfly", "moth", "wasp", + "catfish", + "frog", + "lizard", + "horse", "sheep", "cattle", "mouse", + } + namesToRemove = set() + for (name, variants) in nameToVariants.items(): + found = False + for label in variants: + match = nameVariantRegex.match(label) + if match != None and match.group(2) in generalCategories: + nodeToLabel[name] = label + namesToRemove.add(name) + found = True + break + if not found: + for label in variants: + match = nameVariantRegex.match(label) + if match != None and match.group(2) in specificCategories: + nodeToLabel[name] = label + namesToRemove.add(name) + break + for name in namesToRemove: + del nameToVariants[name] +def resolveWithTypeData(): + " Attempts to resolve conflicts using DBpedia's type data " + taxonTypes = { # Obtained from the DBpedia ontology + "http://dbpedia.org/ontology/Species", + "http://dbpedia.org/ontology/Archaea", + "http://dbpedia.org/ontology/Bacteria", + "http://dbpedia.org/ontology/Eukaryote", + "http://dbpedia.org/ontology/Plant", + "http://dbpedia.org/ontology/ClubMoss", + "http://dbpedia.org/ontology/Conifer", + "http://dbpedia.org/ontology/CultivatedVariety", + "http://dbpedia.org/ontology/Cycad", + "http://dbpedia.org/ontology/Fern", + "http://dbpedia.org/ontology/FloweringPlant", + "http://dbpedia.org/ontology/Grape", + "http://dbpedia.org/ontology/Ginkgo", + "http://dbpedia.org/ontology/Gnetophytes", + "http://dbpedia.org/ontology/GreenAlga", + "http://dbpedia.org/ontology/Moss", + "http://dbpedia.org/ontology/Fungus", + "http://dbpedia.org/ontology/Animal", + "http://dbpedia.org/ontology/Fish", + "http://dbpedia.org/ontology/Crustacean", + "http://dbpedia.org/ontology/Mollusca", + "http://dbpedia.org/ontology/Insect", + "http://dbpedia.org/ontology/Arachnid", + "http://dbpedia.org/ontology/Amphibian", + "http://dbpedia.org/ontology/Reptile", + "http://dbpedia.org/ontology/Bird", + "http://dbpedia.org/ontology/Mammal", + "http://dbpedia.org/ontology/Cat", + "http://dbpedia.org/ontology/Dog", + "http://dbpedia.org/ontology/Horse", + } + iterNum = 0 + for (label, type) in dbpCur.execute("SELECT label, type from labels INNER JOIN types on labels.iri = types.iri"): + iterNum += 1 + if iterNum % 1e5 == 0: + print(f"At iteration {iterNum}") + # + if type in taxonTypes: + name = label.lower() + if name in nameToVariants: + nodeToLabel[name] = label + del nameToVariants[name] + else: + match = nameVariantRegex.fullmatch(name) + if match != None: + name = match.group(1) + if name in nameToVariants: + nodeToLabel[name] = label + del nameToVariants[name] +#resolveWithTypeData() +#resolveWithCategoryList() +resolveWithPickedLabels() +print(f"Remaining number of conflicts: {len(nameToVariants)}") + +print("Getting node IRIs") +nodeToIri = {} +for (name, label) in nodeToLabel.items(): + (iri,) = dbpCur.execute("SELECT iri FROM labels where label = ? COLLATE NOCASE", (label,)).fetchone() + nodeToIri[name] = iri + +print("Resolving redirects") +redirectingIriSet = set() +iterNum = 0 +for (name, iri) in nodeToIri.items(): + iterNum += 1 + if iterNum % 1e4 == 0: + print(f"At iteration {iterNum}") + # + row = dbpCur.execute("SELECT target FROM redirects where iri = ?", (iri,)).fetchone() + if row != None: + nodeToIri[name] = row[0] + redirectingIriSet.add(name) + +print("Adding description tables") +dbCur.execute("CREATE TABLE wiki_ids (name TEXT PRIMARY KEY, id INT, redirected INT)") +dbCur.execute("CREATE INDEX wiki_id_idx ON wiki_ids(id)") +dbCur.execute("CREATE TABLE descs (wiki_id INT PRIMARY KEY, desc TEXT, from_dbp INT)") +iterNum = 0 +for (name, iri) in nodeToIri.items(): + iterNum += 1 + if iterNum % 1e4 == 0: + print(f"At iteration {iterNum}") + # + query = "SELECT abstract, id FROM abstracts INNER JOIN ids ON abstracts.iri = ids.iri WHERE ids.iri = ?" + row = dbpCur.execute(query, (iri,)).fetchone() + if row != None: + desc, wikiId = row + dbCur.execute("INSERT INTO wiki_ids VALUES (?, ?, ?)", (name, wikiId, 1 if name in redirectingIriSet else 0)) + dbCur.execute("INSERT OR IGNORE INTO descs VALUES (?, ?, ?)", (wikiId, desc, 1)) + +print("Closing databases") +dbCon.commit() +dbCon.close() +dbpCon.commit() +dbpCon.close() diff --git a/backend/tolData/genEnwikiDescData.py b/backend/tolData/genEnwikiDescData.py new file mode 100755 index 0000000..d3f93ed --- /dev/null +++ b/backend/tolData/genEnwikiDescData.py @@ -0,0 +1,102 @@ +#!/usr/bin/python3 + +import sys, re, os +import sqlite3 + +usageInfo = f""" +Usage: {sys.argv[0]} + +Reads a database containing data from Wikipedia, and tries to associate +wiki pages with nodes in the database, and add descriptions for nodes +that don't have them. +""" +if len(sys.argv) > 1: + print(usageInfo, file=sys.stderr) + sys.exit(1) + +enwikiDb = "enwiki/descData.db" +dbFile = "data.db" +namesToSkipFile = "pickedEnwikiNamesToSkip.txt" +pickedLabelsFile = "pickedEnwikiLabels.txt" +# Got about 25k descriptions when testing + +print("Opening databases") +enwikiCon = sqlite3.connect(enwikiDb) +enwikiCur = enwikiCon.cursor() +dbCon = sqlite3.connect(dbFile) +dbCur = dbCon.cursor() + +print("Checking for names to skip") +namesToSkip = set() +if os.path.exists(namesToSkipFile): + with open(namesToSkipFile) as file: + for line in file: + namesToSkip.add(line.rstrip()) + print(f"Found {len(namesToSkip)}") +print("Checking for picked-titles") +nameToPickedTitle = {} +if os.path.exists(pickedLabelsFile): + with open(pickedLabelsFile) as file: + for line in file: + (name, _, title) = line.rstrip().partition("|") + nameToPickedTitle[name.lower()] = title +print(f"Found {len(nameToPickedTitle)}") + +print("Getting names of nodes without descriptions") +nodeNames = set() +query = "SELECT nodes.name FROM nodes LEFT JOIN wiki_ids ON nodes.name = wiki_ids.name WHERE wiki_ids.id IS NULL" +for (name,) in dbCur.execute(query): + nodeNames.add(name) +print(f"Found {len(nodeNames)}") +nodeNames.difference_update(namesToSkip) + +print("Associating nodes with page IDs") +nodeToPageId = {} +iterNum = 0 +for name in nodeNames: + iterNum += 1 + if iterNum % 1e4 == 0: + print(f"At iteration {iterNum}") + # + if name not in nameToPickedTitle: + row = enwikiCur.execute("SELECT id FROM pages WHERE pages.title = ? COLLATE NOCASE", (name,)).fetchone() + if row != None: + nodeToPageId[name] = row[0] + else: + title = nameToPickedTitle[name] + row = enwikiCur.execute("SELECT id FROM pages WHERE pages.title = ?", (title,)).fetchone() + if row != None: + nodeToPageId[name] = row[0] + else: + print("WARNING: Picked title {title} not found", file=sys.stderr) + +print("Resolving redirects") +redirectingNames = set() +iterNum = 0 +for (name, pageId) in nodeToPageId.items(): + iterNum += 1 + if iterNum % 1e3 == 0: + print(f"At iteration {iterNum}") + # + query = "SELECT pages.id FROM redirects INNER JOIN pages ON redirects.target = pages.title WHERE redirects.id = ?" + row = enwikiCur.execute(query, (pageId,)).fetchone() + if row != None: + nodeToPageId[name] = row[0] + redirectingNames.add(name) + +print("Adding description data") +iterNum = 0 +for (name, pageId) in nodeToPageId.items(): + iterNum += 1 + if iterNum % 1e3 == 0: + print(f"At iteration {iterNum}") + # + row = enwikiCur.execute("SELECT desc FROM descs where descs.id = ?", (pageId,)).fetchone() + if row != None: + dbCur.execute("INSERT INTO wiki_ids VALUES (?, ?, ?)", (name, pageId, 1 if name in redirectingNames else 0)) + dbCur.execute("INSERT OR IGNORE INTO descs VALUES (?, ?, ?)", (pageId, row[0], 0)) + +print("Closing databases") +dbCon.commit() +dbCon.close() +enwikiCon.close() diff --git a/backend/tolData/genEnwikiNameData.py b/backend/tolData/genEnwikiNameData.py new file mode 100755 index 0000000..7ad61d1 --- /dev/null +++ b/backend/tolData/genEnwikiNameData.py @@ -0,0 +1,76 @@ +#!/usr/bin/python3 + +import sys, re +import sqlite3 + +usageInfo = f""" +Usage: {sys.argv[0]} + +Reads from a database containing data from Wikipdia, along with +node and wiki-id information from the database, and use wikipedia +page-redirect information to add additional alt-name data. +""" +if len(sys.argv) > 1: + print(usageInfo, file=sys.stderr) + sys.exit(1) + +enwikiDb = "enwiki/descData.db" +dbFile = "data.db" +altNameRegex = re.compile(r"[a-zA-Z]+") + # Avoids names like 'Evolution of Elephants', 'Banana fiber', 'Fish (zoology)', + +print("Opening databases") +enwikiCon = sqlite3.connect(enwikiDb) +enwikiCur = enwikiCon.cursor() +dbCon = sqlite3.connect(dbFile) +dbCur = dbCon.cursor() + +print("Getting nodes with wiki IDs") +nodeToWikiId = {} +for (nodeName, wikiId) in dbCur.execute("SELECT name, id from wiki_ids"): + nodeToWikiId[nodeName] = wikiId +print(f"Found {len(nodeToWikiId)}") + +print("Iterating through nodes, finding names that redirect to them") +nodeToAltNames = {} +numAltNames = 0 +iterNum = 0 +for (nodeName, wikiId) in nodeToWikiId.items(): + iterNum += 1 + if iterNum % 1e4 == 0: + print(f"At iteration {iterNum}") + # + nodeToAltNames[nodeName] = set() + query = "SELECT p1.title FROM pages p1" \ + " INNER JOIN redirects r1 ON p1.id = r1.id" \ + " INNER JOIN pages p2 ON r1.target = p2.title WHERE p2.id = ?" + for (name,) in enwikiCur.execute(query, (wikiId,)): + if altNameRegex.fullmatch(name) != None and name.lower() != nodeName: + nodeToAltNames[nodeName].add(name.lower()) + numAltNames += 1 +print(f"Found {numAltNames} alt-names") + +print("Excluding existing alt-names from the set") +query = "SELECT alt_name FROM names WHERE alt_name IN ({})" +iterNum = 0 +for (nodeName, altNames) in nodeToAltNames.items(): + iterNum += 1 + if iterNum % 1e4 == 0: + print(f"At iteration {iterNum}") + # + existingNames = set() + for (name,) in dbCur.execute(query.format(",".join(["?"] * len(altNames))), list(altNames)): + existingNames.add(name) + numAltNames -= len(existingNames) + altNames.difference_update(existingNames) +print(f"Left with {numAltNames} alt-names") + +print("Adding alt-names to database") +for (nodeName, altNames) in nodeToAltNames.items(): + for altName in altNames: + dbCur.execute("INSERT INTO names VALUES (?, ?, ?, 'enwiki')", (nodeName, altName, 0)) + +print("Closing databases") +dbCon.commit() +dbCon.close() +enwikiCon.close() diff --git a/backend/tolData/genEolNameData.py b/backend/tolData/genEolNameData.py new file mode 100755 index 0000000..dd33ee0 --- /dev/null +++ b/backend/tolData/genEolNameData.py @@ -0,0 +1,184 @@ +#!/usr/bin/python3 + +import sys, re, os +import html, csv, sqlite3 + +usageInfo = f""" +Usage: {sys.argv[0]} + +Reads files describing name data from the 'Encyclopedia of Life' site, +tries to associate names with nodes in the database, and adds tables +to represent associated names. + +Reads a vernacularNames.csv file: + Starts with a header line containing: + page_id, canonical_form, vernacular_string, language_code, + resource_name, is_preferred_by_resource, is_preferred_by_eol + The canonical_form and vernacular_string fields contain names + associated with the page ID. Names are not always unique to + particular page IDs. +""" +if len(sys.argv) > 1: + print(usageInfo, file=sys.stderr) + sys.exit(1) + +vnamesFile = "eol/vernacularNames.csv" # Had about 2.8e6 entries +dbFile = "data.db" +namesToSkip = {"unknown", "unknown species", "unidentified species"} +pickedIdsFile = "pickedEolIds.txt" +altsToSkipFile = "pickedEolAltsToSkip.txt" + +print("Reading in vernacular-names data") +nameToPids = {} # 'pid' means 'Page ID' +canonicalNameToPids = {} +pidToNames = {} +pidToPreferred = {} # Maps pids to 'preferred' names +def updateMaps(name, pid, canonical, preferredAlt): + global namesToSkip, nameToPids, canonicalNameToPids, pidToNames, pidToPreferred + if name in namesToSkip: + return + if name not in nameToPids: + nameToPids[name] = {pid} + else: + nameToPids[name].add(pid) + if canonical: + if name not in canonicalNameToPids: + canonicalNameToPids[name] = {pid} + else: + canonicalNameToPids[name].add(pid) + if pid not in pidToNames: + pidToNames[pid] = {name} + else: + pidToNames[pid].add(name) + if preferredAlt: + pidToPreferred[pid] = name +with open(vnamesFile, newline="") as csvfile: + reader = csv.reader(csvfile) + lineNum = 0 + for row in reader: + lineNum += 1 + if lineNum % 1e5 == 0: + print(f"At line {lineNum}") + # Skip header line + if lineNum == 1: + continue + # Parse line + pid = int(row[0]) + name1 = re.sub(r"<[^>]+>", "", row[1].lower()) # Remove tags + name2 = html.unescape(row[2]).lower() + lang = row[3] + preferred = row[6] == "preferred" + # Add to maps + updateMaps(name1, pid, True, False) + if lang == "eng" and name2 != "": + updateMaps(name2, pid, False, preferred) + +print("Checking for manually-picked pids") +nameToPickedPid = {} +if os.path.exists(pickedIdsFile): + with open(pickedIdsFile) as file: + for line in file: + (name, _, eolId) = line.rstrip().partition("|") + nameToPickedPid[name] = None if eolId == "" else int(eolId) +print(f"Found {len(nameToPickedPid)}") + +print("Checking for alt-names to skip") +nameToAltsToSkip = {} +numToSkip = 0 +if os.path.exists(altsToSkipFile): + with open(altsToSkipFile) as file: + for line in file: + (name, _, altName) = line.rstrip().partition("|") + if name not in nameToAltsToSkip: + nameToAltsToSkip[name] = [altName] + else: + nameToAltsToSkip[name].append(altName) + numToSkip += 1 +print(f"Found {numToSkip} alt-names to skip") + +print("Creating database tables") +dbCon = sqlite3.connect(dbFile) +dbCur = dbCon.cursor() +dbCur.execute("CREATE TABLE names(name TEXT, alt_name TEXT, pref_alt INT, src TEXT, PRIMARY KEY(name, alt_name))") +dbCur.execute("CREATE INDEX names_idx ON names(name)") +dbCur.execute("CREATE INDEX names_alt_idx ON names(alt_name)") +dbCur.execute("CREATE INDEX names_alt_idx_nc ON names(alt_name COLLATE NOCASE)") +dbCur.execute("CREATE TABLE eol_ids(id INT PRIMARY KEY, name TEXT)") +dbCur.execute("CREATE INDEX eol_name_idx ON eol_ids(name)") + +print("Associating nodes with names") +usedPids = set() +unresolvedNodeNames = set() +dbCur2 = dbCon.cursor() +def addToDb(nodeName, pidToUse): + " Adds page-ID-associated name data to a node in the database " + global dbCur, pidToPreferred + dbCur.execute("INSERT INTO eol_ids VALUES (?, ?)", (pidToUse, nodeName)) + # Get alt-names + altNames = set() + for n in pidToNames[pidToUse]: + # Avoid alt-names with >3 words + if len(n.split(" ")) > 3: + continue + # Avoid alt-names that already name a node in the database + if dbCur.execute("SELECT name FROM nodes WHERE name = ?", (n,)).fetchone() != None: + continue + # Check for picked alt-name-to-skip + if nodeName in nameToAltsToSkip and n in nameToAltsToSkip[nodeName]: + print(f"Excluding alt-name {n} for node {nodeName}") + continue + # + altNames.add(n) + # Add alt-names to db + preferredName = pidToPreferred[pidToUse] if (pidToUse in pidToPreferred) else None + for n in altNames: + isPreferred = 1 if (n == preferredName) else 0 + dbCur.execute("INSERT INTO names VALUES (?, ?, ?, 'eol')", (nodeName, n, isPreferred)) +print("Adding picked IDs") +for (name, pid) in nameToPickedPid.items(): + if pid != None: + addToDb(name, pid) + usedPids.add(pid) +print("Associating nodes with canonical names") +iterNum = 0 +for (nodeName,) in dbCur2.execute("SELECT name FROM nodes"): + iterNum += 1 + if iterNum % 1e5 == 0: + print(f"At iteration {iterNum}") + if nodeName in nameToPickedPid: + continue + # Check for matching canonical name + if nodeName in canonicalNameToPids: + pidToUse = None + # Pick an associated page ID + for pid in canonicalNameToPids[nodeName]: + hasLowerPrio = pid not in pidToPreferred and pidToUse in pidToPreferred + hasHigherPrio = pid in pidToPreferred and pidToUse not in pidToPreferred + if hasLowerPrio: + continue + if pid not in usedPids and (pidToUse == None or pid < pidToUse or hasHigherPrio): + pidToUse = pid + if pidToUse != None: + addToDb(nodeName, pidToUse) + usedPids.add(pidToUse) + elif nodeName in nameToPids: + unresolvedNodeNames.add(nodeName) +print("Associating leftover nodes with other names") +iterNum = 0 +for nodeName in unresolvedNodeNames: + iterNum += 1 + if iterNum % 100 == 0: + print(f"At iteration {iterNum}") + # Check for matching name + pidToUse = None + for pid in nameToPids[nodeName]: + # Pick an associated page ID + if pid not in usedPids and (pidToUse == None or pid < pidToUse): + pidToUse = pid + if pidToUse != None: + addToDb(nodeName, pidToUse) + usedPids.add(pidToUse) + +print("Closing database") +dbCon.commit() +dbCon.close() diff --git a/backend/tolData/genImgs.py b/backend/tolData/genImgs.py new file mode 100755 index 0000000..ecca8e0 --- /dev/null +++ b/backend/tolData/genImgs.py @@ -0,0 +1,191 @@ +#!/usr/bin/python3 + +import sys, os, subprocess +import sqlite3, urllib.parse +import signal + +usageInfo = f""" +Usage: {sys.argv[0]} + +Reads node IDs and image paths from a file, and possibly from a directory, +and generates cropped/resized versions of those images into a directory, +with names of the form 'nodeId1.jpg'. Also adds image metadata to the +database. + +SIGINT can be used to stop, and the program can be re-run to continue +processing. It uses already-existing database entries to decide what +to skip. +""" +if len(sys.argv) > 1: + print(usageInfo, file=sys.stderr) + sys.exit(1) + +imgListFile = "imgList.txt" +outDir = "img/" +eolImgDb = "eol/imagesList.db" +enwikiImgDb = "enwiki/imgData.db" +pickedImgsDir = "pickedImgs/" +pickedImgsFilename = "imgData.txt" +dbFile = "data.db" +IMG_OUT_SZ = 200 +genImgFiles = True # Usable for debugging + +if not os.path.exists(outDir): + os.mkdir(outDir) + +print("Opening databases") +dbCon = sqlite3.connect(dbFile) +dbCur = dbCon.cursor() +eolCon = sqlite3.connect(eolImgDb) +eolCur = eolCon.cursor() +enwikiCon = sqlite3.connect(enwikiImgDb) +enwikiCur = enwikiCon.cursor() +print("Checking for picked-images") +nodeToPickedImg = {} +if os.path.exists(pickedImgsDir + pickedImgsFilename): + lineNum = 0 + with open(pickedImgsDir + pickedImgsFilename) as file: + for line in file: + lineNum += 1 + (filename, url, license, artist, credit) = line.rstrip().split("|") + nodeName = os.path.splitext(filename)[0] # Remove extension + (otolId,) = dbCur.execute("SELECT id FROM nodes WHERE name = ?", (nodeName,)).fetchone() + nodeToPickedImg[otolId] = { + "nodeName": nodeName, "id": lineNum, + "filename": filename, "url": url, "license": license, "artist": artist, "credit": credit, + } + +print("Checking for image tables") +nodesDone = set() +imgsDone = set() +if dbCur.execute("SELECT name FROM sqlite_master WHERE type='table' AND name='node_imgs'").fetchone() == None: + # Add image tables if not present + dbCur.execute("CREATE TABLE node_imgs (name TEXT PRIMARY KEY, img_id INT, src TEXT)") + dbCur.execute("CREATE TABLE images" \ + " (id INT, src TEXT, url TEXT, license TEXT, artist TEXT, credit TEXT, PRIMARY KEY (id, src))") +else: + # Get existing image-associated nodes + for (otolId,) in dbCur.execute("SELECT nodes.id FROM node_imgs INNER JOIN nodes ON node_imgs.name = nodes.name"): + nodesDone.add(otolId) + # Get existing node-associated images + for (imgId, imgSrc) in dbCur.execute("SELECT id, src from images"): + imgsDone.add((imgId, imgSrc)) + print(f"Found {len(nodesDone)} nodes and {len(imgsDone)} images to skip") + +# Set SIGINT handler +interrupted = False +def onSigint(sig, frame): + global interrupted + interrupted = True +signal.signal(signal.SIGINT, onSigint) + +print("Iterating through input images") +def quit(): + print("Closing databases") + dbCon.commit() + dbCon.close() + eolCon.close() + enwikiCon.close() + sys.exit(0) +def convertImage(imgPath, outPath): + print(f"Converting {imgPath} to {outPath}") + if os.path.exists(outPath): + print(f"ERROR: Output image already exists") + return False + try: + completedProcess = subprocess.run( + ['npx', 'smartcrop-cli', '--width', str(IMG_OUT_SZ), '--height', str(IMG_OUT_SZ), imgPath, outPath], + stdout=subprocess.DEVNULL + ) + except Exception as e: + print(f"ERROR: Exception while attempting to run smartcrop: {e}") + return False + if completedProcess.returncode != 0: + print(f"ERROR: smartcrop had exit status {completedProcess.returncode}") + return False + return True +print("Processing picked-images") +for (otolId, imgData) in nodeToPickedImg.items(): + # Check for SIGINT event + if interrupted: + print("Exiting") + quit() + # Skip if already processed + if otolId in nodesDone: + continue + # Convert image + if genImgFiles: + success = convertImage(pickedImgsDir + imgData["filename"], outDir + otolId + ".jpg") + if not success: + quit() + else: + print(f"Processing {imgData['nodeName']}: {otolId}.jpg") + # Add entry to db + if (imgData["id"], "picked") not in imgsDone: + dbCur.execute("INSERT INTO images VALUES (?, ?, ?, ?, ?, ?)", + (imgData["id"], "picked", imgData["url"], imgData["license"], imgData["artist"], imgData["credit"])) + imgsDone.add((imgData["id"], "picked")) + dbCur.execute("INSERT INTO node_imgs VALUES (?, ?, ?)", (imgData["nodeName"], imgData["id"], "picked")) + nodesDone.add(otolId) +print("Processing images from eol and enwiki") +iterNum = 0 +with open(imgListFile) as file: + for line in file: + iterNum += 1 + # Check for SIGINT event + if interrupted: + print("Exiting") + break + # Skip lines without an image path + if line.find(" ") == -1: + continue + # Get filenames + (otolId, _, imgPath) = line.rstrip().partition(" ") + # Skip if already processed + if otolId in nodesDone: + continue + # Convert image + if genImgFiles: + success = convertImage(imgPath, outDir + otolId + ".jpg") + if not success: + break + else: + if iterNum % 1e4 == 0: + print(f"At iteration {iterNum}") + # Add entry to db + (nodeName,) = dbCur.execute("SELECT name FROM nodes WHERE id = ?", (otolId,)).fetchone() + fromEol = imgPath.startswith("eol/") + imgName = os.path.basename(os.path.normpath(imgPath)) # Get last path component + imgName = os.path.splitext(imgName)[0] # Remove extension + if fromEol: + eolId, _, contentId = imgName.partition(" ") + eolId, contentId = (int(eolId), int(contentId)) + if (eolId, "eol") not in imgsDone: + query = "SELECT source_url, license, copyright_owner FROM images WHERE content_id = ?" + row = eolCur.execute(query, (contentId,)).fetchone() + if row == None: + print(f"ERROR: No image record for EOL ID {eolId}, content ID {contentId}") + break + (url, license, owner) = row + dbCur.execute("INSERT INTO images VALUES (?, ?, ?, ?, ?, ?)", + (eolId, "eol", url, license, owner, "")) + imgsDone.add((eolId, "eol")) + dbCur.execute("INSERT INTO node_imgs VALUES (?, ?, ?)", (nodeName, eolId, "eol")) + else: + enwikiId = int(imgName) + if (enwikiId, "enwiki") not in imgsDone: + query = "SELECT name, license, artist, credit FROM" \ + " page_imgs INNER JOIN imgs ON page_imgs.img_name = imgs.name" \ + " WHERE page_imgs.page_id = ?" + row = enwikiCur.execute(query, (enwikiId,)).fetchone() + if row == None: + print(f"ERROR: No image record for enwiki ID {enwikiId}") + break + (name, license, artist, credit) = row + url = "https://en.wikipedia.org/wiki/File:" + urllib.parse.quote(name) + dbCur.execute("INSERT INTO images VALUES (?, ?, ?, ?, ?, ?)", + (enwikiId, "enwiki", url, license, artist, credit)) + imgsDone.add((enwikiId, "enwiki")) + dbCur.execute("INSERT INTO node_imgs VALUES (?, ?, ?)", (nodeName, enwikiId, "enwiki")) +# Close dbs +quit() diff --git a/backend/tolData/genLinkedImgs.py b/backend/tolData/genLinkedImgs.py new file mode 100755 index 0000000..a8e1322 --- /dev/null +++ b/backend/tolData/genLinkedImgs.py @@ -0,0 +1,125 @@ +#!/usr/bin/python3 + +import sys, re +import sqlite3 + +usageInfo = f""" +Usage: {sys.argv[0]} + +Look for nodes without images in the database, and tries to +associate them with images from their children. +""" +if len(sys.argv) > 1: + print(usageInfo, file=sys.stderr) + sys.exit(1) + +dbFile = "data.db" +compoundNameRegex = re.compile(r"\[(.+) \+ (.+)]") +upPropagateCompoundImgs = False + +print("Opening databases") +dbCon = sqlite3.connect(dbFile) +dbCur = dbCon.cursor() +dbCur.execute("CREATE TABLE linked_imgs (name TEXT PRIMARY KEY, otol_ids TEXT)") + +print("Getting nodes with images") +resolvedNodes = {} # Will map node names to otol IDs with a usable image +query = "SELECT nodes.name, nodes.id FROM nodes INNER JOIN node_imgs ON nodes.name = node_imgs.name" +for (name, otolId) in dbCur.execute(query): + resolvedNodes[name] = otolId +print(f"Found {len(resolvedNodes)}") + +print("Iterating through nodes, trying to resolve images for ancestors") +nodesToResolve = {} # Maps a node name to a list of objects that represent possible child images +processedNodes = {} # Map a node name to an OTOL ID, representing a child node whose image is to be used +parentToChosenTips = {} # used to prefer images from children with more tips +iterNum = 0 +while len(resolvedNodes) > 0: + iterNum += 1 + if iterNum % 1e3 == 0: + print(f"At iteration {iterNum}") + # Get next node + (nodeName, otolId) = resolvedNodes.popitem() + processedNodes[nodeName] = otolId + # Traverse upwards, resolving ancestors if able + while True: + # Get parent + row = dbCur.execute("SELECT parent FROM edges WHERE child = ?", (nodeName,)).fetchone() + if row == None or row[0] in processedNodes or row[0] in resolvedNodes: + break + parent = row[0] + # Get parent data + if parent not in nodesToResolve: + childNames = [row[0] for row in dbCur.execute("SELECT child FROM edges WHERE parent = ?", (parent,))] + query = "SELECT name, tips FROM nodes WHERE name IN ({})".format(",".join(["?"] * len(childNames))) + childObjs = [{"name": row[0], "tips": row[1], "otolId": None} for row in dbCur.execute(query, childNames)] + childObjs.sort(key=lambda x: x["tips"], reverse=True) + nodesToResolve[parent] = childObjs + else: + childObjs = nodesToResolve[parent] + # Check if highest-tips child + if (childObjs[0]["name"] == nodeName): + # Resolve parent, and continue from it + dbCur.execute("INSERT INTO linked_imgs VALUES (?, ?)", (parent, otolId)) + del nodesToResolve[parent] + processedNodes[parent] = otolId + parentToChosenTips[parent] = childObjs[0]["tips"] + nodeName = parent + continue + else: + # Mark child as a potential choice + childObj = next(c for c in childObjs if c["name"] == nodeName) + childObj["otolId"] = otolId + break + # When out of resolved nodes, resolve nodesToResolve nodes, possibly adding more nodes to resolve + if len(resolvedNodes) == 0: + for (name, childObjs) in nodesToResolve.items(): + childObj = next(c for c in childObjs if c["otolId"] != None) + resolvedNodes[name] = childObj["otolId"] + parentToChosenTips[name] = childObj["tips"] + dbCur.execute("INSERT INTO linked_imgs VALUES (?, ?)", (name, childObj["otolId"])) + nodesToResolve.clear() + +print("Replacing linked-images for compound nodes") +iterNum = 0 +for nodeName in processedNodes.keys(): + iterNum += 1 + if iterNum % 1e4 == 0: + print(f"At iteration {iterNum}") + # + match = compoundNameRegex.fullmatch(nodeName) + if match != None: + # Replace associated image with subname images + (subName1, subName2) = match.group(1,2) + otolIdPair = ["", ""] + if subName1 in processedNodes: + otolIdPair[0] = processedNodes[subName1] + if subName2 in processedNodes: + otolIdPair[1] = processedNodes[subName2] + # Use no image if both subimages not found + if otolIdPair[0] == "" and otolIdPair[1] == "": + dbCur.execute("DELETE FROM linked_imgs WHERE name = ?", (nodeName,)) + continue + # Add to db + dbCur.execute("UPDATE linked_imgs SET otol_ids = ? WHERE name = ?", + (otolIdPair[0] + "," + otolIdPair[1], nodeName)) + # Possibly repeat operation upon parent/ancestors + if upPropagateCompoundImgs: + while True: + # Get parent + row = dbCur.execute("SELECT parent FROM edges WHERE child = ?", (nodeName,)).fetchone() + if row != None: + parent = row[0] + # Check num tips + (numTips,) = dbCur.execute("SELECT tips from nodes WHERE name = ?", (nodeName,)).fetchone() + if parent in parentToChosenTips and parentToChosenTips[parent] <= numTips: + # Replace associated image + dbCur.execute("UPDATE linked_imgs SET otol_ids = ? WHERE name = ?", + (otolIdPair[0] + "," + otolIdPair[1], parent)) + nodeName = parent + continue + break + +print("Closing databases") +dbCon.commit() +dbCon.close() diff --git a/backend/tolData/genOtolData.py b/backend/tolData/genOtolData.py new file mode 100755 index 0000000..b5e0055 --- /dev/null +++ b/backend/tolData/genOtolData.py @@ -0,0 +1,250 @@ +#!/usr/bin/python3 + +import sys, re, os +import json, sqlite3 + +usageInfo = f""" +Usage: {sys.argv[0]} + +Reads files describing a tree-of-life from an 'Open Tree of Life' release, +and stores tree information in a database. + +Reads a labelled_supertree_ottnames.tre file, which is assumed to have this format: + The tree-of-life is represented in Newick format, which looks like: (n1,n2,(n3,n4)n5)n6 + The root node is named n6, and has children n1, n2, and n5. + Name examples include: Homo_sapiens_ott770315, mrcaott6ott22687, 'Oxalis san-miguelii ott5748753', + 'ott770315' and 'mrcaott6ott22687' are node IDs. The latter is for a 'compound node'. + The node with ID 'ott770315' will get the name 'homo sapiens'. + A compound node will get a name composed from it's sub-nodes (eg: [name1 + name2]). + It is possible for multiple nodes to have the same name. + In these cases, extra nodes will be named sequentially, as 'name1 [2]', 'name1 [3]', etc. +Reads an annotations.json file, which is assumed to have this format: + Holds a JSON object, whose 'nodes' property maps node IDs to objects holding information about that node, + such as the properties 'supported_by' and 'conflicts_with', which list phylogenetic trees that + support/conflict with the node's placement. +Reads from a picked-names file, if present, which specifies name and node ID pairs. + These help resolve cases where multiple nodes share the same name. +""" +if len(sys.argv) > 1: + print(usageInfo, file=sys.stderr) + sys.exit(1) + +treeFile = "otol/labelled_supertree_ottnames.tre" # Had about 2.5e9 nodes +annFile = "otol/annotations.json" +dbFile = "data.db" +nodeMap = {} # Maps node IDs to node objects +nameToFirstId = {} # Maps node names to first found ID (names might have multiple IDs) +dupNameToIds = {} # Maps names of nodes with multiple IDs to those IDs +pickedNamesFile = "pickedOtolNames.txt" + +class Node: + " Represents a tree-of-life node " + def __init__(self, name, childIds, parentId, tips, pSupport): + self.name = name + self.childIds = childIds + self.parentId = parentId + self.tips = tips + self.pSupport = pSupport + +print("Parsing tree file") +# Read file +data = None +with open(treeFile) as file: + data = file.read() +dataIdx = 0 +# Parse content +iterNum = 0 +def parseNewick(): + " Parses a node using 'data' and 'dataIdx', updates nodeMap accordingly, and returns the node's ID " + global data, dataIdx, iterNum + iterNum += 1 + if iterNum % 1e5 == 0: + print(f"At iteration {iterNum}") + # Check for EOF + if dataIdx == len(data): + raise Exception(f"ERROR: Unexpected EOF at index {dataIdx}") + # Check for node + if data[dataIdx] == "(": # parse inner node + dataIdx += 1 + childIds = [] + while True: + # Read child + childId = parseNewick() + childIds.append(childId) + if (dataIdx == len(data)): + raise Exception(f"ERROR: Unexpected EOF at index {dataIdx}") + # Check for next child + if (data[dataIdx] == ","): + dataIdx += 1 + continue + else: + # Get node name and id + dataIdx += 1 # Consume an expected ')' + name, id = parseNewickName() + updateNameMaps(name, id) + # Get child num-tips total + tips = 0 + for childId in childIds: + tips += nodeMap[childId].tips + # Add node to nodeMap + nodeMap[id] = Node(name, childIds, None, tips, False) + # Update childrens' parent reference + for childId in childIds: + nodeMap[childId].parentId = id + return id + else: # Parse node name + name, id = parseNewickName() + updateNameMaps(name, id) + nodeMap[id] = Node(name, [], None, 1, False) + return id +def parseNewickName(): + " Parses a node name using 'data' and 'dataIdx', and returns a (name, id) pair " + global data, dataIdx + name = None + end = dataIdx + # Get name + if (end < len(data) and data[end] == "'"): # Check for quoted name + end += 1 + inQuote = True + while end < len(data): + if (data[end] == "'"): + if end + 1 < len(data) and data[end + 1] == "'": # Account for '' as escaped-quote + end += 2 + continue + else: + end += 1 + inQuote = False + break + end += 1 + if inQuote: + raise Exception(f"ERROR: Unexpected EOF at index {dataIdx}") + name = data[dataIdx:end] + dataIdx = end + else: + while end < len(data) and not re.match(r"[(),]", data[end]): + end += 1 + if (end == dataIdx): + raise Exception(f"ERROR: Unexpected EOF at index {dataIdx}") + name = data[dataIdx:end].rstrip() + if end == len(data): # Ignore trailing input semicolon + name = name[:-1] + dataIdx = end + # Convert to (name, id) + name = name.lower() + if name.startswith("mrca"): + return (name, name) + elif name[0] == "'": + match = re.fullmatch(r"'([^\\\"]+) (ott\d+)'", name) + if match == None: + raise Exception(f"ERROR: invalid name \"{name}\"") + name = match.group(1).replace("''", "'") + return (name, match.group(2)) + else: + match = re.fullmatch(r"([^\\\"]+)_(ott\d+)", name) + if match == None: + raise Exception(f"ERROR: invalid name \"{name}\"") + return (match.group(1).replace("_", " "), match.group(2)) +def updateNameMaps(name, id): + global nameToFirstId, dupNameToIds + if name not in nameToFirstId: + nameToFirstId[name] = id + else: + if name not in dupNameToIds: + dupNameToIds[name] = [nameToFirstId[name], id] + else: + dupNameToIds[name].append(id) +rootId = parseNewick() + +print("Resolving duplicate names") +# Read picked-names file +nameToPickedId = {} +if os.path.exists(pickedNamesFile): + with open(pickedNamesFile) as file: + for line in file: + (name, _, otolId) = line.rstrip().partition("|") + nameToPickedId[name] = otolId +# Resolve duplicates +for (dupName, ids) in dupNameToIds.items(): + # Check for picked id + if dupName in nameToPickedId: + idToUse = nameToPickedId[dupName] + else: + # Get conflicting node with most tips + tipNums = [nodeMap[id].tips for id in ids] + maxIdx = tipNums.index(max(tipNums)) + idToUse = ids[maxIdx] + # Adjust name of other conflicting nodes + counter = 2 + for id in ids: + if id != idToUse: + nodeMap[id].name += f" [{counter}]" + counter += 1 + +print("Changing mrca* names") +def convertMrcaName(id): + node = nodeMap[id] + name = node.name + childIds = node.childIds + if len(childIds) < 2: + print(f"WARNING: MRCA node \"{name}\" has less than 2 children") + return + # Get 2 children with most tips + childTips = [nodeMap[id].tips for id in childIds] + maxIdx1 = childTips.index(max(childTips)) + childTips[maxIdx1] = 0 + maxIdx2 = childTips.index(max(childTips)) + childId1 = childIds[maxIdx1] + childId2 = childIds[maxIdx2] + childName1 = nodeMap[childId1].name + childName2 = nodeMap[childId2].name + # Check for mrca* child names + if childName1.startswith("mrca"): + childName1 = convertMrcaName(childId1) + if childName2.startswith("mrca"): + childName2 = convertMrcaName(childId2) + # Check for composite names + match = re.fullmatch(r"\[(.+) \+ (.+)]", childName1) + if match != None: + childName1 = match.group(1) + match = re.fullmatch(r"\[(.+) \+ (.+)]", childName2) + if match != None: + childName2 = match.group(1) + # Create composite name + node.name = f"[{childName1} + {childName2}]" + return childName1 +for (id, node) in nodeMap.items(): + if node.name.startswith("mrca"): + convertMrcaName(id) + +print("Parsing annotations file") +# Read file +data = None +with open(annFile) as file: + data = file.read() +obj = json.loads(data) +nodeAnnsMap = obj["nodes"] +# Find relevant annotations +for (id, node) in nodeMap.items(): + # Set has-support value using annotations + if id in nodeAnnsMap: + nodeAnns = nodeAnnsMap[id] + supportQty = len(nodeAnns["supported_by"]) if "supported_by" in nodeAnns else 0 + conflictQty = len(nodeAnns["conflicts_with"]) if "conflicts_with" in nodeAnns else 0 + node.pSupport = supportQty > 0 and conflictQty == 0 + +print("Creating nodes and edges tables") +dbCon = sqlite3.connect(dbFile) +dbCur = dbCon.cursor() +dbCur.execute("CREATE TABLE nodes (name TEXT PRIMARY KEY, id TEXT UNIQUE, tips INT)") +dbCur.execute("CREATE INDEX nodes_idx_nc ON nodes(name COLLATE NOCASE)") +dbCur.execute("CREATE TABLE edges (parent TEXT, child TEXT, p_support INT, PRIMARY KEY (parent, child))") +dbCur.execute("CREATE INDEX edges_child_idx ON edges(child)") +for (otolId, node) in nodeMap.items(): + dbCur.execute("INSERT INTO nodes VALUES (?, ?, ?)", (node.name, otolId, node.tips)) + for childId in node.childIds: + childNode = nodeMap[childId] + dbCur.execute("INSERT INTO edges VALUES (?, ?, ?)", + (node.name, childNode.name, 1 if childNode.pSupport else 0)) +print("Closing database") +dbCon.commit() +dbCon.close() diff --git a/backend/tolData/genReducedTrees.py b/backend/tolData/genReducedTrees.py new file mode 100755 index 0000000..a921be4 --- /dev/null +++ b/backend/tolData/genReducedTrees.py @@ -0,0 +1,329 @@ +#!/usr/bin/python3 + +import sys, os.path, re +import json, sqlite3 + +usageInfo = f""" +Usage: {sys.argv[0]} [tree1] + +Creates reduced versions of the tree in the database: +- A 'picked nodes' tree: + Created from a minimal set of node names read from a file, + possibly with some extra randmly-picked children. +- An 'images only' tree: + Created by removing nodes without an image or presence in the + 'picked' tree. +- A 'weakly trimmed' tree: + Created by removing nodes that lack an image or description, or + presence in the 'picked' tree. And, for nodes with 'many' children, + removing some more, despite any node descriptions. + +If tree1 is specified, as 'picked', 'images', or 'trimmed', only that +tree is generated. +""" +if len(sys.argv) > 2 or len(sys.argv) == 2 and re.fullmatch(r"picked|images|trimmed", sys.argv[1]) == None: + print(usageInfo, file=sys.stderr) + sys.exit(1) + +tree = sys.argv[1] if len(sys.argv) > 1 else None +dbFile = "data.db" +pickedNodesFile = "pickedNodes.txt" +COMP_NAME_REGEX = re.compile(r"\[.+ \+ .+]") # Used to recognise composite nodes + +class Node: + def __init__(self, id, children, parent, tips, pSupport): + self.id = id + self.children = children + self.parent = parent + self.tips = tips + self.pSupport = pSupport + +print("Opening database") +dbCon = sqlite3.connect(dbFile) +dbCur = dbCon.cursor() + +def genPickedNodeTree(dbCur, pickedNames, rootName): + global COMP_NAME_REGEX + PREF_NUM_CHILDREN = 3 # Include extra children up to this limit + nodeMap = {} # Maps node names to Nodes + print("Getting ancestors") + nodeMap = genNodeMap(dbCur, pickedNames, 100) + print(f"Result has {len(nodeMap)} nodes") + print("Removing composite nodes") + removedNames = removeCompositeNodes(nodeMap) + print(f"Result has {len(nodeMap)} nodes") + print("Removing 'collapsible' nodes") + temp = removeCollapsibleNodes(nodeMap, pickedNames) + removedNames.update(temp) + print(f"Result has {len(nodeMap)} nodes") + print("Adding some additional nearby children") + namesToAdd = [] + iterNum = 0 + for (name, node) in nodeMap.items(): + iterNum += 1 + if iterNum % 100 == 0: + print(f"At iteration {iterNum}") + # + numChildren = len(node.children) + if numChildren < PREF_NUM_CHILDREN: + children = [row[0] for row in dbCur.execute("SELECT child FROM edges where parent = ?", (name,))] + newChildren = [] + for n in children: + if n in nodeMap or n in removedNames: + continue + if COMP_NAME_REGEX.fullmatch(n) != None: + continue + if dbCur.execute("SELECT name from node_imgs WHERE name = ?", (n,)).fetchone() == None and \ + dbCur.execute("SELECT name from linked_imgs WHERE name = ?", (n,)).fetchone() == None: + continue + newChildren.append(n) + newChildNames = newChildren[:(PREF_NUM_CHILDREN - numChildren)] + node.children.extend(newChildNames) + namesToAdd.extend(newChildNames) + for name in namesToAdd: + parent, pSupport = dbCur.execute("SELECT parent, p_support from edges WHERE child = ?", (name,)).fetchone() + (id,) = dbCur.execute("SELECT id FROM nodes WHERE name = ?", (name,)).fetchone() + parent = None if parent == "" else parent + nodeMap[name] = Node(id, [], parent, 0, pSupport == 1) + print(f"Result has {len(nodeMap)} nodes") + print("Updating 'tips' values") + updateTips(rootName, nodeMap) + print("Creating table") + addTreeTables(nodeMap, dbCur, "p") +def genImagesOnlyTree(dbCur, nodesWithImgOrPicked, pickedNames, rootName): + print("Getting ancestors") + nodeMap = genNodeMap(dbCur, nodesWithImgOrPicked, 1e4) + print(f"Result has {len(nodeMap)} nodes") + print("Removing composite nodes") + removeCompositeNodes(nodeMap) + print(f"Result has {len(nodeMap)} nodes") + print("Removing 'collapsible' nodes") + removeCollapsibleNodes(nodeMap, {}) + print(f"Result has {len(nodeMap)} nodes") + print(f"Updating 'tips' values") # Needed for next trimming step + updateTips(rootName, nodeMap) + print(f"Trimming from nodes with 'many' children") + trimIfManyChildren(nodeMap, rootName, 300, pickedNames) + print(f"Result has {len(nodeMap)} nodes") + print(f"Updating 'tips' values") + updateTips(rootName, nodeMap) + print("Creating table") + addTreeTables(nodeMap, dbCur, "i") +def genWeaklyTrimmedTree(dbCur, nodesWithImgDescOrPicked, nodesWithImgOrPicked, rootName): + print("Getting ancestors") + nodeMap = genNodeMap(dbCur, nodesWithImgDescOrPicked, 1e5) + print(f"Result has {len(nodeMap)} nodes") + print("Getting nodes to 'strongly keep'") + iterNum = 0 + nodesFromImgOrPicked = set() + for name in nodesWithImgOrPicked: + iterNum += 1 + if iterNum % 1e4 == 0: + print(f"At iteration {iterNum}") + # + while name != None: + if name not in nodesFromImgOrPicked: + nodesFromImgOrPicked.add(name) + name = nodeMap[name].parent + else: + break + print(f"Node set has {len(nodesFromImgOrPicked)} nodes") + print("Removing 'collapsible' nodes") + removeCollapsibleNodes(nodeMap, nodesWithImgDescOrPicked) + print(f"Result has {len(nodeMap)} nodes") + print(f"Updating 'tips' values") # Needed for next trimming step + updateTips(rootName, nodeMap) + print(f"Trimming from nodes with 'many' children") + trimIfManyChildren(nodeMap, rootName, 600, nodesFromImgOrPicked) + print(f"Result has {len(nodeMap)} nodes") + print(f"Updating 'tips' values") + updateTips(rootName, nodeMap) + print("Creating table") + addTreeTables(nodeMap, dbCur, "t") +# Helper functions +def genNodeMap(dbCur, nameSet, itersBeforePrint = 1): + " Returns a subtree that includes nodes in 'nameSet', as a name-to-Node map " + nodeMap = {} + iterNum = 0 + for name in nameSet: + iterNum += 1 + if iterNum % itersBeforePrint == 0: + print(f"At iteration {iterNum}") + # + prevName = None + while name != None: + if name not in nodeMap: + # Add node + (id, tips) = dbCur.execute("SELECT id, tips from nodes where name = ?", (name,)).fetchone() + row = dbCur.execute("SELECT parent, p_support from edges where child = ?", (name,)).fetchone() + parent = None if row == None or row[0] == "" else row[0] + pSupport = row == None or row[1] == 1 + children = [] if prevName == None else [prevName] + nodeMap[name] = Node(id, children, parent, 0, pSupport) + # Iterate to parent + prevName = name + name = parent + else: + # Just add as child + if prevName != None: + nodeMap[name].children.append(prevName) + break + return nodeMap +def removeCompositeNodes(nodeMap): + " Given a tree, removes composite-name nodes, and returns the removed nodes' names " + global COMP_NAME_REGEX + namesToRemove = set() + for (name, node) in nodeMap.items(): + parent = node.parent + if parent != None and COMP_NAME_REGEX.fullmatch(name) != None: + # Connect children to parent + nodeMap[parent].children.remove(name) + nodeMap[parent].children.extend(node.children) + for n in node.children: + nodeMap[n].parent = parent + nodeMap[n].pSupport &= node.pSupport + # Remember for removal + namesToRemove.add(name) + for name in namesToRemove: + del nodeMap[name] + return namesToRemove +def removeCollapsibleNodes(nodeMap, nodesToKeep = {}): + """ Given a tree, removes single-child parents, then only-childs, + with given exceptions, and returns the set of removed nodes' names """ + namesToRemove = set() + # Remove single-child parents + for (name, node) in nodeMap.items(): + if len(node.children) == 1 and node.parent != None and name not in nodesToKeep: + # Connect parent and children + parent = node.parent + child = node.children[0] + nodeMap[parent].children.remove(name) + nodeMap[parent].children.append(child) + nodeMap[child].parent = parent + nodeMap[child].pSupport &= node.pSupport + # Remember for removal + namesToRemove.add(name) + for name in namesToRemove: + del nodeMap[name] + # Remove only-childs (not redundant because 'nodesToKeep' can cause single-child parents to be kept) + namesToRemove.clear() + for (name, node) in nodeMap.items(): + isOnlyChild = node.parent != None and len(nodeMap[node.parent].children) == 1 + if isOnlyChild and name not in nodesToKeep: + # Connect parent and children + parent = node.parent + nodeMap[parent].children = node.children + for n in node.children: + nodeMap[n].parent = parent + nodeMap[n].pSupport &= node.pSupport + # Remember for removal + namesToRemove.add(name) + for name in namesToRemove: + del nodeMap[name] + # + return namesToRemove +def trimIfManyChildren(nodeMap, rootName, childThreshold, nodesToKeep = {}): + namesToRemove = set() + def findTrimmables(nodeName): + nonlocal nodeMap, nodesToKeep + node = nodeMap[nodeName] + if len(node.children) > childThreshold: + numToTrim = len(node.children) - childThreshold + # Try removing nodes, preferring those with less tips + candidatesToTrim = [n for n in node.children if n not in nodesToKeep] + childToTips = {n: nodeMap[n].tips for n in candidatesToTrim} + candidatesToTrim.sort(key=lambda n: childToTips[n], reverse=True) + childrenToRemove = set(candidatesToTrim[-numToTrim:]) + node.children = [n for n in node.children if n not in childrenToRemove] + # Mark nodes for deletion + for n in childrenToRemove: + markForRemoval(n) + # Recurse on children + for n in node.children: + findTrimmables(n) + def markForRemoval(nodeName): + nonlocal nodeMap, namesToRemove + namesToRemove.add(nodeName) + for child in nodeMap[nodeName].children: + markForRemoval(child) + findTrimmables(rootName) + for nodeName in namesToRemove: + del nodeMap[nodeName] +def updateTips(nodeName, nodeMap): + " Updates the 'tips' values for a node and it's descendants, returning the node's new 'tips' value " + node = nodeMap[nodeName] + tips = sum([updateTips(childName, nodeMap) for childName in node.children]) + tips = max(1, tips) + node.tips = tips + return tips +def addTreeTables(nodeMap, dbCur, suffix): + " Adds a tree to the database, as tables nodes_X and edges_X, where X is the given suffix " + nodesTbl = f"nodes_{suffix}" + edgesTbl = f"edges_{suffix}" + dbCur.execute(f"CREATE TABLE {nodesTbl} (name TEXT PRIMARY KEY, id TEXT UNIQUE, tips INT)") + dbCur.execute(f"CREATE INDEX {nodesTbl}_idx_nc ON {nodesTbl}(name COLLATE NOCASE)") + dbCur.execute(f"CREATE TABLE {edgesTbl} (parent TEXT, child TEXT, p_support INT, PRIMARY KEY (parent, child))") + dbCur.execute(f"CREATE INDEX {edgesTbl}_child_idx ON {edgesTbl}(child)") + for (name, node) in nodeMap.items(): + dbCur.execute(f"INSERT INTO {nodesTbl} VALUES (?, ?, ?)", (name, node.id, node.tips)) + for childName in node.children: + pSupport = 1 if nodeMap[childName].pSupport else 0 + dbCur.execute(f"INSERT INTO {edgesTbl} VALUES (?, ?, ?)", (name, childName, pSupport)) + +print(f"Finding root node") +query = "SELECT name FROM nodes LEFT JOIN edges ON nodes.name = edges.child WHERE edges.parent IS NULL LIMIT 1" +(rootName,) = dbCur.execute(query).fetchone() +print(f"Found \"{rootName}\"") + +print('=== Getting picked-nodes ===') +pickedNames = set() +pickedTreeExists = False +if dbCur.execute("SELECT name FROM sqlite_master WHERE type='table' AND name='nodes_p'").fetchone() == None: + print(f"Reading from {pickedNodesFile}") + with open(pickedNodesFile) as file: + for line in file: + name = line.rstrip() + row = dbCur.execute("SELECT name from nodes WHERE name = ?", (name,)).fetchone() + if row == None: + row = dbCur.execute("SELECT name from names WHERE alt_name = ?", (name,)).fetchone() + if row != None: + pickedNames.add(row[0]) + if len(pickedNames) == 0: + raise Exception("ERROR: No picked names found") +else: + pickedTreeExists = True + print("Picked-node tree already exists") + if tree == 'picked': + sys.exit() + for (name,) in dbCur.execute("SELECT name FROM nodes_p"): + pickedNames.add(name) +print(f"Found {len(pickedNames)} names") + +if (tree == 'picked' or tree == None) and not pickedTreeExists: + print("=== Generating picked-nodes tree ===") + genPickedNodeTree(dbCur, pickedNames, rootName) +if tree != 'picked': + print("=== Finding 'non-low significance' nodes ===") + nodesWithImgOrPicked = set() + nodesWithImgDescOrPicked = set() + print("Finding nodes with descs") + for (name,) in dbCur.execute("SELECT name FROM wiki_ids"): # Can assume the wiki_id has a desc + nodesWithImgDescOrPicked.add(name) + print("Finding nodes with images") + for (name,) in dbCur.execute("SELECT name FROM node_imgs"): + nodesWithImgDescOrPicked.add(name) + nodesWithImgOrPicked.add(name) + print("Adding picked nodes") + for name in pickedNames: + nodesWithImgDescOrPicked.add(name) + nodesWithImgOrPicked.add(name) + if tree == 'images' or tree == None: + print("=== Generating images-only tree ===") + genImagesOnlyTree(dbCur, nodesWithImgOrPicked, pickedNames, rootName) + if tree == 'trimmed' or tree == None: + print("=== Generating weakly-trimmed tree ===") + genWeaklyTrimmedTree(dbCur, nodesWithImgDescOrPicked, nodesWithImgOrPicked, rootName) + +print("Closing database") +dbCon.commit() +dbCon.close() diff --git a/backend/tolData/otol/README.md b/backend/tolData/otol/README.md new file mode 100644 index 0000000..4be2fd2 --- /dev/null +++ b/backend/tolData/otol/README.md @@ -0,0 +1,10 @@ +Files +===== +- opentree13.4tree.tgz <br> + Obtained from <https://tree.opentreeoflife.org/about/synthesis-release/v13.4>. + Contains tree data from the [Open Tree of Life](https://tree.opentreeoflife.org/about/open-tree-of-life). +- labelled\_supertree\_ottnames.tre <br> + Extracted from the .tgz file. Describes the structure of the tree. +- annotations.json + Extracted from the .tgz file. Contains additional attributes of tree + nodes. Used for finding out which nodes have 'phylogenetic support'. diff --git a/backend/tolData/pickedImgs/README.md b/backend/tolData/pickedImgs/README.md new file mode 100644 index 0000000..dfe192b --- /dev/null +++ b/backend/tolData/pickedImgs/README.md @@ -0,0 +1,10 @@ +This directory holds additional image files to use for tree-of-life nodes, +on top of those from EOL and Wikipedia. + +Possible Files +============== +- (Image files) +- imgData.txt <br> + Contains lines with the format `filename|url|license|artist|credit`. + The filename should consist of a node name, with an image extension. + Other fields correspond to those in the `images` table (see ../README.md). diff --git a/backend/tolData/reviewImgsToGen.py b/backend/tolData/reviewImgsToGen.py new file mode 100755 index 0000000..de592f5 --- /dev/null +++ b/backend/tolData/reviewImgsToGen.py @@ -0,0 +1,225 @@ +#!/usr/bin/python3 + +import sys, re, os, time +import sqlite3 +import tkinter as tki +from tkinter import ttk +import PIL +from PIL import ImageTk, Image, ImageOps + +usageInfo = f""" +Usage: {sys.argv[0]} + +Provides a GUI that displays, for each node in the database, associated +images from EOL and Wikipedia, and allows choosing which to use. Writes +choice data to a text file with lines of the form 'otolId1 imgPath1', or +'otolId1', where no path indicates a choice of no image. + +The program can be closed, and run again to continue from the last choice. +The program looks for an existing output file to determine what choices +have already been made. +""" +if len(sys.argv) > 1: + print(usageInfo, file=sys.stderr) + sys.exit(1) + +eolImgDir = "eol/imgs/" +enwikiImgDir = "enwiki/imgs/" +dbFile = "data.db" +outFile = "imgList.txt" +IMG_DISPLAY_SZ = 400 +PLACEHOLDER_IMG = Image.new("RGB", (IMG_DISPLAY_SZ, IMG_DISPLAY_SZ), (88, 28, 135)) +onlyReviewPairs = True + +print("Opening database") +dbCon = sqlite3.connect(dbFile) +dbCur = dbCon.cursor() + +nodeToImgs = {} # Maps otol-ids to arrays of image paths +print("Iterating through images from EOL") +if os.path.exists(eolImgDir): + for filename in os.listdir(eolImgDir): + # Get associated EOL ID + eolId, _, _ = filename.partition(" ") + query = "SELECT nodes.id FROM nodes INNER JOIN eol_ids ON nodes.name = eol_ids.name WHERE eol_ids.id = ?" + # Get associated node IDs + found = False + for (otolId,) in dbCur.execute(query, (int(eolId),)): + if otolId not in nodeToImgs: + nodeToImgs[otolId] = [] + nodeToImgs[otolId].append(eolImgDir + filename) + found = True + if not found: + print(f"WARNING: No node found for {eolImgDir}{filename}") +print(f"Result: {len(nodeToImgs)} nodes with images") +print("Iterating through images from Wikipedia") +if os.path.exists(enwikiImgDir): + for filename in os.listdir(enwikiImgDir): + # Get associated page ID + (wikiId, _, _) = filename.partition(".") + # Get associated node IDs + query = "SELECT nodes.id FROM nodes INNER JOIN wiki_ids ON nodes.name = wiki_ids.name WHERE wiki_ids.id = ?" + found = False + for (otolId,) in dbCur.execute(query, (int(wikiId),)): + if otolId not in nodeToImgs: + nodeToImgs[otolId] = [] + nodeToImgs[otolId].append(enwikiImgDir + filename) + found = True + if not found: + print(f"WARNING: No node found for {enwikiImgDir}{filename}") +print(f"Result: {len(nodeToImgs)} nodes with images") +print("Filtering out already-made image choices") +oldSz = len(nodeToImgs) +if os.path.exists(outFile): + with open(outFile) as file: + for line in file: + line = line.rstrip() + if " " in line: + line = line[:line.find(" ")] + del nodeToImgs[line] +print(f"Filtered out {oldSz - len(nodeToImgs)} entries") + +class ImgReviewer: + " Provides the GUI for reviewing images " + def __init__(self, root, nodeToImgs): + self.root = root + root.title("Image Reviewer") + # Setup main frame + mainFrame = ttk.Frame(root, padding="5 5 5 5") + mainFrame.grid(column=0, row=0, sticky=(tki.N, tki.W, tki.E, tki.S)) + root.columnconfigure(0, weight=1) + root.rowconfigure(0, weight=1) + # Set up images-to-be-reviewed frames + self.eolImg = ImageTk.PhotoImage(PLACEHOLDER_IMG) + self.enwikiImg = ImageTk.PhotoImage(PLACEHOLDER_IMG) + self.labels = [] + for i in (0, 1): + frame = ttk.Frame(mainFrame, width=IMG_DISPLAY_SZ, height=IMG_DISPLAY_SZ) + frame.grid(column=i, row=0) + label = ttk.Label(frame, image=self.eolImg if i == 0 else self.enwikiImg) + label.grid(column=0, row=0) + self.labels.append(label) + # Add padding + for child in mainFrame.winfo_children(): + child.grid_configure(padx=5, pady=5) + # Add keyboard bindings + root.bind("<q>", self.quit) + root.bind("<Key-j>", lambda evt: self.accept(0)) + root.bind("<Key-k>", lambda evt: self.accept(1)) + root.bind("<Key-l>", lambda evt: self.reject()) + # Set fields + self.nodeImgsList = list(nodeToImgs.items()) + self.listIdx = -1 + self.otolId = None + self.eolImgPath = None + self.enwikiImgPath = None + self.numReviewed = 0 + self.startTime = time.time() + # Initialise images to review + self.getNextImgs() + def getNextImgs(self): + " Updates display with new images to review, or ends program " + # Get next image paths + while True: + self.listIdx += 1 + if self.listIdx == len(self.nodeImgsList): + print("No more images to review. Exiting program.") + self.quit() + return + self.otolId, imgPaths = self.nodeImgsList[self.listIdx] + # Potentially skip user choice + if onlyReviewPairs and len(imgPaths) == 1: + with open(outFile, 'a') as file: + file.write(f"{self.otolId} {imgPaths[0]}\n") + continue + break + # Update displayed images + self.eolImgPath = self.enwikiImgPath = None + imageOpenError = False + for imgPath in imgPaths: + img = None + try: + img = Image.open(imgPath) + img = ImageOps.exif_transpose(img) + except PIL.UnidentifiedImageError: + print(f"UnidentifiedImageError for {imgPath}") + imageOpenError = True + continue + if imgPath.startswith("eol/"): + self.eolImgPath = imgPath + self.eolImg = ImageTk.PhotoImage(self.resizeImgForDisplay(img)) + elif imgPath.startswith("enwiki/"): + self.enwikiImgPath = imgPath + self.enwikiImg = ImageTk.PhotoImage(self.resizeImgForDisplay(img)) + else: + print(f"Unexpected image path {imgPath}") + self.quit() + return + # Re-iterate if all image paths invalid + if self.eolImgPath == None and self.enwikiImgPath == None: + if imageOpenError: + self.reject() + self.getNextImgs() + return + # Add placeholder images + if self.eolImgPath == None: + self.eolImg = ImageTk.PhotoImage(self.resizeImgForDisplay(PLACEHOLDER_IMG)) + elif self.enwikiImgPath == None: + self.enwikiImg = ImageTk.PhotoImage(self.resizeImgForDisplay(PLACEHOLDER_IMG)) + # Update image-frames + self.labels[0].config(image=self.eolImg) + self.labels[1].config(image=self.enwikiImg) + # Update title + title = f"Images for otol ID {self.otolId}" + query = "SELECT names.alt_name FROM" \ + " nodes INNER JOIN names ON nodes.name = names.name" \ + " WHERE nodes.id = ? and pref_alt = 1" + row = dbCur.execute(query, (self.otolId,)).fetchone() + if row != None: + title += f", aka {row[0]}" + title += f" ({self.listIdx + 1} out of {len(self.nodeImgsList)})" + self.root.title(title) + def accept(self, imgIdx): + " React to a user selecting an image " + imgPath = self.eolImgPath if imgIdx == 0 else self.enwikiImgPath + if imgPath == None: + print("Invalid selection") + return + with open(outFile, 'a') as file: + file.write(f"{self.otolId} {imgPath}\n") + self.numReviewed += 1 + self.getNextImgs() + def reject(self): + " React to a user rejecting all images of a set " + with open(outFile, 'a') as file: + file.write(f"{self.otolId}\n") + self.numReviewed += 1 + self.getNextImgs() + def quit(self, e = None): + global dbCon + print(f"Number reviewed: {self.numReviewed}") + timeElapsed = time.time() - self.startTime + print(f"Time elapsed: {timeElapsed:.2f} seconds") + if self.numReviewed > 0: + print(f"Avg time per review: {timeElapsed/self.numReviewed:.2f} seconds") + dbCon.close() + self.root.destroy() + def resizeImgForDisplay(self, img): + " Returns a copy of an image, shrunk to fit it's frame (keeps aspect ratio), and with a background " + if max(img.width, img.height) > IMG_DISPLAY_SZ: + if (img.width > img.height): + newHeight = int(img.height * IMG_DISPLAY_SZ/img.width) + img = img.resize((IMG_DISPLAY_SZ, newHeight)) + else: + newWidth = int(img.width * IMG_DISPLAY_SZ / img.height) + img = img.resize((newWidth, IMG_DISPLAY_SZ)) + bgImg = PLACEHOLDER_IMG.copy() + bgImg.paste(img, box=( + int((IMG_DISPLAY_SZ - img.width) / 2), + int((IMG_DISPLAY_SZ - img.height) / 2))) + return bgImg +# Create GUI and defer control +print("Starting GUI") +root = tki.Tk() +ImgReviewer(root, nodeToImgs) +root.mainloop() |
