aboutsummaryrefslogtreecommitdiff
path: root/backend/data
diff options
context:
space:
mode:
Diffstat (limited to 'backend/data')
-rw-r--r--backend/data/README.md152
-rwxr-xr-xbackend/data/addPickedNames.py57
-rw-r--r--backend/data/dbpedia/README.md29
-rwxr-xr-xbackend/data/dbpedia/genDescData.py130
-rw-r--r--backend/data/enwiki/README.md52
-rwxr-xr-xbackend/data/enwiki/downloadImgLicenseInfo.py150
-rwxr-xr-xbackend/data/enwiki/downloadImgs.py91
-rwxr-xr-xbackend/data/enwiki/genDescData.py127
-rwxr-xr-xbackend/data/enwiki/genDumpIndexDb.py58
-rwxr-xr-xbackend/data/enwiki/genImgData.py190
-rwxr-xr-xbackend/data/enwiki/lookupPage.py68
-rw-r--r--backend/data/eol/README.md26
-rwxr-xr-xbackend/data/eol/downloadImgs.py147
-rwxr-xr-xbackend/data/eol/genImagesListDb.sh12
-rwxr-xr-xbackend/data/eol/reviewImgs.py205
-rwxr-xr-xbackend/data/genDbpData.py247
-rwxr-xr-xbackend/data/genEnwikiDescData.py102
-rwxr-xr-xbackend/data/genEnwikiNameData.py76
-rwxr-xr-xbackend/data/genEolNameData.py184
-rwxr-xr-xbackend/data/genImgs.py191
-rwxr-xr-xbackend/data/genLinkedImgs.py125
-rwxr-xr-xbackend/data/genOtolData.py250
-rwxr-xr-xbackend/data/genReducedTrees.py329
-rw-r--r--backend/data/otol/README.md10
-rw-r--r--backend/data/pickedImgs/README.md10
-rwxr-xr-xbackend/data/reviewImgsToGen.py225
26 files changed, 0 insertions, 3243 deletions
diff --git a/backend/data/README.md b/backend/data/README.md
deleted file mode 100644
index ba64114..0000000
--- a/backend/data/README.md
+++ /dev/null
@@ -1,152 +0,0 @@
-This directory holds files used to generate data.db, which contains tree-of-life data.
-
-# Tables
-## Tree Structure data
-- `nodes` <br>
- Format : `name TEXT PRIMARY KEY, id TEXT UNIQUE, tips INT` <br>
- Represents a tree-of-life node. `tips` represents the number of no-child descendants.
-- `edges` <br>
- Format: `parent TEXT, child TEXT, p_support INT, PRIMARY KEY (parent, child)` <br>
- `p_support` is 1 if the edge has 'phylogenetic support', and 0 otherwise
-## Node name data
-- `eol_ids` <br>
- Format: `id INT PRIMARY KEY, name TEXT` <br>
- Associates an EOL ID with a node's name.
-- `names` <br>
- Format: `name TEXT, alt_name TEXT, pref_alt INT, src TEXT, PRIMARY KEY(name, alt_name)` <br>
- Associates a node with alternative names.
- `pref_alt` is 1 if the alt-name is the most 'preferred' one.
- `src` indicates the dataset the alt-name was obtained from (can be 'eol', 'enwiki', or 'picked').
-## Node description data
-- `wiki_ids` <br>
- Format: `name TEXT PRIMARY KEY, id INT, redirected INT` <br>
- Associates a node with a wikipedia page ID.
- `redirected` is 1 if the node was associated with a different page that redirected to this one.
-- `descs` <br>
- Format: `wiki_id INT PRIMARY KEY, desc TEXT, from_dbp INT` <br>
- Associates a wikipedia page ID with a short-description.
- `from_dbp` is 1 if the description was obtained from DBpedia, and 0 otherwise.
-## Node image data
-- `node_imgs` <br>
- Format: `name TEXT PRIMARY KEY, img_id INT, src TEXT` <br>
- Associates a node with an image.
-- `images` <br>
- Format: `id INT, src TEXT, url TEXT, license TEXT, artist TEXT, credit TEXT, PRIMARY KEY (id, src)` <br>
- Represents an image, identified by a source ('eol', 'enwiki', or 'picked'), and a source-specific ID.
-- `linked_imgs` <br>
- Format: `name TEXT PRIMARY KEY, otol_ids TEXT` <br>
- Associates a node with an image from another node.
- `otol_ids` can be an otol ID, or two comma-separated otol IDs or empty strings.
- The latter is used for compound nodes.
-## Reduced tree data
-- `nodes_t`, `nodes_i`, `nodes_p` <br>
- These are like `nodes`, but describe the nodes for various reduced trees.
-- `edges_t`, `edges_i`, `edges_p` <br>
- Like `edges` but for reduced trees.
-
-# Generating the Database
-
-For the most part, these steps should be done in order.
-
-As a warning, the whole process takes a lot of time and file space. The tree will probably
-have about 2.5 billion nodes. Downloading the images takes several days, and occupies over
-200 GB. And if you want good data, you'll need to do some manual review, which can take weeks.
-
-## Environment
-The scripts are written in python and bash.
-Some of the python scripts require third-party packages:
-- jsonpickle: For encoding class objects as JSON.
-- requests: For downloading data.
-- PIL: For image processing.
-- tkinter: For providing a basic GUI to review images.
-- mwxml, mwparserfromhell: For parsing Wikipedia dumps.
-
-## Generate tree structure data
-1. Obtain files in otol/, as specified in it's README.
-2. Run genOtolData.py, which creates data.db, and adds the `nodes` and `edges` tables,
- using data in otol/. It also uses these files, if they exist:
- - pickedOtolNames.txt: Has lines of the form `name1|otolId1`. Some nodes in the
- tree may have the same name (eg: Pholidota can refer to pangolins or orchids).
- Normally, such nodes will get the names 'name1', 'name1 [2]', 'name1 [3], etc.
- This file can be used to manually specify which node should be named 'name1'.
-
-## Generate node name data
-1. Obtain 'name data files' in eol/, as specified in it's README.
-2. Run genEolNameData.py, which adds the `names` and `eol_ids` tables, using data in
- eol/ and the `nodes` table. It also uses these files, if they exist:
- - pickedEolIds.txt: Has lines of the form `nodeName1|eolId1` or `nodeName1|`.
- Specifies node names that should have a particular EOL ID, or no ID.
- Quite a few taxons have ambiguous names, and may need manual correction.
- For example, Viola may resolve to a taxon of butterflies or of plants.
- - pickedEolAltsToSkip.txt: Has lines of the form `nodeName1|altName1`.
- Specifies that a node's alt-name set should exclude altName1.
-
-## Generate node description data
-### Get data from DBpedia
-1. Obtain files in dbpedia/, as specified in it's README.
-2. Run genDbpData.py, which adds the `wiki_ids` and `descs` tables, using data in
- dbpedia/ and the `nodes` table. It also uses these files, if they exist:
- - pickedEnwikiNamesToSkip.txt: Each line holds the name of a node for which
- no description should be obtained. Many node names have a same-name
- wikipedia page that describes something different (eg: Osiris).
- - pickedDbpLabels.txt: Has lines of the form `nodeName1|label1`.
- Specifies node names that should have a particular associated page label.
-### Get data from Wikipedia
-1. Obtain 'description database files' in enwiki/, as specified in it's README.
-2. Run genEnwikiDescData.py, which adds to the `wiki_ids` and `descs` tables,
- using data in enwiki/ and the `nodes` table.
- It also uses these files, if they exist:
- - pickedEnwikiNamesToSkip.txt: Same as with genDbpData.py.
- - pickedEnwikiLabels.txt: Similar to pickedDbpLabels.txt.
-
-## Generate node image data
-### Get images from EOL
-1. Obtain 'image metadata files' in eol/, as specified in it's README.
-2. In eol/, run downloadImgs.py, which downloads images (possibly multiple per node),
- into eol/imgsForReview, using data in eol/, as well as the `eol_ids` table.
-3. In eol/, run reviewImgs.py, which interactively displays the downloaded images for
- each node, providing the choice of which to use, moving them to eol/imgs/.
- Uses `names` and `eol_ids` to display extra info.
-### Get images from Wikipedia
-1. In enwiki/, run genImgData.py, which looks for wikipedia image names for each node,
- using the `wiki_ids` table, and stores them in a database.
-2. In enwiki/, run downloadImgLicenseInfo.py, which downloads licensing information for
- those images, using wikipedia's online API.
-3. In enwiki/, run downloadImgs.py, which downloads 'permissively-licensed'
- images into enwiki/imgs/.
-### Merge the image sets
-1. Run reviewImgsToGen.py, which displays images from eol/imgs/ and enwiki/imgs/,
- and enables choosing, for each node, which image should be used, if any,
- and outputs choice information into imgList.txt. Uses the `nodes`,
- `eol_ids`, and `wiki_ids` tables (as well as `names` to display extra info).
-2. Run genImgs.py, which creates cropped/resized images in img/, from files listed in
- imgList.txt and located in eol/ and enwiki/, and creates the `node_imgs` and
- `images` tables. If pickedImgs/ is present, images within it are also used. <br>
- The outputs might need to be manually created/adjusted:
- - An input image might have no output produced, possibly due to
- data incompatibilities, memory limits, etc. A few input image files
- might actually be html files, containing a 'file not found' page.
- - An input x.gif might produce x-1.jpg, x-2.jpg, etc, instead of x.jpg.
- - An input image might produce output with unexpected dimensions.
- This seems to happen when the image is very large, and triggers a
- decompression bomb warning.
- The result might have as many as 150k images, with about 2/3 of them
- being from wikipedia.
-### Add more image associations
-1. Run genLinkedImgs.py, which tries to associate nodes without images to
- images of it's children. Adds the `linked_imgs` table, and uses the
- `nodes`, `edges`, and `node_imgs` tables.
-
-## Do some post-processing
-1. Run genEnwikiNameData.py, which adds more entries to the `names` table,
- using data in enwiki/, and the `names` and `wiki_ids` tables.
-2. Optionally run addPickedNames.py, which allows adding manually-selected name data to
- the `names` table, as specified in pickedNames.txt.
- - pickedNames.txt: Has lines of the form `nodeName1|altName1|prefAlt1`.
- These correspond to entries in the `names` table. `prefAlt` should be 1 or 0.
- A line like `name1|name1|1` causes a node to have no preferred alt-name.
-3. Run genReducedTrees.py, which generates multiple reduced versions of the tree,
- adding the `nodes_*` and `edges_*` tables, using `nodes` and `names`. Reads from
- pickedNodes.txt, which lists names of nodes that must be included (1 per line).
- The original tree isn't used for web-queries, as some nodes would have over
- 10k children, which can take a while to render (took over a minute in testing).
diff --git a/backend/data/addPickedNames.py b/backend/data/addPickedNames.py
deleted file mode 100755
index d56a0cb..0000000
--- a/backend/data/addPickedNames.py
+++ /dev/null
@@ -1,57 +0,0 @@
-#!/usr/bin/python3
-
-import sys
-import sqlite3
-
-usageInfo = f"""
-Usage: {sys.argv[0]}
-
-Reads alt-name data from a file, and adds it to the database's 'names' table.
-"""
-if len(sys.argv) > 1:
- print(usageInfo, file=sys.stderr)
- sys.exit(1)
-
-dbFile = "data.db"
-pickedNamesFile = "pickedNames.txt"
-
-print("Opening database")
-dbCon = sqlite3.connect(dbFile)
-dbCur = dbCon.cursor()
-
-print("Iterating through picked-names file")
-with open(pickedNamesFile) as file:
- for line in file:
- # Get record data
- nodeName, altName, prefAlt = line.lower().rstrip().split("|")
- prefAlt = int(prefAlt)
- # Check whether there exists a node with the name
- row = dbCur.execute("SELECT name from nodes where name = ?", (nodeName,)).fetchone()
- if row == None:
- print(f"ERROR: No node with name \"{nodeName}\" exists")
- break
- # Remove any existing preferred-alt status
- if prefAlt == 1:
- query = "SELECT name, alt_name FROM names WHERE name = ? AND pref_alt = 1"
- row = dbCur.execute(query, (nodeName,)).fetchone()
- if row != None and row[1] != altName:
- print(f"Removing pref-alt status from alt-name {row[1]} for {nodeName}")
- dbCur.execute("UPDATE names SET pref_alt = 0 WHERE name = ? AND alt_name = ?", row)
- # Check for an existing record
- if nodeName == altName:
- continue
- query = "SELECT name, alt_name, pref_alt FROM names WHERE name = ? AND alt_name = ?"
- row = dbCur.execute(query, (nodeName, altName)).fetchone()
- if row == None:
- print(f"Adding record for alt-name {altName} for {nodeName}")
- dbCur.execute("INSERT INTO names VALUES (?, ?, ?, 'picked')", (nodeName, altName, prefAlt))
- else:
- # Update existing record
- if row[2] != prefAlt:
- print(f"Updating record for alt-name {altName} for {nodeName}")
- dbCur.execute("UPDATE names SET pref_alt = ?, src = 'picked' WHERE name = ? AND alt_name = ?",
- (prefAlt, nodeName, altName))
-
-print("Closing database")
-dbCon.commit()
-dbCon.close()
diff --git a/backend/data/dbpedia/README.md b/backend/data/dbpedia/README.md
deleted file mode 100644
index 8a08f20..0000000
--- a/backend/data/dbpedia/README.md
+++ /dev/null
@@ -1,29 +0,0 @@
-This directory holds files obtained from/using [Dbpedia](https://www.dbpedia.org).
-
-# Downloaded Files
-- `labels_lang=en.ttl.bz2` <br>
- Obtained via https://databus.dbpedia.org/dbpedia/collections/latest-core.
- Downloaded from <https://databus.dbpedia.org/dbpedia/generic/labels/2022.03.01/labels_lang=en.ttl.bz2>.
-- `page_lang=en_ids.ttl.bz2` <br>
- Downloaded from <https://databus.dbpedia.org/dbpedia/generic/page/2022.03.01/page_lang=en_ids.ttl.bz2>
-- `redirects_lang=en_transitive.ttl.bz2` <br>
- Downloaded from <https://databus.dbpedia.org/dbpedia/generic/redirects/2022.03.01/redirects_lang=en_transitive.ttl.bz2>.
-- `disambiguations_lang=en.ttl.bz2` <br>
- Downloaded from <https://databus.dbpedia.org/dbpedia/generic/disambiguations/2022.03.01/disambiguations_lang=en.ttl.bz2>.
-- `instance-types_lang=en_specific.ttl.bz2` <br>
- Downloaded from <https://databus.dbpedia.org/dbpedia/mappings/instance-types/2022.03.01/instance-types_lang=en_specific.ttl.bz2>.
-- `short-abstracts_lang=en.ttl.bz2` <br>
- Downloaded from <https://databus.dbpedia.org/vehnem/text/short-abstracts/2021.05.01/short-abstracts_lang=en.ttl.bz2>.
-
-# Other Files
-- genDescData.py <br>
- Used to generate a database representing data from the ttl files.
-- descData.db <br>
- Generated by genDescData.py. <br>
- Tables: <br>
- - `labels`: `iri TEXT PRIMARY KEY, label TEXT `
- - `ids`: `iri TEXT PRIMARY KEY, id INT`
- - `redirects`: `iri TEXT PRIMARY KEY, target TEXT`
- - `disambiguations`: `iri TEXT PRIMARY KEY`
- - `types`: `iri TEXT, type TEXT`
- - `abstracts`: `iri TEXT PRIMARY KEY, abstract TEXT`
diff --git a/backend/data/dbpedia/genDescData.py b/backend/data/dbpedia/genDescData.py
deleted file mode 100755
index d9e8a80..0000000
--- a/backend/data/dbpedia/genDescData.py
+++ /dev/null
@@ -1,130 +0,0 @@
-#!/usr/bin/python3
-
-import sys, re
-import bz2, sqlite3
-
-usageInfo = f"""
-Usage: {sys.argv[0]}
-
-Adds DBpedia labels/types/abstracts/etc data into a database.
-"""
-if len(sys.argv) > 1:
- print(usageInfo, file=sys.stderr)
- sys.exit(1)
-
-labelsFile = "labels_lang=en.ttl.bz2" # Had about 16e6 entries
-idsFile = "page_lang=en_ids.ttl.bz2"
-redirectsFile = "redirects_lang=en_transitive.ttl.bz2"
-disambigFile = "disambiguations_lang=en.ttl.bz2"
-typesFile = "instance-types_lang=en_specific.ttl.bz2"
-abstractsFile = "short-abstracts_lang=en.ttl.bz2"
-dbFile = "descData.db"
-# In testing, this script took a few hours to run, and generated about 10GB
-
-print("Creating database")
-dbCon = sqlite3.connect(dbFile)
-dbCur = dbCon.cursor()
-
-print("Reading/storing label data")
-dbCur.execute("CREATE TABLE labels (iri TEXT PRIMARY KEY, label TEXT)")
-dbCur.execute("CREATE INDEX labels_idx ON labels(label)")
-dbCur.execute("CREATE INDEX labels_idx_nc ON labels(label COLLATE NOCASE)")
-labelLineRegex = re.compile(r'<([^>]+)> <[^>]+> "((?:[^"]|\\")+)"@en \.\n')
-lineNum = 0
-with bz2.open(labelsFile, mode='rt') as file:
- for line in file:
- lineNum += 1
- if lineNum % 1e5 == 0:
- print(f"At line {lineNum}")
- #
- match = labelLineRegex.fullmatch(line)
- if match == None:
- raise Exception(f"ERROR: Line {lineNum} has unexpected format")
- dbCur.execute("INSERT INTO labels VALUES (?, ?)", (match.group(1), match.group(2)))
-
-print("Reading/storing wiki page ids")
-dbCur.execute("CREATE TABLE ids (iri TEXT PRIMARY KEY, id INT)")
-idLineRegex = re.compile(r'<([^>]+)> <[^>]+> "(\d+)".*\n')
-lineNum = 0
-with bz2.open(idsFile, mode='rt') as file:
- for line in file:
- lineNum += 1
- if lineNum % 1e5 == 0:
- print(f"At line {lineNum}")
- #
- match = idLineRegex.fullmatch(line)
- if match == None:
- raise Exception(f"ERROR: Line {lineNum} has unexpected format")
- try:
- dbCur.execute("INSERT INTO ids VALUES (?, ?)", (match.group(1), int(match.group(2))))
- except sqlite3.IntegrityError as e:
- # Accounts for certain lines that have the same IRI
- print(f"WARNING: Failed to add entry with IRI \"{match.group(1)}\": {e}")
-
-print("Reading/storing redirection data")
-dbCur.execute("CREATE TABLE redirects (iri TEXT PRIMARY KEY, target TEXT)")
-redirLineRegex = re.compile(r'<([^>]+)> <[^>]+> <([^>]+)> \.\n')
-lineNum = 0
-with bz2.open(redirectsFile, mode='rt') as file:
- for line in file:
- lineNum += 1
- if lineNum % 1e5 == 0:
- print(f"At line {lineNum}")
- #
- match = redirLineRegex.fullmatch(line)
- if match == None:
- raise Exception(f"ERROR: Line {lineNum} has unexpected format")
- dbCur.execute("INSERT INTO redirects VALUES (?, ?)", (match.group(1), match.group(2)))
-
-print("Reading/storing diambiguation-page data")
-dbCur.execute("CREATE TABLE disambiguations (iri TEXT PRIMARY KEY)")
-disambigLineRegex = redirLineRegex
-lineNum = 0
-with bz2.open(disambigFile, mode='rt') as file:
- for line in file:
- lineNum += 1
- if lineNum % 1e5 == 0:
- print(f"At line {lineNum}")
- #
- match = disambigLineRegex.fullmatch(line)
- if match == None:
- raise Exception(f"ERROR: Line {lineNum} has unexpected format")
- dbCur.execute("INSERT OR IGNORE INTO disambiguations VALUES (?)", (match.group(1),))
-
-print("Reading/storing instance-type data")
-dbCur.execute("CREATE TABLE types (iri TEXT, type TEXT)")
-dbCur.execute("CREATE INDEX types_iri_idx ON types(iri)")
-typeLineRegex = redirLineRegex
-lineNum = 0
-with bz2.open(typesFile, mode='rt') as file:
- for line in file:
- lineNum += 1
- if lineNum % 1e5 == 0:
- print(f"At line {lineNum}")
- #
- match = typeLineRegex.fullmatch(line)
- if match == None:
- raise Exception(f"ERROR: Line {lineNum} has unexpected format")
- dbCur.execute("INSERT INTO types VALUES (?, ?)", (match.group(1), match.group(2)))
-
-print("Reading/storing abstracts")
-dbCur.execute("CREATE TABLE abstracts (iri TEXT PRIMARY KEY, abstract TEXT)")
-descLineRegex = labelLineRegex
-lineNum = 0
-with bz2.open(abstractsFile, mode='rt') as file:
- for line in file:
- lineNum += 1
- if lineNum % 1e5 == 0:
- print(f"At line {lineNum}")
- #
- if line[0] == "#":
- continue
- match = descLineRegex.fullmatch(line)
- if match == None:
- raise Exception(f"ERROR: Line {lineNum} has unexpected format")
- dbCur.execute("INSERT INTO abstracts VALUES (?, ?)",
- (match.group(1), match.group(2).replace(r'\"', '"')))
-
-print("Closing database")
-dbCon.commit()
-dbCon.close()
diff --git a/backend/data/enwiki/README.md b/backend/data/enwiki/README.md
deleted file mode 100644
index 90d16c7..0000000
--- a/backend/data/enwiki/README.md
+++ /dev/null
@@ -1,52 +0,0 @@
-This directory holds files obtained from/using [English Wikipedia](https://en.wikipedia.org/wiki/Main_Page).
-
-# Downloaded Files
-- enwiki-20220501-pages-articles-multistream.xml.bz2 <br>
- Obtained via <https://dumps.wikimedia.org/backup-index.html> (site suggests downloading from a mirror).
- Contains text content and metadata for pages in enwiki.
- Some file content and format information was available from
- <https://meta.wikimedia.org/wiki/Data_dumps/What%27s_available_for_download>.
-- enwiki-20220501-pages-articles-multistream-index.txt.bz2 <br>
- Obtained like above. Holds lines of the form offset1:pageId1:title1,
- providing, for each page, an offset into the dump file of a chunk of
- 100 pages that includes it.
-
-# Generated Dump-Index Files
-- genDumpIndexDb.py <br>
- Creates an sqlite-database version of the enwiki-dump index file.
-- dumpIndex.db <br>
- Generated by genDumpIndexDb.py. <br>
- Tables: <br>
- - `offsets`: `title TEXT PRIMARY KEY, id INT UNIQUE, offset INT, next_offset INT`
-
-# Description Database Files
-- genDescData.py <br>
- Reads through pages in the dump file, and adds short-description info to a database.
-- descData.db <br>
- Generated by genDescData.py. <br>
- Tables: <br>
- - `pages`: `id INT PRIMARY KEY, title TEXT UNIQUE`
- - `redirects`: `id INT PRIMARY KEY, target TEXT`
- - `descs`: `id INT PRIMARY KEY, desc TEXT`
-
-# Image Database Files
-- genImgData.py <br>
- Used to find infobox image names for page IDs, storing them into a database.
-- downloadImgLicenseInfo.py <br>
- Used to download licensing metadata for image names, via wikipedia's online API, storing them into a database.
-- imgData.db <br>
- Used to hold metadata about infobox images for a set of pageIDs.
- Generated using getEnwikiImgData.py and downloadImgLicenseInfo.py. <br>
- Tables: <br>
- - `page_imgs`: `page_id INT PRIMAY KEY, img_name TEXT` <br>
- `img_name` may be null, which means 'none found', and is used to avoid re-processing page-ids.
- - `imgs`: `name TEXT PRIMARY KEY, license TEXT, artist TEXT, credit TEXT, restrictions TEXT, url TEXT` <br>
- Might lack some matches for `img_name` in `page_imgs`, due to licensing info unavailability.
-- downloadImgs.py <br>
- Used to download image files into imgs/.
-
-# Other Files
-- lookupPage.py <br>
- Running `lookupPage.py title1` looks in the dump for a page with a given title,
- and prints the contents to stdout. Uses dumpIndex.db.
-
diff --git a/backend/data/enwiki/downloadImgLicenseInfo.py b/backend/data/enwiki/downloadImgLicenseInfo.py
deleted file mode 100755
index 399922e..0000000
--- a/backend/data/enwiki/downloadImgLicenseInfo.py
+++ /dev/null
@@ -1,150 +0,0 @@
-#!/usr/bin/python3
-
-import sys, re
-import sqlite3, urllib.parse, html
-import requests
-import time, signal
-
-usageInfo = f"""
-Usage: {sys.argv[0]}
-
-Reads image names from a database, and uses enwiki's online API to obtain
-licensing information for them, adding the info to the database.
-
-SIGINT causes the program to finish an ongoing download and exit.
-The program can be re-run to continue downloading, and looks
-at already-processed names to decide what to skip.
-"""
-if len(sys.argv) > 1:
- print(usageInfo, file=sys.stderr)
- sys.exit(1)
-
-imgDb = "imgData.db"
-apiUrl = "https://en.wikipedia.org/w/api.php"
-userAgent = "terryt.dev (terry06890@gmail.com)"
-batchSz = 50 # Max 50
-tagRegex = re.compile(r"<[^<]+>")
-whitespaceRegex = re.compile(r"\s+")
-
-print("Opening database")
-dbCon = sqlite3.connect(imgDb)
-dbCur = dbCon.cursor()
-dbCur2 = dbCon.cursor()
-print("Checking for table")
-if dbCur.execute("SELECT name FROM sqlite_master WHERE type='table' AND name='imgs'").fetchone() == None:
- dbCur.execute("CREATE TABLE imgs(" \
- "name TEXT PRIMARY KEY, license TEXT, artist TEXT, credit TEXT, restrictions TEXT, url TEXT)")
-
-print("Reading image names")
-imgNames = set()
-for (imgName,) in dbCur.execute("SELECT DISTINCT img_name FROM page_imgs WHERE img_name NOT NULL"):
- imgNames.add(imgName)
-print(f"Found {len(imgNames)}")
-
-print("Checking for already-processed images")
-oldSz = len(imgNames)
-for (imgName,) in dbCur.execute("SELECT name FROM imgs"):
- imgNames.discard(imgName)
-print(f"Found {oldSz - len(imgNames)}")
-
-# Set SIGINT handler
-interrupted = False
-oldHandler = None
-def onSigint(sig, frame):
- global interrupted
- interrupted = True
- signal.signal(signal.SIGINT, oldHandler)
-oldHandler = signal.signal(signal.SIGINT, onSigint)
-
-print("Iterating through image names")
-imgNames = list(imgNames)
-iterNum = 0
-for i in range(0, len(imgNames), batchSz):
- iterNum += 1
- if iterNum % 1 == 0:
- print(f"At iteration {iterNum} (after {(iterNum - 1) * batchSz} images)")
- if interrupted:
- print(f"Exiting loop at iteration {iterNum}")
- break
- # Get batch
- imgBatch = imgNames[i:i+batchSz]
- imgBatch = ["File:" + x for x in imgBatch]
- # Make request
- headers = {
- "user-agent": userAgent,
- "accept-encoding": "gzip",
- }
- params = {
- "action": "query",
- "format": "json",
- "prop": "imageinfo",
- "iiprop": "extmetadata|url",
- "maxlag": "5",
- "titles": "|".join(imgBatch),
- "iiextmetadatafilter": "Artist|Credit|LicenseShortName|Restrictions",
- }
- responseObj = None
- try:
- response = requests.get(apiUrl, params=params, headers=headers)
- responseObj = response.json()
- except Exception as e:
- print(f"ERROR: Exception while downloading info: {e}")
- print(f"\tImage batch: " + "|".join(imgBatch))
- continue
- # Parse response-object
- if "query" not in responseObj or "pages" not in responseObj["query"]:
- print("WARNING: Response object for doesn't have page data")
- print("\tImage batch: " + "|".join(imgBatch))
- if "error" in responseObj:
- errorCode = responseObj["error"]["code"]
- print(f"\tError code: {errorCode}")
- if errorCode == "maxlag":
- time.sleep(5)
- continue
- pages = responseObj["query"]["pages"]
- normalisedToInput = {}
- if "normalized" in responseObj["query"]:
- for entry in responseObj["query"]["normalized"]:
- normalisedToInput[entry["to"]] = entry["from"]
- for (_, page) in pages.items():
- # Some fields // More info at https://www.mediawiki.org/wiki/Extension:CommonsMetadata#Returned_data
- # LicenseShortName: short human-readable license name, apparently more reliable than 'License',
- # Artist: author name (might contain complex html, multiple authors, etc)
- # Credit: 'source'
- # For image-map-like images, can be quite large/complex html, creditng each sub-image
- # May be <a href="text1">text2</a>, where the text2 might be non-indicative
- # Restrictions: specifies non-copyright legal restrictions
- title = page["title"]
- if title in normalisedToInput:
- title = normalisedToInput[title]
- title = title[5:] # Remove 'File:'
- if title not in imgNames:
- print(f"WARNING: Got title \"{title}\" not in image-name list")
- continue
- if "imageinfo" not in page:
- print(f"WARNING: No imageinfo section for page \"{title}\"")
- continue
- metadata = page["imageinfo"][0]["extmetadata"]
- url = page["imageinfo"][0]["url"]
- license = metadata['LicenseShortName']['value'] if 'LicenseShortName' in metadata else None
- artist = metadata['Artist']['value'] if 'Artist' in metadata else None
- credit = metadata['Credit']['value'] if 'Credit' in metadata else None
- restrictions = metadata['Restrictions']['value'] if 'Restrictions' in metadata else None
- # Remove markup
- if artist != None:
- artist = tagRegex.sub(" ", artist)
- artist = whitespaceRegex.sub(" ", artist)
- artist = html.unescape(artist)
- artist = urllib.parse.unquote(artist)
- if credit != None:
- credit = tagRegex.sub(" ", credit)
- credit = whitespaceRegex.sub(" ", credit)
- credit = html.unescape(credit)
- credit = urllib.parse.unquote(credit)
- # Add to db
- dbCur2.execute("INSERT INTO imgs VALUES (?, ?, ?, ?, ?, ?)",
- (title, license, artist, credit, restrictions, url))
-
-print("Closing database")
-dbCon.commit()
-dbCon.close()
diff --git a/backend/data/enwiki/downloadImgs.py b/backend/data/enwiki/downloadImgs.py
deleted file mode 100755
index 8fb605f..0000000
--- a/backend/data/enwiki/downloadImgs.py
+++ /dev/null
@@ -1,91 +0,0 @@
-#!/usr/bin/python3
-
-import sys, re, os
-import sqlite3
-import urllib.parse, requests
-import time, signal
-
-usageInfo = f"""
-Usage: {sys.argv[0]}
-
-Downloads images from URLs in an image database, into an output directory,
-with names of the form 'pageId1.ext1'.
-
-SIGINT causes the program to finish an ongoing download and exit.
-The program can be re-run to continue downloading, and looks
-in the output directory do decide what to skip.
-"""
-if len(sys.argv) > 1:
- print(usageInfo, file=sys.stderr)
- sys.exit(1)
-
-imgDb = "imgData.db" # About 130k image names
-outDir = "imgs"
-licenseRegex = re.compile(r"cc0|cc([ -]by)?([ -]sa)?([ -][1234]\.[05])?( \w\w\w?)?", flags=re.IGNORECASE)
-# In testing, this downloaded about 100k images, over several days
-
-if not os.path.exists(outDir):
- os.mkdir(outDir)
-print("Checking for already-downloaded images")
-fileList = os.listdir(outDir)
-pageIdsDone = set()
-for filename in fileList:
- (basename, extension) = os.path.splitext(filename)
- pageIdsDone.add(int(basename))
-print(f"Found {len(pageIdsDone)}")
-
-# Set SIGINT handler
-interrupted = False
-oldHandler = None
-def onSigint(sig, frame):
- global interrupted
- interrupted = True
- signal.signal(signal.SIGINT, oldHandler)
-oldHandler = signal.signal(signal.SIGINT, onSigint)
-
-print("Opening database")
-dbCon = sqlite3.connect(imgDb)
-dbCur = dbCon.cursor()
-print("Starting downloads")
-iterNum = 0
-query = "SELECT page_id, license, artist, credit, restrictions, url FROM" \
- " imgs INNER JOIN page_imgs ON imgs.name = page_imgs.img_name"
-for (pageId, license, artist, credit, restrictions, url) in dbCur.execute(query):
- if pageId in pageIdsDone:
- continue
- if interrupted:
- print(f"Exiting loop")
- break
- # Check for problematic attributes
- if license == None or licenseRegex.fullmatch(license) == None:
- continue
- if artist == None or artist == "" or len(artist) > 100 or re.match(r"(\d\. )?File:", artist) != None:
- continue
- if credit == None or len(credit) > 300 or re.match(r"File:", credit) != None:
- continue
- if restrictions != None and restrictions != "":
- continue
- # Download image
- iterNum += 1
- print(f"Iteration {iterNum}: Downloading for page-id {pageId}")
- urlParts = urllib.parse.urlparse(url)
- extension = os.path.splitext(urlParts.path)[1]
- if len(extension) <= 1:
- print(f"WARNING: No filename extension found in URL {url}")
- sys.exit(1)
- outFile = f"{outDir}/{pageId}{extension}"
- headers = {
- "user-agent": "terryt.dev (terry06890@gmail.com)",
- "accept-encoding": "gzip",
- }
- try:
- response = requests.get(url, headers=headers)
- with open(outFile, 'wb') as file:
- file.write(response.content)
- time.sleep(1)
- # https://en.wikipedia.org/wiki/Wikipedia:Database_download says to "throttle self to 1 cache miss per sec"
- # It's unclear how to properly check for cache misses, so this just aims for 1 per sec
- except Exception as e:
- print(f"Error while downloading to {outFile}: {e}")
-print("Closing database")
-dbCon.close()
diff --git a/backend/data/enwiki/genDescData.py b/backend/data/enwiki/genDescData.py
deleted file mode 100755
index b0ca272..0000000
--- a/backend/data/enwiki/genDescData.py
+++ /dev/null
@@ -1,127 +0,0 @@
-#!/usr/bin/python3
-
-import sys, os, re
-import bz2
-import html, mwxml, mwparserfromhell
-import sqlite3
-
-usageInfo = f"""
-Usage: {sys.argv[0]}
-
-Reads through the wiki dump, and attempts to
-parse short-descriptions, and add them to a database.
-"""
-if len(sys.argv) > 1:
- print(usageInfo, file=sys.stderr)
- sys.exit(1)
-
-dumpFile = "enwiki-20220501-pages-articles-multistream.xml.bz2" # Had about 22e6 pages
-enwikiDb = "descData.db"
-# In testing, this script took over 10 hours to run, and generated about 5GB
-
-descLineRegex = re.compile("^ *[A-Z'\"]")
-embeddedHtmlRegex = re.compile(r"<[^<]+/>|<!--[^<]+-->|<[^</]+>([^<]*|[^<]*<[^<]+>[^<]*)</[^<]+>|<[^<]+$")
- # Recognises a self-closing HTML tag, a tag with 0 children, tag with 1 child with 0 children, or unclosed tag
-convertTemplateRegex = re.compile(r"{{convert\|(\d[^|]*)\|(?:(to|-)\|(\d[^|]*)\|)?([a-z][^|}]*)[^}]*}}")
-def convertTemplateReplace(match):
- if match.group(2) == None:
- return f"{match.group(1)} {match.group(4)}"
- else:
- return f"{match.group(1)} {match.group(2)} {match.group(3)} {match.group(4)}"
-parensGroupRegex = re.compile(r" \([^()]*\)")
-leftoverBraceRegex = re.compile(r"(?:{\||{{).*")
-
-def parseDesc(text):
- # Find first matching line outside {{...}}, [[...]], and block-html-comment constructs,
- # and then accumulate lines until a blank one.
- # Some cases not accounted for include: disambiguation pages, abstracts with sentences split-across-lines,
- # nested embedded html, 'content significant' embedded-html, markup not removable with mwparsefromhell,
- lines = []
- openBraceCount = 0
- openBracketCount = 0
- inComment = False
- skip = False
- for line in text.splitlines():
- line = line.strip()
- if len(lines) == 0:
- if len(line) > 0:
- if openBraceCount > 0 or line[0] == "{":
- openBraceCount += line.count("{")
- openBraceCount -= line.count("}")
- skip = True
- if openBracketCount > 0 or line[0] == "[":
- openBracketCount += line.count("[")
- openBracketCount -= line.count("]")
- skip = True
- if inComment or line.find("<!--") != -1:
- if line.find("-->") != -1:
- if inComment:
- inComment = False
- skip = True
- else:
- inComment = True
- skip = True
- if skip:
- skip = False
- continue
- if line[-1] == ":": # Seems to help avoid disambiguation pages
- return None
- if descLineRegex.match(line) != None:
- lines.append(line)
- else:
- if len(line) == 0:
- return removeMarkup(" ".join(lines))
- lines.append(line)
- if len(lines) > 0:
- return removeMarkup(" ".join(lines))
- return None
-def removeMarkup(content):
- content = embeddedHtmlRegex.sub("", content)
- content = convertTemplateRegex.sub(convertTemplateReplace, content)
- content = mwparserfromhell.parse(content).strip_code() # Remove wikitext markup
- content = parensGroupRegex.sub("", content)
- content = leftoverBraceRegex.sub("", content)
- return content
-def convertTitle(title):
- return html.unescape(title).replace("_", " ")
-
-print("Creating database")
-if os.path.exists(enwikiDb):
- raise Exception(f"ERROR: Existing {enwikiDb}")
-dbCon = sqlite3.connect(enwikiDb)
-dbCur = dbCon.cursor()
-dbCur.execute("CREATE TABLE pages (id INT PRIMARY KEY, title TEXT UNIQUE)")
-dbCur.execute("CREATE INDEX pages_title_idx ON pages(title COLLATE NOCASE)")
-dbCur.execute("CREATE TABLE redirects (id INT PRIMARY KEY, target TEXT)")
-dbCur.execute("CREATE INDEX redirects_idx ON redirects(target)")
-dbCur.execute("CREATE TABLE descs (id INT PRIMARY KEY, desc TEXT)")
-
-print("Iterating through dump file")
-with bz2.open(dumpFile, mode='rt') as file:
- dump = mwxml.Dump.from_file(file)
- pageNum = 0
- for page in dump:
- pageNum += 1
- if pageNum % 1e4 == 0:
- print(f"At page {pageNum}")
- if pageNum > 3e4:
- break
- # Parse page
- if page.namespace == 0:
- try:
- dbCur.execute("INSERT INTO pages VALUES (?, ?)", (page.id, convertTitle(page.title)))
- except sqlite3.IntegrityError as e:
- # Accounts for certain pages that have the same title
- print(f"Failed to add page with title \"{page.title}\": {e}", file=sys.stderr)
- continue
- if page.redirect != None:
- dbCur.execute("INSERT INTO redirects VALUES (?, ?)", (page.id, convertTitle(page.redirect)))
- else:
- revision = next(page)
- desc = parseDesc(revision.text)
- if desc != None:
- dbCur.execute("INSERT INTO descs VALUES (?, ?)", (page.id, desc))
-
-print("Closing database")
-dbCon.commit()
-dbCon.close()
diff --git a/backend/data/enwiki/genDumpIndexDb.py b/backend/data/enwiki/genDumpIndexDb.py
deleted file mode 100755
index 3955885..0000000
--- a/backend/data/enwiki/genDumpIndexDb.py
+++ /dev/null
@@ -1,58 +0,0 @@
-#!/usr/bin/python3
-
-import sys, os, re
-import bz2
-import sqlite3
-
-usageInfo = f"""
-Usage: {sys.argv[0]}
-
-Adds data from the wiki dump index-file into a database.
-"""
-if len(sys.argv) > 1:
- print(usageInfo, file=sys.stderr)
- sys.exit(1)
-
-indexFile = "enwiki-20220501-pages-articles-multistream-index.txt.bz2" # Had about 22e6 lines
-indexDb = "dumpIndex.db"
-
-if os.path.exists(indexDb):
- raise Exception(f"ERROR: Existing {indexDb}")
-print("Creating database")
-dbCon = sqlite3.connect(indexDb)
-dbCur = dbCon.cursor()
-dbCur.execute("CREATE TABLE offsets (title TEXT PRIMARY KEY, id INT UNIQUE, offset INT, next_offset INT)")
-
-print("Iterating through index file")
-lineRegex = re.compile(r"([^:]+):([^:]+):(.*)")
-lastOffset = 0
-lineNum = 0
-entriesToAdd = []
-with bz2.open(indexFile, mode='rt') as file:
- for line in file:
- lineNum += 1
- if lineNum % 1e5 == 0:
- print(f"At line {lineNum}")
- #
- match = lineRegex.fullmatch(line.rstrip())
- (offset, pageId, title) = match.group(1,2,3)
- offset = int(offset)
- if offset > lastOffset:
- for (t, p) in entriesToAdd:
- try:
- dbCur.execute("INSERT INTO offsets VALUES (?, ?, ?, ?)", (t, p, lastOffset, offset))
- except sqlite3.IntegrityError as e:
- # Accounts for certain entries in the file that have the same title
- print(f"Failed on title \"{t}\": {e}", file=sys.stderr)
- entriesToAdd = []
- lastOffset = offset
- entriesToAdd.append([title, pageId])
-for (title, pageId) in entriesToAdd:
- try:
- dbCur.execute("INSERT INTO offsets VALUES (?, ?, ?, ?)", (title, pageId, lastOffset, -1))
- except sqlite3.IntegrityError as e:
- print(f"Failed on title \"{t}\": {e}", file=sys.stderr)
-
-print("Closing database")
-dbCon.commit()
-dbCon.close()
diff --git a/backend/data/enwiki/genImgData.py b/backend/data/enwiki/genImgData.py
deleted file mode 100755
index dedfe14..0000000
--- a/backend/data/enwiki/genImgData.py
+++ /dev/null
@@ -1,190 +0,0 @@
-#!/usr/bin/python3
-
-import sys, re
-import bz2, html, urllib.parse
-import sqlite3
-
-usageInfo = f"""
-Usage: {sys.argv[0]}
-
-For some set of page IDs, looks up their content in the wiki dump,
-and tries to parse infobox image names, storing them into a database.
-
-The program can be re-run with an updated set of page IDs, and
-will skip already-processed page IDs.
-"""
-if len(sys.argv) > 1:
- print(usageInfo, file=sys.stderr)
- sys.exit(1)
-
-def getInputPageIds():
- pageIds = set()
- dbCon = sqlite3.connect("../data.db")
- dbCur = dbCon.cursor()
- for (pageId,) in dbCur.execute("SELECT id from wiki_ids"):
- pageIds.add(pageId)
- dbCon.close()
- return pageIds
-dumpFile = "enwiki-20220501-pages-articles-multistream.xml.bz2"
-indexDb = "dumpIndex.db"
-imgDb = "imgData.db" # The database to create
-idLineRegex = re.compile(r"<id>(.*)</id>")
-imageLineRegex = re.compile(r".*\| *image *= *([^|]*)")
-bracketImageRegex = re.compile(r"\[\[(File:[^|]*).*]]")
-imageNameRegex = re.compile(r".*\.(jpg|jpeg|png|gif|tiff|tif)", flags=re.IGNORECASE)
-cssImgCropRegex = re.compile(r"{{css image crop\|image *= *(.*)", flags=re.IGNORECASE)
-# In testing, got about 360k image names
-
-print("Getting input page-ids")
-pageIds = getInputPageIds()
-print(f"Found {len(pageIds)}")
-
-print("Opening databases")
-indexDbCon = sqlite3.connect(indexDb)
-indexDbCur = indexDbCon.cursor()
-imgDbCon = sqlite3.connect(imgDb)
-imgDbCur = imgDbCon.cursor()
-print("Checking tables")
-if imgDbCur.execute("SELECT name FROM sqlite_master WHERE type='table' AND name='page_imgs'").fetchone() == None:
- # Create tables if not present
- imgDbCur.execute("CREATE TABLE page_imgs (page_id INT PRIMARY KEY, img_name TEXT)") # img_name may be NULL
- imgDbCur.execute("CREATE INDEX page_imgs_idx ON page_imgs(img_name)")
-else:
- # Check for already-processed page IDs
- numSkipped = 0
- for (pid,) in imgDbCur.execute("SELECT page_id FROM page_imgs"):
- if pid in pageIds:
- pageIds.remove(pid)
- numSkipped += 1
- else:
- print(f"WARNING: Found already-processed page ID {pid} which was not in input set")
- print(f"Will skip {numSkipped} already-processed page IDs")
-
-print("Getting dump-file offsets")
-offsetToPageids = {}
-offsetToEnd = {} # Maps chunk-start offsets to their chunk-end offsets
-iterNum = 0
-for pageId in pageIds:
- iterNum += 1
- if iterNum % 1e4 == 0:
- print(f"At iteration {iterNum}")
- #
- query = "SELECT offset, next_offset FROM offsets WHERE id = ?"
- row = indexDbCur.execute(query, (pageId,)).fetchone()
- if row == None:
- print(f"WARNING: Page ID {pageId} not found")
- continue
- (chunkOffset, endOffset) = row
- offsetToEnd[chunkOffset] = endOffset
- if chunkOffset not in offsetToPageids:
- offsetToPageids[chunkOffset] = []
- offsetToPageids[chunkOffset].append(pageId)
-print(f"Found {len(offsetToEnd)} chunks to check")
-
-print("Iterating through chunks in dump file")
-def getImageName(content):
- " Given an array of text-content lines, tries to return an infoxbox image name, or None "
- # Doesn't try and find images in outside-infobox [[File:...]] and <imagemap> sections
- for line in content:
- match = imageLineRegex.match(line)
- if match != None:
- imageName = match.group(1).strip()
- if imageName == "":
- return None
- imageName = html.unescape(imageName)
- # Account for {{...
- if imageName.startswith("{"):
- match = cssImgCropRegex.match(imageName)
- if match == None:
- return None
- imageName = match.group(1)
- # Account for [[File:...|...]]
- if imageName.startswith("["):
- match = bracketImageRegex.match(imageName)
- if match == None:
- return None
- imageName = match.group(1)
- # Account for <!--
- if imageName.find("<!--") != -1:
- return None
- # Remove an initial 'File:'
- if imageName.startswith("File:"):
- imageName = imageName[5:]
- # Remove an initial 'Image:'
- if imageName.startswith("Image:"):
- imageName = imageName[6:]
- # Check for extension
- match = imageNameRegex.match(imageName)
- if match != None:
- imageName = match.group(0)
- imageName = urllib.parse.unquote(imageName)
- imageName = html.unescape(imageName) # Intentionally unescaping again (handles some odd cases)
- imageName = imageName.replace("_", " ")
- return imageName
- # Exclude lines like: | image = &lt;imagemap&gt;
- return None
- return None
-with open(dumpFile, mode='rb') as file:
- iterNum = 0
- for (pageOffset, endOffset) in offsetToEnd.items():
- iterNum += 1
- if iterNum % 100 == 0:
- print(f"At iteration {iterNum}")
- #
- pageIds = offsetToPageids[pageOffset]
- # Jump to chunk
- file.seek(pageOffset)
- compressedData = file.read(None if endOffset == -1 else endOffset - pageOffset)
- data = bz2.BZ2Decompressor().decompress(compressedData).decode()
- # Look in chunk for pages
- lines = data.splitlines()
- lineIdx = 0
- while lineIdx < len(lines):
- # Look for <page>
- if lines[lineIdx].lstrip() != "<page>":
- lineIdx += 1
- continue
- # Check page id
- lineIdx += 3
- idLine = lines[lineIdx].lstrip()
- match = idLineRegex.fullmatch(idLine)
- if match == None or int(match.group(1)) not in pageIds:
- lineIdx += 1
- continue
- pageId = int(match.group(1))
- lineIdx += 1
- # Look for <text> in <page>
- foundText = False
- while lineIdx < len(lines):
- if not lines[lineIdx].lstrip().startswith("<text "):
- lineIdx += 1
- continue
- foundText = True
- # Get text content
- content = []
- line = lines[lineIdx]
- content.append(line[line.find(">") + 1:])
- lineIdx += 1
- foundTextEnd = False
- while lineIdx < len(lines):
- line = lines[lineIdx]
- if not line.endswith("</text>"):
- content.append(line)
- lineIdx += 1
- continue
- foundTextEnd = True
- content.append(line[:line.rfind("</text>")])
- # Look for image-filename
- imageName = getImageName(content)
- imgDbCur.execute("INSERT into page_imgs VALUES (?, ?)", (pageId, imageName))
- break
- if not foundTextEnd:
- print(f"WARNING: Did not find </text> for page id {pageId}")
- break
- if not foundText:
- print(f"WARNING: Did not find <text> for page id {pageId}")
-
-print("Closing databases")
-indexDbCon.close()
-imgDbCon.commit()
-imgDbCon.close()
diff --git a/backend/data/enwiki/lookupPage.py b/backend/data/enwiki/lookupPage.py
deleted file mode 100755
index 1a90851..0000000
--- a/backend/data/enwiki/lookupPage.py
+++ /dev/null
@@ -1,68 +0,0 @@
-#!/usr/bin/python3
-
-import sys, re
-import bz2
-import sqlite3
-
-usageInfo = f"""
-Usage: {sys.argv[0]} title1
-
-Looks up a page with title title1 in the wiki dump, using
-the dump-index db, and prints the corresponding <page>.
-"""
-if len(sys.argv) != 2:
- print(usageInfo, file=sys.stderr)
- sys.exit(1)
-
-dumpFile = "enwiki-20220501-pages-articles-multistream.xml.bz2"
-indexDb = "dumpIndex.db"
-pageTitle = sys.argv[1].replace("_", " ")
-
-print("Looking up offset in index db")
-dbCon = sqlite3.connect(indexDb)
-dbCur = dbCon.cursor()
-query = "SELECT title, offset, next_offset FROM offsets WHERE title = ?"
-row = dbCur.execute(query, (pageTitle,)).fetchone()
-if row == None:
- print("Title not found")
- sys.exit(0)
-_, pageOffset, endOffset = row
-dbCon.close()
-print(f"Found chunk at offset {pageOffset}")
-
-print("Reading from wiki dump")
-content = []
-with open(dumpFile, mode='rb') as file:
- # Get uncompressed chunk
- file.seek(pageOffset)
- compressedData = file.read(None if endOffset == -1 else endOffset - pageOffset)
- data = bz2.BZ2Decompressor().decompress(compressedData).decode()
- # Look in chunk for page
- lines = data.splitlines()
- lineIdx = 0
- found = False
- pageNum = 0
- while not found:
- line = lines[lineIdx]
- if line.lstrip() == "<page>":
- pageNum += 1
- if pageNum > 100:
- print("ERROR: Did not find title after 100 pages")
- break
- lineIdx += 1
- titleLine = lines[lineIdx]
- if titleLine.lstrip() == '<title>' + pageTitle + '</title>':
- found = True
- print(f"Found title in chunk as page {pageNum}")
- content.append(line)
- content.append(titleLine)
- while True:
- lineIdx += 1
- line = lines[lineIdx]
- content.append(line)
- if line.lstrip() == "</page>":
- break
- lineIdx += 1
-
-print("Content: ")
-print("\n".join(content))
diff --git a/backend/data/eol/README.md b/backend/data/eol/README.md
deleted file mode 100644
index 8c527a8..0000000
--- a/backend/data/eol/README.md
+++ /dev/null
@@ -1,26 +0,0 @@
-This directory holds files obtained from/using the [Encyclopedia of Life](https://eol.org/).
-
-# Name Data Files
-- vernacularNames.csv <br>
- Obtained from <https://opendata.eol.org/dataset/vernacular-names> on 24/04/2022 (last updated on 27/10/2020).
- Contains alternative-name data from EOL.
-
-# Image Metadata Files
-- imagesList.tgz <br>
- Obtained from <https://opendata.eol.org/dataset/images-list> on 24/04/2022 (last updated on 05/02/2020).
- Contains metadata for images from EOL.
-- imagesList/ <br>
- Extracted from imagesList.tgz.
-- genImagesListDb.sh <br>
- Creates a database, and imports imagesList/*.csv files into it.
-- imagesList.db <br>
- Created by running genImagesListDb.sh <br>
- Tables: <br>
- - `images`:
- `content_id INT PRIMARY KEY, page_id INT, source_url TEXT, copy_url TEXT, license TEXT, copyright_owner TEXT`
-
-# Image Generation Files
-- downloadImgs.py <br>
- Used to download image files into imgsForReview/.
-- reviewImgs.py <br>
- Used to review images in imgsForReview/, moving acceptable ones into imgs/.
diff --git a/backend/data/eol/downloadImgs.py b/backend/data/eol/downloadImgs.py
deleted file mode 100755
index 96bc085..0000000
--- a/backend/data/eol/downloadImgs.py
+++ /dev/null
@@ -1,147 +0,0 @@
-#!/usr/bin/python3
-
-import sys, re, os, random
-import sqlite3
-import urllib.parse, requests
-import time
-from threading import Thread
-import signal
-
-usageInfo = f"""
-Usage: {sys.argv[0]}
-
-For some set of EOL IDs, downloads associated images from URLs in
-an image-list database. Uses multiple downloading threads.
-
-May obtain multiple images per ID. The images will get names
-with the form 'eolId1 contentId1.ext1'.
-
-SIGINT causes the program to finish ongoing downloads and exit.
-The program can be re-run to continue downloading. It looks for
-already-downloaded files, and continues after the one with
-highest EOL ID.
-"""
-if len(sys.argv) > 1:
- print(usageInfo, file=sys.stderr)
- sys.exit(1)
-# In testing, this downloaded about 70k images, over a few days
-
-imagesListDb = "imagesList.db"
-def getInputEolIds():
- eolIds = set()
- dbCon = sqlite3.connect("../data.db")
- dbCur = dbCon.cursor()
- for (id,) in dbCur.execute("SELECT id FROM eol_ids"):
- eolIds.add(id)
- dbCon.close()
- return eolIds
-outDir = "imgsForReview/"
-MAX_IMGS_PER_ID = 3
-MAX_THREADS = 5
-POST_DL_DELAY_MIN = 2 # Minimum delay in seconds to pause after download before starting another (for each thread)
-POST_DL_DELAY_MAX = 3
-LICENSE_REGEX = r"cc-by((-nc)?(-sa)?(-[234]\.[05])?)|cc-publicdomain|cc-0-1\.0|public domain"
-
-print("Getting input EOL IDs")
-eolIds = getInputEolIds()
-print("Getting EOL IDs to download for")
-# Get IDs from images-list db
-imgDbCon = sqlite3.connect(imagesListDb)
-imgCur = imgDbCon.cursor()
-imgListIds = set()
-for (pageId,) in imgCur.execute("SELECT DISTINCT page_id FROM images"):
- imgListIds.add(pageId)
-# Get set intersection, and sort into list
-eolIds = eolIds.intersection(imgListIds)
-eolIds = sorted(eolIds)
-print(f"Result: {len(eolIds)} EOL IDs")
-
-print("Checking output directory")
-if not os.path.exists(outDir):
- os.mkdir(outDir)
-print("Finding next ID to download for")
-nextIdx = 0
-fileList = os.listdir(outDir)
-ids = [int(filename.split(" ")[0]) for filename in fileList]
-if len(ids) > 0:
- ids.sort()
- nextIdx = eolIds.index(ids[-1]) + 1
-if nextIdx == len(eolIds):
- print("No IDs left. Exiting...")
- sys.exit(0)
-
-print("Starting download threads")
-numThreads = 0
-threadException = None # Used for ending main thread after a non-main thread exception
-# Handle SIGINT signals
-interrupted = False
-oldHandler = None
-def onSigint(sig, frame):
- global interrupted
- interrupted = True
- signal.signal(signal.SIGINT, oldHandler)
-oldHandler = signal.signal(signal.SIGINT, onSigint)
-# Function for threads to execute
-def downloadImg(url, outFile):
- global numThreads, threadException
- try:
- data = requests.get(url)
- with open(outFile, 'wb') as file:
- file.write(data.content)
- time.sleep(random.random() * (POST_DL_DELAY_MAX - POST_DL_DELAY_MIN) + POST_DL_DELAY_MIN)
- except Exception as e:
- print(f"Error while downloading to {outFile}: {str(e)}", file=sys.stderr)
- threadException = e
- numThreads -= 1
-# Manage downloading
-for idx in range(nextIdx, len(eolIds)):
- eolId = eolIds[idx]
- # Get image urls
- imgDataList = []
- ownerSet = set() # Used to get images from different owners, for variety
- exitLoop = False
- query = "SELECT content_id, copy_url, license, copyright_owner FROM images WHERE page_id = ?"
- for (contentId, url, license, copyrightOwner) in imgCur.execute(query, (eolId,)):
- if url.startswith("data/"):
- url = "https://content.eol.org/" + url
- urlParts = urllib.parse.urlparse(url)
- extension = os.path.splitext(urlParts.path)[1]
- if len(extension) <= 1:
- print(f"WARNING: No filename extension found in URL {url}", file=sys.stderr)
- continue
- # Check image-quantity limit
- if len(ownerSet) == MAX_IMGS_PER_ID:
- break
- # Check for skip conditions
- if re.fullmatch(LICENSE_REGEX, license) == None:
- continue
- if len(copyrightOwner) > 100: # Avoid certain copyrightOwner fields that seem long and problematic
- continue
- if copyrightOwner in ownerSet:
- continue
- ownerSet.add(copyrightOwner)
- # Determine output filename
- outPath = f"{outDir}{eolId} {contentId}{extension}"
- if os.path.exists(outPath):
- print(f"WARNING: {outPath} already exists. Skipping download.")
- continue
- # Check thread limit
- while numThreads == MAX_THREADS:
- time.sleep(1)
- # Wait for threads after an interrupt or thread-exception
- if interrupted or threadException != None:
- print("Waiting for existing threads to end")
- while numThreads > 0:
- time.sleep(1)
- exitLoop = True
- break
- # Perform download
- print(f"Downloading image to {outPath}")
- numThreads += 1
- thread = Thread(target=downloadImg, args=(url, outPath), daemon=True)
- thread.start()
- if exitLoop:
- break
-# Close images-list db
-print("Finished downloading")
-imgDbCon.close()
diff --git a/backend/data/eol/genImagesListDb.sh b/backend/data/eol/genImagesListDb.sh
deleted file mode 100755
index 87dd840..0000000
--- a/backend/data/eol/genImagesListDb.sh
+++ /dev/null
@@ -1,12 +0,0 @@
-#!/bin/bash
-set -e
-
-# Combine CSV files into one, skipping header lines
-cat imagesList/media_*_{1..58}.csv | tail -n +2 > imagesList.csv
-# Create database, and import the CSV file
-sqlite3 imagesList.db <<END
-CREATE TABLE images (
- content_id INT PRIMARY KEY, page_id INT, source_url TEXT, copy_url TEXT, license TEXT, copyright_owner TEXT);
-.mode csv
-.import 'imagesList.csv' images
-END
diff --git a/backend/data/eol/reviewImgs.py b/backend/data/eol/reviewImgs.py
deleted file mode 100755
index ecdf7ab..0000000
--- a/backend/data/eol/reviewImgs.py
+++ /dev/null
@@ -1,205 +0,0 @@
-#!/usr/bin/python3
-
-import sys, re, os, time
-import sqlite3
-import tkinter as tki
-from tkinter import ttk
-import PIL
-from PIL import ImageTk, Image, ImageOps
-
-usageInfo = f"""
-Usage: {sys.argv[0]}
-
-Provides a GUI for reviewing images. Looks in a for-review directory for
-images named 'eolId1 contentId1.ext1', and, for each EOL ID, enables the user to
-choose an image to keep, or reject all. Also provides image rotation.
-Chosen images are placed in another directory, and rejected ones are deleted.
-"""
-if len(sys.argv) > 1:
- print(usageInfo, file=sys.stderr)
- sys.exit(1)
-
-imgDir = "imgsForReview/"
-outDir = "imgs/"
-extraInfoDbCon = sqlite3.connect("../data.db")
-extraInfoDbCur = extraInfoDbCon.cursor()
-def getExtraInfo(eolId):
- global extraInfoDbCur
- query = "SELECT names.alt_name FROM" \
- " names INNER JOIN eol_ids ON eol_ids.name = names.name" \
- " WHERE id = ? and pref_alt = 1"
- row = extraInfoDbCur.execute(query, (eolId,)).fetchone()
- if row != None:
- return f"Reviewing EOL ID {eolId}, aka \"{row[0]}\""
- else:
- return f"Reviewing EOL ID {eolId}"
-IMG_DISPLAY_SZ = 400
-MAX_IMGS_PER_ID = 3
-IMG_BG_COLOR = (88, 28, 135)
-PLACEHOLDER_IMG = Image.new("RGB", (IMG_DISPLAY_SZ, IMG_DISPLAY_SZ), IMG_BG_COLOR)
-
-print("Checking output directory")
-if not os.path.exists(outDir):
- os.mkdir(outDir)
-print("Getting input image list")
-imgList = os.listdir(imgDir)
-imgList.sort(key=lambda s: int(s.split(" ")[0]))
-if len(imgList) == 0:
- print("No input images found")
- sys.exit(0)
-
-class EolImgReviewer:
- " Provides the GUI for reviewing images "
- def __init__(self, root, imgList):
- self.root = root
- root.title("EOL Image Reviewer")
- # Setup main frame
- mainFrame = ttk.Frame(root, padding="5 5 5 5")
- mainFrame.grid(column=0, row=0, sticky=(tki.N, tki.W, tki.E, tki.S))
- root.columnconfigure(0, weight=1)
- root.rowconfigure(0, weight=1)
- # Set up images-to-be-reviewed frames
- self.imgs = [PLACEHOLDER_IMG] * MAX_IMGS_PER_ID # Stored as fields for use in rotation
- self.photoImgs = list(map(lambda img: ImageTk.PhotoImage(img), self.imgs)) # Image objects usable by tkinter
- # These need a persistent reference for some reason (doesn't display otherwise)
- self.labels = []
- for i in range(MAX_IMGS_PER_ID):
- frame = ttk.Frame(mainFrame, width=IMG_DISPLAY_SZ, height=IMG_DISPLAY_SZ)
- frame.grid(column=i, row=0)
- label = ttk.Label(frame, image=self.photoImgs[i])
- label.grid(column=0, row=0)
- self.labels.append(label)
- # Add padding
- for child in mainFrame.winfo_children():
- child.grid_configure(padx=5, pady=5)
- # Add keyboard bindings
- root.bind("<q>", self.quit)
- root.bind("<Key-j>", lambda evt: self.accept(0))
- root.bind("<Key-k>", lambda evt: self.accept(1))
- root.bind("<Key-l>", lambda evt: self.accept(2))
- root.bind("<Key-i>", lambda evt: self.reject())
- root.bind("<Key-a>", lambda evt: self.rotate(0))
- root.bind("<Key-s>", lambda evt: self.rotate(1))
- root.bind("<Key-d>", lambda evt: self.rotate(2))
- root.bind("<Key-A>", lambda evt: self.rotate(0, True))
- root.bind("<Key-S>", lambda evt: self.rotate(1, True))
- root.bind("<Key-D>", lambda evt: self.rotate(2, True))
- # Initialise images to review
- self.imgList = imgList
- self.imgListIdx = 0
- self.nextEolId = 0
- self.nextImgNames = []
- self.rotations = []
- self.getNextImgs()
- # For displaying extra info
- self.numReviewed = 0
- self.startTime = time.time()
- def getNextImgs(self):
- " Updates display with new images to review, or ends program "
- # Gather names of next images to review
- for i in range(MAX_IMGS_PER_ID):
- if self.imgListIdx == len(self.imgList):
- if i == 0:
- self.quit()
- return
- break
- imgName = self.imgList[self.imgListIdx]
- eolId = int(re.match(r"(\d+) (\d+)", imgName).group(1))
- if i == 0:
- self.nextEolId = eolId
- self.nextImgNames = [imgName]
- self.rotations = [0]
- else:
- if self.nextEolId != eolId:
- break
- self.nextImgNames.append(imgName)
- self.rotations.append(0)
- self.imgListIdx += 1
- # Update displayed images
- idx = 0
- while idx < MAX_IMGS_PER_ID:
- if idx < len(self.nextImgNames):
- try:
- img = Image.open(imgDir + self.nextImgNames[idx])
- img = ImageOps.exif_transpose(img)
- except PIL.UnidentifiedImageError:
- os.remove(imgDir + self.nextImgNames[idx])
- del self.nextImgNames[idx]
- del self.rotations[idx]
- continue
- self.imgs[idx] = self.resizeImgForDisplay(img)
- else:
- self.imgs[idx] = PLACEHOLDER_IMG
- self.photoImgs[idx] = ImageTk.PhotoImage(self.imgs[idx])
- self.labels[idx].config(image=self.photoImgs[idx])
- idx += 1
- # Restart if all image files non-recognisable
- if len(self.nextImgNames) == 0:
- self.getNextImgs()
- return
- # Update title
- firstImgIdx = self.imgListIdx - len(self.nextImgNames) + 1
- lastImgIdx = self.imgListIdx
- title = getExtraInfo(self.nextEolId)
- title += f" (imgs {firstImgIdx} to {lastImgIdx} out of {len(self.imgList)})"
- self.root.title(title)
- def accept(self, imgIdx):
- " React to a user selecting an image "
- if imgIdx >= len(self.nextImgNames):
- print("Invalid selection")
- return
- for i in range(len(self.nextImgNames)):
- inFile = imgDir + self.nextImgNames[i]
- if i == imgIdx: # Move accepted image, rotating if needed
- outFile = outDir + self.nextImgNames[i]
- img = Image.open(inFile)
- img = ImageOps.exif_transpose(img)
- if self.rotations[i] != 0:
- img = img.rotate(self.rotations[i], expand=True)
- img.save(outFile)
- os.remove(inFile)
- else: # Delete non-accepted image
- os.remove(inFile)
- self.numReviewed += 1
- self.getNextImgs()
- def reject(self):
- " React to a user rejecting all images of a set "
- for i in range(len(self.nextImgNames)):
- os.remove(imgDir + self.nextImgNames[i])
- self.numReviewed += 1
- self.getNextImgs()
- def rotate(self, imgIdx, anticlockwise = False):
- " Respond to a user rotating an image "
- deg = -90 if not anticlockwise else 90
- self.imgs[imgIdx] = self.imgs[imgIdx].rotate(deg)
- self.photoImgs[imgIdx] = ImageTk.PhotoImage(self.imgs[imgIdx])
- self.labels[imgIdx].config(image=self.photoImgs[imgIdx])
- self.rotations[imgIdx] = (self.rotations[imgIdx] + deg) % 360
- def quit(self, e = None):
- global extraInfoDbCon
- print(f"Number reviewed: {self.numReviewed}")
- timeElapsed = time.time() - self.startTime
- print(f"Time elapsed: {timeElapsed:.2f} seconds")
- if self.numReviewed > 0:
- print(f"Avg time per review: {timeElapsed/self.numReviewed:.2f} seconds")
- extraInfoDbCon.close()
- self.root.destroy()
- def resizeImgForDisplay(self, img):
- " Returns a copy of an image, shrunk to fit in it's frame (keeps aspect ratio), and with a background "
- if max(img.width, img.height) > IMG_DISPLAY_SZ:
- if (img.width > img.height):
- newHeight = int(img.height * IMG_DISPLAY_SZ/img.width)
- img = img.resize((IMG_DISPLAY_SZ, newHeight))
- else:
- newWidth = int(img.width * IMG_DISPLAY_SZ / img.height)
- img = img.resize((newWidth, IMG_DISPLAY_SZ))
- bgImg = PLACEHOLDER_IMG.copy()
- bgImg.paste(img, box=(
- int((IMG_DISPLAY_SZ - img.width) / 2),
- int((IMG_DISPLAY_SZ - img.height) / 2)))
- return bgImg
-# Create GUI and defer control
-print("Starting GUI")
-root = tki.Tk()
-EolImgReviewer(root, imgList)
-root.mainloop()
diff --git a/backend/data/genDbpData.py b/backend/data/genDbpData.py
deleted file mode 100755
index df3a6be..0000000
--- a/backend/data/genDbpData.py
+++ /dev/null
@@ -1,247 +0,0 @@
-#!/usr/bin/python3
-
-import sys, os, re
-import sqlite3
-
-usageInfo = f"""
-Usage: {sys.argv[0]}
-
-Reads a database containing data from DBpedia, and tries to associate
-DBpedia IRIs with nodes in a database, adding short-descriptions for them.
-"""
-if len(sys.argv) > 1:
- print(usageInfo, file=sys.stderr)
- sys.exit(1)
-
-dbpediaDb = "dbpedia/descData.db"
-namesToSkipFile = "pickedEnwikiNamesToSkip.txt"
-pickedLabelsFile = "pickedDbpLabels.txt"
-dbFile = "data.db"
-rootNodeName = "cellular organisms"
-rootLabel = "organism" # Will be associated with root node
-# Got about 400k descriptions when testing
-
-print("Opening databases")
-dbpCon = sqlite3.connect(dbpediaDb)
-dbpCur = dbpCon.cursor()
-dbCon = sqlite3.connect(dbFile)
-dbCur = dbCon.cursor()
-
-print("Getting node names")
-nodeNames = set()
-for (name,) in dbCur.execute("SELECT name from nodes"):
- nodeNames.add(name)
-
-print("Checking for names to skip")
-oldSz = len(nodeNames)
-if os.path.exists(namesToSkipFile):
- with open(namesToSkipFile) as file:
- for line in file:
- nodeNames.remove(line.rstrip())
-print(f"Skipping {oldSz - len(nodeNames)} nodes")
-
-print("Reading disambiguation-page labels")
-disambigLabels = set()
-query = "SELECT labels.iri from labels INNER JOIN disambiguations ON labels.iri = disambiguations.iri"
-for (label,) in dbpCur.execute(query):
- disambigLabels.add(label)
-
-print("Trying to associate nodes with DBpedia labels")
-nodeToLabel = {}
-nameVariantRegex = re.compile(r"(.*) \(([^)]+)\)") # Used to recognise labels like 'Thor (shrimp)'
-nameToVariants = {} # Maps node names to lists of matching labels
-iterNum = 0
-for (label,) in dbpCur.execute("SELECT label from labels"):
- iterNum += 1
- if iterNum % 1e5 == 0:
- print(f"At iteration {iterNum}")
- #
- if label in disambigLabels:
- continue
- name = label.lower()
- if name in nodeNames:
- if name not in nameToVariants:
- nameToVariants[name] = [label]
- elif label not in nameToVariants[name]:
- nameToVariants[name].append(label)
- else:
- match = nameVariantRegex.fullmatch(name)
- if match != None:
- subName = match.group(1)
- if subName in nodeNames and match.group(2) != "disambiguation":
- if subName not in nameToVariants:
- nameToVariants[subName] = [label]
- elif name not in nameToVariants[subName]:
- nameToVariants[subName].append(label)
-# Associate labels without conflicts
-for (name, variants) in nameToVariants.items():
- if len(variants) == 1:
- nodeToLabel[name] = variants[0]
-for name in nodeToLabel:
- del nameToVariants[name]
-# Special case for root node
-nodeToLabel[rootNodeName] = rootLabel
-if rootNodeName in nameToVariants:
- del nameToVariants["cellular organisms"]
-
-print("Trying to resolve {len(nameToVariants)} conflicts")
-def resolveWithPickedLabels():
- " Attempts to resolve conflicts using a picked-names file "
- with open(pickedLabelsFile) as file:
- for line in file:
- (name, _, label) = line.rstrip().partition("|")
- if name not in nameToVariants:
- print(f"WARNING: No conflict found for name \"{name}\"", file=sys.stderr)
- continue
- if label == "":
- del nameToVariants[name]
- else:
- if label not in nameToVariants[name]:
- print(f"INFO: Picked label \"{label}\" for name \"{name}\" outside choice set", file=sys.stderr)
- nodeToLabel[name] = label
- del nameToVariants[name]
-def resolveWithCategoryList():
- """
- Attempts to resolve conflicts by looking for labels like 'name1 (category1)',
- and choosing those with a category1 that seems 'biological'.
- Does two passes, using more generic categories first. This helps avoid stuff like
- Pan being classified as a horse instead of an ape.
- """
- generalCategories = {
- "species", "genus",
- "plant", "fungus", "animal",
- "annelid", "mollusc", "arthropod", "crustacean", "insect", "bug",
- "fish", "amphibian", "reptile", "bird", "mammal",
- }
- specificCategories = {
- "protist", "alveolate", "dinoflagellates",
- "orchid", "poaceae", "fern", "moss", "alga",
- "bryozoan", "hydrozoan",
- "sponge", "cnidarian", "coral", "polychaete", "echinoderm",
- "bivalve", "gastropod", "chiton",
- "shrimp", "decapod", "crab", "barnacle", "copepod",
- "arachnid", "spider", "harvestman", "mite",
- "dragonfly", "mantis", "cicada", "grasshopper", "planthopper",
- "beetle", "fly", "butterfly", "moth", "wasp",
- "catfish",
- "frog",
- "lizard",
- "horse", "sheep", "cattle", "mouse",
- }
- namesToRemove = set()
- for (name, variants) in nameToVariants.items():
- found = False
- for label in variants:
- match = nameVariantRegex.match(label)
- if match != None and match.group(2) in generalCategories:
- nodeToLabel[name] = label
- namesToRemove.add(name)
- found = True
- break
- if not found:
- for label in variants:
- match = nameVariantRegex.match(label)
- if match != None and match.group(2) in specificCategories:
- nodeToLabel[name] = label
- namesToRemove.add(name)
- break
- for name in namesToRemove:
- del nameToVariants[name]
-def resolveWithTypeData():
- " Attempts to resolve conflicts using DBpedia's type data "
- taxonTypes = { # Obtained from the DBpedia ontology
- "http://dbpedia.org/ontology/Species",
- "http://dbpedia.org/ontology/Archaea",
- "http://dbpedia.org/ontology/Bacteria",
- "http://dbpedia.org/ontology/Eukaryote",
- "http://dbpedia.org/ontology/Plant",
- "http://dbpedia.org/ontology/ClubMoss",
- "http://dbpedia.org/ontology/Conifer",
- "http://dbpedia.org/ontology/CultivatedVariety",
- "http://dbpedia.org/ontology/Cycad",
- "http://dbpedia.org/ontology/Fern",
- "http://dbpedia.org/ontology/FloweringPlant",
- "http://dbpedia.org/ontology/Grape",
- "http://dbpedia.org/ontology/Ginkgo",
- "http://dbpedia.org/ontology/Gnetophytes",
- "http://dbpedia.org/ontology/GreenAlga",
- "http://dbpedia.org/ontology/Moss",
- "http://dbpedia.org/ontology/Fungus",
- "http://dbpedia.org/ontology/Animal",
- "http://dbpedia.org/ontology/Fish",
- "http://dbpedia.org/ontology/Crustacean",
- "http://dbpedia.org/ontology/Mollusca",
- "http://dbpedia.org/ontology/Insect",
- "http://dbpedia.org/ontology/Arachnid",
- "http://dbpedia.org/ontology/Amphibian",
- "http://dbpedia.org/ontology/Reptile",
- "http://dbpedia.org/ontology/Bird",
- "http://dbpedia.org/ontology/Mammal",
- "http://dbpedia.org/ontology/Cat",
- "http://dbpedia.org/ontology/Dog",
- "http://dbpedia.org/ontology/Horse",
- }
- iterNum = 0
- for (label, type) in dbpCur.execute("SELECT label, type from labels INNER JOIN types on labels.iri = types.iri"):
- iterNum += 1
- if iterNum % 1e5 == 0:
- print(f"At iteration {iterNum}")
- #
- if type in taxonTypes:
- name = label.lower()
- if name in nameToVariants:
- nodeToLabel[name] = label
- del nameToVariants[name]
- else:
- match = nameVariantRegex.fullmatch(name)
- if match != None:
- name = match.group(1)
- if name in nameToVariants:
- nodeToLabel[name] = label
- del nameToVariants[name]
-#resolveWithTypeData()
-#resolveWithCategoryList()
-resolveWithPickedLabels()
-print(f"Remaining number of conflicts: {len(nameToVariants)}")
-
-print("Getting node IRIs")
-nodeToIri = {}
-for (name, label) in nodeToLabel.items():
- (iri,) = dbpCur.execute("SELECT iri FROM labels where label = ? COLLATE NOCASE", (label,)).fetchone()
- nodeToIri[name] = iri
-
-print("Resolving redirects")
-redirectingIriSet = set()
-iterNum = 0
-for (name, iri) in nodeToIri.items():
- iterNum += 1
- if iterNum % 1e4 == 0:
- print(f"At iteration {iterNum}")
- #
- row = dbpCur.execute("SELECT target FROM redirects where iri = ?", (iri,)).fetchone()
- if row != None:
- nodeToIri[name] = row[0]
- redirectingIriSet.add(name)
-
-print("Adding description tables")
-dbCur.execute("CREATE TABLE wiki_ids (name TEXT PRIMARY KEY, id INT, redirected INT)")
-dbCur.execute("CREATE INDEX wiki_id_idx ON wiki_ids(id)")
-dbCur.execute("CREATE TABLE descs (wiki_id INT PRIMARY KEY, desc TEXT, from_dbp INT)")
-iterNum = 0
-for (name, iri) in nodeToIri.items():
- iterNum += 1
- if iterNum % 1e4 == 0:
- print(f"At iteration {iterNum}")
- #
- query = "SELECT abstract, id FROM abstracts INNER JOIN ids ON abstracts.iri = ids.iri WHERE ids.iri = ?"
- row = dbpCur.execute(query, (iri,)).fetchone()
- if row != None:
- desc, wikiId = row
- dbCur.execute("INSERT INTO wiki_ids VALUES (?, ?, ?)", (name, wikiId, 1 if name in redirectingIriSet else 0))
- dbCur.execute("INSERT OR IGNORE INTO descs VALUES (?, ?, ?)", (wikiId, desc, 1))
-
-print("Closing databases")
-dbCon.commit()
-dbCon.close()
-dbpCon.commit()
-dbpCon.close()
diff --git a/backend/data/genEnwikiDescData.py b/backend/data/genEnwikiDescData.py
deleted file mode 100755
index d3f93ed..0000000
--- a/backend/data/genEnwikiDescData.py
+++ /dev/null
@@ -1,102 +0,0 @@
-#!/usr/bin/python3
-
-import sys, re, os
-import sqlite3
-
-usageInfo = f"""
-Usage: {sys.argv[0]}
-
-Reads a database containing data from Wikipedia, and tries to associate
-wiki pages with nodes in the database, and add descriptions for nodes
-that don't have them.
-"""
-if len(sys.argv) > 1:
- print(usageInfo, file=sys.stderr)
- sys.exit(1)
-
-enwikiDb = "enwiki/descData.db"
-dbFile = "data.db"
-namesToSkipFile = "pickedEnwikiNamesToSkip.txt"
-pickedLabelsFile = "pickedEnwikiLabels.txt"
-# Got about 25k descriptions when testing
-
-print("Opening databases")
-enwikiCon = sqlite3.connect(enwikiDb)
-enwikiCur = enwikiCon.cursor()
-dbCon = sqlite3.connect(dbFile)
-dbCur = dbCon.cursor()
-
-print("Checking for names to skip")
-namesToSkip = set()
-if os.path.exists(namesToSkipFile):
- with open(namesToSkipFile) as file:
- for line in file:
- namesToSkip.add(line.rstrip())
- print(f"Found {len(namesToSkip)}")
-print("Checking for picked-titles")
-nameToPickedTitle = {}
-if os.path.exists(pickedLabelsFile):
- with open(pickedLabelsFile) as file:
- for line in file:
- (name, _, title) = line.rstrip().partition("|")
- nameToPickedTitle[name.lower()] = title
-print(f"Found {len(nameToPickedTitle)}")
-
-print("Getting names of nodes without descriptions")
-nodeNames = set()
-query = "SELECT nodes.name FROM nodes LEFT JOIN wiki_ids ON nodes.name = wiki_ids.name WHERE wiki_ids.id IS NULL"
-for (name,) in dbCur.execute(query):
- nodeNames.add(name)
-print(f"Found {len(nodeNames)}")
-nodeNames.difference_update(namesToSkip)
-
-print("Associating nodes with page IDs")
-nodeToPageId = {}
-iterNum = 0
-for name in nodeNames:
- iterNum += 1
- if iterNum % 1e4 == 0:
- print(f"At iteration {iterNum}")
- #
- if name not in nameToPickedTitle:
- row = enwikiCur.execute("SELECT id FROM pages WHERE pages.title = ? COLLATE NOCASE", (name,)).fetchone()
- if row != None:
- nodeToPageId[name] = row[0]
- else:
- title = nameToPickedTitle[name]
- row = enwikiCur.execute("SELECT id FROM pages WHERE pages.title = ?", (title,)).fetchone()
- if row != None:
- nodeToPageId[name] = row[0]
- else:
- print("WARNING: Picked title {title} not found", file=sys.stderr)
-
-print("Resolving redirects")
-redirectingNames = set()
-iterNum = 0
-for (name, pageId) in nodeToPageId.items():
- iterNum += 1
- if iterNum % 1e3 == 0:
- print(f"At iteration {iterNum}")
- #
- query = "SELECT pages.id FROM redirects INNER JOIN pages ON redirects.target = pages.title WHERE redirects.id = ?"
- row = enwikiCur.execute(query, (pageId,)).fetchone()
- if row != None:
- nodeToPageId[name] = row[0]
- redirectingNames.add(name)
-
-print("Adding description data")
-iterNum = 0
-for (name, pageId) in nodeToPageId.items():
- iterNum += 1
- if iterNum % 1e3 == 0:
- print(f"At iteration {iterNum}")
- #
- row = enwikiCur.execute("SELECT desc FROM descs where descs.id = ?", (pageId,)).fetchone()
- if row != None:
- dbCur.execute("INSERT INTO wiki_ids VALUES (?, ?, ?)", (name, pageId, 1 if name in redirectingNames else 0))
- dbCur.execute("INSERT OR IGNORE INTO descs VALUES (?, ?, ?)", (pageId, row[0], 0))
-
-print("Closing databases")
-dbCon.commit()
-dbCon.close()
-enwikiCon.close()
diff --git a/backend/data/genEnwikiNameData.py b/backend/data/genEnwikiNameData.py
deleted file mode 100755
index 7ad61d1..0000000
--- a/backend/data/genEnwikiNameData.py
+++ /dev/null
@@ -1,76 +0,0 @@
-#!/usr/bin/python3
-
-import sys, re
-import sqlite3
-
-usageInfo = f"""
-Usage: {sys.argv[0]}
-
-Reads from a database containing data from Wikipdia, along with
-node and wiki-id information from the database, and use wikipedia
-page-redirect information to add additional alt-name data.
-"""
-if len(sys.argv) > 1:
- print(usageInfo, file=sys.stderr)
- sys.exit(1)
-
-enwikiDb = "enwiki/descData.db"
-dbFile = "data.db"
-altNameRegex = re.compile(r"[a-zA-Z]+")
- # Avoids names like 'Evolution of Elephants', 'Banana fiber', 'Fish (zoology)',
-
-print("Opening databases")
-enwikiCon = sqlite3.connect(enwikiDb)
-enwikiCur = enwikiCon.cursor()
-dbCon = sqlite3.connect(dbFile)
-dbCur = dbCon.cursor()
-
-print("Getting nodes with wiki IDs")
-nodeToWikiId = {}
-for (nodeName, wikiId) in dbCur.execute("SELECT name, id from wiki_ids"):
- nodeToWikiId[nodeName] = wikiId
-print(f"Found {len(nodeToWikiId)}")
-
-print("Iterating through nodes, finding names that redirect to them")
-nodeToAltNames = {}
-numAltNames = 0
-iterNum = 0
-for (nodeName, wikiId) in nodeToWikiId.items():
- iterNum += 1
- if iterNum % 1e4 == 0:
- print(f"At iteration {iterNum}")
- #
- nodeToAltNames[nodeName] = set()
- query = "SELECT p1.title FROM pages p1" \
- " INNER JOIN redirects r1 ON p1.id = r1.id" \
- " INNER JOIN pages p2 ON r1.target = p2.title WHERE p2.id = ?"
- for (name,) in enwikiCur.execute(query, (wikiId,)):
- if altNameRegex.fullmatch(name) != None and name.lower() != nodeName:
- nodeToAltNames[nodeName].add(name.lower())
- numAltNames += 1
-print(f"Found {numAltNames} alt-names")
-
-print("Excluding existing alt-names from the set")
-query = "SELECT alt_name FROM names WHERE alt_name IN ({})"
-iterNum = 0
-for (nodeName, altNames) in nodeToAltNames.items():
- iterNum += 1
- if iterNum % 1e4 == 0:
- print(f"At iteration {iterNum}")
- #
- existingNames = set()
- for (name,) in dbCur.execute(query.format(",".join(["?"] * len(altNames))), list(altNames)):
- existingNames.add(name)
- numAltNames -= len(existingNames)
- altNames.difference_update(existingNames)
-print(f"Left with {numAltNames} alt-names")
-
-print("Adding alt-names to database")
-for (nodeName, altNames) in nodeToAltNames.items():
- for altName in altNames:
- dbCur.execute("INSERT INTO names VALUES (?, ?, ?, 'enwiki')", (nodeName, altName, 0))
-
-print("Closing databases")
-dbCon.commit()
-dbCon.close()
-enwikiCon.close()
diff --git a/backend/data/genEolNameData.py b/backend/data/genEolNameData.py
deleted file mode 100755
index dd33ee0..0000000
--- a/backend/data/genEolNameData.py
+++ /dev/null
@@ -1,184 +0,0 @@
-#!/usr/bin/python3
-
-import sys, re, os
-import html, csv, sqlite3
-
-usageInfo = f"""
-Usage: {sys.argv[0]}
-
-Reads files describing name data from the 'Encyclopedia of Life' site,
-tries to associate names with nodes in the database, and adds tables
-to represent associated names.
-
-Reads a vernacularNames.csv file:
- Starts with a header line containing:
- page_id, canonical_form, vernacular_string, language_code,
- resource_name, is_preferred_by_resource, is_preferred_by_eol
- The canonical_form and vernacular_string fields contain names
- associated with the page ID. Names are not always unique to
- particular page IDs.
-"""
-if len(sys.argv) > 1:
- print(usageInfo, file=sys.stderr)
- sys.exit(1)
-
-vnamesFile = "eol/vernacularNames.csv" # Had about 2.8e6 entries
-dbFile = "data.db"
-namesToSkip = {"unknown", "unknown species", "unidentified species"}
-pickedIdsFile = "pickedEolIds.txt"
-altsToSkipFile = "pickedEolAltsToSkip.txt"
-
-print("Reading in vernacular-names data")
-nameToPids = {} # 'pid' means 'Page ID'
-canonicalNameToPids = {}
-pidToNames = {}
-pidToPreferred = {} # Maps pids to 'preferred' names
-def updateMaps(name, pid, canonical, preferredAlt):
- global namesToSkip, nameToPids, canonicalNameToPids, pidToNames, pidToPreferred
- if name in namesToSkip:
- return
- if name not in nameToPids:
- nameToPids[name] = {pid}
- else:
- nameToPids[name].add(pid)
- if canonical:
- if name not in canonicalNameToPids:
- canonicalNameToPids[name] = {pid}
- else:
- canonicalNameToPids[name].add(pid)
- if pid not in pidToNames:
- pidToNames[pid] = {name}
- else:
- pidToNames[pid].add(name)
- if preferredAlt:
- pidToPreferred[pid] = name
-with open(vnamesFile, newline="") as csvfile:
- reader = csv.reader(csvfile)
- lineNum = 0
- for row in reader:
- lineNum += 1
- if lineNum % 1e5 == 0:
- print(f"At line {lineNum}")
- # Skip header line
- if lineNum == 1:
- continue
- # Parse line
- pid = int(row[0])
- name1 = re.sub(r"<[^>]+>", "", row[1].lower()) # Remove tags
- name2 = html.unescape(row[2]).lower()
- lang = row[3]
- preferred = row[6] == "preferred"
- # Add to maps
- updateMaps(name1, pid, True, False)
- if lang == "eng" and name2 != "":
- updateMaps(name2, pid, False, preferred)
-
-print("Checking for manually-picked pids")
-nameToPickedPid = {}
-if os.path.exists(pickedIdsFile):
- with open(pickedIdsFile) as file:
- for line in file:
- (name, _, eolId) = line.rstrip().partition("|")
- nameToPickedPid[name] = None if eolId == "" else int(eolId)
-print(f"Found {len(nameToPickedPid)}")
-
-print("Checking for alt-names to skip")
-nameToAltsToSkip = {}
-numToSkip = 0
-if os.path.exists(altsToSkipFile):
- with open(altsToSkipFile) as file:
- for line in file:
- (name, _, altName) = line.rstrip().partition("|")
- if name not in nameToAltsToSkip:
- nameToAltsToSkip[name] = [altName]
- else:
- nameToAltsToSkip[name].append(altName)
- numToSkip += 1
-print(f"Found {numToSkip} alt-names to skip")
-
-print("Creating database tables")
-dbCon = sqlite3.connect(dbFile)
-dbCur = dbCon.cursor()
-dbCur.execute("CREATE TABLE names(name TEXT, alt_name TEXT, pref_alt INT, src TEXT, PRIMARY KEY(name, alt_name))")
-dbCur.execute("CREATE INDEX names_idx ON names(name)")
-dbCur.execute("CREATE INDEX names_alt_idx ON names(alt_name)")
-dbCur.execute("CREATE INDEX names_alt_idx_nc ON names(alt_name COLLATE NOCASE)")
-dbCur.execute("CREATE TABLE eol_ids(id INT PRIMARY KEY, name TEXT)")
-dbCur.execute("CREATE INDEX eol_name_idx ON eol_ids(name)")
-
-print("Associating nodes with names")
-usedPids = set()
-unresolvedNodeNames = set()
-dbCur2 = dbCon.cursor()
-def addToDb(nodeName, pidToUse):
- " Adds page-ID-associated name data to a node in the database "
- global dbCur, pidToPreferred
- dbCur.execute("INSERT INTO eol_ids VALUES (?, ?)", (pidToUse, nodeName))
- # Get alt-names
- altNames = set()
- for n in pidToNames[pidToUse]:
- # Avoid alt-names with >3 words
- if len(n.split(" ")) > 3:
- continue
- # Avoid alt-names that already name a node in the database
- if dbCur.execute("SELECT name FROM nodes WHERE name = ?", (n,)).fetchone() != None:
- continue
- # Check for picked alt-name-to-skip
- if nodeName in nameToAltsToSkip and n in nameToAltsToSkip[nodeName]:
- print(f"Excluding alt-name {n} for node {nodeName}")
- continue
- #
- altNames.add(n)
- # Add alt-names to db
- preferredName = pidToPreferred[pidToUse] if (pidToUse in pidToPreferred) else None
- for n in altNames:
- isPreferred = 1 if (n == preferredName) else 0
- dbCur.execute("INSERT INTO names VALUES (?, ?, ?, 'eol')", (nodeName, n, isPreferred))
-print("Adding picked IDs")
-for (name, pid) in nameToPickedPid.items():
- if pid != None:
- addToDb(name, pid)
- usedPids.add(pid)
-print("Associating nodes with canonical names")
-iterNum = 0
-for (nodeName,) in dbCur2.execute("SELECT name FROM nodes"):
- iterNum += 1
- if iterNum % 1e5 == 0:
- print(f"At iteration {iterNum}")
- if nodeName in nameToPickedPid:
- continue
- # Check for matching canonical name
- if nodeName in canonicalNameToPids:
- pidToUse = None
- # Pick an associated page ID
- for pid in canonicalNameToPids[nodeName]:
- hasLowerPrio = pid not in pidToPreferred and pidToUse in pidToPreferred
- hasHigherPrio = pid in pidToPreferred and pidToUse not in pidToPreferred
- if hasLowerPrio:
- continue
- if pid not in usedPids and (pidToUse == None or pid < pidToUse or hasHigherPrio):
- pidToUse = pid
- if pidToUse != None:
- addToDb(nodeName, pidToUse)
- usedPids.add(pidToUse)
- elif nodeName in nameToPids:
- unresolvedNodeNames.add(nodeName)
-print("Associating leftover nodes with other names")
-iterNum = 0
-for nodeName in unresolvedNodeNames:
- iterNum += 1
- if iterNum % 100 == 0:
- print(f"At iteration {iterNum}")
- # Check for matching name
- pidToUse = None
- for pid in nameToPids[nodeName]:
- # Pick an associated page ID
- if pid not in usedPids and (pidToUse == None or pid < pidToUse):
- pidToUse = pid
- if pidToUse != None:
- addToDb(nodeName, pidToUse)
- usedPids.add(pidToUse)
-
-print("Closing database")
-dbCon.commit()
-dbCon.close()
diff --git a/backend/data/genImgs.py b/backend/data/genImgs.py
deleted file mode 100755
index ecca8e0..0000000
--- a/backend/data/genImgs.py
+++ /dev/null
@@ -1,191 +0,0 @@
-#!/usr/bin/python3
-
-import sys, os, subprocess
-import sqlite3, urllib.parse
-import signal
-
-usageInfo = f"""
-Usage: {sys.argv[0]}
-
-Reads node IDs and image paths from a file, and possibly from a directory,
-and generates cropped/resized versions of those images into a directory,
-with names of the form 'nodeId1.jpg'. Also adds image metadata to the
-database.
-
-SIGINT can be used to stop, and the program can be re-run to continue
-processing. It uses already-existing database entries to decide what
-to skip.
-"""
-if len(sys.argv) > 1:
- print(usageInfo, file=sys.stderr)
- sys.exit(1)
-
-imgListFile = "imgList.txt"
-outDir = "img/"
-eolImgDb = "eol/imagesList.db"
-enwikiImgDb = "enwiki/imgData.db"
-pickedImgsDir = "pickedImgs/"
-pickedImgsFilename = "imgData.txt"
-dbFile = "data.db"
-IMG_OUT_SZ = 200
-genImgFiles = True # Usable for debugging
-
-if not os.path.exists(outDir):
- os.mkdir(outDir)
-
-print("Opening databases")
-dbCon = sqlite3.connect(dbFile)
-dbCur = dbCon.cursor()
-eolCon = sqlite3.connect(eolImgDb)
-eolCur = eolCon.cursor()
-enwikiCon = sqlite3.connect(enwikiImgDb)
-enwikiCur = enwikiCon.cursor()
-print("Checking for picked-images")
-nodeToPickedImg = {}
-if os.path.exists(pickedImgsDir + pickedImgsFilename):
- lineNum = 0
- with open(pickedImgsDir + pickedImgsFilename) as file:
- for line in file:
- lineNum += 1
- (filename, url, license, artist, credit) = line.rstrip().split("|")
- nodeName = os.path.splitext(filename)[0] # Remove extension
- (otolId,) = dbCur.execute("SELECT id FROM nodes WHERE name = ?", (nodeName,)).fetchone()
- nodeToPickedImg[otolId] = {
- "nodeName": nodeName, "id": lineNum,
- "filename": filename, "url": url, "license": license, "artist": artist, "credit": credit,
- }
-
-print("Checking for image tables")
-nodesDone = set()
-imgsDone = set()
-if dbCur.execute("SELECT name FROM sqlite_master WHERE type='table' AND name='node_imgs'").fetchone() == None:
- # Add image tables if not present
- dbCur.execute("CREATE TABLE node_imgs (name TEXT PRIMARY KEY, img_id INT, src TEXT)")
- dbCur.execute("CREATE TABLE images" \
- " (id INT, src TEXT, url TEXT, license TEXT, artist TEXT, credit TEXT, PRIMARY KEY (id, src))")
-else:
- # Get existing image-associated nodes
- for (otolId,) in dbCur.execute("SELECT nodes.id FROM node_imgs INNER JOIN nodes ON node_imgs.name = nodes.name"):
- nodesDone.add(otolId)
- # Get existing node-associated images
- for (imgId, imgSrc) in dbCur.execute("SELECT id, src from images"):
- imgsDone.add((imgId, imgSrc))
- print(f"Found {len(nodesDone)} nodes and {len(imgsDone)} images to skip")
-
-# Set SIGINT handler
-interrupted = False
-def onSigint(sig, frame):
- global interrupted
- interrupted = True
-signal.signal(signal.SIGINT, onSigint)
-
-print("Iterating through input images")
-def quit():
- print("Closing databases")
- dbCon.commit()
- dbCon.close()
- eolCon.close()
- enwikiCon.close()
- sys.exit(0)
-def convertImage(imgPath, outPath):
- print(f"Converting {imgPath} to {outPath}")
- if os.path.exists(outPath):
- print(f"ERROR: Output image already exists")
- return False
- try:
- completedProcess = subprocess.run(
- ['npx', 'smartcrop-cli', '--width', str(IMG_OUT_SZ), '--height', str(IMG_OUT_SZ), imgPath, outPath],
- stdout=subprocess.DEVNULL
- )
- except Exception as e:
- print(f"ERROR: Exception while attempting to run smartcrop: {e}")
- return False
- if completedProcess.returncode != 0:
- print(f"ERROR: smartcrop had exit status {completedProcess.returncode}")
- return False
- return True
-print("Processing picked-images")
-for (otolId, imgData) in nodeToPickedImg.items():
- # Check for SIGINT event
- if interrupted:
- print("Exiting")
- quit()
- # Skip if already processed
- if otolId in nodesDone:
- continue
- # Convert image
- if genImgFiles:
- success = convertImage(pickedImgsDir + imgData["filename"], outDir + otolId + ".jpg")
- if not success:
- quit()
- else:
- print(f"Processing {imgData['nodeName']}: {otolId}.jpg")
- # Add entry to db
- if (imgData["id"], "picked") not in imgsDone:
- dbCur.execute("INSERT INTO images VALUES (?, ?, ?, ?, ?, ?)",
- (imgData["id"], "picked", imgData["url"], imgData["license"], imgData["artist"], imgData["credit"]))
- imgsDone.add((imgData["id"], "picked"))
- dbCur.execute("INSERT INTO node_imgs VALUES (?, ?, ?)", (imgData["nodeName"], imgData["id"], "picked"))
- nodesDone.add(otolId)
-print("Processing images from eol and enwiki")
-iterNum = 0
-with open(imgListFile) as file:
- for line in file:
- iterNum += 1
- # Check for SIGINT event
- if interrupted:
- print("Exiting")
- break
- # Skip lines without an image path
- if line.find(" ") == -1:
- continue
- # Get filenames
- (otolId, _, imgPath) = line.rstrip().partition(" ")
- # Skip if already processed
- if otolId in nodesDone:
- continue
- # Convert image
- if genImgFiles:
- success = convertImage(imgPath, outDir + otolId + ".jpg")
- if not success:
- break
- else:
- if iterNum % 1e4 == 0:
- print(f"At iteration {iterNum}")
- # Add entry to db
- (nodeName,) = dbCur.execute("SELECT name FROM nodes WHERE id = ?", (otolId,)).fetchone()
- fromEol = imgPath.startswith("eol/")
- imgName = os.path.basename(os.path.normpath(imgPath)) # Get last path component
- imgName = os.path.splitext(imgName)[0] # Remove extension
- if fromEol:
- eolId, _, contentId = imgName.partition(" ")
- eolId, contentId = (int(eolId), int(contentId))
- if (eolId, "eol") not in imgsDone:
- query = "SELECT source_url, license, copyright_owner FROM images WHERE content_id = ?"
- row = eolCur.execute(query, (contentId,)).fetchone()
- if row == None:
- print(f"ERROR: No image record for EOL ID {eolId}, content ID {contentId}")
- break
- (url, license, owner) = row
- dbCur.execute("INSERT INTO images VALUES (?, ?, ?, ?, ?, ?)",
- (eolId, "eol", url, license, owner, ""))
- imgsDone.add((eolId, "eol"))
- dbCur.execute("INSERT INTO node_imgs VALUES (?, ?, ?)", (nodeName, eolId, "eol"))
- else:
- enwikiId = int(imgName)
- if (enwikiId, "enwiki") not in imgsDone:
- query = "SELECT name, license, artist, credit FROM" \
- " page_imgs INNER JOIN imgs ON page_imgs.img_name = imgs.name" \
- " WHERE page_imgs.page_id = ?"
- row = enwikiCur.execute(query, (enwikiId,)).fetchone()
- if row == None:
- print(f"ERROR: No image record for enwiki ID {enwikiId}")
- break
- (name, license, artist, credit) = row
- url = "https://en.wikipedia.org/wiki/File:" + urllib.parse.quote(name)
- dbCur.execute("INSERT INTO images VALUES (?, ?, ?, ?, ?, ?)",
- (enwikiId, "enwiki", url, license, artist, credit))
- imgsDone.add((enwikiId, "enwiki"))
- dbCur.execute("INSERT INTO node_imgs VALUES (?, ?, ?)", (nodeName, enwikiId, "enwiki"))
-# Close dbs
-quit()
diff --git a/backend/data/genLinkedImgs.py b/backend/data/genLinkedImgs.py
deleted file mode 100755
index a8e1322..0000000
--- a/backend/data/genLinkedImgs.py
+++ /dev/null
@@ -1,125 +0,0 @@
-#!/usr/bin/python3
-
-import sys, re
-import sqlite3
-
-usageInfo = f"""
-Usage: {sys.argv[0]}
-
-Look for nodes without images in the database, and tries to
-associate them with images from their children.
-"""
-if len(sys.argv) > 1:
- print(usageInfo, file=sys.stderr)
- sys.exit(1)
-
-dbFile = "data.db"
-compoundNameRegex = re.compile(r"\[(.+) \+ (.+)]")
-upPropagateCompoundImgs = False
-
-print("Opening databases")
-dbCon = sqlite3.connect(dbFile)
-dbCur = dbCon.cursor()
-dbCur.execute("CREATE TABLE linked_imgs (name TEXT PRIMARY KEY, otol_ids TEXT)")
-
-print("Getting nodes with images")
-resolvedNodes = {} # Will map node names to otol IDs with a usable image
-query = "SELECT nodes.name, nodes.id FROM nodes INNER JOIN node_imgs ON nodes.name = node_imgs.name"
-for (name, otolId) in dbCur.execute(query):
- resolvedNodes[name] = otolId
-print(f"Found {len(resolvedNodes)}")
-
-print("Iterating through nodes, trying to resolve images for ancestors")
-nodesToResolve = {} # Maps a node name to a list of objects that represent possible child images
-processedNodes = {} # Map a node name to an OTOL ID, representing a child node whose image is to be used
-parentToChosenTips = {} # used to prefer images from children with more tips
-iterNum = 0
-while len(resolvedNodes) > 0:
- iterNum += 1
- if iterNum % 1e3 == 0:
- print(f"At iteration {iterNum}")
- # Get next node
- (nodeName, otolId) = resolvedNodes.popitem()
- processedNodes[nodeName] = otolId
- # Traverse upwards, resolving ancestors if able
- while True:
- # Get parent
- row = dbCur.execute("SELECT parent FROM edges WHERE child = ?", (nodeName,)).fetchone()
- if row == None or row[0] in processedNodes or row[0] in resolvedNodes:
- break
- parent = row[0]
- # Get parent data
- if parent not in nodesToResolve:
- childNames = [row[0] for row in dbCur.execute("SELECT child FROM edges WHERE parent = ?", (parent,))]
- query = "SELECT name, tips FROM nodes WHERE name IN ({})".format(",".join(["?"] * len(childNames)))
- childObjs = [{"name": row[0], "tips": row[1], "otolId": None} for row in dbCur.execute(query, childNames)]
- childObjs.sort(key=lambda x: x["tips"], reverse=True)
- nodesToResolve[parent] = childObjs
- else:
- childObjs = nodesToResolve[parent]
- # Check if highest-tips child
- if (childObjs[0]["name"] == nodeName):
- # Resolve parent, and continue from it
- dbCur.execute("INSERT INTO linked_imgs VALUES (?, ?)", (parent, otolId))
- del nodesToResolve[parent]
- processedNodes[parent] = otolId
- parentToChosenTips[parent] = childObjs[0]["tips"]
- nodeName = parent
- continue
- else:
- # Mark child as a potential choice
- childObj = next(c for c in childObjs if c["name"] == nodeName)
- childObj["otolId"] = otolId
- break
- # When out of resolved nodes, resolve nodesToResolve nodes, possibly adding more nodes to resolve
- if len(resolvedNodes) == 0:
- for (name, childObjs) in nodesToResolve.items():
- childObj = next(c for c in childObjs if c["otolId"] != None)
- resolvedNodes[name] = childObj["otolId"]
- parentToChosenTips[name] = childObj["tips"]
- dbCur.execute("INSERT INTO linked_imgs VALUES (?, ?)", (name, childObj["otolId"]))
- nodesToResolve.clear()
-
-print("Replacing linked-images for compound nodes")
-iterNum = 0
-for nodeName in processedNodes.keys():
- iterNum += 1
- if iterNum % 1e4 == 0:
- print(f"At iteration {iterNum}")
- #
- match = compoundNameRegex.fullmatch(nodeName)
- if match != None:
- # Replace associated image with subname images
- (subName1, subName2) = match.group(1,2)
- otolIdPair = ["", ""]
- if subName1 in processedNodes:
- otolIdPair[0] = processedNodes[subName1]
- if subName2 in processedNodes:
- otolIdPair[1] = processedNodes[subName2]
- # Use no image if both subimages not found
- if otolIdPair[0] == "" and otolIdPair[1] == "":
- dbCur.execute("DELETE FROM linked_imgs WHERE name = ?", (nodeName,))
- continue
- # Add to db
- dbCur.execute("UPDATE linked_imgs SET otol_ids = ? WHERE name = ?",
- (otolIdPair[0] + "," + otolIdPair[1], nodeName))
- # Possibly repeat operation upon parent/ancestors
- if upPropagateCompoundImgs:
- while True:
- # Get parent
- row = dbCur.execute("SELECT parent FROM edges WHERE child = ?", (nodeName,)).fetchone()
- if row != None:
- parent = row[0]
- # Check num tips
- (numTips,) = dbCur.execute("SELECT tips from nodes WHERE name = ?", (nodeName,)).fetchone()
- if parent in parentToChosenTips and parentToChosenTips[parent] <= numTips:
- # Replace associated image
- dbCur.execute("UPDATE linked_imgs SET otol_ids = ? WHERE name = ?",
- (otolIdPair[0] + "," + otolIdPair[1], parent))
- nodeName = parent
- continue
- break
-
-print("Closing databases")
-dbCon.commit()
-dbCon.close()
diff --git a/backend/data/genOtolData.py b/backend/data/genOtolData.py
deleted file mode 100755
index b5e0055..0000000
--- a/backend/data/genOtolData.py
+++ /dev/null
@@ -1,250 +0,0 @@
-#!/usr/bin/python3
-
-import sys, re, os
-import json, sqlite3
-
-usageInfo = f"""
-Usage: {sys.argv[0]}
-
-Reads files describing a tree-of-life from an 'Open Tree of Life' release,
-and stores tree information in a database.
-
-Reads a labelled_supertree_ottnames.tre file, which is assumed to have this format:
- The tree-of-life is represented in Newick format, which looks like: (n1,n2,(n3,n4)n5)n6
- The root node is named n6, and has children n1, n2, and n5.
- Name examples include: Homo_sapiens_ott770315, mrcaott6ott22687, 'Oxalis san-miguelii ott5748753',
- 'ott770315' and 'mrcaott6ott22687' are node IDs. The latter is for a 'compound node'.
- The node with ID 'ott770315' will get the name 'homo sapiens'.
- A compound node will get a name composed from it's sub-nodes (eg: [name1 + name2]).
- It is possible for multiple nodes to have the same name.
- In these cases, extra nodes will be named sequentially, as 'name1 [2]', 'name1 [3]', etc.
-Reads an annotations.json file, which is assumed to have this format:
- Holds a JSON object, whose 'nodes' property maps node IDs to objects holding information about that node,
- such as the properties 'supported_by' and 'conflicts_with', which list phylogenetic trees that
- support/conflict with the node's placement.
-Reads from a picked-names file, if present, which specifies name and node ID pairs.
- These help resolve cases where multiple nodes share the same name.
-"""
-if len(sys.argv) > 1:
- print(usageInfo, file=sys.stderr)
- sys.exit(1)
-
-treeFile = "otol/labelled_supertree_ottnames.tre" # Had about 2.5e9 nodes
-annFile = "otol/annotations.json"
-dbFile = "data.db"
-nodeMap = {} # Maps node IDs to node objects
-nameToFirstId = {} # Maps node names to first found ID (names might have multiple IDs)
-dupNameToIds = {} # Maps names of nodes with multiple IDs to those IDs
-pickedNamesFile = "pickedOtolNames.txt"
-
-class Node:
- " Represents a tree-of-life node "
- def __init__(self, name, childIds, parentId, tips, pSupport):
- self.name = name
- self.childIds = childIds
- self.parentId = parentId
- self.tips = tips
- self.pSupport = pSupport
-
-print("Parsing tree file")
-# Read file
-data = None
-with open(treeFile) as file:
- data = file.read()
-dataIdx = 0
-# Parse content
-iterNum = 0
-def parseNewick():
- " Parses a node using 'data' and 'dataIdx', updates nodeMap accordingly, and returns the node's ID "
- global data, dataIdx, iterNum
- iterNum += 1
- if iterNum % 1e5 == 0:
- print(f"At iteration {iterNum}")
- # Check for EOF
- if dataIdx == len(data):
- raise Exception(f"ERROR: Unexpected EOF at index {dataIdx}")
- # Check for node
- if data[dataIdx] == "(": # parse inner node
- dataIdx += 1
- childIds = []
- while True:
- # Read child
- childId = parseNewick()
- childIds.append(childId)
- if (dataIdx == len(data)):
- raise Exception(f"ERROR: Unexpected EOF at index {dataIdx}")
- # Check for next child
- if (data[dataIdx] == ","):
- dataIdx += 1
- continue
- else:
- # Get node name and id
- dataIdx += 1 # Consume an expected ')'
- name, id = parseNewickName()
- updateNameMaps(name, id)
- # Get child num-tips total
- tips = 0
- for childId in childIds:
- tips += nodeMap[childId].tips
- # Add node to nodeMap
- nodeMap[id] = Node(name, childIds, None, tips, False)
- # Update childrens' parent reference
- for childId in childIds:
- nodeMap[childId].parentId = id
- return id
- else: # Parse node name
- name, id = parseNewickName()
- updateNameMaps(name, id)
- nodeMap[id] = Node(name, [], None, 1, False)
- return id
-def parseNewickName():
- " Parses a node name using 'data' and 'dataIdx', and returns a (name, id) pair "
- global data, dataIdx
- name = None
- end = dataIdx
- # Get name
- if (end < len(data) and data[end] == "'"): # Check for quoted name
- end += 1
- inQuote = True
- while end < len(data):
- if (data[end] == "'"):
- if end + 1 < len(data) and data[end + 1] == "'": # Account for '' as escaped-quote
- end += 2
- continue
- else:
- end += 1
- inQuote = False
- break
- end += 1
- if inQuote:
- raise Exception(f"ERROR: Unexpected EOF at index {dataIdx}")
- name = data[dataIdx:end]
- dataIdx = end
- else:
- while end < len(data) and not re.match(r"[(),]", data[end]):
- end += 1
- if (end == dataIdx):
- raise Exception(f"ERROR: Unexpected EOF at index {dataIdx}")
- name = data[dataIdx:end].rstrip()
- if end == len(data): # Ignore trailing input semicolon
- name = name[:-1]
- dataIdx = end
- # Convert to (name, id)
- name = name.lower()
- if name.startswith("mrca"):
- return (name, name)
- elif name[0] == "'":
- match = re.fullmatch(r"'([^\\\"]+) (ott\d+)'", name)
- if match == None:
- raise Exception(f"ERROR: invalid name \"{name}\"")
- name = match.group(1).replace("''", "'")
- return (name, match.group(2))
- else:
- match = re.fullmatch(r"([^\\\"]+)_(ott\d+)", name)
- if match == None:
- raise Exception(f"ERROR: invalid name \"{name}\"")
- return (match.group(1).replace("_", " "), match.group(2))
-def updateNameMaps(name, id):
- global nameToFirstId, dupNameToIds
- if name not in nameToFirstId:
- nameToFirstId[name] = id
- else:
- if name not in dupNameToIds:
- dupNameToIds[name] = [nameToFirstId[name], id]
- else:
- dupNameToIds[name].append(id)
-rootId = parseNewick()
-
-print("Resolving duplicate names")
-# Read picked-names file
-nameToPickedId = {}
-if os.path.exists(pickedNamesFile):
- with open(pickedNamesFile) as file:
- for line in file:
- (name, _, otolId) = line.rstrip().partition("|")
- nameToPickedId[name] = otolId
-# Resolve duplicates
-for (dupName, ids) in dupNameToIds.items():
- # Check for picked id
- if dupName in nameToPickedId:
- idToUse = nameToPickedId[dupName]
- else:
- # Get conflicting node with most tips
- tipNums = [nodeMap[id].tips for id in ids]
- maxIdx = tipNums.index(max(tipNums))
- idToUse = ids[maxIdx]
- # Adjust name of other conflicting nodes
- counter = 2
- for id in ids:
- if id != idToUse:
- nodeMap[id].name += f" [{counter}]"
- counter += 1
-
-print("Changing mrca* names")
-def convertMrcaName(id):
- node = nodeMap[id]
- name = node.name
- childIds = node.childIds
- if len(childIds) < 2:
- print(f"WARNING: MRCA node \"{name}\" has less than 2 children")
- return
- # Get 2 children with most tips
- childTips = [nodeMap[id].tips for id in childIds]
- maxIdx1 = childTips.index(max(childTips))
- childTips[maxIdx1] = 0
- maxIdx2 = childTips.index(max(childTips))
- childId1 = childIds[maxIdx1]
- childId2 = childIds[maxIdx2]
- childName1 = nodeMap[childId1].name
- childName2 = nodeMap[childId2].name
- # Check for mrca* child names
- if childName1.startswith("mrca"):
- childName1 = convertMrcaName(childId1)
- if childName2.startswith("mrca"):
- childName2 = convertMrcaName(childId2)
- # Check for composite names
- match = re.fullmatch(r"\[(.+) \+ (.+)]", childName1)
- if match != None:
- childName1 = match.group(1)
- match = re.fullmatch(r"\[(.+) \+ (.+)]", childName2)
- if match != None:
- childName2 = match.group(1)
- # Create composite name
- node.name = f"[{childName1} + {childName2}]"
- return childName1
-for (id, node) in nodeMap.items():
- if node.name.startswith("mrca"):
- convertMrcaName(id)
-
-print("Parsing annotations file")
-# Read file
-data = None
-with open(annFile) as file:
- data = file.read()
-obj = json.loads(data)
-nodeAnnsMap = obj["nodes"]
-# Find relevant annotations
-for (id, node) in nodeMap.items():
- # Set has-support value using annotations
- if id in nodeAnnsMap:
- nodeAnns = nodeAnnsMap[id]
- supportQty = len(nodeAnns["supported_by"]) if "supported_by" in nodeAnns else 0
- conflictQty = len(nodeAnns["conflicts_with"]) if "conflicts_with" in nodeAnns else 0
- node.pSupport = supportQty > 0 and conflictQty == 0
-
-print("Creating nodes and edges tables")
-dbCon = sqlite3.connect(dbFile)
-dbCur = dbCon.cursor()
-dbCur.execute("CREATE TABLE nodes (name TEXT PRIMARY KEY, id TEXT UNIQUE, tips INT)")
-dbCur.execute("CREATE INDEX nodes_idx_nc ON nodes(name COLLATE NOCASE)")
-dbCur.execute("CREATE TABLE edges (parent TEXT, child TEXT, p_support INT, PRIMARY KEY (parent, child))")
-dbCur.execute("CREATE INDEX edges_child_idx ON edges(child)")
-for (otolId, node) in nodeMap.items():
- dbCur.execute("INSERT INTO nodes VALUES (?, ?, ?)", (node.name, otolId, node.tips))
- for childId in node.childIds:
- childNode = nodeMap[childId]
- dbCur.execute("INSERT INTO edges VALUES (?, ?, ?)",
- (node.name, childNode.name, 1 if childNode.pSupport else 0))
-print("Closing database")
-dbCon.commit()
-dbCon.close()
diff --git a/backend/data/genReducedTrees.py b/backend/data/genReducedTrees.py
deleted file mode 100755
index a921be4..0000000
--- a/backend/data/genReducedTrees.py
+++ /dev/null
@@ -1,329 +0,0 @@
-#!/usr/bin/python3
-
-import sys, os.path, re
-import json, sqlite3
-
-usageInfo = f"""
-Usage: {sys.argv[0]} [tree1]
-
-Creates reduced versions of the tree in the database:
-- A 'picked nodes' tree:
- Created from a minimal set of node names read from a file,
- possibly with some extra randmly-picked children.
-- An 'images only' tree:
- Created by removing nodes without an image or presence in the
- 'picked' tree.
-- A 'weakly trimmed' tree:
- Created by removing nodes that lack an image or description, or
- presence in the 'picked' tree. And, for nodes with 'many' children,
- removing some more, despite any node descriptions.
-
-If tree1 is specified, as 'picked', 'images', or 'trimmed', only that
-tree is generated.
-"""
-if len(sys.argv) > 2 or len(sys.argv) == 2 and re.fullmatch(r"picked|images|trimmed", sys.argv[1]) == None:
- print(usageInfo, file=sys.stderr)
- sys.exit(1)
-
-tree = sys.argv[1] if len(sys.argv) > 1 else None
-dbFile = "data.db"
-pickedNodesFile = "pickedNodes.txt"
-COMP_NAME_REGEX = re.compile(r"\[.+ \+ .+]") # Used to recognise composite nodes
-
-class Node:
- def __init__(self, id, children, parent, tips, pSupport):
- self.id = id
- self.children = children
- self.parent = parent
- self.tips = tips
- self.pSupport = pSupport
-
-print("Opening database")
-dbCon = sqlite3.connect(dbFile)
-dbCur = dbCon.cursor()
-
-def genPickedNodeTree(dbCur, pickedNames, rootName):
- global COMP_NAME_REGEX
- PREF_NUM_CHILDREN = 3 # Include extra children up to this limit
- nodeMap = {} # Maps node names to Nodes
- print("Getting ancestors")
- nodeMap = genNodeMap(dbCur, pickedNames, 100)
- print(f"Result has {len(nodeMap)} nodes")
- print("Removing composite nodes")
- removedNames = removeCompositeNodes(nodeMap)
- print(f"Result has {len(nodeMap)} nodes")
- print("Removing 'collapsible' nodes")
- temp = removeCollapsibleNodes(nodeMap, pickedNames)
- removedNames.update(temp)
- print(f"Result has {len(nodeMap)} nodes")
- print("Adding some additional nearby children")
- namesToAdd = []
- iterNum = 0
- for (name, node) in nodeMap.items():
- iterNum += 1
- if iterNum % 100 == 0:
- print(f"At iteration {iterNum}")
- #
- numChildren = len(node.children)
- if numChildren < PREF_NUM_CHILDREN:
- children = [row[0] for row in dbCur.execute("SELECT child FROM edges where parent = ?", (name,))]
- newChildren = []
- for n in children:
- if n in nodeMap or n in removedNames:
- continue
- if COMP_NAME_REGEX.fullmatch(n) != None:
- continue
- if dbCur.execute("SELECT name from node_imgs WHERE name = ?", (n,)).fetchone() == None and \
- dbCur.execute("SELECT name from linked_imgs WHERE name = ?", (n,)).fetchone() == None:
- continue
- newChildren.append(n)
- newChildNames = newChildren[:(PREF_NUM_CHILDREN - numChildren)]
- node.children.extend(newChildNames)
- namesToAdd.extend(newChildNames)
- for name in namesToAdd:
- parent, pSupport = dbCur.execute("SELECT parent, p_support from edges WHERE child = ?", (name,)).fetchone()
- (id,) = dbCur.execute("SELECT id FROM nodes WHERE name = ?", (name,)).fetchone()
- parent = None if parent == "" else parent
- nodeMap[name] = Node(id, [], parent, 0, pSupport == 1)
- print(f"Result has {len(nodeMap)} nodes")
- print("Updating 'tips' values")
- updateTips(rootName, nodeMap)
- print("Creating table")
- addTreeTables(nodeMap, dbCur, "p")
-def genImagesOnlyTree(dbCur, nodesWithImgOrPicked, pickedNames, rootName):
- print("Getting ancestors")
- nodeMap = genNodeMap(dbCur, nodesWithImgOrPicked, 1e4)
- print(f"Result has {len(nodeMap)} nodes")
- print("Removing composite nodes")
- removeCompositeNodes(nodeMap)
- print(f"Result has {len(nodeMap)} nodes")
- print("Removing 'collapsible' nodes")
- removeCollapsibleNodes(nodeMap, {})
- print(f"Result has {len(nodeMap)} nodes")
- print(f"Updating 'tips' values") # Needed for next trimming step
- updateTips(rootName, nodeMap)
- print(f"Trimming from nodes with 'many' children")
- trimIfManyChildren(nodeMap, rootName, 300, pickedNames)
- print(f"Result has {len(nodeMap)} nodes")
- print(f"Updating 'tips' values")
- updateTips(rootName, nodeMap)
- print("Creating table")
- addTreeTables(nodeMap, dbCur, "i")
-def genWeaklyTrimmedTree(dbCur, nodesWithImgDescOrPicked, nodesWithImgOrPicked, rootName):
- print("Getting ancestors")
- nodeMap = genNodeMap(dbCur, nodesWithImgDescOrPicked, 1e5)
- print(f"Result has {len(nodeMap)} nodes")
- print("Getting nodes to 'strongly keep'")
- iterNum = 0
- nodesFromImgOrPicked = set()
- for name in nodesWithImgOrPicked:
- iterNum += 1
- if iterNum % 1e4 == 0:
- print(f"At iteration {iterNum}")
- #
- while name != None:
- if name not in nodesFromImgOrPicked:
- nodesFromImgOrPicked.add(name)
- name = nodeMap[name].parent
- else:
- break
- print(f"Node set has {len(nodesFromImgOrPicked)} nodes")
- print("Removing 'collapsible' nodes")
- removeCollapsibleNodes(nodeMap, nodesWithImgDescOrPicked)
- print(f"Result has {len(nodeMap)} nodes")
- print(f"Updating 'tips' values") # Needed for next trimming step
- updateTips(rootName, nodeMap)
- print(f"Trimming from nodes with 'many' children")
- trimIfManyChildren(nodeMap, rootName, 600, nodesFromImgOrPicked)
- print(f"Result has {len(nodeMap)} nodes")
- print(f"Updating 'tips' values")
- updateTips(rootName, nodeMap)
- print("Creating table")
- addTreeTables(nodeMap, dbCur, "t")
-# Helper functions
-def genNodeMap(dbCur, nameSet, itersBeforePrint = 1):
- " Returns a subtree that includes nodes in 'nameSet', as a name-to-Node map "
- nodeMap = {}
- iterNum = 0
- for name in nameSet:
- iterNum += 1
- if iterNum % itersBeforePrint == 0:
- print(f"At iteration {iterNum}")
- #
- prevName = None
- while name != None:
- if name not in nodeMap:
- # Add node
- (id, tips) = dbCur.execute("SELECT id, tips from nodes where name = ?", (name,)).fetchone()
- row = dbCur.execute("SELECT parent, p_support from edges where child = ?", (name,)).fetchone()
- parent = None if row == None or row[0] == "" else row[0]
- pSupport = row == None or row[1] == 1
- children = [] if prevName == None else [prevName]
- nodeMap[name] = Node(id, children, parent, 0, pSupport)
- # Iterate to parent
- prevName = name
- name = parent
- else:
- # Just add as child
- if prevName != None:
- nodeMap[name].children.append(prevName)
- break
- return nodeMap
-def removeCompositeNodes(nodeMap):
- " Given a tree, removes composite-name nodes, and returns the removed nodes' names "
- global COMP_NAME_REGEX
- namesToRemove = set()
- for (name, node) in nodeMap.items():
- parent = node.parent
- if parent != None and COMP_NAME_REGEX.fullmatch(name) != None:
- # Connect children to parent
- nodeMap[parent].children.remove(name)
- nodeMap[parent].children.extend(node.children)
- for n in node.children:
- nodeMap[n].parent = parent
- nodeMap[n].pSupport &= node.pSupport
- # Remember for removal
- namesToRemove.add(name)
- for name in namesToRemove:
- del nodeMap[name]
- return namesToRemove
-def removeCollapsibleNodes(nodeMap, nodesToKeep = {}):
- """ Given a tree, removes single-child parents, then only-childs,
- with given exceptions, and returns the set of removed nodes' names """
- namesToRemove = set()
- # Remove single-child parents
- for (name, node) in nodeMap.items():
- if len(node.children) == 1 and node.parent != None and name not in nodesToKeep:
- # Connect parent and children
- parent = node.parent
- child = node.children[0]
- nodeMap[parent].children.remove(name)
- nodeMap[parent].children.append(child)
- nodeMap[child].parent = parent
- nodeMap[child].pSupport &= node.pSupport
- # Remember for removal
- namesToRemove.add(name)
- for name in namesToRemove:
- del nodeMap[name]
- # Remove only-childs (not redundant because 'nodesToKeep' can cause single-child parents to be kept)
- namesToRemove.clear()
- for (name, node) in nodeMap.items():
- isOnlyChild = node.parent != None and len(nodeMap[node.parent].children) == 1
- if isOnlyChild and name not in nodesToKeep:
- # Connect parent and children
- parent = node.parent
- nodeMap[parent].children = node.children
- for n in node.children:
- nodeMap[n].parent = parent
- nodeMap[n].pSupport &= node.pSupport
- # Remember for removal
- namesToRemove.add(name)
- for name in namesToRemove:
- del nodeMap[name]
- #
- return namesToRemove
-def trimIfManyChildren(nodeMap, rootName, childThreshold, nodesToKeep = {}):
- namesToRemove = set()
- def findTrimmables(nodeName):
- nonlocal nodeMap, nodesToKeep
- node = nodeMap[nodeName]
- if len(node.children) > childThreshold:
- numToTrim = len(node.children) - childThreshold
- # Try removing nodes, preferring those with less tips
- candidatesToTrim = [n for n in node.children if n not in nodesToKeep]
- childToTips = {n: nodeMap[n].tips for n in candidatesToTrim}
- candidatesToTrim.sort(key=lambda n: childToTips[n], reverse=True)
- childrenToRemove = set(candidatesToTrim[-numToTrim:])
- node.children = [n for n in node.children if n not in childrenToRemove]
- # Mark nodes for deletion
- for n in childrenToRemove:
- markForRemoval(n)
- # Recurse on children
- for n in node.children:
- findTrimmables(n)
- def markForRemoval(nodeName):
- nonlocal nodeMap, namesToRemove
- namesToRemove.add(nodeName)
- for child in nodeMap[nodeName].children:
- markForRemoval(child)
- findTrimmables(rootName)
- for nodeName in namesToRemove:
- del nodeMap[nodeName]
-def updateTips(nodeName, nodeMap):
- " Updates the 'tips' values for a node and it's descendants, returning the node's new 'tips' value "
- node = nodeMap[nodeName]
- tips = sum([updateTips(childName, nodeMap) for childName in node.children])
- tips = max(1, tips)
- node.tips = tips
- return tips
-def addTreeTables(nodeMap, dbCur, suffix):
- " Adds a tree to the database, as tables nodes_X and edges_X, where X is the given suffix "
- nodesTbl = f"nodes_{suffix}"
- edgesTbl = f"edges_{suffix}"
- dbCur.execute(f"CREATE TABLE {nodesTbl} (name TEXT PRIMARY KEY, id TEXT UNIQUE, tips INT)")
- dbCur.execute(f"CREATE INDEX {nodesTbl}_idx_nc ON {nodesTbl}(name COLLATE NOCASE)")
- dbCur.execute(f"CREATE TABLE {edgesTbl} (parent TEXT, child TEXT, p_support INT, PRIMARY KEY (parent, child))")
- dbCur.execute(f"CREATE INDEX {edgesTbl}_child_idx ON {edgesTbl}(child)")
- for (name, node) in nodeMap.items():
- dbCur.execute(f"INSERT INTO {nodesTbl} VALUES (?, ?, ?)", (name, node.id, node.tips))
- for childName in node.children:
- pSupport = 1 if nodeMap[childName].pSupport else 0
- dbCur.execute(f"INSERT INTO {edgesTbl} VALUES (?, ?, ?)", (name, childName, pSupport))
-
-print(f"Finding root node")
-query = "SELECT name FROM nodes LEFT JOIN edges ON nodes.name = edges.child WHERE edges.parent IS NULL LIMIT 1"
-(rootName,) = dbCur.execute(query).fetchone()
-print(f"Found \"{rootName}\"")
-
-print('=== Getting picked-nodes ===')
-pickedNames = set()
-pickedTreeExists = False
-if dbCur.execute("SELECT name FROM sqlite_master WHERE type='table' AND name='nodes_p'").fetchone() == None:
- print(f"Reading from {pickedNodesFile}")
- with open(pickedNodesFile) as file:
- for line in file:
- name = line.rstrip()
- row = dbCur.execute("SELECT name from nodes WHERE name = ?", (name,)).fetchone()
- if row == None:
- row = dbCur.execute("SELECT name from names WHERE alt_name = ?", (name,)).fetchone()
- if row != None:
- pickedNames.add(row[0])
- if len(pickedNames) == 0:
- raise Exception("ERROR: No picked names found")
-else:
- pickedTreeExists = True
- print("Picked-node tree already exists")
- if tree == 'picked':
- sys.exit()
- for (name,) in dbCur.execute("SELECT name FROM nodes_p"):
- pickedNames.add(name)
-print(f"Found {len(pickedNames)} names")
-
-if (tree == 'picked' or tree == None) and not pickedTreeExists:
- print("=== Generating picked-nodes tree ===")
- genPickedNodeTree(dbCur, pickedNames, rootName)
-if tree != 'picked':
- print("=== Finding 'non-low significance' nodes ===")
- nodesWithImgOrPicked = set()
- nodesWithImgDescOrPicked = set()
- print("Finding nodes with descs")
- for (name,) in dbCur.execute("SELECT name FROM wiki_ids"): # Can assume the wiki_id has a desc
- nodesWithImgDescOrPicked.add(name)
- print("Finding nodes with images")
- for (name,) in dbCur.execute("SELECT name FROM node_imgs"):
- nodesWithImgDescOrPicked.add(name)
- nodesWithImgOrPicked.add(name)
- print("Adding picked nodes")
- for name in pickedNames:
- nodesWithImgDescOrPicked.add(name)
- nodesWithImgOrPicked.add(name)
- if tree == 'images' or tree == None:
- print("=== Generating images-only tree ===")
- genImagesOnlyTree(dbCur, nodesWithImgOrPicked, pickedNames, rootName)
- if tree == 'trimmed' or tree == None:
- print("=== Generating weakly-trimmed tree ===")
- genWeaklyTrimmedTree(dbCur, nodesWithImgDescOrPicked, nodesWithImgOrPicked, rootName)
-
-print("Closing database")
-dbCon.commit()
-dbCon.close()
diff --git a/backend/data/otol/README.md b/backend/data/otol/README.md
deleted file mode 100644
index 4be2fd2..0000000
--- a/backend/data/otol/README.md
+++ /dev/null
@@ -1,10 +0,0 @@
-Files
-=====
-- opentree13.4tree.tgz <br>
- Obtained from <https://tree.opentreeoflife.org/about/synthesis-release/v13.4>.
- Contains tree data from the [Open Tree of Life](https://tree.opentreeoflife.org/about/open-tree-of-life).
-- labelled\_supertree\_ottnames.tre <br>
- Extracted from the .tgz file. Describes the structure of the tree.
-- annotations.json
- Extracted from the .tgz file. Contains additional attributes of tree
- nodes. Used for finding out which nodes have 'phylogenetic support'.
diff --git a/backend/data/pickedImgs/README.md b/backend/data/pickedImgs/README.md
deleted file mode 100644
index dfe192b..0000000
--- a/backend/data/pickedImgs/README.md
+++ /dev/null
@@ -1,10 +0,0 @@
-This directory holds additional image files to use for tree-of-life nodes,
-on top of those from EOL and Wikipedia.
-
-Possible Files
-==============
-- (Image files)
-- imgData.txt <br>
- Contains lines with the format `filename|url|license|artist|credit`.
- The filename should consist of a node name, with an image extension.
- Other fields correspond to those in the `images` table (see ../README.md).
diff --git a/backend/data/reviewImgsToGen.py b/backend/data/reviewImgsToGen.py
deleted file mode 100755
index de592f5..0000000
--- a/backend/data/reviewImgsToGen.py
+++ /dev/null
@@ -1,225 +0,0 @@
-#!/usr/bin/python3
-
-import sys, re, os, time
-import sqlite3
-import tkinter as tki
-from tkinter import ttk
-import PIL
-from PIL import ImageTk, Image, ImageOps
-
-usageInfo = f"""
-Usage: {sys.argv[0]}
-
-Provides a GUI that displays, for each node in the database, associated
-images from EOL and Wikipedia, and allows choosing which to use. Writes
-choice data to a text file with lines of the form 'otolId1 imgPath1', or
-'otolId1', where no path indicates a choice of no image.
-
-The program can be closed, and run again to continue from the last choice.
-The program looks for an existing output file to determine what choices
-have already been made.
-"""
-if len(sys.argv) > 1:
- print(usageInfo, file=sys.stderr)
- sys.exit(1)
-
-eolImgDir = "eol/imgs/"
-enwikiImgDir = "enwiki/imgs/"
-dbFile = "data.db"
-outFile = "imgList.txt"
-IMG_DISPLAY_SZ = 400
-PLACEHOLDER_IMG = Image.new("RGB", (IMG_DISPLAY_SZ, IMG_DISPLAY_SZ), (88, 28, 135))
-onlyReviewPairs = True
-
-print("Opening database")
-dbCon = sqlite3.connect(dbFile)
-dbCur = dbCon.cursor()
-
-nodeToImgs = {} # Maps otol-ids to arrays of image paths
-print("Iterating through images from EOL")
-if os.path.exists(eolImgDir):
- for filename in os.listdir(eolImgDir):
- # Get associated EOL ID
- eolId, _, _ = filename.partition(" ")
- query = "SELECT nodes.id FROM nodes INNER JOIN eol_ids ON nodes.name = eol_ids.name WHERE eol_ids.id = ?"
- # Get associated node IDs
- found = False
- for (otolId,) in dbCur.execute(query, (int(eolId),)):
- if otolId not in nodeToImgs:
- nodeToImgs[otolId] = []
- nodeToImgs[otolId].append(eolImgDir + filename)
- found = True
- if not found:
- print(f"WARNING: No node found for {eolImgDir}{filename}")
-print(f"Result: {len(nodeToImgs)} nodes with images")
-print("Iterating through images from Wikipedia")
-if os.path.exists(enwikiImgDir):
- for filename in os.listdir(enwikiImgDir):
- # Get associated page ID
- (wikiId, _, _) = filename.partition(".")
- # Get associated node IDs
- query = "SELECT nodes.id FROM nodes INNER JOIN wiki_ids ON nodes.name = wiki_ids.name WHERE wiki_ids.id = ?"
- found = False
- for (otolId,) in dbCur.execute(query, (int(wikiId),)):
- if otolId not in nodeToImgs:
- nodeToImgs[otolId] = []
- nodeToImgs[otolId].append(enwikiImgDir + filename)
- found = True
- if not found:
- print(f"WARNING: No node found for {enwikiImgDir}{filename}")
-print(f"Result: {len(nodeToImgs)} nodes with images")
-print("Filtering out already-made image choices")
-oldSz = len(nodeToImgs)
-if os.path.exists(outFile):
- with open(outFile) as file:
- for line in file:
- line = line.rstrip()
- if " " in line:
- line = line[:line.find(" ")]
- del nodeToImgs[line]
-print(f"Filtered out {oldSz - len(nodeToImgs)} entries")
-
-class ImgReviewer:
- " Provides the GUI for reviewing images "
- def __init__(self, root, nodeToImgs):
- self.root = root
- root.title("Image Reviewer")
- # Setup main frame
- mainFrame = ttk.Frame(root, padding="5 5 5 5")
- mainFrame.grid(column=0, row=0, sticky=(tki.N, tki.W, tki.E, tki.S))
- root.columnconfigure(0, weight=1)
- root.rowconfigure(0, weight=1)
- # Set up images-to-be-reviewed frames
- self.eolImg = ImageTk.PhotoImage(PLACEHOLDER_IMG)
- self.enwikiImg = ImageTk.PhotoImage(PLACEHOLDER_IMG)
- self.labels = []
- for i in (0, 1):
- frame = ttk.Frame(mainFrame, width=IMG_DISPLAY_SZ, height=IMG_DISPLAY_SZ)
- frame.grid(column=i, row=0)
- label = ttk.Label(frame, image=self.eolImg if i == 0 else self.enwikiImg)
- label.grid(column=0, row=0)
- self.labels.append(label)
- # Add padding
- for child in mainFrame.winfo_children():
- child.grid_configure(padx=5, pady=5)
- # Add keyboard bindings
- root.bind("<q>", self.quit)
- root.bind("<Key-j>", lambda evt: self.accept(0))
- root.bind("<Key-k>", lambda evt: self.accept(1))
- root.bind("<Key-l>", lambda evt: self.reject())
- # Set fields
- self.nodeImgsList = list(nodeToImgs.items())
- self.listIdx = -1
- self.otolId = None
- self.eolImgPath = None
- self.enwikiImgPath = None
- self.numReviewed = 0
- self.startTime = time.time()
- # Initialise images to review
- self.getNextImgs()
- def getNextImgs(self):
- " Updates display with new images to review, or ends program "
- # Get next image paths
- while True:
- self.listIdx += 1
- if self.listIdx == len(self.nodeImgsList):
- print("No more images to review. Exiting program.")
- self.quit()
- return
- self.otolId, imgPaths = self.nodeImgsList[self.listIdx]
- # Potentially skip user choice
- if onlyReviewPairs and len(imgPaths) == 1:
- with open(outFile, 'a') as file:
- file.write(f"{self.otolId} {imgPaths[0]}\n")
- continue
- break
- # Update displayed images
- self.eolImgPath = self.enwikiImgPath = None
- imageOpenError = False
- for imgPath in imgPaths:
- img = None
- try:
- img = Image.open(imgPath)
- img = ImageOps.exif_transpose(img)
- except PIL.UnidentifiedImageError:
- print(f"UnidentifiedImageError for {imgPath}")
- imageOpenError = True
- continue
- if imgPath.startswith("eol/"):
- self.eolImgPath = imgPath
- self.eolImg = ImageTk.PhotoImage(self.resizeImgForDisplay(img))
- elif imgPath.startswith("enwiki/"):
- self.enwikiImgPath = imgPath
- self.enwikiImg = ImageTk.PhotoImage(self.resizeImgForDisplay(img))
- else:
- print(f"Unexpected image path {imgPath}")
- self.quit()
- return
- # Re-iterate if all image paths invalid
- if self.eolImgPath == None and self.enwikiImgPath == None:
- if imageOpenError:
- self.reject()
- self.getNextImgs()
- return
- # Add placeholder images
- if self.eolImgPath == None:
- self.eolImg = ImageTk.PhotoImage(self.resizeImgForDisplay(PLACEHOLDER_IMG))
- elif self.enwikiImgPath == None:
- self.enwikiImg = ImageTk.PhotoImage(self.resizeImgForDisplay(PLACEHOLDER_IMG))
- # Update image-frames
- self.labels[0].config(image=self.eolImg)
- self.labels[1].config(image=self.enwikiImg)
- # Update title
- title = f"Images for otol ID {self.otolId}"
- query = "SELECT names.alt_name FROM" \
- " nodes INNER JOIN names ON nodes.name = names.name" \
- " WHERE nodes.id = ? and pref_alt = 1"
- row = dbCur.execute(query, (self.otolId,)).fetchone()
- if row != None:
- title += f", aka {row[0]}"
- title += f" ({self.listIdx + 1} out of {len(self.nodeImgsList)})"
- self.root.title(title)
- def accept(self, imgIdx):
- " React to a user selecting an image "
- imgPath = self.eolImgPath if imgIdx == 0 else self.enwikiImgPath
- if imgPath == None:
- print("Invalid selection")
- return
- with open(outFile, 'a') as file:
- file.write(f"{self.otolId} {imgPath}\n")
- self.numReviewed += 1
- self.getNextImgs()
- def reject(self):
- " React to a user rejecting all images of a set "
- with open(outFile, 'a') as file:
- file.write(f"{self.otolId}\n")
- self.numReviewed += 1
- self.getNextImgs()
- def quit(self, e = None):
- global dbCon
- print(f"Number reviewed: {self.numReviewed}")
- timeElapsed = time.time() - self.startTime
- print(f"Time elapsed: {timeElapsed:.2f} seconds")
- if self.numReviewed > 0:
- print(f"Avg time per review: {timeElapsed/self.numReviewed:.2f} seconds")
- dbCon.close()
- self.root.destroy()
- def resizeImgForDisplay(self, img):
- " Returns a copy of an image, shrunk to fit it's frame (keeps aspect ratio), and with a background "
- if max(img.width, img.height) > IMG_DISPLAY_SZ:
- if (img.width > img.height):
- newHeight = int(img.height * IMG_DISPLAY_SZ/img.width)
- img = img.resize((IMG_DISPLAY_SZ, newHeight))
- else:
- newWidth = int(img.width * IMG_DISPLAY_SZ / img.height)
- img = img.resize((newWidth, IMG_DISPLAY_SZ))
- bgImg = PLACEHOLDER_IMG.copy()
- bgImg.paste(img, box=(
- int((IMG_DISPLAY_SZ - img.width) / 2),
- int((IMG_DISPLAY_SZ - img.height) / 2)))
- return bgImg
-# Create GUI and defer control
-print("Starting GUI")
-root = tki.Tk()
-ImgReviewer(root, nodeToImgs)
-root.mainloop()