diff options
| author | Terry Truong <terry06890@gmail.com> | 2022-09-11 14:55:42 +1000 |
|---|---|---|
| committer | Terry Truong <terry06890@gmail.com> | 2022-09-11 15:04:14 +1000 |
| commit | 5de5fb93e50fe9006221b30ac4a66f1be0db82e7 (patch) | |
| tree | 2567c25c902dbb40d44419805cebb38171df47fa /backend/tol_data | |
| parent | daccbbd9c73a5292ea9d6746560d7009e5aa666d (diff) | |
Add backend unit tests
- Add unit testing code in backend/tests/
- Change to snake-case for script/file/directory names
- Use os.path.join() instead of '/'
- Refactor script code into function defs and a main-guard
- Make global vars all-caps
Some fixes:
- For getting descriptions, some wiki redirects weren't properly resolved
- Linked images were sub-optimally propagated
- Generation of reduced trees assumed a wiki-id association implied a description
- Tilo.py had potential null dereferences by not always using a reduced node set
- EOL image downloading didn't properly wait for all threads to end when finishing
Diffstat (limited to 'backend/tol_data')
33 files changed, 3571 insertions, 0 deletions
diff --git a/backend/tol_data/README.md b/backend/tol_data/README.md new file mode 100644 index 0000000..a21418b --- /dev/null +++ b/backend/tol_data/README.md @@ -0,0 +1,155 @@ +This directory holds files used to generate the tree-of-life database data.db. + +# Database Tables +## Tree Structure +- `nodes` <br> + Format : `name TEXT PRIMARY KEY, id TEXT UNIQUE, tips INT` <br> + Represents a tree-of-life node. `tips` holds the number of no-child descendants +- `edges` <br> + Format: `parent TEXT, child TEXT, p_support INT, PRIMARY KEY (parent, child)` <br> + `p_support` is 1 if the edge has 'phylogenetic support', and 0 otherwise +## Node Mappings +- `eol_ids` <br> + Format: `name TEXT PRIMARY KEY, id INT` <br> + Associates nodes with EOL IDs +- `wiki_ids` <br> + Format: `name TEXT PRIMARY KEY, id INT` <br> + Associates nodes with wikipedia page IDs +## Node Vernacular Names +- `names` <br> + Format: `name TEXT, alt_name TEXT, pref_alt INT, src TEXT, PRIMARY KEY(name, alt_name)` <br> + Associates a node with alternative names. + `pref_alt` is 1 if the alt-name is the most 'preferred' one. + `src` indicates the dataset the alt-name was obtained from (can be 'eol', 'enwiki', or 'picked'). +## Node Descriptions +- `descs` <br> + Format: `wiki_id INT PRIMARY KEY, desc TEXT, from_dbp INT` <br> + Associates a wikipedia page ID with a short-description. + `from_dbp` is 1 if the description was obtained from DBpedia, and 0 otherwise. +## Node Images +- `node_imgs` <br> + Format: `name TEXT PRIMARY KEY, img_id INT, src TEXT` <br> + Associates a node with an image. +- `images` <br> + Format: `id INT, src TEXT, url TEXT, license TEXT, artist TEXT, credit TEXT, PRIMARY KEY (id, src)` <br> + Represents an image, identified by a source ('eol', 'enwiki', or 'picked'), and a source-specific ID. +- `linked_imgs` <br> + Format: `name TEXT PRIMARY KEY, otol_ids TEXT` <br> + Associates a node with an image from another node. + `otol_ids` can be an otol ID, or (for compound nodes) two comma-separated strings that may be otol IDs or empty. +## Reduced Trees +- `nodes_t`, `nodes_i`, `nodes_p` <br> + These are like `nodes`, but describe nodes of reduced trees. +- `edges_t`, `edges_i`, `edges_p` <br> + Like `edges` but for reduced trees. +## Other +- `node_iucn` <br> + Format: `name TEXT PRIMARY KEY, iucn TEXT` <br> + Associates nodes with IUCN conservation status strings (eg: 'endangered') +- `node_pop` <br> + Format: `name TEXT PRIMARY KEY, pop INT` <br> + Associates nodes with popularity values (higher means more popular) + +# Generating the Database + +As a warning, the whole process takes a lot of time and file space. The +tree will probably have about 2.6 million nodes. Downloading the images +takes several days, and occupies over 200 GB. + +## Environment +Some of the scripts use third-party packages: +- `indexed_bzip2`: For parallelised bzip2 processing. +- `jsonpickle`: For encoding class objects as JSON. +- `requests`: For downloading data. +- `PIL`: For image processing. +- `tkinter`: For providing a basic GUI to review images. +- `mwxml`, `mwparserfromhell`: For parsing Wikipedia dumps. + +## Generate Tree Structure Data +1. Obtain 'tree data files' in otol/, as specified in it's README. +2. Run `gen_otol_data.py`, which creates data.db, and adds the `nodes` and `edges` tables, + using data in otol/. It also uses these files, if they exist: + - `picked_otol_names.txt`: Has lines of the form `name1|otolId1`. + Can be used to override numeric suffixes added to same-name nodes. + +## Generate Dataset Mappings +1. Obtain 'taxonomy data files' in otol/, 'mapping files' in eol/, + files in wikidata/, and 'dump-index files' in enwiki/, as specified + in their READMEs. +2. Run `gen_mapping_data.py`, which adds the `eol_ids` and `wiki_ids` tables, + as well as `node_iucn`. It uses the files obtained above, the `nodes` table, + and 'picked mappings' files, if they exist. + - `picked_eol_ids.txt` contains lines like `3785967|405349`, specifying + an otol ID and an eol ID to map it to. The eol ID can be empty, + in which case the otol ID won't be mapped. + - `picked_wiki_ids.txt` and `picked_wiki_ids_rough.txt` contain lines like + `5341349|Human`, specifying an otol ID and an enwiki title, + which may contain spaces. The title can be empty. + +## Generate Node Name Data +1. Obtain 'name data files' in eol/, and 'description database files' in enwiki/, + as specified in their READMEs. +2. Run `gen_name_data.py`, which adds the `names` table, using data in eol/ and enwiki/, + along with the `nodes`, `eol_ids`, and `wiki_ids` tables. <br> + It also uses `picked_names.txt`, if it exists. This file can hold lines like + `embryophyta|land plant|1`, specifying a node name, an alt-name to add for it, + and a 1 or 0 indicating whether it is a 'preferred' alt-name. The last field + can be empty, which indicates that the alt-name should be removed, or, if the + alt-name is the same as the node name, that no alt-name should be preferred. + +## Generate Node Description Data +1. Obtain files in dbpedia/, as specified in it's README. +2. Run `gen_desc_data.py`, which adds the `descs` table, using data in dbpedia/ and + enwiki/, and the `nodes` table. + +## Generate Node Images Data +### Get images from EOL +1. Obtain 'image metadata files' in eol/, as specified in it's README. +2. In eol/, run `download_imgs.py`, which downloads images (possibly multiple per node), + into eol/imgs_for_review, using data in eol/, as well as the `eol_ids` table. + By default, more images than needed are downloaded for review. To skip this, set + the script's MAX_IMGS_PER_ID to 1. +3. In eol/, run `review_imgs.py`, which interactively displays the downloaded images for + each node, providing the choice of which (if any) to use, moving them to eol/imgs/. + Uses `names` and `eol_ids` to display extra info. If MAX_IMGS_PER_ID was set to 1 in + the previous step, you can skip review by renaming the image folder. +### Get Images from Wikipedia +1. In enwiki/, run `gen_img_data.py`, which looks for wikipedia image names for each node, + using the `wiki_ids` table, and stores them in a database. +2. In enwiki/, run `download_img_license_info.py`, which downloads licensing information for + those images, using wikipedia's online API. +3. In enwiki/, run `download_imgs.py`, which downloads 'permissively-licensed' + images into enwiki/imgs/. +### Merge the Image Sets +1. Run `review_imgs_to_gen.py`, which displays images from eol/imgs/ and enwiki/imgs/, + and enables choosing, for each node, which image should be used, if any, + and outputs choice information into `img_list.txt`. Uses the `nodes`, + `eol_ids`, and `wiki_ids` tables (as well as `names` to display extra info). + To skip manual review, set REVIEW to 'none' in the script (the script will select any + image, preferring ones from Wikipedia). +2. Run `gen_imgs.py`, which creates cropped/resized images in img/, from files listed in + `img_list.txt` and located in eol/ and enwiki/, and creates the `node_imgs` and + `images` tables. If `picked_imgs/` is present, images within it are also used. <br> + The outputs might need to be manually created/adjusted: + - An input image might have no output produced, possibly due to + data incompatibilities, memory limits, etc. A few input image files + might actually be html files, containing a 'file not found' page. + - An input x.gif might produce x-1.jpg, x-2.jpg, etc, instead of x.jpg. + - An input image might produce output with unexpected dimensions. + This seems to happen when the image is very large, and triggers a + decompression bomb warning. +### Add more Image Associations +1. Run `gen_linked_imgs.py`, which tries to associate nodes without images to + images of it's children. Adds the `linked_imgs` table, and uses the + `nodes`, `edges`, and `node_imgs` tables. + +## Generate Reduced Trees +1. Run `gen_reduced_trees.py`, which generates multiple reduced versions of the tree, + adding the `nodes_*` and `edges_*` tables, using `nodes`, `edges`, `wiki_ids`, + `node_imgs`, `linked_imgs`, and `names`. Reads from `picked_nodes.txt`, which lists + names of nodes that must be included (1 per line). + +## Generate Node Popularity Data +1. Obtain 'page view files' in enwiki/, as specified in it's README. +2. Run `gen_pop_data.py`, which adds the `node_pop` table, using data in enwiki/, + and the `wiki_ids` table. diff --git a/backend/tol_data/__init__.py b/backend/tol_data/__init__.py new file mode 100644 index 0000000..e69de29 --- /dev/null +++ b/backend/tol_data/__init__.py diff --git a/backend/tol_data/dbpedia/README.md b/backend/tol_data/dbpedia/README.md new file mode 100644 index 0000000..a708122 --- /dev/null +++ b/backend/tol_data/dbpedia/README.md @@ -0,0 +1,29 @@ +This directory holds files obtained/derived from [Dbpedia](https://www.dbpedia.org). + +# Downloaded Files +- `labels_lang=en.ttl.bz2` <br> + Obtained via https://databus.dbpedia.org/dbpedia/collections/latest-core. + Downloaded from <https://databus.dbpedia.org/dbpedia/generic/labels/2022.03.01/labels_lang=en.ttl.bz2>. +- `page_lang=en_ids.ttl.bz2` <br> + Downloaded from <https://databus.dbpedia.org/dbpedia/generic/page/2022.03.01/page_lang=en_ids.ttl.bz2> +- `redirects_lang=en_transitive.ttl.bz2` <br> + Downloaded from <https://databus.dbpedia.org/dbpedia/generic/redirects/2022.03.01/redirects_lang=en_transitive.ttl.bz2>. +- `disambiguations_lang=en.ttl.bz2` <br> + Downloaded from <https://databus.dbpedia.org/dbpedia/generic/disambiguations/2022.03.01/disambiguations_lang=en.ttl.bz2>. +- `instance-types_lang=en_specific.ttl.bz2` <br> + Downloaded from <https://databus.dbpedia.org/dbpedia/mappings/instance-types/2022.03.01/instance-types_lang=en_specific.ttl.bz2>. +- `short-abstracts_lang=en.ttl.bz2` <br> + Downloaded from <https://databus.dbpedia.org/vehnem/text/short-abstracts/2021.05.01/short-abstracts_lang=en.ttl.bz2>. + +# Other Files +- `gen_desc_data.py` <br> + Used to generate a database representing data from the ttl files. +- `desc_data.db` <br> + Generated by `gen_desc_data.py`. <br> + Tables: <br> + - `labels`: `iri TEXT PRIMARY KEY, label TEXT ` + - `ids`: `iri TEXT PRIMARY KEY, id INT` + - `redirects`: `iri TEXT PRIMARY KEY, target TEXT` + - `disambiguations`: `iri TEXT PRIMARY KEY` + - `types`: `iri TEXT, type TEXT` + - `abstracts`: `iri TEXT PRIMARY KEY, abstract TEXT` diff --git a/backend/tol_data/dbpedia/__init__.py b/backend/tol_data/dbpedia/__init__.py new file mode 100644 index 0000000..e69de29 --- /dev/null +++ b/backend/tol_data/dbpedia/__init__.py diff --git a/backend/tol_data/dbpedia/gen_desc_data.py b/backend/tol_data/dbpedia/gen_desc_data.py new file mode 100755 index 0000000..50418e0 --- /dev/null +++ b/backend/tol_data/dbpedia/gen_desc_data.py @@ -0,0 +1,120 @@ +#!/usr/bin/python3 + +""" +Adds DBpedia labels/types/abstracts/etc data into a database +""" + +# In testing, this script took a few hours to run, and generated about 10GB + +import re +import bz2, sqlite3 + +LABELS_FILE = 'labels_lang=en.ttl.bz2' # Had about 16e6 entries +IDS_FILE = 'page_lang=en_ids.ttl.bz2' +REDIRECTS_FILE = 'redirects_lang=en_transitive.ttl.bz2' +DISAMBIG_FILE = 'disambiguations_lang=en.ttl.bz2' +TYPES_FILE = 'instance-types_lang=en_specific.ttl.bz2' +ABSTRACTS_FILE = 'short-abstracts_lang=en.ttl.bz2' +DB_FILE = 'desc_data.db' + +def genData( + labelsFile: str, idsFile: str, redirectsFile: str, disambigFile: str, + typesFile: str, abstractsFile: str, dbFile: str) -> None: + """ Reads the files and writes to db """ + print('Creating database') + dbCon = sqlite3.connect(dbFile) + dbCur = dbCon.cursor() + # + print('Reading/storing label data') + dbCur.execute('CREATE TABLE labels (iri TEXT PRIMARY KEY, label TEXT)') + dbCur.execute('CREATE INDEX labels_idx ON labels(label)') + dbCur.execute('CREATE INDEX labels_idx_nc ON labels(label COLLATE NOCASE)') + labelLineRegex = re.compile(r'<([^>]+)> <[^>]+> "((?:[^"]|\\")+)"@en \.\n') + with bz2.open(labelsFile, mode='rt') as file: + for lineNum, line in enumerate(file, 1): + if lineNum % 1e5 == 0: + print(f'At line {lineNum}') + match = labelLineRegex.fullmatch(line) + if match is None: + raise Exception(f'ERROR: Line {lineNum} has unexpected format') + dbCur.execute('INSERT INTO labels VALUES (?, ?)', (match.group(1), match.group(2))) + # + print('Reading/storing wiki page ids') + dbCur.execute('CREATE TABLE ids (iri TEXT PRIMARY KEY, id INT)') + dbCur.execute('CREATE INDEX ids_idx ON ids(id)') + idLineRegex = re.compile(r'<([^>]+)> <[^>]+> "(\d+)".*\n') + with bz2.open(idsFile, mode='rt') as file: + for lineNum, line in enumerate(file, 1): + if lineNum % 1e5 == 0: + print(f'At line {lineNum}') + match = idLineRegex.fullmatch(line) + if match is None: + raise Exception(f'ERROR: Line {lineNum} has unexpected format') + try: + dbCur.execute('INSERT INTO ids VALUES (?, ?)', (match.group(1), int(match.group(2)))) + except sqlite3.IntegrityError as e: + # Accounts for certain lines that have the same IRI + print(f'WARNING: Failed to add entry with IRI "{match.group(1)}": {e}') + # + print('Reading/storing redirection data') + dbCur.execute('CREATE TABLE redirects (iri TEXT PRIMARY KEY, target TEXT)') + redirLineRegex = re.compile(r'<([^>]+)> <[^>]+> <([^>]+)> \.\n') + with bz2.open(redirectsFile, mode='rt') as file: + for lineNum, line in enumerate(file, 1): + if lineNum % 1e5 == 0: + print(f'At line {lineNum}') + match = redirLineRegex.fullmatch(line) + if match is None: + raise Exception(f'ERROR: Line {lineNum} has unexpected format') + dbCur.execute('INSERT INTO redirects VALUES (?, ?)', (match.group(1), match.group(2))) + # + print('Reading/storing diambiguation-page data') + dbCur.execute('CREATE TABLE disambiguations (iri TEXT PRIMARY KEY)') + disambigLineRegex = redirLineRegex + with bz2.open(disambigFile, mode='rt') as file: + for lineNum, line in enumerate(file, 1): + if lineNum % 1e5 == 0: + print(f'At line {lineNum}') + match = disambigLineRegex.fullmatch(line) + if match is None: + raise Exception(f'ERROR: Line {lineNum} has unexpected format') + dbCur.execute('INSERT OR IGNORE INTO disambiguations VALUES (?)', (match.group(1),)) + # + print('Reading/storing instance-type data') + dbCur.execute('CREATE TABLE types (iri TEXT, type TEXT)') + dbCur.execute('CREATE INDEX types_iri_idx ON types(iri)') + typeLineRegex = redirLineRegex + with bz2.open(typesFile, mode='rt') as file: + for lineNum, line in enumerate(file, 1): + if lineNum % 1e5 == 0: + print(f'At line {lineNum}') + match = typeLineRegex.fullmatch(line) + if match is None: + raise Exception(f'ERROR: Line {lineNum} has unexpected format') + dbCur.execute('INSERT INTO types VALUES (?, ?)', (match.group(1), match.group(2))) + # + print('Reading/storing abstracts') + dbCur.execute('CREATE TABLE abstracts (iri TEXT PRIMARY KEY, abstract TEXT)') + descLineRegex = labelLineRegex + with bz2.open(abstractsFile, mode='rt') as file: + for lineNum, line in enumerate(file): + if lineNum % 1e5 == 0: + print(f'At line {lineNum}') + if line[0] == '#': + continue + match = descLineRegex.fullmatch(line) + if match is None: + raise Exception(f'ERROR: Line {lineNum} has unexpected format') + dbCur.execute('INSERT INTO abstracts VALUES (?, ?)', + (match.group(1), match.group(2).replace(r'\"', '"'))) + # + print('Closing database') + dbCon.commit() + dbCon.close() + +if __name__ == '__main__': + import argparse + parser = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter) + parser.parse_args() + # + genData(LABELS_FILE, IDS_FILE, REDIRECTS_FILE, DISAMBIG_FILE, TYPES_FILE, ABSTRACTS_FILE, DB_FILE) diff --git a/backend/tol_data/enwiki/README.md b/backend/tol_data/enwiki/README.md new file mode 100644 index 0000000..ba1de33 --- /dev/null +++ b/backend/tol_data/enwiki/README.md @@ -0,0 +1,63 @@ +This directory holds files obtained/derived from [English Wikipedia](https://en.wikipedia.org/wiki/Main_Page). + +# Downloaded Files +- `enwiki-20220501-pages-articles-multistream.xml.bz2` <br> + Contains text content and metadata for pages in enwiki. + Obtained via <https://dumps.wikimedia.org/backup-index.html> (site suggests downloading from a mirror). + Some file content and format information was available from + <https://meta.wikimedia.org/wiki/Data_dumps/What%27s_available_for_download>. +- `enwiki-20220501-pages-articles-multistream-index.txt.bz2` <br> + Obtained like above. Holds lines of the form offset1:pageId1:title1, + providing, for each page, an offset into the dump file of a chunk of + 100 pages that includes it. + +# Dump-Index Files +- `gen_dump_index_db.py` <br> + Creates a database version of the enwiki-dump index file. +- `dumpIndex.db` <br> + Generated by `gen_dump_index_db.py`. <br> + Tables: <br> + - `offsets`: `title TEXT PRIMARY KEY, id INT UNIQUE, offset INT, next_offset INT` + +# Description Database Files +- `gen_desc_data.py` <br> + Reads through pages in the dump file, and adds short-description info to a database. +- `desc_data.db` <br> + Generated by `gen_desc_data.py`. <br> + Tables: <br> + - `pages`: `id INT PRIMARY KEY, title TEXT UNIQUE` + - `redirects`: `id INT PRIMARY KEY, target TEXT` + - `descs`: `id INT PRIMARY KEY, desc TEXT` + +# Image Database Files +- `gen_img_data.py` <br> + Used to find infobox image names for page IDs, storing them into a database. +- `downloadImgLicenseInfo.py` <br> + Used to download licensing metadata for image names, via wikipedia's online API, storing them into a database. +- `img_data.db` <br> + Used to hold metadata about infobox images for a set of pageIDs. + Generated using `get_enwiki_img_data.py` and `download_img_license_info.py`. <br> + Tables: <br> + - `page_imgs`: `page_id INT PRIMAY KEY, img_name TEXT` <br> + `img_name` may be null, which means 'none found', and is used to avoid re-processing page-ids. + - `imgs`: `name TEXT PRIMARY KEY, license TEXT, artist TEXT, credit TEXT, restrictions TEXT, url TEXT` <br> + Might lack some matches for `img_name` in `page_imgs`, due to licensing info unavailability. +- `downloadImgs.py` <br> + Used to download image files into imgs/. + +# Page View Files +- `pageviews/pageviews-*-user.bz2` + Each holds wikimedia article page view data for some month. + Obtained via <https://dumps.wikimedia.org/other/pageview_complete/monthly/>. + Some format info was available from <https://dumps.wikimedia.org/other/pageview_complete/readme.html>. +- `gen_pageview_data.py` <br> + Reads pageview/*, and creates a database holding average monthly pageview counts. +- `pageview_data.db` <br> + Generated using `gen_pageview_data.py`. <br> + Tables: <br> + - `views`: `title TEXT PRIMARY KEY, id INT, views INT` + +# Other Files +- `lookup_page.py` <br> + Running `lookup_page.py title1` looks in the dump for a page with a given title, + and prints the contents to stdout. Uses dumpIndex.db. diff --git a/backend/tol_data/enwiki/__init__.py b/backend/tol_data/enwiki/__init__.py new file mode 100644 index 0000000..e69de29 --- /dev/null +++ b/backend/tol_data/enwiki/__init__.py diff --git a/backend/tol_data/enwiki/download_img_license_info.py b/backend/tol_data/enwiki/download_img_license_info.py new file mode 100755 index 0000000..0a809ac --- /dev/null +++ b/backend/tol_data/enwiki/download_img_license_info.py @@ -0,0 +1,154 @@ +#!/usr/bin/python3 + +""" +Reads image names from a database, and uses enwiki's online API to obtain +licensing information for them, adding the info to the database. + +SIGINT causes the program to finish an ongoing download and exit. +The program can be re-run to continue downloading, and looks +at already-processed names to decide what to skip. +""" + +import re +import sqlite3, urllib.parse, html +import requests +import time, signal + +IMG_DB = 'img_data.db' +# +API_URL = 'https://en.wikipedia.org/w/api.php' +USER_AGENT = 'terryt.dev (terry06890@gmail.com)' +BATCH_SZ = 50 # Max 50 +TAG_REGEX = re.compile(r'<[^<]+>') +WHITESPACE_REGEX = re.compile(r'\s+') + +def downloadInfo(imgDb: str) -> None: + print('Opening database') + dbCon = sqlite3.connect(imgDb) + dbCur = dbCon.cursor() + print('Checking for table') + if dbCur.execute('SELECT name FROM sqlite_master WHERE type="table" AND name="imgs"').fetchone() is None: + dbCur.execute('CREATE TABLE imgs (' \ + 'name TEXT PRIMARY KEY, license TEXT, artist TEXT, credit TEXT, restrictions TEXT, url TEXT)') + # + print('Reading image names') + imgNames: set[str] = set() + for (imgName,) in dbCur.execute('SELECT DISTINCT img_name FROM page_imgs WHERE img_name NOT NULL'): + imgNames.add(imgName) + print(f'Found {len(imgNames)}') + # + print('Checking for already-processed images') + oldSz = len(imgNames) + for (imgName,) in dbCur.execute('SELECT name FROM imgs'): + imgNames.discard(imgName) + print(f'Found {oldSz - len(imgNames)}') + # + # Set SIGINT handler + interrupted = False + oldHandler = None + def onSigint(sig, frame): + nonlocal interrupted + interrupted = True + signal.signal(signal.SIGINT, oldHandler) + oldHandler = signal.signal(signal.SIGINT, onSigint) + # + print('Iterating through image names') + imgNameList = list(imgNames) + iterNum = 0 + for i in range(0, len(imgNameList), BATCH_SZ): + iterNum += 1 + if iterNum % 1 == 0: + print(f'At iteration {iterNum} (after {(iterNum - 1) * BATCH_SZ} images)') + if interrupted: + print(f'Exiting loop at iteration {iterNum}') + break + # Get batch + imgBatch = imgNameList[i:i+BATCH_SZ] + imgBatch = ['File:' + x for x in imgBatch] + # Make request + headers = { + 'user-agent': USER_AGENT, + 'accept-encoding': 'gzip', + } + params = { + 'action': 'query', + 'format': 'json', + 'prop': 'imageinfo', + 'iiprop': 'extmetadata|url', + 'maxlag': '5', + 'titles': '|'.join(imgBatch), + 'iiextmetadatafilter': 'Artist|Credit|LicenseShortName|Restrictions', + } + responseObj = None + try: + response = requests.get(API_URL, params=params, headers=headers) + responseObj = response.json() + except Exception as e: + print(f'ERROR: Exception while downloading info: {e}') + print('\tImage batch: ' + '|'.join(imgBatch)) + continue + # Parse response-object + if 'query' not in responseObj or 'pages' not in responseObj['query']: + print('WARNING: Response object for doesn\'t have page data') + print('\tImage batch: ' + '|'.join(imgBatch)) + if 'error' in responseObj: + errorCode = responseObj['error']['code'] + print(f'\tError code: {errorCode}') + if errorCode == 'maxlag': + time.sleep(5) + continue + pages = responseObj['query']['pages'] + normalisedToInput: dict[str, str] = {} + if 'normalized' in responseObj['query']: + for entry in responseObj['query']['normalized']: + normalisedToInput[entry['to']] = entry['from'] + for page in pages.values(): + # Some fields // More info at https://www.mediawiki.org/wiki/Extension:CommonsMetadata#Returned_data + # LicenseShortName: short human-readable license name, apparently more reliable than 'License', + # Artist: author name (might contain complex html, multiple authors, etc) + # Credit: 'source' + # For image-map-like images, can be quite large/complex html, creditng each sub-image + # May be <a href='text1'>text2</a>, where the text2 might be non-indicative + # Restrictions: specifies non-copyright legal restrictions + title: str = page['title'] + if title in normalisedToInput: + title = normalisedToInput[title] + title = title[5:] # Remove 'File:' + if title not in imgNames: + print(f'WARNING: Got title "{title}" not in image-name list') + continue + if 'imageinfo' not in page: + print(f'WARNING: No imageinfo section for page "{title}"') + continue + metadata = page['imageinfo'][0]['extmetadata'] + url: str = page['imageinfo'][0]['url'] + license: str | None = metadata['LicenseShortName']['value'] if 'LicenseShortName' in metadata else None + artist: str | None = metadata['Artist']['value'] if 'Artist' in metadata else None + credit: str | None = metadata['Credit']['value'] if 'Credit' in metadata else None + restrictions: str | None = metadata['Restrictions']['value'] if 'Restrictions' in metadata else None + # Remove markup + if artist is not None: + artist = TAG_REGEX.sub(' ', artist).strip() + artist = WHITESPACE_REGEX.sub(' ', artist) + artist = html.unescape(artist) + artist = urllib.parse.unquote(artist) + if credit is not None: + credit = TAG_REGEX.sub(' ', credit).strip() + credit = WHITESPACE_REGEX.sub(' ', credit) + credit = html.unescape(credit) + credit = urllib.parse.unquote(credit) + # Add to db + print((title, license, artist, credit, restrictions, url)) + dbCur.execute('INSERT INTO imgs VALUES (?, ?, ?, ?, ?, ?)', + (title, license, artist, credit, restrictions, url)) + # + print('Closing database') + dbCon.commit() + dbCon.close() + +if __name__ == '__main__': + import argparse + parser = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter) + parser.parse_args() + # + downloadInfo(IMG_DB) diff --git a/backend/tol_data/enwiki/download_imgs.py b/backend/tol_data/enwiki/download_imgs.py new file mode 100755 index 0000000..ba874e1 --- /dev/null +++ b/backend/tol_data/enwiki/download_imgs.py @@ -0,0 +1,99 @@ +#!/usr/bin/python3 + +""" +Downloads images from URLs in an image database, into an output directory, +with names of the form 'pageId1.ext1'. + +SIGINT causes the program to finish an ongoing download and exit. +The program can be re-run to continue downloading, and looks +in the output directory do decide what to skip. +""" + +# In testing, this downloaded about 100k images, over several days + +import re, os +import sqlite3 +import urllib.parse, requests +import time, signal + +IMG_DB = 'img_data.db' # About 130k image names +OUT_DIR = 'imgs' +# +LICENSE_REGEX = re.compile(r'cc0|cc([ -]by)?([ -]sa)?([ -][1234]\.[05])?( \w\w\w?)?', flags=re.IGNORECASE) +USER_AGENT = 'terryt.dev (terry06890@gmail.com)' +TIMEOUT = 1 + # https://en.wikipedia.org/wiki/Wikipedia:Database_download says to 'throttle to 1 cache miss per sec' + # It's unclear how to properly check for cache misses, so we just aim for 1 per sec + +def downloadImgs(imgDb: str, outDir: str, timeout: int) -> None: + if not os.path.exists(outDir): + os.mkdir(outDir) + print('Checking for already-downloaded images') + fileList = os.listdir(outDir) + pageIdsDone: set[int] = set() + for filename in fileList: + pageIdsDone.add(int(os.path.splitext(filename)[0])) + print(f'Found {len(pageIdsDone)}') + # + # Set SIGINT handler + interrupted = False + oldHandler = None + def onSigint(sig, frame): + nonlocal interrupted + interrupted = True + signal.signal(signal.SIGINT, oldHandler) + oldHandler = signal.signal(signal.SIGINT, onSigint) + # + print('Opening database') + dbCon = sqlite3.connect(imgDb) + dbCur = dbCon.cursor() + print('Starting downloads') + iterNum = 0 + query = 'SELECT page_id, license, artist, credit, restrictions, url FROM' \ + ' imgs INNER JOIN page_imgs ON imgs.name = page_imgs.img_name' + for pageId, license, artist, credit, restrictions, url in dbCur.execute(query): + if pageId in pageIdsDone: + continue + if interrupted: + print('Exiting loop') + break + # Check for problematic attributes + if license is None or LICENSE_REGEX.fullmatch(license) is None: + continue + if artist is None or artist == '' or len(artist) > 100 or re.match(r'(\d\. )?File:', artist) is not None: + continue + if credit is None or len(credit) > 300 or re.match(r'File:', credit) is not None: + continue + if restrictions is not None and restrictions != '': + continue + # Download image + iterNum += 1 + print(f'Iteration {iterNum}: Downloading for page-id {pageId}') + urlParts = urllib.parse.urlparse(url) + extension = os.path.splitext(urlParts.path)[1] + if len(extension) <= 1: + print(f'WARNING: No filename extension found in URL {url}') + continue + outFile = os.path.join(outDir, f'{pageId}{extension}') + print(outFile) + headers = { + 'user-agent': USER_AGENT, + 'accept-encoding': 'gzip', + } + try: + response = requests.get(url, headers=headers) + with open(outFile, 'wb') as file: + file.write(response.content) + time.sleep(timeout) + except Exception as e: + print(f'Error while downloading to {outFile}: {e}') + return + print('Closing database') + dbCon.close() + +if __name__ == '__main__': + import argparse + parser = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter) + parser.parse_args() + # + downloadImgs(IMG_DB, OUT_DIR, TIMEOUT) diff --git a/backend/tol_data/enwiki/gen_desc_data.py b/backend/tol_data/enwiki/gen_desc_data.py new file mode 100755 index 0000000..0dca16b --- /dev/null +++ b/backend/tol_data/enwiki/gen_desc_data.py @@ -0,0 +1,126 @@ +#!/usr/bin/python3 + +""" +Reads through the wiki dump, and attempts to parse short-descriptions, +and add them to a database +""" + +# In testing, this script took over 10 hours to run, and generated about 5GB + +import sys, os, re +import bz2 +import html, mwxml, mwparserfromhell +import sqlite3 + +DUMP_FILE = 'enwiki-20220501-pages-articles-multistream.xml.bz2' # Had about 22e6 pages +DB_FILE = 'desc_data.db' + +DESC_LINE_REGEX = re.compile('^ *[A-Z\'"]') +EMBEDDED_HTML_REGEX = re.compile(r'<[^<]+/>|<!--[^<]+-->|<[^</]+>([^<]*|[^<]*<[^<]+>[^<]*)</[^<]+>|<[^<]+$') + # Recognises a self-closing HTML tag, a tag with 0 children, tag with 1 child with 0 children, or unclosed tag +CONVERT_TEMPLATE_REGEX = re.compile(r'{{convert\|(\d[^|]*)\|(?:(to|-)\|(\d[^|]*)\|)?([a-z][^|}]*)[^}]*}}') +def convertTemplateReplace(match): + """ Used in regex-substitution with CONVERT_TEMPLATE_REGEX """ + if match.group(2) is None: + return f'{match.group(1)} {match.group(4)}' + else: + return f'{match.group(1)} {match.group(2)} {match.group(3)} {match.group(4)}' +PARENS_GROUP_REGEX = re.compile(r' \([^()]*\)') +LEFTOVER_BRACE_REGEX = re.compile(r'(?:{\||{{).*') + +def genData(dumpFile: str, dbFile: str) -> None: + print('Creating database') + if os.path.exists(dbFile): + raise Exception(f'ERROR: Existing {dbFile}') + dbCon = sqlite3.connect(dbFile) + dbCur = dbCon.cursor() + dbCur.execute('CREATE TABLE pages (id INT PRIMARY KEY, title TEXT UNIQUE)') + dbCur.execute('CREATE INDEX pages_title_idx ON pages(title COLLATE NOCASE)') + dbCur.execute('CREATE TABLE redirects (id INT PRIMARY KEY, target TEXT)') + dbCur.execute('CREATE INDEX redirects_idx ON redirects(target)') + dbCur.execute('CREATE TABLE descs (id INT PRIMARY KEY, desc TEXT)') + # + print('Iterating through dump file') + with bz2.open(dumpFile, mode='rt') as file: + for pageNum, page in enumerate(mwxml.Dump.from_file(file), 1): + if pageNum % 1e4 == 0: + print(f'At page {pageNum}') + # Parse page + if page.namespace == 0: + try: + dbCur.execute('INSERT INTO pages VALUES (?, ?)', (page.id, convertTitle(page.title))) + except sqlite3.IntegrityError as e: + # Accounts for certain pages that have the same title + print(f'Failed to add page with title "{page.title}": {e}', file=sys.stderr) + continue + if page.redirect is not None: + dbCur.execute('INSERT INTO redirects VALUES (?, ?)', (page.id, convertTitle(page.redirect))) + else: + revision = next(page) + desc = parseDesc(revision.text) + if desc is not None: + dbCur.execute('INSERT INTO descs VALUES (?, ?)', (page.id, desc)) + # + print('Closing database') + dbCon.commit() + dbCon.close() +def parseDesc(text: str) -> str | None: + # Find first matching line outside {{...}}, [[...]], and block-html-comment constructs, + # and then accumulate lines until a blank one. + # Some cases not accounted for include: disambiguation pages, abstracts with sentences split-across-lines, + # nested embedded html, 'content significant' embedded-html, markup not removable with mwparsefromhell, + lines: list[str] = [] + openBraceCount = 0 + openBracketCount = 0 + inComment = False + skip = False + for line in text.splitlines(): + line = line.strip() + if not lines: + if line: + if openBraceCount > 0 or line[0] == '{': + openBraceCount += line.count('{') + openBraceCount -= line.count('}') + skip = True + if openBracketCount > 0 or line[0] == '[': + openBracketCount += line.count('[') + openBracketCount -= line.count(']') + skip = True + if inComment or line.find('<!--') != -1: + if line.find('-->') != -1: + if inComment: + inComment = False + skip = True + else: + inComment = True + skip = True + if skip: + skip = False + continue + if line[-1] == ':': # Seems to help avoid disambiguation pages + return None + if DESC_LINE_REGEX.match(line) is not None: + lines.append(line) + else: + if not line: + return removeMarkup(' '.join(lines)) + lines.append(line) + if lines: + return removeMarkup(' '.join(lines)) + return None +def removeMarkup(content: str) -> str: + content = EMBEDDED_HTML_REGEX.sub('', content) + content = CONVERT_TEMPLATE_REGEX.sub(convertTemplateReplace, content) + content = mwparserfromhell.parse(content).strip_code() # Remove wikitext markup + content = PARENS_GROUP_REGEX.sub('', content) + content = LEFTOVER_BRACE_REGEX.sub('', content) + return content +def convertTitle(title: str) -> str: + return html.unescape(title).replace('_', ' ') + +if __name__ == '__main__': + import argparse + parser = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter) + parser.parse_args() + # + genData(DUMP_FILE, DB_FILE) diff --git a/backend/tol_data/enwiki/gen_dump_index_db.py b/backend/tol_data/enwiki/gen_dump_index_db.py new file mode 100755 index 0000000..5f21c9b --- /dev/null +++ b/backend/tol_data/enwiki/gen_dump_index_db.py @@ -0,0 +1,60 @@ +#!/usr/bin/python3 + +""" +Adds data from the wiki dump index-file into a database +""" +import sys, os, re +import bz2 +import sqlite3 + +INDEX_FILE = 'enwiki-20220501-pages-articles-multistream-index.txt.bz2' # Had about 22e6 lines +DB_FILE = 'dumpIndex.db' + +def genData(indexFile: str, dbFile: str) -> None: + """ Reads the index file and creates the db """ + if os.path.exists(dbFile): + raise Exception(f'ERROR: Existing {dbFile}') + print('Creating database') + dbCon = sqlite3.connect(dbFile) + dbCur = dbCon.cursor() + dbCur.execute('CREATE TABLE offsets (title TEXT PRIMARY KEY, id INT UNIQUE, offset INT, next_offset INT)') + print('Iterating through index file') + lineRegex = re.compile(r'([^:]+):([^:]+):(.*)') + lastOffset = 0 + lineNum = 0 + entriesToAdd: list[tuple[str, str]] = [] + with bz2.open(indexFile, mode='rt') as file: + for line in file: + lineNum += 1 + if lineNum % 1e5 == 0: + print(f'At line {lineNum}') + # + match = lineRegex.fullmatch(line.rstrip()) + assert match is not None + offsetStr, pageId, title = match.group(1,2,3) + offset = int(offsetStr) + if offset > lastOffset: + for t, p in entriesToAdd: + try: + dbCur.execute('INSERT INTO offsets VALUES (?, ?, ?, ?)', (t, int(p), lastOffset, offset)) + except sqlite3.IntegrityError as e: + # Accounts for certain entries in the file that have the same title + print(f'Failed on title "{t}": {e}', file=sys.stderr) + entriesToAdd = [] + lastOffset = offset + entriesToAdd.append((title, pageId)) + for title, pageId in entriesToAdd: + try: + dbCur.execute('INSERT INTO offsets VALUES (?, ?, ?, ?)', (title, int(pageId), lastOffset, -1)) + except sqlite3.IntegrityError as e: + print(f'Failed on title "{t}": {e}', file=sys.stderr) + print('Closing database') + dbCon.commit() + dbCon.close() + +if __name__ == '__main__': + import argparse + parser = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter) + parser.parse_args() + # + genData(INDEX_FILE, DB_FILE) diff --git a/backend/tol_data/enwiki/gen_img_data.py b/backend/tol_data/enwiki/gen_img_data.py new file mode 100755 index 0000000..d4696f0 --- /dev/null +++ b/backend/tol_data/enwiki/gen_img_data.py @@ -0,0 +1,193 @@ +#!/usr/bin/python3 + +""" +For some set of page IDs, looks up their content in the wiki dump, +and tries to parse infobox image names, storing them into a database. + +The program can be re-run with an updated set of page IDs, and +will skip already-processed page IDs. +""" + +import re +import os, bz2, html, urllib.parse +import sqlite3 + +DUMP_FILE = 'enwiki-20220501-pages-articles-multistream.xml.bz2' +INDEX_DB = 'dumpIndex.db' +IMG_DB = 'img_data.db' # The database to create +DB_FILE = os.path.join('..', 'data.db') +# +ID_LINE_REGEX = re.compile(r'<id>(.*)</id>') +IMG_LINE_REGEX = re.compile(r'.*\| *image *= *([^|]*)') +BRACKET_IMG_REGEX = re.compile(r'\[\[(File:[^|]*).*]]') +IMG_NAME_REGEX = re.compile(r'.*\.(jpg|jpeg|png|gif|tiff|tif)', flags=re.IGNORECASE) +CSS_IMG_CROP_REGEX = re.compile(r'{{css image crop\|image *= *(.*)', flags=re.IGNORECASE) + +def genData(pageIds: set[int], dumpFile: str, indexDb: str, imgDb: str) -> None: + print('Opening databases') + indexDbCon = sqlite3.connect(indexDb) + indexDbCur = indexDbCon.cursor() + imgDbCon = sqlite3.connect(imgDb) + imgDbCur = imgDbCon.cursor() + print('Checking tables') + if imgDbCur.execute('SELECT name FROM sqlite_master WHERE type="table" AND name="page_imgs"').fetchone() is None: + # Create tables if not present + imgDbCur.execute('CREATE TABLE page_imgs (page_id INT PRIMARY KEY, img_name TEXT)') # img_name may be NULL + imgDbCur.execute('CREATE INDEX page_imgs_idx ON page_imgs(img_name)') + else: + # Check for already-processed page IDs + numSkipped = 0 + for (pid,) in imgDbCur.execute('SELECT page_id FROM page_imgs'): + if pid in pageIds: + pageIds.remove(pid) + numSkipped += 1 + else: + print(f'Found already-processed page ID {pid} which was not in input set') + print(f'Will skip {numSkipped} already-processed page IDs') + # + print('Getting dump-file offsets') + offsetToPageids: dict[int, list[int]] = {} + offsetToEnd: dict[int, int] = {} # Maps chunk-start offsets to their chunk-end offsets + iterNum = 0 + for pageId in pageIds: + iterNum += 1 + if iterNum % 1e4 == 0: + print(f'At iteration {iterNum}') + # + query = 'SELECT offset, next_offset FROM offsets WHERE id = ?' + row: tuple[int, int] | None = indexDbCur.execute(query, (pageId,)).fetchone() + if row is None: + print(f'WARNING: Page ID {pageId} not found') + continue + chunkOffset, endOffset = row + offsetToEnd[chunkOffset] = endOffset + if chunkOffset not in offsetToPageids: + offsetToPageids[chunkOffset] = [] + offsetToPageids[chunkOffset].append(pageId) + print(f'Found {len(offsetToEnd)} chunks to check') + # + print('Iterating through chunks in dump file') + with open(dumpFile, mode='rb') as file: + iterNum = 0 + for pageOffset, endOffset in offsetToEnd.items(): + iterNum += 1 + if iterNum % 100 == 0: + print(f'At iteration {iterNum}') + # + chunkPageIds = offsetToPageids[pageOffset] + # Jump to chunk + file.seek(pageOffset) + compressedData = file.read(None if endOffset == -1 else endOffset - pageOffset) + data = bz2.BZ2Decompressor().decompress(compressedData).decode() + # Look in chunk for pages + lines = data.splitlines() + lineIdx = 0 + while lineIdx < len(lines): + # Look for <page> + if lines[lineIdx].lstrip() != '<page>': + lineIdx += 1 + continue + # Check page id + lineIdx += 3 + idLine = lines[lineIdx].lstrip() + match = ID_LINE_REGEX.fullmatch(idLine) + if match is None or int(match.group(1)) not in chunkPageIds: + lineIdx += 1 + continue + pageId = int(match.group(1)) + lineIdx += 1 + # Look for <text> in <page> + foundText = False + while lineIdx < len(lines): + if not lines[lineIdx].lstrip().startswith('<text '): + lineIdx += 1 + continue + foundText = True + # Get text content + content: list[str] = [] + line = lines[lineIdx] + content.append(line[line.find('>') + 1:]) + lineIdx += 1 + foundTextEnd = False + while lineIdx < len(lines): + line = lines[lineIdx] + if not line.endswith('</text>'): + content.append(line) + lineIdx += 1 + continue + foundTextEnd = True + content.append(line[:line.rfind('</text>')]) + # Look for image-filename + imageName = getImageName(content) + imgDbCur.execute('INSERT into page_imgs VALUES (?, ?)', (pageId, imageName)) + break + if not foundTextEnd: + print(f'WARNING: Did not find </text> for page id {pageId}') + break + if not foundText: + print(f'WARNING: Did not find <text> for page id {pageId}') + # + print('Closing databases') + indexDbCon.close() + imgDbCon.commit() + imgDbCon.close() +def getImageName(content: list[str]) -> str | None: + """ Given an array of text-content lines, tries to return an infoxbox image name, or None """ + # Doesn't try and find images in outside-infobox [[File:...]] and <imagemap> sections + for line in content: + match = IMG_LINE_REGEX.match(line) + if match is not None: + imageName = match.group(1).strip() + if imageName == '': + return None + imageName = html.unescape(imageName) + # Account for {{... + if imageName.startswith('{'): + match = CSS_IMG_CROP_REGEX.match(imageName) + if match is None: + return None + imageName = match.group(1) + # Account for [[File:...|...]] + if imageName.startswith('['): + match = BRACKET_IMG_REGEX.match(imageName) + if match is None: + return None + imageName = match.group(1) + # Account for <!-- + if imageName.find('<!--') != -1: + return None + # Remove an initial 'File:' + if imageName.startswith('File:'): + imageName = imageName[5:] + # Remove an initial 'Image:' + if imageName.startswith('Image:'): + imageName = imageName[6:] + # Check for extension + match = IMG_NAME_REGEX.match(imageName) + if match is not None: + imageName = match.group(0) + imageName = urllib.parse.unquote(imageName) + imageName = html.unescape(imageName) # Intentionally unescaping again (handles some odd cases) + imageName = imageName.replace('_', ' ') + return imageName + # Exclude lines like: | image = <imagemap> + return None + return None + +def getInputPageIdsFromDb(dbFile: str) -> set[int]: + print('Getting input page-ids') + pageIds: set[int] = set() + dbCon = sqlite3.connect(dbFile) + dbCur = dbCon.cursor() + for (pageId,) in dbCur.execute('SELECT id from wiki_ids'): + pageIds.add(pageId) + dbCon.close() + print(f'Found {len(pageIds)}') + return pageIds +if __name__ == '__main__': + import argparse + parser = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter) + parser.parse_args() + # + pageIds = getInputPageIdsFromDb(DB_FILE) + genData(pageIds, DUMP_FILE, INDEX_DB, IMG_DB) diff --git a/backend/tol_data/enwiki/gen_pageview_data.py b/backend/tol_data/enwiki/gen_pageview_data.py new file mode 100755 index 0000000..ce3b674 --- /dev/null +++ b/backend/tol_data/enwiki/gen_pageview_data.py @@ -0,0 +1,68 @@ +#!/usr/bin/python3 + +""" +Reads through wikimedia files containing pageview counts, +computes average counts, and adds them to a database +""" + +# Took about 15min per file (each had about 180e6 lines) + +import sys, os, glob, math, re +from collections import defaultdict +import bz2, sqlite3 + +PAGEVIEW_FILES = glob.glob('./pageviews/pageviews-*-user.bz2') +DUMP_INDEX_DB = 'dumpIndex.db' +DB_FILE = 'pageview_data.db' + +def genData(pageviewFiles: list[str], dumpIndexDb: str, dbFile: str) -> None: + # Each pageview file has lines that seem to hold these space-separated fields: + # wiki code (eg: en.wikipedia), article title, page ID (may be: null), + # platform (eg: mobile-web), monthly view count, + # hourly count string (eg: A1B2 means 1 view on day 1 and 2 views on day 2) + if os.path.exists(dbFile): + print('ERROR: Database already exists') + sys.exit(1) + # + namespaceRegex = re.compile(r'[a-zA-Z]+:') + titleToViews: dict[str, int] = defaultdict(int) + linePrefix = b'en.wikipedia ' + for filename in pageviewFiles: + print(f'Reading from {filename}') + with bz2.open(filename, 'rb') as file: + for lineNum, line in enumerate(file, 1): + if lineNum % 1e6 == 0: + print(f'At line {lineNum}') + if not line.startswith(linePrefix): + continue + # Get second and second-last fields + line = line[len(linePrefix):line.rfind(b' ')] # Remove first and last fields + title = line[:line.find(b' ')].decode('utf-8') + viewCount = int(line[line.rfind(b' ')+1:]) + if namespaceRegex.match(title) is not None: + continue + # Update map + titleToViews[title] += viewCount + print(f'Found {len(titleToViews)} titles') + # + print('Writing to db') + dbCon = sqlite3.connect(dbFile) + dbCur = dbCon.cursor() + idbCon = sqlite3.connect(dumpIndexDb) + idbCur = idbCon.cursor() + dbCur.execute('CREATE TABLE views (title TEXT PRIMARY KEY, id INT, views INT)') + for title, views in titleToViews.items(): + row = idbCur.execute('SELECT id FROM offsets WHERE title = ?', (title,)).fetchone() + if row is not None: + wikiId = int(row[0]) + dbCur.execute('INSERT INTO views VALUES (?, ?, ?)', (title, wikiId, math.floor(views / len(pageviewFiles)))) + dbCon.commit() + dbCon.close() + idbCon.close() + +if __name__ == '__main__': + import argparse + parser = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter) + args = parser.parse_args() + # + genData(PAGEVIEW_FILES, DUMP_INDEX_DB, DB_FILE) diff --git a/backend/tol_data/enwiki/lookup_page.py b/backend/tol_data/enwiki/lookup_page.py new file mode 100755 index 0000000..8ef1229 --- /dev/null +++ b/backend/tol_data/enwiki/lookup_page.py @@ -0,0 +1,71 @@ +#!/usr/bin/python3 + +""" +Looks up a page with title title1 in the wiki dump, using the dump-index +db, and prints the corresponding <page>. +""" + +import sys +import bz2 +import sqlite3 + +DUMP_FILE = 'enwiki-20220501-pages-articles-multistream.xml.bz2' +INDEX_DB = 'dumpIndex.db' + +def lookupPage(dumpFile: str, indexDb: str, pageTitle: str) -> None: + print('Looking up offset in index db') + dbCon = sqlite3.connect(indexDb) + dbCur = dbCon.cursor() + query = 'SELECT title, offset, next_offset FROM offsets WHERE title = ?' + row = dbCur.execute(query, (pageTitle,)).fetchone() + if row is None: + print('Title not found') + sys.exit(0) + _, pageOffset, endOffset = row + dbCon.close() + print(f'Found chunk at offset {pageOffset}') + # + print('Reading from wiki dump') + content: list[str] = [] + with open(dumpFile, mode='rb') as file: + # Get uncompressed chunk + file.seek(pageOffset) + compressedData = file.read(None if endOffset == -1 else endOffset - pageOffset) + data = bz2.BZ2Decompressor().decompress(compressedData).decode() + # Look in chunk for page + lines = data.splitlines() + lineIdx = 0 + found = False + pageNum = 0 + while not found: + line = lines[lineIdx] + if line.lstrip() == '<page>': + pageNum += 1 + if pageNum > 100: + print('ERROR: Did not find title after 100 pages') + break + lineIdx += 1 + titleLine = lines[lineIdx] + if titleLine.lstrip() == '<title>' + pageTitle + '</title>': + found = True + print(f'Found title in chunk as page {pageNum}') + content.append(line) + content.append(titleLine) + while True: + lineIdx += 1 + line = lines[lineIdx] + content.append(line) + if line.lstrip() == '</page>': + break + lineIdx += 1 + # + print('Content: ') + print('\n'.join(content)) + +if __name__ == '__main__': + import argparse + parser = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter) + parser.add_argument('title', help='The title to look up') + args = parser.parse_args() + # + lookupPage(DUMP_FILE, INDEX_DB, args.title.replace('_', ' ')) diff --git a/backend/tol_data/eol/README.md b/backend/tol_data/eol/README.md new file mode 100644 index 0000000..580310d --- /dev/null +++ b/backend/tol_data/eol/README.md @@ -0,0 +1,31 @@ +This directory holds files obtained via the [Encyclopedia of Life](https://eol.org/). + +# Mapping Files +- `provider_ids.csv.gz` <br> + Obtained from <https://opendata.eol.org/dataset/identifier-map> on 22/08/22 (says last updated 27/07/22). + Associates EOL IDs with taxon IDs from sources like NCBI and Index Fungorium. + +# Name Data Files +- `vernacularNames.csv` <br> + Obtained from <https://opendata.eol.org/dataset/vernacular-names> on 24/04/2022 (last updated on 27/10/2020). + Contains alternative-node-names data from EOL. + +# Image Metadata Files +- `imagesList.tgz` <br> + Obtained from <https://opendata.eol.org/dataset/images-list> on 24/04/2022 (last updated on 05/02/2020). + Contains metadata for images from EOL. +- `imagesList/` <br> + Extracted from imagesList.tgz. +- `gen_images_list_db.py` <br> + Creates a database, and imports imagesList/*.csv files into it. +- `images_list.db` <br> + Created by running genImagesListDb.py <br> + Tables: <br> + - `images`: + `content_id INT PRIMARY KEY, page_id INT, source_url TEXT, copy_url TEXT, license TEXT, copyright_owner TEXT` + +# Image Generation Files +- `download_imgs.py` <br> + Used to download image files into imgs_for_review/. +- `review_imgs.py` <br> + Used to review images in imgs_for_review/, moving acceptable ones into imgs/. diff --git a/backend/tol_data/eol/__init__.py b/backend/tol_data/eol/__init__.py new file mode 100644 index 0000000..e69de29 --- /dev/null +++ b/backend/tol_data/eol/__init__.py diff --git a/backend/tol_data/eol/download_imgs.py b/backend/tol_data/eol/download_imgs.py new file mode 100755 index 0000000..8454a35 --- /dev/null +++ b/backend/tol_data/eol/download_imgs.py @@ -0,0 +1,152 @@ +#!/usr/bin/python3 + +""" +For some set of EOL IDs, downloads associated images from URLs in +an image-list database. Uses multiple downloading threads. + +May obtain multiple images per ID. The images will get names +with the form 'eolId1 contentId1.ext1'. + +SIGINT causes the program to finish ongoing downloads and exit. +The program can be re-run to continue downloading. It looks for +already-downloaded files, and continues after the one with +highest EOL ID. +""" + +import sys, re, os, random +import sqlite3 +import urllib.parse, requests +import time +from threading import Thread +import signal + +IMAGES_LIST_DB = 'images_list.db' +OUT_DIR = 'imgs_for_review' +DB_FILE = os.path.join('..', 'data.db') +# +MAX_IMGS_PER_ID = 3 +MAX_THREADS = 5 +POST_DL_DELAY_MIN = 2 # Minimum delay in seconds to pause after download before starting another (for each thread) +POST_DL_DELAY_MAX = 3 +LICENSE_REGEX = r'cc-by((-nc)?(-sa)?(-[234]\.[05])?)|cc-publicdomain|cc-0-1\.0|public domain' + +def downloadImgs(eolIds, imagesListDb, outDir): + print('Getting EOL IDs to download for') + # Get IDs from images-list db + imgDbCon = sqlite3.connect(imagesListDb) + imgCur = imgDbCon.cursor() + imgListIds: set[int] = set() + for (pageId,) in imgCur.execute('SELECT DISTINCT page_id FROM images'): + imgListIds.add(pageId) + # Get set intersection, and sort into list + eolIds = eolIds.intersection(imgListIds) + eolIdList = sorted(eolIds) + nextIdx = 0 + print(f'Result: {len(eolIdList)} EOL IDs') + # + print('Checking output directory') + if not os.path.exists(outDir): + os.mkdir(outDir) + else: + print('Finding next ID to download for') + fileList = os.listdir(outDir) + ids = [int(filename.split(' ')[0]) for filename in fileList] + if ids: + ids.sort() + nextIdx = eolIdList.index(ids[-1]) + 1 + if nextIdx == len(eolIdList): + print('No IDs left. Exiting...') + return + # + print('Starting download threads') + numThreads = 0 + threadException: Exception | None = None # Used for ending main thread after a non-main thread exception + # Handle SIGINT signals + interrupted = False + oldHandler = None + def onSigint(sig, frame): + nonlocal interrupted + interrupted = True + signal.signal(signal.SIGINT, oldHandler) + oldHandler = signal.signal(signal.SIGINT, onSigint) + # Function for threads to execute + def downloadImg(url, outFile): + nonlocal numThreads, threadException + try: + data = requests.get(url) + with open(outFile, 'wb') as file: + file.write(data.content) + time.sleep(random.random() * (POST_DL_DELAY_MAX - POST_DL_DELAY_MIN) + POST_DL_DELAY_MIN) + except Exception as e: + print(f'Error while downloading to {outFile}: {str(e)}', file=sys.stderr) + threadException = e + numThreads -= 1 + # Manage downloading + for idx in range(nextIdx, len(eolIdList)): + eolId = eolIdList[idx] + # Get image urls + ownerSet: set[str] = set() # Used to get images from different owners, for variety + exitLoop = False + query = 'SELECT content_id, copy_url, license, copyright_owner FROM images WHERE page_id = ?' + for contentId, url, license, copyrightOwner in imgCur.execute(query, (eolId,)): + if url.startswith('data/'): + url = 'https://content.eol.org/' + url + urlParts = urllib.parse.urlparse(url) + extension = os.path.splitext(urlParts.path)[1] + if len(extension) <= 1: + print(f'WARNING: No filename extension found in URL {url}', file=sys.stderr) + continue + # Check image-quantity limit + if len(ownerSet) == MAX_IMGS_PER_ID: + break + # Check for skip conditions + if re.fullmatch(LICENSE_REGEX, license) is None: + continue + if len(copyrightOwner) > 100: # Avoid certain copyrightOwner fields that seem long and problematic + continue + if copyrightOwner in ownerSet: + continue + ownerSet.add(copyrightOwner) + # Determine output filename + outPath = os.path.join(outDir, f'{eolId} {contentId}{extension}') + if os.path.exists(outPath): + print(f'WARNING: {outPath} already exists. Skipping download.') + continue + # Check thread limit + while numThreads == MAX_THREADS: + time.sleep(1) + # Wait for threads after an interrupt or thread-exception + if interrupted or threadException is not None: + print('Waiting for existing threads to end') + while numThreads > 0: + time.sleep(1) + exitLoop = True + break + # Perform download + print(f'Downloading image to {outPath}') + numThreads += 1 + thread = Thread(target=downloadImg, args=(url, outPath), daemon=True) + thread.start() + if exitLoop: + break + # Close images-list db + while numThreads > 0: + time.sleep(1) + print('Finished downloading') + imgDbCon.close() + +def getEolIdsFromDb(dbFile) -> set[int]: + eolIds: set[int] = set() + dbCon = sqlite3.connect(dbFile) + dbCur = dbCon.cursor() + for (id,) in dbCur.execute('SELECT id FROM eol_ids'): + eolIds.add(id) + dbCon.close() + return eolIds +if __name__ == '__main__': + import argparse + parser = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter) + parser.parse_args() + # + eolIds = getEolIdsFromDb(DB_FILE) + downloadImgs(eolIds, IMAGES_LIST_DB, OUT_DIR) diff --git a/backend/tol_data/eol/gen_images_list_db.py b/backend/tol_data/eol/gen_images_list_db.py new file mode 100755 index 0000000..ee57ac6 --- /dev/null +++ b/backend/tol_data/eol/gen_images_list_db.py @@ -0,0 +1,39 @@ +#!/usr/bin/python3 + +""" +Generates a sqlite db from a directory of CSV files holding EOL image data +""" + +import os, glob +import csv, re, sqlite3 + +IMAGE_LISTS_GLOB = os.path.join('imagesList', '*.csv') +DB_FILE = 'images_list.db' + +def genData(imageListsGlob: str, dbFile: str) -> None: + print('Creating database') + dbCon = sqlite3.connect(dbFile) + dbCur = dbCon.cursor() + dbCur.execute('CREATE TABLE images' \ + ' (content_id INT PRIMARY KEY, page_id INT, source_url TEXT,' \ + ' copy_url TEXT, license TEXT, copyright_owner TEXT)') + dbCur.execute('CREATE INDEX images_pid_idx ON images(page_id)') + print('Reading CSV files') + for filename in glob.glob(imageListsGlob): + print(f'Processing {filename}') + with open(filename, newline='') as file: + for contentId, pageId, sourceUrl, copyUrl, license, owner in csv.reader(file): + if re.match(r'^[a-zA-Z]', contentId): # Skip header line (not in all files) + continue + dbCur.execute('INSERT INTO images VALUES (?, ?, ?, ?, ?, ?)', + (int(contentId), int(pageId), sourceUrl, copyUrl, license, owner)) + print('Closing database') + dbCon.commit() + dbCon.close() + +if __name__ == '__main__': + import argparse + parser = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter) + parser.parse_args() + # + genData(IMAGE_LISTS_GLOB, DB_FILE) diff --git a/backend/tol_data/eol/review_imgs.py b/backend/tol_data/eol/review_imgs.py new file mode 100755 index 0000000..9fb462c --- /dev/null +++ b/backend/tol_data/eol/review_imgs.py @@ -0,0 +1,213 @@ +#!/usr/bin/python3 + +""" +Provides a GUI for reviewing images. Looks in a for-review directory for +images named 'eolId1 contentId1.ext1', and, for each EOL ID, enables the user to +choose an image to keep, or reject all. Also provides image rotation. +Chosen images are placed in another directory, and rejected ones are deleted. +""" + +import sys, re, os, time +import sqlite3 +import tkinter as tki +from tkinter import ttk +import PIL +from PIL import ImageTk, Image, ImageOps + +IMG_DIR = 'imgs_for_review' +OUT_DIR = 'imgs' +EXTRA_INFO_DB = os.path.join('..', 'data.db') +# +IMG_DISPLAY_SZ = 400 +MAX_IMGS_PER_ID = 3 +IMG_BG_COLOR = (88, 28, 135) +PLACEHOLDER_IMG = Image.new('RGB', (IMG_DISPLAY_SZ, IMG_DISPLAY_SZ), IMG_BG_COLOR) + +class EolImgReviewer: + """ Provides the GUI for reviewing images """ + def __init__(self, root, imgDir, imgList, extraInfoDb, outDir): + self.root = root + root.title('EOL Image Reviewer') + # Setup main frame + mainFrame = ttk.Frame(root, padding='5 5 5 5') + mainFrame.grid(column=0, row=0, sticky=(tki.N, tki.W, tki.E, tki.S)) + root.columnconfigure(0, weight=1) + root.rowconfigure(0, weight=1) + # Set up images-to-be-reviewed frames + self.imgs = [PLACEHOLDER_IMG] * MAX_IMGS_PER_ID # Stored as fields for use in rotation + self.photoImgs = list(map(lambda img: ImageTk.PhotoImage(img), self.imgs)) # Image objects usable by tkinter + # These need a persistent reference for some reason (doesn't display otherwise) + self.labels: list[ttk.Label] = [] + for i in range(MAX_IMGS_PER_ID): + frame = ttk.Frame(mainFrame, width=IMG_DISPLAY_SZ, height=IMG_DISPLAY_SZ) + frame.grid(column=i, row=0) + label = ttk.Label(frame, image=self.photoImgs[i]) + label.grid(column=0, row=0) + self.labels.append(label) + # Add padding + for child in mainFrame.winfo_children(): + child.grid_configure(padx=5, pady=5) + # Add keyboard bindings + root.bind('<q>', self.quit) + root.bind('<Key-j>', lambda evt: self.accept(0)) + root.bind('<Key-k>', lambda evt: self.accept(1)) + root.bind('<Key-l>', lambda evt: self.accept(2)) + root.bind('<Key-i>', lambda evt: self.reject()) + root.bind('<Key-a>', lambda evt: self.rotate(0)) + root.bind('<Key-s>', lambda evt: self.rotate(1)) + root.bind('<Key-d>', lambda evt: self.rotate(2)) + root.bind('<Key-A>', lambda evt: self.rotate(0, True)) + root.bind('<Key-S>', lambda evt: self.rotate(1, True)) + root.bind('<Key-D>', lambda evt: self.rotate(2, True)) + # Initialise fields + self.imgDir = imgDir + self.imgList = imgList + self.outDir = outDir + self.imgListIdx = 0 + self.nextEolId = 0 + self.nextImgNames: list[str] = [] + self.rotations: list[int] = [] + # For displaying extra info + self.extraInfoDbCon = sqlite3.connect(extraInfoDb) + self.extraInfoDbCur = self.extraInfoDbCon.cursor() + self.numReviewed = 0 + self.startTime = time.time() + # + self.getNextImgs() + def getNextImgs(self): + """ Updates display with new images to review, or ends program """ + # Gather names of next images to review + for i in range(MAX_IMGS_PER_ID): + if self.imgListIdx == len(self.imgList): + if i == 0: + self.quit() + return + break + imgName = self.imgList[self.imgListIdx] + eolId = int(re.match(r'(\d+) (\d+)', imgName).group(1)) + if i == 0: + self.nextEolId = eolId + self.nextImgNames = [imgName] + self.rotations = [0] + else: + if self.nextEolId != eolId: + break + self.nextImgNames.append(imgName) + self.rotations.append(0) + self.imgListIdx += 1 + # Update displayed images + idx = 0 + while idx < MAX_IMGS_PER_ID: + if idx < len(self.nextImgNames): + try: + img = Image.open(os.path.join(self.imgDir, self.nextImgNames[idx])) + img = ImageOps.exif_transpose(img) + except PIL.UnidentifiedImageError: + os.remove(os.path.join(self.imgDir, self.nextImgNames[idx])) + del self.nextImgNames[idx] + del self.rotations[idx] + continue + self.imgs[idx] = self.resizeImgForDisplay(img) + else: + self.imgs[idx] = PLACEHOLDER_IMG + self.photoImgs[idx] = ImageTk.PhotoImage(self.imgs[idx]) + self.labels[idx].config(image=self.photoImgs[idx]) + idx += 1 + # Restart if all image files non-recognisable + if not self.nextImgNames: + self.getNextImgs() + return + # Update title + firstImgIdx = self.imgListIdx - len(self.nextImgNames) + 1 + lastImgIdx = self.imgListIdx + title = self.getExtraInfo(self.nextEolId) + title += f' (imgs {firstImgIdx} to {lastImgIdx} out of {len(self.imgList)})' + self.root.title(title) + def accept(self, imgIdx): + """ React to a user selecting an image """ + if imgIdx >= len(self.nextImgNames): + print('Invalid selection') + return + for i in range(len(self.nextImgNames)): + inFile = os.path.join(self.imgDir, self.nextImgNames[i]) + if i == imgIdx: # Move accepted image, rotating if needed + outFile = os.path.join(self.outDir, self.nextImgNames[i]) + img = Image.open(inFile) + img = ImageOps.exif_transpose(img) + if self.rotations[i] != 0: + img = img.rotate(self.rotations[i], expand=True) + img.save(outFile) + os.remove(inFile) + else: # Delete non-accepted image + os.remove(inFile) + self.numReviewed += 1 + self.getNextImgs() + def reject(self): + """ React to a user rejecting all images of a set """ + for i in range(len(self.nextImgNames)): + os.remove(os.path.join(self.imgDir, self.nextImgNames[i])) + self.numReviewed += 1 + self.getNextImgs() + def rotate(self, imgIdx, anticlockwise = False): + """ Respond to a user rotating an image """ + deg = -90 if not anticlockwise else 90 + self.imgs[imgIdx] = self.imgs[imgIdx].rotate(deg) + self.photoImgs[imgIdx] = ImageTk.PhotoImage(self.imgs[imgIdx]) + self.labels[imgIdx].config(image=self.photoImgs[imgIdx]) + self.rotations[imgIdx] = (self.rotations[imgIdx] + deg) % 360 + def quit(self, e = None): + print(f'Number reviewed: {self.numReviewed}') + timeElapsed = time.time() - self.startTime + print(f'Time elapsed: {timeElapsed:.2f} seconds') + if self.numReviewed > 0: + print(f'Avg time per review: {timeElapsed/self.numReviewed:.2f} seconds') + self.extraInfoDbCon.close() + self.root.destroy() + # + def resizeImgForDisplay(self, img): + """ Returns a copy of an image, shrunk to fit in it's frame (keeps aspect ratio), and with a background """ + if max(img.width, img.height) > IMG_DISPLAY_SZ: + if (img.width > img.height): + newHeight = int(img.height * IMG_DISPLAY_SZ/img.width) + img = img.resize((IMG_DISPLAY_SZ, newHeight)) + else: + newWidth = int(img.width * IMG_DISPLAY_SZ / img.height) + img = img.resize((newWidth, IMG_DISPLAY_SZ)) + bgImg = PLACEHOLDER_IMG.copy() + bgImg.paste(img, box=( + int((IMG_DISPLAY_SZ - img.width) / 2), + int((IMG_DISPLAY_SZ - img.height) / 2))) + return bgImg + def getExtraInfo(self, eolId: int) -> str: + """ Used to display extra EOL ID info """ + query = 'SELECT names.alt_name FROM' \ + ' names INNER JOIN eol_ids ON eol_ids.name = names.name' \ + ' WHERE id = ? and pref_alt = 1' + row = self.extraInfoDbCur.execute(query, (eolId,)).fetchone() + if row is not None: + return f'Reviewing EOL ID {eolId}, aka "{row[0]}"' + else: + return f'Reviewing EOL ID {eolId}' + +def reviewImgs(imgDir: str, outDir: str, extraInfoDb: str): + print('Checking output directory') + if not os.path.exists(outDir): + os.mkdir(outDir) + print('Getting input image list') + imgList = os.listdir(imgDir) + imgList.sort(key=lambda s: int(s.split(' ')[0])) + if not imgList: + print('No input images found') + sys.exit(0) + # Create GUI and defer control + print('Starting GUI') + root = tki.Tk() + EolImgReviewer(root, imgDir, imgList, extraInfoDb, outDir) + root.mainloop() + +if __name__ == '__main__': + import argparse + parser = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter) + parser.parse_args() + # + reviewImgs(IMG_DIR, OUT_DIR, EXTRA_INFO_DB) diff --git a/backend/tol_data/gen_desc_data.py b/backend/tol_data/gen_desc_data.py new file mode 100755 index 0000000..fa08a8c --- /dev/null +++ b/backend/tol_data/gen_desc_data.py @@ -0,0 +1,92 @@ +#!/usr/bin/python3 + +""" +Maps nodes to short descriptions, using data from DBpedia and +Wikipedia, and stores results in the database. +""" + +import os, sqlite3 + +DBPEDIA_DB = os.path.join('dbpedia', 'desc_data.db') +ENWIKI_DB = os.path.join('enwiki', 'desc_data.db') +DB_FILE = 'data.db' + +def genData(dbpediaDb: str, enwikiDb: str, dbFile: str) -> None: + print('Creating table') + dbCon = sqlite3.connect(dbFile) + dbCur = dbCon.cursor() + dbCur.execute('CREATE TABLE descs (wiki_id INT PRIMARY KEY, desc TEXT, from_dbp INT)') + # + print('Getting node mappings') + nodeToWikiId: dict[str, int] = {} + for name, wikiId in dbCur.execute('SELECT name, id from wiki_ids'): + nodeToWikiId[name] = wikiId + # + print('Reading data from DBpedia') + dbpCon = sqlite3.connect(dbpediaDb) + dbpCur = dbpCon.cursor() + print('Getting node IRIs') + nodeToIri: dict[str, str] = {} + iterNum = 0 + for name, wikiId in nodeToWikiId.items(): + iterNum += 1 + if iterNum % 1e5 == 0: + print(f'At iteration {iterNum}') + # + row = dbpCur.execute('SELECT iri FROM ids where id = ?', (wikiId,)).fetchone() + if row is not None: + nodeToIri[name] = row[0] + print('Resolving redirects') + iterNum = 0 + for name, iri in nodeToIri.items(): + iterNum += 1 + if iterNum % 1e5 == 0: + print(f'At iteration {iterNum}') + # + row = dbpCur.execute('SELECT target FROM redirects where iri = ?', (iri,)).fetchone() + if row is not None: + nodeToIri[name] = row[0] + print('Adding descriptions') + iterNum = 0 + for name, iri in nodeToIri.items(): + iterNum += 1 + if iterNum % 1e4 == 0: + print(f'At iteration {iterNum}') + # + row = dbpCur.execute('SELECT abstract FROM abstracts WHERE iri = ?', (iri,)).fetchone() + if row is not None: + dbCur.execute('INSERT OR IGNORE INTO descs VALUES (?, ?, ?)', (nodeToWikiId[name], row[0], 1)) + del nodeToWikiId[name] + dbpCon.close() + # + print('Reading data from Wikipedia') + enwikiCon = sqlite3.connect(enwikiDb) + enwikiCur = enwikiCon.cursor() + print('Adding descriptions') + iterNum = 0 + for name, wikiId in nodeToWikiId.items(): + iterNum += 1 + if iterNum % 1e3 == 0: + print(f'At iteration {iterNum}') + # Check for redirect + wikiIdToGet = wikiId + query = 'SELECT pages.id FROM redirects INNER JOIN pages ON redirects.target = pages.title' \ + ' WHERE redirects.id = ?' + row = enwikiCur.execute(query, (wikiId,)).fetchone() + if row is not None: + wikiIdToGet = row[0] + # + row = enwikiCur.execute('SELECT desc FROM descs where id = ?', (wikiIdToGet,)).fetchone() + if row is not None: + dbCur.execute('INSERT OR IGNORE INTO descs VALUES (?, ?, ?)', (wikiId, row[0], 0)) + # + print('Closing databases') + dbCon.commit() + dbCon.close() + +if __name__ == '__main__': + import argparse + parser = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter) + args = parser.parse_args() + # + genData(DBPEDIA_DB, ENWIKI_DB, DB_FILE) diff --git a/backend/tol_data/gen_imgs.py b/backend/tol_data/gen_imgs.py new file mode 100755 index 0000000..6d54e4d --- /dev/null +++ b/backend/tol_data/gen_imgs.py @@ -0,0 +1,214 @@ +#!/usr/bin/python3 + +""" +Reads node IDs and image paths from a file, and possibly from a directory, +and generates cropped/resized versions of those images into a directory, +with names of the form 'nodeId1.jpg'. Also adds image metadata to the +database. + +SIGINT can be used to stop, and the program can be re-run to continue +processing. It uses already-existing database entries to decide what +to skip. +""" + +import os, subprocess +import sqlite3, urllib.parse +import signal + +IMG_LIST_FILE = 'img_list.txt' +EOL_IMG_DIR = os.path.join('eol', 'imgs') # Used to decide which IMG_LIST_FILE lines denote chosen EOL images +OUT_DIR = 'img' +EOL_IMG_DB = os.path.join('eol', 'images_list.db') +ENWIKI_IMG_DB = os.path.join('enwiki', 'img_data.db') +PICKED_IMGS_DIR = 'picked_imgs' +PICKED_IMGS_FILE = 'img_data.txt' +DB_FILE = 'data.db' +# +IMG_OUT_SZ = 200 + +ImgId = tuple[int, str] # Holds an int ID and a source string (eg: 'eol') +class PickedImg: + """ Represents a picked-image from pickedImgsDir """ + def __init__(self, nodeName: str, id: int, filename: str, url: str, license: str, artist: str, credit: str): + self.nodeName = nodeName + self.id = id + self.filename = filename + self.url = url + self.license = license + self.artist = artist + self.credit = credit + +def genImgs( + imgListFile: str, eolImgDir: str, outDir: str, eolImgDb: str, enwikiImgDb: str, + pickedImgsDir: str, pickedImgsFile: str, dbFile): + """ Reads the image-list file, generates images, and updates db """ + if not os.path.exists(outDir): + os.mkdir(outDir) + # + dbCon = sqlite3.connect(dbFile) + dbCur = dbCon.cursor() + print('Checking for image tables') + nodesDone: set[str] = set() + imgsDone: set[ImgId] = set() + if dbCur.execute('SELECT name FROM sqlite_master WHERE type="table" AND name="node_imgs"').fetchone() is None: + # Add image tables if not present + dbCur.execute('CREATE TABLE node_imgs (name TEXT PRIMARY KEY, img_id INT, src TEXT)') + dbCur.execute('CREATE TABLE images (' \ + 'id INT, src TEXT, url TEXT, license TEXT, artist TEXT, credit TEXT, PRIMARY KEY (id, src))') + else: + # Get existing image-associated nodes + for (otolId,) in dbCur.execute('SELECT nodes.id FROM node_imgs INNER JOIN nodes ON node_imgs.name = nodes.name'): + nodesDone.add(otolId) + # Get existing node-associated images + for imgId, imgSrc in dbCur.execute('SELECT id, src from images'): + imgsDone.add((imgId, imgSrc)) + print(f'Found {len(nodesDone)} nodes and {len(imgsDone)} images to skip') + # + print('Processing picked-images') + success = processPickedImgs(pickedImgsDir, pickedImgsFile, nodesDone, imgsDone, outDir, dbCur) + if success: + print('Processing images from eol and enwiki') + processImgs(imgListFile, eolImgDir, eolImgDb, enwikiImgDb, nodesDone, imgsDone, outDir, dbCur) + # Close db + dbCon.commit() + dbCon.close() +def processPickedImgs( + pickedImgsDir: str, pickedImgsFile: str, nodesDone: set[str], imgsDone: set[ImgId], + outDir: str, dbCur: sqlite3.Cursor) -> bool: + """ Converts picked-images and updates db, returning False upon interruption or failure """ + # Read picked-image data + nodeToPickedImg: dict[str, PickedImg] = {} + if os.path.exists(os.path.join(pickedImgsDir, pickedImgsFile)): + with open(os.path.join(pickedImgsDir, pickedImgsFile)) as file: + for lineNum, line in enumerate(file, 1): + filename, url, license, artist, credit = line.rstrip().split('|') + nodeName = os.path.splitext(filename)[0] # Remove extension + (otolId,) = dbCur.execute('SELECT id FROM nodes WHERE name = ?', (nodeName,)).fetchone() + nodeToPickedImg[otolId] = PickedImg(nodeName, lineNum, filename, url, license, artist, credit) + # Set SIGINT handler + interrupted = False + def onSigint(sig, frame): + nonlocal interrupted + interrupted = True + signal.signal(signal.SIGINT, onSigint) + # Convert images + for otolId, imgData in nodeToPickedImg.items(): + # Check for SIGINT event + if interrupted: + print('Exiting') + return False + # Skip if already processed + if otolId in nodesDone: + continue + # Convert image + success = convertImage(os.path.join(pickedImgsDir, imgData.filename), os.path.join(outDir, otolId + '.jpg')) + if not success: + return False + # Add entry to db + if (imgData.id, 'picked') not in imgsDone: + dbCur.execute('INSERT INTO images VALUES (?, ?, ?, ?, ?, ?)', + (imgData.id, 'picked', imgData.url, imgData.license, imgData.artist, imgData.credit)) + imgsDone.add((imgData.id, 'picked')) + dbCur.execute('INSERT INTO node_imgs VALUES (?, ?, ?)', (imgData.nodeName, imgData.id, 'picked')) + nodesDone.add(otolId) + return True +def processImgs( + imgListFile: str, eolImgDir: str, eolImgDb: str, enwikiImgDb: str, + nodesDone: set[str], imgsDone: set[ImgId], outDir: str, dbCur: sqlite3.Cursor) -> bool: + """ Converts EOL and enwiki images, and updates db, returning False upon interrupted or failure """ + eolCon = sqlite3.connect(eolImgDb) + eolCur = eolCon.cursor() + enwikiCon = sqlite3.connect(enwikiImgDb) + enwikiCur = enwikiCon.cursor() + # Set SIGINT handler + interrupted = False + def onSigint(sig, frame): + nonlocal interrupted + interrupted = True + signal.signal(signal.SIGINT, onSigint) + # Convert images + flag = False # Set to True upon interruption or failure + with open(imgListFile) as file: + for line in file: + # Check for SIGINT event + if interrupted: + print('Exiting') + flag = True + break + # Skip lines without an image path + if line.find(' ') == -1: + continue + # Get filenames + otolId, _, imgPath = line.rstrip().partition(' ') + # Skip if already processed + if otolId in nodesDone: + continue + # Convert image + success = convertImage(imgPath, os.path.join(outDir, otolId + '.jpg')) + if not success: + flag = True + break + # Add entry to db + (nodeName,) = dbCur.execute('SELECT name FROM nodes WHERE id = ?', (otolId,)).fetchone() + fromEol = imgPath.startswith(eolImgDir) + imgName = os.path.basename(os.path.normpath(imgPath)) # Get last path component + imgName = os.path.splitext(imgName)[0] # Remove extension + if fromEol: + eolIdStr, _, contentIdStr = imgName.partition(' ') + eolId, contentId = int(eolIdStr), int(contentIdStr) + if (eolId, 'eol') not in imgsDone: + query = 'SELECT source_url, license, copyright_owner FROM images WHERE content_id = ?' + row = eolCur.execute(query, (contentId,)).fetchone() + if row is None: + print(f'ERROR: No image record for EOL ID {eolId}, content ID {contentId}') + flag = True + break + url, license, owner = row + dbCur.execute('INSERT INTO images VALUES (?, ?, ?, ?, ?, ?)', + (eolId, 'eol', url, license, owner, '')) + imgsDone.add((eolId, 'eol')) + dbCur.execute('INSERT INTO node_imgs VALUES (?, ?, ?)', (nodeName, eolId, 'eol')) + else: + enwikiId = int(imgName) + if (enwikiId, 'enwiki') not in imgsDone: + query = 'SELECT name, license, artist, credit FROM' \ + ' page_imgs INNER JOIN imgs ON page_imgs.img_name = imgs.name' \ + ' WHERE page_imgs.page_id = ?' + row = enwikiCur.execute(query, (enwikiId,)).fetchone() + if row is None: + print(f'ERROR: No image record for enwiki ID {enwikiId}') + flag = True + break + name, license, artist, credit = row + url = 'https://en.wikipedia.org/wiki/File:' + urllib.parse.quote(name) + dbCur.execute('INSERT INTO images VALUES (?, ?, ?, ?, ?, ?)', + (enwikiId, 'enwiki', url, license, artist, credit)) + imgsDone.add((enwikiId, 'enwiki')) + dbCur.execute('INSERT INTO node_imgs VALUES (?, ?, ?)', (nodeName, enwikiId, 'enwiki')) + eolCon.close() + enwikiCon.close() + return not flag +def convertImage(imgPath: str, outPath: str): + print(f'Converting {imgPath} to {outPath}') + if os.path.exists(outPath): + print('ERROR: Output image already exists') + return False + try: + completedProcess = subprocess.run( + ['npx', 'smartcrop-cli', '--width', str(IMG_OUT_SZ), '--height', str(IMG_OUT_SZ), imgPath, outPath], + stdout=subprocess.DEVNULL + ) + except Exception as e: + print(f'ERROR: Exception while attempting to run smartcrop: {e}') + return False + if completedProcess.returncode != 0: + print(f'ERROR: smartcrop had exit status {completedProcess.returncode}') + return False + return True + +if __name__ == '__main__': + import argparse + parser = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter) + parser.parse_args() + # + genImgs(IMG_LIST_FILE, EOL_IMG_DIR, OUT_DIR, EOL_IMG_DB, ENWIKI_IMG_DB, PICKED_IMGS_DIR, PICKED_IMGS_FILE, DB_FILE) diff --git a/backend/tol_data/gen_linked_imgs.py b/backend/tol_data/gen_linked_imgs.py new file mode 100755 index 0000000..7002e92 --- /dev/null +++ b/backend/tol_data/gen_linked_imgs.py @@ -0,0 +1,117 @@ +#!/usr/bin/python3 + +""" +Look for nodes without images in the database, and tries to +associate them with images from their children +""" + +import re +import sqlite3 + +DB_FILE = 'data.db' +# +COMPOUND_NAME_REGEX = re.compile(r'\[(.+) \+ (.+)]') +UP_PROPAGATE_COMPOUND_IMGS = False + +def genData(dbFile: str) -> None: + print('Opening database') + dbCon = sqlite3.connect(dbFile) + dbCur = dbCon.cursor() + dbCur.execute('CREATE TABLE linked_imgs (name TEXT PRIMARY KEY, otol_ids TEXT)') + # + print('Getting nodes with images') + nodeToUsedId: dict[str, str] = {} # Maps name of node to otol ID of node to use image for + query = 'SELECT nodes.name, nodes.id FROM nodes INNER JOIN node_imgs ON nodes.name = node_imgs.name' + for name, otolId in dbCur.execute(query): + nodeToUsedId[name] = otolId + print(f'Found {len(nodeToUsedId)}') + # + print('Getting node depths') + nodeToDepth: dict[str, int] = {} + maxDepth = 0 + nodeToParent: dict[str, str | None] = {} # Maps name of node to name of parent + for nodeName in nodeToUsedId.keys(): + nodeChain = [nodeName] + lastDepth = 0 + # Add ancestors + while True: + row = dbCur.execute('SELECT parent FROM edges WHERE child = ?', (nodeName,)).fetchone() + if row is None: + nodeToParent[nodeName] = None + break + nodeToParent[nodeName] = row[0] + nodeName = row[0] + nodeChain.append(nodeName) + if nodeName in nodeToDepth: + lastDepth = nodeToDepth[nodeName] + break + # Add depths + for i in range(len(nodeChain)): + nodeToDepth[nodeChain[-i-1]] = i + lastDepth + maxDepth = max(maxDepth, lastDepth + len(nodeChain) - 1) + # + print('Finding ancestors to give linked images') + depthToNodes: dict[int, list[str]] = {depth: [] for depth in range(maxDepth + 1)} + for nodeName, depth in nodeToDepth.items(): + depthToNodes[depth].append(nodeName) + parentToCandidate: dict[str, tuple[str, int]] = {} # Maps parent node name to candidate child name and tips-val + iterNum = 0 + for depth in range(maxDepth, -1, -1): + for node in depthToNodes[depth]: + iterNum += 1 + if iterNum % 1e4 == 0: + print(f'At iteration {iterNum}') + # + if node in parentToCandidate: + nodeToUsedId[node] = nodeToUsedId[parentToCandidate[node][0]] + dbCur.execute('INSERT INTO linked_imgs VALUES (?, ?)', (node, nodeToUsedId[node])) + parent = nodeToParent[node] + if parent is not None and parent not in nodeToUsedId: + (tips,) = dbCur.execute('SELECT tips FROM nodes WHERE name == ?', (node,)).fetchone() + if parent not in parentToCandidate or parentToCandidate[parent][1] < tips: + parentToCandidate[parent] = (node, tips) + # + print('Replacing linked-images for compound nodes') + for iterNum, node in enumerate(parentToCandidate.keys(), 1): + if iterNum % 1e4 == 0: + print(f'At iteration {iterNum}') + # + match = COMPOUND_NAME_REGEX.fullmatch(node) + if match is not None: + # Replace associated image with subname images + subName1, subName2 = match.group(1,2) + otolIdPair = ['', ''] + if subName1 in nodeToUsedId: + otolIdPair[0] = nodeToUsedId[subName1] + if subName2 in nodeToUsedId: + otolIdPair[1] = nodeToUsedId[subName2] + # Use no image if both subimages not found + if otolIdPair[0] == '' and otolIdPair[1] == '': + dbCur.execute('DELETE FROM linked_imgs WHERE name = ?', (node,)) + continue + # Add to db + dbCur.execute('UPDATE linked_imgs SET otol_ids = ? WHERE name = ?', (','.join(otolIdPair), node)) + # Possibly repeat operation upon parent/ancestors + if UP_PROPAGATE_COMPOUND_IMGS: + while True: + parent = nodeToParent[node] + if parent is not None: + (tips,) = dbCur.execute('SELECT tips from nodes WHERE name = ?', (node,)).fetchone() + if parent in parentToCandidate and parentToCandidate[parent][1] <= tips: + # Replace associated image + dbCur.execute( + 'UPDATE linked_imgs SET otol_ids = ? WHERE name = ?', (','.join(otolIdPair), parent)) + node = parent + continue + break + # + print('Closing database') + dbCon.commit() + dbCon.close() + +if __name__ == '__main__': + import argparse + parser = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter) + parser.parse_args() + # + genData(DB_FILE) diff --git a/backend/tol_data/gen_mapping_data.py b/backend/tol_data/gen_mapping_data.py new file mode 100755 index 0000000..95e930b --- /dev/null +++ b/backend/tol_data/gen_mapping_data.py @@ -0,0 +1,271 @@ +#!/usr/bin/python3 + +""" +Maps otol IDs to EOL and enwiki titles, using IDs from various +other sources (like NCBI). + +Reads otol taxonomy data to get source IDs for otol IDs, +then looks up those IDs in an EOL provider_ids file, +and in a wikidata dump, and stores results in the database. + +Based on code from https://github.com/OneZoom/OZtree, located in +OZprivate/ServerScripts/TaxonMappingAndPopularity/ (22 Aug 2022). +""" + +import os +from collections import defaultdict +import gzip, csv, sqlite3 + +TAXONOMY_FILE = os.path.join('otol', 'taxonomy.tsv') +EOL_IDS_FILE = os.path.join('eol', 'provider_ids.csv.gz') +WIKIDATA_DB = os.path.join('wikidata', 'taxon_srcs.db') +ENWIKI_DUMP_INDEX_DB = os.path.join('enwiki', 'dumpIndex.db') +PICKED_MAPPINGS = { + 'eol': ['picked_eol_ids.txt'], + 'enwiki': ['picked_wiki_ids.txt', 'picked_wiki_ids_rough.txt'] +} +DB_FILE = 'data.db' + +OTOL_SRCS = ['ncbi', 'if', 'worms', 'irmng', 'gbif'] # Earlier sources will get higher priority +EOL_SRCS = {676: 'ncbi', 459: 'worms', 767: 'gbif'} # Maps external-source int-identifiers to names + +def genData( + taxonomyFile: str, + eolIdsFile: str, + wikidataDb: str, + pickedMappings: dict[str, list[str]], + enwikiDumpIndexDb: str, + dbFile: str) -> None: + """ Reads the files and enwiki db and creates the db """ + nodeToSrcIds: dict[int, dict[str, int]] = {} # Maps otol ID to {src1: id1, src2: id2, ...} + usedSrcIds: set[tuple[str, int]] = set() # {(src1, id1), ...} (used to avoid storing IDs that won't be used) + nodeToEolId: dict[int, int] = {} # Maps otol ID to eol ID + nodeToWikiTitle: dict[int, str] = {} # Maps otol ID to wikipedia title + titleToIucnStatus: dict[str, str] = {} # Maps wikipedia title to IUCN string + titleToPageId: dict[str, int] = {} # Maps wikipedia title to page ID + # Get mappings from data input + readTaxonomyFile(taxonomyFile, nodeToSrcIds, usedSrcIds) + readEolIdsFile(eolIdsFile, nodeToSrcIds, usedSrcIds, nodeToEolId) + readWikidataDb(wikidataDb, nodeToSrcIds, usedSrcIds, nodeToWikiTitle, titleToIucnStatus, nodeToEolId) + readPickedMappings(pickedMappings, nodeToEolId, nodeToWikiTitle) + getEnwikiPageIds(enwikiDumpIndexDb, nodeToWikiTitle, titleToPageId) + # + print('Writing to db') + dbCon = sqlite3.connect(dbFile) + dbCur = dbCon.cursor() + # Get otol id-to-name map + otolIdToName: dict[int, str] = {} + for nodeName, nodeId in dbCur.execute('SELECT name, id from nodes'): + if nodeId.startswith('ott'): + otolIdToName[int(nodeId[3:])] = nodeName + # Add eol mappings + dbCur.execute('CREATE TABLE eol_ids (name TEXT PRIMARY KEY, id INT)') + dbCur.execute('CREATE INDEX eol_id_idx ON eol_ids(id)') + for otolId, eolId in nodeToEolId.items(): + if otolId in otolIdToName: + dbCur.execute('INSERT INTO eol_ids VALUES (?, ?)', (otolIdToName[otolId], eolId)) + # Add enwiki mappings + dbCur.execute('CREATE TABLE wiki_ids (name TEXT PRIMARY KEY, id INT)') + dbCur.execute('CREATE INDEX wiki_id_idx ON wiki_ids(id)') + dbCur.execute('CREATE TABLE node_iucn (name TEXT PRIMARY KEY, iucn TEXT)') + for otolId, title in nodeToWikiTitle.items(): + if otolId in otolIdToName and title in titleToPageId: + dbCur.execute('INSERT INTO wiki_ids VALUES (?, ?)', (otolIdToName[otolId], titleToPageId[title])) + if title in titleToIucnStatus: + dbCur.execute('INSERT INTO node_iucn VALUES (?, ?)', (otolIdToName[otolId], titleToIucnStatus[title])) + dbCon.commit() + dbCon.close() +def readTaxonomyFile( + taxonomyFile: str, + nodeToSrcIds: dict[int, dict[str, int]], + usedSrcIds: set[tuple[str, int]]) -> None: + """ Reads taxonomy file, and maps OTOL node IDs to external-source IDs """ + # The file has a header line, then lines that hold these fields (each is followed by a tab-pipe-tab sequence): + # uid (otol-id, eg: 93302), parent_uid, name, rank, + # sourceinfo (comma-separated source specifiers, eg: ncbi:2952,gbif:3207147), uniqueName, flags + print('Reading taxonomy file') + with open(taxonomyFile) as file: # Had about 4.5e6 lines + for lineNum, line in enumerate(file, 1): + if lineNum % 1e5 == 0: + print(f'At line {lineNum}') + # Skip header line + if lineNum == 1: + continue + # Parse line + fields = line.split('\t|\t') + try: + otolId = int(fields[0]) + except ValueError: + print(f'Skipping non-integral ID {fields[0]} on line {lineNum}') + continue + srcsField = fields[4] + # Add source IDs + for srcPair in srcsField.split(','): + src, srcIdStr = srcPair.split(':', 1) + if srcIdStr.isdecimal() and src in OTOL_SRCS: + if otolId not in nodeToSrcIds: + nodeToSrcIds[otolId] = {} + elif src in nodeToSrcIds[otolId]: + continue + srcId = int(srcIdStr) + nodeToSrcIds[otolId][src] = srcId + usedSrcIds.add((src, srcId)) + print(f'- Result has {sum([len(v) for v in nodeToSrcIds.values()]):,} entries') # Was about 6.7e6 +def readEolIdsFile( + eolIdsFile: str, + nodeToSrcIds: dict[int, dict[str, int]], + usedSrcIds: set[tuple[str, int]], + nodeToEolId: dict[int, int]) -> None: + """ Reads EOL provider IDs file, and maps EOL IDs to external-source IDs """ + # The file is a CSV with a header line, then lines that hold these fields: + # node_id, resource_pk (ID from external source), resource_id (int denoting external-source), + # page_id (eol ID), preferred_canonical_for_page + print('Reading EOL provider IDs file') + srcToEolId: dict[str, dict[int, int]] = {src: {} for src in EOL_SRCS.values()} # Maps src1 to {id1: eolId1, ...} + with gzip.open(eolIdsFile, mode='rt') as file: # Had about 13e6 lines + for lineNum, row in enumerate(csv.reader(file), 1): + if lineNum % 1e6 == 0: + print(f'At line {lineNum}') + # Skip header line + if lineNum == 1: + continue + # Parse line + eolId = int(row[3]) + srcInt = int(row[2]) + srcIdStr = row[1] + if srcIdStr.isdecimal() and srcInt in EOL_SRCS: + srcId = int(srcIdStr) + src = EOL_SRCS[srcInt] + if (src, srcId) not in usedSrcIds: + continue + if srcId in srcToEolId[src]: + print(f'Found {src} ID {srcId} with multiple EOL IDs {srcToEolId[src][srcId]} and {eolId}') + continue + srcToEolId[src][srcId] = eolId + print(f'- Result has {sum([len(v) for v in srcToEolId.values()]):,} entries') + # Was about 3.5e6 (4.2e6 without usedSrcIds) + # + print('Resolving candidate EOL IDs') + # For each otol ID, find eol IDs with matching sources, and choose the 'best' one + for otolId, srcInfo in nodeToSrcIds.items(): + eolIdToCount: dict[int, int] = defaultdict(int) + for src, srcId in srcInfo.items(): + if src in srcToEolId and srcId in srcToEolId[src]: + eolId = srcToEolId[src][srcId] + eolIdToCount[eolId] += 1 + if len(eolIdToCount) == 1: + nodeToEolId[otolId] = list(eolIdToCount)[0] + elif len(eolIdToCount) > 1: + # For multiple candidates, prefer those with most sources, and break ties by picking the lowest + maxCount = max(eolIdToCount.values()) + eolIds = [eolId for eolId, count in eolIdToCount.items() if count == maxCount] + nodeToEolId[otolId] = min(eolIds) + print(f'- Result has {len(nodeToEolId):,} entries') # Was about 2.7e6 +def readWikidataDb( + wikidataDb: str, + nodeToSrcIds: dict[int, dict[str, int]], + usedSrcIds: set[tuple[str, int]], + nodeToWikiTitle: dict[int, str], + titleToIucnStatus: dict[str, str], + nodeToEolId: dict[int, int]) -> None: + """ Reads db holding ID and IUCN mappings from wikidata, and maps otol IDs to Wikipedia titles and EOL IDs """ + print('Reading from Wikidata db') + srcToWikiTitle: dict[str, dict[int, str]] = defaultdict(dict) # Maps 'eol'/etc to {srcId1: title1, ...} + wikiTitles = set() + dbCon = sqlite3.connect(wikidataDb) + dbCur = dbCon.cursor() + for src, srcId, title in dbCur.execute('SELECT src, id, title from src_id_to_title'): + if (src, srcId) in usedSrcIds or src == 'eol': # Keep EOL IDs for later use + srcToWikiTitle[src][srcId] = title + wikiTitles.add(title) + for title, status in dbCur.execute('SELECT title, status from title_iucn'): + if title in wikiTitles: + titleToIucnStatus[title] = status + print(f'- Source-to-title map has {sum([len(v) for v in srcToWikiTitle.values()]):,} entries') + # Was about 1.1e6 (1.2e6 without usedSrcIds) + print(f'- IUCN map has {len(titleToIucnStatus):,} entries') # Was about 7e4 (7.2e4 without usedSrcIds) + dbCon.close() + # + print('Resolving candidate Wikidata items') + # For each otol ID, find wikidata titles with matching sources, and choose the 'best' one + for otolId, srcInfo in nodeToSrcIds.items(): + titleToSrcs: dict[str, list[str]] = defaultdict(list) # Maps candidate titles to list of sources + for src, srcId in srcInfo.items(): + if src in srcToWikiTitle and srcId in srcToWikiTitle[src]: + title = srcToWikiTitle[src][srcId] + titleToSrcs[title].append(src) + # Choose title to use + if len(titleToSrcs) == 1: + nodeToWikiTitle[otolId] = list(titleToSrcs)[0] + elif len(titleToSrcs) > 1: # Test example: otol ID 621052 + # Get titles with most sources + maxSrcCnt = max([len(srcs) for srcs in titleToSrcs.values()]) + titleToSrcs = {t: s for t, s in titleToSrcs.items() if len(s) == maxSrcCnt} + if len(titleToSrcs) == 1: + nodeToWikiTitle[otolId] = list(titleToSrcs)[0] + else: + # Get a title with a source with highest priority + srcToTitle = {s: t for t in titleToSrcs for s in titleToSrcs[t]} + for src in OTOL_SRCS: + if src in srcToTitle: + nodeToWikiTitle[otolId] = srcToTitle[src] + break + print(f'- Result has {len(nodeToWikiTitle):,} entries') # Was about 4e5 + # + print('Adding extra EOL mappings from Wikidata') + wikiTitleToNode = {title: node for node, title in nodeToWikiTitle.items()} + addedEntries: dict[int, int] = {} + for eolId, title in srcToWikiTitle['eol'].items(): + if title in wikiTitleToNode: + otolId = wikiTitleToNode[title] + if otolId not in nodeToEolId: # Only add if the otol ID has no EOL ID + nodeToEolId[otolId] = eolId + addedEntries[otolId] = eolId + print(f'- Added {len(addedEntries):,} entries') # Was about 3e3 +def readPickedMappings( + pickedMappings: dict[str, list[str]], + nodeToEolId: dict[int, int], + nodeToWikiTitle: dict[int, str]) -> None: + """ Read mappings from OTOL IDs to EOL IDs and Wikipedia titles """ + print('Reading picked mappings') + for src in pickedMappings: + for filename in pickedMappings[src]: + if not os.path.exists(filename): + continue + with open(filename) as file: + for line in file: + otolIdStr, mappedVal = line.rstrip().split('|') + otolId = int(otolIdStr) + if src == 'eol': + if mappedVal: + nodeToEolId[otolId] = int(mappedVal) + else: + if otolId in nodeToEolId: + del nodeToEolId[otolId] + else: # src == 'enwiki' + if mappedVal: + nodeToWikiTitle[otolId] = mappedVal + else: + if otolId in nodeToWikiTitle: + del nodeToWikiTitle[otolId] +def getEnwikiPageIds(enwikiDumpIndexDb: str, nodeToWikiTitle: dict[int, str], titleToPageId: dict[str, int]) -> None: + """ Read a db for mappings from enwiki titles to page IDs """ + print('Getting enwiki page IDs') + numNotFound = 0 + dbCon = sqlite3.connect(enwikiDumpIndexDb) + dbCur = dbCon.cursor() + for title in nodeToWikiTitle.values(): + record = dbCur.execute('SELECT id FROM offsets WHERE title = ?', (title,)).fetchone() + if record != None: + titleToPageId[title] = record[0] + else: + numNotFound += 1 + dbCon.close() + print(f'Unable to find IDs for {numNotFound} titles') # Was 2913 + +if __name__ == '__main__': + import argparse + parser = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter) + args = parser.parse_args() + # + genData(TAXONOMY_FILE, EOL_IDS_FILE, WIKIDATA_DB, PICKED_MAPPINGS, ENWIKI_DUMP_INDEX_DB, DB_FILE) diff --git a/backend/tol_data/gen_name_data.py b/backend/tol_data/gen_name_data.py new file mode 100755 index 0000000..2e92c20 --- /dev/null +++ b/backend/tol_data/gen_name_data.py @@ -0,0 +1,128 @@ +#!/usr/bin/python3 + +""" +Maps nodes to vernacular names, using data from EOL, enwiki, and a +picked-names file, and stores results in the database. +""" + +import re, os +import html, csv, sqlite3 + +EOL_NAMES_FILE = os.path.join('eol', 'vernacularNames.csv') +ENWIKI_DB = os.path.join('enwiki', 'desc_data.db') +PICKED_NAMES_FILE = 'picked_names.txt' +DB_FILE = 'data.db' + +def genData(eolNamesFile: str, enwikiDb: str, pickedNamesFile: str, dbFile: str) -> None: + """ Reads the files and adds to db """ + dbCon = sqlite3.connect(dbFile) + dbCur = dbCon.cursor() + # + print('Creating table') + dbCur.execute('CREATE TABLE names(name TEXT, alt_name TEXT, pref_alt INT, src TEXT, PRIMARY KEY(name, alt_name))') + dbCur.execute('CREATE INDEX names_idx ON names(name)') + dbCur.execute('CREATE INDEX names_alt_idx ON names(alt_name)') + dbCur.execute('CREATE INDEX names_alt_idx_nc ON names(alt_name COLLATE NOCASE)') + # + print('Getting node mappings') + nodeToTips: dict[str, int] = {} + for name, tips in dbCur.execute('SELECT name, tips from nodes'): + nodeToTips[name] = tips + # + addEolNames(eolNamesFile, nodeToTips, dbCur) + addEnwikiNames(enwikiDb, nodeToTips, dbCur) + addPickedNames(pickedNamesFile, nodeToTips, dbCur) + # + print('Closing database') + dbCon.commit() + dbCon.close() +def addEolNames(eolNamesFile: str, nodeToTips: dict[str, int], dbCur: sqlite3.Cursor) -> None: + """ Reads EOL names, associates them with otol nodes, and writes to db """ + # The CSV file has a header line, then lines with these fields: + # page_id, canonical_form (canonical name, not always unique to page ID), + # vernacular_string (vernacular name), language_code, + # resource_name, is_preferred_by_resource, is_preferred_by_eol + print('Getting EOL mappings') + eolIdToNode: dict[int, str] = {} # Maps eol ID to node name (if there are multiple, choose one with most tips) + for name, eolId in dbCur.execute('SELECT name, id from eol_ids'): + if eolId not in eolIdToNode or nodeToTips[eolIdToNode[eolId]] < nodeToTips[name]: + eolIdToNode[eolId] = name + print('Adding names from EOL') + namesToSkip = {'unknown', 'unknown species', 'unidentified species'} + with open(eolNamesFile, newline='') as file: + for lineNum, fields in enumerate(csv.reader(file), 1): + if lineNum % 1e5 == 0: + print(f'At line {lineNum}') # Reached about 2.8e6 + # Skip header line + if lineNum == 1: + continue + # Parse line + eolId = int(fields[0]) + name = html.unescape(fields[2]).lower() + lang = fields[3] + isPreferred = 1 if fields[6] == 'preferred' else 0 + # Add to db + if eolId in eolIdToNode and name not in namesToSkip and name not in nodeToTips \ + and lang == 'eng' and len(name.split(' ')) <= 3: # Ignore names with >3 words + cmd = 'INSERT OR IGNORE INTO names VALUES (?, ?, ?, \'eol\')' + # The 'OR IGNORE' accounts for duplicate lines + dbCur.execute(cmd, (eolIdToNode[eolId], name, isPreferred)) +def addEnwikiNames(enwikiDb: str, nodeToTips: dict[str, int], dbCur: sqlite3.Cursor) -> None: + """ Reads enwiki names, associates them with otol nodes, and writes to db """ + print('Getting enwiki mappings') + wikiIdToNode: dict[int, str] = {} + for name, wikiId in dbCur.execute('SELECT name, id from wiki_ids'): + if wikiId not in wikiIdToNode or nodeToTips[wikiIdToNode[wikiId]] < nodeToTips[name]: + wikiIdToNode[wikiId] = name + print('Adding names from enwiki') + altNameRegex = re.compile(r'[a-z]+') # Avoids names like 'evolution of elephants', 'banana fiber', 'fish (zoology)', + enwikiCon = sqlite3.connect(enwikiDb) + enwikiCur = enwikiCon.cursor() + iterNum = 0 + for wikiId, nodeName in wikiIdToNode.items(): + iterNum += 1 + if iterNum % 1e4 == 0: + print(f'At iteration {iterNum}') # Reached about 3.6e5 + # + query = 'SELECT p1.title FROM pages p1' \ + ' INNER JOIN redirects r1 ON p1.id = r1.id' \ + ' INNER JOIN pages p2 ON r1.target = p2.title WHERE p2.id = ?' + for (name,) in enwikiCur.execute(query, (wikiId,)): + name = name.lower() + if altNameRegex.fullmatch(name) is not None and name != nodeName and name not in nodeToTips: + dbCur.execute('INSERT OR IGNORE INTO names VALUES (?, ?, ?, \'enwiki\')', (nodeName, name, 0)) +def addPickedNames(pickedNamesFile: str, nodeToTips: dict[str, int], dbCur: sqlite3.Cursor) -> None: + # File format: + # nodename1|altName1|isPreferred1 -> Add an alt-name + # nodename1|altName1| -> Remove an alt-name + # nodename1|nodeName1| -> Remove any preferred-alt status + if os.path.exists(pickedNamesFile): + print('Getting picked names') + with open(pickedNamesFile) as file: + for line in file: + nodeName, altName, isPreferredStr = line.lower().rstrip().split('|') + if nodeName not in nodeToTips: + print(f'Skipping "{nodeName}", as no such node exists') + continue + if isPreferredStr: + isPreferred = 1 if isPreferredStr == '1' else 0 + if isPreferred == 1: + # Remove any existing preferred-alt status + cmd = 'UPDATE names SET pref_alt = 0 WHERE name = ? AND alt_name = ? AND pref_alt = 1' + dbCur.execute(cmd, (nodeName, altName)) + # Remove any existing record + dbCur.execute('DELETE FROM names WHERE name = ? AND alt_name = ?', (nodeName, altName)) + # Add record + dbCur.execute('INSERT INTO names VALUES (?, ?, ?, "picked")', (nodeName, altName, isPreferred)) + elif nodeName != altName: # Remove any matching record + dbCur.execute('DELETE FROM names WHERE name = ? AND alt_name = ?', (nodeName, altName)) + else: # Remove any preferred-alt status + cmd = 'UPDATE names SET pref_alt = 0 WHERE name = ? AND pref_alt = 1' + dbCur.execute(cmd, (nodeName,)) + +if __name__ == '__main__': + import argparse + parser = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter) + args = parser.parse_args() + # + genData(EOL_NAMES_FILE, ENWIKI_DB, PICKED_NAMES_FILE, DB_FILE) diff --git a/backend/tol_data/gen_otol_data.py b/backend/tol_data/gen_otol_data.py new file mode 100755 index 0000000..eba8779 --- /dev/null +++ b/backend/tol_data/gen_otol_data.py @@ -0,0 +1,267 @@ +#!/usr/bin/python3 + +""" +Reads files describing a tree-of-life from an 'Open Tree of Life' release, +and stores tree info in a database. + +Reads a labelled_supertree_ottnames.tre file, which is assumed to have this format: + The tree-of-life is represented in Newick format, which looks like: (n1,n2,(n3,n4)n5)n6 + The root node is named n6, and has children n1, n2, and n5. + Name examples include: Homo_sapiens_ott770315, mrcaott6ott22687, 'Oxalis san-miguelii ott5748753', + 'ott770315' and 'mrcaott6ott22687' are node IDs. The latter is for a 'compound node'. + The node with ID 'ott770315' will get the name 'homo sapiens'. + A compound node will get a name composed from it's sub-nodes (eg: [name1 + name2]). + It is possible for multiple nodes to have the same name. + In these cases, extra nodes will be named sequentially, as 'name1 [2]', 'name1 [3]', etc. +Reads an annotations.json file, which is assumed to have this format: + Holds a JSON object, whose 'nodes' property maps node IDs to objects holding information about that node, + such as the properties 'supported_by' and 'conflicts_with', which list phylogenetic trees that + support/conflict with the node's placement. +Reads from a picked-names file, if present, which specifies name and node ID pairs. + These help resolve cases where multiple nodes share the same name. +""" + +import re, os +import json, sqlite3 + +TREE_FILE = os.path.join('otol', 'labelled_supertree_ottnames.tre') # Had about 2.5e9 nodes +ANN_FILE = os.path.join('otol', 'annotations.json') +DB_FILE = 'data.db' +PICKED_NAMES_FILE = 'picked_otol_names.txt' + +class Node: + """ Represents a tree-of-life node """ + def __init__(self, name, childIds, parentId, tips, pSupport): + self.name = name + self.childIds = childIds + self.parentId = parentId + self.tips = tips + self.pSupport = pSupport +class BasicStream: + """ Represents a basic data stream, using a string and index. Used for parsing text with lookahead. """ + def __init__(self, data, idx=0): + self.data = data + self.idx = idx + def hasNext(self) -> bool: + return self.idx < len(self.data) + def next(self) -> str: + if self.hasNext(): + char = self.data[self.idx] + self.idx += 1 + return char; + else: + return ''; + def peek(self) -> str: + if self.hasNext(): + return self.data[self.idx] + else: + return ''; + def skipWhitespace(self) -> None: + while self.hasNext() and self.data[self.idx].isspace(): + self.idx += 1 + def progress(self) -> float: + return (self.idx / len(self.data)) + +def genData(treeFile: str, annFile: str, pickedNamesFile: str, dbFile: str) -> None: + """ Reads the files and stores the tree info """ + nodeMap: dict[str, Node] = {} # Maps node IDs to node objects + nameToFirstId: dict[str, str] = {} # Maps node names to first found ID (names might have multiple IDs) + dupNameToIds: dict[str, list[str]] = {} # Maps names of nodes with multiple IDs to those IDs + # + print('Parsing tree file') + treeStream: BasicStream + with open(treeFile) as file: + treeStream = BasicStream(file.read()) + # Parse content + parseNewick(treeStream, nodeMap, nameToFirstId, dupNameToIds) + print('Resolving duplicate names') + # Read picked-names file + nameToPickedId: dict[str, str] = {} + if os.path.exists(pickedNamesFile): + with open(pickedNamesFile) as file: + for line in file: + name, _, otolId = line.strip().partition('|') + nameToPickedId[name] = otolId + # Resolve duplicates + for dupName, ids in dupNameToIds.items(): + # Check for picked id + if dupName in nameToPickedId: + idToUse = nameToPickedId[dupName] + else: + # Get conflicting node with most tips + tipNums = [nodeMap[id].tips for id in ids] + maxIdx = tipNums.index(max(tipNums)) + idToUse = ids[maxIdx] + # Adjust name of other conflicting nodes + counter = 2 + for id in ids: + if id != idToUse: + nodeMap[id].name += f' [{counter}]' + counter += 1 + print('Changing mrca* names') + for id, node in nodeMap.items(): + if node.name.startswith('mrca'): + convertMrcaName(id, nodeMap) + print('Parsing annotations file') + # Read file + with open(annFile) as file: + data = file.read() + obj = json.loads(data) + nodeAnnsMap = obj['nodes'] + # Find relevant annotations + for id, node in nodeMap.items(): + # Set has-support value using annotations + if id in nodeAnnsMap: + nodeAnns = nodeAnnsMap[id] + supportQty = len(nodeAnns['supported_by']) if 'supported_by' in nodeAnns else 0 + conflictQty = len(nodeAnns['conflicts_with']) if 'conflicts_with' in nodeAnns else 0 + node.pSupport = supportQty > 0 and conflictQty == 0 + print('Creating nodes and edges tables') + dbCon = sqlite3.connect(dbFile) + dbCur = dbCon.cursor() + dbCur.execute('CREATE TABLE nodes (name TEXT PRIMARY KEY, id TEXT UNIQUE, tips INT)') + dbCur.execute('CREATE INDEX nodes_idx_nc ON nodes(name COLLATE NOCASE)') + dbCur.execute('CREATE TABLE edges (parent TEXT, child TEXT, p_support INT, PRIMARY KEY (parent, child))') + dbCur.execute('CREATE INDEX edges_child_idx ON edges(child)') + for otolId, node in nodeMap.items(): + dbCur.execute('INSERT INTO nodes VALUES (?, ?, ?)', (node.name, otolId, node.tips)) + for childId in node.childIds: + childNode = nodeMap[childId] + dbCur.execute('INSERT INTO edges VALUES (?, ?, ?)', + (node.name, childNode.name, 1 if childNode.pSupport else 0)) + print('Closing database') + dbCon.commit() + dbCon.close() +def parseNewick( + stream: BasicStream, + nodeMap: dict[str, Node], + nameToFirstId: dict[str, str], + dupNameToIds: dict[str, list[str]]) -> str: + """ Parses a node using 'data' and 'dataIdx', updates nodeMap accordingly, and returns the node's ID """ + if stream.idx % 1e5 == 0: + print(f'Progress: {stream.progress() * 100:.2f}%') + # Find node + stream.skipWhitespace() + if stream.peek() == '': + raise Exception(f'ERROR: Unexpected EOF at index {stream.idx}') + elif stream.peek() == '(': # Start of inner node + stream.next() + childIds: list[str] = [] + while True: + # Read child + childId = parseNewick(stream, nodeMap, nameToFirstId, dupNameToIds) + childIds.append(childId) + # Check for next child or end of node + stream.skipWhitespace() + if stream.peek() == '': + raise Exception(f'ERROR: Unexpected EOF at index {stream.idx}') + elif stream.peek() == ',': # Expect another child + stream.next() + continue + else: # End of child list + # Get node name and id + stream.next() # Consume an expected ')' + stream.skipWhitespace() + name, id = parseNewickName(stream) + updateNameMaps(name, id, nameToFirstId, dupNameToIds) + # Get child num-tips total + tips = 0 + for childId in childIds: + tips += nodeMap[childId].tips + # Add node to nodeMap + nodeMap[id] = Node(name, childIds, None, tips, False) + # Update childrens' parent reference + for childId in childIds: + nodeMap[childId].parentId = id + return id + else: # Parse node name + name, id = parseNewickName(stream) + updateNameMaps(name, id, nameToFirstId, dupNameToIds) + nodeMap[id] = Node(name, [], None, 1, False) + return id +def parseNewickName(stream: BasicStream) -> tuple[str, str]: + """ Parses a node name from 'stream', and returns a (name, id) pair """ + name: str + nameChars = [] + if stream.peek() == '': + raise Exception(f'ERROR: Unexpected EOF at index {stream.idx}') + elif stream.peek() == "'": # Quoted name + nameChars.append(stream.next()) + while True: + if stream.peek() == '': + raise Exception(f'ERROR: Unexpected EOF at index {stream.idx}') + elif stream.peek() == "'": + nameChars.append(stream.next()) + if stream.peek() == "'": # '' is escaped-quote + nameChars.append(stream.next()) + continue + break + nameChars.append(stream.next()) + else: + while stream.hasNext() and not re.match(r'[(),;]', stream.peek()): + nameChars.append(stream.next()) + if stream.peek() == ';': # Ignore trailing input semicolon + stream.next() + # Convert to (name, id) + name = ''.join(nameChars).rstrip().lower() + if name.startswith('mrca'): + return (name, name) + elif name[0] == "'": + match = re.fullmatch(r"'([^\\\"]+) (ott\d+)'", name) + if match is None: + raise Exception(f'ERROR: invalid name \'{name}\'') + name = match.group(1).replace("''", "'") + return (name, match.group(2)) + else: + match = re.fullmatch(r"([^\\\"]+)_(ott\d+)", name) + if match is None: + raise Exception(f'ERROR: invalid name \'{name}\'') + return (match.group(1).replace('_', ' '), match.group(2)) +def updateNameMaps(name: str, id: str, nameToFirstId: dict[str, str], dupNameToIds: dict[str, list[str]]) -> None: + """ Update maps upon a newly parsed name """ + if name not in nameToFirstId: + nameToFirstId[name] = id + else: + if name not in dupNameToIds: + dupNameToIds[name] = [nameToFirstId[name], id] + else: + dupNameToIds[name].append(id) +def convertMrcaName(id: str, nodeMap: dict[str, Node]) -> str: + """ Update a node in a tree to be named after 2 descendants. + Returns the name of one such descendant, for use during recursion. """ + node = nodeMap[id] + name = node.name + childIds = node.childIds + if len(childIds) < 2: + raise Exception(f'ERROR: MRCA node \'{name}\' has less than 2 children') + # Get 2 children with most tips + childTips = [nodeMap[id].tips for id in childIds] + maxIdx1 = childTips.index(max(childTips)) + childTips[maxIdx1] = 0 + maxIdx2 = childTips.index(max(childTips)) + childId1 = childIds[maxIdx1] + childId2 = childIds[maxIdx2] + childName1 = nodeMap[childId1].name + childName2 = nodeMap[childId2].name + # Check for mrca* child names + if childName1.startswith('mrca'): + childName1 = convertMrcaName(childId1, nodeMap) + if childName2.startswith('mrca'): + childName2 = convertMrcaName(childId2, nodeMap) + # Check for composite names + match = re.fullmatch(r'\[(.+) \+ (.+)]', childName1) + if match is not None: + childName1 = match.group(1) + match = re.fullmatch(r'\[(.+) \+ (.+)]', childName2) + if match is not None: + childName2 = match.group(1) + # Create composite name + node.name = f'[{childName1} + {childName2}]' + return childName1 + +if __name__ == '__main__': + import argparse + parser = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter) + parser.parse_args() + # + genData(TREE_FILE, ANN_FILE, PICKED_NAMES_FILE, DB_FILE) diff --git a/backend/tol_data/gen_pop_data.py b/backend/tol_data/gen_pop_data.py new file mode 100755 index 0000000..e6a646e --- /dev/null +++ b/backend/tol_data/gen_pop_data.py @@ -0,0 +1,45 @@ +#!/usr/bin/python3 + +""" +Reads enwiki page view info from a database, and stores it +as node popularity values in the database. +""" + +import os, sqlite3 + +PAGEVIEWS_DB = os.path.join('enwiki', 'pageview_data.db') +DB_FILE = 'data.db' + +def genData(pageviewsDb: str, dbFile: str) -> None: + dbCon = sqlite3.connect(dbFile) + dbCur = dbCon.cursor() + # + print('Getting view counts') + pdbCon = sqlite3.connect(pageviewsDb) + pdbCur = pdbCon.cursor() + nodeToViews: dict[str, int] = {} # Maps node names to counts + iterNum = 0 + for wikiId, views in pdbCur.execute('SELECT id, views from views'): + iterNum += 1 + if iterNum % 1e4 == 0: + print(f'At iteration {iterNum}') # Reached 1.6e6 + # + row = dbCur.execute('SELECT name FROM wiki_ids WHERE id = ?', (wikiId,)).fetchone() + if row is not None: + nodeToViews[row[0]] = views + pdbCon.close() + # + print(f'Writing {len(nodeToViews)} entries to db') + dbCur.execute('CREATE TABLE node_pop (name TEXT PRIMARY KEY, pop INT)') + for nodeName, views in nodeToViews.items(): + dbCur.execute('INSERT INTO node_pop VALUES (?, ?)', (nodeName, views)) + # + dbCon.commit() + dbCon.close() + +if __name__ == '__main__': + import argparse + parser = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter) + args = parser.parse_args() + # + genData(PAGEVIEWS_DB, DB_FILE) diff --git a/backend/tol_data/gen_reduced_trees.py b/backend/tol_data/gen_reduced_trees.py new file mode 100755 index 0000000..3742544 --- /dev/null +++ b/backend/tol_data/gen_reduced_trees.py @@ -0,0 +1,337 @@ +#!/usr/bin/python3 + +""" +Creates reduced versions of the tree in the database: +- A 'picked nodes' tree: + Created from a minimal set of node names read from a file, + possibly with some extra randmly-picked children. +- An 'images only' tree: + Created by removing nodes without an image or presence in the + 'picked' tree. +- A 'weakly trimmed' tree: + Created by removing nodes that lack an image or description, or + presence in the 'picked' tree. And, for nodes with 'many' children, + removing some more, despite any node descriptions. +""" + +import sys, re +import sqlite3 + +DB_FILE = 'data.db' +PICKED_NODES_FILE = 'picked_nodes.txt' +# +COMP_NAME_REGEX = re.compile(r'\[.+ \+ .+]') # Used to recognise composite nodes + +class Node: + def __init__(self, id, children, parent, tips, pSupport): + self.id = id + self.children = children + self.parent = parent + self.tips = tips + self.pSupport = pSupport + +def genData(tree: str, dbFile: str, pickedNodesFile: str) -> None: + print('Opening database') + dbCon = sqlite3.connect(dbFile) + dbCur = dbCon.cursor() + # + print('Finding root node') + query = 'SELECT name FROM nodes LEFT JOIN edges ON nodes.name = edges.child WHERE edges.parent IS NULL LIMIT 1' + (rootName,) = dbCur.execute(query).fetchone() + print(f'Found \'{rootName}\'') + # + print('=== Getting picked-nodes ===') + pickedNames: set[str] = set() + pickedTreeExists = False + if dbCur.execute('SELECT name FROM sqlite_master WHERE type="table" AND name="nodes_p"').fetchone() is None: + print(f'Reading from {pickedNodesFile}') + with open(pickedNodesFile) as file: + for line in file: + name = line.rstrip() + row = dbCur.execute('SELECT name from nodes WHERE name = ?', (name,)).fetchone() + if row is None: + row = dbCur.execute('SELECT name from names WHERE alt_name = ?', (name,)).fetchone() + if row is not None: + pickedNames.add(row[0]) + if not pickedNames: + raise Exception('ERROR: No picked names found') + else: + pickedTreeExists = True + print('Picked-node tree already exists') + if tree == 'picked': + sys.exit() + for (name,) in dbCur.execute('SELECT name FROM nodes_p'): + pickedNames.add(name) + print(f'Found {len(pickedNames)} names') + # + if (tree == 'picked' or tree is None) and not pickedTreeExists: + print('=== Generating picked-nodes tree ===') + genPickedNodeTree(dbCur, pickedNames, rootName) + if tree != 'picked': + print('=== Finding \'non-low significance\' nodes ===') + nodesWithImgOrPicked: set[str] = set() + nodesWithImgDescOrPicked: set[str] = set() + print('Finding nodes with descs') + for (name,) in dbCur.execute('SELECT name FROM wiki_ids INNER JOIN descs ON wiki_ids.id = descs.wiki_id'): + nodesWithImgDescOrPicked.add(name) + print('Finding nodes with images') + for (name,) in dbCur.execute('SELECT name FROM node_imgs'): + nodesWithImgDescOrPicked.add(name) + nodesWithImgOrPicked.add(name) + print('Adding picked nodes') + for name in pickedNames: + nodesWithImgDescOrPicked.add(name) + nodesWithImgOrPicked.add(name) + if tree == 'images' or tree is None: + print('=== Generating images-only tree ===') + genImagesOnlyTree(dbCur, nodesWithImgOrPicked, pickedNames, rootName) + if tree == 'trimmed' or tree is None: + print('=== Generating weakly-trimmed tree ===') + genWeaklyTrimmedTree(dbCur, nodesWithImgDescOrPicked, nodesWithImgOrPicked, rootName) + # + print('Closing database') + dbCon.commit() + dbCon.close() +def genPickedNodeTree(dbCur: sqlite3.Cursor, pickedNames: set[str], rootName: str) -> None: + PREF_NUM_CHILDREN = 3 # Include extra children up to this limit + print('Getting ancestors') + nodeMap = genNodeMap(dbCur, pickedNames, 100) + print(f'Result has {len(nodeMap)} nodes') + print('Removing composite nodes') + removedNames = removeCompositeNodes(nodeMap) + print(f'Result has {len(nodeMap)} nodes') + print('Removing \'collapsible\' nodes') + temp = removeCollapsibleNodes(nodeMap, pickedNames) + removedNames.update(temp) + print(f'Result has {len(nodeMap)} nodes') + print('Adding some additional nearby children') + namesToAdd: list[str] = [] + iterNum = 0 + for name, node in nodeMap.items(): + iterNum += 1 + if iterNum % 100 == 0: + print(f'At iteration {iterNum}') + # + numChildren = len(node.children) + if numChildren < PREF_NUM_CHILDREN: + children = [row[0] for row in dbCur.execute('SELECT child FROM edges where parent = ?', (name,))] + newChildren: list[str] = [] + for n in children: + if n in nodeMap or n in removedNames: + continue + if COMP_NAME_REGEX.fullmatch(n) is not None: + continue + if dbCur.execute('SELECT name from node_imgs WHERE name = ?', (n,)).fetchone() is None and \ + dbCur.execute('SELECT name from linked_imgs WHERE name = ?', (n,)).fetchone() is None: + continue + newChildren.append(n) + newChildNames = newChildren[:(PREF_NUM_CHILDREN - numChildren)] + node.children.extend(newChildNames) + namesToAdd.extend(newChildNames) + for name in namesToAdd: + parent, pSupport = dbCur.execute('SELECT parent, p_support from edges WHERE child = ?', (name,)).fetchone() + (id,) = dbCur.execute('SELECT id FROM nodes WHERE name = ?', (name,)).fetchone() + parent = None if parent == '' else parent + nodeMap[name] = Node(id, [], parent, 0, pSupport == 1) + print(f'Result has {len(nodeMap)} nodes') + print('Updating \'tips\' values') + updateTips(rootName, nodeMap) + print('Creating table') + addTreeTables(nodeMap, dbCur, 'p') +def genImagesOnlyTree( + dbCur: sqlite3.Cursor, + nodesWithImgOrPicked: set[str], + pickedNames: set[str], + rootName: str) -> None: + print('Getting ancestors') + nodeMap = genNodeMap(dbCur, nodesWithImgOrPicked, 1e4) + print(f'Result has {len(nodeMap)} nodes') + print('Removing composite nodes') + removeCompositeNodes(nodeMap) + print(f'Result has {len(nodeMap)} nodes') + print('Removing \'collapsible\' nodes') + removeCollapsibleNodes(nodeMap, pickedNames) + print(f'Result has {len(nodeMap)} nodes') + print('Updating \'tips\' values') # Needed for next trimming step + updateTips(rootName, nodeMap) + print('Trimming from nodes with \'many\' children') + trimIfManyChildren(nodeMap, rootName, 300, pickedNames) + print(f'Result has {len(nodeMap)} nodes') + print('Updating \'tips\' values') + updateTips(rootName, nodeMap) + print('Creating table') + addTreeTables(nodeMap, dbCur, 'i') +def genWeaklyTrimmedTree( + dbCur: sqlite3.Cursor, + nodesWithImgDescOrPicked: set[str], + nodesWithImgOrPicked: set[str], + rootName: str) -> None: + print('Getting ancestors') + nodeMap = genNodeMap(dbCur, nodesWithImgDescOrPicked, 1e5) + print(f'Result has {len(nodeMap)} nodes') + print('Getting nodes to \'strongly keep\'') + iterNum = 0 + nodesFromImgOrPicked: set[str] = set() + for name in nodesWithImgOrPicked: + iterNum += 1 + if iterNum % 1e4 == 0: + print(f'At iteration {iterNum}') + # + while name is not None: + if name not in nodesFromImgOrPicked: + nodesFromImgOrPicked.add(name) + name = nodeMap[name].parent + else: + break + print(f'Node set has {len(nodesFromImgOrPicked)} nodes') + print('Removing \'collapsible\' nodes') + removeCollapsibleNodes(nodeMap, nodesWithImgDescOrPicked) + print(f'Result has {len(nodeMap)} nodes') + print('Updating \'tips\' values') # Needed for next trimming step + updateTips(rootName, nodeMap) + print('Trimming from nodes with \'many\' children') + trimIfManyChildren(nodeMap, rootName, 600, nodesFromImgOrPicked) + print(f'Result has {len(nodeMap)} nodes') + print('Updating \'tips\' values') + updateTips(rootName, nodeMap) + print('Creating table') + addTreeTables(nodeMap, dbCur, 't') +# Helper functions +def genNodeMap(dbCur: sqlite3.Cursor, nameSet: set[str], itersBeforePrint = 1) -> dict[str, Node]: + """ Returns a subtree that includes nodes in 'nameSet', as a name-to-Node map """ + nodeMap: dict[str, Node] = {} + iterNum = 0 + name: str | None + for name in nameSet: + iterNum += 1 + if iterNum % itersBeforePrint == 0: + print(f'At iteration {iterNum}') + # + prevName: str | None = None + while name is not None: + if name not in nodeMap: + # Add node + id, tips = dbCur.execute('SELECT id, tips from nodes where name = ?', (name,)).fetchone() + row: None | tuple[str, int] = dbCur.execute( + 'SELECT parent, p_support from edges where child = ?', (name,)).fetchone() + parent = None if row is None or row[0] == '' else row[0] + pSupport = row is None or row[1] == 1 + children = [] if prevName is None else [prevName] + nodeMap[name] = Node(id, children, parent, 0, pSupport) + # Iterate to parent + prevName = name + name = parent + else: + # Just add as child + if prevName is not None: + nodeMap[name].children.append(prevName) + break + return nodeMap +def removeCompositeNodes(nodeMap: dict[str, Node]) -> set[str]: + """ Given a tree, removes composite-name nodes, and returns the removed nodes' names """ + namesToRemove: set[str] = set() + for name, node in nodeMap.items(): + parent = node.parent + if parent is not None and COMP_NAME_REGEX.fullmatch(name) is not None: + # Connect children to parent + nodeMap[parent].children.remove(name) + nodeMap[parent].children.extend(node.children) + for n in node.children: + nodeMap[n].parent = parent + nodeMap[n].pSupport &= node.pSupport + # Remember for removal + namesToRemove.add(name) + for name in namesToRemove: + del nodeMap[name] + return namesToRemove +def removeCollapsibleNodes(nodeMap: dict[str, Node], nodesToKeep: set[str] = set()) -> set[str]: + """ Given a tree, removes single-child parents, then only-childs, + with given exceptions, and returns the set of removed nodes' names """ + namesToRemove: set[str] = set() + # Remove single-child parents + for name, node in nodeMap.items(): + if len(node.children) == 1 and node.parent is not None and name not in nodesToKeep: + # Connect parent and children + parent = node.parent + child = node.children[0] + nodeMap[parent].children.remove(name) + nodeMap[parent].children.append(child) + nodeMap[child].parent = parent + nodeMap[child].pSupport &= node.pSupport + # Remember for removal + namesToRemove.add(name) + for name in namesToRemove: + del nodeMap[name] + # Remove only-childs (not redundant because 'nodesToKeep' can cause single-child parents to be kept) + namesToRemove.clear() + for name, node in nodeMap.items(): + isOnlyChild = node.parent is not None and len(nodeMap[node.parent].children) == 1 + if isOnlyChild and name not in nodesToKeep: + # Connect parent and children + parent = node.parent + nodeMap[parent].children = node.children + for n in node.children: + nodeMap[n].parent = parent + nodeMap[n].pSupport &= node.pSupport + # Remember for removal + namesToRemove.add(name) + for name in namesToRemove: + del nodeMap[name] + # + return namesToRemove +def trimIfManyChildren( + nodeMap: dict[str, Node], rootName: str, childThreshold: int, nodesToKeep: set[str] = set()) -> None: + namesToRemove: set[str] = set() + def findTrimmables(nodeName: str) -> None: + nonlocal nodeMap, nodesToKeep + node = nodeMap[nodeName] + if len(node.children) > childThreshold: + numToTrim = len(node.children) - childThreshold + # Try removing nodes, preferring those with less tips + candidatesToTrim = [n for n in node.children if n not in nodesToKeep] + childToTips = {n: nodeMap[n].tips for n in candidatesToTrim} + candidatesToTrim.sort(key=lambda n: childToTips[n], reverse=True) + childrenToRemove = set(candidatesToTrim[-numToTrim:]) + node.children = [n for n in node.children if n not in childrenToRemove] + # Mark nodes for deletion + for n in childrenToRemove: + markForRemoval(n) + # Recurse on children + for n in node.children: + findTrimmables(n) + def markForRemoval(nodeName: str) -> None: + nonlocal nodeMap, namesToRemove + namesToRemove.add(nodeName) + for child in nodeMap[nodeName].children: + markForRemoval(child) + findTrimmables(rootName) + for nodeName in namesToRemove: + del nodeMap[nodeName] +def updateTips(nodeName: str, nodeMap: dict[str, Node]) -> int: + """ Updates the 'tips' values for a node and it's descendants, returning the node's new 'tips' value """ + node = nodeMap[nodeName] + tips = sum([updateTips(childName, nodeMap) for childName in node.children]) + tips = max(1, tips) + node.tips = tips + return tips +def addTreeTables(nodeMap: dict[str, Node], dbCur: sqlite3.Cursor, suffix: str): + """ Adds a tree to the database, as tables nodes_X and edges_X, where X is the given suffix """ + nodesTbl = f'nodes_{suffix}' + edgesTbl = f'edges_{suffix}' + dbCur.execute(f'CREATE TABLE {nodesTbl} (name TEXT PRIMARY KEY, id TEXT UNIQUE, tips INT)') + dbCur.execute(f'CREATE INDEX {nodesTbl}_idx_nc ON {nodesTbl}(name COLLATE NOCASE)') + dbCur.execute(f'CREATE TABLE {edgesTbl} (parent TEXT, child TEXT, p_support INT, PRIMARY KEY (parent, child))') + dbCur.execute(f'CREATE INDEX {edgesTbl}_child_idx ON {edgesTbl}(child)') + for name, node in nodeMap.items(): + dbCur.execute(f'INSERT INTO {nodesTbl} VALUES (?, ?, ?)', (name, node.id, node.tips)) + for childName in node.children: + pSupport = 1 if nodeMap[childName].pSupport else 0 + dbCur.execute(f'INSERT INTO {edgesTbl} VALUES (?, ?, ?)', (name, childName, pSupport)) + +if __name__ == '__main__': + import argparse + parser = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter) + parser.add_argument('--tree', choices=['picked', 'images', 'trimmed'], help='Only generate the specified tree') + args = parser.parse_args() + # + genData(args.tree, DB_FILE, PICKED_NODES_FILE) diff --git a/backend/tol_data/otol/README.md b/backend/tol_data/otol/README.md new file mode 100644 index 0000000..e018369 --- /dev/null +++ b/backend/tol_data/otol/README.md @@ -0,0 +1,19 @@ +This directory holds files obtained via the +[Open Tree of Life](https://tree.opentreeoflife.org/about/open-tree-of-life). + +# Tree Data Files +- `opentree13.4tree.tgz` <br> + Obtained from <https://tree.opentreeoflife.org/about/synthesis-release/v13.4>. + Contains tree data from the [Open Tree of Life](https://tree.opentreeoflife.org/about/open-tree-of-life). +- `labelled_supertree_ottnames.tre` <br> + Extracted from the .tgz file. Describes the structure of the tree. +- `annotations.json` <br> + Extracted from the .tgz file. Contains additional attributes of tree + nodes. Used for finding out which nodes have 'phylogenetic support'. + +# Taxonomy Data Files +- `ott3.3.tgz` <br> + Obtained from <https://tree.opentreeoflife.org/about/taxonomy-version/ott3.3>. + Contains taxonomy data from the Open Tree of Life. +- `otol/taxonomy.tsv` <br> + Extracted from the .tgz file. Holds taxon IDs from sources like NCBI, used to map between datasets. diff --git a/backend/tol_data/picked_imgs/README.md b/backend/tol_data/picked_imgs/README.md new file mode 100644 index 0000000..1edd951 --- /dev/null +++ b/backend/tol_data/picked_imgs/README.md @@ -0,0 +1,10 @@ +This directory holds additional image files to use for tree-of-life nodes, +on top of those from EOL and Wikipedia. + +Possible Files +============== +- (Image files) +- img_data.txt <br> + Contains lines with the format `filename|url|license|artist|credit`. + The filename should consist of a node name, with an image extension. + Other fields correspond to those in the `images` table (see ../README.md). diff --git a/backend/tol_data/review_imgs_to_gen.py b/backend/tol_data/review_imgs_to_gen.py new file mode 100755 index 0000000..2283ed7 --- /dev/null +++ b/backend/tol_data/review_imgs_to_gen.py @@ -0,0 +1,241 @@ +#!/usr/bin/python3 + +""" +Provides a GUI that displays, for each node in the database, associated +images from EOL and Wikipedia, and allows choosing which to use. Writes +choice data to a text file with lines of the form 'otolId1 imgPath1', or +'otolId1', where no path indicates a choice of no image. + +The program can be closed, and run again to continue from the last choice. +The program looks for an existing output file to determine what choices +have already been made. +""" + +import os, time +import sqlite3 +import tkinter as tki +from tkinter import ttk +import PIL +from PIL import ImageTk, Image, ImageOps + +EOL_IMG_DIR = os.path.join('eol', 'imgs') +ENWIKI_IMG_DIR = os.path.join('enwiki', 'imgs') +DB_FILE = 'data.db' +OUT_FILE = 'img_list.txt' +# +IMG_DISPLAY_SZ = 400 +PLACEHOLDER_IMG = Image.new('RGB', (IMG_DISPLAY_SZ, IMG_DISPLAY_SZ), (88, 28, 135)) +REVIEW = 'only pairs' # Can be: 'all', 'only pairs', 'none' + +class ImgReviewer: + """ Provides the GUI for reviewing images """ + def __init__(self, root, nodeToImgs, eolImgDir, enwikiImgDir, outFile, dbCon, review): + self.root = root + root.title('Image Reviewer') + # Setup main frame + mainFrame = ttk.Frame(root, padding='5 5 5 5') + mainFrame.grid(column=0, row=0, sticky=(tki.N, tki.W, tki.E, tki.S)) + root.columnconfigure(0, weight=1) + root.rowconfigure(0, weight=1) + # Set up images-to-be-reviewed frames + self.eolImg = ImageTk.PhotoImage(PLACEHOLDER_IMG) + self.enwikiImg = ImageTk.PhotoImage(PLACEHOLDER_IMG) + self.labels: list[ttk.Label] = [] + for i in (0, 1): + frame = ttk.Frame(mainFrame, width=IMG_DISPLAY_SZ, height=IMG_DISPLAY_SZ) + frame.grid(column=i, row=0) + label = ttk.Label(frame, image=self.eolImg if i == 0 else self.enwikiImg) + label.grid(column=0, row=0) + self.labels.append(label) + # Add padding + for child in mainFrame.winfo_children(): + child.grid_configure(padx=5, pady=5) + # Add keyboard bindings + root.bind('<q>', self.quit) + root.bind('<Key-j>', lambda evt: self.accept(0)) + root.bind('<Key-k>', lambda evt: self.accept(1)) + root.bind('<Key-l>', lambda evt: self.reject()) + # Set fields + self.nodeImgsList = list(nodeToImgs.items()) + self.listIdx = -1 + self.eolImgDir = eolImgDir + self.enwikiImgDir = enwikiImgDir + self.outFile = outFile + self.review = review + self.dbCon = dbCon + self.dbCur = dbCon.cursor() + self.otolId = None + self.eolImgPath = None + self.enwikiImgPath = None + self.numReviewed = 0 + self.startTime = time.time() + # Initialise images to review + self.getNextImgs() + def getNextImgs(self): + """ Updates display with new images to review, or ends program """ + # Get next image paths + while True: + self.listIdx += 1 + if self.listIdx == len(self.nodeImgsList): + print('No more images to review. Exiting program.') + self.quit() + return + self.otolId, imgPaths = self.nodeImgsList[self.listIdx] + # Potentially skip user choice + if len(imgPaths) == 1 and (self.review == 'only pairs' or self.review == 'none'): + with open(self.outFile, 'a') as file: + file.write(f'{self.otolId} {imgPaths[0]}\n') + continue + elif self.review == 'none': + with open(self.outFile, 'a') as file: + file.write(f'{self.otolId} {imgPaths[-1]}\n') # Prefer enwiki image + continue + break + # Update displayed images + self.eolImgPath = self.enwikiImgPath = None + imageOpenError = False + for imgPath in imgPaths: + img: Image + try: + img = Image.open(imgPath) + img = ImageOps.exif_transpose(img) + except PIL.UnidentifiedImageError: + print(f'UnidentifiedImageError for {imgPath}') + imageOpenError = True + continue + if imgPath.startswith(self.eolImgDir): + self.eolImgPath = imgPath + self.eolImg = ImageTk.PhotoImage(self.resizeImgForDisplay(img)) + elif imgPath.startswith(self.enwikiImgDir): + self.enwikiImgPath = imgPath + self.enwikiImg = ImageTk.PhotoImage(self.resizeImgForDisplay(img)) + else: + print(f'Unexpected image path {imgPath}') + self.quit() + return + # Re-iterate if all image paths invalid + if self.eolImgPath is None and self.enwikiImgPath is None: + if imageOpenError: + self.reject() + self.getNextImgs() + return + # Add placeholder images + if self.eolImgPath is None: + self.eolImg = ImageTk.PhotoImage(self.resizeImgForDisplay(PLACEHOLDER_IMG)) + elif self.enwikiImgPath is None: + self.enwikiImg = ImageTk.PhotoImage(self.resizeImgForDisplay(PLACEHOLDER_IMG)) + # Update image-frames + self.labels[0].config(image=self.eolImg) + self.labels[1].config(image=self.enwikiImg) + # Update title + title = f'Images for otol ID {self.otolId}' + query = 'SELECT names.alt_name FROM' \ + ' nodes INNER JOIN names ON nodes.name = names.name' \ + ' WHERE nodes.id = ? and pref_alt = 1' + row = self.dbCur.execute(query, (self.otolId,)).fetchone() + if row is not None: + title += f', aka {row[0]}' + title += f' ({self.listIdx + 1} out of {len(self.nodeImgsList)})' + self.root.title(title) + def accept(self, imgIdx): + """ React to a user selecting an image """ + imgPath = self.eolImgPath if imgIdx == 0 else self.enwikiImgPath + if imgPath is None: + print('Invalid selection') + return + with open(self.outFile, 'a') as file: + file.write(f'{self.otolId} {imgPath}\n') + self.numReviewed += 1 + self.getNextImgs() + def reject(self): + """"" React to a user rejecting all images of a set """ + with open(self.outFile, 'a') as file: + file.write(f'{self.otolId}\n') + self.numReviewed += 1 + self.getNextImgs() + def quit(self, e = None): + print(f'Number reviewed: {self.numReviewed}') + timeElapsed = time.time() - self.startTime + print(f'Time elapsed: {timeElapsed:.2f} seconds') + if self.numReviewed > 0: + print(f'Avg time per review: {timeElapsed/self.numReviewed:.2f} seconds') + self.dbCon.close() + self.root.destroy() + def resizeImgForDisplay(self, img): + """ Returns a copy of an image, shrunk to fit it's frame (keeps aspect ratio), and with a background """ + if max(img.width, img.height) > IMG_DISPLAY_SZ: + if (img.width > img.height): + newHeight = int(img.height * IMG_DISPLAY_SZ/img.width) + img = img.resize((IMG_DISPLAY_SZ, newHeight)) + else: + newWidth = int(img.width * IMG_DISPLAY_SZ / img.height) + img = img.resize((newWidth, IMG_DISPLAY_SZ)) + bgImg = PLACEHOLDER_IMG.copy() + bgImg.paste(img, box=( + int((IMG_DISPLAY_SZ - img.width) / 2), + int((IMG_DISPLAY_SZ - img.height) / 2))) + return bgImg + +def reviewImgs(eolImgDir: str, enwikiImgDir: str, dbFile: str, outFile: str, review: str) -> None: + print('Opening database') + dbCon = sqlite3.connect(dbFile) + dbCur = dbCon.cursor() + # + nodeToImgs: dict[str, list[str]] = {} # Maps otol-ids to arrays of image paths + print('Iterating through images from EOL') + if os.path.exists(eolImgDir): + for filename in os.listdir(eolImgDir): + # Get associated EOL ID + eolId, _, _ = filename.partition(' ') + query = 'SELECT nodes.id FROM nodes INNER JOIN eol_ids ON nodes.name = eol_ids.name WHERE eol_ids.id = ?' + # Get associated node IDs + found = False + for (otolId,) in dbCur.execute(query, (int(eolId),)): + if otolId not in nodeToImgs: + nodeToImgs[otolId] = [] + nodeToImgs[otolId].append(os.path.join(eolImgDir, filename)) + found = True + if not found: + print(f'WARNING: No node found for {os.path.join(eolImgDir, filename)}') + print(f'Result: {len(nodeToImgs)} nodes with images') + print('Iterating through images from Wikipedia') + if os.path.exists(enwikiImgDir): + for filename in os.listdir(enwikiImgDir): + # Get associated page ID + wikiId, _, _ = filename.partition('.') + # Get associated node IDs + query = 'SELECT nodes.id FROM nodes INNER JOIN wiki_ids ON nodes.name = wiki_ids.name WHERE wiki_ids.id = ?' + found = False + for (otolId,) in dbCur.execute(query, (int(wikiId),)): + if otolId not in nodeToImgs: + nodeToImgs[otolId] = [] + nodeToImgs[otolId].append(os.path.join(enwikiImgDir, filename)) + found = True + if not found: + print(f'WARNING: No node found for {os.path.join(enwikiImgDir, filename)}') + print(f'Result: {len(nodeToImgs)} nodes with images') + # + print('Filtering out already-made image choices') + oldSz = len(nodeToImgs) + if os.path.exists(outFile): + with open(outFile) as file: + for line in file: + line = line.rstrip() + if ' ' in line: + line = line[:line.find(' ')] + del nodeToImgs[line] + print(f'Filtered out {oldSz - len(nodeToImgs)} entries') + # + # Create GUI and defer control + print('Starting GUI') + root = tki.Tk() + ImgReviewer(root, nodeToImgs, eolImgDir, enwikiImgDir, outFile, dbCon, review) + root.mainloop() + dbCon.close() + +if __name__ == '__main__': + import argparse + parser = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter) + parser.parse_args() + # + reviewImgs(EOL_IMG_DIR, ENWIKI_IMG_DIR, DB_FILE, OUT_FILE, REVIEW) diff --git a/backend/tol_data/wikidata/README.md b/backend/tol_data/wikidata/README.md new file mode 100644 index 0000000..7b3105e --- /dev/null +++ b/backend/tol_data/wikidata/README.md @@ -0,0 +1,18 @@ +This directory holds files obtained via [Wikidata](https://www.wikidata.org/). + +# Downloaded Files +- `latest-all.json.bz2` <br> + Obtained from <https://dumps.wikimedia.org/wikidatawiki/entities/> (on 23/08/22). + Format info can be found at <https://doc.wikimedia.org/Wikibase/master/php/md_docs_topics_json.html>. + +# Other Files +- `gen_taxon_src_data.py` <br> + Used to generate a database holding taxon information from the dump. +- `offsets.dat` <br> + Holds bzip2 block offsets for the dump. Generated and used by + genTaxonSrcData.py for parallel processing of the dump. +- `taxon_srcs.db` <br> + Generated by `gen_taxon_src_data.py`. <br> + Tables: <br> + - `src_id_to_title`: `src TEXT, id INT, title TEXT, PRIMARY KEY(src, id)` + - `title_iucn`: `title TEXT PRIMARY KEY, status TEXT` diff --git a/backend/tol_data/wikidata/__init__.py b/backend/tol_data/wikidata/__init__.py new file mode 100644 index 0000000..e69de29 --- /dev/null +++ b/backend/tol_data/wikidata/__init__.py diff --git a/backend/tol_data/wikidata/gen_taxon_src_data.py b/backend/tol_data/wikidata/gen_taxon_src_data.py new file mode 100755 index 0000000..50ed917 --- /dev/null +++ b/backend/tol_data/wikidata/gen_taxon_src_data.py @@ -0,0 +1,239 @@ +#!/usr/bin/python3 + +""" +Reads a wikidata JSON dump, looking for enwiki taxon items, and associated +IDs from sources like GBIF/etc, and IUCN conservation status. Writes results +into a database. + +The JSON dump contains an array of objects, each of which describes a +Wikidata item item1, and takes up it's own line. +- Getting item1's Wikidata ID: item1['id'] (eg: "Q144") +- Checking if item1 is a taxon: item1['claims']['P31'][idx1]['mainsnak']['datavalue']['value']['id'] == id1 + 'idx1' indexes an array of statements + 'id1' is a Wikidata ID denoting a taxon item type (eg: Q310890 means 'monotypic taxon') +- Checking if item1 is a taxon-alt: item1['claims']['P31'][idx1]['mainsnak']['datavalue']['value']['id'] == id1 + 'id1' denotes a common-name-alternative item type (eg: Q55983715 means 'organisms known by a particular common name') + Getting the ID of the item that item1 is an alternative for: + item1['claims']['P31'][idx1]['qualifiers']['P642'][idx2]['datavalue']['value']['numeric-id'] +- Checking for an EOL/NCBI/etc ID: item['claims'][prop1][idx1]['mainsnak']['datavalue']['value'] (eg: "328672") + 'prop1' denotes a 'has ID from source *' property (eg: 'P830' means 'has EOL ID') +- Checking for an IUCN status: item['claims']['P141'][idx1]['mainsnak']['datavalue']['value']['id'] (eg: "Q219127") + +Based on code from https://github.com/OneZoom/OZtree, located in +OZprivate/ServerScripts/TaxonMappingAndPopularity/ (22 Aug 2022). +""" + +# On Linux, running on the full dataset caused the processes to hang after processing. This was resolved by: +# - Storing subprocess results in temp files. Apparently passing large objects through pipes can cause deadlock. +# - Using set_start_method('spawn'). Apparently 'fork' can cause unexpected copying of lock/semaphore/etc state. +# Related: https://bugs.python.org/issue6721 +# - Using pool.map() instead of pool.imap_unordered(), which seems to hang in some cases (was using python 3.8). +# Possibly related: https://github.com/python/cpython/issues/72882 + +import sys, os, re, math, io +from collections import defaultdict +import bz2, json, sqlite3 +import multiprocessing, indexed_bzip2, pickle, tempfile + +WIKIDATA_FILE = 'latest-all.json.bz2' +OFFSETS_FILE = 'offsets.dat' +DB_FILE = 'taxon_srcs.db' +N_PROCS = 6 # Took about 3 hours with N_PROCS=6 + +# Wikidata entity IDs +TAXON_IDS = ['Q16521', 'Q310890', 'Q23038290', 'Q713623'] # 'taxon', 'monotypic taxon', 'fossil taxon', 'clade' +TAXON_ALT_IDS = ['Q55983715', 'Q502895'] # 'organisms known by a particular common name', 'common name' +SRC_PROP_IDS = {'P830': 'eol', 'P685': 'ncbi', 'P1391': 'if', 'P850': 'worms', 'P5055': 'irmng', 'P846': 'gbif'} +IUCN_STATUS_IDS = { + 'Q211005': 'least concern', 'Q719675': 'near threatened', 'Q278113': 'vulnerable', + 'Q11394': 'endangered', 'Q219127': 'critically endangered', 'Q239509': 'extinct in the wild', + 'Q237350': 'extinct species', 'Q3245245': 'data deficient' +} +# For filtering lines before parsing JSON +LINE_REGEX = re.compile(('"id":(?:"' + '"|"'.join([s for s in TAXON_IDS + TAXON_ALT_IDS]) + '")\D').encode()) + +def genData(wikidataFile: str, offsetsFile: str, dbFile: str, nProcs: int) -> None: + """ Reads the dump and writes source/iucn info to db """ + # Maps to populate + srcIdToId: dict[str, dict[int, int]] = defaultdict(dict) # Maps 'eol'/etc to {srcId1: wikidataId1, ...} + idToTitle: dict[int, str] = {} # Maps wikidata ID to enwiki title + idToAltId: dict[int, int] = {} # Maps taxon-item wikidata ID to taxon-alt ID (eg: 'canis lupus familiaris' -> 'dog') + idToIucnStatus: dict[int, str] = {} # Maps wikidata ID to iucn-status string ('least concern', etc) + # Check db + if os.path.exists(dbFile): + print('ERROR: Database already exists') + sys.exit(1) + # Read dump + if nProcs == 1: + with bz2.open(wikidataFile, mode='rb') as file: + for lineNum, line in enumerate(file, 1): + if lineNum % 1e4 == 0: + print(f'At line {lineNum}') + readDumpLine(line, srcIdToId, idToTitle, idToAltId, idToIucnStatus) + else: + if not os.path.exists(offsetsFile): + print('Creating offsets file') # For indexed access for multiprocessing (creation took about 6.7 hours) + with indexed_bzip2.open(wikidataFile) as file: + with open(offsetsFile, 'wb') as file2: + pickle.dump(file.block_offsets(), file2) + print('Allocating file into chunks') + fileSz: int # About 1.4 TB + with indexed_bzip2.open(wikidataFile) as file: + with open(offsetsFile, 'rb') as file2: + file.set_block_offsets(pickle.load(file2)) + fileSz = file.seek(0, io.SEEK_END) + chunkSz = math.floor(fileSz / nProcs) + chunkIdxs = [-1] + [chunkSz * i for i in range(1, nProcs)] + [fileSz-1] + # Each adjacent pair specifies a start+end byte index for readDumpChunk() + print(f'- Chunk size: {chunkSz:,}') + print('Starting processes to read dump') + with tempfile.TemporaryDirectory() as tempDirName: + # Using maxtasksperchild=1 to free resources on task completion + with multiprocessing.Pool(processes=nProcs, maxtasksperchild=1) as pool: + for outFilename in pool.map( + readDumpChunkOneParam, + ((i, wikidataFile, offsetsFile, chunkIdxs[i], chunkIdxs[i+1], + os.path.join(tempDirName, f'{i}.pickle')) for i in range(nProcs))): + # Get map data from subprocess output file + with open(outFilename, 'rb') as file: + maps = pickle.load(file) + # Add to maps + for src, idMap in maps[0].items(): + srcIdToId[src].update(idMap) + idToTitle.update(maps[1]) + idToAltId.update(maps[2]) + idToIucnStatus.update(maps[3]) + # + print('Writing to db') + dbCon = sqlite3.connect(dbFile) + dbCur = dbCon.cursor() + dbCur.execute('CREATE TABLE src_id_to_title (src TEXT, id INT, title TEXT, PRIMARY KEY(src, id))') + for src, submap in srcIdToId.items(): + for srcId, wId in submap.items(): + if wId not in idToTitle: # Check for a title, possibly via an alt-taxon + if wId in idToAltId: + wId = idToAltId[wId] + else: + continue + dbCur.execute('INSERT INTO src_id_to_title VALUES (?, ?, ?)', (src, srcId, idToTitle[wId])) + dbCur.execute('CREATE TABLE title_iucn (title TEXT PRIMARY KEY, status TEXT)') + for wId, status in idToIucnStatus.items(): + if wId not in idToTitle: # Check for a title, possibly via an alt-taxon + if wId in idToAltId and idToAltId[wId] not in idToIucnStatus: + wId = idToAltId[wId] + else: + continue + dbCur.execute('INSERT OR IGNORE INTO title_iucn VALUES (?, ?)', (idToTitle[wId], status)) + # The 'OR IGNORE' allows for multiple taxons using the same alt + dbCon.commit() + dbCon.close() +def readDumpLine( + lineBytes: bytes, + srcIdToId: dict[str, dict[int, int]], + idToTitle: dict[int, str], + idToAltId: dict[int, int], + idToIucnStatus: dict[int, str]) -> None: + # Check if taxon item + if LINE_REGEX.search(lineBytes) is None: + return + try: + line = lineBytes.decode('utf-8').rstrip().rstrip(',') + jsonItem = json.loads(line) + except json.JSONDecodeError: + print(f'Unable to parse line {line} as JSON') + return + isTaxon = False + altTaxa: list[int] = [] # For a taxon-alt item, holds associated taxon-item IDs + claims = None + try: + claims = jsonItem['claims'] + for statement in claims['P31']: # Check for 'instance of' statements + typeId: str = statement['mainsnak']['datavalue']['value']['id'] + if typeId in TAXON_IDS: + isTaxon = True + break + elif typeId in TAXON_ALT_IDS: + snaks = statement['qualifiers']['P642'] # Check for 'of' qualifiers + altTaxa.extend([int(s['datavalue']['value']['numeric-id']) for s in snaks]) + break + except (KeyError, ValueError): + return + if not isTaxon and not altTaxa: + return + # Get wikidata ID and enwiki title + itemId: int | None = None + itemTitle: str | None = None + try: + itemId = int(jsonItem['id'][1:]) # Skips initial 'Q' + itemTitle = jsonItem['sitelinks']['enwiki']['title'] + except KeyError: + # Allow taxon-items without titles (they might get one via a taxon-alt) + if itemId is not None and isTaxon: + itemTitle = None + else: + return + # Update maps + if itemTitle is not None: + idToTitle[itemId] = itemTitle + for altId in altTaxa: + idToAltId[altId] = itemId + # Check for source IDs + for srcPropId, src in SRC_PROP_IDS.items(): + if srcPropId in claims: + try: + srcId = int(claims[srcPropId][0]['mainsnak']['datavalue']['value']) + srcIdToId[src][srcId] = itemId + except (KeyError, ValueError): + continue + # Check for IUCN status + if 'P141' in claims: # Check for 'iucn conservation status' statement + try: + iucnStatusId: str = claims['P141'][0]['mainsnak']['datavalue']['value']['id'] + idToIucnStatus[itemId] = IUCN_STATUS_IDS[iucnStatusId] + except KeyError: + pass +def readDumpChunkOneParam(params: tuple[int, str, str, int, int, str]) -> str: + """ Forwards to readDumpChunk(), for use with pool.map() """ + return readDumpChunk(*params) +def readDumpChunk( + procId: int, wikidataFile: str, offsetsFile: str, startByte: int, endByte: int, outFilename: str) -> str: + """ Reads lines in the dump that begin after a start-byte, and not after an end byte. + If startByte is -1, start at the first line. """ + # Maps to populate + maps: tuple[ + dict[str, dict[int, int]], + dict[int, str], + dict[int, int], + dict[int, str]] = (defaultdict(dict), {}, {}, {}) + # Read dump + with indexed_bzip2.open(wikidataFile) as file: + # Load offsets file + with open(offsetsFile, 'rb') as file2: + offsets = pickle.load(file2) + file.set_block_offsets(offsets) + # Seek to chunk + if startByte != -1: + file.seek(startByte) + file.readline() + else: + startByte = 0 # Used for progress calculation + # Read lines + count = 0 + while file.tell() <= endByte: + count += 1 + if count % 1e4 == 0: + perc = (file.tell() - startByte) / (endByte - startByte) * 100 + print(f'Thread {procId}: {perc:.2f}%') + readDumpLine(file.readline(), *maps) + # Output results into file + with open(outFilename, 'wb') as file: + pickle.dump(maps, file) + return outFilename + +if __name__ == '__main__': # Guard needed for multiprocessing + import argparse + parser = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter) + args = parser.parse_args() + # + multiprocessing.set_start_method('spawn') + genData(WIKIDATA_FILE, OFFSETS_FILE, DB_FILE, N_PROCS) |
