diff options
Diffstat (limited to 'backend/data')
21 files changed, 253 insertions, 227 deletions
diff --git a/backend/data/README.md b/backend/data/README.md index d4a6196..7d1adad 100644 --- a/backend/data/README.md +++ b/backend/data/README.md @@ -1,115 +1,121 @@ -File Generation Process -======================= -1 Tree Structure Data - 1 Obtain data in otol/, as specified in it's README. - 2 Run genOtolData.py, which creates data.db, and adds - 'nodes' and 'edges' tables using data in otol/*, as well as - genOtolNamesToKeep.txt, if present. -2 Name Data for Search - 1 Obtain data in eol/, as specified in it's README. - 2 Run genEolNameData.py, which adds 'names' and 'eol_ids' tables to data.db, - using data in eol/vernacularNames.csv and the 'nodes' table, and possibly - genEolNameDataPickedIds.txt. -3 Node Description Data - 1 Obtain data in dbpedia/ and enwiki/, as specified in their README files. - 2 Run genDbpData.py, which adds 'wiki_ids' and 'descs' tables to data.db, - using data in dbpedia/dbpData.db, the 'nodes' table, and possibly - genDescNamesToSkip.txt and dbpPickedLabels.txt. - 3 Run genEnwikiDescData.py, which adds to the 'wiki_ids' and 'descs' tables, - using data in enwiki/enwikiData.db, and the 'nodes' table. - Also uses genDescNamesToSkip.txt and genEnwikiDescTitlesToUse.txt for - skipping/resolving some name-page associations. -4 Image Data - 1 In eol/, run downloadImgs.py to download EOL images into eol/imgsForReview/. - It uses data in eol/imagesList.db, and the 'eol_ids' table. - 2 In eol/, run reviewImgs.py to filter images in eol/imgsForReview/ into EOL-id-unique - images in eol/imgsReviewed/ (uses 'names' and 'eol_ids' to display extra info). - 3 In enwiki/, run getEnwikiImgData.py, which generates a list of - tol-node images, and creates enwiki/enwikiImgs.db to store it. - Uses the 'wiki_ids' table to get tol-node wiki-ids. - 4 In enwiki/, run downloadImgLicenseInfo.py, which downloads licensing - information for images listed in enwiki/enwikiImgs.db, and stores - it in that db. - 5 In enwiki/, run downloadEnwikiImgs.py, which downloads 'permissively-licensed' - images in listed in enwiki/enwikiImgs.db, storing them in enwiki/imgs/. - 6 Run reviewImgsToMerge.py, which displays images from eol/ and enwiki/, - and enables choosing, for each tol-node, which image should be used, if any, - and outputs choice information into mergedImgList.txt. Uses the 'nodes', - 'eol_ids', and 'wiki_ids' tables (as well as 'names' for info-display). - 7 Run genImgsForWeb.py, which creates cropped/resized images in img/, - using mergedImgList.txt, and possibly pickedImgs/, and adds 'images' and - 'node_imgs' tables to data.db. <br> - Smartcrop's outputs might need to be manually created/adjusted: <br> - - An input image might have no output produced, possibly due to - data incompatibilities, memory limits, etc. A few input image files - might actually be html files, containing a 'file not found' page. - - An input x.gif might produce x-1.jpg, x-2.jpg, etc, instead of x.jpg. - - An input image might produce output with unexpected dimensions. - This seems to happen when the image is very large, and triggers a - decompression bomb warning. - The result might have as many as 150k images, with about 2/3 of them - being from wikipedia. - 8 Run genLinkedImgs.py to add a 'linked_imgs' table to data.db, - which uses 'nodes', 'edges', 'eol_ids', and 'node_imgs', to associate - nodes without images to child images. -5 Reduced Tree Structure Data - 1 Run genReducedTreeData.py, which adds 'r_nodes' and 'r_edges' tables to - data.db, using reducedTol/names.txt, and the 'nodes' and 'names' tables. -6 Other - - Optionally run genEnwikiNameData.py, which adds more entries to the 'names' table, - using data in enwiki/enwikiData.db, and the 'names' and 'wiki_ids' tables. - - Optionally run addPickedNames.py, which adds manually-picked names to - the 'names' table, as specified in pickedNames.txt. - - Optionally run trimTree.py, which tries to remove some 'low-significance' nodes, - for the sake of performance and result-relevance. Without this, jumping to certain - nodes within the fungi and moths can take over a minute to render. +This directory holds files used to generate data.db, which contains tree-of-life data. -data.db Tables -============== -- nodes: name TEXT PRIMARY KEY, id TEXT UNIQUE, tips INT -- edges: node TEXT, child TEXT, p\_support INT, PRIMARY KEY (node, child) -- eol\_ids: id INT PRIMARY KEY, name TEXT -- names: name TEXT, alt\_name TEXT, pref\_alt INT, src TEXT, PRIMARY KEY(name, alt\_name) -- wiki\_ids: name TEXT PRIMARY KEY, id INT, redirected INT -- descs: wiki\_id INT PRIMARY KEY, desc TEXT, from\_dbp INT -- node\_imgs: name TEXT PRIMARY KEY, img\_id INT, src TEXT -- images: id INT, src TEXT, url TEXT, license TEXT, artist TEXT, credit TEXT, PRIMARY KEY (id, src) -- linked\_imgs: name TEXT PRIMARY KEY, otol\_ids TEXT -- r\_nodes: name TEXT PRIMARY KEY, tips INT -- r\_edges: node TEXT, child TEXT, p\_support INT, PRIMARY KEY (node, child) +# Tables: +- `nodes`: `name TEXT PRIMARY KEY, id TEXT UNIQUE, tips INT` +- `edges`: `node TEXT, child TEXT, p_support INT, PRIMARY KEY (node, child)` +- `eol_ids`: `id INT PRIMARY KEY, name TEXT` +- `names`: `name TEXT, alt_name TEXT, pref_alt INT, src TEXT, PRIMARY KEY(name, alt_name)` +- `wiki_ids`: `name TEXT PRIMARY KEY, id INT, redirected INT` +- `descs`: `wiki_id INT PRIMARY KEY, desc TEXT, from_dbp INT` +- `node_imgs`: `name TEXT PRIMARY KEY, img_id INT, src TEXT` +- `images`: `id INT, src TEXT, url TEXT, license TEXT, artist TEXT, credit TEXT, PRIMARY KEY (id, src)` +- `linked_imgs`: `name TEXT PRIMARY KEY, otol_ids TEXT` +- `r_nodes`: `name TEXT PRIMARY KEY, tips INT` +- `r_edges`: `node TEXT, child TEXT, p_support INT, PRIMARY KEY (node, child)` -Other Files -=========== -- dbpPickedLabels.txt <br> - Contains DBpedia labels, one per line. Used by genDbpData.py to help - resolve conflicts when associating tree-of-life node names with - DBpedia node labels. -- genOtolNamesToKeep.txt <br> - Contains names to avoid trimming off the tree data generated by - genOtolData.py. Usage is optional, but, without it, a large amount - of possibly-significant nodes are removed, using a short-sighted - heuristic. <br> - One way to generate this list is to generate the files as usual, - then get node names that have an associated image, description, or - presence in r_nodes. Then run the genOtolData.py and genEolNameData.py - scripts again (after deleting their created tables). -- genEnwikiDescNamesToSkip.txt <br> - Contains names for nodes that genEnwikiNameData.py should skip adding - a description for. Usage is optional, but without it, some nodes will - probably get descriptions that don't match (eg: the bee genus Osiris - might be described as an egyptian god). <br> - This file was generated by running genEnwikiNameData.py, then listing - the names that it added into a file, along with descriptions, and - manually removing those that seemed node-matching (got about 30k lines, - with about 1 in 30 descriptions non-matching). And, after creating - genEnwikiDescTitlesToUse.txt, names shared with that file were removed. -- genEnwikiDescTitlesToUse.txt <br> - Contains enwiki titles with the form 'name1 (category1)' for - genEnwikiNameData.py to use to resolve nodes matching name name1. - Usage is optional, but it adds some descriptions that would otherwise - be skipped. <br> - This file was generated by taking the content of genEnwikiNameData.py, - after the manual filtering step, then, for each name name,1 getting - page titles from dbpedia/dbpData.db that match 'name1 (category1)'. - This was followed by manually removing lines, keeping those that - seemed to match the corresponding node (used the app to help with this). +# Generating the Database + +For the most part, these steps should be done in order. + +As a warning, the whole process takes a lot of time and file space. The tree will probably +have about 2.5 billion nodes. Downloading the images will take several days, and occupy over +200 GB. And if you want good data, you'll need to do some manual review, which can take weeks. + +## Environment +The scripts are written in python and bash. +Some of the python scripts require third-party packages: +- jsonpickle: For encoding class objects as JSON. +- requests: For downloading data. +- PIL: For image processing. +- tkinter: For providing a basic GUI to review images. +- mwxml, mwparserfromhell: For parsing Wikipedia dumps. + +## Generate tree structure data +1. Obtain files in otol/, as specified in it's README. +2. Run genOtolData.py, which creates data.db, and adds the `nodes` and `edges` tables, + using data in otol/. It also uses these files, if they exist: + - pickedOtolNames.txt: Has lines of the form `name1|otolId1`. Some nodes in the + tree may have the same name (eg: Pholidota can refer to pangolins or orchids). + Normally, such nodes will get the names 'name1', 'name1 [2]', 'name1 [3], etc. + This file can be used to manually specify which node should be named 'name1'. + +## Generate node name data +1. Obtain 'name data files' in eol/, as specified in it's README. +2. Run genEolNameData.py, which adds the `names` and `eol_ids` tables, using data in + eol/ and the `nodes` table. It also uses these files, if they exist: + - pickedEolIds.txt: Has lines of the form `nodeName1|eolId1` or `nodeName1|`. + Specifies node names that should have a particular EOL ID, or no ID. + Quite a few taxons have ambiguous names, and may need manual correction. + For example, Viola may resolve to a taxon of butterflies or of plants. + - pickedEolAltsToSkip.txt: Has lines of the form `nodeName1|altName1`. + Specifies that a node's alt-name set should exclude altName1. + +## Generate node description data +### Get data from DBpedia +1. Obtain files in dbpedia/, as specified in it's README. +2. Run genDbpData.py, which adds the `wiki_ids` and `descs` tables, using data in + dbpedia/ and the `nodes` table. It also uses these files, if they exist: + - pickedEnwikiNamesToSkip.txt: Each line holds the name of a node for which + no description should be obtained. Many node names have a same-name + wikipedia page that describes something different (eg: Osiris). + - pickedDbpLabels.txt: Has lines of the form `nodeName1|label1`. + Specifies node names that should have a particular associated page label. +### Get data from Wikipedia +1. Obtain 'description database files' in enwiki/, as specified in it's README. +2. Run genEnwikiDescData.py, which adds to the `wiki_ids` and `descs` tables, + using data in enwiki/ and the `nodes` table. + It also uses these files, if they exist: + - pickedEnwikiNamesToSkip.txt: Same as with genDbpData.py. + - pickedEnwikiLabels.txt: Similar to pickedDbpLabels.txt. + +## Generate image data +### Get images from EOL +1. Obtain 'image metadata files' in eol/, as specified in it's README. +2. In eol/, run downloadImgs.py, which downloads images (possibly multiple per node), + into eol/imgsForReview, using data in eol/, as well as the `eol_ids` table. +3. In eol/, run reviewImgs.py, which interactively displays the downloaded images for + each node, providing the choice of which to use, moving them to eol/imgs/. + Uses `names` and `eol_ids` to display extra info. +### Get images from Wikipedia +1. In enwiki/, run genImgData.py, which looks for wikipedia image names for each node, + using the `wiki_ids` table, and stores them in a database. +2. In enwiki/, run downloadImgLicenseInfo.py, which downloads licensing information for + those images, using wikipedia's online API. +3. In enwiki/, run downloadEnwikiImgs.py, which downloads 'permissively-licensed' + images into enwiki/imgs/. +### Merge the image sets +1. Run reviewImgsToGen.py, which displays images from eol/imgs/ and enwiki/imgs/, + and enables choosing, for each node, which image should be used, if any, + and outputs choice information into imgList.txt. Uses the `nodes`, + `eol_ids`, and `wiki_ids` tables (as well as `names` to display extra info). +2. Run genImgs.py, which creates cropped/resized images in img/, from files listed in + imgList.txt and located in eol/ and enwiki/, and creates the `node_imgs` and + `images` tables. If pickedImgs/ is present, images within it are also used. <br> + The outputs might need to be manually created/adjusted: + - An input image might have no output produced, possibly due to + data incompatibilities, memory limits, etc. A few input image files + might actually be html files, containing a 'file not found' page. + - An input x.gif might produce x-1.jpg, x-2.jpg, etc, instead of x.jpg. + - An input image might produce output with unexpected dimensions. + This seems to happen when the image is very large, and triggers a + decompression bomb warning. + The result might have as many as 150k images, with about 2/3 of them + being from wikipedia. +### Add more image associations +1. Run genLinkedImgs.py, which tries to associate nodes without images to + images of it's children. Adds the `linked_imgs` table, and uses the + `nodes`, `edges`, and `node_imgs` tables. + +## Do some post-processing +1. Run genReducedTreeData.py, which generates a second, reduced version of the tree, + adding the `r_nodes` and `r_edges` tables, using `nodes` and `names`. Reads from + pickedReducedNodes.txt, which lists names of nodes that must be included (1 per line). +2. Optionally run trimTree.py, which tries to remove some 'low-significance' nodes, + for the sake of performance and result-relevance. Otherwise, some nodes may have + over 10k children, which can take a while to render (over a minute in my testing). + You might want to backup the untrimmed tree first, as this operation is not easily + reversible. +3. Optionally run genEnwikiNameData.py, which adds more entries to the `names` table, + using data in enwiki/, and the `names` and `wiki_ids` tables. +4. Optionally run addPickedNames.py, which allows adding manually-selected name data to + the `names` table, as specified in pickedNames.txt. diff --git a/backend/data/dbpedia/README.md b/backend/data/dbpedia/README.md index 78e2a90..8a08f20 100644 --- a/backend/data/dbpedia/README.md +++ b/backend/data/dbpedia/README.md @@ -1,28 +1,29 @@ -Downloaded Files -================ -- labels\_lang=en.ttl.bz2 <br> - Obtained via https://databus.dbpedia.org/dbpedia/collections/latest-core, - using the link <https://databus.dbpedia.org/dbpedia/generic/labels/2022.03.01/labels_lang=en.ttl.bz2>. -- page\_lang=en\_ids.ttl.bz2 <br> +This directory holds files obtained from/using [Dbpedia](https://www.dbpedia.org). + +# Downloaded Files +- `labels_lang=en.ttl.bz2` <br> + Obtained via https://databus.dbpedia.org/dbpedia/collections/latest-core. + Downloaded from <https://databus.dbpedia.org/dbpedia/generic/labels/2022.03.01/labels_lang=en.ttl.bz2>. +- `page_lang=en_ids.ttl.bz2` <br> Downloaded from <https://databus.dbpedia.org/dbpedia/generic/page/2022.03.01/page_lang=en_ids.ttl.bz2> -- redirects\_lang=en\_transitive.ttl.bz2 <br> +- `redirects_lang=en_transitive.ttl.bz2` <br> Downloaded from <https://databus.dbpedia.org/dbpedia/generic/redirects/2022.03.01/redirects_lang=en_transitive.ttl.bz2>. -- disambiguations\_lang=en.ttl.bz2 <br> +- `disambiguations_lang=en.ttl.bz2` <br> Downloaded from <https://databus.dbpedia.org/dbpedia/generic/disambiguations/2022.03.01/disambiguations_lang=en.ttl.bz2>. -- instance-types\_lang=en\_specific.ttl.bz2 <br> +- `instance-types_lang=en_specific.ttl.bz2` <br> Downloaded from <https://databus.dbpedia.org/dbpedia/mappings/instance-types/2022.03.01/instance-types_lang=en_specific.ttl.bz2>. -- short-abstracts\_lang=en.ttl.bz2 <br> +- `short-abstracts_lang=en.ttl.bz2` <br> Downloaded from <https://databus.dbpedia.org/vehnem/text/short-abstracts/2021.05.01/short-abstracts_lang=en.ttl.bz2>. -Generated Files -=============== -- dbpData.db <br> - An sqlite database representing data from the ttl files. - Generated by running genData.py. - Tables - - labels: iri TEXT PRIMARY KEY, label TEXT - - ids: iri TEXT PRIMARY KEY, id INT - - redirects: iri TEXT PRIMARY KEY, target TEXT - - disambiguations: iri TEXT PRIMARY KEY - - types: iri TEXT, type TEXT - - abstracts: iri TEXT PRIMARY KEY, abstract TEXT +# Other Files +- genDescData.py <br> + Used to generate a database representing data from the ttl files. +- descData.db <br> + Generated by genDescData.py. <br> + Tables: <br> + - `labels`: `iri TEXT PRIMARY KEY, label TEXT ` + - `ids`: `iri TEXT PRIMARY KEY, id INT` + - `redirects`: `iri TEXT PRIMARY KEY, target TEXT` + - `disambiguations`: `iri TEXT PRIMARY KEY` + - `types`: `iri TEXT, type TEXT` + - `abstracts`: `iri TEXT PRIMARY KEY, abstract TEXT` diff --git a/backend/data/dbpedia/genData.py b/backend/data/dbpedia/genDescData.py index 41c48a8..bba3ff5 100755 --- a/backend/data/dbpedia/genData.py +++ b/backend/data/dbpedia/genDescData.py @@ -16,7 +16,7 @@ redirectsFile = "redirects_lang=en_transitive.ttl.bz2" disambigFile = "disambiguations_lang=en.ttl.bz2" typesFile = "instance-types_lang=en_specific.ttl.bz2" abstractsFile = "short-abstracts_lang=en.ttl.bz2" -dbFile = "dbpData.db" +dbFile = "descData.db" # Open db dbCon = sqlite3.connect(dbFile) diff --git a/backend/data/enwiki/README.md b/backend/data/enwiki/README.md index 6462d7d..1c16a2e 100644 --- a/backend/data/enwiki/README.md +++ b/backend/data/enwiki/README.md @@ -1,39 +1,52 @@ -Downloaded Files -================ +This directory holds files obtained from/using [English Wikipedia](https://en.wikipedia.org/wiki/Main_Page). + +# Downloaded Files - enwiki-20220501-pages-articles-multistream.xml.bz2 <br> - Obtained via <https://dumps.wikimedia.org/backup-index.html> - (site suggests downloading from a mirror). Contains text - content and metadata for pages in English Wikipedia - (current revision only, excludes talk pages). Some file - content and format information was available from - <https://meta.wikimedia.org/wiki/Data_dumps/What%27s_available_for_download>. + Obtained via <https://dumps.wikimedia.org/backup-index.html> (site suggests downloading from a mirror). + Contains text content and metadata for pages in enwiki. + Some file content and format information was available from + <https://meta.wikimedia.org/wiki/Data_dumps/What%27s_available_for_download>. - enwiki-20220501-pages-articles-multistream-index.txt.bz2 <br> Obtained like above. Holds lines of the form offset1:pageId1:title1, - providing offsets, for each page, into the dump file, of a chunk of + providing, for each page, an offset into the dump file of a chunk of 100 pages that includes it. -Generated Files -=============== +# Generated Dump-Index Files +- genDumpIndexDb.py <br> + Creates an sqlite-database version of the enwiki-dump index file. - dumpIndex.db <br> - Holds data from the enwiki dump index file. Generated by - genDumpIndexDb.py, and used by lookupPage.py to get content for a - given page title. <br> + Generated by genDumpIndexDb.py. <br> Tables: <br> - - offsets: title TEXT PRIMARY KEY, id INT UNIQUE, offset INT, next\_offset INT -- enwikiData.db <br> - Holds data obtained from the enwiki dump file, in 'pages', - 'redirects', and 'descs' tables. Generated by genData.py, which uses - python packages mwxml and mwparserfromhell. <br> + - `offsets`: `title TEXT PRIMARY KEY, id INT UNIQUE, offset INT, next_offset INT` + +# Description Database Files +- genDescData.py <br> + Reads through pages in the dump file, and adds short-description info to a database. +- descData.db <br> + Generated by genDescData.py. <br> Tables: <br> - - pages: id INT PRIMARY KEY, title TEXT UNIQUE - - redirects: id INT PRIMARY KEY, target TEXT - - descs: id INT PRIMARY KEY, desc TEXT -- enwikiImgs.db <br> - Holds infobox-images obtained for some set of wiki page-ids. - Generated by running getEnwikiImgData.py, which uses the enwiki dump - file and dumpIndex.db. <br> + - `pages`: `id INT PRIMARY KEY, title TEXT UNIQUE` + - `redirects`: `id INT PRIMARY KEY, target TEXT` + - `descs`: `id INT PRIMARY KEY, desc TEXT` + +# Image Database Files +- genImgData.py <br> + Used to find infobox image names for page IDs, storing them into a database. +- downloadImgLicenseInfo.py <br> + Used to download licensing metadata for image names, via wikipedia's online API, storing them into a database. +- imgData.db <br> + Used to hold metadata about infobox images for a set of pageIDs. + Generated using getEnwikiImgData.py and downloadImgLicenseInfo.py. <br> Tables: <br> - - page\_imgs: page\_id INT PRIMAY KEY, img\_name TEXT - (img\_name may be null, which is used to avoid re-processing the page-id on a second pass) - - imgs: name TEXT PRIMARY KEY, license TEXT, artist TEXT, credit TEXT, restrictions TEXT, url TEXT - (might lack some matches for 'img_name' in 'page_imgs', due to inability to get license info) + - `page_imgs`: `page_id INT PRIMAY KEY, img_name TEXT` <br> + `img_name` may be null, which means 'none found', and is used to avoid re-processing page-ids. + - `imgs`: `name TEXT PRIMARY KEY, license TEXT, artist TEXT, credit TEXT, restrictions TEXT, url TEXT` <br> + Might lack some matches for `img_name` in `page_imgs`, due to licensing info unavailability. +- downloadEnwikiImgs.py <br> + Used to download image files into imgs/. + +# Other Files +- lookupPage.py <br> + Running `lookupPage.py title1` looks in the dump for a page with a given title, + and prints the contents to stdout. Uses dumpIndex.db. + diff --git a/backend/data/enwiki/downloadEnwikiImgs.py b/backend/data/enwiki/downloadEnwikiImgs.py index de9b862..2929a0d 100755 --- a/backend/data/enwiki/downloadEnwikiImgs.py +++ b/backend/data/enwiki/downloadEnwikiImgs.py @@ -16,7 +16,7 @@ if len(sys.argv) > 1: print(usageInfo, file=sys.stderr) sys.exit(1) -imgDb = "enwikiImgs.db" # About 130k image names +imgDb = "imgData.db" # About 130k image names outDir = "imgs" licenseRegex = re.compile(r"cc0|cc([ -]by)?([ -]sa)?([ -][1234]\.[05])?( \w\w\w?)?", flags=re.IGNORECASE) diff --git a/backend/data/enwiki/downloadImgLicenseInfo.py b/backend/data/enwiki/downloadImgLicenseInfo.py index 8231fbb..097304b 100755 --- a/backend/data/enwiki/downloadImgLicenseInfo.py +++ b/backend/data/enwiki/downloadImgLicenseInfo.py @@ -16,7 +16,7 @@ if len(sys.argv) > 1: print(usageInfo, file=sys.stderr) sys.exit(1) -imgDb = "enwikiImgs.db" # About 130k image names +imgDb = "imgData.db" # About 130k image names apiUrl = "https://en.wikipedia.org/w/api.php" batchSz = 50 # Max 50 tagRegex = re.compile(r"<[^<]+>") diff --git a/backend/data/enwiki/genData.py b/backend/data/enwiki/genDescData.py index 3e60bb5..032dbed 100755 --- a/backend/data/enwiki/genData.py +++ b/backend/data/enwiki/genDescData.py @@ -13,7 +13,7 @@ if len(sys.argv) > 1: sys.exit(1) dumpFile = "enwiki-20220501-pages-articles-multistream.xml.bz2" # 22,034,540 pages -enwikiDb = "enwikiData.db" +enwikiDb = "descData.db" # Some regexps and functions for parsing wikitext descLineRegex = re.compile("^ *[A-Z'\"]") diff --git a/backend/data/enwiki/getEnwikiImgData.py b/backend/data/enwiki/genImgData.py index f8bb2ee..9bd28f4 100755 --- a/backend/data/enwiki/getEnwikiImgData.py +++ b/backend/data/enwiki/genImgData.py @@ -21,7 +21,7 @@ def getInputPageIds(): return pageIds dumpFile = "enwiki-20220501-pages-articles-multistream.xml.bz2" indexDb = "dumpIndex.db" -imgDb = "enwikiImgs.db" # Output db +imgDb = "imgData.db" # Output db idLineRegex = re.compile(r"<id>(.*)</id>") imageLineRegex = re.compile(r".*\| *image *= *([^|]*)") bracketImageRegex = re.compile(r"\[\[(File:[^|]*).*]]") diff --git a/backend/data/eol/README.md b/backend/data/eol/README.md index 8338be0..fbb008d 100644 --- a/backend/data/eol/README.md +++ b/backend/data/eol/README.md @@ -1,18 +1,25 @@ -Downloaded Files -================ -- imagesList.tgz <br> - Obtained from https://opendata.eol.org/dataset/images-list on 24/04/2022. - Listed as being last updated on 05/02/2020. +This directory holds files obtained from/using the [Encyclopedia of Life](https://eol.org/). + +# Name Data Files - vernacularNames.csv <br> - Obtained from https://opendata.eol.org/dataset/vernacular-names on 24/04/2022. - Listed as being last updated on 27/10/2020. + Obtained from <https://opendata.eol.org/dataset/vernacular-names> on 24/04/2022 (last updated on 27/10/2020). + Contains alternative-name data from EOL. -Generated Files -=============== +# Image Metadata Files +- imagesList.tgz <br> + Obtained from <https://opendata.eol.org/dataset/images-list> on 24/04/2022 (last updated on 05/02/2020). + Contains metadata for images from EOL. - imagesList/ <br> - Obtained by extracting imagesList.tgz. + Extracted from imagesList.tgz. - imagesList.db <br> - Represents data from eol/imagesList/*, and is created by genImagesListDb.sh. <br> + Contains data from imagesList/. + Created by running genImagesListDb.sh, which simply imports csv files into a database. <br> Tables: <br> - - images: - content_id INT PRIMARY KEY, page_id INT, source_url TEXT, copy_url TEXT, license TEXT, copyright_owner TEXT + - `images`: + `content_id INT PRIMARY KEY, page_id INT, source_url TEXT, copy_url TEXT, license TEXT, copyright_owner TEXT` + +# Image Generation Files +- downloadImgs.py <br> + Used to download image files into imgsForReview/. +- reviewImgs.py <br> + Used to review images in imgsForReview/, moving acceptable ones into imgs/. diff --git a/backend/data/eol/reviewImgs.py b/backend/data/eol/reviewImgs.py index 4fea1c4..5290f9e 100755 --- a/backend/data/eol/reviewImgs.py +++ b/backend/data/eol/reviewImgs.py @@ -17,7 +17,7 @@ if len(sys.argv) > 1: sys.exit(1) imgDir = "imgsForReview/" -outDir = "imgsReviewed/" +outDir = "imgs/" extraInfoDbCon = sqlite3.connect("../data.db") extraInfoDbCur = extraInfoDbCon.cursor() def getExtraInfo(eolId): diff --git a/backend/data/genDbpData.py b/backend/data/genDbpData.py index e921b6c..afe1e17 100755 --- a/backend/data/genDbpData.py +++ b/backend/data/genDbpData.py @@ -12,9 +12,9 @@ if len(sys.argv) > 1: print(usageInfo, file=sys.stderr) sys.exit(1) -dbpediaDb = "dbpedia/dbpData.db" -namesToSkipFile = "genDescNamesToSkip.txt" -pickedLabelsFile = "dbpPickedLabels.txt" +dbpediaDb = "dbpedia/descData.db" +namesToSkipFile = "pickedEnwikiNamesToSkip.txt" +pickedLabelsFile = "pickedDbpLabels.txt" dbFile = "data.db" # Open dbs diff --git a/backend/data/genEnwikiDescData.py b/backend/data/genEnwikiDescData.py index 2396540..dbc8d6b 100755 --- a/backend/data/genEnwikiDescData.py +++ b/backend/data/genEnwikiDescData.py @@ -11,10 +11,10 @@ if len(sys.argv) > 1: print(usageInfo, file=sys.stderr) sys.exit(1) -enwikiDb = "enwiki/enwikiData.db" +enwikiDb = "enwiki/descData.db" dbFile = "data.db" -namesToSkipFile = "genDescNamesToSkip.txt" -pickedLabelsFile = "enwikiPickedLabels.txt" +namesToSkipFile = "pickedEnwikiNamesToSkip.txt" +pickedLabelsFile = "pickedEnwikiLabels.txt" # Open dbs enwikiCon = sqlite3.connect(enwikiDb) diff --git a/backend/data/genEnwikiNameData.py b/backend/data/genEnwikiNameData.py index 71960a5..8285a40 100755 --- a/backend/data/genEnwikiNameData.py +++ b/backend/data/genEnwikiNameData.py @@ -10,7 +10,7 @@ if len(sys.argv) > 1: print(usageInfo, file=sys.stderr) sys.exit(1) -enwikiDb = "enwiki/enwikiData.db" +enwikiDb = "enwiki/descData.db" dbFile = "data.db" altNameRegex = re.compile(r"[a-zA-Z]+") # Avoids names like 'Evolution of Elephants', 'Banana fiber', 'Fish (zoology)', diff --git a/backend/data/genEolNameData.py b/backend/data/genEolNameData.py index aa3905e..d852751 100755 --- a/backend/data/genEolNameData.py +++ b/backend/data/genEolNameData.py @@ -18,8 +18,8 @@ if len(sys.argv) > 1: vnamesFile = "eol/vernacularNames.csv" dbFile = "data.db" NAMES_TO_SKIP = {"unknown", "unknown species", "unidentified species"} -pickedIdsFile = "genEolNameDataPickedIds.txt" -badAltsFile = "genEolNameDataBadAlts.txt" +pickedIdsFile = "pickedEolIds.txt" +badAltsFile = "pickedEolAltsToSkip.txt" # Read in vernacular-names data # Note: Canonical-names may have multiple pids diff --git a/backend/data/genImgsForWeb.py b/backend/data/genImgs.py index 3c299bb..097959f 100755 --- a/backend/data/genImgsForWeb.py +++ b/backend/data/genImgs.py @@ -15,12 +15,12 @@ if len(sys.argv) > 1: print(usageInfo, file=sys.stderr) sys.exit(1) -imgListFile = "mergedImgList.txt" +imgListFile = "imgList.txt" outDir = "img/" eolImgDb = "eol/imagesList.db" -enwikiImgDb = "enwiki/enwikiImgs.db" +enwikiImgDb = "enwiki/imgData.db" pickedImgsDir = "pickedImgs/" -pickedImgsFile = "metadata.txt" +pickedImgsFilename = "imgData.txt" dbFile = "data.db" IMG_OUT_SZ = 200 genImgFiles = True @@ -37,9 +37,9 @@ enwikiCon = sqlite3.connect(enwikiImgDb) enwikiCur = enwikiCon.cursor() # Get 'picked images' info nodeToPickedImg = {} -if os.path.exists(pickedImgsDir + pickedImgsFile): +if os.path.exists(pickedImgsDir + pickedImgsFilename): lineNum = 0 - with open(pickedImgsDir + pickedImgsFile) as file: + with open(pickedImgsDir + pickedImgsFilename) as file: for line in file: lineNum += 1 (filename, url, license, artist, credit) = line.rstrip().split("|") diff --git a/backend/data/genOtolData.py b/backend/data/genOtolData.py index cfb5bed..87b35c3 100755 --- a/backend/data/genOtolData.py +++ b/backend/data/genOtolData.py @@ -1,6 +1,6 @@ #!/usr/bin/python3 -import sys, re +import sys, re, os import json, sqlite3 usageInfo = f"usage: {sys.argv[0]}\n" @@ -30,8 +30,8 @@ annFile = "otol/annotations.json" dbFile = "data.db" nodeMap = {} # Maps node IDs to node objects nameToFirstId = {} # Maps node names to first found ID (names might have multiple IDs) -dupNameToIds = {} # Maps names of nodes with multiple IDs to those node IDs -pickedDupsFile = "genOtolDataPickedDups.txt" +dupNameToIds = {} # Maps names of nodes with multiple IDs to those IDs +pickedNamesFile = "pickedOtolNames.txt" # Parse treeFile print("Parsing tree file") @@ -142,10 +142,11 @@ rootId = parseNewick() # Resolve duplicate names print("Resolving duplicates") nameToPickedId = {} -with open(pickedDupsFile) as file: - for line in file: - (name, _, otolId) = line.rstrip().partition("|") - nameToPickedId[name] = otolId +if os.path.exists(pickedNamesFile): + with open(pickedNamesFile) as file: + for line in file: + (name, _, otolId) = line.rstrip().partition("|") + nameToPickedId[name] = otolId for [dupName, ids] in dupNameToIds.items(): # Check for picked id if dupName in nameToPickedId: diff --git a/backend/data/genReducedTreeData.py b/backend/data/genReducedTreeData.py index 208c937..b475794 100755 --- a/backend/data/genReducedTreeData.py +++ b/backend/data/genReducedTreeData.py @@ -10,7 +10,7 @@ if len(sys.argv) > 1: sys.exit(1) dbFile = "data.db" -nodeNamesFile = "reducedTol/names.txt" +nodeNamesFile = "reducedTreeNodes.txt" minimalNames = set() nodeMap = {} # Maps node names to node objects PREF_NUM_CHILDREN = 3 # Attempt inclusion of children up to this limit diff --git a/backend/data/otol/README.md b/backend/data/otol/README.md index a6f13c2..4be2fd2 100644 --- a/backend/data/otol/README.md +++ b/backend/data/otol/README.md @@ -1,6 +1,10 @@ -Downloaded Files -================ +Files +===== +- opentree13.4tree.tgz <br> + Obtained from <https://tree.opentreeoflife.org/about/synthesis-release/v13.4>. + Contains tree data from the [Open Tree of Life](https://tree.opentreeoflife.org/about/open-tree-of-life). - labelled\_supertree\_ottnames.tre <br> - Obtained from https://tree.opentreeoflife.org/about/synthesis-release/v13.4. -- annotations.json <br> - Obtained from https://tree.opentreeoflife.org/about/synthesis-release/v13.4. + Extracted from the .tgz file. Describes the structure of the tree. +- annotations.json + Extracted from the .tgz file. Contains additional attributes of tree + nodes. Used for finding out which nodes have 'phylogenetic support'. diff --git a/backend/data/pickedImgs/README.md b/backend/data/pickedImgs/README.md index 52fc608..dfe192b 100644 --- a/backend/data/pickedImgs/README.md +++ b/backend/data/pickedImgs/README.md @@ -1,12 +1,10 @@ -This directory is used for adding additional, manually-picked images, -to the server's dataset, overriding any from eol and enwiki. If used, -it is expected to contain image files, and a metadata.txt file that -holds metadata. +This directory holds additional image files to use for tree-of-life nodes, +on top of those from EOL and Wikipedia. Possible Files ============== -- Image files -- metadata.txt <br> - Contains lines with the format filename|url|license|artist|credit. - The filename should be a tree-of-life node name, with an image - extension. Other fields correspond to those in the 'images' table. +- (Image files) +- imgData.txt <br> + Contains lines with the format `filename|url|license|artist|credit`. + The filename should consist of a node name, with an image extension. + Other fields correspond to those in the `images` table (see ../README.md). diff --git a/backend/data/reducedTol/README.md b/backend/data/reducedTol/README.md deleted file mode 100644 index 103bffc..0000000 --- a/backend/data/reducedTol/README.md +++ /dev/null @@ -1,4 +0,0 @@ -Files -===== -- names.txt <br> - Contains names of nodes to be kept in a reduced Tree of Life. diff --git a/backend/data/reviewImgsToMerge.py b/backend/data/reviewImgsToGen.py index d177a5e..4d970ba 100755 --- a/backend/data/reviewImgsToMerge.py +++ b/backend/data/reviewImgsToGen.py @@ -20,13 +20,13 @@ if len(sys.argv) > 1: print(usageInfo, file=sys.stderr) sys.exit(1) -eolImgDir = "eol/imgsReviewed/" +eolImgDir = "eol/imgs/" enwikiImgDir = "enwiki/imgs/" dbFile = "data.db" -outFile = "mergedImgList.txt" +outFile = "imgList.txt" IMG_DISPLAY_SZ = 400 PLACEHOLDER_IMG = Image.new("RGB", (IMG_DISPLAY_SZ, IMG_DISPLAY_SZ), (88, 28, 135)) -onlyReviewPairs = False +onlyReviewPairs = True # Open db dbCon = sqlite3.connect(dbFile) |
