aboutsummaryrefslogtreecommitdiff
path: root/backend
diff options
context:
space:
mode:
authorTerry Truong <terry06890@gmail.com>2022-06-22 01:42:41 +1000
committerTerry Truong <terry06890@gmail.com>2022-06-22 09:39:44 +1000
commite78c4df403e5f98afa08f7a0841ff233d5f6d05b (patch)
treef13dbf91228550075644be9766b4546eb20f1e1f /backend
parentae1467d2ab35a03eb2d7bf3e5ca1cf4634b23443 (diff)
Update backend READMEs, rename some files for consistency
Diffstat (limited to 'backend')
-rw-r--r--backend/README.md4
-rw-r--r--backend/data/README.md232
-rw-r--r--backend/data/dbpedia/README.md45
-rwxr-xr-xbackend/data/dbpedia/genDescData.py (renamed from backend/data/dbpedia/genData.py)2
-rw-r--r--backend/data/enwiki/README.md73
-rwxr-xr-xbackend/data/enwiki/downloadEnwikiImgs.py2
-rwxr-xr-xbackend/data/enwiki/downloadImgLicenseInfo.py2
-rwxr-xr-xbackend/data/enwiki/genDescData.py (renamed from backend/data/enwiki/genData.py)2
-rwxr-xr-xbackend/data/enwiki/genImgData.py (renamed from backend/data/enwiki/getEnwikiImgData.py)2
-rw-r--r--backend/data/eol/README.md33
-rwxr-xr-xbackend/data/eol/reviewImgs.py2
-rwxr-xr-xbackend/data/genDbpData.py6
-rwxr-xr-xbackend/data/genEnwikiDescData.py6
-rwxr-xr-xbackend/data/genEnwikiNameData.py2
-rwxr-xr-xbackend/data/genEolNameData.py4
-rwxr-xr-xbackend/data/genImgs.py (renamed from backend/data/genImgsForWeb.py)10
-rwxr-xr-xbackend/data/genOtolData.py15
-rwxr-xr-xbackend/data/genReducedTreeData.py2
-rw-r--r--backend/data/otol/README.md14
-rw-r--r--backend/data/pickedImgs/README.md16
-rw-r--r--backend/data/reducedTol/README.md4
-rwxr-xr-xbackend/data/reviewImgsToGen.py (renamed from backend/data/reviewImgsToMerge.py)6
22 files changed, 257 insertions, 227 deletions
diff --git a/backend/README.md b/backend/README.md
new file mode 100644
index 0000000..331e7f4
--- /dev/null
+++ b/backend/README.md
@@ -0,0 +1,4 @@
+Files
+=====
+- server.py: Runs the server
+- data/: For generating the server's tree-of-life database
diff --git a/backend/data/README.md b/backend/data/README.md
index d4a6196..7d1adad 100644
--- a/backend/data/README.md
+++ b/backend/data/README.md
@@ -1,115 +1,121 @@
-File Generation Process
-=======================
-1 Tree Structure Data
- 1 Obtain data in otol/, as specified in it's README.
- 2 Run genOtolData.py, which creates data.db, and adds
- 'nodes' and 'edges' tables using data in otol/*, as well as
- genOtolNamesToKeep.txt, if present.
-2 Name Data for Search
- 1 Obtain data in eol/, as specified in it's README.
- 2 Run genEolNameData.py, which adds 'names' and 'eol_ids' tables to data.db,
- using data in eol/vernacularNames.csv and the 'nodes' table, and possibly
- genEolNameDataPickedIds.txt.
-3 Node Description Data
- 1 Obtain data in dbpedia/ and enwiki/, as specified in their README files.
- 2 Run genDbpData.py, which adds 'wiki_ids' and 'descs' tables to data.db,
- using data in dbpedia/dbpData.db, the 'nodes' table, and possibly
- genDescNamesToSkip.txt and dbpPickedLabels.txt.
- 3 Run genEnwikiDescData.py, which adds to the 'wiki_ids' and 'descs' tables,
- using data in enwiki/enwikiData.db, and the 'nodes' table.
- Also uses genDescNamesToSkip.txt and genEnwikiDescTitlesToUse.txt for
- skipping/resolving some name-page associations.
-4 Image Data
- 1 In eol/, run downloadImgs.py to download EOL images into eol/imgsForReview/.
- It uses data in eol/imagesList.db, and the 'eol_ids' table.
- 2 In eol/, run reviewImgs.py to filter images in eol/imgsForReview/ into EOL-id-unique
- images in eol/imgsReviewed/ (uses 'names' and 'eol_ids' to display extra info).
- 3 In enwiki/, run getEnwikiImgData.py, which generates a list of
- tol-node images, and creates enwiki/enwikiImgs.db to store it.
- Uses the 'wiki_ids' table to get tol-node wiki-ids.
- 4 In enwiki/, run downloadImgLicenseInfo.py, which downloads licensing
- information for images listed in enwiki/enwikiImgs.db, and stores
- it in that db.
- 5 In enwiki/, run downloadEnwikiImgs.py, which downloads 'permissively-licensed'
- images in listed in enwiki/enwikiImgs.db, storing them in enwiki/imgs/.
- 6 Run reviewImgsToMerge.py, which displays images from eol/ and enwiki/,
- and enables choosing, for each tol-node, which image should be used, if any,
- and outputs choice information into mergedImgList.txt. Uses the 'nodes',
- 'eol_ids', and 'wiki_ids' tables (as well as 'names' for info-display).
- 7 Run genImgsForWeb.py, which creates cropped/resized images in img/,
- using mergedImgList.txt, and possibly pickedImgs/, and adds 'images' and
- 'node_imgs' tables to data.db. <br>
- Smartcrop's outputs might need to be manually created/adjusted: <br>
- - An input image might have no output produced, possibly due to
- data incompatibilities, memory limits, etc. A few input image files
- might actually be html files, containing a 'file not found' page.
- - An input x.gif might produce x-1.jpg, x-2.jpg, etc, instead of x.jpg.
- - An input image might produce output with unexpected dimensions.
- This seems to happen when the image is very large, and triggers a
- decompression bomb warning.
- The result might have as many as 150k images, with about 2/3 of them
- being from wikipedia.
- 8 Run genLinkedImgs.py to add a 'linked_imgs' table to data.db,
- which uses 'nodes', 'edges', 'eol_ids', and 'node_imgs', to associate
- nodes without images to child images.
-5 Reduced Tree Structure Data
- 1 Run genReducedTreeData.py, which adds 'r_nodes' and 'r_edges' tables to
- data.db, using reducedTol/names.txt, and the 'nodes' and 'names' tables.
-6 Other
- - Optionally run genEnwikiNameData.py, which adds more entries to the 'names' table,
- using data in enwiki/enwikiData.db, and the 'names' and 'wiki_ids' tables.
- - Optionally run addPickedNames.py, which adds manually-picked names to
- the 'names' table, as specified in pickedNames.txt.
- - Optionally run trimTree.py, which tries to remove some 'low-significance' nodes,
- for the sake of performance and result-relevance. Without this, jumping to certain
- nodes within the fungi and moths can take over a minute to render.
+This directory holds files used to generate data.db, which contains tree-of-life data.
-data.db Tables
-==============
-- nodes: name TEXT PRIMARY KEY, id TEXT UNIQUE, tips INT
-- edges: node TEXT, child TEXT, p\_support INT, PRIMARY KEY (node, child)
-- eol\_ids: id INT PRIMARY KEY, name TEXT
-- names: name TEXT, alt\_name TEXT, pref\_alt INT, src TEXT, PRIMARY KEY(name, alt\_name)
-- wiki\_ids: name TEXT PRIMARY KEY, id INT, redirected INT
-- descs: wiki\_id INT PRIMARY KEY, desc TEXT, from\_dbp INT
-- node\_imgs: name TEXT PRIMARY KEY, img\_id INT, src TEXT
-- images: id INT, src TEXT, url TEXT, license TEXT, artist TEXT, credit TEXT, PRIMARY KEY (id, src)
-- linked\_imgs: name TEXT PRIMARY KEY, otol\_ids TEXT
-- r\_nodes: name TEXT PRIMARY KEY, tips INT
-- r\_edges: node TEXT, child TEXT, p\_support INT, PRIMARY KEY (node, child)
+# Tables:
+- `nodes`: `name TEXT PRIMARY KEY, id TEXT UNIQUE, tips INT`
+- `edges`: `node TEXT, child TEXT, p_support INT, PRIMARY KEY (node, child)`
+- `eol_ids`: `id INT PRIMARY KEY, name TEXT`
+- `names`: `name TEXT, alt_name TEXT, pref_alt INT, src TEXT, PRIMARY KEY(name, alt_name)`
+- `wiki_ids`: `name TEXT PRIMARY KEY, id INT, redirected INT`
+- `descs`: `wiki_id INT PRIMARY KEY, desc TEXT, from_dbp INT`
+- `node_imgs`: `name TEXT PRIMARY KEY, img_id INT, src TEXT`
+- `images`: `id INT, src TEXT, url TEXT, license TEXT, artist TEXT, credit TEXT, PRIMARY KEY (id, src)`
+- `linked_imgs`: `name TEXT PRIMARY KEY, otol_ids TEXT`
+- `r_nodes`: `name TEXT PRIMARY KEY, tips INT`
+- `r_edges`: `node TEXT, child TEXT, p_support INT, PRIMARY KEY (node, child)`
-Other Files
-===========
-- dbpPickedLabels.txt <br>
- Contains DBpedia labels, one per line. Used by genDbpData.py to help
- resolve conflicts when associating tree-of-life node names with
- DBpedia node labels.
-- genOtolNamesToKeep.txt <br>
- Contains names to avoid trimming off the tree data generated by
- genOtolData.py. Usage is optional, but, without it, a large amount
- of possibly-significant nodes are removed, using a short-sighted
- heuristic. <br>
- One way to generate this list is to generate the files as usual,
- then get node names that have an associated image, description, or
- presence in r_nodes. Then run the genOtolData.py and genEolNameData.py
- scripts again (after deleting their created tables).
-- genEnwikiDescNamesToSkip.txt <br>
- Contains names for nodes that genEnwikiNameData.py should skip adding
- a description for. Usage is optional, but without it, some nodes will
- probably get descriptions that don't match (eg: the bee genus Osiris
- might be described as an egyptian god). <br>
- This file was generated by running genEnwikiNameData.py, then listing
- the names that it added into a file, along with descriptions, and
- manually removing those that seemed node-matching (got about 30k lines,
- with about 1 in 30 descriptions non-matching). And, after creating
- genEnwikiDescTitlesToUse.txt, names shared with that file were removed.
-- genEnwikiDescTitlesToUse.txt <br>
- Contains enwiki titles with the form 'name1 (category1)' for
- genEnwikiNameData.py to use to resolve nodes matching name name1.
- Usage is optional, but it adds some descriptions that would otherwise
- be skipped. <br>
- This file was generated by taking the content of genEnwikiNameData.py,
- after the manual filtering step, then, for each name name,1 getting
- page titles from dbpedia/dbpData.db that match 'name1 (category1)'.
- This was followed by manually removing lines, keeping those that
- seemed to match the corresponding node (used the app to help with this).
+# Generating the Database
+
+For the most part, these steps should be done in order.
+
+As a warning, the whole process takes a lot of time and file space. The tree will probably
+have about 2.5 billion nodes. Downloading the images will take several days, and occupy over
+200 GB. And if you want good data, you'll need to do some manual review, which can take weeks.
+
+## Environment
+The scripts are written in python and bash.
+Some of the python scripts require third-party packages:
+- jsonpickle: For encoding class objects as JSON.
+- requests: For downloading data.
+- PIL: For image processing.
+- tkinter: For providing a basic GUI to review images.
+- mwxml, mwparserfromhell: For parsing Wikipedia dumps.
+
+## Generate tree structure data
+1. Obtain files in otol/, as specified in it's README.
+2. Run genOtolData.py, which creates data.db, and adds the `nodes` and `edges` tables,
+ using data in otol/. It also uses these files, if they exist:
+ - pickedOtolNames.txt: Has lines of the form `name1|otolId1`. Some nodes in the
+ tree may have the same name (eg: Pholidota can refer to pangolins or orchids).
+ Normally, such nodes will get the names 'name1', 'name1 [2]', 'name1 [3], etc.
+ This file can be used to manually specify which node should be named 'name1'.
+
+## Generate node name data
+1. Obtain 'name data files' in eol/, as specified in it's README.
+2. Run genEolNameData.py, which adds the `names` and `eol_ids` tables, using data in
+ eol/ and the `nodes` table. It also uses these files, if they exist:
+ - pickedEolIds.txt: Has lines of the form `nodeName1|eolId1` or `nodeName1|`.
+ Specifies node names that should have a particular EOL ID, or no ID.
+ Quite a few taxons have ambiguous names, and may need manual correction.
+ For example, Viola may resolve to a taxon of butterflies or of plants.
+ - pickedEolAltsToSkip.txt: Has lines of the form `nodeName1|altName1`.
+ Specifies that a node's alt-name set should exclude altName1.
+
+## Generate node description data
+### Get data from DBpedia
+1. Obtain files in dbpedia/, as specified in it's README.
+2. Run genDbpData.py, which adds the `wiki_ids` and `descs` tables, using data in
+ dbpedia/ and the `nodes` table. It also uses these files, if they exist:
+ - pickedEnwikiNamesToSkip.txt: Each line holds the name of a node for which
+ no description should be obtained. Many node names have a same-name
+ wikipedia page that describes something different (eg: Osiris).
+ - pickedDbpLabels.txt: Has lines of the form `nodeName1|label1`.
+ Specifies node names that should have a particular associated page label.
+### Get data from Wikipedia
+1. Obtain 'description database files' in enwiki/, as specified in it's README.
+2. Run genEnwikiDescData.py, which adds to the `wiki_ids` and `descs` tables,
+ using data in enwiki/ and the `nodes` table.
+ It also uses these files, if they exist:
+ - pickedEnwikiNamesToSkip.txt: Same as with genDbpData.py.
+ - pickedEnwikiLabels.txt: Similar to pickedDbpLabels.txt.
+
+## Generate image data
+### Get images from EOL
+1. Obtain 'image metadata files' in eol/, as specified in it's README.
+2. In eol/, run downloadImgs.py, which downloads images (possibly multiple per node),
+ into eol/imgsForReview, using data in eol/, as well as the `eol_ids` table.
+3. In eol/, run reviewImgs.py, which interactively displays the downloaded images for
+ each node, providing the choice of which to use, moving them to eol/imgs/.
+ Uses `names` and `eol_ids` to display extra info.
+### Get images from Wikipedia
+1. In enwiki/, run genImgData.py, which looks for wikipedia image names for each node,
+ using the `wiki_ids` table, and stores them in a database.
+2. In enwiki/, run downloadImgLicenseInfo.py, which downloads licensing information for
+ those images, using wikipedia's online API.
+3. In enwiki/, run downloadEnwikiImgs.py, which downloads 'permissively-licensed'
+ images into enwiki/imgs/.
+### Merge the image sets
+1. Run reviewImgsToGen.py, which displays images from eol/imgs/ and enwiki/imgs/,
+ and enables choosing, for each node, which image should be used, if any,
+ and outputs choice information into imgList.txt. Uses the `nodes`,
+ `eol_ids`, and `wiki_ids` tables (as well as `names` to display extra info).
+2. Run genImgs.py, which creates cropped/resized images in img/, from files listed in
+ imgList.txt and located in eol/ and enwiki/, and creates the `node_imgs` and
+ `images` tables. If pickedImgs/ is present, images within it are also used. <br>
+ The outputs might need to be manually created/adjusted:
+ - An input image might have no output produced, possibly due to
+ data incompatibilities, memory limits, etc. A few input image files
+ might actually be html files, containing a 'file not found' page.
+ - An input x.gif might produce x-1.jpg, x-2.jpg, etc, instead of x.jpg.
+ - An input image might produce output with unexpected dimensions.
+ This seems to happen when the image is very large, and triggers a
+ decompression bomb warning.
+ The result might have as many as 150k images, with about 2/3 of them
+ being from wikipedia.
+### Add more image associations
+1. Run genLinkedImgs.py, which tries to associate nodes without images to
+ images of it's children. Adds the `linked_imgs` table, and uses the
+ `nodes`, `edges`, and `node_imgs` tables.
+
+## Do some post-processing
+1. Run genReducedTreeData.py, which generates a second, reduced version of the tree,
+ adding the `r_nodes` and `r_edges` tables, using `nodes` and `names`. Reads from
+ pickedReducedNodes.txt, which lists names of nodes that must be included (1 per line).
+2. Optionally run trimTree.py, which tries to remove some 'low-significance' nodes,
+ for the sake of performance and result-relevance. Otherwise, some nodes may have
+ over 10k children, which can take a while to render (over a minute in my testing).
+ You might want to backup the untrimmed tree first, as this operation is not easily
+ reversible.
+3. Optionally run genEnwikiNameData.py, which adds more entries to the `names` table,
+ using data in enwiki/, and the `names` and `wiki_ids` tables.
+4. Optionally run addPickedNames.py, which allows adding manually-selected name data to
+ the `names` table, as specified in pickedNames.txt.
diff --git a/backend/data/dbpedia/README.md b/backend/data/dbpedia/README.md
index 78e2a90..8a08f20 100644
--- a/backend/data/dbpedia/README.md
+++ b/backend/data/dbpedia/README.md
@@ -1,28 +1,29 @@
-Downloaded Files
-================
-- labels\_lang=en.ttl.bz2 <br>
- Obtained via https://databus.dbpedia.org/dbpedia/collections/latest-core,
- using the link <https://databus.dbpedia.org/dbpedia/generic/labels/2022.03.01/labels_lang=en.ttl.bz2>.
-- page\_lang=en\_ids.ttl.bz2 <br>
+This directory holds files obtained from/using [Dbpedia](https://www.dbpedia.org).
+
+# Downloaded Files
+- `labels_lang=en.ttl.bz2` <br>
+ Obtained via https://databus.dbpedia.org/dbpedia/collections/latest-core.
+ Downloaded from <https://databus.dbpedia.org/dbpedia/generic/labels/2022.03.01/labels_lang=en.ttl.bz2>.
+- `page_lang=en_ids.ttl.bz2` <br>
Downloaded from <https://databus.dbpedia.org/dbpedia/generic/page/2022.03.01/page_lang=en_ids.ttl.bz2>
-- redirects\_lang=en\_transitive.ttl.bz2 <br>
+- `redirects_lang=en_transitive.ttl.bz2` <br>
Downloaded from <https://databus.dbpedia.org/dbpedia/generic/redirects/2022.03.01/redirects_lang=en_transitive.ttl.bz2>.
-- disambiguations\_lang=en.ttl.bz2 <br>
+- `disambiguations_lang=en.ttl.bz2` <br>
Downloaded from <https://databus.dbpedia.org/dbpedia/generic/disambiguations/2022.03.01/disambiguations_lang=en.ttl.bz2>.
-- instance-types\_lang=en\_specific.ttl.bz2 <br>
+- `instance-types_lang=en_specific.ttl.bz2` <br>
Downloaded from <https://databus.dbpedia.org/dbpedia/mappings/instance-types/2022.03.01/instance-types_lang=en_specific.ttl.bz2>.
-- short-abstracts\_lang=en.ttl.bz2 <br>
+- `short-abstracts_lang=en.ttl.bz2` <br>
Downloaded from <https://databus.dbpedia.org/vehnem/text/short-abstracts/2021.05.01/short-abstracts_lang=en.ttl.bz2>.
-Generated Files
-===============
-- dbpData.db <br>
- An sqlite database representing data from the ttl files.
- Generated by running genData.py.
- Tables
- - labels: iri TEXT PRIMARY KEY, label TEXT
- - ids: iri TEXT PRIMARY KEY, id INT
- - redirects: iri TEXT PRIMARY KEY, target TEXT
- - disambiguations: iri TEXT PRIMARY KEY
- - types: iri TEXT, type TEXT
- - abstracts: iri TEXT PRIMARY KEY, abstract TEXT
+# Other Files
+- genDescData.py <br>
+ Used to generate a database representing data from the ttl files.
+- descData.db <br>
+ Generated by genDescData.py. <br>
+ Tables: <br>
+ - `labels`: `iri TEXT PRIMARY KEY, label TEXT `
+ - `ids`: `iri TEXT PRIMARY KEY, id INT`
+ - `redirects`: `iri TEXT PRIMARY KEY, target TEXT`
+ - `disambiguations`: `iri TEXT PRIMARY KEY`
+ - `types`: `iri TEXT, type TEXT`
+ - `abstracts`: `iri TEXT PRIMARY KEY, abstract TEXT`
diff --git a/backend/data/dbpedia/genData.py b/backend/data/dbpedia/genDescData.py
index 41c48a8..bba3ff5 100755
--- a/backend/data/dbpedia/genData.py
+++ b/backend/data/dbpedia/genDescData.py
@@ -16,7 +16,7 @@ redirectsFile = "redirects_lang=en_transitive.ttl.bz2"
disambigFile = "disambiguations_lang=en.ttl.bz2"
typesFile = "instance-types_lang=en_specific.ttl.bz2"
abstractsFile = "short-abstracts_lang=en.ttl.bz2"
-dbFile = "dbpData.db"
+dbFile = "descData.db"
# Open db
dbCon = sqlite3.connect(dbFile)
diff --git a/backend/data/enwiki/README.md b/backend/data/enwiki/README.md
index 6462d7d..1c16a2e 100644
--- a/backend/data/enwiki/README.md
+++ b/backend/data/enwiki/README.md
@@ -1,39 +1,52 @@
-Downloaded Files
-================
+This directory holds files obtained from/using [English Wikipedia](https://en.wikipedia.org/wiki/Main_Page).
+
+# Downloaded Files
- enwiki-20220501-pages-articles-multistream.xml.bz2 <br>
- Obtained via <https://dumps.wikimedia.org/backup-index.html>
- (site suggests downloading from a mirror). Contains text
- content and metadata for pages in English Wikipedia
- (current revision only, excludes talk pages). Some file
- content and format information was available from
- <https://meta.wikimedia.org/wiki/Data_dumps/What%27s_available_for_download>.
+ Obtained via <https://dumps.wikimedia.org/backup-index.html> (site suggests downloading from a mirror).
+ Contains text content and metadata for pages in enwiki.
+ Some file content and format information was available from
+ <https://meta.wikimedia.org/wiki/Data_dumps/What%27s_available_for_download>.
- enwiki-20220501-pages-articles-multistream-index.txt.bz2 <br>
Obtained like above. Holds lines of the form offset1:pageId1:title1,
- providing offsets, for each page, into the dump file, of a chunk of
+ providing, for each page, an offset into the dump file of a chunk of
100 pages that includes it.
-Generated Files
-===============
+# Generated Dump-Index Files
+- genDumpIndexDb.py <br>
+ Creates an sqlite-database version of the enwiki-dump index file.
- dumpIndex.db <br>
- Holds data from the enwiki dump index file. Generated by
- genDumpIndexDb.py, and used by lookupPage.py to get content for a
- given page title. <br>
+ Generated by genDumpIndexDb.py. <br>
Tables: <br>
- - offsets: title TEXT PRIMARY KEY, id INT UNIQUE, offset INT, next\_offset INT
-- enwikiData.db <br>
- Holds data obtained from the enwiki dump file, in 'pages',
- 'redirects', and 'descs' tables. Generated by genData.py, which uses
- python packages mwxml and mwparserfromhell. <br>
+ - `offsets`: `title TEXT PRIMARY KEY, id INT UNIQUE, offset INT, next_offset INT`
+
+# Description Database Files
+- genDescData.py <br>
+ Reads through pages in the dump file, and adds short-description info to a database.
+- descData.db <br>
+ Generated by genDescData.py. <br>
Tables: <br>
- - pages: id INT PRIMARY KEY, title TEXT UNIQUE
- - redirects: id INT PRIMARY KEY, target TEXT
- - descs: id INT PRIMARY KEY, desc TEXT
-- enwikiImgs.db <br>
- Holds infobox-images obtained for some set of wiki page-ids.
- Generated by running getEnwikiImgData.py, which uses the enwiki dump
- file and dumpIndex.db. <br>
+ - `pages`: `id INT PRIMARY KEY, title TEXT UNIQUE`
+ - `redirects`: `id INT PRIMARY KEY, target TEXT`
+ - `descs`: `id INT PRIMARY KEY, desc TEXT`
+
+# Image Database Files
+- genImgData.py <br>
+ Used to find infobox image names for page IDs, storing them into a database.
+- downloadImgLicenseInfo.py <br>
+ Used to download licensing metadata for image names, via wikipedia's online API, storing them into a database.
+- imgData.db <br>
+ Used to hold metadata about infobox images for a set of pageIDs.
+ Generated using getEnwikiImgData.py and downloadImgLicenseInfo.py. <br>
Tables: <br>
- - page\_imgs: page\_id INT PRIMAY KEY, img\_name TEXT
- (img\_name may be null, which is used to avoid re-processing the page-id on a second pass)
- - imgs: name TEXT PRIMARY KEY, license TEXT, artist TEXT, credit TEXT, restrictions TEXT, url TEXT
- (might lack some matches for 'img_name' in 'page_imgs', due to inability to get license info)
+ - `page_imgs`: `page_id INT PRIMAY KEY, img_name TEXT` <br>
+ `img_name` may be null, which means 'none found', and is used to avoid re-processing page-ids.
+ - `imgs`: `name TEXT PRIMARY KEY, license TEXT, artist TEXT, credit TEXT, restrictions TEXT, url TEXT` <br>
+ Might lack some matches for `img_name` in `page_imgs`, due to licensing info unavailability.
+- downloadEnwikiImgs.py <br>
+ Used to download image files into imgs/.
+
+# Other Files
+- lookupPage.py <br>
+ Running `lookupPage.py title1` looks in the dump for a page with a given title,
+ and prints the contents to stdout. Uses dumpIndex.db.
+
diff --git a/backend/data/enwiki/downloadEnwikiImgs.py b/backend/data/enwiki/downloadEnwikiImgs.py
index de9b862..2929a0d 100755
--- a/backend/data/enwiki/downloadEnwikiImgs.py
+++ b/backend/data/enwiki/downloadEnwikiImgs.py
@@ -16,7 +16,7 @@ if len(sys.argv) > 1:
print(usageInfo, file=sys.stderr)
sys.exit(1)
-imgDb = "enwikiImgs.db" # About 130k image names
+imgDb = "imgData.db" # About 130k image names
outDir = "imgs"
licenseRegex = re.compile(r"cc0|cc([ -]by)?([ -]sa)?([ -][1234]\.[05])?( \w\w\w?)?", flags=re.IGNORECASE)
diff --git a/backend/data/enwiki/downloadImgLicenseInfo.py b/backend/data/enwiki/downloadImgLicenseInfo.py
index 8231fbb..097304b 100755
--- a/backend/data/enwiki/downloadImgLicenseInfo.py
+++ b/backend/data/enwiki/downloadImgLicenseInfo.py
@@ -16,7 +16,7 @@ if len(sys.argv) > 1:
print(usageInfo, file=sys.stderr)
sys.exit(1)
-imgDb = "enwikiImgs.db" # About 130k image names
+imgDb = "imgData.db" # About 130k image names
apiUrl = "https://en.wikipedia.org/w/api.php"
batchSz = 50 # Max 50
tagRegex = re.compile(r"<[^<]+>")
diff --git a/backend/data/enwiki/genData.py b/backend/data/enwiki/genDescData.py
index 3e60bb5..032dbed 100755
--- a/backend/data/enwiki/genData.py
+++ b/backend/data/enwiki/genDescData.py
@@ -13,7 +13,7 @@ if len(sys.argv) > 1:
sys.exit(1)
dumpFile = "enwiki-20220501-pages-articles-multistream.xml.bz2" # 22,034,540 pages
-enwikiDb = "enwikiData.db"
+enwikiDb = "descData.db"
# Some regexps and functions for parsing wikitext
descLineRegex = re.compile("^ *[A-Z'\"]")
diff --git a/backend/data/enwiki/getEnwikiImgData.py b/backend/data/enwiki/genImgData.py
index f8bb2ee..9bd28f4 100755
--- a/backend/data/enwiki/getEnwikiImgData.py
+++ b/backend/data/enwiki/genImgData.py
@@ -21,7 +21,7 @@ def getInputPageIds():
return pageIds
dumpFile = "enwiki-20220501-pages-articles-multistream.xml.bz2"
indexDb = "dumpIndex.db"
-imgDb = "enwikiImgs.db" # Output db
+imgDb = "imgData.db" # Output db
idLineRegex = re.compile(r"<id>(.*)</id>")
imageLineRegex = re.compile(r".*\| *image *= *([^|]*)")
bracketImageRegex = re.compile(r"\[\[(File:[^|]*).*]]")
diff --git a/backend/data/eol/README.md b/backend/data/eol/README.md
index 8338be0..fbb008d 100644
--- a/backend/data/eol/README.md
+++ b/backend/data/eol/README.md
@@ -1,18 +1,25 @@
-Downloaded Files
-================
-- imagesList.tgz <br>
- Obtained from https://opendata.eol.org/dataset/images-list on 24/04/2022.
- Listed as being last updated on 05/02/2020.
+This directory holds files obtained from/using the [Encyclopedia of Life](https://eol.org/).
+
+# Name Data Files
- vernacularNames.csv <br>
- Obtained from https://opendata.eol.org/dataset/vernacular-names on 24/04/2022.
- Listed as being last updated on 27/10/2020.
+ Obtained from <https://opendata.eol.org/dataset/vernacular-names> on 24/04/2022 (last updated on 27/10/2020).
+ Contains alternative-name data from EOL.
-Generated Files
-===============
+# Image Metadata Files
+- imagesList.tgz <br>
+ Obtained from <https://opendata.eol.org/dataset/images-list> on 24/04/2022 (last updated on 05/02/2020).
+ Contains metadata for images from EOL.
- imagesList/ <br>
- Obtained by extracting imagesList.tgz.
+ Extracted from imagesList.tgz.
- imagesList.db <br>
- Represents data from eol/imagesList/*, and is created by genImagesListDb.sh. <br>
+ Contains data from imagesList/.
+ Created by running genImagesListDb.sh, which simply imports csv files into a database. <br>
Tables: <br>
- - images:
- content_id INT PRIMARY KEY, page_id INT, source_url TEXT, copy_url TEXT, license TEXT, copyright_owner TEXT
+ - `images`:
+ `content_id INT PRIMARY KEY, page_id INT, source_url TEXT, copy_url TEXT, license TEXT, copyright_owner TEXT`
+
+# Image Generation Files
+- downloadImgs.py <br>
+ Used to download image files into imgsForReview/.
+- reviewImgs.py <br>
+ Used to review images in imgsForReview/, moving acceptable ones into imgs/.
diff --git a/backend/data/eol/reviewImgs.py b/backend/data/eol/reviewImgs.py
index 4fea1c4..5290f9e 100755
--- a/backend/data/eol/reviewImgs.py
+++ b/backend/data/eol/reviewImgs.py
@@ -17,7 +17,7 @@ if len(sys.argv) > 1:
sys.exit(1)
imgDir = "imgsForReview/"
-outDir = "imgsReviewed/"
+outDir = "imgs/"
extraInfoDbCon = sqlite3.connect("../data.db")
extraInfoDbCur = extraInfoDbCon.cursor()
def getExtraInfo(eolId):
diff --git a/backend/data/genDbpData.py b/backend/data/genDbpData.py
index e921b6c..afe1e17 100755
--- a/backend/data/genDbpData.py
+++ b/backend/data/genDbpData.py
@@ -12,9 +12,9 @@ if len(sys.argv) > 1:
print(usageInfo, file=sys.stderr)
sys.exit(1)
-dbpediaDb = "dbpedia/dbpData.db"
-namesToSkipFile = "genDescNamesToSkip.txt"
-pickedLabelsFile = "dbpPickedLabels.txt"
+dbpediaDb = "dbpedia/descData.db"
+namesToSkipFile = "pickedEnwikiNamesToSkip.txt"
+pickedLabelsFile = "pickedDbpLabels.txt"
dbFile = "data.db"
# Open dbs
diff --git a/backend/data/genEnwikiDescData.py b/backend/data/genEnwikiDescData.py
index 2396540..dbc8d6b 100755
--- a/backend/data/genEnwikiDescData.py
+++ b/backend/data/genEnwikiDescData.py
@@ -11,10 +11,10 @@ if len(sys.argv) > 1:
print(usageInfo, file=sys.stderr)
sys.exit(1)
-enwikiDb = "enwiki/enwikiData.db"
+enwikiDb = "enwiki/descData.db"
dbFile = "data.db"
-namesToSkipFile = "genDescNamesToSkip.txt"
-pickedLabelsFile = "enwikiPickedLabels.txt"
+namesToSkipFile = "pickedEnwikiNamesToSkip.txt"
+pickedLabelsFile = "pickedEnwikiLabels.txt"
# Open dbs
enwikiCon = sqlite3.connect(enwikiDb)
diff --git a/backend/data/genEnwikiNameData.py b/backend/data/genEnwikiNameData.py
index 71960a5..8285a40 100755
--- a/backend/data/genEnwikiNameData.py
+++ b/backend/data/genEnwikiNameData.py
@@ -10,7 +10,7 @@ if len(sys.argv) > 1:
print(usageInfo, file=sys.stderr)
sys.exit(1)
-enwikiDb = "enwiki/enwikiData.db"
+enwikiDb = "enwiki/descData.db"
dbFile = "data.db"
altNameRegex = re.compile(r"[a-zA-Z]+")
# Avoids names like 'Evolution of Elephants', 'Banana fiber', 'Fish (zoology)',
diff --git a/backend/data/genEolNameData.py b/backend/data/genEolNameData.py
index aa3905e..d852751 100755
--- a/backend/data/genEolNameData.py
+++ b/backend/data/genEolNameData.py
@@ -18,8 +18,8 @@ if len(sys.argv) > 1:
vnamesFile = "eol/vernacularNames.csv"
dbFile = "data.db"
NAMES_TO_SKIP = {"unknown", "unknown species", "unidentified species"}
-pickedIdsFile = "genEolNameDataPickedIds.txt"
-badAltsFile = "genEolNameDataBadAlts.txt"
+pickedIdsFile = "pickedEolIds.txt"
+badAltsFile = "pickedEolAltsToSkip.txt"
# Read in vernacular-names data
# Note: Canonical-names may have multiple pids
diff --git a/backend/data/genImgsForWeb.py b/backend/data/genImgs.py
index 3c299bb..097959f 100755
--- a/backend/data/genImgsForWeb.py
+++ b/backend/data/genImgs.py
@@ -15,12 +15,12 @@ if len(sys.argv) > 1:
print(usageInfo, file=sys.stderr)
sys.exit(1)
-imgListFile = "mergedImgList.txt"
+imgListFile = "imgList.txt"
outDir = "img/"
eolImgDb = "eol/imagesList.db"
-enwikiImgDb = "enwiki/enwikiImgs.db"
+enwikiImgDb = "enwiki/imgData.db"
pickedImgsDir = "pickedImgs/"
-pickedImgsFile = "metadata.txt"
+pickedImgsFilename = "imgData.txt"
dbFile = "data.db"
IMG_OUT_SZ = 200
genImgFiles = True
@@ -37,9 +37,9 @@ enwikiCon = sqlite3.connect(enwikiImgDb)
enwikiCur = enwikiCon.cursor()
# Get 'picked images' info
nodeToPickedImg = {}
-if os.path.exists(pickedImgsDir + pickedImgsFile):
+if os.path.exists(pickedImgsDir + pickedImgsFilename):
lineNum = 0
- with open(pickedImgsDir + pickedImgsFile) as file:
+ with open(pickedImgsDir + pickedImgsFilename) as file:
for line in file:
lineNum += 1
(filename, url, license, artist, credit) = line.rstrip().split("|")
diff --git a/backend/data/genOtolData.py b/backend/data/genOtolData.py
index cfb5bed..87b35c3 100755
--- a/backend/data/genOtolData.py
+++ b/backend/data/genOtolData.py
@@ -1,6 +1,6 @@
#!/usr/bin/python3
-import sys, re
+import sys, re, os
import json, sqlite3
usageInfo = f"usage: {sys.argv[0]}\n"
@@ -30,8 +30,8 @@ annFile = "otol/annotations.json"
dbFile = "data.db"
nodeMap = {} # Maps node IDs to node objects
nameToFirstId = {} # Maps node names to first found ID (names might have multiple IDs)
-dupNameToIds = {} # Maps names of nodes with multiple IDs to those node IDs
-pickedDupsFile = "genOtolDataPickedDups.txt"
+dupNameToIds = {} # Maps names of nodes with multiple IDs to those IDs
+pickedNamesFile = "pickedOtolNames.txt"
# Parse treeFile
print("Parsing tree file")
@@ -142,10 +142,11 @@ rootId = parseNewick()
# Resolve duplicate names
print("Resolving duplicates")
nameToPickedId = {}
-with open(pickedDupsFile) as file:
- for line in file:
- (name, _, otolId) = line.rstrip().partition("|")
- nameToPickedId[name] = otolId
+if os.path.exists(pickedNamesFile):
+ with open(pickedNamesFile) as file:
+ for line in file:
+ (name, _, otolId) = line.rstrip().partition("|")
+ nameToPickedId[name] = otolId
for [dupName, ids] in dupNameToIds.items():
# Check for picked id
if dupName in nameToPickedId:
diff --git a/backend/data/genReducedTreeData.py b/backend/data/genReducedTreeData.py
index 208c937..b475794 100755
--- a/backend/data/genReducedTreeData.py
+++ b/backend/data/genReducedTreeData.py
@@ -10,7 +10,7 @@ if len(sys.argv) > 1:
sys.exit(1)
dbFile = "data.db"
-nodeNamesFile = "reducedTol/names.txt"
+nodeNamesFile = "reducedTreeNodes.txt"
minimalNames = set()
nodeMap = {} # Maps node names to node objects
PREF_NUM_CHILDREN = 3 # Attempt inclusion of children up to this limit
diff --git a/backend/data/otol/README.md b/backend/data/otol/README.md
index a6f13c2..4be2fd2 100644
--- a/backend/data/otol/README.md
+++ b/backend/data/otol/README.md
@@ -1,6 +1,10 @@
-Downloaded Files
-================
+Files
+=====
+- opentree13.4tree.tgz <br>
+ Obtained from <https://tree.opentreeoflife.org/about/synthesis-release/v13.4>.
+ Contains tree data from the [Open Tree of Life](https://tree.opentreeoflife.org/about/open-tree-of-life).
- labelled\_supertree\_ottnames.tre <br>
- Obtained from https://tree.opentreeoflife.org/about/synthesis-release/v13.4.
-- annotations.json <br>
- Obtained from https://tree.opentreeoflife.org/about/synthesis-release/v13.4.
+ Extracted from the .tgz file. Describes the structure of the tree.
+- annotations.json
+ Extracted from the .tgz file. Contains additional attributes of tree
+ nodes. Used for finding out which nodes have 'phylogenetic support'.
diff --git a/backend/data/pickedImgs/README.md b/backend/data/pickedImgs/README.md
index 52fc608..dfe192b 100644
--- a/backend/data/pickedImgs/README.md
+++ b/backend/data/pickedImgs/README.md
@@ -1,12 +1,10 @@
-This directory is used for adding additional, manually-picked images,
-to the server's dataset, overriding any from eol and enwiki. If used,
-it is expected to contain image files, and a metadata.txt file that
-holds metadata.
+This directory holds additional image files to use for tree-of-life nodes,
+on top of those from EOL and Wikipedia.
Possible Files
==============
-- Image files
-- metadata.txt <br>
- Contains lines with the format filename|url|license|artist|credit.
- The filename should be a tree-of-life node name, with an image
- extension. Other fields correspond to those in the 'images' table.
+- (Image files)
+- imgData.txt <br>
+ Contains lines with the format `filename|url|license|artist|credit`.
+ The filename should consist of a node name, with an image extension.
+ Other fields correspond to those in the `images` table (see ../README.md).
diff --git a/backend/data/reducedTol/README.md b/backend/data/reducedTol/README.md
deleted file mode 100644
index 103bffc..0000000
--- a/backend/data/reducedTol/README.md
+++ /dev/null
@@ -1,4 +0,0 @@
-Files
-=====
-- names.txt <br>
- Contains names of nodes to be kept in a reduced Tree of Life.
diff --git a/backend/data/reviewImgsToMerge.py b/backend/data/reviewImgsToGen.py
index d177a5e..4d970ba 100755
--- a/backend/data/reviewImgsToMerge.py
+++ b/backend/data/reviewImgsToGen.py
@@ -20,13 +20,13 @@ if len(sys.argv) > 1:
print(usageInfo, file=sys.stderr)
sys.exit(1)
-eolImgDir = "eol/imgsReviewed/"
+eolImgDir = "eol/imgs/"
enwikiImgDir = "enwiki/imgs/"
dbFile = "data.db"
-outFile = "mergedImgList.txt"
+outFile = "imgList.txt"
IMG_DISPLAY_SZ = 400
PLACEHOLDER_IMG = Image.new("RGB", (IMG_DISPLAY_SZ, IMG_DISPLAY_SZ), (88, 28, 135))
-onlyReviewPairs = False
+onlyReviewPairs = True
# Open db
dbCon = sqlite3.connect(dbFile)