From f8fa9ae3dd1571fa2912067b6eed010ea5d928e9 Mon Sep 17 00:00:00 2001 From: Terry Truong Date: Wed, 8 Jun 2022 12:34:57 +1000 Subject: Update READMEs, refactor getEnwikiImgData.py --- backend/data/README.md | 32 ++++++++++++++++++-------------- 1 file changed, 18 insertions(+), 14 deletions(-) (limited to 'backend/data/README.md') diff --git a/backend/data/README.md b/backend/data/README.md index 17484f4..18daa99 100644 --- a/backend/data/README.md +++ b/backend/data/README.md @@ -14,7 +14,8 @@ File Generation Process It uses data in eol/imagesList.db, and the 'eol\_ids' table. 2 In eol/, run reviewImgs.py to filter images in eol/imgsForReview/ into EOL-id-unique images in eol/imgsReviewed/ (uses 'names' and 'eol\_ids' to display extra info). - 3 Run genImgsForWeb.py to create cropped/resized images in img/, using + 3 // UPDATE + Run genImgsForWeb.py to create cropped/resized images in img/, using images in eol/imgsReviewed/, and also to add an 'images' table to data.db. 4 Run genLinkedImgs.py to add a 'linked_imgs' table to data.db, which uses 'nodes', 'edges', 'eol\_ids', and 'images', to associate @@ -22,21 +23,31 @@ File Generation Process 4 Node Description Data 1 Obtain data in dbpedia/, as specified in it's README. 2 Run genDbpData.py, which adds a 'descs' table to data.db, using - data in dbpedia/dbpData.db, dbpPickedLabels.txt, and the 'nodes' table. -5 Supplementary Name/Description Data + data in dbpedia/dbpData.db, the 'nodes' table, and possibly + dbpNamesToSkip.txt and dbpPickedLabels.txt. +5 Supplementary Name/Description/Image Data 1 Obtain data in enwiki/, as specified in it's README. 2 Run genEnwikiDescData.py, which adds to the 'descs' table, using data in enwiki/enwikiData.db, and the 'nodes' table. Also uses genEnwikiDesc*.txt files for skipping/resolving some name-page associations. - 3 Run genEnwikiNameData.py, which adds to the 'names' table, using data in - enwiki/enwikiData.db, and the 'names' and 'descs' tables. + 3 Optionally run genEnwikiNameData.py, which adds to the 'names' table, + using data in enwiki/enwikiData.db, and the 'names' and 'descs' tables. + 4 In enwiki/, run getEnwikiImgData.py, which generates a list of + tol-node images, and creates enwiki/enwikiImgs.db to store it. + Uses the 'descs' table to get tol-node wiki-ids. + 5 In enwiki/, run downloadImgLicenseInfo.py, which downloads licensing + information for images listed in enwiki/enwikiImgs.db, and stores + it in that db. + 6 In enwiki/, run downloadEnwikiImgs.py, which downloads 'permissively-licensed' + images in listed in enwiki/enwikiImgs.db, storing them in enwiki/imgs/. + 7 // ADD 5 Reduced Tree Structure Data 1 Run genReducedTreeData.py, which adds 'r_nodes' and 'r_edges' tables to data.db, using reducedTol/names.txt, and the 'nodes' and 'names' tables. data.db Tables ============== -- nodes: name TEXT PRIMARY KEY, tips INT +- nodes: name TEXT PRIMARY KEY, id TEXT UNIQUE, tips INT - edges: node TEXT, child TEXT, p\_support INT, PRIMARY KEY (node, child) - names: name TEXT, alt\_name TEXT, pref\_alt INT, PRIMARY KEY(name, alt\_name) - eol\_ids: id INT PRIMARY KEY, name TEXT @@ -51,14 +62,7 @@ Other Files - dbpPickedLabels.txt
Contains DBpedia labels, one per line. Used by genDbpData.py to help resolve conflicts when associating tree-of-life node names with - DBpedia node labels. Was generated by manually editing the output - of genDbpConflicts.py. -- genDbpConflicts.py
- Reads data from dbpedia/dbpData.db, and the 'nodes' table of data.db, - and looks for potential conflicts that would arise when genDbpData.db - tries to associate tree-of-life node names wth DBpedia node labels. It - writes data about them to conflicts.txt, which can be manually edited - to resolve them. + DBpedia node labels. - genOtolNamesToKeep.txt
Contains names to avoid trimming off the tree data generated by genOtolData.py. Usage is optional, but, without it, a large amount -- cgit v1.2.3