From c97acf8852e2017fd4776d65069f707121405f43 Mon Sep 17 00:00:00 2001 From: Terry Truong Date: Sat, 14 May 2022 19:30:43 +1000 Subject: Use DBpedia data for node descriptions Add backend/data/dbpedia/ directory containing scripts and README for obtaining DBpedia data, storing it into a db, converting/adding description data to data.db, and for resolving tol-node DBpedia-node association conflicts (via DBpedia relations, manual listing, etc). Resulted in less (about 3/4 as many) descriptions as with using enwiki, but with notably less mis-associations (eg: node Thor is described as a shrimp instead of a god). --- backend/data/README.md | 45 +++++++++++++++++++++++++++++---------------- 1 file changed, 29 insertions(+), 16 deletions(-) (limited to 'backend/data/README.md') diff --git a/backend/data/README.md b/backend/data/README.md index c4c46ba..b568f90 100644 --- a/backend/data/README.md +++ b/backend/data/README.md @@ -17,24 +17,37 @@ File Generation Process 3 Use genImgsForWeb.py to create cropped/resized images in img/, using images in imgsReviewed, and also to add an 'images' table to data.db. 4 Node Description Data - 1 Obtain data in enwiki/, as specified in it's README. - 2 Run genEnwikiData.py, which adds a 'descs' table to data.db, - using data in enwiki/enwikiData.db, and the 'nodes' table. + - Using DBpedia + 1 Obtain data in dbpedia/, as specified in it's README. + 2 Run genDbpData.py, which adds a 'descs' table to data.db, using + data in dbpedia/dbpData.db, dbpPickedLabels.txt, and the 'nodes' table. + - Using wikipedia dump (old method) + 1 Obtain data in enwiki/, as specified in it's README. + 2 Run genEnwikiData.py, which adds a 'descs' table to data.db, + using data in enwiki/enwikiData.db, and the 'nodes' table. 5 Reduced Tree Structure Data 1 Run genReducedTreeData.py, which adds a 'reduced_nodes' table to data.db, using reducedTol/names.txt, and the 'nodes' and 'names' tables. -data.db tables +data.db Tables ============== -- nodes
- name TEXT PRIMARY KEY, children TEXT, parent TEXT, tips INT, p\_support INT -- names
- name TEXT, alt\_name TEXT, pref\_alt INT, PRIMARY KEY(name, alt\_name) -- eol\_ids
- id INT PRIMARY KEY, name TEXT -- images
- eol\_id INT PRIMARY KEY, source\_url TEXT, license TEXT, copyright\_owner TEXT -- descs
- name TEXT PRIMARY KEY, desc TEXT, redirected INT -- reduced\_nodes
- name TEXT PRIMARY KEY, children TEXT, parent TEXT, tips INT, p_support INT +- nodes: name TEXT PRIMARY KEY, children TEXT, parent TEXT, tips INT, p\_support INT +- names: name TEXT, alt\_name TEXT, pref\_alt INT, PRIMARY KEY(name, alt\_name) +- eol\_ids: id INT PRIMARY KEY, name TEXT +- images: eol\_id INT PRIMARY KEY, source\_url TEXT, license TEXT, copyright\_owner TEXT +- descs: name TEXT PRIMARY KEY, desc TEXT, redirected INT +- reduced\_nodes: name TEXT PRIMARY KEY, children TEXT, parent TEXT, tips INT, p\_support INT + +Other Files +=========== +- dbpPickedLabels.txt
+ Contains DBpedia labels, one per line. Used by genDbpData.py to help + resolve conflicts when associating tree-of-life node names with + DBpedia node labels. Was generated by manually editing the output + of genDbpConflicts.py. +- genDbpConflicts.py
+ Reads data from dbpedia/dbpData.db, and the 'nodes' table of data.db, + and looks for potential conflicts that would arise when genDbpData.db + tries to associate tree-of-life node names wth DBpedia node labels. It + writes data about them to conflicts.txt, which can be manually edited + to resolve them. -- cgit v1.2.3