diff options
| author | Terry Truong <terry06890@gmail.com> | 2022-05-14 19:30:43 +1000 |
|---|---|---|
| committer | Terry Truong <terry06890@gmail.com> | 2022-05-14 19:39:10 +1000 |
| commit | c97acf8852e2017fd4776d65069f707121405f43 (patch) | |
| tree | 1c0d725b6ae496239036b0f1d1c4a2caadf209cf /backend/data/README.md | |
| parent | 7003ef7f92f3a8fed059dab2b37c0e203c000dba (diff) | |
Use DBpedia data for node descriptions
Add backend/data/dbpedia/ directory containing scripts and README
for obtaining DBpedia data, storing it into a db, converting/adding
description data to data.db, and for resolving tol-node DBpedia-node
association conflicts (via DBpedia relations, manual listing, etc).
Resulted in less (about 3/4 as many) descriptions as with using enwiki,
but with notably less mis-associations (eg: node Thor is described as
a shrimp instead of a god).
Diffstat (limited to 'backend/data/README.md')
| -rw-r--r-- | backend/data/README.md | 45 |
1 files changed, 29 insertions, 16 deletions
diff --git a/backend/data/README.md b/backend/data/README.md index c4c46ba..b568f90 100644 --- a/backend/data/README.md +++ b/backend/data/README.md @@ -17,24 +17,37 @@ File Generation Process 3 Use genImgsForWeb.py to create cropped/resized images in img/, using images in imgsReviewed, and also to add an 'images' table to data.db. 4 Node Description Data - 1 Obtain data in enwiki/, as specified in it's README. - 2 Run genEnwikiData.py, which adds a 'descs' table to data.db, - using data in enwiki/enwikiData.db, and the 'nodes' table. + - Using DBpedia + 1 Obtain data in dbpedia/, as specified in it's README. + 2 Run genDbpData.py, which adds a 'descs' table to data.db, using + data in dbpedia/dbpData.db, dbpPickedLabels.txt, and the 'nodes' table. + - Using wikipedia dump (old method) + 1 Obtain data in enwiki/, as specified in it's README. + 2 Run genEnwikiData.py, which adds a 'descs' table to data.db, + using data in enwiki/enwikiData.db, and the 'nodes' table. 5 Reduced Tree Structure Data 1 Run genReducedTreeData.py, which adds a 'reduced_nodes' table to data.db, using reducedTol/names.txt, and the 'nodes' and 'names' tables. -data.db tables +data.db Tables ============== -- nodes <br> - name TEXT PRIMARY KEY, children TEXT, parent TEXT, tips INT, p\_support INT -- names <br> - name TEXT, alt\_name TEXT, pref\_alt INT, PRIMARY KEY(name, alt\_name) -- eol\_ids <br> - id INT PRIMARY KEY, name TEXT -- images <br> - eol\_id INT PRIMARY KEY, source\_url TEXT, license TEXT, copyright\_owner TEXT -- descs <br> - name TEXT PRIMARY KEY, desc TEXT, redirected INT -- reduced\_nodes <br> - name TEXT PRIMARY KEY, children TEXT, parent TEXT, tips INT, p_support INT +- nodes: name TEXT PRIMARY KEY, children TEXT, parent TEXT, tips INT, p\_support INT +- names: name TEXT, alt\_name TEXT, pref\_alt INT, PRIMARY KEY(name, alt\_name) +- eol\_ids: id INT PRIMARY KEY, name TEXT +- images: eol\_id INT PRIMARY KEY, source\_url TEXT, license TEXT, copyright\_owner TEXT +- descs: name TEXT PRIMARY KEY, desc TEXT, redirected INT +- reduced\_nodes: name TEXT PRIMARY KEY, children TEXT, parent TEXT, tips INT, p\_support INT + +Other Files +=========== +- dbpPickedLabels.txt <br> + Contains DBpedia labels, one per line. Used by genDbpData.py to help + resolve conflicts when associating tree-of-life node names with + DBpedia node labels. Was generated by manually editing the output + of genDbpConflicts.py. +- genDbpConflicts.py <br> + Reads data from dbpedia/dbpData.db, and the 'nodes' table of data.db, + and looks for potential conflicts that would arise when genDbpData.db + tries to associate tree-of-life node names wth DBpedia node labels. It + writes data about them to conflicts.txt, which can be manually edited + to resolve them. |
