aboutsummaryrefslogtreecommitdiff
path: root/backend/data/README.md
diff options
context:
space:
mode:
authorTerry Truong <terry06890@gmail.com>2022-05-14 19:30:43 +1000
committerTerry Truong <terry06890@gmail.com>2022-05-14 19:39:10 +1000
commitc97acf8852e2017fd4776d65069f707121405f43 (patch)
tree1c0d725b6ae496239036b0f1d1c4a2caadf209cf /backend/data/README.md
parent7003ef7f92f3a8fed059dab2b37c0e203c000dba (diff)
Use DBpedia data for node descriptions
Add backend/data/dbpedia/ directory containing scripts and README for obtaining DBpedia data, storing it into a db, converting/adding description data to data.db, and for resolving tol-node DBpedia-node association conflicts (via DBpedia relations, manual listing, etc). Resulted in less (about 3/4 as many) descriptions as with using enwiki, but with notably less mis-associations (eg: node Thor is described as a shrimp instead of a god).
Diffstat (limited to 'backend/data/README.md')
-rw-r--r--backend/data/README.md45
1 files changed, 29 insertions, 16 deletions
diff --git a/backend/data/README.md b/backend/data/README.md
index c4c46ba..b568f90 100644
--- a/backend/data/README.md
+++ b/backend/data/README.md
@@ -17,24 +17,37 @@ File Generation Process
3 Use genImgsForWeb.py to create cropped/resized images in img/, using
images in imgsReviewed, and also to add an 'images' table to data.db.
4 Node Description Data
- 1 Obtain data in enwiki/, as specified in it's README.
- 2 Run genEnwikiData.py, which adds a 'descs' table to data.db,
- using data in enwiki/enwikiData.db, and the 'nodes' table.
+ - Using DBpedia
+ 1 Obtain data in dbpedia/, as specified in it's README.
+ 2 Run genDbpData.py, which adds a 'descs' table to data.db, using
+ data in dbpedia/dbpData.db, dbpPickedLabels.txt, and the 'nodes' table.
+ - Using wikipedia dump (old method)
+ 1 Obtain data in enwiki/, as specified in it's README.
+ 2 Run genEnwikiData.py, which adds a 'descs' table to data.db,
+ using data in enwiki/enwikiData.db, and the 'nodes' table.
5 Reduced Tree Structure Data
1 Run genReducedTreeData.py, which adds a 'reduced_nodes' table to data.db,
using reducedTol/names.txt, and the 'nodes' and 'names' tables.
-data.db tables
+data.db Tables
==============
-- nodes <br>
- name TEXT PRIMARY KEY, children TEXT, parent TEXT, tips INT, p\_support INT
-- names <br>
- name TEXT, alt\_name TEXT, pref\_alt INT, PRIMARY KEY(name, alt\_name)
-- eol\_ids <br>
- id INT PRIMARY KEY, name TEXT
-- images <br>
- eol\_id INT PRIMARY KEY, source\_url TEXT, license TEXT, copyright\_owner TEXT
-- descs <br>
- name TEXT PRIMARY KEY, desc TEXT, redirected INT
-- reduced\_nodes <br>
- name TEXT PRIMARY KEY, children TEXT, parent TEXT, tips INT, p_support INT
+- nodes: name TEXT PRIMARY KEY, children TEXT, parent TEXT, tips INT, p\_support INT
+- names: name TEXT, alt\_name TEXT, pref\_alt INT, PRIMARY KEY(name, alt\_name)
+- eol\_ids: id INT PRIMARY KEY, name TEXT
+- images: eol\_id INT PRIMARY KEY, source\_url TEXT, license TEXT, copyright\_owner TEXT
+- descs: name TEXT PRIMARY KEY, desc TEXT, redirected INT
+- reduced\_nodes: name TEXT PRIMARY KEY, children TEXT, parent TEXT, tips INT, p\_support INT
+
+Other Files
+===========
+- dbpPickedLabels.txt <br>
+ Contains DBpedia labels, one per line. Used by genDbpData.py to help
+ resolve conflicts when associating tree-of-life node names with
+ DBpedia node labels. Was generated by manually editing the output
+ of genDbpConflicts.py.
+- genDbpConflicts.py <br>
+ Reads data from dbpedia/dbpData.db, and the 'nodes' table of data.db,
+ and looks for potential conflicts that would arise when genDbpData.db
+ tries to associate tree-of-life node names wth DBpedia node labels. It
+ writes data about them to conflicts.txt, which can be manually edited
+ to resolve them.