diff options
Diffstat (limited to 'backend/data/README.md')
| -rw-r--r-- | backend/data/README.md | 45 |
1 files changed, 29 insertions, 16 deletions
diff --git a/backend/data/README.md b/backend/data/README.md index c4c46ba..b568f90 100644 --- a/backend/data/README.md +++ b/backend/data/README.md @@ -17,24 +17,37 @@ File Generation Process 3 Use genImgsForWeb.py to create cropped/resized images in img/, using images in imgsReviewed, and also to add an 'images' table to data.db. 4 Node Description Data - 1 Obtain data in enwiki/, as specified in it's README. - 2 Run genEnwikiData.py, which adds a 'descs' table to data.db, - using data in enwiki/enwikiData.db, and the 'nodes' table. + - Using DBpedia + 1 Obtain data in dbpedia/, as specified in it's README. + 2 Run genDbpData.py, which adds a 'descs' table to data.db, using + data in dbpedia/dbpData.db, dbpPickedLabels.txt, and the 'nodes' table. + - Using wikipedia dump (old method) + 1 Obtain data in enwiki/, as specified in it's README. + 2 Run genEnwikiData.py, which adds a 'descs' table to data.db, + using data in enwiki/enwikiData.db, and the 'nodes' table. 5 Reduced Tree Structure Data 1 Run genReducedTreeData.py, which adds a 'reduced_nodes' table to data.db, using reducedTol/names.txt, and the 'nodes' and 'names' tables. -data.db tables +data.db Tables ============== -- nodes <br> - name TEXT PRIMARY KEY, children TEXT, parent TEXT, tips INT, p\_support INT -- names <br> - name TEXT, alt\_name TEXT, pref\_alt INT, PRIMARY KEY(name, alt\_name) -- eol\_ids <br> - id INT PRIMARY KEY, name TEXT -- images <br> - eol\_id INT PRIMARY KEY, source\_url TEXT, license TEXT, copyright\_owner TEXT -- descs <br> - name TEXT PRIMARY KEY, desc TEXT, redirected INT -- reduced\_nodes <br> - name TEXT PRIMARY KEY, children TEXT, parent TEXT, tips INT, p_support INT +- nodes: name TEXT PRIMARY KEY, children TEXT, parent TEXT, tips INT, p\_support INT +- names: name TEXT, alt\_name TEXT, pref\_alt INT, PRIMARY KEY(name, alt\_name) +- eol\_ids: id INT PRIMARY KEY, name TEXT +- images: eol\_id INT PRIMARY KEY, source\_url TEXT, license TEXT, copyright\_owner TEXT +- descs: name TEXT PRIMARY KEY, desc TEXT, redirected INT +- reduced\_nodes: name TEXT PRIMARY KEY, children TEXT, parent TEXT, tips INT, p\_support INT + +Other Files +=========== +- dbpPickedLabels.txt <br> + Contains DBpedia labels, one per line. Used by genDbpData.py to help + resolve conflicts when associating tree-of-life node names with + DBpedia node labels. Was generated by manually editing the output + of genDbpConflicts.py. +- genDbpConflicts.py <br> + Reads data from dbpedia/dbpData.db, and the 'nodes' table of data.db, + and looks for potential conflicts that would arise when genDbpData.db + tries to associate tree-of-life node names wth DBpedia node labels. It + writes data about them to conflicts.txt, which can be manually edited + to resolve them. |
