diff options
Diffstat (limited to 'backend/hist_data/README.md')
| -rw-r--r-- | backend/hist_data/README.md | 28 |
1 files changed, 12 insertions, 16 deletions
diff --git a/backend/hist_data/README.md b/backend/hist_data/README.md index b557b14..d05016c 100644 --- a/backend/hist_data/README.md +++ b/backend/hist_data/README.md @@ -21,6 +21,12 @@ This directory holds files used to generate the history database data.db. - `pop`: <br> Format: `id INT PRIMARY KEY, pop INT` <br> Associates each event with a popularity measure (currently an average monthly viewcount) +- `dist`: <br> + Format: `scale INT, unit INT, count INT, PRIMARY KEY (scale, unit)` <br> + Maps scale units to counts of events in them. +- `event_disp`: <br> + Format: `id INT, scale INT, PRIMARY KEY (id, scale)` <br> + Maps events to scales they are 'displayable' on (used to make displayed events more uniform across time). - `images`: <br> Format: `id INT PRIMARY KEY, url TEXT, license TEXT, artist TEXT, credit TEXT` <br> Holds metadata for available images @@ -30,12 +36,6 @@ This directory holds files used to generate the history database data.db. - `descs`: <br> Format: `id INT PRIMARY KEY, wiki_id INT, desc TEXT` <br> Associates an event's enwiki title with a short description. -- `dist`: <br> - Format: `scale INT, unit INT, count INT, PRIMARY KEY (scale, unit)` <br> - Maps scale units to event counts. -- `event_disp`: <br> - Format: `id INT, scale INT, PRIMARY KEY (id, scale)` <br> - Maps events to scales they are 'displayable' on (used to make displayed events more uniform across time). # Generating the Database @@ -51,13 +51,15 @@ Some of the scripts use third-party packages: 1. Run `gen_events_data.py`, which creates `data.db`, and adds the `events` table. ## Generate Popularity Data -1. Obtain 'page view files' in enwiki/, as specified in it's README. +1. Obtain an enwiki dump and 'page view files' in enwiki/, as specified in the README. 1. Run `gen_pop_data.py`, which adds the `pop` table, using data in enwiki/ and the `events` table. +## Generate Event Display Data, and Reduce Dataset +1. Run `gen_disp_data.py`, which adds the `dist` and `event_disp` tables, and removes events not in `event_disp`. + ## Generate Image Data and Popularity Data 1. In enwiki/, run `gen_img_data.py` which looks at pages in the dump that match entries in `events`, looks for infobox image names, and stores them in an image database. - Uses popularity data in enwiki/ to find the top N events in each event category. 1. In enwiki/, run `download_img_license_info.py`, which downloads licensing info for found images, and adds them to the image database. 1. In enwiki/, run `download_imgs.py`, which downloads images into enwiki/imgs/. @@ -69,11 +71,8 @@ Some of the scripts use third-party packages: - An input x.gif might produce x-1.jpg, x-2.jpg, etc, instead of x.jpg. ## Generate Description Data -1. Obtain an enwiki dump in enwiki/, as specified in the README. -1. In enwiki/, run `gen_dump_index.db.py`, which generates a database for indexing the dump. 1. In enwiki/, run `gen_desc_data.py`, which extracts page descriptions into a database. -1. Run `gen_desc_data.py`, which adds the `descs` table, using data in enwiki/, - and the `events` and `images` tables (only adds descriptions for events with images). +1. Run `gen_desc_data.py`, which adds the `descs` table, using data in enwiki/, and the `events` table. ## Optionally Add Extra Event Data 1. Additional events can be described in `picked/events.json`, with images for them put @@ -81,7 +80,4 @@ Some of the scripts use third-party packages: 1. Can run `gen_picked_data.py` to add those described events to the database. ## Remove Events Without Images/Descs -1. Run `reduce_event_data.py` to remove data for events that have no image/description. - -## Generate Distribution and Displayability Data -1. Run `gen_disp_data.py`, which add the `dist` and `event_disp` tables. +1. Run `reduce_event_data.py` to remove data for events that have no image. |
