From 0e5e46cedaaeacf59cfd0f2e30c1ae6923466870 Mon Sep 17 00:00:00 2001 From: Terry Truong Date: Fri, 30 Dec 2022 23:28:09 +1100 Subject: Generate event_disp data before image-generation Make gen_disp_data.py delete non-displayable events Make reduce_event_data.py also delete from 'dist' and 'event_disp' Remove MAX_IMGS_PER_CTG from enwiki/gen_img_data.py Make gen_desc_data.py include events without images --- backend/hist_data/README.md | 28 ++++++++++++---------------- 1 file changed, 12 insertions(+), 16 deletions(-) (limited to 'backend/hist_data/README.md') diff --git a/backend/hist_data/README.md b/backend/hist_data/README.md index b557b14..d05016c 100644 --- a/backend/hist_data/README.md +++ b/backend/hist_data/README.md @@ -21,6 +21,12 @@ This directory holds files used to generate the history database data.db. - `pop`:
Format: `id INT PRIMARY KEY, pop INT`
Associates each event with a popularity measure (currently an average monthly viewcount) +- `dist`:
+ Format: `scale INT, unit INT, count INT, PRIMARY KEY (scale, unit)`
+ Maps scale units to counts of events in them. +- `event_disp`:
+ Format: `id INT, scale INT, PRIMARY KEY (id, scale)`
+ Maps events to scales they are 'displayable' on (used to make displayed events more uniform across time). - `images`:
Format: `id INT PRIMARY KEY, url TEXT, license TEXT, artist TEXT, credit TEXT`
Holds metadata for available images @@ -30,12 +36,6 @@ This directory holds files used to generate the history database data.db. - `descs`:
Format: `id INT PRIMARY KEY, wiki_id INT, desc TEXT`
Associates an event's enwiki title with a short description. -- `dist`:
- Format: `scale INT, unit INT, count INT, PRIMARY KEY (scale, unit)`
- Maps scale units to event counts. -- `event_disp`:
- Format: `id INT, scale INT, PRIMARY KEY (id, scale)`
- Maps events to scales they are 'displayable' on (used to make displayed events more uniform across time). # Generating the Database @@ -51,13 +51,15 @@ Some of the scripts use third-party packages: 1. Run `gen_events_data.py`, which creates `data.db`, and adds the `events` table. ## Generate Popularity Data -1. Obtain 'page view files' in enwiki/, as specified in it's README. +1. Obtain an enwiki dump and 'page view files' in enwiki/, as specified in the README. 1. Run `gen_pop_data.py`, which adds the `pop` table, using data in enwiki/ and the `events` table. +## Generate Event Display Data, and Reduce Dataset +1. Run `gen_disp_data.py`, which adds the `dist` and `event_disp` tables, and removes events not in `event_disp`. + ## Generate Image Data and Popularity Data 1. In enwiki/, run `gen_img_data.py` which looks at pages in the dump that match entries in `events`, looks for infobox image names, and stores them in an image database. - Uses popularity data in enwiki/ to find the top N events in each event category. 1. In enwiki/, run `download_img_license_info.py`, which downloads licensing info for found images, and adds them to the image database. 1. In enwiki/, run `download_imgs.py`, which downloads images into enwiki/imgs/. @@ -69,11 +71,8 @@ Some of the scripts use third-party packages: - An input x.gif might produce x-1.jpg, x-2.jpg, etc, instead of x.jpg. ## Generate Description Data -1. Obtain an enwiki dump in enwiki/, as specified in the README. -1. In enwiki/, run `gen_dump_index.db.py`, which generates a database for indexing the dump. 1. In enwiki/, run `gen_desc_data.py`, which extracts page descriptions into a database. -1. Run `gen_desc_data.py`, which adds the `descs` table, using data in enwiki/, - and the `events` and `images` tables (only adds descriptions for events with images). +1. Run `gen_desc_data.py`, which adds the `descs` table, using data in enwiki/, and the `events` table. ## Optionally Add Extra Event Data 1. Additional events can be described in `picked/events.json`, with images for them put @@ -81,7 +80,4 @@ Some of the scripts use third-party packages: 1. Can run `gen_picked_data.py` to add those described events to the database. ## Remove Events Without Images/Descs -1. Run `reduce_event_data.py` to remove data for events that have no image/description. - -## Generate Distribution and Displayability Data -1. Run `gen_disp_data.py`, which add the `dist` and `event_disp` tables. +1. Run `reduce_event_data.py` to remove data for events that have no image. -- cgit v1.2.3