diff options
| author | Terry Truong <terry06890@gmail.com> | 2022-08-30 17:54:10 +1000 |
|---|---|---|
| committer | Terry Truong <terry06890@gmail.com> | 2022-08-30 17:54:10 +1000 |
| commit | 0cd58b3c1a8c5297579ea7a24a14d82ae8fed169 (patch) | |
| tree | 17c02e7578a0f7b09461f3bca0fa785301292744 /backend/tolData/enwiki/README.md | |
| parent | 0f39be89c3d5620b8187b1d7621b7680800c268b (diff) | |
Add node-popularity data for search-sugg ordering
Add Wikipedia pageview dumps to enwiki/pageview/
Add scripts to generate viewcount averages
Update backend to sort search suggestions by popularity
Diffstat (limited to 'backend/tolData/enwiki/README.md')
| -rw-r--r-- | backend/tolData/enwiki/README.md | 16 |
1 files changed, 14 insertions, 2 deletions
diff --git a/backend/tolData/enwiki/README.md b/backend/tolData/enwiki/README.md index 7df21c9..76f9ee5 100644 --- a/backend/tolData/enwiki/README.md +++ b/backend/tolData/enwiki/README.md @@ -2,8 +2,8 @@ This directory holds files obtained/derived from [English Wikipedia](https://en. # Downloaded Files - enwiki-20220501-pages-articles-multistream.xml.bz2 <br> - Obtained via <https://dumps.wikimedia.org/backup-index.html> (site suggests downloading from a mirror). Contains text content and metadata for pages in enwiki. + Obtained via <https://dumps.wikimedia.org/backup-index.html> (site suggests downloading from a mirror). Some file content and format information was available from <https://meta.wikimedia.org/wiki/Data_dumps/What%27s_available_for_download>. - enwiki-20220501-pages-articles-multistream-index.txt.bz2 <br> @@ -13,7 +13,7 @@ This directory holds files obtained/derived from [English Wikipedia](https://en. # Dump-Index Files - genDumpIndexDb.py <br> - Creates an sqlite-database version of the enwiki-dump index file. + Creates a database version of the enwiki-dump index file. - dumpIndex.db <br> Generated by genDumpIndexDb.py. <br> Tables: <br> @@ -45,6 +45,18 @@ This directory holds files obtained/derived from [English Wikipedia](https://en. - downloadImgs.py <br> Used to download image files into imgs/. +# Page View Files +- pageviews/pageviews-*-user.bz2 + Each holds wikimedia article page view data for some month. + Obtained via <https://dumps.wikimedia.org/other/pageview_complete/monthly/>. + Some format info was available from <https://dumps.wikimedia.org/other/pageview_complete/readme.html>. +- genPageviewData.py <br> + Reads pageview/*, and creates a database holding average monthly pageview counts. +- pageviewData.db <br> + Generated using genPageviewData.py. <br> + Tables: <br> + - `views`: `title TEXT PRIMARY KEY, id INT, views INT` + # Other Files - lookupPage.py <br> Running `lookupPage.py title1` looks in the dump for a page with a given title, |
