From 0cd58b3c1a8c5297579ea7a24a14d82ae8fed169 Mon Sep 17 00:00:00 2001 From: Terry Truong Date: Tue, 30 Aug 2022 17:54:10 +1000 Subject: Add node-popularity data for search-sugg ordering Add Wikipedia pageview dumps to enwiki/pageview/ Add scripts to generate viewcount averages Update backend to sort search suggestions by popularity --- backend/tolData/enwiki/README.md | 16 ++++++++++++++-- 1 file changed, 14 insertions(+), 2 deletions(-) (limited to 'backend/tolData/enwiki/README.md') diff --git a/backend/tolData/enwiki/README.md b/backend/tolData/enwiki/README.md index 7df21c9..76f9ee5 100644 --- a/backend/tolData/enwiki/README.md +++ b/backend/tolData/enwiki/README.md @@ -2,8 +2,8 @@ This directory holds files obtained/derived from [English Wikipedia](https://en. # Downloaded Files - enwiki-20220501-pages-articles-multistream.xml.bz2
- Obtained via (site suggests downloading from a mirror). Contains text content and metadata for pages in enwiki. + Obtained via (site suggests downloading from a mirror). Some file content and format information was available from . - enwiki-20220501-pages-articles-multistream-index.txt.bz2
@@ -13,7 +13,7 @@ This directory holds files obtained/derived from [English Wikipedia](https://en. # Dump-Index Files - genDumpIndexDb.py
- Creates an sqlite-database version of the enwiki-dump index file. + Creates a database version of the enwiki-dump index file. - dumpIndex.db
Generated by genDumpIndexDb.py.
Tables:
@@ -45,6 +45,18 @@ This directory holds files obtained/derived from [English Wikipedia](https://en. - downloadImgs.py
Used to download image files into imgs/. +# Page View Files +- pageviews/pageviews-*-user.bz2 + Each holds wikimedia article page view data for some month. + Obtained via . + Some format info was available from . +- genPageviewData.py
+ Reads pageview/*, and creates a database holding average monthly pageview counts. +- pageviewData.db
+ Generated using genPageviewData.py.
+ Tables:
+ - `views`: `title TEXT PRIMARY KEY, id INT, views INT` + # Other Files - lookupPage.py
Running `lookupPage.py title1` looks in the dump for a page with a given title, -- cgit v1.2.3