aboutsummaryrefslogtreecommitdiff
path: root/backend/tolData/enwiki/README.md
diff options
context:
space:
mode:
authorTerry Truong <terry06890@gmail.com>2022-08-30 17:54:10 +1000
committerTerry Truong <terry06890@gmail.com>2022-08-30 17:54:10 +1000
commit0cd58b3c1a8c5297579ea7a24a14d82ae8fed169 (patch)
tree17c02e7578a0f7b09461f3bca0fa785301292744 /backend/tolData/enwiki/README.md
parent0f39be89c3d5620b8187b1d7621b7680800c268b (diff)
Add node-popularity data for search-sugg ordering
Add Wikipedia pageview dumps to enwiki/pageview/ Add scripts to generate viewcount averages Update backend to sort search suggestions by popularity
Diffstat (limited to 'backend/tolData/enwiki/README.md')
-rw-r--r--backend/tolData/enwiki/README.md16
1 files changed, 14 insertions, 2 deletions
diff --git a/backend/tolData/enwiki/README.md b/backend/tolData/enwiki/README.md
index 7df21c9..76f9ee5 100644
--- a/backend/tolData/enwiki/README.md
+++ b/backend/tolData/enwiki/README.md
@@ -2,8 +2,8 @@ This directory holds files obtained/derived from [English Wikipedia](https://en.
# Downloaded Files
- enwiki-20220501-pages-articles-multistream.xml.bz2 <br>
- Obtained via <https://dumps.wikimedia.org/backup-index.html> (site suggests downloading from a mirror).
Contains text content and metadata for pages in enwiki.
+ Obtained via <https://dumps.wikimedia.org/backup-index.html> (site suggests downloading from a mirror).
Some file content and format information was available from
<https://meta.wikimedia.org/wiki/Data_dumps/What%27s_available_for_download>.
- enwiki-20220501-pages-articles-multistream-index.txt.bz2 <br>
@@ -13,7 +13,7 @@ This directory holds files obtained/derived from [English Wikipedia](https://en.
# Dump-Index Files
- genDumpIndexDb.py <br>
- Creates an sqlite-database version of the enwiki-dump index file.
+ Creates a database version of the enwiki-dump index file.
- dumpIndex.db <br>
Generated by genDumpIndexDb.py. <br>
Tables: <br>
@@ -45,6 +45,18 @@ This directory holds files obtained/derived from [English Wikipedia](https://en.
- downloadImgs.py <br>
Used to download image files into imgs/.
+# Page View Files
+- pageviews/pageviews-*-user.bz2
+ Each holds wikimedia article page view data for some month.
+ Obtained via <https://dumps.wikimedia.org/other/pageview_complete/monthly/>.
+ Some format info was available from <https://dumps.wikimedia.org/other/pageview_complete/readme.html>.
+- genPageviewData.py <br>
+ Reads pageview/*, and creates a database holding average monthly pageview counts.
+- pageviewData.db <br>
+ Generated using genPageviewData.py. <br>
+ Tables: <br>
+ - `views`: `title TEXT PRIMARY KEY, id INT, views INT`
+
# Other Files
- lookupPage.py <br>
Running `lookupPage.py title1` looks in the dump for a page with a given title,