Processing 10TB of Wikipedia Page Views — Part 1

Data Engineering

My Wikipedia, circa late 60's

Introduction

Page views aren’t enough

Wikidata to the rescue

What problem are we trying to solve?

The naive approach (and why it’s not the best plan)

Can we script it?

# Independent variables
USAGE="$0 [-d] [all|year|month|day]"
DEBUG=0
PROJECT=bigquery-public-data-staging
BUCKET=wiki-staging
DOMAIN=dumps.wikimedia.org
SRC_BASE=https://$DOMAIN
DST_BASE=gs://$BUCKET
SRC_VIEW_PATH=other/pageviews
DST_VIEW_PATH=$DOMAIN/$SRC_VIEW_PATH
SRC_VIEW_URL=$SRC_BASE/$SRC_VIEW_PATH
DST_VIEW_URL=$DST_BASE/$DST_VIEW_PATH
# Dependent variables
WINDOW="${1:-day}"
HOUR=$(date +%H)
TODAY=$(date '+%s')
YYYY="${YYYY:-$(date --date=@$TODAY +%Y)}"
MM="${MM:-$(date --date=@$TODAY +%m)}"
DD="${DD:-$(date --date=@$TODAY +%d)}"
case "$WINDOW" in
all)
S1=/; S2=*/*/ S3=pageviews-*.gz;;
year)
S1=/$YYYY/; S2=*/; S3=pageviews-$YYYY*.gz;;
month)
S1=/$YYYY/$YYYY-$MM/; S2=; S3=pageviews-$YYYY$MM*.gz;;
day)
S1=/$YYYY/$YYYY-$MM/; S2=; S3=pageviews-$YYYY$MM$DD-*.gz;;
*)
echo $USAGE
exit 1;;
esac
# Assemble list of every pageview log file and size on website.
wget --no-parent -nv --spider -S -r -A "$S3" $SRC_VIEW_URL$S1 2>&1 |
awk 'function base(file, a, n) {n = split(file,a,"/"); return a[n]} \
$1 == "Content-Length:" {len=$2} $3 == "URL:" {print base($4), len}' |
sort >src-files.txt
# Assemble list of every pageview log file and size in cloud storage.
if gsutil stat $DST_VIEW_URL$S1$S2$S3 >/dev/null 2>&1
then
gsutil ls -l -r $DST_VIEW_URL$S1$S2$S3 2>/dev/null | grep -v ":$" |
awk 'function base(file, a, n) {n = split(file,a,"/"); return a[n]} \
$1 != "TOTAL:" {print base($3), $1}' |
sort >dst-files.txt

# One-sided diff - find every web file that either doesn't exist or
# doesn't match size of corresponding object in cloud storage.
WORK_TO_DO=0
comm -23 src-files.txt dst-files.txt >diffs.txt
while read FILE SIZE
do
DIR=`echo $FILE | awk '{y=substr($1,11,4);m=substr($1,15,2);
printf("%s/%s-%s",y,y,m)}'`
echo -en "$SRC_VIEW_URL/$DIR/$FILE$EOL"
wget -q $SRC_VIEW_URL/$DIR/$FILE
gsutil cp $FILE $DST_VIEW_URL/$DIR/$FILE
rm -f $FILE
WORK_TO_DO=1
done <diffs.txt
# If we found any missing or changed files, update our collection.
if [ "$WORK_TO_DO" = "1" ]
then
./update.sh $YYYY $MM $DD
fi
if [ "$1" = "" -o "$2" = "" -o "$3" = "" ]
then
echo "usage: $0 year month day"
exit 1
fi
YEAR=$1
MONTH=$2
DAY=$3
QUERY=$(cat <<EOF
CREATE TABLE IF NOT EXISTS \`bigquery-public-data.wikipedia.pageviews_$YEAR\`
(datehour TIMESTAMP, wiki STRING, title STRING, views INT64)
PARTITION BY DATE(datehour)
CLUSTER BY wiki, title
OPTIONS(
description = 'Wikipedia pageviews from http://dumps.wikimedia.your.org/other/pageviews/, partitioned by date, clustered by (wiki, title)',
require_partition_filter = true
)
EOF
)
echo "creating BigQuery table (if necessary) for $YEAR..."
bq query -q --use_legacy_sql=false "$QUERY"
QUERY=$(cat <<EOF
INSERT INTO \`bigquery-public-data.wikipedia.pageviews_$YEAR\`
WITH already_loaded as (
SELECT DISTINCT datehour FROM \`bigquery-public-data.wikipedia.pageviews_$YEAR\`
WHERE datehour >= '$YEAR-$MONTH-$DAY')
SELECT datehour, wiki, SUBSTR(title, 0, 300) title, views
FROM \`bigquery-public-data-staging.wikipedia_pipeline.view_parsed\` t1
WHERE BYTE_LENGTH(wiki)+ BYTE_LENGTH(title) < 1024
AND BYTE_LENGTH(title) < 300
AND EXTRACT(YEAR FROM datehour)=$YEAR
AND EXTRACT(MONTH FROM datehour)=$MONTH
AND EXTRACT(DAY FROM datehour)=$DAY
AND NOT EXISTS (SELECT * FROM already_loaded t2 WHERE t2.datehour = t1.datehour)
EOF
)
echo "transferring data from GCS to BQ table for $YEAR-$MONTH-$DAY..."
bq query -q --use_legacy_sql=false "$QUERY"

Wikidata decompression — We’re gonna need a bigger boat

BUCKET=wiki-staging
DOMAIN=dumps.wikimedia.org
SRC_BASE=https://$DOMAIN
DST_BASE=gs://$BUCKET
SRC_DATA_PATH=wikidatawiki/entities/latest-all.json.bz2
DST_DATA_PATH=$DOMAIN/$SRC_DATA_PATH
SRC_DATA_URL=$SRC_BASE/$SRC_DATA_PATH
DST_DATA_URL=$DST_BASE/$DST_DATA_PATH
# Get name and size of latest wikidata file on website.
read SFILE SSIZE \
<<<$(wget -nv --spider -S -r -A ".gz" -I $SRC_DATA_PATH $SRC_DATA_URL 2>&1 |
awk 'function base(file, a, n) {n = split(file,a,"/"); return a[n]} \
$1 == "Content-Length:" {len=$2} $3 == "URL:" {print base($4), len}')
# Get name and size of latest wikidata file copy we have in Cloud Storage.
read DFILE DSIZE \
<<<$(gsutil ls -l -r $DST_DATA_URL |
awk 'function base(file, a, n) {n = split(file,a,"/"); return a[n]} \
$1 != "TOTAL:" {print base($3), $1}')
# If new file available, display it for further processing.
if [ "$SFILE" != "$DFILE" -o "$SSIZE" != "$DSIZE" ]
then
echo -en "$SRC_DATA_URL$EOL"
fi

Let’s productionize this thing

VMNAME=wikiload
PROJECT=bigquery-public-data-staging
ZONE=us-central1-c
SCOPES="https://www.googleapis.com/auth/cloud-platform"
echo -en "creating VM to migrate wikidata...EOL"
gcloud beta compute instances create $VMNAME \
--zone=$ZONE \
--machine-type=m1-ultramem-80 \
--subnet=default \
--network-tier=PREMIUM \
--no-restart-on-failure \
--maintenance-policy=TERMINATE \
--scopes=$SCOPES \
--image=debian-10-buster-v20191014 \
--image-project=debian-cloud \
--boot-disk-size=1000GB \
--boot-disk-type=pd-ssd \
--boot-disk-device-name=wikiload \
--reservation-affinity=any \
--metadata-from-file startup-script=startup.sh \
--preemptible

Conclusion

Resources and Acknowledgements

Cloud Dev Relations team @ Google. I like technology, indie music & films, writing, teaching, and chess.