Analysis of Wikipedia Media Counts Data

Jun 12, 2018

5 minutes read

In this post, I will show you how can we use two different approaches to find out the Top 100 visited images, videos or audio files on Wikipedia during the past three years (2015-2017) based on the media counts data. Before we started, you should know that the data we downloaded here cannot essentially represent users request because the counts can only indicate how often an image, video, or audio file has been transferred to users. Modern browsers may preload contents in a webpage to deliver a better user experience while the users will not necessarily see all preloaded resources. Transferred means Wikipedia, as a data collector, doesn’t know if the users see content or not since some of them are just preloaded. In the Wikipedia Analytics site, there are two types of data available:

The count of all transferred media files.
The count of the top 1000 transferred media files.

We will use the second type. Downloading all data is meaningless as our goal is to find the top 100 result. The average size of a 1-day data file is around 1.2MB. So the size of 3-years data will be around 1.3GB.

The Python Approach

The first step is to download the data. There are 24 files in each 1-day zip data. They contained the same data but sorted based on 24 different keys. We will use the file sorted by the third key, which is the count of Total requests. There are extra comments and explanations in each csv file. Instead of removing the unwanted lines, we can use grep to match our wanted lines since they are all started with /wikipedia. Then we pipe the filtering result to counts_data.csv. So eventually, there will be only one csv file we need to handle.

#!/usr/bin/env bash

for year in {2015..2017} ; do
    for month in 0{1..9} {10..12} ; do
        for day in 0{1..9} {10..31} ; do
            # download data
            wget https://dumps.wikimedia.org/other/mediacounts/daily/$year/mediacounts.top1000.$year-$month-$day.v00.csv.zip;
            # unzip it
            unzip mediacounts.top1000.$year-$month-$day.v00.csv.zip;
            # find all lines start with /wikipedia
            grep -e '^\/wikipedia' mediacounts.$year-$month-$day.v00.sorted_key03.csv >> counts_data.csv
            done
        done
    done

What we got here is counts_data.csv, a 147MB file.

/wikipedia/en/4/4a/Commons-logo.svg,77559286851,41416246...
/wikipedia/commons/4/4a/Commons-logo.svg,56555175913,36930238...
/wikipedia/commons/f/fa/Wikiquote-logo.svg,54244595857,28138886...
/wikipedia/commons/e/ee/Kit_socks.svg,59780124,116229...
/wikipedia/commons/2/28/AQMI_Flag_asymmetric.svg,440687158,116156...
/wikipedia/commons/2/2d/Flag_of_the_Basque_Country.svg,112401513,115978...

Since the size of data was not as big as I expected, I decided to give python a try despite its inefficiency. We can easily find the duplicates elements and store them into a dictionary at an O(n) complexity. Using a priority queue algorithm (heapq module) in python, we can get the Top 100 results within 5 seconds.

import csv
import heapq


d = dict()
with open("counts_data.csv","r") as source:
    rdr= csv.reader(source)
    for r in rdr:
        if r[0] in d:
            d[r[0]] += int(r[2])
        else:
            d[r[0]] = int(r[2])

largest = heapq.nlargest(100, d, key=d.get)

for i in largest:
    print('{}\t{}'.format(d[i], i))

python count.py  4.23s user 0.10s system 99% cpu 4.359 total

The first row of our output shows the request counts and the second row is the name of media resources. To view a image, just put the base URL https://upload.wikimedia.org in front of its name.

37125421447     /wikipedia/commons/4/4a/Commons-logo.svg
34875952535     /wikipedia/en/4/4a/Commons-logo.svg
23343701679     /wikipedia/commons/f/fa/Wikiquote-logo.svg
20227017943     /wikipedia/commons/2/23/Icons-mini-file_acrobat.gif
16756712131     /wikipedia/commons/4/4c/Wikisource-logo.svg
16167552367     /wikipedia/foundation/2/20/CloseWindow19x19.png
15579691508     /wikipedia/en/9/99/Question_book-new.svg
14298349706     /wikipedia/commons/a/a4/Flag_of_the_United_States.svg
11111183473     /wikipedia/commons/9/9e/Flag_of_Japan.svg
10982117380     /wikipedia/en/4/48/Folder_Hexagonal_Icon.svg

The output looks fair. To verify the result, let’s try the hard way – using Hadoop.

The Hadoop Approach

The first step is also downloading data. However, the most significant difference here is that we don’t cleanse the downloaded data. Instead, we directly load the raw data into the Hadoop distributed file system.

#!/usr/bin/env bash

for year in {2015..2017} ; do
    for month in 0{1..9} {10..12} ; do
        for day in 0{1..9} {10..31} ; do
            # download data
            wget https://dumps.wikimedia.org/other/mediacounts/daily/$year/mediacounts.top1000.$year-$month-$day.v00.csv.zip;
            done
        done
    done

hdfs dfs -mkdir mediacounts
hdfs dfs -copyFromLocal raw_data mediacounts/raw_data

We also need to clean out the part with comments in the raw data. Instead of running grep in bash, we use grep as a mapper and cat as a reducer in Hadoop. The result will be written into a new folder. hadoop-streaming.jar is a utility that allows you to create and run Map/Reduce jobs with any executable or script as the mapper or the reducer.

$ hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar -input mediacounts/raw_data -output cleaned_data -mapper 'grep -e "^\/wikipedia"' -reducer cat

A mapper will read the standard input and then print the first column (name of the media) as well as the third column (requests count).

import sys

for line in sys.stdin:
    line = line.strip()
    words = line.split(',')
    print('{}\t{}'.format(words[0], words[2]))

A reducer will read the output of the mapper and sum the occurrences of each resource.

import sys

current_word = None
current_count = 0
word = None

for line in sys.stdin:
    line = line.strip()
    # parse the input we got from mapper.py
    word = line.split('\t')[0]
    count = line.split('\t')[1]
    # convert count (currently a string) to int
    try:
        count = int(count)
    except ValueError:
        # count was not a number, so silently discard this line
        continue

    if current_word == word:
        current_count += count
    else:
        if current_word:
            # write result to STDOUT
            print("{}\t{}".format(current_word, current_count))
        current_count = count
        current_word = word

# print last word
if current_word == word:
    print("{}\t{}".format(current_word, current_count))

A test before running Hadoop can be helpful. Noted that we use sort as a shuffler. It’s just for testing because Hadoop will automatically sort data after mapping.

$ echo "foo,1,1,2,2\nbar,2,2\nbar,2,2\nfoo,1,5" | python mapper.py | sort -k1,1 | python reducer.py
bar     4
foo     6

Looks good. Let’s run mapper and reducer in Hadoop.

$ hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar -files mapper.py,reducer.py -input cleaned_data -output count_result -mapper 'mapper.py' -reducer 'reducer.py'

No surprise. We got the same result.

$ hdfs dfs -cat 'count_result/part-*' | sort -r -n -k2 | less
/wikipedia/commons/4/4a/Commons-logo.svg                37125421447
/wikipedia/en/4/4a/Commons-logo.svg                     34875952535
/wikipedia/commons/f/fa/Wikiquote-logo.svg              23343701679
/wikipedia/commons/2/23/Icons-mini-file_acrobat.gif     20227017943
/wikipedia/commons/4/4c/Wikisource-logo.svg             16756712131
/wikipedia/foundation/2/20/CloseWindow19x19.png         16167552367
/wikipedia/en/9/99/Question_book-new.svg                15579691508
/wikipedia/commons/a/a4/Flag_of_the_United_States.svg   14298349706
/wikipedia/commons/9/9e/Flag_of_Japan.svg               11111183473
/wikipedia/en/4/48/Folder_Hexagonal_Icon.svg            10982117380
...