Girai Kaku


github twitter keybase email
Analysis of Wikipedia Media Counts Data
Jun 12, 2018
5 minutes read

In this post, I will show you how can we use two different approaches to find out the Top 100 visited images, videos or audio files on Wikipedia during the past three years (2015-2017) based on the media counts data. Before we started, you should know that the data we downloaded here cannot essentially represent users request because the counts can only indicate how often an image, video, or audio file has been transferred to users. Modern browsers may preload contents in a webpage to deliver a better user experience while the users will not necessarily see all preloaded resources. Transferred means Wikipedia, as a data collector, doesn’t know if the users see content or not since some of them are just preloaded. In the Wikipedia Analytics site, there are two types of data available:

  1. The count of all transferred media files.
  2. The count of the top 1000 transferred media files.

We will use the second type. Downloading all data is meaningless as our goal is to find the top 100 result. The average size of a 1-day data file is around 1.2MB. So the size of 3-years data will be around 1.3GB.

The Python Approach

The first step is to download the data. There are 24 files in each 1-day zip data. They contained the same data but sorted based on 24 different keys. We will use the file sorted by the third key, which is the count of Total requests. There are extra comments and explanations in each csv file. Instead of removing the unwanted lines, we can use grep to match our wanted lines since they are all started with /wikipedia. Then we pipe the filtering result to counts_data.csv. So eventually, there will be only one csv file we need to handle.

#!/usr/bin/env bash

for year in {2015..2017} ; do
    for month in 0{1..9} {10..12} ; do
        for day in 0{1..9} {10..31} ; do
            # download data
            # unzip it
            unzip mediacounts.top1000.$year-$month-$;
            # find all lines start with /wikipedia
            grep -e '^\/wikipedia' mediacounts.$year-$month-$day.v00.sorted_key03.csv >> counts_data.csv

What we got here is counts_data.csv, a 147MB file.


Since the size of data was not as big as I expected, I decided to give python a try despite its inefficiency. We can easily find the duplicates elements and store them into a dictionary at an O(n) complexity. Using a priority queue algorithm (heapq module) in python, we can get the Top 100 results within 5 seconds.

import csv
import heapq

d = dict()
with open("counts_data.csv","r") as source:
    rdr= csv.reader(source)
    for r in rdr:
        if r[0] in d:
            d[r[0]] += int(r[2])
            d[r[0]] = int(r[2])

largest = heapq.nlargest(100, d, key=d.get)

for i in largest:
    print('{}\t{}'.format(d[i], i))
python  4.23s user 0.10s system 99% cpu 4.359 total

The first row of our output shows the request counts and the second row is the name of media resources. To view a image, just put the base URL in front of its name.

37125421447     /wikipedia/commons/4/4a/Commons-logo.svg
34875952535     /wikipedia/en/4/4a/Commons-logo.svg
23343701679     /wikipedia/commons/f/fa/Wikiquote-logo.svg
20227017943     /wikipedia/commons/2/23/Icons-mini-file_acrobat.gif
16756712131     /wikipedia/commons/4/4c/Wikisource-logo.svg
16167552367     /wikipedia/foundation/2/20/CloseWindow19x19.png
15579691508     /wikipedia/en/9/99/Question_book-new.svg
14298349706     /wikipedia/commons/a/a4/Flag_of_the_United_States.svg
11111183473     /wikipedia/commons/9/9e/Flag_of_Japan.svg
10982117380     /wikipedia/en/4/48/Folder_Hexagonal_Icon.svg

The output looks fair. To verify the result, let’s try the hard way – using Hadoop.

The Hadoop Approach

The first step is also downloading data. However, the most significant difference here is that we don’t cleanse the downloaded data. Instead, we directly load the raw data into the Hadoop distributed file system.

#!/usr/bin/env bash

for year in {2015..2017} ; do
    for month in 0{1..9} {10..12} ; do
        for day in 0{1..9} {10..31} ; do
            # download data
hdfs dfs -mkdir mediacounts
hdfs dfs -copyFromLocal raw_data mediacounts/raw_data

We also need to clean out the part with comments in the raw data. Instead of running grep in bash, we use grep as a mapper and cat as a reducer in Hadoop. The result will be written into a new folder. hadoop-streaming.jar is a utility that allows you to create and run Map/Reduce jobs with any executable or script as the mapper or the reducer.

$ hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar -input mediacounts/raw_data -output cleaned_data -mapper 'grep -e "^\/wikipedia"' -reducer cat

A mapper will read the standard input and then print the first column (name of the media) as well as the third column (requests count).

import sys

for line in sys.stdin:
    line = line.strip()
    words = line.split(',')
    print('{}\t{}'.format(words[0], words[2]))

A reducer will read the output of the mapper and sum the occurrences of each resource.

import sys

current_word = None
current_count = 0
word = None

for line in sys.stdin:
    line = line.strip()
    # parse the input we got from
    word = line.split('\t')[0]
    count = line.split('\t')[1]
    # convert count (currently a string) to int
        count = int(count)
    except ValueError:
        # count was not a number, so silently discard this line

    if current_word == word:
        current_count += count
        if current_word:
            # write result to STDOUT
            print("{}\t{}".format(current_word, current_count))
        current_count = count
        current_word = word

# print last word
if current_word == word:
    print("{}\t{}".format(current_word, current_count))

A test before running Hadoop can be helpful. Noted that we use sort as a shuffler. It’s just for testing because Hadoop will automatically sort data after mapping.

$ echo "foo,1,1,2,2\nbar,2,2\nbar,2,2\nfoo,1,5" | python | sort -k1,1 | python
bar     4
foo     6

Looks good. Let’s run mapper and reducer in Hadoop.

$ hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar -files, -input cleaned_data -output count_result -mapper '' -reducer ''

No surprise. We got the same result.

$ hdfs dfs -cat 'count_result/part-*' | sort -r -n -k2 | less
/wikipedia/commons/4/4a/Commons-logo.svg                37125421447
/wikipedia/en/4/4a/Commons-logo.svg                     34875952535
/wikipedia/commons/f/fa/Wikiquote-logo.svg              23343701679
/wikipedia/commons/2/23/Icons-mini-file_acrobat.gif     20227017943
/wikipedia/commons/4/4c/Wikisource-logo.svg             16756712131
/wikipedia/foundation/2/20/CloseWindow19x19.png         16167552367
/wikipedia/en/9/99/Question_book-new.svg                15579691508
/wikipedia/commons/a/a4/Flag_of_the_United_States.svg   14298349706
/wikipedia/commons/9/9e/Flag_of_Japan.svg               11111183473
/wikipedia/en/4/48/Folder_Hexagonal_Icon.svg            10982117380


The result indicates that at least half of elements in the Top 100 list are icons used by Wikipedia.

Commons-logo Wikiquote-logo Question_book-new

We can also see lots of national flags in the ranking. The Top 5 national flag files have been preloaded are

  1. Flag of the United States
  2. Flag of Japan
  3. Flag of France
  4. Flag of the People’s Republic of China
  5. Flag of Germany

The only jpg file in the Top 100 list somehow is a picture of the Örebro Castle although I’ve never heard about that place.


Back to posts

comments powered by Disqus