In this post, I will show you how can we use two different approaches to find out the Top 100 visited images, videos or audio files on Wikipedia during the past three years (2015-2017) based on the media counts data. Before we started, you should know that the data we downloaded here cannot essentially represent users request because the counts can only indicate how often an image, video, or audio file has been transferred to users. Modern browsers may preload contents in a webpage to deliver a better user experience while the users will not necessarily see all preloaded resources. Transferred means Wikipedia, as a data collector, doesn’t know if the users see content or not since some of them are just preloaded. In the Wikipedia Analytics site, there are two types of data available:
- The count of all transferred media files.
- The count of the top 1000 transferred media files.
We will use the second type. Downloading all data is meaningless as our goal is to find the top 100 result. The average size of a 1-day data file is around 1.2MB
. So the size of 3-years data will be around 1.3GB
.
The Python Approach
The first step is to download the data. There are 24 files in each 1-day zip
data. They contained the same data but sorted based on 24 different keys. We will use the file sorted by the third key, which is the count of Total requests
. There are extra comments and explanations in each csv
file. Instead of removing the unwanted lines, we can use grep
to match our wanted lines since they are all started with /wikipedia
. Then we pipe the filtering result to counts_data.csv
. So eventually, there will be only one csv
file we need to handle.
#!/usr/bin/env bash
for year in {2015..2017} ; do
for month in 0{1..9} {10..12} ; do
for day in 0{1..9} {10..31} ; do
# download data
wget https://dumps.wikimedia.org/other/mediacounts/daily/$year/mediacounts.top1000.$year-$month-$day.v00.csv.zip;
# unzip it
unzip mediacounts.top1000.$year-$month-$day.v00.csv.zip;
# find all lines start with /wikipedia
grep -e '^\/wikipedia' mediacounts.$year-$month-$day.v00.sorted_key03.csv >> counts_data.csv
done
done
done
What we got here is counts_data.csv
, a 147MB
file.
/wikipedia/en/4/4a/Commons-logo.svg,77559286851,41416246...
/wikipedia/commons/4/4a/Commons-logo.svg,56555175913,36930238...
/wikipedia/commons/f/fa/Wikiquote-logo.svg,54244595857,28138886...
/wikipedia/commons/e/ee/Kit_socks.svg,59780124,116229...
/wikipedia/commons/2/28/AQMI_Flag_asymmetric.svg,440687158,116156...
/wikipedia/commons/2/2d/Flag_of_the_Basque_Country.svg,112401513,115978...
Since the size of data was not as big as I expected, I decided to give python a try despite its inefficiency. We can easily find the duplicates elements and store them into a dictionary at an O(n) complexity. Using a priority queue algorithm (heapq
module) in python, we can get the Top 100 results within 5 seconds.
import csv
import heapq
d = dict()
with open("counts_data.csv","r") as source:
rdr= csv.reader(source)
for r in rdr:
if r[0] in d:
d[r[0]] += int(r[2])
else:
d[r[0]] = int(r[2])
largest = heapq.nlargest(100, d, key=d.get)
for i in largest:
print('{}\t{}'.format(d[i], i))
python count.py 4.23s user 0.10s system 99% cpu 4.359 total
The first row of our output shows the request counts and the second row is the name of media resources. To view a image, just put the base URL https://upload.wikimedia.org
in front of its name.
37125421447 /wikipedia/commons/4/4a/Commons-logo.svg
34875952535 /wikipedia/en/4/4a/Commons-logo.svg
23343701679 /wikipedia/commons/f/fa/Wikiquote-logo.svg
20227017943 /wikipedia/commons/2/23/Icons-mini-file_acrobat.gif
16756712131 /wikipedia/commons/4/4c/Wikisource-logo.svg
16167552367 /wikipedia/foundation/2/20/CloseWindow19x19.png
15579691508 /wikipedia/en/9/99/Question_book-new.svg
14298349706 /wikipedia/commons/a/a4/Flag_of_the_United_States.svg
11111183473 /wikipedia/commons/9/9e/Flag_of_Japan.svg
10982117380 /wikipedia/en/4/48/Folder_Hexagonal_Icon.svg
The output looks fair. To verify the result, let’s try the hard way – using Hadoop.
The Hadoop Approach
The first step is also downloading data. However, the most significant difference here is that we don’t cleanse the downloaded data. Instead, we directly load the raw data into the Hadoop distributed file system.
#!/usr/bin/env bash
for year in {2015..2017} ; do
for month in 0{1..9} {10..12} ; do
for day in 0{1..9} {10..31} ; do
# download data
wget https://dumps.wikimedia.org/other/mediacounts/daily/$year/mediacounts.top1000.$year-$month-$day.v00.csv.zip;
done
done
done
hdfs dfs -mkdir mediacounts
hdfs dfs -copyFromLocal raw_data mediacounts/raw_data
We also need to clean out the part with comments in the raw data. Instead of running grep
in bash, we use grep
as a mapper and cat
as a reducer in Hadoop. The result will be written into a new folder. hadoop-streaming.jar
is a utility that allows you to create and run Map/Reduce jobs with any executable or script as the mapper or the reducer.
$ hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar -input mediacounts/raw_data -output cleaned_data -mapper 'grep -e "^\/wikipedia"' -reducer cat
A mapper will read the standard input and then print the first column (name of the media) as well as the third column (requests count).
import sys
for line in sys.stdin:
line = line.strip()
words = line.split(',')
print('{}\t{}'.format(words[0], words[2]))
A reducer will read the output of the mapper and sum the occurrences of each resource.
import sys
current_word = None
current_count = 0
word = None
for line in sys.stdin:
line = line.strip()
# parse the input we got from mapper.py
word = line.split('\t')[0]
count = line.split('\t')[1]
# convert count (currently a string) to int
try:
count = int(count)
except ValueError:
# count was not a number, so silently discard this line
continue
if current_word == word:
current_count += count
else:
if current_word:
# write result to STDOUT
print("{}\t{}".format(current_word, current_count))
current_count = count
current_word = word
# print last word
if current_word == word:
print("{}\t{}".format(current_word, current_count))
A test before running Hadoop can be helpful. Noted that we use sort
as a shuffler. It’s just for testing because Hadoop will automatically sort data after mapping.
$ echo "foo,1,1,2,2\nbar,2,2\nbar,2,2\nfoo,1,5" | python mapper.py | sort -k1,1 | python reducer.py
bar 4
foo 6
Looks good. Let’s run mapper and reducer in Hadoop.
$ hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar -files mapper.py,reducer.py -input cleaned_data -output count_result -mapper 'mapper.py' -reducer 'reducer.py'
No surprise. We got the same result.
$ hdfs dfs -cat 'count_result/part-*' | sort -r -n -k2 | less
/wikipedia/commons/4/4a/Commons-logo.svg 37125421447
/wikipedia/en/4/4a/Commons-logo.svg 34875952535
/wikipedia/commons/f/fa/Wikiquote-logo.svg 23343701679
/wikipedia/commons/2/23/Icons-mini-file_acrobat.gif 20227017943
/wikipedia/commons/4/4c/Wikisource-logo.svg 16756712131
/wikipedia/foundation/2/20/CloseWindow19x19.png 16167552367
/wikipedia/en/9/99/Question_book-new.svg 15579691508
/wikipedia/commons/a/a4/Flag_of_the_United_States.svg 14298349706
/wikipedia/commons/9/9e/Flag_of_Japan.svg 11111183473
/wikipedia/en/4/48/Folder_Hexagonal_Icon.svg 10982117380
...
Summary
The result indicates that at least half of elements in the Top 100 list are icons used by Wikipedia.
We can also see lots of national flags in the ranking. The Top 5 national flag files have been preloaded are
- Flag of the United States
- Flag of Japan
- Flag of France
- Flag of the People’s Republic of China
- Flag of Germany
The only jpg
file in the Top 100 list somehow is a picture of the Örebro Castle although I’ve never heard about that place.