Appearance
Compare page metrics
This tutorial explains how to use Python to load page viewership and edit data from three endpoints, process it, and display it on a single diagram.
A diagram like this could be useful to you, for example, when trying to identify potential correlation or causation between different metrics.
Most sections on this page contain Python code snippets without comments. The full script, with comments, is available at the end of the page.
Prerequisites
To follow this tutorial, you should be familiar with Python, have a basic understanding of data structures, and know what it means to make API requests.
If you aren't familiar with Python, you can learn about it from the official website, Python books available on Wikibooks, or other sources available online.
Before requesting data from the API, be sure to read the access policy.
Software
This tutorial requires that you have access to a Python development environment. This can be a system-wide Python installation, a virtual environment, or a Jupyter Notebook (such as PAWS, the instance hosted by the Wikimedia Foundation).
To run the code on this page, you need Python version 3.9 or higher. You also need to install the following extra libraries:
To install these libraries, run the following command (or similar, depending on your Python distribution):
sh
pip install matplotlib pandas requests
Setting up and requesting data from the API
Start by importing the libraries installed earlier:
- Pyplot, provided by Matplotlib, will allow you to create a plot of editor numbers.
- Pandas will allow you to prepare the editor data for displaying.
- Requests will allow you to request editor data from the API.
py
import matplotlib.pyplot as plt
import pandas as pd
import requests as rq
Next, specify the parameters for API requests. This means setting:
- user agent, as described in the access policy
- request URLs. A URL defines what data you will request. You might want to see the documentation for the following endpoints to understand how to construct these: bytes changed on a page, edits to a page, and page views for a page.
In this example, you will request:
- daily edit, view, and absolute difference in page size data,
- for the Land page on English Wikipedia,
- for every day between 2022.04.01 and 2022.12.31.
py
headers = {
"User-Agent": "Wikimedia Analytics API Tutorial (<your wiki username>) compare-page-metrics.py",
}
diff_url = """https://wikimedia.org/api/rest_v1/metrics/bytes-difference/\
absolute/per-page/en.wikipedia.org/Land/all-editor-types/\
daily/20220401/20221231"""
view_url = """https://wikimedia.org/api/rest_v1/metrics/pageviews/\
per-article/en.wikipedia.org/all-access/all-agents/\
Land/daily/20220401/20221231"""
edit_url = """
https://wikimedia.org/api/rest_v1/metrics/edits/\
per-page/en.wikipedia.org/Land/all-editor-types/\
daily/20220401/20221231"""
Next, request the data and parse the JSON responses sent by the API.
py
diff_response = rq.get(diff_url, headers=headers).json()
view_response = rq.get(view_url, headers=headers).json()
edit_response = rq.get(edit_url, headers=headers).json()
Preparing data for plotting
Data returned in the JSON response has a tree-like structure. For numerical calculations and comparisons, it's better to have data in the form of a table or matrix.
In Python, a common table-like structure for working with data is a DataFrame (or mutable table) available in the Pandas library.
Code in this section prepares the data for displaying by normalizing it and placing it in a single DataFrame. For each endpoint:
- Create a DataFrame. Make sure to point to the correct location of records in the JSON response. See the raw response data for structure details: bytes changed, views, edits. Notice that the structure of the views response is slightly different from other responses.
- Convert the
timestamp
column to datetime, a dedicated type for time data. Notice that the format of this column in the views endpoint differs from the other endpoints: time zone information is missing. To fix that, set the time zone during conversion by callingdt.tz_localize("UTC")
. - Use the
timestamp
column as index of the DataFrame. An index is a set of labels that uniquely identifies each row. - Remove unnecessary columns returned in the views response.
py
# bytes changed endpoint #
diff_df = pd.DataFrame.from_records(diff_response["items"][0]["results"])
diff_df["timestamp"] = pd.to_datetime(diff_df["timestamp"])
diff_df = diff_df.set_index("timestamp")
# views endpoint #
view_df = pd.DataFrame.from_records(view_response["items"])
view_df["timestamp"] = pd.to_datetime(
view_df["timestamp"], format="%Y%m%d%H"
).dt.tz_localize("UTC")
view_df = view_df.set_index("timestamp")
view_df = view_df.drop(columns=["project", "article", "granularity", "access", "agent"])
# edits endpoint #
edit_df = pd.DataFrame.from_records(edit_response["items"][0]["results"])
edit_df["timestamp"] = pd.to_datetime(edit_df["timestamp"])
edit_df = edit_df.set_index("timestamp")
All your DataFrames now have a correct index that you can use to merge them into a single DataFrame. In the parameters of the merge operation, specify:
on="timestamp"
to merge the data based on the timestamp - when rows have the same timestamp, join them into a single rowhow="outer"
to preserve all rows, even if one of the DataFrames doesn't contain data for a given timestamp
After the second merge, make sure to fill in empty values in the DataFrame with zeros. To do that, call fillna(0).astype(int)
.
For more information about the way DataFrame merge works in Pandas, see merging in Pandas.
py
r = pd.merge(diff_df, edit_df, on="timestamp", how="outer")
r = pd.merge(r, view_df, on="timestamp", how="outer").fillna(0).astype(int)
Empty values in the dataset
Zeros are often omitted in Wikimedia Analytics API responses. This means that when you request edit data for a page for a specific, seven-day period, you might receive a response that contains fewer days, with days without edits skipped. For this reason, the code presented earlier:
- preserves rows for timestamps that occur in at least one of the DataFrames. This ensures that when you have page views data for a given day, but no edits occurred on that day, this day isn't removed from the dataset (which would happen if you set
how
toinner
). - fills in missing data by calling
fillna(0)
. This fills in missing values with zeros, which is typically what a skipped value means.
Displaying the plot
With the DataFrame prepared, you can now display the data.
Start by specifying the plot style, in this case bmh
. You can learn more about the available plot styles by reading Matplotlib's style sheets reference.
py
plt.style.use("bmh")
Create a subplot layout and prepare to display it. To produce a plot that's readable, set the following parameters in the plt.subplots()
call:
nrows=3
andncols=1
to create three subplot slots in a column, one under anothersharex=True
to ensure the subplots align based on the timestampsharey=False
to ensure the subplots don't adhere to a single scale. This makes them more readable.figsize=(15,10)
to ensure the diagram is wide enough to read comfortably
Configure the subplots, each based on one of the columns from your DataFrame, by specifying the following parameters:
ax=axes[X]
to assign the subplot to a specific slot in the layout.ax=axes[0]
assigns the plot to the first slot,axes[1]
- second, andaxes[2]
- third.color
to set the color of the subplot. In the example,"c"
means cyan,"m"
- magenta, and"y"
- yellow. For more information, see Specifying colors.title
to set the title of the subplot
With the configuration in place, you can display the plot.
py
fig, axes = plt.subplots(nrows=3, ncols=1, sharex=True, sharey=False, figsize=(15,10))
r["views"].plot(ax=axes[0], color="c", title="Views")
r["edits"].plot(ax=axes[1], color="m", title="Edits")
r["abs_bytes_diff"].plot(ax=axes[2], color="y", title="Absolute change (bytes)")
plt.show()
Next steps
To better understand the libraries and data used in this tutorial, be sure to experiment with different parameters in function calls and the request URL.
To learn more about the ecosystem of Python tools and libraries used in data science, explore the links listed in Useful resources.
To see what other endpoints are available to you in the Analytics API, check the API reference pages listed in the menu.
Full script
The full script should look like the following.
py
"""Displays viewership and edit data for a page."""
import matplotlib.pyplot as plt
import pandas as pd
import requests as rq
# Prepare and make request #
# Specify user agent
# Be sure to customize this variable according to the access policy
# before running this script
headers = {
"User-Agent": "GitLab CI automated test (/generated-data-platform/aqs/analytics-api) compare-page-metrics.py",
}
# URL for absolute article size difference data
diff_url = """https://wikimedia.org/api/rest_v1/metrics/bytes-difference/\
absolute/per-page/en.wikipedia.org/Land/all-editor-types/\
daily/20220401/20221231"""
# URL for viewership data
view_url = """https://wikimedia.org/api/rest_v1/metrics/pageviews/\
per-article/en.wikipedia.org/all-access/all-agents/\
Land/daily/20220401/20221231"""
# URL for edit number data
edit_url = """
https://wikimedia.org/api/rest_v1/metrics/edits/\
per-page/en.wikipedia.org/Land/all-editor-types/\
daily/20220401/20221231"""
# Request all data from APIs and parse responses as JSON
diff_response = rq.get(diff_url, headers=headers).json()
view_response = rq.get(view_url, headers=headers).json()
edit_response = rq.get(edit_url, headers=headers).json()
# Create Pandas DataFrame for bytes difference data
diff_df = pd.DataFrame.from_records(diff_response["items"][0]["results"])
# Parse timestamp and use it as index
diff_df["timestamp"] = pd.to_datetime(diff_df["timestamp"])
diff_df = diff_df.set_index("timestamp")
# Create Pandas DataFrame for viewership data
view_df = pd.DataFrame.from_records(view_response["items"])
# Parse timestamp and use it as index
# Note that timestamps for page views are in a different format
# and do not include time zone information, this command
# sets time zone to UTC
view_df["timestamp"] = pd.to_datetime(
view_df["timestamp"], format="%Y%m%d%H"
).dt.tz_localize("UTC")
view_df = view_df.set_index("timestamp")
# Remove unnecessary columns included in the page view response
view_df = view_df.drop(columns=["project", "article", "granularity", "access", "agent"])
# Create Pandas DataFrame for edit data
edit_df = pd.DataFrame.from_records(edit_response["items"][0]["results"])
# Parse timestamp and use it as index
edit_df["timestamp"] = pd.to_datetime(edit_df["timestamp"])
edit_df = edit_df.set_index("timestamp")
# Merge the three DataFrames into one based on timestamp
# Note that the joins are defined as "outer" to retain all timestamps even
# if they are missing from any DataFrame. Also note the `.fillna(0).astype(int)`
# which fills missing data with zeros interpreted as integers
r = pd.merge(diff_df, edit_df, on="timestamp", how="outer")
r = pd.merge(r, view_df, on="timestamp", how="outer").fillna(0).astype(int)
# Set plot style
plt.style.use("bmh")
# Create a subplot layout
# Note that these plots do not share the Y axis (value), but share the X axis (timestamp)
# This is because these subplots have very different values. For example, the number of
# edits would be completely invisible if the Y axis scale was optimized for page view data
fig, axes = plt.subplots(nrows=3, ncols=1, sharex=True, sharey=False, figsize=(15,10))
# Configure the subplot for views
r["views"].plot(ax=axes[0], color="c", title="Views")
# Configure the subplot for edits
r["edits"].plot(ax=axes[1], color="m", title="Edits")
# Configure the subplot for absolute change in bytes
r["abs_bytes_diff"].plot(ax=axes[2], color="y", title="Absolute change (bytes)")
# Display all subplots
plt.show()
Useful resources
- Python home page
- Conda, package and environment manager popular among data scientists. It's often used to install and manage Python and R packages.
- Matplotlib documentation
- Pandas documentation
- Requests documentation
- JupyterLab and Jupyter Noteboook home page
- PAWS, the Jupyter Notebook instance hosted by the Wikimedia Foundation