Appearance
Compare editor numbers, year over year
This tutorial explains how to use Python to load editor data from the editors endpoint, process it, and display a diagram with a year-over-year comparison of editor numbers.
A diagram like this could be useful to you, for example, in an analysis of seasonal trends in editor numbers.
Most sections on this page contain Python code snippets without comments. The full script, with comments, is available at the end of the page.
Prerequisites
To follow this tutorial, you should be familiar with Python, have a basic understanding of data structures, and know what it means to make API requests.
If you aren't familiar with Python, you can learn about it from the official website, Python books available on Wikibooks, or other sources available online.
Before requesting data from the API, be sure to read the access policy.
Software
This tutorial requires that you have access to a Python development environment. This can be a system-wide Python installation, a virtual environment, or a Jupyter Notebook (such as PAWS, the instance hosted by the Wikimedia Foundation).
To run the code on this page, you need Python version 3.9 or higher. You also need to install the following extra libraries:
To install these libraries, run the following command (or similar, depending on your Python distribution):
sh
pip install matplotlib pandas requests
Setting up and requesting data from the API
Start by importing the libraries installed earlier:
- Pyplot, provided by Matplotlib, will allow you to create a plot of editor numbers.
- Pandas will allow you to prepare the editor data for displaying.
- Requests will allow you to request editor data from the API.
py
import matplotlib.pyplot as plt
import pandas as pd
import requests as rq
Next, specify the parameters for the API request. This means setting:
- user agent, as described in the access policy
- request URL. The URL defines what data you will request. You might want to see the documentation of the editors endpoint to understand how to construct it. In this example, you will request monthly editor numbers from the beginning of 2016 to the end of 2023.
py
headers = {
"User-Agent": "Wikimedia Analytics API Tutorial (<your wiki username>) compare-editor-numbers.py"
}
url = """https://wikimedia.org/api/rest_v1/metrics/editors/aggregate/\
en.wikipedia.org/user/content/all-activity-levels/\
monthly/20160101/20240101"""
Next, request the data and parse the JSON response sent by the API.
py
response = rq.get(url, headers=headers)
editor_numbers = response.json()
Preparing data for plotting
Data returned in the JSON response has a tree-like structure. For numerical calculations and comparisons, it's better to have data in the form of a table or matrix.
In Python, a common table-like structure for working with data is a DataFrame (or mutable table) available in the Pandas library.
The following code snippet prepares the data for displaying by:
- Creating a DataFrame from records returned by the API. Note that the rows of data are available under the
["items"].[0].["results"]
path of the JSON response, where[0]
represents the first element of the "items" list. You can preview the structure of the JSON response by opening the request link in your browser. - Convert the data in the
timestamp
column to datetime, a dedicated data type for time data. This is one way of extracting month and year information from the returned records. - Create new columns for month and year data based on the
timestamp
column. - Remove the
timestamp
column as it's no longer needed. - Pivot the table so that data in the DataFrame is indexed by month, with columns representing years. This makes it easier to perform a year-over-year analysis.
py
editors_df = pd.DataFrame.from_records(editor_numbers["items"][0]["results"])
date = pd.to_datetime(editors_df["timestamp"])
editors_df["month"] = pd.DatetimeIndex(date).month
editors_df["year"] = pd.DatetimeIndex(date).year
editors_df = editors_df.drop(columns=["timestamp"])
editor_df = editors_df.pivot(index="month", columns="year", values="editors")
Displaying the plot
With the DataFrame prepared, you can now display the data.
Start by specifying the plot style, in this case bmh
. You can learn more about the available plot styles by reading Matplotlib's style sheets reference.
py
plt.style.use("bmh")
Create a set of subplots and prepare to display them as a single plot based on the DataFrame. The only parameter necessary in the editor_df.plot()
call is ax
, as it tells the plot function to use the set of subplots created earlier. You can experiment with the other parameters, but the values in the example are:
- subplots set to
False
to display plots for all years on the same diagram figsize
set to20,10
to make sure the plot is wide enough to read comfortablycolormap
set toAccent
, which is one of the colormaps available in Matplotlib. Colormap defines the colors used to present data on the plot. For more information, see Choosing colormaps in Matplotlib.
With the configuration in place, you can display the plot.
py
fig, ax = plt.subplots()
editor_df.plot(
subplots=False, figsize=(20, 10), ax=ax, colormap="Accent"
)
plt.show()
Next steps
To better understand the libraries and data used in this tutorial, be sure to experiment with different parameters in function calls and the request URL.
To learn how to combine data from multiple endpoints to display a single diagram, read the Compare page metrics tutorial.
To learn more about the ecosystem of Python tools and libraries used in data science, explore the links listed in Useful resources.
To see what other endpoints are available to you in the Analytics API, check the API reference pages listed in the menu.
Full script
The full script should look like the following.
py
"""Displays the editors by year plot."""
import matplotlib.pyplot as plt
import pandas as pd
import requests as rq
# Prepare and make a request #
# Specify the user agent
# Be sure to customize this variable according to the access policy
# before running this script
headers = {
"User-Agent": "GitLab CI automated test (/generated-data-platform/aqs/analytics-api) compare-editor-numbers.py"
}
# Define the request URL
url = """https://wikimedia.org/api/rest_v1/metrics/editors/aggregate/\
en.wikipedia.org/user/content/all-activity-levels/\
monthly/20160101/20240101"""
# Request data from the API
response = rq.get(url, headers=headers)
# Parse the JSON response
editor_numbers = response.json()
# Prepare data #
# Create a pandas DataFrame
editors_df = pd.DataFrame.from_records(editor_numbers["items"][0]["results"])
# Convert the string timestamp to datetime
date = pd.to_datetime(editors_df["timestamp"])
# Create a new column for months
editors_df["month"] = pd.DatetimeIndex(date).month
# Create a new column for years
editors_df["year"] = pd.DatetimeIndex(date).year
# Remove the timestamp column as it's not needed anymore
editors_df = editors_df.drop(columns=["timestamp"])
# Pivot the data frame
editor_df = editors_df.pivot(index="month", columns="year", values="editors")
# Display results #
# Set a plot style
plt.style.use("bmh")
# Create a subplot set
fig, ax = plt.subplots()
# Configure the plot for the DataFrame
editor_df.plot(subplots=False, figsize=(20, 10), ax=ax, colormap="Accent")
# Display the plot
plt.show()
Useful resources
- Python home page
- Conda, package and environment manager popular among data scientists. It's often used to install and manage Python and R packages.
- Matplotlib documentation
- Pandas documentation
- Requests documentation
- JupyterLab and Jupyter Noteboook home page
- PAWS, the Jupyter Notebook instance hosted by the Wikimedia Foundation