Skip to content

AO3 and its first data release

Archiveofourown.org (AO3) is a fan-created, fan-run, nonprofit, noncommercial archive for transformative fanworks. At the time of this writing, it has more than 42,750 fandoms, 3,547,000 users, and 7,428,000 works.

On 2021-03-21, the Archive released a Selective data dump for fan statisticians. The data comes in two CSV files, described as follows:

The first includes information about works:

  • creation date
  • language
  • word count
  • restricted or not
  • complete or not
  • associated tag IDs

The second provides the key to the tag IDs:

  • tag ID
  • tag type (e.g. Warning, Fandom, Relationship)
  • tag name (unless the tag has fewer than 5 uses)
  • canonical or not
  • an approximate number of uses
  • merger ID (i.e. the tag's canonical version, if it has one)

There are endless possibilities with this data set, we can:

  • Find out most popular languages
  • Visualize language trend
  • Analyze users' posting habits
  • Look into the seasonality of users' writing habits
  • Text mining and sentiment analysis
  • Word frequency, Topic modeling, etc

Let's start off with today's topic: Language Trend

The following code is written in Python and executed in Jupyter Notebook. You can follow along, or check out the github repository for this project.

python
# Load python libraries
import pandas as pd
# Sneak peek of the data that we're going to work on
pd.read_csv("/home/pi/Downloads/works-20210226.csv", nrows=5)
creation datelanguagerestrictedcompleteword_counttagsUnnamed: 6
02021-02-26enFalseTrue38810+414093+1001939+4577144+1499536+110+4682892+...NaN
12021-02-26enFalseTrue163810+20350917+34816907+23666027+23269305+2326930...NaN
22021-02-26enFalseTrue150210+10613413+9780526+3763877+3741104+7657229+30...NaN
32021-02-26enFalseTrue10010+15322+54862755+20595867+32994286+663+471751...NaN
42021-02-26enFalseTrue99411+721553+54604+1439500+3938423+53483274+54862...NaN