AO3 and its first data release

Archiveofourown.org (AO3) is a fan-created, fan-run, nonprofit, noncommercial archive for transformative fanworks. At the time of this writing, it has more than 42,750 fandoms, 3,547,000 users, and 7,428,000 works.

On 2021-03-21, the Archive released a Selective data dump for fan statisticians. The data comes in two CSV files, described as follows:

The first includes information about works:

creation date
language
word count
restricted or not
complete or not
associated tag IDs

The second provides the key to the tag IDs:

tag ID
tag type (e.g. Warning, Fandom, Relationship)
tag name (unless the tag has fewer than 5 uses)
canonical or not
an approximate number of uses
merger ID (i.e. the tag's canonical version, if it has one)

There are endless possibilities with this data set, we can:

Find out most popular languages
Visualize language trend
Analyze users' posting habits
Look into the seasonality of users' writing habits
Text mining and sentiment analysis
Word frequency, Topic modeling, etc

Let's start off with today's topic: Language Trend

The following code is written in Python and executed in Jupyter Notebook. You can follow along, or check out the github repository for this project.

python

# Load python libraries
import pandas as pd
# Sneak peek of the data that we're going to work on
pd.read_csv("/home/pi/Downloads/works-20210226.csv", nrows=5)

	creation date	language	restricted	complete	word_count	tags	Unnamed: 6
0	2021-02-26	en	False	True	388	10+414093+1001939+4577144+1499536+110+4682892+...	NaN
1	2021-02-26	en	False	True	1638	10+20350917+34816907+23666027+23269305+2326930...	NaN
2	2021-02-26	en	False	True	1502	10+10613413+9780526+3763877+3741104+7657229+30...	NaN
3	2021-02-26	en	False	True	100	10+15322+54862755+20595867+32994286+663+471751...	NaN
4	2021-02-26	en	False	True	994	11+721553+54604+1439500+3938423+53483274+54862...	NaN

AO3 and its first data release ​

AO3 and its first data release