Appearance
AO3 and its first data release
Archiveofourown.org (AO3) is a fan-created, fan-run, nonprofit, noncommercial archive for transformative fanworks. At the time of this writing, it has more than 42,750 fandoms, 3,547,000 users, and 7,428,000 works.
On 2021-03-21, the Archive released a Selective data dump for fan statisticians. The data comes in two CSV files, described as follows:
The first includes information about works:
- creation date
- language
- word count
- restricted or not
- complete or not
- associated tag IDs
The second provides the key to the tag IDs:
- tag ID
- tag type (e.g. Warning, Fandom, Relationship)
- tag name (unless the tag has fewer than 5 uses)
- canonical or not
- an approximate number of uses
- merger ID (i.e. the tag's canonical version, if it has one)
There are endless possibilities with this data set, we can:
- Find out most popular languages
- Visualize language trend
- Analyze users' posting habits
- Look into the seasonality of users' writing habits
- Text mining and sentiment analysis
- Word frequency, Topic modeling, etc
Let's start off with today's topic: Language Trend
The following code is written in Python and executed in Jupyter Notebook. You can follow along, or check out the github repository for this project.
python
# Load python libraries
import pandas as pd
# Sneak peek of the data that we're going to work on
pd.read_csv("/home/pi/Downloads/works-20210226.csv", nrows=5)
creation date | language | restricted | complete | word_count | tags | Unnamed: 6 | |
---|---|---|---|---|---|---|---|
0 | 2021-02-26 | en | False | True | 388 | 10+414093+1001939+4577144+1499536+110+4682892+... | NaN |
1 | 2021-02-26 | en | False | True | 1638 | 10+20350917+34816907+23666027+23269305+2326930... | NaN |
2 | 2021-02-26 | en | False | True | 1502 | 10+10613413+9780526+3763877+3741104+7657229+30... | NaN |
3 | 2021-02-26 | en | False | True | 100 | 10+15322+54862755+20595867+32994286+663+471751... | NaN |
4 | 2021-02-26 | en | False | True | 994 | 11+721553+54604+1439500+3938423+53483274+54862... | NaN |