Skip to content

5 Most Popular Languages on AO3: Trend

tags: Python Pandas matplotlib datetime

In this section, we'll work with the date and language columns in the data set, and briefly touch base with the datetime object in Pandas.

Loading File

python
# Load Python library
import pandas as pd

# Load file
path="/home/pi/Downloads/works-20210226.csv"
chunker = pd.read_csv(path, chunksize=10000)
works = pd.concat(chunker, ignore_index=True)
python
# Select two columns that we're interested in
lang_date = works[['creation date','language']].copy()

# Drop NA values
lang_date = lang_date.dropna()

# Preview of the DataFrame
lang_date
creation datelanguage
02021-02-26en
12021-02-26en
22021-02-26en
32021-02-26en
42021-02-26en
.........
72696882008-09-13en
72696892008-09-13en
72696902008-09-13en
72696912008-09-13en
72696922008-09-13en

7269603 rows × 2 columns

Datetime in Pandas

Now we want to do some data manipulation. Fistly, we want to find the number of works posted in each language every month from the year of 2008 to 2021. To do this, we are going to group the DataFrame by month and by language.

python
# Make sure the date column is in datetime format
lang_date['creation date'] = pd.to_datetime(lang_date['creation date'])
python
# Group the date column by "month" (freq="1M")
# Pd.Grouper(key, freq) is used instead of pd.Series.dt.year
# because it does not aggregate month over multiple years
# Group the language column by counting each language

lang_count = lang_date.groupby([pd.Grouper(key='creation date',freq='1M'),'language']).size().reset_index()
lang_count.columns = ['date','language','count']
lang_count
datelanguagecount
02008-09-30en928
12008-10-31de1
22008-10-31en480
32008-11-30en337
42008-12-31en239
............
34232021-02-28uk28
34242021-02-28vi131
34252021-02-28wuu16
34262021-02-28yue41
34272021-02-28zh7261

3428 rows × 3 columns

Reshaping Dataframe with Pivot_table()

Continue with data manipulation. Here we're only interested in the trend of 5 most popular languages on AO3. From previous post, the 5 languages are 'en','zh','ru','es', and 'fr'.

To make our selection process easier and to pave the road for visualization, we use pivot_table() in Pandas to reshape the DataFrame, using date as the index, and the languages as columns.

Let me show you how the new DataFrame looks like.

python
# Use pivot_table() for easy visualization
# 5 most popular languages as subset
# The list of language is obtained from previous analysis

subset = lang_count.pivot_table(index='date',columns='language',values='count')[['en','zh','ru','es','fr']]
subset
languageenzhruesfr
date
2008-09-30928.0NaNNaNNaNNaN
2008-10-31480.0NaNNaNNaNNaN
2008-11-30337.0NaNNaNNaNNaN
2008-12-31239.0NaNNaNNaNNaN
2009-01-31499.0NaNNaNNaNNaN
..................
2020-10-31141120.06359.02092.02116.0673.0
2020-11-30122796.05201.01591.01791.0770.0
2020-12-31154417.05724.01576.01860.0739.0
2021-01-31147813.06567.03995.02519.0655.0
2021-02-28137125.07261.04265.01947.0803.0

150 rows × 5 columns

Plotting The Trend in Language

Having prepared our data, we can now move on to visualization.

python
# Import libraries
# Top line is Jupyter Notebook specific

%matplotlib inline

import matplotlib.pyplot as plt
python
# Plotting the graph with Pandas
# The .plot() method is a simple wrapper around plt.plot()
# Use figsize to control the size of the graph

ax = subset.plot(subplots=True,figsize=(12,10),grid=False,title="Language Trend on AO3 \n 2008-2021")

png

Note that the graph displays monthly postings in each language from 2008 to 2021. You can see from the graph when users started to post in particular language, and when the interest in that language peaked throughout the years.