Skip to content

Columns, N/A Values, and Simple Data Cleaning

tags: Python Pandas

In this section, we focus on navigating the data set, and cleaning missing values.

Loading File

For details on how to load large csv files in Python, check out Loading CSV Files in Python.

python
# Load Python library
import pandas as pd

# Load file
path="/home/pi/Downloads/works-20210226.csv"
chunker = pd.read_csv(path, chunksize=10000)
works = pd.concat(chunker, ignore_index=True)

Shape Of The Dataframe

We can use shape to get the number of rows and columns in the dataset. Specifically, shape[0] displays the number of rows, and shape[1] displays the number of columns.

python
# The number of rows and columns in the dataframe
works.shape
(7269693, 7)
python
# Number of rows
works.shape[0]
7269693
python
# Number of cols
works.shape[1]
7

As we can see from above, there are 7269693 rows and 7 columns in the dataframe.

Column Names and Selecting Columns

To preview the dataset and its columns, we can print out the first few rows. However, sometimes there are too many columns in a data set it is difficult to display on the screen. Instead, we could print the column names specifically.

python
# View col names
works.columns
Index(['creation date', 'language', 'restricted', 'complete', 'word_count',
       'tags', 'Unnamed: 6'],
      dtype='object')

Often, we only need certain columns to work with. There are several ways to select columns, let's select language column for example.

python
# Select a single column 
works.language
0          en
1          en
2          en
3          en
4          en
           ..
7269688    en
7269689    en
7269690    en
7269691    en
7269692    en
Name: language, Length: 7269693, dtype: object
python
works['language']
0          en
1          en
2          en
3          en
4          en
           ..
7269688    en
7269689    en
7269690    en
7269691    en
7269692    en
Name: language, Length: 7269693, dtype: object
python
works.loc[:,'language']
0          en
1          en
2          en
3          en
4          en
           ..
7269688    en
7269689    en
7269690    en
7269691    en
7269692    en
Name: language, Length: 7269693, dtype: object

As we can see from above, the three methods yield exactly the same results. Which one to use depends on your own preference.

Let's select multiple columns at the same time.

python
# Select multiple columns
works[['creation date','language']]
creation datelanguage
02021-02-26en
12021-02-26en
22021-02-26en
32021-02-26en
42021-02-26en
.........
72696882008-09-13en
72696892008-09-13en
72696902008-09-13en
72696912008-09-13en
72696922008-09-13en

7269693 rows × 2 columns

python
works.loc[:,['creation date', 'language']]
creation datelanguage
02021-02-26en
12021-02-26en
22021-02-26en
32021-02-26en
42021-02-26en
.........
72696882008-09-13en
72696892008-09-13en
72696902008-09-13en
72696912008-09-13en
72696922008-09-13en

7269693 rows × 2 columns

With the two columns on hand, we can proceed with more in-depth analysis. For example, we could:

  • Find the total number of languages on AO3
  • Find top 5 most popular languages on AO3
  • Visulize language trend
  • Analyze users' posting habits

etc.

But before we can do any analysis, we need to clean the data set.

N/A values

In real world, the data set may contain missing values (showing up as NaN). In order to prepare the data for further analysis, we need to detect null values, single out the rows, and eventually drop the rows containing null values.

For more details behind the scene, check out Handling Missing Data.

python
# Detect null values in data
works.isna().any()
creation date    False
language          True
restricted       False
complete         False
word_count        True
tags              True
Unnamed: 6        True
dtype: bool
python
# Alternatively using isnull(), same results
works.isnull().any()
creation date    False
language          True
restricted       False
complete         False
word_count        True
tags              True
Unnamed: 6        True
dtype: bool

Here it shows that the language, word_count and tages columns all have null values. Let's check out what they look like before taking any actions regarding the null values.

python
# Select language column, display only rows containing null values
works['language'][works['language'].isnull()]
73119      NaN
95222      NaN
184633     NaN
211955     NaN
266702     NaN
          ... 
6197226    NaN
6210472    NaN
6216530    NaN
6266535    NaN
6272792    NaN
Name: language, Length: 90, dtype: object
python
# Same method, word_count column
works['word_count'][works['word_count'].isnull()]
1404      NaN
3846      NaN
3976      NaN
5458      NaN
6170      NaN
           ..
6531822   NaN
6559452   NaN
6755516   NaN
6847505   NaN
6897542   NaN
Name: word_count, Length: 2268, dtype: float64

If we decide to drop all rows that has at least one null value, we could:

python
# Drop null values in language column
works.dropna(subset = ['language'])
creation datelanguagerestrictedcompleteword_counttagsUnnamed: 6
02021-02-26enFalseTrue388.010+414093+1001939+4577144+1499536+110+4682892+...NaN
12021-02-26enFalseTrue1638.010+20350917+34816907+23666027+23269305+2326930...NaN
22021-02-26enFalseTrue1502.010+10613413+9780526+3763877+3741104+7657229+30...NaN
32021-02-26enFalseTrue100.010+15322+54862755+20595867+32994286+663+471751...NaN
42021-02-26enFalseTrue994.011+721553+54604+1439500+3938423+53483274+54862...NaN
........................
72696882008-09-13enTrueTrue705.078+77+84+101+104+105+106+23+13+16+70+933NaN
72696892008-09-13enFalseTrue1392.078+77+84+107+23+10+16+70+933+616NaN
72696902008-09-13enFalseTrue1755.077+78+69+108+109+62+110+23+9+111+16+70+10128+4858NaN
72696912008-09-13enFalseTrue1338.0112+113+13+114+16+115+101+117+118+119+120+116+...NaN
72696922008-09-13enFalseTrue1836.0123+124+125+127+128+13+129+14+130+131+132+133+...NaN

7269603 rows × 7 columns

Compared to the 7269693 rows in original data set, we dropped 90 rows that contain missing values in the language column, which is exactly what we intended to achieve.