Skip to content

Loading CSV Files in Python

tags: Python Pandas

We have two pretty hefty csv files on hand, 554 MB and 923 MB respectively. It would take significant time for the program to load the entire file to the machine.

We could define nrows=5 to load first 5 rows of the file just to have an idea of how the data looks like.

Loading first 5 rows

python
# Load Python library
import pandas as pd
# Load first 5 rows 
pd.read_csv("/home/pi/Downloads/works-20210226.csv", nrows=5)
creation datelanguagerestrictedcompleteword_counttagsUnnamed: 6
02021-02-26enFalseTrue38810+414093+1001939+4577144+1499536+110+4682892+...NaN
12021-02-26enFalseTrue163810+20350917+34816907+23666027+23269305+2326930...NaN
22021-02-26enFalseTrue150210+10613413+9780526+3763877+3741104+7657229+30...NaN
32021-02-26enFalseTrue10010+15322+54862755+20595867+32994286+663+471751...NaN
42021-02-26enFalseTrue99411+721553+54604+1439500+3938423+53483274+54862...NaN
python
pd.read_csv("/home/pi/Downloads/tags-20210226.csv", nrows=5)
idtypenamecanonicalcached_countmerger_id
01MediaTV ShowsTrue910NaN
12MediaMoviesTrue1164NaN
23MediaBooks & LiteratureTrue134NaN
34MediaCartoons & Comics & Graphic NovelsTrue166NaN
45MediaAnime & MangaTrue501NaN

Loading Entire File

There are additional steps to take so that we can save memory and potentially speed up the loading process. Jupyter Notebook takes about 54 seconds to read the file on my machine, so be prepared that it might take significant time.

We'll use chunksize=10000 to save memory by reading chunks of the file at a time, then use pd.concat() to concatenate the chunks.

python
# The file is too large
# save memory by reading chunks of the file

chunker = pd.read_csv("/home/pi/Downloads/works-20210226.csv", chunksize=10000)
python
# Combine chunks into a dataframe
works = pd.concat(chunker, ignore_index=True)
python
# First 5 rows
works.iloc[:5,:]
creation datelanguagerestrictedcompleteword_counttagsUnnamed: 6
02021-02-26enFalseTrue388.010+414093+1001939+4577144+1499536+110+4682892+...NaN
12021-02-26enFalseTrue1638.010+20350917+34816907+23666027+23269305+2326930...NaN
22021-02-26enFalseTrue1502.010+10613413+9780526+3763877+3741104+7657229+30...NaN
32021-02-26enFalseTrue100.010+15322+54862755+20595867+32994286+663+471751...NaN
42021-02-26enFalseTrue994.011+721553+54604+1439500+3938423+53483274+54862...NaN