Tips on Working with Datetime Index in pandas

As you may understand from the title it is not a complete guide on Time Series or Datetime data type in Python. So if you expect to get in-depth explanation from A to Z it’s a wrong place. Seriously. There is a fantastic article on this topic, well explained, detailed and quite straightforward. Don’t waste your time on this one.

For those who have reached this part I will tell that you will find something useful here for sure. Again, seriously. I found my notes on Time Series and decided to organize it into a little article with general tips, which are aplicable, I guess, in 80 to 90% of times you work with dates. So it’s worth sharing, isn’t it?

I have a dataset with air pollutants measurements for every hour since 2016 in Madrid, so I will use it as an example.

Importing data

By default pandas will use the first column as index while importing csv file with read_csv() , so if your datetime column isn’t first you will need to specify it explicitly index_col='date'.

The beauty of pandas is that it can preprocess your datetime data during import. By specifying parse_dates=True pandas will try parsing the index, if we pass list of ints or names e.g. if [1, 2, 3] – it will try parsing columns 1, 2, 3 each as a separate date column, list of lists e.g. if [[1, 3]] – combine columns 1 and 3 and parse as a single date column, dict, e.g. {‘foo’ : [1, 3]} – parse columns 1, 3 as date and call result ‘foo’. If you are using other method to import data you can always use pd.to_datetime after it.

I have imported my data using the following code:

import pandas as pd
import glob

pattern = 'data/madrid*.csv'
csv_files = glob.glob(pattern)

frames = []

for csv in csv_files:
    df = pd.read_csv(csv, index_col='date', parse_dates=True)
    frames.append(df)

df = pd.concat(frames)
df.head()

Out[4]: 
                     BEN  CH4   CO  EBE  NMHC     NO   NO_2  NOx  O_3  PM10  
date                                                                          
2016-11-01 01:00:00  NaN  NaN  0.7  NaN   NaN  153.0   77.0  NaN  NaN   NaN   
2016-11-01 01:00:00  3.1  NaN  1.1  2.0  0.53  260.0  144.0  NaN  4.0  46.0   
2016-11-01 01:00:00  5.9  NaN  NaN  7.5   NaN  297.0  139.0  NaN  NaN   NaN   
2016-11-01 01:00:00  NaN  NaN  1.0  NaN   NaN  154.0  113.0  NaN  2.0   NaN   
2016-11-01 01:00:00  NaN  NaN  NaN  NaN   NaN  275.0  127.0  NaN  2.0   NaN

The data is gathered from 24 different stations about 14 different pollutants. We are not going to analyze this data, and to make it little bit simpler we will choose only one station, two pollutants and remove all NaN values (DANGER! please, do not repeat it at home).

df_time = df[['O_3', 'PM10']][df['station'] == 28079008].dropna()
df_time.head()
Out[9]: 
                     O_3  PM10