Test For Normal Distribution Of Data With Python
One of the first steps in exploratory data analysis is to identify the characteristics of the data, importantly including a test for distribution patterns. In this example, learn how to check if your data is normally distributed in Python with a visualization as well as a calculation given by the Scipy library.
Attached, find a CSV file with 130 records of human body temperature readings
derived from the
Journal of Statistics Education (Shoemaker 1996)
.
Start by loading the CSV to your site (
instructions here
). In this example, we'll construct an Empirical cumulative distribution function (
ECDF
) to visualize the distribution of the data.
Most of the work will be done in Python, so for the SQL code, use the following:
select * from [human_body_temperature]
In Python 3.6, start by importing packages:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
Pandas will be used to handle the dataframe; numpy will be used to calculate a few key statistics such as median and standard deviation as well as to draw random samples from the dataset, matplotlib.pyplot and seaborn will be used together to generate the plot, and scipy will be used for the mathematical calculation of the normal statistics.
Next, let's define a function that will generate plottable points:
def ecdf(data):
"""Compute ECDF for a one-dimensional array of measurements."""
# Number of data points: n
n = len(data)
# x-data for the ECDF: x
x = np.sort(data)
# y-data for the ECDF: y
y = np.arange(1, n+1) / n