失恋的仙人掌 · Python-matplotlib绘制散点图 ...· 2 周前 · |
文雅的四季豆 · matplotlib绘制散点图 - ...· 2 周前 · |
机灵的乌冬面 · python如何查看画出的图中点的坐标 | ...· 2 周前 · |
酒量大的风衣 · Unable to install ...· 1 周前 · |
没读研的大葱 · 新疆这座城市为何以“梨”命名?-中新网·新疆· 2 月前 · |
宽容的野马 · c语言中如何让一个语句只执行一次 | ...· 2 月前 · |
侠义非凡的铅笔 · 进阶教程:Android 5.0 免 ...· 4 月前 · |
大力的电池 · 把生成的代码复制到自己的模块中,启动报错 ...· 7 月前 · |
大气的课本 · 斗罗大陆邪神降临 ...· 1 年前 · |
equal plot matplotlib |
https://www.oreilly.com/library/view/python-data-science/9781491912126/ch04.html |
销魂的钥匙扣
5 月前 |
Get full access to Python Data Science Handbook and 60K+ other titles, with a free 10-day trial of O'Reilly.
There are also live events, courses curated by job role, and more.
Start your free trialWe’ll now take an in-depth look at the Matplotlib tool for visualization in Python. Matplotlib is a multiplatform data visualization library built on NumPy arrays, and designed to work with the broader SciPy stack. It was conceived by John Hunter in 2002, originally as a patch to IPython for enabling interactive MATLAB-style plotting via gnuplot from the IPython command line. IPython’s creator, Fernando Perez, was at the time scrambling to finish his PhD, and let John know he wouldn’t have time to review the patch for several months. John took this as a cue to set out on his own, and the Matplotlib package was born, with version 0.1 released in 2003. It received an early boost when it was adopted as the plotting package of choice of the Space Telescope Science Institute (the folks behind the Hubble Telescope), which financially supported Matplotlib’s development and greatly expanded its capabilities.
One of Matplotlib’s most important features is its ability to play well with many operating systems and graphics backends. Matplotlib supports dozens of backends and output types, which means you can count on it to work regardless of which operating system you are using or which output format you wish. This cross-platform, everything-to-everyone approach has been one of the great strengths of Matplotlib. It has led to a large userbase, which in turn has led to an active developer base and Matplotlib’s powerful tools and ubiquity within the scientific Python world.
In recent years, however, the interface and style of Matplotlib have begun to show their age. Newer tools like ggplot and ggvis in the R language, along with web visualization toolkits based on D3js and HTML5 canvas, often make Matplotlib feel clunky and old-fashioned. Still, I’m of the opinion that we cannot ignore Matplotlib’s strength as a well-tested, cross-platform graphics engine. Recent Matplotlib versions make it relatively easy to set new global plotting styles (see “Customizing Matplotlib: Configurations and Stylesheets” ), and people have been developing new packages that build on its powerful internals to drive Matplotlib via cleaner, more modern APIs—for example, Seaborn (discussed in “Visualization with Seaborn” ), ggplot , HoloViews , Altair , and even Pandas itself can be used as wrappers around Matplotlib’s API. Even with wrappers like these, it is still often useful to dive into Matplotlib’s syntax to adjust the final plot output. For this reason, I believe that Matplotlib itself will remain a vital piece of the data visualization stack, even if new tools mean the community gradually moves away from using the Matplotlib API directly.
Just as we use the
np
shorthand for NumPy and the
pd
shorthand for
Pandas, we will use some standard shorthands for Matplotlib imports:
In
[
1
]:
import
matplotlib
as
mpl
import
matplotlib.pyplot
as
plt
The
plt
interface is what we will use most often, as we’ll see
throughout this chapter.
We will use the
plt.style
directive to choose
appropriate aesthetic styles for our figures. Here we will set the
classic
style, which ensures that the plots we create use the classic
Matplotlib style:
In
[
2
]:
plt
.
style
.
use
(
'classic'
)
Throughout this section, we will adjust this style as needed. Note that the stylesheets used here are supported as of Matplotlib version 1.5; if you are using an earlier version of Matplotlib, only the default style is available. For more information on stylesheets, see “Customizing Matplotlib: Configurations and Stylesheets” .
A visualization you can’t see won’t be of much use, but just how you view your Matplotlib plots depends on the context. The best use of Matplotlib differs depending on how you are using it; roughly, the three applicable contexts are using Matplotlib in a script, in an IPython terminal, or in an IPython notebook.
If you are using Matplotlib from within a script, the function
plt.show()
is your friend.
plt.show()
starts an event loop, looks
for all currently active figure objects, and opens one or more
interactive windows that display your figure or figures.
So, for example, you may have a file called myplot.py containing the following:
# ------- file: myplot.py ------
import
matplotlib.pyplot
as
plt
import
numpy
as
np
x
=
np
.
linspace
(
0
,
10
,
100
)
plt
.
plot
(
x
,
np
.
sin
(
x
))
plt
.
plot
(
x
,
np
.
cos
(
x
))
plt
.
show
()
You can then run this script from the command-line prompt, which will result in a window opening with your figure displayed:
$ python myplot.py
The
plt.show()
command does a lot under the hood, as it must interact with
your system’s interactive graphical backend. The details of this
operation can vary greatly from system to system and even installation
to installation, but Matplotlib does its best to hide all these details
from you.
One thing to be aware of: the
plt.show()
command should be used
only
once
per Python session, and is most often seen at the very end of the
script. Multiple
show()
commands can lead to unpredictable
backend-dependent behavior, and should mostly be avoided.
It can be very convenient to use Matplotlib interactively within an
IPython shell (see
Chapter 1
). IPython is built to work well with Matplotlib if you specify
Matplotlib mode.
To enable this mode, you can use the
%matplotlib
magic command after starting
ipython
:
In
[
1
]:
%
matplotlib
Using
matplotlib
backend
:
TkAgg
In
[
2
]:
import
matplotlib.pyplot
as
plt
At this point, any
plt
plot command will cause a figure window to
open, and further commands can be run to update the plot. Some changes
(such as modifying properties of lines that are already drawn) will not
draw automatically; to force an update, use
plt.draw()
. Using
plt.show()
in Matplotlib mode is not required.
The IPython notebook is a browser-based interactive data analysis tool that can combine narrative, code, graphics, HTML elements, and much more into a single executable document (see Chapter 1 ).
Plotting interactively within an IPython notebook can be done with the
%matplotlib
command, and works in a similar way to the IPython shell.
In the IPython notebook, you also have the option of embedding graphics
directly in the notebook, with two possible options:
%matplotlib notebook
will lead to
interactive
plots embedded
within the
notebook
%matplotlib inline
will lead to
static
images of your plot
embedded in the notebook
For this book, we will generally opt for
%matplotlib inline
:
In
[
3
]:
%
matplotlib
inline
After you run this command (it needs to be done only once per kernel/session), any cell within the notebook that creates a plot will embed a PNG image of the resulting graphic ( Figure 4-1 ):
In
[
4
]:
import
numpy
as
np
x
=
np
.
linspace
(
0
,
10
,
100
)
fig
=
plt
.
figure
()
plt
.
plot
(
x
,
np
.
sin
(
x
),
'-'
)
plt
.
plot
(
x
,
np
.
cos
(
x
),
'--'
);
One nice feature of Matplotlib is the ability to save figures in a wide
variety of formats. You can save a figure using the
savefig()
command. For example, to save the previous figure as a PNG file, you can
run this:
In
[
5
]:
fig
.
savefig
(
'my_figure.png'
)
We now have a file called my_figure.png in the current working directory:
In
[
6
]:
!
ls
-
lh
my_figure
.
png
-rw-r--r-- 1 jakevdp staff 16K Aug 11 10:59 my_figure.png
To confirm that it contains what we think it contains, let’s use the
IPython
Image
object to display the contents of this file (
Figure 4-2
):
In
[
7
]:
from
IPython.display
import
Image
Image
(
'my_figure.png'
)
In
savefig()
, the file format is inferred from the extension of the
given filename. Depending on what backends you have installed, many
different file formats are available. You can find the list of supported file types for your system by using the following method of the figure
canvas
object:
In
[
8
]:
fig
.
canvas
.
get_supported_filetypes
()
Out[8]: {'eps': 'Encapsulated Postscript', 'jpeg': 'Joint Photographic Experts Group', 'jpg': 'Joint Photographic Experts Group', 'pdf': 'Portable Document Format', 'pgf': 'PGF code for LaTeX', 'png': 'Portable Network Graphics', 'ps': 'Postscript', 'raw': 'Raw RGBA bitmap', 'rgba': 'Raw RGBA bitmap', 'svg': 'Scalable Vector Graphics', 'svgz': 'Scalable Vector Graphics', 'tif': 'Tagged Image File Format', 'tiff': 'Tagged Image File Format'}
Note that when saving your figure, it’s not necessary to use
plt.show()
or
related commands discussed earlier.
A potentially confusing feature of Matplotlib is its dual interfaces: a convenient MATLAB-style state-based interface, and a more powerful object-oriented interface. We’ll quickly highlight the differences between the two here.
Matplotlib was originally written as a Python alternative for MATLAB
users, and much of its syntax reflects that fact. The MATLAB-style tools
are contained in the pyplot (
plt
) interface. For example, the
following code will probably look quite familiar to MATLAB users (
Figure 4-3
):
In
[
9
]:
plt
.
figure
()
# create a plot figure
# create the first of two panels and set current axis
plt
.
subplot
(
2
,
1
,
1
)
# (rows, columns, panel number)
plt
.
plot
(
x
,
np
.
sin
(
x
))
# create the second panel and set current axis
plt
.
subplot
(
2
,
1
,
2
)
plt
.
plot
(
x
,
np
.
cos
(
x
));
It’s important to note that this interface is
stateful
: it keeps track of the
“current” figure and axes, which are where all
plt
commands are
applied. You can get a reference to these using the
plt.gcf()
(get
current figure) and
plt.gca()
(get current axes)
routines
.
While this stateful interface is fast and convenient for simple plots, it is easy to run into problems. For example, once the second panel is created, how can we go back and add something to the first? This is possible within the MATLAB-style interface, but a bit clunky. Fortunately, there is a better way.
The object-oriented interface is available for these more complicated
situations, and for when you want more control over your figure. Rather
than depending on some notion of an “active” figure or axes, in the
object-oriented interface the plotting functions are
methods
of
explicit
Figure
and
Axes
objects. To re-create the previous plot using
this style of plotting, you might do the following (
Figure 4-4
):
In
[
10
]:
# First create a grid of plots
# ax will be an array of two Axes objects
fig
,
ax
=
plt
.
subplots
(
2
)
# Call plot() method on the appropriate object
ax
[
0
]
.
plot
(
x
,
np
.
sin
(
x
))
ax
[
1
]
.
plot
(
x
,
np
.
cos
(
x
));
For more simple plots, the choice of which style to use is largely a matter of preference, but the object-oriented approach can become a necessity as plots become
more complicated. Throughout this chapter, we will switch between the
MATLAB-style and object-oriented interfaces, depending on what is most
convenient. In most cases, the difference is as small as switching
plt.plot()
to
ax.plot()
, but there are a few gotchas that we will
highlight as they come up in the following sections.
In
[
1
]:
%
matplotlib
inline
import
matplotlib.pyplot
as
plt
plt
.
style
.
use
(
'seaborn-whitegrid'
)
import
numpy
as
np
For all Matplotlib plots, we start by creating a figure and an axes. In their simplest form, a figure and axes can be created as follows ( Figure 4-5 ):
In
[
2
]:
fig
=
plt
.
figure
()
ax
=
plt
.
axes
()
In Matplotlib, the
figure
(an instance of the class
plt.Figure
) can
be thought of as a single container that contains all the objects
representing axes, graphics, text, and labels. The
axes
(an instance
of the class
plt.Axes
) is what we see above: a bounding box with ticks
and labels, which will eventually contain the plot elements that make up
our visualization. Throughout this book, we’ll commonly use the variable
name
fig
to refer to a figure instance, and
ax
to refer to an axes
instance or group of axes instances.
Once we have created an axes, we can use the
ax.plot
function to plot
some data. Let’s start with a simple sinusoid (
Figure 4-6
):
In
[
3
]:
fig
=
plt
.
figure
()
ax
=
plt
.
axes
()
x
=
np
.
linspace
(
0
,
10
,
1000
)
ax
.
plot
(
x
,
np
.
sin
(
x
));
Alternatively, we can use the pylab interface and let the figure and axes be created for us in the background ( Figure 4-7 ; see “Two Interfaces for the Price of One” for a discussion of these two interfaces):
In
[
4
]:
plt
.
plot
(
x
,
np
.
sin
(
x
));
If we want to create a single figure with multiple lines, we can simply
call the
plot
function multiple times (
Figure 4-8
):
In
[
5
]:
plt
.
plot
(
x
,
np
.
sin
(
x
))
plt
.
plot
(
x
,
np
.
cos
(
x
));
That’s all there is to plotting simple functions in Matplotlib! We’ll now dive into some more details about how to control the appearance of the axes and lines.
The first adjustment you might wish to make to a plot is to control the
line colors and styles.
The
plt.plot()
function takes additional
arguments that can be used to specify these. To adjust the color, you
can use the
color
keyword, which accepts a string argument
representing virtually any imaginable color. The color can be specified
in a variety of ways (
Figure 4-9
):
In
[
6
]:
plt
.
plot
(
x
,
np
.
sin
(
x
-
0
),
color
=
'blue'
)
# specify color by name
plt
.
plot
(
x
,
np
.
sin
(
x
-
1
),
color
=
'g'
)
# short color code (rgbcmyk)
plt
.
plot
(
x
,
np
.
sin
(
x
-
2
),
color
=
'0.75'
)
# Grayscale between 0 and 1
plt
.
plot
(
x
,
np
.
sin
(
x
-
3
),
color
=
'#FFDD44'
)
# Hex code (RRGGBB from 00 to FF)
plt
.
plot
(
x
,
np
.
sin
(
x
-
4
),
color
=
(
1.0
,
0.2
,
0.3
))
# RGB tuple, values 0 and 1
plt
.
plot
(
x
,
np
.
sin
(
x
-
5
),
color
=
'chartreuse'
);
# all HTML color names supported
If no color is specified, Matplotlib will automatically cycle through a set of default colors for multiple lines.
Similarly, you can adjust the line style using the
linestyle
keyword (
Figure 4-10
):
In
[
7
]:
plt
.
plot
(
x
,
x
+
0
,
linestyle
=
'solid'
)
plt
.
plot
(
x
,
x
+
1
,
linestyle
=
'dashed'
)
plt
.
plot
(
x
,
x
+
2
,
linestyle
=
'dashdot'
)
plt
.
plot
(
x
,
x
+
3
,
linestyle
=
'dotted'
);
# For short, you can use the following codes:
plt
.
plot
(
x
,
x
+
4
,
linestyle
=
'-'
)
# solid
plt
.
plot
(
x
,
x
+
5
,
linestyle
=
'--'
)
# dashed
plt
.
plot
(
x
,
x
+
6
,
linestyle
=
'-.'
)
# dashdot
plt
.
plot
(
x
,
x
+
7
,
linestyle
=
':'
);
# dotted
If you would like to be extremely terse, these
linestyle
and
color
codes can be combined into a single nonkeyword argument to the
plt.plot()
function (
Figure 4-11
):
In
[
8
]:
plt
.
plot
(
x
,
x
+
0
,
'-g'
)
# solid green
plt
.
plot
(
x
,
x
+
1
,
'--c'
)
# dashed cyan
plt
.
plot
(
x
,
x
+
2
,
'-.k'
)
# dashdot black
plt
.
plot
(
x
,
x
+
3
,
':r'
);
# dotted red
These single-character color codes reflect the standard abbreviations in the RGB (Red/Green/Blue) and CMYK (Cyan/Magenta/Yellow/blacK) color systems, commonly used for digital color graphics.
There are many other keyword arguments that can be used to fine-tune the
appearance of the plot; for more details, I’d suggest viewing the
docstring of the
plt.plot()
function using IPython’s help tools (see
“Help and Documentation in IPython”
).
Matplotlib does a decent job of choosing default axes limits for your
plot, but sometimes it’s nice to have finer control. The most basic way
to adjust axis limits is to use the
plt.xlim()
and
plt.ylim()
methods (
Figure 4-12
):
In
[
9
]:
plt
.
plot
(
x
,
np
.
sin
(
x
))
plt
.
xlim
(
-
1
,
11
)
plt
.
ylim
(
-
1.5
,
1.5
);
If for some reason you’d like either axis to be displayed in reverse, you can simply reverse the order of the arguments ( Figure 4-13 ):
In
[
10
]:
plt
.
plot
(
x
,
np
.
sin
(
x
))
plt
.
xlim
(
10
,
0
)
plt
.
ylim
(
1.2
,
-
1.2
);
A useful related method is
plt.axis()
(note here the potential
confusion between
axes
with an
e
, and
axis
with an
i
). The
plt.axis()
method allows you to set the
x
and
y
limits with a single
call, by passing a list that specifies
[xmin, xmax, ymin, ymax]
(
Figure 4-14
):
In
[
11
]:
plt
.
plot
(
x
,
np
.
sin
(
x
))
plt
.
axis
([
-
1
,
11
,
-
1.5
,
1.5
]);
The
plt.axis()
method goes even beyond this, allowing you to do things
like automatically tighten the bounds around the current plot (
Figure 4-15
):
In
[
12
]:
plt
.
plot
(
x
,
np
.
sin
(
x
))
plt
.
axis
(
'tight'
);
It allows even higher-level specifications, such as ensuring an equal
aspect ratio so that on your screen, one unit in
x
is equal to one unit
in
y
(
Figure 4-16
):
In
[
13
]:
plt
.
plot
(
x
,
np
.
sin
(
x
))
plt
.
axis
(
'equal'
);
For more information on axis limits and the other capabilities of the
plt.axis()
method, refer to the
plt.axis()
docstring.
As the last piece of this section, we’ll briefly look at the labeling of plots: titles, axis labels, and simple legends.
Titles and axis labels are the simplest such labels—there are methods that can be used to quickly set them ( Figure 4-17 ):
In
[
14
]:
plt
.
plot
(
x
,
np
.
sin
(
x
))
plt
.
title
(
"A Sine Curve"
)
plt
.
xlabel
(
"x"
)
plt
.
ylabel
(
"sin(x)"
);
You can adjust the position, size, and style of these labels using optional arguments to the function. For more information, see the Matplotlib documentation and the docstrings of each of these functions.
When multiple lines are being shown within a single axes, it can be
useful to create a plot legend that labels each line type. Again,
Matplotlib has a built-in way of quickly creating such a legend. It is
done via the (you guessed it)
plt.legend()
method. Though there are
several valid ways of using this, I find it easiest to specify the label
of each line using the
label
keyword of the plot function (
Figure 4-18
):
In
[
15
]:
plt
.
plot
(
x
,
np
.
sin
(
x
),
'-g'
,
label
=
'sin(x)'
)
plt
.
plot
(
x
,
np
.
cos
(
x
),
':b'
,
label
=
'cos(x)'
)
plt
.
axis
(
'equal'
)
plt
.
legend
();
As you can see, the
plt.legend()
function keeps track of the line
style and color, and matches these with the correct label. More
information on specifying and formatting plot legends can be found in
the
plt.legend()
docstring; additionally, we will cover some more
advanced legend options in
“Customizing Plot Legends”
.
While most
plt
functions translate directly to
ax
methods (such as
plt.plot()
→
ax.plot()
,
plt.legend()
→
ax.legend()
, etc.), this
is not the case for all commands. In particular, functions to set
limits, labels, and titles are slightly modified. For transitioning
between MATLAB-style functions and object-oriented methods, make the
following changes:
plt.xlabel()
→
ax.set_xlabel()
plt.ylabel()
→
ax.set_ylabel()
plt.xlim()
→
ax.set_xlim()
plt.ylim()
→
ax.set_ylim()
plt.title()
→
ax.set_title()
In the object-oriented interface to plotting, rather than calling these
functions individually, it is often more convenient to use the
ax.set()
method to set all these properties at once (
Figure 4-19
):
In
[
16
]:
ax
=
plt
.
axes
()
ax
.
plot
(
x
,
np
.
sin
(
x
))
ax
.
set
(
xlim
=
(
0
,
10
),
ylim
=
(
-
2
,
2
),
xlabel
=
'x'
,
ylabel
=
'sin(x)'
,
title
=
'A Simple Plot'
);
Another commonly used plot type is the simple scatter plot, a close cousin of the line plot. Instead of points being joined by line segments, here the points are represented individually with a dot, circle, or other shape. We’ll start by setting up the notebook for plotting and importing the functions we will use:
In
[
1
]:
%
matplotlib
inline
import
matplotlib.pyplot
as
plt
plt
.
style
.
use
(
'seaborn-whitegrid'
)
import
numpy
as
np
In the previous section, we looked at
plt.plot
/
ax.plot
to produce line
plots. It turns out that this same function can produce scatter plots as
well (
Figure 4-20
):
In
[
2
]:
x
=
np
.
linspace
(
0
,
10
,
30
)
y
=
np
.
sin
(
x
)
plt
.
plot
(
x
,
y
,
'o'
,
color
=
'black'
);
The third argument in the function call is a character that represents
the type of symbol used for the plotting. Just as you can specify options such as
'-'
and
'--'
to control the line style, the marker style has its own set
of short string codes. The full list of available symbols can be seen in
the documentation of
plt.plot
, or in Matplotlib’s online
documentation. Most of the possibilities are fairly intuitive, and we’ll
show a number of the more common ones here (
Figure 4-21
):
In
[
3
]:
rng
=
np
.
random
.
RandomState
(
0
)
for
marker
in
[
'o'
,
'.'
,
','
,
'x'
,
'+'
,
'v'
,
'^'
,
'<'
,
'>'
,
's'
,
'd'
]:
plt
.
plot
(
rng
.
rand
(
5
),
rng
.
rand
(
5
),
marker
,
label
=
"marker='{0}'"
.
format
(
marker
))
plt
.
legend
(
numpoints
=
1
)
plt
.
xlim
(
0
,
1.8
);
For even more possibilities, these character codes can be used together with line and color codes to plot points along with a line connecting them ( Figure 4-22 ):
In
[
4
]:
plt
.
plot
(
x
,
y
,
'-ok'
);
# line (-), circle marker (o), black (k)
Additional keyword arguments to
plt.plot
specify a wide range of
properties of the lines and markers (
Figure 4-23
):
In
[
5
]:
plt
.
plot
(
x
,
y
,
'-p'
,
color
=
'gray'
,
markersize
=
15
,
linewidth
=
4
,
markerfacecolor
=
'white'
,
markeredgecolor
=
'gray'
,
markeredgewidth
=
2
)
plt
.
ylim
(
-
1.2
,
1.2
);
This type of flexibility in the
plt.plot
function allows for a wide
variety of possible visualization options. For a full description of the
options available, refer to the
plt.plot
documentation.
A second, more powerful method of creating scatter plots is the
plt.scatter
function, which can be used very similarly to the
plt.plot
function (
Figure 4-24
):
In
[
6
]:
plt
.
scatter
(
x
,
y
,
marker
=
'o'
);
The primary difference of
plt.scatter
from
plt.plot
is that it can
be used to create scatter plots where the properties of each individual
point (size, face color, edge color, etc.) can be individually
controlled or mapped to data.
Let’s show this by creating a random scatter plot with points of many
colors and sizes. In order to better see the overlapping results, we’ll
also use the
alpha
keyword to adjust the transparency level (
Figure 4-25
):
In
[
7
]:
rng
=
np
.
random
.
RandomState
(
0
)
x
=
rng
.
randn
(
100
)
y
=
rng
.
randn
(
100
)
colors
=
rng
.
rand
(
100
)
sizes
=
1000
*
rng
.
rand
(
100
)
plt
.
scatter
(
x
,
y
,
c
=
colors
,
s
=
sizes
,
alpha
=
0.3
,
cmap
=
'viridis'
)
plt
.
colorbar
();
# show color scale
Notice that the color argument is
automatically mapped to a color scale (shown here by the
colorbar()
command), and the size argument is given in pixels. In this way, the
color and size of points can be used to convey information in the
visualization, in order to illustrate multidimensional data.
For example, we might use the Iris data from Scikit-Learn, where each sample is one of three types of flowers that has had the size of its petals and sepals carefully measured ( Figure 4-26 ):
In
[
8
]:
from
sklearn.datasets
import
load_iris
iris
=
load_iris
()
features
=
iris
.
data
.
T
plt
.
scatter
(
features
[
0
],
features
[
1
],
alpha
=
0.2
,
s
=
100
*
features
[
3
],
c
=
iris
.
target
,
cmap
=
'viridis'
)
plt
.
xlabel
(
iris
.
feature_names
[
0
])
plt
.
ylabel
(
iris
.
feature_names
[
1
]);
We can see that this scatter plot has given us the ability to simultaneously explore four different dimensions of the data: the (x, y) location of each point corresponds to the sepal length and width, the size of the point is related to the petal width, and the color is related to the particular species of flower. Multicolor and multifeature scatter plots like this can be useful for both exploration and presentation of data.
Aside from the different features available in
plt.plot
and
plt.scatter
, why might you choose to use one over the other? While it
doesn’t matter as much for small amounts of data, as datasets get larger
than a few thousand points,
plt.plot
can be noticeably more efficient
than
plt.scatter
. The reason is that
plt.scatter
has the capability
to render a different size and/or color for each point, so the renderer
must do the extra work of constructing each point individually. In
plt.plot
, on the other hand, the points are always essentially clones
of each other, so the work of determining the appearance of the points
is done only once for the entire set of data. For large datasets, the
difference between these two can lead to vastly different performance,
and for this reason,
plt.plot
should be preferred over
plt.scatter
for large
datasets
.
For any scientific measurement, accurate accounting for errors is nearly as important, if not more important, than accurate reporting of the number itself. For example, imagine that I am using some astrophysical observations to estimate the Hubble Constant, the local measurement of the expansion rate of the universe. I know that the current literature suggests a value of around 71 (km/s)/Mpc, and I measure a value of 74 (km/s)/Mpc with my method. Are the values consistent? The only correct answer, given this information, is this: there is no way to know.
Suppose I augment this information with reported uncertainties: the current literature suggests a value of around 71
In visualization of data and results, showing these errors effectively can make a plot convey much more complete information.
A basic errorbar can be created with a single Matplotlib function call ( Figure 4-27 ):
In
[
1
]:
%
matplotlib
inline
import
matplotlib.pyplot
as
plt
plt
.
style
.
use
(
'seaborn-whitegrid'
)
import
numpy
as
np
In
[
2
]:
x
=
np
.
linspace
(
0
,
10
,
50
)
dy
=
0.8
y
=
np
.
sin
(
x
)
+
dy
*
np
.
random
.
randn
(
50
)
plt
.
errorbar
(
x
,
y
,
yerr
=
dy
,
fmt
=
'.k'
);
Here the
fmt
is a format code controlling the appearance of lines and
points, and has the same syntax as the shorthand used in
plt.plot
,
outlined in
“Simple Line Plots”
and
“Simple Scatter Plots”
.
In addition to these basic options, the
errorbar
function has many
options to fine-tune the outputs. Using these additional options you can
easily customize the aesthetics of your errorbar plot. I often find it
helpful, especially in crowded plots, to make the errorbars lighter than
the points themselves (
Figure 4-28
):
In
[
3
]:
plt
.
errorbar
(
x
,
y
,
yerr
=
dy
,
fmt
=
'o'
,
color
=
'black'
,
ecolor
=
'lightgray'
,
elinewidth
=
3
,
capsize
=
0
);
In addition to these options, you can also specify horizontal errorbars
(
xerr
), one-sided errorbars, and many other variants. For more
information on the options available, refer to the docstring of
plt.errorbar
.
Here we’ll perform a simple Gaussian process regression (GPR), using the Scikit-Learn API (see “Introducing Scikit-Learn” for details). This is a method of fitting a very flexible nonparametric function to data with a continuous measure of the uncertainty. We won’t delve into the details of Gaussian process regression at this point, but will focus instead on how you might visualize such a continuous error measurement:
In
[
4
]:
from
sklearn.gaussian_process
import
GaussianProcess
# define the model and draw some data
model
=
lambda
x
:
x
*
np
.
sin
(
x
)
xdata
=
np
.
array
([
1
,
3
,
5
,
6
,
8
])
ydata
=
model
(
xdata
)
# Compute the Gaussian process fit
gp
=
GaussianProcess
(
corr
=
'cubic'
,
theta0
=
1e-2
,
thetaL
=
1e-4
,
thetaU
=
1E-1
,
random_start
=
100
)
gp
.
fit
(
xdata
[:,
np
.
newaxis
],
ydata
)
xfit
=
np
.
linspace
(
0
,
10
,
1000
)
yfit
,
MSE
=
gp
.
predict
(
xfit
[:,
np
.
newaxis
],
eval_MSE
=
True
)
dyfit
=
2
*
np
.
sqrt
(
MSE
)
# 2*sigma ~ 95% confidence region
We now have
xfit
,
yfit
, and
dyfit
, which sample the continuous fit
to our data. We could pass these to the
plt.errorbar
function as
above, but we don’t really want to plot 1,000 points with 1,000 errorbars.
Instead, we can use the
plt.fill_between
function with a light color
to visualize this continuous error (
Figure 4-29
):
In
[
5
]:
# Visualize the result
plt
.
plot
(
xdata
,
ydata
,
'or'
)
plt
.
plot
(
xfit
,
yfit
,
'-'
,
color
=
'gray'
)
plt
.
fill_between
(
xfit
,
yfit
-
dyfit
,
yfit
+
dyfit
,
color
=
'gray'
,
alpha
=
0.2
)
plt
.
xlim
(
0
,
10
);
Note what we’ve done here with the
fill_between
function: we pass an x
value, then the lower y-bound, then the upper y-bound, and the result is
that the area between these regions is filled.
The resulting figure gives a very intuitive view into what the Gaussian process regression algorithm is doing: in regions near a measured data point, the model is strongly constrained and this is reflected in the small model errors. In regions far from a measured data point, the model is not strongly constrained, and the model errors increase.
For more information on the options available in
plt.fill_between()
(and the closely related
plt.fill()
function), see the function
docstring or the Matplotlib documentation.
Finally, if this seems a bit too low level for your taste, refer to “Visualization with Seaborn” , where we discuss the Seaborn package, which has a more streamlined API for visualizing this type of continuous errorbar.
Sometimes it is useful to display three-dimensional data in
two dimensions using contours or color-coded regions. There are three
Matplotlib functions that can be helpful for this task:
plt.contour
for contour plots,
plt.contourf
for filled contour plots, and
plt.imshow
for showing images. This section looks at several examples of using these. We’ll start by setting up the notebook for plotting and importing the functions we will use:
In
[
1
]:
%
matplotlib
inline
import
matplotlib.pyplot
as
plt
plt
.
style
.
use
(
'seaborn-white'
)
import
numpy
as
np
We’ll start by demonstrating a contour plot using a function , using the following particular choice for (we’ve seen this before in “Computation on Arrays: Broadcasting” , when we used it as a motivating example for array broadcasting ):
In
[
2
]:
def
f
(
x
,
y
):
return
np
.
sin
(
x
)
**
10
+
np
.
cos
(
10
+
y
*
x
)
*
np
.
cos
(
x
)
In
[
3
]:
x
=
np
.
linspace
(
0
,
5
,
50
)
y
=
np
.
linspace
(
0
,
5
,
40
)
X
,
Y
=
np
.
meshgrid
(
x
,
y
)
Z
=
f
(
X
,
Y
)
Now let’s look at this with a standard line-only contour plot ( Figure 4-30 ):
In
[
4
]:
plt
.
contour
(
X
,
Y
,
Z
,
colors
=
'black'
);
Notice that by default when a single color is used, negative values are
represented by dashed lines, and positive values by solid lines.
Alternatively, you can color-code the lines by specifying a colormap
with the
cmap
argument. Here, we’ll also specify that we want more
lines to be drawn—20 equally spaced intervals within the data range (
Figure 4-31
):
In
[
5
]:
plt
.
contour
(
X
,
Y
,
Z
,
20
,
cmap
=
'RdGy'
);
Here we chose the
RdGy
(short for
Red-Gray
) colormap, which is a good
choice for centered data. Matplotlib has a wide range of colormaps
available, which you can easily browse in IPython by doing a
tab completion on the
plt.cm
module:
plt.cm.<TAB>
Our plot is looking nicer, but the spaces between the lines may be
a bit distracting. We can change this by switching to a filled contour
plot using the
plt.contourf()
function (notice the
f
at the end),
which uses largely the same syntax as
plt.contour()
.
Additionally, we’ll add a
plt.colorbar()
command, which automatically
creates an additional axis with labeled color information for the plot (
Figure 4-32
):
In
[
6
]:
plt
.
contourf
(
X
,
Y
,
Z
,
20
,
cmap
=
'RdGy'
)
plt
.
colorbar
();
The colorbar makes it clear that the black regions are “peaks,” while the red regions are “valleys.”
One potential issue with this plot is that it is a bit “splotchy.” That
is, the color steps are discrete rather than continuous, which is not
always what is desired. You could remedy this by setting the number of
contours to a very high number, but this results in a rather inefficient
plot: Matplotlib must render a new polygon for each step in the level.
A
better way to handle this is to use the
plt.imshow()
function, which
interprets a two-dimensional grid of data as an image.
Figure 4-33 shows the result of the following code:
In
[
7
]:
plt
.
imshow
(
Z
,
extent
=
[
0
,
5
,
0
,
5
],
origin
=
'lower'
,
cmap
=
'RdGy'
)
plt
.
colorbar
()
plt
.
axis
(
aspect
=
'image'
);
There are a few potential gotchas with
imshow()
, however:
plt.imshow()
doesn’t accept an
x
and
y
grid, so you must manually
specify the
extent
[
xmin
,
xmax
,
ymin
,
ymax
]
of the image on the plot.
plt.imshow()
by default follows the standard image array definition
where the origin is in the upper left, not in the lower left as in most
contour plots. This must be changed when showing gridded data.
plt.imshow()
will automatically adjust the axis aspect ratio to
match the input data; you can change this by setting, for example,
plt.axis(aspect='image')
to make
x
and
y
units match.
Finally, it can sometimes be useful to combine contour plots and image
plots. For example, to create the effect shown in
Figure 4-34
, we’ll use a partially
transparent background image (with transparency set via the
alpha
parameter) and over-plot contours with labels on the contours themselves
(using the
plt.clabel()
function):
In
[
8
]:
contours
=
plt
.
contour
(
X
,
Y
,
Z
,
3
,
colors
=
'black'
)
plt
.
clabel
(
contours
,
inline
=
True
,
fontsize
=
8
)
plt
.
imshow
(
Z
,
extent
=
[
0
,
5
,
0
,
5
],
origin
=
'lower'
,
cmap
=
'RdGy'
,
alpha
=
0.5
)
plt
.
colorbar
();
The combination of these three functions—
plt.contour
,
plt.contourf
,
and
plt.imshow
—gives nearly limitless possibilities for displaying
this sort of three-dimensional data within a two-dimensional plot
.
For
more information on the options available in these functions, refer to
their docstrings.
If you are interested in three-dimensional
visualizations of this type of data, see
“Three-Dimensional Plotting in Matplotlib”
.
A simple histogram can be a great first step in understanding a dataset. Earlier, we saw a preview of Matplotlib’s histogram function (see “Comparisons, Masks, and Boolean Logic” ), which creates a basic histogram in one line, once the normal boilerplate imports are done ( Figure 4-35 ):
In
[
1
]:
%
matplotlib
inline
import
numpy
as
np
import
matplotlib.pyplot
as
plt
plt
.
style
.
use
(
'seaborn-white'
)
data
=
np
.
random
.
randn
(
1000
)
In
[
2
]:
plt
.
hist
(
data
);
The
hist()
function has many options to tune both the calculation and
the display; here’s an example of a more customized histogram (
Figure 4-36
):
In
[
3
]:
plt
.
hist
(
data
,
bins
=
30
,
normed
=
True
,
alpha
=
0.5
,
histtype
=
'stepfilled'
,
color
=
'steelblue'
,
edgecolor
=
'none'
);
The
plt.hist
docstring has more information on other customization
options available. I find this combination of
histtype='stepfilled'
along with some transparency
alpha
to be very
useful when comparing histograms of several distributions (
Figure 4-37
):
In
[
4
]:
x1
=
np
.
random
.
normal
(
0
,
0.8
,
1000
)
x2
=
np
.
random
.
normal
(
-
2
,
1
,
1000
)
x3
=
np
.
random
.
normal
(
3
,
2
,
1000
)
kwargs
=
dict
(
histtype
=
'stepfilled'
,
alpha
=
0.3
,
normed
=
True
,
bins
=
40
)
plt
.
hist
(
x1
,
**
kwargs
)
plt
.
hist
(
x2
,
**
kwargs
)
plt
.
hist
(
x3
,
**
kwargs
);
If you would like to simply compute the histogram (that is, count the
number of points in a given bin) and not display it, the
np.histogram()
function is available:
In
[
5
]:
counts
,
bin_edges
=
np
.
histogram
(
data
,
bins
=
5
)
(
counts
)
[ 12 190 468 301 29]
Just as we create histograms in one dimension by dividing the
number line into bins, we can also create histograms in two dimensions
by dividing points among two-dimensional bins. We’ll take a brief look
at several ways to do this here. We’ll start by defining some data—an
x
and
y
array drawn from a multivariate Gaussian
distribution
:
In
[
6
]:
mean
=
[
0
,
0
]
cov
=
[[
1
,
1
],
[
1
,
2
]]
x
,
y
=
np
.
random
.
multivariate_normal
(
mean
,
cov
,
10000
)
.
T
One straightforward way to plot a two-dimensional histogram is to use Matplotlib’s
plt.hist2d
function (
Figure 4-38
):
In
[
12
]:
plt
.
hist2d
(
x
,
y
,
bins
=
30
,
cmap
=
'Blues'
)
cb
=
plt
.
colorbar
()
cb
.
set_label
(
'counts in bin'
)
Just as with
plt.hist
,
plt.hist2d
has a number of extra options to
fine-tune the plot and the binning, which are nicely outlined in the function
docstring. Further, just as
plt.hist
has a counterpart in
np.histogram
,
plt.hist2d
has a counterpart in
np.histogram2d
, which can be used as
follows:
In
[
8
]:
counts
,
xedges
,
yedges
=
np
.
histogram2d
(
x
,
y
,
bins
=
30
)
For the generalization of this histogram binning in dimensions higher
than two, see the
np.histogramdd
function.
The two-dimensional histogram creates a tessellation of squares across
the axes. Another natural shape for such a tessellation is the regular
hexagon. For this purpose, Matplotlib provides the
plt.hexbin
routine,
which represents a two-dimensional dataset binned within a grid of
hexagons (
Figure 4-39
):
In
[
9
]:
plt
.
hexbin
(
x
,
y
,
gridsize
=
30
,
cmap
=
'Blues'
)
cb
=
plt
.
colorbar
(
label
=
'count in bin'
)
plt.hexbin
has a number of interesting options, including the ability
to specify weights for each point, and to change the output in each bin
to any NumPy aggregate (mean of weights, standard deviation of weights,
etc.).
Another common method of evaluating densities in multiple dimensions is
kernel density estimation
(KDE). This will be discussed more fully in
“In-Depth: Kernel Density Estimation”
, but for now we’ll simply mention that KDE can be thought of
as a way to “smear out” the points in space and add up the result to
obtain a smooth function. One extremely quick and simple KDE
implementation exists in the
scipy.stats
package. Here is a quick
example of using the KDE on this data (
Figure 4-40
):
In
[
10
]:
from
scipy.stats
import
gaussian_kde
# fit an array of size [Ndim, Nsamples]
data
=
np
.
vstack
([
x
,
y
])
kde
=
gaussian_kde
(
data
)
# evaluate on a regular grid
xgrid
=
np
.
linspace
(
-
3.5
,
3.5
,
40
)
ygrid
=
np
.
linspace
(
-
6
,
6
,
40
)
Xgrid
,
Ygrid
=
np
.
meshgrid
(
xgrid
,
ygrid
)
Z
=
kde
.
evaluate
(
np
.
vstack
([
Xgrid
.
ravel
(),
Ygrid
.
ravel
()]))
# Plot the result as an image
plt
.
imshow
(
Z
.
reshape
(
Xgrid
.
shape
),
origin
=
'lower'
,
aspect
=
'auto'
,
extent
=
[
-
3.5
,
3.5
,
-
6
,
6
],
cmap
=
'Blues'
)
cb
=
plt
.
colorbar
()
cb
.
set_label
(
"density"
)
KDE has a smoothing length that effectively slides the knob between
detail and smoothness (one example of the ubiquitous bias–variance
trade-off). The literature on choosing an appropriate smoothing length is
vast:
gaussian_kde
uses a rule of thumb to attempt to find a
nearly optimal smoothing length for the input data.
Other KDE implementations are available within the SciPy ecosystem, each with its own various strengths and weaknesses; see, for example,
sklearn.neighbors.KernelDensity
and
statsmodels.nonparametric.kernel_density.KDEMultivariate
. For
visualizations based on KDE, using Matplotlib tends to be
overly verbose. The Seaborn library, discussed in
“Visualization with Seaborn”
,
provides a much more terse API for creating KDE-based visualizations
.
Plot legends give meaning to a visualization, assigning labels to the various plot elements. We previously saw how to create a simple legend; here we’ll take a look at customizing the placement and aesthetics of the legend in Matplotlib.
The simplest legend can be created with the
plt.legend()
command,
which automatically creates a legend for any labeled plot elements (
Figure 4-41
):
In
[
1
]:
import
matplotlib.pyplot
as
plt
plt
.
style
.
use
(
'classic'
)
In
[
2
]:
%
matplotlib
inline
import
numpy
as
np
In
[
3
]:
x
=
np
.
linspace
(
0
,
10
,
1000
)
fig
,
ax
=
plt
.
subplots
()
ax
.
plot
(
x
,
np
.
sin
(
x
),
'-b'
,
label
=
'Sine'
)
ax
.
plot
(
x
,
np
.
cos
(
x
),
'--r'
,
label
=
'Cosine'
)
ax
.
axis
(
'equal'
)
leg
=
ax
.
legend
();
But there are many ways we might want to customize such a legend. For example, we can specify the location and turn off the frame ( Figure 4-42 ):
In
[
4
]:
ax
.
legend
(
loc
=
'upper left'
,
frameon
=
False
)
Figure 4-42. A customized plot legend
We can use the
ncol
command to specify the number of columns in the legend (Figure 4-43):In
[
5
]:
ax
.
legend
(
frameon
=
False
,
loc
=
'lower center'
,
ncol
=
2
)
Figure 4-43. A two-column plot legend
We can use a rounded box (
fancybox
) or add a shadow, change the transparency (alpha value) of the frame, or change the padding around the text (Figure 4-44):In
[
6
]:
ax
.
legend
(
fancybox
=
True
,
framealpha
=
1
,
shadow
=
True
,
borderpad
=
1
)
Figure 4-44. A fancybox plot legend
For more information on available legend options, see the
plt.legend
docstring.Choosing Elements for the Legend
As we’ve already seen, the legend includes all labeled elements by default. If this is not what is desired, we can fine-tune which elements and labels appear in the legend by using the objects returned by plot commands. The
plt.plot()
command is able to create multiple lines at once, and returns a list of created line instances. Passing any of these toplt.legend()
will tell it which to identify, along with the labels we’d like to specify (Figure 4-45):In
[
7
]:
y
=
np
.
sin
(
x
[:,
np
.
newaxis
]
+
np
.
pi
*
np
.
arange
(
0
,
2
,
0.5
))
lines
=
plt
.
plot
(
x
,
y
)
# lines is a list of plt.Line2D instances
plt
.
legend
(
lines
[:
2
],
[
'first'
,
'second'
]);
Figure 4-45. Customization of legend elements
I generally find in practice that it is clearer to use the first method, applying labels to the plot elements you’d like to show on the legend (Figure 4-46):
In
[
8
]:
plt
.
plot
(
x
,
y
[:,
0
],
label
=
'first'
)
plt
.
plot
(
x
,
y
[:,
1
],
label
=
'second'
)
plt
.
plot
(
x
,
y
[:,
2
:])
plt
.
legend
(
framealpha
=
1
,
frameon
=
True
);
Figure 4-46. Alternative method of customizing legend elements
Notice that by default, the legend ignores all elements without a
label
attribute set.Sometimes the legend defaults are not sufficient for the given visualization. For example, perhaps you’re using the size of points to mark certain features of the data, and want to create a legend reflecting this. Here is an example where we’ll use the size of points to indicate populations of California cities. We’d like a legend that specifies the scale of the sizes of the points, and we’ll accomplish this by plotting some labeled data with no entries (Figure 4-47):
In
[
9
]:
import
pandas
as
pd
cities
=
pd
.
read_csv
(
'data/california_cities.csv'
)
# Extract the data we're interested in
lat
,
lon
=
cities
[
'latd'
],
cities
[
'longd'
]
population
,
area
=
cities
[
'population_total'
],
cities
[
'area_total_km2'
]
# Scatter the points, using size and color but no label
plt
.
scatter
(
lon
,
lat
,
label
=
None
,
c
=
np
.
log10
(
population
),
cmap
=
'viridis'
,
s
=
area
,
linewidth
=
0
,
alpha
=
0.5
)
plt
.
axis
(
aspect
=
'equal'
)
plt
.
xlabel
(
'longitude'
)
plt
.
ylabel
(
'latitude'
)
plt
.
colorbar
(
label
=
'log$_{10}$(population)'
)
plt
.
clim
(
3
,
7
)
# Here we create a legend:
# we'll plot empty lists with the desired size and label
for
area
in
[
100
,
300
,
500
]:
plt
.
scatter
([],
[],
c
=
'k'
,
alpha
=
0.3
,
s
=
area
,
label
=
str
(
area
)
+
' km$^2$'
)
plt
.
legend
(
scatterpoints
=
1
,
frameon
=
False
,
labelspacing
=
1
,
title
=
'City Area'
)
plt
.
title
(
'California Cities: Area and Population'
);
Figure 4-47. Location, geographic size, and population of California cities
The legend will always reference some object that is on the plot, so if we’d like to display a particular shape we need to plot it. In this case, the objects we want (gray circles) are not on the plot, so we fake them by plotting empty lists. Notice too that the legend only lists plot elements that have a label specified.
By plotting empty lists, we create labeled plot objects that are picked up by the legend, and now our legend tells us some useful information. This strategy can be useful for creating more sophisticated visualizations.
Finally, note that for geographic data like this, it would be clearer if we could show state boundaries or other map-specific elements. For this, an excellent choice of tool is Matplotlib’s Basemap add-on toolkit, which we’ll explore in “Geographic Data with Basemap”.
Sometimes when designing a plot you’d like to add multiple legends to the same axes. Unfortunately, Matplotlib does not make this easy: via the standard
legend
interface, it is only possible to create a single legend for the entire plot. If you try to create a second legend usingplt.legend()
orax.legend()
, it will simply override the first one. We can work around this by creating a new legend artist from scratch, and then using the lower-levelax.add_artist()
method to manually add the second artist to the plot (Figure 4-48):In
[
10
]:
fig
,
ax
=
plt
.
subplots
()
lines
=
[]
styles
=
[
'-'
,
'--'
,
'-.'
,
':'
]
x
=
np
.
linspace
(
0
,
10
,
1000
)
for
i
in
range
(
4
):
lines
+=
ax
.
plot
(
x
,
np
.
sin
(
x
-
i
*
np
.
pi
/
2
),
styles
[
i
],
color
=
'black'
)
ax
.
axis
(
'equal'
)
# specify the lines and labels of the first legend
ax
.
legend
(
lines
[:
2
],
[
'line A'
,
'line B'
],
loc
=
'upper right'
,
frameon
=
False
)
# Create the second legend and add the artist manually.
from
matplotlib.legend
import
Legend
leg
=
Legend
(
ax
,
lines
[
2
:],
[
'line C'
,
'line D'
],
loc
=
'lower right'
,
frameon
=
False
)
ax
.
add_artist
(
leg
);
Figure 4-48. A split plot legend
This is a peek into the low-level artist objects that compose any Matplotlib plot. If you examine the source code of
ax.legend()
(recall that you can do this within the IPython notebook usingax.legend??
) you’ll see that the function simply consists of some logic to create a suitableLegend
artist, which is then saved in thelegend_
attribute and added to the figure when the plot is drawn.Customizing Colorbars
Plot legends identify discrete labels of discrete points. For continuous labels based on the color of points, lines, or regions, a labeled colorbar can be a great tool. In Matplotlib, a colorbar is a separate axes that can provide a key for the meaning of colors in a plot. Because the book is printed in black and white, this section has an accompanying online appendix where you can view the figures in full color (https://github.com/jakevdp/PythonDataScienceHandbook). We’ll start by setting up the notebook for plotting and importing the functions we will use:
In
[
1
]:
import
matplotlib.pyplot
as
plt
plt
.
style
.
use
(
'classic'
)
In
[
2
]:
%
matplotlib
inline
import
numpy
as
np
As we have seen several times throughout this section, the simplest colorbar can be created with the
plt.colorbar
function (Figure 4-49):In
[
3
]:
x
=
np
.
linspace
(
0
,
10
,
1000
)
I
=
np
.
sin
(
x
)
*
np
.
cos
(
x
[:,
np
.
newaxis
])
plt
.
imshow
(
I
)
plt
.
colorbar
();
Figure 4-49. A simple colorbar legend
We’ll now discuss a few ideas for customizing these colorbars and using them effectively in various situations.
We can specify the colormap using the
cmap
argument to the plotting function that is creating the visualization (Figure 4-50):In
[
4
]:
plt
.
imshow
(
I
,
cmap
=
'gray'
);
Figure 4-50. A grayscale colormap
All the available colormaps are in the
plt.cm
namespace; using IPython’s tab-completion feature will give you a full list of built-in possibilities:plt.cm.<TAB>But being able to choose a colormap is just the first step: more important is how to decide among the possibilities! The choice turns out to be much more subtle than you might initially expect.
A full treatment of color choice within visualization is beyond the scope of this book, but for entertaining reading on this subject and others, see the article Simple Rules for Better Figures”. Matplotlib’s online documentation also has an interesting discussion of colormap choice.
Broadly, you should be aware of three different categories of colormaps:
These consist of one continuous sequence
of colors (e.g., binary
or viridis
).
These usually contain two distinct colors,
which show positive and negative deviations from a mean (e.g., RdBu
or
PuOr
).
These mix colors with no particular
sequence (e.g., rainbow
or jet
).
We can see this by converting the jet
colorbar into black and white (Figure 4-51):
In
[
5
]:
from
matplotlib.colors
import
LinearSegmentedColormap
def
grayscale_cmap
(
cmap
):
"""Return a grayscale version of the given colormap"""
cmap
=
plt
.
cm
.
get_cmap
(
cmap
)
colors
=
cmap
(
np
.
arange
(
cmap
.
N
))
# convert RGBA to perceived grayscale luminance
# cf. http://alienryderflex.com/hsp.html
RGB_weight
=
[
0.299
,
0.587
,
0.114
]
luminance
=
np
.
sqrt
(
np
.
dot
(
colors
[:,
:
3
]
**
2
,
RGB_weight
))
colors
[:,
:
3
]
=
luminance
[:,
np
.
newaxis
]
return
LinearSegmentedColormap
.
from_list
(
cmap
.
name
+
"_gray"
,
colors
,
cmap
.
N
)
def
view_colormap
(
cmap
):
"""Plot a colormap with its grayscale equivalent"""
cmap
=
plt
.
cm
.
get_cmap
(
cmap
)
colors
=
cmap
(
np
.
arange
(
cmap
.
N
))
cmap
=
grayscale_cmap
(
cmap
)
grayscale
=
cmap
(
np
.
arange
(
cmap
.
N
))
fig
,
ax
=
plt
.
subplots
(
2
,
figsize
=
(
6
,
2
),
subplot_kw
=
dict
(
xticks
=
[],
yticks
=
[]))
ax
[
0
]
.
imshow
([
colors
],
extent
=
[
0
,
10
,
0
,
1
])
ax
[
1
]
.
imshow
([
grayscale
],
extent
=
[
0
,
10
,
0
,
1
])
In
[
6
]:
view_colormap
(
'jet'
)
Notice the bright stripes in the grayscale image. Even in full color,
this uneven brightness means that the eye will be drawn to certain
portions of the color range, which will potentially emphasize
unimportant parts of the dataset. It’s better to use a colormap such as
viridis
(the default as of Matplotlib 2.0), which is specifically
constructed to have an even brightness variation across the range. Thus,
it not only plays well with our color perception, but also will
translate well to grayscale printing (Figure 4-52):
In
[
7
]:
view_colormap
(
'viridis'
)
If you favor rainbow schemes, another good option for continuous data is
the cubehelix
colormap (Figure 4-53):
In
[
8
]:
view_colormap
(
'cubehelix'
)
For other situations, such as showing positive and negative deviations
from some mean, dual-color colorbars such as RdBu
(short for Red-Blue) can be
useful. However, as you can see in Figure 4-54, it’s important to note that the positive-negative information will be
lost upon translation to grayscale!
In
[
9
]:
view_colormap
(
'RdBu'
)
We’ll see examples of using some of these color maps as we continue.
There are a large number of colormaps available in Matplotlib; to see a
list of them, you can use IPython to explore the plt.cm
submodule. For
a more principled approach to colors in Python, you can refer to the
tools and documentation within the Seaborn library (see
“Visualization with Seaborn”).
Matplotlib allows for a large range of colorbar customization. The
colorbar itself is simply an instance of plt.Axes
, so all of the axes
and tick formatting tricks we’ve learned are applicable. The colorbar
has some interesting flexibility; for example, we can narrow the color
limits and indicate the out-of-bounds values with a triangular arrow at
the top and bottom by setting the extend
property. This might come in
handy, for example, if you’re displaying an image that is subject to noise (Figure 4-55):
In
[
10
]:
# make noise in 1% of the image pixels
speckles
=
(
np
.
random
.
random
(
I
.
shape
)
<
0.01
)
I
[
speckles
]
=
np
.
random
.
normal
(
0
,
3
,
np
.
count_nonzero
(
speckles
))
plt
.
figure
(
figsize
=
(
10
,
3.5
))
plt
.
subplot
(
1
,
2
,
1
)
plt
.
imshow
(
I
,
cmap
=
'RdBu'
)
plt
.
colorbar
()
plt
.
subplot
(
1
,
2
,
2
)
plt
.
imshow
(
I
,
cmap
=
'RdBu'
)
plt
.
colorbar
(
extend
=
'both'
)
plt
.
clim
(
-
1
,
1
);
Notice that in the left panel, the default color limits respond to the noisy pixels, and the range of the noise completely washes out the pattern we are interested in. In the right panel, we manually set the color limits, and add extensions to indicate values that are above or below those limits. The result is a much more useful visualization of our data.
Colormaps are by default continuous, but sometimes you’d like to
represent discrete values. The easiest way to do this is to use the
plt.cm.get_cmap()
function, and pass the name of a suitable colormap
along with the number of desired bins (Figure 4-56):
In
[
11
]:
plt
.
imshow
(
I
,
cmap
=
plt
.
cm
.
get_cmap
(
'Blues'
,
6
))
plt
.
colorbar
()
plt
.
clim
(
-
1
,
1
);
The discrete version of a colormap can be used just like any other colormap.
For an example of where this might be useful, let’s look at an interesting visualization of some handwritten digits data. This data is included in Scikit-Learn, and consists of nearly 2,000 8×8 thumbnails showing various handwritten digits.
For now, let’s start by downloading the digits data and visualizing
several of the example images with plt.imshow()
(Figure 4-57):
In
[
12
]:
# load images of the digits 0 through 5 and visualize several of them
from
sklearn.datasets
import
load_digits
digits
=
load_digits
(
n_class
=
6
)
fig
,
ax
=
plt
.
subplots
(
8
,
8
,
figsize
=
(
6
,
6
))
for
i
,
axi
in
enumerate
(
ax
.
flat
):
axi
.
imshow
(
digits
.
images
[
i
],
cmap
=
'binary'
)
axi
.
set
(
xticks
=
[],
yticks
=
[])
Because each digit is defined by the hue of its 64 pixels, we can consider each digit to be a point lying in 64-dimensional space: each dimension represents the brightness of one pixel. But visualizing relationships in such high-dimensional spaces can be extremely difficult. One way to approach this is to use a dimensionality reduction technique such as manifold learning to reduce the dimensionality of the data while maintaining the relationships of interest. Dimensionality reduction is an example of unsupervised machine learning, and we will discuss it in more detail in “What Is Machine Learning?”.
Deferring the discussion of these details, let’s take a look at a two-dimensional manifold learning projection of this digits data (see “In-Depth: Manifold Learning” for details):
In
[
13
]:
# project the digits into 2 dimensions using IsoMap
from
sklearn.manifold
import
Isomap
iso
=
Isomap
(
n_components
=
2
)
projection
=
iso
.
fit_transform
(
digits
.
data
)
We’ll use our discrete colormap to view the results, setting the ticks
and clim
to improve the aesthetics of the resulting colorbar (Figure 4-58):
In
[
14
]:
# plot the results
plt
.
scatter
(
projection
[:,
0
],
projection
[:,
1
],
lw
=
0.1
,
c
=
digits
.
target
,
cmap
=
plt
.
cm
.
get_cmap
(
'cubehelix'
,
6
))
plt
.
colorbar
(
ticks
=
range
(
6
),
label
=
'digit value'
)
plt
.
clim
(
-
0.5
,
5.5
)
The projection also gives us some interesting insights on the relationships within the dataset: for example, the ranges of 5 and 3 nearly overlap in this projection, indicating that some handwritten fives and threes are difficult to distinguish, and therefore more likely to be confused by an automated classification algorithm. Other values, like 0 and 1, are more distantly separated, and therefore much less likely to be confused. This observation agrees with our intuition, because 5 and 3 look much more similar than do 0 and 1.
We’ll return to manifold learning and digit classification in Chapter 5.
Sometimes it is helpful to compare different views of data side by side. To this end, Matplotlib has the concept of subplots: groups of smaller axes that can exist together within a single figure. These subplots might be insets, grids of plots, or other more complicated layouts. In this section, we’ll explore four routines for creating subplots in Matplotlib. We’ll start by setting up the notebook for plotting and importing the functions we will use:
In
[
1
]:
%
matplotlib
inline
import
matplotlib.pyplot
as
plt
plt
.
style
.
use
(
'seaborn-white'
)
import
numpy
as
np
The most basic method of creating an axes is to use the plt.axes
function. As we’ve seen previously, by default this creates a standard
axes object that fills the entire figure. plt.axes
also takes an
optional argument that is a list of four numbers in the figure
coordinate system. These numbers represent
[bottom, left, width, height]
in the figure coordinate system, which
ranges from 0 at the bottom left of the figure to 1 at the top right of
the figure.
For example, we might create an inset axes at the top-right corner of another axes by setting the x and y position to 0.65 (that is, starting at 65% of the width and 65% of the height of the figure) and the x and y extents to 0.2 (that is, the size of the axes is 20% of the width and 20% of the height of the figure). Figure 4-59 shows the result of this code:
In
[
2
]:
ax1
=
plt
.
axes
()
# standard axes
ax2
=
plt
.
axes
([
0.65
,
0.65
,
0.2
,
0.2
])
The equivalent of this command within the object-oriented interface is
fig.add_axes()
. Let’s use this to create two vertically stacked axes (Figure 4-60):
In
[
3
]:
fig
=
plt
.
figure
()
ax1
=
fig
.
add_axes
([
0.1
,
0.5
,
0.8
,
0.4
],
xticklabels
=
[],
ylim
=
(
-
1.2
,
1.2
))
ax2
=
fig
.
add_axes
([
0.1
,
0.1
,
0.8
,
0.4
],
ylim
=
(
-
1.2
,
1.2
))
x
=
np
.
linspace
(
0
,
10
)
ax1
.
plot
(
np
.
sin
(
x
))
ax2
.
plot
(
np
.
cos
(
x
));
We now have two axes (the top with no tick labels) that are just touching: the bottom of the upper panel (at position 0.5) matches the top of the lower panel (at position 0.1 + 0.4).
Aligned columns or rows of subplots are a common enough need that
Matplotlib has several convenience routines that make them easy to
create. The lowest level of these is plt.subplot()
, which creates a
single subplot within a grid. As you can see, this command takes three
integer arguments—the number of rows, the number of columns, and the
index of the plot to be created in this scheme, which runs from the
upper left to the bottom right (Figure 4-61):
In
[
4
]:
for
i
in
range
(
1
,
7
):
plt
.
subplot
(
2
,
3
,
i
)
plt
.
text
(
0.5
,
0.5
,
str
((
2
,
3
,
i
)),
fontsize
=
18
,
ha
=
'center'
)
The command plt.subplots_adjust
can be used to adjust the spacing between these plots. The following code (the result of which is shown in Figure 4-62) uses the equivalent object-oriented
command, fig.add_subplot()
:
In
[
5
]:
fig
=
plt
.
figure
()
fig
.
subplots_adjust
(
hspace
=
0.4
,
wspace
=
0.4
)
for
i
in
range
(
1
,
7
):
ax
=
fig
.
add_subplot
(
2
,
3
,
i
)
ax
.
text
(
0.5
,
0.5
,
str
((
2
,
3
,
i
)),
fontsize
=
18
,
ha
=
'center'
)
We’ve used the hspace
and wspace
arguments of plt.subplots_adjust
,
which specify the spacing along the height and width of the figure, in
units of the subplot size (in this case, the space is 40% of the subplot
width and height).
The approach just described can become quite tedious when you’re creating a large grid of
subplots, especially if you’d like to hide the x- and y-axis labels
on the inner plots. For this purpose, plt.subplots()
is the easier tool
to use (note the s
at the end of subplots
). Rather than creating a
single subplot, this function creates a full grid of subplots in a
single line, returning them in a NumPy array. The arguments are the
number of rows and number of columns, along with optional keywords
sharex
and sharey
, which allow you to specify the relationships
between different axes.
Here we’ll create a 2×3 grid of subplots, where all axes in the same row share their y-axis scale, and all axes in the same column share their x-axis scale (Figure 4-63):
In
[
6
]:
fig
,
ax
=
plt
.
subplots
(
2
,
3
,
sharex
=
'col'
,
sharey
=
'row'
)
Note that by specifying sharex
and sharey
, we’ve automatically
removed inner labels on the grid to make the plot cleaner. The resulting
grid of axes instances is returned within a NumPy array, allowing for
convenient specification of the desired axes using standard array
indexing notation (Figure 4-64):
In
[
7
]:
# axes are in a two-dimensional array, indexed by [row, col]
for
i
in
range
(
2
):
for
j
in
range
(
3
):
ax
[
i
,
j
]
.
text
(
0.5
,
0.5
,
str
((
i
,
j
)),
fontsize
=
18
,
ha
=
'center'
)
Figure 4-64. Identifying plots in a subplot grid
In comparison to
plt.subplot()
,plt.subplots()
is more consistent with Python’s conventional 0-based indexing.plt.GridSpec: More Complicated Arrangements
To go beyond a regular grid to subplots that span multiple rows and columns,
plt.GridSpec()
is the best tool. Theplt.GridSpec()
object does not create a plot by itself; it is simply a convenient interface that is recognized by theplt.subplot()
command. For example, a gridspec for a grid of two rows and three columns with some specified width and height space looks like this:In
[
8
]:
grid
=
plt
.
GridSpec
(
2
,
3
,
wspace
=
0.4
,
hspace
=
0.3
)
From this we can specify subplot locations and extents using the familiar Python slicing syntax (Figure 4-65):
In
[
9
]:
plt
.
subplot
(
grid
[
0
,
0
])
plt
.
subplot
(
grid
[
0
,
1
:])
plt
.
subplot
(
grid
[
1
,
:
2
])
plt
.
subplot
(
grid
[
1
,
2
]);
Figure 4-65. Irregular subplots with plt.GridSpec
This type of flexible grid alignment has a wide range of uses. I most often use it when creating multi-axes histogram plots like the one shown here (Figure 4-66):
In
[
10
]:
# Create some normally distributed data
mean
=
[
0
,
0
]
cov
=
[[
1
,
1
],
[
1
,
2
]]
x
,
y
=
np
.
random
.
multivariate_normal
(
mean
,
cov
,
3000
)
.
T
# Set up the axes with gridspec
fig
=
plt
.
figure
(
figsize
=
(
6
,
6
))
grid
=
plt
.
GridSpec
(
4
,
4
,
hspace
=
0.2
,
wspace
=
0.2
)
main_ax
=
fig
.
add_subplot
(
grid
[:
-
1
,
1
:])
y_hist
=
fig
.
add_subplot
(
grid
[:
-
1
,
0
],
xticklabels
=
[],
sharey
=
main_ax
)
x_hist
=
fig
.
add_subplot
(
grid
[
-
1
,
1
:],
yticklabels
=
[],
sharex
=
main_ax
)
# scatter points on the main axes
main_ax
.
plot
(
x
,
y
,
'ok'
,
markersize
=
3
,
alpha
=
0.2
)
# histogram on the attached axes
x_hist
.
hist
(
x
,
40
,
histtype
=
'stepfilled'
,
orientation
=
'vertical'
,
color
=
'gray'
)
x_hist
.
invert_yaxis
()
y_hist
.
hist
(
y
,
40
,
histtype
=
'stepfilled'
,
orientation
=
'horizontal'
,
color
=
'gray'
)
y_hist
.
invert_xaxis
()
Figure 4-66. Visualizing multidimensional distributions with plt.GridSpec
This type of distribution plotted alongside its margins is common enough that it has its own plotting API in the Seaborn package; see “Visualization with Seaborn” for more details.
Creating a good visualization involves guiding the reader so that the figure tells a story. In some cases, this story can be told in an entirely visual manner, without the need for added text, but in others, small textual cues and labels are necessary. Perhaps the most basic types of annotations you will use are axes labels and titles, but the options go beyond this. Let’s take a look at some data and how we might visualize and annotate it to help convey interesting information. We’ll start by setting up the notebook for plotting and importing the functions we will use:
In
[
1
]:
%
matplotlib
inline
import
matplotlib.pyplot
as
plt
import
matplotlib
as
mpl
plt
.
style
.
use
(
'seaborn-whitegrid'
)
import
numpy
as
np
import
pandas
as
pd
Example: Effect of Holidays on US Births
Let’s return to some data we worked with earlier in “Example: Birthrate Data”, where we generated a plot of average births over the course of the calendar year; as already mentioned, this data can be downloaded at https://raw.githubusercontent.com/jakevdp/data-CDCbirths/master/births.csv.
We’ll start with the same cleaning procedure we used there, and plot the results (Figure 4-67):
In
[
2
]:
births
=
pd
.
read_csv
(
'births.csv'
)
quartiles
=
np
.
percentile
(
births
[
'births'
],
[
25
,
50
,
75
])
mu
,
sig
=
quartiles
[
1
],
0.74
*
(
quartiles
[
2
]
-
quartiles
[
0
])
births
=
births
.
query
(
'(births > @mu - 5 * @sig) & (births < @mu + 5 * @sig)'
)
births
[
'day'
]
=
births
[
'day'
]
.
astype
(
int
)
births
.
index
=
pd
.
to_datetime
(
10000
*
births
.
year
+
100
*
births
.
month
+
births
.
day
,
format
=
'
%Y%m%d
'
)
births_by_date
=
births
.
pivot_table
(
'births'
,
[
births
.
index
.
month
,
births
.
index
.
day
])
births_by_date
.
index
=
[
pd
.
datetime
(
2012
,
month
,
day
)
for
(
month
,
day
)
in
births_by_date
.
index
]
In
[
3
]:
fig
,
ax
=
plt
.
subplots
(
figsize
=
(
12
,
4
))
births_by_date
.
plot
(
ax
=
ax
);
Figure 4-67. Average daily births by date
When we’re communicating data like this, it is often useful to annotate certain features of the plot to draw the reader’s attention. This can be done manually with the
plt.text
/ax.text
command, which will place text at a particular x/y value (Figure 4-68):In
[
4
]:
fig
,
ax
=
plt
.
subplots
(
figsize
=
(
12
,
4
))
births_by_date
.
plot
(
ax
=
ax
)
# Add labels to the plot
style
=
dict
(
size
=
10
,
color
=
'gray'
)
ax
.
text
(
'2012-1-1'
,
3950
,
"New Year's Day"
,
**
style
)
ax
.
text
(
'2012-7-4'
,
4250
,
"Independence Day"
,
ha
=
'center'
,
**
style
)
ax
.
text
(
'2012-9-4'
,
4850
,
"Labor Day"
,
ha
=
'center'
,
**
style
)
ax
.
text
(
'2012-10-31'
,
4600
,
"Halloween"
,
ha
=
'right'
,
**
style
)
ax
.
text
(
'2012-11-25'
,
4450
,
"Thanksgiving"
,
ha
=
'center'
,
**
style
)
ax
.
text
(
'2012-12-25'
,
3850
,
"Christmas "
,
ha
=
'right'
,
**
style
)
# Label the axes
ax
.
set
(
title
=
'USA births by day of year (1969-1988)'
,
ylabel
=
'average daily births'
)
# Format the x axis with centered month labels
ax
.
xaxis
.
set_major_locator
(
mpl
.
dates
.
MonthLocator
())
ax
.
xaxis
.
set_minor_locator
(
mpl
.
dates
.
MonthLocator
(
bymonthday
=
15
))
ax
.
xaxis
.
set_major_formatter
(
plt
.
NullFormatter
())
ax
.
xaxis
.
set_minor_formatter
(
mpl
.
dates
.
DateFormatter
(
'%h'
));
Figure 4-68. Annotated average daily births by date
The
ax.text
method takes an x position, a y position, a string, and then optional keywords specifying the color, size, style, alignment, and other properties of the text. Here we usedha='right'
andha='center'
, whereha
is short for horizonal alignment. See the docstring ofplt.text()
and ofmpl.text.Text()
for more information on available options.Transforms and Text Position
In the previous example, we anchored our text annotations to data locations. Sometimes it’s preferable to anchor the text to a position on the axes or figure, independent of the data. In Matplotlib, we do this by modifying the transform.
Any graphics display framework needs some scheme for translating between coordinate systems. For example, a data point at needs to somehow be represented at a certain location on the figure, which in turn needs to be represented in pixels on the screen. Mathematically, such coordinate transformations are relatively straightforward, and Matplotlib has a well-developed set of tools that it uses internally to perform them (the tools can be explored in the
matplotlib.transforms
submodule).The average user rarely needs to worry about the details of these transforms, but it is helpful knowledge to have when considering the placement of text on a figure. There are three predefined transforms that can be useful in this situation:
ax.transData
Transform associated with data coordinates
ax.transAxes
Transform associated with the axes (in units of axes dimensions)
fig.transFigure
Transform associated with the figure (in units of figure dimensions)
Here let’s look at an example of drawing text at various locations using these transforms (Figure 4-69):
In
[
5
]:
fig
,
ax
=
plt
.
subplots
(
facecolor
=
'lightgray'
)
ax
.
axis
([
0
,
10
,
0
,
10
])
# transform=ax.transData is the default, but we'll specify it anyway
ax
.
text
(
1
,
5
,
". Data: (1, 5)"
,
transform
=
ax
.
transData
)
ax
.
text
(
0.5
,
0.1
,
". Axes: (0.5, 0.1)"
,
transform
=
ax
.
transAxes
)
ax
.
text
(
0.2
,
0.2
,
". Figure: (0.2, 0.2)"
,
transform
=
fig
.
transFigure
);
Figure 4-69. Comparing Matplotlib’s coordinate systems
Note that by default, the text is aligned above and to the left of the specified coordinates; here the “.” at the beginning of each string will approximately mark the given coordinate location.
The
transData
coordinates give the usual data coordinates associated with the x- and y-axis labels. ThetransAxes
coordinates give the location from the bottom-left corner of the axes (here the white box) as a fraction of the axes size. ThetransFigure
coordinates are similar, but specify the position from the bottom left of the figure (here the gray box) as a fraction of the figure size.Notice now that if we change the axes limits, it is only the
transData
coordinates that will be affected, while the others remain stationary (Figure 4-70):In
[
6
]:
ax
.
set_xlim
(
0
,
2
)
ax
.
set_ylim
(
-
6
,
6
)
Figure 4-70. Comparing Matplotlib’s coordinate systems
You can see this behavior more clearly by changing the axes limits interactively; if you are executing this code in a notebook, you can make that happen by changing
%matplotlib inline
to%matplotlib notebook
and using each plot’s menu to interact with the plot.Arrows and Annotation
Along with tick marks and text, another useful annotation mark is the simple arrow.
Drawing arrows in Matplotlib is often much harder than you might hope. While there is a
plt.arrow()
function available, I wouldn’t suggest using it; the arrows it creates are SVG objects that will be subject to the varying aspect ratio of your plots, and the result is rarely what the user intended. Instead, I’d suggest using theplt.annotate()
function. This function creates some text and an arrow, and the arrows can be very flexibly specified.Here we’ll use
annotate
with several of its options (Figure 4-71):In
[
7
]:
%
matplotlib
inline
fig
,
ax
=
plt
.
subplots
()
x
=
np
.
linspace
(
0
,
20
,
1000
)
ax
.
plot
(
x
,
np
.
cos
(
x
))
ax
.
axis
(
'equal'
)
ax
.
annotate
(
'local maximum'
,
xy
=
(
6.28
,
1
),
xytext
=
(
10
,
4
),
arrowprops
=
dict
(
facecolor
=
'black'
,
shrink
=
0.05
))
ax
.
annotate
(
'local minimum'
,
xy
=
(
5
*
np
.
pi
,
-
1
),
xytext
=
(
2
,
-
6
),
arrowprops
=
dict
(
arrowstyle
=
"->"
,
connectionstyle
=
"angle3,angleA=0,angleB=-90"
));
Figure 4-71. Annotation examples
The arrow style is controlled through the
arrowprops
dictionary, which has numerous options available. These options are fairly well documented in Matplotlib’s online documentation, so rather than repeating them here I’ll quickly show some of the possibilities. Let’s demonstrate several of the possible options using the birthrate plot from before (Figure 4-72):In
[
8
]:
fig
,
ax
=
plt
.
subplots
(
figsize
=
(
12
,
4
))
births_by_date
.
plot
(
ax
=
ax
)
# Add labels to the plot
ax
.
annotate
(
"New Year's Day"
,
xy
=
(
'2012-1-1'
,
4100
),
xycoords
=
'data'
,
xytext
=
(
50
,
-
30
),
textcoords
=
'offset points'
,
arrowprops
=
dict
(
arrowstyle
=
"->"
,
connectionstyle
=
"arc3,rad=-0.2"
))
ax
.
annotate
(
"Independence Day"
,
xy
=
(
'2012-7-4'
,
4250
),
xycoords
=
'data'
,
bbox
=
dict
(
boxstyle
=
"round"
,
fc
=
"none"
,
ec
=
"gray"
),
xytext
=
(
10
,
-
40
),
textcoords
=
'offset points'
,
ha
=
'center'
,
arrowprops
=
dict
(
arrowstyle
=
"->"
))
ax
.
annotate
(
'Labor Day'
,
xy
=
(
'2012-9-4'
,
4850
),
xycoords
=
'data'
,
ha
=
'center'
,
xytext
=
(
0
,
-
20
),
textcoords
=
'offset points'
)
ax
.
annotate
(
''
,
xy
=
(
'2012-9-1'
,
4850
),
xytext
=
(
'2012-9-7'
,
4850
),
xycoords
=
'data'
,
textcoords
=
'data'
,
arrowprops
=
{
'arrowstyle'
:
'|-|,widthA=0.2,widthB=0.2'
,
})
ax
.
annotate
(
'Halloween'
,
xy
=
(
'2012-10-31'
,
4600
),
xycoords
=
'data'
,
xytext
=
(
-
80
,
-
40
),
textcoords
=
'offset points'
,
arrowprops
=
dict
(
arrowstyle
=
"fancy"
,
fc
=
"0.6"
,
ec
=
"none"
,
connectionstyle
=
"angle3,angleA=0,angleB=-90"
))
ax
.
annotate
(
'Thanksgiving'
,
xy
=
(
'2012-11-25'
,
4500
),
xycoords
=
'data'
,
xytext
=
(
-
120
,
-
60
),
textcoords
=
'offset points'
,
bbox
=
dict
(
boxstyle
=
"round4,pad=.5"
,
fc
=
"0.9"
),
arrowprops
=
dict
(
arrowstyle
=
"->"
,
connectionstyle
=
"angle,angleA=0,angleB=80,rad=20"
))
ax
.
annotate
(
'Christmas'
,
xy
=
(
'2012-12-25'
,
3850
),
xycoords
=
'data'
,
xytext
=
(
-
30
,
0
),
textcoords
=
'offset points'
,
size
=
13
,
ha
=
'right'
,
va
=
"center"
,
bbox
=
dict
(
boxstyle
=
"round"
,
alpha
=
0.1
),
arrowprops
=
dict
(
arrowstyle
=
"wedge,tail_width=0.5"
,
alpha
=
0.1
));
# Label the axes
ax
.
set
(
title
=
'USA births by day of year (1969-1988)'
,
ylabel
=
'average daily births'
)
# Format the x axis with centered month labels
ax
.
xaxis
.
set_major_locator
(
mpl
.
dates
.
MonthLocator
())
ax
.
xaxis
.
set_minor_locator
(
mpl
.
dates
.
MonthLocator
(
bymonthday
=
15
))
ax
.
xaxis
.
set_major_formatter
(
plt
.
NullFormatter
())
ax
.
xaxis
.
set_minor_formatter
(
mpl
.
dates
.
DateFormatter
(
'%h'
));
ax
.
set_ylim
(
3600
,
5400
);
Figure 4-72. Annotated average birth rates by day
You’ll notice that the specifications of the arrows and text boxes are very detailed: this gives you the power to create nearly any arrow style you wish. Unfortunately, it also means that these sorts of features often must be manually tweaked, a process that can be very time-consuming when one is producing publication-quality graphics! Finally, I’ll note that the preceding mix of styles is by no means best practice for presenting data, but rather included as a demonstration of some of the available options.
More discussion and examples of available arrow and annotation styles can be found in the Matplotlib gallery, in particular http://matplotlib.org/examples/pylab_examples/annotation_demo2.html.
Matplotlib’s default tick locators and formatters are designed to be generally sufficient in many common situations, but are in no way optimal for every plot. This section will give several examples of adjusting the tick locations and formatting for the particular plot type you’re interested in.
Before we go into examples, it will be best for us to understand further the object hierarchy of Matplotlib plots. Matplotlib aims to have a Python object representing everything that appears on the plot: for example, recall that the
figure
is the bounding box within which plot elements appear. Each Matplotlib object can also act as a container of sub-objects; for example, eachfigure
can contain one or moreaxes
objects, each of which in turn contain other objects representing plot contents.The tick marks are no exception. Each
axes
has attributesxaxis
andyaxis
, which in turn have attributes that contain all the properties of the lines, ticks, and labels that make up the axes.Within each axis, there is the concept of a major tick mark and a minor tick mark. As the names would imply, major ticks are usually bigger or more pronounced, while minor ticks are usually smaller. By default, Matplotlib rarely makes use of minor ticks, but one place you can see them is within logarithmic plots (Figure 4-73):
In
[
1
]:
%
matplotlib
inline
import
matplotlib.pyplot
as
plt
plt
.
style
.
use
(
'seaborn-whitegrid'
)
import
numpy
as
np
In
[
2
]:
ax
=
plt
.
axes
(
xscale
=
'log'
,
yscale
=
'log'
)
Figure 4-73. Example of logarithmic scales and labels
We see here that each major tick shows a large tick mark and a label, while each minor tick shows a smaller tick mark with no label.
We can customize these tick properties—that is, locations and labels—by setting the
formatter
andlocator
objects of each axis. Let’s examine these for the x axis of the plot just shown:In
[
3
]:
(
ax
.
xaxis
.
get_major_locator
())
(
ax
.
xaxis
.
get_minor_locator
())
<matplotlib.ticker.LogLocator object at 0x107530cc0> <matplotlib.ticker.LogLocator object at 0x107530198>In
[
4
]:
(
ax
.
xaxis
.
get_major_formatter
())
(
ax
.
xaxis
.
get_minor_formatter
())
<matplotlib.ticker.LogFormatterMathtext object at 0x107512780> <matplotlib.ticker.NullFormatter object at 0x10752dc18>We see that both major and minor tick labels have their locations specified by a
LogLocator
(which makes sense for a logarithmic plot). Minor ticks, though, have their labels formatted by aNullFormatter
; this says that no labels will be shown.We’ll now show a few examples of setting these locators and formatters for various plots.
Hiding Ticks or Labels
Perhaps the most common tick/label formatting operation is the act of hiding ticks or labels. We can do this using
plt.NullLocator()
andplt.NullFormatter()
, as shown here (Figure 4-74):In
[
5
]:
ax
=
plt
.
axes
()
ax
.
plot
(
np
.
random
.
rand
(
50
))
ax
.
yaxis
.
set_major_locator
(
plt
.
NullLocator
())
ax
.
xaxis
.
set_major_formatter
(
plt
.
NullFormatter
())
Figure 4-74. Plot with hidden tick labels (x-axis) and hidden ticks (y-axis)
Notice that we’ve removed the labels (but kept the ticks/gridlines) from the x axis, and removed the ticks (and thus the labels as well) from the y axis. Having no ticks at all can be useful in many situations—for example, when you want to show a grid of images. For instance, consider Figure 4-75, which includes images of different faces, an example often used in supervised machine learning problems (for more information, see “In-Depth: Support Vector Machines”):
In
[
6
]:
fig
,
ax
=
plt
.
subplots
(
5
,
5
,
figsize
=
(
5
,
5
))
fig
.
subplots_adjust
(
hspace
=
0
,
wspace
=
0
)
# Get some face data from scikit-learn
from
sklearn.datasets
import
fetch_olivetti_faces
faces
=
fetch_olivetti_faces
()
.
images
for
i
in
range
(
5
):
for
j
in
range
(
5
):
ax
[
i
,
j
]
.
xaxis
.
set_major_locator
(
plt
.
NullLocator
())
ax
[
i
,
j
]
.
yaxis
.
set_major_locator
(
plt
.
NullLocator
())
ax
[
i
,
j
]
.
imshow
(
faces
[
10
*
i
+
j
],
cmap
=
"bone"
)
Figure 4-75. Hiding ticks within image plots
Notice that each image has its own axes, and we’ve set the locators to null because the tick values (pixel number in this case) do not convey relevant information for this particular visualization.
Reducing or Increasing the Number of Ticks
One common problem with the default settings is that smaller subplots can end up with crowded labels. We can see this in the plot grid shown in Figure 4-76:
In
[
7
]:
fig
,
ax
=
plt
.
subplots
(
4
,
4
,
sharex
=
True
,
sharey
=
True
)
Figure 4-76. A default plot with crowded ticks
Particularly for the x ticks, the numbers nearly overlap, making them quite difficult to decipher. We can fix this with the
plt.MaxNLocator()
, which allows us to specify the maximum number of ticks that will be displayed. Given this maximum number, Matplotlib will use internal logic to choose the particular tick locations (Figure 4-77):In
[
8
]:
# For every axis, set the x and y major locator
for
axi
in
ax
.
flat
:
axi
.
xaxis
.
set_major_locator
(
plt
.
MaxNLocator
(
3
))
axi
.
yaxis
.
set_major_locator
(
plt
.
MaxNLocator
(
3
))
Figure 4-77. Customizing the number of ticks
This makes things much cleaner. If you want even more control over the locations of regularly spaced ticks, you might also use
plt.MultipleLocator
, which we’ll discuss in the following section.Matplotlib’s default tick formatting can leave a lot to be desired; it works well as a broad default, but sometimes you’d like to do something more. Consider the plot shown in Figure 4-78, a sine and a cosine:
In
[
9
]:
# Plot a sine and cosine curve
fig
,
ax
=
plt
.
subplots
()
x
=
np
.
linspace
(
0
,
3
*
np
.
pi
,
1000
)
ax
.
plot
(
x
,
np
.
sin
(
x
),
lw
=
3
,
label
=
'Sine'
)
ax
.
plot
(
x
,
np
.
cos
(
x
),
lw
=
3
,
label
=
'Cosine'
)
# Set up grid, legend, and limits
ax
.
grid
(
True
)
ax
.
legend
(
frameon
=
False
)
ax
.
axis
(
'equal'
)
ax
.
set_xlim
(
0
,
3
*
np
.
pi
);
Figure 4-78. A default plot with integer ticks
There are a couple changes we might like to make. First, it’s more natural for this data to space the ticks and grid lines in multiples of . We can do this by setting a
MultipleLocator
, which locates ticks at a multiple of the number you provide. For good measure, we’ll add both major and minor ticks in multiples of (Figure 4-79):In
[
10
]:
ax
.
xaxis
.
set_major_locator
(
plt
.
MultipleLocator
(
np
.
pi
/
2
))
ax
.
xaxis
.
set_minor_locator
(
plt
.
MultipleLocator
(
np
.
pi
/
4
))
Figure 4-79. Ticks at multiples of pi/2
But now these tick labels look a little bit silly: we can see that they are multiples of
, but the decimal representation does not immediately convey this. To fix this, we can change the tick formatter. There’s no built-in formatter for what we want to do, so we’ll instead use plt.FuncFormatter
, which accepts a user-defined function giving fine-grained control over the tick outputs (Figure 4-80):In
[
11
]:
def
format_func
(
value
,
tick_number
):
# find number of multiples of pi/2
N
=
int
(
np
.
round
(
2
*
value
/
np
.
pi
))
if
N
==
0
:
return
"0"
elif
N
==
1
:
return
r
"$\pi/2$"
elif
N
==
2
:
return
r
"$\pi$"
elif
N
%
2
>
0
:
return
r
"${0}\pi/2$"
.
format
(
N
)
else
:
return
r
"${0}\pi$"
.
format
(
N
//
2
)
ax
.
xaxis
.
set_major_formatter
(
plt
.
FuncFormatter
(
format_func
))
Figure 4-80. Ticks with custom labels
This is much better! Notice that we’ve made use of Matplotlib’s LaTeX support, specified by enclosing the string within dollar signs. This is very convenient for display of mathematical symbols and formulae; in this case,
"$\pi$"
is rendered as the Greek characterThe
plt.FuncFormatter()
offers extremely fine-grained control over the appearance of your plot ticks, and comes in very handy when you’re preparing plots for presentation or publication.Summary of Formatters and Locators
We’ve mentioned a couple of the available formatters and locators. We’ll conclude this section by briefly listing all the built-in locator and formatter options. For more information on any of these, refer to the docstrings or to the Matplotlib online documentation. Each of the following is available in the
plt
namespace:Customizing Matplotlib: Configurations and Stylesheets
Matplotlib’s default plot settings are often the subject of complaint among its users. While much is slated to change in the 2.0 Matplotlib release, the ability to customize default settings helps bring the package in line with your own aesthetic preferences.
Here we’ll walk through some of Matplotlib’s runtime configuration (
rc
) options, and take a look at the newer stylesheets feature, which contains some nice sets of default configurations.Throughout this chapter, we’ve seen how it is possible to tweak individual plot settings to end up with something that looks a little bit nicer than the default. It’s possible to do these customizations for each individual plot. For example, here is a fairly drab default histogram (Figure 4-81):
In
[
1
]:
import
matplotlib.pyplot
as
plt
plt
.
style
.
use
(
'classic'
)
import
numpy
as
np
%
matplotlib
inline
In
[
2
]:
x
=
np
.
random
.
randn
(
1000
)
plt
.
hist
(
x
);
Figure 4-81. A histogram in Matplotlib’s default style
We can adjust this by hand to make it a much more visually pleasing plot, shown in Figure 4-82:
In
[
3
]:
# use a gray background
ax
=
plt
.
axes
(
axisbg
=
'#E6E6E6'
)
ax
.
set_axisbelow
(
True
)
# draw solid white grid lines
plt
.
grid
(
color
=
'w'
,
linestyle
=
'solid'
)
# hide axis spines
for
spine
in
ax
.
spines
.
values
():
spine
.
set_visible
(
False
)
# hide top and right ticks
ax
.
xaxis
.
tick_bottom
()
ax
.
yaxis
.
tick_left
()
# lighten ticks and labels
ax
.
tick_params
(
colors
=
'gray'
,
direction
=
'out'
)
for
tick
in
ax
.
get_xticklabels
():
tick
.
set_color
(
'gray'
)
for
tick
in
ax
.
get_yticklabels
():
tick
.
set_color
(
'gray'
)
# control face and edge color of histogram
ax
.
hist
(
x
,
edgecolor
=
'#E6E6E6'
,
color
=
'#EE6666'
);
Figure 4-82. A histogram with manual customizations
This looks better, and you may recognize the look as inspired by the look of the R language’s
ggplot
visualization package. But this took a whole lot of effort! We definitely do not want to have to do all that tweaking each time we create a plot. Fortunately, there is a way to adjust these defaults once in a way that will work for all plots.Changing the Defaults: rcParams
Each time Matplotlib loads, it defines a runtime configuration (
rc
) containing the default styles for every plot element you create. You can adjust this configuration at any time using theplt.rc
convenience routine. Let’s see what it looks like to modify therc
parameters so that our default plot will look similar to what we did before.We’ll start by saving a copy of the current
rcParams
dictionary, so we can easily reset these changes in the current session:In
[
4
]:
IPython_default
=
plt
.
rcParams
.
copy
()
Now we can use the
plt.rc
function to change some of these settings:In
[
5
]:
from
matplotlib
import
cycler
colors
=
cycler
(
'color'
,
[
'#EE6666'
,
'#3388BB'
,
'#9988DD'
,
'#EECC55'
,
'#88BB44'
,
'#FFBBBB'
])
plt
.
rc
(
'axes'
,
facecolor
=
'#E6E6E6'
,
edgecolor
=
'none'
,
axisbelow
=
True
,
grid
=
True
,
prop_cycle
=
colors
)
plt
.
rc
(
'grid'
,
color
=
'w'
,
linestyle
=
'solid'
)
plt
.
rc
(
'xtick'
,
direction
=
'out'
,
color
=
'gray'
)
plt
.
rc
(
'ytick'
,
direction
=
'out'
,
color
=
'gray'
)
plt
.
rc
(
'patch'
,
edgecolor
=
'#E6E6E6'
)
plt
.
rc
(
'lines'
,
linewidth
=
2
)
With these settings defined, we can now create a plot and see our settings in action (Figure 4-83):
In
[
6
]:
plt
.
hist
(
x
);
Figure 4-83. A customized histogram using rc settings
Let’s see what simple line plots look like with these
rc
parameters (Figure 4-84):In
[
7
]:
for
i
in
range
(
4
):
plt
.
plot
(
np
.
random
.
rand
(
10
))
Figure 4-84. A line plot with customized styles
I find this much more aesthetically pleasing than the default styling. If you disagree with my aesthetic sense, the good news is that you can adjust the
rc
parameters to suit your own tastes! These settings can be saved in a .matplotlibrc file, which you can read about in the Matplotlib documentation. That said, I prefer to customize Matplotlib using its stylesheets instead.The version 1.4 release of Matplotlib in August 2014 added a very convenient
style
module, which includes a number of new default stylesheets, as well as the ability to create and package your own styles. These stylesheets are formatted similarly to the .matplotlibrc files mentioned earlier, but must be named with a .mplstyle extension.Even if you don’t create your own style, the stylesheets included by default are extremely useful. The available styles are listed in
plt.style.available
—here I’ll list only the first five for brevity:In
[
8
]:
plt
.
style
.
available
[:
5
]
Out[8]: ['fivethirtyeight', 'seaborn-pastel', 'seaborn-whitegrid', 'ggplot', 'grayscale']The basic way to switch to a stylesheet is to call:
plt
.
style
.
use
(
'stylename'
)
But keep in mind that this will change the style for the rest of the session! Alternatively, you can use the style context manager, which sets a style temporarily:
with
plt
.
style
.
context
(
'stylename'
):
make_a_plot
()
Let’s create a function that will make two basic types of plot:
In
[
9
]:
def
hist_and_lines
():
np
.
random
.
seed
(
0
)
fig
,
ax
=
plt
.
subplots
(
1
,
2
,
figsize
=
(
11
,
4
))
ax
[
0
]
.
hist
(
np
.
random
.
randn
(
1000
))
for
i
in
range
(
3
):
ax
[
1
]
.
plot
(
np
.
random
.
rand
(
10
))
ax
[
1
]
.
legend
([
'a'
,
'b'
,
'c'
],
loc
=
'lower left'
)
We’ll use this to explore how these plots look using the various built-in styles.
The default style is what we’ve been seeing so far throughout the book; we’ll start with that. First, let’s reset our runtime configuration to the notebook default:
In
[
10
]:
# reset rcParams
plt
.
rcParams
.
update
(
IPython_default
);
Now let’s see how it looks (Figure 4-85):
In
[
11
]:
hist_and_lines
()
Figure 4-85. Matplotlib’s default style
The FiveThirtyEight style mimics the graphics found on the popular FiveThirtyEight website. As you can see in Figure 4-86, it is typified by bold colors, thick lines, and transparent axes.
In
[
12
]:
with
plt
.
style
.
context
(
'fivethirtyeight'
):
hist_and_lines
()
Figure 4-86. The FiveThirtyEight style
The
ggplot
package in the R language is a very popular visualization tool. Matplotlib’sggplot
style mimics the default styles from that package (Figure 4-87):In
[
13
]:
with
plt
.
style
.
context
(
'ggplot'
):
hist_and_lines
()
Figure 4-87. The ggplot style
There is a very nice short online book called Probabilistic Programming and Bayesian Methods for Hackers; it features figures created with Matplotlib, and uses a nice set of
rc
parameters to create a consistent and visually appealing style throughout the book. This style is reproduced in thebmh
stylesheet (Figure 4-88):In
[
14
]:
with
plt
.
style
.
context
(
'bmh'
):
hist_and_lines
()
Figure 4-88. The bmh style
For figures used within presentations, it is often useful to have a dark rather than light background. The
dark_background
style provides this (Figure 4-89):In
[
15
]:
with
plt
.
style
.
context
(
'dark_background'
):
hist_and_lines
()
Figure 4-89. The dark_background style
Sometimes you might find yourself preparing figures for a print publication that does not accept color figures. For this, the
grayscale
style, shown in Figure 4-90, can be very useful:In
[
16
]:
with
plt
.
style
.
context
(
'grayscale'
):
hist_and_lines
()
Figure 4-90. The grayscale style
Matplotlib also has stylesheets inspired by the Seaborn library (discussed more fully in “Visualization with Seaborn”). As we will see, these styles are loaded automatically when Seaborn is imported into a notebook. I’ve found these settings to be very nice, and tend to use them as defaults in my own data exploration (see Figure 4-91):
In
[
17
]:
import
seaborn
hist_and_lines
()
Figure 4-91. Seaborn’s plotting style
With all of these built-in options for various plot styles, Matplotlib becomes much more useful for both interactive visualization and creation of figures for publication. Throughout this book, I will generally use one or more of these style conventions when creating plots.
Three-Dimensional Plotting in Matplotlib
Matplotlib was initially designed with only two-dimensional plotting in mind. Around the time of the 1.0 release, some three-dimensional plotting utilities were built on top of Matplotlib’s two-dimensional display, and the result is a convenient (if somewhat limited) set of tools for three-dimensional data visualization. We enable three-dimensional plots by importing the
mplot3d
toolkit, included with the main Matplotlib installation (Figure 4-92):In
[
1
]:
from
mpl_toolkits
import
mplot3d
Once this submodule is imported, we can create a three-dimensional axes by passing the keyword
projection='3d'
to any of the normal axes creation routines:In
[
2
]:
%
matplotlib
inline
import
numpy
as
np
import
matplotlib.pyplot
as
plt
In
[
3
]:
fig
=
plt
.
figure
()
ax
=
plt
.
axes
(
projection
=
'3d'
)
Figure 4-92. An empty three-dimensional axes
With this 3D axes enabled, we can now plot a variety of three-dimensional plot types. Three-dimensional plotting is one of the functionalities that benefits immensely from viewing figures interactively rather than statically in the notebook; recall that to use interactive figures, you can use
%matplotlib notebook
rather than%matplotlib inline
when running this code.Three-Dimensional Points and Lines
The most basic three-dimensional plot is a line or scatter plot created from sets of (x, y, z) triples. In analogy with the more common two-dimensional plots discussed earlier, we can create these using the
ax.plot3D
andax.scatter3D
functions. The call signature for these is nearly identical to that of their two-dimensional counterparts, so you can refer to “Simple Line Plots” and “Simple Scatter Plots” for more information on controlling the output. Here we’ll plot a trigonometric spiral, along with some points drawn randomly near the line (Figure 4-93):In
[
4
]:
ax
=
plt
.
axes
(
projection
=
'3d'
)
# Data for a three-dimensional line
zline
=
np
.
linspace
(
0
,
15
,
1000
)
xline
=
np
.
sin
(
zline
)
yline
=
np
.
cos
(
zline
)
ax
.
plot3D
(
xline
,
yline
,
zline
,
'gray'
)
# Data for three-dimensional scattered points
zdata
=
15
*
np
.
random
.
random
(
100
)
xdata
=
np
.
sin
(
zdata
)
+
0.1
*
np
.
random
.
randn
(
100
)
ydata
=
np
.
cos
(
zdata
)
+
0.1
*
np
.
random
.
randn
(
100
)
ax
.
scatter3D
(
xdata
,
ydata
,
zdata
,
c
=
zdata
,
cmap
=
'Greens'
);
Figure 4-93. Points and lines in three dimensions
Notice that by default, the scatter points have their transparency adjusted to give a sense of depth on the page. While the three-dimensional effect is sometimes difficult to see within a static image, an interactive view can lead to some nice intuition about the layout of the points.
Analogous to the contour plots we explored in “Density and Contour Plots”,
mplot3d
contains tools to create three-dimensional relief plots using the same inputs. Like two-dimensionalax.contour
plots,ax.contour3D
requires all the input data to be in the form of two-dimensional regular grids, with the Z data evaluated at each point. Here we’ll show a three-dimensional contour diagram of a three-dimensional sinusoidal function (Figure 4-94):In
[
5
]:
def
f
(
x
,
y
):
return
np
.
sin
(
np
.
sqrt
(
x
**
2
+
y
**
2
))
x
=
np
.
linspace
(
-
6
,
6
,
30
)
y
=
np
.
linspace
(
-
6
,
6
,
30
)
X
,
Y
=
np
.
meshgrid
(
x
,
y
)
Z
=
f
(
X
,
Y
)
In
[
6
]:
fig
=
plt
.
figure
()
ax
=
plt
.
axes
(
projection
=
'3d'
)
ax
.
contour3D
(
X
,
Y
,
Z
,
50
,
cmap
=
'binary'
)
ax
.
set_xlabel
(
'x'
)
ax
.
set_ylabel
(
'y'
)
ax
.
set_zlabel
(
'z'
);
Figure 4-94. A three-dimensional contour plot
Sometimes the default viewing angle is not optimal, in which case we can use the
view_init
method to set the elevation and azimuthal angles. In this example (the result of which is shown in Figure 4-95), we’ll use an elevation of 60 degrees (that is, 60 degrees above the x-y plane) and an azimuth of 35 degrees (that is, rotated 35 degrees counter-clockwise about the z-axis):In
[
7
]:
ax
.
view_init
(
60
,
35
)
Figure 4-95. Adjusting the view angle for a three-dimensional plot
Again, note that we can accomplish this type of rotation interactively by clicking and dragging when using one of Matplotlib’s interactive backends.
Wireframes and Surface Plots
Two other types of three-dimensional plots that work on gridded data are wireframes and surface plots. These take a grid of values and project it onto the specified three-dimensional surface, and can make the resulting three-dimensional forms quite easy to visualize. Here’s an example using a wireframe (Figure 4-96):
In
[
8
]:
fig
=
plt
.
figure
()
ax
=
plt
.
axes
(
projection
=
'3d'
)
ax
.
plot_wireframe
(
X
,
Y
,
Z
,
color
=
'black'
)
ax
.
set_title
(
'wireframe'
);
Figure 4-96. A wireframe plot
A surface plot is like a wireframe plot, but each face of the wireframe is a filled polygon. Adding a colormap to the filled polygons can aid perception of the topology of the surface being visualized (Figure 4-97):
In
[
9
]:
ax
=
plt
.
axes
(
projection
=
'3d'
)
ax
.
plot_surface
(
X
,
Y
,
Z
,
rstride
=
1
,
cstride
=
1
,
cmap
=
'viridis'
,
edgecolor
=
'none'
)
ax
.
set_title
(
'surface'
);
Figure 4-97. A three-dimensional surface plot
Note that though the grid of values for a surface plot needs to be two-dimensional, it need not be rectilinear. Here is an example of creating a partial polar grid, which when used with the
surface3D
plot can give us a slice into the function we’re visualizing (Figure 4-98):In
[
10
]:
r
=
np
.
linspace
(
0
,
6
,
20
)
theta
=
np
.
linspace
(
-
0.9
*
np
.
pi
,
0.8
*
np
.
pi
,
40
)
r
,
theta
=
np
.
meshgrid
(
r
,
theta
)
X
=
r
*
np
.
sin
(
theta
)
Y
=
r
*
np
.
cos
(
theta
)
Z
=
f
(
X
,
Y
)
ax
=
plt
.
axes
(
projection
=
'3d'
)
ax
.
plot_surface
(
X
,
Y
,
Z
,
rstride
=
1
,
cstride
=
1
,
cmap
=
'viridis'
,
edgecolor
=
'none'
);
Figure 4-98. A polar surface plot
Surface Triangulations
For some applications, the evenly sampled grids required by the preceding routines are overly restrictive and inconvenient. In these situations, the triangulation-based plots can be very useful. What if rather than an even draw from a Cartesian or a polar grid, we instead have a set of random draws?
In
[
11
]:
theta
=
2
*
np
.
pi
*
np
.
random
.
random
(
1000
)
r
=
6
*
np
.
random
.
random
(
1000
)
x
=
np
.
ravel
(
r
*
np
.
sin
(
theta
))
y
=
np
.
ravel
(
r
*
np
.
cos
(
theta
))
z
=
f
(
x
,
y
)
We could create a scatter plot of the points to get an idea of the surface we’re sampling from (Figure 4-99):
In
[
12
]:
ax
=
plt
.
axes
(
projection
=
'3d'
)
ax
.
scatter
(
x
,
y
,
z
,
c
=
z
,
cmap
=
'viridis'
,
linewidth
=
0.5
);
Figure 4-99. A three-dimensional sampled surface
This leaves a lot to be desired. The function that will help us in this case is
ax.plot_trisurf
, which creates a surface by first finding a set of triangles formed between adjacent points (the result is shown in Figure 4-100; remember thatx
,y
, andz
here are one-dimensional arrays):In
[
13
]:
ax
=
plt
.
axes
(
projection
=
'3d'
)
ax
.
plot_trisurf
(
x
,
y
,
z
,
cmap
=
'viridis'
,
edgecolor
=
'none'
);
Figure 4-100. A triangulated surface plot
The result is certainly not as clean as when it is plotted with a grid, but the flexibility of such a triangulation allows for some really interesting three-dimensional plots. For example, it is actually possible to plot a three-dimensional Möbius strip using this, as we’ll see next.
Example: Visualizing a Möbius strip
A Möbius strip is similar to a strip of paper glued into a loop with a half-twist. Topologically, it’s quite interesting because despite appearances it has only a single side! Here we will visualize such an object using Matplotlib’s three-dimensional tools. The key to creating the Möbius strip is to think about its parameterization: it’s a two-dimensional strip, so we need two intrinsic dimensions. Let’s call them
, which ranges from 0 to around the loop, and which ranges from –1 to 1 across the width of the strip: In
[
14
]:
theta
=
np
.
linspace
(
0
,
2
*
np
.
pi
,
30
)
w
=
np
.
linspace
(
-
0.25
,
0.25
,
8
)
w
,
theta
=
np
.
meshgrid
(
w
,
theta
)
Now from this parameterization, we must determine the (x, y, z) positions of the embedded strip.
Thinking about it, we might realize that there are two rotations happening: one is the position of the loop about its center (what we’ve called
), while the other is the twisting of the strip about its axis (we’ll call this ). For a Möbius strip, we must have the strip make half a twist during a full loop, or In
[
15
]:
phi
=
0.5
*
theta
Now we use our recollection of trigonometry to derive the three-dimensional embedding. We’ll define
, the distance of each point from the center, and use this to find the embedded coordinates: In
[
16
]:
# radius in x-y plane
r
=
1
+
w
*
np
.
cos
(
phi
)
x
=
np
.
ravel
(
r
*
np
.
cos
(
theta
))
y
=
np
.
ravel
(
r
*
np
.
sin
(
theta
))
z
=
np
.
ravel
(
w
*
np
.
sin
(
phi
))
Finally, to plot the object, we must make sure the triangulation is correct. The best way to do this is to define the triangulation within the underlying parameterization, and then let Matplotlib project this triangulation into the three-dimensional space of the Möbius strip. This can be accomplished as follows (Figure 4-101):
In
[
17
]:
# triangulate in the underlying parameterization
from
matplotlib.tri
import
Triangulation
tri
=
Triangulation
(
np
.
ravel
(
w
),
np
.
ravel
(
theta
))
ax
=
plt
.
axes
(
projection
=
'3d'
)
ax
.
plot_trisurf
(
x
,
y
,
z
,
triangles
=
tri
.
triangles
,
cmap
=
'viridis'
,
linewidths
=
0.2
);
ax
.
set_xlim
(
-
1
,
1
);
ax
.
set_ylim
(
-
1
,
1
);
ax
.
set_zlim
(
-
1
,
1
);
Figure 4-101. Visualizing a Möbius strip
Combining all of these techniques, it is possible to create and display a wide variety of three-dimensional objects and patterns in Matplotlib.
Geographic Data with Basemap
One common type of visualization in data science is that of geographic data. Matplotlib’s main tool for this type of visualization is the Basemap toolkit, which is one of several Matplotlib toolkits that live under the
mpl_toolkits
namespace. Admittedly, Basemap feels a bit clunky to use, and often even simple visualizations take much longer to render than you might hope. More modern solutions, such as leaflet or the Google Maps API, may be a better choice for more intensive map visualizations. Still, Basemap is a useful tool for Python users to have in their virtual toolbelts. In this section, we’ll show several examples of the type of map visualization that is possible with this toolkit.Installation of Basemap is straightforward; if you’re using conda you can type this and the package will be downloaded:
$ conda install basemapWe add just a single new import to our standard boilerplate:
In
[
1
]:
%
matplotlib
inline
import
numpy
as
np
import
matplotlib.pyplot
as
plt
from
mpl_toolkits.basemap
import
Basemap
Once you have the Basemap toolkit installed and imported, geographic plots are just a few lines away (the graphics in Figure 4-102 also require the
PIL
package in Python 2, or thepillow
package in Python 3):In
[
2
]:
plt
.
figure
(
figsize
=
(
8
,
8
))
m
=
Basemap
(
projection
=
'ortho'
,
resolution
=
None
,
lat_0
=
50
,
lon_0
=-
100
)
m
.
bluemarble
(
scale
=
0.5
);
Figure 4-102. A “bluemarble” projection of the Earth
The meaning of the arguments to Basemap will be discussed momentarily.
The useful thing is that the globe shown here is not a mere image; it is a fully functioning Matplotlib axes that understands spherical coordinates and allows us to easily over-plot data on the map! For example, we can use a different map projection, zoom in to North America, and plot the location of Seattle. We’ll use an etopo image (which shows topographical features both on land and under the ocean) as the map background (Figure 4-103):
In
[
3
]:
fig
=
plt
.
figure
(
figsize
=
(
8
,
8
))
m
=
Basemap
(
projection
=
'lcc'
,
resolution
=
None
,
width
=
8E6
,
height
=
8E6
,
lat_0
=
45
,
lon_0
=-
100
,)
m
.
etopo
(
scale
=
0.5
,
alpha
=
0.5
)
# Map (long, lat) to (x, y) for plotting
x
,
y
=
m
(
-
122.3
,
47.6
)
plt
.
plot
(
x
,
y
,
'ok'
,
markersize
=
5
)
plt
.
text
(
x
,
y
,
' Seattle'
,
fontsize
=
12
);
Figure 4-103. Plotting data and labels on the map
This gives you a brief glimpse into the sort of geographic visualizations that are possible with just a few lines of Python. We’ll now discuss the features of Basemap in more depth, and provide several examples of visualizing map data. Using these brief examples as building blocks, you should be able to create nearly any map visualization that you desire.
Map Projections
The first thing to decide when you are using maps is which projection to use. You’re probably familiar with the fact that it is impossible to project a spherical map, such as that of the Earth, onto a flat surface without somehow distorting it or breaking its continuity. These projections have been developed over the course of human history, and there are a lot of choices! Depending on the intended use of the map projection, there are certain map features (e.g., direction, area, distance, shape, or other considerations) that are useful to maintain.
The Basemap package implements several dozen such projections, all referenced by a short format code. Here we’ll briefly demonstrate some of the more common ones.
We’ll start by defining a convenience routine to draw our world map along with the longitude and latitude lines:
In
[
4
]:
from
itertools
import
chain
def
draw_map
(
m
,
scale
=
0.2
):
# draw a shaded-relief image
m
.
shadedrelief
(
scale
=
scale
)
# lats and longs are returned as a dictionary
lats
=
m
.
drawparallels
(
np
.
linspace
(
-
90
,
90
,
13
))
lons
=
m
.
drawmeridians
(
np
.
linspace
(
-
180
,
180
,
13
))
# keys contain the plt.Line2D instances
lat_lines
=
chain
(
*
(
tup
[
1
][
0
]
for
tup
in
lats
.
items
()))
lon_lines
=
chain
(
*
(
tup
[
1
][
0
]
for
tup
in
lons
.
items
()))
all_lines
=
chain
(
lat_lines
,
lon_lines
)
# cycle through these lines and set the desired style
for
line
in
all_lines
:
line
.
set
(
linestyle
=
'-'
,
alpha
=
0.3
,
color
=
'w'
)
Cylindrical projections
The simplest of map projections are cylindrical projections, in which lines of constant latitude and longitude are mapped to horizontal and vertical lines, respectively. This type of mapping represents equatorial regions quite well, but results in extreme distortions near the poles. The spacing of latitude lines varies between different cylindrical projections, leading to different conservation properties, and different distortion near the poles. In Figure 4-104, we show an example of the equidistant cylindrical projection, which chooses a latitude scaling that preserves distances along meridians. Other cylindrical projections are the Mercator (
projection='merc'
) and the cylindrical equal-area (projection='cea'
) projections.In
[
5
]:
fig
=
plt
.
figure
(
figsize
=
(
8
,
6
),
edgecolor
=
'w'
)
m
=
Basemap
(
projection
=
'cyl'
,
resolution
=
None
,
llcrnrlat
=-
90
,
urcrnrlat
=
90
,
llcrnrlon
=-
180
,
urcrnrlon
=
180
,
)
draw_map
(
m
)
Figure 4-104. Cylindrical equal-area projection
The additional arguments to Basemap for this view specify the latitude (
lat
) and longitude (lon
) of the lower-left corner (llcrnr
) and upper-right corner (urcrnr
) for the desired map, in units of degrees.Pseudo-cylindrical projections
Pseudo-cylindrical projections relax the requirement that meridians (lines of constant longitude) remain vertical; this can give better properties near the poles of the projection. The Mollweide projection (
projection='moll'
) is one common example of this, in which all meridians are elliptical arcs (Figure 4-105). It is constructed so as to preserve area across the map: though there are distortions near the poles, the area of small patches reflects the true area. Other pseudo-cylindrical projections are the sinusoidal (projection='sinu'
) and Robinson (projection='robin'
) projections.In
[
6
]:
fig
=
plt
.
figure
(
figsize
=
(
8
,
6
),
edgecolor
=
'w'
)
m
=
Basemap
(
projection
=
'moll'
,
resolution
=
None
,
lat_0
=
0
,
lon_0
=
0
)
draw_map
(
m
)
Figure 4-105. The Molleweide projection
The extra arguments to
Basemap
here refer to the central latitude (lat_0
) and longitude (lon_0
) for the desired map.Perspective projections
Perspective projections are constructed using a particular choice of perspective point, similar to if you photographed the Earth from a particular point in space (a point which, for some projections, technically lies within the Earth!). One common example is the orthographic projection (
projection='ortho'
), which shows one side of the globe as seen from a viewer at a very long distance. Thus, it can show only half the globe at a time. Other perspective-based projections include the gnomonic projection (projection='gnom'
) and stereographic projection (projection='stere'
). These are often the most useful for showing small portions of the map.Here is an example of the orthographic projection (Figure 4-106):
In
[
7
]:
fig
=
plt
.
figure
(
figsize
=
(
8
,
8
))
m
=
Basemap
(
projection
=
'ortho'
,
resolution
=
None
,
lat_0
=
50
,
lon_0
=
0
)
draw_map
(
m
);
Figure 4-106. The orthographic projection
A conic projection projects the map onto a single cone, which is then unrolled. This can lead to very good local properties, but regions far from the focus point of the cone may become very distorted. One example of this is the Lambert conformal conic projection (
projection='lcc'
), which we saw earlier in the map of North America. It projects the map onto a cone arranged in such a way that two standard parallels (specified inBasemap
bylat_1
andlat_2
) have well-represented distances, with scale decreasing between them and increasing outside of them. Other useful conic projections are the equidistant conic (projection='eqdc'
) and the Albers equal-area (projection='aea'
) projection (Figure 4-107). Conic projections, like perspective projections, tend to be good choices for representing small to medium patches of the globe.In
[
8
]:
fig
=
plt
.
figure
(
figsize
=
(
8
,
8
))
m
=
Basemap
(
projection
=
'lcc'
,
resolution
=
None
,
lon_0
=
0
,
lat_0
=
50
,
lat_1
=
45
,
lat_2
=
55
,
width
=
1.6E7
,
height
=
1.2E7
)
draw_map
(
m
)
Figure 4-107. The Albers equal-area projection
If you’re going to do much with map-based visualizations, I encourage you to read up on other available projections, along with their properties, advantages, and disadvantages. Most likely, they are available in the Basemap package. If you dig deep enough into this topic, you’ll find an incredible subculture of geo-viz geeks who will be ready to argue fervently in support of their favorite projection for any given application!
Earlier we saw the
bluemarble()
andshadedrelief()
methods for projecting global images on the map, as well as thedrawparallels()
anddrawmeridians()
methods for drawing lines of constant latitude and longitude. The Basemap package contains a range of useful functions for drawing borders of physical features like continents, oceans, lakes, and rivers, as well as political boundaries such as countries and US states and counties. The following are some of the available drawing functions that you may wish to explore using IPython’s help features:Physical boundaries and bodies of water
drawcoastlines()
Draw continental coast lines
drawlsmask()
Draw a mask between the land and sea, for use with projecting images on one or the other
drawmapboundary()
Draw the map boundary, including the fill color for oceans
drawrivers()
Draw rivers on the map
fillcontinents()
Fill the continents with a given color; optionally fill lakes with another color
For the boundary-based features, you must set the desired resolution when creating a Basemap image. The
resolution
argument of theBasemap
class sets the level of detail in boundaries, either'c'
(crude),'l'
(low),'i'
(intermediate),'h'
(high),'f'
(full), orNone
if no boundaries will be used. This choice is important: setting high-resolution boundaries on a global map, for example, can be very slow.Here’s an example of drawing land/sea boundaries, and the effect of the resolution parameter. We’ll create both a low- and high-resolution map of Scotland’s beautiful Isle of Skye. It’s located at 57.3°N, 6.2°W, and a map of 90,000×120,000 kilometers shows it well (Figure 4-108):
In
[
9
]:
fig
,
ax
=
plt
.
subplots
(
1
,
2
,
figsize
=
(
12
,
8
))
for
i
,
res
in
enumerate
([
'l'
,
'h'
]):
m
=
Basemap
(
projection
=
'gnom'
,
lat_0
=
57.3
,
lon_0
=-
6.2
,
width
=
90000
,
height
=
120000
,
resolution
=
res
,
ax
=
ax
[
i
])
m
.
fillcontinents
(
color
=
"#FFDDCC"
,
lake_color
=
'#DDEEFF'
)
m
.
drawmapboundary
(
fill_color
=
"#DDEEFF"
)
m
.
drawcoastlines
()
ax
[
i
]
.
set_title
(
"resolution='{0}'"
.
format
(
res
));
Figure 4-108. Map boundaries at low and high resolution
Notice that the low-resolution coastlines are not suitable for this level of zoom, while high-resolution works just fine. The low level would work just fine for a global view, however, and would be much faster than loading the high-resolution border data for the entire globe! It might require some experimentation to find the correct resolution parameter for a given view; the best route is to start with a fast, low-resolution plot and increase the resolution as needed.
Plotting Data on Maps
Perhaps the most useful piece of the Basemap toolkit is the ability to over-plot a variety of data onto a map background. For simple plotting and text, any
plt
function works on the map; you can use theBasemap
instance to project latitude and longitude coordinates to(x, y)
coordinates for plotting withplt
, as we saw earlier in the Seattle example.In addition to this, there are many map-specific functions available as methods of the
Basemap
instance. These work very similarly to their standard Matplotlib counterparts, but have an additional Boolean argumentlatlon
, which if set toTrue
allows you to pass raw latitudes and longitudes to the method, rather than projected(x, y)
coordinates.Some of these map-specific methods are:
contour()
/contourf()
Draw contour lines or filled contours
imshow()
Draw an image
pcolor()
/pcolormesh()
Draw a pseudocolor plot for irregular/regular meshes
plot()
Draw lines and/or markers
scatter()
Draw points with markers
quiver()
Draw vectors
barbs()
Draw wind barbs
drawgreatcircle()
Draw a great circle
We’ll see examples of a few of these as we continue. For more information on these functions, including several example plots, see the online Basemap documentation.
Example: California Cities
Recall that in “Customizing Plot Legends”, we demonstrated the use of size and color in a scatter plot to convey information about the location, size, and population of California cities. Here, we’ll create this plot again, but using Basemap to put the data in context.
We start with loading the data, as we did before:
In
[
10
]:
import
pandas
as
pd
cities
=
pd
.
read_csv
(
'data/california_cities.csv'
)
# Extract the data we're interested in
lat
=
cities
[
'latd'
]
.
values
lon
=
cities
[
'longd'
]
.
values
population
=
cities
[
'population_total'
]
.
values
area
=
cities
[
'area_total_km2'
]
.
values
Next, we set up the map projection, scatter the data, and then create a colorbar and legend (Figure 4-109):
In
[
11
]:
# 1. Draw the map background
fig
=
plt
.
figure
(
figsize
=
(
8
,
8
))
m
=
Basemap
(
projection
=
'lcc'
,
resolution
=
'h'
,
lat_0
=
37.5
,
lon_0
=-
119
,
width
=
1E6
,
height
=
1.2E6
)
m
.
shadedrelief
()
m
.
drawcoastlines
(
color
=
'gray'
)
m
.
drawcountries
(
color
=
'gray'
)
m
.
drawstates
(
color
=
'gray'
)
# 2. scatter city data, with color reflecting population
# and size reflecting area
m
.
scatter
(
lon
,
lat
,
latlon
=
True
,
c
=
np
.
log10
(
population
),
s
=
area
,
cmap
=
'Reds'
,
alpha
=
0.5
)
# 3. create colorbar and legend
plt
.
colorbar
(
label
=
r
'$\log_{10}({\rm population})$'
)
plt
.
clim
(
3
,
7
)
# make legend with dummy points
for
a
in
[
100
,
300
,
500
]:
plt
.
scatter
([],
[],
c
=
'k'
,
alpha
=
0.5
,
s
=
a
,
label
=
str
(
a
)
+
' km$^2$'
)
plt
.
legend
(
scatterpoints
=
1
,
frameon
=
False
,
labelspacing
=
1
,
loc
=
'lower left'
);
Figure 4-109. Scatter plot over a map background
This shows us roughly where larger populations of people have settled in California: they are clustered near the coast in the Los Angeles and San Francisco areas, stretched along the highways in the flat central valley, and avoiding almost completely the mountainous regions along the borders of the state.
Example: Surface Temperature Data
As an example of visualizing some more continuous geographic data, let’s consider the “polar vortex” that hit the eastern half of the United States in January 2014. A great source for any sort of climatic data is NASA’s Goddard Institute for Space Studies. Here we’ll use the GIS 250 temperature data, which we can download using shell commands (these commands may have to be modified on Windows machines). The data used here was downloaded on 6/12/2016, and the file size is approximately 9 MB:
In
[
12
]:
# !curl -O http://data.giss.nasa.gov/pub/gistemp/gistemp250.nc.gz
# !gunzip gistemp250.nc.gz
The data comes in NetCDF format, which can be read in Python by the
netCDF4
library. You can install this library as shown here:$ conda install netcdf4We read the data as follows:
In
[
13
]:
from
netCDF4
import
Dataset
data
=
Dataset
(
'gistemp250.nc'
)
The file contains many global temperature readings on a variety of dates; we need to select the index of the date we’re interested in—in this case, January 15, 2014:
In
[
14
]:
from
netCDF4
import
date2index
from
datetime
import
datetime
timeindex
=
date2index
(
datetime
(
2014
,
1
,
15
),
data
.
variables
[
'time'
])
Now we can load the latitude and longitude data, as well as the temperature anomaly for this index:
In
[
15
]:
lat
=
data
.
variables
[
'lat'
][:]
lon
=
data
.
variables
[
'lon'
][:]
lon
,
lat
=
np
.
meshgrid
(
lon
,
lat
)
temp_anomaly
=
data
.
variables
[
'tempanomaly'
][
timeindex
]
Finally, we’ll use the
pcolormesh()
method to draw a color mesh of the data. We’ll look at North America, and use a shaded relief map in the background. Note that for this data we specifically chose a divergent colormap, which has a neutral color at zero and two contrasting colors at negative and positive values (Figure 4-110). We’ll also lightly draw the coastlines over the colors for reference:In
[
16
]:
fig
=
plt
.
figure
(
figsize
=
(
10
,
8
))
m
=
Basemap
(
projection
=
'lcc'
,
resolution
=
'c'
,
width
=
8E6
,
height
=
8E6
,
lat_0
=
45
,
lon_0
=-
100
,)
m
.
shadedrelief
(
scale
=
0.5
)
m
.
pcolormesh
(
lon
,
lat
,
temp_anomaly
,
latlon
=
True
,
cmap
=
'RdBu_r'
)
plt
.
clim
(
-
8
,
8
)
m
.
drawcoastlines
(
color
=
'lightgray'
)
plt
.
title
(
'January 2014 Temperature Anomaly'
)
plt
.
colorbar
(
label
=
'temperature anomaly (°C)'
);
The data paints a picture of the localized, extreme temperature anomalies that happened during that month. The eastern half of the United States was much colder than normal, while the western half and Alaska were much warmer. Regions with no recorded temperature show the map background.
Figure 4-110. The temperature anomaly in January 2014
Matplotlib has proven to be an incredibly useful and popular visualization tool, but even avid users will admit it often leaves much to be desired. There are several valid complaints about Matplotlib that often come up:
Prior to version 2.0, Matplotlib’s defaults are not exactly the best choices. It was based off of MATLAB circa 1999, and this often shows.
Matplotlib’s API is relatively low level. Doing sophisticated statistical visualization is possible, but often requires a lot of boilerplate code.
Matplotlib predated Pandas by more than a decade, and thus is not designed for use with Pandas
DataFrame
s. In order to visualize data from a PandasDataFrame
, you must extract eachSeries
and often concatenate them together into the right format. It would be nicer to have a plotting library that can intelligently use theDataFrame
labels in a plot.An answer to these problems is Seaborn. Seaborn provides an API on top of Matplotlib that offers sane choices for plot style and color defaults, defines simple high-level functions for common statistical plot types, and integrates with the functionality provided by Pandas
DataFrame
s.To be fair, the Matplotlib team is addressing this: it has recently added the
plt.style
tools (discussed in “Customizing Matplotlib: Configurations and Stylesheets”), and is starting to handle Pandas data more seamlessly. The 2.0 release of the library will include a new default stylesheet that will improve on the current status quo. But for all the reasons just discussed, Seaborn remains an extremely useful add-on.Seaborn Versus Matplotlib
Here is an example of a simple random-walk plot in Matplotlib, using its classic plot formatting and colors. We start with the typical imports:
In
[
1
]:
import
matplotlib.pyplot
as
plt
plt
.
style
.
use
(
'classic'
)
%
matplotlib
inline
import
numpy
as
np
import
pandas
as
pd
Now we create some random walk data:
In
[
2
]:
# Create some data
rng
=
np
.
random
.
RandomState
(
0
)
x
=
np
.
linspace
(
0
,
10
,
500
)
y
=
np
.
cumsum
(
rng
.
randn
(
500
,
6
),
0
)
And do a simple plot (Figure 4-111):
In
[
3
]:
# Plot the data with Matplotlib defaults
plt
.
plot
(
x
,
y
)
plt
.
legend
(
'ABCDEF'
,
ncol
=
2
,
loc
=
'upper left'
);
Figure 4-111. Data in Matplotlib’s default style
Although the result contains all the information we’d like it to convey, it does so in a way that is not all that aesthetically pleasing, and even looks a bit old-fashioned in the context of 21st-century data visualization.
Now let’s take a look at how it works with Seaborn. As we will see, Seaborn has many of its own high-level plotting routines, but it can also overwrite Matplotlib’s default parameters and in turn get even simple Matplotlib scripts to produce vastly superior output. We can set the style by calling Seaborn’s
set()
method. By convention, Seaborn is imported assns
:In
[
4
]:
import
seaborn
as
sns
sns
.
set
()
Now let’s rerun the same two lines as before (Figure 4-112):
In
[
5
]:
# same plotting code as above!
plt
.
plot
(
x
,
y
)
plt
.
legend
(
'ABCDEF'
,
ncol
=
2
,
loc
=
'upper left'
);
Figure 4-112. Data in Seaborn’s default style
Ah, much better!
Exploring Seaborn Plots
The main idea of Seaborn is that it provides high-level commands to create a variety of plot types useful for statistical data exploration, and even some statistical model fitting.
Let’s take a look at a few of the datasets and plot types available in Seaborn. Note that all of the following could be done using raw Matplotlib commands (this is, in fact, what Seaborn does under the hood), but the Seaborn API is much more convenient.
Histograms, KDE, and densities
Often in statistical data visualization, all you want is to plot histograms and joint distributions of variables. We have seen that this is relatively straightforward in Matplotlib (Figure 4-113):
In
[
6
]:
data
=
np
.
random
.
multivariate_normal
([
0
,
0
],
[[
5
,
2
],
[
2
,
2
]],
size
=
2000
)
data
=
pd
.
DataFrame
(
data
,
columns
=
[
'x'
,
'y'
])
for
col
in
'xy'
:
plt
.
hist
(
data
[
col
],
normed
=
True
,
alpha
=
0.5
)
Figure 4-113. Histograms for visualizing distributions
Rather than a histogram, we can get a smooth estimate of the distribution using a kernel density estimation, which Seaborn does with
sns.kdeplot
(Figure 4-114):In
[
7
]:
for
col
in
'xy'
:
sns
.
kdeplot
(
data
[
col
],
shade
=
True
)
Figure 4-114. Kernel density estimates for visualizing distributions
Histograms and KDE can be combined using
distplot
(Figure 4-115):In
[
8
]:
sns
.
distplot
(
data
[
'x'
])
sns
.
distplot
(
data
[
'y'
]);
Figure 4-115. Kernel density and histograms plotted together
If we pass the full two-dimensional dataset to
kdeplot
, we will get a two-dimensional visualization of the data (Figure 4-116):In
[
9
]:
sns
.
kdeplot
(
data
);
Figure 4-116. A two-dimensional kernel density plot
We can see the joint distribution and the marginal distributions together using
sns.jointplot
. For this plot, we’ll set the style to a white background (Figure 4-117):In
[
10
]:
with
sns
.
axes_style
(
'white'
):
sns
.
jointplot
(
"x"
,
"y"
,
data
,
kind
=
'kde'
);
Figure 4-117. A joint distribution plot with a two-dimensional kernel density estimate
There are other parameters that can be passed to
jointplot
—for example, we can use a hexagonally based histogram instead (Figure 4-118):In
[
11
]:
with
sns
.
axes_style
(
'white'
):
sns
.
jointplot
(
"x"
,
"y"
,
data
,
kind
=
'hex'
)
Figure 4-118. A joint distribution plot with a hexagonal bin representation
When you generalize joint plots to datasets of larger dimensions, you end up with pair plots. This is very useful for exploring correlations between multidimensional data, when you’d like to plot all pairs of values against each other.
We’ll demo this with the well-known Iris dataset, which lists measurements of petals and sepals of three iris species:
In
[
12
]:
iris
=
sns
.
load_dataset
(
"iris"
)
iris
.
head
()
Out[12]: sepal_length sepal_width petal_length petal_width species 0 5.1 3.5 1.4 0.2 setosa 1 4.9 3.0 1.4 0.2 setosa 2 4.7 3.2 1.3 0.2 setosa 3 4.6 3.1 1.5 0.2 setosa 4 5.0 3.6 1.4 0.2 setosaVisualizing the multidimensional relationships among the samples is as easy as calling
sns.pairplot
(Figure 4-119):In
[
13
]:
sns
.
pairplot
(
iris
,
hue
=
'species'
,
size
=
2.5
);
Figure 4-119. A pair plot showing the relationships between four variables
Sometimes the best way to view data is via histograms of subsets. Seaborn’s
FacetGrid
makes this extremely simple. We’ll take a look at some data that shows the amount that restaurant staff receive in tips based on various indicator data (Figure 4-120):In
[
14
]:
tips
=
sns
.
load_dataset
(
'tips'
)
tips
.
head
()
Out[14]: total_bill tip sex smoker day time size 0 16.99 1.01 Female No Sun Dinner 2 1 10.34 1.66 Male No Sun Dinner 3 2 21.01 3.50 Male No Sun Dinner 3 3 23.68 3.31 Male No Sun Dinner 2 4 24.59 3.61 Female No Sun Dinner 4In
[
15
]:
tips
[
'tip_pct'
]
=
100
*
tips
[
'tip'
]
/
tips
[
'total_bill'
]
grid
=
sns
.
FacetGrid
(
tips
,
row
=
"sex"
,
col
=
"time"
,
margin_titles
=
True
)
grid
.
map
(
plt
.
hist
,
"tip_pct"
,
bins
=
np
.
linspace
(
0
,
40
,
15
));
Figure 4-120. An example of a faceted histogram
Factor plots can be useful for this kind of visualization as well. This allows you to view the distribution of a parameter within bins defined by any other parameter (Figure 4-121):
In
[
16
]:
with
sns
.
axes_style
(
style
=
'ticks'
):
g
=
sns
.
factorplot
(
"day"
,
"total_bill"
,
"sex"
,
data
=
tips
,
kind
=
"box"
)
g
.
set_axis_labels
(
"Day"
,
"Total Bill"
);
Figure 4-121. An example of a factor plot, comparing distributions given various discrete factors
Joint distributions
Similar to the pair plot we saw earlier, we can use
sns.jointplot
to show the joint distribution between different datasets, along with the associated marginal distributions (Figure 4-122):In
[
17
]:
with
sns
.
axes_style
(
'white'
):
sns
.
jointplot
(
"total_bill"
,
"tip"
,
data
=
tips
,
kind
=
'hex'
)
Figure 4-122. A joint distribution plot
The joint plot can even do some automatic kernel density estimation and regression (Figure 4-123):
In
[
18
]:
sns
.
jointplot
(
"total_bill"
,
"tip"
,
data
=
tips
,
kind
=
'reg'
);
Figure 4-123. A joint distribution plot with a regression fit
Bar plots
Time series can be plotted with
sns.factorplot
. In the following example (visualized in Figure 4-124), we’ll use the Planets data that we first saw in “Aggregation and Grouping”:In
[
19
]:
planets
=
sns
.
load_dataset
(
'planets'
)
planets
.
head
()
Out[19]: method number orbital_period mass distance year 0 Radial Velocity 1 269.300 7.10 77.40 2006 1 Radial Velocity 1 874.774 2.21 56.95 2008 2 Radial Velocity 1 763.000 2.60 19.84 2011 3 Radial Velocity 1 326.030 19.40 110.62 2007 4 Radial Velocity 1 516.220 10.50 119.47 2009In
[
20
]:
with
sns
.
axes_style
(
'white'
):
g
=
sns
.
factorplot
(
"year"
,
data
=
planets
,
aspect
=
2
,
kind
=
"count"
,
color
=
'steelblue'
)
g
.
set_xticklabels
(
step
=
5
)
Figure 4-124. A histogram as a special case of a factor plot
We can learn more by looking at the method of discovery of each of these planets, as illustrated in Figure 4-125:
In
[
21
]:
with
sns
.
axes_style
(
'white'
):
g
=
sns
.
factorplot
(
"year"
,
data
=
planets
,
aspect
=
4.0
,
kind
=
'count'
,
hue
=
'method'
,
order
=
range
(
2001
,
2015
))
g
.
set_ylabels
(
'Number of Planets Discovered'
)
Figure 4-125. Number of planets discovered by year and type (see the online appendix for a full-scale figure)
For more information on plotting with Seaborn, see the Seaborn documentation, a tutorial, and the Seaborn gallery.
Example: Exploring Marathon Finishing Times
Here we’ll look at using Seaborn to help visualize and understand finishing results from a marathon. I’ve scraped the data from sources on the Web, aggregated it and removed any identifying information, and put it on GitHub where it can be downloaded (if you are interested in using Python for web scraping, I would recommend Web Scraping with Python by Ryan Mitchell). We will start by downloading the data from the Web, and loading it into Pandas:
In
[
22
]:
# !curl -O https://raw.githubusercontent.com/jakevdp/marathon-data/
# master/marathon-data.csv
In
[
23
]:
data
=
pd
.
read_csv
(
'marathon-data.csv'
)
data
.
head
()
Out[23]: age gender split final 0 33 M 01:05:38 02:08:51 1 32 M 01:06:26 02:09:28 2 31 M 01:06:49 02:10:42 3 38 M 01:06:16 02:13:45 4 31 M 01:06:32 02:13:59By default, Pandas loaded the time columns as Python strings (type
object
); we can see this by looking at thedtypes
attribute of theDataFrame
:In
[
24
]:
data
.
dtypes
Out[24]: age int64 gender object split object final object dtype: objectLet’s fix this by providing a converter for the times:
In
[
25
]:
def
convert_time
(
s
):
h
,
m
,
s
=
map
(
int
,
s
.
split
(
':'
))
return
pd
.
datetools
.
timedelta
(
hours
=
h
,
minutes
=
m
,
seconds
=
s
)
data
=
pd
.
read_csv
(
'marathon-data.csv'
,
converters
=
{
'split'
:
convert_time
,
'final'
:
convert_time
})
data
.
head
()
Out[25]: age gender split final 0 33 M 01:05:38 02:08:51 1 32 M 01:06:26 02:09:28 2 31 M 01:06:49 02:10:42 3 38 M 01:06:16 02:13:45 4 31 M 01:06:32 02:13:59In
[
26
]:
data
.
dtypes
Out[26]: age int64 gender object split timedelta64[ns] final timedelta64[ns] dtype: objectThat looks much better. For the purpose of our Seaborn plotting utilities, let’s next add columns that give the times in seconds:
In
[
27
]:
data
[
'split_sec'
]
=
data
[
'split'
]
.
astype
(
int
)
/
1E9
data
[
'final_sec'
]
=
data
[
'final'
]
.
astype
(
int
)
/
1E9
data
.
head
()
Out[27]: age gender split final split_sec final_sec 0 33 M 01:05:38 02:08:51 3938.0 7731.0 1 32 M 01:06:26 02:09:28 3986.0 7768.0 2 31 M 01:06:49 02:10:42 4009.0 7842.0 3 38 M 01:06:16 02:13:45 3976.0 8025.0 4 31 M 01:06:32 02:13:59 3992.0 8039.0To get an idea of what the data looks like, we can plot a
jointplot
over the data (Figure 4-126):In
[
28
]:
with
sns
.
axes_style
(
'white'
):
g
=
sns
.
jointplot
(
"split_sec"
,
"final_sec"
,
data
,
kind
=
'hex'
)
g
.
ax_joint
.
plot
(
np
.
linspace
(
4000
,
16000
),
np
.
linspace
(
8000
,
32000
),
':k'
)
Figure 4-126. The relationship between the split for the first half-marathon and the finishing time for the full marathon
The dotted line shows where someone’s time would lie if they ran the marathon at a perfectly steady pace. The fact that the distribution lies above this indicates (as you might expect) that most people slow down over the course of the marathon. If you have run competitively, you’ll know that those who do the opposite—run faster during the second half of the race—are said to have “negative-split” the race.
Let’s create another column in the data, the split fraction, which measures the degree to which each runner negative-splits or positive-splits the race:
In
[
29
]:
data
[
'split_frac'
]
=
1
-
2
*
data
[
'split_sec'
]
/
data
[
'final_sec'
]
data
.
head
()
Out[29]: age gender split final split_sec final_sec split_frac 0 33 M 01:05:38 02:08:51 3938.0 7731.0 -0.018756 1 32 M 01:06:26 02:09:28 3986.0 7768.0 -0.026262 2 31 M 01:06:49 02:10:42 4009.0 7842.0 -0.022443 3 38 M 01:06:16 02:13:45 3976.0 8025.0 0.009097 4 31 M 01:06:32 02:13:59 3992.0 8039.0 0.006842Where this split difference is less than zero, the person negative-split the race by that fraction. Let’s do a distribution plot of this split fraction (Figure 4-127):
In
[
30
]:
sns
.
distplot
(
data
[
'split_frac'
],
kde
=
False
);
plt
.
axvline
(
0
,
color
=
"k"
,
linestyle
=
"--"
);
Figure 4-127. The distribution of split fractions; 0.0 indicates a runner who completed the first and second halves in identical times
In
[
31
]:
sum
(
data
.
split_frac
<
0
)
Out[31]: 251Out of nearly 40,000 participants, there were only 250 people who negative-split their marathon.
Let’s see whether there is any correlation between this split fraction and other variables. We’ll do this using a
pairgrid
, which draws plots of all these correlations (Figure 4-128):In
[
32
]:
g
=
sns
.
PairGrid
(
data
,
vars
=
[
'age'
,
'split_sec'
,
'final_sec'
,
'split_frac'
],
hue
=
'gender'
,
palette
=
'RdBu_r'
)
g
.
map
(
plt
.
scatter
,
alpha
=
0.8
)
g
.
add_legend
();
Figure 4-128. The relationship between quantities within the marathon dataset
It looks like the split fraction does not correlate particularly with age, but does correlate with the final time: faster runners tend to have closer to even splits on their marathon time. (We see here that Seaborn is no panacea for Matplotlib’s ills when it comes to plot styles: in particular, the x-axis labels overlap. Because the output is a simple Matplotlib plot, however, the methods in “Customizing Ticks” can be used to adjust such things if desired.)
The difference between men and women here is interesting. Let’s look at the histogram of split fractions for these two groups (Figure 4-129):
In
[
33
]:
sns
.
kdeplot
(
data
.
split_frac
[
data
.
gender
==
'M'
],
label
=
'men'
,
shade
=
True
)
sns
.
kdeplot
(
data
.
split_frac
[
data
.
gender
==
'W'
],
label
=
'women'
,
shade
=
True
)
plt
.
xlabel
(
'split_frac'
);
Figure 4-129. The distribution of split fractions by gender
The interesting thing here is that there are many more men than women who are running close to an even split! This almost looks like some kind of bimodal distribution among the men and women. Let’s see if we can suss out what’s going on by looking at the distributions as a function of age.
A nice way to compare distributions is to use a violin plot (Figure 4-130):
In
[
34
]:
sns
.
violinplot
(
"gender"
,
"split_frac"
,
data
=
data
,
palette
=
[
"lightblue"
,
"lightpink"
]);
Figure 4-130. A violin plot showing the split fraction by gender
This is yet another way to compare the distributions between men and women.
Let’s look a little deeper, and compare these violin plots as a function of age. We’ll start by creating a new column in the array that specifies the decade of age that each person is in (Figure 4-131):
In
[
35
]:
data
[
'age_dec'
]
=
data
.
age
.
map
(
lambda
age
:
10
*
(
age
//
10
))
data
.
head
()
Out[35]: age gender split final split_sec final_sec split_frac age_dec 0 33 M 01:05:38 02:08:51 3938.0 7731.0 -0.018756 30 1 32 M 01:06:26 02:09:28 3986.0 7768.0 -0.026262 30 2 31 M 01:06:49 02:10:42 4009.0 7842.0 -0.022443 30 3 38 M 01:06:16 02:13:45 3976.0 8025.0 0.009097 30 4 31 M 01:06:32 02:13:59 3992.0 8039.0 0.006842 30In
[
36
]:
men
=
(
data
.
gender
==
'M'
)
women
=
(
data
.
gender
==
'W'
)
with
sns
.
axes_style
(
style
=
None
):
sns
.
violinplot
(
"age_dec"
,
"split_frac"
,
hue
=
"gender"
,
data
=
data
,
split
=
True
,
inner
=
"quartile"
,
palette
=
[
"lightblue"
,
"lightpink"
]);
Figure 4-131. A violin plot showing the split fraction by gender and age
Looking at this, we can see where the distributions of men and women differ: the split distributions of men in their 20s to 50s show a pronounced over-density toward lower splits when compared to women of the same age (or of any age, for that matter).
Also surprisingly, the 80-year-old women seem to outperform everyone in terms of their split time. This is probably due to the fact that we’re estimating the distribution from small numbers, as there are only a handful of runners in that range:
In
[
38
]:
(
data
.
age
>
80
)
.
sum
()
Out[38]: 7Back to the men with negative splits: who are these runners? Does this split fraction correlate with finishing quickly? We can plot this very easily. We’ll use
regplot
, which will automatically fit a linear regression to the data (Figure 4-132):In
[
37
]:
g
=
sns
.
lmplot
(
'final_sec'
,
'split_frac'
,
col
=
'gender'
,
data
=
data
,
markers
=
"."
,
scatter_kws
=
dict
(
color
=
'c'
))
g
.
map
(
plt
.
axhline
,
y
=
0.1
,
color
=
"k"
,
ls
=
":"
);
Figure 4-132. Split fraction versus finishing time by gender
Apparently the people with fast splits are the elite runners who are finishing within ~15,000 seconds, or about 4 hours. People slower than that are much less likely to have a fast second split.
A single chapter in a book can never hope to cover all the available features and plot types available in Matplotlib. As with other packages we’ve seen, liberal use of IPython’s tab-completion and help functions (see “Help and Documentation in IPython”) can be very helpful when you’re exploring Matplotlib’s API. In addition, Matplotlib’s online documentation can be a helpful reference. See in particular the Matplotlib gallery linked on that page: it shows thumbnails of hundreds of different plot types, each one linked to a page with the Python code snippet used to generate it. In this way, you can visually inspect and learn about a wide range of different plotting styles and visualization techniques.
For a book-length treatment of Matplotlib, I would recommend Interactive Applications Using Matplotlib, written by Matplotlib core developer Ben Root.
Other Python Graphics Libraries
Although Matplotlib is the most prominent Python visualization library, there are other more modern tools that are worth exploring as well. I’ll mention a few of them briefly here:
Bokeh is a JavaScript visualization library with a Python frontend that creates highly interactive visualizations capable of handling very large and/or streaming datasets. The Python frontend outputs a JSON data structure that can be interpreted by the Bokeh JS engine.
Plotly is the eponymous open source product of the Plotly company, and is similar in spirit to Bokeh. Because Plotly is the main product of a startup, it is receiving a high level of development effort. Use of the library is entirely free.
Vispy is an actively developed project focused on dynamic visualizations of very large datasets. Because it is built to target OpenGL and make use of efficient graphics processors in your computer, it is able to render some quite large and stunning visualizations.
Vega and Vega-Lite are declarative graphics representations, and are the product of years of research into the fundamental language of data visualization. The reference rendering implementation is JavaScript, but the API is language agnostic. There is a Python API under development in the Altair package. Though it’s not mature yet, I’m quite excited for the possibilities of this project to provide a common reference point for visualization in Python and other languages.
The visualization space in the Python community is very dynamic, and I fully expect this list to be out of date as soon as it is published. Keep an eye out for what’s coming in the future!
Get Python Data Science Handbook now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.
Start your free trial© 2024, O’Reilly Media, Inc. All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners.
We are a participant in the Amazon Services LLC Associates Program, an affiliate advertising program designed to provide a means for us to earn fees by linking to Amazon.com and affiliated sites.
Terms of service • Privacy policy • Editorial independence
Don’t leave empty-handed
Get Mark Richards’s Software Architecture Patterns ebook to better understand how to design components—and how they should interact.
It’s yours, free.
Get it nowCheck it out now on O’Reilly
Dive in for free with a 10-day trial of the O’Reilly learning platform—then explore all the other resources our members count on to build skills and solve problems every day.
Start your free trial Become a member now
文雅的四季豆 · matplotlib绘制散点图 - Laumians 2 周前 |
酒量大的风衣 · Unable to install matplotlib 3.0.3 on Windows - freetype and libpng dependencies fail · Issue #13555 1 周前 |
没读研的大葱 · 新疆这座城市为何以“梨”命名?-中新网·新疆 2 月前 |