In [1]: df = pd.DataFrame({'foo': [1, 2, 3, 4],
...: 'bar': ['a', 'b', 'c', 'd'],
...: 'baz': pd.date_range('2018-01-01', freq='d', periods=4),
...: 'qux': pd.Categorical(['a', 'b', 'c', 'c'])},
...: index=pd.Index(range(4), name='idx'))
In [2]: df
Out[2]:
foo bar baz qux
0 1 a 2018-01-01 a
1 2 b 2018-01-02 b
2 3 c 2018-01-03 c
3 4 d 2018-01-04 c
[4 rows x 4 columns]
In [3]: df.dtypes
Out[3]:
foo int64
bar object
baz datetime64[ns]
qux category
Length: 4, dtype: object
In [4]: df.to_json('test.json', orient='table')
In [5]: new_df = pd.read_json('test.json', orient='table')
In [6]: new_df
Out[6]:
foo bar baz qux
0 1 a 2018-01-01 a
1 2 b 2018-01-02 b
2 3 c 2018-01-03 c
3 4 d 2018-01-04 c
[4 rows x 4 columns]
In [7]: new_df.dtypes
Out[7]:
foo int64
bar object
baz datetime64[ns]
qux category
Length: 4, dtype: object
Please note that the string index is not supported with the round trip format, as it is used by default in write_json to indicate a missing index name.
In [8]: df.index.name = 'index'
In [9]: df.to_json('test.json', orient='table')
In [10]: new_df = pd.read_json('test.json', orient='table')
In [11]: new_df
Out[11]:
foo bar baz qux
0 1 a 2018-01-01 a
1 2 b 2018-01-02 b
2 3 c 2018-01-03 c
3 4 d 2018-01-04 c
[4 rows x 4 columns]
In [12]: new_df.dtypes
Out[12]:
foo int64
bar object
baz datetime64[ns]
qux category
Length: 4, dtype: object
Method .assign() accepts dependent arguments
The DataFrame.assign() now accepts dependent keyword arguments for python version later than 3.6 (see also PEP 468). Later keyword arguments may now refer to earlier ones if the argument is a callable. See the
documentation here (GH14207)
In [13]: df = pd.DataFrame({'A': [1, 2, 3]})
In [14]: df
Out[14]:
[3 rows x 1 columns]
In [15]: df.assign(B=df.A, C=lambda x: x['A'] + x['B'])
Out[15]:
A B C
0 1 1 2
1 2 2 4
2 3 3 6
[3 rows x 3 columns]
Warning
This may subtly change the behavior of your code when you’re
using .assign() to update an existing column. Previously, callables
referring to other variables being updated would get the “old” values
Previous behavior:
In [2]: df = pd.DataFrame({"A": [1, 2, 3]})
In [3]: df.assign(A=lambda df: df.A + 1, C=lambda df: df.A * -1)
Out[3]:
0 2 -1
1 3 -2
2 4 -3
New behavior:
In [16]: df.assign(A=df.A + 1, C=lambda df: df.A * -1)
Out[16]:
0 2 -2
1 3 -3
2 4 -4
[3 rows x 2 columns]
Merging on a combination of columns and index levels
Strings passed to DataFrame.merge() as the on, left_on, and right_on
parameters may now refer to either column names or index level names.
This enables merging DataFrame instances on a combination of index levels
and columns without resetting indexes. See the Merge on columns and
levels documentation section.
(GH14355)
In [17]: left_index = pd.Index(['K0', 'K0', 'K1', 'K2'], name='key1')
In [18]: left = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
....: 'B': ['B0', 'B1', 'B2', 'B3'],
....: 'key2': ['K0', 'K1', 'K0', 'K1']},
....: index=left_index)
....:
In [19]: right_index = pd.Index(['K0', 'K1', 'K2', 'K2'], name='key1')
In [20]: right = pd.DataFrame({'C': ['C0', 'C1', 'C2', 'C3'],
....: 'D': ['D0', 'D1', 'D2', 'D3'],
....: 'key2': ['K0', 'K0', 'K0', 'K1']},
....: index=right_index)
....:
In [21]: left.merge(right, on=['key1', 'key2'])
Out[21]:
A B key2 C D
K0 A0 B0 K0 C0 D0
K1 A2 B2 K0 C1 D1
K2 A3 B3 K1 C3 D3
[3 rows x 5 columns]
Sorting by a combination of columns and index levels
Strings passed to DataFrame.sort_values() as the by parameter may
now refer to either column names or index level names. This enables sorting
DataFrame instances by a combination of index levels and columns without
resetting indexes. See the Sorting by Indexes and Values documentation section.
(GH14353)
# Build MultiIndex
In [22]: idx = pd.MultiIndex.from_tuples([('a', 1), ('a', 2), ('a', 2),
....: ('b', 2), ('b', 1), ('b', 1)])
....:
In [23]: idx.names = ['first', 'second']
# Build DataFrame
In [24]: df_multi = pd.DataFrame({'A': np.arange(6, 0, -1)},
....: index=idx)
....:
In [25]: df_multi
Out[25]:
first second
a 1 6
2 5
2 4
b 2 3
1 2
1 1
[6 rows x 1 columns]
# Sort by 'second' (index) and 'A' (column)
In [26]: df_multi.sort_values(by=['second', 'A'])
Out[26]:
first second
b 1 1
1 2
a 1 6
b 2 3
a 2 4
2 5
[6 rows x 1 columns]
Extending pandas with custom types (experimental)
pandas now supports storing array-like objects that aren’t necessarily 1-D NumPy
arrays as columns in a DataFrame or values in a Series. This allows third-party
libraries to implement extensions to NumPy’s types, similar to how pandas
implemented categoricals, datetimes with timezones, periods, and intervals.
As a demonstration, we’ll use cyberpandas, which provides an IPArray type
for storing ip addresses.
In [1]:
from cyberpandas import IPArray
In [2]: values = IPArray([
...: 0,
...: 3232235777,
...: 42540766452641154071740215577757643572
...: ])
IPArray isn’t a normal 1-D NumPy array, but because it’s a pandas
ExtensionArray, it can be stored properly inside pandas’ containers.
In [3]: ser = pd.Series(values)
In [4]: ser
Out[4]:
0 0.0.0.0
1 192.168.1.1
2 2001:db8:85a3::8a2e:370:7334
dtype: ip
Notice that the dtype is ip. The missing value semantics of the underlying
array are respected:
In [5]: ser.isna()
Out[5]:
0 True
1 False
2 False
dtype: bool
For more, see the extension types
documentation. If you build an extension array, publicize it on our
ecosystem page.
New observed keyword for excluding unobserved categories in GroupBy
Grouping by a categorical includes the unobserved categories in the output.
When grouping by multiple categorical columns, this means you get the cartesian product of all the
categories, including combinations where there are no observations, which can result in a large
number of groups. We have added a keyword observed to control this behavior, it defaults to
observed=False for backward-compatibility. (GH14942, GH8138, GH15217, GH17594, GH8669, GH20583, GH20902)
In [27]: cat1 = pd.Categorical(["a", "a", "b", "b"],
....: categories=["a", "b", "z"], ordered=True)
....:
In [28]: cat2 = pd.Categorical(["c", "d", "c", "d"],
....: categories=["c", "d", "y"], ordered=True)
....:
In [29]: df = pd.DataFrame({"A": cat1, "B": cat2, "values": [1, 2, 3, 4]})
In [30]: df['C'] = ['foo', 'bar'] * 2
In [31]: df
Out[31]:
A B values C
0 a c 1 foo
1 a d 2 bar
2 b c 3 foo
3 b d 4 bar
[4 rows x 4 columns]
To show all values, the previous behavior:
In [32]: df.groupby(['A', 'B', 'C'], observed=False).count()
Out[32]:
values
A B C
a c bar 0
foo 1
d bar 1
foo 0
y bar 0
... ...
z c foo 0
d bar 0
foo 0
y bar 0
foo 0
[18 rows x 1 columns]
To show only observed values:
In [33]: df.groupby(['A', 'B', 'C'], observed=True).count()
Out[33]:
values
A B C
a c foo 1
d bar 1
b c foo 1
d bar 1
[4 rows x 1 columns]
For pivoting operations, this behavior is already controlled by the dropna keyword:
In [34]: cat1 = pd.Categorical(["a", "a", "b", "b"],
....: categories=["a", "b", "z"], ordered=True)
....:
In [35]: cat2 = pd.Categorical(["c", "d", "c", "d"],
....: categories=["c", "d", "y"], ordered=True)
....:
In [36]: df = pd.DataFrame({"A": cat1, "B": cat2, "values": [1, 2, 3, 4]})
In [37]: df
Out[37]:
A B values
0 a c 1
1 a d 2
2 b c 3
3 b d 4
[4 rows x 3 columns]
In [38]: pd.pivot_table(df, values='values', index=['A', 'B'],
....: dropna=True)
....:
Out[38]:
values
a c 1
d 2
b c 3
d 4
[4 rows x 1 columns]
In [39]: pd.pivot_table(df, values='values', index=['A', 'B'],
....: dropna=False)
....:
Out[39]:
values
a c 1.0
d 2.0
y NaN
b c 3.0
d 4.0
y NaN
z c NaN
d NaN
y NaN
[9 rows x 1 columns]
Rolling/Expanding.apply() accepts raw=False to pass a Series to the function
Series.rolling().apply(), DataFrame.rolling().apply(),
Series.expanding().apply(), and DataFrame.expanding().apply() have gained a raw=None parameter.
This is similar to DataFame.apply(). This parameter, if True allows one to send a np.ndarray to the applied function. If False a Series will be passed. The
default is None, which preserves backward compatibility, so this will default to True, sending an np.ndarray.
In a future version the default will be changed to False, sending a Series. (GH5071, GH20584)
In [40]: s = pd.Series(np.arange(5), np.arange(5) + 1)
In [41]: s
Out[41]:
1 0
2 1
3 2
4 3
5 4
Length: 5, dtype: int64
Pass a Series:
In [42]: s.rolling(2, min_periods=1).apply(lambda x: x.iloc[-1], raw=False
)
Out[42]:
1 0.0
2 1.0
3 2.0
4 3.0
5 4.0
Length: 5, dtype: float64
Mimic the original behavior of passing a ndarray:
In [43]: s.rolling(2, min_periods=1).apply(lambda x: x[-1], raw=True)
Out[43]:
1 0.0
2 1.0
3 2.0
4 3.0
5 4.0
Length: 5, dtype: float64
DataFrame.interpolate has gained the limit_area kwarg
DataFrame.interpolate() has gained a limit_area parameter to allow further control of which NaN s are replaced.
Use limit_area='inside' to fill only NaNs surrounded by valid values or use limit_area='outside' to fill only NaN s
outside the existing valid values while preserving those inside. (GH16284) See the full documentation here.
In [44]: ser = pd.Series([np.nan, np.nan, 5, np.nan, np.nan,
....: np.nan, 13, np.nan, np.nan])
....:
In [45]: ser
Out[45]:
0 NaN
1 NaN
2 5.0
3 NaN
4 NaN
5 NaN
6 13.0
7 NaN
8 NaN
Length: 9, dtype: float64
Fill one consecutive inside value in both directions
In [46]: ser.interpolate(limit_direction='both', limit_area='inside', limit=1)
Out[46]:
0 NaN
1 NaN
2 5.0
3 7.0
4 NaN
5 11.0
6 13.0
7 NaN
8 NaN
Length: 9, dtype: float64
Fill all consecutive outside values backward
In [47]: ser.interpolate(limit_direction='backward', limit_area='outside')
Out[47]:
0 5.0
1 5.0
2 5.0
3 NaN
4 NaN
5 NaN
6 13.0
7 NaN
8 NaN
Length: 9, dtype: float64
Fill all consecutive outside values in both directions
In [48]: ser.interpolate(limit_direction='both', limit_area='outside')
Out[48]:
0 5.0
1 5.0
2 5.0
3 NaN
4 NaN
5 NaN
6 13.0
7 13.0
8 13.0
Length: 9, dtype: float64
Function get_dummies now supports dtype argument
The get_dummies() now accepts a dtype argument, which specifies a dtype for the new columns. The default remains uint8. (GH18330)
In [49]: df = pd.DataFrame({'a': [1, 2], 'b': [3, 4], 'c': [5, 6]})
In [50]: pd.get_dummies(df, columns=['c']).dtypes
Out[50]:
a int64
b int64
c_5 bool
c_6 bool
Length: 4, dtype: object
In [51]: pd.get_dummies(df, columns=['c'], dtype=bool).dtypes
Out[51]:
a int64
b int64
c_5 bool
c_6 bool
Length: 4, dtype: object
Timedelta mod method
mod (%) and divmod operations are now defined on Timedelta objects
when operating with either timedelta-like or with numeric arguments.
See the documentation here. (GH19365)
In [52]: td = pd.Timedelta(hours=37)
In [53]: td % pd.Timedelta(minutes=45)
Out[53]: Timedelta('0 days 00:15:00')
Method .rank() handles inf values when NaN are present
In previous versions, .rank() would assign inf elements NaN as their ranks. Now ranks are calculated properly. (GH6945)
In [54]: s = pd.Series([-np.inf, 0, 1, np.nan, np.inf])
In [55]: s
Out[55]:
0 -inf
1 0.0
2 1.0
3 NaN
4 inf
Length: 5, dtype: float64
Previous behavior:
In [11]: s.rank()
Out[11]:
0 1.0
1 2.0
2 3.0
3 NaN
4 NaN
dtype: float64
Current behavior:
In [56]: s.rank()
Out[56]:
0 1.0
1 2.0
2 3.0
3 NaN
4 4.0
Length: 5, dtype: float64
Furthermore, previously if you rank inf or -inf values together with NaN values, the calculation won’t distinguish NaN from infinity when using ‘top’ or ‘bottom’ argument.
In [57]: s = pd.Series([np.nan, np.nan, -np.inf, -np.inf])
In [58]: s
Out[58]:
0 NaN
1 NaN
2 -inf
3 -inf
Length: 4, dtype: float64
Previous behavior:
In [15]: s.rank(na_option='top')
Out[15]:
0 2.5
1 2.5
2 2.5
3 2.5
dtype: float64
Current behavior:
In [59]: s.rank(na_option='top')
Out[59]:
0 1.5
1 1.5
2 3.5
3 3.5
Length: 4, dtype: float64
These bugs were squashed:
Bug in DataFrame.rank() and Series.rank() when method='dense' and pct=True in which percentile ranks were not being used with the number of distinct observations (GH15630)
Bug in Series.rank() and DataFrame.rank() when ascending='False' failed to return correct ranks for infinity if NaN
were present (GH19538)
Bug in DataFrameGroupBy.rank() where ranks were incorrect when both infinity and NaN were present (GH20561)
Series.str.cat has gained the join kwarg
Previously, Series.str.cat() did not – in contrast to most of pandas – align Series on their index before concatenation (see GH18657).
The method has now gained a keyword join to control the manner of alignment, see examples below and here.
In v.0.23 join will default to None (meaning no alignment), but this default will change to 'left' in a future version of pandas.
In [60]: s = pd.Series(['a', 'b', 'c', 'd'])
In [61]: t = pd.Series(['b', 'd', 'e', 'c'], index=[1, 3, 4, 2])
In [62]: s.str.cat(t)
Out[62]:
0 NaN
1 bb
2 cc
3 dd
Length: 4, dtype: object
In [63]: s.str.cat(t, join='left', na_rep='-')
Out[63]:
0 a-
1 bb
2 cc
3 dd
Length: 4, dtype: object
Furthermore, Series.str.cat() now works for CategoricalIndex as well (previously raised a ValueError; see GH20842).
DataFrame.astype performs column-wise conversion to Categorical
DataFrame.astype() can now perform column-wise conversion to Categorical by supplying the string 'category' or
a CategoricalDtype. Previously, attempting this would raise a NotImplementedError. See the
Object creation section of the documentation for more details and examples. (GH12860, GH18099)
Supplying the string 'category' performs column-wise conversion, with only labels appearing in a given column set as categories:
In [64]: df = pd.DataFrame({'A': list('abca'), 'B': list('bccd')})
In [65]: df = df.astype('category')
In [66]: df['A'].dtype
Out[66]: CategoricalDtype(categories=['a', 'b', 'c'], ordered=False)
In [67]: df['B'].dtype
Out[67]: CategoricalDtype(categories=['b', 'c', 'd'], ordered=False)
Supplying a CategoricalDtype will make the categories in each column consistent with the supplied dtype:
In [68]: from pandas.api.types import CategoricalDtype
In [69]: df = pd.DataFrame({'A': list('abca'), 'B': list('bccd')})
In [70]: cdt = CategoricalDtype(categories=list('abcd'), ordered=True)
In [71]: df = df.astype(cdt)
In [72]: df['A'].dtype
Out[72]: CategoricalDtype(categories=['a', 'b', 'c', 'd'], ordered=True)
In [73]: df['B'].dtype
Out[73]: CategoricalDtype(categories=['a', 'b', 'c', 'd'], ordered=True)
Other enhancements
Unary + now permitted for Series and DataFrame as numeric operator (GH16073)
Better support for to_excel() output with the xlsxwriter engine. (GH16149)
pandas.tseries.frequencies.to_offset() now accepts leading ‘+’ signs e.g. ‘+1h’. (GH18171)
MultiIndex.unique() now supports the level= argument, to get unique values from a specific index level (GH17896)
pandas.io.formats.style.Styler now has method hide_index() to determine whether the index will be rendered in output (GH14194)
pandas.io.formats.style.Styler now has method hide_columns() to determine whether columns will be hidden in output (GH14194)
Improved wording of ValueError raised in to_datetime() when unit= is passed with a non-convertible value (GH14350)
Series.fillna() now accepts a Series or a dict as a value for a categorical dtype (GH17033)
pandas.read_clipboard() updated to use qtpy, falling back to PyQt5 and then PyQt4, adding compatibility with Python3 and multiple python-qt bindings (GH17722)
Improved wording of ValueError raised in read_csv() when the usecols argument cannot match all columns. (GH17301)
DataFrame.corrwith() now silently drops non-numeric columns when passed a Series. Before, an exception was raised (GH18570).
IntervalIndex now supports time zone aware Interval objects (GH18537, GH18538)
Series() / DataFrame() tab completion also returns identifiers in the first level of a MultiIndex(). (GH16326)
read_excel() has gained the nrows parameter (GH16645)
DataFrame.append() can now in more cases preserve the type of the calling dataframe’s columns (e.g. if both are CategoricalIndex) (GH18359)
DataFrame.to_json() and Series.to_json()
now accept an index argument which allows the user to exclude the index from the JSON output (GH17394)
IntervalIndex.to_tuples() has gained the na_tuple parameter to control whether NA is returned as a tuple of NA, or NA itself (GH18756)
Categorical.rename_categories, CategoricalIndex.rename_categories and Series.cat.rename_categories
can now take a callable as their argument (GH18862)
Interval and IntervalIndex have gained a length attribute (GH18789)
Resampler objects now have a functioning pipe method.
Previously, calls to pipe were diverted to the mean method (GH17905).
is_scalar() now returns True for DateOffset objects (GH18943).
DataFrame.pivot() now accepts a list for the values= kwarg (GH17160).
Added pandas.api.extensions.register_dataframe_accessor(),
pandas.api.extensions.register_series_accessor(), and
pandas.api.extensions.register_index_accessor(), accessor for libraries downstream of pandas
to register custom accessors like .cat on pandas objects. See
Registering Custom Accessors for more (GH14781).
IntervalIndex.astype now supports conversions between subtypes when passed an IntervalDtype (GH19197)
IntervalIndex and its associated constructor methods (from_arrays, from_breaks, from_tuples) have gained a dtype parameter (GH19262)
Added pandas.core.groupby.SeriesGroupBy.is_monotonic_increasing() and pandas.core.groupby.SeriesGroupBy.is_monotonic_decreasing() (GH17015)
For subclassed DataFrames, DataFrame.apply() will now preserve the Series subclass (if defined) when passing the data to the applied function (GH19822)
DataFrame.from_dict() now accepts a columns argument that can be used to specify the column names when orient='index' is used (GH18529)
Added option display.html.use_mathjax so MathJax can be disabled when rendering tables in Jupyter notebooks (GH19856, GH19824)
DataFrame.replace() now supports the method parameter, which can be used to specify the replacement method when to_replace is a scalar, list or tuple and value is None (GH19632)
Timestamp.month_name(), DatetimeIndex.month_name(), and Series.dt.month_name() are now available (GH12805)
Timestamp.day_name() and DatetimeIndex.day_name() are now available to return day names with a specified locale (GH12806)
DataFrame.to_sql() now performs a multi-value insert if the underlying connection supports itk rather than inserting row by row.
SQLAlchemy dialects supporting multi-value inserts include: mysql, postgresql, sqlite and any dialect with supports_multivalues_insert. (GH14315, GH8953)
read_html() now accepts a displayed_only keyword argument to controls whether or not hidden elements are parsed (True by default) (GH20027)
read_html() now reads all <tbody> elements in a <table>, not just the first. (GH20690)
quantile() and quantile() now accept the interpolation keyword, linear by default (GH20497)
zip compression is supported via compression=zip in DataFrame.to_pickle(), Series.to_pickle(), DataFrame.to_csv(), Series.to_csv(), DataFrame.to_json(), Series.to_json(). (GH17778)
WeekOfMonth constructor now supports n=0 (GH20517).
DataFrame and Series now support matrix multiplication (@) operator (GH10259) for Python>=3.5
Updated DataFrame.to_gbq() and pandas.read_gbq() signature and documentation to reflect changes from
the pandas-gbq library version 0.4.0. Adds intersphinx mapping to pandas-gbq
library. (GH20564)
Added new writer for exporting Stata dta files in version 117, StataWriter117. This format supports exporting strings with lengths up to 2,000,000 characters (GH16450)
to_hdf() and read_hdf()
now accept an errors keyword argument to control encoding error handling (GH20835)
cut() has gained the duplicates='raise'|'drop' option to control whether to raise on duplicated edges (GH20947)
date_range(), timedelta_range(), and interval_range() now return a linearly spaced index if start, stop, and periods are specified, but freq is not. (GH20808, GH20983, GH20976)
Dependencies have increased minimum versions
We have updated our minimum supported versions of dependencies (GH15184).
If installed, we now require:
Instantiation from dicts preserves dict insertion order for Python 3.6+
Until Python 3.6, dicts in Python had no formally defined ordering. For Python
version 3.6 and later, dicts are ordered by insertion order, see
PEP 468.
pandas will use the dict’s insertion order, when creating a Series or
DataFrame from a dict and you’re using Python version 3.6 or
higher. (GH19884)
Previous behavior (and current behavior if on Python < 3.6):
In [16]: pd.Series({'Income': 2000,
....: 'Expenses': -1500,
....: 'Taxes': -200,
....: 'Net result': 300})
Out[16]:
Expenses -1500
Income 2000
Net result 300
Taxes -200
dtype: int64
Note the Series above is ordered alphabetically by the index values.
New behavior (for Python >= 3.6):
In [74]: pd.Series({'Income': 2000,
....: 'Expenses': -1500,
....: 'Taxes': -200,
....: 'Net result': 300})
....:
Out[74]:
Income 2000
Expenses -1500
Taxes -200
Net result 300
Length: 4, dtype: int64
Notice that the Series is now ordered by insertion order. This new behavior is
used for all relevant pandas types (Series, DataFrame, SparseSeries
and SparseDataFrame).
If you wish to retain the old behavior while using Python >= 3.6, you can use
.sort_index():
In [75]: pd.Series({'Income': 2000,
....: 'Expenses': -1500,
....: 'Taxes': -200,
....: 'Net result': 300}).sort_index()
....:
Out[75]:
Expenses -1500
Income 2000
Net result 300
Taxes -200
Length: 4, dtype: int64
Deprecate Panel
Panel was deprecated in the 0.20.x release, showing as a DeprecationWarning. Using Panel will now show a FutureWarning. The recommended way to represent 3-D data are
with a MultiIndex on a DataFrame via the to_frame() or with the xarray package. pandas
provides a to_xarray() method to automate this conversion (GH13563, GH18324).
In [75]: import pandas._testing as tm
In [76]: p = tm.makePanel()
In [77]: p
Out[77]:
<class 'pandas.core.panel.Panel'>
Dimensions: 3 (items) x 3 (major_axis) x 4 (minor_axis)
Items axis: ItemA to ItemC
Major_axis axis: 2000-01-03 00:00:00 to 2000-01-05 00:00:00
Minor_axis axis: A to D
Convert to a MultiIndex DataFrame
In [78]: p.to_frame()
Out[78]:
ItemA ItemB ItemC
major minor
2000-01-03 A 0.469112 0.721555 0.404705
B -1.135632 0.271860 -1.039268
C 0.119209 0.276232 -1.344312
D -2.104569 0.113648 -0.109050
2000-01-04 A -0.282863 -0.706771 0.577046
B 1.212112 -0.424972 -0.370647
C -1.044236 -1.087401 0.844885
D -0.494929 -1.478427 1.643563
2000-01-05 A -1.509059 -1.039575 -1.715002
B -0.173215 0.567020 -1.157892
C -0.861849 -0.673690 1.075770
D 1.071804 0.524988 -1.469388
[12 rows x 3 columns]
Convert to an xarray DataArray
In [79]: p.to_xarray()
Out[79]:
<xarray.DataArray (items: 3, major_axis: 3, minor_axis: 4)>
array([[[ 0.469112, -1.135632, 0.119209, -2.104569],
[-0.282863, 1.212112, -1.044236, -0.494929],
[-1.509059, -0.173215, -0.861849, 1.071804]],
[[ 0.721555, 0.27186 , 0.276232, 0.113648],
[-0.706771, -0.424972, -1.087401, -1.478427],
[-1.039575, 0.56702 , -0.67369 , 0.524988]],
[[ 0.404705, -1.039268, -1.344312, -0.10905 ],
[ 0.577046, -0.370647, 0.844885, 1.643563],
[-1.715002, -1.157892, 1.07577 , -1.469388]]])
Coordinates:
* items (items) object 'ItemA' 'ItemB' 'ItemC'
* major_axis (major_axis) datetime64[ns] 2000-01-03 2000-01-04 2000-01-05
* minor_axis (minor_axis) object 'A' 'B' 'C' 'D'
pandas.core.common removals
The following error & warning messages are removed from pandas.core.common (GH13634, GH19769):
PerformanceWarning
UnsupportedFunctionCall
UnsortedIndexError
AbstractMethodError
These are available from import from pandas.errors (since 0.19.0).
Changes to make output of DataFrame.apply consistent
DataFrame.apply() was inconsistent when applying an arbitrary user-defined-function that returned a list-like with axis=1. Several bugs and inconsistencies
are resolved. If the applied function returns a Series, then pandas will return a DataFrame; otherwise a Series will be returned, this includes the case
where a list-like (e.g. tuple or list is returned) (GH16353, GH17437, GH17970, GH17348, GH17892, GH18573,
GH17602, GH18775, GH18901, GH18919).
In [76]: df = pd.DataFrame(np.tile(np.arange(3), 6).reshape(6, -1) + 1,
....: columns=['A', 'B', 'C'])
....:
In [77]: df
Out[77]:
A B C
0 1 2 3
1 1 2 3
2 1 2 3
3 1 2 3
4 1 2 3
5 1 2 3
[6 rows x 3 columns]
Previous behavior: if the returned shape happened to match the length of original columns, this would return a DataFrame.
If the return shape did not match, a Series with lists was returned.
In [3]: df.apply(lambda x: [1, 2, 3], axis=1)
Out[3]:
A B C
0 1
2 3
1 1 2 3
2 1 2 3
3 1 2 3
4 1 2 3
5 1 2 3
In [4]: df.apply(lambda x: [1, 2], axis=1)
Out[4]:
0 [1, 2]
1 [1, 2]
2 [1, 2]
3 [1, 2]
4 [1, 2]
5 [1, 2]
dtype: object
New behavior: When the applied function returns a list-like, this will now always return a Series.
In [78]: df.apply(lambda x: [1, 2, 3], axis=1)
Out[78]:
0 [1, 2, 3]
1 [1, 2, 3]
2 [1, 2, 3]
3 [1, 2, 3]
4 [1, 2, 3]
5 [1, 2, 3]
Length: 6, dtype: object
In [79]: df.apply(lambda x: [1, 2], axis=1)
Out[79]:
0 [1, 2]
1 [1, 2]
2 [1, 2]
3 [1, 2]
4 [1, 2]
5 [1, 2]
Length: 6, dtype: object
To have expanded columns, you can use result_type='expand'
In [80]: df.apply(lambda x: [1, 2, 3], axis=1, result_type='expand')
Out[80]:
0 1 2
0 1 2 3
1 1 2 3
2 1 2 3
3 1 2 3
4 1 2 3
5 1 2 3
[6 rows x 3 columns]
To broadcast the result across the original columns (the old behaviour for
list-likes of the correct length), you can use result_type='broadcast'.
The shape must match the original columns.
In [81]: df.apply(lambda x: [1, 2, 3], axis=1, result_type='broadcast')
Out[81]:
A B C
0 1 2 3
1 1 2 3
2 1 2 3
3 1 2 3
4 1 2 3
5 1 2 3
[6 rows x 3 columns]
Returning a Series allows one to control the exact return structure and column names:
In [82]: df.apply(lambda x: pd.Series([1, 2, 3], index=['D', 'E', 'F']), axis=1)
Out[82]:
D E F
0 1 2 3
1 1 2 3
2 1 2 3
3 1 2 3
4 1 2 3
5 1 2 3
[6 rows x 3 columns]
Concatenation will no longer sort
In a future version of pandas pandas.concat() will no longer sort the non-concatenation axis when it is not already aligned.
The current behavior is the same as the previous (sorting), but now a warning is issued when sort is not specified and the non-concatenation axis is not aligned (GH4588).
In [83]: df1 = pd.DataFrame({"a": [1, 2], "b": [1, 2]}, columns=['b', 'a'])
In [84]: df2 = pd.DataFrame({"a": [4, 5]})
In [85]: pd.concat([df1, df2])
Out[85]:
0 1.0 1
1 2.0 2
0 NaN 4
1 NaN 5
[4 rows x 2 columns]
To keep the previous behavior (sorting) and silence the warning, pass sort=True
In [86]: pd.concat([df1, df2], sort=True)
Out[86]:
a b
0 1 1.0
1 2 2.0
0 4 NaN
1 5 NaN
[4 rows x 2 columns]
To accept the future behavior (no sorting), pass sort=False
Note that this change also applies to DataFrame.append(), which has also received a sort keyword for controlling this behavior.
Build changes
Building pandas for development now requires cython >= 0.24 (GH18613)
Building from source now explicitly requires setuptools in setup.py (GH18113)
Updated conda recipe to be in compliance with conda-build 3.0+ (GH18002)
Index division by zero fills correctly
Division operations on Index and subclasses will now fill division of positive numbers by zero with np.inf, division of negative numbers by zero with -np.inf and 0 / 0 with np.nan. This matches existing Series behavior. (GH19322, GH19347)
Previous behavior:
In [6]: index = pd.Int64Index([-1, 0, 1])
In [7]: index / 0
Out[7]: Int64Index([0, 0, 0], dtype='int64')
# Previous behavior yielded different results depending on the type of zero in the divisor
In [8]: index / 0.0
Out[8]: Float64Index([-inf, nan, inf], dtype='float64')
In [9]: index = pd.UInt64Index([0, 1])
In [10]: index / np.array([0, 0], dtype=np.uint64)
Out[10]: UInt64Index([0, 0], dtype='uint64')
In [11]: pd.RangeIndex(1, 5) / 0
ZeroDivisionError: integer division or modulo by zero
Current behavior:
In [12]: index = pd.Int64Index([-1, 0, 1])
# division by zero gives -infinity where negative,
# +infinity where positive, and NaN for 0 / 0
In [13]: index / 0
# The result of division by zero should not depend on
# whether the zero is int or float
In [14]: index / 0.0
In [15]: index = pd.UInt64Index([0, 1])
In [16]: index / np.array([0, 0], dtype=np.uint64)
In [17]: pd.RangeIndex(1, 5) / 0