![]() |
豪气的地瓜 · 南京夫子庙秦淮风光带有哪些具体景点- 南京本地宝· 4 月前 · |
![]() |
没人理的豆芽 · 喝海马汤能长个? 可能导致透支骨龄· 7 月前 · |
![]() |
害羞的咖啡豆 · 关于苹果预装应用卸载,你必须要知道的事· 1 年前 · |
![]() |
坐怀不乱的烈马 · 《游戏王:决斗链接》《四叶草剧场》均拒绝手机 ...· 1 年前 · |
![]() |
干练的枕头 · 【川网观大运】邹敬园:我们和对手发挥得都很优 ...· 1 年前 · |
This is a major release from 0.22.0 and includes a number of API changes, deprecations, new features, enhancements, and performance improvements along with a large number of bug fixes. We recommend that all users upgrade to this version.
Highlights include:
Merging / sorting on a combination of columns and index levels .
Changes to make output shape of DataFrame.apply consistent .
Check the API Changes and deprecations before updating.
Warning
Starting January 1, 2019, pandas feature releases will support Python 3 only. See Dropping Python 2.7 for more.
What’s new in v0.23.0
New
observed
keyword for excluding unobserved categories in
GroupBy
Rolling/Expanding.apply() accepts
raw=False
to pass a
Series
to the function
DataFrame.astype
performs column-wise conversion to
Categorical
Instantiation from dicts preserves dict insertion order for Python 3.6+
orient='table'
#
A
DataFrame
can now be written to and subsequently read back via JSON while preserving metadata through usage of the
orient='table'
argument (see
GH 18912
and
GH 9146
). Previously, none of the available
orient
values guaranteed the preservation of dtypes and index names, amongst other metadata.
In [1]: df = pd.DataFrame({'foo': [1, 2, 3, 4],
...: 'bar': ['a', 'b', 'c', 'd'],
...: 'baz': pd.date_range('2018-01-01', freq='d', periods=4),
...: 'qux': pd.Categorical(['a', 'b', 'c', 'c'])},
...: index=pd.Index(range(4), name='idx'))
In [2]: df
Out[2]:
foo bar baz qux
0 1 a 2018-01-01 a
1 2 b 2018-01-02 b
2 3 c 2018-01-03 c
3 4 d 2018-01-04 c
[4 rows x 4 columns]
In [3]: df.dtypes
Out[3]:
foo int64
bar object
baz datetime64[ns]
qux category
Length: 4, dtype: object
In [4]: df.to_json('test.json', orient='table')
In [5]: new_df = pd.read_json('test.json', orient='table')
In [6]: new_df
Out[6]:
foo bar baz qux
0 1 a 2018-01-01 a
1 2 b 2018-01-02 b
2 3 c 2018-01-03 c
3 4 d 2018-01-04 c
[4 rows x 4 columns]
In [7]: new_df.dtypes
Out[7]:
foo int64
bar object
baz datetime64[ns]
qux category
Length: 4, dtype: object
Please note that the string index
is not supported with the round trip format, as it is used by default in write_json
to indicate a missing index name.
In [8]: df.index.name = 'index'
In [9]: df.to_json('test.json', orient='table')
In [10]: new_df = pd.read_json('test.json', orient='table')
In [11]: new_df
Out[11]:
foo bar baz qux
0 1 a 2018-01-01 a
1 2 b 2018-01-02 b
2 3 c 2018-01-03 c
3 4 d 2018-01-04 c
[4 rows x 4 columns]
In [12]: new_df.dtypes
Out[12]:
foo int64
bar object
baz datetime64[ns]
qux category
Length: 4, dtype: object
Method .assign()
accepts dependent arguments#
The DataFrame.assign()
now accepts dependent keyword arguments for python version later than 3.6 (see also PEP 468). Later keyword arguments may now refer to earlier ones if the argument is a callable. See the
documentation here (GH 14207)
In [13]: df = pd.DataFrame({'A': [1, 2, 3]})
In [14]: df
Out[14]:
[3 rows x 1 columns]
In [15]: df.assign(B=df.A, C=lambda x: x['A'] + x['B'])
Out[15]:
A B C
0 1 1 2
1 2 2 4
2 3 3 6
[3 rows x 3 columns]
Warning
This may subtly change the behavior of your code when you’re
using .assign()
to update an existing column. Previously, callables
referring to other variables being updated would get the “old” values
Previous behavior:
In [2]: df = pd.DataFrame({"A": [1, 2, 3]})
In [3]: df.assign(A=lambda df: df.A + 1, C=lambda df: df.A * -1)
Out[3]:
0 2 -1
1 3 -2
2 4 -3
New behavior:
In [16]: df.assign(A=df.A + 1, C=lambda df: df.A * -1)
Out[16]:
0 2 -2
1 3 -3
2 4 -4
[3 rows x 2 columns]
Merging on a combination of columns and index levels#
Strings passed to DataFrame.merge()
as the on
, left_on
, and right_on
parameters may now refer to either column names or index level names.
This enables merging DataFrame
instances on a combination of index levels
and columns without resetting indexes. See the Merge on columns and
levels documentation section.
(GH 14355)
In [17]: left_index = pd.Index(['K0', 'K0', 'K1', 'K2'], name='key1')
In [18]: left = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
....: 'B': ['B0', 'B1', 'B2', 'B3'],
....: 'key2': ['K0', 'K1', 'K0', 'K1']},
....: index=left_index)
....:
In [19]: right_index = pd.Index(['K0', 'K1', 'K2', 'K2'], name='key1')
In [20]: right = pd.DataFrame({'C': ['C0', 'C1', 'C2', 'C3'],
....: 'D': ['D0', 'D1', 'D2', 'D3'],
....: 'key2': ['K0', 'K0', 'K0', 'K1']},
....: index=right_index)
....:
In [21]: left.merge(right, on=['key1', 'key2'])
Out[21]:
A B key2 C D
K0 A0 B0 K0 C0 D0
K1 A2 B2 K0 C1 D1
K2 A3 B3 K1 C3 D3
[3 rows x 5 columns]
Sorting by a combination of columns and index levels#
Strings passed to DataFrame.sort_values()
as the by
parameter may
now refer to either column names or index level names. This enables sorting
DataFrame
instances by a combination of index levels and columns without
resetting indexes. See the Sorting by Indexes and Values documentation section.
(GH 14353)
# Build MultiIndex
In [22]: idx = pd.MultiIndex.from_tuples([('a', 1), ('a', 2), ('a', 2),
....: ('b', 2), ('b', 1), ('b', 1)])
....:
In [23]: idx.names = ['first', 'second']
# Build DataFrame
In [24]: df_multi = pd.DataFrame({'A': np.arange(6, 0, -1)},
....: index=idx)
....:
In [25]: df_multi
Out[25]:
first second
a 1 6
2 5
2 4
b 2 3
1 2
1 1
[6 rows x 1 columns]
# Sort by 'second' (index) and 'A' (column)
In [26]: df_multi.sort_values(by=['second', 'A'])
Out[26]:
first second
b 1 1
1 2
a 1 6
b 2 3
a 2 4
2 5
[6 rows x 1 columns]
Extending pandas with custom types (experimental)#
pandas now supports storing array-like objects that aren’t necessarily 1-D NumPy
arrays as columns in a DataFrame or values in a Series. This allows third-party
libraries to implement extensions to NumPy’s types, similar to how pandas
implemented categoricals, datetimes with timezones, periods, and intervals.
As a demonstration, we’ll use cyberpandas, which provides an IPArray
type
for storing ip addresses.
In [1]: from cyberpandas import IPArray
In [2]: values = IPArray([
...: 0,
...: 3232235777,
...: 42540766452641154071740215577757643572
...: ])
IPArray
isn’t a normal 1-D NumPy array, but because it’s a pandas
ExtensionArray
, it can be stored properly inside pandas’ containers.
In [3]: ser = pd.Series(values)
In [4]: ser
Out[4]:
0 0.0.0.0
1 192.168.1.1
2 2001:db8:85a3::8a2e:370:7334
dtype: ip
Notice that the dtype is ip
. The missing value semantics of the underlying
array are respected:
In [5]: ser.isna()
Out[5]:
0 True
1 False
2 False
dtype: bool
For more, see the extension types
documentation. If you build an extension array, publicize it on the ecosystem page.
New observed
keyword for excluding unobserved categories in GroupBy
#
Grouping by a categorical includes the unobserved categories in the output.
When grouping by multiple categorical columns, this means you get the cartesian product of all the
categories, including combinations where there are no observations, which can result in a large
number of groups. We have added a keyword observed
to control this behavior, it defaults to
observed=False
for backward-compatibility. (GH 14942, GH 8138, GH 15217, GH 17594, GH 8669, GH 20583, GH 20902)
In [27]: cat1 = pd.Categorical(["a", "a", "b", "b"],
....: categories=["a", "b", "z"], ordered=True)
....:
In [28]: cat2 = pd.Categorical(["c", "d", "c", "d"],
....: categories=["c", "d", "y"], ordered=True)
....:
In [29]: df = pd.DataFrame({"A": cat1, "B": cat2, "values": [1, 2, 3, 4]})
In [30]: df['C'] = ['foo', 'bar'] * 2
In [31]: df
Out[31]:
A B values C
0 a c 1 foo
1 a d 2 bar
2 b c 3 foo
3 b d 4 bar
[4 rows x 4 columns]
To show all values, the previous behavior:
In [32]: df.groupby(['A', 'B', 'C'], observed=False).count()
Out[32]:
values
A B C
a c bar 0
foo 1
d bar 1
foo 0
y bar 0
... ...
z c foo 0
d bar 0
foo 0
y bar 0
foo 0
[18 rows x 1 columns]
To show only observed values:
In [33]: df.groupby(['A', 'B', 'C'], observed=True).count()
Out[33]:
values
A B C
a c foo 1
d bar 1
b c foo 1
d bar 1
[4 rows x 1 columns]
For pivoting operations, this behavior is already controlled by the dropna
keyword:
In [34]: cat1 = pd.Categorical(["a", "a", "b", "b"],
....: categories=["a", "b", "z"], ordered=True)
....:
In [35]: cat2 = pd.Categorical(["c", "d", "c", "d"],
....: categories=["c", "d", "y"], ordered=True)
....:
In [36]: df = pd.DataFrame({"A": cat1, "B": cat2, "values": [1, 2, 3, 4]})
In [37]: df
Out[37]:
A B values
0 a c 1
1 a d 2
2 b c 3
3 b d 4
[4 rows x 3 columns]
In [38]: pd.pivot_table(df, values='values', index=['A', 'B'],
....: dropna=True)
....:
Out[38]:
values
a c 1.0
d 2.0
b c 3.0
d 4.0
[4 rows x 1 columns]
In [39]: pd.pivot_table(df, values='values', index=['A', 'B'],
....: dropna=False)
....:
Out[39]:
values
a c 1.0
d 2.0
y NaN
b c 3.0
d 4.0
y NaN
z c NaN
d NaN
y NaN
[9 rows x 1 columns]
Rolling/Expanding.apply() accepts raw=False
to pass a Series
to the function#
Series.rolling().apply()
, DataFrame.rolling().apply()
,
Series.expanding().apply()
, and DataFrame.expanding().apply()
have gained a raw=None
parameter.
This is similar to DataFame.apply()
. This parameter, if True
allows one to send a np.ndarray
to the applied function. If False
a Series
will be passed. The
default is None
, which preserves backward compatibility, so this will default to True
, sending an np.ndarray
.
In a future version the default will be changed to False
, sending a Series
. (GH 5071, GH 20584)
In [40]: s = pd.Series(np.arange(5), np.arange(5) + 1)
In [41]: s
Out[41]:
1 0
2 1
3 2
4 3
5 4
Length: 5, dtype: int64
Pass a Series
:
In [42]: s.rolling(2, min_periods=1).apply(lambda x: x.iloc[-1], raw=False)
Out[42]:
1 0.0
2 1.0
3 2.0
4 3.0
5 4.0
Length: 5, dtype: float64
Mimic the original behavior of passing a ndarray:
In [43]: s.rolling(2, min_periods=1).apply(lambda x: x[-1], raw=True)
Out[43]:
1 0.0
2 1.0
3 2.0
4 3.0
5 4.0
Length: 5, dtype: float64
DataFrame.interpolate
has gained the limit_area
kwarg#
DataFrame.interpolate()
has gained a limit_area
parameter to allow further control of which NaN
s are replaced.
Use limit_area='inside'
to fill only NaNs surrounded by valid values or use limit_area='outside'
to fill only NaN
s
outside the existing valid values while preserving those inside. (GH 16284) See the full documentation here.
In [44]: ser = pd.Series([np.nan, np.nan, 5, np.nan, np.nan,
....: np.nan, 13, np.nan, np.nan])
....:
In [45]: ser
Out[45]:
0 NaN
1 NaN
2 5.0
3 NaN
4 NaN
5 NaN
6 13.0
7 NaN
8 NaN
Length: 9, dtype: float64
Fill one consecutive inside value in both directions
In [46]: ser.interpolate(limit_direction='both', limit_area='inside', limit=1)
Out[46]:
0 NaN
1 NaN
2 5.0
3 7.0
4 NaN
5 11.0
6 13.0
7 NaN
8 NaN
Length: 9, dtype: float64
Fill all consecutive outside values backward
In [47]: ser.interpolate(limit_direction='backward', limit_area='outside')
Out[47]:
0 5.0
1 5.0
2 5.0
3 NaN
4 NaN
5 NaN
6 13.0
7 NaN
8 NaN
Length: 9, dtype: float64
Fill all consecutive outside values in both directions
In [48]: ser.interpolate(limit_direction='both', limit_area='outside')
Out[48]:
0 5.0
1 5.0
2 5.0
3 NaN
4 NaN
5 NaN
6 13.0
7 13.0
8 13.0
Length: 9, dtype: float64
Function get_dummies
now supports dtype
argument#
The get_dummies()
now accepts a dtype
argument, which specifies a dtype for the new columns. The default remains uint8. (GH 18330)
In [49]: df = pd.DataFrame({'a': [1, 2], 'b': [3, 4], 'c': [5, 6]})
In [50]: pd.get_dummies(df, columns=['c']).dtypes
Out[50]:
a int64
b int64
c_5 bool
c_6 bool
Length: 4, dtype: object
In [51]: pd.get_dummies(df, columns=['c'], dtype=bool).dtypes
Out[51]:
a int64
b int64
c_5 bool
c_6 bool
Length: 4, dtype: object
Timedelta mod method#
mod
(%) and divmod
operations are now defined on Timedelta
objects
when operating with either timedelta-like or with numeric arguments.
See the documentation here. (GH 19365)
In [52]: td = pd.Timedelta(hours=37)
In [53]: td % pd.Timedelta(minutes=45)
Out[53]: Timedelta('0 days 00:15:00')
Method .rank()
handles inf
values when NaN
are present#
In previous versions, .rank()
would assign inf
elements NaN
as their ranks. Now ranks are calculated properly. (GH 6945)
In [54]: s = pd.Series([-np.inf, 0, 1, np.nan, np.inf])
In [55]: s
Out[55]:
0 -inf
1 0.0
2 1.0
3 NaN
4 inf
Length: 5, dtype: float64
Previous behavior:
In [11]: s.rank()
Out[11]:
0 1.0
1 2.0
2 3.0
3 NaN
4 NaN
dtype: float64
Current behavior:
In [56]: s.rank()
Out[56]:
0 1.0
1 2.0
2 3.0
3 NaN
4 4.0
Length: 5, dtype: float64
Furthermore, previously if you rank inf
or -inf
values together with NaN
values, the calculation won’t distinguish NaN
from infinity when using ‘top’ or ‘bottom’ argument.
In [57]: s = pd.Series([np.nan, np.nan, -np.inf, -np.inf])
In [58]: s
Out[58]:
0 NaN
1 NaN
2 -inf
3 -inf
Length: 4, dtype: float64
Previous behavior:
In [15]: s.rank(na_option='top')
Out[15]:
0 2.5
1 2.5
2 2.5
3 2.5
dtype: float64
Current behavior:
In [59]: s.rank(na_option='top')
Out[59]:
0 1.5
1 1.5
2 3.5
3 3.5
Length: 4, dtype: float64
These bugs were squashed:
Bug in DataFrame.rank()
and Series.rank()
when method='dense'
and pct=True
in which percentile ranks were not being used with the number of distinct observations (GH 15630)
Bug in Series.rank()
and DataFrame.rank()
when ascending='False'
failed to return correct ranks for infinity if NaN
were present (GH 19538)
Bug in DataFrameGroupBy.rank()
where ranks were incorrect when both infinity and NaN
were present (GH 20561)
Series.str.cat
has gained the join
kwarg#
Previously, Series.str.cat()
did not – in contrast to most of pandas
– align Series
on their index before concatenation (see GH 18657).
The method has now gained a keyword join
to control the manner of alignment, see examples below and here.
In v.0.23 join
will default to None (meaning no alignment), but this default will change to 'left'
in a future version of pandas.
In [60]: s = pd.Series(['a', 'b', 'c', 'd'])
In [61]: t = pd.Series(['b', 'd', 'e', 'c'], index=[1, 3, 4, 2])
In [62]: s.str.cat(t)
Out[62]:
0 NaN
1 bb
2 cc
3 dd
Length: 4, dtype: object
In [63]: s.str.cat(t, join='left', na_rep='-')
Out[63]:
0 a-
1 bb
2 cc
3 dd
Length: 4, dtype: object
Furthermore, Series.str.cat()
now works for CategoricalIndex
as well (previously raised a ValueError
; see GH 20842).
DataFrame.astype
performs column-wise conversion to Categorical
#
DataFrame.astype()
can now perform column-wise conversion to Categorical
by supplying the string 'category'
or
a CategoricalDtype
. Previously, attempting this would raise a NotImplementedError
. See the
Object creation section of the documentation for more details and examples. (GH 12860, GH 18099)
Supplying the string 'category'
performs column-wise conversion, with only labels appearing in a given column set as categories:
In [64]: df = pd.DataFrame({'A': list('abca'), 'B': list('bccd')})
In [65]: df = df.astype('category')
In [66]: df['A'].dtype
Out[66]: CategoricalDtype(categories=['a', 'b', 'c'], ordered=False, categories_dtype=object)
In [67]: df['B'].dtype
Out[67]: CategoricalDtype(categories=['b', 'c', 'd'], ordered=False, categories_dtype=object)
Supplying a CategoricalDtype
will make the categories in each column consistent with the supplied dtype:
In [68]: from pandas.api.types import CategoricalDtype
In [69]: df = pd.DataFrame({'A': list('abca'), 'B': list('bccd')})
In [70]: cdt = CategoricalDtype(categories=list('abcd'), ordered=True)
In [71]: df = df.astype(cdt)
In [72]: df['A'].dtype
Out[72]: CategoricalDtype(categories=['a', 'b', 'c', 'd'], ordered=True, categories_dtype=object)
In [73]: df['B'].dtype
Out[73]: CategoricalDtype(categories=['a', 'b', 'c', 'd'], ordered=True, categories_dtype=object)
Other enhancements#
Unary +
now permitted for Series
and DataFrame
as numeric operator (GH 16073)
Better support for to_excel()
output with the xlsxwriter
engine. (GH 16149)
pandas.tseries.frequencies.to_offset()
now accepts leading ‘+’ signs e.g. ‘+1h’. (GH 18171)
MultiIndex.unique()
now supports the level=
argument, to get unique values from a specific index level (GH 17896)
pandas.io.formats.style.Styler
now has method hide_index()
to determine whether the index will be rendered in output (GH 14194)
pandas.io.formats.style.Styler
now has method hide_columns()
to determine whether columns will be hidden in output (GH 14194)
Improved wording of ValueError
raised in to_datetime()
when unit=
is passed with a non-convertible value (GH 14350)
Series.fillna()
now accepts a Series or a dict as a value
for a categorical dtype (GH 17033)
pandas.read_clipboard()
updated to use qtpy, falling back to PyQt5 and then PyQt4, adding compatibility with Python3 and multiple python-qt bindings (GH 17722)
Improved wording of ValueError
raised in read_csv()
when the usecols
argument cannot match all columns. (GH 17301)
DataFrame.corrwith()
now silently drops non-numeric columns when passed a Series. Before, an exception was raised (GH 18570).
IntervalIndex
now supports time zone aware Interval
objects (GH 18537, GH 18538)
Series()
/ DataFrame()
tab completion also returns identifiers in the first level of a MultiIndex()
. (GH 16326)
read_excel()
has gained the nrows
parameter (GH 16645)
DataFrame.append()
can now in more cases preserve the type of the calling dataframe’s columns (e.g. if both are CategoricalIndex
) (GH 18359)
DataFrame.to_json()
and Series.to_json()
now accept an index
argument which allows the user to exclude the index from the JSON output (GH 17394)
IntervalIndex.to_tuples()
has gained the na_tuple
parameter to control whether NA is returned as a tuple of NA, or NA itself (GH 18756)
Categorical.rename_categories
, CategoricalIndex.rename_categories
and Series.cat.rename_categories
can now take a callable as their argument (GH 18862)
Interval
and IntervalIndex
have gained a length
attribute (GH 18789)
Resampler
objects now have a functioning pipe
method.
Previously, calls to pipe
were diverted to the mean
method (GH 17905).
is_scalar()
now returns True
for DateOffset
objects (GH 18943).
DataFrame.pivot()
now accepts a list for the values=
kwarg (GH 17160).
Added pandas.api.extensions.register_dataframe_accessor()
,
pandas.api.extensions.register_series_accessor()
, and
pandas.api.extensions.register_index_accessor()
, accessor for libraries downstream of pandas
to register custom accessors like .cat
on pandas objects. See
Registering Custom Accessors for more (GH 14781).
IntervalIndex.astype
now supports conversions between subtypes when passed an IntervalDtype
(GH 19197)
IntervalIndex
and its associated constructor methods (from_arrays
, from_breaks
, from_tuples
) have gained a dtype
parameter (GH 19262)
Added pandas.core.groupby.SeriesGroupBy.is_monotonic_increasing()
and pandas.core.groupby.SeriesGroupBy.is_monotonic_decreasing()
(GH 17015)
For subclassed DataFrames
, DataFrame.apply()
will now preserve the Series
subclass (if defined) when passing the data to the applied function (GH 19822)
DataFrame.from_dict()
now accepts a columns
argument that can be used to specify the column names when orient='index'
is used (GH 18529)
Added option display.html.use_mathjax
so MathJax can be disabled when rendering tables in Jupyter
notebooks (GH 19856, GH 19824)
DataFrame.replace()
now supports the method
parameter, which can be used to specify the replacement method when to_replace
is a scalar, list or tuple and value
is None
(GH 19632)
Timestamp.month_name()
, DatetimeIndex.month_name()
, and Series.dt.month_name()
are now available (GH 12805)
Timestamp.day_name()
and DatetimeIndex.day_name()
are now available to return day names with a specified locale (GH 12806)
DataFrame.to_sql()
now performs a multi-value insert if the underlying connection supports itk rather than inserting row by row.
SQLAlchemy
dialects supporting multi-value inserts include: mysql
, postgresql
, sqlite
and any dialect with supports_multivalues_insert
. (GH 14315, GH 8953)
read_html()
now accepts a displayed_only
keyword argument to controls whether or not hidden elements are parsed (True
by default) (GH 20027)
read_html()
now reads all <tbody>
elements in a <table>
, not just the first. (GH 20690)
quantile()
and quantile()
now accept the interpolation
keyword, linear
by default (GH 20497)
zip compression is supported via compression=zip
in DataFrame.to_pickle()
, Series.to_pickle()
, DataFrame.to_csv()
, Series.to_csv()
, DataFrame.to_json()
, Series.to_json()
. (GH 17778)
WeekOfMonth
constructor now supports n=0
(GH 20517).
DataFrame
and Series
now support matrix multiplication (@
) operator (GH 10259) for Python>=3.5
Updated DataFrame.to_gbq()
and pandas.read_gbq()
signature and documentation to reflect changes from
the pandas-gbq library version 0.4.0. Adds intersphinx mapping to pandas-gbq
library. (GH 20564)
Added new writer for exporting Stata dta files in version 117, StataWriter117
. This format supports exporting strings with lengths up to 2,000,000 characters (GH 16450)
to_hdf()
and read_hdf()
now accept an errors
keyword argument to control encoding error handling (GH 20835)
cut()
has gained the duplicates='raise'|'drop'
option to control whether to raise on duplicated edges (GH 20947)
date_range()
, timedelta_range()
, and interval_range()
now return a linearly spaced index if start
, stop
, and periods
are specified, but freq
is not. (GH 20808, GH 20983, GH 20976)
Dependencies have increased minimum versions#
We have updated our minimum supported versions of dependencies (GH 15184).
If installed, we now require:
Package
Minimum Version
Required
Issue
python-dateutil
2.5.0
openpyxl
2.4.0
beautifulsoup4
4.2.1
setuptools
24.2.0
Instantiation from dicts preserves dict insertion order for Python 3.6+#
Until Python 3.6, dicts in Python had no formally defined ordering. For Python
version 3.6 and later, dicts are ordered by insertion order, see
PEP 468.
pandas will use the dict’s insertion order, when creating a Series
or
DataFrame
from a dict and you’re using Python version 3.6 or
higher. (GH 19884)
Previous behavior (and current behavior if on Python < 3.6):
In [16]: pd.Series({'Income': 2000,
....: 'Expenses': -1500,
....: 'Taxes': -200,
....: 'Net result': 300})
Out[16]:
Expenses -1500
Income 2000
Net result 300
Taxes -200
dtype: int64
Note the Series above is ordered alphabetically by the index values.
New behavior (for Python >= 3.6):
In [74]: pd.Series({'Income': 2000,
....: 'Expenses': -1500,
....: 'Taxes': -200,
....: 'Net result': 300})
....:
Out[74]:
Income 2000
Expenses -1500
Taxes -200
Net result 300
Length: 4, dtype: int64
Notice that the Series is now ordered by insertion order. This new behavior is
used for all relevant pandas types (Series
, DataFrame
, SparseSeries
and SparseDataFrame
).
If you wish to retain the old behavior while using Python >= 3.6, you can use
.sort_index()
:
In [75]: pd.Series({'Income': 2000,
....: 'Expenses': -1500,
....: 'Taxes': -200,
....: 'Net result': 300}).sort_index()
....:
Out[75]:
Expenses -1500
Income 2000
Net result 300
Taxes -200
Length: 4, dtype: int64
Deprecate Panel#
Panel
was deprecated in the 0.20.x release, showing as a DeprecationWarning
. Using Panel
will now show a FutureWarning
. The recommended way to represent 3-D data are
with a MultiIndex
on a DataFrame
via the to_frame()
or with the xarray package. pandas
provides a to_xarray()
method to automate this conversion (GH 13563, GH 18324).
In [75]: import pandas._testing as tm
In [76]: p = tm.makePanel()
In [77]: p
Out[77]:
<class 'pandas.core.panel.Panel'>
Dimensions: 3 (items) x 3 (major_axis) x 4 (minor_axis)
Items axis: ItemA to ItemC
Major_axis axis: 2000-01-03 00:00:00 to 2000-01-05 00:00:00
Minor_axis axis: A to D
Convert to a MultiIndex DataFrame
In [78]: p.to_frame()
Out[78]:
ItemA ItemB ItemC
major minor
2000-01-03 A 0.469112 0.721555 0.404705
B -1.135632 0.271860 -1.039268
C 0.119209 0.276232 -1.344312
D -2.104569 0.113648 -0.109050
2000-01-04 A -0.282863 -0.706771 0.577046
B 1.212112 -0.424972 -0.370647
C -1.044236 -1.087401 0.844885
D -0.494929 -1.478427 1.643563
2000-01-05 A -1.509059 -1.039575 -1.715002
B -0.173215 0.567020 -1.157892
C -0.861849 -0.673690 1.075770
D 1.071804 0.524988 -1.469388
[12 rows x 3 columns]
Convert to an xarray DataArray
In [79]: p.to_xarray()
Out[79]:
<xarray.DataArray (items: 3, major_axis: 3, minor_axis: 4)>
array([[[ 0.469112, -1.135632, 0.119209, -2.104569],
[-0.282863, 1.212112, -1.044236, -0.494929],
[-1.509059, -0.173215, -0.861849, 1.071804]],
[[ 0.721555, 0.27186 , 0.276232, 0.113648],
[-0.706771, -0.424972, -1.087401, -1.478427],
[-1.039575, 0.56702 , -0.67369 , 0.524988]],
[[ 0.404705, -1.039268, -1.344312, -0.10905 ],
[ 0.577046, -0.370647, 0.844885, 1.643563],
[-1.715002, -1.157892, 1.07577 , -1.469388]]])
Coordinates:
* items (items) object 'ItemA' 'ItemB' 'ItemC'
* major_axis (major_axis) datetime64[ns] 2000-01-03 2000-01-04 2000-01-05
* minor_axis (minor_axis) object 'A' 'B' 'C' 'D'
pandas.core.common removals#
The following error & warning messages are removed from pandas.core.common
(GH 13634, GH 19769):
PerformanceWarning
UnsupportedFunctionCall
UnsortedIndexError
AbstractMethodError
These are available from import from pandas.errors
(since 0.19.0).
Changes to make output of DataFrame.apply
consistent#
DataFrame.apply()
was inconsistent when applying an arbitrary user-defined-function that returned a list-like with axis=1
. Several bugs and inconsistencies
are resolved. If the applied function returns a Series, then pandas will return a DataFrame; otherwise a Series will be returned, this includes the case
where a list-like (e.g. tuple
or list
is returned) (GH 16353, GH 17437, GH 17970, GH 17348, GH 17892, GH 18573,
GH 17602, GH 18775, GH 18901, GH 18919).
In [76]: df = pd.DataFrame(np.tile(np.arange(3), 6).reshape(6, -1) + 1,
....: columns=['A', 'B', 'C'])
....:
In [77]: df
Out[77]:
A B C
0 1 2 3
1 1 2 3
2 1 2 3
3 1 2 3
4 1 2 3
5 1 2 3
[6 rows x 3 columns]
Previous behavior: if the returned shape happened to match the length of original columns, this would return a DataFrame
.
If the return shape did not match, a Series
with lists was returned.
In [3]: df.apply(lambda x: [1, 2, 3], axis=1)
Out[3]:
A B C
0 1 2 3
1 1 2 3
2 1 2 3
3 1 2 3
4 1 2 3
5 1 2 3
In [4]: df.apply(lambda x: [1, 2], axis=1)
Out[4]:
0 [1, 2]
1 [1, 2]
2 [1, 2]
3 [1, 2]
4 [1, 2]
5 [1, 2]
dtype: object
New behavior: When the applied function returns a list-like, this will now always return a Series
.
In [78]: df.apply(lambda x: [1, 2, 3], axis=1)
Out[78]:
0 [1, 2, 3]
1 [1, 2, 3]
2 [1, 2, 3]
3 [1, 2, 3]
4 [1, 2, 3]
5 [1, 2, 3]
Length: 6, dtype: object
In [79]: df.apply(lambda x: [1, 2], axis=1)
Out[79]:
0 [1, 2]
1 [1, 2]
2 [1, 2]
3 [1, 2]
4 [1, 2]
5 [1, 2]
Length: 6, dtype: object
To have expanded columns, you can use result_type='expand'
In [80]: df.apply(lambda x: [1, 2, 3], axis=1, result_type='expand')
Out[80]:
0 1 2
0 1 2 3
1 1 2 3
2 1 2 3
3 1 2 3
4 1 2 3
5 1 2 3
[6 rows x 3 columns]
To broadcast the result across the original columns (the old behaviour for
list-likes of the correct length), you can use result_type='broadcast'
.
The shape must match the original columns.
In [81]: df.apply(lambda x: [1, 2, 3], axis=1, result_type='broadcast')
Out[81]:
A B C
0 1 2 3
1 1 2 3
2 1 2 3
3 1 2 3
4 1 2 3
5 1 2 3
[6 rows x 3 columns]
Returning a Series
allows one to control the exact return structure and column names:
In [82]: df.apply(lambda x: pd.Series([1, 2, 3], index=['D', 'E', 'F']), axis=1)
Out[82]:
D E F
0 1 2 3
1 1 2 3
2 1 2 3
3 1 2 3
4 1 2 3
5 1 2 3
[6 rows x 3 columns]
Concatenation will no longer sort#
In a future version of pandas pandas.concat()
will no longer sort the non-concatenation axis when it is not already aligned.
The current behavior is the same as the previous (sorting), but now a warning is issued when sort
is not specified and the non-concatenation axis is not aligned (GH 4588).
In [83]: df1 = pd.DataFrame({"a": [1, 2], "b": [1, 2]}, columns=['b', 'a'])
In [84]: df2 = pd.DataFrame({"a": [4, 5]})
In [85]: pd.concat([df1, df2])
Out[85]:
0 1.0 1
1 2.0 2
0 NaN 4
1 NaN 5
[4 rows x 2 columns]
To keep the previous behavior (sorting) and silence the warning, pass sort=True
In [86]: pd.concat([df1, df2], sort=True)
Out[86]:
a b
0 1 1.0
1 2 2.0
0 4 NaN
1 5 NaN
[4 rows x 2 columns]
To accept the future behavior (no sorting), pass sort=False
Note that this change also applies to DataFrame.append()
, which has also received a sort
keyword for controlling this behavior.
Build changes#
Building pandas for development now requires cython >= 0.24
(GH 18613)
Building from source now explicitly requires setuptools
in setup.py
(GH 18113)
Updated conda recipe to be in compliance with conda-build 3.0+ (GH 18002)
Index division by zero fills correctly#
Division operations on Index
and subclasses will now fill division of positive numbers by zero with np.inf
, division of negative numbers by zero with -np.inf
and 0 / 0
with np.nan
. This matches existing Series
behavior. (GH 19322, GH 19347)
Previous behavior:
In [6]: index = pd.Int64Index([-1, 0, 1])
In [7]: index / 0
Out[7]: Int64Index([0, 0, 0], dtype='int64')
# Previous behavior yielded different results depending on the type of zero in the divisor
In [8]: index / 0.0
Out[8]: Float64Index([-inf, nan, inf], dtype='float64')
In [9]: index = pd.UInt64Index([0, 1])
In [10]: index / np.array([0, 0], dtype=np.uint64)
Out[10]: UInt64Index([0, 0], dtype='uint64')
In [11]: pd.RangeIndex(1, 5) / 0
ZeroDivisionError: integer division or modulo by zero
Current behavior:
In [12]: index = pd.Int64Index([-1, 0, 1])
# division by zero gives -infinity where negative,
# +infinity where positive, and NaN for 0 / 0
In [13]: index / 0
# The result of division by zero should not depend on
# whether the zero is int or float
In [14]: index / 0.0
In [15]: index = pd.UInt64Index([0, 1])
In [16]: index / np.array([0, 0], dtype=np.uint64)
In [17]: pd.RangeIndex(1, 5) / 0
Extraction of matching patterns from strings#
By default, extracting matching patterns from strings with str.extract()
used to return a
Series
if a single group was being extracted (a DataFrame
if more than one group was
extracted). As of pandas 0.23.0 str.extract()
always returns a DataFrame
, unless
expand
is set to False
. Finally, None
was an accepted value for
the expand
parameter (which was equivalent to False
), but now raises a ValueError
. (GH 11386)
Previous behavior:
In [1]: s = pd.Series(['number 10', '12 eggs'])
In [2]: extracted = s.str.extract(r'.*(\d\d).*')
In [3]: extracted
Out [3]:
0 10
1 12
dtype: object
In [4]: type(extracted)
Out [4]:
pandas.core.series.Series
New behavior:
In [87]: s = pd.Series(['number 10', '12 eggs'])
In [88]: extracted = s.str.extract(r'.*(\d\d).*')
In [89]: extracted
Out[89]:
0 10
1 12
[2 rows x 1 columns]
In [90]: type(extracted)
Out[90]: pandas.core.frame.DataFrame
To restore previous behavior, simply set expand
to False
:
In [91]: s = pd.Series(['number 10', '12 eggs'])
In [92]: extracted = s.str.extract(r'.*(\d\d).*', expand=False)
In [93]: extracted
Out[93]:
0 10
1 12
Length: 2, dtype: object
In [94]: type(extracted)
Out[94]: pandas.core.series.Series
Default value for the ordered
parameter of CategoricalDtype
#
The default value of the ordered
parameter for CategoricalDtype
has changed from False
to None
to allow updating of categories
without impacting ordered
. Behavior should remain consistent for downstream objects, such as Categorical
(GH 18790)
In previous versions, the default value for the ordered
parameter was False
. This could potentially lead to the ordered
parameter unintentionally being changed from True
to False
when users attempt to update categories
if ordered
is not explicitly specified, as it would silently default to False
. The new behavior for ordered=None
is to retain the existing value of ordered
.
New behavior:
In [2]: from pandas.api.types import CategoricalDtype
In [3]: cat = pd.Categorical(list('abcaba'), ordered=True, categories=list('cba'))
In [4]: cat
Out[4]:
[a, b, c, a, b, a]
Categories (3, object): [c < b < a]
In [5]: cdt = CategoricalDtype(categories=list('cbad'))
In [6]: cat.astype(cdt)
Out[6]:
[a, b, c, a, b, a]
Categories (4, object): [c < b < a < d]
Notice in the example above that the converted Categorical
has retained ordered=True
. Had the default value for ordered
remained as False
, the converted Categorical
would have become unordered, despite ordered=False
never being explicitly specified. To change the value of ordered
, explicitly pass it to the new dtype, e.g. CategoricalDtype(categories=list('cbad'), ordered=False)
.
Note that the unintentional conversion of ordered
discussed above did not arise in previous versions due to separate bugs that prevented astype
from doing any type of category to category conversion (GH 10696, GH 18593). These bugs have been fixed in this release, and motivated changing the default value of ordered
.
Better pretty-printing of DataFrames in a terminal#
Previously, the default value for the maximum number of columns was
pd.options.display.max_columns=20
. This meant that relatively wide data
frames would not fit within the terminal width, and pandas would introduce line
breaks to display these 20 columns. This resulted in an output that was
relatively difficult to read:
If Python runs in a terminal, the maximum number of columns is now determined
automatically so that the printed data frame fits within the current terminal
width (pd.options.display.max_columns=0
) (GH 17023). If Python runs
as a Jupyter kernel (such as the Jupyter QtConsole or a Jupyter notebook, as
well as in many IDEs), this value cannot be inferred automatically and is thus
set to 20
as in previous versions. In a terminal, this results in a much
nicer output:
Note that if you don’t like the new default, you can always set this option
yourself. To revert to the old setting, you can run this line: