python - Concatenate strings from several rows using Pandas groupby

link管理
链接快照平台
输入网页链接，自动生成快照
标签化管理网页链接
Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.
Learn more about Collectives
Teams
Q&A for work
Connect and share knowledge within a single location that is structured and easy to search.
Learn more about Teams
I want to apply some sort of concatenation of the strings in a column using groupby.
This is my code so far:
import pandas as pd
from io import StringIO
data = StringIO("""
"name1","hej","2014-11-01"
"name1","du","2014-11-02"
"name1","aj","2014-12-01"
"name1","oj","2014-12-02"
"name2","fin","2014-11-01"
"name2","katt","2014-11-02"
"name2","mycket","2014-12-01"
"name2","lite","2014-12-01"
# load string as stream into dataframe
df = pd.read_csv(data,header=0, names=["name","text","date"],parse_dates=[2])
# add column with month
df["month"] = df["date"].apply(lambda x: x.month)
I want the end result to look like this:
You can groupby the 'name' and 'month' columns, then call transform which will return data aligned to the original df and apply a lambda where we join the text entries:
In [119]:
df['text'] = df[['name','text','month']].groupby(['name','month'])['text'].transform(lambda x: ','.join(x))
df[['name','text','month']].drop_duplicates()
Out[119]:
    name         text  month
0  name1       hej,du     11
2  name1        aj,oj     12
4  name2     fin,katt     11
6  name2  mycket,lite     12
I sub the original df by passing a list of the columns of interest df[['name','text','month']] here and then call drop_duplicates
EDIT actually I can just call apply and then reset_index:
In [124]:
df.groupby(['name','month'])['text'].apply(lambda x: ','.join(x)).reset_index()
Out[124]:
    name  month         text
0  name1     11       hej,du
1  name1     12        aj,oj
2  name2     11     fin,katt
3  name2     12  mycket,lite
update
the lambda is unnecessary here:
In[38]:
df.groupby(['name','month'])['text'].apply(','.join).reset_index()
Out[38]: 
    name  month         text
0  name1     11           du
1  name1     12        aj,oj
2  name2     11     fin,katt
3  name2     12  mycket,lite
                In pandas < 1.0, .drop_duplicates() ignores the index, which may give unexpected results. You can avoid this by using .agg(lambda x: ','.join(x)) instead of .transform().drop_duplicates().
– Matthias Fripp
                May 30, 2020 at 2:41
                drop_duplicates() might not work if you do not include parameter drop_duplicates(inplace=True) or just rewrite the line of code as  df = df[['name','text','month']].drop_duplicates()
– IAmBotmaker
                Sep 23, 2020 at 11:46
                What ensures that the text e.g. in the first column is actually "hej du" and not "du hej"? Is there an implicit sort somewhere? How can I make this explicit, e.g. sort by the date column?
– Thomas
                Aug 4, 2021 at 13:55
We can groupby the 'name' and 'month' columns, then call agg() functions of Panda’s DataFrame objects.
The aggregation functionality provided by the agg() function allows multiple statistics to be calculated per group in one calculation.
df.groupby(['name', 'month'], as_index = False).agg({'text': ' '.join})
                f = lambda x: func(x, *args, **kwargs) TypeError: sequence item 45: expected str instance, NoneType found on NULL or None values in the database
– Andrew
                Jul 21, 2022 at 16:43
                This also allows you to keep additional columns, for example by adding , 'othercol': 'last' into the agg dict
– fantabolous
                Sep 13, 2022 at 5:53
The answer by EdChum provides you with a lot of flexibility but if you just want to concateate strings into a column of list objects you can also:
output_series = df.groupby(['name','month'])['text'].apply(list)
                Man, you've just saved me a lot of time. Thank you. This is the best way to assemble the chronological lists of registrations/user ids into 'cohorts' that I know of. Thank you once again.
– Alex Fedotov
                Jun 28, 2020 at 2:37
                This solution worked for me very well for getting the unique appearances too. I just used “set” instead of “list” and then daisy chained a join and presto. Note that it doesn’t work if there are nan values, so I had to use fillna() on the text field first.  In my case the command ended: df.groupby(['doc_id'])['author'].apply(set).apply(", ".join).reset_index()
– whydoesntwork
                Apr 11, 2022 at 12:52
For me the above solutions were close but added some unwanted /n's and dtype:object, so here's a modified version:
df.groupby(['name', 'month'])['text'].apply(lambda text: ''.join(text.to_string(index=False))).str.replace('(\\n)', '').reset_index()
Although, this is an old question. But just in case. I used the below code and it seems to work like a charm.
text = ''.join(df[df['date'].dt.month==8]['text'])
Thanks to all the other answers, the following is probably the most concise and feels more natural. Using df.groupby("X")["A"].agg() aggregates over one or many selected columns.
df = pandas.DataFrame({'A' : ['a', 'a', 'b', 'c', 'c'],
                       'B' : ['i', 'j', 'k', 'i', 'j'],
                       'X' : [1, 2, 2, 1, 3]})
  A  B  X
  a  i  1
  a  j  2
  b  k  2
  c  i  1
  c  j  3
df.groupby("X", as_index=False)["A"].agg(' '.join)
  X    A
  1  a c
  2  a b
  3    c
df.groupby("X", as_index=False)[["A", "B"]].agg(' '.join)
  X    A    B
  1  a c  i i
  2  a b  j k
  3    c    j
        Thanks for contributing an answer to Stack Overflow!
Please be sure to answer the question. Provide details and share your research!
But avoid …
Asking for help, clarification, or responding to other answers.
Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.