How to select one row for each distinct value for a particular column and merge to form a new dataframe in Python?

Vaidehi :

The dataset I am using looks like this. It is a video captioning data set with captions under the column 'Description'.

Video_ID       Description
mv89psg6zh4    A bird is bathing in a sink.
mv89psg6zh4    A faucet is running while a bird stands.
mv89psg6zh4    A bird gets washed.
mv89psg6zh4    A parakeet is taking a shower in a sink.
mv89psg6zh4    The bird is taking a bath under the faucet.
mv89psg6zh4    A bird is standing in a sink drinking water.
R2DvpPTfl-E    PLAYING GAME ON LAPTOP.
R2DvpPTfl-E    THE MAN IS WATCHING LAPTOP.
l7x8uIdg2XU    A woman is pouring ingredients into a bowl.
l7x8uIdg2XU    A woman is adding milk to some pasta.
l7x8uIdg2XU    A person adds ingredients to pasta. 
l7x8uIdg2XU    the girls are doing the cooking.

However, the number of captions for each video is different and not uniform.

I intend to extract one row for one unique Video_ID and form a new dataframe merging these unique rows. Also, to delete the same row from the existing dataframe.

The result I want should look like this:

Dataframe 1-

Video_ID       Description
mv89psg6zh4    A faucet is running while a bird stands.
mv89psg6zh4    A bird gets washed.
mv89psg6zh4    A parakeet is taking a shower in a sink.
mv89psg6zh4    The bird is taking a bath under the faucet.
mv89psg6zh4    A bird is standing in a sink drinking water.
R2DvpPTfl-E    THE MAN IS WATCHING LAPTOP.
l7x8uIdg2XU    A woman is adding milk to some pasta.
l7x8uIdg2XU    A person adds ingredients to pasta. 
l7x8uIdg2XU    the girls are doing the cooking.

Dataframe 2-

Video_ID       Description
mv89psg6zh4    A bird is bathing in a sink.
R2DvpPTfl-E    PLAYING GAME ON LAPTOP.
l7x8uIdg2XU    A woman is pouring ingredients into a bowl.

So that the rows are basically moved from the existing dataframe to form a new dataframe.

Quang Hoang :

You can use groupby() to sample the index:

s = df.index.to_series().groupby(df['Video_ID']).apply(lambda x: x.sample(n=1))

# random unique
df.loc[s]

# rest of data
df.drop(s)

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=360211&siteId=1