how to aggregate rows based on a condition in pyspark?

2022-12-07 21:50 问答作者：

I am trying to aggregate some rows in my pyspark dataframe based on a condition. Here is my dataframe:

customer	total_purchased	location
john	4	Maine
john	3	Nevada
john	5	null
Mary	4	Maine
Mary	4	Florida
Mary	4	null

The result I'm looking to get will look like this:

customer	total_purchased	location
john	9	Maine
john	8	Nevada
Mar开发者_C百科y	8	Maine
Mary	8	Florida

The rows that had a null location are removed, and the total_purchased from the rows with the null location is added to the total for each of the non-null locations.

Is there a way to do this in pyspark without involving very many steps?

I found a very interesting idea in this post (written by pault): Combine two rows in Pyspark if a condition is met

But I wasn't able to implement it because there isn't a column to group by so easily in this scenario.

Filter all null records, group by "customer" and sum the "total_purchased". This way it will work even if you have multiple null records for a customer.

Then filter all not-null records, join above dataframe and sum the "total_purchased" with joined entry.

df = spark.createDataFrame(data=[["john",4,"Maine"],["john",3,"Nevada"],["john",5,None],["Mary",4,"Maine"],["Mary",4,"Florida"],["Mary",4,None]], schema=["customer","total_purchased","location"])

df_null = df.filter(F.col("location").isNull()) \
            .groupBy("customer") \
            .agg(F.sum("total_purchased").alias("total_for_null"))

df = df.filter(F.col("location").isNotNull()) \
       .join(df_null, on="customer", how="left") \
       .withColumn("total_purchased", F.col("total_purchased") + F.col("total_for_null")) \
       .drop("total_for_null")

Output:

+--------+---------------+--------+
|customer|total_purchased|location|
+--------+---------------+--------+
|Mary    |8              |Maine   |
|Mary    |8              |Florida |
|john    |9              |Maine   |
|john    |8              |Nevada  |
+--------+---------------+--------+

继续阅读：apache-spark dataframe pyspark

how to aggregate rows based on a condition in pyspark?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？