how to aggregate rows based on a condition in pyspark?
I am trying to aggregate some rows in my pyspark dataframe based on a condition. Here is my dataframe:
customer | total_purchased | location |
---|---|---|
john | 4 | Maine |
john | 3 | Nevada |
john | 5 | null |
Mary | 4 | Maine |
Mary | 4 | Florida |
Mary | 4 | null |
The result I'm looking to get will look like this:
customer | total_purchased | location |
---|---|---|
john | 9 | Maine |
john | 8 | Nevada |
Mar开发者_C百科y | 8 | Maine |
Mary | 8 | Florida |
The rows that had a null location are removed, and the total_purchased from the rows with the null location is added to the total for each of the non-null locations.
Is there a way to do this in pyspark without involving very many steps?
I found a very interesting idea in this post (written by pault): Combine two rows in Pyspark if a condition is met
But I wasn't able to implement it because there isn't a column to group by so easily in this scenario.
Filter all null records, group by "customer" and sum the "total_purchased". This way it will work even if you have multiple null records for a customer.
Then filter all not-null records, join above dataframe and sum the "total_purchased" with joined entry.
df = spark.createDataFrame(data=[["john",4,"Maine"],["john",3,"Nevada"],["john",5,None],["Mary",4,"Maine"],["Mary",4,"Florida"],["Mary",4,None]], schema=["customer","total_purchased","location"])
df_null = df.filter(F.col("location").isNull()) \
.groupBy("customer") \
.agg(F.sum("total_purchased").alias("total_for_null"))
df = df.filter(F.col("location").isNotNull()) \
.join(df_null, on="customer", how="left") \
.withColumn("total_purchased", F.col("total_purchased") + F.col("total_for_null")) \
.drop("total_for_null")
Output:
+--------+---------------+--------+
|customer|total_purchased|location|
+--------+---------------+--------+
|Mary |8 |Maine |
|Mary |8 |Florida |
|john |9 |Maine |
|john |8 |Nevada |
+--------+---------------+--------+
精彩评论