MySQL聚合函数concat,MySQL的GROUP_CONCAT聚合函数的Spark SQL替代

该博客介绍如何在SparkSQL中模仿MySQL的GROUP_CONCAT函数,将每个用户名的所有朋友信息合并为一行字符串。文章提供了使用UserDefinedAggregateFunction的方法,并展示了具体的实现代码。此外,还提到了在实践中可能更快捷的解决方案,即通过提取RDD、groupByKey、mkString和重建DataFrame来完成此操作。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

I have a table of two string type columns (username, friend) and for each username, I want to collect all of its friends on one row, concatenated as strings. For example: ('username1', 'friends1, friends2, friends3')

I know MySQL does this with GROUP_CONCAT. Is there any way to do this with Spark SQL?

解决方案

Before you proceed: This operations is yet another another groupByKey. While it has multiple legitimate applications it is relatively expensive so be sure to use it only when required.

Not exactly concise or efficient solution but you can use UserDefinedAggregateFunction introduced in Spark 1.5.0:

object GroupConcat extends UserDefinedAggregateFunction {

def inputSchema = new StructType().add("x", StringType)

def bufferSchema = new StructType().add("buff", ArrayType(StringType))

def dataType = StringType

def deterministic = true

def initialize(buffer: MutableAggregationBuffer) = {

buffer.update(0, ArrayBuffer.empty[String])

}

def update(buffer: MutableAggregationBuffer, input: Row) = {

if (!input.isNullAt(0))

buffer.update(0, buffer.getSeq[String](0) :+ input.getString(0))

}

def merge(buffer1: MutableAggregationBuffer, buffer2: Row) = {

buffer1.update(0, buffer1.getSeq[String](0) ++ buffer2.getSeq[String](0))

}

def evaluate(buffer: Row) = UTF8String.fromString(

buffer.getSeq[String](0).mkString(","))

}

Example usage:

val df = sc.parallelize(Seq(

("username1", "friend1"),

("username1", "friend2"),

("username2", "friend1"),

("username2", "friend3")

)).toDF("username", "friend")

df.groupBy($"username").agg(GroupConcat($"friend")).show

## +---------+---------------+

## | username| friends|

## +---------+---------------+

## |username1|friend1,friend2|

## |username2|friend1,friend3|

## +---------+---------------+

In practice it can be faster to extract RDD, groupByKey, mkString and rebuild DataFrame.

You can get a similar effect by combining collect_list function (Spark >= 1.6.0) with concat_ws:

import org.apache.spark.sql.functions.{collect_list, udf, lit}

df.groupBy($"username")

.agg(concat_ws(",", collect_list($"friend")).alias("friends"))

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值