MySQL聚合函数concat,MySQL的GROUP_CONCAT聚合函数的Spark SQL替代

最新推荐文章于 2023-04-21 17:14:39 发布

转载最新推荐文章于 2023-04-21 17:14:39 发布 · 339 阅读

文章标签：

该博客介绍如何在SparkSQL中模仿MySQL的GROUP_CONCAT函数，将每个用户名的所有朋友信息合并为一行字符串。文章提供了使用UserDefinedAggregateFunction的方法，并展示了具体的实现代码。此外，还提到了在实践中可能更快捷的解决方案，即通过提取RDD、groupByKey、mkString和重建DataFrame来完成此操作。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

I have a table of two string type columns (username, friend) and for each username, I want to collect all of its friends on one row, concatenated as strings. For example: ('username1', 'friends1, friends2, friends3')

I know MySQL does this with GROUP_CONCAT. Is there any way to do this with Spark SQL?

解决方案

Before you proceed: This operations is yet another another groupByKey. While it has multiple legitimate applications it is relatively expensive so be sure to use it only when required.

Not exactly concise or efficient solution but you can use UserDefinedAggregateFunction introduced in Spark 1.5.0:

object GroupConcat extends UserDefinedAggregateFunction {

def inputSchema = new StructType().add("x", StringType)

def bufferSchema = new StructType().add("buff", ArrayType(StringType))

def dataType = StringType

def deterministic = true

def initialize(buffer: MutableAggregationBuffer) = {

buffer.update(0, ArrayBuffer.empty[String])

}

def update(buffer: MutableAggregationBuffer, input: Row) = {

if (!input.isNullAt(0))

buffer.update(0, buffer.getSeq[String](0) :+ input.getString(0))