SparkSQL之DataFrame使用详解

本文深入解析了Apache Spark中DataFrame的多种操作方式,包括Action操作如collect、count、first、head和take;基础函数如toDF、cache、persist、unpersist、schema、printSchema和createOrReplaceTempView;以及集成语言函数如select、filter、groupBy、limit、sort、distinct和col。通过实例展示了如何在Spark中高效处理大规模数据集。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

一、Action操作
操作说明返回类型
collect以Array形式返回DataFrame的所有RowsArray[Row]
count返回DataFrame的Rows数目Long
first返回第一行数据Row
head返回第一行数据,与first等价Row
show以表格形式显示DataFrame的前20行数据Unit
take(n)返回DataFrame前面N行数据Array[Row]
1.collect
  • 示例
  def show1(ss:SparkSession):Unit={
    val df=ss.read.json("E:\\data\\spark\\dataframe\\test\\read\\people.json")
    df.show()
  }
  • 结果
[30,Andy]
[null,Michael]
[19,Justin]
2.count
  • 示例
  def count1(ss:SparkSession):Unit={
    val df=ss.read.json("E:\\data\\spark\\dataframe\\test\\read\\people.json")
    println(df.count())
  }
  • 结果
3
3.first
  • 示例
  def first1(ss:SparkSession):Unit={
    val df=ss.read.json("E:\\data\\spark\\dataframe\\test\\read\\people.json")
    println(df.first)
  }
  • 结果
[30,Andy]
4.head
  • 示例
  def head1(ss:SparkSession):Unit={
    val df=ss.read.json("E:\\data\\spark\\dataframe\\test\\read\\people.json")
    println(df.head())
  }
  • 结果
[30,Andy]
5.take
  • 示例
  def take1(ss:SparkSession):Unit={
    val df=ss.read.json("E:\\data\\spark\\dataframe\\test\\read\\people.json")
    df.take(2).foreach(println)
  }
  • 结果
[30,Andy]
[null,Michael]
二、基础函数
函数说明返回类型
toDF返回一个重新指定columns的DataFrameDataFrame
cache缓存DataFrameDataFrame
persist根据StorageLevel持久化DataFrame
unpersist去除持久化DataFrame
schema返回DataFrame指定的SchemaStructType
printSchema以树格式打印schemaUnit
columns以Array形式返回全部列名Array[String]
1.toDF
  • 示例1
  //RDD转DataFrame
  def toDF1(ss:SparkSession):Unit={
    val rdd=ss.sparkContext.textFile("E:\\data\\spark\\rdd\\test\\read\\app1.log")
    val mapRdd=rdd.map(line=>line.split(",")).map{arr=>(arr(0),arr(1),arr(2),arr(3),arr(4),arr(5))}
    import ss.implicits._
    val df=mapRdd.toDF("Date","Name","APP","DownLoad","Area","Version")
    df.show()
  }
  • 结果1
+----------+----+----+---------+----+-------+
|      Date|Name| APP| DownLoad|Area|Version|
+----------+----+----+---------+----+-------+
|2020-05-15|张三|王者荣耀| 华为应用|北京|v2.0|
|2020-05-15|李四|王者荣耀| 应用宝  |北京|v1.2|
|2020-05-15|李四|王者荣耀| 应用宝  |北京|v1.5|
|2020-05-15|王五| 阴阳师 |app store|上海|v2.9|
+----------+----+----+---------+----+-------+
  • 示例2
  //DataSet转DataFrame
  def toDF2(ss:SparkSession):Unit={
    import ss.implicits._
    val ds=ss.createDataset(Seq(
      ("张三",21,11111.11),
      ("李四",22,22222.22),
      ("王五",23,33333.33),
      ("赵六",24,44444.44)
    ))
    val df=ds.toDF("Name","Age","Salary")
    df.show()
  }
  • 结果2
+----+---+--------+
|Name|Age|  Salary|
+----+---+--------+
| 张三|21|11111.11|
| 李四|22|22222.22|
| 王五|23|33333.33|
| 赵六|24|44444.44|
+----+---+--------+
2.columns
  • 示例1
  def columns1(ss:SparkSession):Unit={
    val df=ss.read.json("E:\\data\\spark\\dataframe\\test\\read\\people.json")
    df.columns.foreach(println)
  }
  • 结果1
age
name
3.persist
  • 示例
  def persist1(ss:SparkSession):Unit={
    val df=ss.read.option("header","true").csv("E:\\data\\spark\\rdd\\test\\read\\ml-25m\\genome-scores.csv")
    df.persist(StorageLevel.MEMORY_AND_DISK)

    val start1=System.currentTimeMillis()
    val rows1=df.count()
    val end1=System.currentTimeMillis()
    println("行数:"+rows1+" , "+(end1-start1)+"毫秒")
    val start2=System.currentTimeMillis()
    val rows2=df.count()
    val end2=System.currentTimeMillis()
    println("行数:"+rows2+" , "+(end2-start2)+"毫秒")
  }
  • 结果
行数:15584448 , 31027毫秒
行数:15584448 , 186毫秒
4.cache
  • 示例
  def cache1(ss:SparkSession):Unit={
    val df=ss.read.option("header","true").csv("E:\\data\\spark\\rdd\\test\\read\\ml-25m\\genome-scores.csv")
    df.cache()
    
    val start1=System.currentTimeMillis()
    val rows1=df.count()
    val end1=System.currentTimeMillis()
    println("行数:"+rows1+" , "+(end1-start1)+"毫秒")
    val start2=System.currentTimeMillis()
    val rows2=df.count()
    val end2=System.currentTimeMillis()
    println("行数:"+rows2+" , "+(end2-start2)+"毫秒")
  }
  • 结果
行数:15584448 , 33180毫秒
行数:15584448 , 192毫秒
5.schema
  • 示例
  def schema1(ss:SparkSession):Unit={
    val df=ss.read.json("E:\\data\\spark\\dataframe\\test\\read\\people.json")
    println(df.schema)
  }
  • 结果
StructType(StructField(age,LongType,true), StructField(name,StringType,true))
6.printSchema
  • 示例
  def printSchema1(ss:SparkSession):Unit={
    val df=ss.read.json("E:\\data\\spark\\dataframe\\test\\read\\people.json")
    df.printSchema()
  }
  • 结果
root
 |-- age: long (nullable = true)
 |-- name: string (nullable = true)
7.createOrReplaceTempView
  • 示例
  def createOrReplaceTempView1(ss:SparkSession):Unit={
    val df=ss.read.option("header","true").csv("E:\\data\\spark\\rdd\\test\\read\\ml-25m\\movies.csv")
    df.createOrReplaceTempView("t_movie")
    val selectDF=ss.sql("select movieId,title from t_movie where movieId>=5 and movieId<=10")
    selectDF.show()
  }
  • 结果
+-------+-------------------+
|movieId|              title|
+-------+-------------------+
|      6|        Heat (1995)|
|      7|     Sabrina (1995)|
|      8|Tom and Huck (1995)|
|      9|Sudden Death (1995)|
+-------+-------------------+
三、集成语言
函数说明返回类型
select选择一个列的集合DataFrame
filter使用给定的SQL表达式过滤器DataFrame
where使用给定表达式过滤行DataFrame
groupBy使用给定的列分组DataFrame,以便能进行聚合操作GroupedData
limit返回前几行数据,新DataFrameDataFrame
sort返回给定表达式排序的新DataFrameDataFrame
distinct返回新DataFrame,仅包含DataFrame的unique rowsDataFrame
col基于列名选择列,并以一个Column的形式返回Column
agg在整体DataFrame部分组聚合DataFrame
dropdrop一个列,并返回一个新DataFrameDataFrame
1.select
  • 示例1
  def select1(ss:SparkSession):Unit={
    val df=ss.read.json("E:\\data\\spark\\dataframe\\test\\read\\people.json")
    df.select("name","age").show()
  }
  • 结果1
+-------+----+
|   name| age|
+-------+----+
|   Andy|  30|
|Michael|null|
| Justin|  19|
+-------+----+
  • 示例2
  def select2(ss:SparkSession):Unit={
    val df=ss.read.json("E:\\data\\spark\\dataframe\\test\\read\\people.json")
    df.select(df("name"), df("age")+1).show()
  }
  • 结果2
+-------+---------+
|   name|(age + 1)|
+-------+---------+
|   Andy|       31|
|Michael|     null|
| Justin|       20|
+-------+---------+
2.filter
  • 示例1
  def filter1(ss:SparkSession):Unit={
    val df=ss.read.json("E:\\data\\spark\\dataframe\\test\\read\\people.json")
    //df.filter(df("age")<25).show()
    df.filter("age<25").show()
  }
  • 结果1
+---+------+
|age|  name|
+---+------+
| 19|Justin|
+---+------+
  • 示例2
  def filter2(ss:SparkSession):Unit={
    val df=ss.read.json("E:\\data\\spark\\dataframe\\test\\read\\people1.json")
    df.filter("name=='Andy'").show()
  }
  • 结果2
+---+----+
|age|name|
+---+----+
| 30|Andy|
| 30|Andy|
+---+----+
3.groupBy
  • 示例1
  def groupBy1(ss:SparkSession):Unit={
    val df=ss.read.json("E:\\data\\spark\\dataframe\\test\\read\\people1.json")
    df.groupBy("name").count().show()
  }
  • 结果1
+-------+-----+
|   name|count|
+-------+-----+
|Michael|    1|
|   Andy|    2|
| Justin|    1|
+-------+-----+
  • 示例2
  def groupBy2(ss:SparkSession):Unit={
    val df=ss.read.json("E:\\data\\spark\\dataframe\\test\\read\\people1.json")
    df.groupBy("name","age").count().show()
  }
  • 结果2
+-------+----+-----+
|   name| age|count|
+-------+----+-----+
|   Andy|  30|    2|
|Michael|null|    1|
| Justin|  19|    1|
+-------+----+-----+
  • 示例3
  def groupBy3(ss:SparkSession):Unit={
    val df=ss.read.json("E:\\data\\spark\\dataframe\\test\\read\\people1.json")
    df.createOrReplaceTempView("people")  //注册临时表
    val groupByDf=ss.sql("select name,count(name) as num from people group by name")
    groupByDf.show()
  }
  • 结果3
+-------+---+
|   name|num|
+-------+---+
|Michael|  1|
|   Andy|  2|
| Justin|  1|
+-------+---+
4.distinct
  • 示例
  def distinct1(ss: SparkSession): Unit = {
    val df = ss.read.json("E:\\data\\spark\\dataframe\\test\\read\\people1.json")
    df.distinct().show()
  }
  • 结果
5.where
  • 示例1
  def where1(ss:SparkSession):Unit={
    val df=ss.read.json("E:\\data\\spark\\dataframe\\test\\read\\people.json")
    df.where("age<20").show()
  }
  • 结果1
+---+------+
|age|  name|
+---+------+
| 19|Justin|
+---+------+
  • 示例2
  def where2(ss:SparkSession):Unit={
    val df=ss.read.json("E:\\data\\spark\\dataframe\\test\\read\\people.json")
    df.where("name=='Justin'").show()
  }
  • 结果2
+---+------+
|age|  name|
+---+------+
| 19|Justin|
+---+------+
6.limit
  • 示例
  def limit1(ss:SparkSession):Unit={
    val df=ss.read.json("E:\\data\\spark\\dataframe\\test\\read\\people1.json")
    df.limit(2).show()
  }
  • 结果
+----+-------+
| age|   name|
+----+-------+
|  30|   Andy|
|null|Michael|
+----+-------+
7.sort
  • 示例1
  def sort1(ss:SparkSession):Unit={
    val df=ss.read.json("E:\\data\\spark\\dataframe\\test\\read\\people.json")
    df.sort("age").show()
  }
  • 结果1
+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  19| Justin|
|  30|   Andy|
+----+-------+
  • 示例2
  def sort2(ss:SparkSession):Unit={
    val df=ss.read.json("E:\\data\\spark\\dataframe\\test\\read\\people.json")
    //df.sort("name","age").show()
    df.sort(df("name"),df("age")).show()
  }
  • 结果2
+----+-------+
| age|   name|
+----+-------+
|  30|   Andy|
|  19| Justin|
|null|Michael|
+----+-------+
  • 示例3
  def sort3(ss:SparkSession):Unit={
    val df=ss.read.json("E:\\data\\spark\\dataframe\\test\\read\\people.json")
    df.sort(df("name").desc,df("age").asc).show()
  }
  • 结果3
+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  19| Justin|
|  30|   Andy|
+----+-------+
8.orderBy
  • 示例1
  def orderBy1(ss:SparkSession):Unit={
    val df=ss.read.json("E:\\data\\spark\\dataframe\\test\\read\\people.json")
    df.orderBy(df("name"),df("age").asc).show()
  }
  • 结果1
+----+-------+
| age|   name|
+----+-------+
|  30|   Andy|
|  19| Justin|
|null|Michael|
+----+-------+
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值