本文隶属于专栏《1000个问题搞定大数据技术体系》,该专栏为笔者原创,引用请注明来源,不足和错误之处请在评论区帮忙指出,谢谢!
目录本专栏目录结构和参考文献请见1000个问题搞定大数据技术体系
Spark SQL functions.scala 源码解析(一)Sort functions (基于 Spark 3.3.0)
Spark SQL functions.scala 源码解析(二)Aggregate functions(基于 Spark 3.3.0)
Spark SQL functions.scala 源码解析(三)Window functions (基于 Spark 3.3.0)
Spark SQL functions.scala 源码解析(四)Non-aggregate functions (基于 Spark 3.3.0)
Spark SQL functions.scala 源码解析(五)Math Functions (基于 Spark 3.3.0)
Spark SQL functions.scala 源码解析(六)Misc functions (基于 Spark 3.3.0)
Spark SQL functions.scala 源码解析(七)String functions (基于 Spark 3.3.0)
Spark SQL functions.scala 源码解析(八)DateTime functions (基于 Spark 3.3.0)
Spark SQL functions.scala 源码解析(九)Collection functions (基于 Spark 3.3.0)
Spark SQL functions.scala 源码解析(十)Partition transform functions(基于 Spark 3.3.0)
Spark SQL functions.scala 源码解析(十一)Scala UDF functions(基于 Spark 3.3.0)
Spark SQL functions.scala 源码解析(十二)Java UDF functions(基于 Spark 3.3.0)
正文 asciidef ascii(e: Column): Column = withExpr { Ascii(e.expr) }用法
========== df.select(ascii($"a"), ascii($"b"), ascii($"c")).show() ========== +--------+--------+--------+ |ascii(a)|ascii(b)|ascii(c)| +--------+--------+--------+ | 97| 97| 0| +--------+--------+--------+base64
def base64(e: Column): Column = withExpr { base64(e.expr) }用法
========== df.select(base64($"a"), base64($"b"), base64($"c")).show() ========== +---------+---------+---------+ |base64(a)|base64(b)|base64(c)| +---------+---------+---------+ | YWJj| YWFhQmI=| | +---------+---------+---------+bit_length
def bit_length(e: Column): Column = withExpr { BitLength(e.expr) }concat_ws
@scala.annotation.varargs def concat_ws(sep: String, exprs: Column*): Column = withExpr { ConcatWs(Literal.create(sep, StringType) +: exprs.map(_.expr)) }用法
========== df.select(concat_ws(";", $"a", $"b", $"c")).show() ========== +---------------------+ |concat_ws(;, a, b, c)| +---------------------+ | abc;aaaBb;| +---------------------+decode/encode
def decode(value: Column, charset: String): Column = withExpr { StringDecode(value.expr, lit(charset).expr) } def encode(value: Column, charset: String): Column = withExpr { Encode(value.expr, lit(charset).expr) }用法
========== df.select(decode($"a", "utf-8")).show() ========== +----------------------+ |stringdecode(a, utf-8)| +----------------------+ | abc| +----------------------+ ========== df.select(encode($"a", "utf-8")).show() ========== +----------------+ |encode(a, utf-8)| +----------------+ | [61 62 63]| +----------------+format_number/format_string
def format_number(x: Column, d: Int): Column = withExpr { FormatNumber(x.expr, lit(d).expr) } @scala.annotation.varargs def format_string(format: String, arguments: Column*): Column = withExpr { FormatString((lit(format) +: arguments).map(_.expr): _*) }
用法HALF_EVEN 舍入模式:向最接近数字方向舍入,如果与两个相邻数字的距离相等,则向相邻的偶数舍入。
========== df.select(format_number(lit(5L), 4)).show() ========== +-------------------+ |format_number(5, 4)| +-------------------+ | 5.0000| +-------------------+ ========== df.select(format_number(lit(1.toByte), 4)).show() ========== +-------------------+ |format_number(1, 4)| +-------------------+ | 1.0000| +-------------------+ ========== df.select(format_number(lit(2.toShort), 4)).show() ========== +-------------------+ |format_number(2, 4)| +-------------------+ | 2.0000| +-------------------+ ========== df.select(format_number(lit(3.1322.toFloat), 4)).show() ========== +------------------------+ |format_number(3.1322, 4)| +------------------------+ | 3.1322| +------------------------+ ========== df.select(format_number(lit(4), 4)).show() ========== +-------------------+ |format_number(4, 4)| +-------------------+ | 4.0000| +-------------------+ ========== df.select(format_number(lit(5L), 4)).show() ========== +-------------------+ |format_number(5, 4)| +-------------------+ | 5.0000| +-------------------+ ========== df.select(format_number(lit(6.48173), 4)).show() ========== +-------------------------+ |format_number(6.48173, 4)| +-------------------------+ | 6.4817| +-------------------------+ ========== df.select(format_number(lit(BigDecimal("7.128381")), 4)).show() ========== +--------------------------+ |format_number(7.128381, 4)| +--------------------------+ | 7.1284| +--------------------------+ ========== df.select(format_string("aa%d%s", lit(123), lit("cc"))).show() ========== +------------------------------+ |format_string(aa%d%s, 123, cc)| +------------------------------+ | aa123cc| +------------------------------+initcap
def initcap(e: Column): Column = withExpr { InitCap(e.expr) }用法
========== df.select(initcap($"a"), initcap($"b"), initcap($"c")).show() ========== +----------+----------+----------+ |initcap(a)|initcap(b)|initcap(c)| +----------+----------+----------+ | Abc| Aaabb| | +----------+----------+----------+instr
def instr(str: Column, substring: String): Column = withExpr { StringInstr(str.expr, lit(substring).expr) }用法
========== df.select(instr($"b", "aa")).show() ========== +------------+ |instr(b, aa)| +------------+ | 1| +------------+length
def length(e: Column): Column = withExpr { Length(e.expr) }用法
========== df.select(length($"a"), length($"b"), length($"c")).show() ========== +---------+---------+---------+ |length(a)|length(b)|length(c)| +---------+---------+---------+ | 3| 5| 0| +---------+---------+---------+lower
def lower(e: Column): Column = withExpr { Lower(e.expr) }用法
========== df.select(lower($"b")).show() ========== +--------+ |lower(b)| +--------+ | aaabb| +--------+levenshtein
def levenshtein(l: Column, r: Column): Column = withExpr { Levenshtein(l.expr, r.expr) }
用法莱文斯坦距离,又称 Levenshtein 距离,是编辑距离的一种。指两个字串之间,由一个转成另一个所需的最少编辑 *** 作次数。允许的编辑 *** 作包括将一个字符替换成另一个字符,插入一个字符,删除一个字符。
例如将 kitten 一字转成 sitting:
sitten (k→s)
sittin (e→i)
sitting (→g)
俄罗斯科学家弗拉基米尔·莱文斯坦在1965年提出这个概念。
========== df.select(levenshtein($"a", $"b")).show() ========== +-----------------+ |levenshtein(a, b)| +-----------------+ | 4| +-----------------+locate
def locate(substr: String, str: Column): Column = withExpr { new StringLocate(lit(substr).expr, str.expr) } def locate(substr: String, str: Column, pos: Int): Column = withExpr { StringLocate(lit(substr).expr, str.expr, lit(pos).expr) }用法
========== df.select(locate("aa", $"b")).show() ========== +----------------+ |locate(aa, b, 1)| +----------------+ | 1| +----------------+ ========== df.select(locate("aa", $"b", 2)).show() ========== +----------------+ |locate(aa, b, 2)| +----------------+ | 2| +----------------+lpad
def lpad(str: Column, len: Int, pad: String): Column = withExpr { StringLPad(str.expr, lit(len).expr, lit(pad).expr) } def lpad(str: Column, len: Int, pad: Array[Byte]): Column = withExpr { new BinaryLPad(str.expr, lit(len).expr, lit(pad).expr) }用法
========== df.select(lpad($"a", 10, " ")).show() ========== +--------------+ |lpad(a, 10, )| +--------------+ | abc| +--------------+ltrim
def ltrim(e: Column): Column = withExpr {StringTrimLeft(e.expr) } def ltrim(e: Column, trimString: String): Column = withExpr { StringTrimLeft(e.expr, Literal(trimString)) }用法
========== df.select(ltrim(lit(" 123"))).show() ========== +-------------+ |ltrim( 123)| +-------------+ | 123| +-------------+ ========== df.select(ltrim(lit("aaa123"), "a")).show() ========== +---------------------------+ |TRIm(LEADING a FROM aaa123)| +---------------------------+ | 123| +---------------------------+octet_length
def octet_length(e: Column): Column = withExpr { OctetLength(e.expr) }regexp_extract/regexp_replace
def regexp_extract(e: Column, exp: String, groupIdx: Int): Column = withExpr { RegExpExtract(e.expr, lit(exp).expr, lit(groupIdx).expr) } def regexp_replace(e: Column, pattern: String, replacement: String): Column = withExpr { RegExpReplace(e.expr, lit(pattern).expr, lit(replacement).expr) } def regexp_replace(e: Column, pattern: Column, replacement: Column): Column = withExpr { RegExpReplace(e.expr, pattern.expr, replacement.expr) }用法
========== df.select(regexp_extract(lit("abc123"), "(\d+)", 1)).show() ========== +--------------------------------+ |regexp_extract(abc123, (d+), 1)| +--------------------------------+ | 123| +--------------------------------+ ========== df.select(regexp_replace(lit("abc123"), "(\d+)", "num")).show() ========== +-------------------------------------+ |regexp_replace(abc123, (d+), num, 1)| +-------------------------------------+ | abcnum| +-------------------------------------+ ========== df.select(regexp_replace(lit("abc123"), lit("(\d+)"), lit("num"))).show() ========== +-------------------------------------+ |regexp_replace(abc123, (d+), num, 1)| +-------------------------------------+ | abcnum| +-------------------------------------+unbase64
def unbase64(e: Column): Column = withExpr { Unbase64(e.expr) }用法
========== df.select(unbase64(typedlit(Array[Byte](1, 2, 3, 4)))).show() ========== +---------------------+ |unbase64(X'01020304')| +---------------------+ | []| +---------------------+rpad
def rpad(str: Column, len: Int, pad: String): Column = withExpr { StringRPad(str.expr, lit(len).expr, lit(pad).expr) } def rpad(str: Column, len: Int, pad: Array[Byte]): Column = withExpr { new BinaryRPad(str.expr, lit(len).expr, lit(pad).expr) }用法
========== df.select(rpad($"a", 10, " ")).show() ========== +--------------+ |rpad(a, 10, )| +--------------+ | abc | +--------------+repeat
def repeat(str: Column, n: Int): Column = withExpr { StringRepeat(str.expr, lit(n).expr) }用法
========== df.select(repeat($"a", 3)).show() ========== +------------+ |repeat(a, 3)| +------------+ | abcabcabc| +------------+rtrim
def rtrim(e: Column): Column = withExpr { StringTrimRight(e.expr) } def rtrim(e: Column, trimString: String): Column = withExpr { StringTrimRight(e.expr, Literal(trimString)) }用法
========== df.select(rtrim(lit("123 "))).show() ========== +-------------+ |rtrim(123 )| +-------------+ | 123| +-------------+ ========== df.select(rtrim(lit("123aaa"), "a")).show() ========== +----------------------------+ |TRIm(TRAILING a FROM 123aaa)| +----------------------------+ | 123| +----------------------------+soundex
def soundex(e: Column): Column = withExpr { SoundEx(e.expr) }
用法soundex 是一个将任何文本串转换为描述其语音表示的字母数字模式的算法。soundex 考虑了类似的发音字符和音节,使得对字符串进行发音比较而不是字母比较。
========== df.select(soundex($"a"), soundex($"b")).show() ========== +----------+----------+ |soundex(a)|soundex(b)| +----------+----------+ | A120| A100| +----------+----------+split
def split(str: Column, pattern: String): Column = withExpr { StringSplit(str.expr, Literal(pattern), Literal(-1)) } def split(str: Column, pattern: String, limit: Int): Column = withExpr { StringSplit(str.expr, Literal(pattern), Literal(limit)) }用法
========== df.select(split(lit("a;b;c"), ";")).show() ========== +-------------------+ |split(a;b;c, ;, -1)| +-------------------+ | [a, b, c]| +-------------------+ ========== df.select(split(lit("a;b;c"), ";", 2)).show() ========== +------------------+ |split(a;b;c, ;, 2)| +------------------+ | [a, b;c]| +------------------+ ========== df.select(split(lit("a;b;c"), ";", 0)).show() ========== +------------------+ |split(a;b;c, ;, 0)| +------------------+ | [a, b, c]| +------------------+ ========== df.select(split(lit("a;b;c"), ";", -1)).show() ========== +-------------------+ |split(a;b;c, ;, -1)| +-------------------+ | [a, b, c]| +-------------------+substring/substring_index
def substring(str: Column, pos: Int, len: Int): Column = withExpr { Substring(str.expr, lit(pos).expr, lit(len).expr) } def substring_index(str: Column, delim: String, count: Int): Column = withExpr { SubstringIndex(str.expr, lit(delim).expr, lit(count).expr) }用法
========== df.select(substring(lit("abcdef"), 2, 5)).show() ========== +-----------------------+ |substring(abcdef, 2, 5)| +-----------------------+ | bcdef| +-----------------------+ ========== df.select(substring_index(lit("www.shockang.com"), ".", 2)).show() ========== +---------------------------------------+ |substring_index(www.shockang.com, ., 2)| +---------------------------------------+ | www.shockang| +---------------------------------------+overlay
def overlay(src: Column, replace: Column, pos: Column, len: Column): Column = withExpr { Overlay(src.expr, replace.expr, pos.expr, len.expr) } def overlay(src: Column, replace: Column, pos: Column): Column = withExpr { new Overlay(src.expr, replace.expr, pos.expr) }用法
========== df.select(overlay(lit("abcdef"), lit("abc"), lit(4), lit(1))).show() ========== +--------------------------+ |overlay(abcdef, abc, 4, 1)| +--------------------------+ | abcabcef| +--------------------------+ ========== df.select(overlay(lit("abcdef"), lit("abc"), lit(4))).show() ========== +---------------------------+ |overlay(abcdef, abc, 4, -1)| +---------------------------+ | abcabc| +---------------------------+sentences
def sentences(string: Column, language: Column, country: Column): Column = withExpr { Sentences(string.expr, language.expr, country.expr) } def sentences(string: Column): Column = withExpr { Sentences(string.expr) }用法
========== df.select(sentences(lit("我们都有一个家,名字叫中国"), lit("zh"), lit("CN"))).show() ========== +---------------------------------------------+ |sentences(我们都有一个家,名字叫中国, zh, CN)| +---------------------------------------------+ | [[我们都有一个家, 名字叫中国]]| +---------------------------------------------+ ========== df.select(sentences(lit("我们都有一个家,名字叫中国"))).show() ========== +-----------------------------------------+ |sentences(我们都有一个家,名字叫中国, , )| +-----------------------------------------+ | [[我们都有一个家, 名字叫中国]]| +-----------------------------------------+translate
def translate(src: Column, matchingString: String, replaceString: String): Column = withExpr { StringTranslate(src.expr, lit(matchingString).expr, lit(replaceString).expr) }用法
========== df.select(translate(lit("abcdef"), "def", "123")).show() ========== +---------------------------+ |translate(abcdef, def, 123)| +---------------------------+ | abc123| +---------------------------+trim
def trim(e: Column): Column = withExpr { StringTrim(e.expr) } def trim(e: Column, trimString: String): Column = withExpr { StringTrim(e.expr, Literal(trimString)) }用法
========== df.select(trim(lit(" abc "))).show() ========== +---------------+ |trim( abc )| +---------------+ | abc| +---------------+ ========== df.select(trim(lit("aaabcaaaa"), "a")).show() ========== +---------------------------+ |TRIm(BOTH a FROM aaabcaaaa)| +---------------------------+ | bc| +---------------------------+upper
def upper(e: Column): Column = withExpr { Upper(e.expr) }用法
========== df.select(upper($"b")).show() ========== +--------+ |upper(b)| +--------+ | AAABB| +--------+实践 代码
package com.shockang.study.spark.sql.functions import com.shockang.study.spark.util.Utils.formatPrint import org.apache.log4j.{Level, Logger} import org.apache.spark.sql.SparkSession import org.apache.spark.sql.functions._ object StringFunctionsExample { def main(args: Array[String]): Unit = { Logger.getLogger("org").setLevel(Level.OFF) val spark = SparkSession.builder().appName("StringFunctionsExample").master("local[*]").getOrCreate() import spark.implicits._ val df = Seq(("abc", "aaaBb", "")).toDF("a", "b", "c") // ascii formatPrint("""df.select(ascii($"a"), ascii($"b"), ascii($"c")).show()""") df.select(ascii($"a"), ascii($"b"), ascii($"c")).show() // base64 formatPrint("""df.select(base64($"a"), base64($"b"), base64($"c")).show()""") df.select(base64($"a"), base64($"b"), base64($"c")).show() // concat_ws formatPrint("""df.select(concat_ws(";", $"a", $"b", $"c")).show()""") df.select(concat_ws(";", $"a", $"b", $"c")).show() // decode/encode formatPrint("""df.select(decode($"a", "utf-8")).show()""") df.select(decode($"a", "utf-8")).show() formatPrint("""df.select(encode($"a", "utf-8")).show()""") df.select(encode($"a", "utf-8")).show() // format_number/format_string formatPrint("""df.select(format_number(lit(5L), 4)).show()""") df.select(format_number(lit(5L), 4)).show() formatPrint("""df.select(format_number(lit(1.toByte), 4)).show()""") df.select(format_number(lit(1.toByte), 4)).show() formatPrint("""df.select(format_number(lit(2.toShort), 4)).show()""") df.select(format_number(lit(2.toShort), 4)).show() formatPrint("""df.select(format_number(lit(3.1322.toFloat), 4)).show()""") df.select(format_number(lit(3.1322.toFloat), 4)).show() formatPrint("""df.select(format_number(lit(4), 4)).show()""") df.select(format_number(lit(4), 4)).show() formatPrint("""df.select(format_number(lit(5L), 4)).show()""") df.select(format_number(lit(5L), 4)).show() formatPrint("""df.select(format_number(lit(6.48173), 4)).show()""") df.select(format_number(lit(6.48173), 4)).show() formatPrint("""df.select(format_number(lit(BigDecimal("7.128381")), 4)).show()""") df.select(format_number(lit(BigDecimal("7.128381")), 4)).show() formatPrint("""df.select(format_string("aa%d%s", lit(123), lit("cc"))).show()""") df.select(format_string("aa%d%s", lit(123), lit("cc"))).show() // initcap formatPrint("""df.select(initcap($"a"), initcap($"b"), initcap($"c")).show()""") df.select(initcap($"a"), initcap($"b"), initcap($"c")).show() // instr formatPrint("""df.select(instr($"b", "aa")).show()""") df.select(instr($"b", "aa")).show() // length formatPrint("""df.select(length($"a"), length($"b"), length($"c")).show()""") df.select(length($"a"), length($"b"), length($"c")).show() // lower formatPrint("""df.select(lower($"b")).show()""") df.select(lower($"b")).show() // levenshtein formatPrint("""df.select(levenshtein($"a", $"b")).show()""") df.select(levenshtein($"a", $"b")).show() // locate formatPrint("""df.select(locate("aa", $"b")).show()""") df.select(locate("aa", $"b")).show() formatPrint("""df.select(locate("aa", $"b", 2)).show()""") df.select(locate("aa", $"b", 2)).show() // lpad formatPrint("""df.select(lpad($"a", 10, " ")).show()""") df.select(lpad($"a", 10, " ")).show() // ltrim formatPrint("""df.select(ltrim(lit(" 123"))).show()""") df.select(ltrim(lit(" 123"))).show() formatPrint("""df.select(ltrim(lit("aaa123"), "a")).show()""") df.select(ltrim(lit("aaa123"), "a")).show() // regexp_extract/regexp_replace formatPrint("""df.select(regexp_extract(lit("abc123"), "(\d+)", 1)).show()""") df.select(regexp_extract(lit("abc123"), "(\d+)", 1)).show() formatPrint("""df.select(regexp_replace(lit("abc123"), "(\d+)", "num")).show()""") df.select(regexp_replace(lit("abc123"), "(\d+)", "num")).show() formatPrint("""df.select(regexp_replace(lit("abc123"), lit("(\d+)"), lit("num"))).show()""") df.select(regexp_replace(lit("abc123"), lit("(\d+)"), lit("num"))).show() // unbase64 formatPrint("""df.select(unbase64(typedlit(Array[Byte](1, 2, 3, 4)))).show()""") df.select(unbase64(typedlit(Array[Byte](1, 2, 3, 4)))).show() // rpad formatPrint("""df.select(rpad($"a", 10, " ")).show()""") df.select(rpad($"a", 10, " ")).show() // repeat formatPrint("""df.select(repeat($"a", 3)).show()""") df.select(repeat($"a", 3)).show() // rtrim formatPrint("""df.select(rtrim(lit("123 "))).show()""") df.select(rtrim(lit("123 "))).show() formatPrint("""df.select(rtrim(lit("123aaa"), "a")).show()""") df.select(rtrim(lit("123aaa"), "a")).show() // soundex formatPrint("""df.select(soundex($"a"), soundex($"b")).show()""") df.select(soundex($"a"), soundex($"b")).show() // split formatPrint("""df.select(split(lit("a;b;c"), ";")).show()""") df.select(split(lit("a;b;c"), ";")).show() formatPrint("""df.select(split(lit("a;b;c"), ";", 2)).show()""") df.select(split(lit("a;b;c"), ";", 2)).show() formatPrint("""df.select(split(lit("a;b;c"), ";", 0)).show()""") df.select(split(lit("a;b;c"), ";", 0)).show() formatPrint("""df.select(split(lit("a;b;c"), ";", -1)).show()""") df.select(split(lit("a;b;c"), ";", -1)).show() // substring/substring_index formatPrint("""df.select(substring(lit("abcdef"), 2, 5)).show()""") df.select(substring(lit("abcdef"), 2, 5)).show() formatPrint("""df.select(substring_index(lit("www.shockang.com"), ".", 2)).show()""") df.select(substring_index(lit("www.shockang.com"), ".", 2)).show() // overlay formatPrint("""df.select(overlay(lit("abcdef"), lit("abc"), lit(4), lit(1))).show()""") df.select(overlay(lit("abcdef"), lit("abc"), lit(4), lit(1))).show() formatPrint("""df.select(overlay(lit("abcdef"), lit("abc"), lit(4))).show()""") df.select(overlay(lit("abcdef"), lit("abc"), lit(4))).show() // sentences formatPrint("""df.select(sentences(lit("我们都有一个家,名字叫中国"), lit("zh"), lit("CN"))).show()""") df.select(sentences(lit("我们都有一个家,名字叫中国"), lit("zh"), lit("CN"))).show() formatPrint("""df.select(sentences(lit("我们都有一个家,名字叫中国"))).show()""") df.select(sentences(lit("我们都有一个家,名字叫中国"))).show() // translate formatPrint("""df.select(translate(lit("abcdef"), "def", "123")).show()""") df.select(translate(lit("abcdef"), "def", "123")).show() // trim formatPrint("""df.select(trim(lit(" abc "))).show()""") df.select(trim(lit(" abc "))).show() formatPrint("""df.select(trim(lit("aaabcaaaa"), "a")).show()""") df.select(trim(lit("aaabcaaaa"), "a")).show() // upper formatPrint("""df.select(upper($"b")).show()""") df.select(upper($"b")).show() } }输出
========== df.select(ascii($"a"), ascii($"b"), ascii($"c")).show() ========== +--------+--------+--------+ |ascii(a)|ascii(b)|ascii(c)| +--------+--------+--------+ | 97| 97| 0| +--------+--------+--------+ ========== df.select(base64($"a"), base64($"b"), base64($"c")).show() ========== +---------+---------+---------+ |base64(a)|base64(b)|base64(c)| +---------+---------+---------+ | YWJj| YWFhQmI=| | +---------+---------+---------+ ========== df.select(concat_ws(";", $"a", $"b", $"c")).show() ========== +---------------------+ |concat_ws(;, a, b, c)| +---------------------+ | abc;aaaBb;| +---------------------+ ========== df.select(decode($"a", "utf-8")).show() ========== +----------------------+ |stringdecode(a, utf-8)| +----------------------+ | abc| +----------------------+ ========== df.select(encode($"a", "utf-8")).show() ========== +----------------+ |encode(a, utf-8)| +----------------+ | [61 62 63]| +----------------+ ========== df.select(format_number(lit(5L), 4)).show() ========== +-------------------+ |format_number(5, 4)| +-------------------+ | 5.0000| +-------------------+ ========== df.select(format_number(lit(1.toByte), 4)).show() ========== +-------------------+ |format_number(1, 4)| +-------------------+ | 1.0000| +-------------------+ ========== df.select(format_number(lit(2.toShort), 4)).show() ========== +-------------------+ |format_number(2, 4)| +-------------------+ | 2.0000| +-------------------+ ========== df.select(format_number(lit(3.1322.toFloat), 4)).show() ========== +------------------------+ |format_number(3.1322, 4)| +------------------------+ | 3.1322| +------------------------+ ========== df.select(format_number(lit(4), 4)).show() ========== +-------------------+ |format_number(4, 4)| +-------------------+ | 4.0000| +-------------------+ ========== df.select(format_number(lit(5L), 4)).show() ========== +-------------------+ |format_number(5, 4)| +-------------------+ | 5.0000| +-------------------+ ========== df.select(format_number(lit(6.48173), 4)).show() ========== +-------------------------+ |format_number(6.48173, 4)| +-------------------------+ | 6.4817| +-------------------------+ ========== df.select(format_number(lit(BigDecimal("7.128381")), 4)).show() ========== +--------------------------+ |format_number(7.128381, 4)| +--------------------------+ | 7.1284| +--------------------------+ ========== df.select(format_string("aa%d%s", lit(123), lit("cc"))).show() ========== +------------------------------+ |format_string(aa%d%s, 123, cc)| +------------------------------+ | aa123cc| +------------------------------+ ========== df.select(initcap($"a"), initcap($"b"), initcap($"c")).show() ========== +----------+----------+----------+ |initcap(a)|initcap(b)|initcap(c)| +----------+----------+----------+ | Abc| Aaabb| | +----------+----------+----------+ ========== df.select(instr($"b", "aa")).show() ========== +------------+ |instr(b, aa)| +------------+ | 1| +------------+ ========== df.select(length($"a"), length($"b"), length($"c")).show() ========== +---------+---------+---------+ |length(a)|length(b)|length(c)| +---------+---------+---------+ | 3| 5| 0| +---------+---------+---------+ ========== df.select(lower($"b")).show() ========== +--------+ |lower(b)| +--------+ | aaabb| +--------+ ========== df.select(levenshtein($"a", $"b")).show() ========== +-----------------+ |levenshtein(a, b)| +-----------------+ | 4| +-----------------+ ========== df.select(locate("aa", $"b")).show() ========== +----------------+ |locate(aa, b, 1)| +----------------+ | 1| +----------------+ ========== df.select(locate("aa", $"b", 2)).show() ========== +----------------+ |locate(aa, b, 2)| +----------------+ | 2| +----------------+ ========== df.select(lpad($"a", 10, " ")).show() ========== +--------------+ |lpad(a, 10, )| +--------------+ | abc| +--------------+ ========== df.select(ltrim(lit(" 123"))).show() ========== +-------------+ |ltrim( 123)| +-------------+ | 123| +-------------+ ========== df.select(ltrim(lit("aaa123"), "a")).show() ========== +---------------------------+ |TRIm(LEADING a FROM aaa123)| +---------------------------+ | 123| +---------------------------+ ========== df.select(regexp_extract(lit("abc123"), "(\d+)", 1)).show() ========== +--------------------------------+ |regexp_extract(abc123, (d+), 1)| +--------------------------------+ | 123| +--------------------------------+ ========== df.select(regexp_replace(lit("abc123"), "(\d+)", "num")).show() ========== +-------------------------------------+ |regexp_replace(abc123, (d+), num, 1)| +-------------------------------------+ | abcnum| +-------------------------------------+ ========== df.select(regexp_replace(lit("abc123"), lit("(\d+)"), lit("num"))).show() ========== +-------------------------------------+ |regexp_replace(abc123, (d+), num, 1)| +-------------------------------------+ | abcnum| +-------------------------------------+ ========== df.select(unbase64(typedlit(Array[Byte](1, 2, 3, 4)))).show() ========== +---------------------+ |unbase64(X'01020304')| +---------------------+ | []| +---------------------+ ========== df.select(rpad($"a", 10, " ")).show() ========== +--------------+ |rpad(a, 10, )| +--------------+ | abc | +--------------+ ========== df.select(repeat($"a", 3)).show() ========== +------------+ |repeat(a, 3)| +------------+ | abcabcabc| +------------+ ========== df.select(rtrim(lit("123 "))).show() ========== +-------------+ |rtrim(123 )| +-------------+ | 123| +-------------+ ========== df.select(rtrim(lit("123aaa"), "a")).show() ========== +----------------------------+ |TRIm(TRAILING a FROM 123aaa)| +----------------------------+ | 123| +----------------------------+ ========== df.select(soundex($"a"), soundex($"b")).show() ========== +----------+----------+ |soundex(a)|soundex(b)| +----------+----------+ | A120| A100| +----------+----------+ ========== df.select(split(lit("a;b;c"), ";")).show() ========== +-------------------+ |split(a;b;c, ;, -1)| +-------------------+ | [a, b, c]| +-------------------+ ========== df.select(split(lit("a;b;c"), ";", 2)).show() ========== +------------------+ |split(a;b;c, ;, 2)| +------------------+ | [a, b;c]| +------------------+ ========== df.select(split(lit("a;b;c"), ";", 0)).show() ========== +------------------+ |split(a;b;c, ;, 0)| +------------------+ | [a, b, c]| +------------------+ ========== df.select(split(lit("a;b;c"), ";", -1)).show() ========== +-------------------+ |split(a;b;c, ;, -1)| +-------------------+ | [a, b, c]| +-------------------+ ========== df.select(substring(lit("abcdef"), 2, 5)).show() ========== +-----------------------+ |substring(abcdef, 2, 5)| +-----------------------+ | bcdef| +-----------------------+ ========== df.select(substring_index(lit("www.shockang.com"), ".", 2)).show() ========== +---------------------------------------+ |substring_index(www.shockang.com, ., 2)| +---------------------------------------+ | www.shockang| +---------------------------------------+ ========== df.select(overlay(lit("abcdef"), lit("abc"), lit(4), lit(1))).show() ========== +--------------------------+ |overlay(abcdef, abc, 4, 1)| +--------------------------+ | abcabcef| +--------------------------+ ========== df.select(overlay(lit("abcdef"), lit("abc"), lit(4))).show() ========== +---------------------------+ |overlay(abcdef, abc, 4, -1)| +---------------------------+ | abcabc| +---------------------------+ ========== df.select(sentences(lit("我们都有一个家,名字叫中国"), lit("zh"), lit("CN"))).show() ========== +---------------------------------------------+ |sentences(我们都有一个家,名字叫中国, zh, CN)| +---------------------------------------------+ | [[我们都有一个家, 名字叫中国]]| +---------------------------------------------+ ========== df.select(sentences(lit("我们都有一个家,名字叫中国"))).show() ========== +-----------------------------------------+ |sentences(我们都有一个家,名字叫中国, , )| +-----------------------------------------+ | [[我们都有一个家, 名字叫中国]]| +-----------------------------------------+ ========== df.select(translate(lit("abcdef"), "def", "123")).show() ========== +---------------------------+ |translate(abcdef, def, 123)| +---------------------------+ | abc123| +---------------------------+ ========== df.select(trim(lit(" abc "))).show() ========== +---------------+ |trim( abc )| +---------------+ | abc| +---------------+ ========== df.select(trim(lit("aaabcaaaa"), "a")).show() ========== +---------------------------+ |TRIm(BOTH a FROM aaabcaaaa)| +---------------------------+ | bc| +---------------------------+ ========== df.select(upper($"b")).show() ========== +--------+ |upper(b)| +--------+ | AAABB| +--------+
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)