sparksql语法(sparksql语法分析器 原理)
Spark SQL(十):Hive On Spark
Hive是目前大数据领域,事实上的SQL标准。其底层默认是基于MapReduce实现的,但是由于MapReduce速度实在比较慢,因此这几年,陆续出来了新的SQL查询引擎,包括Spark SQL,Hive On Tez,Hive On Spark等。
Spark SQL与Hive On Spark是不一样的。Spark SQL是Spark自己研发出来的针对各种数据源,包括Hive、JSON、Parquet、JDBC、RDD等都可以执行查询的,一套基于Spark计算引擎的查询引擎。因此它是Spark的一个项目,只不过提供了针对Hive执行查询的工功能而已,适合在一些使用Spark技术栈的大数据应用类系统中使用。
而Hive On Spark,是Hive的一个项目,它是将Spark作为底层的查询引擎(不通过MapReduce作为唯一的查询引擎)。Hive On Spark,只适用于Hive,在可预见的未来,很有可能Hive默认的底层引擎就从MapReduce切换为Spark了;适合于将原有的Hive数据仓库以及数据统计分析替换为Spark引擎,作为全公司通用的大数据统计分析引擎。
Hive On Spark做了一些优化:
1、Map Join
Spark SQL默认对join是支持使用broadcast机制将小表广播到各个节点上,以进行join的。但是问题是,这会给Driver和Worker带来很大的内存开销。因为广播的数据要一直保留在Driver内存中。所以目前采取的是,类似乎MapReduce的Distributed Cache机制,即提高HDFS replica factor的复制因子,以让数据在每个计算节点上都有一个备份,从而可以在本地进行数据读取。
2、Cache Table
对于某些需要对一张表执行多次操作的场景,Hive On Spark内部做了优化,即将要多次操作的表cache到内存中,以便于提升性能。但是这里要注意,并不是对所有的情况都会自动进行cache。所以说,Hive On Spark还有很多不完善的地方。
Hive QL语句 =
语法分析 = AST =
生成逻辑执行计划 = Operator Tree =
优化逻辑执行计划 = Optimized Operator Tree =
生成物理执行计划 = Task Tree =
优化物理执行计划 = Optimized Task Tree =
执行优化后的Optimized Task Tree
spark sql语法整理
add_months
add_months(start_date, num_months) - Returns the date that is num_months after start_date.
Examples:
cast
cast(expr AS type) - Casts the value expr to the target data type type.
Examples:
concat
concat(col1, col2, ..., colN) - Returns the concatenation of col1, col2, ..., colN.
Examples:
concat_ws
concat_ws(sep, [str | array(str)]+) - Returns the concatenation of the strings separated by sep.
Examples:
current_date
current_date() - Returns the current date at the start of query evaluation.
current_timestamp
current_timestamp() - Returns the current timestamp at the start of query evaluation.
date
date(expr) - Casts the value expr to the target data type date.
date_add
date_add(start_date, num_days) - Returns the date that is num_days after start_date.
Examples:
date_format
date_format(timestamp, fmt) - Converts timestamp to a value of string in the format specified by the date format fmt.
Examples:
date_sub
date_sub(start_date, num_days) - Returns the date that is num_days before start_date.
Examples:
datediff
datediff(endDate, startDate) - Returns the number of days from startDate to endDate.
Examples:
day
day(date) - Returns the day of month of the date/timestamp.
Examples:
dayofmonth
dayofmonth(date) - Returns the day of month of the date/timestamp.
Examples:
dayofweek
dayofweek(date) - Returns the day of the week for date/timestamp (1 = Sunday, 2 = Monday, ..., 7 = Saturday).
Examples:
dayofyear
dayofyear(date) - Returns the day of year of the date/timestamp.
Examples:
explode
explode(expr) - Separates the elements of array expr into multiple rows, or the elements of map expr into multiple rows and columns.
Examples:
from_unixtime
from_unixtime(unix_time, format) - Returns unix_time in the specified format.
Examples:
if
if(expr1, expr2, expr3) - If expr1 evaluates to true, then returns expr2; otherwise returns expr3.
Examples:
left
left(str, len) - Returns the leftmost len(len can be string type) characters from the string str,if len is less or equal than 0 the result is an empty string.
Examples:
length
length(expr) - Returns the character length of string data or number of bytes of binary data. The length of string data includes the trailing spaces. The length of binary data includes binary zeros.
Examples:
map
map(key0, value0, key1, value1, ...) - Creates a map with the given key/value pairs.
Examples:
nvl(expr1, expr2) - Returns expr2 if expr1 is null, or expr1 otherwise.
Examples:
nvl2
nvl2(expr1, expr2, expr3) - Returns expr2 if expr1 is not null, or expr3 otherwise.
Examples:
rank
rank() - Computes the rank of a value in a group of values. The result is one plus the number of rows preceding or equal to the current row in the ordering of the partition. The values will produce gaps in the sequence.
regexp_extract
regexp_extract(str, regexp[, idx]) - Extracts a group that matches regexp.
Examples:
regexp_replace
regexp_replace(str, regexp, rep) - Replaces all substrings of str that match regexp with rep.
Examples:
replace
replace(str, search[, replace]) - Replaces all occurrences of search with replace.
Arguments:
str - a string expression
search - a string expression. If search is not found in str, str is returned unchanged.
replace - a string expression. If replace is not specified or is an empty string, nothing replaces the string that is removed from str.
Examples:
row_number
row_number() - Assigns a unique, sequential number to each row, starting with one, according to the ordering of rows within the window partition.
split
split(str, regex) - Splits str around occurrences that match regex.
Examples:
substr
substr(str, pos[, len]) - Returns the substring of str that starts at pos and is of length len, or the slice of byte array that starts at pos and is of length len.
Examples:
to_date
to_date(date_str[, fmt]) - Parses the date_str expression with the fmt expression to a date. Returns null with invalid input. By default, it follows casting rules to a date if the fmt is omitted.
Examples:
to_timestamp
to_timestamp(timestamp[, fmt]) - Parses the timestamp expression with the fmt expression to a timestamp. Returns null with invalid input. By default, it follows casting rules to a timestamp if the fmt is omitted.
Examples:
unix_timestamp
unix_timestamp([expr[, pattern]]) - Returns the UNIX timestamp of current or specified time.
Examples:
weekday
weekday(date) - Returns the day of the week for date/timestamp (0 = Monday, 1 = Tuesday, ..., 6 = Sunday).
Examples:
weekofyear
weekofyear(date) - Returns the week of the year of the given date. A week is considered to start on a Monday and week 1 is the first week with 3 days.
Examples:
spark SQL和hive到底什么关系
Hive是一种基于HDFS的数据仓库,并且提供了基于SQL模型的,针对存储了大数据的数据仓库,进行分布式交互查询的查询引擎。
SparkSQL并不能完全替代Hive,它替代的是Hive的查询引擎,SparkSQL由于其底层基于Spark自身的基于内存的特点,因此速度是Hive查询引擎的数倍以上,Spark本身是不提供存储的,所以不可能替代Hive作为数据仓库的这个功能。
SparkSQL相较于Hive的另外一个优点,是支持大量不同的数据源,包括hive、json、parquet、jdbc等等。SparkSQL由于身处Spark技术堆栈内,基于RDD来工作,因此可以与Spark的其他组件无缝整合使用,配合起来实现许多复杂的功能。比如SparkSQL支持可以直接针对hdfs文件执行sql语句。