sparksql语法(sparksql语法分析器 原理)

http://www.itjxue.com  2023-03-18 08:47  来源:未知  点击次数: 

Spark SQL(十):Hive On Spark

Hive是目前大数据领域,事实上的SQL标准。其底层默认是基于MapReduce实现的,但是由于MapReduce速度实在比较慢,因此这几年,陆续出来了新的SQL查询引擎,包括Spark SQL,Hive On Tez,Hive On Spark等。

Spark SQL与Hive On Spark是不一样的。Spark SQL是Spark自己研发出来的针对各种数据源,包括Hive、JSON、Parquet、JDBC、RDD等都可以执行查询的,一套基于Spark计算引擎的查询引擎。因此它是Spark的一个项目,只不过提供了针对Hive执行查询的工功能而已,适合在一些使用Spark技术栈的大数据应用类系统中使用。

而Hive On Spark,是Hive的一个项目,它是将Spark作为底层的查询引擎(不通过MapReduce作为唯一的查询引擎)。Hive On Spark,只适用于Hive,在可预见的未来,很有可能Hive默认的底层引擎就从MapReduce切换为Spark了;适合于将原有的Hive数据仓库以及数据统计分析替换为Spark引擎,作为全公司通用的大数据统计分析引擎。

Hive On Spark做了一些优化:

1、Map Join

Spark SQL默认对join是支持使用broadcast机制将小表广播到各个节点上,以进行join的。但是问题是,这会给Driver和Worker带来很大的内存开销。因为广播的数据要一直保留在Driver内存中。所以目前采取的是,类似乎MapReduce的Distributed Cache机制,即提高HDFS replica factor的复制因子,以让数据在每个计算节点上都有一个备份,从而可以在本地进行数据读取。

2、Cache Table

对于某些需要对一张表执行多次操作的场景,Hive On Spark内部做了优化,即将要多次操作的表cache到内存中,以便于提升性能。但是这里要注意,并不是对所有的情况都会自动进行cache。所以说,Hive On Spark还有很多不完善的地方。

Hive QL语句 =

语法分析 = AST =

生成逻辑执行计划 = Operator Tree =

优化逻辑执行计划 = Optimized Operator Tree =

生成物理执行计划 = Task Tree =

优化物理执行计划 = Optimized Task Tree =

执行优化后的Optimized Task Tree

spark sql语法整理

add_months

add_months(start_date, num_months) - Returns the date that is num_months after start_date.

Examples:

cast

cast(expr AS type) - Casts the value expr to the target data type type.

Examples:

concat

concat(col1, col2, ..., colN) - Returns the concatenation of col1, col2, ..., colN.

Examples:

concat_ws

concat_ws(sep, [str | array(str)]+) - Returns the concatenation of the strings separated by sep.

Examples:

current_date

current_date() - Returns the current date at the start of query evaluation.

current_timestamp

current_timestamp() - Returns the current timestamp at the start of query evaluation.

date

date(expr) - Casts the value expr to the target data type date.

date_add

date_add(start_date, num_days) - Returns the date that is num_days after start_date.

Examples:

date_format

date_format(timestamp, fmt) - Converts timestamp to a value of string in the format specified by the date format fmt.

Examples:

date_sub

date_sub(start_date, num_days) - Returns the date that is num_days before start_date.

Examples:

datediff

datediff(endDate, startDate) - Returns the number of days from startDate to endDate.

Examples:

day

day(date) - Returns the day of month of the date/timestamp.

Examples:

dayofmonth

dayofmonth(date) - Returns the day of month of the date/timestamp.

Examples:

dayofweek

dayofweek(date) - Returns the day of the week for date/timestamp (1 = Sunday, 2 = Monday, ..., 7 = Saturday).

Examples:

dayofyear

dayofyear(date) - Returns the day of year of the date/timestamp.

Examples:

explode

explode(expr) - Separates the elements of array expr into multiple rows, or the elements of map expr into multiple rows and columns.

Examples:

from_unixtime

from_unixtime(unix_time, format) - Returns unix_time in the specified format.

Examples:

if

if(expr1, expr2, expr3) - If expr1 evaluates to true, then returns expr2; otherwise returns expr3.

Examples:

left

left(str, len) - Returns the leftmost len(len can be string type) characters from the string str,if len is less or equal than 0 the result is an empty string.

Examples:

length

length(expr) - Returns the character length of string data or number of bytes of binary data. The length of string data includes the trailing spaces. The length of binary data includes binary zeros.

Examples:

map

map(key0, value0, key1, value1, ...) - Creates a map with the given key/value pairs.

Examples:

nvl(expr1, expr2) - Returns expr2 if expr1 is null, or expr1 otherwise.

Examples:

nvl2

nvl2(expr1, expr2, expr3) - Returns expr2 if expr1 is not null, or expr3 otherwise.

Examples:

rank

rank() - Computes the rank of a value in a group of values. The result is one plus the number of rows preceding or equal to the current row in the ordering of the partition. The values will produce gaps in the sequence.

regexp_extract

regexp_extract(str, regexp[, idx]) - Extracts a group that matches regexp.

Examples:

regexp_replace

regexp_replace(str, regexp, rep) - Replaces all substrings of str that match regexp with rep.

Examples:

replace

replace(str, search[, replace]) - Replaces all occurrences of search with replace.

Arguments:

str - a string expression

search - a string expression. If search is not found in str, str is returned unchanged.

replace - a string expression. If replace is not specified or is an empty string, nothing replaces the string that is removed from str.

Examples:

row_number

row_number() - Assigns a unique, sequential number to each row, starting with one, according to the ordering of rows within the window partition.

split

split(str, regex) - Splits str around occurrences that match regex.

Examples:

substr

substr(str, pos[, len]) - Returns the substring of str that starts at pos and is of length len, or the slice of byte array that starts at pos and is of length len.

Examples:

to_date

to_date(date_str[, fmt]) - Parses the date_str expression with the fmt expression to a date. Returns null with invalid input. By default, it follows casting rules to a date if the fmt is omitted.

Examples:

to_timestamp

to_timestamp(timestamp[, fmt]) - Parses the timestamp expression with the fmt expression to a timestamp. Returns null with invalid input. By default, it follows casting rules to a timestamp if the fmt is omitted.

Examples:

unix_timestamp

unix_timestamp([expr[, pattern]]) - Returns the UNIX timestamp of current or specified time.

Examples:

weekday

weekday(date) - Returns the day of the week for date/timestamp (0 = Monday, 1 = Tuesday, ..., 6 = Sunday).

Examples:

weekofyear

weekofyear(date) - Returns the week of the year of the given date. A week is considered to start on a Monday and week 1 is the first week with 3 days.

Examples:

spark SQL和hive到底什么关系

Hive是一种基于HDFS的数据仓库,并且提供了基于SQL模型的,针对存储了大数据的数据仓库,进行分布式交互查询的查询引擎。

SparkSQL并不能完全替代Hive,它替代的是Hive的查询引擎,SparkSQL由于其底层基于Spark自身的基于内存的特点,因此速度是Hive查询引擎的数倍以上,Spark本身是不提供存储的,所以不可能替代Hive作为数据仓库的这个功能。

SparkSQL相较于Hive的另外一个优点,是支持大量不同的数据源,包括hive、json、parquet、jdbc等等。SparkSQL由于身处Spark技术堆栈内,基于RDD来工作,因此可以与Spark的其他组件无缝整合使用,配合起来实现许多复杂的功能。比如SparkSQL支持可以直接针对hdfs文件执行sql语句。

(责任编辑:IT教学网)

更多

推荐思科认证文章