实战Spark分布式SQL引擎

一、概览
Spark SQL除了使用spark-sql命令进入交互式执行环境之外,还能够使用JDBC/ODBC或命令行接口进行分布式查询,在这个模式下,终端用户或应用可以直接和Spark SQL进行交互式SQL查询而不需要写任何scala代码。

二、使用Thrift JDBC server

spark版本    :1.4.0

Yarn版本    :CDH5.4.0

1、准备工作

将hive-site.xml拷贝或link到$SPARK_HOME/conf下

2、使用spark安装目录下脚本启动hive thrift server,默认不加参数时,会以local模式启动,占用本地一个JVM进程

sbin/start-thriftserver.sh

3、yarn-client模式启动,默认启动在10001端口

sbin/start-thriftserver.sh --master yarn

接下来,我们观察yarn UI的UI上,启动了25个container

 

为什么启动了一个JDBC服务就占用这么多资源呢?这是因为conf/spark-env.sh中配置了SPARK_EXECUTOR_INSTANCES为24个实例,再加上一个yarn client的driver实例

export SPARK_EXECUTOR_INSTANCES=24

观察Yarn NodeManager节点上的进程,thriftserver会常驻一个叫org.apache.spark.executor.CoarseGrainedExecutorBackend的进程,随时为之后的SQL作业启动Task。这样做的好处是运行Spark SQL时,减少了启动container上的时间消耗,同时代价是在thrift server空闲的时候,这些container资源仍然占用着不会释放给其他spark或mapreduce作业使用。

4、使用beeline连接Spark SQL交互式引擎

bin/beeline -u jdbc:hive2://localhost:10001 -n root -p root

三、命令行帮助

1、Thrift server

Mandatory arguments to long options are mandatory for short options too.
  -a, --all                  do not ignore entries starting with .
  -A, --almost-all          do not list implied . and ..
      --author              with -l, print the author of each file
  -b, --escape              print octal escapes for nongraphic characters
      --block-size=SIZE      use SIZE-byte blocks.  See SIZE format below
  -B, --ignore-backups      do not list implied entries ending with ~
  -c                        with -lt: sort by, and show, ctime (time of last
                              modification of file status information)
                              with -l: show ctime and sort by name
                              otherwise: sort by ctime
  -C                        list entries by columns
      --color[=WHEN]        colorize the output.  WHEN defaults to `always'
                              or can be `never' or `auto'.  More info below
  -d, --directory            list directory entries instead of contents,
                              and do not dereference symbolic links
  -D, --dired                generate output designed for Emacs' dired mode
  -f                        do not sort, enable -aU, disable -ls --color
  -F, --classify            append indicator (one of */=>@|) to entries
      --file-type            likewise, except do not append `*'
      --format=WORD          across -x, commas -m, horizontal -x, long -l,
                              single-column -1, verbose -l, vertical -C
      --full-time            like -l --time-style=full-iso
  -g                        like -l, but do not list owner
      --group-directories-first
                            group directories before files.
                              augment with a --sort option, but any
                              use of --sort=none (-U) disables grouping
  -G, --no-group            in a long listing, don't print group names
  -h, --human-readable      with -l, print sizes in human readable format
                              (e.g., 1K 234M 2G)
      --si                  likewise, but use powers of 1000 not 1024
  -H, --dereference-command-line
                            follow symbolic links listed on the command line
      --dereference-command-line-symlink-to-dir
                            follow each command line symbolic link
                            that points to a directory
      --hide=PATTERN        do not list implied entries matching shell PATTERN
                              (overridden by -a or -A)
      --indicator-style=WORD  append indicator with style WORD to entry names:
                              none (default), slash (-p),
                              file-type (--file-type), classify (-F)
  -i, --inode                print the index number of each file
  -I, --ignore=PATTERN      do not list implied entries matching shell PATTERN
  -k                        like --block-size=1K
  -l                        use a long listing format
  -L, --dereference          when showing file information for a symbolic
                              link, show information for the file the link
                              references rather than for the link itself
  -m                        fill width with a comma separated list of entries
  -n, --numeric-uid-gid      like -l, but list numeric user and group IDs
  -N, --literal              print raw entry names (don't treat e.g. control
                              characters specially)
  -o                        like -l, but do not list group information
  -p, --indicator-style=slash
                            append / indicator to directories
  -q, --hide-control-chars  print ? instead of non graphic characters
      --show-control-chars  show non graphic characters as-is (default
                            unless program is `ls' and output is a terminal)
  -Q, --quote-name          enclose entry names in double quotes
      --quoting-style=WORD  use quoting style WORD for entry names:
                              literal, locale, shell, shell-always, c, escape
  -r, --reverse              reverse order while sorting
  -R, --recursive            list subdirectories recursively
  -s, --size                print the allocated size of each file, in blocks
  -S                        sort by file size
      --sort=WORD            sort by WORD instead of name: none -U,
                            extension -X, size -S, time -t, version -v
      --time=WORD            with -l, show time as WORD instead of modification
                            time: atime -u, access -u, use -u, ctime -c,
                            or status -c; use specified time as sort key
                            if --sort=time
      --time-style=STYLE    with -l, show times using style STYLE:
                            full-iso, long-iso, iso, locale, +FORMAT.
                            FORMAT is interpreted like `date'; if FORMAT is
                            FORMAT1<newline>FORMAT2, FORMAT1 applies to
                            non-recent files and FORMAT2 to recent files;
                            if STYLE is prefixed with `posix-', STYLE
                            takes effect only outside the POSIX locale
  -t                        sort by modification time
  -T, --tabsize=COLS        assume tab stops at each COLS instead of 8
  -u                        with -lt: sort by, and show, access time
                              with -l: show access time and sort by name
                              otherwise: sort by access time
  -U                        do not sort; list entries in directory order
  -v                        natural sort of (version) numbers within text
  -w, --width=COLS          assume screen width instead of current value
  -x                        list entries by lines instead of by columns
  -X                        sort alphabetically by entry extension
  -1                        list one file per line
 
SELinux options:
 
  --lcontext                Display security context.  Enable -l. Lines
                            will probably be too wide for most displays.
  -Z, --context              Display security context so it fits on most
                            displays.  Displays only mode, user, group,
                            security context and file name.
  --scontext                Display only security context and file name.
      --help    display this help and exit
      --version  output version information and exit

2、beeline


  -u <database url>              the JDBC URL to connect to
  -n <username>                  the username to connect as
  -p <password>                  the password to connect as
  -d <driver class>              the driver class to use
  -e <query>                      query that should be executed
  -f <file>                      script file that should be executed
  --hiveconf property=value      Use value for given property
  --hivevar name=value            hive variable name and value
                                  This is Hive specific settings in which variables
                                  can be set at session level and referenced in Hive
                                  commands or queries.
  --color=[true/false]            control whether color is used for display
  --showHeader=[true/false]      show column names in query results
  --headerInterval=ROWS;          the interval between which heades are displayed
  --fastConnect=[true/false]      skip building table/column list for tab-completion
  --autoCommit=[true/false]      enable/disable automatic transaction commit
  --verbose=[true/false]          show verbose error messages and debug info
  --showWarnings=[true/false]    display connection warnings
  --showNestedErrs=[true/false]  display nested errors
  --numberFormat=[pattern]        format numbers using DecimalFormat pattern
  --force=[true/false]            continue running script even after errors
  --maxWidth=MAXWIDTH            the maximum width of the terminal
  --maxColumnWidth=MAXCOLWIDTH    the maximum width to use when displaying columns
  --silent=[true/false]          be more silent
  --autosave=[true/false]        automatically save preferences
  --outputformat=[table/vertical/csv/tsv]  format mode for result display
  --isolation=LEVEL              set the transaction isolation level
  --nullemptystring=[true/false]  set to true to get historic behavior of printing null as empty string
  --help                          display this message

--------------------------------------分割线 --------------------------------------

--------------------------------------分割线 --------------------------------------

Spark 的详细介绍请点这里
Spark 的下载地址请点这里

猜你喜欢

转载自www.linuxidc.com/Linux/2015-08/121143.htm