Spark series - from zero to learn SparkSQL programming (under)

<ArtifactId> mysql-connector-java </ artifactId> <version> 5.1.32 </ version> </ dependency> <-! Spark sql dependent -> <dependency> <groupId> org.apache.spark </ groupId > <artifactId> spark-sql_2.11 </ artifactId> <version> 2.0.2 </ version> </ dependency> </ dependencies> 6. Java source operating DataFrame1.DataFrame as SparkSQL core API, which is through to SparkContext acquisition, the code is as follows: 1 // create a spark session, and specify appName, where you need to submit the task. val spark = newSparkSession.Builder (). appName ( "CaseClassSchema"). master ( "local [2]"). getOrCreate () // 2. Get SparkContext, all subsequent operations need SparkSQL this context. val sc: in SparkContext = spark.sparkContext above, is designated Master sparkSQL execution environment may be a cluster can also be local, where a local [2] specifies a local stand-alone mode, two threads to perform tasks, Note that local must be lowercase. SparkSession SparkContext is an upgraded version of his support HiveContext and SparkContext. 2. We can obtain the data via the corresponding DataFrame SparkContext, the following code: // 3. Each row in the RDD acquiring, by the RDD schema converted into DFval lineRdd: RDD [Array [String]] = sc.textFile ( "hdfs: // node01: 8020 / spark_res / people.txt") .map (_. split ( ",")) val peopleRdd: RDD [People] = lineRdd.map (x => People (x (0), x (1) .toInt)) import spark.implicits._val peopleDF: DataFrame = peopleRdd.toDF // 4. the DF operating peopleDF.printSchema () before peopleDF.show () println (peopleDF.head ()) println (peopleDF.count ()) peopleDF.columns.foreach (println) DataFrame use, the package must be turned, otherwise it will not toDF method. 3.DataFrame SQL operation in two ways, DSL and SQL, code is as follows: //DSLpeopleDF.select ( "name", "age") .show () peopleDF.filter ($ "age"> 20) .groupBy ( " name "). count (). show // SQLpeopleDF.createOrReplaceTempView (" t_people ") spark.sql (" select * from t_people order by age desc "). after the operation is complete show4.SQL must be closed sparkContext and SparkSession, codes are sc.stop () and spark. , "Root") properties.setProperty ( "password", "123456") resultDF.write.jdbc ( "jdbc: mysql: //192.168.52.105: 3306 / iplocation", "spark_save_result", properties) // close sparkcontext sparksession }} .. Note that resultDF.write, return DataFrameWriter. 1. Such a result may be stored in any of SQL, and due to the convenience of the API, may be stored in multiple formats, such as text, json, orc, csv, jdbc like. 2. For the preservation of data, the system offers several saving mode can be specified by mode (String): overwrite: Rewrite internal data file append: the end of the file add content to ignore: if the file already exists ignore operation error: default option, if the file exists, an exception is thrown 8. Conclusion 1. SparkSQL series, we first introduced the SparkSQL core API DataFrame, internal DataFrame divided RDD basis of distributed data sets and Schema meta-information. DataFrame SQL code will pass Catalyst optimization prior to execution, to become efficient handling code. Then we introduced by spark-shell and java api both clients window operation DataFrame. 2. Create DataFrame in two ways: 1 will be converted into DataFrame by rdd.toDF direct rdd. 2 reads various data formats directly spark.read. 3. Check the contents DataFrame in two ways: 1. View data structure by df.printSchema. 2. View the data content through df.show. 4. df provides DSL and two styles of SQL to manipulate data. For DSL style, there is a common method select () filter () and so on. 5. In the second half of this article describes how SparkSQL interact with mysql, in addition, also supports interactive Parquet, ORC, JSON, Hive, JDBC, avro protocol documents.


文章来源于公总号黑马程序员广州中心(itheimagz)更多资源请关注




Guess you like

Origin blog.51cto.com/14500648/2430115