Data read and save

1 motivation

Spark supports a variety of input and output sources. The following three common data sources are as follows:

  • File Formats and File Systems: For data stored in local file systems or distributed file systems (such as NFS, HDFS, Amazon S3, etc.), spark can access many different file formats, including text files, JSON, SequenceFile, and protocol buffers.
  • Structured Data Sources in Spark SQL: Spark provides a set of concise and efficient APIs to handle structured data sources including JSON and Apache Hive.
  • Databases and key-value stores: Spark’s own library and some third-party libraries can be used to connect to Hbase, ES and JDBS sources, etc.

2 file format

Spark has an easy way to read and save many file formats. from such asUnstructured files such as text files, to semi-structured files such as JSON format, and then to structured files such as SequenceFile, Spark can support it. Spark will choose the corresponding processing method according to the file extension. This process is encapsulated and transparent to users. The following are common formats:

insert image description here

2.1 Text files

When we read a text file as RDD,Each row of the input becomes an element of the RDD. It is also possible to read multiple complete text files at once as a pair RDD, where the key is the file name and the value is the file content. To read a text file, simply call the textFile() function in SparkContext with the file path as an argument, as follows:

input = sc.textFile("file://home/spark/README.md")

If multiple input files come in the form of a directory containing all parts of the data, this can be handled in two ways. You can still use the textFile function, passing the directory as an argument, and it will read the parts into the RDD. Sometimes it is necessary to know which file each part of the data came from (such as time data that puts the key in the filename), and sometimes it is desirable to process the entire file at once. If the files are small enough, you can use the SparkContext.wholeTextFiles() method, which returns a pair RDD where the keys are the filenames of the input files.
Outputting a text file is also fairly simple,The saveAsTextFile() method receives a path and inputs the contents of the RDD into the file corresponding to the path, Spark treats the incoming path as a directory and outputs multiple files in that directory. In this way, Spark can output in parallel from multiple nodes. In this method, we cannot control which part of the data is output to which file, but some output formats support control. Save the text file as follows:

result.saveAsTextFile(outputFile)

2.2 JSON

JSON is a widely used semi-structured data format, and the easiest way to read JSON data is to read the data as a text file. As follows, python reads JSON data as follows:

import json
data = input.map(lambda x: json.loads(x))

2.3 CSV

Comma-separated value (CSV) files have a fixed number of fields per line, separated by commas. Like JSON, there are many different libraries for CSV, but only one library is used in each language. Again, for Python we'll use the native csv library.

2.4 SequenceFile

SequenceFile is a common Hadoop format consisting of key-value pair files with no relative relational structure. Since Hadoop uses a custom serialization framework, SequenceFile is composed of elements that implement Hadoop's Writable interface. The following is python to read the sequenceFile file:

data = sc.sequenceFile(inFile, "org.apache.hadoop.io.Text", "org.apache,hadoop.io.IntWritable")

3 Structured Data in Spark SQL

Spark SQL is a new component added to Spark in Spark 1.0 and has quickly become the more popular way of manipulating structured and semi-structured data in Spark. In each case, we give a SQL query to Spark SQL, let it perform a query on a data source (select some fields or use some functions on the fields), and thenGet an RDD composed of Row objects, each Row object represents a record. In Python, elements can be accessed using row[column_number] and row.column_name.

3.1 Apache Hive

Apache Hive is a common source of structured data on Hadoop. Hive can store tables in various formats within HDFS or on other storage systems. To connect Spark SQL to an existing Hive, you need to provide a Hive configuration file. You need to copy the hive-site.xml file to Spark's ./conf/ directory. After doing this,Then create the HiveContext object, which is the entry point of Spark SQL, and then you can use Hive query language (HQL) to query your table, and get the returned data in the form of RDD composed of rows, as follows: python creates HiveContext and queries data:

from pyspark.sql import HiveContext
hiveCtx = HiveContext(sc)
rows = hiveCtx.sql("select name, age from users")
firstRow = rows.first()
print (firstRow.name)

3.2 JSON

If you have JSON data with a consistent structure among records, Spark SQL can also automatically infer their structure information and read these data as records, which makes the operation of extracting fields very simple.Use the HiveContext.jsonFile method to get an RDD of Row objects from the entire file. Suppose the content of the json file is as follows:
insert image description here

The following is to read the json file based on python:

data = hiveCtx.jsonFile("file.json")
data.registerTempTable("table_tmp")
results = hiveCtx.sql("select user.name, txt from table_tmp")

Guess you like

Origin blog.csdn.net/BGoodHabit/article/details/121577270