1.pyspark读取各种数据源
通过
pyspark.sql.
DataFrameReader
对象的各种方法可以读取各种数据源
先创建个SparkSession
spark = SparkSession.builder \ .master("local") \ .appName("Word Count") \ .config("mysqlusername", "alarm") \ .getOrCreate()
用SparkSession的read方法就可以创建pyspark.sql.DataFrameReader
1)读取数据库mysql
df = spark.read.jdbc(url='jdbc:mysql://192.168.88.60:3306/alarm',table='test', properties={'user':'alarm','password':'123456'})
2)读取json、csv、text
df=spark.read.csv('school.csv',header=True) df=spark.read.json('test0307_t.json') df = spark.read.text('python/test_support/sql/text-test.txt')
2.SparkSession的createDataFrame的方法
从RDD, a list or a pandas.DataFrame创建
rdd = sc.parallelize(l) spark.createDataFrame(rdd).collect() d = [{'name': 'Alice', 'age': 1}] spark.createDataFrame(d).collect()
df = spark.createDataFrame([("a", 1), ("b", 2), ("c", 3)], ["Col1", "Col2"]) df=pd.DataFrame([[1],[2]]) sparkdf=spark.createDataFrame(df)