pyspark instructions

PySpark 

PySpark is Spark is  Python  API provides developers located $ SPARK_HOME / bin directory, use is also very simple, enter pyspark shell is ready for use.

Sub-module

pyspark.sql module

pyspark.streaming module

pyspark.ml package

pyspark.mllib package

PySpark provided classes

pyspark.SparkConf

pyspark.SparkConf class provides a method for operating a Spark application configuration. Spark for various parameters set pairs.

pyspark.SparkContext

pyspark.SparkContext class provides the main entry point of application Spark interactive applications and represents a linking Spark cluster, based on the connection, the application can create and broadcast cluster RDD variables (pyspark.Broadcast)

pyspark.SparkFiles

SparkFiles contains only class method, developers should not create an instance of the class SparkFiles.

pyspark.RDD

This class provides a basis for the method of operation PySpark RDD.

first () method of class provides pyspark.RDD, returns the first element of RDD.

Aggregate () method uses the given function and a combination of "neutral value of zero, each partition element to the polymerization, then the polymerization results of all the partitions.

cache () uses the default storage level (MEMORY_ONLY) this RDD for persistence  

collect () returns a list containing all of the elements in this RDD 

pyspark.Accumulator

A kind of "only allows you to add," the shared variable, the Spark  task can only add to its value.

pyspark.Broadcast

Spark offers two shared variables  : Broadcast variables and totalizer, pyspark.Broadcast class provides a method of broadcasting the operation variables.

pyspark.Accumulator

pyspark.Accumulator provides a method of operation of accumulator variable [  

The accumulator is only variable being accumulated correlation operations can thus effectively be supported in parallel.

Testing process

Users enter pyspark shell under hdfs

/usr/hdp/2.6.0.3-8/spark/bin/pyspark

Reference Example http://spark.apache.org/docs/1.6.3/mllib-statistics.html

1. Basic statistics Basic Statistics 

Summary statistics (Summary statistics)

MultivariateStatisticalSummary colStats () Returns an instance, the column containing the maximum, minimum, mean, variance quantity, and non-zero and the total number.

This case is spark2.2.0 reference document, the document is incorrect 1.6.3, the following can copy the whole

import numpy as np
 
from pyspark.mllib.stat import Statistics
 
mat = sc.parallelize(
    [np.array([1.0, 10.0, 100.0]), np.array([2.0, 20.0, 200.0]), np.array([3.0, 30.0, 300.0])]
)  # an RDD of Vectors
 
# Compute column summary statistics.
summary = Statistics.colStats(mat)
print(summary.mean())  # a dense vector containing the mean value for each column
print(summary.variance())  # column-wise variance
print(summary.numNonzeros())  # number of nonzeros in each column

2, Correlation Analysis (Correlations)

 Calculating two series (Series) correlation data between the statistical data is a common operation. In spark.mllib we provide a method of correlation between each two flexible computing. Support calculate the correlation methods currently Pearson's and Spearman's (Pearson and Spearman) correlation.

StatisticsIt provides a method to calculate the correlation between series. According to the type of input, two RDD [Double] or RDD [Vector], or the output of the correlation matrix are Double

This case reference spark2.2.0 documentation 1.6.3 documentation is incomplete, the whole can be copied

from pyspark.mllib.stat import Statistics
 
seriesX = sc.parallelize([1.0, 2.0, 3.0, 3.0, 5.0])  # a series
# seriesY must have the same number of partitions and cardinality as seriesX
seriesY = sc.parallelize([11.0, 22.0, 33.0, 33.0, 555.0])
 
# Compute the correlation using Pearson's method. Enter "spearman" for Spearman's method.
# If a method is not specified, Pearson's method will be used by default.
print("Correlation is: " + str(Statistics.corr(seriesX, seriesY, method="pearson")))
 
data = sc.parallelize(
    [np.array([1.0, 10.0, 100.0]), np.array([2.0, 20.0, 200.0]), np.array([5.0, 33.0, 366.0])]
)  # an RDD of Vectors
 
# calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method.
# If a method is not specified, Pearson's method will be used by default.
print(Statistics.corr(data, method="pearson"))

3, stratified sampling (Stratified sampling)

And other statistical functions spark.mllib different, sampleByKey and sampleByKeyExact can perform a hierarchical key-value pairs RDD sampling method. For layered sample, it can be considered a key tag, this value as a specific property. For example, key can be a man or a woman or a document ID, and the corresponding value can be a word list or a list of the person's age document. sampleByKey method similar manner to determine whether a coin toss was observed to be sampled, it is necessary once through the data, and provides the desired sample size. sampleByKeyExact required to use than sampleByKey simple random sampling in each spend more resources, but sampling will provide the exact size of 99.99% confidence level. python does not currently support sampleByKeyExact.

data = sc.parallelize([(1, 'a'), (1, 'b'), (2, 'c'), (2, 'd'), (2, 'e'), (3, 'f')])
fractions = {1: 0.1, 2: 0.6, 3: 0.3}
approxSample = data.sampleByKey(False, fractions)
for each in approxSample.collect():print(each)//此处需敲2次回车

4, hypothesis testing (Hypothesis testing)

Hypothesis testing is a powerful tool in statistics, to determine whether the results were statistically significant, regardless of whether the result is incidental. spark.mllib currently supports Pearson's chi-squared (χ 2 χ2)) test to obtain a goodness of fit and independence. The input data type is determined whether or independence goodness of fit test. The goodness of fit test to enter the type of Vector, and the independence of a test requires  Matrixas input.

Statistics provided to run Pearson's chi - squared test method. The following example demonstrates how to run and interpret hypothesis testing.

from pyspark.mllib.linalg import Matrices, Vectors
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.stat import Statistics
 
vec = Vectors.dense(0.1, 0.15, 0.2, 0.3, 0.25)  # a vector composed of the frequencies of events
 
# compute the goodness of fit. If a second vector to test against
# is not supplied as a parameter, the test runs against a uniform distribution.
goodnessOfFitTestResult = Statistics.chiSqTest(vec)
 
# summary of the test including the p-value, degrees of freedom,
# test statistic, the method used, and the null hypothesis.
print("%s\n" % goodnessOfFitTestResult)
 
mat = Matrices.dense(3, 2, [1.0, 3.0, 5.0, 2.0, 4.0, 6.0])  # a contingency matrix
 
# conduct Pearson's independence test on the input contingency matrix
independenceTestResult = Statistics.chiSqTest(mat)
 
# summary of the test including the p-value, degrees of freedom,
# test statistic, the method used, and the null hypothesis.
print("%s\n" % independenceTestResult)
 
obs = sc.parallelize(
    [LabeledPoint(1.0, [1.0, 0.0, 3.0]),
     LabeledPoint(1.0, [1.0, 2.0, 0.0]),
     LabeledPoint(1.0, [-1.0, 0.0, -0.5])]
)  # LabeledPoint(feature, label)
 
# The contingency table is constructed from an RDD of LabeledPoint and used to conduct
# the independence test. Returns an array containing the ChiSquaredTestResult for every feature
# against the label.
featureTestResults = Statistics.chiSqTest(obs)
 
for i, result in enumerate(featureTestResults):
    print("Column %d:\n%s" % (i + 1, result))  //此处需敲2次回车

In addition,  spark.mllib it provides a single sample to achieve bilateral equal probability distribution for Kolmogorov-Smirnov (KS) test. By providing a theoretical distribution (currently only supports the normal) and the name of the parameter, or according to a given theoretical distribution calculated cumulative distribution function, the user can test his hypothesis, i.e. the distribution of the sample subject. The appropriate user message, but without providing the test distribution parameter, the test is initialized to the standard normal distribution and a normal distribution record (distName = "norm") according to the.

Statistics 提供了运行单样本,双侧Kolmogorov-Smirnov检验的方法。 以下示例演示如何运行和解释假设检验。

from pyspark.mllib.stat import Statistics
 
parallelData = sc.parallelize([0.1, 0.15, 0.2, 0.3, 0.25])
 
# run a KS test for the sample versus a standard normal distribution
testResult = Statistics.kolmogorovSmirnovTest(parallelData, "norm", 0, 1)
# summary of the test including the p-value, test statistic, and null hypothesis
# if our p-value indicates significance, we can reject the null hypothesis
# Note that the Scala functionality of calling Statistics.kolmogorovSmirnovTest with
# a lambda to calculate the CDF is not made available in the Python API
print(testResult)

5, flow test Streaming Significance Testing

  No example to spark2.2.0

6, random number generators Random data generation

Random algorithm for generating random data, performance testing and prototyping useful. spark.mllib support the use of iid generate random RDD. Values ​​plotted from the given distribution: uniform, normal or standard Poisson distribution.

RandomRDDs Providing factory method generates a random vector, or a double RDD RDD. The following example generates random double type RDD, which value follows a standard normal distribution N (0,1), and then mapped to N (1,4).

Net official incomplete example, this sample source https://github.com/apache/spark/blob/master/examples/src/main/python/mllib/random_rdd_generation.py

from pyspark.mllib.random import RandomRDDs
 
    numExamples = 10000
  
    # number of examples to generate
 
    fraction = 0.1  # fraction of data to sample
 
    # Example: RandomRDDs.normalRDD
 
    normalRDD = RandomRDDs.normalRDD(sc, numExamples)
 
    print('Generated RDD of %d examples sampled from the standard normal distribution'% normalRDD.count())
 
    print('  First 5 samples:')
 
    for sample in normalRDD.take(5):print('    ' + str(sample))  //此处需敲2次回车
 
    print()
 
    # Example: RandomRDDs.normalVectorRDD
 
    normalVectorRDD = RandomRDDs.normalVectorRDD(sc, numRows=numExamples, numCols=2)
 
    print('Generated RDD of %d examples of length-2 vectors.' % normalVectorRDD.count())
 
    print('  First 5 samples:')
 
    for sample in normalVectorRDD.take(5):print('    ' + str(sample))  ////此处需敲2次回车
    print()

7, kernel density estimation Kernel density estimation

The kernel density estimation   is a technique for visualizing the empirical probability distribution, without the need for a particular distribution of observed samples is assumed. It calculates the estimated point of a given set of random variables assessment of the probability density function. It does this by expressing the empirical distribution estimate PDF at a specific point, which is the average of each sample PDF centered normal distribution.

KernelDensity It provides a method for calculating kernel density estimation from RDD samples. The following example demonstrates how to do this.

Net official incomplete example, this sample source https://github.com/apache/spark/blob/master/examples/src/main/python/mllib/kernel_density_estimation_example.py

from pyspark.mllib.stat import KernelDensity
 
# an RDD of sample data
data = sc.parallelize([1.0, 1.0, 1.0, 2.0, 3.0, 4.0, 5.0, 5.0, 6.0, 7.0, 8.0, 9.0, 9.0])
 
# Construct the density estimator with the sample data and a standard deviation for the Gaussian
# kernels
kd = KernelDensity()
kd.setSample(data)
kd.setBandwidth(3.0)
 
# Find density estimates for the given values
densities = kd.estimate([-1.0, 2.0, 5.0])
print(densities)

8, Classification and Regression

spark.mllib package supports binary classification (binary) , multiCLASS Classification (multi-classification) and regression analysis (regression analysis) of various methods.

The following table lists the supported algorithms for each type of problem.

For more details on these methods can be found below, with reference to the content http://spark.apache.org/docs/1.6.3/mllib-linear-methods.html

a) Classification Classification

Classification ( Classification ) The goal is to different data items into different categories. Which binary classification ( binary Classification ), there are positive and negative class category two categories, is the most common type of classification. If more than two categories, that is, multivariate classification ( multiCLASS Classification .). There are two methods spark.mllib linear classification: linear Support Vector Machines (SVMs) and logistic regression. Linear SVMs only supports binary classification. The logistic regression supports both bivariate and multivariate classification. This approach, spark.mllib are provided with two variants of the L1 and L2 of regularization. In MLlib, the test data set with the RDD type LabeledPoint , where labels indexed from zero: 0,1,2 ......

Linear support vector machine Linear Support Vector Machines (SVMs)

Prerequisite, upload files sample_svm_data.txt to the HDFS / user / hdfs / data / mllib /

The following example shows how to load sample data set, SVM model build and use the model to predict the results to calculate the training error.

Examples of sources of https://github.com/apache/spark/blob/master/examples/src/main/python/mllib/svm_with_sgd_example.py

from pyspark.mllib.classification import SVMWithSGD, SVMModel
from pyspark.mllib.regression import LabeledPoint
 
# Load and parse the data
def parsePoint(line):
    values = [float(x) for x in line.split(' ')]
    return LabeledPoint(values[0], values[1:])
 
data = sc.textFile("data/mllib/sample_svm_data.txt")
parsedData = data.map(parsePoint)
 
# Build the model
model = SVMWithSGD.train(parsedData, iterations=100)
 
# Evaluating the model on training data
labelsAndPreds = parsedData.map(lambda p: (p.label, model.predict(p.features)))
trainErr = labelsAndPreds.filter(lambda lp: lp[0] != lp[1]).count() / float(parsedData.count())
print("Training Error = " + str(trainErr))
 
# Save and load model
model.save(sc, "target/tmp/pythonSVMWithSGDModel")
sameModel = SVMModel.load(sc, "target/tmp/pythonSVMWithSGDModel")

Logistic regression Logistic regressio

The following example shows how to load a sample data set, to build a logistic regression model, and the results of the model to predict, to calculate the training error.

Note to spark2.2.0 Python API does not support multi-class classification model and save / load, but in the future will support.

from pyspark.mllib.classification import LogisticRegressionWithLBFGS, LogisticRegressionModel
from pyspark.mllib.regression import LabeledPoint
 
# Load and parse the data
def parsePoint(line):
    values = [float(x) for x in line.split(' ')]
    return LabeledPoint(values[0], values[1:])
 
data = sc.textFile("data/mllib/sample_svm_data.txt")
parsedData = data.map(parsePoint)
 
# Build the model
model = LogisticRegressionWithLBFGS.train(parsedData)
 
# Evaluating the model on training data
labelsAndPreds = parsedData.map(lambda p: (p.label, model.predict(p.features)))
trainErr = labelsAndPreds.filter(lambda lp: lp[0] != lp[1]).count() / float(parsedData.count())
print("Training Error = " + str(trainErr))
 
# Save and load model
model.save(sc, "target/tmp/pythonLogisticRegressionWithLBFGSModel")
sameModel = LogisticRegressionModel.load(sc,"target/tmp/pythonLogisticRegressionWithLBFGSModel")
 

b) Regression Regression

Provided that upload lpsa.data file to the HDFS / user / hdfs / data / mllib / ridge-data / directory

Linear least squares, Lasso, and ridge regression

from pyspark.mllib.regression import LabeledPoint, LinearRegressionWithSGD, LinearRegressionModel
 
# Load and parse the data
def parsePoint(line):
    values = [float(x) for x in line.replace(',', ' ').split(' ')]
    return LabeledPoint(values[0], values[1:])
 
data = sc.textFile("data/mllib/ridge-data/lpsa.data")
parsedData = data.map(parsePoint)
 
# Build the model
model = LinearRegressionWithSGD.train(parsedData, iterations=100, step=0.00000001)
 
# Evaluate the model on training data
valuesAndPreds = parsedData.map(lambda p: (p.label, model.predict(p.features)))
MSE = valuesAndPreds \
    .map(lambda vp: (vp[0] - vp[1])**2) \
    .reduce(lambda x, y: x + y) / valuesAndPreds.count()
print("Mean Squared Error = " + str(MSE))
 
# Save and load model
model.save(sc, "target/tmp/pythonLinearRegressionWithSGDModel")
sameModel = LinearRegressionModel.load(sc, "target/tmp/pythonLinearRegressionWithSGDModel")

 

Guess you like

Origin blog.csdn.net/zwahut/article/details/90638252