spark notes 1

Spark study notes 1

Experimental environment and laboratory building on the platform version: java8, python2.7, scala2.11.8, hadoop2.7.3, spark2.4.4

Learning Content

basic concepts

Spark is a framework for cluster computing UC Berkeley AMP lab development, similar to Hadoop, but there are a lot of differences. Maximum optimization is to make computing tasks of intermediate results can be stored in memory, do not need to always write HDFS, more applications that require iterative MapReduce algorithm scene, you can get better performance. For example, once ordering tests, to 100TB of data sorting, Spark three times faster than Hadoop, and requires only one-tenth of the machine. Spark largest cluster node can reach 8,000, the data processing up to PB level, Internet companies is widely used.
The core Hadoop is a distributed file system, HDFS and computational framework MapReduces. Spark can replace the MapReduce, and is compatible with HDFS, Hive other distributed storage layer, into good Hadoop ecosystem.
Spark execution characteristics
intermediate result output: Spark abstract workflow execution general directed acyclic graph execution plan (DAG), Stage allows multiple tasks are executed in parallel or in series.
Data format and memory layout: Spark abstract distributed memory structure of the elastic distributed data sets stored RDD, the data can be controlled in different nodes of partitions, users can customize the partitioning strategy.
Overhead task scheduling: Spark uses an event-driven library AKKA to start the task, to avoid the system startup and switching overhead by multiplexing the thread pool thread.
Spark advantages of
speed, running the workload 100 times faster. Apache Spark DAG using the most advanced scheduler, the query optimizer and execution engine physics, to achieve high performance batch processing and streaming data.
Easy to use, supports writing applications fast with Java, Scala, Python, R, and SQL. Spark offers more than 80 operators, can easily build parallel applications. You can use it from Scala, Python, R and SQL shell in interactively.
Universality, combined with SQL, streaming and complex analysis. Spark provides a number of libraries, including SQL and DataFrames, MLlib for machine learning, GraphX and Spark flow. You can seamlessly combining these libraries within the same application.
A variety of environments can run, Spark running Hadoop, Apache Mesos, Kubernetes, stand-alone or cloud host. It can access different data sources. You can use it standalone mode Spark cluster on EC2, Hadoop YARN, Mesos or Kubernetes. Access, and data HDFS, Apache Cassandra, Apache HBase Apache Hive hundreds of other data sources.
Spark ecosystem BDAS
Currently, Spark has developed into a large data contains many subprojects computing platform. BDAS analysis is based on data Spark stack (BDAS) proposed by Berkeley. The core framework Spark, while covering the support structure of the SQL query and analysis of data query engine Spark SQL, to provide a machine learning system and distributed machine MLBase underlying learning library MLlib, parallel computing framework FIG GraphX, flow computing framework Spark Streaming approximate query engine BlinkDB, memory, distributed file systems Tachyon, resource management framework Mesos and other sub-projects. These sub-projects provide a higher level, richer computing paradigm in the upper Spark.

Learning nouns

  • HDFS: Hadoop Distributed File System
  • MapReduce: calculation model for large parallel processing, the frame and the platform
  • MapReduce algorithm requires iterative: map can be understood as a list of data, reduce is reduced mean, here I think we can be understood as the map data through the list of feature extraction are classified into the last regular data.
  • RDD: data set distributed elastic formula

operating

  • spark-shell
    and the shell command line input python pop up almost python, this spark-shell is the scala language

    file is using / etc RDD Object / protocols created, the number of rows file.count () Gets the RDD file, and file .first () Gets the content of the first line

    to obtain the number of lines containing both the tcp and udp two strings

    say this to the operator uses the word count I just did not figure out MapReduce, yet do not understand.
    Currently guess probably means, as a space divided between the two spaces is a word, and then generate a RDD objects wordcount, number of rows is the number of its words
  • Pyspark
    equivalent of a python by manipulating one of sparkshell spark

    Temporarily not figure out the concept

    mapReduce


The bed, tomorrow to continue learning

Guess you like

Origin www.cnblogs.com/ltl0501/p/12099459.html