One article to understand the two core technologies of big data

  Today, the editor will share with you the two core technologies of big data. Only by knowing ourselves and knowing our enemies can we survive a hundred battles. The same is true for learning big data technology. We must first have a clear understanding to ensure that we can devote ourselves to learning.

  What is Hadoop?

  Hadoop started out as a Yahoo project in 2006 and was subsequently promoted to the top Apache open source project. It is a general-purpose distributed system infrastructure with several components: Hadoop Distributed File System (HDFS), which stores files in a Hadoop-native format and parallelizes them across clusters; YARN, which coordinates application runtime Scheduler; MapReduce, which is the algorithm that actually processes the data in parallel. Hadoop is built using the Java programming language, and applications on it can also be written in other languages. Through a Thrift client, users can write MapReduce or Python code.

  


  In addition to these basic components, Hadoop also includes Sqoop, which moves relational data into HDFS; Hive, a SQL-like interface that allows users to run queries on HDFS; and Mahout, machine learning. In addition to using HDFS for file storage, Hadoop can now be configured to use S3 buckets or Azure blobs as input.

  It is available as open source through the Apache distribution, or through vendors such as Cloudera (the largest Hadoop vendor in size and scope), MapR, or HortonWorks.

  What is Spark?

  Spark is a relatively new project, born in 2012 at the AMPLab at the University of California, Berkeley. It is also a top-level Apache project that focuses on processing data in parallel in a cluster, with one big difference being that it runs in memory.

  Similar to Hadoop's concept of reading and writing files to HDFS, Spark uses RDDs (Resilient Distributed Datasets) to process data in RAM. Spark runs in standalone mode, Hadoop cluster can be used as a data source, and can also run with Mesos. In the latter case, the Mesos master will replace the Spark master or YARN for scheduling.

  


  Spark is built around Spark Core, the engine that drives scheduling, optimization, and RDD abstraction, and connects Spark to the right filesystem (HDFS, S3, RDBM, or Elasticsearch). Several libraries also run on Spark Core, including Spark SQL, which allows users to run SQL-like commands on distributed datasets, MLLib for machine learning, GraphX ​​for graph problems, and continuous streaming of log data that allows input Streaming.

  Spark has several APIs. The original interface was written in Scala, and Python and R interfaces have also been added due to heavy use by data scientists. Java is another option for writing Spark jobs.

  Databricks, a company founded by Spark founder Matei Zaharia, is now responsible for Spark development and Spark distribution for customers.

  The basic explanation of the two core technologies of big data, Hadoop and spark, is over here. If you want to improve your technology or break through your own technical field, welcome Xiaobian, we have prepared a full set of big data for you. study materials!

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325986348&siteId=291194637