Big Data concepts and hdfs

Big Data

  • Outline

    •  Big Data is the new processing model in order to have more decision-making power, insight, process optimization capabilities to adapt to the massive high growth rates, diverse data assets.

  • Big Data problems faced

    •  How to store huge amounts of data (kb, mb, gb, tb, pb, eb, zb)
    • How noise reduction processing data (data to be cleaned, so that turning waste into the data, to extract useful data, unnecessary data is released to reduce the resource space)
  • Treatment options

    • hadoop is a distributed file storage system to solve the problem of storage, which hdfs to solve data storage problems, mapReduce to address how the building process
  • What hadoop that? 

    • Derived from?

      •  According to published three articles google

      1. google File System

      2. Google Bigtable

      3. MapReduc Google E inspired hadoop father Doug Cutting solve the problems faced by large data using java language 

    • Outline
      • hadoop distributed infrastructure is an open source apache Foundation, which implements a high fault rate, as well as high throughput, low cost, due to the hadoop written in java language can be used in linux is very reliable, hadoop core design is hdfs and mapReudce and Hbase respectively which in turn google3 articles solve the problems faced by big data
        • hdfs distributed file storage system
        • mapreduce distributed computing frameworks requires only a small amount of code to implement a distributed computing java
        • hbase based HDFS inline storage NoSql
    •  hdfs
      • Distributed file storage system, which has nameNode, dataNode, block, nameNode responsible for managing dataNode, dataNode responsible for receiving read and write requests and nameNode coordination, responsible for the creation and fast block copy, nameNode stores metadata, datanode and block the Mapping relations  

 

 

    •  nameNode storing metadata (used to describe the data), and is responsible for managing dataNode coordination dataNode
    •    DataNode responsible nameNode write request for data block storing node, to report its fast information nameNode
    •    block data fast hdfs default 128mb is a minimum, not a default there are three copies
    •    Rack for rack storage node is placed, to improve the fault tolerance, high throughput. Optimize storage and computing
  •   The relationship between nameNode and SecondaryNameNode

    Fsimage backup metadata will be loaded into memory to

    edits a log file read and write requests

   nameNode fsimage and edits loads at boot time, these two files do not appear out of thin air, so you want to format nameNode 

   When the user in operating files, due to the increase edits, leading nameNode will start getting slower and slower, so there have been SecondaryNameNode simply, he is a copy of nameNode, when reaching the checkpoint, which is the default hdfs one hour or the order of the operation log when the article reaches 100w, and edits this time SecondaryNameNode fsimage loading will come to merge, at this time, if there is a read and write requests over time is loaded into a file is called edits-inprogess records read and write requests, and then edits fsimage will merge into a new fsimage, but this time will be renamed edits-inprogess edits

    • Small question: Why is the default size of a block is 128mb 
      • Hadoop 1x default size of the fast when it becomes larger as the hard disk 64 in a faster time hadoop2x size becomes 128m, then the default addressing time is best transmission speed is 100/1
  •  mapReduce
      •  The concept: distributed computing framework. Calculated for large-scale data, using parallel computing, full use of the physical storage mechanism dataNode, using (Map) mapping (the Reduce) protocol, he greatly facilitates the case where the programmer is not distributed parallel programming , his program is run on a distributed system, the idea is to put a key in the map and then use the Reduce overall planning, to ensure that all key groups of key mapping each team shared
    • mapReduce do best is to divide and conquer;
      • The division is a large and complex task into several simple task to handle, simple tasks comprises 3 layers
      1. Compared to the original data to be greatly reduced
      2. All parallel computing tasks, and interfere with each other
      3. Nearby computing principles
      • Reduce rule responsible for the results of calculation of co-ordination summary map
      • To achieve mapReduce first of all by means of a resource scheduling platform Yarn
  • Yarn
      • Yarn concept as a resource scheduling platform, which has one of the largest managers, ResourceManager responsible for the overall allocation of resources, as well as manage each node, NodeManager responsible for resource status reports to the ResourceManager, in NodeManager there a MRAppMaster, responsible for application of computing resources, computing tasks and to coordinate the implementation and monitoring tasks together NodeManager
      1. ResourceManager responsible for the overall resources and computing clusters do overall planning
      2.  Calculated on NodeManager crew management host, responsible for reporting its status information
      3. MRAppMaster responsible for responsible for applying resources to the ResourceManager, coordinate computing tasks
      4. YarnChild do the actual computing tasks
      5. Container abstract unit of computing resources

            

Guess you like

Origin www.cnblogs.com/blogs-gxData/p/11562807.html