Big data security controls and scene analysis

1, how to achieve security of hadoop.
    1.1 Share hadoop cluster:
        A: managers to developers is divided into several queues, each has certain resources, each user and user group can only use a specified queue resources.
        b: various data on the HDFS, public, private, encrypted. No users can access different data.
    1.2 HDFS security
        client namenode get initial access authentication (using kerberos) later, will get a delegation token, this token can be used as the authentication next visit HDFS or submit jobs.
        Also, read the block is the same.
    1.3 mapreduce security
        all on track to submit jobs or job's run state are using RPC with Kerberos authentication implementation. When the authorized user to submit jobs, the JobTracker whom will generate a Delegation token,
        the token is distributed to each of the HDFS and by RPC TaskTracker storage as part of the job, job once the end of run, the token invalid.
    1.4 DistributedCache is safe.
        DistribuedCache are two, one is Shared, all operations can be shared, private and shared only by the user's operation.
    1.5 RPC security
        Permissions are added authentication and authorization mechanisms in Hadoop RP. When the user calls the RPC, the user's login name passed through RPC head to RPC, RPC after using the Simple Authentication and Security Layer (SASL) to determine a permission protocol (Kerberos support and DIGEST-MD5 two kinds), RPC to complete the authorization.

2, hbase hdfs and scene analysis using
    the HDFS: 
        . 1, a write-once, read many times.
        2, to ensure data consistency.
        3, is mainly deployed in many low-cost machines, improve reliability through multiple copies, provides fault tolerance and recovery mechanisms.
    HBase: 
        1, a large amount of instant written, database support or not support the need for high cost scenario.
        2, data needs to be long-term preservation, and the amount will be sustained growth into the larger scene
        3, hbase NA and has a complex join, multi-level indexes, table relational data model
        4, a large amount of data (100s TB-level data) and fast demand for random access.
        5, capacity expansion driven by big data elegant, dynamic expansion of system capacity must. For example: webPage DB. 
        6, business scenario is simple, does not require a lot of properties in a relational database (e.g. cross-column intersection tables, transactions, connection, etc.)
        7, optimization: rational design rowkey. Because of the investigation hbase
        6, hbase of rowkey design, which affect the performance of hbase.

3, hbase design issues

  a, hbase.hregion.max.filesize how much should be set appropriate.
    The default is 256, the maximum HStoreFile. If any size HStoreFiles a Column Family (or HStore) exceeds this value, then, HRegion to which they belong will be Split into two.
    As we all know in the beginning hbase data is written memstore, When memstore full 64MB, it will be flush to the disk to become storefile.
    When the number of storefile more than 3, will start the compaction process will merge them into a storefile. This process will remove some outdated timestamp data, such as update data.
    Storefile when combined is greater than the size of the default maximum hfile, split operation is triggered, it will be cut into two region. 
b, autoflush = false influence of
 either official or a lot of blog advocated in order to improve the write speed hbase set autoflush = false in the application code, then lz think in the online application should be careful to carry out the settings for the following reasons:
     2.1, autoflush = false principle is that when the client submits delete or put request, the request client cache until the data of more than 2M (hbase.client.write.buffer decision)
     or the user performs a hbase.flushcommits () when the regionserver submit a request. So even if htable.put () returns successfully executed, indicating that the request is not really successful.
     If caching has not reached the client crashes, the portion of data is not transmitted due to regionserver lost. This zero tolerance for online services is unacceptable.
     2.2, autoflush = true although the write speed will drop 2-3 times, but why for many online applications that are to be opened, it is also the reason it hbase default value of true.
     When the value is true, each request will be sent to regionserver, and the first thing after receiving a request to write regionserver Hlog, thus requiring for io is very high, in order to improve the writing speed hbase,
     should be as Highland io improve throughput, such as adding disks, raid card use, reducing the number of replication and other factors.

c When the traditional relational database in a table, in the modeling business transformation to the hbase, from a performance point of how to set up family and qualifier it?
    The most extreme, ① each column are set to a family, ② a table with only one family, all of which are listed in a qualifier, then what difference does it make?
    In terms of reading considerations:
        the more family, you get the advantage of each cell of data more obvious, because io and network are reduced.
        If all the data has only one family, then each read will read the current rowkey, there will be some losses on the network and io.
        Of course, if you want to get a few column data is fixed, then these columns are written to a family than the family to set up better, because
        for the only time you can get back all the data requested.
    From the writing point of view:
        First, the memory side, for a Region, will be assigned a Family Store is a per each table, and each Store,
        will assign a MemStore, so more family will consume more RAM.
        Secondly, from the flush and compaction aspects that the current version of hbase, in the flush and compaction units are based on region,
        when to say when a family reaches flush conditions, memstore will flush all of the region's family once owned,
        Even if only a few memstore data will trigger flush generates small files. This increases the probability of occurrence of compaction,
        the compaction is in units of region, so it is prone to compaction storms thereby reducing overall system throughput.
    Third, from the viewpoint of split, is due hfile family units, so for a plurality of family, the data is distributed to more hfile, reducing the probability of the split occurs.
    This is a double-edged sword. The split will result in less volume of the region is relatively large, the balance is the number of the region rather than the size of the unit carried out, which could lead to failure of balance.
    And the plus side, fewer split will make the system more stable and provide online services. The bad news is we can manually split and balance in the trough time of the request to avoid off.
    So write more for the system, if it is to be off-line, we just try to use a good family, but if it is an online application, it still should be a reasonable allocation of family circumstances depending on the application

3, redis distributed implementation principle. How to read and write separation, which algorithms are used in this process, what are the benefits.
    memcache can only be said to be simple kv memory data structures, and data types supported by redis rich. Redis cluster mechanism to achieve in the future 3.0.
    Current methods Redis cluster implemented mainly using a dilute consistency Ha slice (Shard), assigning a different key to different redis server, to achieve the purpose of lateral expansion.
    Use a dilute consistency Kazakhstan fragmented and then distributed to different key on different Redis-Server, when we need expansion, the need to increase the machine to slice the list, this time will make the same key count fell out with the original on different machines,
    so if you want to take a certain value, will not take the case of, in this case, Redis proposed way called Pre-Sharding: the
    use of cluster model redis: there several problems
        a: expanding problem:
        PreS-Sharding each method is a physical machine, running a plurality of different fracture Redis instance, if there are three physical machine, each physical machine running the actual three Redis,
        we fragment list actually nine Redis instance, when we need to expansion, to increase a physical machine, the following steps:
            1. run Redis-Server on the new physical machine; 
            2, the Redis-Server dependent (slaveof) fragment list in a Redis-Server (assuming that called RedisA);
            3, the other main completion from copying (the Replication), the client fragment list RedisA IP and port to IP and port on the new physical Redis-Server; and
            4, stops RedisA.
        B: single point of failure
            will be a Redis-Server moved to another table. Prd-Sharding online expansion is actually a way, but still very dependent on Redis replication of itself,
            if the primary database snapshot data file is too large, the copy process will be long, and will put pressure on the main library.

Published 57 original articles · won praise 33 · Views 140,000 +

Guess you like

Origin blog.csdn.net/u014156013/article/details/103948561