Big data and cloud computing knowledge

The first chapter
four characteristics 1. Large data (4V)
(1) data volume (Volume) : the amount of data is huge, TB has jumped from level to level PB
(2) data type variety (Variety) : divided structure data (10%), unstructured data (90%), unstructured data comprises a semi-structured data; structured data refers to data stored in a database relationship thereof, the latter variety, including e-mail, audio, video, micro-channel, micro-blog, location information, link information, phone call information, web logs, etc.
(3) processing speed (the velocity) : real-time analysis, the second response
(4) low-value density (the value) : value densities much below traditional relational database those species existing data
2. large data calculation mode
(1) batch calculation :
a: the MapReduce: large batch processing of data may be performed a large-scale data processing tasks in parallel, for parallel calculation of large data sets .
B: Spark: latency cluster distributed computing system for a large data set, faster than many MapReduce
(2) flow calculation :
stream data refers to the time distribution of a series of dynamic and unlimited number of aggregate data, data value decreases as time passes, and therefore must be calculated during run time of the second response is given.
Flow calculation: can be processed in real time, continuous data streams arriving from different data sources, real-time analysis after treatment, give valuable results.
3. Cloud
Concepts cloud
1. Cloud computing achieved by providing a scalable network, low-cost distributed computing power of
2. The cloud computing service model three kinds:
A: IaaS (Infrastructure as a Service): The infrastructure (computing resources (cpu, memory) and storage (disk)) as a service Rent
b: PaaS (platform as a service): platform as a service to rent
c: SaaS (software as a service): software as a service to rent
3 yuan three types of calculations:
a: public cloud: to provide services for all users
b: private cloud: only provide services to a specific user
c: hybrid cloud: a combination of public and private cloud characteristics (because for some companies, on the one hand for security reasons need to put the data private cloud, on the other hand would like to receive public cloud computing resources, we can put public and private clouds mixed with the use of)
the key technologies 2. cloud computing

(1) virtualization (virtual key features: compatibility, isolation, packaging)
virtualization technology is the cornerstone of cloud computing architecture refers to a virtual machine for multiple logical computers running simultaneously on a plurality computing logic computer, each running a different computer logic Caozuoxitong, and the application can run independently of each other without affecting each other in a space, thereby significantly improving the working efficiency of the computer
(2) distributed memory
the HDFS (distributed file system), using a simple "write once, read many" model file, once the file is created, written and closed, then it can only perform a read operation, modify operation can not be performed. HDFS Java-based implementation.
(3) Distributed Computing
MapReduce (parallel programming model), so that anyone can self-love short time to get massive computing power, which allows developers without development experience can also develop distributed parallel programs, and let it run on hundreds of machines, in a short time to complete the calculation of mass data.
MapReduce the complex, run parallel to the abstraction process on a large scale computing clusters and two functions -Map Reduce, and to cut a large data set into a plurality of small data, parallel processing branches to different machines, greatly improves data processing speed.
3. The concept of cloud computing data center:
cloud computing data center is a complex set of facilities, including blade servers, broadband internet access, environmental control equipment, monitoring equipment, safety devices and the like.
Data centers are an important carrier of cloud computing, cloud computing offers computing, storage, bandwidth, and other hardware resources, provide operational support environment for a variety of platforms and applications

The second chapter (hadoop)
1.Hadoop
(1) concept:
Hadoop is an Apache Software Foundation's open-source distributed computing platform that provides users with a transparent system-level details of the type of open source infrastructure division. Hadoop big data is recognized as the industry standard open source
Hadoop is a Java-based development, with a good cross-platform.
(2) The core Hadoop is a distributed file system, HDFS and MapReduce
(3) Hadoop features
high reliability, high efficiency, high scalability, high fault tolerance, low cost, runs on the Linux platform that supports multiple programming languages
(4) Hadoop three core technology
a: HDFS (distributed file system that can run on inexpensive commodity servers clusters, low cost and high reliability, high throughput)
b: Hbase (to provide high reliability, high performance , scalable, real-time reading and writing, column-based distributed database)
C: the MapReduce (distributed programming model, parallel program)

Google corresponding three technologies are: GFS MapReduce Bigtable
2.SSH Login
for pseudo-distributed and fully distributed Hadoop is concerned, Hadop name of the node (NameNode) need to start the Hadoop cluster daemon all machines, this process can log in via SSH to fulfill. (Hadoop does not provide input in the form of SSH login password, in order to successfully log on each machine, all machines need to be configured as a node name they can no password)
3.Hadoop Installation
(1) stand-alone mode : Hadoop distributed non-default mode mode (local mode), without additional configuration to run. I.e., non-distributed single Java process, facilitate debugging
(2) pseudo-distributed mode : Hadoop run may be distributed over a single node, processes to separate Hadoop Java process to run, the node both as NameNode also as DataNodes, Meanwhile, in the read type HDFS file
(3) distributed mode : a plurality of nodes constituting a cluster environment to run Hadoop

Chapter III (HDFS)
concepts 1.HDFS the
HDFS -Block basic storage unit (data block) The default is 64MB. Each memory block as a separate unit.
Function HDFS major components
NameNode DataNode
stores metadata store file content
metadata is stored in the memory contents of the file saved to disk
contains mappings between files, block, datanode maintain the mapping between block id to local files datanode

HDFS namespace comprise directories, files, blocks
2.HDFS data replication
replication NameNode full control data block, it periodically receives the heartbeat and the block status report from each cluster in Datanode. Receive the heartbeat signal Datanode means that the node is working properly. Block status report contains a list of all data blocks on the Datanode
data block (block) replication:
(. 1) the NameNode found Number Block of the file does not meet the minimum number of copies or part DataNode failure
(2) inform DataNode copy each Block
(. 3 ) DataNode start copying each other
3.HDFS commonly used commands
1. list the HDFS file
Hadoop fs -ls
2. Create Specify one or more folders
Hadoop fs -mkdir
3. uploading files to HDFS
the Hadoop DFS Test -put test1 / test1 file in the upload directory hadoop both the HDFS named Test /
4. Copy the file to the HDFS local system
hadoop dfs -get test0 test00 / to the HDFS test0 copied to the local system named test00 /
5. delete the files in the HDFS
hadoop the DFS -rmr test00 / delete a document called test00 under HDFS /
a file under 6. Check HDFS
hadoop the DFS -cat TTT / View HDFS under ttt content files /
7. the report of HDFS basic statistics
hadoop dfsadmin -report
8. exit safe mode
hadoop dfsadmin -safemode the Leave
9. into safe mode
hadoop dfsadmin -safemode enter

Chapter IV (HBase)
1. concept
Hbase is a highly reliable, high performance, column-oriented, distributed database scalable, mainly used to store semi-structured and unstructured data loose
2.HBase traditional relational data and difference
HBase traditional relational database
data type using a simple data model, the data is stored as a string can not explain structured data and unstructured data into different formats are sequence strings stored in the relational model employed Hbase having rich the data type and storage
data table and the relationship between the operation table does not exist, only a simple insertion, search, delete, empty insert, delete, update, query, relates to multi-table joins
storage mode based on the stored columns stored line mode
data index HBase only one index - row keys can build complex multiple indexes for different columns of
data maintenance operations to perform the update, do not remove the old version of the data, it will generate a new version, the old version will retain the authentic update replaces the old value when, after the old value is overwritten not exist
scalability can be flexibly extended horizontally, can Enough to easily increase or decrease the number of hardware performance is difficult to achieve the telescopic extension transversely through the cluster, the longitudinal extension of the relatively limited space

Chapter VII (MapReduce)
a design concept 1.MapReduce
a MapReduce concept design is "calculated move closer to the data" rather than "data to calculate closer", because the mobile data transmission network requires a lot of overhead, especially in large-scale under data environment, this overhead is particularly alarming, therefore, mobile computing is more economical than moving data
2.MapReduce conceived on three levels of
(1) how to deal with big data: divide and rule
(2) up to the abstract model: Mapper and Reducer
(3) up to the architecture: a unified architecture for programmers to hide system details, the details of the underlying transparent

IX (Spark)
1. concept
Spark is calculated based on the large data parallel computing frame memory, it can be used to construct large, low-latency data analysis application
2. The application scenario
(1) complex batch process data
(2) based on interactive query historical data,
(3) data processing based on real-time data streams
3.Spark idea
Spark's design follows " a software stack to meet different application scenarios ," the idea gradually formed a complete ecosystem that can provide memory computing architecture can also support ad hoc SQL queries, machine learning and computing.
4.Spark two types of operations
Conversion: map filter join union sample
action: first top count collect

Chapter X (Hive)
Hive is based on Hadoop data warehousing tools that can be used to Hadoop data files for data collation,
ad hoc queries and analysis of storage, he provides a query language for relational databases --Hive similar to the SQL language QL
Hive part of the code:
Requirement 1: the mobile terminal for later analysis using the user terminal or the PC later, and the proportion of users of the mobile terminal and the PC
SELECT Case
When isMobile = '1 true'then
When isMobile =' 0 false'then
end AS isMobile,
COUNT (. 1) AS NUM
from pinglun
Group by isMobile;

Requirement 2: analysis of user comments period (after receipt of goods, generally how long comment)
the SELECT
Days,
COUNT (1) AS NUM
from pinglun
Group Days by
the Order by NUM desc;

Guess you like

Origin blog.csdn.net/weixin_44039347/article/details/91602294