Big Data common terms [turn]

Today, when searching for information on Big Data, and occasionally in https://blog.csdn.net/dashujudaka/article/details/82980532 seen some large data common concepts, solve doubts in my mind for a long time, now its reprint over, this is not called tort bar.

Linux: Because big data software is running on Linux, so Linux to learn some of the solid, to learn Linux you quickly master big data technologies will be very helpful, allowing you to better understand hadoop, hive , operating environment and network environment configuration hbase, spark and other large data software can step on a lot less pit shell will be able to learn to read a script like this can be easier to understand and configure large data clusters. You can also make a new future for big data technology learning curve faster.
 
 
Hadoop: It is now popular big data processing platform has become almost synonymous with big data, so this is a must to learn. Hadoop which includes several components HDFS, MapReduce and YARN, HDFS is where we store data like a computer hard disk as files are stored in the above, MapReduce data processing is calculated, it has a feature that no matter how much data just give it time it will be able to finish the data, but the time may not be fast so it is called batch data.
YARN important component is the embodiment of the concept of Hadoop platform with which other software Big Data ecosystem will be able to run on hadoop, so that we can better take advantage of the large storage HDFS and save more resources, such as we do not have a single spark to build a cluster, and run it directly on top of the existing hadoop yarn on it. In fact, the Hadoop of these components you can learn to understand the big deal with data, but you still might in the end be of "big data" has not been a very clear idea, I do not listen to this tangle. After working so you'll have a lot of scenes encountered dozens of T / T hundreds of large-scale data, to the time you will not feel nice big data, the greater the more your headache. Of course, so do not be afraid of large-scale data processing, because this is where your value, so that those who were engaged in Javaee of html5 php and envy of the DBA's go.
 
 
 
Zookeeper: This is a one size fits all, the installation of HA Hadoop will use it, it will be used later Hbase. It is generally used to store some information about mutual cooperation, the information is relatively small generally not more than 1M, it is using its software that depend on it, for us personally, just to have it installed correctly, the normal run it up on it.
 
Mysql: We study large data processed, the next learn mysql database processing tools small data because a hive will be installed when to use, mysql need to know to what degree that you can layer it on Linux? installed, up and running, it will be simple to configure permissions, change the root password, create a database. Here the main thing is to learn SQL syntax, because syntax hive and this is very similar.
 
Sqoop: This is used to import data into Mysql in Hadoop inside. Of course, you can also do this directly to Mysql data tables exported as a file and then put on HDFS is the same, of course, pay attention to the production environment using the pressure Mysql.
 
Hive: this thing will be for SQL syntax, it is the artifact, it allows you to handle large data becomes very simple, it will not be hard to write MapReduce programs. Pig Some people say that? It Pig almost a grasp on it.
 
Oozie: since learned Hive, and I am sure you need this thing, it can help you manage your Hive or MapReduce, Spark script, you can check whether the program correctly, you made a mistake to the police and help you re test procedures, most importantly, also help you to configure task dependencies. I am sure you will love it, or you feel that looking at a lot of scripts, and dense crond is not a kind of want to feces.
 
Hbase: This is the Hadoop ecosystem in NOSQL database, his data is stored in the form of key and value and the key is unique, so it can be used for duplication of data, compared with the data it can store MYSQL large lot. So he often used to store large data destination after the process is completed.
 
Kafka: This is a relatively easy to use tool queue, the queue is why the queue up for tickets you do not know much data also need to queue processing, collaboration with other students so you will not be called up, why do you give me so much?? data (such as several hundred G file) how I handled over, you do not blame him because he was not engaged in big data, you can talk to him I put the data in the queue when you use one to take, so he will not complain immediately crestfallen to optimize his program went, but to deal with because that is his thing. Instead you give problems. Of course, we can also use this tool to do real-time online data storage or into HDFS, then you can Flume tools used in conjunction with a call, it is designed to provide simple data processing, and writes a variety of recipient data (such as Kafka) a.
 
Spark: It is used to compensate for shortcomings in MapReduce based data processing speed, it is characterized by the data loaded into memory rather than calculated read slow death slow evolution also particularly hard. Particularly suitable for iteration, the algorithm flow were particularly porridge it. It is written in scala. Scala or Java language can operate it, because they are using the JVM.
 
Machine Learning (Machine Learning, ML): more than one field is interdisciplinary, involving probability theory, statistics, approximation theory, convex analysis, algorithmic complexity theory and other subjects. It is the core of artificial intelligence, is to make computers intelligent fundamental way, throughout all areas of application of artificial intelligence, it is mainly the use of induction, rather than a comprehensive interpretation. Machine learning algorithms compare the basic fixed, relatively easy to learn together.
 
Deep learning (Deep Learning, DL): The concept comes from the depth study of artificial neural network is developing rapidly in recent years. Examples of applications are deep learning AlphaGo, face recognition, image detection. Is scarce talent at home and abroad, but the depth is relatively difficult to learn, algorithm updates faster, we need to follow the teacher's learning experience.

Guess you like

Origin www.cnblogs.com/yjh123/p/10972351.html