The learning process from Xiaobai to big data technical expert

How to learn big data processing technology? First of all, we need to learn the Java language and the Linux operating system. These two are the basis for learning big data, and the order of learning is not divided.

[1] Java: Everyone knows that the direction of Java includes JavaSE, JavaEE, and JavaME. Which direction should we learn to learn big data?
You only need to learn the standard version of Java, JavaSE, such as Servlet, JSP, Tomcat, Struct, Spring, Hibernate, and Mybites are all technologies in the direction of JavaEE. They are not used much in big data technology, and you only need to understand them. , Of course, you still need to know how to connect to the database in Java. For example, you must master JDBC. Some students say that Hibernate or Mybites can also connect to the database. Why don't you learn it? It will take you a lot of time, and it is not often used in the final work. I have not seen anyone who uses these two things in big data processing. Of course, if you have enough energy, you can learn the principles of Hibernate or Mybites. Only learn API, which can increase your understanding of Java operation database, because the core of these two technologies is Java reflection and various uses of JDBC.
[2] Linux: Because big data-related software runs on Linux, Linux needs to be learned more solidly. Learning Linux well will help you quickly master big data-related technologies and allow you to better understand The operating environment and network environment configuration of big data software such as hadoop, hive, hbase, spark, etc., can avoid many pitfalls, and learn the shell to understand the script, which makes it easier to understand and configure the big data cluster. It also allows you to learn more about new big data technologies in the future . Here I would like to recommend the big data learning exchange group I built by myself: 199427210. The group is all about learning big data development. If you are learning big data, the editor welcomes you to join. Everyone is a software development party, from time to time. Sharing dry goods (only related to big data software development), including a 2018 latest big data advanced materials and advanced development tutorials that I have compiled by myself. Welcome to advanced and advanced small partners who want to go deeper into big data.

Now that the basics are over, let’s talk about what big data technologies need to be learned. You can learn them in the order I wrote.

[3] Zookeeper: This is a panacea, it will be used when installing HA of Hadoop, and it will also be used by Hbase in the future. It is generally used to store some cooperative information, which is relatively small and generally no more than 1M. The software that uses it depends on it. For us, we only need to install it correctly and let it run normally. That's it.
Mysql: We have finished learning the processing of big data. Next, we will learn to learn the mysql database, a tool for processing small data. Because it will be used when installing hive, what level does mysql need to master? You can install it on Linux, run it, configure simple permissions, change the root password, and create a database. The main thing here is to learn the syntax of SQL, because the syntax of hive is very similar to this.
Sqoop: This is used to import data from Mysql into Hadoop. Of course, you can also not use this. It is the same to directly export the Mysql data table as a file and put it on HDFS. Of course, you should pay attention to the pressure of Mysql when using it in the production environment.
Hive: This thing is an artifact for those who know SQL syntax. It can make it easy for you to process big data, and you will no longer need to write MapReduce programs. Some people say Pig that? It's almost as good as Pig to master one.
Oozie: Now that you have learned Hive, I believe you must need this thing. It can help you manage your Hive or MapReduce and Spark scripts, and can also check whether your programs are executed correctly. Try the program, and most importantly, it can also help you configure the dependencies of the task. I believe you will like it, otherwise you will feel like shit when you look at the pile of scripts and the dense crond.
Hbase: This is the NOSQL database in the Hadoop ecosystem. Its data is stored in the form of key and value, and the key is unique, so it can be used for data sorting. Compared with MYSQL, it can store data Much larger quantity. Therefore, it is often used as a storage destination after big data processing is completed.
Kafka: This is a relatively easy-to-use queue tool. What is the queue for? Line up to buy tickets, you know? If there is too much data, it also needs to be processed in a queue, so that other students who are collaborating with you will not scream. Why are you giving me so much data (such as hundreds of gigabytes of files), how can I handle it, don’t blame him because of him It's not for big data, you can tell him that I put the data in the queue and you take it one by one when you use it, so that he won't complain anymore and immediately optimize his program, because it can't handle it. his thing. rather than the question you gave. Of course, we can also use this tool to store online real-time data or enter HDFS. At this time, you can use it with a tool called Flume, which is specially used to provide simple data processing and write to various Data receiver (such as Kafka).
Spark: It is used to make up for the shortcomings of MapReduce-based data processing speed. It is characterized by loading data into memory for calculation instead of reading the slow, deadly evolutionary hard disk. It is especially suitable for iterative operations, so algorithm flows are particularly fond of it. It is written in scala. Either Java or Scala can operate it, because they both use the JVM.

If you know these things, you will become a professional big data development engineer, and the monthly salary of 2W is a drizzle.

[4] Follow-up improvement: Of course, there is still room for improvement, such as learning python, which can be used to write web crawlers. In this way, we can create our own data, and you can download all kinds of data on the network to your cluster for processing. Finally, learn the principles of algorithms such as recommendation and classification, so that you can better communicate with algorithm engineers. In this way, your company will be more inseparable from you, and everyone will not want what you like.