Big data learning route, Wuhu takes off~

Grind your fists!

Saying that data is one of the core things of an enterprise, I think everyone should basically agree. After all, even XXX has said that this is an era where data is king, and whoever has the data will control the future!

No wonder all of them are eager to deal with big data.

Note: This article has been included in the Github open source project: github.com/hansonwang99/JavaCollection , which contains detailed self-study routes, interview questions and interviews , programming materials and series of technical articles for programming in all major directions. The resources are continuously updated


Fundamentals of Big Data Development

Learning programming languages ​​is often the first big step in our learning journey. Many frameworks in the field of big data are developed based on the Java language, and various frameworks also provide Java APIs to provide use and operation interfaces, so the learning of the Java language cannot escape. In addition, Scala can be learned when necessary, and it is still used a lot in the field of big data development. Scala language has strong expressive ability, code signal-to-noise ratio is very high, and many big data frameworks also provide the development interface of Scala language. Moreover, Scala can also run on the Java platform (JVM) and is compatible with Java programs, so it can also It is well integrated with big data related systems.

In addition, the old-fashioned data structures and algorithms , computer networks , operating systems , databases , and design patterns are also essential general computer foundations for programmers. Not only do they need to have big data, but also those who engage in back-end development must master these foundations. , And these things are also necessary in job interviews, this part should spend a lot of time to sit down.

Finally, I want to mention Linuxthe requirements of the operating system. Of course, we mainly focus on Linuxthe perspective of system use. Because the development and deployment of big data systems are basically based on the Linuxenvironment. Mastering commonly used commands, configuration, network and system management, basic Shell programming, etc., is of great benefit to learning.


Basic development tools

Common development tools and software in the big data field are basically the same as back-end development. For example, choose a common Linux operating system, a set of useful SSH tools and FTP/SFTP tools, a good integrated development environment, and mainstream Source control tools and build tools, etc.

Next, we will enter the specific process of big data development, and we will divide it into several chunks. The first is data collection .


data collection

Since big data systems deal with massive amounts of data, the first question is, what type of massive data is it ? Where did it come from ?

We can say that the type of input data types a large range of data systems, in the form structures are different, the traditional structured data, but also XML, Jsonand so this kind of semi-structured data, and even documents, audio and video such unstructured data .

The data sources are even more diverse. There are data directly from the back-end existing database, data from the back-end log system, various data from third-party services, and even various data crawled from the Internet.

Once the data source is found, the next data collection and data transmission work is very important.

Let's take the most common log data in the background as an example. Since the current service system adopts many cluster deployment methods, the collection and transmission of log data in distributed clusters is a big problem. FlumeIs a more commonly used distributed data collection and aggregation framework, the most typical application is the collection of log data. It can customize various data senders and aggregate data, while providing simple processing of data, and writing to various data recipients to complete data transmission.

At the same time, there is Logstashan open source data collection engine that you may have heard of, and it is also more commonly used.

Of course, there is another scenario that usually needs to be considered in the data collection step, that is, data migration (such as import/export) between different storage systems (or databases). For example, we often need to perform data migration (exchange) between traditional relational databases (such as MySQL) and data warehouses (such as Hive) of big data systems . At this time, a Sqoopdata collection and transmission tool is very commonly used. In addition, Taobao's open source DataX is also the same type of tool.


data storage

After the data collection is completed, it needs to be stored next . This is also a very clear idea and process.

When it comes to data storage, the first thing we think of is of course database storage. Including MySQL, Sql Serverand so this most common relational databases , as well as Redis, MongoDB, HBaseand other such non-relational database .

We will ElasticSearchtalk about it separately here , because although it can also be regarded as a database to a certain extent, its more important identity is still an excellent full-text search engine. Its emergence solves some tasks that traditional relational databases and NoSQL non-relational databases cannot efficiently complete, such as efficient full-text search, structured search, and even data analysis. Therefore, more and more companies are used now. many.

In addition to traditional databases, in the field of big data, widely used storage technologies also include distributed file systems and distributed databases . When it comes to distributed file systems, the famous HDFSbig data distributed file system is a widely used big data distributed file system. It is not only a basic data storage platform, but also a big data system infrastructure; the latter’s representative technology HBaseis built on HDFS. The distributed database above is suitable for the storage of massive data.

In the field of big data, in addition to distributed database and distributed file systems, there is an often heard is that with Hivethe representative of the data warehouse . We can understand the data warehouse as a logical concept, and its bottom layer is often based on the file system. Also to Hive, for example, it appears mainly allows developers by SQLway of processing and convenient operation of HDFSthe data on the treatment applicable to off-line batch data to get started friendly, lowering the threshold.

So a phased summary of this part of the content can be as follows:


data processing

The data has fallen, what's next? Of course, it is to fully tap the value contained in the data, and to put it more bluntly is to perform various queries, analysis and calculations on it, so as to empower the data and generate value.

The earliest MapReduceis the distributed computing framework provided by Hadoop, which can be used for statistics and analysis HDFSof massive data, suitable for off-line batch processing that is not sensitive to speed ; later memory computing frameworks Sparkare more suitable for iterative operations, so they are also very popular. Favor. In some scenarios that do not require real-time computing, these frameworks are widely used, but in some scenarios where offline data analysis cannot meet the needs, such as financial risk control, real-time recommendation, etc., online computing or streaming computing becomes It is very necessary, and this has become the main front of a large number of excellent real-time computing frameworks such as Storm, Flinkand so on. In particular Flink, the degree of popularity in the past few years is needless to say, and the processing engines built on it are also row upon row.


Data value and application

The ultimate task of the big data system is to serve the business and create actual value for production. Such value application scenarios include, but are not limited to, providing various statistical reports, product recommendations, data visualization, business analysis, decision-making assistance, and so on.


Big data peripheral technology

Talking about this, it should be said that the above content has basically completed the main process of a big data engine, but the actual big data system still needs the support of many peripheral technologies, so many additional frameworks and technologies have been derived.

Due to the limitations and bottlenecks of stand-alone performance, many framework components of the big data system are deployed in clusters . At this time, the deployment , management and monitoring tools for the cluster system are indispensable, such as widely used Ambariand Cloudera Manageretc.

With the cluster, the management of various resources and the scheduling of various tasks on the cluster platform has become a complex and difficult problem. At this time, the resource management framework YARN, task workflow scheduling framework, Azkabanand so Oozieon have their abilities.

At the same time, in order to ensure the high availability of distributed clusters, ZooKeeperthis kind of distributed coordination service framework has been a great help, and tasks such as Master election, cluster management, and distributed coordination notifications are all without question.

Finally, there is a famous middleware framework that must be mentioned, that is Kafka. It is not just a high-throughput messaging system. With it, system decoupling, peak pressure buffering, efficient stream processing, etc. make it the most beautiful baby in the eyes of back-end developers and big data developers.


Make a summary

Finally, we also post the full version of the mind map of all the above content here, because this map is too big, it is not easy to handle.


A few topics to discuss

Is there a big relationship between big data development and back-end development?

It should be said that many technical points and frameworks have intersections. For example, the basic parts of general programming are completely the same, and the commonly used frameworks such as Redis, Zookeeper, Kafka, Elasticsearch, etc., which are too mainstream to be mainstream, were also available when we sorted out the Java back-end route, so the intersection of the two is very Big data, and even a lot of big data, used to be transferred from the back-end, very natural, because many technologies are connected or even the same.

Do you have to learn so many frameworks?

There are so many frameworks in the big data field, and the number of people who can't sleep can be roughly counted. There are at least 30 or 40 mentioned in the brain map just now. Does each need to be learned? When we sorted out, we listed more than one mainstream framework of the same type. Generally speaking, as long as we understand one of them, it is not difficult to get started with other technologies of the same type. It is very important to draw inferences from one another. In addition, we try to learn mainstream classic frameworks. Generally, there is no problem. For example, the distributed file system HDFS is very classic and used a lot. Flink in stream processing is now very popular. You can consider learning when you study the corresponding parts by yourself.

How to learn the specific framework (technology)?

Finally, we still have to implement the question of how to learn a specific technology (framework). I think the learning ideas are clear. The first step is to figure out what the framework does, what problems and pain points it solves, and what other similar "competitive products" are. This step is in the detailed mind map above. We have already I've done it for you. The second big step is to use this technology (framework) to gain a sense of accomplishment. It is very important to use this technology (framework). How to use it? The idea is also very clear. First, install and deploy the corresponding environment, run it, and then Experiment based on the prepared environment, run the demo, write things by yourself, take them and run, from simple to complex, slowly learn to become proficient, you will definitely step on the pit in the process, so make records, output, notes, and write your own The process of stepping on the pit and the solution ideas are very important, step by step; the last big step is to study the principle of the key mechanism inside, and learn is to earn, so the overall three steps are.


postscript

The hard core of this issue is not easy to create. I don't want to be stunned, and I hope to support it three times in a row.

Finally, I would like to give special thanks to the leader of Juhuayun, my senior brother Yun for the guidance and help provided by this route combing, I would like to call it Zhou Fatlun in KTV, the leader of the big data industry.

Note: This article has been included in the Github open source project: github.com/hansonwang99/JavaCollection , which contains detailed self-study routes, interview questions and interviews , programming materials and series of technical articles for programming in all major directions. The resources are continuously updated

See you next.

Guess you like

Origin blog.csdn.net/wangshuaiwsws95/article/details/109672348