What is big data, and how should big data be learned?

1. Instructions for consumption

One more technical article. . . Fan A: So this is the reason you delayed the series of articles and videos? ? ? Fan Yibingding: Exactly! Yes! Been off for more time? Me: Cough. . . I have been plagued with chores recently, but I hope to forgive me! The following are the instructions for consumption:

  1. This article is suitable for you who don’t know much about big data. It is also suitable for you who are not sure whether to learn big data. It will take you to understand the needs of the industry and the positions related to it. It is also suitable for you who have just stepped into the big data field. Welcome to bookmark and share articles with friends around you.
  2. The author has been engaged in big data development and training for many years, has optimized a complete big data curriculum system for many institutions, designed and implemented big data professional training programs for many universities, and conducted many big data teacher trainings and learning exchanges for key university teachers , I hope my little knowledge can help everyone.
  3. This article is not to describe big data as a omnipotent thing that can solve all problems, but to objectively explain its role and some problems that can be solved. I hope to introduce this field to you as completely as possible. As for how to choose, you need to decide according to your actual situation. If you have any questions, you can leave a message in the comment area or join the fan group to communicate directly with me.

2. Basic concepts of big data

1. What is big data

As for what big data is, I think everyone already knows something about it, and many real cases have penetrated into our lives. Big data has the characteristics of large amount of data, rich and complex data types, and fast data growth. All data analysis must be based on real data sets to make sense, and data quality itself is also one of the important factors that affect the results of big data analysis. One.

As learners, what we care about should be what kind of problems big data can solve, what fields it can be applied to, what content should be learned, and which aspects should be focused. Simply put, what we need to learn is a series of big data ecosystem technology components , as well as analysis methods and thinking throughout the entire data analysis process , and the ideas are more important! Only by clarifying the data analysis scenarios and processes can we determine which big data components need to be integrated to solve this problem. Next, we will open the door to this field together~

2. How the data was collected

The first step in big data analysis is to collect and manage data. We need to understand how data is generated first. How was it captured? Can those seemingly messy data really be analyzed?

  • Active data generation and user behavior data collection

Actively generated data is easier to understand. When we use the Internet or various applications, data will be generated by filling in the submission form. Similarly, in our offline environment, such as bank card opening and paper form filling, eventually become electronic data and flow into the system. Usually, we classify this type of behavior as user registration , which is usually the starting point for generating data. (Of course, sometimes the data we analyze may not care about the user's own information.) In addition, by using some platform functions, users will upload and publish various types of data, such as text information, audio, Video, etc., these are all methods of data generation and accumulation.
For user behavior data, more information comes from application embedding and capture, because users must interact with the user interface through mouse clicks or finger touches to use applications. Take a web application (website) as an example. Basically, all mouse behaviors can be captured through event monitoring. The time the mouse stays in a certain area and whether it clicks or not, we can even characterize the entire behavior based on user behavior data. Heat map of the page.

In different application scenarios, we can further divide the dimensions of behavior type, function module, user information, etc., and do more in-depth analysis.

  • Structured and unstructured data

The most common structured data is data stored in relational databases, such as MySQL, Oracle, etc. These data have one characteristic, that is, they are very standardized. Because the relational database belongs to the write-time mode , that is to say, the data that does not meet the pre-set data types and specifications will not pass the verification and cannot be stored in the database. In addition to the data in the database, those data files that can be directly imported into the database can also be regarded as structured data, such as CSV format. These data usually need to have a unified column separator, row separator, a unified date format, and so on.
Non-organized data refers to another category of data besides structured data. Usually there is no expected data organization and stored in non-relational databases, such as Redis and MongoDB, using NoSQL for operations. It may also be non-text type data, which requires special corresponding means to process and analyze.

3. Can big data really predict?

When asked whether big data can predict, it is better to talk about how big data predicts. If it is combined with the field of artificial intelligence, it is more complicated. Let's talk about a relatively simple scenario: use statistical analysis to assist decision-making, or use classic data mining algorithms for model training. Since it is a prediction, it may be accurate or inaccurate. What the analyst needs to do is to reasonably use various data dimensions and combine the corresponding algorithms or statistical analysis methods to train or fit a potential law. This process is like giving us three points (1, 1), (2, 2), (3, 3), and we can roughly guess that its function may be the same as y=x . Of course, the actual analysis process is much more complicated than this. After all, there are many functional formulas that can satisfy these three points, but which one is the law I want? This requires equal emphasis on theoretical knowledge and industry experience, and continuous polishing and optimization can obtain a reliable model.
But what we can be clear about is that big data predictions and recommendations are algorithm-based, mathematical and scientific, but they are not 100% accurate.

3. What is big data development

After understanding what big data is, let’s introduce the job of big data development. Let’s go directly to the job description (JD: Job Description) to give everyone a taste. Then let's explain the main work of big data development engineers, and finally summarize the skills that need to be mastered.

  • JD Big Data Development Engineer JD:

Insert picture description here

  • Xiaomi Big Data Development Engineer JD:

  • Didi Big Data Development Engineer JD:

  • The main work

From the above job description, we can find that big data development engineers generally connect with the business, either based on a certain scenario for targeted data processing, or creating a big data product. Here we also need to correct a small concept. Some partners think that a company with big data job needs must be a company with a large amount of data and a large accumulation of users. In fact, it is not. In addition to analyzing the company's own business data, it is also possible to create a general big data product. Although there are not as many big data positions as ordinary development engineers, the demand still exists.
If it is to analyze the company's own business data, it will generally be more focused on using big data components and algorithm libraries to construct a feasible data analysis program. As you can see, there are now fewer big data jobs that do not involve algorithms at all. The algorithm here does not refer to the data structure, but refers to the machine learning library , and the algorithm related to data mining . At least you must know how to control the input and output of the algorithm. The problems that the algorithm can solve may not involve personal modeling. It will be introduced in detail in the subsection of big data analysis.

If it is to develop a big data product, such as a modeling platform , or a solution dedicated to data collection and data visualization . Then this is more suitable for small partners who change from a development engineer to a big data development engineer , which is equivalent to adding the underlying big data component to the development of an application. This requires that we not only need to understand the original server-side framework, but also be able to control the big data development API.

  • master a skill

The skills needed to be mastered in big data development can be summarized as follows:

  1. Operating system: Linux (basic operation, software maintenance, authority management, timed tasks, simple Shell, etc.)
  2. Programming language: Java (mainly), Scala, Python, etc.
  3. Data collection components and middleware: Flume, Sqoop, Kafka, Logstash, Splunk, etc.
  4. Core components of big data cluster: Hadoop, Hive, Impala, HBase, Spark (Core, SQL, Streaming, MLlib), Flink, Zookeeper, etc.
  5. Literacy requirements: major in computer or big data

Fourth, what is big data analysis

Speaking of data analysts, this is not the focus of this article, because the threshold is relatively high. On the other hand, it is more in the direction of mathematics and statistics. It is more about dealing with data and algorithms. The product of programming is usually not an application, but an algorithm. Model . Let's take a look at the related JD first:

  • Little Red Book Data Analyst JD:

Insert picture description here

  • JD Data Analyst JD:

  • Sina Weibo data analyst:

Insert picture description here

  • The main work:

If the job requirements for big data development are one by one. . . Then the job demand for data analysts is likely to be one by one. . . As can be seen from the above requirements, each position describes the business scenario in detail. After all, one of the main tasks of a data analyst is to establish an algorithm model , which is a deep cultivation of the vertical field . Usually we cannot directly use the existing algorithms, we must evaluate, optimize, or use a combination. In addition, you must have business experience in this field to be able to do well.

  • master a skill:

The skills that algorithm engineers need to master can be summarized in the following aspects:

  1. Programming language: Python, R, SQL, etc.
  2. Modeling tools: MATLAB, Mathematica, etc.
  3. Familiar with machine learning libraries and classic data mining algorithms
  4. Mathematics, statistics, computer related majors, sensitive to data

5. How to learn big data

The two main jobs related to big data have been introduced above. In fact, there are still many jobs related to big data. To sum it up, ETL engineers can also be said to be side-effects, because as the amount of data continues to increase, whether it is a bank Both internal and big data service companies are transitioning from traditional ETL tools to big data clusters.
With so many technical points involved, how to learn more efficiently? The first thing to get started is big data development . There is not much explanation for the part of Linux operating system and programming language 不要觉得有些东西没用就跳过,有些时候编程思想和解决问题的方法同样很重要,课本上有的一定要扎实. For components related to big data, it seems very complicated. Many small partners may delve into the usage, operators, functions, and APIs of each component. Of course, there is nothing wrong with it, but at the same time, you must not forget the main line buried in it. , That is: a complete data analysis process . In the process of learning, we must understand the characteristics, differences and application data scenarios of each component.

  • Offline calculation

In the offline computing scenario, all historical data is used, that is, data that will not change again. After the data source is determined, these data will not be added or updated, which is more suitable for scenarios that do not require high real-time performance. In most cases, a certain indicator is calculated periodically or a job is executed, and the calculation time can basically be controlled at the minute level.

  1. Data source: data files, data in the database, etc.
  2. Data collection: Sqoop, HDFS data upload, Hive data import, etc.
  3. Data storage: HDFS
  4. Data analysis: MapReduce, Hive QL
  5. Calculation result: Hive result table (HiveJDBC query), export to relational database
  • Real-time calculation

The data facing real-time computing is constantly flowing in, and it is necessary to be able to use appropriate components to process the real-time incoming data. Sometimes the data inflow per unit time will be more, and the consumption will be slower. Sometimes the inflow of data per unit time will be less and the consumption will be faster. Therefore, when collecting data, it is necessary to ensure that the data is not lost, and at the same time, middleware is required to manage the data. When performing real-time calculations, you can use micro-batch methods or other methods. At the same time, you must deal with the problem of merging calculation results and display the latest results in real time.

  1. Data source: incremental monitoring of log files, etc.
  2. Data collection: Flume
  3. Middleware: Kafka
  4. Data analysis: Spark-Streaming, Flink, etc.
  5. Calculation result: HBase

The above is just a simple list of some component integration solutions that implement data flows in different scenarios. I am telling you that you must be good at discovering and summarizing the characteristics of different components, and putting the right components in the right place. This is what interviewers often like to ask Scene title. In fact, the method of using each component and calling the API is not very complicated. The focus is on process, integration, and connection between components, continuous penetration and strengthening of data analysis and processing ideas, which can directly translate a requirement into Data analysis plan, this is the focus of learning.
本文的所有内容都只是个人的一点粗浅的认识,只适合刚入门学习或刚从事相关工作的小伙伴进行参考,有任何不对的地方希望大家包涵。

Thanks to the original blog for letting me know about big data. In fact, I am a student of big data, but I don’t know much about it. The original address of the original blog: https://blog.csdn.net/u012039040/article/details/108589729

Guess you like

Origin blog.csdn.net/My_daily_life/article/details/108820540