Big data system

1. Data Scientist

Since the rise of big data , the concept of data science has also become a hot topic of discussion in the data field. "Data Scientist" has become a job position and appears on various job postings. So what exactly is data science? What is the relationship between big data and data science? What role does big data play in data science? This article is mainly intended to play a role in popular science, so that friends who are about to or are engaged in data work have a comprehensive understanding of data science work, and also make friends who have ideas to enter the field of big data have an understanding of the situation of the industry before they are really engaged in big data work. Known. Data science is a mixed interdisciplinary (as shown in the figure below). To fully become a data scientist, you need to have good knowledge of mathematics and computers , as well as knowledge in a certain professional field . The work done is revolving around data. After the amount of data has exploded, big data is regarded as a branch of data science.
Big data system

2. Big data system

Big Data has actually been around for many years, but with ubiquitous sensors and ubiquitous data burying points, acquiring data has become easier, larger, and more diverse. As a result, the original traditional data field had to think about changing to a new platform that can handle and use the increasing amount of data. Use the following two points to further elaborate:

  • A point raised by Dr. Wu Jun: Existing industry + new technology = new industry, big data also conforms to this principle, but what is spawned is not just a new industry, but a complete industrial chain: original data field + new Big data technology = big data industry chain;
  • The scope of data use, the original data application is mainly to sample from the data in the existing data, and then do data mining and analysis to discover the potential rules in the data for prediction or decision-making. However, sampling will always discard part of the data. That is to say, some potential rules and value will be lost. As the amount of data and content continue to accumulate, enterprises are paying more and more attention to the full amount of data in data applications, covering all potential rules as much as possible to discover what may or may not be thought of the value of.

    Big data is a chain or pipeline based on the flow of data. Where does the data come from and where it goes is not only a philosophical question, but it can also be considered when doing data work. As shown in the figure below, the big data field can be divided into the following main directions, and these directions can correspond to some job positions:

Big data system

1. Data platform

Data Platform , build and maintain a stable and secure big data platform, design big data architecture on demand, research and select big data technology products and solutions, implement deployment and go online. For most of the technologies involved in the big data field, it is necessary to have an understanding of some parts, and have the thought and ability of distributed systems.

Corresponding positions: big data architect, data platform engineer

2. Data collection

Data Collecting , obtains data from channels such as Web/Sensor/RDBMS, and provides data sources for big data platforms.

Corresponding positions: crawler engineer, data acquisition engineer

3. Data warehouse

Data Warehouse is a bit similar to the work content of traditional data warehouse: designing data warehouse hierarchical structure, ETL, and data modeling, but based on different platforms. In the era of big data, most data warehouses are implemented based on big data technology, such as Hive Data warehouse based on Hadoop.

Corresponding positions: ETL engineer, data warehouse engineer

4. Data processing

Data Processing , to complete the processing or data cleaning in some specific requirements, is combined in the data warehouse in a small team. In the past, ETL may be used to directly configure and process some filter items with tools, and the code part will be less. Nowadays, data processing on the big data platform can use more code methods to do more diversified processing. The required technologies are Hive, Hadoop, Spark, etc. Do not underestimate data processing. Subsequent data analysis and data mining are based on the quality of data processing. It can be said that data processing has a particularly important position in the entire process.

Corresponding positions: Hadoop engineer, Spark engineer

5. Data analysis

Data Analysis , based on statistical analysis methods for data analysis: such as regression analysis, analysis of variance, correlation analysis, etc. Big data analysis technologies such as Ad-Hoc interactive analysis and SQL on Hadoop include Hive, Impala, Presto, Spark SQL, and technologies that support OLAP: Kylin.

Corresponding position: data analyst

6. Data mining

Data Mining is a relatively broad concept, which can be directly understood as finding useful information from a large amount of data. Data mining in big data is mainly to design and implement data mining algorithms on big data platforms: classification algorithms, clustering algorithms, association analysis, etc.

Corresponding position: data mining engineer

7. Machine learning

Machine Learning , and data mining are often discussed together, and even considered to be the same thing. Machine learning is an interdisciplinary subject of computer and statistics. The basic goal is to learn a function (mapping) of x->y for classification or regression. The reason why it is often combined with data mining is because a lot of data mining work is now realized through algorithm tools provided by machine learning, such as personalized recommendation, which analyzes various purchases on the platform through some algorithms of machine learning, browse And the collection log, get a recommendation model to predict the products you like.

Corresponding position: algorithm engineer, researcher

8. Deep learning

Deep Learning is a topic (very popular topic) in machine learning. From the content of deep learning, it is a derivative of neural network algorithms. It has achieved very good results in classification and recognition of images, speech, natural language, etc. Effectively, most of the work is in tuning parameters.

Corresponding position: algorithm engineer, researcher

9. Data visualization

Data Visualization displays the high-value data after analysis and mining in a more beautiful and flexible way in front of bosses, customers, and users. It is more of some front-end things. Maybe some aesthetic knowledge is required. Combining user preferences, present the value of data in the most appropriate way.

Corresponding positions: data engineer, BI engineer

10. Data application

Data Application , applications that can be derived from each of the above, such as accurate advertising, personalized recommendations, user portraits, etc.

Corresponding position: data engineer

Guess you like

Origin blog.51cto.com/12824426/2560977