2019, the system learns how to big data, knowledge + content + Tutorial

Big Data in 2019 as a more popular technology, more and more attention, then for a large wishing to enter the data of my friends, most want to know is: What is big data science? Branch today and how much data can you come together to share an article on Big Data learning content system introduced.

2019, the system learns how to big data, knowledge + content + Tutorial

Big Data technology system too complicated, the basic technical coverage data acquisition, data preprocessing, distributed storage, the NOSQL database, calculating a multi-mode (batch processing line, real-time streaming processing, the processing memory), multi-modal calculation (image, text, video, audio), data warehouse, data mining, machine learning, artificial intelligence, deep learning, parallel computing, visualization, and other technical areas and different levels. Further extensive large data applications, in various fields of technology difference is quite large. A short time is difficult to grasp a large number of areas of theoretical and technical data, recommendations from the application cut to point, starting with a demand for practical applications, a technique to get a point, after a certain skills, then giving top priority scale, so the learning effect It would be much better. On the big data technology

The past few years and now the so-called cutting-edge information technology era of big data, mobile Internet, Internet of things, cloud computing, artificial intelligence, robotics, and other large data, one by one fire again, what is Big Data, Big Data technology category include those It estimated that many people are familiar with in accordance with their own areas of the elephant.

In the process of getting started big data have met learning, industry, the lack of systematic learning path, learning systems planning, you are welcome to join my big learning data exchange skirt: 251 956 502, skirt documents have my years of study manual sorting of large data , development tools, PDF document with a book, you can download yourself.

From below (Data technology, technical data) generic techniques systematically introduce DT angle is greater data, including those of the core technologies, like the relationship between the field:

First, we said machine learning, machine learning (machine learning), is an interdisciplinary computer science and statistics, the core goal is to function mapping, training data, the most optimal solution, and a series of model evaluation algorithm, let the computer has data automatic classification and forecasting functions; machine learning include the following for each class of many intelligent processing algorithms, classification, clustering, regression, correlation analysis algorithms have a lot of support, such as SVM, neural networks, Logistic regression, decision trees, EM , HMM, Bayesian network, random forest, LDA, etc., regardless of the network's top ten ranking algorithm or algorithms Twenty, it can only be said to be tip of the iceberg; in short to computer intelligence, machine learning is the core of the core, depth learning, the core technical concepts of data mining, business intelligence, artificial intelligence, big data is machine learning, machine learning is used for image processing and machine vision recognition, machine learning used to simulate human language is the natural language processing, machine vision and natural language processing is also supporting the core technology of artificial intelligence, Learning is used for general data analysis data mining, business intelligence data mining is the core technology.

Deep learning (deep learning), machine learning, which is now more of a sub-field of fire, the depth of learning that has been studied for a few variants of neural network algorithm decade, due in large data classification conditions image, voice recognition and other fields achieved very good results on the recognition, artificial intelligence breakthrough is expected to become the core technology, so the major research institutions and IT giants have invested a lot of manpower and resources to do related research and development work.

Data mining (data mining), is a very broad concept, similar to mining, from which a large number of stone dug up a few gems, mining valuable regular information from massive amounts of data the same way inside. The core technology of data mining machine learning from the field, as is the depth of learning machine learning algorithm is a relatively fire, of course, also be used for data mining. As well as traditional business intelligence (BI) field also includes data mining, OLAP multidimensional data analysis can be done mining analysis, basic statistical analysis and even Excel can do the mining. The key is your technology can really dig out useful information, then that information can upgrade guide your decision, even if it is into data mining door.

AI (artifical intelligence), is a great concept, the ultimate goal is the personification of intelligent machines, machines and people to complete the work of the human brain alone tens of watts of power, capable of handling a variety of complex issues, how look is very amazing thing. Although the computing power of the machine is much stronger than humans, but aspects of the human ability to understand, perceptual inference, memory and fantasy, psychology and other functions, the machine is not match, so the machine should anthropomorphic difficult from a technical point of view the manual smart clear. The relationship between artificial intelligence and machine learning, a considerable part of both technology and algorithms are coincident depth study made in areas such as computer vision and walking with great success, such as Google automatically recognizes a cat, also recently Google's AlpaGo He defeated the human hands and other top professional Go. But the depth of learning at this stage can not achieve the kind of brain computing, up to a maximum bionic level, emotion, memory, cognition, experience and other uniquely human ability to machine difficult to achieve in the short term.

Finally, we say big data (big data), big data is essentially a methodology, summarized in one sentence, is to assist decision-making by analyzing massive amounts of whole mining and non-sampling data. The above technique originally calculated process on a small scale data, big data era do, but data becomes large, the core technology is inseparable from machine learning, data mining, and the need to consider a distributed storage management and massive machine data learning algorithm parallel processing core technology. In short the concept of big data is a big box, what can be installed inside, collecting large data sources if the sensor, then things can not be separated, collected large data sources with a smart phone, then you can not do without mobile Internet, a large mass of data data storage can not be separated to be highly scalable cloud computing, big data analysis calculation using the traditional machine learning, data mining techniques will be relatively slow, need to do parallel computing and distributed computing expansion, large data visualization to interactive display can not be separated, large Based on the analysis would join the traditional combination of business intelligence data, financial data analysis, big, big traffic data analysis, medical big data analytics, big data analytics telecommunications, electricity providers big data analytics, social big data analytics, large text data, image data is large , video data ... and so big and so too broad ..., in short, this large data frame is too large, the ultimate goal is to achieve the depth of human insight and intelligent decision-making under conditions of mass data using the above-mentioned series of core technology! This is not only the ultimate goal of information technology, the core technology is the driving force of the development of intelligent management of human society.

System capacity data analyst
below

Mathematical knowledge

Mathematical knowledge is the basis of knowledge of data analysts.

For primary data analyst to understand some basic descriptive statistics related to the content, there is a certain formula to calculate the capacity to understand the common statistical model algorithm is a plus.

For senior data analyst, a statistical model of knowledge is an essential capability, linear algebra (matrix calculation is mainly related knowledge) the best there is a certain understanding.

For data mining engineers, in addition to the statistics, various algorithms also need to skillfully use, requirements for mathematics is highest.

analyzing tool

For primary data analysts, Excel Fun is necessary, pivot tables and formulas used must be familiar with, the VBA is plus. Also, learn a statistical analysis tool, SPSS as the entry is good.

For advanced data analyst, analytical tools core competencies, the VBA RMF, SPSS / SAS / R wherein at least one familiar with, other analysis tools (e.g., Matlab) as appropriate.

For data mining engineer ...... ah, will be used with Excel on the line, the main work depends to write code to solve it.

Programming language

For primary data analyst, write SQL queries, if there is a need to write about Hadoop and Hive queries, basically OK.

For senior data analyst, in addition to SQL, it is necessary to learn Python, it is used to acquire and process data more efficiently. Of course, other programming languages ​​are also possible.

For data mining engineers, Hadoop was familiar, Python / Java / C ++ at least get familiar with one, Shell will have to use ...... In short programming language is absolutely the core capability of the data mining engineer.

Business understanding

Business understanding is the basis of the data analyst, said all the work is not excessive, data acquisition program, selecting indicators, as well as the final conclusion of insight, we are dependent on the data analysts understand the business itself.

For primary data analyst, the main task is to extract the data and do some simple graphs, as well as a small amount of insight conclusion, with a basic understanding of the business can be.

For senior data analyst, the need for more in-depth understanding of the business, can be based on data, to extract the effective point of view, the real business can help.

For data mining engineer, we have a basic understanding of the business can focus still needs to be placed on their technical ability to play.

logical thinking

The ability in the previous article I mentioned in the less, this time pulling out alone for a bit.

For primary data analyst, logical thinking mainly reflected the purpose of each step in the data analysis process, we need to know what kind of means used, to what kind of goals.

For senior data analyst, logical thinking mainly in building a complete and effective analytical framework for understanding the relationships between objects analysis, a clear cause and effect of each change indexes, the business will bring.

For data mining engineers, in addition to reflecting the logical thinking and analysis of business-related work, also includes an algorithm logic, programmable logic, etc., so the requirements of logical thinking is the highest.

data visualization

Data Visualization speaking on very large, in fact, including the wide range of data to be placed inside PPT chart can also be regarded as data visualization, so I think this is the ability of a universal need.

For primary data analyst, Excel and PPT can make basic charts and reports, data clearly show, it achieved its goal.

For advanced data analysts need to explore better data visualization methods, the use of more effective data visualization tools, make simple or complex based on actual demand, but for the audience viewing data visualization content.

For data mining engineer, to understand some of the data visualization tool is necessary, but also to do some complex visualizations according to the needs, but usually do not need to consider the issue too much landscaping.

Coordination and communication

For primary data analyst, understand the business, look for data to explain the report, and need to deal with people from different departments, so communication skills are important.

For senior data analyst, you need to start with an independent project or product and do some cooperation, so in addition to communication skills, but also need some project coordination.

For data mining engineers, and technical aspects of human communication ones, a relatively small number of operational aspects of communication and coordination requirements are relatively low.

fast learning

No matter which direction do data analysis, junior or senior, we need to have the ability to learn quickly, learn the business logic, science industry knowledge, learning technology tools, data analysis analysis framework ...... there is content in the field of science endless, and everyone always keep in mind there is a learning center.

Data Analyst tool system
a description of FIG.

Can be seen from the figure, Python in the data analysis of the pan with quite high, each stage in the process can be used Python. So as a data analyst if you need to learn a programming language, so strongly recommend Python ~

Hadoop family Technical description:

Apache Hadoop: an open-source framework for distributed computing Apache open-source organization, providing a distributed file system subproject (HDFS) MapReduce software framework and support for distributed computing.

Apache Hive: is a Hadoop-based data warehousing tools, you can map the structure of the data file to a database table, fast implementation of MapReduce simple statistics by type of SQL statements, without having to develop specialized MapReduce applications, is very suitable for data warehouse statistics analysis.

Apache Pig: Hadoop is based on a large-scale data analysis tools, SQL-LIKE language that provides called Pig Latin, the compiler will type the SQL language data analysis request into a series of optimized processing of MapReduce operations.

Apache HBase: is a high-reliability, high-performance, column-oriented, scalable distributed storage system, using HBase technology can be built up large-scale structure of the storage cluster on cheap PC Server.

Apache Sqoop: is a tool used to Hadoop data and relational database mutually transferred, the data can be turned into a relational database (MySQL, Oracle, Postgres, etc.) to the HDFS Hadoop, it is also possible to HDFS the pilot data into a relational database.

Apache Zookeeper: is a distributed application designed for distributed, open-source coordination service, which is mainly used to solve some of the data management problems often encountered in distributed applications, distributed applications simplify the difficulty of coordination and management, provide high-performance distributed services

Apache Mahout: Hadoop framework is based on a distributed machine learning and data mining. Mahout with MapReduce to achieve some data mining algorithms to solve the problem of parallel mining.

Apache Cassandra: is an open source NoSQL distributed database system. It was originally developed by Facebook, used to store simple data format, set Google BigTable data model and a fully distributed architecture Amazon Dynamo-in-one

Apache Avro: is a data serialization system, designed to support data-intensive, high-volume application data exchange. Avro is new data sequence format and transfer tool will gradually replace the existing IPC mechanism Hadoop

Apache Ambari: a Web-based tool that supports supply Hadoop cluster management and monitoring.

Apache Chukwa: is a data collection system for monitoring a large open distributed systems, which can be collected as various types of data files for processing saved for Hadoop Hadoop MapReduce various operations in HDFS.

Apache Hama: HDFS is based on the BSP (Bulk Synchronous Parallel) parallel computing framework, may be used comprise FIG Hama, including a matrix network algorithm and large-scale, large data calculation.

Apache Flume: is a distributed, secure, high-availability massive log polymerization system can be used to log data collection, processing log data, log data transmission.

Apache Giraph: is a scalable, distributed iteration of graph processing system, Hadoop-based platform, inspired by the BSP (bulk synchronous parallel) and Google's Pregel.

Apache Oozie: Server is a workflow engine for managing and coordinating tasks running on the platform Hadoop (HDFS, Pig and MapReduce) of.

Apache Crunch: is based on Google's FlumeJava library written in the Java library for creating MapReduce programs. And Hive, Pig Similarly, Crunch provided for common tasks, such as the connection data, perform aggregation and sorting records of the pattern library

Apache Whirr: is a set of class libraries run on cloud services (including Hadoop), providing a high degree of complementarity. Whirr study supports Amazon EC2 and Rackspace service.

Apache Bigtop: one pair is packaged Hadoop and its surrounding ecosystem, distribution and testing tools.

Apache HCatalog: Hadoop-based data management table and stored, for central management of metadata and patterns across Hadoop and the RDBMS, using Pig and Hive provide relational view.

Cloudera Hue: is a WEB-based monitoring and management system, to achieve the HDFS, MapReduce / YARN, HBase, Hive, Pig operation and management of web.

Guess you like

Origin blog.51cto.com/14296550/2438271