The most complete history of big data entry manual!

First, the big data five basic aspects of analysis

1, visual analysis

Big Data Analytics users have large data analysis experts, as well as ordinary users, but they are both big data analysis for the most basic requirement is that visual analysis, since intuitive visual analysis can show a large data characteristics, but can very easily be readers accepted, just as simple as plug-speak.

2, data mining algorithms

The core theory is that big data analysis data mining algorithms, various data mining algorithms based on different data types and formats to more scientific data presenting itself with features, it is precisely because of these are recognized worldwide statistician various statistical methods (you can call it truth) to in-depth internal data, dig out shared values. Another aspect is because of these data mining algorithms to faster processing of large data, if an algorithm would take several years to conclude that the value of big data, there can be no talk about.

3, predictive analysis capabilities

One big data analytics applications ultimately is predictive analytics, data mining from large out features through scientific model, then we can bring in new data model to predict future data.

4, a semantic engine

Big Data analysis is widely used in network data mining, from the user's keyword search, keyword tags, or other input semantic analysis to determine user needs, in order to achieve a better user experience and advertising match.

5, data quality and data management

Big Data analysis is inseparable from the data quality and data management, high-quality data and efficient data management, both in academic research or in commercial applications, we are able to ensure real and valuable analysis results. Big Data analysis is the basis of the above five aspects, of course, more in-depth analysis of big data, then there are many, many more features, more in-depth, more professional big data analysis. Want to learn the system big data, you can join the big data technology learning exchange buttoned Junyang: 522 189 307

Second, how to choose the right data analysis tools

To understand what data analysis, data type you want to analyze large data there are four main categories:

1, the transaction data (TRANSACTION DATA)

Big Data platform to capture time span larger, more massive structured transaction data, so that you can analyze a wider range of transaction data types, including POS or e-commerce shopping not only data but also the behavior of transaction data, such as Web servers Internet records clickstream data log.

2, human data (HUMAN-GENERATED DATA)

Unstructured data exists widely in e-mail, documents, images, audio, video, as well as through the blog, wiki data, especially social media stream generated. These data were analyzed using a text analysis function provides a rich source of data.

3, mobile data (MOBILE DATA)

Access to the Internet smart phones and tablet becoming more common. These App on mobile devices are able to track and communicate with numerous events (such as location change that is reporting a new geocoding) from the transaction data (such as search products recorded events) within the App to personal information or status reporting events.

4, the machine and sensor data (MACHINE AND SENSOR DATA)

This function device includes data created or generated, e.g. household appliances smart meters, intelligent temperature controller, plant machinery, and Internet connection. These devices may be configured to communicate with other nodes in the network can also automatically transmit data to a central server, so that the data can be analyzed. Machine and sensor data are a prime example of things (the IoT) emerging from arising. Things may be derived from data used to construct a model, continuous monitoring behavior prediction (e.g., when the sensor value indicates that there is a problem identified), provides a predetermined command (e.g., warning prior art inspection apparatus actually wrong).

What are the requirements and purposes of data analysis tools to achieve?

  • Provides advanced analysis algorithm analysis applications and models

  • Large data platform for the engine, such as Hadoop or other high-performance analysis system

  • Structured and unstructured data can be applied to a variety of data sources

  • As for the increase in data analysis model, to achieve extended

  • Model can be, or has been integrated into a data visualizer

  • And other technologies can be integrated

Further, the tool must contain some essential features, including the integration algorithm and supporting data mining techniques, including (but not limited to):

1, clustering and segmentation:

The division of a large entity with common characteristics of small groups. For example, analysis of collected customer to determine more segments of the target market.

2, classification:

Predetermined data organized into categories. Such as deciding how to change the customer segments are classified according to the model.

3, recovery:

Used to restore the relationship between a dependent variable and independent variables more than one and help decide how the dependent variable varies depending on the independent variables. Such as the use of geographic data, net income, and next summer the average temperature of the area predicted to property.

4, joint projects and set mining:

In large data sets to find the correlation between variables. For example, it can help call center representatives to provide more accurate information based on the caller's customer segmentation, relationship and type of complaints.

5, similarity and contact:

For indirect clustering algorithm. Similarity integration algorithms can be used to determine similarity spare cluster entity.

5, the neural network:

For non-direct analysis of machine learning.

People through data analysis tools to understand what

  • Data scientists, they want to use more complex data types more complex analysis, understanding of how to design, how to apply basic model to assess the inherent tendency or bias.

  • Business analysts, they are more like casual user who wants to use the data to implement active data discovery, visualization or portions of available information and predictive analysis.

  • Business managers, they want to understand the model and conclusions.

  • IT developers who provide support for all of the above category of users.

How to choose the most suitable for large data analysis software

Professional knowledge and skills of analysts. Some tools target audience is novice users, some professional data analysts, while others are designed for two audiences.

  • Diversity.

Depending on the use cases and applications, business users may need to support different types of analysis, using a specific type of modeling (such as regression, clustering, segmentation, behavioral modeling and decision trees). These functions have been able to support a wide range of high-level, analytical modeling of different forms, but there are some manufacturers put decades of effort to adjust the different versions of the algorithms, increased more advanced features. Enterprises are facing problems which models and understanding the most relevant, product evaluation based on how the product that best meets the needs of business users, these are very important.

  • Analysis of the data range.

Range of data to be analyzed involves many aspects, such structured and unstructured information, traditional local databases and data warehouses, cloud-based data source, a large platform data (e.g., the Hadoop) data management. However, different products for non-traditional data Lake (in Hadoop or other scale used to provide within the NoSQL data management system) data management on the level of support provided mixed. How to choose the products, companies must consider the specific needs of acquiring and processing data volume and data types.

  • cooperation.

The larger the scale, the more likely require cross-sectoral, share analysis, modeling and application among many analysts. If companies have a lot of analysts distributed in various departments, how to interpret the results and analysis, you may need to add more model sharing and collaboration methods.

  • License and maintenance budget.

Almost all manufacturers of products are divided into different versions, the entire purchase price and operating costs vary. License fees and features, functions, proportional to the number of nodes to analyze the amount of data or products available for use restrictions.

  • Ease of use.

Are not statistics savvy business analyst can easily develop analysis and application of it? Determine whether the product provides a visual method to facilitate the development and analysis.

  • Unstructured data usage.

Confirm that the product can be used in different types of unstructured data (documents, email, images, videos, presentations, social media and other information channels), and can parse and use of information received.

  • Extensibility and scalability.

With the continuous expansion of the growing volume of data and data management platform, to assess how different analytical products follow the growth of processing and storage capacity increases.

Third, how to distinguish between three large hot jobs data - data scientists, data engineers, data analysts

With Yuyan hot big data, career-related big data has become a hot, to bring talent development has brought a lot of opportunities. Data scientists, data engineers, data analysts have become the industry's most popular big data jobs. How are they defined? What specifically do the work? What skills are needed? Let us together look.

How these three career is positioning?

  • Data scientists what kind of existence

Data scientists is a contract that the use of scientific methods, using data mining tools for digital information complex large amounts, symbols, text, URL, audio or other video digitizing reproduce and understanding, and to look for new insights engineer or expert (different to statistician or analyst).

  • How data is defined Engineer

Data engineers are generally defined as "a deep understanding of statistical software engineers stars discipline." If you are a business problem to worry about, then you need a data engineer. Their core value lies in their ability to create a data pipeline by means of clear data. Fully understand file systems, distributed computing and database data is to become an excellent engineer the necessary skills.

Data engineers have a fairly good understanding of the algorithms. Therefore, the data engineer should be able to run the basic data model. High-end business needs gave birth to the calculus highly complex needs. In many cases, these needs exceed the data engineer to master knowledge, this time you need to call for help data scientists.

  • How to understand the data analyst

Data Analyst refer to different industries, specializing in industry data collection, collation, analysis, and based on data professionals to make industry research, assessments and forecasts. They know how to ask the right questions, very good at data analysis, data visualization and data presentation.

This three career what specific duties

  • Responsibilities of data scientists

Data scientists tend to look at the world around them with a way to explore the data. The data available for analysis of a large number of scattered data becomes structured, but also find a wealth of data sources, integration with other data sources may not be complete, and to clean up the resulting data set. New competitive environment, the challenges of constantly changing, new data continue to flow into the data scientists need to help decision makers in a variety of shuttle analysis, analysis of data from the temporary to the ongoing analysis of data exchange. When they find something, they communicate their findings suggest new business direction. They are very creative display of visual information, but also to find a pattern of clear and convincing. The implication in the data rule recommend to the Boss, thus affecting the products, processes and decision-making.

  • Responsibilities of Engineers data

Historical analysis, to predict the future, optimization of choice, which is big data engineer in the "play data" the most important of the three tasks. Through these three directions work, they help companies make better business decisions.

Big Data Engineer is a very important task is to find out the features of past events by analyzing the data. For example, Tencent's data team is building a data warehouse, put all the large number of network platforms, irregular data to sort, summarize the features available to query, to support the company's various business needs for data, including advertising delivery, game development, social networking and so on.

Identify the characteristics of past events, the biggest role is to help companies better understand consumers. By analyzing past behavior tracks, we will be able to know the person, and predict his behavior.

By introducing the key factors, large data engineers can predict future consumer trends. On the Ali Mama marketing platform, engineers are trying to help Taobao sellers to do business through the introduction of meteorological data. For example, this summer is not hot, it is possible that some products did not sell well last year, in addition to air conditioning, fans, tank tops, bathing suit and so may be affected. Then we will establish the relationship between meteorological data and sales data, find related categories, advance warning sellers inventory turnover.

According to the different nature of the business enterprise, big data engineers can achieve different purposes through data analysis. Tencent, the most straightforward example of big data can reflect the work of engineers is the option to test (AB Test), product manager helping to choose A, B two alternatives. In the past, policymakers can only be judged based on experience, but now big data in real-time test engineers through a wide range - for example, in the case of social networking products, so that half of the users to see the A interface, and the other half use interface B, observed CTR and conversion rate statistics within a period of time, in order to help the marketing department to make the final choice.

  • Data analyst job duties

The Internet itself has a digital and interactive features, this attribute characteristics to data collection, collation, research has brought a revolutionary breakthrough. In the past, the cost of data "atomic world," analyst takes higher (money, resources and time) to obtain research support, rich data analysis, data, comprehensiveness, continuity and timeliness of a lot worse than the age of the Internet.

Compared with the traditional Data Analyst, Data Analyst age of the Internet faces is not the lack of data, but excess data. Therefore, the data analyst age of the Internet through technical means have to learn to perform efficient data processing. More importantly, the data analyst at the Internet age to continuous innovation and methodological breakthroughs in research data.

On the industry, the value of the data analyst's similar. On press and publication industry, regardless of any age, whether the media operators accurate, detailed and timely information on the status and trends audience, the media are the key to success.

In addition, the press and publishing content industry, more crucially, the data analyst can perform their functions content analysis of consumer data, which is the key functions of the press and publishing organizations to improve customer support services.

Want to engage in these three occupations need to know what skills?

A. data scientists need to master the skills

1, Computer Science

Generally speaking, most data scientists have the required programming, computer science-related professional background. Simply put, the processing of large data necessary for hadoop, Mahout and other massively parallel processing technology and machine learning related skills.

2, mathematics, statistics, data mining, etc.

In addition to mathematics, literacy statistics, but also requires the use of SPSS, SAS and other mainstream statistical analysis software skills. Which, for the statistical analysis of open source programming language and runtime environment "R" Recent high-profile. R's strengths not only in that it contains a wealth of statistical analysis library, and have the results of high-quality visual chart generation, and can be run through a simple command. In addition, it includes a called CRAN (The Comprehensive R Archive Network) packet extension mechanism, the introduction of extended package can be used under the standard condition and functions not supported by the data set.

3, data visualization (the Visualization)

The quality of information is largely dependent on its expression. The significance of the digital list consisting of data included in the analysis, development of Web prototype, using the external API to charts, maps, Dashboard and other services to unify, so that the analysis results visualization, which is for data scientists is very important one of skill.

B. Data engineers need to master the skills

1, mathematics and statistics relevant background

For large data requirements are engineers hope is that master's or doctoral degree in statistics and mathematics background. Lack of theoretical background of data workers, easier access to the danger zone (Danger Zone) on a skill - a bunch of numbers, according to different data models and algorithms can always unwinding ordered some results, but if you do not know what that means , it is not really meaningful results, and as a result also easily mislead you. Only a certain theoretical knowledge in order to understand the model, and even reuse model innovation models to solve practical problems.

2, the computer coding ability

The actual development capacity and large-scale data processing capabilities as large data engineer some essential elements. Because many of the value of the data mining process comes from, you must use their hands to discover the value of gold. For example, now many records of people on social networks generated is unstructured data, how, voice, images and even video grab meaningful information from these clueless character that require large data mining engineer personally. Even in some of the team, the responsibilities of big data engineer to analyze the main business, but it should also be familiar with the way the computer processing of large data.

3, knowledge of a particular field or industry applications

Big Data Engineer role is very important that this is not out of the market because of the large data and applications only in specific areas combine to produce value. So, in one or more vertical industry experience to candidates able to accumulate knowledge of the industry, for large data later become engineers of great help, so this is a plus compared with sub convincing when to apply for the job.

C. Data Analyst required to master the skills

1, understand the business. Premise in data analysis will need to understand the business, that is familiar with the industry knowledge, and business processes, preferably with their own insights, if out of industry knowledge and business background, results of the analysis will be off line kite, not much value.

2, understand the management. On the one hand is required to build a framework for data analysis, such as analysis to determine the ideas you need to use theoretical knowledge of marketing and management to guide, if you are not familiar with management theory, it is difficult to build a framework for data analysis, subsequent data analysis is also difficult to . On the other hand role is to propose recommendations instructive analysis of data for the analysis conclusion.

3, understand analysis. It refers to master the basic principles of data analysis with a number of effective data analysis methods, and can be flexibly applied to practical work, in order to effectively carry out data analysis. Basic analysis methods are: Comparative analysis, group analysis, cross analysis, structural analysis, FIG funnel analysis, comprehensive evaluation analysis, factor analysis, correlation matrix analysis method or the like. Advanced analysis methods are: correlation analysis, regression analysis, cluster analysis, discriminant analysis, principal component analysis, factor analysis, correspondence analysis, time series and the like.

4, understand tool. It refers to master data analysis related to the common tools. Data analysis is the theory, and data analysis tools is to achieve theoretical data analysis tools, face ever-growing data, we can not rely on calculators to analyze, we must rely on powerful data analysis tools to help us complete the data analysis.

5, understand design. Understand the design is the use of effective expression analysis chart data analyst's point of view, the analysis results at a glance. The design of the chart is much learning, such as the selection of graphics, layout design, color matching, etc., all need to have a certain design principles.

Fourth, from rookie to become 9-step program to develop data scientists

First, the definition of data scientists each company vary, currently there is no uniform definition. But in general, a combination of software engineers, data scientists and statisticians skills, and he or she wishes to put a lot of work in the field of industry knowledge.

About 90% of the data scientists have at least a college education experience, even to the doctor and get a doctorate, of course, they get a degree in the field is very wide. Some recruiters even find people who have the required professional humanistic creativity, some of the key skills they can teach others.

Therefore, the data exclude a scientific degree programs (like the well-known universities are springing up all over the world appear with), what steps you need to take to become a data scientist?

  • Review your mathematical and statistical skills.

A good data scientist must be able to understand the contents of the data to tell you, to do this, you must have a solid basic linear algebra, understanding of algorithms and statistical skills. In certain situations it may require advanced mathematics, but this is a good start occasion.

  • Understand the concept of machine learning.

Machine learning is a new word, but big data and inextricably linked. Machine learning to use artificial intelligence algorithms to transform data into value, and without explicit programming.

  • Learning code.

Data scientists must know how to adjust the code to tell the computer how to analyze the data. From an open source languages ​​such as Python where it began.

  • Understand the database, the data pool and distributed storage.

Data stored in the database, the data pool or the entire distributed network. And how to build the repository data depends on how you access, use, and analysis of these data. If there is no overall architecture or planning ahead when you're building your data storage, follow-up on the impact that you will be very far-reaching.

  • Learning data modification and data cleansing techniques.

Modifying the original data is data to another format is more accessible and easier to analyze. Data cleansing helps eliminate duplication and "bad" data. Both are data scientists an indispensable tool in the toolbox.

  • Learn the basics of good data visualization and reporting.

You do not have to be a graphic designer, but you do need well versed in how to create data reports, easy to lay people such as your manager or CEO can understand.

  • Add more tools to your toolbox.

Once you have mastered the above techniques, it is time to expand your toolbox of scientific data, including Hadoop, R language and Spark. Experience and knowledge of these tools will let you in on a lot of data science job seekers.

Published 178 original articles · won praise 3 · views 30000 +

Guess you like

Origin blog.csdn.net/mnbvxiaoxin/article/details/104868948