Big Data Engineers common data mining analysis tool

Data integration of multi-discipline science and built on the theory and techniques of these disciplines, including mathematics, probability models, statistics, machine learning, data warehousing, visualization. In practical applications, scientific data, including data collection, cleaning, analysis, visualization and data applications throughout the iterative process, and ultimately help organizations make the right development decisions data science practitioners called data scientists. Data scientists has its unique basic ideas and common tools, comprehensive sort data analysts and data scientists use this article toolkit, including open-source technology platform tools, mining analysis tools, such as hundreds of other common tools, dozens of categories, part of the URL!

Data scientists with compound talents broaden their horizons, they are both solid scientific data base, such as mathematics, statistics, computer science, etc., but also has extensive business knowledge and experience of data scientists and deep technical expertise in certain scientific disciplines to solve complex data problems, so as to develop big data plans and strategies for different decision makers. Data analysts and data scientists the tools used in online MOOC have provided, such as February 1, 2016 - John Hopkins University, Coursera data science specialization courses online courses. Common tools data scientists with the basic ideas and data, and issues related to data analysts and data scientists use tools to do a comprehensive overview.

Data scientists and technicians of big data toolkit:... A 2015 best tools related to big data technology platform, B summary of open source big data processing tools, C common data mining analysis tools.

C. common data mining analysis tool

1, Dynelle Abeyta data compiled by scientists 5 tools (2015-09-29):

dedup - dedup is a Python library, using machine learning fast structured data deduplication and entity resolution. Data scientists found that they often need to use SELECT DISTINCT * FROM my_messy_dataset; Unfortunately, the real world data sets tend to be more complex. Whether you are going to summarize multiple data sources, or simple data collection, you need to do to start a meaningful analysis of data deduplication. As you can imagine, there are ways inexhaustible rules can be used to merge data and to define the equivalent inexhaustible meaning of your data. Two restaurants with the same address belongs to a company it? First and last names are the same two records is the same person it? You're lucky, dedup allows you to save the day! Based on innovative computer science research, dedup using machine learning (more precisely, to say that it should be active learning) to learn, to study the two possible ambiguous record by merging human feedback, and find exactly what constitutes both "similar." More convenient, it has a graphical user interface (GUI), anyone can use it.

Theano - Theano is a Python library, you can effectively define, optimize and evaluate mathematical expressions containing a multidimensional array. Theano features:

· Numpy and close integration - use numpy.ndaaray in function Theano compiled.

Use of transparent · GPU - when data intensive calculation is performed, compared to the CPU, a 140 times faster. (Using float32 test)

· Optimize speed and stability - to get the right answer to the log (1 + x), even if x is really small.

· C language code dynamically generated - speed up the evaluation expression.

Extensive unit testing and verification of self - discovery and diagnosis of different types of errors.

StarCluster - StarCluster has begun to design a virtual machine created in Amazon's EC2 cloud server to automatically configure and manage cluster and simplify procedures. StarCluster allow anyone to easily create a cluster computing environments in the cloud server applications and systems for distributed and parallel computing in. This allows you to make interactive programs on unlimited data. (Contributor Alessandro Gagliardi, Galvanize data science instructor.)

In the case of graph-tool-- python network and graphical analysis of the growing library of graphical tools give a lot of hope. Although NetworkX and Gephi like such a tool in the tool there is still growing in their place, but for those who want advanced analytics for bigger image - whether social networks, the road network, or biological networks - it is often appeared to be inadequate. NetworkX has been the most popular Python tools for network analysis, because it has a rich API and the threshold of use is very low, but once you start dealing with more graphics, the drawbacks of pure python implementation really beginning to show. And Gephi is an excellent interactive visualization and graphical development tool for new images, but there is a trouble scripting interface, making it difficult to use programmatically to control. Graphical tool attempts to draw lessons from its predecessors and the best results to the data scientist. It uses C ++ implementation (parallel execution) and use Python to arm, bundled with an easy to use API, also won the super-fast speed, but does not affect usability.

Plotly - Plotly is oriented R, Python, MATLAB, JavaScript and Excel interactive graphics library. Plotly it is also a platform for analysis and sharing of data and images. ? How Plotly it is different with Google Docs and GitHub as you can coordinate and control your data; files can be set to public, private, or share the secret documents. If you use public cloud plotly free, offline Plotly, or on-site deployment, many of which are the following options available. Plotly can be used in your workflow, there are three possible ways:

Scientists data integration with other tools. Plotly of R, Python and MATLAB API allows you to do interactive, updated instrument panel and image. Plotly integrated IPython Notebooks, NetworkX, Shiny, ggplot2, matplotlib, pandas, reporting tools and databases. For example, using the following FIG ggplot2 produced, and then fitted into this blog. After hover to see data changes, then click and drag to enlarge.

Creating interactive maps. Plotly graphics library built on top of D3.js. For geographic data, Plotly supports hierarchical chart, scatter, bubble chart, block diagram and FIGS. You can make something like this map, like R and Python as embed them into the blog, applications and dashboards.

The establishment of a full range of visualization. You can meet any requirements for the use Plotly visualization: map, 2D, 3D, and flow chart. Click and move your mouse to rotate the diagram, and observe the hover data changes, or switch amplification.

2,6 kinds of open-source data mining tool: Eighty percent of the data is unstructured, need a program and a method to extract useful information and converts it into understandable, usable structured form. Data mining process there are plenty of tools available, such as the use of artificial intelligence, machine learning, and other technologies to extract the data. The following recommended six large open source data mining tool for you:

1) WEKA - non-Java version WEKA native mainly to analyze the data and develop the agricultural sector. The Java-based version of the tool is very complex and is used in many different applications, including visualization and data analysis algorithms and predictive modeling. Compared with RapidMiner advantage is that it is free under the GNU General Public License, since the user can choose to customize to your liking. WEKA standard supports a variety of data mining tasks, including data pre-processing, collection, classification, regression analysis, visualization, and feature selection. After adding series modeling, WEKA will become more powerful, but not currently included.

2) RapidMiner-- The tool is written in Java, providing advanced analytical techniques through a template-based framework. Which tool is the biggest benefit, users do not need to write any code. It is provided as a service rather than a native software. It is worth mentioning that the tool data mining tools standings topping the list. Further, in addition to data mining, the RapidMiner also provided as data preprocessing and visualization, prediction and statistical modeling analysis, evaluation and deployment functions. It also provides a more powerful learning programs from WEKA (an intelligent analysis environment) and R scripts, models and algorithms. RapidMiner distributed under AGPL open source license, it can be downloaded from SourceForge. SourceForge is a developer centralized place the development and management of a large number of open source projects located here, including the use of MediaWiki Wikipedia.

3) NLTK-- when it comes to language processing tasks, nothing can beat NLTK. NLTK provides a language processing tools, including data mining, machine learning, data capture, sentiment analysis and other language processing tasks. And you need to do is install NLTK, and then drag a package to your favorite task, you can do other things. Because it is written in Python, you can build applications on top, you can also customize its small tasks.

4) Orange - Python is popular because it is easy to learn and powerful. If you are a Python developer, when it comes to the need to find a job with the tools, so no further than the Orange's. It is based on the Python language, a powerful open source tools, and the god of the novice and expert are applicable. In addition, you will definitely fall in love with this visual programming tool and Python scripts. It is not only the components of machine learning, there are also additional bioinformatics and text mining can be said to be filled with a variety of functions for data analysis.

5) KNIME-- data processing has three main components: extraction, transformation, and loading. And these three KNIME can do. KNIME provides you with a graphical user interface for processing the data node. It is an open source data analysis, reporting and comprehensive platform, but also by its modular concept of flow-type data, integrating various components of machine learning and data mining, and attracted the attention of financial business intelligence and data analysis. KNIME is based on Eclipse, written in Java, and is easy to extend and complement the plug. Its additional features can be added at any time, and much of its data integration module has been included in the core version.

6) R-Programming-- if I told you that R project, a GNU project, by R (R-programming referred to hereinafter collectively referred to as R) self-written, what would you think? It is mainly written in C and FORTRAN and many modules are written by R, this is a statistical calculations and graphing for the programming language and software environment for free software. R language is widely used in data mining, and statistical software development and data analysis. In recent years, ease of use and scalability also greatly enhance the visibility of R's. In addition to data, it also provides statistical and mapping technology, including linear and nonlinear modeling, classical statistical tests, time series analysis, classification, collection and so on.

3, three kinds of language data analysis tools: With the development of science data analysis tools, on the one hand the successful resolution of the data science algorithms fail, ultra-large-scale data visualization and a series of challenges; on the other hand have their own characteristics and advantages and disadvantages. E.g. Mahout has excellent data processing capability of a large, not only the processing speed and data volume, but poor visualization. Next, select the R language, RapidMiner, Mahout three mainstream scientific data analysis tools, and its outline in the form of a table of the main features of the three comparative analysis, basic tools as follows.

1) R language for statistical computing and mapping programming language and environment, the use of command-line work, distributed free of charge under the GNU agreement, its source code is available for free download and use. R website on CRAN provides a large number of third-party packages, covering many aspects of economics, sociology, statistics, bioinformatics, and this is why more and more people of all walks of life like a major R the reason. We committed to the R language and Hadoop integration for the poor scalability of traditional Hadoop analysis software and analysis functions of the weak vulnerable, researchers. As open source R statistical analysis software, by the depth of R and Hadoop integration, data parallel processing computing to make Hadoop access to powerful depth analysis capabilities.

2) RapidMiner, formerly known as Yale, is a system for data mining, machine learning, open source computing environment and business forecasting analysis. It can either use simple scripting language for large-scale process operations can also be operated by JavaAPI or GUI mode. Because it has a GUI features, so for data mining easier for beginners to get started. RapidMiner6 has a friendly and powerful toolbox, providing fast and stable analysis can be a good prototype design in a short time, making the key decisions in the implementation of data mining process as early as possible. Helping to reduce customer churn, perform sentiment analysis, predictive maintenance as well as direct marketing and so on.

3) ApacheMahout originated in 2008, its main goal is to build a machine learning algorithm scalable repository, which provides some classic machine learning algorithms, designed to help developers more quickly and easily create intelligent applications. Currently, Mahout projects include frequent child mining, classification, clustering, recommendation engine (collaborative filtering).

4, five kinds of data mining tools are: Intelligent Miner, SAS Enterpreise Miner, SPSS Clementine, Markowitz analysis system, of GDM, the following section will be described.

1) Intelligent Miner Summary: IBM's Exterprise Miner easy to use, is a good start to understand the data mining. Capable of handling large. The amount of data mining, general function, may only meet the requirements. There is no data exploration capabilities. The difference with other software interfaces, can only use the DB2, the database is connected outside's DB2, such as Oracle, SAS, SPSS DataJoiner need to install software as an intermediate. Difficult to release. Beautiful result, but equally difficult to understand.

2) SAS Overview Enterprise Miner: SAS finished with a statistical theory, powerful, has a complete data exploration capabilities. But difficult to master, requiring advanced statistical analysis are professionals. The results Nanyilijie. The price is extremely expensive and is leasing model. Basic elements: support SAS statistical module, so that it has excellent power and influence, it also enhances data mining algorithms through a lot of those modules. SAS uses its SEMMA methodology to provide a data model supports including association, clustering, decision trees, neural networks and statistical regression, including a wide range of mining tools.

3) SPSS (Statistical Product and Service Solutions) Summary: "Statistical Product and Service Solutions' software. Software originally called the "Statistical Package for the Social Sciences." But with the expansion of the service sector increased SPSS product and service depth, SPSS 2000 the company has a formal English name will be changed to "statistical products and service solutions." Used in many fields and industries, it is the world's most widely used professional statistical software.

Recommended Reading

40 + annual salary of big data development [W] tutorial, all here!

Big Data technologies inventory

Training programmers to share large data arrays explain in Shell

Big Data Tutorial: SparkShell in writing Spark and IDEA program

Zero-based Big Data Quick Start Tutorial

Java Basic Course

web front-end development based tutorial

Basics tutorial to learn linux

Guess you like

Origin blog.csdn.net/chengxvsyu/article/details/93235531