Eighteen big data analysis tools

  Big data analysis is a broad term that refers to data sets that are so large and complex that they need specially designed hardware and software tools for processing. The data set is usually in the size of trillions or EB. These data sets are collected from a variety of sources: sensors, climate information, public information such as magazines, newspapers, and articles. Other examples generated by big data analysis include purchase transaction records, web logs, medical records, military surveillance, video and image archives, and large-scale e-commerce.

  Big data analysis, they have a high interest in the impact of companies. Big data analysis is the process of studying large amounts of data to find patterns, correlations and other useful information that can help companies better adapt to changes and make more informed decisions.

  One, Hadoop

  Hadoop is an open source framework that allows large data to be stored and processed in a distributed environment using simple programming model computers throughout the cluster. Its purpose is to expand from a single server to thousands of machines, and each machine can provide local computing and storage.

  Hadoop is a software framework that can perform distributed processing of large amounts of data. But Hadoop is processed in a reliable, efficient, and scalable way. Hadoop is reliable. Even if computing elements and storage fail, it maintains multiple copies of working data to ensure that it can redistribute processing for failed nodes. Hadoop is efficient. It works in parallel and speeds up processing through parallel processing. Hadoop is also scalable and can handle petabytes of data. In addition, Hadoop relies on community servers, so its cost is relatively low and anyone can use it.

  

hadoop

 

  Hadoop is a distributed computing platform that is easily structured and used. Users can easily develop and run applications that process massive amounts of data on Hadoop. It has the following advantages:

  1. High reliability. Hadoop's ability to store and process data bit by bit is trustworthy.

  2. High scalability. Hadoop distributes data and completes computing tasks among the available computer clusters. These clusters can be easily expanded to thousands of nodes.

  3. High efficiency. Hadoop can dynamically move data between nodes and ensure the dynamic balance of each node, so the processing speed is very fast.

  4. High fault tolerance. Hadoop can automatically save multiple copies of data and automatically redistribute failed tasks.

  Hadoop comes with a framework written in Java, so it is ideal to run on a Linux production platform. Applications on Hadoop can also be written in other languages, such as C++.

  Two, HPCC

  HPCC, an abbreviation for High Performance Computing and Communications. In 1993, the Federal Coordinating Council for Science, Engineering, and Technology of the United States submitted a report on "Major Challenge Project: High Performance Computing and Communications" to Congress, which is also known as the report of the HPCC plan, the US President’s Science Strategy Project. The purpose is to solve a number of important scientific and technological challenges by strengthening research and development. HPCC is a plan for the United States to implement the information superhighway. The implementation of the plan will cost tens of billions of dollars. Its main goal is to achieve: develop a scalable computing system and related software to support terabit-level network transmission performance, and develop thousands of dollars. Megabit network technology expands research and educational institutions and network connection capabilities.

  

hpcc

 

  The project is mainly composed of five parts:

  1. High-performance computer system (HPCS), which includes the research of future generations of computer systems, system design tools, advanced typical systems and evaluation of original systems, etc.;

  2. Advanced Software Technology and Algorithms (ASTA), software support, new algorithm design, software branches and tools, computational computing and high-performance computing research center, etc. with huge challenges in content;

  3. National Research and Education Grid (NREN), which includes research and development of intermediary stations and 1 billion-bit transmission;

  4. Basic Research and Human Resources (BRHR), the content includes basic research, training, education, and course materials. It is designed to reward investigators-started, long-term investigations in scalable high-performance computing to increase the stream of innovation consciousness. Increase the pool of skilled and well-trained personnel through improved education and high-performance computing training and communications, and to provide the necessary infrastructure to support these investigations and research activities;

  5. Information Infrastructure Technology and Application (IITA), aimed at ensuring the United States' leading position in the development of advanced information technology.

  Three, Storm

  Storm is a free, open source, distributed, high fault-tolerant real-time computing system. Storm makes continuous stream computing easy and makes up for the real-time requirements that Hadoop batch processing cannot meet. Storm is often used in real-time analysis, online machine learning, continuous computing, distributed remote calls, and ETL. The deployment and management of Storm is very simple, and, among similar streaming computing tools, the performance of Storm is also very outstanding.

  

storm

 

  Storm is free and open source software, a distributed, fault-tolerant real-time computing system. Storm can process huge data streams very reliably and is used to process Hadoop batch data. Storm is very simple, supports many programming languages, and is very fun to use. Storm is open sourced by Twitter, and other well-known application companies include Groupon, Taobao, Alipay, Alibaba, Le Element, Admaster and so on.

  Storm has many application areas: real-time analysis, online machine learning, non-stop computing, distributed RPC (remote procedure call protocol, a kind of request service from a remote computer program through the network), ETL (abbreviation for Extraction-Transformation-Loading, That is, data extraction, conversion and loading) and so on. Storm's processing speed is amazing: After testing, each node can process 1 million data tuples per second. Storm is scalable, fault-tolerant, and easy to set up and operate.

  Fourth, Apache Drill

  In order to help business users find more effective and speed up Hadoop data query methods, the Apache Software Foundation recently launched an open source project called "Drill". Apache Drill implements Google's Dremel. "Drill" has been operated as an Apache incubator project and will continue to be promoted to software engineers worldwide.

  

Apache Drill

 

  The project will create an open source version of Google's Dremel Hadoop tool (Google uses this tool to speed up Internet applications of Hadoop data analysis tools). And "Drill" will help Hadoop users achieve faster query of massive data sets.

  The "Drill" project is actually inspired by Google’s Dremel project: This project helps Google realize the analysis and processing of massive data sets, including analyzing and crawling web documents, tracking application data installed on the Android Market, analyzing spam, and analyzing Test results on Google's distributed build system, etc.

  By developing the "Drill" Apache open source project, organizations are expected to establish Drill's API interface and a flexible and powerful architecture to help support a wide range of data sources, data formats and query languages.

  五、RapidMiner

  RapidMiner provides machine learning programs. Data mining includes data visualization, processing, statistical modeling and predictive analysis.

  RapidMiner is the world's leading data mining solution, with advanced technology to a very large extent. Its data mining tasks cover a wide range, including various data arts, which can simplify the design and evaluation of the data mining process.

  

RapidMiner

 

  Functions and features

  Provide data mining technology and libraries for free; 100% use Java code (can run on the operating system); the data mining process is simple, powerful and intuitive; the internal XML guarantees a standardized format to express the exchange data mining process; it can be automated with a simple scripting language Perform large-scale processes; multi-level data views to ensure effective and transparent data; interactive prototypes of graphical user interfaces; command line (batch mode) automatic large-scale applications; Java API (application programming interface); simple plug-ins and promotion Mechanism; powerful visualization engine, many cutting-edge high-dimensional data visualization modeling; more than 400 data mining operators support; Yale University has been successfully applied in many different application fields, including text mining, multimedia mining, functional design, Data stream mining, integrated development methods and distributed data mining.

  Limitations of RapidMiner; RapidMiner has a size limit in terms of the number of rows; For RapidMiner, you need more hardware resources than ODM and SAS.

  Six, Pentaho BI

  Pentaho BI platform is different from traditional BI products. It is a process-centric and solution-oriented framework. Its purpose is to integrate a series of enterprise-level BI products, open source software, APIs and other components to facilitate the development of business intelligence applications. Its appearance enables a series of independent products for business intelligence, such as Jfree, Quartz, etc., to be integrated together to form a complex and complete business intelligence solution.

  

Pentaho BI

 

  The Pentaho BI platform, the core architecture and foundation of the Pentaho Open BI suite, is process-centric, because the central controller is a workflow engine. The workflow engine uses process definitions to define the business intelligence processes executed on the BI platform. Processes can be easily customized, and new processes can also be added. The BI platform contains components and reports to analyze the performance of these processes. At present, Pentaho's main components include report generation, analysis, data mining, workflow management, and so on. These components are integrated into the Pentaho platform through J2EE, WebService, SOAP, HTTP, Java, JavaScript, Portals and other technologies. Pentaho is distributed mainly in the form of Pentaho SDK.

  Pentaho SDK consists of five parts: Pentaho platform, Pentaho sample database, standalone Pentaho platform, Pentaho solution example and a pre-configured Pentaho web server. Among them, the Pentaho platform is the most important part of the Pentaho platform, including the main body of the Pentaho platform source code; the Pentaho database provides data services for the normal operation of the Pentaho platform, including configuration information, Solution-related information, etc. For the Pentaho platform, it It is not necessary, it can be replaced with other database services through configuration; the stand-alone Pentaho platform is an example of the stand-alone operation mode of the Pentaho platform, which demonstrates how to make the Pentaho platform run independently without the support of an application server;

  The Pentaho solution example is an Eclipse project to demonstrate how to develop related business intelligence solutions for the Pentaho platform.

  The Pentaho BI platform is built on the basis of servers, engines and components. These provide the system's J2EE server, security, portal, workflow, rule engine, diagrams, collaboration, content management, data integration, analysis and modeling functions. Most of these components are based on standards and can be replaced with other products.

  Seven, Druid

  Druid is a real-time data analysis storage system, the best database connection pool in the Java language. Druid can provide powerful monitoring and extended functions.

  

Druid

 

  Eight, Ambari

  Big data platform construction and monitoring tool; similarly there is CDH

  1. Provide Hadoop cluster

  Ambari provides a step-by-step guide for installing Hadoop services on any number of hosts.

  Ambari handles the configuration of cluster Hadoop services.

  2. Manage Hadoop cluster

  Ambari provides central management for starting, stopping and reconfiguring Hadoop services for the entire cluster.

  3. Monitor the Hadoop cluster

  Ambari provides a dashboard for monitoring the health and status of the Hadoop cluster.

  

Ambari

 

  Nine, Spark

  Large-scale data processing framework (which can cope with three common data processing scenarios in enterprises: complex batch data processing; interactive query based on historical data; data processing based on real-time data stream, Ceph: Linux distributed File system.

  

Spark

 

  十 、 Public Board

  1. What is Tableau Public-Big Data Analysis Tool

  This is a simple and intuitive tool. Because it provides interesting insights through data visualization. Tableau Public’s million-line limit. Because it is easier to use fares than most other players in the data analysis market. Using Tableau's visual effects, you can investigate a hypothesis. In addition, browse the data and cross-check your insights.

  2. Use of Tableau Public

  You can publish interactive data visualizations to the Web for free; no programming skills are required; visualizations published to Tableau Public can be embedded in blogs. In addition, you can share web pages via email or social media. The shared content can be downloaded for effective sulfur. This makes it the best big data analysis tool.

  3. Limitations of Tableau Public

  All data is public, and the scope of restricted access is very small; the size of the data is limited; it cannot be connected to [R; the only way to read is through the OData source, which is Excel or txt.

  Eleven, OpenRefine

  1. What is OpenRefine-data analysis tool

  Data cleaning software formerly known as GoogleRefine. Because it can help you clean up the data for analysis. It operates on a row of data. In addition, placing the columns under the columns is very similar to a relational database table.

  2. Use of OpenRefine

  Clean up messy data; data conversion; parse data from website; add data to data set by getting data from web service. For example, OpenRefine can be used to geocode addresses to geographic coordinates.

  3. Limitations of OpenRefine

  Open Refine is not suitable for large data sets; refining does not work for big data

  12. KNIME

  1. What is KNIME-data analysis tool

  KNIME helps you manipulate, analyze and model data through visual programming. It is used to integrate various components for data mining and machine learning.

  2. The purpose of KNIME

  Don't write code blocks. Instead, you must delete and drag connection points between activities; the data analysis tool supports programming languages; in fact, analysis tools such as scalable running chemical data, text mining, python, and [R.

  3. Limitations of KNIME

  Poor data visualization

  13. Google Fusion Tables

  1. What is Google Fusion Tables

  For data tools, we have a cooler, larger version of Google Spreadsheets. An incredible data analysis, mapping and visualization tool for large data sets. In addition, Google Fusion Tables can be added to the list of business analysis tools. This is also one of the best big data analysis tools.

  2. Use Google Fusion Tables

  Visualize larger tabular data online; filter and summarize across hundreds of thousands of rows; combine tables with other data on the web; you can merge two or three tables to generate a single visualization containing data sets;

  3. Limitations of Google Fusion Tables

  Only the first 100,000 rows of data in the table are included in the query results or have been mapped; the total size of the data sent in one API call cannot exceed 1MB.

  14. NodeXL

  1. What is NodeXL

  It is a visualization and analysis software for relationships and networks. NodeXL provides accurate calculations. It is a free (not professional) and open source network analysis and visualization software. NodeXL is one of the best statistical tools for data analysis. These include advanced network indicators. In addition, access to social media network data import procedures and automation.

  2. The purpose of NodeXL

  This is a data analysis tool in Excel that can help achieve the following aspects:

  Data import; graph visualization; graph analysis; data representation; the software is integrated into Microsoft Excel 2007, 2010, 2013 and 2016. It opens as a workbook and contains various worksheets containing graphical structural elements. It's like nodes and edges; the software can import various graphic formats. This adjacency matrix, Pajek.net, UCINet.dl, GraphML and edge lists.

  3. Limitations of NodeXL

  You need to use multiple seed terms for a specific problem; run the data extraction at slightly different times.

  15. Wolfram Alpha

  1. What is Wolfram Alpha

  It is a computational knowledge engine or response engine created by Stephen Wolfram.

  2. Use of Wolfram Alpha

  It is an add-on for Apple's Siri; it provides detailed responses to technical searches and solves calculus problems; it helps business users to obtain infographics and graphs. And help to create topic overview, product information and advanced pricing history.

  3. Limitations of Wolfram Alpha

  Wolfram Alpha can only handle public figures and facts, but not opinions; it limits the calculation time of each query; what are the questions about these statistical tools for data analysis?

  16. Google search operator

  1. What is a Google search operator

  It is a powerful resource that can help you filter Google results. This immediately gets the most relevant and useful information.

  2. The use of Google search operators

  Filter Google search results faster; Google's powerful data analysis tools can help discover new information.

  Seventeen, Excel solver

  1. What is an Excel solver

  Solver add-in is a Microsoft Office Excel add-in program. In addition, it is available when you install Microsoft Excel or Office. It is a linear programming and optimization tool in excel. This allows you to set constraints. It is an advanced optimization tool that helps solve problems quickly.

  2. Use of solver

  The final value found by Solver is a solution to the relationship and decision-making; it uses a variety of methods from nonlinear optimization. There are also linear programming to evolutionary algorithms and genetic algorithms to find solutions.

  3. Limitations of the solver

  Bad expansion is one of the areas that Excel Solver lacks; it will affect the time and quality of the solution; the solver will affect the inherent solvability of the model;

  18. Dataiku DSS

  1. What is Dataiku DSS

  This is a collaborative data science software platform. In addition, it also helps the team build, prototype and explore. Although, it can provide its own data products more effectively.

  2. Use of Dataiku DSS

  Dataiku DSS-Data analysis tool provides an interactive visual interface. Therefore, they can build, click, point or use languages ​​such as SQL.

  3. Limitations of Dataiku DSS

  Limited visualization capabilities; UI barriers: reloading the code/data set; unable to easily compile the entire code into a single document/notebook; still need to integrate with SPARK

  The above tools are only part of the tools used for big data analysis, I will not list them one by one. The purpose of some tools is classified below:

  1. Front-end display

  The front-end open source tools used for presentation analysis include JasperSoft, Pentaho, Spagobi, Openi, Birt, etc.

  Commercial analysis tools for display analysis include Style Intelligence, RapidMiner Radoop, Cognos, BO, Microsoft Power BI, Oracle, Microstrategy, QlikView, and Tableau.

  Domestically, there are BDP, National Cloud Data (magic mirror of big data analysis), Sematic, FineBI and so on.

  2. Data warehouse

  There are Teradata AsterData, EMC GreenPlum, HP Vertica, etc.

  3. Data Mart

  There are QlikView, Tableau, Style Intelligence and so on.

 

Retrieved from: https://www.aaa-cg.com.cn/data/1756.html

Guess you like

Origin blog.csdn.net/yuuEva/article/details/108363024