About Big Data, you should know that the 75 professional terms

Recently, Ramesh Dontha bursts of two articles on DataConomy, concise and comprehensive introduction to 75 core terms about big data, which is not only large data beginners a good introductory information, employees can also play for higher-order Network access to the role. Part article is divided into (25 terms) and Part II (50 terms).

Part (25 terms)

If you are new to big data, you may find this area very difficult to understand, no start. However, you can include a list of 25 big data terms below starting from this, so let's get started.

Algorithm (Algorithm) : algorithm can be understood as a statistical or mathematical formula used during data analysis . So, "algorithm" is how big data something to do it? You know, even though the word algorithm is a general term, but in this era of popular big data analytics, algorithms are often mentioned and becomes more and more popular.

Analysis (Analytics) : Let us imagine a situation is likely to occur, your credit card company gives you sent an e-mail record of your annual card money transfer case, and if this time you bring this slip, began to seriously study your percentage of consumption in terms of food, clothing, entertainment, etc. What happens? You are doing analytical work, you are digging your useful information from raw data (these data can help you make decisions for the coming year their consumption) in. So, if you make the whole city of people made a post in a similar way on Twitter and Facebook deal will happen then? In this case, we can be called big data analysis. The so-called big data analysis, is the large amount of data and reasoning which tells useful information . Here are three different types of analytical methods, now let's sort them separately.

Descriptive analysis (Descriptive Analytics) : If you just tell yourself credit card case last year: 25% of food products, clothing terms of 35%, entertainment 20%, the remaining 20% are miscellaneous expenses, then this analysis is descriptive analysis is called. Of course, you can also find out more details.

Predictive analysis (the Predictive Analytics) : If you're on credit card spending over the past five-year-old were analyzed and found that annual consumption is essentially a continuous change of trend, so in this case you can predict with high probability : state spending next year and in the past should be similar. This does not mean we predict the future, but should be understood as anything can happen in our 'probabilistic forecast. " In the big data predictive analytics, data scientists may use advanced technologies, such as machine learning, and advanced statistical methods (which we'll discuss this later) to predict the weather conditions, changes in the economy and so on.

Normative analysis (prescriptive Analytics) : Here we have an example or by credit transfer to understand. If you want to find out what kind of own consumption (such as food, entertainment, clothing, etc.) can have a huge impact on overall consumption, then the analysis based on the specifications of predictive analysis (Predictive Analytics) by introducing a "dynamic index (action) "(such as reducing food or clothing or entertainment) as well as the results of the resulting provisions to be a consumer item you can best reduce the overall cost of analysis. You can extend it to large data field, and to imagine how a person in charge before he examined the effects of a variety of dynamic indicators, thus making the so-called decision-making from "data-driven" in the.

Batch (Batch Processing) : Although the bulk data processing from the mainframe (mainframe) era has existed, but in the face of the era of big data processing large amounts of data, batch gained greater significance. Data processing is a batch processing large amounts of data (e.g., collected during a period of time pile's) effective method. Distributed Computing (the Hadoop), will be discussed later, the method is a special batch data processing.

Cassandra is a very popular open source data management system , developed by the Apache Software Foundation and operated. Apache mastered a lot of big data processing techniques, Cassandra is their specially designed for server processing system between distributed large amounts of data.

Cloud computing (Cloud Computing) : Although the term cloud computing is now a household name, no need to repeat them here, but in order to consider the integrity of the contents of the whole chapter, the author is here to join the cloud computing entry. Essentially, software or data on a remote server process, and these resources can be accessed from anywhere on the network, then it can be called cloud computing.

Cluster Computing (the Cluster Computing) : It is used to describe a plurality of server computing resources in a rich cluster (Cluster) visualization of the term. More technical understanding is that in the context of clustering, we may discuss nodes (node), cluster management (cluster management layer), load balancing (load balancing) and parallel processing (parallel processing) and so on.

Dark data (the Data Dark) : This is a coinage, in my opinion, it is used to scare people, so that senior management may sound obscure. Basically, the so-called dark data means that companies accumulated and processed all the data is actually less than the complete , in this sense, we call them "dark" data, they may simply not be analyzed . These data can be information, call center records, minutes of meetings and so on social networks. Many estimates suggest that all the company's data, 60% to 90% range data may be dark, but in fact no one knows.

Lake data (the Data Lake) : When I first heard the word, really thought it was April Fool's Day joke. But it really is a term. So a lake data (data lake) that is a saving of company level data repository to a large number of original format . Here we introduce a data warehouse (Data warehouse). A data warehouse is a similar concept with the data mentioned a lake here, but the difference is, it is saved after cleaning and other resources and the integration of structured data . Data warehouses are often used for general data (but not necessarily the case). It is generally believed that a data lake that people can more easily access to the data that you really need. In addition, you can also make it easier to deal with, to use them effectively.

Data mining (the Data PROTECTION. TECHNOLOGY) : data mining process relating to the following, in a complex pattern recognition techniques to identify meaningful patterns from a large group of data and get relevant insights . It is closely related to previously described "analysis" in data mining, you will first data mining, then the results of these was analyzed. In order to obtain meaningful patterns (pattern), data miners will be used to statistically (a classic old method), machine learning algorithms and artificial intelligence.

Data scientists : data scientist is nowadays a very sexy industry. It refers to this group of people who can understand and then by extracting raw data (which is in front of us so-called data Lake), process and draw insights. Part of the data scientists say that only the necessary skills can have Superman: analytical skills, statistics, computer science, creativity, storytelling ability and the ability to understand the business context. No wonder these guys high wages.

Distributed File System (the System Distributed File) : too large amount of data can not be stored in a separate system, a distributed file system is able to store large amounts of data on a plurality of storage devices of file system, it is possible to reduce storage significant cost and complexity of data.

ETL : ETL on behalf of extraction, transformation, and loading . It refers to this process: "extract" the raw data, by cleaning / wealth of tools to integrate data "conversions" in the form of "suitable", and its "load" to the appropriate library for system use. Even ETL from data warehouse, but the process is also used when acquiring data, e.g., data obtained from an external data source in a large system.

Hadoop : When people think of big data, they will immediately think of Hadoop. Hadoop is an open source software framework (logo is a cute elephant), which is composed of Hadoop Distributed File System (HDFS), which allows the use of distributed hardware for large data storage, abstraction and analysis. If you really want someone impressed with this thing, you can say YARN (Yet Another Resource Scheduler), by definition, is another resource scheduler with him. I ask these people really are deeply shocked names. The proposed Hadoop Apache Foundation is also responsible for Pig, Hive and Spark (which are some of the software's name). You do not have to be amazing to do these names?

Memory calculations (the In-Memory Computing) : generally considered, without involving any I / O accesses are faster calculation. Memory computing technology is such that all the work data sets are moved to the collective memory of the cluster, avoiding intermediate result written to disk during the calculation. Apache Spark is a computing system memory, it is relatively Mapreduce this type of system I / O bound to have a great advantage .

Internet of Things (IoT) : The latest buzzword is Internet of Things (IoT). IoT embedded objects (such as sensors, wearable devices, vehicles, refrigerators, etc.) by a computing device interconnected to the Internet, they can send and receive data. Things generate vast amounts of data, a lot of great opportunities for data analysis .

Machine Learning (Machine Learning) a method of machine learning is based on the data fed to learn to design, adjust and improve the system: Use forecasting and statistical algorithms set, they continued to approach "correct" behavior and thoughts, as more data is entered into the system, they can be further enhanced.

MapReduce : MapReduce may be a bit difficult to understand, I tried to explain it. MapReduceMapReduce is a programming model that is best understood to note Map and Reduce are two different processes. In MapReduce, a programming model first large data sets into smaller pieces (these pieces take technical jargon called "tuple", but when I have described will try to avoid obscure technical terms), then these pieces will They are distributed to various different computer (that is previously described as a cluster) on the position, which is necessary in the Map process. Each model will then collect the results, and they "reduce" to be a part of. MapReduce data processing model and the Hadoop Distributed File System are inseparable.

Non-relational databases (NoSQL) : the word sounds almost "SQL, Structured Query Language," the opposite of, the traditional SQL relational data management system (RDBMS) necessary, but NOSQL actually refers to "more than SQL 」 . NoSQL actually refers to those who are not designed to handle the structure (or no "schema", outline) database management system of large amounts of data. NoSQL suitable for large data systems, because this flexibility massive unstructured and distributed NoSQL database needs priority features.

R Language : This was also to give a programming language from a worse name? R Language is one such language. However, R language is a very good language to work in statistical work. If you do not know the R language, not to mention you are a data scientist. Because the  R language is one of the scientific data in the most popular programming language .

The Spark (the Apache the Spark) : the Apache the Spark is a fast in-memory data processing engine, it is able to perform stream processing of iterations that require access to the database effectively, machine learning, and SQL workload. MapReduce Spark earlier than we usually discuss a lot faster too.

Process stream (Stream Processing) : designed to process stream processing for continuous streaming of data. Combined with flow analysis (refers to the ability of numerical and statistical analysis can be calculated continuously), especially in real-time stream processing method for large-scale data processing.

Structured vs unstructured data (the Data Structured Unstructured V) : This is one big comparative data. Structured data that substantially any data that can be placed in a relational database, data is organized in this manner may be associated with other data via a table. Unstructured data refers to data in any relational database can not be placed, for example e-mail messages, status on social media, as well as the human voice, and so on.

Part II (50 terms)

This article is a continuation of the previous article, since the article enthusiastic response, I decided to introduce more than 50 related terms. Let's make a brief review of the terms of the articles covered: Algorithms, analysis, descriptive analysis, pre-analysis, predictive analytics, batch, Cassandra (a large-scale distributed data storage system), cloud computing, cluster computing dark data, lake, data mining, data scientists, distributed file systems, ETL, Hadoop (a software platform to develop and run large-scale data processing), memory computing, networking, machine learning, Mapreduce (hadoop core components one), the NoSQL (non-relational database), R, Spark (calculation engine), stream processing, structured vs unstructured data.

Then we continue to understand the other 50 big data terms.

Apache Software Foundation (ASF) open source project provides a number of big data, there are currently more than 350. After explaining these projects need to spend a lot of time, so I just picked explains some popular terms.

 Kafka used to live the Apache : Named to the Czech writer Franz Kafka used to build real-time data pipeline and streaming media applications . It is so popular is the ability to fault-tolerant way to store, manage and process the data stream, was said to have a very "fast." Given the social network environment involving a lot of processing data streams, Kafka is currently very popular.

Mahout the Apache : Mahout provides a prefabricated library of algorithms for machine learning and data mining, can also be used to create the environment more algorithms. In other words, the best machine learning environment geeks.

Oozie the Apache : In any programming environment, you need some workflow system dependencies way through predefined and defined work schedule and run. Oozie for the pig, MapReduce and Hive and other languages written in big data work provided it is this.

Drill the Apache, the Apache Impala, the Apache the Spark SQL : three open-source projects provide fast and interactive SQL , such as interaction with the Apache Hadoop data. If you already know SQL to store and process large data format (ie, HBase or HDFS), these functions will be very useful. Sorry, strange to say here.

Hive the Apache : SQL know it? If you know that you are a very good Hive get started. Hive facilitate the use of SQL to read, write and manage large data sets residing in a distributed store .

Pig the Apache : Pig is created on large distributed data sets, query, execution of the routine platform. Scripting language used is called Pig Latin (I am definitely not nonsense, believe me). It is said that Pig is easy to understand and learn. But I doubt how much can be learned?

Sqoop the Apache : Hadoop for transferring data from a  transfer to a non-Hadoop data store (e.g., relational databases and data warehouses) tool.

Storm the Apache : a free open source real-time distributed computing system. It makes use of Hadoop batch while more easily handle unstructured data .

Artificial Intelligence (AI) : Why AI appear here? You might ask, this is not a separate field do? All these technology trends are closely linked, so we had better stop and continue to learn, right? AI by way of a combination of hardware and software to develop intelligent machines and software, this combination of hardware and software can sense the environment and take the necessary action, if necessary, continue to learn from these operations. Not that sound like machine learning? With me, "confused" it.

Behavior Analysis (Behavioral Analytics) : Have you ever wondered how Google is to provide advertising for the product / service you needed? Behavior analysis focuses on understanding what does and consumer applications, and how they work and how in some way. This involves about our traffic patterns, social media interactions, as well as our online shopping activity (shopping cart, etc.), connect these extraneous data points, and try to predict the outcome. As an example, when I find a hotel and empty the cart, I received a phone line holiday resorts. I would also like to say more, please?

Brontobytes : 1 27 back to zero, this is the future size of the storage unit of the digital world. And we are here to talk about  Terabyte (TB), Petabyte (PB ), Exabyte (EB), Zetabyte (ZB), Yottabyte (YB) and Brontobyte (BB) . You have to read this article to understand these terms.

BI (Business Intelligence) : I will reuse Gartner defines BI as it explained very well. Business intelligence is a general term, including applications, infrastructure, tools and best practices that can access and analyze information in order to improve and optimize decisions and performance.

Biometrics (Biometrics) : This is identified by a James one or more physical characteristics of the human body and analysis technology Bondish technology combined, such as facial recognition, iris recognition, fingerprint recognition.

Click stream analysis (Clickstream Analytics) : Online Click for analysis of user data when browsing on the web. Have you ever thought even when switching sites, or why certain Google advertising lingers? Because Google Gangster know what you click on.

Cluster analysis (Cluster Analysis) is an attempt to explore the identifying data structure analysis, also known as segmentation analysis or classification analysis. More specifically, it tries to determine the case of homogeneous groups (homogenous groups), namely observation, participants, respondents. If the previously unknown group, cluster analysis is used to identify case groups. Because it is exploratory, indeed dependent variables and independent variables were distinguished. SPSS different clustering method may provide processing binary, nominal, ordinal number and size (or range rate) data.

Comparative analysis (Comparative Analytics) : Because the key is that big data analysis, so this article I will explain in depth analysis of the meaning. As the name suggests, it is the use of comparative analysis such as pattern analysis, filtering and decision tree analysis and other statistical techniques to compare multiple processes, data sets, or other objects. I know it involves less and less technical, but I still can not completely avoid using the term. Comparative analysis can be used in the health care field, by comparing a large number of medical records, documents, images, etc., to give a more efficient and more accurate medical diagnosis.

Correlation Analysis (Connection Analytics) : You must have seen the impact of those who like the same chart with the theme spider web will connect people to determine the specific topic. Correlation analysis can help identify system among the people, products, networks, and even relevant connections between the data and the impact of multiple network combined with.

Data Analyst (the Data Analyst) : Data Analyst is a very important and popular work, in addition to preparing the report, it is responsible for collecting, editing and analyzing data. I will write a more detailed article on data analysts.

Data cleaning (the Data Cleansing) : As the name suggests, data cleansing comes to detect and correct or delete inaccurate data or records in a database, and then remember that "dirty data." By means of automatic or manual tools and algorithms, data analysts to further enrich and correct data to improve data quality. Remember that dirty data can lead to erroneous analysis and bad decisions .

Data as a Service (DaaS) : We have a software as a service (SaaS), Platform as a Service (PaaS), and now we have DaaS, it means: data as a service. Data provided by the cloud on-demand access to users, DaaS providers can help us to quickly obtain high-quality data.

Data virtualization (the Data Virtualization) : This is a data management method, which allow an application does not know the technical details (e.g., where data is stored in, what format) and can be extracted in a case where the operation data. For example, social networking use this method to store our photographs.

Dirty data (the Data Dirty) : Since large data so attractive, then people began to data plus other adjectives to form new terms, such as black data (dark data), dirty (dirty data), small data (small data), and now intelligent data (smart data). Dirty data is not clean data, in other words, is not accurate, repetitive and inconsistent data. Obviously, you would not think of dirty data and stir together. So, as soon as possible to fix it.

Fuzzy logic (Fuzzy Logic) : How many times have one thing is certain, for example, 100% correct? Very rare! Our brains aggregated into the data portion of the facts, the facts are further abstracted to a certain threshold value can be determined for our decision. Fuzzy logic is one such calculation, and the like Boolean algebra, and so the "0" and "1" Instead, it is designed to mimic the fact that by gradually eliminating the part of the human brain.

Game of (Gamification) : In a typical game, you will have a score like elements similar to compete with others, and there are clear rules of the game. Game of big data is to use these concepts to collect, analyze data or inspire players.

FIG Database (Graph Databases) : map data using nodes and edges such concept to the relationship between them and representative operations and to mine data social media. Ever wonder the Amazon to tell you when you buy a product of information about what others are buying? Yes, this is the map database.

Hadoop user experience (the User preference Experience Hadoop / Hue) : Hue is one that allows easier use of Apache Hadoop open-source interfaces . It is a web-based application; it has a file browser, a distributed file system; it is designed for the task of MapReduce; it can be scheduled with a frame Oozie workflow; it has a shell, a Impala, a Hive UI and a set of Hadoop API.

High-performance analytical applications (HANA) : This is SAP's large data transfer and analysis and design of a software and hardware platform memory.

HBase : a column-oriented database distributed . It uses HDFS as its underlying storage, supports both batch calculated using the MapReduce conducted, also support batch use things interact with computing.

Load balancing (the Load Balancing) : In order to achieve the best results and use of the system, the load will be distributed to multiple computers or servers.

Metadata (the Metadata) : metadata that can describe the other data . Metadata summarizes the basic information data, which makes finding and using specific examples of data easier. For example, the authors, the data creation date, modification date, and size of these items is the basic document metadata. In addition to the document file, metadata is also used for images, videos, spreadsheets and web pages.

MongoDB : MongoDB is an open source cross-platform database for text data model, rather than the traditional table-based relational database. The main structure of such a database design goal is to structured data and unstructured data faster and easier integration in specific types of applications .

Mashup : Fortunately, the word and the term we use in our daily lives, "mashup" has a similar meaning, is the meaning of mix and match. Essence, Mashup is a different set of data and to a method of separate application (for example: geographic data and the real estate data, demographic data combined). This really allows visualization becomes cool.

Multidimensional database (Multi-Dimensional Databases) : This is a data-line analytical processing (OLAP) and data warehousing database to optimize come. If you do not know what a data warehouse is that I can explain the data warehouse is not something else, it's just data from multiple data sources to do a centralized storage .

Multivalued database (MultiValue Databases) : multi-value database is a non-relational database, it can be directly appreciated that three-dimensional data, which is good for direct manipulation HTML and XML strings.

NLP (Natural Language Processing) : natural language processing is designed to make computers more accurately understand everyday human language software algorithm that enables humans more natural, more effective and computer interaction.

Neural networks (Neural Network) : According to this description (http://neuralnetworksanddeeplearning.com/), neural networks are biologically inspired by a very nice programming paradigm that enables learning from observed data in the computer. It has been a long time no one would say a programming paradigm is very beautiful. In fact, the neural network is a term used by the neural network model ....... and brain biology-inspired real life is closely related to the depth of learning. Depth learning neural network is a collection of a series of learning technologies.

Pattern Recognition (Pattern Recognition) : When the algorithm needs to determine in large data sets or return the law or on different data sets when it appeared pattern recognition. It is with machine learning and data mining are closely linked, even it is considered synonymous with the latter two. This visibility can help researchers find some profound law or may be considered to get some very absurd conclusion.

RFID (Radio Frequency Identification / the RFID) : RFID is a non-contact type sensor using a radio frequency electromagnetic field to transmit data. With the development of the Internet of Things, RFID tags can be embedded into any possible "something inside", which can generate a lot of data needs to be analyzed. Welcome to the World.

Software as a Service (SaaS) : Software as a Service allows service providers to applications hosted on the Internet. SaaS providers to provide services in the cloud.

Semi-structured data (Semi-Structured Data) : semi-structured data refer to those data is not formatted in a conventional manner, such as those associated with traditional database data fields or data model used. Semi-structured data is not entirely original data or completely unstructured data, which may include some data sheets, labels or other structural elements . Examples of semi-structured data have graphs, tables, XML documents, and e-mail. Semi-structured data very popular on the Web, often can be found in the object-oriented database.

Sentiment analysis (the Analysis Sentiment) : sentiment analysis related to the various types of consumer sentiment in social media on behalf of the customer telephone interviews and surveys exist and interact with documents expressed capture emotions and opinions, tracking and analysis. Text analytics and natural language processing technology is a typical sentiment analysis process. The goal is to identify sentiment analysis or evaluation of attitudes or feelings for a company, product, service, person or time.

Spatial Analysis (Spatial Analysis) : spatial analysis refers to the analysis of spatial data, or to identify patterns and regularly distributed in understanding geometric space data, such data has a geometry and topology data.

Stream processing (Stream Processing) : for stream processing is designed to "stream data" in real time "continuous" query and processing. In order to stream large amounts of data at very high speed continuous real-time numerical and statistical analysis, the demand for data flow convection processes on social networks is clear.

Intelligent data (Smart Data) is the result after the process is operable and useful in a number of algorithms to the data.

The Terabyte : This is a relatively large number of data units, 1TB equal 1000GB. It is estimated that, 10TB able to accommodate all print Library of Congress, and 1TB is able to accommodate the entire encyclopedia Encyclopedia Brittanica.

Visualization (the Visualization) : Once you have a reasonable visualization, raw data can be used. Of course, here and beyond simple visual chart. At the same time but can contain many variables further includes data readability and understanding of complex graphs .

Yottabytes : nearly 1000 Zettabytes, or 2,500,000,000,000,000 DVD. Now all digital storage is about 1 Yottabyte, and that number will double every 18 months.

Zettabytes : nearly 1000 Exabytes, or 1 billion Terabytes.

Published 133 original articles · won praise 478 · views 340 000 +

Guess you like

Origin blog.csdn.net/fuhanghang/article/details/104247565