Beginner large data need to know the data used term Big 25

1. algorithm. "Algorithm" related to how big data? Even if the algorithm is a general term, but in the big data analytics make it more popular and contemporary pop.
 


 

2. Analysis. You might receive an end of the annual year-end report of all transactions from a credit card company sent. If you are interested in further analysis of their own specific cost in terms of food, clothing, entertainment and other accounting? That you is doing "analysis" of the. You are to learn from a pile of raw data, to help them make decisions for next year's spending. If you are doing the same exercise for the entire urban population of posts on Twitter or Facebook? That we are talking about big data is analyzed. The essence of big data analysis is the use of large amounts of data to extrapolate and storytelling. Big data analytics to three different types, we are ready to continue this topic be discussed in turn.

3. descriptive analysis. If you just tell me your credit card bills last year spent on food by 25%, 35% spent on clothing, 20% spent on entertainment, the rest is a hodgepodge of items, this is the descriptive analysis . Of course, you can also refer to more details.

4. predictive analysis. If your credit card to be analyzed according to the historical record of the past five years, and the division has a certain continuity, then you can predict a high probability that next year will be almost the same in the past few years. Here the details should be noted that this is not "predict the future", "probability" but in the future may occur. In the big data predictive analysis, data scientists may use a similar machine learning, advanced statistical process (hereinafter these terms will be introduced) and other advanced technology to predict the weather, the economy changes.

The standard analysis. Credit card transactions follow the case, you may want to find out what areas of expenditure (grade food, clothing, entertainment, etc.) have a huge impact on their overall spending. Normative analysis based on predictive analysis of the above, including the "action" records (such as reducing food, clothing, entertainment expenses), and the analysis results are used to "require" the best category to reduce overall spending. You can try to diverge to its big data, and imagine how executives to make data-driven decisions by looking at the impact of various actions.

6. Batch. Although the bulk data processing in the mainframe era already appeared, but the data entrusted to it more processing large data sets, thus giving more meaning batch. A set of transactions for a period of time to collect, batch data processing provides an efficient method to handle large amounts of data. The later introduction of Hadoop is focused on batch data processing. Beyond the batch world: the flow is calculated using the Spark SQL to build a batch program.

7. Cassandra is a popular by the Apache Software Foundation open source database management system management. Many large data technology thanks to Apache, which is Cassandra is designed to handle large amounts of data across distributed servers.

8. cloud. Cloud computing has become apparent everywhere, so this probably goes without saying, but for the integrity of the article is accompanied by introduction. The nature of cloud computing is software running on a remote server and (/ or) data hosting, and allows access from anywhere on the Internet.

9. cluster computing. It is a use of pooled resources of multiple servers "cluster" to be strange way of computing. After learning more technology, we may also discuss node, cluster management, load balancing and parallel processing.

10. dark data. In my view, the term applies to those senior management who scared went to pieces. Basically, the dark data is data that is collected and processed companies but not for any meaningful purpose, thus describes it as "dark", they may never be buried. They may be social networking traffic, call center logs, meeting notes, and so on. People make a lot of estimates, 60-90% of all enterprise data are likely to be "dark data", but no one really knows.

11. Data Lake. When I first heard the word, I really thought it was April Fool's joke in the open. But it really is a term! Lake data is an enterprise data in its original format of large repositories. Although the lake is the data discussed here, but we need to discuss with data warehousing, data warehouse and the lake because the data is conceptually very similar, it is enterprise-wide data repository, but clean up and other data sources after the integration of structured format differ. Data warehouses commonly used in conventional data (but not quite). It is said that the lake allows data users to easily access enterprise data, users really needed to know what they are looking for is how to deal and let the intelligent use. Embracing open source technologies premise - understanding the data you know Lake lake data (DATA LAKE) do?

12. Data Mining. Data mining is the use of sophisticated pattern recognition technology to find meaningful patterns from large amounts of data, extract insights. This "analysis" is closely related term use of personal data for analysis of our previously discussed. In order to extract meaningful patterns in data mining by using a statistical (Yes, good old math), machine learning algorithms and artificial intelligence.

13. The data scientist. We're talking about such a popular career! Data scientists can extract the raw data (Could it be said from the foregoing data extracted from the lake?), Process the data, and then put forward new ideas. Data scientists needed to have some skills and Superman is no different: analysis, statistics, computer science, creativity, storytelling and understanding of the business environment. No wonder they can get such high salaries paid.

14. The distributed file system. Due to the large data too large to be stored on a single system, a distributed file system provides a data storage system to facilitate across multiple storage devices to store a lot of data, and help reduce the cost and complexity of a large number of data storage .

15. ETL. ETL are the extract, transform, load acronym, on behalf of extraction, transformation and loading process. It specifically refers to "extract" the raw data, through the data cleaning / modified manner "conversion" to obtain a "suitable" data, thereby "loading" the entire process suitable repository for system use. Although the concept is based on the data warehouse ETL, but is also applicable to other scenarios in the process, such as access to / from an external data source absorbance data in the big data system. ETL what we need?

Experience teaches you to build efficient algorithms / data ETL science department - the engineer or else write ETL?

16. Hadoop. When people think of big data can immediately think of Hadoop. Hadoop (has a cute elephant LOGO) is an open source software framework, is a major component of Hadoop Distributed File System (HDFS), Hadoop deployment of distributed hardware to support the storage of large data sets, retrieval and analysis. If you really want to impress someone, you can also talk about YARN (Yet Another Resource Schedule, another resource scheduler), as its name suggests, it is also a resource scheduler. I sincerely admire these people for the program named. Named for the Hadoop Apache Foundation would like a Pig, Hive and Spark (yes, they are the names of various software). These names do not let you feel impressed you?

17. Memory calculations. Calculations performed under general circumstances, can not access any I / O is expected to need access to I / O than faster. It is capable of computing the in-memory data set is fully transferred to the work memory collective cluster, and avoids the intermediate calculation written to disk in the art. Apache Spark is an in-memory computing system, it is compared with the I / O, in such a system like Hadoop MapReduce bindings have a huge advantage.

18. IOT. The latest buzzword is Internet of Things (Internet of things, referred to as IOT). IOT is an embedded object through the Internet (sensors, wearable devices, automobiles, refrigerators, etc.) in a computing device interconnected together, and can transmit / receive data. IOT generated a lot of data, which provides more opportunities for presenting large data analysis.

19. The machine learning. It is to design a machine learning based on data provided by the design method capable of continuous learning, adjustment, the improved system. Use forecasting and statistical machine learning algorithms and focus on achieving the "right" behavior patterns and simple insight, as more and more data injection system also continues to be optimized and improved. Typical applications include fraud detection, online and other personalized recommendations.

20.MapReduce. MapReduce concept may be a bit confusing, but let me try. MapReduce is a programming model, the method is best understood Map and Reduce is as two separate units. In this case, the first data programming model of large data sets into several parts (the technical term is called "tuples", but this article does not want to be too technical), it can be deployed on different machines in different locations (i.e., the cluster is calculated as previously described), it is part of the nature of these Map. The next model to collect all the results and "reduce" to the same report. MapReduce data processing model with distributed file system hadoop complement each other.

21.NoSQL. This may at first sound like object-oriented SQL (Structured Query Language, Structured Query Language) protest against traditional relational database management system (RDBMS), in fact, represents the NoSQL NOT ONLY SQL, which means "not only SQL . " NoSQL actually refers to handle large volumes of unstructured, or database management system is technically called "Chart" (e.g., relational database table) data, and the like. NoSQL databases are generally well suited for large-scale data systems, thanks to their flexibility and large unstructured databases necessary for distributed architecture.

22.R language. Someone can think of worse than this programming language name, please? Yes, 'R' is an exceptional merit in statistical computing programming language. Even if you 'R' I do not know, then you are not a data scientist. (If you do not know 'R', please do not put those bad code sent me). This is one of the most popular R language in data science language.

23. Spark (Apache Spark). Apache Spark is a fast in-memory data processing engine that can efficiently perform rapid iterations required to access the data flow sets, machine learning or SQL workloads. Spark MapReduce is usually much faster than we previously discussed.

24. The stream processing. Stream processing aimed at "continuous" real-time query and manipulate data stream by. Analysis of the combined stream (i.e., the ability to continuously calculate mathematical or statistical analysis while the flow), flow processing solution can be used to handle very large real-time data.

25. A structured and unstructured data. This is a great 5V data in "Variety" diversity. Structured data is the ability to put the most basic relational database data types, you can contact the organization by any other data tables. All unstructured data is the data can not be directly stored in a relational database, such as e-mail, post on social media, human recording.

Recommended Reading articles

What Big Data Engineer Ali in the interview process?

Big Data requires learning how to base?

Experience big data development engineer salary 30K summary?

 

Guess you like

Origin blog.csdn.net/aa541505/article/details/90348626