South China Agricultural University 2021 Spring "Cloud Computing and Big Data" Final Exam Review Paper

South China Agricultural University 2021 Spring "Cloud Computing and Big Data" Final Exam Review Paper

foreword

It is my greatest honor for you to read this article. This is a set of review papers that I made myself based on previous years’ test papers and test centers. It can also be used as a test paper. It is to help some students who have already been in practice and lack review time to successfully cope with the exam. . Facts have proved that this set of papers works well, and the prediction is correct for more than half of the original questions of the test paper. With the help of this test paper, the classmates I know finished reviewing in one day, and finally got more than 70 points to pass the test smoothly.
Because simply reviewing the test points may be boring, I think that in doing the questions, you can grasp the main points of the test faster, activate your brain and enhance your memory. This paper is designed to help you review, not to say that you can pass the exam just by doing this paper.

1. Fill in the blanks

  1. In the three-tier model, cloud computing is often divided into infrastructure as a service (IaaS), platform as a service (PaaS
    ), and software as a service (SaaS).
  2. What are the virtualization technologies of cloud computing: server virtualization, storage virtualization, network virtualization
  3. Classification of commonly used high-dimensional data visualization techniques: scatter plot matrix, parallel coordinates, dimensionality reduction projection, radar chart
  4. Big data sources: real-world measurements, human records, and computer-generated data
  5. Dimensions of data quality: accuracy, consistency, completeness, timeliness, entity identity
  6. Data missing value filling methods: deletion, unified filling, statistical filling, predictive filling
    [the above all appeared in the exam, and there are also feature selection and feature extraction to fill in the blanks]

2. Conceptual questions

  1. The concept of big data and enumerates 4V or 5V characteristics
    Massive data or huge amount of data is too large to be acquired, stored, managed, processed and refined by current mainstream computer systems within a reasonable time to help users make decisions.
    Volume: Large amount of data;
    Variety: Diversified types and sources;
    Value: Data value density is relatively low;
    Velocity: Data growth rate is fast;
    (Optional) Veracity: The accuracy and reliability of data, that is, the quality of data.
  2. Cloud computing concept and characteristics
    Definition: Cloud computing is a business computing model. It distributes computing tasks on a resource pool composed of a large number of computers, enabling various application systems to obtain computing power, storage space, and information services as needed. Cloud computing is to provide dynamically scalable and cheap computing services on demand through the network.
    Features: ultra-large scale, virtualization, high reliability, versatility, high scalability, extremely cheap, on-demand services
    [the definition of Vector Space Model was also tested during the exam]

3. Calculation questions

insert image description here

insert image description here
Assume that r(u,i)=1 if and only when u and i are associated, otherwise r(u,i)=0,
use one of UserBase and ItemBase to solve it.
3. Calculation of precision, recall rate and F1 value
[There was also a big question about naive Bayesian calculation during the exam, I didn’t review it at all, it was really set by the teacher]

4. Answer and analysis questions

  1. Introduction to HDFS and its characteristics
    HDFS is a distributed file system of Hadoop, and its functions are data storage, management and error handling. It is an open source version similar to GFS, designed to reliably store large-scale data sets and improve the efficiency of users accessing data.
    Features: Suitable for big data storage and processing;
    the cluster size can be dynamically expanded;
    can effectively ensure data consistency,
    large data throughput, and good cross-platform portability.

  2. What are the four categories of NoSQL databases? Briefly describe their characteristics.
    It is divided into four categories: key-value pair, column family, document and graph database.
    The key-value pair database represented by Redis is mainly used to process high access loads of large amounts of data, and the search speed is fast but the data is unstructured; the column
    family database represented by HBase is stored in a column cluster, and the same column data is stored together. The search speed is fast and the scalability is strong, but the functions are relatively limited; the
    document database represented by MongoDB is applied to web applications, and its requirements for data structure are not strict. The variable table structure also leads to low query performance and lack of unified query syntax .
    Graph databases represented by Neo4j are mainly used in social networks and recommendation systems, focusing on building relational graphs and using graph structure models and algorithms. The disadvantage is that calculations need to be performed on the entire graph, so it is not easy to use distributed cluster computing.

  3. Please list typical distributed file systems and give a brief description.
    HDFS is Hadoop's distributed file system, its function is data management, storage and error handling. HDFS is suitable for large file storage, and the cluster can be dynamically expanded, which can effectively ensure data consistency, large data throughput, and good cross-platform portability.
    Ceph is a highly available, easy-to-manage, and open-source distributed storage system that can provide object storage, block storage, and file storage services. Its advantages include unified storage capabilities, scalability, reliability, and automated maintenance. Compared with HDFS for offline batch processing, Ceph tends to be a highly scalable, highly available, and high-performance real-time distributed system, and it supports better data writing, especially random writing.
    ClusterFS is an open source distributed system with powerful horizontal expansion capability, which can store PB-level data and handle thousands of clients through expansion.

  4. Design a Public Cryptography Model with Secret Authentication
    insert image description here

  5. Briefly describe the basic principles of the BSP model and the main steps of BSP calculation.
    Basic principle:
    The BSP model is an asynchronous MIMD-DM model, a parallel computing model based on block synchronization, with asynchronous parallelism within a block and explicit synchronization between blocks.
    The main steps of calculation:
    from the vertical point of view, it consists of a series of serial supersteps, similar to the serial program structure
    From the horizontal point of view, in each superstep, all processes perform local calculations in parallel
    Local calculations: each processing The processor only calculates the data stored in the local memory.
    Global communication: the processor group exchanges data with each other, and one party initiates push and get operations.
    Fence synchronization: When a processor encounters a fence, it will wait until other processors have also arrived.

  6. The concepts and meanings of batch computing, stream computing, graph computing, etc. Batch
    computing is mainly for offline computing scenarios. The calculated data is static data. The data has been obtained and saved before computing, and will not change during the computing process. Real-time requirements are not required. High, calculations are allowed to compute for a while without returning results immediately. Batch big data is usually composed of computing request input interface, computing control node and several computing execution nodes. Typical example is Map Reduce

Streaming computing: mainly for online computing scenarios, the data to be computed is dynamic data, and the data is constantly arriving during the computing process. It is impossible to predict the arrival time and order of the data before computing, and it is also impossible to store the data in advance. High real-time requirements. Therefore, streaming computing is to analyze streaming data in real time to obtain valuable real-time information.

Graph Computing: A technology that studies the relationship between objects and performs overall description, calculation and analysis.

  1. YARN design ideas (architecture)
    insert image description here
    remember the picture, follow the picture to say that it is almost inseparable

Summarize

There are a lot of knowledge points, and it is really difficult to get a high score. You have to memorize everything the teacher has said. This also emphasizes the importance of attending classes. Some people think that attending classes in college is boring, but attending classes is the most efficient and cost-effective , I can't guarantee that you can understand every time you finish the class or remember the knowledge points at the end of the term, but in the end you must review more easily than others, because this knowledge exists in your brain very early, you can quickly Quickly recall the main points, there will not be a bunch of unfamiliar names at the beginning.
Finally, I wish you good grades in the exam, come on!

Guess you like

Origin blog.csdn.net/weixin_43594279/article/details/118072883