Big Data Era

MapReduce is a programming model for parallel operations on large datasets (greater than 1TB)

thrift is a software framework for the development of scalable and cross-language services. It combines a powerful software stack and code generation engine to build between C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, JavaScript, Node.js, Smalltalk, and OCaml programming languages Seamless, efficient service.

In a sense, both WebService and REST are the implementation of RPC, so what is the development process of RPC? This article refers to wikipedia for a brief summary of RPC.

RPC (RemoteProcedureCall) is a technology of Inter-Process Communication (IPC, Inter-Process Communication), which generally refers to inter-process communication on different machines. When programming in ancient languages ​​such as C, RPC is called a call to a "subroutine" on the S side, so it is called a "procedure call". After the emergence of OOP, RPC can also be called remote method invocation (RemoteMethodInvocation), or remote invocation (RemoteInvocation).

The RPC process can be synchronous or asynchronous. Synchronous mode: The C-side sends a request to the S-side, blocking and waiting; the S-side executes a subroutine and sends a response; the C-side continues to execute; asynchronous methods, such as XHTTP calls.

 

The calling process of RPC (the term Stub should be borrowed from JavaRMI):

  1. Client sends a request (Call) to ClientStub.
  2. The ClientStub encapsulates the request parameters (also called Marshalling), issues a system call, and the OS sends a message to the S side.
  3. After receiving the message, the S end passes the packetized message to the ServerStub. ServerStub unpacking (UnMarshalling).
  4. ServerStub calls the subroutine on the S side. After processing, send the result to the C side in the same way.

Note: ServerStub is also called Skeleton.

 

What is a stub?

A stub is a piece of code that converts the parameters passed in the RPC process. Handling content includes big and small endian issues between different OS. In addition, the client side is generally called Stub, and the server side is generally called Skeleton.

Production method: 1) Manual generation, which is more troublesome; 2) Automatic generation, using IDL (InterfaceDescriptionLanguate) to define the interface of C/S.

Interaction mechanism standard: IDL is generally used, and the tool RPCGEN() is used to generate IDL.

 

RPC related implementation

  • JavaRMI
  • XML-RPC, XML+HTTP to make calls between machines
  • JSON-RPC
  • SOAP, an upgraded version of XML-RPC
  • Facebook Thrift
  • CORBA
  • AMF AdobeFlex
  • Libevent is a framework for building RPC Server and Client.
  • WCF, from Microsoft
  • .net Remoting, gradually replaced by WCF

Most companies are not unfamiliar with the concept of big data, but few people understand the technology used in it. On the one hand, the software used by many people is relatively simple and professional, and only needs to be able to operate it. For example, the FineBI business intelligence solution software, all the internal technologies do not need to be understood by the operator, and only need to understand some superficial communication needs. That is, you don't need to consider how to model, which can save communication time during project implementation and bring more benefits to the enterprise.

On the other hand, many bosses are not good at this, and understanding it has no practical effect on them, but for those who often come into contact with these software, understanding the technology behind it will be of better help to their work. So, what technologies are used behind big data?

1. NoSQL database

In the environment we live in, the emergence of new technologies does not take too long to be reused by people. In fact, many technologies will be used by people one month after they appear. In a broad sense, NoSQL databases themselves also contain many technologies. They focus on the limitations of relational database engines such as indexing, streaming media, and high-access website services, as well as others. In these areas, NoSQL databases are used most frequently.

二、HadoopMapReduce

This is a technology that can handle the challenges posed by big data analysis, not only with high frequency of application, but also with unique advantages in processing. In my hometown, many companies think that the data platform developed by Hadoop MapReduce technology is the best to use. It can be seen that this technology can indeed bring unexpected benefits to enterprises.

3. Memory Analysis Technology

Memory was expensive when it first appeared, but with the advancement of technology, more and more memory began to appear, and the price naturally dropped again and again. However, the performance has not declined, but has an upward trend, which is why memory is very popular in the network.

Not only that, but professionals also mentioned that low-cost memory applications in big data centers have real-time and high-efficiency advantages, and can also improve big data insights, thereby providing enterprises with better data analysis and mining.

4. Integrated equipment

Business intelligence and big data analysis were only stimulated after the emergence of data warehouse equipment. This way of using data warehouse technology to enhance their own competitive advantages and stay ahead of competitors has made many companies happy. However, there are still many functions of integrated equipment. Among them, the ability to enhance the role of traditional database systems is the most used by many enterprises. In addition, integrated equipment has become an important tool for enterprises to cope with data challenges. Therefore, this technology has also attracted much attention.

 

Erlang is a general-purpose concurrency programming language developed by CS-Lab under the jurisdiction of Swedish telecommunications equipment manufacturer Ericsson to create a programming language and runtime environment that can handle large-scale concurrent activities. Erlang came out in 1987, and after ten years of development, an open source version was released in 1998. Erlang is an interpreted language running on a virtual machine , but now also includes a native code compiler developed by the High Performance Erlang Project (HiPE) at Uppsala University , and since version R11B-4, Erlang also supports a script interpreter . In terms of programming paradigm , Erlang is a multi-paradigm programming language, covering functional, concurrent and distributed . Erlang with sequential execution is an early evaluation, single assignment and dynamic typing functional programming language.
Erlang is a structured, dynamically typed programming language with built-in parallel computing support. Originally designed by Ericsson for communication applications, such as controlling switches or changing protocols, it is very suitable for building distributed, real-time soft parallel computing systems. Application runtimes written in Erlang usually consist of thousands of lightweight processes that communicate with each other through message passing. Inter-process context switching is only one or two links for Erlang, which is much more efficient than thread switching of C programs.
It is much simpler to write distributed , because its distribution mechanism is transparent: it is not known to the program that it is running in a distributed manner. The Erlang runtime environment is a virtual machine , a bit like the Java virtual machine, so that once the code is compiled, it can be run anywhere. Its runtime system even allows code to be updated without being interrupted. In addition, byte code can also be compiled into native code to run if it needs to be more efficient .
Spark is a Hadoop MapReduce-like general parallel framework open sourced by UC Berkeley AMP lab. Spark has the advantages of Hadoop MapReduce; but unlike MapReduce, the intermediate output results of Job can be stored in memory, so there is no need to read and write HDFS. , so Spark is better suited for MapReduce algorithms that require iteration, such as data mining and machine learning.
Spark is an open source cluster computing environment similar to Hadoop , but there are some differences between the two that make Spark superior for certain workloads, in other words, Spark enables In addition to being able to provide interactive queries, it can also optimize iterative workloads.    
Spark is implemented in the Scala language, which uses Scala as its application framework. Unlike Hadoop, Spark and Scala are tightly integrated, where Scala can manipulate distributed datasets as easily as native collection objects.    
Although Spark was created to support iterative jobs on distributed datasets, it is actually a complement to Hadoop and can run in parallel on the Hadoop file system. This behavior is supported through a third-party clustering framework called Mesos. Spark was developed by the UC Berkeley AMP Lab (Algorithms, Machines, and People Lab) to build large-scale, low-latency data analytics applications 

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326553517&siteId=291194637