[Big Data day07_1] - Introduction to Big Data course, a basic introduction to the server, disk storage basic introduction, switches basic introduction, introduction card, IDC data centers introduced disk array (to understand)

1. Introduction large data Course

1.1, the concept of Big Data

Big data (big data), refers to the collection of data can not be captured, managed and treated with conventional software tools within a certain time frame, the new model requires processing in order to have greater decision-making power, strength and insight discovery process optimization capabilities of mass , high growth rates and diverse information assets.

The basic unit is the smallest bit, all the units in the order given: 'bit , Byte , KB, MB, GB, the TB, PB, EB, ZB, YB, BB, NB, DB.

1Byte = 8bit 1K = 1024Byte 1MB = 1024K 1G = 1024M

1T = 1024G 1P = 1024T 1E = 1024P 1Z = 1024E

1Y = 1024Z 1B = 1024Y 1N = 1024B 1D = 1024N

Here Insert Picture Description

In 1986, the world's only 0.02EB that is, the amount of data about 21000TB shown above, and by the year 2007, the world is 280EB is, the amount of data about 300000000TB, and turned 14,000 times.
Recently, due to the emergence of mobile Internet and physical networking, and access to a variety of terminal equipment, the popularity of various forms of business, on average every 40 months, the amount of data doubles world! If you say there is little impression, you can cite a simple example, in 2012, will produce the amount of data 2.5EB day. Based on IDC's report predicts that from 2013 to 2020, the amount of data to jump from 4.4ZB to 44ZB! By 2025, there will be global amount of data 163ZB!

Thus, as of now, a large amount of data the world has to burst! The traditional relational databases simply can not handle such a mass of data!

1. 2, large data characteristics

1) Volume (a lot):

Up to now, the amount of data for all human production of printed materials is 200PB, and a total amount of data on the history of mankind remark about 5EB. Currently, a typical personal computer's hard drive capacity of the order of TB, while the amount of data that a number of large enterprises has been close to the order of EB.
Here Insert Picture Description

2) Velocity (high-speed):

This is distinguished from conventional large data mining data most significant feature. According to the report, "Digital Universe," the IDC is expected that by 2020, global data usage will reach 35.2ZB. In the face of such vast amounts of data, data processing efficiency is life.

Lynx double eleven: 6 minutes and 58 seconds in 2016, Lynx turnover of over 10 billion

Here Insert Picture Description

3) Variety (varied):

This type of diversity also allows data to be divided into structured data and unstructured data. With respect to a conventional database for easy storage / text-based structured data, unstructured data more and more, including logging networks, audio, video, images, location information, etc., these multiple types of data processing of the data the ability to put a higher demand.

4) Value (low value density):

Size is inversely proportional to the level of the total value of the data density. For instance, surveillance video of the day, we only care about Song teacher in the fitness the minute, how fast to valuable data "purification" has become a large problem under the current background data to be solved.

1 ** 3 ** large data can be doing

1) O2O: Baidu + Big Data platform technology and open up the passenger flow analysis capabilities at advanced online and offline help businesses fine operation, to boost sales.

Here Insert Picture Description

2) Retail: Explore user value, to provide personalized service solutions; and through physical retail network, work together to create the ultimate experience. Classic case of child diapers + beer.

Here Insert Picture Description

3) commercial advertising Recommended: recommended to the user visited commercial advertising type

Here Insert Picture Description

4) Real Estate: Big Data comprehensive help the real estate industry, to create precise investment strategy and marketing, choose a more suitable place to build a more suitable building, sold more suitable person.

5) Insurance: massive data mining and risk prediction, help the insurance industry, precision marketing, improve the refinement of pricing power.

6) Finance: users reflect the multi-dimensional features to help financial institutions recommend high-quality customers, to prevent the risk of fraud.

Here Insert Picture Description

7) Artificial Intelligence

Here Insert Picture Description

1.4, the development prospects of large data

1) the party's 18 session of the Fifth Plenary Session "big data implementation of the national strategy" issued "to promote the development of Big Data Platform for Action," the State Council, Big Data technology and application of breakthroughs in innovation, the domestic market demand in the outbreak, China's large data industry is facing important opportunities for development.

2) International Data Corporation IDC predicts that by 2020, companies based big data analytics platform computing spending will exceed $ 500 billion. At present, China's large data only 460,000 people, the next 3-5 years talent gap of 150 million and more.

Here Insert Picture Description

1.5, enterprise data business process analysis unit

Here Insert Picture Description

2. Server Basic Introduction

Server, also known as the server is to provide computing services equipment. Since the server needs to respond to service requests and processed, so in general the server should have the capacity for services and security services.

In the network environment, according to the different types of services provided by the server into a file server , database server , application server , W EB server , etc.

Configuration server includes a processor , a hard disk , memory , system buses , etc., and a universal computer architecture is similar, but because of the need to provide highly reliable service, thus processing power, stability, reliability, security, scalability, the management of the higher requirements.

It can be simply understood as the server is a computer, but the hard drive is larger than an ordinary PC, CPU than ordinary PC faster processing, card faster than an ordinary PC. . .

Here Insert Picture Description

3. The storage disk basic introduction

Servers need to store data, will inevitably have to support the disk, the disk is a class of storage media, we are dedicated to storing various types of data, according to which the disk interface type and can have a good variety classification, then we simply look different the basic characteristics of the various types of disk interface bar

3. 1, SCSI Interface Hard Disk Drive

SCSi old traditional server transmission interface, speed 10kr 15kr. However, due to the cable and an array of card and transmission protocol restrictions, the disc has a fixed interpolation, for example, the interface begins to follow the end of the first disk is inserted, there is no place to plug the end of the hard disk or the like to be inserted. The disc has completely stopped now on sale. The 3.5-inch disk version only. Common speed: 10,000 revolutions / min.

3. 2, SAS Interface Hard Disk Drive

The disk is divided into two SAS protocol, i.e. SAS1.0 and SAS 2.0 interfaces, interfaces SAS1.0 transmission bandwidth is 3.0GB / s speed has 7.2kr 10kr 15kr. The disc has been replaced SAS2.0 interface disc, the disc has a size of 2.5 inch and 3.5 inch two. SAS2.0 interface transfer bandwidth is 6.0GB / s speed has 10kr 15kr, the common capacity of 73.6G 146G 300G 600G 900G. Common speed: 15,000 rpm / min.

3. . 3, FDE / the SDE Interface Disk Drive

FDE / SDE the former disc developed for IBM hardware encrypting hard disk SAS, the SAS disk body equivalent in performance to the hard disk, but because of their hardware encryption system, secret units ensure data is not compromised, the disc is for high-end 2.5 inch 2.5 inch machine and stored on hard disk interface. SED disc identical, different manufacturers.

3. 4, SATA hard disk basic introduction

SATA hard disk: hard disk with SATA interface called serial hard drive , is the future mainstream of development of the PC, because it has a stronger error correction capability, upon discovery of the error can be automatically corrected, thus greatly improving the data transmission security . The new use of a differential signaling system SATA "differential-signal-amplified-system ". Such a system can effectively filter out noise from the normal signal, the noise filtering capability so good as long as the SATA operation using a low voltage, and high transmission voltage 5V Parallel ATA compared, as long as the SATA 0.5V (500mv) of peak to peak voltage to operate over a higher speed. "It is more correct to say: peak to peak ' differential mode voltage '." Common speed: 7200 r / min.

3.5, SSD Hard Disk Drive

The disc is SSD SSD , and personal PC except that the hard disk using a detection system based solid appearance, and using SAS2.0 transmission protocol, performance of the disc is also several times more than the individual retail nearly SSD drive.

4. Switch Basic Introduction

Basic Introduction: Switch (Switch) meaning " switch " is an electrical (optical) signal for forwarding network device . It can be any two access switches network node providing an electrical signal path exclusive. The most common switch is an Ethernet switch . Other common as well as telephone voice switches, optical switches and so on.

Main functions: The main functions of the switch including physical addressing, network topology , error checking, frame sequence and flow control. The switch also has some new features, such as VLAN ( Virtual LAN support), and to link aggregation support, and even some also have firewall capabilities

5. The introduction of the card

NIC (Network Interface Card) is physically connected to the computer and network hardware provided is a computer with a LAN communication interface directly between the media. Since the classification of different network technologies, network card are different, as well known in the ATM card, token ring card and an Ethernet network card. According to statistics, about 80% of the local area network using Ethernet technology.

Interface

The current desktop and notebook computers in a common bus interface methods can be found, where applicable product from the mainstream card manufacturers. But it is worth noting that difficult to find ISA interface on the market 100 M card. Since 1994, PCI bus architecture is increasingly becoming the preferred bus card, it is now firmly established in the server status and high-end desktops machine. Change is coming this card will be extended in some desktops. High-performance PCI Ethernet card, ease of use and enhanced reliability it was widely adopted standard Ethernet network, with support from PC industry.

Technical direction

Currently, Ethernet NIC 10M, 100M, 10M / 100M and Gigabit Ethernet. For large data networks, the server should use the Gigabit Ethernet card, this card used for the connection between the server and the switch, in order to increase the response speed of the overall system.

For typical applications, such as file sharing, 10M network card would have been sufficient, but for possible future voice and video and other applications, 100M card will be conducive to the transmission of real-time applications.

6. LAN basic introduction

LAN (Local Area Network, LAN) refers to interconnected by a plurality of computers in a certain area computer group. Generally within a radius of several kilometers. LAN can achieve document management, application sharing, printer sharing , workgroup scheduling within, e-mail and fax communication services and other functions. LAN is closed, and may consist of two computers in the office composition may be composed of thousands of computers within a company composed.

7. racks basic introduction

To facilitate the management and maintenance of a large number of servers, as well as the positioning problems when fast reading appears on the server to solve the problem, we can use the form of the rack, will summarize many servers to one rack to go inside. Communication problems between the racks can be organized as a local area network using a switch

Here Insert Picture Description

8. The the IDC data center Introduction

Internet Data Center (Internet Data Center) referred to IDC, the telecommunications sector is to utilize the existing Internet communication lines , bandwidth resources, establish a standardized professional-grade telecommunications room environment for enterprises, the Government provides server hosting, all aspects of renting and related value-added service orientation

IDC hosting main application is the site publishing, web hosting and e-commerce. Such as web publishing, hosting units by the host, allocated from the telecommunications sector to the Internet Static IP address later, you can publish your own www site, their products or services via the Internet widely publicized; web hosting is the unit through the managed host, the host himself the massive hard drive space for rent, provide for other customers web hosting services, to become ICP service providers; e-commerce refers to the unit through a managed host, build their own e-commerce system , through this business platform for suppliers, wholesalers, resellers and end users to provide better services.

That IDC Internet Data Center . It is accompanied by growing demand for the Internet and the rapid development of the new century, China has become the Internet industry are an important part. It is the Internet content provider (ICP), business, media and various websites provide large-scale, high-quality, safe and reliable professional server hosting, leased space, network bandwidth and wholesale ASP, EC and other services.

IDC is settled (Hosting) business, business or web server group hosted sites; various modes of e-commerce on which the safe operation of the infrastructure, but also to support enterprises and their business alliances its distributors, suppliers, customers and other implementation of value chain management platform.

ICP IDC originated in the demand for high-speed Internet network, but the United States is still the world leader in location. In the United States, operators in order to safeguard their own interests, the network interconnection bandwidth is set very low, users have to each service provider at all put a server. To solve this problem, IDC emerged to ensure that customers Hosted server from all network access speed is not the bottleneck.

IDC is not only a central data storage, and data center is in circulation, it

Here Insert Picture DescriptionIDC room

It should appear in the Internet data network to exchange the most concentrated areas. It is accompanied by the people hosting and virtual hosting services put forward higher requirements of the situation arising, in a sense, it is the ISP's server hosting room evolved. Specifically, with the rapid development of Internet, the website system bandwidth, management and maintenance of the growing high demand for many enterprises constitute a serious challenge. As a result, companies began to everything related to the web hosting services to provide specialized network services IDC to do, and will focus on enhancing the core competitiveness of the business to go. Visible, IDC is a division of Internet business more refined product.

At present, China mainly large room in Beijing, Shanghai , Guangzhou , Tangshan and other places

9. The disk array

RAID disk basic introduction:

1988 DA Patterson Professor, Berkeley University of California for the first time in the paper "A Case of Redundant Array of Inexpensive Disks" in the RAID concept put forward [1], that is inexpensive redundant disk array (Redundant Array of Inexpensive Disks). At that time, a large capacity disk relatively expensive, the basic idea is to RAID plurality of smaller capacity, relatively inexpensive disk organic composition so as to lower cost and relatively expensive high capacity disk capacity, performance and reliability. With lower costs and prices of disk, RAID can use most of the disk, "cheap" has no meaning. Therefore, RAID Advisory Committee (RAID Advisory Board, RAB) decided to use the "independence" instead of "cheap", at the time turned into a RAID Redundant Array of Independent Disks (Redundant Array of Independent Disks). But this is only a name change, the substance has not changed.

9.1, RAID0 basic introduction

RAID0 is a simple, non-stripe parity data striping technique. In fact, not really a RAID, because it does not provide any form of redundancy policy. The disk RAID0 of storage space where the mass of the composition of the band (2), stores data on all disks in an independent manner access multiple disks and read access. Since concurrently perform I / O operations, the bus bandwidth is fully utilized. Coupled with no need for data validation, RAID0 performance is the highest of all RAID levels. In theory, a block composed of n RAID0 disks, which read and write performance is n times the performance of a single disk, but due to various factors like the bus bandwidth, the actual performance is lower than the theoretical value.

RAID0 low cost, high write performance, high 100% utilization of storage space, etc., but it does not provide data redundancy protection, once the data is corrupted, can not be restored. Therefore, RAID0 generally applicable to performance-critical data but not high security and reliability of applications, such as video, audio, storage, temporary data cache space

Here Insert Picture Description

9.2, RAID1 basic introduction

RAID1 mirroring is called, it will be exactly the same data are written to work and the mirror disks, its disk space utilization is 50%. RAID1 when data is written, the response time will be affected, but when reading data is not affected. RAID1 provides the best data protection, once the work disk failure, the system automatically reads the data from the mirrored disk, it will not affect the user work.

RAID1 and RAID0 is just the opposite, in order to enhance data security so that two fully mirrored disk data presented, so as to achieve security is good, technology is simple, easy to manage. RAID1 has the ability to completely fault-tolerant, but the high implementation costs . RAID1 applied to the high-order requirements as well as read and write performance of the data application is the paramount concern, such as protection of the data messaging system

Here Insert Picture Description

9. The . 3, RAID2 basic introduction

RAID2 called error correcting Hamming code disk array design is the use of redundant Hamming code for data verification. Hamming code is a check code added to a number in the original data for error detection and correction coding techniques, wherein the first 2n-bit (1, 2, 4, 8, ...) is a check code, the other data symbol position. Thus RAID2, the data bit storage, magnetic disk storage each bit of data encoding, data is stored depending on the number of disks set width, set by the user. FIG 4 is a data width of 4 RAID2, it requires four data disks and the parity disk 3. If a 64-bit data width, the required data disk 64 and 7 parity disk. Be seen, the larger the data width RAID2, the higher the storage space utilization, but also the number of disks needed more.

Hamming codes have their own error correction capability, thus RAID2 can occur in case of error data to correct errors, ensure data security . Its very high data transfer performance, design complexity lower than RAID3 described later, RAID4 and RAID5.

However, Hamming code data redundancy too much overhead, and the data output performance RAID2 restricted slowest array disk drives. Furthermore, the Hamming code is bitwise operations, RAID2 data reconstruction is very time-consuming. Because of these significant drawbacks, plus most of the disk drives themselves are equipped with error correction, and therefore RAID2 rarely used in practice, there is no commercial products, mainstream storage arrays not provide RAID2 support.

Here Insert Picture Description

9. The . 4, the RAID . 3 Basic Introduction

RAID3 (FIG. 5) is a dedicated parallel access check disk array, which uses a disk as a dedicated parity disk, as the disk remaining data disk , the data bit byte interleaved manner to the respective data disk. RAID3 disk requires at least three, with the same data in different areas on the disk for XOR parity, parity check value written to disk. RAID0 read performance and good exactly when RAID3, in parallel with reading data from a plurality of disk striping, very high performance, while also providing fault tolerance of data. When data is written to the RAID3, you must calculate the check value with all the same strip, and writes the new parity value check disk. A write operation comprising a write block, read data block with the strip, calculates the checksum value, write check value and other operating system overhead is very large, lower performance.

If RAID3 in a disk fails, data will not affect the reading, can help verify data integrity and other data to reconstruct the data. If the data block to be read is located just a disk failure, the system needs to read all the data blocks in the same slice, and a check value according to reconstruct the lost data, system performance will be affected. When the failed disk is replaced, the system rebuilds the failed disk data to the new disk in the same way.

Storage RAID3 only a parity disk, an array of high utilization, coupled with concurrent access feature, it is possible to provide a large number of high-performance read-write high bandwidth, app for sequential access large amounts of data, such as image processing, stream media services. Currently, RAID5 algorithm continue to improve, when large amounts of data can be read analog RAID3, RAID3 performance will decline and in the event of bad disk, and is often used to operating a RAID5 alternative RAID3 continuity, high bandwidth, a large number of read and write characteristics application.

Here Insert Picture Description

9.5 **, RAID4 basic introduction **

RAID4 RAID3 principle is substantially the same with each other except for a striped manner. RAID4 (FIG. 6) in a manner to organize the data block, the current write operation involves only data disks and two parity disk drive, a plurality of I / O requests can be processed simultaneously to improve system performance. RAID4 by block storage can ensure the integrity of a single block, to avoid adversely affected by other disk with bands generated.

RAID4 peer data blocks on different disks use the same XOR check results in the check disk storage. When data is written, RAID4 check value in such a manner similar to the data on each disk write parity disk, read instant verification. Thus, when a data block is damaged disks, RAID4 data can be reconstructed by the check value and the peer data blocks on the other disks.

RAID4 provides a very good read performance, but a single check disks often become a bottleneck in system performance . For write operations, only one disk RAID4 a disk write, and also writing verification data, so write performance poor. And with the increase in the number of member disks, check disks system bottlenecks will become more prominent. It is above these limitations and deficiencies, RAID4 rare in practice, the mainstream storage products are rarely used RAID4 protection.

Here Insert Picture Description

9. The . 6 , a RAID5 basic introduction

RAID5 should be the most common RAID levels, its principles and RAID4 similar, except that the parity data distributed across all disks in the array , without using a dedicated parity disk. For data and parity data, they write operations can occur simultaneously on different disks. Thus, RAID5 parity disk the bottleneck problem in RAID4 concurrent write operation does not exist. Further, RAID5 also have good scalability. As the number of disk arrays, the ability to parallel operations also will increase the amount, more than RAID4 support disk, which has a higher capacity and higher performance.

On a RAID5 (FIG. 7) of the disk while storing data and parity data, and the corresponding data block stored in the parity information is stored on a different disk, when a data disk is damaged, the system according to the other data blocks and the same band check data corresponding to reconstruct corrupted data. Like other RAID levels, data reconstruction, the performance of RAID5 will be a greater impact.

RAID5 both storage performance, data security and storage costs and other factors, it can be understood as a compromise RAID0 and RAID1, is the best overall performance data protection solutions. RAID5 storage basically meet most application requirements, data centers they use it as a scheme to protect application data.

Here Insert Picture Description


Big Data column, subscribe

Published 100 original articles · won praise 6 · views 4028

Guess you like

Origin blog.csdn.net/qq_38454176/article/details/104759441