The evolution of image service architecture and the advantages of cloud storage

1. What are the advantages of building an independent image server?
2. What are the advantages of using cloud storage services?
3. How to prevent the anti-theft chain of pictures? 

 

Now almost any website, Web App, mobile APP and other applications need to have a picture display function, which is very important for the picture function from the bottom to the top. The image server must be planned in a forward-looking manner. The upload and download speed of images is very important. Of course, this does not mean that the architecture is very NB from the beginning, at least it has certain scalability and stability. Although there are various architectural designs, here I just talk about some of my personal thoughts.


For the image server, IO is undoubtedly the most serious resource consumption. For web applications, the image server needs to be separated to a certain extent. Otherwise, the application may crash due to the IO load of the image server. Therefore, especially for large-scale websites and applications, it is necessary to separate the image server and application server, build an independent image server cluster, and build an independent image server. Its main advantages:
1) Share the I/O load of the Web server - it will consume The resource image service is separated to improve the performance and stability of the server.
2) It can specifically optimize the picture server - set up a targeted caching scheme for the picture service, reduce the cost of bandwidth and network, and improve the access speed.
3) Improve the scalability of the website - by increasing the image server, improve the image service throughput.

From the web1.0 of the traditional Internet, through the web2.0 era and the current web3.0, with the increase of the image storage scale, the architecture of the image server is also gradually changing. The following mainly discusses the architecture of the image server in three stages. evolution.

The initial stage Before introducing the early small picture server architecture of the initial stage, let us first understand the NFS technology, NFS is the abbreviation of Network File System, that is, the network file system. NFS is a project developed and developed by Sun for sharing files between different machines and different operating systems through the network. NFS server can also be regarded as a FILE SERVER, which is used to share files between UNIX-like systems. It can be easily mounted to a directory, and it is as convenient as a local file to operate.


 




If you don't want to sync all your images on every image server, NFS is the easiest way to share files. NFS is a distributed client/server file system. The essence of NFS lies in the sharing of computers between users. Users can connect to the shared computer and access files on the shared computer as if they were accessing a local hard disk. The specific implementation ideas are:

1) All front-end web servers mount directories exported by three image servers through nfs to receive images written by the web server. Then [Picture 1] The server mounts the export directories of the other two picture servers to the local to provide apache access to the outside world.
2) The user uploads the picture.
The user submits the upload request to the web server through the Internet access page. After the web server processes the picture, the web server copies it to the corresponding mount local directory.
3) Users access pictures
When users access pictures, the pictures in the corresponding mount directory are read through the picture server [Picture 1].

Problems with the above architecture:
1) Performance: The existing structure relies too much on nfs. When there is a problem with the nfs server of the image server, it may affect the front-end web server. The problem of NFS is mainly the problem of locks. It is easy to cause deadlocks, which can only be solved by restarting the hardware. Especially when the image reaches a certain magnitude, nfs will have serious performance problems.
2) High availability: There is only one image server for external download, which is prone to single point of failure.
3) Scalability: There are too many dependencies between image servers, and there is not enough room for horizontal expansion.
4) Storage: The uploading hotspot of the web server is uncontrollable, resulting in uneven space occupation of the existing image server.
5) Security: In the nfs method, for those who have the password of the web server, they can modify the content in the nfs at will, and the security level is not high.

Of course, the picture synchronization of the picture server can not use NFS, but also can use ftp or rsync. If ftp is used, each picture server will save a copy of the picture, which also plays the role of backup. However, the disadvantage is that it is time-consuming to ftp the pictures to the server. If you use the asynchronous method to synchronize the pictures, there will be a delay, but the general small picture files are also fine. Using rsync synchronization, when the data file reaches a certain magnitude, each rsync scan will take a long time and bring a certain delay.


In the development stage

 


, when the website reaches a certain scale and has certain requirements on the performance and stability of the image server, the above-mentioned NFS image service architecture is faced with challenges. The overall structure is upgraded. So the picture server architecture shown above appeared, and distributed picture storage appeared.

The specific ideas of its implementation are as follows:
1) After the user uploads the picture to the web server, the web server processes the picture, and then the front-end web server posts the picture to [Picture 1], [Picture 2]...[Picture N]. One, the picture server receives the picture from the post, then writes the picture to the local disk and returns the corresponding success status code. The front-end web server determines the corresponding operation according to the returned status code. If successful, the processing generates thumbnails of various sizes, prints watermarks, and writes the corresponding ID of the image server and the corresponding image path into the DB database.
2) Upload control
When we need to adjust the upload, we only need to modify the ID of the destination image server posted by the web server to control which image storage server to upload to. The corresponding image storage server only needs to install nginx and provide a python or php The service receives and saves images. If you don't want to open python or php services, you can also write an nginx extension module.
3)
When the user accesses the page, the user accesses the image according to the URL of the requested image to the corresponding image server.
Such as:

  1. http://imgN.xxx.com/image1.jpg
copy code




The image server architecture at this stage adds load balancing and distributed image storage, which can solve the problems of high concurrent access and large storage to a certain extent. Load balancing can consider F5 hard load when there is a certain financial resources, of course, you can also consider using the open source LVS soft load (and also enable the cache function). At this time, the concurrent amount of access will be greatly improved, and servers can be deployed at any time according to the situation. Of course, there are also some flaws at this time, that is, the same picture may exist on multiple Squids, because when accessing the picture, it may be assigned to squid1 for the first time, and after the LVS expires, it may be accessed to squid2 or others for the second time. Of course, relatively To solve the concurrency problem, this small amount of redundancy is completely within our allowable range. In this system architecture, the secondary cache can use squid or varnish or traffic server. For the selection of open source software for cache, the following points

should be considered With the "Visual Page Cache" technology, Varnish has an advantage over Squid in terms of memory utilization. It avoids Squid from frequently exchanging files in memory and disk, and its performance is higher than that of Squid. Varnish cannot cache to the local hard disk. There is also a powerful management port through Varnish, which can use regular expressions to quickly and batch clear parts of the cache. nginx is a buffer made by the third-party module ncache, and its performance is basically up to varnish, but in the architecture, nginx is generally used as the reverse (a lot of static files use nginx now, and the concurrency can support 20,000+). In the static architecture, if the front end is directly facing the CDN and the front end has 4 layers of load, it is enough to use the nginx cache completely.

2) Avoid file system caching. In the case of a very large amount of file data, the performance of the file system is very poor, such as squid, nginx proxy_store, proxy_cache and other ways to cache, when the magnitude of the cache increases, the performance will be The request cannot be met. It is a good choice to use the bare disk cache directly for the open source traffic server. The large-scale application and publication in China are mainly from Taobao, not because of its poor performance, but because the open source time is late. Traffic Server has been used within Yahoo for more than 4 years, mainly for CDN services, CDN is used to distribute specific HTTP content, usually static content such as images, JavaScript, CSS. Of course, using leveldb and the like for caching, I estimate that it can also achieve good results.

3) Stability: Squid, as an old-fashioned powerhouse cache, has more reliable stability. According to feedback from some users around me, Varnish occasionally crashes. Traffic Server has no known data corruption during Yahoo's current use, and its stability is relatively reliable. For the future, I actually expect Traffic Server to have more users in China.

        The above image service architecture design eliminates the early NFS dependencies and single-point problems, and can balance the space of the image server and improve the security of the image server. If you only want to store on an ordinary hard disk, you must first consider the actual processing power of the physical hard disk. Whether it is 7200 rpm or 15000 rpm, the actual performance is very different. As for the file system to choose xfs, ext3, ext4 or reiserFs , some performance tests need to be done. According to some official test data, reiserFs is more suitable for storing some small image files. When creating a file system, the inode problem should also be considered, and the inode size of the appropriate size should be selected, because Linux allocates a number inode called an inode to each file, and the inode can be simply understood as a pointer, which always points to the file. specific storage location. The number of inode nodes allowed by a file system is limited. If there are too many files, even if each file is an empty file of 0 bytes, the system will eventually be unable to create any more files because the node space is exhausted. Make a trade-off between speed and speed, and construct a reasonable file directory index.



In the cloud storage stage
 


, Robin Li mentioned at the Baidu Alliance Summit in 2011 that the era of reading pictures on the Internet has come, and picture services have already become a large part of Internet applications, and the processing capability of pictures has also become a corresponding change for enterprises and enterprises. A basic skill for developers, the speed of downloading and uploading pictures is more important. To handle pictures well, the three main problems that need to be faced are: large traffic, high concurrency, and massive storage.

Alibaba Cloud Storage Service (OpenStorage Service, OSS for short) is a massive, secure, low-cost, and highly reliable cloud storage service provided by Alibaba Cloud. Users can upload and download data at any time and anywhere through a simple REST interface, and can also use WEB pages to manage data. At the same time, OSS provides Java, Python, and PHP SDKs to simplify user programming. Based on OSS, users can build various multimedia sharing websites, network disks, personal enterprise data backup and other large-scale data-based services. In the following pictures, cloud storage is mainly introduced with Alibaba Cloud's cloud storage OSS as the entry point. The above picture is a simple architecture diagram of OSS cloud storage.

The true "cloud storage" is not storage but provides cloud services. The main advantages of using cloud storage services are as follows:
1) Users do not need to know the type, interface, storage medium, etc. of the storage device.
2) No need to care about the storage path of the data.
3) There is no need to manage and maintain storage devices.
4) No need to consider data backup and disaster recovery
5) Simply access cloud storage and enjoy storage services.


Architecture module composition 1) Object source information and data files in KV Engine OSS are stored on KV Engine. In version 6.15, V Engine will use version 0.8.6 and use the OSSFileClient provided for OSS. 2) Quota  






This module records the correspondence between buckets and users, and the usage of bucket resources in minutes. Quota will also provide an HTTP interface for the Boss system to query.

3) Security module
The security module mainly records the ID and Key corresponding to the User, and provides the user authentication function for OSS access.

OSS terms and vocabulary
1) Access Key ID & Access Key Secret (API key)
When a user registers for OSS, the system assigns a pair of Access Key ID & Access Key Secret to the user, called an ID pair, which is used to identify the user and provide access to OSS performs signature verification.

2) The virtual storage space provided by Service
OSS to users. In this virtual space, each user can have one or more buckets.

3) Bucket
Bucket is a namespace on OSS; the bucket name is globally unique in the entire OSS and cannot be modified; each object stored on OSS must be included in a bucket. An application, such as a photo sharing website, can correspond to one or more buckets. A user can create up to 10 buckets, but there is no limit to the total number and size of objects stored in each bucket, and users do not need to consider data scalability.
4) Object
In OSS, each file of a user is an Object, and each file must be less than 5TB. Object contains key, data and user meta. Among them, key is the name of the object; data is the data of the object; user meta is the user's description of the object.
Its usage is very simple, as follows for the java sdk:
OSSClient ossClient = new OSSClient(accessKeyId,accessKeySecret);
PutObjectResult result = ossClient.putObject(bucketname, bucketKey, inStream, new ObjectMetadata());
Execute the above code to upload the image stream to the OSS server.
The way to access pictures is also very simple. Its url is: http://bucketname.oss.aliyuncs.com/bucketKey


Distributed file system
has several advantages in distributed storage. Distributed can automatically provide redundancy without us needing to back it up , Worrying about data security. In the case of a particularly large number of files, backup is a very painful thing. One rsync scan may take several hours. Another point is that distributed storage is convenient for dynamic expansion. Of course, in some other domestic file systems, TFS ( http://code.taobao.org/p/tfs/src/) and FASTDFS also have some users, but the advantage of TFS is for some small file storage, mainly used by Taobao. In addition, FASTDFS has performance problems when the concurrency is higher than 300 writes, and the stability is not friendly enough. OSS storage uses Pangu, a highly available and highly reliable distributed file system independently developed by Alibaba Cloud based on the Feitian 5k platform. The distributed file system Pangu is similar to Google's GFS. Pangu's architecture is a master-slave master-slave architecture. The master is responsible for metadata management, and the Sliave is called Chunk Server, which is responsible for read and write requests. Among them, the Master is a multi-Master architecture based on Paxos. After a Master dies, another Master can take over quickly, and the failure recovery can basically be achieved within one minute. The files are stored in shards, each of which will be divided into three copies, placed on different racks, and finally provide end-to-end data verification.


HAPROXY load balancing
is based on haproxy's automatic hash architecture, which is a new cache architecture with nginx as the front end and proxy to the cache machine. Behind nginx is a cache group, and nginx distributes requests to cache machines after url hashing.
This architecture is convenient for pure squid cache upgrade, and nginx can be installed on squid machines. Nginx has the function of caching, which can directly cache some highly visited links on nginx, without having to go through one more proxy request, which can ensure the high availability and high performance of the image server. Such as favicon.ico and the logo of the website. Load balancing is responsible for the load balancing of all OSS requests, and the HTTP server in the background will automatically switch if it fails, thus ensuring uninterrupted OSS services.


CDN
Alibaba Cloud CDN service is a distributed caching system all over the country. It can cache website files (such as pictures or JavaScript code files) on servers in computer rooms in multiple cities across the country. When a user visits your website, it will be nearby. Get data from a server in a city close to TA, so that end users can access your service very quickly.
Alibaba Cloud CDN service deploys more than 100 nodes nationwide, which can provide users with excellent network acceleration effect. When the website business suddenly grows, you don't need to rush to expand the network bandwidth, you can easily deal with it by using CDN services. Like the OSS service, to use the CDN, you need to activate the CDN service on the aliyun.com website first. After activation, you need to create your distribution (ie, distribution channel) in the management center on the website. Each distribution consists of two required parts: distribution ID and source site address.
Using Alibaba Cloud OSS and CDN can easily accelerate content for each bucket, because each bucket corresponds to an independent second-level domain name, and CDN deletion is performed for each file, which simply and economically solves service storage and network problems. After all, most of the storage and network bandwidth of most websites or applications is consumed by pictures or videos.
From the perspective of the entire industry, such cloud storage for individual users, such as foreign DropBox and Box.net, is very popular recently. The domestic cloud storage is currently relatively good, mainly including Qiniu cloud storage and Youpai cloud storage.


Upload and download Divide and Conquer The
image download ratio of the image server is much higher than the upload ratio, and the processing of business logic is also significantly different. The upload server renames the image, records the storage information, and the download server adds a watermark to the image, and modifies the size and other dynamics. deal with. From the perspective of high availability, we can tolerate some image download failures, but must not have image upload failures, because upload failure means data loss. The separation of upload and download ensures that the upload of images will not be affected by the pressure of download. Moreover, the load balancing strategy of the download entry and upload entry is also different. Uploading requires logical processing such as the Quota Server recording the relationship between users and images. If the logical processing of downloading bypasses the front-end cache processing and penetrates the back-end business logic processing, the image path information needs to be obtained from OSS. Recently, Alibaba Cloud will launch a function based on the nearest CDN upload, which will automatically select the nearest CDN node to the user, so that the upload and download speed of data can be optimized. Compared with traditional IDC, the access speed is several times faster.


Picture anti-theft chain processing
If the service does not allow anti-leech, the traffic will cause problems such as bandwidth and server pressure. A more common solution is to add refer ACL judgment to nginx or squid reverse proxy software. OSS also provides refer-based anti-leech technology. Of course, OSS also provides a more advanced URL signature anti-leech, which is implemented as follows:

First, make sure that your bucket permission is private, that is, all requests from this bucket must be authenticated before they are considered legitimate. Then dynamically generate a signed URL based on the operation type, the bucket to be accessed, the object to be accessed, and the timeout. With this signed URL, your authorized users can perform corresponding actions before the signed URL expires.

The signed Python code is as follows:
h=hmac.new("OtxrzxIsfpFjA7SwPzILwy8Bw21TLhquhboDYROV", "GET\n\n\n1141889120\n/oss-example/oss-api.jpg",sha);
urllib.quote_plus (base64.encodestring(h .digest()).strip());

where method can be any of PUT, GET, HEAD, DELETE; the last parameter "timeout" is the timeout time in seconds. A signed URL calculated by the above Python method is:
http://oss-example.oss-cn-hangzh ... Ga%2ByT272YEAiv4%3D

This method of dynamically calculating the signed URL can effectively protect the The data on OSS can be prevented from being stolen by others.


Image editing processing API
for online image editing processing, GraphicsMagick (GraphicsMagick(http://www.graphicsmagick.org/ )) It should not be unfamiliar to technical personnel engaged in the Internet. GraphicsMagick was forked from ImageMagick 5.5.2, but now it has become more stable and better, GM is smaller and easier to install, GM is more efficient, and GM's manual is very rich. GraphicsMagick's commands are basically the same as ImageMagick.

GraphicsMagick provides a very rich interface API including cropping, scaling, compositing, watermarking, image conversion, filling, etc. The development kit SDK is also very rich, including JAVA (im4java), C, C++, Perl, PHP, Tcl, Ruby and other calls, support more than 88 image formats, including important DPX, GIF, JPEG, JPEG-2000, PNG, PDF, PNM and TIFF, GraphicsMagick can be used on most platforms, Linux, Mac, Windows no problem. However, the independent development of these image processing services requires relatively high IO requirements for the server, and these open source image processing and editing libraries are relatively unstable. When I use GraphicsMagick, the author encountered the tomcat process crash. The tomcat service needs to be restarted manually.

Alibaba Cloud has opened the image processing API to the outside world, including most common processing solutions: thumbnail, watermark, text watermark, style, pipeline, etc. Developers can easily use the above image processing solutions, and I hope more and more developers can open up more excellent products based on OSS.

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=327076082&siteId=291194637