Read the paper "High spectral data on a cloud computing architecture parallel and distributed dimensionality reduction"

Disclaimer: This article is a blogger original article, follow the CC 4.0 BY-SA copyright agreement, reproduced, please attach the original source link and this statement.
This link: https://blog.csdn.net/lianggyu/article/details/100151472

论文《Parallel and Distributed Dimensionality Reduction of Hyperspectral Data on Cloud Computing Architectures》

Published in: IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (Volume: 9 , Issue: 6 , June 2016 )

                                                          Cloud computing hyperspectral data on parallel and distributed architecture dimensionality reduction

                                                                                          Summary

        Cloud computing provides the possibility of a large number of storage and processing in a distributed fashion hyperspectral remote sensing data. In hyperspectral imaging, dimensionality reduction is an important task, because of the high spectral data often contain redundant, you can delete the repository before data analysis. In this regard, the development of cloud computing environment dimensionality reduction techniques may provide efficient storage and data preprocessing. In this article, we developed a widely used high spectral dimensionality reduction techniques of parallel and distributed implementation: cloud-based computing architecture of principal component analysis (PCA). Our implementation uses Hadoop Distributed File System (HDFS) to implement distributed storage, use Apache Spark as a computational engine, and based on Map-reduce parallel model development, taking advantage of high-throughput access cloud computing environments and high-performance distribution computing capabilities. We first optimized the traditional PCA algorithm, making it ideally suited for parallel and distributed computing, and then applied to the reality of cloud computing architecture. We use several hyperspectral data sets show the experimental results of the proposed method of distributed parallel with very good performance.

Keywords: cloud, reduced dimensions, the Hadoop, hyperspectral imaging, principal component analysis (PCA), Spark

1 Introduction

Hyperspectral images contain hundreds of contiguous spectral bands, thus has a significant requirement in terms of storage and data processing. It is important to reduce the dimension of hyperspectral image, and extracts the main feature from hundreds highly relevant spectral band. Since most high-dimensional data space is empty, the hyperspectral data focused on a sub-space, it can usually be reduced to a subspace without affecting data quality.

In recent decades, many techniques are used to hyperspectral data into lower-dimensional space. Principal component analysis (PCA) is a statistical process using the set of values ​​of the orthogonal transform to convert a set of observations (possibly related) variable is referred to as a main component linear uncorrelated variables. Since the adjacent frequency bands in the hyperspectral image is often associated, the PCA can efficiently convert the raw data, and to eliminate the correlation between the frequency bands. However, PCA algorithm is computationally intensive. To overcome this problem, some researchers have adopted a multi-core central processing unit (CPU) and a graphics processing unit (GPU) to reduce accelerated so that a high amount of spectral data converted PCA.

However, recent satellite and airborne remote sensing technology technical progress the number of remote sensing image data is growing exponentially, these data have been collected and stored in hyperspectral data repository. The new hyperspectral mission availability, generate large amounts of data every day, in different areas of application (such as requiring data interpretation dimensionality reduction) hyperspectral data scalability and high efficiency treatment raises important challenges.

For example, NASA's Jet Propulsion Laboratory Airborne Visible / Infrared Imaging Spectrometer (AVIRIS) data acquisition rate of 2.55 MB / s, which means you can get nearly 9 GB of data within 1 h. Chinese Pushbroom hyperspectral imager (PHI) of the data acquisition rate of 7.2 MB / s, can collect more than 25 GB of data over 1 hour. Satellite Hyperion instrument 256 × 6925 pixels collected within 30 seconds, with 242 and 12 radiation spectral resolution hyperspectral data cube, approximately 71.9 GB of data collection (over 1.6 TB per day) over 1 hour. In other satellites will be put into operation tasks, such as environmental analysis and mapping program (EnMAP) provides similar data collection rate. Since the capacity of the hyperspectral data store growing, due to their size and distributed in several geographical areas, the use of distributed computing devices is not difficult to meet the needs of large storage and computing hyperspectral data processing applications.

Distributed computing technology for this dynamic, large data sets hyperspectral processing demand is highly desirable. In the past, we have been explored commodity cluster, grid and cloud computing platform for remote sensing data processing. Recently, cloud computing has become the standard for distributed computing, grid computing because of its advanced capabilities for computing and high-performance computing capabilities and services. It offers the potential use of distributed parallel architecture processing large amounts of data processing workload. Cloud computing can be seen as grid computing evolved, as it relies on the principle of its stem and grid computing infrastructure, to maintain high-performance distributed computing capabilities. Therefore, the use of cloud computing hyperspectral analysis of large data store is a natural solution, the evolution of platform technology previously developed other types are also calculated. Nevertheless, as far as we know (although demand for large amounts of data processing in the field of hyperspectral imaging is increasing), but very few people trying to take advantage of cloud computing infrastructure hyperspectral imaging technology, especially dimensionality reduction algorithms in the literature.

In this paper, we introduce a new framework for parallel and distributed based on mass spectral image processing high cloud. Specifically, we use dimension reduction as a case study to demonstrate the use of cloud computing technology and effective implementation of hyperspectral data distributed parallel processing efficiency and accelerate the applicability of hyperspectral data and calculations. To do this, we use advanced technology Hadoop Distributed File System (HDFS) and Apache Spark, etc. as well as map-reduce method development PCA parallel and distributed algorithms in a cloud environment. PCA PCA Hadoop implementation and use of a serial on a single CPU distributed parallel version compared to assess the efficiency of our implementation in accuracy and parallel execution performance.

The rest of the paper is organized as follows. Section II describes the proposed framework of parallel and distributed. The third part introduces the PCA algorithm and architecture for cloud computing, distributed parallel implementation. The fourth part accuracy and performance experimentally evaluated the proposed method. Finally, Section 5 concludes the paper and put forward some reasonable future research directions.

2 parallel and distributed frame design

In order to develop the framework of the PCA parallel and distributed computing architecture in the cloud, the need to address three main issues: 1) distributed programming model; 2) calculation engine; 3) how to obtain dynamic storage.

For distributed programming, we use the map-reduce model, parallel processing across a cluster of computers, take advantage of high-performance capabilities of cloud computing architectures provide. In this model, a task handled by two distributed operation: map and reduce. Data is organized as a set of key / value pairs, and process a mapping function key / value pair to generate a set of intermediate, a task into several separate sub-tasks to run in parallel. Reduce all intermediate values ​​of the same function is responsible for processing the intermediate key associated with the sub-tasks and collect all results collected results of the entire task.

About Distributed Computing Engine, a possible solution is Apache Hadoop1, because of its reliability and scalability as well as its fully open source features. It has been successfully applied IBM, Yahoo and Facebook. However, Apache Hadoop supports only simple calculation single channel (e.g., SQL database query or polymeric), and generally not suitable for multi-path algorithm. Apache Spark2 large-scale data processing engine is calculated on a newly developed cloud computing architecture for fault tolerance abstract and provides fast and general data processing for large clusters it is calculated as the memory cluster. It not only supports simple single-channel calculations, but also can be extended to more complex multi-channel data analysis algorithms required.

It extends the MapReduce model, comprising a primitive for data sharing, distributed data set named elastic (RDDs), and provides a coarse degree of conversion based API, allowing their use to efficiently recover the data origin. Aparche Spark has a high directional acyclic graph (DAG) execution engine that supports data flow and memory cycle calculations, and may be 100 times faster than memory Hadoop MapReduce, or 10 times faster on the disk. To use Apache Spark, developers should write a driver that defines one or more RDD, calling their operations, and track RDD descent. The driver is typically connected to the cluster worker, this is a long-standing process, it is possible (RAM) are stored in random access memory partitions RDD across operations. Apache Spark run time, the driver starts the user a plurality of workers, they are read from the data blocks distributed file system, and stores the calculated RDD partitions in memory.

Our applications need to be able to dynamically allocate storage data between multiple locations. Fortunately, HDFS3 is designed to be deployed on low-cost hardware, and can provide high-throughput access to application data, especially for large data sets. We use the HDFS file as input RDD on Apache Spark, we can use a common interface. For example, the function Partitions () returns a file partition each block (block offset stored in each partition object). Similarly, the function preferredLocations () gives a list of nodes including the block, and the function iterator () can be used to read the block. View of the above problems, and we use Apache Spark HDFS ultra-spectral data distributed parallel processing framework design illustrated in FIG.

3 PCA distributed parallel implementation

3.1 PCA algorithm

3.2 distributed parallel algorithm optimized

Each pixel is now only needs its own vector multiplication, when calculating [Sigma, there is no correlation between the pixel vector, which is suitable for parallel and distributed computing. Further, each pixel row vector can be sequentially read, the boot program to a good data position, and such that the cache memory is more efficiently utilized.

3.3 for parallel and distributed on Spark

In this section, we describe the different stages of parallel and distributed implementation of PCA algorithm, and further describes the architecture related to optimization in the development of distributed and parallel implementation.

As mentioned above, the covariance matrix decomposition is to optimize the characteristics of the most critical PCA algorithm and time-consuming process. In order to take full advantage of a cloud computing architecture provides high performance features, we first model using a map-reduce Spark optimized, as the algorithms.

Taking into account the distributed parallel frame and map-reduce model describes the second portion, the PCA algorithm can be parallel and distributed by the following steps implemented in the form as shown in FIG.

Experiment 4

5 Conclusion

Guess you like

Origin blog.csdn.net/lianggyu/article/details/100151472