This article from the Xia share, nickname Lei Biao, Alibaba computing platform EMR senior product specialist.
2014 came into contact with large data, large data after the internal development of Ali, currently in charge of open source in the cloud Ali big data platform EMR products, open source ecosystem to build on the cloud.
product description
Ali cloud EMR overall structure is as follows:
Management operation and maintenance capabilities
- Cluster management, job management and scheduling
- Web-based operation, SDK & API
Fully compatible with open-source system, and to strengthen the basis
- Hadoop, Spark performance optimization
- Enhanced monitoring capabilities can be integrated
Accompanied by ecological community development
- Components follow the open source community to maintain version upgrade
- Open source cloud platform and Ali were linked, give full play to the ecological capacity of the cloud
- Cloud offerings docking (OSS, SLS, MaxCompute, etc.)
- Cloud docking capability, flexibility, etc. (local disk example of stringent break, resilient and elastic capacity to support the bid instance)
Global deployment (global deployment region 15)
- Fast Copy ecologically diverse enterprise-class open source big data scene programs
It provides a complete enterprise-class integration platform
- Packaged computing platform capabilities
- Out of the box experience
Common combination used:
Big Data platform application to components include:
General Hadoop
- Open source big data off-line, real-time, Ad-hoc query scenarios
- Based on the open source Hadoop ecology, the use of cluster resource management YARN, providing Hive, Spark offline data storage and large-scale distributed computing, SparkStreaming, Flink, Storm streaming data calculation, Presto, Impala interactive query, Oozie, Pig and other Hadoop ecosystem component that supports OSS storage, support for Kerberos authentication and encryption of data.
Kafka
- High throughput open, scalable message system
- E-MapReduce Kafka provides a complete set of service monitoring system and metadata management. Widely used in log collection, monitoring data polymerization scene, or streaming support offline data processing, real-time data analysis.
DataScience
- Big Data + AI scene
- Data Science + AI scenarios for large data provided Hive, Spark off large data ETL, TensorFlow model training, the user can select the frame heterogeneous computing CPU + GPU, the GPU using NVIDIA partial-depth learning algorithm for calculating the line of high performance.
Druid
- Real-time interactive analysis service scene
- Druid provides a large data queries millisecond delay, support a variety of data ingestion ways. E-MapReduce service may combinations Hadoop, E-MapReduce Spark, Ali cloud OSS, Ali cloud using RDS, etc., to build robust and flexible solution for real-time queries.
Zookeeper
- Distributed Lock
- For large-scale Hadoop cluster, HBase cluster, Kafka separate cluster distributed lock service consistency.
Product Function Point
Visualization Cluster Management Console
Built-in scheduling system
- Project-level rights management
- Support DAG
- Better flexibility combined resources
- A variety of convenient job management
- Sound alarm and monitoring
Machine Learning Support
Deep learning, AI to become hot words, EMR EMR Cluster Learning to learn the depth and the depth of the open-source Big Data technologies combine to provide the integration of large data + depth learning services. The use of a cluster, build enterprise data lake, at the same time machine learning and deep learning:
- Support ECS GPU models, ML by Hadoop YARN scheduling GPU cluster resource Spark
- TensorFlow Horvod • Support TensorFlow, Horvod and other computing framework
- Employed PS, MPI data communication mode, etc.
- Support Docker, Standalone operating mode
Disclaimer: This article numbers for all except otherwise specified, all original and the public have a priority right to read the reader number, shall not be reproduced without the author allows, otherwise pursue tort liability.
I am concerned about the number of public, backstage reply [JAVAPDF page 200] get questions!
50,000 people of concern to large data path of God, do not come to know about it?
Road 50000 Big Data concern to God, do not really look at it?
50,000 people of concern to large data path of God, do not really determined to learn about it?