Article Directory
Ali cloud-based data warehouse architecture
product comparison
Ali cloud offerings | Similar products | Brief introduction |
---|---|---|
RDS | MySQL、PostgreSQL | Relational database service, provided by Ali cloud database, there are various versions, such as MySQL version, PostgreSQL version, SQLServer version, etc. |
DTS | Canal、DataX、Sqoop、Flume | Data transmission services, feature-rich, including the collection of data migration, data subscription, real-time data synchronization capabilities for RDMS, NoSQL, Big Data, and other products |
DataHub | Kafka | Data bus, and Kafka main function is similar, but there are more interfaces, functions |
MaxCompute | Hadoop | GM offline computing platform (formerly known as ODPS), support for SQL, MapReduce, UDF, Graph, Spark on MaxCompute peer computing model. Fu is the scheduling system, the storage system is Pangea |
RealtimeCompute | Spark, Flink | Real-time computing framework (the previous version is StreamCompute), based on the underlying Blink |
DataWorks | - | Visualization of large data-stop workshops, including data integration, development, management, service, quality, safety and other functions, specifically to facilitate your use MaxCompute, RealtimeCompute |
AnalyticDB | GreenPlum、LibrA | Analytical database, based on the MPP architecture, including the version of MySQL, PostgreSQL version |
DataV | Table, PowerBI | Visual data presentation tools, mainly big-screen display |
QuickBI | Table, PowerBI | Compared to DataV more flexible, mainly to do data analysis, operations, analysts use more |
Offline number of positions
- Architecture design
- Explanation
- Raw data mainly from two parts
- User behavior log data generated by the server
- Business database generated data
- Of course, you can also import a variety of data, such as web crawler data, market data, and so on purchase
- Data lead-in portion
- Flume import log data using either DataHub (TailDirSource + MemoryChannel + DataHubSink)
- Business synchronous data directly into the platform to take advantage of MaxCompute
- Data warehouse building component, needs to be divided into multiple layers
- ODS (raw data layer) - Raw data, only the most simple format checking, and data compression
- DWD (data of detail) - the data level of detail, various ETL needs cleaning, extraction, separation, dimensionality reduction, to give entity table, dimension tables, fact table
- DWS (summary data layer) - do a mild aggregation of detail for the data, and a variety of statistical indicators preliminary summary, convenient application layer directly behind
- ADS (application layer data) - application layer is the final data results, including various types of indicators will eventually need, but also need to import into a relational database for easy end Web Query
- Database analysis
- This part can be selected AnalyticDB, RDS or self-built relational database, it can be, mainly to facilitate follow-up system query
- If the amount of data is small, a small amount of analysis, the direct use of RDS or self-built relational database can be
- If the data analysis due to business needs require a lot of changes, it is recommended to use AnalyticDB
- Data show part
- Ali selected according to the needs of QuickBI or design their own customized Web interface can display data
- Raw data mainly from two parts
Real-time warehouse number
- Architecture design
- Explanation
- Raw data mainly from two parts
- User behavior log data generated by the server
- Business database generated data
- Data lead-in portion
- Flume import log data using either DataHub (TailDirSource + MemoryChannel + DataHubSink)
- Real-time traffic data need to use DTS to import DataHub
- Data warehouse construction section, you can use the Kappa architecture (two links traditional Lambda architecture reduced to a lower maintenance cost)
- DataHub raw data into first, followed by the washing, the association RealtimeCompute, detailed data in real time
- Real-time data into DataHub detail, followed by a mild RealtimeCompute, highly polymerized, aggregated data in real time
- Real-time data is aggregated into the DataHub (can also go directly to the analysis library), and then imported into AnalyticDB
- Analysis of the database (recommended with off-line part, but still more recommended AnalyticDB)
- This section summarizes the data obtained from the library front DataHub
- Generating statistical results may then be performed within the application layer data directly to display
- Or handed over to a subsequent self-service application calls analysis (analysis of the situation for a variety of constantly changing)
- This section summarizes the data obtained from the library front DataHub
- Data show part
- This part of the same number of offline storage, but usually are done in real-time part of the large-screen display, includes all kinds of statistical indicators can be used directly Ali DataV
- Raw data mainly from two parts