When I heard that Teradata had withdrawn from China, I remembered a data warehouse project I experienced

Yesterday, Teradata withdrew from China, and I remembered the data warehouse project I did 20 years ago. I think Teradata was synonymous with data warehouse back then, just like many people who search for things say Baidu.

Unfortunately, I have not used Teradata. In 2002, I made a so-called decision support system based on SQL Server's data warehouse + business intelligence. Decision Support System DSS, such a name was very popular in those days.

(1)

The source of the data 20 years ago is still similar to the present:

  • Many come from Excel, so ETL will be drawn in

  • Some data you want to collect has no application software, so use OA-no-code form to make a simple application to enter it, and then extract it with ETL

  • Some data is precipitated by specialized application software, and extracted by ETL

So the first thing you need is an ETL tool. I used the SQLServer suite before 2013. I remember that the ETL tool of SQLServer was named Integration Services.

(2)

Data from various data sources is extracted, because they all come from different systems, so some public data is actually master data. For internal use by personnel, the sales software on the sales department is for the internal use of the sales personnel of this department, and the purchasing department on the purchasing department is for the internal use of the purchasing personnel of this department, so the actual master data is not uniform, and the same thing in fact , with different names, different encodings, and different fields in different systems. There is no problem when using the daily application software in each department, but this time to make the overall decision support system of the enterprise and to show the data to the boss, this must be unified. So there must be a master data management system, which involves MDM (Master Data Mgt.).

(3)

Master data needs to be artificially defined and standardized, and then cleaned-integrated-unified, or mapped to each other based on who is the master, which involves the replication, distribution or synchronization of master data. I remember SQLServer has a dedicated Replication Services. Now in the new generation of big data technology, everyone uses Kafka more often.

(4)

In addition to the master data being placed in MDM, those business data are ETL-cleaned and placed in the fact table. This involves the ODS service (Operational Data Store), so that the next step is to build a model in the data warehouse - build the model dimension, and then extract the data from the ODS to the data warehouse, store and fetch according to the dimension, so that multi-dimensional analysis can be done in the future .

Enterprise data is often text-based and structured, so ODS in the past was best at processing this type of data. The data of Internet companies is more diverse, such as blog posts, emails, IM messages, documents, pictures, and videos, so Hadoop was invented to act as a data lake. However, the Hadoop data lake is good at processing unstructured and multimedia data, but not good at processing text and structured data, so now people are exploring the integration of warehouses on the lake and lake warehouses, such as Delta, Hudi, and Iceberg.

(5)

Well, this next step involves the data warehouse. The data warehouses I have learned are all columnar multidimensional data warehouses. But now many people say that the data warehouse is a virtual concept, and it is not necessary to use a multi-dimensional data warehouse. A common row-based relational database can be used for data warehouses. This makes me more confused, this is different from my experience. I think at least you have to use OLAP-type databases, not OLTP-type databases.

Therefore, for Chinese customers, the current situation is mainly to produce complex two-dimensional reports instead of multi-dimensional analysis. My suggestion is not to engage in multi-dimensional data warehouses, not to use real data warehouse products, but to use OLAP databases. So I recommend recommending something like Greenplum, ClickHouse, and Apache Doris. But I am firmly opposed to using OLTP relational databases for data warehouses. Some people directly build the so-called data warehouse on the OLTP SQLServer row-type relational database or MySQL row-type relational database, and confuse the so-called data warehouse, report, and business intelligence, either called data warehouse or business intelligence. It's really a steal.

(6)

According to the field - according to the subject - according to the model - according to the dimension, the data is put into the warehouse from ODS. But there is still a small episode in the process of entering the warehouse. Because some complex analysis indicators need complex calculations and must be saved for future historical comparison.

Therefore, a special multi-dimensional computing programming language is needed to do complex calculations of certain indicators, and the calculated results are then put into the data warehouse. For example, DMX (Multidimensional Extensible Development Language) in SQL Server is used for this purpose. Now in the new generation of open source big data technology, Flink or Spark actually do this.

(7)

After all the data is finally put into the warehouse according to the dimensions and strips, the most common thing people do is to produce complex analysis reports, which require many related indicators to appear in the same report. So here it involves the Reporter Service of SQLServer.

Many people do not engage in multi-dimensional data warehouses, and directly produce complex ratio comprehensive analysis reports on the OLTP relational database. I have seen someone write a stored procedure with more than 1,000 lines to produce a report, which is difficult to read, understand, modify, and debug. track.

In the 1990s, when Powerbuilder was used to produce complex reports, there was a tool called Cross Table. We called it Cross Table. I think it is called Pivotal in Excel. These are all common tools for complex and comprehensive analysis of flat reports.

(8)

There is also a report that is not a report at all, but it is called a report. In fact, I call it a query, but it is displayed in a Table format with a Grid. I think this kind of thing should not be done with a multidimensional data warehouse, it only needs to be obtained from the ODS fact table. However, in the new generation of big data technology, big data query engines like Presto are mainly used.

There is also a service in the SQLServer business intelligence suite called Index Services, which is a full-text search service. However, in the new generation of big data technology, big data search engines such as ElasticSearch are mainly used.

(9)

There is also a more complex visual analysis, which has both visualization and analysis features. We call it Cube.

I have used Cube in the SQLServer business intelligence suite and in the IBM Cognos suite. It can drill up and down, rotate, slice. I see that in the new generation of big data technology, Kylin is focusing on this.

(10)

In addition to Cube, a tool with both visualization and analysis features, there is also a real analysis called Analysis Services in SQL Server. It is also called data mining.

I used Analysis Services to do classification algorithms, clustering algorithms, decision tree algorithms, linear regression algorithms, and time series algorithms. At that time, Microsoft did not provide neural network algorithms. Now in the new generation of big data technology Spark suite, there is also MLLib, which is the machine learning algorithm library, and these are also these. But now the artificial intelligence Tensorflow platform and Pytorch platform are already various algorithms and models of deep learning, which is another way from the machine learning algorithm library.

(11)

I have used almost all the products in the SQLServer business intelligence suite, but the result is:

  • It didn't sell well, and it could show its strength in publicity and orders, but it didn't actually sell a few sets.

  • The implementation is complex and a lot of SQL writing work. In the past, the implementation consultants also knew about the database structure and SQL, but now the implementation consultants can only configure the function interface. Although there are many built-in business analysis templates, there are always things to do from display modification to data calculation modification to data ETL extraction modification.

  • During the customer's use, there are many built-in complex ratio indicator report visualizations and chart visualizations, but unfortunately, the customer's business professional ability is relatively low, and he cannot understand such a complex ratio comprehensive report.

What level of party A really has, and what level of party B needs to be matched.

(12)

You ask me that so many new-generation big data technologies have been produced in the past ten years, what problems have they solved?

What I want to say is: These new-generation big data technologies are mainly applicable to Internet companies. They are mainly multi-modal data, and they are truly massive data. The internal application of Chinese enterprises is mainly text-type structured data, and it is fake massive data (even 100,000 important business data records cannot be generated every day), so my point of view is: internal application data analysis of Chinese enterprises, it is recommended Still use the old commercial kit from more than 20 years ago, don't follow the trend. Because the new generation of big data technology is not suitable and more complicated, it is not so necessary.

However, I also know that what I say is meaningless. Party B has to tell new stories and new products, regardless of whether you need a new generation of big data technology, just do it. It's not shabby to make money.

14dab858f543b01d5b00ac3c724e4880.jpeg

Guess you like

Origin blog.csdn.net/david_lv/article/details/129095628