Alibaba computing platform, a senior technical expert "wave" view of technology trends and changes in the field of big data in recent years, the data strongly recommended [Gang makes a product! ]

REVIEW: Separation is a computing and storage technology in recent years, large wind field data architecture flurry. At the time of the 2019 Lynx past 11 double technical summarize, Alibaba CTO row epilepsy Ali also made special mention of progress on a separate computing storage. Initially the rise of big data, the mainstream network bandwidth is only 100Mb, remote access to data through the network is too slow. In order to solve the problem fast data access, Google creatively put forward the coupling of computing and storage infrastructure, while the continuation of the Hadoop architecture, scenery Missing Piece. But after a decade in the past, today's network bandwidth has increased compared to the time a hundred times, reaching more than 10G, IO bottleneck is no longer a big data, is calculated.

Recently, in an interview with InfoQ computing platform of Alibaba senior technical experts Huyue Jun (a nickname wave), the evolution of technology chatted Ali context search and advertising engines, a new generation of Ali interactive data analysis engine and a large field in recent years, technology trends and changes. In the interview, Huyue Jun said: "The separation of the storage computing storage and computing resources can be scaled according to their needs, better cost savings, but it also brings many challenges design and implementation of efficient engines to."

InfoQ: Have you ever Alibaba is responsible for a number of different lines of business and advertising search engine, could you please give us sort out what the overall context of the evolution of technology in recent years Ali different search engines and advertising? For example, what stage can be divided into? Different stages of technical focus What is the difference?

Huyue Jun: In recent years, accompanied by a large business platform rich quantity of commodities, based on real-time recommendations to enhance intelligence operations and the rise of lower turnover shopping experience and promote continuous optimization of business background, search and recommendation engine technology has gone through three stages. The first stage main concern in improving retrieval performance of the engine, when we do a lot about QPS index building, query optimization process and count grading components to enhance the engine; real-time business with an increasingly urgent need, second stage of our engine in both online and offline have done a lot of work, online engine to achieve a memory index and associated auxiliary table, based on off-line Flink Blink hatched flow calculation engine and machine learning Porsche online platform, greatly reducing the processing delay from end to end that dramatically improves the real-time search and recommendation of the shopping experience; the third stage engine of progress mainly from efficient iterative algorithm to support and continue to improve the precision of search and recommendation, we will recall the engine and count points were separated, abstract the RankingService services to support a variety of search and recommended recall scoring a unified scene, while supporting online learning to calculate depth, to better enhance the shopping experience and transaction guidance.

InfoQ: Ali cloud the new generation of interactive analysis product background of the birth is what? Why are you computing platform, EMR open source big data computing platform MaxCompute Ali's big data, real-time computing platform outside again to create a new interactive analysis engine? What it is to solve the problem?

Huyue Jun: Ali cloud computing and efficient storage platform for interactive analysis engine in 16 years began to research and development. Interactive analysis engine started developing a goal to solve the HBase stability and performance issues, and stores the calculated isolated pure asynchronous runtime we achieve high performance storage engine, on-line performance after 3 to 10 times the original HBase based. Later, based on business needs, compatible with PG ecological evolution has become a large number of real-time data warehouse system.
It and other big data platform Ali has a different positioning: MaxCompute self-development platform is Ali bin number of efficient offline systems, the main focus in high throughput batch; EMR platform is mainly for the convenience of clients on the public cloud to quickly build their own open source big data solutions; real-time computing platform focused on this business flow processing; as for interactive analysis, we are mainly to solve the problem of efficient real-time ad hoc query and OLAP analysis, data storage, while achieving the number of positions to be MaxCompute offline direct inquiries to accelerate.
These different platforms are typically joined together to provide customers with a complete big data solutions. A typical scenario is: the data storage system writes the real-time interactive analysis through ETL processing Flink / Blink, then the user to perform various Ad Hoc Query in interactive analysis engine; if the user needs to perform a batch task, then to import the data into MaxCompute for processing; in addition, the data already MaxCompute, interactive analysis may be used to speed up queries directly.

InfoQ: Ali cloud of interactive analysis on the subject of whether the products have commercial products or open source products? If anything, it is compared with those of standard products, which are different technical and highlights?

Huyue Jun: Ali cloud in the industry and interactive analysis of the subject of some of the products Redshift, Snowflake, GaussDB and Hermes. Technical highlights of Ali cloud interactive analysis are: calculated based storage and efficient separation of the ranks of hybrid storage, based on Orca and support of federated query optimizer, pure asynchronous high-performance query engine, and PG11 ecological compatibility and other features.

infoQ: last 3 years, you are primarily engaged in storage and computing engine design and development work, from the big data storage layer and the two computing engines perspective, do you think the recent three years there is a new technology which is worth mentioning or project? What are the changes in technology trends?

Huyue Jun: I personally believe the past three years, large data storage and computing meaningful comparison of the new technology is the rise of storage and computing separation, such as Snowflake and so on, it makes storage and computing resources can be scaled according to their needs, better cost savings of course this also brings many challenges design and implementation of efficient engines to. For example, how to design specific models and efficient storage I / O to achieve? How to optimize the network connection? Low latency query handling of the case how the compute nodes I / O delay may increase?
On technology trends, a trend we see is the personal attention to the storage layer, such as Databricks open the Delta Lake, Ali cloud for interactive analysis engine is the underlying storage engine is a very important competition, the fact that only do well, unified management and data storage engine in order to make computing more efficient and uniform top.

InfoQ: The idea that "17-18 years is the calculation engine hot year, and now this is already the Red Sea," and whether or not you agree with this view? Do you think the current big data calculation engine in what stage of development? Whether the market is already saturated? Next, calculate what this engine noteworthy technical direction?

Huyue Jun: This year a variety of open-source computing engine does develop rapidly, such as Flink SQL batch unified stream processing, publishing Spark Structured Streaming of sound and Greenplum MPP engine of 6.0. But the Red Sea is still not possible, according to our investigation, many companies still big data solutions based on Hadoop / Hive, the degree of market penetration of the new engine is still in its early stages.
For the calculation engine itself, I think diagram calculation and image, support efficient video processing may be noteworthy technical direction. With the rise of the current recommendation, credit and security requirements, more and more important for the storage and processing relations, currently on a calculation of the various engine support is still in the recount of the stage, behind the development of concern; and graphics bring video processing vector computing applications is also turned increasingly widespread, there are already several in succession to its open-source technology.

InfoQ: beyond calculation engine, large data storage layer, this year there have been a lot of hot topics, such as lake data, the number of real-time warehouse. You see how hot this year the number of real-time data warehouse and the lake?

Huyue Jun: the fiery nature of real time data warehouse or from the drive business. Contemporary, intelligent recommendation and precise operations and other businesses dependent on rapid real-time data mining. Data analysis of the level of hours, days or level of business for many who can not go back.
Besides Lake data, current data warehouse ETL general store is cleaned through the data, the original data will have a certain lack of information, so there are people who advocate also store a variety of raw data to a variety of flexible analysis. Lake data is one such solution that provides a unified synchronization, storage and management mechanisms, as well as the submission and scheduling computing tasks, it emphasizes a more comprehensive and systematic data management and application. According to my personal understanding, the lake is a conceptual data, like data warehousing same, but its advocates save more of the original data and the strengthening of control over data management. Underlying or related technology should be based on the current storage and computing technology, there is not much of a revolutionary change.

InfoQ: 2019 In June 2009, Google for $ 2.6 billion acquisition of data analysis company Looker. In the same month, Salesforce announced a $ 15.7 billion acquisition of enterprise BI Tableau. September 2019, Cloudera announced the acquisition of real-time analysis of business intelligence vendors Arcadia Data. What does this mean for the acquisition of several large data fields for? Unified data analytics platform will be the next big data field of a technology breaking point yet?

Huyue Jun: personal understanding of these acquisitions reflect the company's penetration of the top big data business systems and data analysis to control, such integration should give users a better user experience analysis system, such as cloud data analysis services, and thus It allows the company to better capture the PaaS and SaaS market.
Unified storage and management of data integration platforms for data analysis, as well as scheduling and perform a variety of analysis tasks, while avoiding the relocation of overhead data to provide a unified user experience, personally think that this will be a matter of course results .

These interviews Source: "Ali, a senior technical experts Huyue Jun: Big Data decade, I see technological changes and trends" Author: Cai Fangfang


Finally, some of the knowledge that I know of, to discuss some of the views.

Remember a conversation before and leadership UF (ah, you thought wrong, I was in the interview), he mentioned the big data technology in recent years has become mature, he personally encouraged me to walk java, and more learning overall business architecture. He also said he is so over.

Of course, the older generation can not say that's wrong, that has his reasons, I do? It is a comparison of competing with people. Pianbu Xin.

Personally, I want to go is more data post, although it is developing in the background. (Later ramble about this matter.)
Here Insert Picture Description
Back to the topic, we can see from the above interview, a wave of big brother when it comes to this sentence:According to our investigation, many companies still big data solutions based on Hadoop / Hive, the degree of market penetration of the new engine is still in its early stages.

Ah, do not know if broke the news, real-time business using some of the internal Jingdong is storm, compared to sparkstreaming and new darling flink, it does seem a long old old point.

Of course in the company, as the choice of technology does not mean that the new is the best, most suitable is the best, just like love.

So the prospect of big data is not so pessimistic. I think it is very friendly.

And mentioned real time data warehouse and data lake , the expression of what is a easy to understand, did not look at their own on the turn! ! !

Of course, not only those mentioned in the interview, even some of the content, because Caishuxueqian not know.
Here Insert Picture Description

But the whole story, it is worth a careful look!

We have no way to predict the future, just like this year you can think of masks beat the pork it?
So, we have to do is to grasp now, roll up its sleeves ahead and do!

Finally, a big thank Cai Fangfang interview article, very much.
Then, Wuhan Come (today increased by 15,000, to be scared me), China refueling , can not go to work, get a haircut will do it! ! !
Here Insert Picture Description

Published 760 original articles · won praise 636 · views 110 000 +

Guess you like

Origin blog.csdn.net/qq_41946557/article/details/104302704