Data sets become the next outlet, it would subvert the work of data engineer do?

AI front REVIEW: data next station known as big data, the rise of Ali, the core idea is to share data and 2018 because "Tencent data table on" once again become the focus of discussion. In the March 15 ThoughtWorks technology Lei Dafeng meeting, the topic of data in the table also received enthusiastic attention of many attendees. Today, everyone seems to mention in the data table, but not everyone knows exactly what data sets in the end means. Taiwan is the concept of the data on the tall only manufacturers only need to consider it? Ordinary companies should or should not do in the table data? The data will appear in the table of existing data practitioners who challenges subversive of it? With the above problems, InfoQ interviewed in technology Lei Dafeng at ThoughtWorks data and intelligence director Shi Kai, to talk about his views on the data in the table.

Data in the table is not a big data platform!

First, it is not a platform, not a system, if some manufacturers say they have a data table to sell you, I'm sorry, it is a liar.

To answer the question what the data sets that we must first explore what in Taiwan in the end yes. Although there is no clear definition, but as the Polytechnic straight men, we can be seen as a first stage in the middle layer. Since it is an intermediate layer, then we in Taiwan indeed is a completely technically to explore the full technical term.

We can use Pace Layer Gartner to understand why there must be an intermediate layer, so you can better understand the positioning and value in the table. Pace Layer mentioned, can be stratified according to speed things change, so you can drill down and design a reasonable boundary and services.

In data development, changes in core data model is relatively slow, at the same time, data maintenance workload is very large; but the speed of business innovation, changes in demand for the data presented, it is very fast.

The data appear in the table, just to make up between data development and application development, due to the mismatch response force development speed, keep up the appearance of problems.

Data sets to solve the problem can be summarized as the following three points:

  1. Efficiency : Why add a reporting application development, it is necessary ten days time? Why can not recommend users to obtain real-time list? When business people a little bit of doubt on the data, it takes a long time and found that the data source is changed, the ultimate impact on the time line.

  2. Collaboration question : When business application development, and other projects though demand roughly similar, but because the other team is maintained, so the data still have to redevelop it again.

  3. Capacity issues : data handling and maintenance is a relatively independent of technology, requires considerable professional people to complete, but many times, we have a lot of application developers, very few data developers.

These three types of problems will result in slower application development team. This is the key in the table - let the front desk team development speed is not affected by background data development.

History Kay concluded that "the data in the table is polymerized and cross-domain data management, data abstraction packaged as services provided to the front desk logical concept to business value."

As shown below:

DData API is the core data sets, it is the bridge between the foreground and background, provided by the API data services, rather than directly to the database to the front, the front desk let developers use their own data. As for the process of generating DataAPI, how to make DataAPI produce faster, how to let DATA API clearer, how to make better data quality DATA API, which is the ability to go around the data table constructed.

The key difference between the data sets and data warehousing, data platform

This is a problem now, we often discuss industry data, what is the difference in the end data warehouse, data platform and data units Yes.

In a nutshell, the key difference between the three following aspects:

  1. Data in the table is a logical concept of enterprise-class, reflecting the ability D2V (Data to Value), the main way to provide business data services API;

  2. A data warehouse is a relatively specific functional concept, is a collection of one or more topics of data storage and management, ways to provide business services mainly analysis reports;

  3. Data integration platform is a data base platform of structured and unstructured data for business providing services in a way that appears mainly on the basis of a large data set of data directly;

  4. Data sets from the business closer to the business to provide faster service;

  5. Data warehouse to support management decision analysis, and the data in the table sucked the data to business systems, not limited to the analytical service of the scene after, but also for transactional scene;

  6. Data sets can be built on top of the data warehouse and data platform, to accelerate the process of enterprise data to business value from the intermediate layer.

Historic data warehouse, most of the data which is stored in structured data, which is not the total amount of enterprise data, but according to the needs of targeted extraction, and therefore the value of business data warehouse is a wide variety of reports, but these reports It can not be generated in real time. Reporting data warehouse part of the business, while providing value, but will not directly affect the business.

Appeared data platform is to solve the data warehouse can not handle the long development cycle of unstructured data and reporting problems, we first put aside business needs, all the data are extracted into business together as a large data set, including structured data, unstructured data. When the business side there is a demand, then they need several small data sets separately extracted, provided in the form of a data set to data applications.

The data in the table is based on data warehouse and data platform, based on the data to produce a data API services in a more efficient way to provide services.

Data in the table should have what capacity?

After the big data and artificial intelligence fires past few years, many people have been mentioning a statement that "Data is the new oil." But the story of Kay's point of view is somewhat different, in his view, the data is not equal data assets , if there is no data plan from a business perspective, no amount of data can not create value.

Shi Kai think the data sets of core data assets is a key component of the directory . "We believe that the data of an enterprise to be able to make the most of, a very important prerequisite is the data structure and data assets of the enterprise directory is open to the entire enterprise. By this everyone can understand what assets the company directory category the data, contained what attributes, whom the source data management, so you can quickly figure out that data is not their own needs, but the data itself may not be open, because the data is private information and security levels. "

Many large enterprise business, there are many different business may duplicate data. The so-called data asset inventory is to model data de-duplication, normalization, carding, into a tree, the tree does not directly correspond to fields in the database. To air cargo, for example, their data assets may include cargo aircraft, auxiliary cabin passenger aircraft and a cargo plane that node of a data asset catalog, and various properties of the cargo aircraft (such as the freighter model, space, year, etc.) is this node the following data model. Data assets catalog to do is start from the operational level to develop data standards, the extracted data assets related to enterprise business model, which is now back with what database to store, what structure to store, saved as structured or unstructured both It does not matter. It is equivalent to the company's business made a comb from the data level, the language of the data reduction of the corporate business model. After data assets catalog to do, what is behind the technology, to map this data to extract data from the directory where the asset.

In addition to open, data asset inventory should also have a label description can be retrieved, so as to maximize the convenience of people actually use the data as quickly find what they need.

Lean innovation system in the enterprise data presented at ThoughtWorks the need to have the ability to generalize the data for the following six, the six have the ability, companies have become the basis for data-driven business intelligence, and the ability to carry these platforms is data sets:

  1. Planning and management of data assets

Before doing the table, you first need to know what business value is to think from a business perspective what corporate data assets Yes. Data assets are not equivalent to the data, the data is the only asset that can generate data value to the business. For the same pile of data, different business units concerned data may be completely different indicators, how to make various cross-domain business to become a unified standard, we need to plan for business data panoramas, will all likely to spend all of the enterprise there are potentially valuable data are planning out the final tease out data assets catalog business. At this time there is no need to consider the system, there is no data, only need to focus on what data is valuable to the enterprise business. This layer is not recommended too thin, too thin it is difficult to form a standard can not be applied to the multiple scenes. Data governance is a very important stage in a data field, ThoughtWorks believes in the current business boundaries disappear, rapid changes in demand, the companies need to have the ability to Lean Data Governance --Lean Data Governance. Traditional centralized, controlled prior data governance, to change to decentralized, post-service-style governance.

  1. Data acquisition and storage assets

Data sets to provide powerful capabilities for enterprise data assets acquired and stored.

 3. Data sharing and collaboration

Data companies in Taiwan must be cross-domain, you need to let everyone know where the data assets directory. Because the data is not secure, do not let everyone know what business data. No shared and open up the flow of data there is no way, if there is no flow rate value of the data generated will be very slow. Therefore, on the basis of data security, corporate data assets catalog to stakeholders, value creators open, let the business people can do "Self-Service".

  4. explore and analyze business value

Data sets only to establish access to the source data, also we need to provide the tools and the ability to analyze data to help business people to explore and discover the business value of data. A good data required to provide data exploration and analysis tools for personalized service for users of different positions desk solution, and on the basis of a key generation data API, to provide diversification to the front system.

  1. Build and manage data services

Data sets need to ensure performance and stability, as well as data quality and accuracy of data services, but also need to have a strong service management capabilities. Data in the table is an ecological platform in the data table on top of a variety of data services will continue to grow, so from the outset to build good governance data services is very important, data services need to be recorded, it can be tracked, can audited, it can be monitored.

   6. metrics and operational data services

If the data is in the final stage only to the data to do the business, then it is just a role porter. Capacity data sets also need to have metrics and operational data service can be provided on the stage of data services and related behavior and keep track of records, including data services, which is the department with the number of second-rate, through which to measure each the business value of data services.

Shi Kai believes that the data table is a need to use the Internet thinking to run a profit center platform, business analysts, data sets need to analyze the business, to understand why this morning, this person finance department with the data in the table, called ten times in the afternoon he does not, what reason is called data services usually what other data services will be invoked. These need to do the appropriate records, do the log for analysis, make data as the same as the electronic business platform to run, then real-time data based on these business practices to remind data services provider, adjust, change, optimize data service, which business data is available in the table, the only way to get business support and the fastest response.

Why everyone needs a data table?

Data in the table is not the only big company to require stuff on tall.

ThoughtWorks from 2017 to the present, has helped a number of large domestic and foreign enterprises in Taiwan construction data, which has a huge body mass of middle-class enterprise data, as well as some level of small data sets.

"In the future all companies will become the core business of processing data, and the data in the data table is the value of the plants, so all businesses need the ability to data sets, data sets must be the future of every business standard configuration . "

In the history of Kai opinion, the data table does not mean "large" data platform. Depending on the size of the business and business data sets vary in size, complexity might not the same, but the value of its business generated is the same.

When companies evaluate whether they should build the station data, should be considered from what? Shi Kai think, from a strategic point of view, every business needs to establish their own data sets; from a tactical point of view, when companies find themselves in the development and utilization of data speed and speed application development do not match, you need to consider building the data sets.

That a lot of companies do when the application system, consider nothing directly on single architecture, a database up to do first, and then build applications on top of. ThoughtWorks is now proposed business, even if not in the data table, do not set up a "data sets" the project is called, but do apply, the best this application is divided into three layers, service layer, data layer in Taiwan, source data layer, applications do at the beginning of time to put out three levels of abstraction.

Poor quality of data so the data can not do in Taiwan? No!

Data quality problems left over by history so that we often question the use and value of the data. 2018, Shi Kai often heard in the process of communicating with different enterprises sentence is, " we have no data to utilize this step, because (application system) data quality is too poor ."

Every time I hear these words, Shi Kai mind would seem to hear another word, "not the time to cultivate children's ah, the child is too small."

Not because of poor data quality, not to use the data. Precisely because not doing things behind, so data quality was poor. But also because of poor data quality can not be set aside on the business scene, attempting to fully resolve data quality problems, so not support the business sector can not generate business value from data work. So ThoughtWorks proposed use to do precisely the application, the business needs to do, synchronous address data quality issues.

Shi Kai believes that data quality problems, the fundamental problem is the lack of overall planning data and data lead to thinking in the beginning of building applications. At the beginning of the original application build process class, only consider how to make the process run, the lack of analysis of the application of positioning data across the enterprise panorama (Data Landscape) in no optimize the storage of data from the source, circulation, in order to better data with other systems to align caliber, unified language, the process issues into abstract model the problem domain, then the domain model to abstract data model.

Construction data table Challenge of

Construction data in the early stage of the biggest challenges is to sort out whether there is a scene from the operational level clear business value, as well as data panorama, and not only in the latter part of the construction technology.

 

Challenges facing the construction of the data sets include:

  • Carding business scenarios: figuring out how to generate data sets of value to the business.

  • Priority policy construction data table: large and demand is likely, but we can not directly build large and medium data sets should be prioritize requirements based on business importance.

  • Data governance: open and operational independence of the few successful data governance, data standards have a large (assets data directory), directory data assets by a total of latitude, common business model extracted, the data on this basis governance requires close integration with business scenarios.

    Construction of two sets of data that needs strategic patience

In order to speed up table data is generated from data to speed business value, but its production process still takes time, a lot of complicated work to do, so the data for the construction side of the Taiwan investors in Taiwan and data to say, we need a corresponding strategic patience.

  • For investors concerned, we must fully understand the value and limitations of the data table class project. In the current organizational structure and maturity of the technology, the data table is still a technology platform for generating business value is an accelerated process. But business demand for data will not be because of the reduced data sets, the data sets are not A Dream, can not arbitrarily change a variety of business service you want. This is still a need for overall planning, agile iterative, systematic evolution of the construction project, so the need to manage expectations, there is some strategic patience.

  • For the construction side is concerned, we should fully understand the complexity of the data sets building, do not rush, do not expect at once. Shi Kai advice is to start small in Taiwan, around specific business scenarios to build valuable, as far as possible without departing from the scene to engage in a long cycle, large and pure tool platform.

    Data in the table can be small and beautiful

  The key considerations in the construction of the data table consists of two aspects.

First, the data sets must be aligned with business value. Construction of the data sets, the most important thing is not technology, nor is it good data quality is not good, but the data is thinking and cultural data. Data is to establish a way of thinking from the perspective of data to think about the problem; data is to take the data and business culture as a whole to see, instead of just the data as a support tool. Think clearly business demands for data is the first step in building a data table, even if it is temporarily unable to think too thin, but also to think, to think clearly do not do it first.

Do not clear in the business scenario, the priority is not clear, the value metric system has not been established when, on the establishment of large and comprehensive data platform, and all the data are saved. Companies are pursuing input-output ratio of large and comprehensive data platform often face an embarrassing situation, a bunch of feature looks very useful, should be able to spend, but the lack of application scenarios, really have a scene, also found not out of the box, but also a large number of customization.

Secondly, the data in the data sets should be small, start small scene.

Data sets are scene-oriented rather than technology-oriented, and this customer's business, construction and information technology development stage companies have a close correlation of business infrastructure, it is difficult to buy a large and comprehensive products to solve once and for all of.

This can be explained by the following table in FIG principle construction:

In the beginning the need to top-level design, business-oriented vision for developing an overall plan in Taiwan, comprehensive sort data innovation panoramic blueprint, which is the black frame portion of the left side of the figure, driving out all the business scenarios explored by the business vision, thereby deduced panorama architecture sets of data, technical support.

But when implemented, starting from a specific business scenario. High-value data sets from the scene to start, and then along the vertical cut scene, find a panorama of the data or multiple data sets, data from small landing scene, so as to quickly verify the value. Consideration large, the overall pull-through, to avoid subsequent data islands, but cut into small data set, starting from a high scenario can be realized. Then one scene done, business value and also the ability to sync station established.

Overall, that is, "the design stage to go sideways, landing stage vertically cut."

Data sets and technical team selection

Data in table groups typically include the following roles:

  • Business team of experts: understanding the business, carding business scenarios, determine one correspondence between data assets and business scenarios, determine the priority of business scenarios, provide the basis for the construction of the data table.

  • Data Engineering Team: Construction and maintenance of data sets, including ETL, data acquisition, and data sets to ensure performance and stability, use the table tool for collecting, storing, processing, handling data.

  • Data analysis team: the value of data analysis to explore the scenes, producing more data services.

  • Data governance team: comb data standards, data security and privacy component specifications, the use of open source data center management tools of go (such as atlas, wherehows) to solve business scenarios around data quality and security issues.

  • Team intelligent algorithms: data analysis, business intelligence and algorithms to provide exploration tool.

And such a team work constitutes a data line, a data from the business services data service factory, the factory has a production workshop (Data Pipeline), R & D Center (laboratory data), Management Office (data governance), there products Center (service data store).

Data Factory is a logical concept, not a large and comprehensive product, ThoughtWorks combined with the practice in the past few years of data gives a selection of plant components of the reference architecture, the architecture and components of these recommendations, many of which are reflected ThoughtWorks launched in the past radar technology and explained in detail as follows:

Appear in the table of data for existing data team challenge

As already mentioned, the data in the table is the company's Data API factory, with more efficient and accelerate business value from their data to a more coordinated manner, can provide higher responsiveness to the business. Therefore, data from business units closer, which for traditional enterprise data services is concerned, is a significant change to the original data at the same time the team will be a huge challenge.

 1. The data required for business analysts increased

Traditional enterprise data and operational work clear division of labor, clear boundaries, business people responsible for business requirements proposed business problems and business issues broken down into a number of clear data problems, engineers and data analysts and data in this clear problem under problem solving.

However, after the table data appear, the data table is an enabling platform, it will precipitate, we provide a lot of data analysis tools and data services that enables business people do not have the professional data capabilities can also do some simple data analysis, generate business insight. This means that with the support of the table data, relatively simple business problem will be more clear by business people themselves to get rid of, then the data is transmitted to the question of professional staff, it will be more complex problems. This ability to understand business data to strengthen the staff, he / she must have the ability to quickly understand the business, to be able to reflect the professionalism and superiority.

 2. For the engineering capabilities required to improve the personnel data

The original data analysis belong to the individual works, each data scientists, data analysts is an independent unit of work, the business sector given business problem, they give good results through their familiar tools and methods. But after the table data appear, while they get more weapons and tools for data analysis, be able to stand on the basis of previous work to improve the efficiency and accuracy, on the other hand, they also need to know more platform data analysis tools, such as Jupyter Notebook, but also by the results of their analysis requires the ability to be converted to data services, the precipitate into the station.

 3. Data team needs to have more of a business perspective

The original data analysis team is a functional team, more as a think tank data exist. In most cases, relatively far from the business, not to mention the responsibility for the results of the business. And after the data appear in the table, the data from the business will become increasingly close to the table, and even a direct impact on the business of running involved, data team will slowly from the think-tank of identity data, gradually from the back to the front, directly responsible for one data services, and data services are directly involved in the business which generates business value. Such positioning changes, require data team has more business perspective, to be more focus on business value, business objectives directly aligned to work.

Therefore, the data appear in the table, not just a technology platform for enterprises it is a systematic work, enterprise data related processes, responsibilities, division of labor should have a corresponding adjustment in order to achieve the overall goal.

Data sets VS Data Privacy

For data sets, the data privacy and security is a very important issue. Many people may remember a few days ago in response to Ma for the "Tencent data sets theory". Last year, Tencent organizational restructuring process to achieve the technology to get through, but to open up the data remain cautious. Ma responded "data table on" 18 November World Internet Conference: "Tencent can not apply the practice of many other companies, the data directly to any open because inside our platform, a lot of people are all people. communication, social behavior between the data, if any data can get through to the company or to business units with external customers, it will bring disastrous consequences. we want to be more cautious in this regard, we want the user's perspective to consider the protection of personal information and data given priority. "A lot of people do not do these data are interpreted as Tencent Taiwan, Shi Kai do not think so.

In his view, Tencent's response is not to say they do not do data sets, but stressed the need to do more work on data privacy. In fact, all of the data security and privacy needs from the scene. Shi Kai believes that "not from pure data perspective data privacy, data privacy is not out of the scene." If it is purely from the data level, not from the level of business scenarios to manage data privacy, will bring the two issues, or data being managed very dead, hampered produce business value; or data privacy management will be loopholes.

Shi Kai cited an example, such as we are talking about user transaction data, if the user is not associated basic information, transaction data itself is not available for users of privacy risks, because it is not associated with any individual user. Therefore, the user can analyze the transaction data and the use of desensitization.

On the other hand, if the data from the scene to talk about privacy, it may also lead to the neglect of the potential security issues. Sometimes, if it is not associated with the scene, it did not seem possible two data security problem, but in fact these two outsiders associate the data generated value. This is why in the beginning it is imperative that all of the scenes, out of all the analysis as much as possible.

In addition, set permissions, data classification audit, library-level data desensitization are all possible means to enhance data security. Modern data sets must have a mechanism to monitor and record the behavior of call data, which in turn can enhance the protection of data security and privacy.

The next step in the data table

Many companies at home and abroad have started to invest in the construction of the data table, we are more familiar, including Ali, Huawei, Lenovo, Hainan Airlines, SAIC, Shell and so on.

In the history of Kai opinion, the data sets currently on the rise development. Although future data sets may not also be called data sets, but it will become necessary basic components business.

The world is developed from information technology to digital. Information technology means most of the work done in the physical world, then solve a small part of the problem with the digital world of computers letter. Digital is the man moved to the digital world from the physical world. From this perspective, the data in the table to restore the physical world will become a business in the digital world.

Data in the table was originally designed to separate the computing and storage, to a lesser extent, the real core of the data in the table can not be stored. But in the current situation, generalized data table in the next period of time will still be covering data warehousing, data storage components such as the lake, "data factory" concept may be more applicable to the present stage. But with the development of data sets, the future is likely to no longer need the data lake.

Finally, Shi Kai Ali also mentioned another strategy in Taiwan in Taiwan - "business in Taiwan." He said that "the current business units more partial real-time trading, down from the precipitation business; data sets currently more partial analysis, insight and decision-making, for the business to provide data services T + N and T + 0, but then move forward walk, data sets slowly in conjunction with the Fair more closely. as computing power is growing, and the further development of micro-services architecture, business units and data units may be integrated in the future. "

 

—   The End    —

 

Published 363 original articles · won praise 74 · views 190 000 +

Guess you like

Origin blog.csdn.net/sinat_26811377/article/details/104570933