How to "build" a big data modeling platform with open source components

Written in the front: The blogger is a "small pig" who has devoted himself to training after actual combat development. The nickname is taken from "Peng Peng" in the cartoon "The Lion King". He always treats his surroundings with an optimistic and positive attitude. thing. My technical route has gone from a Java full-stack engineer all the way to the field of big data development and data mining. Now I have a small success. I would like to share with you what I have obtained in the past, and I hope it will be helpful to you on the road of learning. At the same time, the blogger also wants to build a complete technical library through this attempt. Any abnormalities, errors, and precautions related to the technical points of the article will be listed at the end. Welcome to provide materials in various ways.

  • Please criticize and point out any errors in the article and make sure to correct them in time.
  • If you have any questions you want to discuss and learn, please contact me: [email protected].
  • The style of publishing articles varies from column to column, and they are all self-contained. Please correct me for deficiencies.

How to "build" a big data modeling platform with open source components

Keywords in this article: open source components, big data modeling, project architecture, technology stack

1. Edible instructions

Finally, I can just post a chat to blow water. . . Oh no, it's a technical discourse article, there are official activities? That must keep up! Cough~ So gossip, the following are the instructions:

  1. This article is not the title of the party, it will introduce the technology related to building a big data modeling platform based on the web interface and the relationship between the components. This is a real project in an enterprise and it has been commercialized. As one of the core developers, I have also witnessed the growth of the entire platform.
  2. Due to the limited space, no specific code will be involved, but the relationship between each component and the business scenario and data processing process will be explained as much as possible, and knowledge related to the big data field will also be interspersed.
  3. Since I have been deeply involved in the training field for many years, please forgive me if there is any nonsense and shallowness. If you are still a learner or a developer who has just stepped into the field of big data, then this article is worth your collection.

2. Project background

The birth of the project can be traced back to three or four years ago. When Alibaba's data plus platform was still in a free trial, the big data modeling platform we made has been commercialized and put into order, and we have reached cooperation with Huawei and China Unicom . And enter the Unicom WoChuang space , first come to a picture to feel:

Then, at this time, a passerby should stand up and say: You can achieve commercialization completely because the big factory has not built a product that can crush you~ Yes. So I can only say: emmm, you. . . that's right!
But in fact, there are many reasons for whether a product can be commercialized. Big manufacturers have obvious advantages in various aspects, but this does not mean that other products have no opportunities. In addition to technical strength, team size, project funds, product positioning, market The environment is equally important.
When I took over the project, it was already a semi-finished product. The so-called big data modeling platform is actually a general-purpose product positioning. It is more about the integration of functions. It can be said that it is standard big data development. The main composition of the team is Developers, of course, also include data analysts.

The core function of the entire product is to realize the complete process of data collection, data source management, data cleaning, statistical analysis, machine learning, and data visualization. The difficulty lies in the formation of data flow, which is controllable and easy to manage. Even after many years, I still feel that although this project does not involve complex scenarios and various data analysis optimization solutions, it is definitely helpful to me. It allows me to truly understand and operate every aspect of data analysis. A process can also be said to open up the two lines of Ren and Du. Anything you do in the future is actually just the optimization of a certain link or a fixed data flow in a specific scenario. After all, common things have been made. Is the calculation of a certain fixed index or model training still a problem?
Every time I go to an interview about this project in the future, the other party will say: This young man has caught up with a good project. Of course, the project itself is one aspect, and my own summary is also important. By graduating at that time, I also forced myself to thoroughly understand the project, not only in terms of technology, but also in terms of products, design, etc. I turned him into my master's thesis, of course after obtaining the software copyright (later I found out that it was just a graduation thesis and it didn't seem to be relevant).

The background introduction ends here. The following feature film begins. If you are sure that this is what you want, please like, follow, and give a little support after reading it. Also welcome to bookmark and express your thoughts in the comment area.

Three, meet the technology stack

In order to facilitate the introduction of the scene and the processing of related technologies, it will be divided according to different functional modules. First, a complete project architecture diagram will be given.

Looking at this picture now, the architecture is a bit old, but I think the history should be truly restored. At that time, the big front-end related technology had just exploded, and the project had been developed for a while before I took over, so this is What it should look like at that time also represents that period of hard work.
Recalling that at that time, there were really few big data materials and project cases. Most of them were bragging with PPT. Some of the core technologies related to big data in big factories were not accessible to the grassroots. Arrived, so it can be regarded as advancing in groping.

1. Functional module framework

Due to the limited space of the article, only part of the functions of the big data modeling platform will be introduced. If you are interested in the processing technology of certain links, you can scan the QR code at the bottom of the article to join the WeChat group (CSDN official provides for content partners. WeChat group), will regularly broadcast live interactions with fans.

In a real project, since it is an enterprise-level application, there will inevitably be a series of authority management functions such as departments and employees. This article only focuses on the big data processing related processes, so some less important parts are deleted.

2. Data source management

For the management part of the data source, all the data to be analyzed is stored on HDFS. At the same time, because it is mainly for statistical analysis, all the processed data is structured offline data: it can be pulled from a relational database or uploaded by the user. , After the upload is completed, it exists in the form of a Hive table. Only the name of the data source, its belonging, and the information of the corresponding Hive table are recorded in the platform, and the subsequent data process will not modify the original data, so the same copy Data may be used in multiple data processes, so an existing Hive table can be declared as multiple data sources . In fact, multiple association relationships are established, and all data displayed to users in the modeling platform are one by one. Source node.
Insert picture description here
The Sqoop component is used when pulling the relational database, which is spliced ​​into a complete command according to the database connection parameters filled in by the user and executed on the server. For data files uploaded by users, you need to specify column names, column types, column and row separators, and automatically create a corresponding structure of Hive tables based on the information, so that the data can be recognized normally after importing the data.

3. Data processing flow

For the modeling platform, one of the most basic functions is to allow users to customize the data flow, which can be used in enterprise or university teaching. The method we adopted is to encapsulate some common statistical analysis functions and complete machine learning libraries into a functional node (mainly implemented through Hive QL, Spark Mllib, RHive ), and each node will have a corresponding configuration Parameters, all users need to do is drag, combine, configure, and run.
For the front-end process design UI component, we chose GooFlow . The data process can be saved and modified. It is actually a big JSON reflected in the database, which records the line direction, node configuration, etc., when the process is opened again The canvas will be restored, and at the same time, the configuration information of each node in the entire process must be saved.

After the project process starts, each step will generate a result table as the data source for the next operation. The final running result will generate a result table, which can be directly displayed in a table, download the result data, or drag and drop a visualization component. Display after configuration.
It should be noted here that GooFlow is a component that needs to be authorized. You can also choose other components to replace it. At present, the public on the Internet is only a trial version, or an anti-piracy version with a mining program, so if you want to use it Or contact the author greatly.
Insert picture description here

4. Other functional modules

For some other functions, they are more conventional Web application development. For example, the part of the visual display is based on the encapsulation of the option configuration of Echarts , allowing users to configure the effect of the chart through the interface. The simpler way to query the data to be displayed from Hive is HiveJDBC .
The data interaction between the front and back ends uses a combination of JSP tags and Ajax, which is relatively old. The persistence layer uses Hibernate . Although I personally prefer MyBatis, it is impossible to refactor by my own efforts. Just work hard.
Because it is an article about the technology stack, I did not use too much text. I think it will be more clear and direct to show it with architecture diagrams and flowcharts. If you have something you want to discuss, you can leave a message in the comment area~

Insert picture description here

Guess you like

Origin blog.csdn.net/u012039040/article/details/108289541