Ali entrepreneurs share the company's BI selection road! I have stepped on the pits of self-research and open source

1. Business background and demand combing

At the beginning of the annual plan, I planned to go to BI this year, but the cause of the epidemic has been delayed again and again, and it happened at that time that demand broke out. Various business problems, various data and various analysis, realized that it was impossible. I couldn't bear the load, so I started to establish the project and conduct BI selection research.

 

Here I organize my research notes into articles for sharing, which only represent my personal views.

The company's current data requirements are mainly divided into two categories:

  • Temporary demand: The business suddenly wants to see the effect of this wave of activities (the definition of data indicators may be changed at any time and added at any time)
  • Solidification requirements: data to be read every week and every month (the definition of data is very clear).

For these two types of needs, our current solution is:

  • For temporary requirements: write HQL to Hive to check it, and then transfer the result to excel and send it to the demanding personnel.
  • For curing requirements: write scripts, combine Hive to run the results, write the results into the corresponding DB library, and then summarize and display them through third-party open source BI tools.

This is concise and clear, but the obvious problem:

  • The development cost is too high: Every demand, whether it is a temporary demand or a long-term demand, requires customized development. In this case, our manpower is deeply involved.
  • Inflexible use: A report can only be displayed without analysis function. If you want to analyze, you need to copy the data to excel, and use excel for processing and analysis. Our data users may not have this ability.
  • Resource waste: There are many double calculations in reports developed by different personnel in many cases.
  • Experience Tucao: The query speed of hue is very slow, and it takes more than a minute to do a simple query of select from (the underlying engine tez is too slow!!)

In this case, what we plan to build is actually to select a multi-dimensional analysis platform, so that the business side can access the data, and it is best not to use SQL, because most of our business side does not know SQL (although we have carried out SQL training, But there is still a certain threshold).

Based on the above, we conducted a product survey on BI selection.

2. Product trial analysis and analysis

Based on everyone's discussion and understanding and several BI tools recommended by peers in several groups, after a round of screening, the following products were finally selected as key research objects: Superset, Metabase, FineBI.

1、Superset

Ali entrepreneurs share the company's BI selection road!  I have stepped on the pits of self-research and open source

 

On the whole, after personal experience, I feel as follows:

  • The installation tutorial is painful, just as painful as installing mysql
  • Only supports single table access, does not support table connection calculation speed depends on the speed of your database
  • The visualization options are very rich, and there are several sets of geographic location visualization solutions based on latitude and longitude
  • The authority control is very detailed, down to each function key;
  • Unfortunately, the biggest problem is that the user experience for business analysts is not very good. The visualization process is to set corresponding parameters for different graphics schemes, and the authority control is also very complicated.

The details of each aspect are as follows:

1) Data source and data management

  • The supported databases are very rich: Druid, Hive, Impala, Kylin, Spark SQL, BigQuery, Pinot, ClickHouse, Google Sheets, Greenplum, IBM Db2, MySQL, Oracle, PostgreSQL, Presto, Snowflake, SQLite, SQL Server, Teradata, Vertica, Support uploading local CSV files
  • Data table model management, field type can be set, dimension/can be filtered/can be used as time column, secondary processing fields, statistical indicators
  • The data tables available in the chart have to be added from the database one by one (the SQL toolbox can see all of them), which is not very convenient.
  • Deep support durid

2) Single chart & dashboard board

  • Single chart production process: select data source (table or view) -> select chart type -> set chart parameters (index/dimension/filter conditions). The data source can only be selected from the data table list page, and the data source cannot be changed after entering the analysis page; because when switching between different chart types, it is necessary to fill in the parameters according to different charts, which is not convenient to use in self-service analysis;
  • The supported types of visualization graphics are very rich, with 48 visualization schemes;
  • The filtering function of the kanban is very weak. It does not even have the most basic date filtering component. It is realized by the filter component in the single image. The filtering component can only be made for a single data table and then applied to the Kanban. This function is also very inconvenient.
  • Provide simple chart drilling and exploration function (directly jump into the single chart), but it does not support chart linkage;
  • Kanban cannot be copied and cloned directly. To copy a Kanban, you can only re-edit the selected single image;
  • Kanban supports automatic refresh, and the minimum refresh time granularity is 10 seconds


3) SQL query

  • Supports associative filling of field/table information
  • Support cross-database related query
  • A multi-tab environment to handle multiple queries at once
  • To visualize the query result, you need to save it as a view, and then jump to the chart page; and you need to empower the view (the process is very inconvenient)
  • Can search the query history;
  • Support for templating using Jinja template language, which allows macros to be used in SQL code

4) Authority management

  • By setting permissions on roles, users specify roles to achieve permission control
  • The granularity of permission control is very fine, it supports functional permission control (table modification can be subdivided into delete and new operations), and it supports permission control on menus, data sources, data tables, fields, charts, and kanbans.
  • The configuration of permissions is very complicated and cumbersome
  • Does not support data row-level control

5) Secondary development

  • Technical architecture: Python+Flask+Recat+Redux+SQLAlchemy
  • Originally an open source project of Airbnb, there is a large company team behind it to support maintenance, version updates, bug fixes, and secondary development.
  • Support restful API

2. Metabase (open source, gihub star 15,670)

Ali entrepreneurs share the company's BI selection road!  I have stepped on the pits of self-research and open source

 

Overall, after personal experience, the highlights are as follows:

  • The interactive experience is very friendly to business personnel. Through a global search function for kanban and single images, an intelligent scene of "ask a question" is created, that is, through the search box to consult, the system tells you the answer, and the interface of the entire product is very simple and clear.
  • It is very simple to make a single image. It takes the data as the center to select different graphics (the non-selectable graphics are automatically grayed out). It basically takes half a minute to complete the analysis of a single image.
  • But the biggest shortcoming is that the authority management is too weak, there is only coarse-grained control that can be modified/visible, and there is no separate control over whether the table can be deleted.

The details of each aspect are as follows:

1) Data source and data management

  • The supported databases are relatively weak: Postgres, MySQL, SQL Server, Redshift, SQLite, Google BigQuery, H2, Oracle, Vertica, Snowflake, MongoDB, Druid, Presto, SparkSQL
    (Special note: Druid version is 2.0 , So SQL query is not supported, its power is greatly reduced; in addition, Hive, Kylin are not supported)
  • Unified data model management entrance, after adding data tables/views, set dimension/measure fields (this part is very detailed, and the set types have been extensively expanded)
  • Provide timing tasks, database synchronization (hour level)
  • Self-service table field information perspective function, intelligent exploration, automatic display board, automatic correlation data distribution (additional cool function)

Ali entrepreneurs share the company's BI selection road!  I have stepped on the pits of self-research and open source

 

2) Single chart & dashboard board

  • The single image production process is very simple: select data source -> select filter conditions -> select analysis indicators -> select grouping dimensions -> select visualization type
  • The supported types of visualization graphics can only meet basic needs, and 14 visualization solutions (including funnels, numbers with changes, and maps)
  • Some charts can be controlled in detail, such as form control row color according to conditions, adjust field position, display mini color bar, prefix and suffix setting
  • Support basic filter conditions, including date range (the field through the filter is associated with the field in the single image)
  • Provides simple chart drill function, but does not support chart linkage
  • One-click copy of existing Kanban
  • Automatically refresh data with minimum granularity to 1 minute
  • Sharing support: public link, public embed (blog page), embed in application
  • Use Pulses to send data to Slack (a foreign chat tool) or email as planned

3) SQL query

  • Supports associative filling of field/table information
  • sql query results can directly switch the graphic display scheme
  • Does not support cross-database related queries
  • The variables in the native query allow the use of filter components or URL parameters to dynamically replace the values ​​in the query

4) Authority management

  • By setting permissions on roles, users specify roles to achieve permission control
  • The authority setting is very weak, and it can only set whether the authority is accessible (accessible data may be deleted directly)
  • Permission setting objects are relatively shallow: only permission control for data sources, data tables, charts, and analysis project collections, not at the data row level
  • Field-level field control can be set to be visible or invisible (sensitive field scenarios), but cannot be managed by role

5) Secondary development

  • Technical architecture: Clojure+Recat+Redux
  • Provides complete API documentation, and can complete many secondary developments with rich API and documentation

3. FineBI (commercial)

(The mini program has been added here, please check it on the Toutiao client)

  • Complete data analysis in 5 minutes, zero-code operation, click and drag to complete the analysis, data reports can be made within half an hour.
  • Meet diverse analysis needs, data processing, exploratory OLAP analysis, self-service data analysis
  • The main function of self-service data collection, ordinary business personnel can filter, cut, sort, summarize, etc., and achieve the desired data results in a self-service and flexible manner
  • One-click data sharing and control, detailed and accurate data authority control, data and reports can be shared across the company, and sharing results are updated in real time
  • It supports the analysis of large amounts of data, adopts advanced column storage, has efficient computing capabilities and powerful data compression capabilities, and supports fast data analysis at the front end.

Ali entrepreneurs share the company's BI selection road!  I have stepped on the pits of self-research and open source

 

Overall, after personal experience, the highlights are as follows:

  • To get started, you need to adapt to the next process, first configure data, process self-service data sets, and then visualize dashboards and charts, which is a bit confusing
  • Making visualization is very simple, the interaction is a bit similar to Tableau, drag and drop data fields to the dimension box, immediately present the visualization, and then build a dashboard based on the visualization components
  • There is a linkage drill function, but also very smart, can be automatically linked to the common field
  • The data processing function is very powerful, self-service data set, there are many data processing functions, including grouping and summarizing, modifying data fields, table merging, etc.
  • The biggest highlight is that the permission control is very detailed and practical, which can be subdivided into the original data source, processed data set, dashboard, and can manage different roles of users, including positions, departments, etc., with administrators and users and other permissions control, It smells like OA, it's the most powerful of these

The details of each aspect are as follows:

1) Data source and data management

  • The supported database is very rich, as shown in the figure below
  • There is a business package function that can classify different data sources, such as by department or by business needs
  • Visual management of data tables, data preview, blood relationship analysis, correlation view, etc.

2) Single chart & dashboard board

  • Visual production process: connect data source (database or import excel)-self-service data set processing data-make chart components based on data-make visual dashboard.
  • There are many types of visualization charts, there are more than 50 basic charts, and the overlap between the charts can be used to set more than 100 styles;
  • There are three types of forms: grouping, crossing, and detail. Especially complex forms are not supported. There is another report software finereport
  • The dashboard has filtering components such as time, text, value, and query, as well as self-defined condition filtering, and the filtering function is relatively powerful;
  • Provide data drilling, linkage, and jump functions, which can directly interact on the dashboard, and also interact with other dashboards, and support the linkage of charts;
  • Components and Kanban can be reused directly, just copy
  • It has a wealth of functions, and can filter, summarize, sort, and write formula calculations for the data twice when doing visualization
  • The dashboard has a regular refresh function, which can automatically refresh a single dashboard, multiple dashboards, and a single component. It requires JS to write a regular refresh frequency
  • Share the public link of the support dashboard, or it can be linked to its decision-making system, or it can be embedded in the webpage
  • Dashboards are directly shared, public links are created for dashboards, and dashboards are hung out.

Ali entrepreneurs share the company's BI selection road!  I have stepped on the pits of self-research and open source

 

3) SQL query

Directly connect to the database through JDBC

  • Support cross-database related query
  • Support SQL data set, allow to write SQL access
  • Supports visual data preview. After adding and updating the data table in the business package, there is a data preview area in the business package editing interface, and you can view the successfully edited table data.
  • Visual association table and blood relationship analysis
  • BI project that provides two calculation modes: real-time data and extracted data

4) Authority management

  • Permissions can be set through roles, and the permission recipients include departments, roles, positions, and users
  • The authority can be set for personnel management, directory authority, management system, data connection, data authority (data table), sharing authority, timing scheduling management authority, etc., with rich authority settings
  • The permission setting object is deeper, which can be detailed to the component or data row level

Ali entrepreneurs share the company's BI selection road!  I have stepped on the pits of self-research and open source

 

5) Secondary development

  • Pure java development, basically a zero-code tool
  • Support certain secondary development, with API documentation

to sum up

Finally, from a comprehensive selection point of view, the options for this BI selection fall on Metabase and FineBI. The former is open source and the latter is commercial.

 

Open source has its drawbacks, the authority function is too weak, there is no platform operation and maintenance function, the interface is all in English, some of our departments are good for development, but considering that the company may promote BI in the future, the selection requires tools to learn To be accepted by the business side, and product stability requires the guarantee of technology and service.

 

Anyway, it is necessary to purchase mature platforms, so the latter is more favored within the acceptable budget. The functional satisfaction is 90%, and the specific performance has to be tested. The price is about 200,000 to hundreds of thousands, depending on concurrency. There are service items and so on.

Guess you like

Origin blog.csdn.net/yuanziok/article/details/109120955