Great tool for exploring data visualization

Editor's note: This content comes from the customer of Grape City - the front-end technology team of Zhengcai Cloud. Based on the world's leading cloud computing, big data, artificial intelligence and other digital technologies, Zhengcai Cloud Company has built the country's first government procurement cloud service platform - Zhengcai Cloud Platform. At present, this platform has become the industry's most extensive service range , a cross-regional, cross-level, and cross-field integrated procurement cloud service platform with the largest number of users and the most active transactions.

foreword

Data visualization consists of three branches: scientific visualization, information visualization, and visual analysis.

1. Scientific visualization is primarily concerned with the visualization of three-dimensional phenomena, such as various systems in architecture, meteorology, medicine, or biology. The focus is on realistic rendering of volumes, surfaces, lights, etc., perhaps even including some kind of dynamic component.

2. Information visualization is a picture that combines data and design, which is a form of data representation that is beneficial for individuals or organizations to disseminate information to audiences in a short and effective manner.

3. Visual analytics is defined as the science of analysis and reasoning based on a visual interactive interface, which integrates technologies such as graphics, data mining, and human-computer interaction to form complementary advantages and mutual enhancement of human brain intelligence and machine intelligence.

Visual reports are the most important thing in visual analysis, which can quickly display a large amount of data and perform data operations flexibly. The operations include data filtering, association, linkage, drilling, copy query, replacement, style setting, The injection of conditional formatting enables multi-color scales, icon sets, data bars, repeated values, insertion of formulas, linkage across tables, etc. SpreadJS is the most prominent in solving visual analysis reports. Below we will only discuss the role of SpreadJS in visual analysis.

Difficulties in report visualization

The Internet e-commerce service industry usually processes a large amount of business information and user information. Customer service and data analysts are the main users of reports.

Customer service usually handles a large number of work order filling, customer complaint registration, import of raw data from third-party platforms, statistical summary, review and approval, electronic signing, distribution, etc. every day. Most of the work information is usually carried by Excel, and the server needs to process a large number of documents every day. Because the data of the Excel document itself is difficult to extract and store, it is not convenient to distribute the template to the operator immediately when the template is updated, and it is difficult to integrate it into the web page. And other issues.

Data analysts need to get the data to summarize, calculate the sales of each commodity brand, maximum value, minimum value, average value, etc., and identify valuable data. Capture valid data and make a report to the boss.

For the above scenarios, report visualization can summarize the following difficulties:

1. Concurrency

The company has a large number of customer service personnel, and thousands of people are heavily operating online at the same time. The business cycle is short and the amount of data is large, so the concurrent performance consumption of the server is very large. You can use Apache POI in the background to extract and modify Excel data, and perform formula calculations therein. This will encounter two performance bottlenecks:

1) It is necessary to upload and download files frequently, and the server bandwidth is under great pressure;

2) All Excel parsing and extraction operations are on the server side, and frequent IO operations make the server overwhelmed.

The above two performance points are difficult to break through under the current architecture, which is also one of the most challenging requirements when refactoring projects. Of course, hard-heap server configuration is also a solution, but it cannot solve some other problems, and it will also bring pressure on operation and maintenance.

2. High requirements for Excel operation and compatibility

If the new system cannot be used by everyone quickly, the training cost will be unaffordable given the number of users of this project. Moreover, it is necessary to be able to directly import existing Excel report templates, otherwise it is unacceptable to develop or design all Excel reports again.

3. The report format is flexible and changeable

For different business scenarios, the report templates are also ever-changing. Therefore, it is particularly important that the operator's design and reporting can be completed on the page without the intervention of R&D.

4. Support formula calculation

Since it involves modules such as commodities, orders, cost accounting, and financial statistics, it has high requirements for the types and performance of calculation formulas.

5. Data Documentation in Workflow

In the workflow of the previous system, when Excel reports were involved, either the data would first be assembled on the server side and the Excel template, or the system would find the Excel file on the file server according to the path, and then flow to the corresponding link. Some new business modules can even only use email for file transfer.

This process will generate a large number of files, which puts a lot of pressure on the file server, and the background has to do batch data splitting and maintenance on a regular basis. This upgrade system needs to solve this problem.

Think about how to choose

First of all, the first step in model selection is to find out which products are available on the market for us to choose from. There are many products on the market that can be integrated into the system and support this kind of online form document editing. I generally divide them into Two categories.

1. Cloud document type products

There are many such products, similar to WPS, graphite documents, and office online. They themselves have a high degree of completion, and have helped users realize almost all functions including online collaboration, and even support a certain degree of secondary development and can be deployed privately. But the problem is that such products are usually relatively closed, and the development of secondary customization is relatively difficult and not lightweight enough. Most of the authorization methods are based on time, concurrency, number of users, etc., which are expensive and not very suitable for our needs.

2. Control type products

Standard controls like LuckySheet, Handsontable, and SpreadJS are all pure front-end table controls, and all support Excel features and json data binding.

LuckySheet is a domestic MIT open source software that can be used commercially. But when I researched it, it had only been online for 1 or 2 months, and it was not endorsed by a big factory like React, so it was impossible to use it in our official projects. One year has passed so far, and communication platforms such as QQ groups and forums have been launched one after another, but they are still weak.

Handsontable is a foreign commercial form control. It is said that there are many secondary development pits, but the biggest problem for us is that it does not have a Chinese support team.

SpreadJS is a commercial Excel table control of Grape City Company. Interestingly, I found that in the comment area under LuckySheet of V2EX, the author of LuckySheet also said that SpreadJS is the industry benchmark. It supports importing most of the Excel features including formulas, charts, styles, and conditional formatting (macros are not supported). And the most surprising thing is that its operation interface is a complete Excel interface, completely developed in pure JS, using json for template and data interaction. At the same time, SpreadJS also has a corresponding after-sales support team. Technical issues can be communicated by phone or in the forum at any time during the working day. Related materials include videos, documents, examples, and API manuals are also very rich. You can even invite their technical consultants to the company for training. For a project team like ours with a short construction period and heavy development tasks, it can indeed save a lot of energy and reduce risks.

Image source: SpreadJS Online Excel Editor

So what are controls? Why use controls?

Citing Wikipedia
In computer programming, a control (or component, widget or control) is a graphical user interface element whose arrangement of displayed information can be changed by the user, such as a window or a text box . A control definition is characterized by providing a single point of interaction for direct manipulation of given data. A control is a basic visual building block contained in an application program that controls all data that the program processes and interacts with that data.

According to my own understanding, a control is a functional module that only provides basic functions and supports secondary development. The control is relatively light in dependency and better in plasticity, and it also has corresponding development documents and APIs. It is a basic function package for developers, and it is convenient to customize functions according to needs.

SpreadJS Requirements Solutions and Benefits

1. Concurrency

Since SpreadJS is a design that separates data and templates, the filling personnel only need to complete the filling on the page. When submitting, you can only submit the completed data json, and the server no longer needs to centrally parse all Excel files. Bandwidth consumption is also directly saved in half.

2. High requirements for Excel operation and compatibility

During the internal trial, the ladies and sisters in finance and customer service reported that the user experience is almost exactly the same as Excel, and no special training is required. Moreover, a large number of our own Excel reports can be imported directly (batch and remote import can also be realized after secondary development), and a series of elements including charts, formulas, table styles, etc. can be directly imported into online operations.

3. The report format is flexible and changeable

Designers can directly design online, or take Excel-designed reports to the web, do data binding, submit and save them in json format (the ssjson format of Spread JS includes all the information of Excel documents)

4. Support formula calculation

It supports more than 450 formulas (a total of 480 in Excel), and you can also develop and expand custom formulas by yourself, which is enough for finance. At the same time, it also supports all Excel reference operations, such as cross-sheet references, absolute references, and function naming information.

5. Data Documentation in Workflow

It basically breaks away from the dependence on files, all process status and dependent data can be recorded in the database, and the file server only needs to save a small number of template documents (in fact, when the number of templates is not large, they can be directly placed in the database, but we have off-the-shelf file server). This saves 90% of the space overhead of our file server, and our operation and maintenance partners wake up laughing in the middle of the night.

Diving into SpreadJS

Here comes the important point. In fact, what interests me the most as a front-end developer is the underlying design of SpreadJS, as well as the optimization of memory and performance balance. I have done a lot of research and study on this. Fortunately, it is not difficult to find information in this area. You can often find it in the open class section of the official Grape City forum (https://gcdn.grapecity.com.cn/forum.php?mod=forumdisplay&fid =225&filter=typeid&typeid=274&fileGuid=QKgTJRrrCD96PXwh)

rendering performance

Performance must be the biggest concern of every Deepin Table control user. Our data volume often reaches thousands, and Excel is not convenient for paging (involving front-end formula calculation and summary), so I am very worried during model selection. Later, I found that I thought too much, SpreadJS can easily load 500,000 pieces of data, and it takes about 200 ms to load (the performance demonstration example on the official website can only load 50,000, and we picked up and measured 500,000). Later, after in-depth research, I realized that to solve this problem, their thinking is as follows:

  • Real-time rendering + Double buffering (translated into double buffering?):

Use Canvas to render the table part, and only render part of the content that the user sees, so that the speed of loading 1,000 rows and 100,000 rows of data is fast, and the performance is not much different.

Double buffering is to solve the continuous experience problem of continuous rendering, and can further improve the rendering speed. It is estimated that few people have heard of this term, but everyone should have experienced it. Double buffering is generally called double buffering in graphics. In fact, the drawing instructions are completed in a buffer. The drawing here is very fast. After the drawing instruction is completed, the completed graphics are displayed on the screen immediately by exchanging instructions, which avoids incomplete drawing and is highly efficient. It is actually very common in games. When our main character is running on the map, the game engine will load and render the map in real time according to the moving direction of the character, which avoids the long wait when loading a large map at one time.

Image source: Grape City Open Class [SpreadJS Performance Optimization]

(https://gcdn.grapecity.com.cn/forum.php?mod=viewthread&tid=86035&extra=page=1&filter=typeid&typeid=274&fileGuid=QKgTJRrrCD96PXwh)

SpreadJS Performance Optimization- Grape City Open Class- Grape City Product Technology Community (grapecity.com.cn)

  • Sparse arrays:

SpreadJS uses a sparse array data structure for storage optimization of tabular data. Sparse arrays are often used to optimize the memory usage of two-dimensional arrays (such as chessboards, maps, etc.), but it has a natural defect, that is, slow access performance.

So in response to this question at the time, I did a stress test for it, and the traversal of millions of levels took more than 200 ms. Performance can meet our needs.

computing engine

According to the official introduction, the formula engine actually includes two major implementation parts, one is the calculation logic system and the other is the reference system.

  • citation system

The calculation of the formula in Excel depends on some original data, such as C1 refers to B1, B1 refers to A1, etc. SpreadJS has already encapsulated this part of the function very natively, and developers do not need to worry about it at all (unless there is a reference Backtracking and other special needs).

In Excel, there are direct references, cross-sheet references, relative/absolute references, references to naming information, references to table row and column formulas, cross-workbook references, etc. (the list is not exhausted, interested students can search and learn by themselves). The runtime of SpreadJS is on the web page, so don't even think about cross-Workbook references, at least it is definitely not supported at present.

  • Computational logic

Calculation formulas such as SUM, IF, MATCH, and VLOOKUP that can be input into cells are like small "logic packages". Currently, SpreadJS has 460+ native formula functions, while Excel only has 490+ , and SpreadJS can customize formulas, and the experience is the same as that of native formulas.

For the underlying implementation, in fact, after multiple versions of iterations, these formulas are no longer independent "logical islands". The implementation of the formula has a lot of abstraction and reuse at the bottom layer. It is said that while the performance of the new version is improved, the amount of code is significantly reduced compared with the old version, which is also more friendly to front-end engineering packaging.

For the realization of nested formula calculation, SpreadJS builds an AST tree at the bottom to analyze the calculation logic of the formula set by the user. From the code of the official example, a set of Expression is built at the bottom of the formula, and there is a corresponding public interface for calling , as shown in the figure:

Image source: [SpreadJS formula structure tree display]

https://gcdn.grapecity.com.cn/showtopic-79188-1-1.html?fileGuid=QKgTJRrrCD96PXwh

  • performance

First of all, as a front-end technology, we can first analyze the possible performance bottlenecks and their impact based on the technical requirements of formula calculation. We used a lot of user events, dirty data, linkage and other functions during development. An important prerequisite for all these functions to ensure correct operation is to ensure that correct calculation results can be obtained at any time. The most direct way to achieve this is to let Formulas perform calculations in a high-priority, synchronous manner.

Everyone knows that multi-threading can help share the computing pressure, but let’s not talk about the difficulty of design and implementation. Even if Web Worker is supported, JavaScript can only be regarded as a single-threaded language strictly speaking, because its Web Worker sub-threads are completely Controlled by the main thread, and the main thread cannot be blocked and suspended. So even if Web Worker is introduced, the synchronous execution mentioned above cannot be guaranteed.

After the above analysis, it can be seen that the limitation of the calculation performance of the formula depends on the calculation ability of JavaScript. I found a related picture, which can intuitively reflect the computing power of Node.js (Node.js is a V8 engine, recognized as the fastest JS engine)

The picture is quoted from "Deeply Explaining Node.js"

According to our tests, the above computing performance is close to that of native JS, and the optimization of SpreadJS in this area is already very close to the physical limit. At present, in our application scenario, this computing performance is sufficient, but it does not rule out that there will be massive data and formula computing requirements in the future, and the official has also given related solutions in this regard, refer to here .

It is said that the official is still further developing the caching technology to realize the block caching of formula calculation: even if the value on the reference chain changes, there is no need to calculate the formula of the entire reference chain. It sounds very powerful, and the idea is also reliable, I hope it will be launched soon.

style system

Excel's style system is very complex. Every function point such as border, font, alignment, data format, conditional format, etc. has a very flexible and huge implementation. When I first learned about SpreadJS, I was also stunned by its Style class. The borders, backgrounds, fonts, alignments, etc. that I can imagine can be "visible", but there are also things like cell types, data formats, table buttons, drop-downs, and watermarks. I can't help but sigh that Style is too heavy. If you customize a large number of cell styles, the memory and performance will definitely be bad. However, no bottlenecks have been found in practical applications. It turns out that a layered structure is used here to design, as shown in the figure:

Image source: Grape City Open Class [SpreadJS Performance Optimization]

How to use SpreadJS?

1. Render the table

Figure 6.1-1 Binding data and formulas

First, obtain the global spread object, which is the main body of the entire table, and the spread is divided into multiple sheets. SpreadJS will return a spread object after initialization.

  • Vue version spread object
<gc-spread-sheets @workbookInitialized='spreadInitHandle(\$event)' />

methods:{

spreadInitHandle: function (spread) {

this.spread = sprea

},
}
  • bind data, bind formula

    tableDataBind() {
    // 数据源,可以从后台请求拿到
    
    var dataSource = {
    
    // 注意这里加了一层bindPath,用于映射表格的绑定路径
    
    bindPath_table: [{
    
    c1: 100,
    
    c2: 90,
    
    c3: 30,
    
    c4: 40
    
    }, {
    
    c1: 88,
    
    c2: 66,
    
    c3: 55,
    
    c4:100
    
    }, {
    
    c1: 30,
    
    c2: 89,
    
    c3: 100,
    
    c4: 40
    
    },{
    
    c1: 40,
    
    c2: 66,
    
    c3: 88,
    
    c4: 40
    
    }]
    
    };
    
    // 表格绑定和单元格绑定数据源,需要用SpreadJS的CellBindingSource包装一下
    
    var spreadNS = GC.Spread.Sheets;
    
    var dataSource1 = new spreadNS.Bindings.CellBindingSource(dataSource);
    
    var table2 = this.activeSheet.tables.add("tableName", 0, 0, 1, 5, spreadNS.Tables.TableThemes.light6);
    
    table2.showFooter(true);
    
    table2.autoGenerateColumns(false);
    
    var c1 = new spreadNS.Tables.TableColumn(1);
    
    c1.name("语文");
    
    c1.dataField("c1");
    
    var c2 = new spreadNS.Tables.TableColumn(2);
    
    c2.name("数学");
    
    c2.dataField("c2");
    
    var c3 = new spreadNS.Tables.TableColumn(3);
    
    c3.name("英语");
    
    c3.dataField("c3");
    
    var c4 = new spreadNS.Tables.TableColumn(4);
    
    c4.name("化学");
    
    c4.dataField("c4");
    
    var c5 = new spreadNS.Tables.TableColumn(5);
    
    c5.name("合计");
    
    table2.bindColumns([c1, c2, c3, c4, c5]);
    
    table2.bindingPath("bindPath_table");
    
    // 设置公式
    
    table2.setColumnDataFormula(4, "=[@语文]+[@数学]+[@英语]+[@化学]");
    
    table2.setColumnFormula(4, "=SUBTOTAL(109,[合计])");
    
    // 设置允许单元格的内容超出单元格,与绑定无关
    
    this.activeSheet.options.allowCellOverflow = true;
    
    // 绑定dataSource
    
    this.activeSheet.setDataSource(dataSource1);
    
    this.spread.resumePaint();
    
    }
    

Figure 6.1-2 Function name and function code mapping table

Render Conditional Formatting

Rendering conditional format: the completion of data rendering can only ensure that the data can be displayed normally, but this is not enough to meet the needs of data analysts, and it is also necessary to clearly display valid data such as: the maximum and minimum values ​​are marked in red, and the progress bar shows a changing state. The icon shows rising or falling, two-color gradation, three-color gradation, etc. How to achieve it?

  • Icon set: the effect is as shown in the figure

  • Implementation code

    iconset() {
    
    var activeSheet = this.activeSheet;
    
    var iconSetRule = new GC.Spread.Sheets.ConditionalFormatting.IconSetRule();
    
    // 演示demo先写死区域
    
    iconSetRule.ranges([new GC.Spread.Sheets.Range(0,0, 5, 5)]);
    
    // IconSetType图标志的类型:箭头,圆圈和execl 打通的,excel有哪些这这边就支持哪些
    
    iconSetRule.iconSetType(GC.Spread.Sheets.ConditionalFormatting.IconSetType.threeArrowsColored);
    
    var iconCriteria = iconSetRule.iconCriteria();
    
    iconCriteria[0] = new GC.Spread.Sheets.ConditionalFormatting.IconCriterion(
    
    true,
    
    GC.Spread.Sheets.ConditionalFormatting.IconValueType.number,
    
    60
    
    );(<60)
    
    iconCriteria[1] = new GC.Spread.Sheets.ConditionalFormatting.IconCriterion(
    
    true,
    
    GC.Spread.Sheets.ConditionalFormatting.IconValueType.number,
    
    90
    
    );(60<= <90)
    
    iconCriteria[2] = new GC.Spread.Sheets.ConditionalFormatting.IconCriterion(
    
    true,
    
    GC.Spread.Sheets.ConditionalFormatting.IconValueType.number,
    
    90
    
    );(>=90)
    
    iconSetRule.reverseIconOrder(false);
    
    iconSetRule.showIconOnly(false);
    
    activeSheet.conditionalFormats.addRule(iconSetRule);
    
    }
    
  • Progress bar: the effect is as shown in the figure

  • Implementation code

    dataBar(){
    
    var activeSheet = this.activeSheet;
    
    activeSheet.conditionalFormats.addDataBarRule(
    
    GC.Spread.Sheets.ConditionalFormatting.ScaleValueType.number,0,//最小数
    
    GC.Spread.Sheets.ConditionalFormatting.ScaleValueType.number, 100,//最大值
    
    "orange",//颜色
    
    [new GC.Spread.Sheets.Range(0,0, 5, 4)]
    
    );
    
    },
    
  • Repeat value: the effect is as shown in the figure

  • Implementation code

    duplicateValue() {
    
      var activeSheet = this.activeSheet;
    
      var style = new GC.Spread.Sheets.Style();
    
      style.backColor = "yellow";
    
      style.foreColor = "red";
    
      var ranges = [new GC.Spread.Sheets.Range(0,0, 5, 4)];
    
      activeSheet.conditionalFormats.addDuplicateRule(style, ranges);
    
      }
    
    - 
    

Cell containing text 6: the effect is as shown in the figure

  • Implementation code

    includeText() {
    
    var activeSheet = this.activeSheet;
    
    var style = new GC.Spread.Sheets.Style();
    
    style.backColor = "red";
    
    var ranges = [new GC.Spread.Sheets.Range(0,0, 5, 5)];
    
    activeSheet.conditionalFormats.addSpecificTextRule(
    
    GC.Spread.Sheets.ConditionalFormatting.TextComparisonOperators.contains, "6", style, ranges
    
    );
    
    }
    
  • Combining the above realization results as shown in Fig.

write at the end

This article mainly introduces some of my own explorations in the direction of data visualization, and it will be helpful for students who are preparing to make market reports, email subscription reports, online collaboration, and visual analysis.

Due to the length of the article, there are many conceptual things involved, and it is inevitable that there will be mistakes. I hope everyone can correct me, thank you!

============================

The editor has something to say: I would like to thank the front-end technical team of Zhengcai Cloud for recognizing the products of Grape City and providing the above content. If you also have experience about the use of Grape City products, welcome to contribute to us, just contact us in the background of the WeChat public account.

Guess you like

Origin blog.csdn.net/powertoolsteam/article/details/131815676