Technical practice dry goods | From workflow to workflow

The author of this article: Scallion Pancake, front-end engineer of Guanyuan, the landing team develops specifications, develops quality and speed, and is committed to creating easier-to-use ABI products.

background

Let me give you a simple example. Because of work needs, you may have to extract data from the database every day, then make a report, and finally send it to the relevant leader in the form of an email. But each leader may need to look at different things. You need to filter and process the data before making a report. So, can this repetitive process every day be abstracted into a specific workflow, and each step should be visualized? Become a functional node, and then connect them in series in the form of tasks, and display them in the visualized form of DAG, just run regularly every day? For this, we will need a workflow to standardize and automate this process.

What is that workflow? What is a DAG? Let's get into today's content.

foreword

This article will explain the two concepts of workflow and DAG in our Universe (one of the three major product lines of Guanyuan, that is, the intelligent data development platform of Guanyuan Data), and then introduce some other content. The whole is divided into four parts:

  1. Workflow in the development platform;
  2. How to implement DAG abstractly;
  3. Introduction to other workflows;
  4. Summary and thinking based on workflow and DAG. Let's get started~

1. Workflow

First, briefly introduce the workflow in Universe:

Realize the dependencies and scheduling sequence design of various tasks, visualize the process, design and manage low-code, and configure task nodes quickly and highly available to process a series of data tasks; and can meet the agreed time It runs after the event is dependent, calls up each task node in an orderly manner, and automatically completes the data processing process. It has the advantages of ease of use, high reliability, and high scalability.

According to this description, we can briefly summarize the two core capabilities of the workflow:

  1. Scheduling;
  2. Configuration (node).

These two core competencies are described in detail below.

1.1 Scheduling

The development platform supports timing scheduling based on  Cron  expressions and event scheduling based on input source data dependencies, and the timing scheduling uses the  quartz  distributed scheduler. It has the following characteristics:

  • high availability
    • Realize the visual arrangement of task nodes through DAG, without the need for complicated platform language learning costs, and task scheduling is out of the box;
    • Supports the configuration of various scheduling relationships such as sequential scheduling, successful scheduling, and failed scheduling, and flexibly adjusts scheduling strategies;
    • It supports running workflows on a regular basis, daily, weekly, monthly, etc., and the running results can be quickly pushed to platforms such as DingTalk and Enterprise WeChat. Once configured, it is continuously available.
  • High reliability: Decentralized multi-Master and multi-Worker distributed architecture avoids single point of failure and enhances system reliability.
  • High scalability: Custom task types and processes can be developed based on the SDK to seamlessly connect.

1.1.1 Timing scheduling

Support to set the timing in the form of daily/weekly/monthly/yearly and accurate to minutes and interval length (hour/minute).

For example: I expect the workflow to run at 7 o'clock in the morning and 21 o'clock in the evening every day, then I can choose the form of -7 o'clock/21 o'clock-00 o'clock every day, or set the interval of minutes/hours to run.

1.1.2 Event Scheduling

Generally, workflows have data source dependencies, such as datasets/databases. When all dependent data sources are updated, the workflow can be automatically run once.

1.2 Configuration

Based on a conventional configuration description, an interactive UI is produced for building the target object.

The purpose of scheduling is to run the workflow. The operation of the workflow depends on the configuration of different task nodes. Different configurations will inevitably have different UI components. How can we use known data structures to assemble a visual UI? The answer is configuration.

We read based on a configuration description (object), then render the corresponding components according to the configuration, and at the same time centrally set the values ​​of the components into a general configuration object, thus completing a construction from description to UI and then to the target object process. Below I will simply give three examples to illustrate the power and charm of configuration.

1.2.1 Basic capabilities

If we need to construct a target object as follows:

{
    name: '',
    description: '',
}

Then we will have the following configuration description:

[
    {
        fieldName: 'name',
        label: '名称',
        type: 'STRING',
        defaultValue: '',
    },
    {
        fieldName: 'description',
        label: '描述',
        type: 'TEXT',
        defaultValue: '',
    },
]

The resulting UI looks like this:

1.2.2 Dynamic capabilities

Many times we need to dynamically implement a target object, what do you mean? It is to select different values ​​of an attribute, dynamically use an attribute to combine into a new target object, and then corresponding to the UI is to select different attribute values ​​to display different components, which is obviously impossible to achieve by relying on our basic capabilities. .

For example, if I want to calculate the area of ​​a graphic, if a square needs the side length attribute, and a circle needs the radius attribute, then the target object and UI will become:

  • When choosing a square

{
    shape: 'square',
    side: 8,
}

  • When circle is selected
{
    shape: 'circle',
    radius: 4,
}
复制代码

It can be seen  side that and   dynamically appear radius along with it  , then we can simply modify the configuration description:shape

    {
        fieldName: 'shape',
        label: '图形',
        type: 'MODEL',
        model: {
            modelType: 'SELECT',
            labels: [ '圆', '正方形' ],
            values: [ 'circle', 'square' ],
        },
    },
    {
        fieldName: 'radius',
        label: '半径',
        type: 'NUMBER',
        dependsOnMap: {
            shape: [ 'circle' ],
        },
        defaultValue: 4,
    },
    {
        fieldName: 'side',
        label: '边长',
        type: 'NUMBER',
        dependsOnMap: {
            shape: [ 'square' ],
        },
        defaultValue: 8,
    },

It can be seen that we only added the dependsOnMap attribute, and then adapted a little when internally rendering and building objects, so that we can choose different attributes to display different components.

Here is a brief description of the dependsOnMap attribute. Its key value should be a certain fieldName, and its value is an array, which is convenient for expanding the situation where multiple values ​​are allowed. In this way, the value can be obtained according to the fieldName and compared with the value in the configuration. If the same Then display the component, the core logic is as follows:

function isDependsOnMap (dependsOnMap, config) {
  const fieldNames = Object.keys(dependsOnMap || {})
  if (fieldNames.length === 0) return true
  return fieldNames.every(fieldName => {
    const values = dependsOnMap[fieldName] || []
    return values.indexOf(_get(config, fieldName)) > -1
  })
}

1.2.3 Complex Capabilities

In our daily writing, there may also be data transfer between components. Because due to the object constraints described in the configuration, we are actually independent when rendering each component, and there is no connection between components. For this reason, we only need to implement a data sharing layer at the top. Component 3 requires The transmitted data is placed in the data sharing layer, and the component 1 that needs the data can directly obtain it.

The configuration is as follows:

    {
        fieldName: 'fieldName1',
        label: '组件1',
        type: 'MODEL',
        model: {
            modelType: 'SELECT',
            labels: [ '圆', '正方形' ],
            values: [ 'circle', 'square' ],
            from: { fieldName: 'disabledFieldName' }, // 依赖于组件3里的设置,判读当前组件是否需要 disabled
        },
    },
    {
        fieldName: 'fieldName2',
        label: '组件2',
        type: 'NUMBER',
    },
    {
        fieldName: 'fieldName3',
        label: '组件3',
        type: 'MODEL',
        model: {
            modelType: 'BOOLEAN',
            targetSharedFieldName: 'disabledFieldName', // 往数据共享层设置数据的字段
        },
    },

The key configuration attributes are in component 3  model.targetSharedFieldName and in component 1  model.from, and the two can correspond to each other. The general implementation is as follows:

const SharedContext = React.createContext({
  updateFieldValue: () => {}, // 更新字段 value
  getFieldValue: () => {}, // 获取字段 value
})

function Comp1 ({ definition }) {
  const { targetSharedFieldName } = definition.model
  const { updateFieldValue } = useContext(SharedContext)

  useEffect(() => {
    updateFieldValue(targetSharedFieldName, value)
  }, [ deps ])
}

function Comp2 ({ definition }) {
  const { from } = definition.model
  const { getFieldValue } = useContext(SharedContext)
  const value = getFieldValue(from)
}

Finally, simply a complex configuration UI animation in the previous development platform, and feel the power and charm of configuration:

1.2.4 Service Capabilities

When we need to build some array-like target objects, the first thing we think of is to display the UI in the form of a list, so we designed some service-type components, which are only responsible for rendering the list, but the components of each list are based on The type of the array elements is determined. For example, we need such an array-like target object:

{
    list: [
        { name: 'a', age: 12 },
        { name: 'b', age: 18 },
    ],
}

The corresponding configuration description can be written as follows:

[
    {
        fieldName: 'list',
        label: '列表',
        type: 'MODEL',
        model: {
            modelType: 'LIST',
            definitions: [
                {
                    fieldName: 'name',
                    label: '名称',
                    type: 'STRING',
                },
                {
                    fieldName: 'age',
                    label: '年龄',
                    type: 'NUMBER',
                },
            ],
        },
    },
]

The corresponding UI is as follows:

This  LIST component is a service-type component that displays array objects in the form of a list.

1.2.5 Registration Capabilities

Built-in components may not fully meet the requirements of configuration, because configuration is just a convention, but drawing UI by building objects is free, and the display forms vary widely. For this reason, we provide a registration mechanism. Users can customize the registered component type to draw the corresponding target object.

1.3 Summary

Based on such excellent configuration capabilities, it should be abstracted, so it is also used in BI's custom charts. Based on this, we wrote a library called  Lego . As the name implies, we expect that when building some UIs dedicated to configuration, it is as simple as building blocks. The description (interface) is agreed, and you can piece it together.

After introducing the workflow, we also need a visual interface to describe the process, so DAG is undoubtedly a good display form.

Two, DAG

The full name of DAG  is Directed Acyclic Graph , which means Directed Acyclic Graph in Chinese . It consists of a finite number of vertices and "directed edges". Starting from any vertex, passing through several directed edges, it is impossible to return to this vertex. For example, as shown in the figure below:

After a simple understanding of the concept of DAG, how to abstract a simple and easy-to-use DAG for the workflow scenario of the development platform? First sort out what information and status are needed to draw a DAG:

  • Node information (nodes)
  • Node location (location)
  • Connection information (edges)
  • Edit and read-only status\

The first three points are easy to understand. They should be the three essential elements for drawing DAG. Let’s explain about the fourth point, because the workflow of the development platform has the concept of going online and offline. After the development is completed, it will go online and run without modification. If it is a specification in warehouse development, then our workflow has the distinction of offline editable and online read-only.

First start with editing and read-only, we can divide DAG into two parts: Playground and Renderer, which can be used independently. Playground corresponds to the editing state, and Renderer corresponds to the read-only state. The Playground should generate the drawing information in the editing state in real time, and the Renderer is responsible for rendering in real time according to the drawing information. Then let's sort out what capabilities should be available in the editing and read-only states:

  • Playground
    • node dragging
    • Connection additions and deletions
    • Add/Copy Node
    • Frame select nodes for batch copy/delete
    • Auto Layout/Undo Operations
  • Renderer
    • Zoom in and zoom out
    • canvas drag
    • node click

Then think about it further, what capabilities should our DAG have? Here I simply list the following points in combination with the use of the development platform:

  • Provide style configuration (such as node size/connection width, etc.)
  • Support width and height adaptation
  • Custom drawing nodes and connections
  • Enhancement of other drawing capabilities (such as annotation function, which does not belong to the function of DAG itself, but is considered to be implemented as an extended function)

So far, our DAG probably has a complete structure and implementation direction:

|- ConfigContext              --- 配置层
     |- Playground            --- 编辑层
        |- ResponsiveProvider --- 自适应宽高层(可选)
           |- Renderer        --- 只读层,只做展示
              |- Nodes        --- 节点
              |- Edges        --- 连线

It is probably like this in use:

2.1 Read-only use

<ConfigContext.Provider value={
   
   { node: { width: 56, height: 56 } }}>
 <ResponsiveProvider>
  <Renderer nodes={nodes} location={location} edges={edges} />
 </ResponsiveProvider>
</ConfigContext.Provider>

2.2 Editorial use

<ConfigContext.Provider value={
   
   { node: { width: 60, height: 60 } }}>
 <Playground nodes={nodes} location={location} edges={edges} />
</ConfigContext.Provider>

2.3 Use of custom nodes and connections

<ConfigContext.Provider value={
   
   { node: { width: 56, height: 56 } }}>
 <Renderer nodes={nodes} location={location} edges={edges}>
  <Nodes>
   {(props) => <CustomNode />}
  </Nodes>
  <Edges>
   {(props) => <CustomEdge />}
  </Edges>
 </Renderer>
</ConfigContext.Provider>

2.4 Underlying drawing

Here we choose svg because svg is powerful enough in drawing, supports css to customize styles, and is also convenient for event binding. With this direction, we can determine which tags the following elements correspond to in turn:

Draw roughly the following structure:

The enlargement, reduction and movement of the canvas are set through the viewBox property

According to the html structure, we need to care about how to generate the connection. Here, we mainly calculate a quadratic Bezier curve (Quadratic Curves) through the positions of the two nodes to obtain a perfect curve with reverse symmetry, as follows:

Here is how to implement the next quadratic Bezier curve in the path tag. First draw the information that requires three points, as shown in the animation below:

Secondly, because our curve is reverse symmetric, we only need to draw half of it. This half is a quadratic Bezier curve, so the positions of the three points are easy to confirm, as follows:

Among them, P0 is the starting point, and P4 is the end point. For the convenience of calculation, P1 corresponds to 1/4 of the horizontal spacing, and the height is the same as the starting point, and P2 is 1/2 of the horizontal spacing and vertical spacing. Points can be substituted: d = M P0x P0y Q P1x P1y P2x P2y T P4x P4yIn this way, we get a complete curve, which is composed of two quadratic Bezier curves.

2.5 Layout

With nodes and connections, the layout is also a very important part. Manual dragging is obviously not neat enough sometimes. If there is an automatic layout algorithm, it will be much easier. Here we choose dagre as the   automatic   layout calculation tools. There are three main algorithms:

function rank(g) {
 switch(g.graph().ranker) {
  case "network-simplex": networkSimplexRanker(g); break;
  case "tight-tree": tightTreeRanker(g); break;
  case "longest-path": longestPathRanker(g); break;
  default: networkSimplexRanker(g);
 }
}

network-simplex Similar  tight-tree to the layout, the layout is implemented in a compact way. longest-path The difference is that if there are multiple end nodes, these nodes are guaranteed to be aligned from top to bottom instead of the nearest layout, as shown in the following figure:

  • network-simplex and tight-tree

  • longest-path

3. Other workflows

How to use these workflows will not be introduced in detail here, but some ideas on drawing and applying workflows will be borrowed from them.

3.1 ****n8n

The workflow automation platform that doesn't box you in, that you never outgrow.

n8n supports event-driven (generally through third-party application hooks/local file modification monitoring, etc.) and   cron   expression timing scheduling workflow, and at the same time determines the dependencies between nodes in the order of data delivery. It is very similar to our workflow, except that our workflow is a dependency on node task scheduling, not data.

3.1.1 Application

So what is it suitable for? As shown below:

If you are an open source enthusiast and want to know the news immediately when your Github Repo is starred or removed from star, then you can use the open star hook of github, and then send yourself a message through slack. Through the integration of third-party platforms, various unrelated applications can be connected in series to develop a convenient workflow. \

3.1.2 Summary

At present, n8n has integrated 200+ applications, covering most mainstream applications. However, some domestic applications are still missing, such as DingTalk/Enterprise WeChat, etc., so it has successfully supported custom development nodes. If you are interested, you can click  here  . On the whole, n8n is more like a workflow for integrating applications, and of course it also supports some local functions, such as reading and writing files/using git operations, etc. It can integrate some common operations that need to be written and written in our daily work or development into a workflow, which is convenient for daily life.

3.1.3 Reference

From its workflow design, perhaps some points can be used for reference:

  • When configuring a node, you can see what the output data of the previous node is, which is convenient for configuration in the current step
  • After the node is configured, it can be executed immediately, and the corresponding output data can be seen
  • There are some data visualization enhancements on the connection, such as how many rows are there in the output data
  • Nodes can be directly clicked to add and select the rear node, saving part of the connection operation

3.1.4 Others

Later, I tried to see if the nodes could form a loop, and the result was yes, but the application was stuck in an infinite loop, as follows:

The data grows infinitely, running an infinite loop.

3.2 Orange

Open source machine learning and data visualization. Build data analysis workflows visually, with a large, diverse toolbox.

3.2.1 Application

Orange is more suitable for ML-related work. It is a bit like our AI Flow, but it also integrates functions such as data flow/data exploration/chart analysis in it. It does not need to go to other pages to configure, process and view separately, and process data in the form of workflow. View, process and analyze. Simple last picture:

There is a very interesting point, its connection supports full data or selected data for transmission, as shown in the figure below:

Then the way of data transfer will be reflected on the connection. In terms of connection, it displays the endpoints in the form of arcs (I guess the use of arcs is to increase the connection area of ​​nodes and also adapt to circular nodes). If there is a connection, it will be a solid line, and if there is no connection, it will be Dotted lines are very friendly to the display of status.

3.2.2 Summary

Orange's functional integration is very powerful. In addition to basic data conversion, it also has functions such as charts/models/evaluation, which is very suitable for data analysis in the direction of AI.

4. Summary and thinking

The concept of workflow has been proposed for a long time. It is an abstraction, generalization and description of the business rules between the process and its various operation steps. The emergence of workflow makes our process standardized and the steps clear. The workflow above data development avoids a series of repetitive operations, and at the same time displays it in the form of DAG, making the process more intuitive. Of course, DAG is not necessarily used in systems with sequence restrictions such as scheduling, but can also be used in other forms, such as the display of causal relationships such as data blood relationship, or the display of family graphs, and then improve One layer can even be used in a data processing network, where data flows from one point to another, and does not necessarily need to be displayed in a visual form, only this concept is required.

4.1 Possibilities

In fact, our workflow has powerful scheduling capabilities and configuration functions, but is limited by limited functional nodes. If we can support custom configuration nodes, users can have more room for imagination in data development. It is not limited to these existing nodes for workflow development.

References

[1] https://dolphinscheduler.apache.org/zh-cn/docs/latest/user_doc/about/introduction.html

[2] https://en.wikipedia.org/wiki/Workflow

[3] https://en.wikipedia.org/wiki/Directed_acyclic_graph

[4] https://github.com/biolab/orange3

[5] https://github.com/n8n-io/n8n

Guess you like

Origin blog.csdn.net/GUANDATA_/article/details/125935346