Description of GIS data model (for Mr. Luo)

It is only for the reference of Mr. Luo and his students.

1. The perspective of machine learning

1.1 Matrix representation of structured data

In the field of machine learning, the most basic and commonly used way of data description is n × mn \times mn×m matrixX \mathbf{X}X , wherennn is the number of objects (object object is also called instance instance),mmm is the number of features (feature feature is also called attribute attribute),xij x_{ij}xijIndicates the iiThe jjth of the i objectj eigenvalues, which are a real number. Example:
X = [ 29.35 106.33 19.2 29.33 106.34 19.8 29.36 106.35 19.6 29.32 106.32 20.5 ] \mathbf{X} = \left[ \begin{array}{ccc} 10.693 & 3 & 3 \ 29.33 & 106.34 & 19.8\\ 29.36 & 106.35 & 19.6\\ 29.32 & 106.32 & 20.5\\ \end{array} \right]X= 29.3529.3329.3629.32106.33106.34106.35106.3219.219.819.620.5
It has 4 rows, each representing a plot in Chongqing, and 3 columns representing longitude, latitude, and annual average temperature. Essentially,
each row of data is independent and out of order, so swapping different rows does not change The data itself.
The advantages of this description include:

  1. concise;
  2. Support matrix operation;
  3. Corresponding to the two-dimensional table of the relational database, each feature is atomic, that is, it cannot be further divided.
    The disadvantages include:
  4. Only real data is supported;
  5. The relationship between data cannot be described.

1.2 Tuple representation of structured data

In the field of granular computing, the most common data description is the binary group S = ( U , A ) S = (\mathbf{U}, \mathbf{A})S=(U,A ) , 其中U = { x 1 , … , xn } \mathbf{U} = \{x_1, \dots, x_n\}U={ x1,,xn} is a collection of objects,A = { a 1 , … , am } \mathbf{A} = \{a_1, \dots, a_m\}A={ a1,,am} is a collection of attributes.

Table 1. A data table
U \mathbf{U}U a 1 a_1 a1(coordinate) a 2 a_2 a2(air temperature) a 3 a_3 a3(crop)
x 1 x_1 x1 ( 29.35 , 106.33 ) (29.35, 106.33) (29.35,106.33) 19.2 ± 3.6 19.2 \pm 3.6 19.2±3.6 rice
x 2 x_2 x2 ( 29.33 , 106.34 ) (29.33, 106.34) (29.33,106.34) 19.8 ± 3.2 19.8 \pm 3.2 19.8±3.2 corn
x 3 x_3 x3 ( 29.36 , 106.35 ) (29.36, 106.35) (29.36,106.35) 19.6 ± 4.5 19.6 \pm 4.5 19.6±4.5 sugar cane
x 4 x_4 x4 ( 29.32 , 106.32 ) (29.32, 106.32) (29.32,106.32) 20.5 ± 3.3 20.5 \pm 3.3 20.5±3.3 corn

Advantages of this description over matrix descriptions include:

  1. U \mathbf{U}U is a set, so it naturally represents a disorder. As mentioned earlier, this is only an implicit convention in matrix notation;
  2. a j ( x i ) a_j(x_i) aj(xi) can be real numbers, integers (or enumeration types), Boolean types (actually also enumeration types), interval types, fuzzy types, set types, etc. In Table 1, a 1 ( x 1 ) = ( 29.35, 106.33 ) a_1(x_1) = (29.35, 106.33)a1(x1)=(29.35,106.33 ) gives the latitude and longitude of the plot. Note that( 29.35 , 106.33 ) (29.35, 106.33)(29.35,106.33 ) here as one datum, not two.
    Disadvantages of this description include:
  3. aj a_jajIt is both an attribute and a function, which is not very elegant;
  4. a j ( x i ) a_j(x_i) aj(xi) is not necessarily atomic, and cannot correspond to two-dimensional tables in traditional relational databases, and can only be supported by object-oriented databases.

If you want to overcome disadvantage 1, you need to write its complete form S = ( U , A , V , I ) S = (\mathbf{U}, \mathbf{A}, \mathbf{V}, I)S=(U,A,V,I ) , whereV \mathbf{V}V is the set of all attribute values,I : U × AI: \mathbf{U} \times \mathbf{A}I:U×A is an information function, such asI ( x 1 , a 1 ) = ( 29.35 , 106.33 ) I(x_1, a_1) = (29.35, 106.33)I(x1,a1)=(29.35,106.33).

1.3 Structural Model of GIS Data (Illustrative Only)

In the field of GIS, at least two sets of attributes are needed to describe a plot. The coordinates, shape, and boundary static information of the plot object can be separated from the dynamic information such as temperature, light, and crops. Simple data types (can be Represented by integers or real numbers) are separated from complex data types (such as shapes with boundaries). Definition 1 is written according to this idea. Based on this data
model, mainstream methods of machine learning can be used directly.

1.4 Graphical models for GIS data (illustration only)

The adjacent relationship between plots may be very important, so Definition 2 uses a graph model to describe. If two plots are adjacent, then the corresponding node has an edge (the weight of this edge is not considered for the time being). Each node It has its own static and dynamic properties.
The data model is actually a knowledge graph, but unfortunately I haven't studied it in depth. Its common field is social networks, and each node corresponds to a person. There are a lot of ways to deal with it.

Figure 1. Example of graph structure. Total V = { v 1 , v 2 , … , v 9 } \mathbf{V} = \{v_1, v_2, \dots, v_9\}V={ v1,v2,,v9} , represented by numbers.v 1 v_1v1with v 2 v_2v2There is a connection line, indicating that the two areas are adjacent. Each node has its own attributes, and only v 1 v_1 is shown herev1properties.

1.5 Hierarchical model of GIS data (illustration only)

If you want to represent that a large plot contains several small plots, you need a hierarchical (tree-like) model, the root of the tree is the entire area, and the leaf node is the smallest unit of the plot.

1.6 Completion of machine learning tasks

Taking the simplest clustering problem as an example,

  • On top of the structured model, distances between objects can be defined, and then algorithms such as kMeans can be used for clustering.
  • On the knowledge map, there are a series of clustering algorithms.
  • Each layer of the hierarchical model can be specified by experts or users, and its advantage is that it has good semantics (such as cities, districts, and counties); it can also be clustered from the knowledge graph in a hierarchical manner, and its advantage is that it conforms to the data itself. But in the latter case, it is best to fix the depth of the leaf nodes. At present, I think that the hierarchical model is the result of the clustering of the graphical model, but I have not yet figured out how to further cluster the hierarchical model. Figure
    2 It shows
    the attention from Figure 1: constructing a tree-shaped data from the knowledge map or basic structured data is itself a granulation process (the idea of ​​granular computing).
Figure 2. The hierarchical model obtained after hierarchical clustering in Figure 1. Among them, nodes 10-13 are branch nodes, representing regions. There are 3 layers in this model.

2. GIS perspective

The thinking of geographical experts is not necessarily consistent with the thinking of machine learning. How to seek common ground while reserving differences is the key.

2.1 Reliable calculation of irregular grid granulation expression

The concept of "grid" is often used to describe planar areas, and is rarely used outside of geographic information systems. The reasons include:

  1. Data for machine learning can be viewed as points in a high-dimensional space ( X \mathbf{X} in Section 1.1X corresponds to the three-dimensional space), there is no need to divide the high-dimensional space into multiple cubic regions. And many corresponding regions may not have any data points.
  2. The general idea is to cluster these points into clusters, rather than dividing the space itself.
  3. The grid can only deal with the information of the planar area, and its expressive ability is limited. Other information (such as crop types) cannot be reflected.

"Irregularity" may be a problem for planes (spheres), but from the perspective of graph theory, plots are abstracted as nodes, and the shape of the plot is an attribute on the node (it can be expressed by a closed curve). As shown in Figure 1 As shown, it is quite natural to express using knowledge graph.

"Graining" has at least two aspects:

  1. For spatial nodes, granulation can clarify their relationship, as shown in Figure 2.
  2. For the characteristics of nodes, granulation can support different processing. For example, the temperature of a node is represented as a time series within a year, and the units supported by granulation include hours, days, months, seasons, etc. A node Crops can also be granulated.

"Reliable calculation" should be related to the uncertainty of each attribute. For example, the rapeseed yield at a certain node may be between 200-300 kg. When estimating, the estimated value of some plots is high and some are low, but The sum tends to be a more appropriate value. This is similar to my estimation of my college entrance examination score, which is higher in mathematics and lower in English, but the final error is only 4 points (suspected of excessive showing off).

2.2 Uncertain Analysis Problems of Time-Series Feature Recombination Under Space Shape Constraints

"Spatial form" may have a similar meaning to the previous "irregularity".
"Constraint" may indicate a closed area.
"Timing feature" refers to some dynamic features, such as temperature and humidity. In fact, timing and dynamics are The important nature of GIS data to become meaningful. Time series and multi-time series data have their own modeling methods, and their fusion modeling with other static data has good practical significance. "
Feature recombination" may be feature selection (from Select 10 features that really work out of 100 features) or feature extraction (extract 10 new features from 100 features, they can be a linear combination of original features, PCA can do it; it can also be nonlinear, deep learning The most powerful.)
"Uncertainty" is the core of machine learning, because the problem of certainty has been done by scientific computing and database systems.
"Uncertainty analysis" or "uncertainty modeling" is used to describe uncertainty , which is convenient for quantitative analysis.

2.3 Optimizable problems for active learning under confidence control

"Active learning" itself is an important research direction. It refers to the method of consulting data or labels to experts through human-computer interaction, so as to achieve better results with less data (labels). "Confidence
control "It refers to setting a confidence level according to the specific needs of the application (the actual situation does not need to be 100% accurate) to control the learning process. Combine the two to control the number of data and
labels for active learning queries to meet the given A confidence level is enough.
Some papers I have done before have this relationship:

  1. Yan-Xue Wu, Xue-Yang Min, Fan Min, Min Wang. Cost-sensitive active learning with a label uniform distribution model. International Journal of Approximate Reasoning. 105(2019) 49-65. Using query cost and misclassification cost to Control the active learning process.
  2. Fan Min, Qing-Hua Hu, William Zhu, Feature selection with test cost constraint, International Journal of Approximate Reasoning. 55(1)(2014)167–179. Using test cost constraints to optimize the effect of attribute selection. Reducing the attribute selection problem It is a constraint satisfaction problem (Constraint satisfaction problem, CSP).

See http://www.fansmale.com/publications.html for more papers . Especially active learning.

Guess you like

Origin blog.csdn.net/minfanphd/article/details/129870924
Recommended