This article explains the data thoroughly (1): Data source

I. Introduction

We often hear this question in our daily lives: Do you have data support? Where is your data source? Is the data noisy?

So what exactly does the "data" exist here?

Baidu Encyclopedia's definition of data is very simple: data is the result of facts or observations, a logical induction of objective things, and raw raw materials used to represent objective things.

And think about it carefully, is the data we refer to in daily life really data? In fact, we refer more to "data knowledge" that has formed a system, has a logical structure, and is practical.

Therefore, we cannot treat data as a simple concept, but in fact, there is a lot of knowledge in "data".

Let me first introduce four terms and concepts related to "data". Later, I will elaborate on the methodology of their "value realization".

Do you really understand what data, information, knowledge, and insight are?

Insert picture description here

  • data (dishes bought in the vegetable market): simple facts, unprocessed, unorganized, and primitive.
  • Information (folding dishes, washing dishes): The data after structured organization and processing should be relevant and practical according to the "scenario and context".
  • Kownledge: It is an information map linked through learning and experience, with the ability to predict, make decisions and generalize.
  • Insight (to the point where you can teach others how to cook): The ability to accurately and profoundly understand complex problems or situations (which can be achieved with the help of tools).

Today, Xiao Chen will take you to see the source of the data and its specific types. After all, knowing yourself and your opponents can survive a hundred battles. With today's foreshadowing, we can easily learn in the next few sessions~

2. Data source (vegetable market)

If we say that data is the raw material we need for cooking, then determining the source of the data is like determining which vegetable market to go to before we go out to buy food; and "vegetable market" is also specialized in the art industry! Buying seafood goes to the seafood market, buying poultry goes to the poultry market...Data is also a truth. You need to filter the data sources through the fields you need. After all, ensuring the quality of the data is the first step in cooking delicious dishes~

As mentioned earlier, data is a huge concept. If we want to make good use of it, we must first know the type of data, and then determine the source and collect the data based on the type.

1. Differentiate data sources according to the degree of structure

1) Unstructured data

Unstructured data is the simplest form of data; unstructured data is always available around us and is almost at your fingertips. Text, pictures, sound or video are all unstructured data, and this type of data is usually stored in files In the repository (you can think of it as a well-organized directory on the computer hard drive).

However, extracting value from data of this shape is usually the most difficult; because we first need to extract structural features from description or abstract data (for example, to use text, we may need to extract the topic and the positive or positive effect of the text on the topic. Negative reviews, and one thousand readers will have one thousand Hamlet, this kind of information is highly subjective).

At present, the very popular text mining technology, its data source is what we call unstructured data here.
Insert picture description here

2) Structured data

Structured data, as the name suggests, is well-defined tabular data (rows and columns), which means we know which columns and what types of data they contain; these data are usually stored in a database, where we can use SQL Language to filter structured data and easily create data sets for our data science solutions.

Insert picture description here

3) Semi-structured data

Semi-structured data is between unstructured and structured data. Although it defines a consistent format, the structure is not very strict. For example, part of the data may be incomplete or of different types; semi-structured data Usually stored as files, however, certain types of semi-structured data (such as JSON or XML) can be stored in document-oriented databases.

Insert picture description here

2. Differentiate data sources according to data privacy

1) Data sources within the organization (closed data sources)

The first place to find data is inside the organization. Most companies currently have ERP, CRM, workflow management and other systems running. Such systems usually use databases to store data in a structured manner; these databases contain a large amount of data. You can easily extract value from it; for example, through a workflow management system, you can easily understand the bottlenecks in the business process, or by using data from the ERP system, you can make sales forecasts.

2) Public data sources (open source data sources)

In addition to internal non-public data, many organizations receive and send a large number of files, pictures, sounds, or videos. These data that are distributed and retained on the public Internet are public data sources; for example, you can imagine that an insurance company receives a lot of data. Claims that may be accompanied by pictures (in paper or PDF format), these files are usually manually converted to a more structured format before processing; however, some information will be lost in this conversion, when trying to improve our data science When making a solution, we can use these files to extract additional data, such as scenario overviews.

In the future, we can use this additional data to improve fraudulent claims detection, which is the value of public data sources.

In addition, there are actually many data source classifications in the industry, such as whether it is real-time data, primary data or secondary data sources...

Guess you like

Origin blog.csdn.net/amumuum/article/details/112801902