How to manage unstructured data?

Editor: Peng Wenhua

Source: Big Data Architect

Hello Peng friends, I am Lao Peng. I have encountered several projects recently, all of which are inseparable from unstructured data.

When Lao Peng just graduated, he was doing database work, which was all structured data. Later, Hadoop technology came into being, which can be used to process semi-structured data of the Internet of Things and the Internet.

There are still relatively few unstructured scenes. Most of them choose one or two points in the project to give meaning.

But now that it has been developed for so many years, the structured data of most enterprises is more or less basic, and the governance of unstructured data is still blank. Let’s talk about this topic today~~

unstructured data

The unstructured data mentioned here refers specifically to:

1. Various documents such as official documents and research reports

2. Surveillance video and other audio and video

3. Various special documents such as design drawings

It's hard to think about these things. Unlike structured data in databases, this data is more problematic. We can list a few if we think about it:

1. There is no unified storage (various attachments, various WeChat transmissions)

2. There is no unified standard (all documents are written by themselves)

3. There are many types of data (unstructured and semi-structured except structured)

4. Outside the law, governance blind spot (the first time I know that the method of dealing with unstructured data is TF/IDF word frequency statistics, the first time I know that the unstructured application is word cloud)

5. No one manages it, and I don’t know how to manage it (there are very, very few companies with file management rooms and file managers)

If you want, I can list ten or eight more. In short, this is a huge pit!

Unstructured Data Governance

In fact, according to me, the unstructured data of most enterprises is far from meeting the pre-conditions for "governance".

Because they haven't even prepared the data yet, they are all scattered in various places, so how do you treat them?

For structured data, we know that we need inventory, standards, master data, indicators, and quality control. Because we know that the data is in those libraries.

No matter how many databases or tables there are, we know that the data is there. But unstructured data is different! Who knows where!

A company with a file management room and a knowledge center is not bad. Regardless of whether it is complete or not, there is always a centralized place.

But more are stored separately: OA, mailbox, cloud disk, personal storage, everywhere! Can't do it!

So, if you want to do unstructured data governance, what is the first step? Data inventory? Is the data aggregated?

NONONONONO！

The first thing to do is to sort out the distribution of unstructured data in the enterprise, and know which ones are the focus of our governance!

You just say, with so many kinds of unstructured data, which ones are more and which ones are less? Which ones are important and which ones are secondary? Which to govern first? Which post-governance? Which ones have the greatest impact on business? Which ones have the least impact on the business? Which are of great value? Which value is small?

If you don't understand these problems, you just work hard. Who knows if it will be effective if you work for a long time?

You may ask, after figuring this out, shouldn’t it be time to gather data?

NONONONONO！

Or not. Again, you have to have a traction. In general, it is better to apply traction. Just like the logic of data warehouse construction, bottom-up construction is quick.

The first project must be won quickly! Give everyone confidence. Otherwise, it will be indefinitely, and no one can stand it.

Therefore, the second step should be to formulate a suitable application according to the business, and then quickly collect some data, use NLP and other technologies to structure the unstructured data, and then use database, big data, graph computing and other technologies to process the data, and do Come up with one or two applications that can see the effect.

For example this:

In the reimbursement scenario, OCR is used for identification, and RPA is used for invoice verification and data proofreading to realize rapid reimbursement and bookkeeping.

This will help buddies free up the time for reimbursement:

summary

Unstructured data management is very difficult, very difficult, whether it is technology or management, it is N orders of magnitude more difficult than structured data. The way of working is also completely different, so be careful! Be careful! ! !

More exciting:

How to control the quality of CRM data? Share the experience of the world's top 500 with you!

How to do a good job in big data security access control?

[66-page PPT] Ministerial and group-level data governance project experience sharing

Kuaishou data quality assurance system and its practice in live broadcast scenarios

How to create a data governance closed loop? Take the financial industry as an example

Digital transformation requires redefining the role of data governance

Typesetting | Lao Peng

Reviewer | Lao Peng Editor-in-Chief | Lao Peng

How to manage unstructured data?

Guess you like