Text data analysis artifact - IBM BigInsights Text Analytics

The value of textual data analysis

Text data is ubiquitous in our lives: thoughts posted in WeChat Moments and Weibo; posts on forums to evaluate products; machine logs automatically generated by the application background, etc. This kind of data itself contains a lot of useful information, but because the textual expression can be very flexible, the information can be accurately expressed without strictly following the grammar. For example, there are three different ways of expressing the age of the characters in the picture below.

In the example, the most important information about age is name and age, and the structured data on the right side of the figure is the main form that expresses these core information and can be processed by the application. How to convert unstructured text data into structured data that can accurately express information is a major problem in text data analysis. 

Approaches to textual data analysis

Generally, there are two main modes for implementing text data analysis : a mode based on syntax analysis, and a mode based on semantic association (completely abandoning the grammar, and analyzing through the contextual association of the text). 

Grammar-based analysis needs to divide text data into grammatical elements, such as subject, predicate and object, according to language grammar, and then generate target information according to grammar and semantic rules. This method is suitable for scenarios that are more standardized with text content.

The model based on semantic association uses a large number of comprehensive technologies such as word segmentation and dictionary to mark (label) text data, and then generate final information according to specific rules or combinations. The implementation method roughly includes the following steps:

Commonly used text analysis tools:

At present, there are many tools available for text data analysis . The common ones and their characteristics are as follows:

Pig: A data processing tool with high-level syntax, which is easy to program and expand, and uses MapReduce for data parallel processing at the bottom.

JAQL: Data processing tool for JOSN data, very suitable for processing JOSN data.

AQL: A marked text data processing tool with a syntax similar to SQL, easy to use, and built-in multiple data extractors.

Python Natural Language Toolkit: A text analysis tool provided by Python, which can perform part-of-speech tagging, syntactic analysis, keyword extraction, text classification, sentiment analysis, etc.

Text data analysis tool in BigInsights: IBM's enterprise-level big data product, BigInsights, integrates AQL for text analysis. On top of this, a graphical text analysis tool, Text Analytics, is developed to provide users with great convenience.

AQL introduction:

The processing of text data in the AQL framework goes through three main steps:

1. Data labeling: Use techniques such as dictionaries and regular expressions to label the text data to be analyzed. This step is achieved by defining various data extractors.

2. Generate data according to rules: shard, group, define association rules, etc., after labeling the data, and generate a list of candidate data according to these rules.

3. Data merging and filtering: Final processing of candidate data, such as merging, screening and filtering of duplicate data, etc., forms the final result.

The AQL data processing process is shown in the following figure:

BigInsights

Using AQL to process data requires learning AQL syntax, being familiar with the new environment, and most importantly, writing code to complete various text analysis tasks. Therefore, using AQL to process data is not easy. The following figure shows the AQL code that needs to be written to extract numbers from text:

BigInsights

Introduction to Text Analytics tools:

IBM has developed Text Analytics, a text analysis tool with a full graphical interface based on AQL, so that analysts do not need to write the underlying code, but use graphical interface tools to easily and quickly complete text data analysis tasks, which greatly improves the platform's text data. Skills of analyze.

The Text Analytics tool interface is similar to Eclipse, and the entire tool is divided into the following areas:

1. Project management area: The division of different text analysis tasks can be realized through different projects.

2. Document browsing area: Displays the text data document being processed, and the processing result marks the content of different tags through different background colors.

3. Canvas area: In this area, the creation and modification of text data processing rules can be completed by dragging and dropping the mouse.

4. Property area: Display the properties of the currently selected object, and you can set different property values.

5. Results area: Displays the results after processing according to the current text processing rules.

BigInsights

For text data analysis through Text Analytics, all work is done in this interface. Users do not need to care about specific AQL details and codes, nor do they need to care about background processing jobs. Text Analytics automatically generates AQL for text processing rules and submits jobs to Data processing is done in a Hadoop cluster.

The following simple example demonstrates how to extract earnings numbers from plain text earnings data.

Three easy steps to get text analysis

Step 1: Import the data

After creating a new project, click the plus button in the project area to add a text data source. The tool supports adding files from the local file system or HDFS, and supports data in .zip, .tar, .tgz, .gz and other formats.

BigInsights

Step 2: Edit Text Data Rules

According to the needs of data processing, drag the required extractor from the "Extractor" menu to the canvas area, and define the attributes and rules in the extractor.

In this example, we only extract simple financial data, so we only need to concatenate the three extractors of the character "$", the number extractor Number and the currency unit Currency, as shown in the following figure:

BigInsights

To extract profit data including department names, the following rules need to be defined:

BigInsights

Step 3: Run and Results Export

Click the Run button in the canvas area, the text analysis results will be displayed directly in the result list, and the results can be exported for processing and use.

BigInsights

In addition, after running, the document area is also displayed as text with different background colors according to the hit results of different rules, which is convenient for inspection.

BigInsights

Summarize:

The Text Analytics tool in BigInsights realizes zero-programming text analysis through a full graphical interface, and expands the application scope of text processing through integration with Hadoop, which can help enterprise customers quickly implement a variety of text data analysis applications, such as Internet text data analysis. Analysis, machine log analysis, etc.

The text analysis results of Text Analytics can be processed and analyzed in depth. For example, it can be displayed in Cognos through charts, and it can also be used as an analysis data source in SPSS . For more details, please refer to Huidu Big Data.

For more big data and analysis related industry information, solutions, cases, tutorials, etc., please click to view >>>

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326490806&siteId=291194637