Open source java CMS - FreeCMS2.6 Web page information collection

java open source forum system http://javabbs.javaz.cn

Project address: http://www.freeteam.cn/

Web page information collection

   Supported since FreeCMS 2.1

The target web page information can be captured through simple configuration. It supports incremental collection, keyword replacement, and timing collection. The same collection rule can collect multiple pages (static and dynamic), collect various information attributes, and can automatically audit and statically collect information. Information page.

Collection rule management

Click Collection Rules from the management menu on the left to enter.

Add collection rules

Click the "Add" button below the collection rules list.

After filling in the relevant properties, click the "Save" button.

Collection rule attribute description

Collection rule attributes are divided into basic, setting, collection address, collection attribute, and keyword replacement.

In general, it can be done by filling in the relevant properties in the Basic tab. The latter tabs are available if more advanced settings are required.

The main properties are explained below.

Name: The name of the collection rule.

Collected to Column: The collected information should be added to that column.

Page Encoding: The page encoding of the target web page, the default is UTF-8.

Collection address: the address of the target web page. Only one can be set in the basic tab, if you want to set more than one, you can set it in the collection address tab.

Collection scheduling: Set the timing to execute the collection operation. This setting is very important. The collection operation can only be performed if the collection scheduling system is set.

Content list start and end html: Because the system extracts information attributes by intercepting keywords from the content of the target web page, it is very important to set the start and end html of the target attribute. It must be set to a relatively unique start and end html, so that the system In order to correctly intercept the target attribute. This attribute is mainly used to intercept the html of the target page information list.

Content address starts and ends html: After obtaining the content list html according to the above attribute, use this attribute to intercept each content address.

Content title starts and ends html: After obtaining the content address according to the above attribute, the system will grab the webpage content of this content address, and then intercept the content title according to this attribute. The setting of content-related properties is similar to this property, and will not be described in detail below.

Status: The collection rules in the enabled state will be executed by the system.

Capture pictures: Download the pictures in the information content to the local.

Auto-approval: Set the collected information to the approved state directly.

Use the click volume of collected information: The default click volume of the collected information is 0. After setting this attribute and the content click volume starts and ends the html, the system will intercept the click volume of the target information and set it as the click volume of the collected information.

Maximum number of collected contents: No limit by default. If this property is set, the system will count how many pieces of information have been collected by this collection rule from the collection records. If the maximum number of collected contents is exceeded, the system will not collect any more.

Set the first picture as the title picture: If there is a picture in the information content, extract the first picture as the title picture, and set the information as picture information.

Clear HTML tags in content: Clear the HTML tags in the information content and keep the plain text.

Whether to collect when the content is empty: It can be set to not collect this information when the content is empty.

Use the addition time of the collected information: The default addition time of the collected information is the current time. After setting this property and the content addition time, the system will intercept the addition time of the target information and set it as the addition time of the collected information.

Add time format of collection information: The default format is yyyy-MM-dd. If the add time format of the target page is different, you need to set the correct date format here.

Collection start time: The default is the current time. If the collection start time is not reached, the system will not collect.

Collection end time: The default is never ending. If the collection end time is exceeded, the system will not collect.

Content address completion url: Because some web pages use relative paths or absolute paths, you can set the prefix of the content address.

Image address completion url: Because some web pages use relative paths or absolute paths, you can set the prefix of the image link address.

A tag link address in the content to complete the url: Because some web pages use relative paths or absolute paths, you can set the prefix of the A tag link address in the content.

The collection address is divided into static and dynamic addresses. The static address is a fixed address. The dynamic address generally refers to the address that can be paged. The paging variable is represented by {page}, which can be set to collect from which page to that page, such as http:/ /www.freetam.cn/list_{page}.html, set the number of start pages to 1 and the number of end pages to 10, the system will automatically extract http://www.freetam.cn/list_1.html to http://www .freetam.cn/list_10.html Data of all pages.

Under normal circumstances, we only need to collect the title and content of the information. The system also provides functions for collecting content description, clicks, author, source, and adding time attributes.

With the keyword replacement function, you can replace the keywords in the collected information with the keywords you want.

 

Edit Collection Rules

Select the collection rule that needs to be edited, and then click the "Edit" button.

Note: Only one acquisition rule can be edited at the same time.

After filling in the relevant properties, click the "Save" button.

collection

Select the collection rule to be collected, and then click the "collect" button.

Note: Only one collection rule can be collected at the same time.

delete collection rule

Select the collection rule to be deleted, and then click the "Delete" button.

Tip: Multiple collection rules can be deleted at the same time.

In order to prevent misoperation, the system will prompt the user whether to delete, click "OK" to complete the deletion operation.

View collection records

From the management menu on the left, click Collection Records to enter.

Here you can view all web page collection records. You can delete the specified collection records, but the collected information data will not be deleted. Select the collection records to be deleted, and then click the "Delete" button.

   Tip: Multiple acquisition records can be deleted at the same time.

 

In order to prevent misoperation, the system will prompt the user whether to delete, click "OK" to complete the deletion operation.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326321107&siteId=291194637