POI data processing flow

  POI data from different data sources, data access information obtained will be different, but nothing more than basic data and details data.

  After the data access process may be unified as: 

    Data Access => Data standardization => Data sentenced to heavy => Data Fusion => data released => continuously updated

    Different data operation in the step may vary, but it will substantially following the above steps, each step will be described one by one

    1. Data Access: Depending on the access mode data source is varied, as shown most standard's data, usually provided as a mid / mif file, then flow into a data format to be processed; Internet However fetch rich content data follow diverse specifications, then determines whether the conditions for the initial access to flow downwardly; partner data relative standard, but with different business focus, usually required data partner reconciliation and good feedback query interface feedback data access; UGC but requires relatively little data processing and to give timely feedback .... The importance of different data sources and data will all the different orders of magnitude for large data but relatively low importance of data need to do data entry validation; data less, but it is important to have a common data reconciliation and feedback mechanisms this will reduce the workload of late business expansion.

    2. Data normalization: normalized data typically comprises three parts: 1) field alignment, some data may be inconsistent for the same content source name field, at which time it is converted into a unified name and path; calculating classification, status field value up the full data; 2) to verify the correctness of data, for example, according to whether the same address check provinces division coordinates, 3) or the classified data portion excluding blacklist trigger data, such as triad involving terrorism and other illegal types of data. Standardization process is not complicated, but will increase the access of the data source and cumbersome, so a robust services can be configured such that subsequent standardization work more effective.

    3. Data sentenced weight: How to determine whether the new access data with the original data source access after repeated data, also said that the new POI a data source if the current access of the POI you should already have this new POI growing integration with existing POI POI information and update the original, if not currently the POI, you should use the new access of independent add a POI POI data to your own system. Sentenced to heavy processes more responsible, detailed here temporarily, it simply is to create an inverted index key information existing POI, according to the information of the new POI query inverted index, calculated from the list of POI inverted index return similarity, if the similarity is judged to reach the threshold value is repeated.

    4. Data fusion: different sources is the same identification data fusion is a POI data, this data in the respective source select the most reliable data base, and before a number of different services to generate POI. This POI to meet different business needs.

    5. Post Data: data distribution POI data fusion refers to the data obtained the respective push parties online business operations. As with data access, release docking multiple business parties, adaptation and verification of data according to different business, a general release mode is necessary.

    6. Data Update: data generation is a continuous delivery processes, data acquisition and continued integration, data will be continuously updated, delivery is also an ongoing process of data distribution.

Guess you like

Origin www.cnblogs.com/dlgh/p/11966486.html