Natural language processing a pipeline

1. Data source: including text, pdf, database and other sources

2. Libraries used: jieba gensim sklearn keras 

3. Services that can be implemented: find related and similar words (subject to word segmentation), compare the similarity of two word segmentations, and which ones are related and irrelevant to others (semantic fuzzy search)

For example: Bank of China:

[["ICBC", 0.7910350561141968], ["601988", 0.7748256921768188], ["ICBC", 0.7616539001464844], ["China Construction Bank", 0.75733399939117432], ["China Construction Bank]", 33977544 ".7469172477722168], [" Agricultural Bank of China, ".7167254686355591], [" Bank ".7115263938903809], [" Agricultural Bank ", .7070150375366211], [" CITIC Bank ", .6993384957313538], [" CCB ", .6886808276176453], ["ICBC", 0.684762716293335], ["China Merchants Bank", 0.6723880767822266], ["China Minsheng Bank", 0.6720935106277466], ["Industrial Bank", 0.670,5615520477295"], ["03988", 0.494,299 0.6620436310768127], ["Everbright Bank", 0.6612452268600464], ["Bank of Communications", 0.6425610780715942], ["601939", 0.6396690607070923], ["601398", 0.656208]HSBC ".6354925036430359], [" China Everbright Bank, ".6283385157585144], [" Hua Xia Bank ", .6261048316955566], [" 090 601 ", .6191191077232361], [" ABC ", .6165546774864197], [" Bank of Nanjing ", .6162608861923218] , [ "Gu Yu," .6026109457015991], [ "Minsheng Bank", 0.6018795371055603], [ "B02776", 0.6003248691558838], [ "Bank of Beijing", .5989225506782532], [ "00939", .5841124057769775], [ "601288", .5798826217651367 ], ["French Industrial Bank", 0.5750421285629272], ["600036", 0.5725768804550171], ["BOC Hong Kong", 0.5725655555725098], ["Slag Bank", 0.5723541975021362], ["Shanghai Bank", 0.5716006755828857], [" Capital Bank", 0.5714462399482727], ["Shi Chenyu", 0.5713250637054443], ["01398", 0.5696423053741455], ["01288", 0.5673946738243103], ["China Development Bank, ".5673025846481323], [" the line ", .5642573237419128], [" 000-chiu, ".5616151094436646], [" 601998 ", .5594305992126465], [" 601 328 ", .5585275292396545], [" CITIC Industrial Bank " 0.5555926561355591], ["Citibank", 0.5535871386528015], ["Ningbo Bank", 0.5529069900512695]]

China:

[[ "The World", .7685298919677734], [ "Global", .7626694440841675], [ "the worldwide", .7018718123435974], [ "country", .6887967586517334], [ "the whole world", 0.681572437286377], [ "America", .6747004985809326 ], [ "Asia", .6721218824386597], [ "The Chinese government", .6407063007354736], [ "domestic", .6364794969558716], [ "India", .6236740946769714], [ "international", .6172101497650146], [ "big country", .6167921423912048] , ["Countries in Asia", 0.6133526563644409], ["Asia Pacific", 0.610878586769104], ["Worldwide", 0.6104856729507446], ["In the World", 0.6089214086532593], ["East Asia", 0.6022,76072 Japan ", 0.601786196231842], ["World Today", 0.6002479791641235], ["Asia", 0.5914613604545593], ["Global", 0.58716830220222473], ["Global", 0.585560917854309["African Continent", 0.5852369070053101], ["World Market", 0.5849867463111877], ["Europe", 0.5787924528121948], ["Third World", 0.5771710872650146], ["Global Integration", 53176278"5053] , .5766173601150513], [ "European countries", .5756310224533081], [ "Latin America", .5752301216125488], [ "economic power", .5745469331741333], [ "first World", .5730843544006348], [ "East Asia", .5727769136428833], [ "power", .5700076222419739], [ "industry", .5689312219619751], [ "Korea", .5672852396965027], [ "States", .5603423118591309], [ "emerging countries", .5577350854873657], [ "developed countries", .5569929480552673], ["United Kingdom", 0.5562434196472168], ["Germany", 0.5535132884979248], ["Today", 0.5534329414367676], ["Latin America", 0.5512816309928894], ["East Asian Countries", 0.5505844354629517], [ "China rise", .5435972213745117], [ "Latin America", .5431581735610962], [ "Western Hemisphere", .5429360866546631], [ "Western countries", .5408912897109985], [ "country", .5392733216285706], [ "Russia" , 0.5382996797561646]]

 

 

Vanke:

[[ "Golden" 0.8261025547981262], [ "Wharf" 0.8132781386375427], [ "green city" 0.7946393489837646], [ "rival" 0.7812688946723938], [ "Country Garden" 0.7795591354370117], [ "Yu Liang" 0.7790281772613525 ], ["Ocean Real Estate", 0.7744697332382202], ["Sunac", 0.7735781669616699], ["Evergrande Real Estate", 0.7618383169174194], ["Sunac China", 0.753994345664978], ["China Merchants Real Estate", 0.76299] "Hopson" 0.7338892221450806], [ "CRL" 0.7292978167533875], [ "Lake" 0.7278294563293457], [ "Radiant" 0.7256796956062317], [ "Lake real estate" 0.7223220467567444], [ "Wang" 0.7217631936073303 ], ["Baoneng", 0.7196142673492432], ["Sun Hongbin", 0.7192676067352295], ["Greentown China", 0.7135359048843384], ["Yuexiu Real Estate", 0.7109189629554749], ["Poly Real Estate", 9090.074031]["Shimao", 0.7004261016845703], ["China Jinmao", 0.6861996650695801], ["KWG", 0.6830298900604248], ["Agile", 0.6111322569847107], ["Shimao Real Estate", 0.8468,33 "0.6793832778930664], [" Wanke A ", 0.677139937877655], [" green ", 0.6746823787689209], [" R & F "0.6702776551246643], [" POWERLONG "0.662824809551239], [" R & F Properties ", 0.660904049873352], [ "Baoneng Department", 0.6577337384223938], ["Jinke", 0.6565895676612854], ["Sunshine City", 0.6557801961898804], ["Fangxing", 0.654536247253418], ["Xiexin", 0.65335935354] , .6524677276611328], [ "dragon light estate" 0.644176721572876], [ "Wharf" .6433624029159546], [ "China Hengda" .6420278549194336], [ "OCT", .6391571760177612], [ "Xu Jiayin," .6391341686248779],["Wantong Real Estate", 0.6383571028709412], ["Huayuan", 0.6379672288894653], ["Song Weiping", 0.6350336670875549], ["Leading Real Estate Enterprise", 0.63375490"90385437], ["Dongyuan", 0.6384] ", 0.6329449415206909]]

 

4. Basic steps:

Load->gensim->classifier of the data source (traditional word frequency-based/deep learning keras)

5. Use of model results gensim.models.keyedvectors.KeyedVectors

wmdistance(document1, document2) # The input is a word set of 2 doc

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326473262&siteId=291194637
Recommended