python association rule learning: FP-Growth algorithm drugs a "basket" analysis

Original link: http://tecdat.cn/?p=7318

 

Products can be classified according to the seller

On the Evolution, there are some of the top category ( "drugs", "digital goods", "fraud-related", etc.) broken down into specific product pages. Each page contains a list of several different vendors.

According to my suppliers co-occurrence relationship between the product created a graph, where each node corresponds to a product sold by its edge weight at the same time the number of suppliers of products defines two events. Thus, for example, if there are three simultaneously suppliers sell Scullin A and 4-AcO-DMT, then the right of FIG. I Scullin between A and 4-AcO-DMT node a weight of 3. I use to generate the hierarchical network edge Evolution product of random blocks Model Based Visualization:

code segment


importimport  pandaspandas  asas  pdpd
 importimport  graph_toolgraph_t  as gt
import graph_tool.draw
import graph_tool.community
import itertools
import collections
import matplotlib
import math
In [2]:
 
gt.draw.graph_draw(g, pos=pos, vertex_fill_color=b,
            edge_control_points=cts,
            vertex_size=20,
            vertex_text=g.vertex_properties['label'],
            vertex_text_rotation=g.vertex_properties['text_rot'],
            vertex_text_position=1,
            vertex_font_size=20,
            vertex_font_family='mono',
            vertex_anchor=0,
            vertex_color=b,
            vcmap=matplotlib.cm.Spectral,
            ecmap=matplotlib.cm.Spectral,
            edge_color=g.edge_properties['color'],
            bg_color=[0,0,0,1],
            output_size=[1024*2,1024*2],
            output='/home/aahu/Desktop/evo_nvends={0}.png'.format(MIN_SHARED_VENDORS))
saving to disk...

 

 It contains 73 nodes and 2,219 edges (I found 3,785 suppliers in the data).

code segment:

# coding: utf-8

from bs4 import BeautifulSoup
import re
import pandas as pd
import dateutil
import os

import logging
 

def main():
    for datestr in os.listdir(DATA_DIR):
        d1 = os.path.join(DATA_DIR, datestr)
        fdate = dateutil.parser.parse(datestr)
        catdir = os.path.join(d1,'category')
        if os.path.exists(catdir):
            logger.info(catdir)
            df = catdir_to_df(catdir, fdate)
            outname ='category_df_'+datestr+'.tsv'
            df.to_csv(os.path.join(DATA_DIR,outname),'\t',index=False)


if __name__=='__main__':
    main()

Higher weights edge drawn more bright. Node using a random block model cluster and nodes in the same cluster are assigned the same color. Upper part of the figure (corresponding to drug) and a lower portion (corresponding to a non-drug, i.e. Arms / hacker / credit /, etc.) have a distinct boundary between. This suggests the possibility of selling drugs supplier of non-drug sales is small, and vice versa.

 

91.7% of the speed of the sale

Association rule learning is the solution to market basket analysis one kind of problem directly and popular method. The traditional application is the recommended items to shoppers based on other customer's shopping cart. For some reason, a typical example is the "customers to buy diapers also buy beer."

We do not have customer data on Evolution crawl from public posts. However, we do have data for each vendor sells products that can help us quantify the results of the visual analysis of the proposal.

This is our sample database (complete file has 3,785 lines (each vendor a)):

 

Vendor Products
MrHolland [‘Cocaine’, ‘Cannabis’, ‘Stimulants’, ‘Hash’]
Packstation24 [‘Accounts’, ‘Benzos’, ‘IDs & Passports’, ‘SIM Cards’, ‘Fraud’]
Spinifex [‘Benzos’, ‘Cannabis’, ‘Cocaine’, ‘Stimulants’, ‘Prescription’, ‘Sildenafil Citrate’]
OzVendor [‘Software’, ‘Erotica’, ‘Dumps’, ‘E-Books’, ‘Fraud’]
OzzyDealsDirect [‘Cannabis’, ‘Seeds’, ‘MDMA’, ‘Weed’]
TatyThai [‘Accounts’, ‘Documents & Data’, ‘IDs & Passports’, ‘Paypal’, ‘CC & CVV’]
PEA_King [‘Mescaline’, ‘Stimulants’, ‘Meth’, ‘Psychedelics’]
PROAMFETAMINE [‘MDMA’, ‘Speed’, ‘Stimulants’, ‘Ecstasy’, ‘Pills’]
ParrotFish [‘Weight Loss’, ‘Stimulants’, ‘Prescription’, ‘Ecstasy’]

 

关联规则挖掘是计算机科学中的一个巨大领域–在过去的二十年中,已经发表了数百篇论文。 

我运行的FP-Growth算法的最小允许支持为40,最小允许置信度为0.1。该算法学习了12,364条规则。 

规则前项 后项 支持度 置信度
[‘Speed’, ‘MDMA’] [‘Ecstasy’] 155 0.91716
[‘Ecstasy’, ‘Stimulants’] [‘MDMA’] 310 0.768
[‘Speed’, ‘Weed’, ‘Stimulants’] [‘Cannabis’, ‘Ecstasy’] 68 0.623
[‘Fraud’, ‘Hacking’] [‘Accounts’] 53 0.623
[‘Fraud’, ‘CC & CVV’, ‘Accounts’] [‘Paypal’] 43 0.492
[‘Documents & Data’] [‘Accounts’] 139 0.492
[‘Guns’] [‘Weapons’] 72 0.98
[‘Weapons’] [‘Guns’] 72 0.40

 

 If you have any questions, please leave a comment below. 

 

 

  

Big Data tribe  - Chinese professional third-party data service providers to provide customized one-stop data mining and statistical analysis consultancy services

Statistical analysis and data mining consulting services: y0.cn/teradat (Consulting Services, please contact the official website customer service )

Click here to send me a messageQQ:3025393450

 

QQ exchange group: 186 388 004 

[Service] Scene  

Research; the company outsourcing; online and offline one training; data reptile collection; academic research; report writing; market research.

[Tribe] big data to provide customized one-stop data mining and statistical analysis consultancy

Welcome to elective our R language data analysis will be mining will know the course!

 

  

Big Data tribe  - Chinese professional third-party data service providers to provide customized one-stop data mining and statistical analysis consultancy services

Statistical analysis and data mining consulting services: y0.cn/teradat (Consulting Services, please contact the official website customer service )

Click here to send me a messageQQ:3025393450

 

QQ exchange group: 186 388 004 

[Service] Scene  

Research; the company outsourcing; online and offline one training; data reptile collection; academic research; report writing; market research.

[Tribe] big data to provide customized one-stop data mining and statistical analysis consultancy

Welcome to elective our R language data analysis will be mining will know the course!

 

Guess you like

Origin www.cnblogs.com/tecdat/p/11646274.html