Notes|Python Performance Development|Python Performance Optimization Record (3)

before optimization

Functions with current cProfile over 1 second:

        15013142 function calls (14601930 primitive calls) in 27.217 seconds

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       43    0.047    0.001   27.217    0.633 F:\topo\semi_supervised_cluster.py:13(semi_supervised_cluster)
       42    0.072    0.002   13.944    0.332 F:\topo\cop_kmeans.py:13(cop_kmeans)
       84    0.002    0.000   11.483    0.137 D:\py\py3.10\lib\site-packages\pandas\core\frame.py:9486(corr)
       84   11.357    0.135   11.357    0.135 {pandas._libs.algos.nancorr}
    23039    0.595    0.000    6.043    0.000 {built-in method builtins.sum}
       42    1.213    0.029    5.205    0.124 F:\topo\cop_kmeans.py:197(get_ml_info)
410285/93131    0.486    0.000    4.970    0.000 {built-in method numpy.core._multiarray_umath.implement_array_function}
    24722    0.052    0.000    4.327    0.000 F:\topo\cop_kmeans.py:79(closest_clusters)
    24722    0.597    0.000    4.215    0.000 F:\topo\cop_kmeans.py:80(<listcomp>)
     7888    0.056    0.000    4.037    0.001 F:\topo\metrics.py:42(pairwise_correlation)
       42    0.011    0.000    3.867    0.092 F:\topo\cop_kmeans.py:217(<listcomp>)
    14942    0.015    0.000    3.843    0.000 F:\topo\cop_kmeans.py:217(<genexpr>)
    74166    0.058    0.000    3.618    0.000 <__array_function__ internals>:177(median)
    74166    0.078    0.000    3.499    0.000 D:\py\py3.10\lib\site-packages\numpy\lib\function_base.py:3711(median)
    74166    0.163    0.000    3.421    0.000 D:\py\py3.10\lib\site-packages\numpy\lib\function_base.py:3651(_ureduce)
    74166    0.459    0.000    3.229    0.000 D:\py\py3.10\lib\site-packages\numpy\lib\function_base.py:3801(_median)
     7888    0.030    0.000    2.283    0.000 D:\py\py3.10\lib\site-packages\pandas\core\series.py:2508(corr)
       42    0.000    0.000    2.204    0.052 F:\topo\cop_kmeans.py:71(tolerance)
     7888    0.049    0.000    1.772    0.000 D:\py\py3.10\lib\site-packages\pandas\core\nanops.py:83(_f)
    16272    0.168    0.000    1.750    0.000 D:\py\py3.10\lib\site-packages\pandas\core\series.py:323(__init__)
      139    1.709    0.012    1.713    0.012 F:\topo\cop_kmeans.py:153(compute_centers)
     7888    0.041    0.000    1.546    0.000 D:\py\py3.10\lib\site-packages\pandas\core\nanops.py:1524(nancorr)
       42    0.009    0.000    1.427    0.034 F:\topo\cop_kmeans.py:75(<listcomp>)
     7888    0.014    0.000    1.264    0.000 D:\py\py3.10\lib\site-packages\pandas\core\nanops.py:1566(func)
     7888    0.009    0.000    1.250    0.000 <__array_function__ internals>:177(corrcoef)
     7888    0.097    0.000    1.229    0.000 D:\py\py3.10\lib\site-packages\numpy\lib\function_base.py:2689(corrcoef)
  1570172    1.119    0.000    1.119    0.000 F:\topo\cop_kmeans.py:75(<genexpr>)
    74166    0.069    0.000    1.094    0.000 <__array_function__ internals>:177(mean)
    82340    0.377    0.000    1.038    0.000 D:\py\py3.10\lib\site-packages\numpy\core\_methods.py:162(_mean)
    30775    0.139    0.000    1.009    0.000 D:\py\py3.10\lib\site-packages\pandas\core\construction.py:470(sanitize_array)

1st optimization: use numpy.corrcoefinsteadpd.corr

Modify location:metrics.py > pairwise_correlation()

Before optimization:

def pairwise_correlation(x, y):
    return 1 - pd.Series(x).corr(pd.Series(y))

Optimized:

def pairwise_correlation(x, y):
    if not isinstance(x, np.ndarray):
        x = np.array(x)
    if not isinstance(y, np.ndarray):
        y = np.array(y)
    nan_idx = np.logical_or(np.isnan(x), np.isnan(y))
    return 1 - np.corrcoef(x[~nan_idx], y[~nan_idx])[0][1]

Currently, the actual parameter of x is a list type, and the actual parameter of y is a numpy.ndarray type, which has to be isinstanceprocessed . We will avoid this judgment by adjusting the calling position later.

Functions with current cProfile over 1 second:

        11925832 function calls (11554060 primitive calls) in 24.903 seconds

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       43    0.050    0.001   24.903    0.579 F:\topo\semi_supervised_cluster.py:13(semi_supervised_cluster)
       84    0.002    0.000   11.962    0.142 D:\py\py3.10\lib\site-packages\pandas\core\frame.py:9486(corr)
       84   11.822    0.141   11.822    0.141 {pandas._libs.algos.nancorr}
       42    0.074    0.002   11.136    0.265 F:\topo\cop_kmeans.py:13(cop_kmeans)
402814/85660    0.457    0.000    4.824    0.000 {built-in method numpy.core._multiarray_umath.implement_array_function}
    24722    0.054    0.000    4.419    0.000 F:\topo\cop_kmeans.py:79(closest_clusters)
    24722    0.605    0.000    4.304    0.000 F:\topo\cop_kmeans.py:80(<listcomp>)
    74166    0.058    0.000    3.699    0.000 <__array_function__ internals>:177(median)
    74166    0.079    0.000    3.578    0.000 D:\py\py3.10\lib\site-packages\numpy\lib\function_base.py:3711(median)
    74166    0.164    0.000    3.500    0.000 D:\py\py3.10\lib\site-packages\numpy\lib\function_base.py:3651(_ureduce)
    23039    0.576    0.000    3.360    0.000 {built-in method builtins.sum}
    74166    0.468    0.000    3.306    0.000 D:\py\py3.10\lib\site-packages\numpy\lib\function_base.py:3801(_median)
       42    1.198    0.029    2.503    0.060 F:\topo\cop_kmeans.py:197(get_ml_info)
       42    0.000    0.000    2.211    0.053 F:\topo\cop_kmeans.py:71(tolerance)
      139    1.778    0.013    1.784    0.013 F:\topo\cop_kmeans.py:153(compute_centers)
       42    0.012    0.000    1.456    0.035 F:\topo\cop_kmeans.py:75(<listcomp>)
     7888    0.086    0.000    1.228    0.000 F:\topo\metrics.py:42(pairwise_correlation)
       42    0.010    0.000    1.179    0.028 F:\topo\cop_kmeans.py:219(<listcomp>)
    14942    0.012    0.000    1.159    0.000 F:\topo\cop_kmeans.py:219(<genexpr>)
  1570172    1.149    0.000    1.149    0.000 F:\topo\cop_kmeans.py:75(<genexpr>)
    74166    0.071    0.000    1.121    0.000 <__array_function__ internals>:177(mean)
     7888    0.008    0.000    1.060    0.000 <__array_function__ internals>:177(corrcoef)
     7888    0.078    0.000    1.042    0.000 D:\py\py3.10\lib\site-packages\numpy\lib\function_base.py:2689(corrcoef)
    82340    0.368    0.000    1.031    0.000 D:\py\py3.10\lib\site-packages\numpy\core\_methods.py:162(_mean)

The second optimization: put multiple np.medianin the loop and merge

Modify location:cop_kmeans.py > cop_kmeans()

Before optimization:

def closest_clusters(clusters, data_index, distance):
    """计算到每一个类的距离排序"""
    distances = [np.median(distance[data_index, cluster]) for cluster in clusters]
    return sorted(range(len(distances)), key=lambda x: distances[x]), distances
for i, d in enumerate(dataset):
    indices, clusters_distances = closest_clusters(pre_clusters, i, dataset)
    pass  # 后续逻辑未用到 clusters_distances,未修改 pre_clusters 和 dataset

Optimized:

all_distances = [np.median(dataset[:, cluster], axis=1) for cluster in pre_clusters]
for i, d in enumerate(dataset):
    distances = [all_distances[j][i] for j in range(len(pre_clusters))]
    indices = sorted(range(len(distances)), key=lambda x: distances[x])

Functions with current cProfile over 1 second:

        8019608 function calls (7869083 primitive calls) in 18.432 seconds

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       43    0.044    0.001   18.432    0.429 F:\topo\semi_supervised_cluster.py:13(semi_supervised_cluster)
       84    0.002    0.000   10.561    0.126 D:\py\py3.10\lib\site-packages\pandas\core\frame.py:9486(corr)
       84   10.441    0.124   10.441    0.124 {pandas._libs.algos.nancorr}
       42    0.059    0.001    6.309    0.150 F:\topo\cop_kmeans.py:13(cop_kmeans)
    23039    0.538    0.000    3.175    0.000 {built-in method builtins.sum}
       42    1.014    0.024    2.240    0.053 F:\topo\cop_kmeans.py:217(get_ml_info)
       42    0.000    0.000    2.085    0.050 F:\topo\cop_kmeans.py:91(tolerance)
      139    1.679    0.012    1.684    0.012 F:\topo\cop_kmeans.py:173(compute_centers)
       42    0.009    0.000    1.359    0.032 F:\topo\cop_kmeans.py:95(<listcomp>)
107818/11911    0.240    0.000    1.186    0.000 {built-in method numpy.core._multiarray_umath.implement_array_function}
     7888    0.077    0.000    1.154    0.000 F:\topo\metrics.py:42(pairwise_correlation)
       42    0.009    0.000    1.115    0.027 F:\topo\cop_kmeans.py:237(<listcomp>)
    14942    0.011    0.000    1.096    0.000 F:\topo\cop_kmeans.py:237(<genexpr>)
  1570172    1.076    0.000    1.076    0.000 F:\topo\cop_kmeans.py:95(<genexpr>)
     7888    0.007    0.000    1.000    0.000 <__array_function__ internals>:177(corrcoef)

Optimization No. 3: Use matrix operations instead of Python list comprehensions

Modify location:cop_kmeans.py > tolerance()

Before optimization:

# taken from scikit-learn (https://goo.gl/1RYPP5)
def tolerance(tol, dataset):
    n = len(dataset)
    dim = len(dataset[0])
    averages = [sum(dataset[i][d] for i in range(n)) / float(n) for d in range(dim)]
    variances = [sum((dataset[i][d] - averages[d]) ** 2 for i in range(n)) / float(n) for d in range(dim)]
    return tol * sum(variances) / dim

Optimized:

def tolerance(tol, dataset):
    return tol * sum(np.var(dataset, axis=0)) / dataset.shape[1]

Functions with current cProfile over 0.5 seconds:

4864701 function calls (4714176 primitive calls) in 16.199 seconds

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       43    0.043    0.001   16.199    0.377 F:\topo\semi_supervised_cluster.py:13(semi_supervised_cluster)
       84    0.002    0.000   10.434    0.124 D:\py\py3.10\lib\site-packages\pandas\core\frame.py:9486(corr)
       84   10.316    0.123   10.316    0.123 {pandas._libs.algos.nancorr}
       42    0.060    0.001    4.211    0.100 F:\topo\cop_kmeans.py:13(cop_kmeans)
       42    1.010    0.024    2.177    0.052 F:\topo\cop_kmeans.py:213(get_ml_info)
      139    1.720    0.012    1.724    0.012 F:\topo\cop_kmeans.py:169(compute_centers)
107860/11953    0.227    0.000    1.132    0.000 {built-in method numpy.core._multiarray_umath.implement_array_function}
     7888    0.073    0.000    1.093    0.000 F:\topo\metrics.py:42(pairwise_correlation)
       42    0.008    0.000    1.051    0.025 F:\topo\cop_kmeans.py:233(<listcomp>)
     8097    0.011    0.000    1.045    0.000 {built-in method builtins.sum}
    14942    0.010    0.000    1.034    0.000 F:\topo\cop_kmeans.py:233(<genexpr>)
     7888    0.007    0.000    0.947    0.000 <__array_function__ internals>:177(corrcoef)
     7888    0.070    0.000    0.931    0.000 D:\py\py3.10\lib\site-packages\numpy\lib\function_base.py:2689(corrcoef)
     7888    0.008    0.000    0.577    0.000 <__array_function__ internals>:177(cov)
     7888    0.149    0.000    0.559    0.000 D:\py\py3.10\lib\site-packages\numpy\lib\function_base.py:2462(cov)
     2422    0.008    0.000    0.525    0.000 D:\py\py3.10\lib\site-packages\pandas\core\frame.py:587(__init__)

cop_keamsget_ml_infoThe specific time of and :

F:\topo\cop_kmeans.py:13(cop_kmeans)            ->      40    0.001    0.001  F:\topo\cop_kmeans.py:41(<listcomp>)
                                                       139    0.011    0.082  F:\topo\cop_kmeans.py:45(<listcomp>)
                                                     24722    0.014    0.014  F:\topo\cop_kmeans.py:47(<listcomp>)
                                                       139    0.001    0.077  F:\topo\cop_kmeans.py:66(<listcomp>)
                                                        42    0.000    0.008  F:\topo\cop_kmeans.py:91(tolerance)
                                                        42    0.001    0.011  F:\topo\cop_kmeans.py:100(initialize_centers)
                                                     25346    0.012    0.012  F:\topo\cop_kmeans.py:157(violate_constraints)
                                                       139    1.720    1.724  F:\topo\cop_kmeans.py:169(compute_centers)
                                                        42    1.010    2.177  F:\topo\cop_kmeans.py:213(get_ml_info)
                                                        42    0.006    0.010  F:\topo\cop_kmeans.py:239(transitive_closure)
                                                     74971    0.006    0.006  {built-in method builtins.len}
                                                     24722    0.021    0.029  {built-in method builtins.sorted}
                                                       139    0.000    0.000  {built-in method builtins.sum}
F:\topo\cop_kmeans.py:213(get_ml_info)          ->      42    0.005    0.005  F:\topo\cop_kmeans.py:225(<listcomp>)
                                                        42    0.008    1.051  F:\topo\cop_kmeans.py:233(<listcomp>)
                                                   1562953    0.110    0.110  {built-in method builtins.len}
                                                      7471    0.001    0.001  {method 'append' of 'list' objects}

The fourth optimization: move the logic executed multiple times in the loop to a single execution outside the loop

Modify location:cop_kmeans.py > get_ml_info()

Before optimization:

for j, group in enumerate(groups):
    for d in range(dim):
        for i in group:
            centroids[j][d] += dataset[i][d]
        centroids[j][d] /= float(len(group))

Optimized:

    for j, group in enumerate(groups):
        n_group = float(len(group))
        for d in range(dim):
            for i in group:
                centroids[j][d] += dataset[i][d]
            centroids[j][d] /= n_group

Functions with current cProfile over 0.5 seconds:

         3309815 function calls (3159204 primitive calls) in 15.745 seconds

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       43    0.042    0.001   15.745    0.366 F:\topo\semi_supervised_cluster.py:13(semi_supervised_cluster)
       84    0.002    0.000   10.381    0.124 D:\py\py3.10\lib\site-packages\pandas\core\frame.py:9486(corr)
       84   10.262    0.122   10.262    0.122 {pandas._libs.algos.nancorr}
       42    0.056    0.001    3.764    0.090 F:\topo\cop_kmeans.py:13(cop_kmeans)
       42    0.778    0.019    1.838    0.044 F:\topo\cop_kmeans.py:213(get_ml_info)
      139    1.626    0.012    1.631    0.012 F:\topo\cop_kmeans.py:169(compute_centers)
107860/11953    0.229    0.000    1.139    0.000 {built-in method numpy.core._multiarray_umath.implement_array_function}
     7888    0.073    0.000    1.093    0.000 F:\topo\metrics.py:42(pairwise_correlation)
       42    0.008    0.000    1.053    0.025 F:\topo\cop_kmeans.py:234(<listcomp>)
     8097    0.010    0.000    1.047    0.000 {built-in method builtins.sum}
    14942    0.010    0.000    1.036    0.000 F:\topo\cop_kmeans.py:234(<genexpr>)
     7888    0.007    0.000    0.947    0.000 <__array_function__ internals>:177(corrcoef)
     7888    0.071    0.000    0.931    0.000 D:\py\py3.10\lib\site-packages\numpy\lib\function_base.py:2689(corrcoef)
     7888    0.008    0.000    0.577    0.000 <__array_function__ internals>:177(cov)
     7888    0.148    0.000    0.559    0.000 D:\py\py3.10\lib\site-packages\numpy\lib\function_base.py:2462(cov)
     2422    0.008    0.000    0.520    0.000 D:\py\py3.10\lib\site-packages\pandas\core\frame.py:587(__init__)

cop_keamsget_ml_infoThe specific time of and :

F:\topo\cop_kmeans.py:13(cop_kmeans)            ->      40    0.001    0.001  F:\topo\cop_kmeans.py:41(<listcomp>)
                                                       139    0.011    0.079  F:\topo\cop_kmeans.py:45(<listcomp>)
                                                     24722    0.013    0.013  F:\topo\cop_kmeans.py:47(<listcomp>)
                                                       139    0.001    0.074  F:\topo\cop_kmeans.py:66(<listcomp>)
                                                        42    0.000    0.007  F:\topo\cop_kmeans.py:91(tolerance)
                                                        42    0.001    0.012  F:\topo\cop_kmeans.py:100(initialize_centers)
                                                     25346    0.008    0.008  F:\topo\cop_kmeans.py:157(violate_constraints)
                                                       139    1.626    1.631  F:\topo\cop_kmeans.py:169(compute_centers)
                                                        42    0.778    1.838  F:\topo\cop_kmeans.py:213(get_ml_info)
                                                        42    0.007    0.011  F:\topo\cop_kmeans.py:240(transitive_closure)
                                                     74971    0.006    0.006  {built-in method builtins.len}
                                                     24722    0.018    0.027  {built-in method builtins.sorted}
                                                       139    0.000    0.000  {built-in method builtins.sum}
F:\topo\cop_kmeans.py:213(get_ml_info)          ->      42    0.005    0.005  F:\topo\cop_kmeans.py:225(<listcomp>)
                                                        42    0.008    1.053  F:\topo\cop_kmeans.py:234(<listcomp>)
                                                      7723    0.001    0.001  {built-in method builtins.len}
                                                      7471    0.001    0.001  {method 'append' of 'list' objects}

5th optimization: Use collection merge instead of looping to mark boolean arrays (tested 10% optimization, applied insignificantly)

Modify location:cop_kmeans.py > get_ml_info()

Before optimization:

flags = [True] * n_dataset
groups = []
for i in range(n_dataset):
    if not flags[i]:
        continue
    group = list(ml[i] | {
    
    i})
    groups.append(group)
    for j in group:
        flags[j] = False

Optimized:

visited = set()
groups = []
for i in range(n_dataset):
    if i in visited:
        continue
    temp = ml[i] | {
    
    i}
    groups.append(list(temp))
    visited |= temp

Functions with current cProfile over 0.5 seconds:

         3309387 function calls (3158862 primitive calls) in 16.126 seconds

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       43    0.043    0.001   16.126    0.375 F:\topo\semi_supervised_cluster.py:13(semi_supervised_cluster)
       84    0.002    0.000   10.693    0.127 D:\py\py3.10\lib\site-packages\pandas\core\frame.py:9486(corr)
       84   10.574    0.126   10.574    0.126 {pandas._libs.algos.nancorr}
       42    0.059    0.001    3.830    0.091 F:\topo\cop_kmeans.py:13(cop_kmeans)
       42    0.745    0.018    1.814    0.043 F:\topo\cop_kmeans.py:213(get_ml_info)
      139    1.707    0.012    1.712    0.012 F:\topo\cop_kmeans.py:169(compute_centers)
107860/11953    0.230    0.000    1.142    0.000 {built-in method numpy.core._multiarray_umath.implement_array_function}
     7888    0.074    0.000    1.103    0.000 F:\topo\metrics.py:42(pairwise_correlation)
       42    0.008    0.000    1.063    0.025 F:\topo\cop_kmeans.py:233(<listcomp>)
     8097    0.011    0.000    1.057    0.000 {built-in method builtins.sum}
    14942    0.010    0.000    1.046    0.000 F:\topo\cop_kmeans.py:233(<genexpr>)
     7888    0.007    0.000    0.953    0.000 <__array_function__ internals>:177(corrcoef)
     7888    0.071    0.000    0.937    0.000 D:\py\py3.10\lib\site-packages\numpy\lib\function_base.py:2689(corrcoef)
     7888    0.008    0.000    0.579    0.000 <__array_function__ internals>:177(cov)
     7888    0.149    0.000    0.561    0.000 D:\py\py3.10\lib\site-packages\numpy\lib\function_base.py:2462(cov)
     2422    0.008    0.000    0.558    0.000 D:\py\py3.10\lib\site-packages\pandas\core\frame.py:587(__init__)
      164    0.003    0.000    0.519    0.003 D:\py\py3.10\lib\site-packages\pandas\core\internals\construction.py:425(dict_to_mgr)

cop_keamsget_ml_infoThe specific time of and :

F:\topo\cop_kmeans.py:13(cop_kmeans)            ->      40    0.001    0.001  F:\topo\cop_kmeans.py:41(<listcomp>)
                                                       139    0.011    0.081  F:\topo\cop_kmeans.py:45(<listcomp>)
                                                     24722    0.014    0.014  F:\topo\cop_kmeans.py:47(<listcomp>)
                                                       139    0.001    0.076  F:\topo\cop_kmeans.py:66(<listcomp>)
                                                        42    0.000    0.008  F:\topo\cop_kmeans.py:91(tolerance)
                                                        42    0.001    0.012  F:\topo\cop_kmeans.py:100(initialize_centers)
                                                     25346    0.009    0.009  F:\topo\cop_kmeans.py:157(violate_constraints)
                                                       139    1.707    1.712  F:\topo\cop_kmeans.py:169(compute_centers)
                                                        42    0.745    1.814  F:\topo\cop_kmeans.py:213(get_ml_info)
                                                        42    0.006    0.010  F:\topo\cop_kmeans.py:239(transitive_closure)
                                                     74971    0.006    0.006  {built-in method builtins.len}
                                                     24722    0.020    0.028  {built-in method builtins.sorted}
                                                       139    0.000    0.000  {built-in method builtins.sum}
F:\topo\cop_kmeans.py:213(get_ml_info)          ->      42    0.005    0.005  F:\topo\cop_kmeans.py:224(<listcomp>)
                                                        42    0.008    1.063  F:\topo\cop_kmeans.py:233(<listcomp>)
                                                      7639    0.001    0.001  {built-in method builtins.len}
                                                      7471    0.001    0.001  {method 'append' of 'list' objects}

The sixth optimization: use numpy matrix calculation instead of Python loop

Continue to use matrix calculations, and avoid the need to convert lists to matrices later.

Modify location:cop_kmeans.py > get_ml_info()

Before optimization:

dim = len(dataset[0])
centroids = [[0.0] * dim for i in range(len(groups))]
for j, group in enumerate(groups):
    n_group = float(len(group))
    for d in range(dim):
        for i in group:
            centroids[j][d] += dataset[i][d]
        centroids[j][d] /= n_group

Optimized:

dim = len(dataset[0])
centroids = np.zeros((len(groups), dim), dtype=np.float64)
for j, group in enumerate(groups):
    for i in group:
        new_centroids[j, :] += dataset[i, :]
    centroids[j, :] /= len(group)

Functions with current cProfile over 0.5 seconds:

         3301916 function calls (3151391 primitive calls) in 15.182 seconds

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       43    0.032    0.001   15.182    0.353 F:\topo\semi_supervised_cluster.py:13(semi_supervised_cluster)
       84    0.002    0.000   10.734    0.128 D:\py\py3.10\lib\site-packages\pandas\core\frame.py:9486(corr)
       84   10.618    0.126   10.618    0.126 {pandas._libs.algos.nancorr}
       42    0.057    0.001    2.924    0.070 F:\topo\cop_kmeans.py:13(cop_kmeans)
      139    1.617    0.012    1.622    0.012 F:\topo\cop_kmeans.py:169(compute_centers)
107860/11953    0.226    0.000    1.125    0.000 {built-in method numpy.core._multiarray_umath.implement_array_function}
     7888    0.070    0.000    1.014    0.000 F:\topo\metrics.py:42(pairwise_correlation)
       42    0.034    0.001    1.011    0.024 F:\topo\cop_kmeans.py:213(get_ml_info)
       42    0.008    0.000    0.975    0.023 F:\topo\cop_kmeans.py:230(<listcomp>)
     8097    0.010    0.000    0.969    0.000 {built-in method builtins.sum}
    14942    0.010    0.000    0.958    0.000 F:\topo\cop_kmeans.py:230(<genexpr>)
     7888    0.007    0.000    0.942    0.000 <__array_function__ internals>:177(corrcoef)
     7888    0.071    0.000    0.927    0.000 D:\py\py3.10\lib\site-packages\numpy\lib\function_base.py:2689(corrcoef)
     7888    0.007    0.000    0.572    0.000 <__array_function__ internals>:177(cov)
     7888    0.149    0.000    0.554    0.000 D:\py\py3.10\lib\site-packages\numpy\lib\function_base.py:2462(cov)
     2422    0.007    0.000    0.520    0.000 D:\py\py3.10\lib\site-packages\pandas\core\frame.py:587(__init__)

cop_keams, get_ml_info, compute_centersspecific time:

F:\topo\cop_kmeans.py:13(cop_kmeans)            ->      40    0.000    0.000  F:\topo\cop_kmeans.py:41(<listcomp>)
                                                       139    0.010    0.078  F:\topo\cop_kmeans.py:45(<listcomp>)
                                                     24722    0.013    0.013  F:\topo\cop_kmeans.py:47(<listcomp>)
                                                       139    0.001    0.074  F:\topo\cop_kmeans.py:66(<listcomp>)
                                                        42    0.000    0.008  F:\topo\cop_kmeans.py:91(tolerance)
                                                        42    0.001    0.011  F:\topo\cop_kmeans.py:100(initialize_centers)
                                                     25346    0.008    0.008  F:\topo\cop_kmeans.py:157(violate_constraints)
                                                       139    1.617    1.622  F:\topo\cop_kmeans.py:169(compute_centers)
                                                        42    0.034    1.011  F:\topo\cop_kmeans.py:213(get_ml_info)
                                                        42    0.006    0.010  F:\topo\cop_kmeans.py:236(transitive_closure)
                                                     74971    0.006    0.006  {built-in method builtins.len}
                                                     24722    0.019    0.027  {built-in method builtins.sorted}
                                                       139    0.000    0.000  {built-in method builtins.sum}
F:\topo\cop_kmeans.py:169(compute_centers)      ->     139    0.002    0.002  F:\topo\cop_kmeans.py:173(<listcomp>)
                                                       139    0.000    0.000  F:\topo\cop_kmeans.py:176(<listcomp>)
                                                       139    0.000    0.000  F:\topo\cop_kmeans.py:203(<listcomp>)
                                                       417    0.000    0.000  {built-in method builtins.len}
                                                     24722    0.002    0.002  {method 'append' of 'list' objects}
F:\topo\cop_kmeans.py:213(get_ml_info)          ->      42    0.008    0.975  F:\topo\cop_kmeans.py:230(<listcomp>)
                                                      7639    0.001    0.001  {built-in method builtins.len}
                                                        42    0.001    0.001  {built-in method numpy.zeros}
                                                      7471    0.001    0.001  {method 'append' of 'list' objects}

The seventh optimization: use numpy matrix calculation and collections.Counterinstead of Python loop

Continue to use matrix calculations, and avoid the need to convert lists to matrices later.

Modify location:cop_kmeans.py > get_ml_info()

Before optimization:

dim = len(dataset[0])
centers = [[0.0] * dim for i in range(k)]

counts = [0] * k_new  # counts 仅下方一处用途
for j, c in enumerate(clusters):
    for i in range(dim):
        centers[c][i] += dataset[j][i]
    counts[c] += 1

for j in range(k_new):
    for i in range(dim):
        centers[j][i] = centers[j][i] / float(counts[j])

Optimized:

dim = len(dataset[0])
centers = np.zeros((k,dim), dtype=np.float64)
counts = Counter(clusters)
for j, c in enumerate(clusters):
    centers[c, :] += dataset[j, :]

for j in range(k_new):
    centers[j, :] /= counts[j]

Functions with current cProfile over 0.5 seconds:

         3303306 function calls (3152781 primitive calls) in 13.225 seconds

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       43    0.031    0.001   13.225    0.308 F:\topo\semi_supervised_cluster.py:13(semi_supervised_cluster)
       84    0.001    0.000   10.428    0.124 D:\py\py3.10\lib\site-packages\pandas\core\frame.py:9486(corr)
       84   10.314    0.123   10.314    0.123 {pandas._libs.algos.nancorr}
       42    0.053    0.001    1.303    0.031 F:\topo\cop_kmeans.py:14(cop_kmeans)
107860/11953    0.224    0.000    1.079    0.000 {built-in method numpy.core._multiarray_umath.implement_array_function}
       42    0.035    0.001    0.968    0.023 F:\topo\cop_kmeans.py:214(get_ml_info)
     7888    0.066    0.000    0.962    0.000 F:\topo\metrics.py:42(pairwise_correlation)
       42    0.008    0.000    0.930    0.022 F:\topo\cop_kmeans.py:231(<listcomp>)
     8097    0.010    0.000    0.924    0.000 {built-in method builtins.sum}
    14942    0.010    0.000    0.914    0.000 F:\topo\cop_kmeans.py:231(<genexpr>)
     7888    0.007    0.000    0.894    0.000 <__array_function__ internals>:177(corrcoef)
     7888    0.065    0.000    0.880    0.000 D:\py\py3.10\lib\site-packages\numpy\lib\function_base.py:2689(corrcoef)
     7888    0.007    0.000    0.543    0.000 <__array_function__ internals>:177(cov)
     7888    0.139    0.000    0.526    0.000 D:\py\py3.10\lib\site-packages\numpy\lib\function_base.py:2462(cov)
     2422    0.007    0.000    0.510    0.000 D:\py\py3.10\lib\site-packages\pandas\core\frame.py:587(__init__)

cop_keams, get_ml_info, compute_centersspecific time:

F:\topo\cop_kmeans.py:14(cop_kmeans)            ->      40    0.000    0.000  F:\topo\cop_kmeans.py:42(<listcomp>)
                                                       139    0.009    0.076  F:\topo\cop_kmeans.py:46(<listcomp>)
                                                     24722    0.013    0.013  F:\topo\cop_kmeans.py:48(<listcomp>)
                                                       139    0.001    0.059  F:\topo\cop_kmeans.py:67(<listcomp>)
                                                        42    0.000    0.008  F:\topo\cop_kmeans.py:92(tolerance)
                                                        42    0.001    0.010  F:\topo\cop_kmeans.py:101(initialize_centers)
                                                     25346    0.008    0.008  F:\topo\cop_kmeans.py:158(violate_constraints)
                                                       139    0.051    0.065  F:\topo\cop_kmeans.py:170(compute_centers)
                                                        42    0.035    0.968  F:\topo\cop_kmeans.py:214(get_ml_info)
                                                        42    0.007    0.010  F:\topo\cop_kmeans.py:237(transitive_closure)
                                                     74971    0.006    0.006  {built-in method builtins.len}
                                                     24722    0.019    0.026  {built-in method builtins.sorted}
                                                       139    0.000    0.000  {built-in method builtins.sum}
F:\topo\cop_kmeans.py:214(get_ml_info)          ->      42    0.008    0.930  F:\topo\cop_kmeans.py:231(<listcomp>)
                                                      7639    0.001    0.001  {built-in method builtins.len}
                                                        42    0.001    0.001  {built-in method numpy.zeros}
                                                      7471    0.001    0.001  {method 'append' of 'list' objects}
F:\topo\cop_kmeans.py:170(compute_centers)      ->     139    0.000    0.002  D:\py\py3.10\lib\collections\__init__.py:565(__init__)
                                                       139    0.001    0.001  F:\topo\cop_kmeans.py:177(<listcomp>)
                                                       139    0.000    0.000  F:\topo\cop_kmeans.py:204(<listcomp>)
                                                       556    0.000    0.000  {built-in method builtins.len}
                                                       417    0.008    0.008  {built-in method builtins.print}
                                                       139    0.000    0.000  {built-in method numpy.zeros}
                                                     24722    0.002    0.002  {method 'append' of 'list' objects}

The eighth optimization: Move the logic executed multiple times in the loop to a single execution outside the loop

Modify location:cop_kmeans.py > cop_kmeans()

before optimization

all_distances = [np.median(dataset[:, cluster], axis=1) for cluster in pre_clusters]
for i, d in enumerate(dataset):
    distances = [all_distances[j][i] for j in range(len(pre_clusters))]
    indices = sorted(range(len(distances)), key=lambda x: distances[x])
    counter = 0
    if clusters_[i] == -1:
        found_cluster = False
        while (not found_cluster) and counter < len(indices):
            pass  # 逻辑中没有修改 pre_clusters
        pass  # 逻辑中没有修改 pre_clusters

Optimized:

all_distances = [np.median(dataset[:, cluster], axis=1) for cluster in pre_clusters]
n_cluster = len(pre_clusters)
for i, d in enumerate(dataset):
    distances = [all_distances[j][i] for j in range(n_cluster)]
    indices = sorted(range(n_cluster), key=lambda x: distances[x])
    counter = 0
    if clusters_[i] == -1:
        found_cluster = False
        while (not found_cluster) and counter < n_cluster:
            pass
        pass

Functions where the current cProfile exceeds 0.1 seconds:

         3228099 function calls (3077574 primitive calls) in 13.317 seconds

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       43    0.032    0.001   13.317    0.310 F:\topo\semi_supervised_cluster.py:13(semi_supervised_cluster)
       84    0.002    0.000   10.442    0.124 D:\py\py3.10\lib\site-packages\pandas\core\frame.py:9486(corr)
       84   10.317    0.123   10.317    0.123 {pandas._libs.algos.nancorr}
       42    0.044    0.001    1.309    0.031 F:\topo\cop_kmeans.py:14(cop_kmeans)
107860/11953    0.231    0.000    1.108    0.000 {built-in method numpy.core._multiarray_umath.implement_array_function}
       42    0.037    0.001    0.990    0.024 F:\topo\cop_kmeans.py:212(get_ml_info)
     7888    0.068    0.000    0.985    0.000 F:\topo\metrics.py:42(pairwise_correlation)
       42    0.008    0.000    0.951    0.023 F:\topo\cop_kmeans.py:229(<listcomp>)
     8097    0.010    0.000    0.945    0.000 {built-in method builtins.sum}
    14942    0.010    0.000    0.934    0.000 F:\topo\cop_kmeans.py:229(<genexpr>)
     7888    0.007    0.000    0.915    0.000 <__array_function__ internals>:177(corrcoef)
     7888    0.066    0.000    0.901    0.000 D:\py\py3.10\lib\site-packages\numpy\lib\function_base.py:2689(corrcoef)
     7888    0.007    0.000    0.555    0.000 <__array_function__ internals>:177(cov)
     7888    0.141    0.000    0.538    0.000 D:\py\py3.10\lib\site-packages\numpy\lib\function_base.py:2462(cov)
     2422    0.008    0.000    0.528    0.000 D:\py\py3.10\lib\site-packages\pandas\core\frame.py:587(__init__)
      164    0.003    0.000    0.489    0.003 D:\py\py3.10\lib\site-packages\pandas\core\internals\construction.py:425(dict_to_mgr)
      164    0.001    0.000    0.377    0.002 D:\py\py3.10\lib\site-packages\pandas\core\internals\construction.py:102(arrays_to_mgr)
      164    0.022    0.000    0.313    0.002 D:\py\py3.10\lib\site-packages\pandas\core\internals\construction.py:596(_homogenize)
     7888    0.007    0.000    0.293    0.000 <__array_function__ internals>:177(average)
      684    0.005    0.000    0.289    0.000 D:\py\py3.10\lib\site-packages\pandas\core\internals\managers.py:253(apply)
    14999    0.035    0.000    0.280    0.000 D:\py\py3.10\lib\site-packages\pandas\core\construction.py:470(sanitize_array)
     7888    0.029    0.000    0.279    0.000 D:\py\py3.10\lib\site-packages\numpy\lib\function_base.py:395(average)
       86    0.000    0.000    0.253    0.003 D:\py\py3.10\lib\site-packages\pandas\core\frame.py:3630(__setitem__)
       86    0.001    0.000    0.253    0.003 D:\py\py3.10\lib\site-packages\pandas\core\frame.py:3749(_setitem_frame)
       86    0.006    0.000    0.236    0.003 D:\py\py3.10\lib\site-packages\pandas\core\generic.py:9027(_where)
     7888    0.007    0.000    0.202    0.000 <__array_function__ internals>:177(clip)
     7888    0.008    0.000    0.188    0.000 D:\py\py3.10\lib\site-packages\numpy\core\fromnumeric.py:2083(clip)
     8094    0.008    0.000    0.181    0.000 D:\py\py3.10\lib\site-packages\numpy\core\fromnumeric.py:51(_wrapfunc)
   669066    0.116    0.000    0.174    0.000 {built-in method builtins.isinstance}
       43    0.004    0.000    0.171    0.004 F:\topo\cop_kmeans.py:78(make_constraint)
     7888    0.006    0.000    0.170    0.000 {method 'clip' of 'numpy.ndarray' objects}
    14052    0.035    0.000    0.168    0.000 D:\py\py3.10\lib\site-packages\pandas\core\dtypes\cast.py:1873(construct_1d_arraylike_from_scalar)
     7888    0.020    0.000    0.165    0.000 D:\py\py3.10\lib\site-packages\numpy\core\_methods.py:125(_clip)
      240    0.002    0.000    0.159    0.001 D:\py\py3.10\lib\site-packages\pandas\core\generic.py:5048(filter)
     8591    0.066    0.000    0.143    0.000 D:\py\py3.10\lib\site-packages\numpy\core\_methods.py:162(_mean)
     8174    0.006    0.000    0.141    0.000 {method 'mean' of 'numpy.ndarray' objects}
       86    0.000    0.000    0.141    0.002 D:\py\py3.10\lib\site-packages\pandas\core\internals\managers.py:339(putmask)
       86    0.001    0.000    0.125    0.001 D:\py\py3.10\lib\site-packages\pandas\core\internals\blocks.py:963(putmask)
       86    0.001    0.000    0.120    0.001 D:\py\py3.10\lib\site-packages\pandas\core\array_algos\putmask.py:116(putmask_without_repeat)
      376    0.039    0.000    0.110    0.000 D:\py\py3.10\lib\site-packages\pandas\core\internals\managers.py:1541(as_array)
     8093    0.006    0.000    0.109    0.000 <__array_function__ internals>:177(broadcast_to)
       84    0.000    0.000    0.109    0.001 D:\py\py3.10\lib\site-packages\pandas\core\frame.py:1693(to_numpy)
       86    0.000    0.000    0.107    0.001 <__array_function__ internals>:177(putmask)

cop_keamsThe specific time:

F:\topo\cop_kmeans.py:14(cop_kmeans)            ->      40    0.001    0.001  F:\topo\cop_kmeans.py:42(<listcomp>)
                                                       139    0.008    0.076  F:\topo\cop_kmeans.py:46(<listcomp>)
                                                     24722    0.013    0.013  F:\topo\cop_kmeans.py:49(<listcomp>)
                                                       139    0.001    0.062  F:\topo\cop_kmeans.py:68(<listcomp>)
                                                        42    0.000    0.008  F:\topo\cop_kmeans.py:93(tolerance)
                                                        42    0.001    0.010  F:\topo\cop_kmeans.py:102(initialize_centers)
                                                     25346    0.008    0.008  F:\topo\cop_kmeans.py:159(violate_constraints)
                                                       139    0.054    0.060  F:\topo\cop_kmeans.py:171(compute_centers)
                                                        42    0.037    0.990  F:\topo\cop_kmeans.py:212(get_ml_info)
                                                        42    0.007    0.011  F:\topo\cop_kmeans.py:235(transitive_closure)
                                                       320    0.000    0.000  {built-in method builtins.len}
                                                     24722    0.019    0.027  {built-in method builtins.sorted}
                                                       139    0.000    0.000  {built-in method builtins.sum}

The 9th optimization: remove the increase in the 1st optimization isinstance(only 15776 times, the remaining 653290 times should be other function calls)

In Optimization 6, it has been guaranteed that both parameter inputs are np.ndarray.

Modify location:metrics.py > pairwise_correlation()

Before optimization:

def pairwise_correlation(x, y):
    """
        Use pandas to ignore nan
    """
    if not isinstance(x, np.ndarray):
        x = np.array(x)
    if not isinstance(y, np.ndarray):
        y = np.array(y)
    nan_idx = np.logical_or(np.isnan(x), np.isnan(y))
    return 1 - np.corrcoef(x[~nan_idx], y[~nan_idx])[0][1]

Optimized:

def pairwise_correlation(x, y):
    """
        Use pandas to ignore nan
    """
    nan_idx = np.logical_or(np.isnan(x), np.isnan(y))
    return 1 - np.corrcoef(x[~nan_idx], y[~nan_idx])[0][1]

Functions where the current cProfile exceeds 0.1 seconds:

         3212323 function calls (3061798 primitive calls) in 13.013 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       43    0.030    0.001   13.013    0.303 F:\topo\semi_supervised_cluster.py:13(semi_supervised_cluster)
       84    0.001    0.000   10.240    0.122 D:\py\py3.10\lib\site-packages\pandas\core\frame.py:9486(corr)
       84   10.127    0.121   10.127    0.121 {pandas._libs.algos.nancorr}
       42    0.043    0.001    1.287    0.031 F:\topo\cop_kmeans.py:14(cop_kmeans)
107860/11953    0.222    0.000    1.091    0.000 {built-in method numpy.core._multiarray_umath.implement_array_function}
       42    0.034    0.001    0.973    0.023 F:\topo\cop_kmeans.py:213(get_ml_info)
     7888    0.061    0.000    0.970    0.000 F:\topo\metrics.py:42(pairwise_correlation)
       42    0.008    0.000    0.937    0.022 F:\topo\cop_kmeans.py:230(<listcomp>)
     8097    0.010    0.000    0.931    0.000 {built-in method builtins.sum}
    14942    0.010    0.000    0.920    0.000 F:\topo\cop_kmeans.py:230(<genexpr>)
     7888    0.007    0.000    0.908    0.000 <__array_function__ internals>:177(corrcoef)
     7888    0.066    0.000    0.894    0.000 D:\py\py3.10\lib\site-packages\numpy\lib\function_base.py:2689(corrcoef)
     7888    0.007    0.000    0.550    0.000 <__array_function__ internals>:177(cov)
     7888    0.140    0.000    0.533    0.000 D:\py\py3.10\lib\site-packages\numpy\lib\function_base.py:2462(cov)
     2422    0.007    0.000    0.507    0.000 D:\py\py3.10\lib\site-packages\pandas\core\frame.py:587(__init__)
      164    0.003    0.000    0.470    0.003 D:\py\py3.10\lib\site-packages\pandas\core\internals\construction.py:425(dict_to_mgr)
      164    0.001    0.000    0.364    0.002 D:\py\py3.10\lib\site-packages\pandas\core\internals\construction.py:102(arrays_to_mgr)
      164    0.022    0.000    0.305    0.002 D:\py\py3.10\lib\site-packages\pandas\core\internals\construction.py:596(_homogenize)
     7888    0.007    0.000    0.291    0.000 <__array_function__ internals>:177(average)
     7888    0.029    0.000    0.277    0.000 D:\py\py3.10\lib\site-packages\numpy\lib\function_base.py:395(average)
    14999    0.034    0.000    0.273    0.000 D:\py\py3.10\lib\site-packages\pandas\core\construction.py:470(sanitize_array)
      684    0.005    0.000    0.269    0.000 D:\py\py3.10\lib\site-packages\pandas\core\internals\managers.py:253(apply)
       86    0.001    0.000    0.244    0.003 D:\py\py3.10\lib\site-packages\pandas\core\frame.py:3630(__setitem__)
       86    0.001    0.000    0.243    0.003 D:\py\py3.10\lib\site-packages\pandas\core\frame.py:3749(_setitem_frame)
       86    0.006    0.000    0.226    0.003 D:\py\py3.10\lib\site-packages\pandas\core\generic.py:9027(_where)
     7888    0.007    0.000    0.200    0.000 <__array_function__ internals>:177(clip)
     7888    0.008    0.000    0.185    0.000 D:\py\py3.10\lib\site-packages\numpy\core\fromnumeric.py:2083(clip)
     8094    0.007    0.000    0.179    0.000 D:\py\py3.10\lib\site-packages\numpy\core\fromnumeric.py:51(_wrapfunc)
     7888    0.006    0.000    0.168    0.000 {method 'clip' of 'numpy.ndarray' objects}
   653290    0.111    0.000    0.168    0.000 {built-in method builtins.isinstance}
       43    0.003    0.000    0.166    0.004 F:\topo\cop_kmeans.py:78(make_constraint)
    14052    0.033    0.000    0.163    0.000 D:\py\py3.10\lib\site-packages\pandas\core\dtypes\cast.py:1873(construct_1d_arraylike_from_scalar)
     7888    0.019    0.000    0.162    0.000 D:\py\py3.10\lib\site-packages\numpy\core\_methods.py:125(_clip)
      240    0.002    0.000    0.154    0.001 D:\py\py3.10\lib\site-packages\pandas\core\generic.py:5048(filter)
     8591    0.065    0.000    0.141    0.000 D:\py\py3.10\lib\site-packages\numpy\core\_methods.py:162(_mean)
     8174    0.006    0.000    0.140    0.000 {method 'mean' of 'numpy.ndarray' objects}
       86    0.000    0.000    0.131    0.002 D:\py\py3.10\lib\site-packages\pandas\core\internals\managers.py:339(putmask)
       86    0.001    0.000    0.116    0.001 D:\py\py3.10\lib\site-packages\pandas\core\internals\blocks.py:963(putmask)
       86    0.001    0.000    0.111    0.001 D:\py\py3.10\lib\site-packages\pandas\core\array_algos\putmask.py:116(putmask_without_repeat)
     8093    0.006    0.000    0.108    0.000 <__array_function__ internals>:177(broadcast_to)

cop_keamsThe specific time:

F:\topo\cop_kmeans.py:14(cop_kmeans)            ->      40    0.001    0.001  F:\topo\cop_kmeans.py:42(<listcomp>)
                                                       139    0.008    0.076  F:\topo\cop_kmeans.py:46(<listcomp>)
                                                     24722    0.013    0.013  F:\topo\cop_kmeans.py:49(<listcomp>)
                                                       139    0.001    0.062  F:\topo\cop_kmeans.py:68(<listcomp>)
                                                        42    0.000    0.008  F:\topo\cop_kmeans.py:93(tolerance)
                                                        42    0.001    0.010  F:\topo\cop_kmeans.py:102(initialize_centers)
                                                     25346    0.008    0.008  F:\topo\cop_kmeans.py:159(violate_constraints)
                                                       139    0.054    0.060  F:\topo\cop_kmeans.py:171(compute_centers)
                                                        42    0.037    0.990  F:\topo\cop_kmeans.py:212(get_ml_info)
                                                        42    0.007    0.011  F:\topo\cop_kmeans.py:235(transitive_closure)
                                                       320    0.000    0.000  {built-in method builtins.len}
                                                     24722    0.019    0.027  {built-in method builtins.sorted}
                                                       139    0.000    0.000  {built-in method builtins.sum}

The 10th optimization: complex logic simplification (not significant)

Modify location:cop_kmeans.py > compute_centers()

Before optimization:

new_clusters = [[] for _ in range(k)]
for i in range(len(clusters)):
    for j in range(k):
        if clusters[i] == j:
            new_clusters[j].append(i)
            break

Optimized:

new_clusters = [[] for _ in range(k)]
for i in range(len(clusters)):
    if clusters[i] < k:
        new_clusters[clusters[i]].append(i)

Functions where the current cProfile exceeds 0.1 seconds:

         3212323 function calls (3061798 primitive calls) in 12.903 seconds

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       43    0.030    0.001   12.903    0.300 F:\topo\semi_supervised_cluster.py:13(semi_supervised_cluster)
       84    0.001    0.000   10.209    0.122 D:\py\py3.10\lib\site-packages\pandas\core\frame.py:9486(corr)
       84   10.098    0.120   10.098    0.120 {pandas._libs.algos.nancorr}
       42    0.039    0.001    1.251    0.030 F:\topo\cop_kmeans.py:14(cop_kmeans)
107860/11953    0.216    0.000    1.067    0.000 {built-in method numpy.core._multiarray_umath.implement_array_function}
       42    0.035    0.001    0.961    0.023 F:\topo\cop_kmeans.py:207(get_ml_info)
     7888    0.062    0.000    0.954    0.000 F:\topo\metrics.py:42(pairwise_correlation)
       42    0.008    0.000    0.924    0.022 F:\topo\cop_kmeans.py:224(<listcomp>)
     8097    0.010    0.000    0.918    0.000 {built-in method builtins.sum}
    14942    0.010    0.000    0.908    0.000 F:\topo\cop_kmeans.py:224(<genexpr>)
     7888    0.007    0.000    0.892    0.000 <__array_function__ internals>:177(corrcoef)
     7888    0.065    0.000    0.877    0.000 D:\py\py3.10\lib\site-packages\numpy\lib\function_base.py:2689(corrcoef)
     7888    0.007    0.000    0.544    0.000 <__array_function__ internals>:177(cov)
     7888    0.141    0.000    0.527    0.000 D:\py\py3.10\lib\site-packages\numpy\lib\function_base.py:2462(cov)
     2422    0.007    0.000    0.485    0.000 D:\py\py3.10\lib\site-packages\pandas\core\frame.py:587(__init__)
      164    0.003    0.000    0.450    0.003 D:\py\py3.10\lib\site-packages\pandas\core\internals\construction.py:425(dict_to_mgr)
      164    0.001    0.000    0.350    0.002 D:\py\py3.10\lib\site-packages\pandas\core\internals\construction.py:102(arrays_to_mgr)
      164    0.021    0.000    0.292    0.002 D:\py\py3.10\lib\site-packages\pandas\core\internals\construction.py:596(_homogenize)
     7888    0.007    0.000    0.285    0.000 <__array_function__ internals>:177(average)
     7888    0.028    0.000    0.272    0.000 D:\py\py3.10\lib\site-packages\numpy\lib\function_base.py:395(average)
      684    0.005    0.000    0.262    0.000 D:\py\py3.10\lib\site-packages\pandas\core\internals\managers.py:253(apply)
    14999    0.032    0.000    0.261    0.000 D:\py\py3.10\lib\site-packages\pandas\core\construction.py:470(sanitize_array)
       86    0.000    0.000    0.233    0.003 D:\py\py3.10\lib\site-packages\pandas\core\frame.py:3630(__setitem__)
       86    0.001    0.000    0.232    0.003 D:\py\py3.10\lib\site-packages\pandas\core\frame.py:3749(_setitem_frame)
       86    0.006    0.000    0.216    0.003 D:\py\py3.10\lib\site-packages\pandas\core\generic.py:9027(_where)
     7888    0.007    0.000    0.194    0.000 <__array_function__ internals>:177(clip)
     7888    0.007    0.000    0.180    0.000 D:\py\py3.10\lib\site-packages\numpy\core\fromnumeric.py:2083(clip)
     8094    0.008    0.000    0.174    0.000 D:\py\py3.10\lib\site-packages\numpy\core\fromnumeric.py:51(_wrapfunc)
       43    0.003    0.000    0.166    0.004 F:\topo\cop_kmeans.py:75(make_constraint)
     7888    0.006    0.000    0.163    0.000 {method 'clip' of 'numpy.ndarray' objects}
   653290    0.106    0.000    0.160    0.000 {built-in method builtins.isinstance}
    14052    0.031    0.000    0.158    0.000 D:\py\py3.10\lib\site-packages\pandas\core\dtypes\cast.py:1873(construct_1d_arraylike_from_scalar)
     7888    0.019    0.000    0.157    0.000 D:\py\py3.10\lib\site-packages\numpy\core\_methods.py:125(_clip)
      240    0.001    0.000    0.155    0.001 D:\py\py3.10\lib\site-packages\pandas\core\generic.py:5048(filter)
     8591    0.063    0.000    0.138    0.000 D:\py\py3.10\lib\site-packages\numpy\core\_methods.py:162(_mean)
     8174    0.006    0.000    0.137    0.000 {method 'mean' of 'numpy.ndarray' objects}
       86    0.000    0.000    0.128    0.001 D:\py\py3.10\lib\site-packages\pandas\core\internals\managers.py:339(putmask)
       86    0.001    0.000    0.113    0.001 D:\py\py3.10\lib\site-packages\pandas\core\internals\blocks.py:963(putmask)
       86    0.000    0.000    0.109    0.001 D:\py\py3.10\lib\site-packages\pandas\core\array_algos\putmask.py:116(putmask_without_repeat)
     8093    0.006    0.000    0.106    0.000 <__array_function__ internals>:177(broadcast_to)

cop_keamsThe specific time:

                                                    ncalls  tottime  cumtime
F:\topo\cop_kmeans.py:14(cop_kmeans)            ->      40    0.000    0.000  F:\topo\cop_kmeans.py:39(<listcomp>)
                                                       139    0.008    0.072  F:\topo\cop_kmeans.py:43(<listcomp>)
                                                     24722    0.012    0.012  F:\topo\cop_kmeans.py:46(<listcomp>)
                                                       139    0.001    0.057  F:\topo\cop_kmeans.py:65(<listcomp>)
                                                        42    0.000    0.007  F:\topo\cop_kmeans.py:90(tolerance)
                                                        42    0.001    0.010  F:\topo\cop_kmeans.py:99(initialize_centers)
                                                     25346    0.007    0.007  F:\topo\cop_kmeans.py:156(violate_constraints)
                                                       139    0.045    0.050  F:\topo\cop_kmeans.py:168(compute_centers)
                                                        42    0.035    0.961  F:\topo\cop_kmeans.py:207(get_ml_info)
                                                        42    0.006    0.010  F:\topo\cop_kmeans.py:230(transitive_closure)
                                                       320    0.000    0.000  {built-in method builtins.len}
                                                     24722    0.017    0.025  {built-in method builtins.sorted}
                                                       139    0.000    0.000  {built-in method builtins.sum}

After the first round of optimization

All functions with a cumulative running time of more than 0.5 seconds:

         3212323 function calls (3061798 primitive calls) in 12.903 seconds

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       43    0.030    0.001   12.903    0.300 F:\topo\semi_supervised_cluster.py:13(semi_supervised_cluster)
       84    0.001    0.000   10.209    0.122 D:\py\py3.10\lib\site-packages\pandas\core\frame.py:9486(corr)
       84   10.098    0.120   10.098    0.120 {pandas._libs.algos.nancorr}
       42    0.039    0.001    1.251    0.030 F:\topo\cop_kmeans.py:14(cop_kmeans)
107860/11953    0.216    0.000    1.067    0.000 {built-in method numpy.core._multiarray_umath.implement_array_function}
       42    0.035    0.001    0.961    0.023 F:\topo\cop_kmeans.py:207(get_ml_info)
     7888    0.062    0.000    0.954    0.000 F:\topo\metrics.py:42(pairwise_correlation)
       42    0.008    0.000    0.924    0.022 F:\topo\cop_kmeans.py:224(<listcomp>)
     8097    0.010    0.000    0.918    0.000 {built-in method builtins.sum}
    14942    0.010    0.000    0.908    0.000 F:\topo\cop_kmeans.py:224(<genexpr>)
     7888    0.007    0.000    0.892    0.000 <__array_function__ internals>:177(corrcoef)
     7888    0.065    0.000    0.877    0.000 D:\py\py3.10\lib\site-packages\numpy\lib\function_base.py:2689(corrcoef)
     7888    0.007    0.000    0.544    0.000 <__array_function__ internals>:177(cov)
     7888    0.141    0.000    0.527    0.000 D:\py\py3.10\lib\site-packages\numpy\lib\function_base.py:2462(cov)

Functions that currently have more than 0.02 seconds left to run themselves:

         3212323 function calls (3061798 primitive calls) in 12.903 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       84   10.098    0.120   10.098    0.120 {pandas._libs.algos.nancorr}
107860/11953    0.216    0.000    1.067    0.000 {built-in method numpy.core._multiarray_umath.implement_array_function}
     7888    0.141    0.000    0.527    0.000 D:\py\py3.10\lib\site-packages\numpy\lib\function_base.py:2462(cov)
   653290    0.106    0.000    0.160    0.000 {built-in method builtins.isinstance}
     7888    0.065    0.000    0.877    0.000 D:\py\py3.10\lib\site-packages\numpy\lib\function_base.py:2689(corrcoef)
     8767    0.065    0.000    0.065    0.000 {method 'copy' of 'numpy.ndarray' objects}
    11745    0.064    0.000    0.064    0.000 {method 'reduce' of 'numpy.ufunc' objects}
     8591    0.063    0.000    0.138    0.000 D:\py\py3.10\lib\site-packages\numpy\core\_methods.py:162(_mean)
     7888    0.062    0.000    0.954    0.000 F:\topo\metrics.py:42(pairwise_correlation)
     8093    0.057    0.000    0.083    0.000 D:\py\py3.10\lib\site-packages\numpy\lib\stride_tricks.py:339(_broadcast_to)
       43    0.047    0.001    0.049    0.001 D:\py\py3.10\lib\site-packages\pandas\core\algorithms.py:1551(diff)
      139    0.045    0.000    0.050    0.000 F:\topo\cop_kmeans.py:168(compute_centers)
    15776    0.042    0.000    0.092    0.000 D:\py\py3.10\lib\site-packages\numpy\core\_methods.py:91(_clip_dep_is_scalar_nan)
       42    0.039    0.001    1.251    0.030 F:\topo\cop_kmeans.py:14(cop_kmeans)
     7888    0.036    0.000    0.036    0.000 D:\py\py3.10\lib\site-packages\numpy\core\_methods.py:106(_clip_dep_invoke_with_casting)
       42    0.035    0.001    0.961    0.023 F:\topo\cop_kmeans.py:207(get_ml_info)
   134352    0.034    0.000    0.048    0.000 D:\py\py3.10\lib\site-packages\pandas\core\dtypes\generic.py:43(_check)
      376    0.034    0.000    0.098    0.000 D:\py\py3.10\lib\site-packages\pandas\core\internals\managers.py:1541(as_array)
    14052    0.033    0.000    0.052    0.000 D:\py\py3.10\lib\site-packages\pandas\core\dtypes\cast.py:691(infer_dtype_from_scalar)
    14999    0.032    0.000    0.261    0.000 D:\py\py3.10\lib\site-packages\pandas\core\construction.py:470(sanitize_array)
      417    0.032    0.000    0.032    0.000 {method 'partition' of 'numpy.ndarray' objects}
    14052    0.031    0.000    0.158    0.000 D:\py\py3.10\lib\site-packages\pandas\core\dtypes\cast.py:1873(construct_1d_arraylike_from_scalar)
      469    0.031    0.000    0.063    0.000 D:\py\py3.10\lib\site-packages\pandas\core\internals\blocks.py:396(apply)
    39787    0.030    0.000    0.030    0.000 {built-in method numpy.array}
       43    0.030    0.001   12.903    0.300 F:\topo\semi_supervised_cluster.py:13(semi_supervised_cluster)
     7888    0.028    0.000    0.272    0.000 D:\py\py3.10\lib\site-packages\numpy\lib\function_base.py:395(average)
     8633    0.027    0.000    0.031    0.000 D:\py\py3.10\lib\site-packages\numpy\core\_methods.py:66(_count_reduce_items)
159598/120277    0.026    0.000    0.038    0.000 {built-in method builtins.len}
      247    0.024    0.000    0.077    0.000 D:\py\py3.10\lib\site-packages\scipy\cluster\hierarchy.py:1400(to_tree)
     1259    0.022    0.000    0.030    0.000 {pandas._libs.lib.maybe_convert_objects}
    16022    0.022    0.000    0.031    0.000 D:\py\py3.10\lib\site-packages\numpy\core\fromnumeric.py:3164(ndim)
      164    0.021    0.000    0.292    0.002 D:\py\py3.10\lib\site-packages\pandas\core\internals\construction.py:596(_homogenize)
     7477    0.020    0.000    0.020    0.000 {method 'count' of 'list' objects}

in:

  • {pandas._libs.algos.nancorr}It is the nancorr calculation of pandas
  • {built-in method numpy.core._multiarray_umath.implement_array_function}It is the underlying logic of each numpy function call
  • D:\py\py3.10\lib\site-packages\numpy\lib\function_base.py:2462(cov)It is a substep of numpy to calculate corr
  • {built-in method builtins.isinstance}It is the logic that pandas calls in large quantities in each operation

Subsequent optimization suggestions

  • Removing all non- pandasessential logic is expected to {pandas._libs.algos.nancorr}reduce the running time of other parts except the current one by more than 50%;
  • Greater matrix calculations;
  • F:\topo\metrics.py:42(pairwise_correlation), F:\topo\cop_kmeans.py:168(compute_centers), F:\topo\cop_kmeans.py:14(cop_kmeans), F:\topo\cop_kmeans.py:207(get_ml_info)through relatively large revisions, there is still room for optimization
  • Can consider whether multiple corrcalculation reused

Guess you like

Origin blog.csdn.net/Changxing_J/article/details/130241665