How to dynamically debug third-party libraries for Python
Note: The method in this article is limited to libraries that come with py source code during debugging installation, such as sklearn
.
introduce
I used sklearn
it sklearn.feature_extraction.text.TfidfTransformer
to get TF特征
it, but sklearn
the calculation results I found were not the same as my manual calculation results. sklearn
Although the source code can be found on github . But if you can't debug it dynamically, you can't see the results intuitively.
So the question is, how can we dynamically debug Python's third-party libraries (for example sklearn
)? How can I see the intermediate results of the dynamic running of the source code in the third-party library?
Suppose my code is as follows:
# 原始语料,3个文本
strs_train =[
'God is love',
'OpenGL on the GPU is fast',
'Doctor David is PHD']
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
# 先提取 Bags of words特征
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(strs_train)
# 再基于Bags of words特征,变换为TF特征
tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_counts)
X_train_tf = tf_transformer.transform(X_train_counts)
print(X_train_tf.todense())
How can I see sklearn.feature_extraction.text.TfidfTransformer.transform()
the intermediate result of the function calculation?
Python Debugging Basics
Python
Comes with a module for debugging code pdb
. It supports breakpoint setting, single-step debugging, entering function debugging, viewing code snippets, viewing variable values, and dynamically changing variable values.
The following two lines of code can add a breakpoint to the program:
import pdb
pdb.set_trace()
Add a breakpoint, run the program, when the program stops, you can use the following commands to debug the code in SHELL.
Order | meaning |
---|---|
c | continue code execution |
n | Next step |
r | Execute the code, returning from the current function |
s | enter function |
b | next breakpoint |
Debugging Python third-party libraries
We pdb
can set breakpoints in third-party libraries and debug them. Taking debugging sklearn
as an sklearn.feature_extraction.text.TfidfTransformer
example, the following steps are given.
- (1) Find the location of the third-party library
First use the following Python code to find the sklearn
source code location. My location is here C:\\Users\\biny\\Anaconda3\\lib\\site-packages\\sklearn
.
import sklearn, os
path = os.path.dirname(sklearn.__file__)
- 1
- 2
- (2) Delete the Python precompiled
字节码
Python程序在运行时,为了提高运行速度,Python解释器先将.py代码
编译为byte code
(字节码
),再有Python虚拟机
来执行字节码。
下次再运行同一程序时,若.py代码
没有改变,则省略将.py代码
编译为字节码
的步骤,直接运行上次已编译好的字节码
。
这些字节码
,会被存于__pycache__
文件夹下,和.pyc文件
。按照原理,这个步骤是不需要做的,不过删掉字节码在运行自己的程序,如果不会出现新的字节码文件,说明你的第三方库位置找错了。这样能方便我们发现错误。
- (3)在第三方库源码中加断点
根据第三方库的位置,找到sklearn.feature_extraction.text.TfidfTransformer.transform()
函数所在.py文件
。并用pdb
在函数开头加上断点(如下)。
def transform(self, X, copy=True):
import pdb
pdb.set_trace()
if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):
# preserve float family dtype
X = sp.csr_matrix(X, copy=copy)
else:
# convert counts or binary occurrences to floats
X = sp.csr_matrix(X, dtype=np.float64, copy=copy)
- (4)运行自己的程序
运行我的代码,停在第三方库中,就可以用pdb命令调试第三方代码了。
此时代码已经运行并进入第三方库中,停止在断点处:
C:\mine\tmp\debug_py_3rd_lib>python main.pyc:\users\biny\anaconda3\lib\site-packages\sklearn\feature_extraction\text.py(1018)transform()
-> if hasattr(X, ‘dtype’) and np.issubdtype(X.dtype, np.float):
(Pdb)用n命令(next),让代码单步运行到关键点:
c:\users\biny\anaconda3\lib\site-packages\sklearn\feature_extraction\text.py(1042)transform()
-> if self.norm:
(Pdb) n直接输入要查看的中间变量(X.data),停下的这行代码是即将执行的,我们可以看到执行前的变量值:
c:\users\biny\anaconda3\lib\site-packages\sklearn\feature_extraction\text.py(1043)transform()
-> X = normalize(X, norm=self.norm, copy=False)
(Pdb) X.data
array([ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])继续执行代码(n命令),然后可以看到中间变量值被改变。也能看到这个改变是因为做了
normalize
。
(Pdb) nc:\users\biny\anaconda3\lib\site-packages\sklearn\feature_extraction\text.py(1045)transform()
-> return X
(Pdb) X.data
array([ 0.57735027, 0.57735027, 0.57735027, 0.40824829, 0.40824829,
0.40824829, 0.40824829, 0.40824829, 0.40824829, 0.5 ,
0.5 , 0.5 , 0.5 ])
记住调试结束后,一定要在第三方源码中删掉pdb断点那两行代码!