Preliminary application of code2vec

Recently I want to use a word vector tool like code2vec to convert the code into a word vector. I only saw articles on the Internet about the use of the code2vec tool on github. The
article address :
https://blog.csdn.net/qysh123/article/ details/106309967#comments_15034857

The basic content is as follows:

code2vec (Project GitHub: https://github.com/tech-srl/code2vec) is a paper published on POPL 2019:

Alon, Uri, Meital Zilberstein, Omer Levy, and Eran Yahav. “code2vec: Learning distributed representations of code.” Proceedings of the ACM on Programming Languages 3, no. POPL (2019): 1-29.

Since its publication, it has indeed received a lot of attention. For example, at ASE 2019, there are at least two papers doing follow up work (although they are all criticized to a certain extent, they also illustrate their influence):

Jiang, Lin, Hui Liu, and He Jiang. “Machine Learning Based Recommendation of Method Names: How Far are We.” In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 602-614. IEEE, 2019.

Kang, Hong Jin, Tegawendé F. Bissyandé, and David Lo. “Assessing the Generalizability of Code2vec Token Embeddings.” In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 1-12. IEEE, 2019.

Today, I will briefly summarize how to use existing tools to generate code2vec input data, using the astminer tool:

https://github.com/JetBrains-Research/astminer

This tool actually has a corresponding paper, and its previous name was PathMiner:

Kovalenko, Vladimir, Egor Bogomolov, Timofey Bryksin, and Alberto Bacchelli. “PathMiner: a library for mining of path-based representations of code.” In 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR), pp. 13-17. IEEE, 2019.

I personally feel that the changed name is better. Here I summarize the operation method on Windows:

After downloading its Zip or git clone, run in its directory:

gradle shadowJar

Note that the version of Gradle here needs to be above 5.5. Then we can see in the build\shadow directory that a jar package of lib-0.5.jar has been generated (the name may be slightly different due to different version numbers). In fact, we can run the jar package directly to get the output.

However, since astminer's ReadMe.md writes to run sh scripts, I will briefly talk about the method of running sh scripts under Windows. In fact, many friends have also summarized:

https://blog.csdn.net/weixin_42376686/article/details/82391410

We find sh.exe in the bin directory of git, for example mine is here: D:\Program Files\Git\bin

Double-click to run, and then locate the astminer directory. There is a small problem that needs attention. For example, my astminer directory is here: D:\Projects\astminer, if I directly enter:

cd D:\Projects\astminer

It will report: bash: cd: D:Projectsastminer: No such file or directory.
You can see that the reason is very simple. The separator under Windows is not recognized, so what should be entered is:
cd D:\Projects\astminer
and then you can follow github The instructions on the run:

./gradlew shadowJar

Then run:

./cli.sh code2vec

And the necessary parameters can generate the data needed by code2vec. Seeing here, it seems that it must be run under Linux, but in fact, we look at the content of cli.sh and found that it just runs the jar package above:

#!/bin/bash
 
java -jar build/shadow/lib-0.5.jar "$@"

So we can actually get the same result by running the above jar package directly, so we run it directly in windows cmd:

java -jar lib-0.5.jar code2vec --lang cpp --project %源代码的目录% --output %需要生成的code2vec的输入数据的目录%

java -jar lib-0.5.jar code2vec --lang cpp --project %source code directory% --output %The input of code2vec to be generated

material

1. The open source codevec implementation code URL of the author of code2vec: https://github.com/tech-srl/code2vec
2. Apply code2vec to the open source URL on github in multiple languages: https://github.com/JetBrains-Research /astminer
3. A blog about the open source use of code2vec applied to multiple languages ​​on github: https://blog.csdn.net/qysh123/article/details/106309967

Practice process

1.gradle installation

First of all, according to the above blog operation, we need to generate a lib-0.5.jar file that needs to be executed. gradle shadowJarSo we need to install gradle first. I installed it under the wiindows system. See the article for the specific installation steps: https://blog.csdn.net /lockhou/article/details/113817827

2. Run the ./gradlew shadowJarstatement

We need to open the sh.exe file in the bin folder under git and click Run, and locate the path in the running tool to the astminer directory. (The address of my sh.exe is: C:\Program Files\Git\bin\sh.exe)

3. Use model statements

If we use the statement given in the blog above, we need to locate the command line run by sh.exe under git to the shadow directory under the build file in the astminer directory (astminer-master-dev\build\shadow\lib-0.6 .jar)
and then run the statement

java -jar lib-0.5.jar code2vec --lang cpp --project 源代码的目录 --output 需要生成的code2vec的输入数据的目录

My running statement is:

java -jar lib-0.6.jar code2vec --lang c --project ../../../9_projects_Functions/Asterisk/Vulnerable_functions --output out1

–Project …/…/…/9_projects_Functions/Asterisk/Vulnerable_functions represents the relative path from the current location address to the directory where we placed the c language file. Here I put all the .c files in the Vulnerable_functions folder

-Output out1 means that I input these C language files into the code2vec model and the output results obtained are stored in the file plus out1.
Output format
For path-based representation, astminer supports two output formats. In these two .csv files, we store 4 files: the
result of storing the file is shown in the figure below:
Insert picture description here
Insert picture description here
Insert picture description here
node_types.csv contains the numeric ID and the corresponding node type with direction (as described in the paper, up/down);
tokens .csv contains numeric IDs and corresponding tags;
paths.csv contains numeric IDs and AST paths in the form of a space-separated sequence of node type IDs.
path_contexts.csv contains tags and sequences of path contexts (two tagged triples and between them path)
If code2vec copy format, each line begins with a label path_contexts.csv, and then separated by a space which contains a series of triples. Each triple contains the start token ID, path ID, and end token ID, separated by commas.
If you use the csv format, each line in path_contexts.csv contains a label, then a comma, and then a sequence of triples separated by -;. Each triple contains the start token ID, path ID, and end token ID, separated by spaces.

–Lang represents the language of the project you want to identify, and as a result, different languages ​​will exist under the folder named after the language running file suffix

3. You can also directly use the sentences provided in the open source project

Pretreatment

Run preprocessing on the C/C++ project to expand the #define directive. In other tasks, if macros are provided for C/C++ files, they and their appearance in the code will be deleted.

./cli.sh preprocess --project path/to/project --output path/to/preprocessedProject
Parsing

Extract AST in supported languages ​​from all files.

./cli.sh parse --lang py,java,c,cpp --project path/to/project --output path/to/result --storage dot
Path context

Extract path contexts in supported languages ​​from all files and store fileName triplesOfPathContexts in the form.

./cli.sh pathContexts --lang py,java,c,cpp --project path/to/project --output path/to/results --maxL L --maxW W --maxContexts C --maxTokens T --maxPaths P
Code2vec

Extract data suitable as input to the code2vec model. Parse all files written in the specified language into AST, divide them into methods and store method|name triplesOfPathContexts in form.

./cli.sh code2vec --lang py,java,c,cpp --project path/to/project --output path/to/results --maxL L --maxW W --maxContexts C --maxTokens T --maxPaths P  --split-tokens --granularity method

I tried the code2vec model separately according to the open source tools on github. The model statement is still running on the command line run by sh.exe under git and astminer is located. The instructions are as follows:

./cli.sh code2vec --lang c --project ../9_projects_Functions/Asterisk/Vulnerable_functions --output out

The result of the operation is shown in the figure:
Insert picture description here
Insert picture description here
Insert picture description here

And after a rough comparison of the files in each table, when the same batch of files are processed, the results obtained by the above two models using code2vec are the same

Out of curiosity, I practiced the tool for generating ast. The path is the same as the above. Just enter the following commands directly, and the results obtained are stored in the ast folder.

./cli.sh parse --lang c --project ../9_projects_Functions/Asterisk/Vulnerable_functions --output ast  

Temporary result

1. The results obtained by the two code2vec methods are basically the same
. 2. We only got the vocabulary for the time being. Based on this, the node table represents a .c file and saves the result under the path_contexts.csv folder, which did not get what we wanted The word vector representation of can not be directly applied to neural networks to solve practical text problems

Guess you like

Origin blog.csdn.net/lockhou/article/details/113854491