Machine learning programmers behind the code can be determined by style

Machine learning programmers behind the code can be determined by style

Source: ATYUN AI platform 

Automated tools can now accurately identify the author forum posts, as long as they have enough training data is available. But new research suggests that this sample can be applied to artificial languages, such as code. It turns out that software developers also left a message for identification.

Associate professor of computer science at Drexel University, Rachel Greenstadt and his former doctoral student, now an assistant professor at George Washington University Aylin Caliskan, find the code and style, like other forms of expression, is not completely anonymous. In the DefCon hacker conference on Friday, they will demonstrate their use machine learning techniques carried out some research, the author of the code sample de-anonymized. For example, their work may be useful for plagiarism controversy, but it also has a sense of privacy, especially for the thousands of open source developers to contribute to the world.

How the code to anonymize

Here's a simple explanation of how the researchers used machine learning to discover a piece of code belongs to whom. First, the code sample algorithm identifies all the features of their choice in the design. There are many different characteristics. Think about every aspect of natural language exists: your choice of words, put them together the way, sentence length, and so on. Then Greenstadt and Caliskan these functions actually narrowed to include only the distinguishing features of the developer, will list down from hundreds of thousands to around 50.

Researchers do not rely on low-level features, such as formatting code. Instead, they created the "abstract syntax tree", which reflects the underlying structure of the code, rather than any of its components. Their technique is similar priorities someone's sentence structure, rather than whether they indent each line of the paragraph.

The method also requires a working example of someone to teach arithmetic to know when it finds another code samples. If the random pop GitHub account and publish snippets, Greenstadt and Caliskan not be able to identify the people behind it, because they have only one sample can use their might say this is a developer they had never seen before). However, Greenstadt and Caliskan does not require you to judge the life and work of the code attributed to you. It only takes a few short samples.

For example, in a paper in 2017, Caliskan, Greenstadt and two other researchers have shown that even a small snippet repository on GitHub website is enough to distinguish one from another encoder encoder zone, with a high degree accuracy.

The most impressive is that, Caliskan and other researchers said in a separate paper, using only the compiled binary code to de-anonymize programmers. After the completion of the developer to write a piece of code, a program called a compiler to convert it to a series of 0 and 1, readable by a machine, called binary. For humans, it mostly looks like crap.

Caliskan and her other co-researchers can decompile the binary file into C ++ programming language, while retaining the developer unique style elements. Imagine, you write a paper and use the Google translator to convert it into another language. While the text may look completely different, but you write elements still embedded in your signature syntax and the like. The code as well.

"Style is preserved," Caliskan said. "When things based on personal learning, there is still a very strong style."

For two yuan experiments, Caliskan and other researchers used the code examples Google Code Jam contest of the year. Machine learning algorithms within 96% of the time to correctly identify a group of 100 independent programmers, each programmer codes using eight samples. Even if the amount of the sample expanded to programmer 600, the algorithm still be accurately identified within 83% of the time.

Impact on privacy and plagiarism

Caliskan and Greenstadt say their work can be used to determine whether students copying program, or whether developers violated the non-compete clause in their contract of employment. Security researchers might use it to help determine who might create a specific type of malware.

Even more worrying is that authoritarian governments can use to anonymizing technology to identify individuals behind such review circumvention tools. The study also developers of open source projects had a privacy impact, especially if they have been using the same GitHub account.

Greenstadt said: "People should realize that, in this case, 100% hide their identity is often very difficult."

For example, Greenstadt and Caliskan has been found that some ready-made confusion methods, software engineer to make the code more complex and therefore more security tools in the hidden aspects of the developer's unique style was not successful. The researchers said that in the future, programmers may use more sophisticated methods to hide their own style.

Greenstadt said: "I think that, while we continue, we will find one thing to hide these things are kind of confused I do not believe it will be the end of everything you do is always traceable. in any event, I hope not. "

For example, in another paper by Lucy Simko led by the University of Washington team found that a programmer can create code to trick the algorithm into believing it is created by other people. The team found that developers may cheat their "coded signature", even if they do not make fake special training.

Future research

Greenstadt and Caliskan also found some interesting insights into the nature of the programming. For example, they found that experienced developers more easily recognized than novices. The more skilled you are, your work becomes more unique. This may be partly because programmers often beginners solutions from sites like Stack Overflow copy and paste the code.

Similarly, they found it more difficult to solve the problem of sample code is also easier to determine ownership. Using a set of 62 programmers, each programmer to solve the seven simple questions, the researchers for their work to anonymize, accuracy was 90%. When the researchers used samples of seven difficult problem, the accuracy rate of 95%.

In the future, Greenstadt and Caliskan want to learn how other factors affect a person's coding style, such as members of the same organization what happens when working on projects. They also want to explore whether such people from different countries in different ways of coding problems. For example, in a preliminary study, they found that they could distinguish between code samples prepared by Canada and Chinese developers, the accuracy rate of over 90%.

Another question is whether a standardized way to use the same attribution method in different programming languages. Currently, researchers emphasize, to anonymous remains a mysterious process, although so far their approach has proven effective.

Greenstadt said, "We are still trying to understand what makes something real can determine ownership."

This switched ATYUN artificial intelligence media platforms, the original link: machine learning programmers behind the code can be determined by style

more recommendations

Giphy The ML model may be identified in the 2300 GIF celebrity face, the accuracy of 98%

Intel's open source data tags Toolkit CVAT

AI researchers have developed the platform to identify neurodegenerative diseases

OpenAI release large multi-agent environment Nueral MMO game

Welcome to the official public attention ATYUN number, business cooperation and contribute content, please contact E-mail: bd@atyun.com
Welcome to the official public attention ATYUN number, business cooperation and contribute content, please contact E-mail: [email protected]

 

Guess you like

Origin blog.csdn.net/whale52hertz/article/details/90694455