Python script bug, could affect hundreds of academic research

Python as a powerful and versatile programming language widely acclaimed, it has a very clear grammatical features, suitable for scientific computing ecosystem, data analysis, interactive visualization and so on.

Not long ago, chemical researchers of the University of Hawaii found that , for chemical analysis data to calculate a set of Python scripting error, different computer operating systems to run this script will produce different results. In calculating the results cyanobacteria experiments, the researchers Philip Williams, who found out by running Python script results, depending on the operating system to another. This led the researchers had suspected for more than 150 chemical research papers published results.

The script called "Willoughby-Hoye", script about 1000 lines, which started since 2014, used to calculate the NMR chemical shift values ​​map. The researchers found that the results macOS Mavericks and Windows 10 running the same (173.2), researchers are expected, but not in the macOS Mojave and Ubuntu is not the same (172.4 and 172.7, respectively), several result looks almost, but in the field of scientific research, for high precision, seemingly close to the start and end numbers vary greatly.

So why such outcomes happen?

The script bug appeared, and our impression is very different bug, bug usually have the same incorrect results anywhere, not just in certain environments. After inspection, the problem locking in the way data retrieval, the data for each run are stored in two files, retrieve files by file name pairs, and processed in pairs. The point is that the order to retrieve the file because your operating system. As long as the files that match, you can get the right result. If not, it will process data from different runs twice.

Previously, many news reports think the culprit is the Python standard library module glob module, the script uses the Python glob module to locate the file path name in line with a specific format - to read the file list is generated based on the results of the glob. But glob result in turn depends on the operating system's file the return value. So the results can be affected by the order of the script file handling. In Python described in the documentation about glob module are as follows:

Find all glob module file pathname specified pattern matching according to the rules used by Unix shell, the final result in any order return.

So this is not a bug, the result is a glob of any order returned, it does not return results in the specified order. This fault is not Python, but rather the people who write this Python script. The author of the script should define the desired sorting behavior in the code, to ensure consistency. If you are a good programmer, after a detailed reading documents, they can be noted in the documentation that (eg glob returns the result is arbitrary), in order to take into account when writing code for it. This matter also tells us that good programming skills is very important, especially when applied to this kind of rigorous scientific computing research, the miss is as good, necessarily sweeps clean.

After Williams, who discovered the problem, add the necessary sorting code for that code. He hoped that scientists pay more attention to calculate the part of the experiment, as the script issues affecting the results of other research papers how many, he also difficult to draw a conclusion.

In general, the code is not an academic point the most attention. So, computers and other academic areas of academia tend not to delve too deeply into code quality. This also led, whether academic researchers or industry, can be a huge difference between "academic prototype code" and "industrial level code" feel.

However, compared to the industry code, the code bug affecting academic papers, after all, limited.

reference:

http://www.51testing.com/html/06/n-4462906.html

Guess you like

Origin www.oschina.net/news/110830/python-script-bug-affect-hundreds-research