Android App Data Acquisition: Obtain samples from the Androzoo dataset (download the Google Play app)
Androzoo database introduction
Website link: Androzoo
Screenshot of the website's homepage introduction: Université du Luxembourg
A publicly available dataset provided by the University, which is a growing collection of Android applications collected from multiple sources, including the official Google Play application market. (The specific ratio can be viewed in the entry on the website Markets
)
It currently (2021.4.27) contains 15,082,219 different APKs.
The experiment requires the application of Google Play. I originally tried to obtain it apkpure
from , but the crawler was denied access and there was no good way to find it, so I asked my seniors to find this data set.
Androzoo database sample acquisition method
The method of using the API is described in the website, see API Documentation
the article, you can use the Curl command, the browser or download the application from the script.
The following uses the Curl command as an example. The command pattern provided on the website is:
curl -O --remote-header-name -G -d apikey=${APIKEY} -d sha256=${SHA256} \ https://androzoo.uni.lu/api/download
As you can see, downloading an application requires APIKEY
(user identity) and SHA256
(specify application)
get APIKEY
Instructions In Access
the entry, an application email needs to be sent to [email protected], indicating the name of the research institution and the name of the person requesting access. Make sure to send the application from the university (or research institution) email account.
Get SHA256
The description list
is provided in the entry CSV文件
(the file is updated before 6 am local Luxembourg every day) and can be downloaded. The fields in the file include:
sha256,sha1,md5,apk_size,dex_size,dex_date,pkg_name,vercode,vt_detection,vt_scan_date,markets
Among them, there are those that need to be downloaded sha256
, just fill them in the command.
The detailed introduction of the meaning of the field can be list
read in the entry.
The more commonly used fields may include pkg_name
, markets
which represent the file name and the application market of the file source respectively.
There are also examples of usage below the entry, and a brief description of the first example:
Select only APKs from Google Play Store: zcat latest.csv.gz | grep -v ',snaggamea' | awk -F, '{if ($11 ~ /play.google.com/) {print} }'
zcat
is a command-line utility for viewing the contents of compressed files without decompressing them. The first command is to CSV文件
view the latest.
grep -v ',snaggamea'
The warning states:
There is a fake APK (BC564D52C6E79E1676C19D9602B1359A33B8714A1DC5FCB8ED602209D0B70266) whose pkg_name contains ",". Use grep -v ',snaggamea' to get rid of it.
awk
It is a text analysis tool that reads the file line by line, slices each line with a space as the default delimiter, and then performs various analysis processes on the cut parts.
The usage is:
awk [-F field-separator] 'commands' input-file(s)
commands are real awk commands, [-F field separator] is optional. input-file(s) are the files to process.
In awk, each line in a file, each item separated by a field separator is called a field. Normally, when -F field separator is not specified, the default field separator is space.
That is, ,
consider the delimiter, play.google.com
and print if it is from.
study
After reading the code given to me by the seniors, they should have imported the file into the database, and obtaining it through database query SHA256
is also a very good way to deal with it.
At the same time, during the search process, I found that a big guy wrote a download script by himself and attached it here.
https://github.com/E0HYL/AndrozooDownloader