Android App Data Acquisition: Obtain samples from the Androzoo dataset (download the Google Play app)

Android App Data Acquisition: Obtain samples from the Androzoo dataset (download the Google Play app)

Androzoo database introduction

Website link: Androzoo

Screenshot of the website's homepage introduction: Université du LuxembourgA publicly available dataset provided by the University, which is a growing collection of Android applications collected from multiple sources, including the official Google Play application market. (The specific ratio can be viewed in the entry on the website Markets)

It currently (2021.4.27) contains 15,082,219 different APKs.

The experiment requires the application of Google Play. I originally tried to obtain it apkpurefrom , but the crawler was denied access and there was no good way to find it, so I asked my seniors to find this data set.

Androzoo database sample acquisition method

The method of using the API is described in the website, see API Documentationthe article, you can use the Curl command, the browser or download the application from the script.

The following uses the Curl command as an example. The command pattern provided on the website is:

curl -O --remote-header-name -G -d apikey=${APIKEY} -d sha256=${SHA256} \ https://androzoo.uni.lu/api/download

As you can see, downloading an application requires APIKEY(user identity) and SHA256(specify application)

get APIKEY

Instructions In Accessthe entry, an application email needs to be sent to [email protected], indicating the name of the research institution and the name of the person requesting access. Make sure to send the application from the university (or research institution) email account.

Get SHA256

The description listis provided in the entry CSV文件(the file is updated before 6 am local Luxembourg every day) and can be downloaded. The fields in the file include:

sha256,sha1,md5,apk_size,dex_size,dex_date,pkg_name,vercode,vt_detection,vt_scan_date,markets

Among them, there are those that need to be downloaded sha256, just fill them in the command.

The detailed introduction of the meaning of the field can be listread in the entry.

The more commonly used fields may include pkg_name, marketswhich represent the file name and the application market of the file source respectively.

There are also examples of usage below the entry, and a brief description of the first example:

Select only APKs from Google Play Store: zcat latest.csv.gz | grep -v ',snaggamea' | awk -F, '{if ($11 ~ /play.google.com/) {print} }'

zcatis a command-line utility for viewing the contents of compressed files without decompressing them. The first command is to CSV文件view the latest.
grep -v ',snaggamea'The warning states:

There is a fake APK (BC564D52C6E79E1676C19D9602B1359A33B8714A1DC5FCB8ED602209D0B70266) whose pkg_name contains ",". Use grep -v ',snaggamea' to get rid of it.

awkIt is a text analysis tool that reads the file line by line, slices each line with a space as the default delimiter, and then performs various analysis processes on the cut parts.
The usage is:

awk [-F field-separator] 'commands' input-file(s)
commands are real awk commands, [-F field separator] is optional. input-file(s) are the files to process.
In awk, each line in a file, each item separated by a field separator is called a field. Normally, when -F field separator is not specified, the default field separator is space.

That is, ,consider the delimiter, play.google.comand print if it is from.

study

After reading the code given to me by the seniors, they should have imported the file into the database, and obtaining it through database query SHA256is also a very good way to deal with it.

At the same time, during the search process, I found that a big guy wrote a download script by himself and attached it here.
https://github.com/E0HYL/AndrozooDownloader

Guess you like

Origin blog.csdn.net/m0_54352040/article/details/116208636