Blockchain is Watching You: Profiling and Deanonymizing Ethereum Users

The paper I will explain to you today is about building blockchain user portraits. Its Chinese title is "The Blockchain is Watching You: Analyzing and Deanonymizing Ethereum Users"


  In 2009, Bitcoin was born. Its birth has brought a new decentralized trading model. In the Bitcoin transaction mode, transaction records are anonymous, and only addresses are used instead of accounts. People can only see the transaction information between addresses, but cannot obtain the user's identity information, so the user's privacy is greatly guaranteed.
   However, this anonymity has brought great inconvenience to national regulators. For example, illegal activities such as money laundering, smuggling, and drug dealing are difficult to track and identify in this anonymous transaction mode. Therefore, it is very necessary to de-anonymize Bitcoin transactions.
   Aiming at the problem of blockchain deanonymization, the author of today’s paper conducted some explorations based on Ethereum, and proposed some new solutions.
   Next, I will explain from the following three aspects: related concepts, author's experimental methods and conclusions, and thoughts on on-chain data analysis and user portrait construction.

Related concepts

Quasi-identifier Quasi-identifier

   The data is represented in the form of a table, each row represents a record (record), and each column represents an attribute (attribute). Each record is associated with a specific user/individual. These attributes can be divided into three categories: 1. Identifier: can directly determine an individual, such as: ID number; 2. Quasi-identifier set: can be connected with external tables to identify the minimum attribute set of individuals, such as: Figure 1{ Zip code, age}; 3. Sensitive data: the data that the user does not want to be known, it can be considered that the data table is all sensitive data except for identifiers and quasi-identifiers, such as disease in Figure 2.
   When exposing the data table, the user's sensitive data should be avoided from being disclosed, that is, observers should not be allowed to associate a record with a certain user. Information disclosure can be divided into two categories: 1. Identity disclosure: refers to linking users with specific records; 2. Attribute disclosure: when newly disclosed information allows observers to more accurately speculate on user characteristics , saying that property disclosure has occurred.
   As shown in the figure below, Figure 2 is the k-anonymized data in Figure 1, which makes the records in the same equivalence class indistinguishable from other k-1 records for sensitive attributes.
figure 1

Figure 1. Raw data

insert image description here

Figure 2.3 - Anonymous data

De-anonymization De-anonymization

  Deanonymization refers to a data mining strategy in which anonymized data is cross-referenced with other data sources to re-identify the anonymized data source. Any information that distinguishes one data source from another can be used for deanonymization.
   Here we mainly take the social network as an example, because the graph structure of the social network will be discussed later. The de-anonymization of social networks is mainly aimed at the de-anonymization of nodes. To identify a node is to obtain a person's real information. The de-anonymization methods for social networks can be divided into two categories, one is the method based on mapping, and the other is the method based on guessing. The mapping-based method is to match the real network structure known or crawled by the attacker with the public anonymized network structure data as nodes. The method based on guessing is to use the known background knowledge of the attacker to find one or more matching nodes in the public data. The deanonymization done by the author of this article is actually address matching or node matching.
insert image description here

Figure 3. Social Network

User portrait User Profiling

  User portraits, that is, user information labeling, means that companies abstract a user's business picture after collecting and analyzing data on consumers' social attributes, living habits, and consumption behaviors.
insert image description here

Figure 4. User portrait
  The core work of user portraits is to "label" users, and a label is usually a highly refined feature identifier that is artificially specified, such as age, gender, region, user preference, etc. Finally, when all the user's labels are combined, you can Outline the three-dimensional "portrait" of the user. The meaning of user portrait is shown in Figure 4.

insert image description here

Figure 5. The meaning of user portrait

Node embedding Node Embedding

  Formally speaking, Embedding is to use a low-dimensional dense vector to "represent" an object. The object mentioned here can be a word (Word2Vec), an item (Item2Vec), or a node in a network relationship ( Graph Embedding). The word "representation" means that the Embedding vector can express some characteristics of the corresponding object, and the distance between the vectors reflects the similarity between the objects. Or to put it more bluntly, it is a matrix. Embedding maps a thing into a vector. If two things are very similar, the resulting vector distance will be very small. Here are a few examples:

  • Word Embedding: Mapping words into vectors: if two words have similar meanings, the Euclidean distance between word vectors is small
  • User Embedding: Map users into vectors. If user behavior habits are close, the Euclidean distance between vectors is very small

       After knowing what Embedding is, we started to discuss Node Embedding. In traditional machine learning tasks, we need to extract specified features according to the downstream tasks. If the downstream tasks change, the extracted features also need to change, that is, rely on the downstream task orientation.
       Obviously, it is difficult for traditional machine learning algorithms to extract the features of graph-structured data. Because many feature information of graphs are implicit, abstract, high-dimensional, and task-independent. Therefore, we need to use graph representation learning to reduce the high-dimensional information to a low-dimensional space, while maximizing the preservation of the original structural information of the original graph. In addition, it also has a very important advantage that it weakens the dependence on downstream tasks, and does not need to perform different feature extraction for different tasks every time.
       Therefore, we can use Node Embedding to map the nodes in the graph to a vector space while maintaining some properties of the nodes.
    insert image description here

    Figure 6. node embedding

    insert image description here

    Figure 7. The original network is mapped to the embedding space

      After the nodes are mapped to the vector space, various downstream tasks can be performed.
    insert image description here

    Figure 8. Downstream tasks

    Danaan Poison Attack

       Danaan-Gift attack was first proposed in "Privacy Aspects and Subliminal Channels in Zcash", which means: the attacker sends a small amount of contaminated zcash to the target's shielded address, and these values ​​​​will be retained when the target is de-shielded , which acts as a marker. It is based on the fact that the transaction value of zcash is highly accurate, and the last few digits have no economic significance, but can be used as a fingerprint value.
       In the author's definition, the fingerprint of a transaction value is its last 7 digits in Zatoshis, especially the last 4 digits are particularly stable. Since the last four digits are below the transaction's threshold fee, it has little economic significance and represents only a remnant of a previous transaction. In the blockchain world, mining pool payouts are calculated with full precision, so a random distribution will be generated in the least significant digit, so the uniqueness of the fingerprint value can be guaranteed. In the author's experiment, they think that if 5 of the last 7 digits are the same, or if the last 4 digits are all equal, the two fingerprint values ​​match.

    Author's experiment and conclusion

       The author of this article mainly conducted three experiments: Ethereum user portrait and deanonymization, deanonymization of currency mixing service, and Danaan-Gift attack on Ethereum.

    Experimental data

       The experimental data in this paper is based on the ETH address, and there are three sources: Twitter API, Humanity DAO, and TC mixer contracts. After collecting the address, the author queries the transaction information through the ETH block browser. Humanity DAO can be understood as an experiment, which encourages participants to perform decentralized registration and registration; TC mixer contracts

    • The data provided by the Twitter API are ENS names, and each ENS name can be associated with one or more data
    • Humanity DAO can be understood as an experiment, which encourages participants to carry out decentralized registration and registration
    • TC mixer contracts are a currency-mixing contract. Multiple participants transfer the same amount of funds into the contract to construct an anonymous set

    insert image description here

    Figure 9. ENS names from Twitter

    insert image description here

    Figure 10. Average transaction volume for the three data sources

    assessment method

       There are two evaluation methods in this paper, one is AUC, and the other is entropy gain.

    AUC

       For the first two experiments, the algorithm will return a ranking list of candidate pairs for each account in the test set. Only one pair in each ranking list is a correct match. AUC can be expressed as the following formula: AUC = avg ( r
    ( a ) ∣ c ( a ) ∣ ) AUC=avg(\frac{r(a)}{|c(a)|})AUC=avg(c(a)r(a)) over all a a a, and r ( a ) r(a) r(a) is the rank of correct pair.

    entropy gain

       In addition to measuring the AUC of matches, the authors wanted to quantify the loss of privacy from deanonymizing matches. Here, the author cleverly expresses the information obtained by the attacker as an entropy gain, which is the difference between the prior entropy and the posterior entropy. Note: Prior entropy refers to no deanonymization method used. For TC mixer contracts, the size of the anonymity set changes dynamically , so we need to prove that the size of the anonymity set has no effect on our comparison of entropy gains.
    Proof: The size of the prior anonymous set has no effect on the entropy
    gain Δ = gain ( 2 n , p ) − gain ( n , p ) \Delta=gain(2n,p)-gain(n,p)D=gain(2n,p)gain(n,p ) , under the condition that the probability distribution is p, the author makes difference processing on the entropy gain of anonymous sets of size 2n and n. If the probability distribution is smooth and changes little in the neighborhood, then the above difference will be small, and it can be approximated that the size of the prior anonymous set has no effect on the entropy gain.
    Infer the posterior probability distribution
       for each sizennn , the correct matching pair rank isrrAn anonymous set of r with probabilityP ( n , r ) P(n,r)P(n,r ) at[ ( r − 1 ) / n , r / n ] [(r-1)/n,r/n][(r1)/n,r / n ] uniform distribution. The posterior probability distribution isP ( n , r ) P(n,r)P(n,r ) average value. Note: As mentioned above, the algorithm returns a list of candidate pairs for each account, which is actually an anonymous set.
    Calculate entropy gain

    Experiment 1: Analysis of Ethereum user portraits

       The author chooses ENS names with exactly two addresses, and selects three quasi-identifiers: transaction time, gas fee, and Ethereum transaction graph location for user profile analysis, and finally associates accounts belonging to the same user.
    insert image description here

    Figure 11. Transaction profiles for two ENS names

       When analyzing the transaction graph, the author of this paper took the lead in using the node embedding method to match and identify the account of the same user, and compared it with the method of only using other two quasi-identifiers for identification.

    insert image description here

    Figure 12. AUC using transaction time only

    insert image description here

    Figure 13. AUC using gas fees only

    insert image description here

    Figure 14. AUC of twelve node embedding methods

       When analyzing the transaction graph, the author of this paper took the lead in using the node embedding method to match and identify the account of the same user, and compared it with the method of only using other two quasi-identifiers for identification.

    insert image description here

    Figure 15. Entropy gain using only transaction time

    insert image description here

    Figure 16. Entropy gain using gas fee only

    insert image description here

    Figure 17. Entropy gains for twelve node embedding methods

    Method 2: De-anonymize the currency mixing service

      In the currency mixing service of TC mixer contracts, the deposit address and redemption address may be reused, which will lead to privacy risks. Therefore, the author uses the node embedding method for deanonymization, and finally finds the redemption address corresponding to the deposit address.

    insert image description here

    Figure 18. Number of redeeming addresses found for a given rank

    Method 3: Danaan-Gift Attack on Ethereum

      The author made a fingerprint survival probability analysis on the Danaan-Gift attack on Ethereum, illustrating the possibility of this attack on Ethereum.
    insert image description here

    Blockchain deanonymization and user portrait construction thinking

    Number of addresses can mislead you without knowing it
       Address number is the most common misleading metric because not all addresses are equally important. An address created by a transaction for temporary transfers obviously cannot be compared with another wallet address that holds assets for a long time.

    Anonymity vs. Interpretability
       The friction between anonymity and interpretability in blockchain datasets is relatively small. The more anonymous a blockchain dataset is, the harder it is to get meaningful information from it.
    insert image description here
    Deanonymization vs. Privacy Preservation
    Deanonymizing a blockchain dataset does not involve knowing the true identity of each participant. "What you are" is far more important than "who you are".

    Guess you like

    Origin blog.csdn.net/qq_41988893/article/details/123951849