Data Pricing in Machine Learning Pipelines

Data Pricing in Machine Learning Pipelines

Data Pricing in Machine Learning

Author: Pipelines Zicun Cong Xuan Luo Pei Jian Feida Zhu Yong Zhang

Abstract

Machine learning is disruptive. At the same time, machine learning can only be successful in multiple steps through multi-party collaboration, like a pipeline in an ecosystem, such as collecting data for possible machine learning applications, multi-party collaboration training models, and delivering machine learning to end users Serve. Data is critical and permeates the entire machine learning pipeline. Since machine learning pipelines involve many parties, and in order to be successful, a constructive and dynamic ecosystem must form, markets and data pricing are fundamental to connect and facilitate these multiple parties. In this paper, we survey the principles and recent research progress of data pricing in machine learning pipelines. We begin with a brief review of data markets and pricing needs . We then focus pricing in three important steps in the machine learning pipeline . To understand the pricing in the training data collection step, we review the pricing of the original dataset and data labels . We also survey pricing in the collaborative training step of machine learning models and outline end-user pricing for machine learning models in the machine learning deployment step . We also discussed a range of possible future directions.

Keywords data assets, data pricing, data products, machine learning, AI

1.Intro

  Building a machine learning marketplace requires the cooperation of multiple parties.

  Data is crucial to machine learning . Machine learning models, especially deep models, rely on large amounts of data for training and testing . The subsequent deployment, update, and maintenance of machine learning also require the participation of data.

  Getting data is far from easy for machine learning . For the party building a machine learning model, there are the following challenges : (1) Developing a training data set: the cost of collecting data, creating appropriate labels, and ensuring data quality is relatively high. (2) There is no required training data set: This often means exploring external sources of the required data, obtaining external data. (3) Provide machine learning services to others: need to exchange data.

  Since data and models are essential in machine learning pipelines, data and model exchange is the most basic interaction between multiple parties, data and model marketplaces become a natural choice for machine learning pipelines and ecosystems, and pricing becomes the core of machine learning pipelines mechanism.

  Data products refer to products and information services derived from datasets. Data commoditization encourages data owners to share their datasets in exchange for remuneration, helping buyers obtain high-quality, large-volume data products.

  Agarwal et al. [1] summarize five properties that make data a unique asset .(1) Data can be replicated at zero marginal cost. (2) The value of data is inherently combined. (3) The value of data varies greatly among different buyers. (4) The usefulness of data lies in deriving valuable information from it, which is difficult to a priori. Due to these properties, pricing models for physical commodities cannot be directly applied or directly extended to data products, thus requiring the development of new principles, theories, and methods.

  This article focuses on three steps in the pipeline construction of machine learning models related to data and model provisioning tasks. They are: data collection (pricing raw datasets and pricing data labels), machine learning model training (pricing contributions on provided data), model deployment (pricing machine learning models)

details as follows:

  1. Pricing raw dataset . The first step in building a machine learning model is to collect training data, and commoditized data can be well traded to obtain a training data set. The challenge in pricing raw datasets is how to set a price that reflects the usefulness of the dataset . In addition, the pricing model can also be optimized according to different goals (maximization of revenue, no arbitrage, authenticity) , but this is also a challenge.

  2. Pricing data label . Now crowdsourcing is commonly used to achieve label acquisition, but this faces a key challenge: how to take into account the accuracy of labels and compensate the corresponding crowdsourcing participants. This task often arises in supervised learning.

  3. Aligning revenue distribution in machine learning . Collaborative machine learning is very popular, which means that multiple data owners will aggregate the data for model training and share the revenue from selling these models, but the contribution of the data sets will vary, so there is a need for fair rewards for all data author's contribution

  4. Pricing machine learning models . Customers can directly purchase trained machine learning models. But models may have version differences , which also involves price differences .

  The above four tasks are all about linking the price of the data product with its associated transaction to the customer. However, since these tasks have different application scenarios and pricing goals, they are addressed by orthogonal techniques.

  Existing models in the first two tasks aim to price training datasets with absolute utility functions, i.e., the utility of a data product depends only on the attributes of the product. The important distinction is about the utility function. The utility (e.g. accuracy) of data labels is difficult to calculate due to the lack of ground truth validation.
The third task evaluates the utility of a dataset by its marginal contribution to a machine learning model. Thus the utility of one dataset depends on the utility of the other datasets used to jointly build the model.
Existing methods for the last task also employ absolute utility functions. But machine learning models and datasets have different properties, so new pricing models were developed.

  These four tasks are connected when machine learning models and datasets are priced in an end-to-end fashion.

2. Data Market and Pricing

2.1. Data market

  A data marketplace is a platform that allows people to buy and sell data . For example, Dawex, Snowflake data marketplace, BDEX, Muschalle, etc.

Seven categories of participants   in the data marketplace : analysts, application vendors, developers of data processing algorithms, data providers, consultants, licensing and certification entities, and data marketplace owners.

  Conceptual Architecture of Data Marketplace:

insert image description here

  Figure a, general data market . The data market is mainly composed of three entities, data sellers, arbitrators (also known as data suppliers and data brokers), and data buyers . The arbitrator collects data products from data sellers and sells them to data buyers. After the buyers pay, the arbitrator distributes the money to the data sellers. In general, arbitrators are structured as non-profit participants in data marketplaces.

  Figure b, Sell-side data market . A seller-side data marketplace is one data seller and multiple data buyers . In a seller's market, the arbitrator is operated by a monopoly data seller to sell a single seller's data product . In the reviewed literature, seller's markets are considered by pricing models for data sets in general and specific types of data products such as XML documents and data queries on relational databases.

  Figure c, buyer data market . A buyer data marketplace has multiple data providers and a data buyer . Arbitrators are operated by individual data buyers to purchase data products from data suppliers . Many existing studies have considered the buyer's market. For example, deAlfaro et al. study a buyer's market in which individual consumers pay crowdworkers for labels to label a single buyer's dataset.

2.2. Pricing strategy

  3 important categories: cost-based pricing, customer value-based pricing, and competition-based pricing.

  • cost-based pricing

    • The price of a product is determined by adding a specific amount of markup to the cost .
    • Often adopted in personal data pricing , where the cost is the total privacy compensation to the data owner .
    • Disadvantages : Only internal factors are considered, external factors (competition, demand, etc.) are not considered
  • Pricing based on customer value

    • Determine the price of the product mainly based on the value of the product that the target customer thinks
    • The seller needs to estimate the customer's demand for the product through the customer's willingness and ability to pay
    • Most Commonly Used Data Pricing Strategies
  • competition-based pricing

    • Strategically determine product prices based on competitor price levels and behavioral expectations
    • Game theory provides powerful tools to implement this strategy, such as non-cooperative games: each seller is selfish, and independently sets the price that maximizes the seller's profit, and the result of the competition is that each seller's asking price reaches the Nash equilibrium .

  There are some other pricing strategies like operations oriented pricing, revenue oriented pricing, relationship oriented pricing and so on.

2.3. Four types of data markets

  Fricker and Maksimov proposed four types of data markets : monopoly market, oligopoly market, highly competitive market, and monopoly market.

  • First, in a monopoly, the supplier has sufficient market power to set prices to maximize profits.
  • Second, in an oligopoly, a small number of suppliers dominate the market .
  • Third, in a highly competitive market where individual suppliers do not have sufficient market power to set profit-maximizing prices, prices tend to align with marginal costs.
  • Finally, in monopoly, a single buyer controls the market as the sole consumer of the product offered by the seller.

  Most studies assume that there is an explicit or implicit monopoly (buyer monopoly) and data sellers (data buyers) in the market structure do not care about competing with others. Balasubramanian et al. consider data pricing in oligopoly markets . Jiang et al. studied a perfectly competitive market in which participants trade directly among themselves .

2.4. Demand for Data Pricing

  Let’s review the six data pricing requirements identified by Pei et al .

  • authenticity

    • In a real market, all participants are selfish and only offer prices that maximize their utility value.
    • The real market guarantees that each participant provides its real valuation is the best strategy.
    • Authenticity simplifies strategies for all participants and ensures fundamental fairness in the market .
  • revenue maximization

    • Revenue maximization is a strategy to increase the number of customers through lower prices .
    • Widely adopted by sellers in emerging markets to build market share and reputation.
  • fairness

    • In some cases, the seller will need to cooperate in the transaction. Data markets are fair to participants in a federation if the revenue generated by the federation is distributed fairly among sellers .

    • Assume that a group of sellers D=[s1,s2,...,sn] cooperate to participate in the transaction and receive payment v.Sharply cites four axioms of fair distribution:

      • Balance (validity) : v needs to be fully distributed to sellers in D. (The final sum of the Shapley values ​​calculated by each node should be equal to the total interests of the overall alliance, otherwise it is invalid)
      • Symmetry : Sellers who contribute the same should get the same money.
      • Zero element (redundancy) : If a seller's data does not contribute to the alliance's acquisition of payment v, it should not receive any payment.
      • Additivity : If a set of sellers’ data can be used for two tasks t1 and t2, paying v1 and v2 respectively, then the payment for solving these two tasks t1 + t2 should be v1 + v2.
    • Shapley's value is a method of benefit distribution, which evaluates how to distribute benefits according to contribution. The greater the contribution, the greater the benefit. It turns out that the Shapley value ψ(s) is the only assignment that satisfies the four axioms

    • The Shapley value ψ(s) is defined as the average marginal contribution to all possible subsets si of sellers S⊆D\{si}
      insert image description here

      U(·) is the utility function. The above function can also be rewritten as follows:
      insert image description here

      π ∈ ∏(D), is a permutation of a group of sellers; insert image description here
      it is the set of sellers in π before seller s.

    • The unique Shapley fairness of Shapley values, combined with its flexibility to support different utility functions, makes it a popular tool for implementing fair data markets.

  • No Arbitrage Pricing

    • Arbitrage is the activity of taking advantage of price differences between multiple markets.
    • In the data market, data sellers may provide multiple versions of products. Buyers may avoid the advertised price of the product they want to buy by buying some cheaper products, which will have a negative impact on the seller.
    • An ideal pricing function should ensure that there is no possibility of arbitrage .
  • privacy protection

    • In the data market, the privacy of buyers, sellers, and related third parties is very fragile and may be disclosed in different ways.
    • Focus on privacy compensation: Sensitive datasets are often traded with injected random noise for privacy preservation purposes. Datasets with less random noise are more accurate, but may leak more privacy and should therefore be more compensated to data owners.
  • Computational efficiency

    • Efficiently calculating prices for a large number of commodities and participants is a fundamental requirement of pricing models.
    • Prices should be computed in polynomial time with respect to the number of participants or data products. However, in some application scenarios, it takes exponential time to compute a pricing function with desirable properties , such as Shapley fairness, no-arbitrage, and revenue maximization.
    • Koulis et al. show that computing the arbitrage-free price of join queries on relational databases is generally NP-hard. How to efficiently price desirable properties is a technical challenge.

A new mechanism is proposed here.

  • Contribution Incentive
    • When crowdsourcing data, a major challenge is ensuring that participants in the mid-term report are committed and provide accurate answers. If there is only one fixed price per task, then it is possible for participants to offer a price at will without even solving the task.
    • A desirable approach is to design appropriate rewards for crowdsourcing tasks that motivate participants to put in effort and provide higher quality answers.

3. Pricing Raw Datasets

For pricing raw data sets, existing research considers four situations :
  the most traditional method, which regards the data set as an indivisible unit, does not consider the competition between suppliers, and the intrinsic properties of the data set are the factors that determine the price.
  In the second case, we study how to price indivisible datasets in a competitive market.
  The third option is that consumers can buy a small part of the entire data set , which is more flexible for consumers, but there will be arbitrage problems.
  The final scenario puts a price on personal data through privacy compensation

3.1. Pricing general data

Machine learning and statistical models are susceptible to poor quality data. Dataset pricing based on quality became a natural choice.

  Heckman et al. defined some columns forFactors to Measure Data Quality, such as the age of the data, data accuracy, and data volume. A linear model is proposed . Setting the price of a dataset to insert image description here
estimate the model parameters wi is a difficult task
. Because many datasets do not have public prices associated with them.

  Yu and Zhang studied theDifferent Data Quality FactorsThe data set builtMulti-Version Transactionquestion. They assume that customers' needs and the highest acceptable prices for different versions are publicly available. A two-tier programming model is established to solve this problem. At the first level, data sellers determine versions and their prices to maximize total revenue. At the second level, a group of buyers select data products to maximize their utility . Solving the two-level programming model is very difficult. Yu and Zhang [107] proposed a heuristic genetic algorithm to approach it numerically.

3.2. Pricing crowdsourced data

Crowdsourcing can quickly and cheaply obtain a large amount of training data for machine learning models. In a crowdsourcing marketplace, task requesters initiate data collection tasks and are compensated based on the costs reported by participants . Since workers may overstate their costs, pricing models should incentivize workers to truthfully disclose their costs.

  Yang et al. designed a mobile sensing dataReverse Auction Mechanism. This model is true (if all sellers truthfully report their collection costs), rationally expected (if all sellers have no negative net profit), and favorable (if all buyers have no negative net profit). The authors assume that a buyer has a set of sensing tasks Γ = [T1,T2,…Tn], and each task Ti has a value vi
  for the buyer himself . Each seller si chooses a task Γi in Γ to complete, corresponding to a cost ci . The seller si bids bi to sell the data and submits a task bid group (Γi,bi) to the buyer. After collecting all bids, the buyer selects a subset of sellers S as winners and determines to pay each winner si a reward pi .   The proposed auction mechanism, MSensing , selects the winner S in a greedy manner. At the beginning, S=∅, it iteratively selects the seller with the largest non-negative marginal effect as the winner . Every winner si ∈ S is paid pi. (Si bidding bi higher than pi will not win the auction). Specifically, MSensing runs a winner selection algorithm on users S' = U\{si}. The reward pi is the maximum price si can bid such that si can replace a user in S'. Note that pi ≥ bi, this is because due to incomplete cost information, the buyer provides additional compensation to the seller on top of the offer to incentivize the seller to disclose the actual cost. MSensing satisfies Myerson's description of a real auction mechanism .

  Subsequent work by Jin et al. considered that the data buyer has a correspondingData Quality RequirementsQj. The authors propose aVickrey-Clarke-Groves mechanism, just like a real reverse combinatorial auction. They assume that the data quality qi of each seller is publicly available and qi is the same for all sensing tasks. The authors first consider the case where each seller only bids for a set of sensing tasks Γi. The auction winner S must meet the quality requirements of each task ti, that is, ∑si∈S, if tj∈Γi qi ≥ Qj. The purpose of an auction is to maximize the total utility of buyers and sellers. The authors prove that winner determination under this setting is NP-hard, and propose a greedy winner selection algorithm with a guaranteed approximate ratio to the optimal total utility. Each winner is paid by the winner's key payment. The authors further investigate the total utility maximization problem in a more general scenario, where each seller can bid for multiple task packages. They propose an iterative descending algorithm that achieves near-optimal total utility. However, auctions are not real.

  Koutsopoulos considered a similar setup to Jin et al ., but assuming that a data buyer has only one sensing task . The author proposes areal reverse auction, to minimize the buyer's expected cost while ensuring data quality requirements . The authors assume that data buyers know a priori the distribution of each seller's unit participation cost ci . The participation unit xi of si is a positive real value, indicating how much data is purchased from si. Given a seller's bid, the data buyer determines the auction winner and its participating units by solving a linear programming model that minimizes the total expected payment subject to data quality constraints . Key payments are made to selected winners. All sellers who bid truthfully form a Bayesian Nash equilibrium .

3.3. Pricing data query

The query-based pricing model tailors data purchases to the needs of users . Customers can purchase the parts of the dataset they are interested in through data queries and are charged based on the queries they issue. While this market mechanism provides buyers with greater flexibility, a pricing model that is not carefully designed can open loopholes for arbitrage, enabling buyers to obtain query results at a lower cost than the published price.

  Given a database D and multiple query packages S = { Q1, ... ... , Qm }, if the answer to Q can only be calculated from the answers of the query packages in S, then the query package Q is determined by S. The pricing function is arbitrage-free if the advertised price insert image description here
is , that is, the answer to one query packet Q cannot be obtained cheaper from another set of query packets.

  The first formal framework for arbitrage-free query-based data pricingIt was proposed by Koutis et al . The main idea is that the data seller can first specify the price of several views V on a database, and then determine the price of a query package Q through an algorithm . It is theoretically proved that if there is no arbitrage situation between views in V, then there exists a unique no-arbitrage and no-discount pricing function π(Q). Specifically, π(Q) is the total price of the cheapest subset of V that determines Q, which can be found by querying for certainty. They also show the complexity of evaluating the price function. Unfortunately, for a large number of practical queries, the pricing model is NP-hard. They developed polynomial-time algorithms for specific classes of join queries, chain queries, and loop queries.

  Subsequently, Koutis et al. developed aPrototype Pricing System, QueryMarket. They formulated the pricing model as an integer linear program (ILP) with the goal of minimizing the total cost of purchasing views . The purchased view Vp must meet the following requirements : For a tuple t in the query answer Q(D), there must be a view subset in Vp that can produce t, and for each relation R in Q, at least one on R should be purchased view. For a tuple t that does not exist in Q(D), there must exist a view subset in Vp that can represent that t does not belong to Q(D). Although pricing problems in settings are often NP-hard, QueryMarket shows that for smaller datasets, large numbers of queries can still be priced in practice . To handle the case where querying Q may require databases from multiple sellers, they introduce a revenue-sharing strategy among sellers . Specifically, each seller earns a fraction of the query price π(Q), which is proportional to the seller's maximum revenue among all the lowest-cost solutions to the ILP .

  Li et al. studiedAn Arbitrage-Free Pricing Model for Linear Aggregate Querydesign problem . Given a set of n real-valued x = 〈x1, . . . . , xn 〉 data, a linear query on x is a real-valued vector q = 〈w1, . . . . , wn 〉, the answer isinsert image description here

  The authors propose a marketplace : data buyers can buy a single linear query q with a variance constraint v defined by the buyer. The answer to the query Q = (q, v) is an unbiased estimator of q(x) with variance less than or equal to v . For the first time, the author puts forward the proposition that the rate of decline of the pricing function π cannot exceed 1/v , ie insert image description here
. Then, they propose a set of arbitrage-free pricing functionsinsert image description here
where the function f( ) is a semi-norm. Finally, they provide a general framework for synthesizing new arbitrage-free pricing functions from existing ones. For any arbitrage-free pricing function π1, . No arbitrage . Niu et al. list a series of well-known arbitrage-free pricing functions . In addition to synthetic pricing functions, Li et al. also investigate a similar view-based pricing framework , as Koutis et al. did. By improving the theoretical results in the literature, it is proved that the view-based linear aggregation query pricing model is NP-hard.

  Lin and Kifer studiedArbitrage-free pricing for general data queries. They propose three pricing schemes: instance-independent pricing, upfront associated pricing, and deferred pricing. The authors further summarize five forms of arbitrage: pricing-based arbitrage, separate account arbitrage, post-processing arbitrage, chance arbitrage, and almost certain arbitrage. The authors point out that the model of Koutris et al., with pricing-based arbitrage, computes a price that could lead to information leakage about D. In theory, they propose an instance-independent pricing function and a delayed pricing function that are arbitrage-free in all forms. . The main idea is to approach the pricing problem from a probabilistic perspective. Queries that are more likely to show real database instances are priced higher.

  Likewise, Deep and Koutris describeThe Structure of Pricing Functions Related to Information Arbitrage and Bundle Arbitrage, where information arbitrage covers post-processing arbitrage and chance arbitrage defined by Lin and Kifer. For instances of query-independent pricing and answer-dependent pricing, the arbitrage-free pricing function should be monotonic and subadditive to the amount of information revealed by the asking query . Several examples of arbitrage-free pricing functions are given, including weighted cover functions and Shannon entropy functions.
  Deep and Koutris later implemented the theoretical framework intoReal-time pricing system QIRANA, it calculates the price of query package Q from the perspective of uncertainty reduction . They assume that the buyer is dealing with a set of all possible database instances S with the same schema as the real database instance D. After receiving the query answer E = Q(D), the buyer can exclude some database instances where Di ∈ S cannot be D by checking Q(Di) = E. The query package that eliminates more database instances is priced higher because it reveals more information about D. They propose an arbitrage-free answer-dependent pricing function that assigns a weight to each database Di ∈ S and insert image description here
computes the price of the query packet . By default , each possible database instance Di is assigned the same weight wi = P/|S| , where P is a parameter set by the data owner. Data owners may also provide QIRANA with some sample query packages and their corresponding prices. Then, by solving the entropy maximization problem, QIRANA automatically learns instance weights from the given instances. Choosing S as the complete set of possible database instances leads to a #P-hard problem. To make the pricing function tractable, QIRANA uses a random sample of database instances as S.

  Chawla等The pricing function in the above figure is extended to maximize the seller's revenue. They consider a setup where the supply is infinite and buyers are single-minded, that is, a buyer only wants to buy a single bundle query Q. If the advertised price π(Q) is less than or equal to the buyer's valuation vQ, the buyer will buy Q. The authors employ a training dataset consisting of bundles of query packets and their customer value evaluations. Three pricing scenarios were studied. The main idea of ​​their pricing scheme is that , according to insert image description here
, it is possible to price queries for a bundle of items (database instances). Unified bundle pricing, set the same price for all query bundles. Pricing one by one Use the above figure to set the price of the query package, where the weights will be learned from the training data. XOS pricing learns the k weight of each item Di insert image description here
, and sets the price of Q insert image description here
as the theoretical research on the approximation rate of each pricing scheme to the optimal income. Although the XOS pricing scheme has the best approximation rate, in practice, one-by-one pricing usually yields larger gains.

  Miao et al. studiedPricing Selection-Projection-Natural Join Query in Incomplete Databasequestion. Based on the idea of ​​data source, an arbitrage-free pricing function is proposed , which describes the origin of a piece of data and its processing history. Let t be a tuple in the query answer Q(D). T's lineage L(t,D) is defined as the set of tuples in database D that contribute to t. The authors assume that each tuple t has a base price p(t). Set the price of Q to be
the weighted set of all tuple costs in . Specifically, insert image description here
where μi is the percentage of ti's non-missing attributes. The authors also propose an answer quality-aware pricing functioninsert image description here
where κ(Q,D) is the answer quality and ∆ is a constant. However, π^(QUCA) is not arbitrage-free.

  Buying data is usually not a one-time transaction. Customers can purchase multiple query data from the same data seller. History-aware pricing functions do not charge customers twice for information that has already been purchased. Query-Market keeps track of customers' purchase views, avoiding re-charging when customers re-query those views in the future. Pricing selection-projection-natural-join queries for QIRANA and incomplete databases both support the same history-aware pricing as Query-Market . A disadvantage of these history-based approaches is that the seller must provide reliable storage to preserve the user's query history .

  Upadhyaya et al. proposed aOptimal History-Aware Pricing Function, that is, buyers only need to pay once for the purchased data. The key idea is to allow buyers to request a refund for purchased data . In their settings, queries are priced according to their output size. The seller computes an identifier (coupon) for each tuple in the query answer Q(D). Both Q (D) and the corresponding coupon are sent to the buyer. If the buyer receives the same tuple t from two queries, the buyer can claim a refund by showing the two coupons related to t in the two corresponding queries. To prevent purchasers from borrowing coupons from others and receiving unreasonable refunds, each coupon is uniquely associated with a purchaser. By tracking coupon status, data sellers ensure that each coupon is used only once. However, the pricing function has no arbitrage-free guarantee.

3.4. Privacy Compensation

Trading and sharing personal data may compromise the data provider's therefore. Therefore, how to measure and appropriately compensate data providers for privacy loss is an important issue in designing personal data markets.

  differential privacyis a mathematical framework that strictly protects privacy.According to the principle of differential privacy, random noise will be injected into the data set, so that data buyers can understand the useful information of the entire data set, but cannot accurately understand the specific situation of individuals. random noisesizeaffecting the data provider'sLoss of privacy and the price of data. havelessDatasets injected with random noise may leak more privacy and pricehigher. The pricing model for personal data usually adopts a cost-plus pricing strategy, and the seller first compensates the data provider for theloss of privacy, and then expand the totalprivacy compensation, to determine the price for data buyers.

  Ghosh and Roth startedPricing Privacy Through Auctionsresearch . They propose to create a real marketplace to sell single count queries against binary data . In their setting, the data seller has a dataset consisting of personal data di ∈ {0,1}. The data seller sells an insert image description here
estimate of and , and compensates the data provider for the loss of privacy. Under the framework of differential privacy, the authors treat privacy as a commodity that can be traded . In particular, €-privacy units should be purchased from the provider if the provider's data is used in a €-differentially private manner. Therefore, the privacy compensation problem can be transformed into a variant of a multi-unit reverse auction . The authors assume that each data provider i has a privacy cost function :
insert image description here
. The graph represents the cost of using data in a €-differentially private manner, where vi is the unit privacy cost of i. In the auction, data providers are asked to submit their asking price bi in order to use their data. Ghosh and Roth consider two situations. In the first case, the buyer insert image description here
has a precision requirement for
, ie
. The authors set up an observation task: they only need to buy data from m individuals and use them in a €-differentially private manner, where m and only depend on the accuracy goal. The results show that the classic Vickrey-Clarke-Groves auction minimizes the buyer's payment and guarantees the auction's accuracy objective . The main idea is to select the m lowest bidders and provide each winner with a uniform compensation € b, where b is the (m + 1)th smallest bid . In the second case, the buyer has a budget line and wishes to maximize the accuracy of s.作者提出了一种基于贪婪的近似演算法来解决这个问题。

  个人数据的价值和隐私评估可能是相关的。例如,病人可能会给病人的医疗报告定一个比健康人所要求的更高的价格。Ghosh 和 Roth 展示了一个负面的结果,在有这种相关性的情况下,没有任何个体理性的直接机制可以保护隐私

  在一项后续研究中,Dandekar 等人 考虑了出售线性聚合查询 q = 〈w1, . . . wn〉 超过个人数据 D = 〈d1, . . . , dn〉 实际值的场景。他们假设数据提供者具有与insert image description here
相同的隐私成本函数,并提出了一种真实的逆向拍卖机制,以最大限度地提高预算受限买家的估算器的准确性。真答案insert image description here
的估计器insert image description here
的误差是其平方误差 ( ^s − s)^2。 结果表明,从q中具有较大相应权重的更多提供者计算insert image description here
更准确。因此,问题被转化为背包逆向拍卖[95],在预算限制下最大化所选提供商的总权重。具体而言,作者将预算视为背包的容量,将数据输入di的隐私成本视为其在背包中的权重,wi视为di的值。针对该问题,提出一种逼近比率为5的基于贪婪的算法

  The aforementioned studies [34, 19] assume that data buyers can purchase an arbitrary amount of privacy from each data provider. However, conservative individuals may not want to sell personal data if the loss of privacy is too great. Nget et al. [77] study the same problem that Dandekar et al. [19] did in a more realistic scenario , i.e. an individual i can refuse to participate in the estimation if the loss of privacy for i is greater than a threshold i . They assume that the privacy cost function of each data provider is public, and propose aHeuristic query price determinationmethod . The model first randomly samples a subset of data providers . Then, the data for each sampled individual i is used in a differentially private manner for i, and the compensation is computed accordingly . If the total reward is greater than the budget, the model reduces the differential privacy level of the high-cost provider to meet the budget goal . Finally, they generate perturbed query answers via personalized differential privacy , which guarantees the differential privacy of each selected individual. They repeat the above steps many times and return the perturbed answer with the smallest squared error .

  Later, Zhang et al. [111] proposed aReal Personal Data Marketplace, where each data provider canSpecify the maximum tolerable loss of privacy i for an individual. They first show that the accuracy of query answers is proportional to the total amount of money purchased for privacy. Under the assumption that the distribution of privacy costs for all individuals is public, they design a variant of Bayesian optimal knapsack purchase [29] that maximizes the total privacy of expected purchases subject to the constraints of the data buyer's expected budget. The authors address this problem by adopting the algorithm in [29]. Noisy query answers are generated using personalized differential privacy [50], which guarantees i-differential privacy for each selected individual i.

  The models proposed by Dandekar et al. [19] and Ghosh and Roth [34] may be attacked by arbitrage . Li et al. [57] considered such a situation,Data buyers have variance constraints on the noisy query answers they buy v. They assume that the cost of privacy for individuals is public and propose aA theoretical framework for assigning arbitrage-free prices to linear aggregation queries q. An indeterminate answer is derived from the true answer by adding Laplacian noise to the expected value 0 and the variance √(v/2) . Measured in differential privacy, the privacy loss of individual i is bounded by € = w/( √(v/2)) if the individual participates in the query, and 0 otherwise, where w is the largest absolute weight in q. Several privacy compensation functions are proposed, such as pi (€) = ci €i, where ci is the unit privacy cost of individual i. The price of the query is the sum of privacy compensations, which is proven to be arbitrage-free.

  Li et al. [57] only compensate individuals who participate in queries. However, since the data of two individuals may be related, the privacy of one individual who was not involved in the sale may be compromised because the other individual's data is compromised. To fairly compensate individuals for their privacy, Niu et al. [78] extended the model of Li et al. [57] and proposed aArbitrage-free and reliance on a fair pricing model. Relying on fairness requires that a data provider should be compensated for privacy whenever a query involves some data of other providers related to that provider's data . Using dependent differential privacy [60], the privacy loss caused by queries for data provider i is bounded by insert image description here
. where dsi is the dependency sensitivity of the query to provider i's data. The authors propose a bottom-up and top-down mechanism to determine privacy compensation and query price . The bottom-up mechanism calculates the compensation in the same way as Li et al., and determines the query price as a multiple of the total compensation . The top-down mechanism first determines the query price using a user-defined arbitrage-free pricing function, and uses a portion of the buyer's payment for privacy compensation . Each data provider receives a share of compensation proportional to their loss of privacy.

  All the privacy compensation methods discussed above assume that there is a trusted platform/broker to trade the privacy of the data provider with the data buyer. However, data providers cannot control the use of their own data.

  In this regard, Jin et al. developed a realQuorum Sensing Market, where data owners can decide how much privacy to disclose. In their marketplace, geolocation-obscured data owners are traded through auctions . Data owners start by injecting random noise into their data according to their own privacy preferences. Each data owner then bids with the mean and variance of the injected random noise and fee. The buyer determines the auction winning bidder to maximize the accuracy of the data regarding the buyer's budget line. The authors point out that this optimization problem is NP-hard and propose a greedy heuristic solution . The main idea is to iteratively select data owners such that the largest marginal utility contribution is made until the budget is exhausted .

  In this section, we review representative pricing models on the original dataset under four scenarios, where different requirements are considered. A limitation of the pricing model discussed is that datasets are priced without considering their downstream applications. Fernandez et al. [30] argue that the value of a dataset to customers is usually task-dependent and cannot be assessed solely by intrinsic properties of the dataset. Since the pricing models for raw datasets are independent of the downstream application of the raw datasets, these pricing models can be used in the machine learning process for building both supervised and unsupervised learning models.

4. Pricing data label

Crowdsourcing is a popular method for collecting large-scale labeled training data in machine learning tasks [88]. Unfortunately, crowdsourced data often has quality issues. This is mainly due to the fact that participants may submit low-quality labels. They can be deterred from engaging in these tasks through performance-based remuneration. However, due to the lack of ground truth verification in label collection, how to evaluate the label quality and price corresponding to the label is a challenging task. In this section, we review two types of label pricing models designed to incentivize workers to work hard and submit accurate data labels.

4.1. Gold task pricing model

  Golden tasks are tasks for which the data buyer knows the answer in advance. Gold tasks can be mixed uniformly and randomly among tasks where workers rate worker performance, which determines worker compensation. Since workers cannot distinguish gold tasks from other tasks, this strategy can motivate workers to provide accurate labels.

  Shah and Zhou [88] consider a crowdsourcing setting where workers performBinary Labeling Task. The author proposes a method using the gold taskMultiplicative Pricing Model. The model allows workers to skip assigned tasks if they are unsure of the answer. The total payment to worker u is calculated based on u's performance on the tasks answered. Workers are selfish and want to maximize their individual expected payments. The authors assume that each worker has a personal certainty Pr (yt = l) about the likelihood that the true label yt of task t is l. The pricing model is designed to incentivize workers to only report high-confidence labels with a certainty greater than a threshold p. The total reward starts at β, and for each correct answer in the golden task, the reward will be multiplied by 1/p. However, if any of these golden tasks are answered incorrectly, the reward drops to zero , that is, insert image description here
(5)
where 1( ) is an indicator function, and c and r are the number of correct and incorrect answers, respectively . This pricing model incentivizes employees to only answer tasks they feel confident enough about. The pricing model is incentive-compatible, that is, workers are maximized in expected reward if and only if they strive to report accurate labels . This pricing model also satisfies the "no free lunch" axiom, that is, employees who only provide wrong answers will not get paid at all. In their setting, the proposed method is the only incentive-compatible model that satisfies the "no free lunch" axiom.

  Shah et al. [90] further extended Shah and Zhou's model [88] tomulti-label task. For each task, the worker can submit multiple answers Y that he thinks are most likely to be correct. This multiple-choice system provides greater flexibility for workers to express their beliefs and can leverage the expertise of workers with partial knowledge more effectively than a single-choice system. The authors assume that workers' confidence that each label is the true label for the task is in the set {0} ∪ (p, 1], where p is fixed and known. The authors hope to encourage workers to only report The label set of . If one answer of a worker is correct, then the worker’s reward for completing a gold task is (1-p)^(|Y|-1) , otherwise it is 0. The total amount paid to the worker Depends on worker's reward product for all gold quests.

  In a later study, Shah and Zhou [89] proposed aTwo-Stage Multiplicative Pricing Model, to motivate employees to self-correct their answers . In the first phase, workers answer assigned tasks. In the second stage, if a worker's answer to task t is inconsistent with that of a peer worker, the worker has a chance to change the answer. If the initial answer to the task is correct, then worker u will receive a high reward, if the updated answer is correct, then worker u will receive a low reward, and if the final answer is wrong, worker u will receive 0 reward . Total rewards are determined by the product of what workers get paid from gold quests. It is proved theoretically that this method is the only incentive-compatible model that satisfies the non-free release axiom . Experience has shown that self-correcting setups can significantly improve data quality compared to standard single-stage setups.

  To reduce variance in payoffs, the aforementioned methods [89, 90, 88] require each worker to solve a sufficient number of gold tasks. This would result in a waste of the procurement budget, since the answers to the gold quests are already known.

  De Alfaro et al. [2] passedCombining Companion Predictions and Gold Queststhought to solve this problem . They arrange workers in a hierarchy, with each worker sharing a common task with each of its children. Some gold missions are used to motivate high effort from top staff. Assuming these workers put in enough effort to provide high-quality answers, their answers can serve as pseudo-golden tasks for second-tier workers, who in turn can provide pseudo-golden tasks for the next tier, etc. . If a worker disagrees with the tasks shared by the previous layer, the worker will be punished. Since the top workers are rated by the real gold tasks, they get more accurate ratings than other workers, which is not fair to the lower workers.

  Subsequent work by Goel and Faltings consideredfair pay among workers,也就是说,工人的预期报酬与工人回答的准确性直接成比例,与工人的随机同伴的策略和熟练程度无关。其核心思想是评估工人的熟练程度,即工人正确解决任务的概率。Goel和Faltings[36]首先估计一小群工人的熟练程度与黄金任务。然后,小组工人对非黄金任务的回答被用作贡献的黄金任务,其中工人的熟练程度被用作这些任务的可信度。贡献的黄金任务被用来评估更多工人的熟练程度。最后,每个工人的报酬与工人的估计熟练程度成正比,因此熟练程度高的工人会得到高额报酬。这种模式保证,尽最大努力提供准确的标签是每个工人的主导战略。

4.2.基于同行预测的定价模型

基于同行预测的定价模型可以激励工作和准确的数据标签,而无需访问黄金任务。这些模型利用同一任务答案的随机相关性,在工人之间建立了一个博弈,在博弈论中称之为机制。这个游戏的设计使得努力解决任务的工作者可以获得高预期的回报,而垃圾邮件发送者提供随机答案平均没有收到任何报酬。一个定价模型是激励相容的,如果它承认作为一个均衡发挥高度的努力和真实的报告。

  Dasgupta 和 Ghosh [20]发起了努力启发的研究,并提出了 DG 模型来定价二进制标签数据购买者将一组数据标记任务分配给一组工作者,这样每个任务都由多个工作者标记,每个工作者标记多个任务。它们假设一个工人 ui 要么不投入任何努力,因此提供一个随机标签,要么投入全部努力创建一个具有成本 ci 的标签,并提供一个具有概率 π 的真实标签。在这里,π 被称为 ui 的熟练程度。工人们都是利己主义者,他们想要最大化他们的回报。
  DG 模型根据工人 ui 的报告与对等工作者的报告惊人的一致程度,为分配的任务 t 支付一个工作者 ui。分别用insert image description here
insert image description here
表示 ui 和 up 到任务的答案。该模型支付 ui 一个固定的报酬,减去概率 Pr(ui,up) ,ui 和 up 对随机任务有相同的答案,也就是说,insert image description here
其中 β 是一个非负的支付比例参数,被选择来覆盖工人的努力成本,而 Pr(ui,up)是从提交的标签中近似得到的。支付给工人 ui 的总金额是 ui 为每个任务支付的总金额。
  这种定价模式鼓励人们付出努力,因为对于那些没有解决他们的任务并报告随机/恒定标签的垃圾邮件发送者,他们的预期报酬是零。在所有员工的熟练程度都优于随机猜测的假设下,证明了 DG 模型是激励相容的。尽管定价模型也存在非信息均衡(例如所有员工报告的标签相同) ,但这些均衡对员工的利润较低,因此对员工没有吸引力。

  在多标签的情况下,两个标签 l1和 l2可能是正相关的。Shnayder 等[92]表明,在 DG 模型下,工人可以通过误报11/12来获得更多的利润。相关协议(CA)机制[92]将 DG 模型扩展到多标签任务. In a CA mechanism, knowledge about label dependencies is required. A label correlation matrix ⊿ is learned from worker submissions, where one element ⊿i,j = Pr(li,lj)-Pr(li) Pr(lj) is the correlation between labels li and lj . Let S( ) denote the sign function of ⊿, that is, S(li,lj) = 1 if ⊿i,j > 0, otherwise 0. If your report correlates positively with that of your colleagues, your work will be rewarded. To penalize the case where all workers blindly report the same label, workers are penalized if they are likely to be consistent with workers in random tasks. In particular, report to worker uinsert image description here

The payment for is insert image description here
where insert image description here
is the answer to task t given by worker up. insert image description here
is the answer of worker u to a random task, insert image description here
and is the answer of worker up to another random task.
When the number of tasks is large, the label correlation ⊿ can be learned accurately, and the CA mechanism is the incentive mechanism compatible with the highest reward. However, this mechanism fails if the two labels l1 and l2 are indistinguishable to S( ), that is, ∀li ∈ Y, S(l1,li)=S(l2,li). In this case, workers may misreport l1,l2 and still get paid the same.

  [82] provide complementary theoretical results on pricing for multi-label tasks . They assume that the labels have only limited correlation, that is, Pr(op=l2 | o=l1) < Pr(op=l2 | o=l2), where o and op are the observed labels of worker u and worker up, respectively. The mechanism pays workers u to report y on a task t by insert image description here
where R(^y) is the experience frequency of y, which is computed from all submissions. It turns out that high effort and true reporting is more profitable than any other equilibrium. However, their assumption about label correlation may not hold in some applications [92].

  The above methods [20, 92, 82] require that each task must be completed by at least two workers, which leads to duplicate answers and thus cannot effectively utilize groups. For the setting of binary labels, Liu and Chen [62] suggestedLearn a classifier M from worker reports and use the classifier's predictions M(t) as peer reports. Since the labels submitted by workers are highly noisy, the classifier is trained with noisy label learning techniques [75]. Specifically, they first estimate the error rate of submitted labels. Then, the classifier is optimized using the BER correction loss function φ(·) proposed by Natarajan et al. [75]. The pricing of report y for task t is based on -φ(M(t),y), such that labels with large losses are priced lower. Under the assumption that M is better than random guessing, striving to find the truth label is the highest-payoff equilibrium.
  Liu and Chen [63] studiedSequential label collection problem, where labeling tasks are issued in multiple stages . In their setting, an accurately labeled task has a fixed reward to the data buyer, while a wrongly labeled task has no value to the data buyer. They propose an incentive-compatible pricing model that maximizes the expected utility of data buyers, which is the difference between total rewards and total payments.
  they developed aMulti-Armed Bandit Algorithmto extend the DG model [20], which dynamically adjusts insert image description here
the parameter β in . Larger β can encourage more accurate labeling, but costs more money. Since the Thieves algorithm requires a static environment, this approach may fail to learn the optimal β if the adversary workers adjust their strategies based on their interactions with the mechanism [42, 37]. [42] solve problems by reinforcement learning, which is more effective for workers' strategic behavior.

  In practice, models based on peer forecasts need to adjust payments to avoid negative payments. This adjustment could lead to potentially aggressive and high rewards for spammers. Radanovic and Faltings [81] by proposing aReputation System PropeRBoostto adjust the payment to account for this . PropeRBoost issues tasks to workers in multiple rounds and calculates each worker's reputation score based on the worker's past submissions. In each round r, it first applies the DG model [20] to compute the worker's payment, and then re-scales the payment according to the corresponding worker's reputation. The results show that the average payment to the spammer converges to 0 as r approaches infinity.

  In this section, we introduce the golden task-based and peer-to-peer prediction-based pricing models for data labels. The pricing model developed ensures that efforts to report accurate data labels are the most profitable strategy for all workers. A major problem with methods based on golden tasks is that these methods require a sufficient number of golden tasks to achieve good performance. However, in some cases, the cost of obtaining gold tasks is very high. For methods based on peer predictions, the existence of multiple equilibria is a major limitation, since workers may converge to an uninformed equilibrium where workers are not fully engaged [93].

5. Pricing in Collaborative Training of Machine Learning Models

Collaborative machine learning is an appealing paradigm where multiple data owners collaborate to build high-quality machine learning models by contributing their data . Since datasets from different data owners may have different contributions to trained machine learning models, data owners who contribute more valuable data should be rewarded more [94]. In this section, we review contribution evaluation and revenue distribution techniques in collaborative machine learning.

5.1. Distribution of income according to Shapley value

Shapley fairness is widely adopted as the basis for fair income distribution in collaborative machine learning. It guarantees that each participant receives a payment proportional to the participant's marginal contribution to the performance of the trained machine learning model. The challenge of adopting Shapley's value is its exponent calculation cost.

  Maleki et al. [67] proposed by proposingPermutation Sampling Algorithm with Bounded Utility Functionto solve the Shapley value ofefficiencyproblem . According to insert image description here
, the Shapley value of sellers is the average of the marginal utility contributions of all possible seller subsets, which can be estimated by the sample mean . insert image description here
It is represented by the approximate value of the sum (€, δ) of the Shapley value of the seller, that is, [External link image transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the image and upload it directly (img-iGodx5W4-1677805115728)(images/756cfab1c9a6dd20c9f24455571173b8d18776f6cbb3d487 5f5c860285620741 . png)] To compute the estimator for all sellers, via Hoeffding's inequality [40], we need insert image description here
samples and compute the utility functioninsert image description here

, where N is the number of sellers and r is the range of the utility function. Evaluating the utility function itself, such as computing test accuracy, is computationally expensive because it requires training a machine learning model. Therefore, the method cannot scale to a large number of sellers.

  Ghorbani and Zou [33] extended the Monte Carlo method of Maleki et al. [67] to price individual data points in supervised learning and proposedTruncated Basis and Gradient-Based Approximation Methods. Their truncation-based approach reduces the number of utility evaluations by ignoring large coalitions. The authors argue that it is sufficient to estimate the Shapley value as the inherent noise in the predictive performance U on the test dataset, which can be measured as the bootstrap variance of U. Furthermore, the performance change caused by adding one more training data point s to the large training dataset S is negligible. Therefore, if the utility of S is close to the utility of the entire dataset D, the marginal contribution of s to S can be treated as 0 in practice, and thus its computation can be truncated. Their gradient-based approach speeds up the evaluation of the utility function by reducing training time, where the model only needs to be trained once with the training data. They update the model by performing gradient descent on one data point s at a time, with the marginal contribution of s being the change in model performance. These two approximation methods introduce estimation bias in the approximated Shapley value, and there is no guarantee of approximation error.

  Jia et al. [46] proposedTwo Approximation Algorithms, where the Shapley values ​​have demonstrable error bounds, significantly reducing the number of utility evaluations. The first algorithm adopts the idea of ​​group testing [114] in feature selection. Let βi denote a Boolean random variable denoting whether seller si is in the random sample of sellers. The sampling distribution of Β1, . . , βN is designed such that the Shapley value difference between seller si and seller sj isinsert image description here

where U(β1, . . , βN) is the estimated utility of sellers present and D is all sellers. The seller's Shapley value can be derived from the estimated Shapley differences between all pairs of data by solving a feasibility problem. They demonstrate that the algorithm returns an approximation of (€, δ) with O(N(logN)^2) utility evaluation. The second algorithm is based on their observation that the Shapley values ​​are approximately sparse , that is, most values ​​lie around the mean . Exploiting this property, they applied the idea of ​​sparse signal recovery to compressed sensing [83] and developed an algorithm that yields (€, δ) approximations with only O(N(logN)) utility evaluations.
  [45] further found that the Shapley values ​​of data points used in an unweighted kNN classifier can only be computed exactly in O(NlogN) time . Given a test point xtest with label ytest, they define the utility of a kNN classifier as the likelihood of ytest, i.e. where insert image description here
αi(S) is the index of the training data which is the i-th closest to xtest in the set of data points S index. A special utility function can efficiently compute the Shapley difference between two data points xαi(S) and xαi+1(S), ie insert image description here
. They calculate first insert image description here
, then use Equation 7 to insert image description here
recursively calculate the Shapley values ​​in the order of . They further developed a (, δ) approximation algorithm based on locality-sensitive hashing [21] with only sublinear complexity. The main idea is to only calculate insert image description here
the Shapley value of the retrieved nearest neighbors to xtest and ignore the rest of the data points because their Shapley values ​​are too small. Furthermore, they proposed a insert image description here
Monte Carlo approximation algorithm with time complexity for a weighted kNN classifier.

  The aforementioned studies [45, 46, 33] evaluate the utility of models by their performance on validation datasets. Sim et al. [94] consideredNo validation dataset availablesituation, and recommendUse the information gain of the model parameters as a utility function. Let θ represent the model parameters. After training on data D, the information gain IG(θ) = H(θ)-H(θ|D) is the reduction of uncertainty in θ, where H(·) is the entropy function.In addition to Sharply fairness, three additional incentives for income distribution are proposed, namely individual rationality, stability of the big league, and group welfare. They also proposed p-Shapley fairness, which distributes a reward π(si) = kψ(si)^p to seller si. By adjusting the parameter p ∈ [0,1], they can make a trade-off between achieving different excitation conditions. Instead of monetary rewards, each participant is rewarded with a machine learning model. To achieve different levels of rewards, the model is trained by injecting different levels of noise in the training labels.

  Federated learning [7, 68] enables multiple decentralized actors to collaboratively train a machine learning model while keeping their training data local . Data sets provided by participants are used in the order determined by the central server. Using Shapley values ​​to evaluate participants' contributions creates high communication costs among dispersed participants. Also, Shapley values ​​ignore the order of the data sources. To accommodate these challenges, Wang et al. [103] proposedJoint Shapley value. Let U(si + sj) denote the usefulness of the model, which is trained first on the data of si and then on the data of sj. Let it be a selected set of participants in the federated learning process. The joint Shapley value of player si at round t is defined as follows.
insert image description here

The joint Shapley value of si is insert image description here
where T is the total number of rounds in federated learning. The authors show that the joint Sharply value satisfies the balance and addition axioms of Sharply fairness. The other two axioms, symmetry and zero element, are satisfied in every round . They extended the permutation sampling and group testing approximation method [46] to compute joint Shapley values.

  Participants in federated learning incur some costs to contribute their datasets, such as privacy cost [41] and energy cost [51]. Yu et al. [108] proposed aFair federal learning income distribution mechanism, the mechanism jointly considers the costs and contributions of the participants. In round t, each participant si has a common cost ci(t) and receives reward πi(t). si's regret ri(t) is a function of the difference between si's total cost and total reward. Large values ​​of ri(t) indicate that si does not compensate well for the costs incurred by si. The author believes that the payment of participants in each round should achieve contribution fairness and regret fairness. Contribution fairness requires that each participant's payment πi(t) and the Shapley value ψt(si) should be positively correlated, that is, Σi πi(t)ψt(si) should be maximized. Regret fairness requires that participants should have similar regrets, i.e., the difference in regret between participants should be minimized. Participants' payments are determined by solving an optimization problem associated with a budget constraint. Theoretically, they show that participants' time-averaged regret is an upper bound for the constant value of t→∞.

  Shapley values ​​are vulnerable to data duplication attacks. A data provider can replicate his/her data at zero cost and act as an additional provider for additional unconscionable rewards. Agarwal et al. [1] passedPenalize similar datasets to suppress replicationTo solve this problem , that is, the replication-robust Shapley value is defined as insert image description here
where SM is the similarity measure and λ is a constant. However, the proposed replication-robust Shapley values ​​no longer satisfy the balance axioms in Shapley's fairness.

  Han et al. [38] studied theReplication Attacks in Data Marketplaces with Submodule Utility Functions. They show that the total reward obtained by the attacker increases monotonically with respect to the number of replications the attacker makes. They found that the additional reward to the attacker mainly comes from the marginal contribution of the attacker's replication to the small seller group. To solve this problem, the authors propose to reduce these contributions when computing Shapley values. Their approach guarantees the attacker a smaller reward for making more copies.

  Ohrimenko et al. [79] designed aReplicate robust collaborative data marketplaces, requiring each participant to pay a participation fee . This approach discourages replication because the attacker's cost of participation cannot be covered by the additional reward received by the attacker.

5.2. Other income distribution methods

In addition to the Shapley value, there are some other methods of revenue distribution in collaborative machine learning.

  Leave-one-out[18] is a common method to assess the importance of data. It compares the performance of the model trained on the full dataset to the performance trained on the full dataset minus one point. Performance drop is defined as the value of a data point, i.e., insert image description here
leave-one-out is usually approximated by an influence function [18, 52], which measures how the model changes as the weights of the training points are changed without retraining the model. Richardson et al. [85]Applying an influence function to reward federated learning participants for contributing data points. The results show that the pricing model is incentive compatible. Applying influence functions to price data points has also been studied in [84, 46]. In general, leave-one-out methods are more efficient than Shapley values ​​because they do not require model retraining. However, the missing method may not accurately estimate the value of the data points. Regardless of how important the datum is, these methods are likely to assign a lower value to one of two perfectly equivalent data points, since high performance can still be achieved by including the other datum [106].

  Yan and Procaccia [104] designed aData pricing model based on Kernel [35], which is a well-known income distribution solution in cooperative game theory. The solution is designed to achieve maximum stability in how participants collaborate with each other. The core requirement is that the total reward of each coalition S should be at least equal to the utility U(S), that is, ∀S ⊆ D, ∑ si∈S π(si) ≥ U(S), where π(si) is the reward of participant si, D is the set of all participants. When such a reward cannot be achieved, the minimal core relaxes the constraint by allowing the smallest difference between the utility of S and the total reward of S. In particular, the minimal core computes the payment to each participant by solving the following linear program. insert image description here
The number of constraints in Equation 9 grows exponentially with respect to the number of participants. Yan and Procaccia [104] proposed aMonte Carlo approximation algorithm with guaranteed approximation error to solve efficiency problem. Their approximate method samples relatively few coalitions and solves Equation 9 on the sampled coalitions. If there are multiple solutions to Equation 9, the solution with the smallest l2 norm is chosen. Their income distribution satisfies Shapley's fair axioms of balance, symmetry, and zero elements.

  Yoon et al. [106] proposed aReinforcement Learning Algorithms to Evaluate the Value of Data Points. They learn a data value estimator that estimates data values ​​and selects the most valuable samples to train a target classifier. They jointly learn the data value estimator and the corresponding classifier, which enables the classifier and data value estimator to enhance the performance of each other. However, this approach does not guarantee a fair distribution of income among participants.

  Most existing income allocation methods are developed in the setting of joint training of supervised machine learning models . Participants will be rewarded based on the contribution of their datasets to the utility of the jointly trained machine learning models. To adapt existing pricing models to the scenario of jointly training unsupervised machine learning models, the main challenge is to develop a utility function that all participants can agree on . For some traditional unsupervised machine learning models, there are some widely accepted performance metrics that can be used as utility functions. For example, Silhouette Coefficient [86] and Calinski-Harabasz index [12] are widely used to evaluate the performance of clustering algorithms when the ground truth class is unknown. However, developing utility functions for some unsupervised models, such as pre-trained deep language models [26, 9], can be challenging because they are evaluated differently in many downstream machine learning tasks.

  In this section, we review pricing models in collaborative training of machine learning models. The main idea is to price each participant's dataset according to its contribution to the performance of the jointly trained machine learning model. Shapley's value-based approach guarantees a fair income distribution among participants, but is less computationally efficient and scalable. Some alternatives [106, 104] enjoy better efficiency or coalition stability, but lose the guarantee of fairness.

6. Pricing Machine Learning Models

Many different applications and scenarios require machine learning models. Instead of building machine learning models from scratch, many users and companies turn to purchasing trained machine learning models due to lack of expertise and computing resources [109, 16]. In this section, we review pricing models for machine learning models and discuss pricing differences between machine learning models and raw datasets.

6.1. Pricing Model

Pricing machine learning models is an emerging field of research. To the best of our knowledge, existing research mainly focuses on no-arbitrage and revenue-maximizing pricing.

  Chen et al. [16] proposed aMarkets for machine learning models without arbitrage and revenue maximization. In their environment, model owners sell multiple versions of machine learning models to different buyers. The seller first trains the best model on the entire original dataset. Then, the seller generates different versions of the optimal model by adding Gaussian noise with different variances to the parameters of the optimal model. The expected error rate of the generated model instances increases monotonically with respect to the variance of the injected noise. Arbitrage-free pricing ensures that buyers cannot pay less for a high-performance model. Under their regime, the pricing function is arbitrage-free if and only if the inverse of the function with respect to the noise variance is monotonic and subadditive. Unfortunately, their pricing model only works with machine learning models trained with strictly convex objective functions .
  Chen et al. [16] further studiedRevenue Maximization in Machine Learning Model Pricing, these models relate to the needs and valuations of a set of buyers. They show that determining the optimal price is coNP-hard. To overcome computational hardness, they insert image description here
relax the additional constraint π(x+y) ≤ π(x)+π(y) by
, where x ≤ y and ̂π is an approximation of the optimal pricing function π. They show that ̂ π is arbitrage-free and that ∀x > 0, π(x)/2 ≤ ̂ π(x) ≤ π(x). they propose aDynamic programming algorithm, for computing ̂π in O(n^2) time, where n is the number of model versions.

  Liu et al. [61] proposed aEnd-to-End Model Marketplace, which jointly considers the privacy cost of data owners and the needs of model buyers . Brokers collect data from data owners and generate multiple versions of machine learning models for sale, with different subsets of training data and different levels of privacy. Revenue is fully distributed to data owners. Objective perturbations [14] are used to train models with a desired level of differential privacy, thereby injecting quantized random noise into the model's objective function. Each data owner requires a minimum compensation for using the owner's data to train a model with €-differential privacy, i.e. where insert image description here
bi is proportional to the Shapley value of si with respect to all seller datasets and ci (€) is the privacy of si cost. An ideal pricing model should maximize revenue, be free of arbitrage across different levels of privacy, and include compensation to data owners. Computing the optimal pricing function is coNP-hard, so they proposed a dynamic programming algorithm to solve the problem approximately. A limitation of the pricing model is that it cannot adjust prices based on dynamic client demand, which can limit a broker's revenue.

  Agarwal et al. [1] consideredConduct an online auction for the machine learning model market, which is true, maximizes revenue. They assume that buyers come one at a time, each wanting to buy a machine learning model for the buyer's prediction task. Denote insert image description here
the predictive quality of the model ̂ Yi on the validation dataset Yi for buyer i. The buyer reward I get from the model is μi insert image description here
, where μi is buyer i's private valuation of unit performance. Let pi and bi denote the broker's asking price and buyer i's bid, respectively, denoting unit performance. The broker generates a noisy machine learning model for buyer i based on the price difference pi − bi. Specifically, the model is trained on a dataset with quantization-injected random noise, such that the performance of the model is scaled down to pi − bi. Buyer i is charged by the function RF(pi, bi, Yi), which is designed according to Myerson's payment function rule [71]. The utility that buyer I receives by bidding bi is insert image description here
where ̂ Yi is the prediction of the returned noise model. The results show that bidding truthfully on buyer's valuation μi maximizes buyer i's utility. The authors apply the multiplicative weighting method [3] to calculate the price pi from historical revenues. They show that the pricing mechanism achieves the greatest revenue.

6.2. Pricing of raw data products and machine learning models

At a high level, pricing for machine learning models and raw datasets shares a common set of requirements and techniques. But their pricing models are fundamentally different in at least four respects.

  First, the pricing units for machine learning models are usually well-defined and fixed. Machine learning models are usually priced and sold as a whole . Customers can purchase machine learning models or use machine learning models through API calls, where each call has a fixed price. In contrast, raw datasets can be used at a variety of granularities . For example, a customer may be interested in sales information for customers in the United States in the last year. However, another customer may wish to purchase sales for the Christmas season. This flexibility makes versioning of raw data products easier and enables more flexible pricing mechanisms. For example, depending on the amount of information displayed, different prices can be assigned to different queries on the same database [25].

  Second, version control in model marketplaces is harder than version control in data marketplaces. Due to the powerful and flexible aggregation properties of datasets, different versions of datasets can be easily generated by aggregating by different dimensions. Generating different versions of a machine learning model requires more sophisticated techniques [16] because it is challenging to accurately control the differences between multiple versions.

  Third, the value of raw data sets to customers is often harder to measure than machine learning models. Typically, raw datasets are used to train machine learning models. The ultimate value of a dataset depends not only on its intrinsic properties, but also on the specific task and analysis method the dataset is used for [30] . As a result, it is often difficult for customers to understand the value of a dataset. Many machine learning models are designed for specific tasks and are directly used by humans to support decision making [1] . It is easier for people to validate and understand the value of such machine learning models. For example, customers can evaluate classification models based on their predictive accuracy.

  Finally, preventing arbitrage is generally more difficult in model markets than in raw data markets. As Tram'er et al. [100], Yu et al. [109] have shown, machine learning models can be stolen by an attacker with a reasonable number of API calls. Clients with a large number of query instances may first purchase some predictions from the target machine learning model. Clients can then use the near-equivalent output as the target model to train a local model and use the local model to predict the remaining query instances at almost no cost.

  In this section, we review pricing for machine learning models. We first revisit the no-arbitrage and revenue-maximizing pricing models. We then discuss several key differences between machine learning model offerings and raw dataset offerings, including pricing units, version control, arbitrage prevention, and customer valuation.

7. Conclusions and future directions

In this paper, we investigate data pricing in end-to-end machine learning pipelines . We consider three important steps in the machine learning pipeline that may involve pricing , namely raw data collection and labeling , collaborative training of machine learning models , and machine learning model marketplaces . We systematically review representative studies in these steps, discuss pricing principles and review existing methods. End-to-end machine learning pipelines play an increasingly important role in the current era of big data and artificial intelligence economy. To our knowledge, this is the first survey on data pricing in machine learning pipelines.

  Data pricing is still in its early stages . Future work faces many research challenges. We list some of them here.

  First, existing research focuses on designing appropriate reward models for each individual stage of the machine learning pipeline. There is a lack of systematic research on end-to-end income distribution solutions . As we introduced in our survey, the manufacturing process of machine learning models involves multiple parties, including data owners, data processors, machine learning model designers, and possibly other actors. Each party provides a value-added contribution at a stage of the pipeline and is rewarded. A natural question is how to allocate the manufacturing budget among the parties . To answer this question, we need a mechanism to measure and compare the contributions of different organizations at different stages. We also need a system that dynamically adjusts budget allocations based on changes in supply and demand.

  Second, almost all pricing models trained by collaborative models formulate income distribution as a cooperative game and utilize Shapley values ​​for distribution. They justify the use of Shapley values ​​by four axioms , namely, balance, symmetry, zero element, and additivity . However, Yan and Procaccia [104] argue that the necessity of the additivity of data valuations is debatable. In addition to the additivity axiom, many other well-known assignments in cooperative game theory satisfy the other three axioms. Other solutions have their advantages and limitations compared to Shapley's value. For example, the normalized Banzhaf value [13] computes each player's payment as the player's average marginal contribution to all alliances of other players. Although the normalized Banzhaf value does not satisfy the axiom of addition, it is more robust to data duplication attacks than the Shapley value [38]. In markets where robustness is more important than additivity, normalized Banzhaf values ​​are preferable to Sharply values . Different types of data marts may have different goals [30] and thus require different axioms. Therefore, we need to better understand the necessary axioms in different markets and explore market-specific solutions to income distribution.

  Third, fine-grained data acquisition for machine learning tasks has not been fully explored. In practice, datasets from two sellers may have similar or overlapping parts . Data buyers with limited budgets may not want to purchase too many similar data points, since the diversity of training datasets is crucial to the performance of machine learning models [30]. A query-based pricing model [53] allows data buyers to purchase only the portion of the dataset they are interested in . However, existing query-based pricing models are only designed for relational datasets in monopoly markets . Supporting query-based pricing in a common dataset marketplace with competing sellers presents new challenges and opportunities. For data buyers, for example, it is interesting to explore how budgets can be allocated among data sellers to maximize the utility of purchased datasets. It is important for data sellers to assign prices to different parts of their datasets based on supply and demand so that data sellers and their datasets can remain competitive in the marketplace.

  Finally, rigorous data pricing model evaluation methods need to be developed. Many existing pricing models have only been evaluated in oversimplified experimental settings, where many assumptions are made about the behavior of market participants. However, models that are theoretically sound may not work in practice because some model assumptions may break. For example, in real-world markets, participants may behave in adversarial, ignorant, or coalition-building ways. However, the impact of these actions on pricing model performance has been largely overlooked in detailed analyses. Therefore, as suggested by Fernandez et al. [30], a simulation platform should be developed that can simulate different behaviors of market participants. The platform can help us research the strengths and limitations of pricing models in the target environment and choose the best deployment model.

Guess you like

Origin blog.csdn.net/m0_43416592/article/details/129313068