Summary of super-complete data sets for machine learning

We all know that datasets are important in the testing process of machine learning models. When constructing a data set, attention should be paid to cleaning and labeling the data. A high-quality data set can often improve the quality of model training and the accuracy of prediction. In the absence of data, you can try to find some public datasets, especially recognized and commonly used datasets. For common tasks, such as: image recognition, object detection and image segmentation tasks, there are corresponding public datasets available. The selection and construction of the model is very important, and the training data is also very important to the model. While changing the model structure to try to improve the accuracy of the model prediction, it is also necessary to pay attention to improving the quality of the input data, and also consider increasing the amount of input data. See Whether it can improve the predictive effect of the model. So, today we sorted out and summarized the machine learning datasets mentioned in relevant papers, data competitions, and field sharing. I hope you guys like it and can promote your own machine learning algorithm research based on this. come on show~

01

Springleaf Marketing Response Dataset

Springleaf puts humanity back into lending by offering personal and auto loans to customers to help them take control of their lives and finances. Direct mail is an important way for the Springleaf team to connect with customers who may need a loan.

Direct offers provide great value to customers who need them and are a fundamental part of Springleaf's marketing strategy. To improve their targeting efforts, Springleaf had to ensure they were focusing on clients who were likely to respond and be good candidates for their services.

Well, Springleaf asks you to predict which customers will respond to direct quotes using heavily anonymized features. Our challenge is to construct new metavariables and employ feature selection methods to deal with this dauntingly extensive dataset. The official address of Springleaf Marketing Response Dataset is:

https://www.kaggle.com/competitions/springleaf-marketing-response/data

The dataset is described as, we will obtain a high-dimensional dataset of anonymous customer information. Each row corresponds to a customer. The response variable is binary and labeled "Target". We have to predict the target variable for each row in the test set. The features have been anonymized for privacy and consist of a combination of continuous and categorical features. You'll come across a lot of "placeholder" values ​​in your data, representing things like missing values. Datasets intentionally preserve their encoding to match Springleaf's internal system. The competition provides the meaning, value, and type of the feature "as is"; dealing with a large mess of features is part of the challenge.

02

StumbleUpon Evergreen classification challenge dataset

StumbleUpon is a user-curated web content discovery engine that recommends relevant, high-quality pages and media to users based on their interests. While some pages recommended (such as news articles or seasonal recipes) are only relevant for a short time, others maintain a timeless quality and can be recommended to users long after they have been discovered. In other words, pages can be categorized as "ephemeral" or "evergreen". Ratings obtained from the community can provide strong signals that pages may no longer be relevant, but what if this distinction could be made in advance? "Transient" or "evergreen" high-quality predictions will greatly improve such recommender systems.

Many people know evergreen content as soon as they see it, but can an algorithm make the same decision without human intuition? Our task is to build a classifier that will evaluate a large number of URLs and mark them as evergreen or ephemeral. Can you top StumbleUpon? As an added bonus to the award, doing well in the competition could land you a career placement at one of San Francisco's best workplaces. The official address of the StumbleUpon Evergreen classification challenge dataset is:

https://www.kaggle.com/competitions/stumbleupon/data

The data provided by the dataset has two components: The first component is two files: train.tsv and test.tsv. Each is a tab-delimited text file containing the fields outlined below, for a total of 10566 URLs. Fields with no data available are indicated by question marks. train.tsv is the training set, which contains 7395 urls. The set is given a binary evergreen label (evergreen (1) or very green (0)). test.tsv is the test/evaluation set, containing 3171 URLs. The second component is raw_content.zip, a zip file containing the raw content of each url, as seen by StumbleUpon's crawler. The raw content of each URL is stored in a tab-delimited text file named after urlid.

03

Santander Customer Transaction Dataset

At Santander, whose mission is to help people and businesses prosper, businesses are always looking for ways to help customers understand their financial situation and identify which products and services can help them achieve their financial goals. Their data science team is constantly challenging machine learning algorithms, collaborating with the global data science community to ensure new solutions to our most common challenges are more accurately found, such as: Are customers happy? Will customers buy this product? Can the customer pay the loan?

For this challenge, santander enlisted Kagglers to help identify which customers would make a particular transaction in the future, regardless of the transaction amount. The data provided for this competition has the same structure as the real data used to solve this problem. The official address of the Santander customer transaction dataset is:

https://www.kaggle.com/competitions/santander-customer-transaction-prediction/data

We will get an anonymized dataset with numeric feature variables, a binary target column and a string ID_code column, and the task is to predict the value of the column in the target test set.

04

Google Brain Respiratory Pressure Dataset

What do doctors do when a patient is having trouble breathing? They use a ventilator to pump oxygen into the lungs of a sedated patient through a tube in the windpipe. But mechanical ventilation is a clinician-intensive procedure, a limitation that became apparent early in the COVID-19 pandemic. At the same time, developing new ways to control mechanical ventilators is prohibitively expensive, even before entering clinical trials. High-quality simulators can reduce this barrier.

Current simulators are trained as an ensemble where each model simulates a lung setting. However, the lungs and their properties form a continuous space, so a parametric approach that takes into account differences in patient lungs must be explored. A team at Google Brain, in collaboration with Princeton University, aims to grow the mechanical ventilation control community around machine learning. They argue that neural networks and deep learning can generalize better to lungs with different characteristics than current industry-standard PID controllers.

In the competition, we will simulate a ventilator attached to a sedated patient's lungs. The best submissions will consider compliance and resistance of the lung properties. If successful, we will help overcome the cost barrier of developing new ways to control mechanical ventilators. This will pave the way for algorithms that adapt to patients and reduce the burden on clinicians during these new times and beyond. As a result, ventilator therapy may become more extensive to help patients breathe. The official address of the Google Brain respiratory pressure dataset is:

https://www.kaggle.com/competitions/ventilator-pressure-prediction/data

The ventilator data used in the competition was generated using a modified open-source ventilator connected to an artificial bellows test lung through a breathing circuit. The figure below illustrates the setup, with the two control inputs highlighted in green and the state variable (airway pressure) highlighted in blue. The first control input is a continuous variable from 0 to 100 representing the percentage the inspiratory solenoid is open to let air into the lungs (i.e. 0 is fully closed and no air is entering, 100 is fully open). The second control input is a binary variable indicating whether the detection valve is open (1) or closed (0) to exhaust air. During the competition, participants are given a large time series of breaths and will learn to predict the airway pressure in the breathing circuit during a breath given the time series of control inputs.

05

Allstate Claims Cost Dataset

当你被一场严重的车祸摧毁时,你的注意力会放在最重要的事情上:家人、朋友和其他亲人。与您的保险代理人一起推理是您最不想花费时间或精力的地方。这就是为什么美国个人保险公司Allstate不断寻求新的想法来改善他们为他们所保护的超过1600万个家庭提供的理赔服务。

Allstate目前正在开发预测索赔成本和严重程度的自动化方法。在本次挑战中,Kaggler受邀展示他们的创造力并通过创建一种准确预测索赔严重程度的算法来展示他们的技术实力。有抱负的竞争对手将展示对预测索赔严重程度的更好方法的洞察力,以便有机会参与Allstate确保无忧客户体验的努力。Allstate索赔成本数据集官方地址为:

https://www.kaggle.com/competitions/allstate-claims-severity/data

该数据集中的每一行代表一个保险索赔,我们需要预测“损失”列的值。以“cat”开头的变量是分类变量,而以“cont”开头的变量是连续变量。

06

“值得买”电子商务销量数据集

随着电子商务与全球经济、社会各领域的深度融合,电子商务已成为我国经济数字化转型巨大动能。庞大的用户基数,飞速发展的移动互联网行业,让中国成为全球电子商务规模最大、发展最快的国家之一。大数据、云计算、人工智能、虚拟现实等数字技术为电子商务创造了丰富的应用场景,不断催生如直播带货、推荐平台、农村电商、新国潮、新文创、在线生鲜等新营销模式和新商业业态。

为运用人工智能技术提升用户体验,解决电商企业痛点、难点问题,助力人工智能领域优秀人才的培养,在商务部电子商务和信息化司、北京市商务局指导下,首届电子商务AI算法大赛(ECAA)开幕。“值得买”电子商务销量数据集官方地址如下:

https://www.automl.ai/competitions/19

该比赛提供了消费门户网站“什么值得买”2021年1月-2021年5月真实平台文章数据约100万条,旨在根据文章前两个小时信息,利用当前先进的机器学习算法进行智能预估第三到十五小时的文章产品销量,及时发现有潜力的爆款商品,将业务目标转化成商品销量预测,为用户提供更好的产品推荐并提升平台收益。

07

爱荷华州房价数据集

这是经典的房价数据集,已被纳入sklearn的标准数据集当中。你有一些 R 或 Python 和机器学习基础知识的经验。对于已经完成机器学习在线课程并希望在尝试特色比赛之前扩展技能的数据科学学生来说,这是一场完美的比赛。

让购房者描述他们梦想中的房子,他们可能不会从地下室天花板的高度或靠近东西铁路的地方开始。但是这个比赛的数据证明,比卧室数量或白色栅栏更能影响价格谈判。爱荷华州房价数据集官方地址如下:

https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/data

该数据集共包含79个解释变量(几乎)描述了爱荷华州艾姆斯住宅的各个方面,这项竞赛挑战我们预测每栋房屋的最终价格。

08

TFI餐厅销售额数据集

TFI在全球拥有1200多家快餐店,是一些世界上最知名品牌背后的公司:汉堡王、Sbarro、Popeyes、Usta Donerci和Arby's。他们在欧洲和亚洲雇佣了 20000多名员工,并在开发新的餐厅网站方面进行了大量日常投资。

目前,决定何时何地开设新餐厅很大程度上是一个基于开发团队个人判断和经验的主观过程。这种主观数据很难跨地域和文化准确推断。新的餐厅网站需要大量的时间和资金来启动和运行。如果选择了错误的餐厅品牌位置,该站点将在18个月内关闭,并产生运营损失。

我们需要找到创建一个机器学习模型来提高对新餐厅网站的投资效率将使 TFI 能够在其他重要业务领域进行更多投资,例如可持续性、创新和新员工培训。本次竞赛使用人口统计、房地产和商业数据,挑战预测100000 区域位置的年餐厅销售额。TFI餐厅销售额数据集官方地址如下:

https://www.kaggle.com/competitions/restaurant-revenue-prediction/data

该数据集包含137家餐厅的训练集和100000家餐厅的测试集。数据列包括开放日期、位置、城市类型和三类混淆数据:人口数据、房地产数据和商业数据。收入列表示餐厅在给定年份的(转换后的)收入,是预测分析的目标。

9

Walmart零售数据集

对零售数据建模的一个挑战是需要根据有限的历史做出决策。如果圣诞节只来一年一次,那么了解战略决策如何影响利润的机会也是如此。

在本次竞赛中,我们将获得位于不同地区的45家沃尔玛门店的历史销售数据。每个商店包含许多部门,参与者必须预测每个商店中每个部门的销售额。为了增加挑战,数据集中包含选定的假日降价事件。众所周知,这些降价会影响销售,但很难预测哪些部门受到影响以及影响的程度。沃尔玛零售数据集官方地址如下:

https://www.kaggle.com/competitions/walmart-recruiting-store-sales-forecasting/data

我们将获得位于不同地区的 45 家沃尔玛商店的历史销售数据。每家商店都包含多个部门,我们的任务是预测每家商店的部门范围内的销售额。

此外,沃尔玛全年举办多次促销降价活动。这些降价促销是在重要节日之前进行的,其中四个最大的节日是超级碗、劳动节、感恩节和圣诞节。包括这些假期在内的周在评估中的权重是非假期周的五倍。该比赛提出的部分挑战是在没有完整/理想的历史数据的情况下模拟降价对这些假期周的影响。

Guess you like

Origin blog.csdn.net/AbnerAI/article/details/129150239
Recommended