uci数据集汇总及翻译

不知道问什么很多人在后台询问uci数据集的下载，但是我好像没有在哪里说过可以在我这里下载的，但是有很多人要，所以这里就做一个搬运。在后台回复 uci数据集 即可获得打包的uci数据集，或者从下面这个链接，自己找自己感兴趣的数据集下载：

http://archive.ics.uci.edu/ml/index.php

欢迎大家关注我的微信公众号，未来上面会推送python 机器学习 算法学习 深度学习 论文阅读 以及偶尔的小鸡汤等内容。ようこそいらっしゃい！

搜索 coderwangson 关注

1.Abalone : Predict the age of abalone from physical measurements

鲍鱼 DataSet ：根据物理度量，预测鲍鱼的年龄。

2.Abscisic Acid Signaling Network : The objective is to determine the set of boolean rules that describe the interactions of the nodes within this plant signaling network. The dataset includes 300 separate boolean pseudodynamic simulations using an asynchronous update scheme.

目标是测定布尔值的度量集合，以描述植物的信号网路节点。该数据集包括了
300 个独立的布尔值形式的虚拟动态模拟值，使用了异步更新的架构。

3.Acute Inflammations : The data was created by a medical expert as a data set to test the expert system, which will perform the presumptive diagnosis of two diseases of the urinary system.

急性炎症 DataSet ：数据来源于一位医学专家的数据集，用以检测专家系统，可以推断出泌尿系统的两种疾病的诊断结果。

4.Adult : Predict whether income exceeds $50K/yr based on census data. Also known as “Census Income” dataset.

成人 DataSet ：根据户口普查资料，预测收入是否能超过 50000 美元/年。通常也被称为“收入普查”数据集。

5.Annealing : Steel annealing data

退火 DataSet ：训练退火数据。

6.Anonymous Microsoft Web Data : Log of anonymous users of www.microsoft.com; predict areas of the web site a user visited based on data on other areas the user visited.

匿名微软网络数据：微软网站的匿名用户记录；通过其他的用户访问区域数据，预测用户在 web 站点的访问区域。

7.Arcene : ARCENE’s task is to distinguish cancer versus normal patterns from mass-spectrometric data. This is a two-class classification problem with

continuous input variables. This dataset is one of 5 datasets of the NIPS 2003 feature selection challenge.

ArceneDataSet ：该数据集的任务是根据大量的观测数据，从正常的模式中辨别出癌症。这是一个根据不断输入的变量的二级分类问题。该数据集是从NIPS2003 特征选择挑战比赛中的 5 个数据集之一。

8.Arrhythmia : Distinguish between the presence and absence of cardiac arrhythmia and classify it in one of the 16 groups.

心率失常 DataSet ：分辨是否出现心率失常，并将结果分类进 16 个组之一。

9.Artificial Characters : Dataset artificially generated by using first order theory which describes structure of ten capital letters of English alphabet

人为性状 DataSet ：通过使用第一次序理论（该理论可以描述出英语字母表的十个开头字母的结构），自动生成的数据集。

10.Audiology (Original) : Nominal audiology dataset from Baylor

原始 AudiologyDataSet ：来自 Baylor 的标称型的 audiology 数据集。

11.Audiology (Standardized) : Standardized version of the original audiology database

标准 AudiologyDataSet ：原始 Audiology 数据集的标准化版本。

12.Australian Sign Language signs : This data consists of sample of Auslan (Australian Sign Language) signs. Examples of 95 signs were collected from
five signers with a total of 6650 sign samples.

澳大利亚标记语言标记 DataSet ：这些数据包括了澳大利亚标记语言标记的样本。95 个实例，均来自五个标识器，其中有 6650 个标记样本。

13.Australian Sign Language signs (High Quality) : This data consists of sample of Auslan (Australian Sign Language) signs. 27 examples of each of
95 Auslan signs were captured from a native signer using high-quality position trackers

澳大利亚标记语言标记 DataSet 高品质版：该数据集包含了 Auslan 标记的样本。有 27 个实例，它们来自 95 个标记，这 27 个实例是使用高质量位置追踪器的当地标识器捕捉出来的。

14.Auto MPG : Revised from CMU StatLib library, data concerns city-cycle

fuel consumption

自动 MPGDataSet ：来自 CMU StatLib 实验室的精品，是与城市循环能源消耗相关的数据集。

15.Automobile : From 1985 Ward’s Automotive Yearbook

汽车 DataSet ：来自 1985 的沃德自动化年鉴。

16.AutoUniv : AutoUniv is an advanced data generator for classifications tasks. The aim is to reflect the nuances and heterogeneity of real data. Data can be generated in .csv, ARFF or C4.5 formats.

AutoUniv 是一个高级数据生成器，可以用来处理分类任务。目标是反映现实数
据的微妙与不同之处。数据可以在 .csv 中生成，采用 ARFF 或者 C4.5 的格式。

17.Bach Chorales : Time-series data based on chorales; challenge is to learn generative grammar; data in Lisp

基于 Chorales 的时间序列数据集；可以用来挑战生成性的语法；数据放在 Lisp
中。

18.Badges : Badges labeled with a “+” or “-” as a function of a person’s name

徽章 DataSet ：标记了“ +”或“ -”的符号的标记，可以作为一个人姓名的函数表达式。

19.Bag of Words : This data set contains five text collections in the form of bags-of-words.

词语包 DataSet ：该数据集包含了 5 个文本集合，每个文本集合以词语包的形式展现。

20.Balance Scale : Balance scale weight & distance database

天平 DataSet ：天平的重量和距离数据库。

21.Balloons : Data previously used in cognitive psychology experiment; 4 data sets represent different conditions of an experiment

气球 DataSet ：曾经用在认知心理学实验中的数据； 4 个数据集代表了一个实验中的不同条件。

22.Blood Transfusion Service Center : Data taken from the Blood Transfusion Service Center in Hsin-Chu City in Taiwan – this is a classification

problem.

输血服务中心 DataSet ：来自台湾的 Hsin-CHu 市的输血服务中心的数据——用以解决分类问题。

23.Breast Cancer : Breast Cancer Data (Restricted Access)

乳腺癌 DataSet ：乳腺癌数据（访问限制）。

24.Breast Cancer Wisconsin (Diagnostic) : Diagnostic Wisconsin Breast Cancer Database

乳腺癌威斯康星洲（诊断数据） DataSet ：威斯康星的乳腺癌诊断数据。

25.Breast Cancer Wisconsin (Original) : Original Wisconsin Breast Cancer Database

乳腺癌威斯康星洲（原始数据）：原始的威斯康星州乳腺癌数据库。

26.Breast Cancer Wisconsin (Prognostic) : Prognostic Wisconsin Breast Cancer Database

乳腺癌威斯康星洲（ Prognostic 版）：威斯康星州乳腺癌数据库。

27.Breast Tissue : Dataset with electrical impedance measurements of freshly excised tissue samples from the breast.

乳腺组织 DataSet ：乳腺的新鲜切除组织样本的电阻度量数据集。

28.CalIt2 Building People Counts : This data comes from the main door of the CalIt2 building at UCI.

Calt2 建筑的人数：该数据集来自 UCI 的 Calts 建筑的主要大门。

29.Car Evaluation : Derived from simple hierarchical decision model, this database may be useful for testing constructive induction and structure discovery methods.

汽车评估 DataSet ：来源于简单层次决策模型，该数据集可用于测试建设性的回归，和发现结构性方法。

30.Cardiotocography : The dataset consists of measurements of fetal heart rate (FHR) and uterine contraction (UC) features on cardiotocograms classified by expert obstetricians.

胎儿心率 DataSet ：该数据集包括胎儿心率（ FHR ），和基于产科专家医生分类

的 cardiotocograms 子宫收缩（ UC ）特征。

31.Census Income : Predict whether income exceeds $50K/yr based on census data. Also known as “Adult” dataset.

收入普查 DataSet ：基于普查数据，预测收入是否超过 50000 美元/年。也被称为“成人”数据集。

32.Census-Income (KDD) : This data set contains weighted census data extracted from the 1994 and 1995 current population surveys conducted by the
U.S. Census Bureau.

收入普查（ KDD ）DataSet ：这个数据集包含了从 1994 －1995 年的 U.S 普查局的《当前人口调查》中提取出来的普查数据。

33.Challenger USA Space Shuttle O-Ring : Task: predict the number of O-rings that experience thermal distress on a flight at 31 degrees F given data on the previous 23 shuttle flights

挑战者号 USA 航天飞机 O 形圈 DataSet ：任务：基于前 23 次飞行数据，预测在一次 31 度热压 F 的状况中的飞行任务的 O 形圈的数目。

34.Character Trajectories : Multiple, labelled samples of pen tip trajectories recorded whilst writing individual characters. All samples are from the same writer, for the purposes of primitive extraction. Only characters with a single pen-down segment were considered.

字符轨迹 DataSet ：同时写出单个字幕的笔尖轨道的多个标记样本记录。为了保证初始的提取数据，所有的样本都来自于同一个书写人员。仅仅考虑了单一落笔段的字符。

35.Chess (Domain Theories) : 6 different domain theories for generating legal moves of chess

国际象棋（域理论） DataSet ：产生国际象棋的规定路数的 6 个不同的域理论。

36.Chess (King-Rook vs. King) : Chess Endgame Database for White King and Rook against Black King (KRK).

国际象棋（王 RookVS 王） DataSet ：白国王与黑国王的象棋残局数据库。

37.Chess (King-Rook vs. King-Knight) : Knight Pin Chess End-Game Database Creator

国际象棋（王 Rook 对战骑士）：骑士

38.Chess (King-Rook vs. King-Pawn) : King+Rook versus King+Pawn on a7 (usually abbreviated KRKPA7).

国王 Rook 与国王 Pawn 的 a7 （通常简写为 KAEPA7 ）。

39.Cloud : Little Documentation

小文档。

40.CMU Face Images : This data consists of 640 black and white face images of people taken with varying pose (straight, left, right, up), expression (neutral, happy, sad, angry), eyes (wearing sunglasses or not), and size

CMU 人脸图像 DataSet ：该数据集包含了 640 张黑白人脸图像，并且有直、左、右、上四个角度，中性、高兴、悲伤、生气四个表情，有的戴着太阳镜，有的没
有，并且大小也不一。

41.Coil 1999 Competition Data : This data set is from the 1999 Computational Intelligence and Learning (COIL) competition. The data contains measurements of river chemical concentrations and algae densities.

Coil1999 竞赛数据：该数据集来自 1999 年的计算机智能学习竞赛（简写为 Coil ）。该数据集包含了河流的化学浓度度量和藻类的密度度量。

42.Communities and Crime : Communities within the United States. The data combines socio-economic data from the 1990 US Census, law enforcement
data from the 1990 US LEMAS survey, and crime data from the 1995 FBI UCR.

社区与犯罪 DataSet ：美国的社区。该数据集包含了来自 1990 美国普查的社会经济数据、来自 1990 美国 LEMAS 调查的法律实施数据，还有来自 1995 年 FBI UCR 的犯罪数据。

43.Communities and Crime Unnormalized : Communities in the US. Data combines socio-economic data from the '90 Census, law enforcement data from the 1990 Law Enforcement Management and Admin Stats survey, and crime data from the 1995 FBI UCR

社区和非标准化犯罪 DataSet ：美国的社区。数据包含了来自 90 年代普查的社会经济数据、来自 1990 年法律实施管理调查的法律实施数据，还有来自 1995 年 FBI UCR 的犯罪数据。

44.Computer Hardware : Relative CPU Performance Data, described in

terms of its cycle time, memory size, etc.

计算机硬件：相关 CPU 运行数据，采用它的时间周期、内存大小来描述。

45.Concrete Compressive Strength : Concrete is the most important material in civil engineering. The concrete compressive strength is a highly nonlinear function of age and ingredients.

混凝土抗压强度 DataSet ：混凝土是土木工程中最重要的材料。抗压强度是混凝土年龄与组成非线性特征。

46.Concrete Slump Test : Concrete is a highly complex material. The slump flow of concrete is not only determined by the water content, but that is also influenced by other concrete ingredients.

混凝土塌方度试验：混凝土是一种非常复杂的材料。它的塌落度流量不仅取决于含水量，也受其他具体成分的影响。

47.Congressional Voting Records : 1984 United Stated Congressional Voting Records; Classify as Republican or Democrat

国会投票记录 DataSet ：1984 年美国国会投票记录；按照共和党与民主党分类。

48.Connect-4 : Contains connect-4 positions

连接 4：包含了连接 4 的位置。

49.Connectionist Bench (Nettalk Corpus) : The file “nettalk.data” contains a list of 20,008 English words, along with a phonetic transcription for each word.
The task is to train a network to produce the proper phonemes

连接工作台（ Nettalk 资料库）：文件“ nettalk.data ”包含了一个有 20008 个英语单词的列表，还有一个每个单词的 phonetic 副本。任务是训练一个网络，用来产生适当的 phonemes 。

50.Connectionist Bench (Sonar, Mines vs. Rocks) : The task is to train a network to discriminate between sonar signals bounced off a metal cylinder and those bounced off a roughly cylindrical rock.

连接工作台（声纳、矿产和岩石）：目标是训练一个网络，用来区别在金属圆柱体的反弹声纳信号，和在基本为圆柱体的岩石上的反弹信号。

51.Connectionist Bench (Vowel Recognition - Deterding Data) : Speaker independent recognition of the eleven steady state vowels of British English

using a specified training set of lpc derived log area ratios.

连接工作台（元音识别— Detering 数据）：使用一个来源于一个比率的指定训练集的 11 个英式英语的稳定元音字母的独立识别扬声器。

52.Contraceptive Method Choice : Dataset is a subset of the 1987 National Indonesia Contraceptive Prevalence Survey.

避孕方法的选择：该数据集是 1997 年印度尼西亚全国的避孕患病率调查的的一个子集。

53.Corel Image Features : This dataset contains image features extracted from a Corel image collection. Four sets of features are available based on the color histogram, color histogram layout, color moments, and co-occurrence

Corel 图像特征：该数据集包含了提取自一个 Corel 图像集合的图片特征。基于颜色直方图、颜色直方图布局、颜色的时机和调和，可得到四个特征集合。

54.Covertype : Forest CoverType dataset

覆盖类型：森林覆盖类型数据集。

55.Credit Approval : This data concerns credit card applications; good mix of attributes

信贷审批：该数据集与信用卡的使用相关；是各种属性的集合。

56.Cylinder Bands : Used in decision tree induction for mitigating process delays known as “cylinder bands” in rotogravure printing

气缸带：使用判定树来归纳，减缓气缸带的凸版打印。

57.Demospongiae : Marine sponges of the Demospongiae class classification domain.

Demospongiae 类别下的海绵分类域。

58.Dermatology : Aim for this dataset is to determine the type of Eryhemato-Squamous Disease.

皮肤科：该数据集用于判定 Eryhemato 鳞状疾病的类型。

59.Dexter : DEXTER is a text classification problem in a bag-of-word representation. This is a two-class classification problem with sparse continuous input variables. This dataset is one of five datasets of the NIPS

2003 feature selection challenge.

DETEX 是一个用一个文字包来表现的文本分类问题。这是一个通过不断的输入参数的两层的分类问题。该数据集是 NIPS2003 年特征提取邀请赛的五个数据集中的一个。

60.DGP2 - The Second Data Generation Program : Generates application domains based on specific parameters, number of features, and proportion of positive to negative examples

DGP2 —第二个数据生成程序：基于具体的参数、特征的数量、和正面到负面例子的比率，产生应用域。

61.Diabetes : This diabetes dataset is from AIM '94

糖尿病：该糖尿病数据集来自 AIM94 。

62.Document Understanding : Five concepts, expressed as predicates, to be learned

文件理解：要学习的五个概念，作为谓词来表现。

63.Dodgers Loop Sensor : Loop sensor data was collected for the Glendale on ramp for the 101 North freeway in Los Angeles

Dodgers 回路传感器：回路传感器数据集来自 Gledale 的斜坡（在洛杉矶的 101
个北高速公路）。

64.Dorothea : DOROTHEA is a drug discovery dataset. Chemical compounds represented by structural molecular features must be classified as active (binding to thrombin) or inactive. This is one of 5 datasets of the NIPS 2003 feature selection challenge.

Dorothea 是一个药物发现数据集。以结构分析特征来表现的化合物必须分类为活性的（绑定到凝血酶）或者非活性的。这是五个 NIPS2003 特征选择挑战赛数据集中的一个。

65.E. Coli Genes : Data giving characteristics of each ORF (potential gene) in the E. coli genome. Sequence, homology (similarity to other genes) and structural information, and function (if known) are provided.

大肠杆菌基因：每个在 E.coli 基因组里面 ORD( 潜在基因 )的特征数据集。提供序列、同源性（与其他基因的相似形）和结构信息。还有功能（如果知道的话）。

66.EBL Domain Theories : Assorted small-scale domain theories

EBL 域理论：各种小规模的域理论。

67.Echocardiogram : Data for classifying if patients will survive for at least one year after a heart attack

超声心动图：该数据集用来分类是否病人在一次心脏病后，至少可以存活一年。

68.Ecoli : This data contains protein localization sites

该数据集包含了蛋白质本地化地址。

69.Economic Sanctions : Domain Theory on Economic Sanctions;
Undocumented

经济制裁：经济制裁方面的域理论，无记录文档。

70.EEG Database : This data arises from a large study to examine EEG correlates of genetic predisposition to alcoholism. It contains measurements from 64 electrodes placed on the scalp sampled at 256 Hz

EEG 数据库：该数据集来源于一个检查 EEG 的、与易患酒精中毒的基因体质相关的大型研究、包含了放在头皮上的、为 256HZ 的、来自 64 个电极的度量。

71.El Nino : The data set contains oceanographic and surface meteorological readings taken from a series of buoys positioned throughout the equatorial Pacific.

厄尔尼诺：该数据集包含了从整个赤道太平洋的一系列浮标的海洋与地面气象读数。

72.Entree Chicago Recommendation Data : This data contains a record of user interactions with the Entree Chicago restaurant recommendation system.

芝加哥主菜推荐数据：该数据集包含了一个与芝加哥主菜馆的推荐系统的用户交互的记录。

73.Flags : From Collins Gem Guide to Flags, 1986

标志：从柯林斯宝石指南的标志， 1986

74.Forest Fires : This is a difficult regression task, where the aim is to predict the burned area of forest fires, in the northeast region of Portugal, by using meteorological and other data (see details at:

http://www.dsi.uminho.pt/~pcortez/forestfires ).

森林火灾：这是一个艰难的回归的任务，其目的是在葡萄牙东北部地区，利用气象数据和其他数据，预测森林火灾的过火面积，（详见： http://www.dsi.uminho PT / pcortez / forestfires ）。

75.Function Finding : Cases collected mostly from investigations in physical science; intention is to evaluate function-finding algorithms

寻找功能：收集的情况下，大多是从在物理科学的调查 ;意图是评价函数发现算法

76.Gisette : GISETTE is a handwritten digit recognition problem. The problem is to separate the highly confusible digits ‘4’ and ‘9’. This dataset is one of five datasets of the NIPS 2003 feature selection challenge.

Gisette： GISETTE 是一个手写数字识别问题。问题是独立的高度 confusible 数字’4’和’9’。这个数据集是 5 NIPS 的 2003 年特征选择挑战的数据集之一。

77.Glass Identification : From USA Forensic Science Service; 6 types of glass; defined in terms of their oxide content (i.e. Na, Fe, K, etc)

玻璃鉴定：从美国法医科学服务 ; 6 种玻璃 ;在他们的氧化物含量定义（即钠，铁，钾等）

78.Haberman’s Survival : Dataset contains cases from study conducted on the survival of patients who had undergone surgery for breast cancer

哈伯曼的生存： DataSet 包含谁经历了乳腺癌手术患者的生存所进行的研究情况

79.Hayes-Roth : Topic: human subjects study

海斯 - 罗斯：主题：人类受试者的研究

80.Heart Disease : 4 databases: Cleveland, Hungary, Switzerland, and the VA Long Beach

心脏病： 4 个数据库：克利夫兰，匈牙利，瑞士，和弗吉尼亚州的长滩

81.Hepatitis : From G.Gong: CMU; Mostly Boolean or numeric-valued attribute types; Includes cost data (donated by Peter Turney)

肝炎：从 G.龚：债务工具中央结算系统 ;大多是布尔值或数字值的属性类型，包括成本数据
（彼得特尼捐赠）

82.Hill-Valley : Each record represents 100 points on a two-dimensional graph. When plotted in order (from 1 through 100) as the Y co-ordinate, the
points will create either a Hill (a ? bump ? in the terrain) or a Valley (a ? dip? in

the terrain).

希尔谷：每个记录代表一个二维图形上 100 点。当策划，以统筹的 Y （从 1 到 100），积分将创建一个山（在凹凸的地形）或谷（浸在地形）。

83.Horse Colic : Well documented attributes; 368 instances with 28 attributes (continuous, discrete, and nominal); 30% missing values

马绞痛：有据可查的属性 ; 368 28 属性（连续，离散的，标称值）的实例 ; 30％的缺失值

84.Housing : Taken from StatLib library

房屋：两者 StatLib 库

85.ICU : Data set prepared for the use of participants for the 1994 AAAI Spring Symposium on Artificial Intelligence in Medicine.

ICU 的数据集，为 1994 年 AAAI 春季研讨会的与会者在医学上使用人工智能准备。

86.Image Segmentation : Image data described by high-level numeric-valued attributes, 7 classes

图像分割：由高层次的数字值属性描述的图像数据， 7 类

87.Insurance Company Benchmark (COIL 2000) : This data set used in the CoIL 2000 Challenge contains information on customers of an insurance company. The data consists of 86 variables and includes product usage data and socio-demographic data

保险公司的基准（线圈 2000 年）：使用该数据集在线圈 2000 挑战包含保险公司对客户的信息。该数据由 86 变数，包括产品使用的数据和社会人口数据

88.Internet Advertisements : This dataset represents a set of possible advertisements on Internet pages.

互联网广告：这个 DataSet 表示一组可能在互联网上的网页广告。

89.Internet Usage Data : This data contains general demographic information on internet users in 1997.

互联网应用的数据：该数据包含一般的互联网用户在 1997 年的人口统计信息。

90.Ionosphere : Classification of radar returns from the ionosphere

电离层：从电离层雷达回波分类

91.IPUMS Census Database : This data set contains unweighted PUMS census data from the Los Angeles and Long Beach areas for the years 1970, 1980, and 1990.

IPUMS 普查数据库：该数据集包含未加权 PUMS 普查从洛杉矶和长滩地区 1970 年， 1980
年和 1990 年的数据。

92.Iris : Famous database; from Fisher, 1936

光圈：著名的数据库 ;从 1936 年费舍尔，

93.ISOLET : Goal: Predict which letter-name was spoken–a simple classification task.

ISOLET ：目标：预测字母名称是口语 - 一个简单的分类任务。

94.Japanese Credit Screening : Includes domain theory (generated by talking to Japanese domain experts); data in Lisp

日本信用筛选：包括域理论（日本领域的专家交谈生成） ;在 Lisp 中的数据

95.Japanese Vowels : This dataset records 640 time series of 12 LPC cepstrum coefficients taken from nine male speakers.

日本元音：该数据集的记录 640 12 的 LPC 倒谱系系数从九男扬声器的时间序列。

96.KDD Cup 1998 Data : This is the data set used for The Second International Knowledge Discovery and Data Mining Tools Competition, which was held i n conjunction with KDD-98

KDD 杯 1998 年的数据：这是数据集的第二届国际知识发现和数据挖掘工具的竞争，这是在同时举行的 KDD - 98

97.KDD Cup 1999 Data : This is the data set used for The Third International Knowledge Discovery and Data Mining Tools Competition, which was held in conjunction with KDD-99

KDD 杯 1999 年的数据：这是数据集使用的第三次国际知识发现和数据挖掘工具的竞争，这是在同时举行的 KDD - 99

98.Kinship : Relational dataset

亲属关系：关系数据集

99.Labor Relations : From Collective Bargaining Review

劳动关系：从集体谈判检讨

100.LED Display Domain : From Classification and Regression Trees book; We provide here 2 C programs for generating sample databases

LED 显示域：从分类和回归树书，我们在这里提供 2 C 程序生成示例数据库

101.Lenses : Database for fitting contact lenses

镜头：装修隐形眼镜数据库

102.Letter Recognition : Database of character image features; try to identify the letter

信承认：人物形象特征的数据库 ; 试图找出信

103.Libras Movement : The data set contains 15 classes of 24 instances each. Each class references to a hand movement type in LIBRAS (Portuguese name ‘L ? ngua BRAsileira de Sinais’, oficial brazilian signal language).

天秤座的运动：该数据集包含了 15 类 24 个实例。每个类的引用，在天秤座的人的手部动作类型（葡萄牙名“ Lngua BRAsileira Sinais ”，公报巴西信号语言）。

104.Liver Disorders : BUPA Medical Research Ltd. database donated by Richard S. Forsyth

肝脏疾病：保柏医疗研究公司数据库由理查德福塞斯捐赠

105.Localization Data for Person Activity : Data contains recordings of five people performing different activities. Each person wore four sensors (tags) while performing the same scenario five times.

人活动的本地化数据：数据包含五个执行不同的活动的人的录音。每个人穿的 4 个传感器（标签），同时执行相同的情况下的五倍。

106.Logic Theorist : All code for Logic Theorist

逻辑理论家：逻辑理论家的所有代码

107.Low Resolution Spectrometer : From IRAS data – NASA Ames Research Center

低分辨率光谱仪：从红外天文卫星数据 - 美国国家航空航天局艾姆斯研究中心

108.Lung Cancer : Lung cancer data; no attribute definitions

肺癌：肺癌数据 ;没有属性定义

109.Lymphography : This lymphography domain was obtained from the University Medical Centre, Institute of Oncology, Ljubljana, Yugoslavia. (Restricted access)

淋巴造影：从大学医学中心，肿瘤研究所，南斯拉夫卢布尔雅那的这淋巴域。（限制访问）

110.M. Tuberculosis Genes : Data giving characteristics of each ORF (potential gene) in the M. tuberculosis bacterium. Sequence, homology (similarity to other genes) and structural information, and function (if known) are provided

结核分枝杆菌基因：给每个 ORF 在结核分枝杆菌的细菌特性（潜在的基因）的数据。序列，同源性（其他基因的相似性）和结构信息，和功能（如果已知）

111.Madelon : MADELON is an artificial dataset, which was part of the NIPS 2003 feature selection challenge. This is a two-class classification problem with continuous input variables. The difficulty is that the problem is multivariate and highly non-linear.

Madelon ：MADELON 是一个人造的数据集，这是对 2003 年的 NIPS 的特征选择挑战的一部分。这是一个连续的输入变量的两个类的分类问题。困难的是，问题是多元的和高度非线性。

112.MAGIC Gamma Telescope : Data are MC generated to simulate registration of high energy gamma particles in an atmospheric Cherenkov telescope

魔伽马望远镜：数据生成高能量的伽玛粒子来模拟大气切伦科夫望远镜登记 MC

113.Mammographic Mass : Discrimination of benign and malignant mammographic masses based on BI-RADS attributes and the patient’s age.

乳腺质量：良性和恶性乳腺群众基于 BI - RADS 的属性和病人的年龄歧视。

114.Mechanical Analysis : Fault diagnosis problem of electromechanical devices; also PUMPS DATA SET is newer version with domain theory and results

力学分析：机电设备的故障诊断问题 ;水泵数据集与域的理论和成果是较新的版本

115.Meta-data : Meta-Data was used in order to give advice about which classification method is appropriate for a particular dataset (taken from results of Statlog project).

元数据：元数据使用的分类方法是适合于一个特定的数据集（ Statlog 项目的结果），以提供意见。

116.MiniBooNE particle identification : This dataset is taken from the MiniBooNE experiment and is used to distinguish electron neutrinos (signal) from muon neutrinos (background).

MiniBooNE 的粒子鉴别：该数据集是从 MiniBooNE 的实验是使用电子中微子（信号），以区别于 μ 子中微子（背景）。

117.Mobile Robots : Learning concepts from sensor data of a mobile robot; set of data sets

移动机器人：从移动机器人的传感器数据学习观念 ;组数据集

118.Molecular Biology (Promoter Gene Sequences) : E. Coli promoter gene sequences (DNA) with partial domain theory

分子生物学（启动子序列）：大肠杆菌启动子的基因序列（ DNA ）的部分域理论

119.Molecular Biology (Protein Secondary Structure) : From CMU connectionist bench repository; Classifies secondary structure of certain globular proteins

分子生物学（蛋白质二级结构）：从债务工具中央结算系统联结板凳资源库 ;某些球状蛋白质的二级结构进行分类

120.Molecular Biology (Splice-junction Gene Sequences) : Primate splice-junction gene sequences (DNA) with associated imperfect domain theory

分子生物学（拼接交界的基因序列）：灵长类动物的基因序列拼接结与相关的不完善域理论
（脱氧核糖核酸）

121.MONK’s Problems : A set of three artificial domains over the same attribute space; Used to test a wide range of induction algorithms

和尚的问题：三个以上相同的属性空间的人工域 ;用于测试一个广泛的归纳算法

122.Moral Reasoner : Horn-clause model that qualitatively simulates moral reasoning; Theory includes negated literals

道德推理：霍恩子句模型定性模拟道德推理理论包括否定的文字

123.Movie : This data set contains a list of over 10000 films including many

older, odd, and cult films. There is information on actors, casts, directors, producers, studios, etc.

电影：该数据集包含一个 10000 多部电影，包括许多年纪大了，奇怪，和邪教的电影列表。有上的演员，演员，董事，制片人，制片公司等信息

124.MSNBC.com Anonymous Web Data : This data describes the page visits of users who visited msnbc.com on September 28, 1999. Visits are recorded at the level of URL category (see description) and are recorded in time order.

MSNBC.com 匿名 Web 数据：这个数据描述了用户的页面访问参观， 1999 年 9 月 28 日
msnbc.com。记录访问的 URL 类别的水平（见说明），在时间顺序记录。

125.Multiple Features : This dataset consists of features of handwritten numer als (0'--9’) extracted from a collection of Dutch utility maps

多种功能：这个数据集，包括从荷兰实用地图的集合中提取的手写体数字（ 0'结束 -9 “）功能

126.Mushroom : From Audobon Society Field Guide; mushrooms described in terms of physical characteristics; classification: poisonous or edible

蘑菇：从 Audobon 社会领域指南“ ;蘑菇描述的物理特性 ;分类：有毒或食用

127.Musk (Version 1) : The goal is to learn to predict whether new molecules will be musks or non-musks

麝香（版本 1）：我们的目标是要学会预测是否有新的分子，将麝香或非麝香

128.Musk (Version 2) : The goal is to learn to predict whether new molecules will be musks or non-musks

麝香（第 2 版）：我们的目标是要学会预测是否有新的分子，将麝香或非麝香

129.NSF Research Award Abstracts 1990-2003 : This data set consists of
(a) 129,000 abstracts describing NSF awards for basic research, (b)
bag-of-word data files extracted from the abstracts, © a list of words used for indexing the bag-of-word

NSF 研究奖论文摘要 1990 年至 2003 年：（一） 129000 摘要描述 NSF 的奖项，用于基础研究（二）字袋从抽象的数据中提取的文件，（三）为索引使用的单词列表，该数据集组成字袋

130.Nursery : Nursery Database was derived from a hierarchical decision

model originally developed to rank applications for nursery schools.

苗圃：苗圃数据库是从最初开发托儿所排名应用分层决策模型派生。

131.Online Handwritten Assamese Characters Dataset : This is a dataset
of 8235 online handwritten assamese characters. The “ online ” process involves capturing of data as text is written on a digitizing tablet with an
electronic pen.

在线手写阿萨姆字符数据集：这是一个 8235 联机手写阿萨姆字符的数据集。 “在线”的过程包括数据采集，数字化仪上用电子笔的书面文本。

132.Opi nosis Opinion ? Review : This dataset contains sentences extracted
from user reviews on a given topic. Example topics are “ performance of Toyota Camry” and “ sound quality of ipod nano ”.

Opinosis 意见/评论：此数据集包含一个给定的主题从用户评论中提取的句子。示例主题是“表现的丰田佳美”和“音质”的 iPod nano。

133.OpinRank Review Dataset : This data set contains user reviews of cars and and hotels collected from Tripadvisor (~259,000 reviews) and Edmunds (~42,230 reviews).

OpinRank 审查数据集：该数据集包含车和酒店收集到到网（ 259000 评语）和埃德蒙兹（?42230 条评论）的用户评论。

134.Optical Recognition of Handwritten Digits : Two versions of this database available; see folder

光学识别手写体数字：这个数据库提供的两个版本，请参阅文件夹

135.Othello Domain Theory : Used in research to generate features for an inductive learning system

奥赛罗域理论：在研究中使用生成归纳学习系统的功能

136.Ozone Level Detection : Two ground ozone level data sets are included in this collection. One is the eight hour peak set (eighthr.data), the other is the one hour peak set (onehr.data). Those data were collected from 1998 to 2004 at the Houston, Galveston and Brazoria area.

臭氧浓度检测：两个地面臭氧浓度的数据集都包含在此集合。之一，是 8 个小时的高峰集
（eighthr.data ），另一种是一个小时的高峰集（ onehr.data）。这些数据收集从 1998 年至 2004
年在休斯敦，加尔维斯顿和 Brazoria 区域。

137.p53 Mutants : The goal is to model mutant p53 transcriptional activity (active vs inactive) based on data extracted from biophysical simulations.

p53 基因突变体：我们的目标是到模型的基础上从生物物理模拟提取数据的突变型 p53 的转录活性（有源 VS 无效）。

138.Page Blocks Classification : The problem consists of classifying all the blocks of the page layout of a document that has been detected by a segmentation process.

页块分类：问题进行分类的一个已被分割过程中检测到的文件的页面布局的所有块组成。

139.Parkinsons : Oxford Parkinson’s Disease Detection Dataset

帕金森：牛津帕金森氏病的检测数据集
140.Parkinsons Telemonitoring : Oxford Parkinson’s Disease Telemonitoring Dataset
帕金森远程监护：牛津帕金森病的远程监护数据集

141.PEMS-SF : 15 months worth of daily data (440 daily records) that describes the occupancy rate, between 0 and 1, of different car lanes of the San Francisco bay area freeways across time.

PEMS - SF： 15 个月，每天的数据（ 440 每日记录）描述的入住率， 0 和 1 之间，不同的汽车车道，旧金山湾地区的高速公路，跨越时间的价值。

142.Pen-Based Recognition of Handwritten Digits : Digit database of 250 samples from 44 writers

基于笔的手写数字识别：来自 44 个作家的 250 个样本的数字数据库

143.Pima Indians Diabetes : From National Institute of Diabetes and Digestive and Kidney Diseases; Includes cost data (donated by Peter Turney)

皮马印第安人糖尿病：国立糖尿病，消化道和肾脏疾病研究所 ;包括成本数据（彼得特尼捐赠）

144.Pioneer-1 Mobile Robot Data : This dataset contains time series sensor readings of the Pioneer-1 mobile robot. The data is broken into “experiences”
in which the robot takes action for some period of time and experiences a control

先锋 - 1 移动机器人数据：该数据集包含了时间序列的先锋 - 1 移动机器人的传感器读数。数据分解成“经验”中，机器人需要一段时间的行动和经验的控制

145.Pittsburgh Bridges : Bridges database that has original and numeric-discretized datasets

匹兹堡桥梁：桥梁数据库，具有原始和数值离散数据集

146.Plants : Data has been extracted from the USDA plants database. It contains all plants (species and genera) in the database and the states of USA and Canada where they occur.

植物：数据已经从美国农业部植物数据库中提取。它包含在数据库中，美国和加拿大发生的所有植物（种属）。

147.Poker Hand : Purpose is to predict poker hands

牌手：目的是预测扑克牌

148.Post-Operative Patient : Dataset of patient features

手术后的病人：病人的特征数据集

149.Primary Tumor : From Ljubljana Oncology Institute

原发肿瘤：肿瘤研究所从卢布尔雅那

150.Prodigy : Assorted domains like blocksworld, eightpuzzle, and schedworld.

奇才： blocksworld ， eightpuzzle ， schedworld 什锦域。

151.Protein Data : Undocumented

蛋白质数据：无证

152.Pseudo Periodic Synthetic Time Series : This data set is designed for testing indexing schemes in time series databases. The data appears highly periodic, but never exactly repeats itself.

伪定期的合成时间系列：该数据集是测试时间序列数据库中的索引计划的设计。的数据显示高度周期性的，但永远不会完全重演。

153.PubChem Bioassay Data : These highly imbalanced bioassay datasets are from the differing types of screening that can be performed using HTS technology. 21 datasets were created from 12 bioassays.

PubChem 数据库生物测定数据：这些高度不平衡的生物测定数据集的筛选不同类型可以使

用高温超导技术。 21 数据集创建了来自 12 个生物测定。

154.Quadruped Mammals : The file animals.c is a data generator of structured instances representing quadruped animals

四足哺乳动物：该文件 animals.c 是一个代表四足动物的结构实例的数据发生器

155.Qualitative Structure Activity Relationships : Two sets of datasets are given: pyrimidines and triazines

定性结构活性关系：给出两套数据集：嘧啶和三嗪

156.Record Linkage Comparison Patterns : Element-wise comparison of records with personal data from a record linkage setting. The task is to decide from a comparison pattern whether the underlying records belong to one person.

记录链接比较模式：元素比较明智的，从创纪录的联动设置的个人资料记录。任务是从一个比较模式，决定是否属于一个人的基本纪录。

157.Relative location of CT slices on axial axis : The dataset consists of 384 features extracted from CT images. The class variable is numeric and denotes the relative location of the CT slice on the axial axis of the human body.

CT 片的轴向轴的相对位置：数据集包括从 CT 图像中提取的 384 功能。类变量是数值表示的 CT 片对人体的轴向轴的相对位置。

158.Reuters Transcribed Subset : This dataset is created by reading out 200 files from the 10 largest Reuters classes and using an Automatic Speech Recognition system to create corresponding transcriptions.

路透社转录子集：创建该数据集是通过读出最大路透社从 10 类 200 个文件，并使用自动语音识别系统，建立相应的改编。

159.Reuters-21578 Text Categorization Collection : This is a collection of documents that appeared on Reuters newswire in 1987. The documents were assembled and indexed with categories.

路透 - 21578 文本分类收集：这是出现于 1987 年，路透通讯社的文件的集合。组装和类别索引文件。

160.Robot Execution Failures : This dataset contains force and torque measurements on a robot after failure detection. Each failure is characterized

by 15 force/torque samples collected at regular time intervals

机器人执行失败：此数据集包含后故障检测机器人的力和力矩测量。每次失败的特点是在固定的时间间隔采集的样品 15 力/力矩

161.SECOM : Data from a semi-conductor manufacturing process

世强：从半导体制造过程中的数据

162.Semeion Handwritten Digit : 1593 handwritten digits from around 80 persons were scanned, stretched in a rectangular box 16x16 in a gray scale of 256 values.

Semeion 手写体数字： 1593 从 80 人左右的手写数字进行扫描，伸一个矩形框，在 256 个值的灰度的 16x16。

163.Servo : Data was from a simulation of a servo system

伺服：数据从一个伺服系统的仿真

164.Shuttle Landing Control : Tiny database; all nominal values

航天飞机着陆控制：微型数据库 ; 所有标称值

165.Solar Flare : Each class attribute counts the number of solar flares of a certain class that occur in a 24 hour period

太阳耀斑：每个类的属性一定的阶级，在 24 小时内发生的太阳耀斑的数量进行计数

166.Soybean (Large) : Michalski’s famous soybean disease database

大豆（大）： MICHALSKI 著名的大豆疾病数据库

167.Soybean (Small) : Michalski’s famous soybean disease database

大豆（小）： MICHALSKI 著名的大豆疾病数据库

168.Spambase : Classifying Email as Spam or Non-Spam

Spambase：归类为“垃圾邮件”或“非垃圾邮件的电子邮件

169.SPECT Heart : Data on cardiac Single Proton Emission Computed Tomography (SPECT) images. Each patient classified into two categories: normal and abnormal.

SPECT 的心脏：心脏单个质子发射计算机断层显像（ SPECT）的图像数据。每个病人分为

两类：正常和不正常的。

170.SPECTF Heart : Data on cardiac Single Proton Emission Computed Tomography (SPECT) images. Each patient classified into two categories: normal and abnormal.

SPECTF 心脏：心脏单个质子发射计算机断层显像（ SPECT）的图像数据。每个病人分为两类：正常和不正常的。

171.Spoken Arabic Digit : This dataset contains timeseries of mel-frequency cepstrum coefficients (MFCCs) corresponding to spoken Arabic digits.
Includes data from 44 male and 44 female native Arabic speakers.

口语阿拉伯语位：该数据集包含 MEL 频率倒谱系数（ MFCCs ）讲阿拉伯语数字对应的时间序列。包括 44 男 44 女的母语讲阿拉伯语的数据。

172.Sponge : Data on sponges; Attributes in Spanish

海绵：海绵上的数据，在西班牙语中的属性

173.Statlog (Australian Credit Approval) : This file concerns credit card applications. This database exists elsewhere in the repository (Credit Screening Database) in a slightly different form

Statlog（澳大利亚授信审批）：这个文件是关于信用卡申请。该数据库存在于其他地方略有不同形式的资源库（授信数据库）

174.Statlog (German Credit Data) : This dataset classifies people described by a set of attributes as good or bad credit risks. Comes in two formats (one all numeric). Also comes with a cost matrix

Statlog（德国信用数据）：这个数据集划分好坏信贷风险的属性所描述的人。来自于两种格式（所有数字）。还带有一个成本矩阵

175.Statlog (Heart) : This dataset is a heart disease database similar to a database already present in the repository (Heart Disease databases) but in a slightly different form

Statlog（心）：这个数据集是一个心脏疾病数据库，数据库已经在库（心脏病数据库）类似，但略有不同的形式

176.Statlog (Image Segmentation) : This dataset is an image segmentation database similar to a database already present in the repository (Image segmentation database) but in a slightly different form.

Statlog（图像分割）：该数据集是一个图像分割数据库，数据库中已存在的资源库（图像分割数据库），但在一个稍微不同的的形式类似。

177.Statlog (Landsat Satellite) : Multi-spectral values of pixels in 3x3 neighbourhoods in a satellite image, and the classification associated with the central pixel in each neighbourhood

Statlog（地球资源卫星多光谱）：在 3x3 的街区在卫星图像的像素值，并与中央像素在每个居委会相关的分类

178.Statlog (Shuttle) : The shuttle dataset contains 9 attributes all of which are numerical. Approximately 80% of the data belongs to class 1

Statlog（班车）：穿梭集包含 20 个属性，所有这一切都是数字。大约 80％的数据属于 1 级

179.Statlog (Vehicle Silhouettes) : 3D objects within a 2D image by application of an ensemble of shape feature extractors to the 2D silhouettes of the objects.

Statlog（车剪影）：在一个物体的二维轮廓的形状特征提取的合奏中的应用 2D 图像的三维对象。

180.Statlog Project : Various Databases: Vehicle silhouttes, Landsat Sattelite, Shuttle, Australian Credit Approval, Heart Disease, Image Segmentation, German Credit

Statlog 项目：各种数据库：车辆 silhouttes，地球资源卫星，航天飞机，澳大利亚信贷审批，心脏病，图像分割，德国信用

181.Steel Plates Faults : A dataset of steel plates ’ faults, classified into 7 different types. The goal was to train machine learning for automatic pattern
recognition.

钢板缺陷：一个数据集钢板断裂，分为 7 个不同的类型。我们的目标是培养学习机，自动模式识别。

182.Student Loan Relational : Student Loan Relational Domain

。助学贷款的关系：助学贷款的关系域

183.Synthetic Control Chart Time Series : This data consists of synthetically generated control charts.

合成控制图的时间序列数据的综合生成的控制图组成。

184.Syskill and Webert Web Page Ratings : This database contains HTML source of web pages plus the ratings of a single user on these web pages.
Web pages are on four seperate subjects (Bands- recording artists; Goats;
Sheep; and BioMedical)

Syskill 和 Webert 网页评价：该数据库包含网页的 HTML 源代码再加上这些网页上的一个单用户的收视率。网页是在四个不同科目（乐队的录音艺术家 ;山羊 ;绵羊;和生物医学）

185.Teaching Assistant Evaluation : The data consist of evaluations of teaching performance; scores are “low”, “medium”, or “high”

助教评价：数据包括教学绩效评价 ;分数“低”，“中等”，或“高”

186.Thyroid Disease : 10 separate databases from Garavan Institute

甲状腺疾病： 10 个单独的数据库 Garavan 研究所

187.Tic-Tac-Toe Endgame : Binary classification task on possible configurations of tic-tac-toe game

井字脚趾残局：可能的配置的 tic - tac - toe 游戏的二元分类任务

188.Trains : 2 data formats (structured, one-instance-per-line)

火车： 2 数据格式（结构化，每行一个实例）

189.Twenty Newsgroups : This data set consists of 20000 messages taken from 20 newsgroups.

第二十新闻组：该数据集由来自 20 个新闻组采取的 20000 消息。

190.UJI Pen Characters : Data consists of written characters in a UNIPEN-like format

宇治笔特点：数据包括在 UNIPEN 样的格式写入的字符

191.UJI Pen Characters (Version 2) : A pen-based database with more than 11k isolated handwritten characters

宇治钢笔字（第 2 版）：一个孤立的手写字符超过 11K 的钢笔型数据库

192.Undocumented : Various datasets without documentation (feel free to explore!)

无证：没有证件的各种数据集（自由探索！）

193.University : Data in original (LISP-readable) form

大学：原（ Lisp 的可读形式）中的数据

194.UNIX User Data : This file contains 9 sets of sanitized user data drawn from the command histories of 8 UNIX computer users at Purdue over the course of up to 2 years.

UNIX 用户数据：该文件包含 9 套消毒的用户在长达 2 年的，当然从 8 UNIX 计算机用户的命令历史数据绘制在普渡大学。

195.URL Reputation : Anonymized 120-day subset of the ICML-09 URL data containing 2.4 million examples and 3.2 million features.

URL 的信誉：不具名的 120 天的 ICML - 09 的 URL 数据，含有 240 万的例子和 320 万功能的一个子集。

196.US Census Data (1990) : The USCensus1990raw data set contains a one percent sample of the Public Use Microdata Samples (PUMS) person records drawn from the full 1990 census sample.

美国人口普查数据（ 1990 年）：USCensus1990raw 数据集包含一成市民使用微观数据（ PUMS ）人记录完整的 1990 年人口普查抽样抽样样品。

197.Volcanoes on Venus - JARtool experiment : The JARtool project was a pioneering effort to develop an automatic system for cataloging small
volcanoes in the large set of Venus images returned by the Magellan spacecraft.

金星上的火山 - JARtool 实验： JARtool 项目是一项开创性的努力开发一个自动化系统编目在大麦哲伦飞船返回的金星图像设置的小火山。

198.Wall-Following Robot Navigation Data : The data were collected as the SCITOS G5 robot navigates through the room following the wall in a clockwise direction, for 4 rounds, using 24 ultrasound sensors arranged circularly around
its ‘waist’.

以下壁挂式机器人的导航数据：数据收集的 SCITOS G5 机器人的导航，通过房间下面的墙壁以顺时针方向， 4 轮，使用圆周围的“腰”，安排了 24 超声传感器。

199.Water Treatment Plant : Multiple classes predict plant state

水处理厂：多类预测植物状态

200.Waveform Database Generator (Version 1) : CART book’s waveform

domains

波形数据库生成器（版本 1）：订购书的波形域

201.Waveform Database Generator (Version 2) : CART book’s waveform domains

波形数据库生成（第 2 版）：订购书的波形域

202.Wine : Using chemical analysis determine the origin of wines

葡萄酒：使用化学分析器判定葡萄酒的来源。

203.Wine Quality : Two datasets are included, related to red and white vinho verde wine samples, from the north of Portugal. The goal is to model wine quality based on physicochemical tests (see [Cortez et al., 2009], http://www3.dsi.uminho.pt/pcortez/wine/ ).

葡萄酒的质量：包括两个数据集，与来自葡萄牙北部的红与白葡萄酒样本样品相关。目标是通过物理化学检验，设计出葡萄酒的质量模型。

204.YearPredictionMSD : Prediction of the release year of a song from audio features. Songs are mostly western, commercial tracks ranging from 1922 to 2011, with a peak in the year 2000s.

年度预测 MSD ：从声音的特征里，预测一首歌曲的发行年份、歌曲大部来自西部的、从 1922 至 2011 年的商业性的音轨，在 2000 年到达顶峰。

205.Yeast : Predicting the Cellular Localization Sites of Proteins

酵母 DataSet ：预测蛋白质的细胞定位点。

206.Zoo : Artificial, 7 classes of animals

动物园 DataSet ：人工，其中类别的动物。

coderwangson 博客专家

发布了260 篇原创文章 · 获赞 254 · 访问量 21万+

私信关注

uci数据集汇总及翻译

uci数据集汇总及翻译

猜你喜欢