quote
LaTex
@article{WANG201621,
title = “Feature selection methods for big data bioinformatics: A survey from the search perspective”,
journal = “Methods”,
volume = “111”,
pages = “21 - 31”,
year = “2016”,
note = “Big Data Bioinformatics”,
issn = “1046-2023”,
doi = “https://doi.org/10.1016/j.ymeth.2016.08.014“,
url = “http://www.sciencedirect.com/science/article/pii/S1046202316302742“,
author = “Lipo Wang and Yaoli Wang and Qing Chang”,
keywords = “Biomarkers, Classification, Clustering, Computational biology, Computational intelligence, Data mining, Evolutionary computation, Evolutionary algorithms, Fuzzy logic, Genetic algorithms, Machine learning, Microarray, Neural networks, Particle swarm optimization, Pattern recognition, Random forests, Rough sets, Soft computing, Swarm intelligence, Support vector machines”
}
Normal
Lipo Wang, Yaoli Wang, Qing Chang,
Feature selection methods for big data bioinformatics: A survey from the search perspective,
Methods,
Volume 111,
2016,
Pages 21-31,
ISSN 1046-2023,
https://doi.org/10.1016/j.ymeth.2016.08.014.
(http://www.sciencedirect.com/science/article/pii/S1046202316302742)
Keywords: Biomarkers; Classification; Clustering; Computational biology; Computational intelligence; Data mining; Evolutionary computation; Evolutionary algorithms; Fuzzy logic; Genetic algorithms; Machine learning; Microarray; Neural networks; Particle swarm optimization; Pattern recognition; Random forests; Rough sets; Soft computing; Swarm intelligence; Support vector machines
Summary
Big Data Bioinformatics
Feature selection applications
Traditional classification:
1. filter
2. wrapper
3. embedded
New categories: (think of combinatorial optimization/search problems)
1. exhaustive search
2. heuristic search — with or without data extraction feature sorting methods
3. hybrid methods
1 Truly optimal feature selection: exhaustive search
Classifier:
- random forests
- support vector machines (SVMs)
- cluster-oriented ensemble classifiers
- random vector functional link (RVFL) random vector functional link
- radial basis function (RBF) neural networks radial basis function (RBF) neural networks
Search for truly optimal subset of features - computationally expensive - NP-hard
Exhaust all possible feature combinations
''combinatorial explosion" — "combinatorial explosion" — exponentially
number of original features — impossible
2 Suboptimal Feature Selection: Heuristic Search
''Heuristic search' — 'Heuristic search': Guided by 'experience' or 'wise choice', expect to find good suboptimal or even global optima.
better than random search
Necessary components of an algorithm:
1. Local improvement
2. Innovation
simulated annealing Simulated annealing algorithm—with a certain probability of receiving poor solutions—helps to jump out of local optima
genetic algorithm (GA)
ant colony optimization (ACC)
particle swarm optimization (PSO)
chaotic simulated annealing
tabu search
noisy chaotic simulated annealing
branch-and-bound branch determination boundary
A Feature selection based on heuristic search without data extraction feature importance ranking
Binary vector — whether to select the corresponding feature
The nearest neighbor classifier
case-based reasoning
a leave-one-out procedure leave-one-out procedure
succinct rules succinct rules
silhouette statistics contour statistics
microarray microarray
peak tree peak tree
input weights - SVM Or neural network – embedded — feature importance ranking (not directly from data)
weight statistical analysis
K-means + SVM
margin influence analysis (MIA) marginal influence analysis + SVM
Mann–Whitney U test — nonparametric test method Nonparametric test method+ no distribution-related assumptions no distribution-related assumptions
in mixed descriptor space
Blocking — modularity
aggregating the outputs of multiple learning algorithms — evaluating a subset of genes — significantly improved — independent of the classification algorithm used
Quantitative structure–activity relationships (QSARs) Quantitative structure-activity relationships (QSARS): biological activities of chemical compounds + their physicochemical descriptors
lexico-semantic event structures
a noun argument structure noun Argument structurecorpus
corpusSRL
systemnonparallel plane proximal classifiersnonparallel plane approximate classifierSVM
+
Regularization— High-dimensional
The support feature machine (SFM)
fuzzy-rough sets
feature evaluation criteria:
1. dependency
2. relevance
3. redundancy
4. significance
the signal-to-noise ratio (SNR)
a Laplace naive Bayes model — Laplace naive Bayes model
Laplace distribution — normal
distribution
Array comparative genomic hybridization (aCGH)
V. Metsis, F. Makedon, D. Shen, H. Huang, Dna copy number selection using robust structured sparsity-inducing norms, IEEE/ACM Trans. Comput. Biol. Bioinf. 11 (1) (2014) 168–181,
http://dx.doi.org/10.1109/TCBB.2013.141.
B. Greedy search for data extraction feature importance ranking
Evaluate the importance of each feature first
A subset of features that works best for one classifier may not work well for another.
Importance measures: (directly derived from input data)
1. t-test
2. fold-change difference
3. Z-score
4. Pearson correlation coefficient
5. relative entropy
6. mutual information
7. separability-correlation measure
8. feature relevance
9. label changes produced by each feature
10. information gain
Dimension reduction method:
- class-separability measure
- Fisher ratio
- principal components analysis (PCA)
- t-test
4 feature selection (feature Selection, FS) methods:
- t-test
- significance analysis of microarrays (SAM)
- rank products (RP)
- random forest (RF)
3 Hybrid feature selection techniques
A semi-exhaustive search
1 Pick some important features
- Feature importance ranking measure
- Fisher-Markov selector
- equal-width discretization scheme
- Collection of multiple traditional statistical methods
- high predictive power
2 Use fewer features for further search
- exhaustive search
- Multi-objective optimization
- an embedded GA, Tabu Search (TS), and SVM
- graph optimization model
B Other Hybrid Feature Selection Methods
Feature extraction method
spectral biclustering
sparse component analysis
Poisson model
scatter matrix
singular value decomposition
weighted PCA
robust principal component analysis
linear discriminant analysis
Laplacian linear discriminant analysis (LLDA)
Laplacian score
SVD-entropy
nonnegative matrix factorization (NMF)
sparse NMF (SNMF)
artificial neural network classification scheme
4 Summary and Outlook
Big Data Bioinformatics
A small sample problem
Very high dimensionality (genes) - >20,000
sample size - ~50 patients
overfitting and overoptimism — overfitting and overoptimism
B unbalanced data
The amount of data varies by category
up-sampling classes with fewer data, down-sampling classes with more
data
making classification errors sensitive to classes (cost-sensitive learning
)
signal-to-noise correlation coefficient (S2N)
Feature Assessment by Sliding Thresholds (FAST)
empirical mutual information — the data sparseness issue
multivariate normal distributions
Class C correlation feature selection
Choose a different subset of features for each class
class-independent FS
class-dependent FS
class distributions
RBF neural classifier — the clustering property
GA
SVM
the multi-layer perceptron (MLP) neural network
the probability density function (PDF) projection theorem
principle component analysis (PCA) from class-specific subspaces
a C-class classification problem — C 2-class classifiers
feature importance measures:
- RELIEF
- class separability
- minimal-redundancy-maximal-relevancy
full class relevant (FCR) and partial class relevant (PCR) features
Markov blanket
multiclass ranking statistics
class-specific statistics
Pareto-front — alleviates the bias
F-score and KW-score
a binary tree of simpler classification subproblems
feature subsets of every class