Feature selection methods for big data bioinformatics: a search-based perspective

quote

LaTex

@article{WANG201621,
title = “Feature selection methods for big data bioinformatics: A survey from the search perspective”,
journal = “Methods”,
volume = “111”,
pages = “21 - 31”,
year = “2016”,
note = “Big Data Bioinformatics”,
issn = “1046-2023”,
doi = “https://doi.org/10.1016/j.ymeth.2016.08.014“,
url = “http://www.sciencedirect.com/science/article/pii/S1046202316302742“,
author = “Lipo Wang and Yaoli Wang and Qing Chang”,
keywords = “Biomarkers, Classification, Clustering, Computational biology, Computational intelligence, Data mining, Evolutionary computation, Evolutionary algorithms, Fuzzy logic, Genetic algorithms, Machine learning, Microarray, Neural networks, Particle swarm optimization, Pattern recognition, Random forests, Rough sets, Soft computing, Swarm intelligence, Support vector machines”
}

Normal

Lipo Wang, Yaoli Wang, Qing Chang,
Feature selection methods for big data bioinformatics: A survey from the search perspective,
Methods,
Volume 111,
2016,
Pages 21-31,
ISSN 1046-2023,
https://doi.org/10.1016/j.ymeth.2016.08.014.
(http://www.sciencedirect.com/science/article/pii/S1046202316302742)
Keywords: Biomarkers; Classification; Clustering; Computational biology; Computational intelligence; Data mining; Evolutionary computation; Evolutionary algorithms; Fuzzy logic; Genetic algorithms; Machine learning; Microarray; Neural networks; Particle swarm optimization; Pattern recognition; Random forests; Rough sets; Soft computing; Swarm intelligence; Support vector machines


Summary

Big Data Bioinformatics

Feature selection applications

Traditional classification:
1. filter
2. wrapper
3. embedded

New categories: (think of combinatorial optimization/search problems)
1. exhaustive search
2. heuristic search — with or without data extraction feature sorting methods
3. hybrid methods


1 Truly optimal feature selection: exhaustive search

Classifier:

  1. random forests
  2. support vector machines (SVMs)
  3. cluster-oriented ensemble classifiers
  4. random vector functional link (RVFL) random vector functional link
  5. radial basis function (RBF) neural networks radial basis function (RBF) neural networks

Search for truly optimal subset of features - computationally expensive - NP-hard

Exhaust all possible feature combinations

''combinatorial explosion" — "combinatorial explosion" — exponentially

number of original features > 30 — impossible


2 Suboptimal Feature Selection: Heuristic Search

''Heuristic search' — 'Heuristic search': Guided by 'experience' or 'wise choice', expect to find good suboptimal or even global optima.

better than random search

Necessary components of an algorithm:
1. Local improvement
2. Innovation

simulated annealing Simulated annealing algorithm—with a certain probability of receiving poor solutions—helps to jump out of local optima

genetic algorithm (GA)
ant colony optimization (ACC)
particle swarm optimization (PSO)
chaotic simulated annealing
tabu search
noisy chaotic simulated annealing
branch-and-bound branch determination boundary


A Feature selection based on heuristic search without data extraction feature importance ranking

Binary vector — whether to select the corresponding feature
The nearest neighbor classifier
case-based reasoning
a leave-one-out procedure leave-one-out procedure
succinct rules succinct rules
silhouette statistics contour statistics
microarray microarray
peak tree peak tree
input weights - SVM Or neural network – embedded — feature importance ranking (not directly from data)
weight statistical analysis
K-means + SVM
margin influence analysis (MIA) marginal influence analysis + SVM
Mann–Whitney U test — nonparametric test method Nonparametric test method+ no distribution-related assumptions no distribution-related assumptions
in mixed descriptor space
Blocking — modularity
aggregating the outputs of multiple learning algorithms — evaluating a subset of genes — significantly improved — independent of the classification algorithm used
Quantitative structure–activity relationships (QSARs) Quantitative structure-activity relationships (QSARS): biological activities of chemical compounds + their physicochemical descriptors
lexico-semantic event structures
a noun argument structure noun Argument structurecorpus
corpusSRL
systemnonparallel plane proximal classifiersnonparallel plane approximate classifierSVM
+ L p Regularization— High-dimensional
The support feature machine (SFM)
fuzzy-rough sets

feature evaluation criteria:
1. dependency
2. relevance
3. redundancy
4. significance

the signal-to-noise ratio (SNR)

a Laplace naive Bayes model — Laplace naive Bayes model
Laplace distribution — normal
distribution


Array comparative genomic hybridization (aCGH)

V. Metsis, F. Makedon, D. Shen, H. Huang, Dna copy number selection using robust structured sparsity-inducing norms, IEEE/ACM Trans. Comput. Biol. Bioinf. 11 (1) (2014) 168–181,
http://dx.doi.org/10.1109/TCBB.2013.141.


B. Greedy search for data extraction feature importance ranking

Evaluate the importance of each feature first

A subset of features that works best for one classifier may not work well for another.

Importance measures: (directly derived from input data)
1. t-test
2. fold-change difference
3. Z-score
4. Pearson correlation coefficient
5. relative entropy
6. mutual information
7. separability-correlation measure
8. feature relevance
9. label changes produced by each feature
10. information gain

Dimension reduction method:

  • class-separability measure
  • Fisher ratio
  • principal components analysis (PCA)
  • t-test

4 feature selection (feature Selection, FS) methods:

  • t-test
  • significance analysis of microarrays (SAM)
  • rank products (RP)
  • random forest (RF)

3 Hybrid feature selection techniques


A semi-exhaustive search

1 Pick some important features
- Feature importance ranking measure
- Fisher-Markov selector
- equal-width discretization scheme
- Collection of multiple traditional statistical methods
- high predictive power
2 Use fewer features for further search
- exhaustive search
- Multi-objective optimization
- an embedded GA, Tabu Search (TS), and SVM
- graph optimization model


B Other Hybrid Feature Selection Methods

Feature extraction method

spectral biclustering
sparse component analysis
Poisson model
scatter matrix
singular value decomposition
weighted PCA
robust principal component analysis
linear discriminant analysis
Laplacian linear discriminant analysis (LLDA)
Laplacian score
SVD-entropy
nonnegative matrix factorization (NMF)
sparse NMF (SNMF)
artificial neural network classification scheme


4 Summary and Outlook

Big Data Bioinformatics


A small sample problem

Very high dimensionality (genes) - >20,000
sample size - ~50 patients

overfitting and overoptimism — overfitting and overoptimism


B unbalanced data

The amount of data varies by category

up-sampling classes with fewer data, down-sampling classes with more
data

making classification errors sensitive to classes (cost-sensitive learning
)

signal-to-noise correlation coefficient (S2N)
Feature Assessment by Sliding Thresholds (FAST)

empirical mutual information — the data sparseness issue

multivariate normal distributions


Class C correlation feature selection

Choose a different subset of features for each class

class-independent FS
class-dependent FS

class distributions
RBF neural classifier — the clustering property
GA
SVM
the multi-layer perceptron (MLP) neural network
the probability density function (PDF) projection theorem
principle component analysis (PCA) from class-specific subspaces

a C-class classification problem — C 2-class classifiers

feature importance measures:

  • RELIEF
  • class separability
  • minimal-redundancy-maximal-relevancy

full class relevant (FCR) and partial class relevant (PCR) features

Markov blanket

multiclass ranking statistics
class-specific statistics
Pareto-front — alleviates the bias
F-score and KW-score

a binary tree of simpler classification subproblems

feature subsets of every class

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325442094&siteId=291194637