[Translation] data scientists need to know ten statistical techniques

Data scientists need to know ten kinds of statistical techniques

No matter what position you stand still in the "sexy" question whether the scientific data, they can not ignore the fact that: data, and we analyze data, organize data, to determine the ability of context data is increasingly important. With a huge employment data and employee feedback, Glassdoor (a US job search community, translator's note) data scientists ranked the nation's best 25 positions first in. Therefore, although this role will still exist, but there is no doubt that data scientists do specific tasks will continue to evolve. With the popularity of such as machine learning technology, as well as emerging areas such as the depth of learning, gained great attention from researchers and engineers and their companies, the data scientists will continue the wave of innovation and technological progress in the wind and waves.

Although it is important to have a powerful programming capabilities, but the data is not entirely scientific software engineering (in fact, familiar with Python, then it would be easier to start work). Data scientists need to triple the capacity of programming, analysis and critical thinking. As Josh Wills said , " data scientists have are more abundant than any programmer statistical knowledge, and stronger than any statistician programming capability ." As far as I understand, a lot of software engineers who want to switch to become data scientists . In case they did not fully understand the theory of scientific data on the blind using machine learning frameworks such as Apache Spark TensorFlow or to process the data. They treat statistical learning the theoretical framework based on functional analysis of statistics and machine learning, too.

Why learn statistical learning theory? Understand the idea behind the technology is more important, so know how and when to use them to facilitate. In order to grasp a more sophisticated approach, one must first understand the easier way. Accurate assessment of the performance is very important, it allows us to determine whether or not work normally. And this is an exciting area of research, science and technology, industrial and financial sectors that have very important applications. In the final analysis, statistical learning is an essential element of modern data scientist training. Examples of statistical learning problems include:

  • Determine the risk factors for prostate cancer.
  • Be classified according to the number of phonemes recorded periodogram right.
  • According to demographic, dietary and clinical measure to predict whether a person will have a heart attack.
  • Custom e-mail spam detection system.
  • Recognize handwritten zip code.
  • The tissue samples classified into one of several cancers.
  • Establish a relationship between salary and demographic variables in survey data population.

In the last semester of college, I taught myself the data mining. The course material covers these three books: Intro to Statistical Learning (Hastie, Tibshirani, Witten, James), Doing the Data Bayesian the Analysis (Kruschke) and Time Series the Analysis and Applications (Shumway, Stoffer). I did a lot and Bayesian analysis, Markov chain, hierarchical model, supervised and unsupervised learning-related exercises. This experience deepened my interest in the academic field of data mining, and I'm sure would like to explore deeper. Recently, I taught myself in Lagunita Stanford Statistical Learning Course, Online , which covers Intro to Statistical Learning book of all materials. Two contact with these elements, I wanted to share 10 kinds of statistical techniques in this book, I think any data scientists should learn these techniques in order to deal more effectively with large data sets.

Before it began to introduce dozens of technology, I would like to distinguish between statistical learning and machine learning. Before I wrote one of the most popular machine learning methods and therefore I am very confident that I have the ability to determine their differences:

  • Machine learning is a branch of artificial intelligence.
  • Statistical learning is a branch of statistics.
  • Machine learning great emphasis on big data and predictive accuracy.
  • Statistical model and its emphasis on learning interpretability, accuracy and uncertainty.
  • But the boundaries between the two become blurred, and there are a lot of "interdisciplinary."
  • Machine learning more market!

1-- linear regression:

In statistics, since the linear regression fit is between the independent and dependent variables by one kind of optimal linear function method to predict the target variable. When the distance and the minimum value and the actual observations obtained by fitting each point, we can identify the best fit of. When selecting the shape, in no other position produces fewer errors, the description of this shape is to fit the "best". The two main linear regression is simple linear regression and multiple linear regression . Simple linear regression by fitting a linear relationship between the optimal use of a single independent variable to predict the dependent variable. Multiple linear regression is optimized by fitting a linear function, use more than one independent variable to predict the dependent variable.

You can choose any two things in your life are related. For example, I have in the past three years, my monthly income and expenses as well as travel data. Now I want to answer the following questions:

  • My monthly expenditure next year will be how much?
  • Which factors (monthly income or monthly number of trips) is more important in determining my monthly expenditure?
  • Monthly income and monthly expenses monthly travel times and what kind of relationship? ?

2 - Category:

Classification is a data mining technique, which is a collection of the sorted data to assist in more accurate prediction and analysis. Category sometimes called decision tree, is one of several methods to effectively analyze large data sets. Two kinds come to the fore main classification is logistic regression and discriminant analysis .

When the opposite is the dependent variable (divalent), logistic regression is appropriate regression analysis. And all similar regression analysis, a logistic regression analysis to predict. Logistic regression is used to describe the data, and to interpret one or more dependent variable with a given class, a sequencer, a given interval or ratio from the relationship between variables. Logistic regression can check problems are:

  • Each additional pound a smoke a pack of cigarettes and each additional, the chances of developing lung cancer (yes or no) will be how kind of change every day?
  • Body weight, calorie intake, fat intake and participants aged heart attack affect you (yes or no)?

In the discriminant analysis , the two or more groups or groups or generally known a priori, and according to the characteristics of the analysis, one or more observations are divided into the classes known in a cluster. Discriminant analysis simulated the X predictor response in each category, and then use Bayes' theorem to convert it to a given category in response to a value of X-probability estimation value. These models can be linear , it can be quadratic .

  • Linear discriminant analysis to classify the type of the response variable is calculated for each observation value observed by a value "discriminant score." These scores by finding a linear combination of independent variables obtained. It is assumed that the observed value of each class are derived from a multi-Gaussian distribution, and the covariance of predictors in response to the variable Y is the same as the levels of k.
  • Quadratic discriminant analysis provides an alternative method. And as LDA, QDA is assumed that the observed value Y are each category from a Gaussian distribution. That differs from LDA, QDA assumed that each class has its own covariance matrix. In other words, the variable prediction covariance are not assumed to be on the response variable Y k are the same levels.

3 - resampling methods:

It refers resampling method duplicate samples extracted from the original data samples. It is a method of non-parametric statistical inference. In other words, the resampling method does not involve using a general probability distribution table values ​​to calculate the approximate p.

Resampling actual data generated based on a unique sampling distribution. It uses experimental methods rather than analytical methods to generate this unique sampling distribution. Based on it to produce an unbiased sample of all possible outcomes studied researcher unbiased. In order to understand the concept of resampling, you should understand the bootstrap method (also translated into Bootstrapping, Translator's Note) and cross-validation :

  • Bootstrapping applied to a variety of scenarios, such as verifying the performance of predictive models, the integration method, and the model is estimated variance bias. It works are performed in the original data with replacement sampling of data, use the " unchecked " data point as a test sample. We can perform many times and calculate the mean to evaluate the performance of our model.
  • On the other hand, cross-validation is used to verify the performance of the model, and the training data is divided into k portions performed. We will pre-k-1 as part of the training set, " set aside " as part of the test set. This step is repeated k times in different ways, and finally k times the mean value is used as the performance evaluation.

Typically, for a linear model, the ordinary least squares is the main criteria to consider when fitting the data. The following three methods may be substituted for it and can provide better accuracy and predictive linear model interpretability.

4 - subset selection:

We believe that this method of determining the response associated with p a subset of a predictor. Then we use the least squares sub-set of features to fit the model.

  • Optimal subset selection: Here, we p each possible combination of two predictors were a fitting OLS regression, then fit the observed effect of each model. The algorithm has two stages: (1) fitting the model contains all predictors of k, where k is the maximum length of the model. (2) cross-validation to select a single model predicted loss. It is important to use to verify or test error, and can not simply use the fitting situation assessment model of training error, because the increase in RSS and R² with variables and monotonically increasing. The best way is to choose the test set the highest and lowest R² RSS to select the model and cross-validation.
  • Forward stepwise selection study is a subset of p predicted a much smaller factor. It starts from the model free predictors gradually added to the model predictor until all predictors are included in the model. The order of addition of the predictor model fitting is to determine the extent of performance improvement, we will always add a variable depending on variable factors until no predictive model can improve the cross-validation error.
  • Backward stepwise selection outset join all the p predictors in the model, and then each iteration removes one of the most useless factor.
  • Mixing step by step method to follow before. But after adding each new variable, the method may also remove those variables model fitting useless.

5 - Characteristics reduction:

This method is suitable for models containing all p predictors. However, the estimated coefficients will converge to zero based on the valuation of least squares. This shrinkage is also known as regularization. It aims to reduce the variance in order to prevent over-fitting model. Because we use different methods of convergence, some coefficient is estimated to be zero. So this method can be performed to select variables, the variables converge to zero most want to see technology that ridge regression and lasso return.

  • Ridge regression is very similar to the least squares method, but it by minimizing a slightly different values estimated coefficients. Ridge regression and OLS coefficient as seeking to reduce the RSS estimate. But when the coefficient value is close to zero, they will punish this contraction. The penalty term has a coefficient estimates will be reduced to close to zero effect. Does not require the mathematical operation, can know ridge regression by minimizing the variance of the column space convergence coefficient is useful, such as principal component analysis, the projecting ridge regression data d direction in space, and the variance is high compared to the composition, more low shrinkage component variance, which is equivalent to both the maximum and minimum principal component the main component.
  • Ridge regression has at least one drawback, it needs to include all in the final model p a predictor, mainly because of penalty will make a lot of predictor coefficients approaching zero, but certainly not equal to zero. This is usually not a problem for forecast accuracy, the results of the model but it makes it more difficult to explain. Lasso overcomes this disadvantage and is capable of s to make some coefficients zero predictor sufficiently small. Since s =. 1 OLS regression would result in a formal, as s approaches 0, the coefficients will converge to zero. Therefore Lasso return is also a good way to perform variable selection.

6 - dimensionality reduction:

Dimension reduction algorithm + 1 p simplification coefficients for M + 1 issues coefficients, wherein M <P . Executing an algorithm includes calculating the variable M different linear combinations or a projection (projection). Then these M projections as predictors, and to fit a linear regression model by the least squares method. The method of treatment is two principal component regression (principal component regression) and partial least squares (partial Least Squares) .

  • Principal Component Regression (PCR) method can be regarded as a derived low-dimensional set of features from a large bonded variables collection. The first principal component (first principal component) data refers to observed values ​​of this variable along the direction of greatest change. In other words, the first main component is the closest fit to the data line, a total of p may be different components of the main fitting. The second main component and the first main component a linear combination of uncorrelated variables and have the greatest variance in the constraint. The main idea is the main component a linear combination of the data can be used to capture the maximum variance in the respective mutually perpendicular directions. Using this method, we can combine effects of related variables to get more information from the data, after all, in the conventional least square method needs to give up one of the relevant variables.
  • PCR methods described above needs to be extracted is a linear combination of X, to obtain an optimal characterization predictors. Since the X output Y can not be used to help determine the direction of a main component, a combination of these ( directions ) extracted using unsupervised methods. That is, the Y can not supervise the extracted principal components, which can not guarantee that the optimal directions are characterized predictor can not guarantee the optimal prediction output can be obtained (although usually assumed case). Partial Least Squares (PLS) is a supervised method, as an alternative PCR method. And the like PCR, PLS is also a method for dimensionality reduction, it first extracts a new, smaller set of features (linear combinations of the original features), and then by least squares fit to the original model M feature having a new the linear model.

7-- nonlinear regression:

In statistics, non-linear regression function of the nonlinear part of the model parameters using the combination of an observed data (depending on one or more independent variables) in the form of regression modeling. Using a successive approximation method to fit the data. Below is some important technical nonlinear model:

  • Step function (STEP function) , is a real number variable, can be written as a linear combination of a limited section of the indicator function. Informal explanation is that the step function is a piecewise constant function, only a limited part.
  • Piecewise function (piecewise function) defined by a plurality of sub-functions, each subroutine is defined on the interval defined main function of the domain. Segment representation is actually a function, rather than the characteristics of the function itself, but by an additional defined conditions, it may be used to describe the nature of the function. For example, a piecewise polynomial function is a polynomial as a function in each sub-definitions, wherein each polynomial may be different.

  • Spline (spline) is one kind of special functions defined by a polynomial segment. In computer graphics, a polynomial spline curve is a parametric curve segment. Because of the simplicity of structure, simplicity and precision of the evaluation, and interact by curve fitting curve designed to approximate the ability of complex curves, spline curve is used.
  • Generalized additive model (generalized additive model) is a generalized linear model, which depends linearly on the linear predictor of some predictor variables smoothing function is unknown, it is speculated that the primary role of the smoothing function.

8 - tree-based methods:

Tree-based methods may be used for classification and regression problems, including the spatial prediction factor into several simple delamination or region. Since the space for separating the set of predictor rules may be summarized as a tree, such a method is referred to as a decision tree method. The following are several different methods of trees, they can be combined to output a single consensus forecast.

  • Bagging reduce prediction variance, i.e., used for training by the additional data (generated by combining and repeating the same size as the original data and multiple pieces of data) generated from the original data. By increasing the training set could not improve the predictive ability of the model, we can only reduce the variance, carefully adjust the forecast to get the desired output.
  • Boosting is a model calculation using a plurality of different outputs, and a method using the weighted mean average of the results of the method. The advantages and disadvantages of these methods are combined, by changing the weighting formulas, you can use different tuning more detailed models, produce good predictive power wider input data.

  • Random Forest algorithm (random forest algorithm) actually and bagging algorithm is very similar, are extracted random bootstrap samples of the training set. However, in addition to the bootstrap samples, extraction features may also be a random subset of the training single tree; in bagging, it is necessary to provide the entire feature set for each tree. Since the feature selection is random, as compared to the conventional algorithm bagging, more independent of each tree, so that usually better prediction performance (due to better variance - deviation tradeoff). As each tree requires only a subset of the learning feature, so the calculation faster.

9 - SVM:

Support vector machine (SVM) is a commonly used supervised learning classification techniques. In layman's terms, it is used to find the hyperplane (optimal separating hyperplane made of two types of set points, in the 2D space is a line, a plane in 3D space, a hyperplane in a high dimensional space. More formally argument is a hyperplane is a n n-1 dimensional subspace dimensional space). SVM is retained while the largest spacing separating hyperplane, so in essence, it is a constrained optimization problem, wherein the spacer support vector machine is maximized under the constraint to perfectly sort the data (hard interval classification device).

"Support" hyperplane data points are called "Support Vector." In the figure above, filling the blue circle and two filled square is the support vector. In the case of two types of data are not linearly separable, the data points will be projected onto a higher dimensional space, so that the data becomes linearly separable. Problems of data points comprising a plurality of categories can be decomposed into a plurality of "one" (one-versus-one), or "remaining one pair" (one-versus-rest) of the binary classification.

10 - Unsupervised Learning:

So far, we have only discussed the supervised learning technique in which data classification are known and available to experience algorithms is the relationship between the entity and its classification. When disaggregated data is unknown, it needs to use another technology. They are called unsupervised, because they need to discover patterns in the data. Clustering (clustring) is an unsupervised learning, which data is divided into a plurality of clusters according to the correlation. Below is some of the most commonly used unsupervised learning algorithms:

  • Principal component analysis includes linear connection between a maximum variance and uncorrelated characterized by identifying, generating data sets of low-dimensional representation. The linear dimension reduction technique helps to understand the interaction of hidden variables unsupervised learning.
  • k-Means Clustering : The distance from the cluster center of the data to be divided into k different clusters.
  • Hierarchical clustering : to build a cluster of multi-level hierarchy by creating a cluster tree.

This is the basic use of some basic statistical techniques that can help data science project managers and / or executives to better understand the internal operation of their data science team. In fact, some of the data science team is run purely by python and R language library algorithms. Most of them do not even consider the foundations of mathematics. However, to understand the basics of statistical analysis can provide a better way for your team. Insights smallest portion may make the operation easier and abstract. We hope that this basic scientific data statistics guide will give you a good understanding!

** You can get all the lectures from [my Github source code] slides and RStudio Course ( github.com/khanhnamle1...

If you find there is a translation error or other areas for improvement, welcome to Denver translation program to be modified and translations PR, also obtained the corresponding bonus points. The beginning of the article Permalink article is the MarkDown the links in this article on GitHub.


Nuggets Translation Project is a high-quality translation of technical articles Internet community, Source for the Nuggets English Share article on. Content covering Android , iOS , front-end , back-end , block chain , product , design , artificial intelligence field, etc., you want to see more high-quality translations, please continue to focus Nuggets translation program , the official micro-blog , we know almost columns .

Guess you like

Origin juejin.im/post/5d42340d6fb9a06ae61a95f5