Data Science combat (a): statistical inference, exploratory data analysis and scientific workflows

1 Introduction

When are you going to be a data scientist, you must first have the following skills: the statistics, linear algebra and some programming skills.

You also need to develop the following skills: data preprocessing, data reprocessing, data modeling, coding, visualize and communicate effectively, these skills often go hand in hand

1.1 Statistical Inference

To data from the real world, and then by the data flow into the real world is the field of statistical inference.

More precisely, statistical inference discipline focused on how to extract information from the data generated by a random process, it is the common processes, methods and theories.

1.2 Populations and Samples

In statistical inference, the general is not specific to the population, it refers to a particular object or group of units

If we can extract and measure certain characteristics of these objects, called for a general set of observational data , it is customary to use N denotes the overall number of observations.

The so-called samples , means in a selected subset of the population, represented by n

The overall sample and 1.3 Big Data

1.4 Big Data means a bold assumption

Ignore the causal relationship is a defect in the law of large data, rather than features. Ignore the causal model will not help solve the existing problems, but can only raise more questions

The data will not speak for themselves, it can only in a quantifiable way to describe the inability to reproduce social events around us.
1.5 Modeling
models are artificially designed and used to exclude irrelevant detail or abstraction. During model analysis, researchers must pay attention to these details are omitted.
Statistical modeling:

Modeling process: first you do? Who is Who of influence? What is the cause and what is the result? How the test results?

Using mathematical language to describe this relationship. Universal mathematical formulas which must include a parameter, but the value of the parameter is unknown.
Paint. They first draw a data flow diagram, it is possible with an arrow, is used to describe how things affect each other, or what happened over a period of time. Before selecting a formula to express this relationship, this relationship map can be a rough description to them.

Construction of model:

Select models are part of the modeling process, you need to make a lot of assumptions about the underlying structure, there should be a standard to regulate how to choose the model and the reasons to explain this choice. But we have no uniform standard, we can only trial and error,
hope that after careful consideration, to develop such a specification.

Exploratory data analysis (EDA) is a good start

Preferably from easy to difficult, the most stupid thing to do first look, hindsight, perhaps not so stupid as you think

Remember, start with simple point is always a good idea, there is a trade-off between simplicity and accurate modeling. Simple model is easy to understand, very often, the original simple model to help you complete 90% of tasks,

And build the model requires only a few hours, the use of complex models might take a few months, and only 92% mentioned this value.

When building the model will use a lot of modules, one of which is the probability distribution.

Probability distributions:

Probability distribution is the basis of the statistical model

Probability distribution can be understood as designating a probability for a subset of possible outcomes, with the corresponding probability distribution function represented

Fitting Model:

Fitting model is the process model parameters are estimated using observations

Fitting model often introduce various optimization methods and algorithms such as maximum likelihood estimation to determine the parameters

Model fitting procedure is to begin the process of writing code: code will read the data written on the paper formula translated into code, then R, or Python built-in optimization methods, according to the data obtained as precisely the parameter values.

So you become more and more sophisticated, or that in itself is when your strengths, you might go to study these optimization methods. First of all have to know the existence of these optimization methods,

Then figure out how they work, but you do not have to go and write code that implements these methods, R and Python have to help you achieve a good, direct call on the line.

Overfitting:

When is the use of over-fitting the data to estimate the parameters of the model, and the resulting model can not simulate reality, other than the data sample ineffective.

2 Exploratory Data Analysis

Exploratory data analysis data is an important part of science, and represents the method from a group of statisticians at Bell Labs in the data used in scientific work and ideas.

The basic tool for exploratory data analysis are graphs, charts and summary statistics. Generally, exploratory data analysis is a method for systematic analysis of data, which shows the distribution of all variables (using boxplot), time-series data and variable
change variables using the matrix scatter diagram shows the two variables the relationship between the two, and got all the summary statistics. In other words, it is to calculate the mean, minimum, maximum, and upper and lower quartiles determined outliers.

Exploratory data analysis is not just a set of tools, but also a way of thinking: how to look at the relationship between, and data. You want to understand the data, understand the shape data, gain an intuitive feel for the data, and the data you want to process the data generated
associates understand. Exploratory data analysis is the bridge between you and the data, it does not prove anything to anyone.

 

2.1 philosophy exploratory data analysis

Analysis There are many important reasons for using exploratory data. Including access to intuition, comparing the distribution of data variables, checking the data (make sure you scale data within the expected range, the format of the data is what you want, etc.), the data found in
missing values and outliers, data Giving a summary.

For the data generated in the log, exploratory data analysis process may be used to debug logging

Finally, exploratory data analysis to ensure product performance in line with expectations.

Exploratory data analysis is the beginning of data analysis, and data visualization is the last link in data analysis, data analysis for presenting conclusions. In exploratory data analysis, graphics just help
to help you understand the data.

In exploratory data analysis, optimization algorithm can be based on understanding of the data. For example, you are developing a ranking algorithm that you recommended to the user's content to rank. To do this, you may need to define what is "popular."

2.2 Exercise: Exploratory Data Analysis

3 scientific data workflow

Specifically, the original data as a starting point, such as logs, the Olympic record, Enron employees' email, genetic material recorded (Note that, when we get these raw data, the information in the event of certain aspects of He has been
missing). We need to process the raw data, so that it is easy to analyze. So we created a pipeline for data reprocessing: joint, put together, clean up, whatever you call them what is good, it is to re-processing the data. We
can use Python, shell scripts, R, SQL to complete this task.

Finally get formatted data, such as the following from the data sequence that includes:
Name | Event | Year | Sex | time

After this get clean data, we should do some exploratory data analysis. In this process, we may find that the data is not so clean, the data may contain duplicate values, missing values or outliers ridiculous, some of the data

is not recorded or incorrectly recorded. When it finds this phenomenon, we have to go back and collect more data, or spend more time cleaning up the data.

Then, we use a number of algorithms, such as k-nearest neighbor, linear regression, naive Bayes and other design models. Depending on which model selection problem to be solved, which may be a classification problem, a predicted problem, or just a basic description
above problem.

Then you can explain, outline, or report the results obtained in exchange. The results can be reported to the boss or colleagues, or published articles in academic journals, or go out and take part in academic conferences to explain our findings.

When doing any analysis, this feedback will be taken into account in order to adjust the bias generated by the model. Model not only predict the future, it also impact the future.

The role of data scientist in the scientific workflow data

 

Relational data workflow and other scientific methods of science

The general steps are:

• ask questions;
• do some background research;
• conception hypothesis;

• experiments to test hypotheses envisaged;
• analyze the data and draw conclusions;
• to share your results to others.

In the scientific workflow data and other scientific methods, not every study questions need step by step to solve most problems do not have to go through every step of strict, a combination of several steps could solve the problem. For example, if
your goal is data visualization (which in itself can be seen as a data products), it is likely that you will not use any machine learning or statistical model, you just need to find ways to get clean data, do some exploratory
data analysis, the results show it can be in the form of a chart.

4 thought experiment: how to simulate chaos

Most of the problems are facing the beginning of a pile of dirty chaotic data, or the problem itself has not been clearly defined, or an urgent problem to be solved.

As our data scientists, to some extent, have a responsibility to restore order from chaos

5 Case study: RealDirect

Guess you like

Origin www.cnblogs.com/qiu-hua/p/12663583.html