python and statistics (to be continued)

foreword

This article is shared based on my own knowledge and the certifications of Google and IBM. Thank you for your reading and discussion. Please correct me if there are any deficiencies.

basic knowledge

AB testing

Companies use A/B testing to evaluate everything from website design to mobile apps, online ads and marketing emails. A/B testing is a way to compare two versions to find out which one performs better. A/B testing has become popular due to its good performance in many online applications. For example, businesses often use A/B testing to create two web page variations to find out which one gets more clicks, purchases, or subscriptions. Even small changes to a web page, such as changing the color, size or position of a button, can increase financial gains. A/B testing helps business leaders optimize product performance and improve customer experience. Another way companies use A/B testing is through marketing emails. You can send two versions of your mailing to your customer list to find out which version generates more sales, or you can test two versions of your online ad to find out which version your visitors click more often. Once you've A/B tested, you can use the data to make permanent changes to your ads. Let’s walk through an example of A/B testing to see how it works. Let's say you run an online store and 10% of your site's visitors make a purchase. You want to run an A/B test to find out if changing the size of the shopping cart button will increase the conversion rate or the percentage of customers who purchase the product. The test presents two versions of a web page, version A and version B, to a group of users randomly selected from among all users who visit the website. Version A is the original web page. Version B is a web page with a larger cart button. The test directs half of the users to version A and the other half to version B. The test lasts for two weeks. After the test, a statistical analysis of the results showed that the larger button in version B led to an increase in purchases. Another key statistical concept in version B is hypothesis testing. In A/B testing, we usually make two hypotheses: the null hypothesis and the alternative hypothesis. The null hypothesis is the hypothesis that there is no change or effect, and the alternative hypothesis is the hypothesis that there is a change or effect. In the example above, the null hypothesis is that larger cart buttons have no effect on purchase rates, and the alternative hypothesis is that larger cart buttons increase purchase rates. Statisticians calculate probabilities and statistics to determine whether a hypothesis holds. If the outcome is very unlikely under the null hypothesis, we reject the null hypothesis and accept the alternative hypothesis. This means that larger cart buttons do increase purchase rates. We cannot reject the null hypothesis if it is possible for the outcome to occur under the null hypothesis. This means that larger cart buttons may not actually affect purchase rates.

Descriptive and Inferential Statistics

Descriptive statistics are used to describe or summarize the main characteristics of a data set, which is very useful for quickly understanding large amounts of data. Data professionals typically implement descriptive statistics using two approaches, visualization and summary statistics. In terms of visualization, we can use charts such as histograms, scatterplots, and boxplots to help explore, visualize, and share data. In terms of summary statistics, we can use measures of central tendency and dispersion to describe data sets. For example, the mean and standard deviation are commonly used measures.
Inferential statistics are used to draw conclusions and predictions based on data. It allows data professionals to make inferences about population data through the analysis of sample data. Aggregate data refers to all possible elements that we are interested in measuring, such as people, objects, or events. The sample data is a subset drawn from the population data. Data professionals use samples to make inferences and predictions about population data.
Between sample and population, there are two related terms, parameter and statistic. A parameter is a characteristic of a population, and a statistic is a characteristic of a sample. For example, the average height of the entire giraffe population is a parameter, while the average height of 100 randomly selected giraffes is a statistic. When we cannot study the entire population, sample statistics can be used to estimate unknown values ​​of population parameters.

Probability distributions

Introduction

In some of my work as an NLP algorithm engineer, I use probability distributions to model different kinds of datasets and to identify salient patterns in my data. A probability distribution describes the likelihood of a possible outcome of a random event. Probability distributions can represent the possible outcomes of simple random events. Examples include flipping a coin or rolling dice. They can also represent more complex events, such as the probability that a new drug will successfully treat a medical condition. A random variable represents the value of a possible outcome of a random event. There are two types of random variables: discrete and continuous. A discrete random variable has a countable number of possible values. Typically, discrete variables are integers that can be counted. For example, if you roll a die five times, you can count the number of times the die lands on a 2. If you toss a coin five times, you can count the number of times it lands heads. A continuous random variable takes all possible values ​​within a certain range. When it comes to continuous variables, you are dealing with fractional values ​​rather than integers. For example, all decimal values ​​between one and two, such as 1.1, 1.12, 1.125, and so on. These values ​​are uncountable because there is no limit to the number of possible fractional values ​​between one and two. Often these are small values ​​that can be measured, such as height, weight, time or temperature. For example, if you measure the height of a person or object, you can continually make your measurements more accurate. A person's height may be 70.2 inches, 70.23 inches, 70.237 inches, 70.2375 inches, and so on. There is no limit to the number of possible values. It is not always immediately obvious whether a variable is discrete or continuous. In order to choose between the two, you can use the following general guidelines. If you can count the number of outcomes you're dealing with, then use a discrete random variable. For example, count the number of times a coin lands heads. If you can measure the outcome, then use a continuous random variable. For example, measuring the time it takes a person to run a marathon. Now that we have explored random variables, let's return to the topic of probability distributions, which describe the probability of each possible value of a random variable. A discrete distribution represents a discrete random variable, and a continuous distribution represents a continuous random variable. Once you know the sample space of a random variable, you can assign probabilities to each possible value. In this particular distribution is described by a set of parameters that control the shape and location of the distribution. Here, we only discuss the two most common distributions: the normal distribution and the uniform distribution.
The normal distribution, also known as the Gaussian distribution, is a common probability distribution for continuous random variables. Many natural phenomena, such as human height and intelligence scores, can be described using a normal distribution. A normal distribution can be described by two parameters: mean and standard deviation. The mean determines the center of the distribution, and the standard deviation controls the shape of the distribution. The shape of a normal distribution is a bell curve where most values ​​are clustered around the mean and values ​​farther away from the mean are less likely.
The uniform distribution is another common continuous probability distribution in which all possible values ​​have equal probability. For example, if you randomly pick a number between 0 and 1, each number is equally likely. A uniform distribution can be described by two parameters: minimum and maximum. The shape of a uniform distribution is a horizontal line, indicating that all values ​​have equal probability.
In addition to these common distributions, there are many others, including the binomial, Poisson, exponential, and gamma distributions, among others. Each distribution has unique properties and uses, depending on the data you are trying to simulate or analyze.
In summary, a probability distribution is a way of describing the probabilities of possible outcomes of random events. For discrete random variables, you can use discrete probability distributions, such as the binomial and Poisson distributions. For continuous random variables, you need to use continuous probability distributions, such as normal and uniform. Understanding these distributions and their purpose will help you better understand and analyze your data.

The binomial distribution

The binomial distribution is a discrete distribution used to model the probability of events with only two possible outcomes (success or failure). This definition assumes that each event is independent, that is, does not affect the probability of other events, and that each event has the same probability of success. For example, the binomial distribution applies to the event of 10 consecutive tosses of the same coin. Note that success and failure are labels for convenience. For example, each coin toss has only two possible outcomes, heads or tails. Depending on your analysis needs, you can choose to mark heads or tails as a successful outcome. Whatever labels you apply to your results, it's important to know that they must be mutually exclusive. As a quick recap, two outcomes are mutually exclusive if they cannot happen at the same time. In a single coin toss, you can't get heads and tails at the same time, only one of them. Data professionals use the binomial distribution to model data in different domains, such as medicine, banking, investing, and machine learning. For example, data professionals use the binomial distribution to model the probability that a new drug will have side effects, that a credit card transaction will be fraudulent, or that a stock price will rise or fall. In machine learning, the binomial distribution is often used to classify data. For example, a data professional might train an algorithm to recognize whether a digital image is a cat. The binomial distribution represents a type of random event known as a binomial experiment. A binomial experiment is a type of randomized experiment. As you may recall, a randomized experiment is a process in which the outcome cannot be determined. All randomized experiments have three things in common. An experiment can have multiple possible outcomes. You can represent every possible outcome in advance, and the outcome of the experiment depends on chance. On the other hand, a binomial experiment has the following properties. Experiments consisted of multiple replicates. Each trial has only two possible outcomes. The probability of success is the same for each trial, and each trial is independent. An experiment in which a coin is tossed 10 times is a binomial experiment.
In probability theory, the mathematical expectation (or mean) is the weighted average of the values ​​of a random variable, and the weight is the probability of occurrence of each value. For the binomial distribution, its expected value is np npn p , wherenn is the number of trials,ppp is the probability of success in each trial. So if you toss a coin ten times, the expected value is10 × 0.5 = 5 10 \times 0.5 = 510×0.5=5 , that is, five heads are expected.
In addition to expected value, another important concept of binomial distribution is variance. Variance is a statistic used to measure the dispersion of values ​​of a random variable. For the binomial distribution, its variance isnp ( 1 − p ) np(1-p)np(1p ) . For example, if you toss a coin ten times, the probability of success on each trial is0.5 0.50.5 , then the variance is10 × 0.5 × ( 1 − 0.5 ) = 2.5 10 \times 0.5 \times (1-0.5) = 2.510×0.5×(10.5)=2.5 . The larger the variance, the more scattered the values ​​of the random variable are.
In summary, the binomial distribution is a discrete probability distribution used to model random events with only two possible outcomes, such as success or failure. It is suitable for scenarios where trials are repeated, where each trial is independent and each trial has the same probability of success. Its applications are vast, including fields such as medicine, banking, investing, and machine learning.

The Poisson distribution

The Poisson distribution is a probability distribution used to model the probability of an event occurring within a specific time period. The Poisson distribution can also be used to express the number of events that occur within a certain space (such as a distance, area, or volume), but in this article, we will focus on time. The Poisson distribution was originally derived in 1830 by the French mathematician Baron Simeon-Danny Poisson. He developed this distribution to describe the number of times a gambler wins a difficult game over a large number of attempts. Data professionals use Poisson distributions to model data such as the expected number of calls per hour to a customer service call center, the number of visits to a website per hour, the daily traffic at a restaurant, and the number of severe storms in a city per month. The Poisson distribution represents a type of randomized experiment known as a Poisson test. A Poisson experiment has the property that the number of events in the experiment can be counted. The average number of events that occur within a certain time period is known, and each event is independent. Let's look at an example. Suppose you are a data professional working for a fast food chain. You know that the store's drive-thru service receives an average of two orders per minute. You want to determine the probability that a restaurant receives a certain number of orders within a given period of time. This is a Poisson experiment because the number of events in the experiment can be counted. You can count the quantity of the order. The average number of events that occur during a specific time period is known. Each result is independent. The probability that one person places an order does not affect the probability that another person places an order. Once you know you are using a Poisson distribution, you can apply the Poisson distribution formula to calculate the probability. In short, this formula helps you determine the probability of a certain number of events occurring within a certain time period. In this formula, the Greek letter λ represents the average number of events that occur during a particular time period. k represents the number of events. e is a constant approximately equal to 2.71828. An exclamation point means factorial. The formula is as follows:
P(X=k) = (λ^k / k!) * e^(-λ)

The normal distribution

The normal distribution is a continuous probability distribution with a bell-shaped curve, a mean at the center of the curve, symmetrical sides, and an area under the curve of 1. We can use the normal distribution to describe and predict many different types of data sets, including height, weight, income, grades, and more. Understanding the normal distribution is important for further study of statistics and machine learning, as many methods are based on the normal distribution. The formula is as follows:
insert image description here
The z-score is how many standard deviations a data point is from the population mean. A z-score gives you an idea of ​​how far a data point is from the mean. For example, if the value is equal to the mean, the z-score is 0. If the value is greater than the mean, the z-score is positive. If the value is less than the mean, the z-score is negative. z-scores can help you normalize your data. In statistics, standardization is the process of putting different variables on the same scale. The formula for this process will be presented later. z-scores are also called standard scores because they are based on what is known as a standard normal distribution. The standard normal distribution is a normal distribution with mean 0 and standard deviation 1. z-scores typically range from -3 to +3. Standardization is useful because it lets you compare scores from different data sets with different units, means, and standard deviations. Data professionals use z-scores to better understand the relationship between data values ​​within a single dataset and between different datasets. For example, data professionals often use z-scores for anomaly detection to find outliers in a dataset. Applications of anomaly detection include finding fraud in financial transactions, finding defects in manufactured products, finding intrusions in computer networks, and more. For example, different customer satisfaction surveys may have different scoring criteria. One survey might rate a product or service from 1 to 20, another from 500 to 1,500, and a third from 130 to 180. Suppose the same product scored 9 on the first survey, 850 on the second survey and 142 on the third survey. These numbers don't mean much by themselves, but if you know they all have a z-score of 1, i.e. one standard deviation above the mean, you can meaningfully compare ratings across different surveys.
You can calculate a z-score using the following formula: Z is equal to X minus μ divided by σ. In this formula, x refers to a single data value or raw score. The Greek letter μ denotes the population mean. The Greek letter σ stands for population standard deviation. Therefore, we can also say that Z is equal to the raw score or data value minus the mean divided by the standard deviation. For example, say you took a standardized test and your test score was 133. The mean score on this test is 100 and the standard deviation is 15. Assuming a normal distribution, you can use the formula to calculate your z-score. Your z-score is your raw score, which is 133 minus the mean score of 100 divided by the standard deviation of 15. This is 133 minus 100 divided by 15 equals 33 divided by 15 equals 2.2. Your z-score of 2.2 tells you that your test score was 2.2 standard deviations above the mean, which means you scored well. Note that the rule of thumb tells you that 95% of values ​​fall within two standard deviations of the mean. Your score is 2.2, which is two standard deviations above the mean. z-scores are useful because they give us an idea of ​​how an individual value compares to the rest of the distribution. Let's consider a different exam, with a different grading scale. Let's say you got an 85 and you want to know if that's a good grade relative to the rest of the class. Whether it's a good grade or not depends on the mean and standard deviation of all test scores. Assuming that test scores follow a normal distribution with a mean of 90 and a standard deviation of 4, you can use the formula to calculate the z-score for a raw score of 85. Your z-score is your raw score of 85 minus the mean score of 90 divided by the standard deviation of 4. That's 85 minus 90 divided by 4 equals -5 and divided by 4 equals 1.25. Your z-score of -1.25 tells you that your test score of 85 is 1.25 standard deviations below the mean.

Basic concepts of probability in Python (this is important relative to our line of work)

We'll use pandas and numpy for manipulation and matplotlib for plotting. In addition to pandas, numpy, and matplotlib, you'll use two Python packages that may be new to you: SciPy stats and Statsmodels. SciPy is an open source software that can be used to solve mathematical, scientific, engineering and technical problems. It allows you to manipulate and visualize data using various Python commands. SciPy stats is a module dedicated to statistics. Statsmodels is a Python package that lets you explore data, use statistical models, and perform statistical tests. It includes an extensive list of statistical functions applicable to different types of data.

Import package and initial operation (the data set is directly available in Baidu)

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
import statsmodels.api as sm
education_districtwise = pd.read_csv("education_districtwise.csv")
education_districtwise = education_districtwise.dropna()

data analysis

Rules of Thumb
Since the normal distribution seems to be a good fit for the regional literacy data, you can expect the rules of thumb to hold relatively well. Recall that the rule of thumb states that for a normal distribution:
68% of values ​​are within +/- 1 standard deviation of the mean
95% of values ​​are within +/- 2 standard deviations of the mean 99.7% of values ​​are within +/- 2 standard deviations of
the mean
(Note: "SD" stands for standard deviation) In
other words, you can expect:
68% of regional literacy rates will be within +/- 1 standard deviation of the mean
95% Regional literacy rates will be within +/- 2 standard deviations of the mean
99.7% of regional literacy rates will be within +/- 3 standard deviations of the mean
First, name two new variables that will store the values ​​of the regional literacy rates Mean and standard deviation: mean_overall_li and std_overall_li.

mean_overall_li = education_districtwise['OVERALL_LI'].mean()
std_overall_li = education_districtwise['OVERALL_LI'].std()

The dataset results I downloaded were: 73.39518927444797 and 10.098460413782469
Recall that a z-score is a measure of how many standard deviations a data point is below or above the population mean. The z-score is useful because it tells you where a value lies in the distribution.
Data professionals often use z-scores for outlier detection. Typically, they consider observations with z-scores less than -3 or greater than +3 to be outliers. In other words, the values ​​are more than +/- 3 standard deviations from the mean.
To find outliers in the data, first create a new column called Z_SCORE that contains the z-scores for the literacy rates for each district in the dataset. Recall that the OVERALL_LI column lists literacy rates for all regions.
Then, the z-score is calculated using the function scipy.stats.zscore().

education_districtwise[(education_districtwise['Z_SCORE'] > 3) | (education_districtwise['Z_SCORE'] < -3)]

hypothetical test

Hypothesis testing is a method used to judge whether the sample data supports a hypothesis. In statistics, we often need to compare the values ​​of two or more population parameters. For example, you may want to know whether a new drug is more effective than an existing drug. To answer this question, you need to conduct an experiment to compare the effects of two drugs. In an experiment, you take samples from two different populations and compare the means or proportions of those samples. Before you can compare, you need to formulate a hypothesis: whether the new drug is more effective than the existing drug. This assumption is called the null hypothesis. In general, the null hypothesis means that the sample is not significantly different from the two populations. Then, you collect enough data to compute statistics, such as t-values ​​or z-values, to determine whether the sample data supports the null hypothesis. If your data show that the new drug is more effective than the existing drug, you can reject the null hypothesis and accept your alternative hypothesis: the new drug is indeed more effective than the existing drug. If your data do not support the null hypothesis, you cannot reject it, that is, you cannot be sure that there is enough evidence to show that there is a significant difference between the two drugs.
In order to perform hypothesis testing, you need to understand how p-values ​​are calculated. The p-value is a statistic that measures how similar the sample data is to the null hypothesis. If the p-value is small, it means that your sample data is inconsistent with the null hypothesis and you can reject the null hypothesis. Generally speaking, when the p-value is less than 0.05, we usually consider that there is enough evidence to reject the null hypothesis. However, this threshold is not absolute, it depends on many factors such as sample size, effect size and type of hypothesis test. Therefore, when doing hypothesis testing, you need to consider many factors, not just p-values.
Let's outline the steps to perform a hypothesis test. First, state the null and alternative hypotheses. Second, choose a significance level. Third, find the P-value, and fourth, reject or not reject the null hypothesis

confidence interval

Confidence intervals are one of the most misunderstood concepts in statistics. Since this is a complex topic, both freshmen and experienced researchers sometimes make inaccurate statements about confidence intervals. Let's look at an example. Suppose you are a data professional at a major urban planning firm. The city has asked your team to design new parks and trails, featuring red maple trees. For planning purposes, your manager has asked you to estimate the average height of all red maple trees in the city, which is approximately 10,000 trees. Instead of measuring every tree, you collected a sample of 50 trees. The average height of the samples was 50 feet with a standard deviation of 7.5 feet. Based on a 95% confidence level, you calculate a confidence interval for the mean height that stretches between 48 feet and 52 feet. This range estimate will help your team design new parks and trails that comply with cityscape regulations. At this point, you might be wondering what it means to choose a 95% confidence level and say you are 95% confident in the interval estimate? Earlier, you learned that confidence levels express uncertainty in the estimation process. Let's talk about what this means from a more technical perspective. A 95% confidence level means that if you repeatedly draw random samples from the population and use the same method to construct confidence intervals for each sample. You can expect that 95% of these intervals will capture the population mean. You can also expect that 5% of the total will not capture the population mean. In practice, data professionals typically choose a random sample and generate a confidence interval that may or may not contain the actual mean. This is because repeated random sampling is often difficult, expensive, and time-consuming. Confidence intervals provide data professionals with a way to quantify the uncertainty due to random sampling.

Guess you like

Origin blog.csdn.net/GodGump/article/details/130149410