How Programmers Achieve Financial Freedom Series 1: Using Programmer Skills for Data Analysis and Visualization

Author: Zen and the Art of Computer Programming

1 Introduction

As a programmer, have you ever wondered how you can become more "financially free" and help yourself get rid of all the inconveniences caused by passive income? If you have experienced the migration from skilled worker to financial independence or entrepreneur, you should know how to increase your personal earning capacity without sacrificing your work status. This article will give a detailed popular science introduction and how to use programmer skills to process, analyze and visualize data, thereby helping you achieve financial freedom.

2. The significance of data analysis and visualization

In people's lives, data plays a vital role in solving various problems in life. Data allows us to see the whole picture of things, reveal the logical relationships behind them, and find regular factors. Through data analysis, we can transform existing data into useful information and use it to make decisions and solve practical problems. Data analysis can help us realize the true value of data, discover regular factors hidden in the data, and formulate corresponding strategies to achieve goals. Data analysis can also help us grasp the latest technological developments, adjust our strategies, stay competitive, and create new value for the company. Data analysis and visualization are also two of the most widely used methods in the computer field, which can help our lives become more fulfilling and colorful.

3. What is a programming language? Why use it?

A programming language is a set of symbolic instructions used to write computer programs that tell computers or other robotic systems how to process data. The syntax and functionality of different programming languages ​​vary greatly, but they all consist of the same basic structure, so they can be translated into machine code for each other and thus executed by the computer. The choice of programming language directly affects the efficiency and upper limit of our subsequent learning, and also affects the tools we use and their functions. Among the more well-known programming languages ​​include Python, Java, JavaScript, etc.

In order to be able to understand how programmers use programming languages ​​for data analysis and visualization, you need to first understand some basic concepts and terminology.

4.Basic concepts and terminology

type of data

Data type refers to the form, size and layout of data stored in computer memory. Different programming languages ​​support different data types, such as integer, floating point, character, Boolean, array, linked list, tree, graph, etc.

variable

A variable (Variable) is a data value stored in memory. It is a value that changes during the running of the program. We can access its value through the variable name. When a variable is created, storage space is allocated for it. Variables are divided into local variables and global variables.

  • Local variables are variables defined inside a function body, accessible only within that function, and usually begin with a lowercase letter or underscore.
  • Global variables are accessible program-wide and usually begin with an uppercase letter.

operator

Operators are symbols used to perform arithmetic operations, relational operations, assignment operations, logical operations and other operations. Operators are mainly divided into the following categories:

  1. Arithmetic operators: used to perform arithmetic operations on two values, such as addition, subtraction, multiplication and division.
  2. Relational operators: used to determine the size relationship between two values, such as equal to, not equal to, greater than, less than, greater than or equal to, less than or equal to.
  3. Assignment operator: used to assign a value to another variable, such as =, +=, -=, *=, /=, etc.
  4. Logical operators: used to perform logical operations on expressions, such as AND or NOT.
  5. Conditional operator: used to return a value based on the value of a conditional expression, such as ? :.
  6. Increment/decrement operator: used to change the value of a variable, such as ++x, --x, x++, x--.
  7. String operators: used for concatenation, comparison, finding substrings, etc.
  8. Function call operator: used to call user-defined functions.
  9. Bit operators: used to perform binary operations on integers, such as AND, OR, XOR, NOT, left shift, right shift, etc.

control statement

Control Statement is a key branch of program execution, which allows specific commands to be executed based on certain conditions. Control statements include the following categories:

  1. Branch statements: including if-else statements and switch-case statements.
  2. Loop statements: including for statements, while statements, and do-while statements.
  3. Jump statements: including goto statements, break statements, continue statements, and return statements.

function

A function is a self-contained block of code that performs a specific task and returns a result. A function consists of five parts: function name, parameters, function body and return value. Functions can be reused and code duplication can be avoided.

module

A module is an independent source file that contains functions, classes, variables and other definitions. Modules can be imported by other programs and use their elements. In Python, a module is a .pyfile.

object

Object is an abstract concept. In object-oriented programming, it refers to everything that can be regarded as an object. In Python, all elements are objects, including numbers, strings, lists, tuples, dictionaries, functions, classes, etc.

API interface

API (Application Programming Interface) interface, application programming interface, is a set of predefined rules to ensure that data exchange between different software complies with specified protocols. API allows each software developer to integrate their own program functions without knowing the underlying implementation. The use of API allows programs to communicate with each other and promotes collaborative work of software.

5. Core algorithm and operating steps

Chi-square test

Chi-square test principle

Chi-squared test, also called Chi-square distribution test or Chi-square statistic test, is a statistical test method used to test whether an observation sample obeys a normal distribution. It uses sample data to calculate the degree of deviation between the overall expectation and the sample mean. If the degree of deviation is high enough, it means that the sample data does not reflect the overall distribution well, that is, it does not achieve a certain degree of consistency. On the contrary, if the degree of deviation is low, it means that the sample data describes the overall distribution well and reflects the overall distribution. its consistency.

Specifically, the chi-square test is suitable for testing discrete or continuous distributed data. Assume that the random variable X has only k different values, and the frequencies of each value are f1, f2,..., fk. Assume that these frequencies satisfy the normal distribution, that is, the variance σ^2 = (n p (1-p))/(n-1), where p is the probability of each category in the sample and n is the sample size. Assume that the sample data is D=(d1, d2,..., dk), and each di belongs to the i-th category.

Given a certain set of sample data D=(d1, d2,..., dk) and the population distribution to be tested F(x). First calculate the population mean μ of the sample:

μ = Σ(ki * pi) / n

Then calculate the likelihood function L of the sample:

L = ∏(fi^((di-mi)^2)/s^2), i=1,...,k

Here s is the standard deviation of the sample data. Obviously, if the sample data satisfies the population distribution F(x), the likelihood function L of the sample data should approach the maximum value. The purpose of the chi-square test is to calculate the difference between L and the maximum likelihood function L', and obtain the test results accordingly.

Chi-square test steps

  1. Collect sample data D and the population distribution F(x) to be tested.
  2. Calculate the population mean μ of the sample.
  3. Calculate the variance σ^2 of the sample.
  4. Based on the number of sample data n, calculate the estimated value of σ^2.
  5. Based on the frequency of each category calculated from the sample data D, the probability pi of each category is calculated.
  6. The estimated value μ is obtained by the least squares method.
  7. Calculate the sum of the squares of each term to obtain the chi-square statistic χ^2.
  8. Calculate the relationship between χ^2 and the degree of freedom df.
  9. According to the relationship between χ^2 and the degree of freedom df, find χ^2.
  10. Based on the size of χ^2, determine whether to reject the null hypothesis.

K-means clustering algorithm

Principle of K-means clustering algorithm

K-means clustering algorithm is a commonly used unsupervised learning algorithm. It iteratively divides n input samples into k clusters so that the samples in the cluster are as close as possible to the cluster center. The distance between clusters is as large as possible. The centroid method is generally used to select cluster centers, which means clustering all samples to the nearest neighbors of k centroids. The steps of K-means algorithm are as follows:

  1. Initialize k centroids and randomly select n data points as initial centroids.
  2. Assign each data point to the cluster with the nearest centroid.
  3. Update the centroid. For each cluster, the centroid is recalculated as the average position of all points in the cluster.
  4. Repeat steps 2 and 3 for all data points until the center of mass no longer moves.

K-means clustering algorithm operation steps

  1. Collect training data set X.
  2. Specify the initial coordinate values ​​of K cluster centers.
  3. Repeat the following operations until convergence: a. Assign the nearest cluster center to each data point. b. Update the coordinate value of the cluster center. c. Check for convergence.
  4. The final clustering result C is returned, and each data point is assigned to the corresponding cluster center.

PCA (Principal Component Analysis) dimensionality reduction algorithm

Principle of PCA dimensionality reduction algorithm

PCA (Principal Component Analysis) dimensionality reduction algorithm is a commonly used feature dimensionality reduction method. Its purpose is to convert high-dimensional data into low-dimensional data. The main steps are as follows:

  1. Calculate the covariance matrix Σ of the sample in the original space. The covariance matrix Σ is a symmetric matrix, with each eigenvalue recorded on the diagonal, and the remaining elements are the correlation coefficients between eigenvectors.
  2. Select the eigenvectors corresponding to the top k largest eigenvalues ​​from the covariance matrix Σ as the subspace basis.
  3. Project the original data to a new subspace, that is, map the representation of the feature vector of each sample in the subspace into the coordinates of the sample in the new subspace.

PCA dimensionality reduction algorithm operation steps

  1. Collect training data set X.
  2. Use the PCA algorithm to transform the data into low-dimensional data Y.
  3. Use dimensionally reduced data for classification.

Guess you like

Origin blog.csdn.net/universsky2015/article/details/133446800