Matlab | Matlab从入门到放弃(14)——多变量数据分析

博主github:https://github.com/MichaelBeechan
博主CSDN:https://blog.csdn.net/u011344545
代码下载:https://github.com/MichaelBeechan/Matlab-From-Zero-To-One

%% Time:2019.6.8
%% Function:多变量数据分析

%% 多变量数据

%% 多变量数据
% MATLAB对多变量统计数据使用列向分析
% 第(i,j)个元素是第j个变量的第i个观测值。
%{
例如,设想一个具有三个变量的数据集:
心率
体重
每周锻炼小时数
%}
D = [72          134          3.2
     81          201          3.5
     69          156          7.1
     82          148          2.4
     75          170          1.2 ];
mu = mean(D)
sigma = std(D)
 
% 有关 MATLAB 中提供的数据分析函数的列表
help datafun 
% Statistics and Machine Learning Toolbox
help stats

datafun

Basic operations.
    max         - Largest component.
    min         - Smallest component.
    bounds      - Smallest and largest components.
    mean        - Average or mean value.
    median      - Median value.
    mode        - Mode, or most frequent value in a sample.
    std         - Standard deviation.
    var         - Variance.
    sort        - Sort in ascending order. 
    sortrows    - Sort rows in ascending order.
    issorted    - TRUE for sorted vector and matrices.
    sum         - Sum of elements.
    prod        - Product of elements.
    histogram   - Histogram.
    histcounts  - Histogram bin counts.
    trapz       - Trapezoidal numerical integration.
    movsum      - Moving sum of elements.
    movvar      - Moving variance.
    movstd      - Moving standard deviation.
    movmedian   - Moving median.
    movmean     - Moving mean.
    movmin      - Moving minimum.
    movmax      - Moving maximum.
    movprod     - Moving product.
    movmad      - Moving median absolute deviation.
    cumsum      - Cumulative sum of elements.
    cumprod     - Cumulative product of elements.
    cummin      - Cumulative smallest component.
    cummax      - Cumulative largest component.
    cumtrapz    - Cumulative trapezoidal numerical integration.
 
  Finite differences.
    diff        - Difference and approximate derivative.
    gradient    - Approximate gradient.
    del2        - Discrete Laplacian.
 
  Correlation.
    corrcoef    - Correlation coefficients.
    cov         - Covariance matrix.
    subspace    - Angle between subspaces.
 
  Filtering and convolution.
    filter      - One-dimensional digital filter.
    filter2     - Two-dimensional digital filter.
    conv        - Convolution and polynomial multiplication.
    conv2       - Two-dimensional convolution.
    convn       - N-dimensional convolution.
    deconv      - Deconvolution and polynomial division.
    detrend     - Linear trend removal.
 
  Fourier transforms.
    fft         - Discrete Fourier transform.
    fft2        - Two-dimensional discrete Fourier transform.
    fftn        - N-dimensional discrete Fourier Transform.
    ifft        - Inverse discrete Fourier transform.
    ifft2       - Two-dimensional inverse discrete Fourier transform.
    ifftn       - N-dimensional inverse discrete Fourier Transform.
    fftshift    - Shift zero-frequency component to center of spectrum.
    ifftshift   - Inverse FFTSHIFT.
    fftw        - Interface to FFTW library run-time algorithm tuning control.
 
  Missing data.
    ismissing          - Find missing data.
    standardizeMissing - Convert to standard missing data.
    rmmissing          - Remove standard missing data.
    fillmissing        - Fill standard missing data.
 
  Data preprocessing.
    ischange     - Detect abrupt changes in data.
    islocalmax   - Detect local maxima in data.
    islocalmin   - Detect local minima in data.
    isoutlier    - Detect outliers in data.
    filloutliers - Replace outliers in data.
    smoothdata   - Smooth noisy data.
    rescale      - Rescales the range of data.
    normalize    - Normalizes data.
 
  Grouping.
    groupsummary - Summary computation by groups

stats

Return cached values and statistics for MemoizedFunction object

%% 数据分析

% 预处理——>汇总——>可视化——>建模
%{
预处理 - 考虑离群值以及缺失值,并对数据进行平滑处理以便确定可能的模型。
汇总 - 计算基本的统计信息以描述数据的总体位置、规模及形状。
可视化 - 绘制数据以便确定模式和趋势。
建模 - 更全面地描述数据趋势,以便预测新数据值。
目的:
使用简单模型来描述数据中的模式,以便实现正确预测。
了解变量之间的关系,以便构建模型。
%}

% 加载数据
load count.dat  % 24×3数组count包含三个十字路口(列)在一天中的每小时流量统计(行)

% 缺失数据   NaN(非数字)值通常用于表示缺失数据
% 使用 isnan 函数检查第三个十字路口的数据是否存在 NaN 值
c3 = count(:, 3);  %Data at intersection 3
c3NaNCount = sum(isnan(c3))

% 离群值
% 离群值是与其余数据中的模式明显不同的数据值
% 确定离群值的一种常用方法是查找与均值的标准差大于某个数字的值
h = histogram(c3, 10);
N = max(h.Values);
mu3 = mean(c3);
sigma3 = std(c3);

hold on
plot([mu3 mu3], [0 N], 'r', 'LineWidth', 2)
X = repmat(mu3 + (1 : 2) * sigma3, 2, 1);
Y = repmat([0; N], 1, 2);
plot(X, Y, 'Color', [255 153 51]./255, 'LineWidth', 2)
legend('Data', 'Mean', 'Stds')
hold off

outliers = (c3 - mu3) > 2*sigma3;
c3m = c3; % Copy c3 to c3m
c3m(outliers) = NaN; % Add NaN values

在这里插入图片描述

% 平滑和筛选
% 第三个十字路口的数据时序图
plot(c3m,'o-')
hold on

在这里插入图片描述

% 使用 MATLAB convn 函数对数据应用简单移动平均平滑法
span = 3; % Size of the averaging window 使用变量 span 控制平滑范围
window = ones(span,1)/span; 
smoothed_c3m = convn(c3m,window,'same');

h = plot(smoothed_c3m,'ro-');
legend('Data','Smoothed Data')

在这里插入图片描述

% 对平滑数据使用 filter 函数
smoothed2_c3m = filter(window,1,c3m);
delete(h)
plot(smoothed2_c3m,'ro-','DisplayName','Smoothed Data');

在这里插入图片描述

发布了127 篇原创文章 · 获赞 117 · 访问量 16万+

猜你喜欢

转载自blog.csdn.net/u011344545/article/details/91346960