R languages using a binary ordinal regression to data modeling polycarboxylic GLM

 The most common model is used to analyze the data sequence number of a logical model. Essentially, you will be treated as categorical results show continuous latent variables. This predictor variables have an impact on its results in only one way, and therefore obtain a regression coefficient for each predictor variable. However, this model has several intercept, they represent the variable segmentation to create points classification performance observed.

As in ordinary regression model, each predictor variables affect the results in a manner that is proportional odds assumptions or constraints. Alternatively, you can let each predictor variables have different effects on the results in each entry point.

How to use univariate GLM software model this? On page about UCLA idre multivariate random coefficient model of the article. Here are important because they use nlme(univariate linear mixed model software) the results of the multivariate modeling. The basic idea is to stack up the data, making it a repeated measures, one kind of signal is found, but the signals sent to the software, i.e. the results are different, thereby requiring different predictors intercept and slope.

Therefore, we have to do is to convert data from a wide long, it is conventional binomial model, but we need to tell the different models for each level estimate of the intercept. To this end, I have to use unstructuredgeneral estimating equations (GEE) working correlation structure. 3

demonstration

library(ordinal) # For ordinal regression to check our results
library(geepack) # For GEE with binary data

data set.

soup <- ordinal::soup
soup$ID <- 1:nrow(soup) # Create a person ID variable
str(soup)

'data.frame':	1847 obs. of  13 variables:
 $ RESP    : Factor w/ 185 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ PROD    : Factor w/ 2 levels "Ref","Test": 1 2 1 2 1 2 2 2 2 1 ...
 $ PRODID  : Factor w/ 6 levels "1","2","3","4",..: 1 2 1 3 1 6 2 4 5 1 ...
 $ SURENESS: Ord.factor w/ 6 levels "1"<"2"<"3"<"4"<..: 6 5 5 6 5 5 2 5 5 2 ...
 $ DAY     : Factor w/ 2 levels "1","2": 1 1 1 1 2 2 2 2 2 2 ...
 $ SOUPTYPE: Factor w/ 3 levels "Self-made","Canned",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ SOUPFREQ: Factor w/ 3 levels ">1/week","1-4/month",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ COLD    : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
 $ EASY    : Factor w/ 10 levels "1","2","3","4",..: 7 7 7 7 7 7 7 7 7 7 ...
 $ GENDER  : Factor w/ 2 levels "Male","Female": 2 2 2 2 2 2 2 2 2 2 ...
 $ AGEGROUP: Factor w/ 4 levels "18-30","31-40",..: 4 4 4 4 4 4 4 4 4 4 ...
 $ LOCATION: Factor w/ 3 levels "Region 1","Region 2",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ ID      : int  1 2 3 4 5 6 7 8 9 10 ...

I use SURENESSvariables. It has six levels. Use DAYand GENDERvariables be modeled.

# Select variables to work with
soup <- dplyr::select(soup, ID, SURENESS, DAY, GENDER)
# I like dummy variables with recognizable names
soup$girl <- ifelse(soup$GENDER == "Female", 1, 0) # Make male reference group
soup$day2 <- ifelse(soup$DAY == "2", 1, 0) # Make day 1 reference group

The next step is to convert the results of the sequence represented by each threshold of 5 results


Once this is done, we are ready to results of these five new variable conversion.


head(soup.long) # Let's look at the data

     ID SURENESS DAY GENDER girl day2 SURE VAL SURE.f
1     1        6   1 Female    1    0    2   1      2
1848  1        6   1 Female    1    0    3   1      3
3695  1        6   1 Female    1    0    4   1      4
5542  1        6   1 Female    1    0    5   1      5
7389  1        6   1 Female    1    0    6   1      6
2     2        5   1 Female    1    0    2   1      2

Let's look did not choose the highest response category of people:



     ID SURENESS DAY GENDER girl day2 SURE VAL SURE.f
22   22        4   1 Female    1    0    2   1      2
1869 22        4   1 Female    1    0    3   1      3
3716 22        4   1 Female    1    0    4   1      4
5563 22        4   1 Female    1    0    5   0      5
7410 22        4   1 Female    1    0    6   0      6

The man chosen SURENESSVALin a. Her score was a front three, the last two of her score is 0, because the threshold value and less than 4 4-5 5-6 thresholds.

The next step is to create a dummy variable for the threshold. These variables will be used to represent the model intercept.


Please note that I will dummies multiplied by -1. In ordinal regression, doing so makes it easier to explain. In short, it ensures a positive coefficient increases the chances from the lower category (for example, 3) move to a higher category (4) or to respond to a higher category of the response.

Now, we are ready to run the model. We use the GEE. Related structure unstructured.


Next, I use the standard ordinal regression estimation model:


让我们比较系数和标准误差:

        Estimate Estimate.1 Std.err Std. Error     Wald  z value Pr(>|W|) Pr(>|z|)
SURE.f2 -2.13244   -2.13155 0.10454    0.10450 416.0946 -20.3971   0.0000   0.0000
SURE.f3 -1.19345   -1.19259 0.09142    0.09232 170.4284 -12.9179   0.0000   0.0000
SURE.f4 -0.89164   -0.89079 0.08979    0.09011  98.5995  -9.8857   0.0000   0.0000
SURE.f5 -0.65782   -0.65697 0.08945    0.08898  54.0791  -7.3833   0.0000   0.0000
SURE.f6 -0.04558   -0.04477 0.08801    0.08789   0.2682  -0.5093   0.6046   0.6105
girl    -0.04932   -0.04917 0.09036    0.09074   0.2980  -0.5419   0.5851   0.5879
day2    -0.26172   -0.26037 0.08584    0.08579   9.2954  -3.0351   0.0023   0.0024

We can see the results are very close.

However, using the estimated glm()estimate dependencies between the results can not establish a person's will produce different results.


        Estimate Std. Error z value Pr(>|z|)
SURE.f2 -2.15144    0.08255 -26.062   0.0000
SURE.f3 -1.21271    0.06736 -18.004   0.0000
SURE.f4 -0.91149    0.06472 -14.084   0.0000
SURE.f5 -0.67782    0.06327 -10.713   0.0000
SURE.f6 -0.06523    0.06178  -1.056   0.2911
girl    -0.07326    0.04961  -1.477   0.1398
day2    -0.26898    0.04653  -5.780   0.0000

Estimates and standard errors were inadequate.

We can easily relax pom.binproportional odds model constraints. Let's run some have suggested by easing constraints on predictors of partial proportional odds modelday2 . We dummies and by estimating the threshold of day2the interaction between the predictor variables to do this.


I also use the name of the parameter to run the same model are compared day2.



             Estimate Estimate.1 Std.err Std. Error     Wald  z value Pr(>|W|) Pr(>|z|)
SURE.f2      -2.02982   -2.03106 0.11800    0.11834 295.8986 -17.1630  0.00000  0.00000
SURE.f3      -1.22087   -1.22213 0.09829    0.09857 154.2801 -12.3980  0.00000  0.00000
SURE.f4      -0.92773   -0.92899 0.09458    0.09443  96.2112  -9.8375  0.00000  0.00000
SURE.f5      -0.65744   -0.65870 0.09246    0.09188  50.5554  -7.1693  0.00000  0.00000
SURE.f6      -0.04733   -0.04859 0.08955    0.08965   0.2793  -0.5420  0.59714  0.58784
SURE.f2:day2  0.07359    0.07360 0.14148    0.14155   0.2705   0.5199  0.60298  0.60312
SURE.f3:day2  0.31691    0.31697 0.10607    0.10613   8.9270   2.9867  0.00281  0.00282
SURE.f4:day2  0.33301    0.33308 0.09970    0.09973  11.1551   3.3398  0.00084  0.00084
SURE.f5:day2  0.26330    0.26339 0.09618    0.09616   7.4938   2.7391  0.00619  0.00616
SURE.f6:day2  0.26741    0.26748 0.09347    0.09345   8.1842   2.8622  0.00423  0.00421
girl         -0.04809   -0.04994 0.09048    0.09077   0.2825  -0.5502  0.59507  0.58221

The results are comparable.

Now, we can model the proportional ratio odds ratio odds binary binary model are compared to test day2variable constraints. geepackIt allows anova()for both models Wald test:


Analysis of 'Wald statistic' Table

Model 1 VAL ~ 0 + SURE.f2 + SURE.f3 + SURE.f4 + SURE.f5 + SURE.f6 + girl + SURE.f2:day2 + SURE.f3:day2 + SURE.f4:day2 + SURE.f5:day2 + SURE.f6:day2
Model 2 VAL ~ 0 + SURE.f2 + SURE.f3 + SURE.f4 + SURE.f5 + SURE.f6 + girl + day2
  Df   X2 P(>|Chi|)
1  4 6.94      0.14

The difference between the two models are not statistically significant, indicating that the day2variable ratio constraint is sufficient.

We can use the function or use ordinalto compare pom.ordand npom.ordmodeling anova(), to perform the same test nomimal_test(). Both are likelihood ratio test, Wald test more fully than the above-mentioned GEE.



Likelihood ratio tests of cumulative link models:

         formula:               nominal: link: threshold:
pom.ord  SURENESS ~ girl + day2 ~1       logit flexible  
npom.ord SURENESS ~ girl        ~day2    logit flexible  

         no.par  AIC logLik LR.stat df Pr(>Chisq)
pom.ord       7 5554  -2770                      
npom.ord     11 5555  -2766    6.91  4       0.14

nominal_test(pom.ord)

Tests of nominal effects

formula: SURENESS ~ girl + day2
       Df logLik  AIC  LRT Pr(>Chi)  
<none>     -2770 5554                
girl    4  -2766 5554 8.02    0.091 .
day2    4  -2766 5555 6.91    0.141  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Both tests converge to the same results, and comparing the test Wald GEE model is also given the same p-value. However, Wald- χ 2χ2 test statistics slightly higher.


Once this is done, of course, use of the ordinal number of the packet is much easier. However, the model is treated as binary may be some benefits, but all of which are out of curiosity rather than need. For some reason, I have yet to figure out when a person tries to use fitted()a function obtained from the model to predict the probability that returns only one set of probability fit. Ideally, it should return fitting probability value for each threshold. Use geepackcan be obtained directly predict the probability of each level. However, this advantage is negligible.


Also, if familiar with the maximum likelihood estimation, it is possible to simply program the likelihood function.

Examples of the above syntax odds ratio in the case of:


coef(summary(res))
      Estimate Std. Error
a1 -2.13155603 0.10450286
a2 -1.19259266 0.09232077
a3 -0.89079068 0.09010891
a4 -0.65697671 0.08898063
a5 -0.04477565 0.08788869
bg -0.04917604 0.09073602
bd -0.26037369 0.08578617

coef(summary(pom.ord))
        Estimate Std. Error     z value     Pr(>|z|)
1|2  -2.13155281 0.10450291 -20.3970663 1.775532e-92
2|3  -1.19259171 0.09232091 -12.9178937 3.567748e-38
3|4  -0.89078590 0.09010896  -9.8856524 4.804418e-23
4|5  -0.65697465 0.08898068  -7.3833401 1.543671e-13
5|6  -0.04476553 0.08788871  -0.5093434 6.105115e-01
girl -0.04917245 0.09073601  -0.5419287 5.878676e-01
day2 -0.26037360 0.08578617  -3.0351465 2.404188e-03

The results are very similar, for a more definitive way to compare models, we can always compare the log-likelihood:

logLik(res)
'log Lik.' -2769.784 (df=7)

logLik(pom.ord)
'log Lik.' -2769.784 (df=7)

 


  1. Agresti, A. (2013). Categorical Data Analysis. Wiley-Interscience. 

 

If you have any questions, please leave a comment below. 

 

 

Big Data tribe  - Chinese professional third-party data service providers to provide customized one-stop data mining and statistical analysis consultancy services

Statistical analysis and data mining consulting services: y0.cn/teradat (Consulting Services, please contact the official website customer service )

Click here to send me a messageQQ:3025393450

 

QQ exchange group: 186 388 004 

[Service] Scene  

Research; the company outsourcing; online and offline one training; data reptile collection; academic research; report writing; market research.

[Tribe] big data to provide customized one-stop data mining and statistical analysis consultancy

 

Welcome attention to micro-channel public number for more information about data dry!
 
 

Welcome to elective our R language data analysis will be mining will know the course!

Guess you like

Origin www.cnblogs.com/tecdat/p/12228536.html