## Statistics in R Series

Introduction

We have now covered logistic regression models for each binary and ordinal data and in addition demonstrated the right way to implement the model in R. Furthermore the prediction evaluation using the R libraries was also discussed in earlier articles. We have now seen the impact of single in addition to multiple predictors on the response variable and quantified it. Binary and ordinal response variables were taken to point out the right way to take care of several types of data. In this text, we are going to undergo 4 more prediction analyses for logistic regression models namely Generalized Ordinal Regression model, Partial Proportional Odd model, Multinomial Logistic model and Poisson Regression model.

Dataset

Our research will use the identical UCI Machine Learning Repository’s Adult Data Set as a case study. Greater than 30000 individuals’ demographic data are collected on this dataset. Data include each individual’s race, education, job, gender, salary, variety of jobs held, hours worked per week, and income earned. To get a refresher, the variables into consideration are shown below.

- Education: numeric and continuous. The health status of a person may be greatly affected by education.
- Marital status: binary (0 for single and 1 for married). The impact of this variable will almost certainly be minimal, nevertheless, it has been included within the evaluation.
- Gender: binary (0 for female and 1 for male). There’s also the chance that it has a lesser impact, but it should be interesting to seek out out.
- Family income: binary (0 for average or lower than average and 1 for greater than average). Health conditions could also be affected by this.
- Health status: ordinal (1 for poor, 2 for average, 3 for good and 4 for excellent)

Prediction in Generalized Ordinal Regression Model

Consider the case where now we have collected data on tons of of people. Amongst the information included is information regarding the person’s education, age, marital status, health status, gender, family income, and full-time employment status. Education, gender, marital status, and family income are to be included as predictor variables within the regression model for health status. Apart from education, the predictor variables are all binary, which suggests they’ve either a 0 or a 1 value. Education is a continuous variable that indicates the variety of years a person has been educated. The next variables are considered for this regression evaluation.

- Education years
- Marital status
- Gender
- Family income
- Health status

The coefficient value for every predictor variable might be one if we perform an ordinal logistic regression and hold the proportional odd assumption. Suppose family income has a coefficient of ‘x’, which suggests that for each unit increase in family income (on this case from 0 to 1), the logit probability or log odds of being in a better category of health status increases by ‘x’. In consequence, we are able to conclude the next statements about this model.

- The log odds of being at average health from poor health is ‘x’ if family income increases to above average status.
- The log odds of being at good health from average health is ‘x’ if family income increases to above average status.
- The log odds of being at excellent health from good health is ‘x’ if family income increases to above average status.

A proportional odd model is characterised by the identical log odds across all levels of outcomes. Real-world data ceaselessly violates this assumption, so we cannot proceed with the proportional odd model. As discussed earlier, two possible solutions to handle this nonproportional odd issue are to have either a generalized ordinal model or a partial proportional odd model.

- Generalized ordinal regression model -> the effect of all level of all predictors can vary
- Partial proportional odd model -> the effect of some level of all/some predictors are allowed to differ

We have now already implemented the model using generalized approach and PPO approach in earlier articles.

Now we are going to implement the prediction procedure using these models.

Here, we are able to see the cumulative predicted probabilities of getting different health statuses for the provided educ values. We all know that our health status has 4 unique values.

If the person has 15 years of education,

- The cumulative probability of getting average health and above is 96%
- The cumulative probability of getting good health and above is 77%
- The cumulative probability of getting excellent health is 24%

If the person has only 5 years of education,

- The cumulative probability of getting average health and above is 81%
- The cumulative probability of getting good health and above is 41%
- The cumulative probability of getting excellent is 8%

Subsequently, it is clear that the variety of education years plays a major role in determining the health status of a person. If we would like to acquire only the expected probabilities, we are able to execute the next command.

ggpredict(model1, terms = “educ[5,10,15]”,ci=NA)

If the person has 15 years of education,

- The probability of getting poor health is 4%
- The probability of getting average health is 20%
- The probability of getting good health is 52%
- The probability of getting excellent health is 24%

If the person has only 5 years of education,

- The probability of getting poor health is nineteen%
- The probability of getting average health is 40%
- The probability of getting good health is 33%
- The probability of getting excellent health is 8%

Clearly, the variety of education years increases the probability of getting higher health. All of those values are adjusted for the mean values of marital, gender and full-time working status.

Prediction in Partial Proportional Odd Model

In a partial proportional odd model, we are able to select the predictors for which we would like to differ the effect of various levels of outcomes. We are able to first determine which predictors are violating the PO assumption after which place those variables after *parallel = FALSE ~* command. Here, now we have placed marital status and family income as violating predictors.

If the person has 15 years of education,

- The probability of getting poor health is 4%
- The probability of getting average health is 20%
- The probability of getting good health is 52%
- The probability of getting excellent health is 24%

If the person has only 5 years of education,

- The probability of getting poor health is 17%
- The probability of getting average health is 41%
- The probability of getting good health is 35%
- The probability of getting excellent health is 7%

The cumulative probabilities will also be calculated using the strategy described before.

Prediction in Multinomial Regression Model

We have now covered multinomial logistic regression evaluation in the next article.

Multinomial regression is a statistical approach to estimating the likelihood of a person falling into a particular category in relation to a baseline category utilizing a logit or log odds approach. Essentially, it really works as an extension of the binomial distribution when there are greater than two outcomes related to the nominal response variable. As a part of multinomial regression, we’re required to define a reference category, and the model will determine various binomial distribution parameters based on the reference category.

In the next code, now we have defined the primary level of health status because the reference level and we are going to compare the multiple binomial regression model with respect to this reference level.

Our prediction approach yielded the next result.

If the person has 15 years of education,

- The probability of getting poor health is 4%
- The probability of getting average health is nineteen%
- The probability of getting good health is 52%
- The probability of getting excellent health is 25%

Again, these predicted probabilities are calculated holding other predictors at their mean. In multinomial logistic regression, the response variable must be nominal. Nevertheless, the response here is converted to ordinal to make use of* ggpredict()* command.

Prediction in Poisson Regression Model

There are occasions when we’d like to take care of data that involves counting. As a way to model a count response variable, akin to the variety of museum visits, we’d like Poisson regression. The variety of visits to the hospital or the variety of math courses taken by a selected group of scholars may function examples. We have now covered Poisson regression in the next article

We’re going to use the identical dataset and predict the variety of science museum visits from education years, gender, marital status, full-time working status and family income. The code block is shown below.

Using the identical *ggpredict()* command, we obtain the next result for various education years in addition to for various genders.

- The anticipated variety of science museum visits is 0.44 if the person is female(gender=0) and has 15 years of education
- The anticipated variety of science museum visits is 0.62 if the person is male(gender=1) and has 15 years of education
- It implies that females visit science museums lower than males. The conclusion is adjusted for the mean values of marital status, full-time working status and family income.

Conclusion

In this text, now we have covered prediction evaluation for 4 several types of regression models. The partial proportional odd model may be regarded as a subset of the generalized ordinal regression model since PPO model allows only just a few predictors to differ their effect across different levels. The multinomial regression model is helpful for nominal response variables which have unordered categories. Lastly Poisson regression model is sweet for the prediction of count variables. We have now demonstrated the usage of *ggpredict()* function in all 4 regression models and the interpretation of result as well.

Acknowledgement for Dataset

Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science (CC BY 4.0)

Thanks for reading.

Buy me a coffee