Linear regression with a categorical dependent variable in SPSS

Stat heavy user · October 14, 2014

Hi,

I am desperately trying to find out if I can perform an ordinary linear regression (not logistic) on a dataset, where a dependant variable is categorical (nominal value - class) in SPSS. Would be really grateful if someone could help me.

Thanks alot!!!

juilletmercredi · October 14, 2014

No, you can't do that. Or rather, it's technically possible to do it in SPSS (aka, the computer program will run), but it is theoretically/conceptually going to give you junk. OLS linear regression relies on the assumption that your independent variable has a linear association with your dependent variable; the outcome needs to be continuous, because there's no linear way to map out nominal values if they have no order. Your results will be meaningless, if you even get any. (You could do an OLS regression on ordered classes - that's not good, either, and there's an ordinal type of regression that's much better.)

If your outcomes are nominal (unordered) and you have three or more, you really need to do a multinomial logistic regression (or some kind of other multinomial generalized linear model - but multinomial logit is likely to be the best). In a multinomial logistic regression, you set one of the categories as the base/reference group and compare the other 2+ groups to the base group. The interpretation is relatively simple if you don't have a bunch of covariates or interaction terms; IDRE at UCLA maintains a help page on it here. If you only have two groups, you do a logistic regression instead. They are both conceptually similar to OLS regression - the interpretation is different, though, because the shape of the curve is quite different (you talk in terms of (log) odds or relative risk instead of simple linear increases).

Edited October 14, 2014 by juilletmercredi

Roll Right · October 17, 2014

Yeah, if its a binary variable (1, 0) use logistic regression. If you have a categorical variable with more than two values then you ought to use multinomial regression (1, 2, 3, 4, etc).

spunky · October 18, 2014

No, you can't do that. Or rather, it's technically possible to do it in SPSS (aka, the computer program will run), but it is theoretically/conceptually going to give you junk. OLS linear regression relies on the assumption that your independent variable has a linear association with your dependent variable; the outcome needs to be continuous, because there's no linear way to map out nominal values if they have no order. Your results will be meaningless, if you even get any. (You could do an OLS regression on ordered classes - that's not good, either, and there's an ordinal type of regression that's much better.)

for the sake of completeness to this forum thread i think it's important to point out that it IS possible to run OLS regression with a binary dependent variable (0/1) and getting something that is NOT "junk" or "meaningless". it's called the linear probability model, mostly developed within econometrics. even though it's been around for +50 years, econometricians still debate its relative merits VS the use of probit or logit regression (latest overview of the debate here:

http://blogs.worldbank.org/impactevaluations/whether-to-probit-or-to-probe-it-in-defense-of-the-linear-probability-model

after many years crunching numbers i've learnt to realize that there are very few absolute "yes" or "no" guidelines when it comes to good practice within data analysis. the answer to pretty much everything is a big fat "it depends"

Edited October 18, 2014 by spunky

juilletmercredi · October 18, 2014

I had a feeling someone was going to bring that up. Yes, it's true that there are very few absolute "yes" or "no" answers in statistics/data analysis, but I have found that when I'm teaching statistics or doing statistical consulting with someone who doesn't have an intermediate/advanced grasp on statistics, telling them about ambiguity confuses sometimes. So I tend to use language that's more a standard "yes" or no, unless there's some compelling reason not to. I like to teach my basic stats students the "rules" so to speak and then teach flexibility of concepts to students who are taking more advanced classes (at least, that's the way I learned).

OP sounded like he had more than two groups (although it was unclear). Personally I would not choose to use the LPM over the probit or logit models in the majority of cases, and I think that someone who wanted to would need to be able to clearly explain the strengths and weaknesses so they could avoid issues like the one Friedman had with that reviewer. But I think this is primarily going to be the analyst's preference and knowledge level, as the experience spunky posted seems to demonstrate

spunky · October 19, 2014

I had a feeling someone was going to bring that up. Yes, it's true that there are very few absolute "yes" or "no" answers in statistics/data analysis, but I have found that when I'm teaching statistics or doing statistical consulting with someone who doesn't have an intermediate/advanced grasp on statistics, telling them about ambiguity confuses sometimes. So I tend to use language that's more a standard "yes" or no, unless there's some compelling reason not to. I like to teach my basic stats students the "rules" so to speak and then teach flexibility of concepts to students who are taking more advanced classes (at least, that's the way I learned).

Uhm… I guess that sort of reflects a little bit the difference in backgrounds where we come from (I come from a Math background and switched to a social science PhD). It is true that in most social science programs the “rules” are taught first and the rationale comes later. The drawbacks I’ve found, (coming from a very different program and very different way in which this stuff is taught) is that students can either (a) become extremely fond of ‘cookbook’ approaches to data analysis (robbing them of flexibility that we’re supposed to foster) or ( b ) have to re-learn a lot of the crucial stuff they learned before because, suddenly, everything they thought was set in stone does not necessarily hold all the time.

But say… what if you were to teach (or explain) from the very beginning the inherent assumptions that each model makes? Granted, this takes a lot more effort to teach than the “hard rules” approach (and I know that from experience because that’s how I teach this stuff, from the intro courses to the advanced stuff, like my math professors did) but it helps create a deeper level of understanding from the very beginning. You force people to acknowledge the ambiguity and complications that come with every day data analysis and help them find ways to deal with these ambiguities. The key issue is, of course, that the emphasis is now on the theory behind the method and not necessarily the application of the method (which most methodology courses concern themselves with). But your students or the people you consult for will grasp things a lot better right from the start.

OP sounded like he had more than two groups (although it was unclear). Personally I would not choose to use the LPM over the probit or logit models in the majority of cases, and I think that someone who wanted to would need to be able to clearly explain the strengths and weaknesses so they could avoid issues like the one Friedman had with that reviewer. But I think this is primarily going to be the analyst's preference and knowledge level, as the experience spunky posted seems to demonstrate

I also got the impression that the OP had more than two groups (although there is a multinomial linear probability model) but, still, I think when you mention Friedman’s reviewer you help bring to light what I elaborated on before, particularly when he says: “I believe our referee was stuck on a non-linear binary response model simply because that is the “correct” approach that we are taught in graduate econometrics”. In Friedman’s view, the reviewer had issues with his approach because of tradition. You’re taught to do certain things in certain ways as a graduate student and that’s the way you have to do it. Then when you tell your students that only certain things work in certain ways and that’s how they have to do it. A lot of malpractice can get perpetuated this way, some of which can have horrible consequences. The popularity of robust methods during the 80s-90s which led to a delay on the discovery of the ozone’s layer hole comes to mind.

In our particular case, you say you’d not use the LPM over the probit/logit model in the majority of cases. Does that mean you’re also OK with leaving endogeneity in your probit/logit models in the majority of cases? Because I’m willing to argue endogeneity gives you *more* problems than LPM models have, it’s not easy to fix for probit/logit models and it’s EXTREMELY pervasive. So maybe you have (inadvertently, of course) helped perpetuate a more severe form of malpractice because you consider the LPM model to be wrong or suboptimal. Now I’m not blaming you of that (most of the time we don’t even acknowledge the problem of endogeneity within psychology/education/sociology, etc.) but it helps me make the case that, regardless of the research question or the level of the person you’re dealing with, it’s always important to leave the door open to all pertinent options. Just get down (or up) to their level of understanding, explain what each method does and let them make an informed decision of what seems to work best for their data. I bet you’ll surprise yourself about the new stuff you’ll learn every time!

Edited October 19, 2014 by spunky

juilletmercredi · October 21, 2014

But say… what if you were to teach (or explain) from the very beginning the inherent assumptions that each model makes?...You force people to acknowledge the ambiguity and complications that come with every day data analysis and help them find ways to deal with these ambiguities. The key issue is, of course, that the emphasis is now on the theory behind the method and not necessarily the application of the method (which most methodology courses concern themselves with). But your students or the people you consult for will grasp things a lot better right from the start.

Well, I do teach about model assumptions - that's a classic part of the statistics curriculum and is very important! There's more than one way to do this; I have personally found it effective to start with classic model assumptions and statements, hinting at some ambiguity for my beginning students, and move to teaching more about variations and ambiguities as they begin to understand more of the theoretical underpinnings. And yes, of course, I have given thought to beginning with the ambiguities. So far, with my own students, I have found that it increases confusion. I think the difference here is also our subfields - you're quantitative so your concern is as much theoretical as it is application. I am in social psych, and when I am teaching undergrads, the concern is application. I LOVE to teach about the theoretical underpinnings and ambiguities in statistics, but given the demands of the departments I've assisted or taught in, that has not been feasible.

But like I said, I don't leave out ambiguity altogether - I hint at it in my intro classes. For example, when it comes to level of measurement (like ordinal vs. continuous), I've included some trick questions to get students to realize that certain outcomes can be considered in different ways depending on the person. When asking students which test they would use, I've deliberately included questions that could be answered with multiple kinds of analyses, then we discuss. I just don't introduce it in the way I would with advanced or graduate students. Same thing in my consultations - the level of ambiguity I use depends on the interests and needs of the client. Some clients want to really know and understand the theory behind what's going on in their models. Some just need to get the models done for their papers, and in others the entire reason they hired me was because they don't really want to think about this too deeply. Given time constraints and resource constraints on projects it's often not feasible to explain and let people choose - and I have actually found that a lot of clients don't want to do this anyway. (Some do, and in those cases we have long exciting conversations about models.)

Does that mean you’re also OK with leaving endogeneity in your probit/logit models in the majority of cases?

Yes. Just as others are okay with the disadvantages/shortcomings of LPMs. I also never said that LPM was wrong or suboptimal. It's not about right or wrong at this level - just trade-offs in the strengths and weaknesses of approaches. I, personally, prefer logit or probit models because I understand them better. Convention of the field has a lot to do with it too, admittedly.

Edited October 21, 2014 by juilletmercredi

Sign In

Linear regression with a categorical dependent variable in SPSS

Recommended Posts

Stat heavy user

juilletmercredi

Roll Right

spunky

juilletmercredi

spunky

juilletmercredi

Create an account or sign in to comment

Create an account

Sign in

Browse

Activity

Results

Important Information