|
||||||||||||||||
|
|||||||||||
首页 > Mounter > Advice on statistical model building
Andrew Grogan-Kaylor writes:
More and more in my work with students, I'm coming to a place where I realize that I know a lot, and am good at explaining, all kinds of statistical stuff like "we use a logit (or probit) for a binary outcome, and here's why" or "when data are clustered inside neighborhoods, we use multilevel models".What I'm less good at is something that is emerging in a lot of the PHD students' questions, which are more general questions about how to build statistical models. When do you know that you have "enough" independent variables? What variables should be included in your analysis even if they're not going to end up being statistically significant? When you model interactions, how many interactions should you test? Which ones should you retain in your model?
I guess that what I'm looking for a is a more "philosophical" piece about statistical model building in general, as opposed to what I usually read, which are pieces about the particulars of a specific statistical technique.
Do you know of any general overview of modeling such as an article? I recall you talking about something like this in your blog, but a search is not turning it up.
My reply: There must be some overviews out there, but the only ones I'm particularly happy with are those in chapter 4 (linear regression) and chapter 5 (logistic regression) of my book with Jennifer.
My brief bit of strategy advice is to start the model simple and add variables.
Also, take your most important main effects and include their interactions. That's a trick I learned a couple years ago, and it's worked over and over for me. It sounds obvious once you hear it, but if you look back on your earlier analyses, I bet you'll find you weren't always doing it.
I keep aware of statistical significance, but I don't think of model building as a process of testing variable or testing interactions. The main thing I get out of statistical significance is that if a coefficient is statistically significant and has the "wrong" sign--that is, it doesn't make sense--then I look more carefully to try to understand what's going on with the predictors.
One other tip--I think we mention this somewhere in the book--is to remember that a regression coefficient can be interpreted as the average difference, comparing two units that differ by 1 unit on predictor x but are identical for all other units. Sometimes this comparison doesn't make a lot of sense, in which case it might not be worth your while to try too hard to interpret the corresponding coefficient.
|