形如以下方程的为多元线性回归,其核心思想同一元线性回归

$$ y= \beta _{0}+ \beta {1}x{1}+ \beta {2}x{2}+ \cdots + \beta {p}x{p}+ \varepsilon~~~~~~\varepsilon \sim N(0, \sigma ^{2}) $$

Untitled

自变量为连续性变量

$$ \sigma $$

data = read_csv("wellbeing.csv")

data %>% psych::describe()
p1 = data %>% ggplot() + geom_density(aes(wellbeing))
p2 = data %>% ggplot() + geom_density(aes(outdoor_time))
p3 = data %>% ggplot() + geom_density(aes(social_int))
library(patchwork)
p1+p2+p3

Untitled

fit = lm(wellbeing ~ outdoor_time + social_int, data = data)
summary(fit)

---------------------------------------------------------------------
Call:
lm(formula = wellbeing ~ outdoor_time + social_int, data = data)

Residuals:
   Min     1Q Median     3Q    Max 
-9.742 -4.915 -1.255  5.628 10.936 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)    5.3704     4.3205   1.243   0.2238    
outdoor_time   0.5924     0.1689   3.506   0.0015 ** 
social_int     1.8034     0.2691   6.702 2.37e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 6.148 on 29 degrees of freedom
Multiple R-squared:  0.7404,	Adjusted R-squared:  0.7224 
F-statistic: 41.34 on 2 and 29 DF,  p-value: 3.226e-09
---------------------------------------------------------------------

library(gvlma)
gvmodel <- gvlma(fit)
summary(gvmodel)

---------------------------------------------------------------------
Value p-value                Decision
Global Stat        2.002726  0.7353 Assumptions acceptable.
Skewness           0.056977  0.8113 Assumptions acceptable.
Kurtosis           1.741556  0.1869 Assumptions acceptable.
Link Function      0.199260  0.6553 Assumptions acceptable.
Heteroscedasticity 0.004932  0.9440 Assumptions acceptable.
---------------------------------------------------------------------

car::influencePlot(fit)

Untitled

残差标准差

sigma(fit)
rss = sum(fit$residuals^2)
rse = sqrt(rss/29)

预测

pdata = tibble(social_int = c(24, 19, 15, 7), outdoor_time = c(3, 26, 20, 2))
predict(mdl1, newdata = pdata)

多重共线性(Multicollinearity)

回归模型建立在自变量相互独立的假设上,而违背该假设,就会使得变量之间由于存在高度相关关系而使回归估计不准确,在实际操作中可计算方差膨胀因子来辨别回归方程中是否存在多重共线性。

$$ VIF= \frac{1}{1-R_{j}^{2}} $$

经验判断方法表明:当0<VIF<5,不存在多重共线性;当5<VIF<10,若共线性,存在弱共线性,当10≤VIF<100,存在较强的多重共线性;当VIF≥100,存在严重多重共线性。

car::vif(fit)

理想情况下,我们希望值接近1,$VIF>10$ 表示有问题

可视化