基于1980年1月至2018年12月的全国社会消费品零售总额的月度数据,进行如下一系列分析
1.获取数据
进入【国家统计局网页】http://www.stats.gov.cn/tjsj/
补充:【下载方法说明】http://jingyan.baidu.com/article/91f5db1b3ae7851c7f05e3ed.html
下载1980.01~2018.12期间的数据即可
2.数据的整理和缺失值的处理
观察数据,发现有相当一部分缺失。
该数据从2012年开始每年的1月和2月是缺失的。
首先使用Python执行完成填补,方法参考Data Preparing - *“根据变量数据类型选择不同的统计量”*的方法,定义missnumber变量作为检测指标,并用fillna函数填补缺失值。完成填补后对表格进行转置,并且将数据按年份由远及近重新排列,存储为新的csv文件
#2018年6月以前的数据为yue #ZD:社会消费品零售总额当期值(亿元) yue = read.csv("yuedu-final.csv",header=TRUE) sum(is.na(yue)) #确认无缺失值
#2018年7月以后的数据为yuet yuet = read.csv("yuet.csv",header=TRUE) yuet
3.TS图、ACF图、PACF图
对1980.01~2018.06的完整数据,画出时间序列图、ACF图和PACF图
library(tseries) library(TSA) library(urca) yuets = ts(yue$ZD,start=1984,frequency=12) par(mfrow=c(2,2)) plot(yuets) acf(yuets) pacf(yuets)
4.差分
观察上面的结果,发现序列显然是非平稳的。因此需要执行差分直至其平稳性通过。
需要考虑:可以用什么检验来检验平稳性是否通过?使用哪种类型的差分?需要差分几次?
5.差分后的ACF、PACF图
zdlog = log(yue$ZD) yuecf = diff(zdlog) #对ZD做差分 par(mfrow=c(2,2)) acf(yuecf) pacf(yuecf) plot(yuecf) library(forecast) ndiffs(yuecf) #此处ndiffs的结果为0,不需要再继续做差分 eacf(yuecf)
AR/MA 0 1 2 3 4 5 6 7 8 9 10 11 12 13 0 o o o x x o x x o o o x o o 1 x x o x x o o x o o o x x x 2 x x o x o o o x x o o x x x 3 o x x x o o o x o o o x o x 4 x o x x o o o o o o o x x o 5 x x x x x o o o o o o x x o 6 x o x x o o o o o o o x x o 7 x o x x o o o o o o o x x o
6、试探性选择模型
基于以上信息,试探性地选择几个可能的模型,并进行分析
e.g.如果对数据一阶差分,差分后的序列服从MA(1)模型,则为ARIMA(0,1,1)
model1
#model1 model1 = arima(zdlog,order=c(0,1,1)) model1
Call: arima(x = zdlog, order = c(0, 1, 1)) Coefficients: ma1 -0.0476 s.e. 0.0522 sigma^2 estimated as 0.005367: log likelihood = 493.44, aic = -984.89
par(mfrow=c(1,2)) plot(ts(model1$residuals)) Box.test(model1$residuals,type="Ljung-Box",lag=8)
Box-Ljung test data: model1$residuals X-squared = 56, df = 8, p-value = 2.821e-09
qqnorm(model1$residuals) qqline(model1$residuals)
model2
#model2 model2 = arima(zdlog,order=c(0,1,0),seasonal = c(1,1,0)) model2
Call: arima(x = zdlog, order = c(0, 1, 0), seasonal = c(1, 1, 0)) Coefficients: sar1 -0.4936 s.e. 0.0430 sigma^2 estimated as 0.008434: log likelihood = 399, aic = -796
par(mfrow=c(1,2)) plot(ts(model2$residuals)) Box.test(model2$residuals,type="Ljung-Box",lag=8)
Box-Ljung test data: model2$residuals X-squared = 97.677, df = 8, p-value < 2.2e-16
qqnorm(model2$residuals) qqline(model2$residuals)
auto
#auto modelauto = auto.arima(zdlog,trace=TRUE)
Fitting models using approximations to speed things up… ARIMA(2,1,2) with drift : -1035.74 ARIMA(0,1,0) with drift : -987.4594 ARIMA(1,1,0) with drift : -994.8467 ARIMA(0,1,1) with drift : -987.8844 ARIMA(0,1,0) : -978.8251 ARIMA(1,1,2) with drift : -1018.26 ARIMA(2,1,1) with drift : -1028.105 ARIMA(3,1,2) with drift : -1022.341 ARIMA(2,1,3) with drift : -1034.852 ARIMA(1,1,1) with drift : -1009.58 ARIMA(1,1,3) with drift : -1018.682 ARIMA(3,1,1) with drift : -993.9287 ARIMA(3,1,3) with drift : -1058.208 ARIMA(4,1,3) with drift : -1106.995 ARIMA(4,1,2) with drift : -1096.604 ARIMA(5,1,3) with drift : Inf ARIMA(4,1,4) with drift : Inf ARIMA(3,1,4) with drift : Inf ARIMA(5,1,2) with drift : Inf ARIMA(5,1,4) with drift : -1163.777 ARIMA(5,1,5) with drift : Inf ARIMA(4,1,5) with drift : Inf ARIMA(5,1,4) : -1054.406 Now re-fitting the best model(s) without approximations… ARIMA(5,1,4) with drift : -1184.331 Best model: ARIMA(5,1,4) with drift
modelauto
Series: zdlog ARIMA(5,1,4) with drift Coefficients: ar1 ar2 ar3 ar4 ar5 ma1 ma2 ma3 0.3535 0.0397 0.5804 -0.7189 -0.3218 -0.634 -0.0948 -0.6099 s.e. 0.0531 0.0438 0.0274 0.0410 0.0519 0.028 0.0374 0.0272 ma4 drift 0.9115 0.0122 s.e. 0.0216 0.0015 sigma^2 estimated as 0.003173: log likelihood=603.49 AIC=-1184.99 AICc=-1184.33 BIC=-1140.73
model3
#model3 model3 = arima(zdlog,c(5,1,4)) model3
Call: arima(x = zdlog, order = c(5, 1, 4)) Coefficients: ar1 ar2 ar3 ar4 ar5 ma1 ma2 ma3 -0.3780 -0.2096 0.4683 -0.1645 -0.2605 0.4148 0.2251 -0.5965 s.e. 0.1716 0.1099 0.1002 0.1149 0.0493 0.1757 0.1139 0.0849 ma4 0.0497 s.e. 0.1229 sigma^2 estimated as 0.004254: log likelihood = 540.04, aic = -1062.08
par(mfrow=c(1,2)) plot(ts(model3$residuals)) Box.test(model3$residuals,type="Ljung-Box",lag=8)
Box-Ljung test data: model3$residuals X-squared = 59.146, df = 8, p-value = 6.852e-10
qqnorm(model3$residuals)
7、比较模型
此处我比较几种模型的AIC
#比较AIC model1$aic model2$aic model3$aic
得到AIC分别为:
-984.8862
-795.9967
-1062.082
还有auto模型的 -1184.99
( auto为ARIMA(5,1,4) )
因此选择model4 = arima(zdlog,c(5,1,4))作为最佳模型
用滞后算子和差分算子表示:
8.预测并对比真实情况
使用得到的最佳模型model4,预测2018.07~2018.12的全国消费品零售总额,并和原始数据中2018年7月~12月的真实情况进行对比。
并且,这里使用不同线条样式绘制在图中,以便对比
#给出预测值 pre=predict(model3,6) exp(pre$pred)
Time Series: Start = 415 End = 420 Frequency = 1 [1] 30404.52 31007.98 31091.24 30151.80 30714.14 30750.63
yuetZD=ts(yuet$ZD,start=415,frequency=1) yuetZD
Time Series: Start = 415 End = 420 Frequency = 1 [1] 30733.7 31542.3 32005.4 35534.4 35259.7 35893.5
ts.plot(exp(pre$pred),ylim=c(30000,36000)) lines(yuetZD,col="red") #line of observed data lines(exp(pre$pred)+2*pre$se,lty=2,col="green") #line of observed data lines(exp(pre$pred)-2*pre$se,lty=2,col="green") #line of observed data
#给出预测值 pre=predict(model3,6) ts.plot(pre$pred,ylim=c(9.8,10.8)) yuetZD=ts(log(yuet$ZD),start=415,frequency=1) yuetZD
Time Series: Start = 415 End = 420 Frequency = 1 [1] 10.33312 10.35908 10.37366 10.47826 10.47050 10.48831
lines(yuetZD,col="red") #line of observed data lines(pre$pred+2*pre$se,lty=2,col="green") #line of observed data lines(pre$pred-2*pre$se,lty=2,col="green") #line of observed data
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)