python3機器學習經典實例-第八章解剖時間序列和時序數據32

05-22

來自專欄 python機器學習經典實踐-學習筆記

從時間序列數據中提取統計數字

分析時間序列數據的主要原因之一是從中提取出有趣的統計信息。考慮數據的本質，時間序列分析可以提供很多信息。本節將介紹如何提取這些統計信息。

創建extract_stats.py文件，並導入必要的資料庫。

import pandas as pdimport matplotlib.pyplot as pltfrom convert_to_timeseries import convert_data_to_timeseries

用到前一節用到的文本文件：載入第三列和第四列數據：創建一個pandas數據結構來保存這些數據，這個數據看著比較像詞典，它有對應的鍵和值：

# Input file containing datainput_file = data_timeseries.txt# Load datadata1 = convert_data_to_timeseries(input_file, 2)data2 = convert_data_to_timeseries(input_file, 3)dataframe = pd.DataFrame({first: data1, second: data2})

接下來提取一些統計數據。用以下代碼提取最大值和最小值：列印數據的均值或者是每行的均值：

# Print max and minprint ( Maximum: , dataframe.max())print ( Minimum: , dataframe.min())# Print meanprint ( Mean: , dataframe.mean())print ( Mean row-wise: , dataframe.mean(1)[:10])

滑動均值是在時間序列分析中較常用的統計。其最著名的應用之一是平滑信號以去除雜訊。滑動均值是指計算一個窗口範圍內的信號均值，並不斷地移動時間窗。這裡用到的窗口大小為24：相關性係數對於理解數據的本質來說非常有用：

# Plot rolling meanpd.rolling_mean(dataframe, window=24).plot()# Print correlation coefficientsprint ( Correlation coefficients: , dataframe.corr())

用大小為60的窗口將其畫出：

# Plot rolling correlationplt.figure()pd.rolling_corr(dataframe[first], dataframe[second], window=60).plot()plt.show()

結果輸出out

Maximum: first 99.82second 99.97dtype: float64Minimum: first 0.07second 0.00dtype: float64Mean: first 51.264529second 49.695417dtype: float64Mean row-wise: 1940-01-31 81.8851940-02-29 41.1351940-03-31 10.3051940-04-30 83.5451940-05-31 18.3951940-06-30 16.6951940-07-31 86.8751940-08-31 42.2551940-09-30 55.8801940-10-31 34.720Freq: M, dtype: float64Correlation coefficients: first secondfirst 1.000000 0.077607second 0.077607 1.000000