Python · 數據工具包

01-29

（這裡是本章用到的 GitHub 地址）

（依稀有印象以前在某個地方說過怎麼生成數據集……嘛不要在意細節【喂】）

說是數據工具包、其實實現的東西都相當樸素；雖說今後可能會因為一些新需求而拓展它、不過目前為止它只有如下三個功能：

生成異或數據集
生成螺旋線數據集
數值化數據

如果沒有關係的話、就請往下看吧 ( σω)σ

首先是最簡單的生成異或數據集的實現，大概直接貼代碼也沒什麼問題：

# 參數 size 即是每個象限中樣本點個數n# 參數 scale 則描述了整個數據集的分散程度ndef gen_xor(size=100, scale=1):n x = np.random.randn(size) * scalen y = np.random.randn(size) * scalen z = np.zeros((size, 2))n z[x * y >= 0, :] = [0, 1]n z[x * y < 0, :] = [1, 0]n return np.c_[x, y].astype(np.float32), zn

然後是生成螺旋線數據集的實現，用到了螺旋線在極坐標的中的表達式。具體而言：

這裡的 n 代表螺旋線的條數，m 代表每條螺旋線上的樣本點個數

from math import pinn# 參數 size 即為上式中的 m，參數 n 即為上式中的 n，參數 n_class 代表類別個數ndef gen_spin(size=50, n=7, n_class=7):n xs = np.zeros((size * n, 2), dtype=np.float32)n ys = np.zeros(size * n, dtype=np.int8)n for i in range(n):n ix = range(size * i, size * (i + 1))n r = np.linspace(0.0, 1, size+1)[1:]n t = np.linspace(2 * i * pi / n, 2 * (i + 4) * pi / n, size)n xs[ix] = np.c_[r * np.sin(t), r * np.cos(t)]n ys[ix] = i % n_classn z = []n for yy in ys:n tmp = [0 if i != yy else 1 for i in range(n_class)]n z.append(tmp)n return xs, np.array(z)n

這樣生成出來的螺旋線可能太過整齊：

如果覺著瘮得慌的話可以在 t 那裡加一個隨機項以讓數據「亂一點」：

最後是數據的數值化，實現起來不難但是可能有些繁：

# wc 是 whether_continuous 的縮寫n# continuous_rate 是判定特徵是否連續的閾值ndef quantize_data(x, y, wc=None, continuous_rate=0.1, separate=False):n # 先將 x 轉置n if isinstance(x, list):n xt = map(list, zip(*x))n else:n xt = x.Tn # 用 set 獲取各維度特徵所有可能的取值n features = [set(feat) for feat in xt]n # 如果沒提供 wc、就用閾值判斷各維度特徵是否連續n if wc is None:n wc = np.array([len(feat) >= continuous_rate * len(y)n for feat in features])n else:n wc = np.array(wc)nn # 獲取數值化數據時的轉換字典n feat_dics = [{_l: i for i, _l in enumerate(feats)} if not wc[i] else Nonen for i, feats in enumerate(features)]n # 如果參數 separate 是 True、就將離散數據和連續數據分開n # 若為 False，則如果全是離散型特徵、就令數據類型為 int，否則令為 doublen if not separate:n if np.all(~wc):n dtype = np.intn else:n dtype = np.doublen x = np.array([[feat_dics[i][_l] if not wc[i] else _ln for i, _l in enumerate(sample)] for sample in x], dtype=dtype)n else:n dtype = np.doublen x = np.array([[feat_dics[i][_l] if not wc[i] else _ln for i, _l in enumerate(sample)] for sample in x], dtype=dtype)n x = (x[:, ~wc].astype(np.int), x[:, wc])nn # 數值化類別並獲取數值化類別時的轉換字典n label_dic = {_l: i for i, _l in enumerate(set(y))}n y = np.array([label_dic[yy] for yy in y], dtype=np.int8)n label_dic = {i: _l for i, _l in enumerate(set(y))}n n # 返回所有可能還會用到的東西n return x, y, wc, features, feat_dics, label_dicn

這個數值化數據的方法會在樸素貝葉斯模型的實現中用到，它本身也是一個簡易的數據預處理、可以應用的場合算是比較廣泛

以上、我們就講完了「數據工具包」中的所有功能及相應實現，雖說比較簡陋、但個人感覺在初等的場合用起來還是頗為方便 ( σω)σ

希望各位觀眾老爺們能夠喜歡~