邏輯回歸解決異網用戶分類問題

邏輯回歸演算法可以高效的解決二分類問題,也可以解決多分類只是沒有KNN高效,KNN天生就是解決多分類,不過KNN過於簡單,適用性沒有邏輯回歸好。我是想通過移動用戶的交際圈來判斷異網用戶是否25歲以下,想看看到底年輕人有多少去了聯通。這裡就提取了兩個特徵,一個是交際圈前5名的平均年齡,再一個就是平均聯繫親密度。然後利用網內對網內的數據進行模型訓練,準確率上去後再開始應用。

這塊能用到的地方還是挺多的,比如判斷用戶是否會訂購業務,是否即將離網,是否套餐降檔等等,但是,處理實際的用戶多分類問題,最佳的還是神經網路深度學習,但是運算量太大了,我的機器配置過低了。實際應用邏輯回歸需要改進的地方比較多,不像理論數據,簡單運算就能90以上準確率。還得繼續學學別人的優化演算法。

用的是python3.6,參考代碼是2.7的,花了點時間看懂然後改代碼,改參數,還是不熟,報錯就百度,搞了一個多小時終於OK了。代碼和結果如下

# -*- coding: utf-8 -*-

"""

Created on Wed Apr 11 19:49:13 2018

from numpy import *

import matplotlib.pyplot as plt

import time

# calculate the sigmoid function

def sigmoid(inX):

return 1.0 / (1 + exp(-inX))

# train a logistic regression model using some optional optimize algorithm

# input: train_x is a mat datatype, each row stands for one sample

# train_y is mat datatype too, each row is the corresponding label

# opts is optimize option include step and maximum number of iterations

def trainLogRegres(train_x, train_y, opts):

# calculate training time

#startTime = time.time()

numSamples, numFeatures = shape(train_x)

alpha = opts[alpha]; maxIter = opts[maxIter]

weights = ones((numFeatures, 1))

# optimize through gradient descent algorilthm

for k in range(maxIter):

if opts[optimizeType] == gradDescent: # gradient descent algorilthm

output = sigmoid(train_x * weights)

error = train_y - output

weights = weights + alpha * train_x.transpose() * error

elif opts[optimizeType] == stocGradDescent: # stochastic gradient descent

for i in range(numSamples):

output = sigmoid(train_x[i, :] * weights)

error = train_y[i, 0] - output

weights = weights + alpha * train_x[i, :].transpose() * error

elif opts[optimizeType] == smoothStocGradDescent: # smooth stochastic gradient descent

# randomly select samples to optimize for reducing cycle fluctuations

dataIndex = list(range(numSamples))

for i in range(numSamples):

alpha = 4.0 / (1.0 + k + i) + 0.01

randIndex = int(random.uniform(0, len(dataIndex)))

output = sigmoid(train_x[randIndex, :] * weights)

error = train_y[randIndex, 0] - output

weights = weights + alpha * train_x[randIndex, :].transpose() * error

del(dataIndex[randIndex]) # during one interation, delete the optimized sample

else:

raise NameError(Not support optimize method type!)

print( Congratulations, training complete! )

return weights

# test your trained Logistic Regression model given test set

def testLogRegres(weights, test_x, test_y):

numSamples, numFeatures = shape(test_x)

matchCount = 0

for i in range(numSamples):

predict = sigmoid(test_x[i, :] * weights)[0, 0] > 0.5

if predict == bool(test_y[i, 0]):

matchCount += 1

accuracy = float(matchCount) / numSamples

return accuracy

# show your trained logistic regression model only available with 2-D data

def showLogRegres(weights, train_x, train_y):

# notice: train_x and train_y is mat datatype

numSamples, numFeatures = shape(train_x)

if numFeatures != 3:

print ("Sorry! I can not draw because the dimension of your data is not 2!")

return 1

# draw all samples

for i in range(numSamples):

if int(train_y[i, 0]) == 0:

plt.plot(train_x[i, 1], train_x[i, 2], or)

elif int(train_y[i, 0]) == 1:

plt.plot(train_x[i, 1], train_x[i, 2], ob)

# draw the classify line

min_x = min(train_x[:, 1])[0, 0]

max_x = max(train_x[:, 1])[0, 0]

weights = weights.getA() # convert mat to array

y_min_x = float(-weights[0] - weights[1] * min_x) / weights[2]

y_max_x = float(-weights[0] - weights[1] * max_x) / weights[2]

plt.plot([min_x, max_x], [y_min_x, y_max_x], -g)

plt.xlabel(X1); plt.ylabel(X2)

plt.show()

from numpy import *

import matplotlib.pyplot as plt

import time

def loadData():

train_x = []

train_y = []

fileIn = open(E:/ceshi/luoji001.txt)

for line in fileIn.readlines():

lineArr = line.strip().split()

train_x.append([1.0, float(lineArr[0]), float(lineArr[1])])

train_y.append(float(lineArr[2]))

return mat(train_x), mat(train_y).transpose()

## step 1: load data

print ("step 1: load data...")

train_x, train_y = loadData()

test_x = train_x; test_y = train_y

## step 2: training...

print ("step 2: training...")

opts = {alpha: 0.01, maxIter: 20, optimizeType: smoothStocGradDescent}

optimalWeights = trainLogRegres(train_x, train_y, opts)

## step 3: testing

print ("step 3: testing...")

accuracy = testLogRegres(optimalWeights, test_x, test_y)

## step 4: show the result

print ( "step 4: show the result...")

print ( The classify accuracy is: %.3f%% % (accuracy * 100))

showLogRegres(optimalWeights, train_x, train_y)

輸出:step 1: load data...

step 2: training...

Congratulations, training complete!

step 3: testing...

step 4: show the result...

The classify accuracy is: 86.262%


推薦閱讀:

初探機器學習之邏輯回歸
廣告CTR預估中預測點擊率的校準

TAG:邏輯回歸 | 數據分析 | 中國移動 |