University Of Michigan: Learning Text Classifiers in Python

02-05

Typically, for text classification models, youre going to focus on linear classifiers.

So its a linear kernel to consist kernel as linear. And then you can specify the C parameter. C is the parameter for soft margin(軟間隔來避免過擬合), the default value for kernel is RBF, a radial basis function, kernel and the default value for C is one,where you are neither too hard not too soft on the margin.once you have defined this bayes classifier, you can fit or you can train it the same way as you train the naive bayes one. And then youll be able to predict the same way as last time you have classified or predict on test data.

So, you need to use some of the labeled data to see how well these models are. Especially if youre comparing between models, or if you are tuning a model. So if you have some parameters, for example you have the C parameters is SVM, you need to know what is a good value of C. So, how would you do it ? That problem is called the model selection problem,and while you training, you need to somehow make sure that you ways to do that. There are two ways you could do model selection, one is keeping some part of printing the label data set separate as the hold out data, And the other option is cross-validation(為了不斷改進C這個參數，來確定模型，一般有兩種做法，一. 留一法二. 交叉驗證)

First well see how you could use that train test split, So Im going to say model_selection.train_test_split, give that train that untrained labels and then specify how much should be your test size.

So for example, suppose you have these data points, In thist case, I have 15 of them and I say I want to do a two third one third split. So my test size is one third or 0.333.

That would mean 10 of them would be the train set and five of them would be the test, now ,you could shuffle the training data, the label data, so that you have a randomly uniform distribution around the positive and negative class. But then, you could say I wanted to keep lets say 66 percent in train and 33 percent in test or go 80 20 if you want to. Lets say four out of five should go in my train set and the one out of five, the fifth part as a test. When you doing this way, you are losing out a significant portion of your training data into test. Remember, that you cannot see the test data when youre training the model. so the test data is used exclusivey , to tune the parameters,So your training data effectively reduces to 66 percent in this case. The other way to do model selection would be cross-validation, So the cross validation with five full cross-validation, would be something like this where you split data into five parts(其實講了這麼多，就是在講留一法和交叉驗證）