I can't seem to get Q2 right. I'm using the support vector classifier from the sklearn package (svm.SVC) in Python. I've put my parameters to the right values but the Ein (1-recall in the ouput) is way too high for most classes. I don't think using pandas is the reason, but still, I changed the classes to int since pandas was using float64 as a default type.
Code:
import pandas as pd
from sklearn import svm, metrics
train_df = pd.read_csv(
filepath,
sep = "[ ]*",
engine = "python",
header = None
)
train_df.columns = ["Digit", "Intensity", "Symmetry"]
train_df["Digit"] = train_df["Digit"].astype(int)
clf = svm.SVC(
C = 0.01,
kernel = 'poly',
degree = 2.0,
gamma = 1.0,
coef0 = 1.0
)
X = train_df.ix[:,(1,2)].values
y = train_df.ix[:,0].values
clf.fit(X,y)
expected = y
predicted = clf.predict(X)
print("Classification report for classifier %s:\n%s\n"
% (clf, metrics.classification_report(expected, predicted)))
print("Confusion matrix:\n%s" % metrics.confusion_matrix(expected, predicted))
No 5 or 8 are predicted correctly, very few 4 and 6, and a few 3 and 7. This is way too strange.
Can someone show me where I'm doing something wrong?
Output:
Code:
Classification report for classifier SVC(C=0.01, cache_size=200, class_weight=None, coef0=1.0,
decision_function_shape=None, degree=2.0, gamma=1.0, kernel='poly',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False):
precision recall f1-score support
0 0.54 0.83 0.65 1194
1 0.93 0.96 0.95 1005
2 0.22 0.55 0.31 731
3 0.27 0.12 0.17 658
4 0.12 0.02 0.04 652
5 0.00 0.00 0.00 556
6 0.09 0.00 0.00 664
7 0.26 0.16 0.19 645
8 0.00 0.00 0.00 542
9 0.21 0.56 0.30 644
avg / total 0.32 0.40 0.33 7291
Confusion matrix:
[[987 41 65 45 2 0 0 5 0 49]
[ 35 969 0 0 0 0 0 0 0 1]
[ 65 3 404 48 16 0 0 48 0 147]
[204 1 165 79 17 0 0 22 0 170]
[ 79 9 163 11 16 0 1 81 0 292]
[ 14 1 361 21 17 0 3 51 0 88]
[ 38 0 282 35 22 0 1 53 0 233]
[ 22 1 178 5 29 0 2 100 0 308]
[298 16 89 26 1 0 2 8 0 102]
[ 83 0 133 24 17 0 2 23 0 362]]