视频1 视频21 视频41 视频61 视频文章1 视频文章21 视频文章41 视频文章61 推荐1 推荐3 推荐5 推荐7 推荐9 推荐11 推荐13 推荐15 推荐17 推荐19 推荐21 推荐23 推荐25 推荐27 推荐29 推荐31 推荐33 推荐35 推荐37 推荐39 推荐41 推荐43 推荐45 推荐47 推荐49 关键词1 关键词101 关键词201 关键词301 关键词401 关键词501 关键词601 关键词701 关键词801 关键词901 关键词1001 关键词1101 关键词1201 关键词1301 关键词1401 关键词1501 关键词1601 关键词1701 关键词1801 关键词1901 视频扩展1 视频扩展6 视频扩展11 视频扩展16 文章1 文章201 文章401 文章601 文章801 文章1001 资讯1 资讯501 资讯1001 资讯1501 标签1 标签501 标签1001 关键词1 关键词501 关键词1001 关键词1501 专题2001
【python数据挖掘课程】二十七.基于SVM分类器的红酒数据分析一.SVM基础概念二.S。。。
2025-10-07 01:52:23 责编:小OO
文档
【python数据挖掘课程】⼆⼗七.基于SVM分类器的红酒数据分

析⼀.SVM基础概念⼆.S。。

这是《Python数据挖掘课程》系列⽂章,前⾯很多⽂章都讲解了分类、聚类算法,这篇⽂章主要讲解SVM分类算法,同时讲解如何读取TXT ⽂件数据并进⾏数据分析及评价的过程。⽂章⽐较基础,希望对你有所帮助,提供些思路,也是⾃⼰教学的内容。推荐⼤家购买作者新书《Python⽹络数据爬取及分析从⼊门到精通(分析篇)》,如果⽂章中存在错误或不⾜之处,还请海涵。

⽬录:⼀.SVM基础概念⼆.SVM基本使⽤⽅法三.TXT红酒数据集预处理四.SVM分析红酒数据五.代码优化

五年来写了314篇博客,12个专栏,是真的热爱分享,热爱CSDN这个平台,也想帮助更多的⼈,专栏包括Python、数据挖掘、⽹络爬⾍、图像处理、C#、Android等。现在也当了两年⽼师,更是觉得有义务教好每⼀个学⽣,让贵州学⼦好好写点代码,学点技术,"师者,传到授业解惑也",提前祝⼤家新年快乐。2019我们携⼿共进,为爱⽽⽣。

前⽂参考:

⼀.SVM基础概念

⽀持向量机(Support Vector Machine,简称SVM)是常见的⼀种判别⽅法。在机器学习领域,是⼀个有监督的学习模型,通常⽤来进⾏模式识别、分类以及回归分析。该算法的最⼤特点是根据结构风险最⼩化准则,以最⼤化分类间隔构造最优分类超平⾯来提⾼学习机的泛化能⼒,较好地解决了⾮线性、⾼维数、局部极⼩点等问题。

由于作者数学推算能⼒不太好,同时SVM原理也⽐较复杂,所以SVM算法基础知识推荐⼤家阅读CSDN博客著名算法⼤神“JULY”的⽂章《⽀持向量机通俗导论(理解SVM的三层境界)》,这篇⽂章由浅⼊深的讲解了SVM算法,⽽本⼩节作者主要讲解SVM的⽤法。

SVM分类算法的核⼼思想是通过建⽴某种核函数,将数据在⾼维寻找⼀个满⾜分类要求的超平⾯,使训练集中的点距离分类⾯尽可能的远,即寻找⼀个分类⾯使得其两侧的空⽩区域最⼤。如图所⽰,两类样本中离分类⾯最近的点且平⾏于最优分类⾯的超平⾯上的训练样本就叫做⽀持向量。

⼆.SVM基本使⽤⽅法

1.SVM原型 SVM分类算法在Sklearn机器学习包中,实现的类是svm.SVC,即C-Support Vector Classification,它是基于libsvm实现的。构造⽅法如下:

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,

decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',

max_iter=-1, probability=False, random_state=None, shrinking=True,

tol=0.001, verbose=False)

其中参数C表⽰⽬标函数的惩罚系数,⽤来平衡分类间隔margin和错分样本的,默认值为1.0;参数cache_size是制定训练所需要的内存(以MB为单位);参数gamma是核函数的系数,默认是gamma=1/n_features;参数kernel可以选择RBF、Linear、Poly、Sigmoid,默认的是RBF;参数degree决定了多项式的最⾼次幂;参数max_iter表⽰最⼤迭代次数,默认值为1;参数coef0是核函数中的独⽴项;参数

class_weight表⽰每个类所占据的权重,不同的类设置不同的惩罚参数C,缺省为⾃适应;参数decision_function_shape包括ovo(⼀对⼀)、ovr(多对多)或None(默认值)。

2.算法步骤 SVC算法主要包括两个步骤:

训练:nbrs.fit(data, target)。

预测:pre = clf.predict(data)。

下⾯这段代码是简单调⽤SVC分类算法进⾏预测的例⼦,数据集中x和y坐标为负数的类标为1,x和y坐标为正数的类标为2,同时预测点[-0.8,-1]的类标为1,点[2,1]的类标为2。

import numpy as np

from sklearn.svm import SVC

X = np.array([[-1, -1], [-2, -2], [1, 3], [4, 6]])

y = np.array([1, 1, 2, 2])

clf = SVC()

clf.fit(X, y)

print clf.fit(X,y)

print(clf.predict([[-0.8,-1], [2,1]]))三.TXT红酒数据集预处理

1.数据集描述该实验数据集是UCI Machine Learning Repository开源⽹站提供的MostPopular Data Sets(hits since 2007)红酒数据集,它是对意⼤利同⼀地区⽣产的三种不同品种的酒,做⼤量分析所得出的数据。这些数据包括了三种类别的酒,酒13种不同成分的特征,共178⾏数据,如下图所⽰。

该数据集包括了三种类型酒中13种不同成分的数量,13种成分分别是:Alcohol、Malicacid、Ash、Alcalinity of ash、Magnesium、Total phenols、Flavanoids、Nonflavanoid phenols、Proanthocyanins、Color intensity、Hue、OD280/OD315 of diluted wines和Proline,每⼀种成分可以看成⼀个特征,对应⼀个数据。三种类型的酒分别标记为“1”、“2”、“3”。数据集特征描述如下表所⽰。

数据存储在wine.txt⽂件中,如下图所⽰。每⾏数据代表⼀个样本,共178⾏数据,每⾏数据包含14列,即第⼀列为类标属性,后⾯依次是13列特征。其中第1类有59个样本,第2类有71个样本,第3类有48个样本。

2.原始数据集数据集原⽂描述如下:

Data Set Information

These data are the results of a chemical analysis of wines grown in the same region in

Italy but derived from three different cultivars. The analysis determined the quantities of

13 constituents found in each of the three types of wines.

I think that the initial data set had around 30 variables, but for some reason I only

have the 13 dimensional version. I had a list of what the 30 or so variables were, but

a.) I lost it, and

b.), I would not know which 13 variables are included in the set.

The attributes are (dontated by Riccardo Leardi, riclea '@' anchem.unige.it )

1) Alcohol

2) Malic acid

3) Ash

4) Alcalinity of ash

5) Magnesium

6) Total phenols

7) Flavanoids

8) Nonflavanoid phenols

9) Proanthocyanins

10)Color intensity

11)Hue

12)OD280OD315 of diluted wines

13)Proline

In a classification context, this is a well posed problem with well behaved class

structures. A good data set for first testing of a new classifier, but not very challenging.

Attribute Information

All attributes are continuous

No statistics available, but suggest to standardise variables for certain uses (e.g. for us with classifiers which are NOT scale invariant)

NOTE 1st attribute is class identifier (1-3)

数据集完整数据如下(读者可复制⾄txt⽂件中):

1,14.23,1.71,2.43,15.6,127,2.8,3.06,.28,2.29,5.,1.04,3.92,1065

1,13.2,1.78,2.14,11.2,100,2.65,2.76,.26,1.28,4.38,1.05,3.4,1050

1,13.16,2.36,2.67,18.6,101,2.8,3.24,.3,2.81,5.68,1.03,3.17,1185

1,14.37,1.95,2.5,16.8,113,3.85,3.49,.24,2.18,7.8,.86,3.45,1480

1,13.24,2.59,2.87,21,118,2.8,2.69,.39,1.82,4.32,1.04,2.93,735

1,14.2,1.76,2.45,15.2,112,3.27,3.39,.34,1.97,6.75,1.05,2.85,1450

1,14.39,1.87,2.45,14.6,96,2.5,2.52,.3,1.98,5.25,1.02,3.58,1290

1,14.06,2.15,2.61,17.6,121,2.6,2.51,.31,1.25,5.05,1.06,3.58,1295

1,14.83,1.,2.17,14,97,2.8,2.98,.29,1.98,5.2,1.08,2.85,1045

1,13.86,1.35,2.27,16,98,2.98,3.15,.22,1.85,7.22,1.01,3.55,1045

1,14.1,2.16,2.3,18,105,2.95,3.32,.22,2.38,5.75,1.25,3.17,1510

1,14.12,1.48,2.32,16.8,95,2.2,2.43,.26,1.57,5,1.17,2.82,1280

1,13.75,1.73,2.41,16,,2.6,2.76,.29,1.81,5.6,1.15,2.9,1320

1,14.75,1.73,2.39,11.4,91,3.1,3.69,.43,2.81,5.4,1.25,2.73,1150

1,14.38,1.87,2.38,12,102,3.3,3.,.29,2.96,7.5,1.2,3,1547

1,13.63,1.81,2.7,17.2,112,2.85,2.91,.3,1.46,7.3,1.28,2.88,1310

1,14.3,1.92,2.72,20,120,2.8,3.14,.33,1.97,6.2,1.07,2.65,1280

1,13.83,1.57,2.62,20,115,2.95,3.4,.4,1.72,6.6,1.13,2.57,1130

1,14.19,1.59,2.48,16.5,108,3.3,3.93,.32,1.86,8.7,1.23,2.82,1680

1,13.,3.1,2.56,15.2,116,2.7,3.03,.17,1.66,5.1,.96,3.36,845

1,14.06,1.63,2.28,16,126,3,3.17,.24,2.1,5.65,1.09,3.71,7801,14.02,1.68,2.21,16,96,2.65,2.33,.26,1.98,4.7,1.04,3.59,1035 1,13.73,1.5,2.7,22.5,101,3,3.25,.29,2.38,5.7,1.19,2.71,1285

1,13.58,1.66,2.36,19.1,106,2.86,3.19,.22,1.95,6.9,1.09,2.88,1515 1,13.68,1.83,2.36,17.2,104,2.42,2.69,.42,1.97,3.84,1.23,2.87,990 1,13.76,1.53,2.7,19.5,132,2.95,2.74,.5,1.35,5.4,1.25,3,1235

1,13.51,1.8,2.65,19,110,2.35,2.53,.29,1.54,4.2,1.1,2.87,1095

1,13.48,1.81,2.41,20.5,100,2.7,2.98,.26,1.86,5.1,1.04,3.47,920 1,13.28,1.,2.84,15.5,110,2.6,2.68,.34,1.36,4.6,1.09,2.78,880 1,13.05,1.65,2.55,18,98,2.45,2.43,.29,1.44,4.25,1.12,2.51,1105 1,13.07,1.5,2.1,15.5,98,2.4,2.,.28,1.37,3.7,1.18,2.69,1020

1,14.22,3.99,2.51,13.2,128,3,3.04,.2,2.08,5.1,.,3.53,760

1,13.56,1.71,2.31,16.2,117,3.15,3.29,.34,2.34,6.13,.95,3.38,795 1,13.41,3.84,2.12,18.8,90,2.45,2.68,.27,1.48,4.28,.91,3,1035

1,13.88,1.,2.59,15,101,3.25,3.56,.17,1.7,5.43,.88,3.56,1095 1,13.24,3.98,2.29,17.5,103,2.,2.63,.32,1.66,4.36,.82,3,680

1,13.05,1.77,2.1,17,107,3,3,.28,2.03,5.04,.88,3.35,885

1,14.21,4.04,2.44,18.9,111,2.85,2.65,.3,1.25,5.24,.87,3.33,1080 1,14.38,3.59,2.28,16,102,3.25,3.17,.27,2.19,4.9,1.04,3.44,1065 1,13.9,1.68,2.12,16,101,3.1,3.39,.21,2.14,6.1,.91,3.33,985

1,14.1,2.02,2.4,18.8,103,2.75,2.92,.32,2.38,6.2,1.07,2.75,1060 1,13.94,1.73,2.27,17.4,108,2.88,3.54,.32,2.08,8.90,1.12,3.1,1260 1,13.05,1.73,2.04,12.4,92,2.72,3.27,.17,2.91,7.2,1.12,2.91,1150 1,13.83,1.65,2.6,17.2,94,2.45,2.99,.22,2.29,5.6,1.24,3.37,1265 1,13.82,1.75,2.42,14,111,3.88,3.74,.32,1.87,7.05,1.01,3.26,1190 1,13.77,1.9,2.68,17.1,115,3,2.79,.39,1.68,6.3,1.13,2.93,1375

1,13.74,1.67,2.25,16.4,118,2.6,2.9,.21,1.62,5.85,.92,3.2,1060

1,13.56,1.73,2.46,20.5,116,2.96,2.78,.2,2.45,6.25,.98,3.03,1120 1,14.22,1.7,2.3,16.3,118,3.2,3,.26,2.03,6.38,.94,3.31,970

1,13.29,1.97,2.68,16.8,102,3,3.23,.31,1.66,6,1.07,2.84,1270

1,13.72,1.43,2.5,16.7,108,3.4,3.67,.19,2.04,6.8,.,2.87,1285

2,12.37,.94,1.36,10.6,88,1.98,.57,.28,.42,1.95,1.05,1.82,520

2,12.33,1.1,2.28,16,101,2.05,1.09,.63,.41,3.27,1.25,1.67,680

2,12.,1.36,2.02,16.8,100,2.02,1.41,.53,.62,5.75,.98,1.59,450 2,13.67,1.25,1.92,18,94,2.1,1.79,.32,.73,3.8,1.23,2.46,630

2,12.37,1.13,2.16,19,87,3.5,3.1,.19,1.87,4.45,1.22,2.87,420

2,12.17,1.45,2.53,19,104,1.,1.75,.45,1.03,2.95,1.45,2.23,355 2,12.37,1.21,2.56,18.1,98,2.42,2.65,.37,2.08,4.6,1.19,2.3,678

2,13.11,1.01,1.7,15,78,2.98,3.18,.26,2.28,5.3,1.12,3.18,502

2,12.37,1.17,1.92,19.6,78,2.11,2,.27,1.04,4.68,1.12,3.48,510

2,13.34,.94,2.36,17,110,2.53,1.3,.55,.42,3.17,1.02,1.93,750

2,12.21,1.19,1.75,16.8,151,1.85,1.28,.14,2.5,2.85,1.28,3.07,718 2,12.29,1.61,2.21,20.4,103,1.1,1.02,.37,1.46,3.05,.906,1.82,870 2,13.86,1.51,2.67,25,86,2.95,2.86,.21,1.87,3.38,1.36,3.16,410 2,13.49,1.66,2.24,24,87,1.88,1.84,.27,1.03,3.74,.98,2.78,472

2,12.99,1.67,2.6,30,139,3.3,2.,.21,1.96,3.35,1.31,3.5,985

2,11.96,1.09,2.3,21,101,3.38,2.14,.13,1.65,3.21,.99,3.13,886

2,11.66,1.88,1.92,16,97,1.61,1.57,.34,1.15,3.8,1.23,2.14,428

2,13.03,.9,1.71,16,86,1.95,2.03,.24,1.46,4.6,1.19,2.48,392

2,11.84,2.,2.23,18,112,1.72,1.32,.43,.95,2.65,.96,2.52,500

2,12.33,.99,1.95,14.8,136,1.9,1.85,.35,2.76,3.4,1.06,2.31,750

2,12.7,3.87,2.4,23,101,2.83,2.55,.43,1.95,2.57,1.19,3.13,463

2,12,.92,2,19,86,2.42,2.26,.3,1.43,2.5,1.38,3.12,278

2,12.72,1.81,2.2,18.8,86,2.2,2.53,.26,1.77,3.9,1.16,3.14,714

2,12.08,1.13,2.51,24,78,2,1.58,.4,1.4,2.2,1.31,2.72,630

2,13.05,3.86,2.32,22.5,85,1.65,1.59,.61,1.62,4.8,.84,2.01,515

2,11.84,.,2.58,18,94,2.2,2.21,.22,2.35,3.05,.79,3.08,520

2,12.67,.98,2.24,18,99,2.2,1.94,.3,1.46,2.62,1.23,3.16,450

2,12.16,1.61,2.31,22.8,90,1.78,1.69,.43,1.56,2.45,1.33,2.26,495 2,11.65,1.67,2.62,26,88,1.92,1.61,.4,1.34,2.6,1.36,3.21,562

2,11.,2.06,2.46,21.6,84,1.95,1.69,.48,1.35,2.8,1,2.75,680

2,12.08,1.33,2.3,23.6,70,2.2,1.59,.42,1.38,1.74,1.07,3.21,625

2,12.08,1.83,2.32,18.5,81,1.6,1.5,.52,1.,2.4,1.08,2.27,480

2,12,1.51,2.42,22,86,1.45,1.25,.5,1.63,3.6,1.05,2.65,450

2,12.69,1.53,2.26,20.7,80,1.38,1.46,.58,1.62,3.05,.96,2.06,495 2,12.29,2.83,2.22,18,88,2.45,2.25,.25,1.99,2.15,1.15,3.3,290

2,11.62,1.99,2.28,18,98,3.02,2.26,.17,1.35,3.25,1.16,2.96,345 2,12.47,1.52,2.2,19,162,2.5,2.27,.32,3.28,2.6,1.16,2.63,937

2,11.81,2.12,2.74,21.5,134,1.6,.99,.14,1.56,2.5,.95,2.26,625

2,12.29,1.41,1.98,16,85,2.55,2.5,.29,1.77,2.9,1.23,2.74,428

2,12.37,1.07,2.1,18.5,88,3.52,3.75,.24,1.95,4.5,1.04,2.77,660

2,12.29,3.17,2.21,18,88,2.85,2.99,.45,2.81,2.3,1.42,2.83,406

2,12.08,2.08,1.7,17.5,97,2.23,2.17,.26,1.4,3.3,1.27,2.96,710

2,12.6,1.34,1.9,18.5,88,1.45,1.36,.29,1.35,2.45,1.04,2.77,562

2,12.34,2.45,2.46,21,98,2.56,2.11,.34,1.31,2.8,.8,3.38,438

2,11.82,1.72,1.88,19.5,86,2.5,1.,.37,1.42,2.06,.94,2.44,415

2,12.51,1.73,1.98,20.5,85,2.2,1.92,.32,1.48,2.94,1.04,3.57,6722,11.41,.74,2.5,21,88,2.48,2.01,.42,1.44,3.08,1.1,2.31,434

2,12.08,1.39,2.5,22.5,84,2.56,2.29,.43,1.04,2.9,.93,3.19,385

2,11.03,1.51,2.2,21.5,85,2.46,2.17,.52,2.01,1.9,1.71,2.87,407

2,11.82,1.47,1.99,20.8,86,1.98,1.6,.3,1.53,1.95,.95,3.33,495

2,12.42,1.61,2.19,22.5,108,2,2.09,.34,1.61,2.06,1.06,2.96,345

2,12.77,3.43,1.98,16,80,1.63,1.25,.43,.83,3.4,.7,2.12,372

2,12,3.43,2,19,87,2,1.,.37,1.87,1.28,.93,3.05,5

2,11.45,2.4,2.42,20,96,2.9,2.79,.32,1.83,3.25,.8,3.39,625

2,11.56,2.05,3.23,28.5,119,3.18,5.08,.47,1.87,6,.93,3.69,465

2,12.42,4.43,2.73,26.5,102,2.2,2.13,.43,1.71,2.08,.92,3.12,365

2,13.05,5.8,2.13,21.5,86,2.62,2.65,.3,2.01,2.6,.73,3.1,380

2,11.87,4.31,2.39,21,82,2.86,3.03,.21,2.91,2.8,.75,3.,380

2,12.07,2.16,2.17,21,85,2.6,2.65,.37,1.35,2.76,.86,3.28,378

2,12.43,1.53,2.29,21.5,86,2.74,3.15,.39,1.77,3.94,.69,2.84,352

2,11.79,2.13,2.78,28.5,92,2.13,2.24,.58,1.76,3,.97,2.44,466

2,12.37,1.63,2.3,24.5,88,2.22,2.45,.4,1.9,2.12,.,2.78,342

2,12.04,4.3,2.38,22,80,2.1,1.75,.42,1.35,2.6,.79,2.57,580

3,12.86,1.35,2.32,18,122,1.51,1.25,.21,.94,4.1,.76,1.29,630

3,12.88,2.99,2.4,20,104,1.3,1.22,.24,.83,5.4,.74,1.42,530

3,12.81,2.31,2.4,24,98,1.15,1.09,.27,.83,5.7,.66,1.36,560

3,12.7,3.55,2.36,21.5,106,1.7,1.2,.17,.84,5,.78,1.29,600

3,12.51,1.24,2.25,17.5,85,2,.58,.6,1.25,5.45,.75,1.51,650

3,12.6,2.46,2.2,18.5,94,1.62,.66,.63,.94,7.1,.73,1.58,695

3,12.25,4.72,2.54,21,,1.38,.47,.53,.8,3.85,.75,1.27,720

3,12.53,5.51,2.,25,96,1.79,.6,.63,1.1,5,.82,1.69,515

3,13.49,3.59,2.19,19.5,88,1.62,.48,.58,.88,5.7,.81,1.82,580

3,12.84,2.96,2.61,24,101,2.32,.6,.53,.81,4.92,.,2.15,590

3,12.93,2.81,2.7,21,96,1.54,.5,.53,.75,4.6,.77,2.31,600

3,13.36,2.56,2.35,20,,1.4,.5,.37,.,5.6,.7,2.47,780

3,13.52,3.17,2.72,23.5,97,1.55,.52,.5,.55,4.35,.,2.06,520

3,13.62,4.95,2.35,20,92,2,.8,.47,1.02,4.4,.91,2.05,550

3,12.25,3.88,2.2,18.5,112,1.38,.78,.29,1.14,8.21,.65,2,855

3,13.16,3.57,2.15,21,102,1.5,.55,.43,1.3,4,.6,1.68,830

3,13.88,5.04,2.23,20,80,.98,.34,.4,.68,4.9,.58,1.33,415

3,12.87,4.61,2.48,21.5,86,1.7,.65,.47,.86,7.65,.54,1.86,625

3,13.32,3.24,2.38,21.5,92,1.93,.76,.45,1.25,8.42,.55,1.62,650

3,13.08,3.9,2.36,21.5,113,1.41,1.39,.34,1.14,9.40,.57,1.33,550

3,13.5,3.12,2.62,24,123,1.4,1.57,.22,1.25,8.60,.59,1.3,500

3,12.79,2.67,2.48,22,112,1.48,1.36,.24,1.26,10.8,.48,1.47,480

3,13.11,1.9,2.75,25.5,116,2.2,1.28,.26,1.56,7.1,.61,1.33,425

3,13.23,3.3,2.28,18.5,98,1.8,.83,.61,1.87,10.52,.56,1.51,675

3,12.58,1.29,2.1,20,103,1.48,.58,.53,1.4,7.6,.58,1.55,0

3,13.17,5.19,2.32,22,93,1.74,.63,.61,1.55,7.9,.6,1.48,725

3,13.84,4.12,2.38,19.5,,1.8,.83,.48,1.56,9.01,.57,1.,480

3,12.45,3.03,2.,27,97,1.9,.58,.63,1.14,7.5,.67,1.73,880

3,14.34,1.68,2.7,25,98,2.8,1.31,.53,2.7,13,.57,1.96,660

3,13.48,1.67,2.,22.5,,2.6,1.1,.52,2.29,11.75,.57,1.78,620

3,12.36,3.83,2.38,21,88,2.3,.92,.5,1.04,7.65,.56,1.58,520

3,13.69,3.26,2.54,20,107,1.83,.56,.5,.8,5.88,.96,1.82,680

3,12.85,3.27,2.58,22,106,1.65,.6,.6,.96,5.58,.87,2.11,570

3,12.96,3.45,2.35,18.5,106,1.39,.7,.4,.94,5.28,.68,1.75,675

3,13.78,2.76,2.3,22,90,1.35,.68,.41,1.03,9.58,.7,1.68,615

3,13.73,4.36,2.26,22.5,88,1.28,.47,.52,1.15,6.62,.78,1.75,520

3,13.45,3.7,2.6,23,111,1.7,.92,.43,1.46,10.68,.85,1.56,695

3,12.82,3.37,2.3,19.5,88,1.48,.66,.4,.97,10.26,.72,1.75,685

3,13.58,2.58,2.69,24.5,105,1.55,.84,.39,1.54,8.66,.74,1.8,750

3,13.4,4.6,2.86,25,112,1.98,.96,.27,1.11,8.5,.67,1.92,630

3,12.2,3.03,2.32,19,96,1.25,.49,.4,.73,5.5,.66,1.83,510

3,12.77,2.39,2.28,19.5,86,1.39,.51,.48,.,9.9999,.57,1.63,470

3,14.16,2.51,2.48,20,91,1.68,.7,.44,1.24,9.7,.62,1.71,660

3,13.71,5.65,2.45,20.5,95,1.68,.61,.52,1.06,7.7,.,1.74,740

3,13.4,3.91,2.48,23,102,1.8,.75,.43,1.41,7.3,.7,1.56,750

3,13.27,4.28,2.26,20,120,1.59,.69,.43,1.35,10.2,.59,1.56,835

3,13.17,2.59,2.37,20,120,1.65,.68,.53,1.46,9.3,.6,1.62,840

3,14.13,4.1,2.74,24.5,96,2.05,.76,.56,1.35,9.2,.61,1.6,560

3.读取数据集整个数据集采⽤逗号分隔,常⽤读取该类型数据集的⽅法是调⽤open()函数读取⽂件,依次读取TXT⽂件中所有内容,再按照逗号分割符获取每⾏的14列数据存储⾄数组或矩阵中,从⽽进⾏数据分析。这⾥讲述另⼀种⽅法,调⽤loadtxt()函数读取逗号分隔的数据,代码如下:

# -*- coding: utf-8 -*-

import os

import numpy as np

path = u"wine.txt"

data = np.loadtxt(path,dtype=float,delimiter=

型,delimiter表⽰分隔符,converters将数据列与转换函数进⾏映射的字段,如{1:fun},usecols表⽰选取数据的列。

3.数据集拆分训练集和预测集由于Wine数据集前59个样本全是第1类,中间71个样本为第2类,最后48个样本是第3类,所以需要将数据集拆分成训练集和预测集。步骤如下: (1)调⽤split()函数将数据集的第⼀列类标(Y数据)和13列特征(X数组)分隔开来。该函数参数包括data数据,分割位置,其中1表⽰从第⼀列分割,axis为1表⽰⽔平分割、0表⽰垂直分割。 (2)由于数据集第⼀列存储的类标为1.0、2.0或3.0浮点型数据,需要将其转换为整型,这⾥在for循环中调⽤int()函数转换,存储⾄y数组中,也可采⽤np.astype()实现。 (3)最后调⽤

np.concatenate()函数将0-40、60-100、140-160⾏数据分割为训练集,包括13列特征和类标,其余78⾏数据为测试集。

# -*- coding: utf-8 -*-

import os

import numpy as np

path = u"wine/wine.txt"

data = np.loadtxt(path,dtype=float,delimiter=

print data

yy, x = np.split(data, (1,), axis=1)

print yy.shape, x.shape

y = []

for n in yy:

y.append(int(n))

train_data = np.concatenate((x[0:40,:], x[60:100,:], x[140:160,:]), axis = 0) #训练集

train_target = np.concatenate((y[0:40], y[60:100], y[140:160]), axis = 0) #样本类别

test_data = np.concatenate((x[40:60, :], x[100:140, :], x[160:,:]), axis = 0) #测试集

test_target = np.concatenate((y[40:60], y[100:140], y[160:]), axis = 0) #样本类别

print train_data.shape, train_target.shape

print test_data.shape, test_target.shape

输出结果如下:

(178L, 1L)

(178L, 13L)

(100L, 1L) (100L, 13L)

(78L, 1L) (78L, 13L)

下⾯补充⼀种随机拆分的⽅式,调⽤sklearn.cross_validation.train_test_split类随机划分训练集与测试集。代码如下:

from sklearn.cross_validation import train_test_split

x, y = np.split(data, (1,), axis=1)

x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=1, train_size=0.7)

参数x表⽰所要划分的样本特征集;y是所要划分的样本结果;train_size表⽰训练样本占⽐,0.7表⽰将数据集划分为70%的训练集、30%的测试集;random_state是随机数的种⼦。该函数在部分版本的sklearn库中是导⼊model_selection类,建议读者下来尝试。

四.SVM分析红酒数据

1.分析流程接着采⽤SVM分类算法对酒类数据集Wine进⾏分析。其分析步骤主要包括如下六个步骤:

加载数据集。采⽤loadtxt()函数加载酒类数据集,采⽤逗号(,)分割。

划分数据集。将Wine数据集划分为训练集和预测集,仅提取酒类13个特种中的两列特征进⾏数据分析。

SVM训练。导⼊Sklearn机器学习包中svm.SVC()函数分析,调⽤fit()函数训练模型,predict(test_data)函数预测分类结果。

评价算法。通过classification_report()函数计算该分类预测结果的准确率、召回率和F值。

创建⽹格。获取数据集中两列特征的最⼤值和最⼩值,并创建对应的矩阵⽹格,⽤于绘制背景图,调⽤numpy扩展包的meshgrid()函数实现。

绘图可视化。设置不同类标的颜⾊,调⽤pcolormesh()函数绘制背景区域颜⾊,调⽤scatter()函数绘制实际结果的散点图。

2.完整代码

# -*- coding: utf-8 -*-

import os

import numpy as np

from sklearn.svm import SVC

from sklearn import metrics

import matplotlib.pyplot as plt

from matplotlib.colors import ListedColormapprint data

#第⼆步划分数据集

yy, x = np.split(data, (1,), axis=1) #第⼀列为类标yy,后⾯13列特征为x

print yy.shape, x.shape

y = []

for n in yy: #将类标浮点型转化为整数

y.append(int(n))

x = x[:, :2] #获取x前两列数据,⽅便绘图对应x、y轴

train_data = np.concatenate((x[0:40,:], x[60:100,:], x[140:160,:]), axis = 0) #训练集

train_target = np.concatenate((y[0:40], y[60:100], y[140:160]), axis = 0) #样本类别

test_data = np.concatenate((x[40:60, :], x[100:140, :], x[160:,:]), axis = 0) #测试集

test_target = np.concatenate((y[40:60], y[100:140], y[160:]), axis = 0) #样本类别

print train_data.shape, train_target.shape

print test_data.shape, test_target.shape

#第三步 SVC训练

clf = SVC()

clf.fit(train_data,train_target)

result = clf.predict(test_data)

print result

#第四步评价算法

print sum(result==test_target) #预测结果与真实结果⽐对

print(metrics.classification_report(test_target, result)) #准确率召回率 F值

#第五步创建⽹格

x1_min, x1_max = test_data[:,0].min()-0.1, test_data[:,0].max()+0.1 #第⼀列

x2_min, x2_max = test_data[:,1].min()-0.1, test_data[:,1].max()+0.1 #第⼆列

xx, yy = np.meshgrid(np.arange(x1_min, x1_max, 0.1),

np.arange(x2_min, x2_max, 0.1)) #⽣成⽹格型数据

z = clf.predict(np.c_[xx.ravel(), yy.ravel()])

#第六步绘图可视化

cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF']) #颜⾊Map

cmap_bold = ListedColormap(['#000000', '#00FF00', '#FFFFFF'])

plt.figure()

z = z.reshape(xx.shape)

print xx.shape, yy.shape, z.shape, test_target.shape

plt.pcolormesh(xx, yy, z, cmap=cmap_light)

plt.scatter(test_data[:,0], test_data[:,1], c=test_target,

cmap=cmap_bold, s=50)

plt.show()

代码提取了178⾏数据的第⼀列作为类标,剩余13列数据作为13个特征的数据集,并划分为训练集(100⾏)和测试集(78⾏)。输出结果如下,包括78⾏SVM分类预测的类标结果,其中61⾏数据类标与真实的结果⼀致,其准确率为0.78,召回率为0.78,F1特征为0.78,最后可视化绘图输出。

绘制的图形如下所⽰:

五.代码优化

前⾯SVM分析红酒数据集的代码存在两个缺点,⼀是采⽤固定的组合⽅式划分的数据集,即调⽤np.concatenate()函数将0-40、60-100、140-160⾏数据分割为训练集,其余为预测集;⼆是只提取了数据集中的两列特征进⾏SVM分析和可视化绘图,即调⽤“x = x[:, :2]”获取前两列特征,⽽红酒数据集共有13列特征。

真实的数据分析中通常会随机划分数据集,分析过程也是对所有的特征进⾏训练及预测操作,再经过降维处理之后进⾏可视化绘图展⽰。下⾯对SVM分析红酒数据集实例进⾏简单的代码优化,主要包括:

随机划分红酒数据集

对数据集的所有特征进⾏训练和预测分析

采⽤PCA算法降维后再进⾏可视化绘图操作完整代码如下,希望读者也认真学习该部分知识,更好地优化⾃⼰的研究或课题。

# -*- coding: utf-8 -*-

import os

import numpy as np

from sklearn.svm import SVC

from sklearn import metrics

import matplotlib.pyplot as plt

from matplotlib.colors import ListedColormap

path = u"wine/wine.txt"

data = np.loadtxt(path,dtype=float,delimiter=

print data

#第⼆步划分数据集

yy, x = np.split(data, (1,), axis=1) #第⼀列类标yy,后⾯13列特征为x

print yy.shape, x.shape

y = []

for n in yy:

y.append(int(n))

y = np.array(y, dtype = int) #list转换数组

#划分数据集测试集40%

train_data, test_data, train_target, test_target = train_test_split(x, y, test_size=0.4, random_state=42)

print train_data.shape, train_target.shape

print test_data.shape, test_target.shape

#第三步 SVC训练

clf = SVC()

clf.fit(train_data, train_target)

result = clf.predict(test_data)

print result

print test_target

#第四步评价算法

print sum(result==test_target) #预测结果与真实结果⽐对

print(metrics.classification_report(test_target, result)) #准确率召回率 F值

#第五步降维操作

pca = PCA(n_components=2)

newData = pca.fit_transform(test_data)

#第六步绘图可视化

plt.figure()

cmap_bold = ListedColormap(['#000000', '#00FF00', '#FFFFFF'])

plt.scatter(newData[:,0], newData[:,1], c=test_target, cmap=cmap_bold, s=50)

plt.show()

输出结果如下所⽰,其准确率、召回率和F值很低,仅为50%、39%和23%。上述代码如下采⽤决策树进⾏分析,则其准确率、召回率和F值就很⾼,结果如下所⽰。所以并不是每种分析算法都适应所有的数据集,不同数据集其特征不同,最佳分析的算也会不同,我们在进⾏数据分析时,通常会对⽐多种分析算法,再优化⾃⼰的实验和模型。下载本文

显示全文
专题