机器学习之决策树(DecisionTree——C4.5)

ops/2025/2/2 12:06:48/

机器学习决策树(DecisionTree——ID3)中我们提到,ID3无法处理是连续值或有缺失值的属性。而C4.5算法可以解决ID3算的上述局限性。

1、连续值属性的处理

对于数据集 D D D和连续值属性 A A A,假设连续值属性 A A A M M M个不同的取值,可通过二分法bi-partition对联组织属性进行离散化处理,即:

  1. M M M个不同的取值由小到大排序,得到排序后的取值,记为 { a 1 , a 2 , . . . , a M } \{a^1, a^2, ..., a^M\} {a1,a2,...,aM}
  2. 对相邻的属性取值 a i a^{i} ai a i + 1 a^{i+1} ai+1,取其均值作为划分点,即 a i + a i + 1 2 \frac{a^{i}+a^{i+1}}{2} 2ai+ai+1,划分后的子集表示为 D t − D_t^- Dt D t + D_t^+ Dt+
  3. 对于连续值属性 A A A,可获得包含 M − 1 M-1 M1个元素的候选划分点集合:
    T A = { a i + a i + 1 2 ∣ 1 ≤ i ≤ M − 1 } (1) T_A=\{\frac{a^{i}+a^{i+1}}{2}|1≤i≤M-1\}\tag1 TA={2ai+ai+1∣1iM1}(1)
  4. 像离散属性值一样开考察上述候选划分点,选取最优的划分点进行样本集合的划分:
    G a i n ( D , A ) = max ⁡ t ∈ T a G a i n ( D , A , t ) = max ⁡ t ∈ T a ( E n t r o p y ( D ) − ∑ λ ∈ { − , + } N t λ N E n t r o p y ( D t λ ) ) (2) \begin{aligned} Gain(D, A)&=\mathop{\max}\limits_{t\in T_a}Gain(D, A, t)\\ &=\mathop{\max}\limits_{t\in T_a}(Entropy(D)-\sum_{\lambda\in \{-, +\}}\frac{N_t^{\lambda}}{N}Entropy(D_t^{\lambda}))\tag2 \end{aligned} Gain(D,A)=tTamaxGain(D,A,t)=tTamax(Entropy(D)λ{,+}NNtλEntropy(Dtλ))(2)
    式(2)中, G a i n ( D , A , t ) Gain(D, A, t) Gain(D,A,t)是样本集 D D D基于划分点 t t t二分后的信息增益, D t λ D_t^{\lambda} Dtλ表示二分后的子集, N t λ N_t^{\lambda} Ntλ表示二分后的子集的样本数量。
表1 西瓜数据集3.0
编号色泽根蒂敲声纹理脐部触感密度含糖率好瓜
1青绿蜷缩浊响清晰凹陷硬滑0.6970.460
2乌黑蜷缩沉闷清晰凹陷硬滑0.7740.376
3乌黑蜷缩浊响清晰凹陷硬滑0.6340.264
4青绿蜷缩沉闷清晰凹陷硬滑0.6080.318
5浅白蜷缩浊响清晰凹陷硬滑0.5560.215
6青绿稍蜷浊响清晰稍凹软粘0.4030.237
7乌黑稍蜷浊响稍糊稍凹软粘0.4810.149
8乌黑稍蜷浊响清晰稍凹硬滑0.4370.211
9乌黑稍蜷沉闷稍糊稍凹硬滑0.6660.091
10青绿硬挺清脆清晰平坦软粘0.2430.267
11浅白硬挺清脆模糊平坦硬滑0.2450.057
12浅白蜷缩浊响模糊平坦软粘0.3430.099
13青绿稍蜷浊响稍糊凹陷硬滑0.6390.161
14浅白稍蜷沉闷稍糊凹陷硬滑0.6570.198
15乌黑稍蜷浊响清晰稍凹软粘0.3600.370
16浅白蜷缩浊响模糊平坦硬滑0.5930.042
17青绿蜷缩沉闷稍糊稍凹硬滑0.7190.103

表1中的西瓜数据集包含17个样本( n = 1 , 2 , 3 , . . . , 17 n=1,2,3,...,17 n=1,2,3,...,17),每个样本有8个属性( k = 1 , 2 , 3 , . . . , 8 k = 1 , 2 , 3 , . . . , 8 k=1,2,3,...,8),样本共计有2个类别( c = 是 , 否 c = 是 , 否 c=,)。17个样本中,好瓜样本有8个、差瓜样本有9个,数据集 D D D信息熵为:
E n t r o p y ( D ) = − ( 8 17 log ⁡ 8 17 + 9 17 log ⁡ 9 17 ) = 0.9975 Entropy(D)=-(\frac{8}{17}\log\frac{8}{17}+\frac{9}{17}\log\frac{9}{17})=0.9975 Entropy(D)=(178log178+179log179)=0.9975

以属性"含糖率"为例,17个样本的在该属性的取值由小到大排序后为:

表2 西瓜数据集3.0——sort("含糖率")
编号色泽根蒂敲声纹理脐部触感密度含糖率好瓜
16浅白蜷缩浊响模糊平坦硬滑0.5930.042
11浅白硬挺清脆模糊平坦硬滑0.2450.057
9乌黑稍蜷沉闷稍糊稍凹硬滑0.6660.091
12浅白蜷缩浊响模糊平坦软粘0.3430.099
17青绿蜷缩沉闷稍糊稍凹硬滑0.7190.103
7乌黑稍蜷浊响稍糊稍凹软粘0.4810.149
13青绿稍蜷浊响稍糊凹陷硬滑0.6390.161
14浅白稍蜷沉闷稍糊凹陷硬滑0.6570.198
8乌黑稍蜷浊响清晰稍凹硬滑0.4370.211
5浅白蜷缩浊响清晰凹陷硬滑0.5560.215
6青绿稍蜷浊响清晰稍凹软粘0.4030.237
3乌黑蜷缩浊响清晰凹陷硬滑0.6340.264
10青绿硬挺清脆清晰平坦软粘0.2430.267
4青绿蜷缩沉闷清晰凹陷硬滑0.6080.318
15乌黑稍蜷浊响清晰稍凹软粘0.3600.370
2乌黑蜷缩沉闷清晰凹陷硬滑0.7740.376
1青绿蜷缩浊响清晰凹陷硬滑0.6970.460

17个样本的在该属性的二分候选划分点为:

0.042
0.057
0.091
0.099
0.103
0.149
0.161
0.198
0.211
0.215
0.237
0.264
0.267
0.318
0.370
0.376
0.460
0.0495
0.074
0.095
0.101
0.126
0.155
0.1795
0.2045
0.213
0.226
0.2505
0.2655
0.2925
0.344
0.373
0.418
  • 当划分点为0.0495,划分后两个子集分别为 D 0.0495 − D_{0.0495}^- D0.0495:{16}和 D 0.0495 + D_{0.0495}^+ D0.0495+:{11, 9, 12, 17, 7, 13, 14, 8, 5, 6, 3, 10, 4, 15, 2, 1}
    E n t r o p y ( D 0.0495 − ) = − ( 0 1 log ⁡ 0 1 + 1 1 log ⁡ 1 1 ) = 0 E n t r o p y ( D 0.0495 + ) = − ( 8 16 log ⁡ 8 16 + 8 16 log ⁡ 8 16 ) = 1.0 G a i n ( D , 含糖率 , 0.0495 ) = E n t r o p y ( D ) − ∑ λ ∈ { − , + } N 0.0495 λ N E n t r o p y ( D 0.126 λ ) = 0.9975 − ( 1 17 ∗ 0 + 16 17 ∗ 1.0 ) = 0.0563 \begin{aligned} Entropy(D_{0.0495}^-)&=-(\frac{0}{1}\log\frac{0}{1}+\frac{1}{1}\log\frac{1}{1})=0\\ Entropy(D_{0.0495}^+)&=-(\frac{8}{16}\log\frac{8}{16}+\frac{8}{16}\log\frac{8}{16})=1.0\\ Gain(D, 含糖率, 0.0495)&= Entropy(D)-\sum_{\lambda\in\{-, +\}}\frac{N_{0.0495}^{\lambda}}{N} Entropy(D_{0.126}^{\lambda})\\ &= 0.9975-(\frac{1}{17}*0+\frac{16}{17}*1.0)\\ &=0.0563 \end{aligned} Entropy(D0.0495)Entropy(D0.0495+)Gain(D,含糖率,0.0495)=(10log10+11log11)=0=(168log168+168log168)=1.0=Entropy(D)λ{,+}NN0.0495λEntropy(D0.126λ)=0.9975(1710+17161.0)=0.0563
  • 当划分点为0.074,划分后两个子集分别为 D 0.074 − D_{0.074}^- D0.074:{16, 11}和 D 0.074 + D_{0.074}^+ D0.074+:{9, 12, 17, 7, 13, 14, 8, 5, 6, 3, 10, 4, 15, 2, 1}
    G a i n ( D , 含糖率 , 0.074 ) = 0.9975 − { 2 17 ∗ [ − ( 0 2 log ⁡ 0 2 + 2 2 log ⁡ 2 2 ) ] + 15 17 ∗ [ − ( 8 15 log ⁡ 8 15 + 7 15 log ⁡ 7 15 ) ] } = 0.1179 \begin{aligned} Gain(D, 含糖率, 0.074)&= 0.9975-\{\frac{2}{17}*[-(\frac{0}{2}\log\frac{0}{2}+\frac{2}{2}\log\frac{2}{2})]+\frac{15}{17}*[-(\frac{8}{15}\log\frac{8}{15}+\frac{7}{15}\log\frac{7}{15})]\}=0.1179 \end{aligned} Gain(D,含糖率,0.074)=0.9975{172[(20log20+22log22)]+1715[(158log158+157log157)]}=0.1179
  • 当划分点为0.095,划分后两个子集分别为 D 0.074 − D_{0.074}^- D0.074:{16, 11, 9}和 D 0.074 + D_{0.074}^+ D0.074+:{12, 17, 7, 13, 14, 8, 5, 6, 3, 10, 4, 15, 2, 1}
    G a i n ( D , 含糖率 , 0.095 ) = 0.9975 − { 3 17 ∗ [ − ( 0 3 log ⁡ 0 3 + 3 3 log ⁡ 3 3 ) ] + 14 17 ∗ [ − ( 8 14 log ⁡ 8 14 + 6 14 log ⁡ 6 14 ) ] } = 0.1861 \begin{aligned} Gain(D, 含糖率, 0.095)&= 0.9975-\{\frac{3}{17}*[-(\frac{0}{3}\log\frac{0}{3}+\frac{3}{3}\log\frac{3}{3})]+\frac{14}{17}*[-(\frac{8}{14}\log\frac{8}{14}+\frac{6}{14}\log\frac{6}{14})]\}=0.1861 \end{aligned} Gain(D,含糖率,0.095)=0.9975{173[(30log30+33log33)]+1714[(148log148+146log146)]}=0.1861
  • 当划分点为0.101,划分后两个子集分别为 D 0.101 − D_{0.101}^- D0.101:{16, 11, 9, 12}和 D 0.101 + D_{0.101}^+ D0.101+:{17, 7, 13, 14, 8, 5, 6, 3, 10, 4, 15, 2, 1}
    G a i n ( D , 含糖率 , 0.101 ) = 0.9975 − { 4 17 ∗ [ − ( 0 4 log ⁡ 0 4 + 4 4 log ⁡ 4 4 ) ] + 13 17 ∗ [ − ( 8 13 log ⁡ 8 13 + 5 13 log ⁡ 5 13 ) ] } = 0.2624 \begin{aligned} Gain(D, 含糖率, 0.101)&= 0.9975-\{\frac{4}{17}*[-(\frac{0}{4}\log\frac{0}{4}+\frac{4}{4}\log\frac{4}{4})]+\frac{13}{17}*[-(\frac{8}{13}\log\frac{8}{13}+\frac{5}{13}\log\frac{5}{13})]\}=0.2624 \end{aligned} Gain(D,含糖率,0.101)=0.9975{174[(40log40+44log44)]+1713[(138log138+135log135)]}=0.2624
  • 当划分点为0.126,划分后两个子集分别为 D 0.126 − D_{0.126}^- D0.126:{16, 11, 9, 12, 17}和 D 0.126 + D_{0.126}^+ D0.126+:{7, 13, 14, 8, 5, 6, 3, 10, 4, 15, 2, 1}
    G a i n ( D , 含糖率 , 0.126 ) = 0.9975 − { 5 17 ∗ [ − ( 0 5 log ⁡ 0 5 + 5 5 log ⁡ 5 5 ) ] + 12 17 ∗ [ − ( 8 12 log ⁡ 8 12 + 4 12 log ⁡ 4 12 ) ] } = 0.3492 \begin{aligned} Gain(D, 含糖率, 0.126)&= 0.9975-\{\frac{5}{17}*[-(\frac{0}{5}\log\frac{0}{5}+\frac{5}{5}\log\frac{5}{5})]+\frac{12}{17}*[-(\frac{8}{12}\log\frac{8}{12}+\frac{4}{12}\log\frac{4}{12})]\}=0.3492 \end{aligned} Gain(D,含糖率,0.126)=0.9975{175[(50log50+55log55)]+1712[(128log128+124log124)]}=0.3492
  • 当划分点为0.155,划分后两个子集分别为 D 0.155 − D_{0.155}^- D0.155:{16, 11, 9, 12, 17, 7}和 D 0.155 + D_{0.155}^+ D0.155+:{13, 14, 8, 5, 6, 3, 10, 4, 15, 2, 1}
    G a i n ( D , 含糖率 , 0.155 ) = 0.9975 − { 6 17 ∗ [ − ( 1 6 log ⁡ 1 6 + 5 6 log ⁡ 5 6 ) ] + 11 17 ∗ [ − ( 7 11 log ⁡ 7 11 + 4 11 log ⁡ 4 11 ) ] } = 0.1561 \begin{aligned} Gain(D, 含糖率, 0.155)&= 0.9975-\{\frac{6}{17}*[-(\frac{1}{6}\log\frac{1}{6}+\frac{5}{6}\log\frac{5}{6})]+\frac{11}{17}*[-(\frac{7}{11}\log\frac{7}{11}+\frac{4}{11}\log\frac{4}{11})]\}=0.1561 \end{aligned} Gain(D,含糖率,0.155)=0.9975{176[(61log61+65log65)]+1711[(117log117+114log114)]}=0.1561
  • 当划分点为0.1795,划分后两个子集分别为 D 0.1795 − D_{0.1795}^- D0.1795:{16, 11, 9, 12, 17, 7, 13}和 D 0.1795 + D_{0.1795}^+ D0.1795+:{14, 8, 5, 6, 3, 10, 4, 15, 2, 1}
    G a i n ( D , 含糖率 , 0.1795 ) = 0.9975 − { 7 17 ∗ [ − ( 1 7 log ⁡ 1 7 + 6 7 log ⁡ 6 7 ) ] + 10 17 ∗ [ − ( 7 10 log ⁡ 7 10 + 3 10 log ⁡ 3 10 ) ] } = 0.2354 \begin{aligned} Gain(D, 含糖率, 0.1795)&= 0.9975-\{\frac{7}{17}*[-(\frac{1}{7}\log\frac{1}{7}+\frac{6}{7}\log\frac{6}{7})]+\frac{10}{17}*[-(\frac{7}{10}\log\frac{7}{10}+\frac{3}{10}\log\frac{3}{10})]\}=0.2354 \end{aligned} Gain(D,含糖率,0.1795)=0.9975{177[(71log71+76log76)]+1710[(107log107+103log103)]}=0.2354
  • 当划分点为0.2045,划分后两个子集分别为 D 0.2045 − D_{0.2045}^- D0.2045:{16, 11, 9, 12, 17, 7, 13, 14}和 D 0.2045 + D_{0.2045}^+ D0.2045+:{8, 5, 6, 3, 10, 4, 15, 2, 1}
    G a i n ( D , 含糖率 , 0.2045 ) = 0.9975 − { 8 17 ∗ [ − ( 1 8 log ⁡ 1 8 + 7 8 log ⁡ 7 8 ) ] + 9 17 ∗ [ − ( 7 9 log ⁡ 7 9 + 2 9 log ⁡ 2 9 ) ] } = 0.3371 \begin{aligned} Gain(D, 含糖率, 0.2045)&= 0.9975-\{\frac{8}{17}*[-(\frac{1}{8}\log\frac{1}{8}+\frac{7}{8}\log\frac{7}{8})]+\frac{9}{17}*[-(\frac{7}{9}\log\frac{7}{9}+\frac{2}{9}\log\frac{2}{9})]\}=0.3371 \end{aligned} Gain(D,含糖率,0.2045)=0.9975{178[(81log81+87log87)]+179[(97log97+92log92)]}=0.3371
  • 当划分点为0.213,划分后两个子集分别为 D 0.213 − D_{0.213}^- D0.213:{16, 11, 9, 12, 17, 7, 13, 14, 8}和 D 0.213 + D_{0.213}^+ D0.213+:{5, 6, 3, 10, 4, 15, 2, 1}
    G a i n ( D , 含糖率 , 0.213 ) = 0.9975 − { 9 17 ∗ [ − ( 2 9 log ⁡ 2 9 + 7 9 log ⁡ 7 9 ) ] + 8 17 ∗ [ − ( 6 8 log ⁡ 6 8 + 2 8 log ⁡ 2 8 ) ] } = 0.2111 \begin{aligned} Gain(D, 含糖率, 0.213)&= 0.9975-\{\frac{9}{17}*[-(\frac{2}{9}\log\frac{2}{9}+\frac{7}{9}\log\frac{7}{9})]+\frac{8}{17}*[-(\frac{6}{8}\log\frac{6}{8}+\frac{2}{8}\log\frac{2}{8})]\}=0.2111 \end{aligned} Gain(D,含糖率,0.213)=0.9975{179[(92log92+97log97)]+178[(86log86+82log82)]}=0.2111
  • 当划分点为0.226,划分后两个子集分别为 D 0.226 − D_{0.226}^- D0.226:{16, 11, 9, 12, 17, 7, 13, 14, 8, 5}和 D 0.226 + D_{0.226}^+ D0.226+:{6, 3, 10, 4, 15, 2, 1}
    G a i n ( D , 含糖率 , 0.226 ) = 0.9975 − { 10 17 ∗ [ − ( 3 10 log ⁡ 3 10 + 7 10 log ⁡ 7 10 ) ] + 7 17 ∗ [ − ( 5 7 log ⁡ 5 7 + 2 7 log ⁡ 2 7 ) ] } = 0.1237 \begin{aligned} Gain(D, 含糖率, 0.226)&= 0.9975-\{\frac{10}{17}*[-(\frac{3}{10}\log\frac{3}{10}+\frac{7}{10}\log\frac{7}{10})]+\frac{7}{17}*[-(\frac{5}{7}\log\frac{5}{7}+\frac{2}{7}\log\frac{2}{7})]\}=0.1237 \end{aligned} Gain(D,含糖率,0.226)=0.9975{1710[(103log103+107log107)]+177[(75log75+72log72)]}=0.1237
  • 当划分点为0.2505,划分后两个子集分别为 D 0.2505 − D_{0.2505}^- D0.2505:{16, 11, 9, 12, 17, 7, 13, 14, 8, 5, 6}和 D 0.2505 + D_{0.2505}^+ D0.2505+:{3, 10, 4, 15, 2, 1}
    G a i n ( D , 含糖率 , 0.2505 ) = 0.9975 − { 11 17 ∗ [ − ( 4 11 log ⁡ 4 11 + 7 11 log ⁡ 7 11 ) ] + 6 17 ∗ [ − ( 4 6 log ⁡ 4 6 + 2 6 log ⁡ 2 6 ) ] } = 0.0615 \begin{aligned} Gain(D, 含糖率, 0.2505)&= 0.9975-\{\frac{11}{17}*[-(\frac{4}{11}\log\frac{4}{11}+\frac{7}{11}\log\frac{7}{11})]+\frac{6}{17}*[-(\frac{4}{6}\log\frac{4}{6}+\frac{2}{6}\log\frac{2}{6})]\}=0.0615 \end{aligned} Gain(D,含糖率,0.2505)=0.9975{1711[(114log114+117log117)]+176[(64log64+62log62)]}=0.0615
  • 当划分点为0.2655,划分后两个子集分别为 D 0.2655 − D_{0.2655}^- D0.2655:{16, 11, 9, 12, 17, 7, 13, 14, 8, 5, 6, 3}和 D 0.2655 + D_{0.2655}^+ D0.2655+:{10, 4, 15, 2, 1}
    G a i n ( D , 含糖率 , 0.2655 ) = 0.9975 − { 12 17 ∗ [ − ( 5 12 log ⁡ 5 12 + 7 12 log ⁡ 7 12 ) ] + 5 17 ∗ [ − ( 3 5 log ⁡ 3 5 + 2 5 log ⁡ 2 5 ) ] } = 0.0202 \begin{aligned} Gain(D, 含糖率, 0.2655)&= 0.9975-\{\frac{12}{17}*[-(\frac{5}{12}\log\frac{5}{12}+\frac{7}{12}\log\frac{7}{12})]+\frac{5}{17}*[-(\frac{3}{5}\log\frac{3}{5}+\frac{2}{5}\log\frac{2}{5})]\}=0.0202 \end{aligned} Gain(D,含糖率,0.2655)=0.9975{1712[(125log125+127log127)]+175[(53log53+52log52)]}=0.0202
  • 当划分点为0.2925,划分后两个子集分别为 D 0.2925 − D_{0.2925}^- D0.2925:{16, 11, 9, 12, 17, 7, 13, 14, 8, 5, 6, 3, 10}和 D 0.2925 + D_{0.2925}^+ D0.2925+:{4, 15, 2, 1}
    G a i n ( D , 含糖率 , 0.2925 ) = 0.9975 − { 13 17 ∗ [ − ( 5 13 log ⁡ 5 13 + 8 13 log ⁡ 8 13 ) ] + 4 17 ∗ [ − ( 3 4 log ⁡ 3 4 + 1 4 log ⁡ 1 4 ) ] } = 0.0715 \begin{aligned} Gain(D, 含糖率, 0.2925)&= 0.9975-\{\frac{13}{17}*[-(\frac{5}{13}\log\frac{5}{13}+\frac{8}{13}\log\frac{8}{13})]+\frac{4}{17}*[-(\frac{3}{4}\log\frac{3}{4}+\frac{1}{4}\log\frac{1}{4})]\}=0.0715 \end{aligned} Gain(D,含糖率,0.2925)=0.9975{1713[(135log135+138log138)]+174[(43log43+41log41)]}=0.0715
  • 当划分点为0.344,划分后两个子集分别为 D 0.344 − D_{0.344}^- D0.344:{16, 11, 9, 12, 17, 7, 13, 14, 8, 5, 6, 3, 10, 4}和 D 0.344 + D_{0.344}^+ D0.344+:{15, 2, 1}
    G a i n ( D , 含糖率 , 0.344 ) = 0.9975 − { 14 17 ∗ [ − ( 6 14 log ⁡ 6 14 + 8 14 log ⁡ 8 14 ) ] + 3 17 ∗ [ − ( 2 3 log ⁡ 2 3 + 1 3 log ⁡ 1 3 ) ] } = 0.0241 \begin{aligned} Gain(D, 含糖率, 0.344)&= 0.9975-\{\frac{14}{17}*[-(\frac{6}{14}\log\frac{6}{14}+\frac{8}{14}\log\frac{8}{14})]+\frac{3}{17}*[-(\frac{2}{3}\log\frac{2}{3}+\frac{1}{3}\log\frac{1}{3})]\}=0.0241 \end{aligned} Gain(D,含糖率,0.344)=0.9975{1714[(146log146+148log148)]+173[(32log32+31log31)]}=0.0241
  • 当划分点为0.373,划分后两个子集分别为 D 0.373 − D_{0.373}^- D0.373:{16, 11, 9, 12, 17, 7, 13, 14, 8, 5, 6, 3, 10, 4, 15}和 D 0.373 + D_{0.373}^+ D0.373+:{2, 1}
    G a i n ( D , 含糖率 , 0.373 ) = 0.9975 − { 15 17 ∗ [ − ( 6 15 log ⁡ 6 15 + 9 15 log ⁡ 9 15 ) ] + 2 17 ∗ [ − ( 2 2 log ⁡ 2 2 + 0 2 log ⁡ 0 2 ) ] } = 0.1041 \begin{aligned} Gain(D, 含糖率, 0.373)&= 0.9975-\{\frac{15}{17}*[-(\frac{6}{15}\log\frac{6}{15}+\frac{9}{15}\log\frac{9}{15})]+\frac{2}{17}*[-(\frac{2}{2}\log\frac{2}{2}+\frac{0}{2}\log\frac{0}{2})]\}=0.1041 \end{aligned} Gain(D,含糖率,0.373)=0.9975{1715[(156log156+159log159)]+172[(22log22+20log20)]}=0.1041
  • 当划分点为0.373,划分后两个子集分别为 D 0.373 − D_{0.373}^- D0.373:{16, 11, 9, 12, 17, 7, 13, 14, 8, 5, 6, 3, 10, 4, 15, 2}和 D 0.373 + D_{0.373}^+ D0.373+:{1}
    G a i n ( D , 含糖率 , 0.418 ) = 0.9975 − { 16 17 ∗ [ − ( 7 16 log ⁡ 7 16 + 9 16 log ⁡ 9 16 ) ] + 1 17 ∗ [ − ( 1 1 log ⁡ 1 1 + 0 1 log ⁡ 0 1 ) ] } = 0.0669 \begin{aligned} Gain(D, 含糖率, 0.418)&= 0.9975-\{\frac{16}{17}*[-(\frac{7}{16}\log\frac{7}{16}+\frac{9}{16}\log\frac{9}{16})]+\frac{1}{17}*[-(\frac{1}{1}\log\frac{1}{1}+\frac{0}{1}\log\frac{0}{1})]\}=0.0669 \end{aligned} Gain(D,含糖率,0.418)=0.9975{1716[(167log167+169log169)]+171[(11log11+10log10)]}=0.0669

因此,属性"含糖率"划分后的最大信息增益为0.349,对应划分点为0.126:
G a i n ( D , 含糖率 ) = G a i n ( D , 含糖率 , t = 0.126 ) = 0.3492 \begin{aligned} Gain(D, 含糖率)&=Gain(D, 含糖率, t=0.126)=0.3492 \end{aligned} Gain(D,含糖率)=Gain(D,含糖率,t=0.126)=0.3492
同理,属性"密度"划分后的最大信息增益为0.2624,对应划分点为0.3815:
G a i n ( D , 密度 ) = G a i n ( D , 密度 , t = 0.3815 ) = 0.2624 \begin{aligned} Gain(D, 密度)&=Gain(D, 密度, t=0.3815)=0.2624 \end{aligned} Gain(D,密度)=Gain(D,密度,t=0.3815)=0.2624

以如此方式即可处理连续值的属性。

2、缺失值属性的处理

表3 西瓜数据集——缺失值
编号色泽根蒂敲声纹理脐部触感好瓜
1蜷缩浊响清晰凹陷硬滑
2乌黑蜷缩沉闷清晰凹陷
3乌黑蜷缩清晰凹陷硬滑
4青绿蜷缩沉闷清晰凹陷硬滑
5蜷缩浊响清晰凹陷硬滑
6青绿稍蜷浊响清晰软粘
7乌黑稍蜷浊响稍糊稍凹软粘
8乌黑稍蜷浊响稍凹硬滑
9乌黑沉闷稍糊稍凹硬滑
10青绿硬挺清脆平坦软粘
11浅白硬挺清脆模糊平坦
12浅白蜷缩模糊平坦软粘
13稍蜷浊响稍糊凹陷硬滑
14浅白稍蜷沉闷稍糊凹陷硬滑
15乌黑稍蜷浊响清晰软粘
16浅白蜷缩浊响模糊平坦硬滑
17青绿沉闷稍糊稍凹硬滑

(1) 如何在属性值确实的情况下进行划分属性选择?

给定训练集 D D D和属性 A A A,假设 D ~ \widetilde{D} D 表示属性 A A A上没有缺失值的样本子集,假定属性 A A A m m m个可取值 { a 1 , a 2 , . . . , a m } \{a^1, a^2, ..., a^m\} {a1,a2,...,am} D ~ m \widetilde{D}^m D m表示 D ~ \widetilde{D} D 中属性 A A A上取值为 a m a^m am的样本子集, D ~ k \widetilde{D}_k D k表示 D ~ \widetilde{D} D 中属于第 k k k类( k = 1 , 2 , . . . , K k=1,2,...,K k=1,2,...,K)的样本子集,则有 D ~ = ∪ k = 1 K D ~ k = ∪ m = 1 m D ~ m \widetilde{D}=\cup_{k=1}^{K}\widetilde{D}_k=\cup_{m=1}^{m}\widetilde{D}^m D =k=1KD k=m=1mD m,假定为每一个样本 x x x赋予一个权重 w x w_x wx定义:
ρ = ∑ x ∈ D ~ w x ∑ x ∈ D w x p ~ k = ∑ x ∈ D ~ k w x ∑ x ∈ D ~ w x r ~ m = ∑ x ∈ D ~ m w x ∑ x ∈ D ~ w x \rho=\frac{\sum_{x\in\widetilde{D}}w_x}{\sum_{x\in D}w_x}\\ \widetilde{p}_k=\frac{\sum_{x\in\widetilde{D}_k}w_x}{\sum_{x\in \widetilde{D}}w_x}\\ \widetilde{r}_m=\frac{\sum_{x\in\widetilde{D}^m}w_x}{\sum_{x\in \widetilde{D}}w_x}\\ ρ=xDwxxD wxp k=xD wxxD kwxr m=xD wxxD mwx
式中, ρ \rho ρ表示无缺失值样本所占的比例, p ~ k \widetilde{p}_k p k表示无缺失值样本中第 k k k类所占的比例, r ~ m \widetilde{r}_m r m表示无缺失值样本中属性 A A A上取值 a m a^m am的样本所占的比例,故有 ∑ k = 1 K p ~ k = ∑ n = 1 m r ~ m = 1 \sum_{k=1}^K\widetilde{p}_k=\sum_{n=1}^m\widetilde{r}_m=1 k=1Kp k=n=1mr m=1

基于上述定义,可将信息增益的计算在缺失值上推广为:
E n t r o p y ( D ~ ) = − ∑ k = 1 K p ~ k log ⁡ p ~ k G a i n ( D , A ) = ρ × G a i n ( D ~ , A ) = ρ × [ E n t r o p y ( D ~ ) − ∑ m = 1 m r ~ m E n t r o p y ( D ~ m ) ] \begin{aligned} Entropy(\widetilde{D})&=-\sum_{k=1}^{K}\widetilde{p}_k\log\widetilde{p}_k\\ Gain(D, A)&=\rho\times Gain(\widetilde{D}, A)=\rho\times[Entropy(\widetilde{D})-\sum_{m=1}^{m}\widetilde{r}_mEntropy(\widetilde{D}^m)] \end{aligned} Entropy(D )Gain(D,A)=k=1Kp klogp k=ρ×Gain(D ,A)=ρ×[Entropy(D )m=1mr mEntropy(D m)]
1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17

  • 属性"色泽",无缺失值样本子集 D ~ = { 2 , 3 , 4 , 6 , 7 , 8 , 9 , 10 , 11 , 12 , 14 , 15 , 16 , 17 } \widetilde{D}=\{2,3,4,6,7,8,9,10,11,12,14,15,16,17\} D ={2,3,4,6,7,8,9,10,11,12,14,15,16,17},有"乌黑"、“青绿”、"浅白"3个取值
    G a i n ( D , 色泽 ) = ρ × [ E n t r o p y ( D ~ ) − ∑ m = 1 m r ~ m E n t r o p y ( D ~ m ) ] = 14 17 × { − ( 6 14 log ⁡ 6 14 + 8 14 log ⁡ 8 14 ) − [ 6 14 × ( − ( 4 6 log ⁡ 4 6 + 2 6 log ⁡ 2 6 ) ) + 4 14 × ( − ( 2 4 log ⁡ 2 4 + 2 4 log ⁡ 2 4 ) ) + 4 14 × ( − ( 0 4 log ⁡ 0 4 + 4 4 log ⁡ 4 4 ) ) ] } = 0.2519 \begin{aligned} Gain(D, 色泽)&=\rho\times[Entropy(\widetilde{D})-\sum_{m=1}^{m}\widetilde{r}_mEntropy(\widetilde{D}^m)]\\ &=\frac{14}{17}\times\{-(\frac{6}{14}\log\frac{6}{14}+\frac{8}{14}\log\frac{8}{14})-[\frac{6}{14}\times(-(\frac{4}{6}\log\frac{4}{6}+\frac{2}{6}\log\frac{2}{6}))+\frac{4}{14}\times(-(\frac{2}{4}\log\frac{2}{4}+\frac{2}{4}\log\frac{2}{4}))+\frac{4}{14}\times(-(\frac{0}{4}\log\frac{0}{4}+\frac{4}{4}\log\frac{4}{4}))]\}\\ &=0.2519 \end{aligned} Gain(D,色泽)=ρ×[Entropy(D )m=1mr mEntropy(D m)]=1714×{(146log146+148log148)[146×((64log64+62log62))+144×((42log42+42log42))+144×((40log40+44log44))]}=0.2519
  • 属性"根蒂",无缺失值样本子集 D ~ = { 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 10 , 11 , 12 , 13 , 14 , 15 , 16 } \widetilde{D}=\{1,2,3,4,5,6,7,8,10,11,12,13,14,15,16\} D ={1,2,3,4,5,6,7,8,10,11,12,13,14,15,16},有"蜷缩"、“稍蜷”、"硬挺"3个取值
    G a i n ( D , 根蒂 ) = 15 17 × { − ( 8 15 log ⁡ 8 15 + 7 15 log ⁡ 7 15 ) − [ 7 15 × ( − ( 5 7 log ⁡ 5 7 + 2 7 log ⁡ 2 7 ) ) + 6 15 × ( − ( 3 6 log ⁡ 3 6 + 3 6 log ⁡ 3 6 ) ) + 2 15 × ( − ( 0 2 log ⁡ 0 2 + 2 2 log ⁡ 2 2 ) ) ] } = 0.1711 \begin{aligned} Gain(D, 根蒂)&=\frac{15}{17}\times\{- (\frac{8}{15}\log\frac{8}{15}+\frac{7}{15}\log\frac{7}{15})- [\frac{7}{15}\times(-(\frac{5}{7}\log\frac{5}{7}+\frac{2}{7}\log\frac{2}{7}))+ \frac{6}{15}\times(-(\frac{3}{6}\log\frac{3}{6}+\frac{3}{6}\log\frac{3}{6}))+ \frac{2}{15}\times(-(\frac{0}{2}\log\frac{0}{2}+\frac{2}{2}\log\frac{2}{2}))]\}\\ &=0.1711 \end{aligned} Gain(D,根蒂)=1715×{(158log158+157log157)[157×((75log75+72log72))+156×((63log63+63log63))+152×((20log20+22log22))]}=0.1711
  • 属性"敲声",无缺失值样本子集 D ~ = { 1 , 2 , 4 , 5 , 6 , 7 , 8 , 9 , 10 , 11 , 13 , 14 , 15 , 16 , 17 } \widetilde{D}=\{1,2,4,5,6,7,8,9,10,11,13,14,15,16,17\} D ={1,2,4,5,6,7,8,9,10,11,13,14,15,16,17},有"浊响"、“沉闷”、"清脆"3个取值
    G a i n ( D , 敲声 ) = 15 17 × { − ( 7 15 log ⁡ 7 15 + 8 15 log ⁡ 8 15 ) − [ 8 15 × ( − ( 5 8 log ⁡ 5 8 + 3 8 log ⁡ 3 8 ) ) + 5 15 × ( − ( 2 5 log ⁡ 2 5 + 3 5 log ⁡ 3 5 ) ) + 2 15 × ( − ( 0 2 log ⁡ 0 2 + 2 2 log ⁡ 2 2 ) ) ] } = 0.1448 \begin{aligned} Gain(D, 敲声)&=\frac{15}{17}\times\{- (\frac{7}{15}\log\frac{7}{15}+\frac{8}{15}\log\frac{8}{15})- [\frac{8}{15}\times(-(\frac{5}{8}\log\frac{5}{8}+\frac{3}{8}\log\frac{3}{8}))+ \frac{5}{15}\times(-(\frac{2}{5}\log\frac{2}{5}+\frac{3}{5}\log\frac{3}{5}))+ \frac{2}{15}\times(-(\frac{0}{2}\log\frac{0}{2}+\frac{2}{2}\log\frac{2}{2}))]\}\\ &=0.1448 \end{aligned} Gain(D,敲声)=1715×{(157log157+158log158)[158×((85log85+83log83))+155×((52log52+53log53))+152×((20log20+22log22))]}=0.1448
  • 属性"纹理",无缺失值样本子集 D ~ = { 1 , 2 , 3 , 4 , 5 , 6 , 7 , 9 , 11 , 12 , 13 , 14 , 15 , 16 , 17 } \widetilde{D}=\{1,2,3,4,5,6,7,9,11,12,13,14,15,16,17\} D ={1,2,3,4,5,6,7,9,11,12,13,14,15,16,17},有"清晰"、“稍糊”、"模糊"3个取值
    G a i n ( D , 纹理 ) = 15 17 × { − ( 7 15 log ⁡ 7 15 + 8 15 log ⁡ 8 15 ) − [ 7 15 × ( − ( 6 7 log ⁡ 6 7 + 1 7 log ⁡ 1 7 ) ) + 5 15 × ( − ( 1 5 log ⁡ 1 5 + 4 5 log ⁡ 4 5 ) ) + 3 15 × ( − ( 0 3 log ⁡ 0 3 + 3 3 log ⁡ 3 3 ) ) ] } = 0.4235 \begin{aligned} Gain(D, 纹理)&=\frac{15}{17}\times\{- (\frac{7}{15}\log\frac{7}{15}+\frac{8}{15}\log\frac{8}{15})- [\frac{7}{15}\times(-(\frac{6}{7}\log\frac{6}{7}+\frac{1}{7}\log\frac{1}{7}))+ \frac{5}{15}\times(-(\frac{1}{5}\log\frac{1}{5}+\frac{4}{5}\log\frac{4}{5}))+ \frac{3}{15}\times(-(\frac{0}{3}\log\frac{0}{3}+\frac{3}{3}\log\frac{3}{3}))]\}\\ &=0.4235 \end{aligned} Gain(D,纹理)=1715×{(157log157+158log158)[157×((76log76+71log71))+155×((51log51+54log54))+153×((30log30+33log33))]}=0.4235
  • 属性"脐部",无缺失值样本子集 D ~ = { 1 , 2 , 3 , 4 , 5 , 7 , 8 , 9 , 10 , 11 , 12 , 13 , 14 , 16 , 17 } \widetilde{D}=\{1,2,3,4,5,7,8,9,10,11,12,13,14,16,17\} D ={1,2,3,4,5,7,8,9,10,11,12,13,14,16,17},有"凹陷"、“稍凹”、"平坦"3个取值
    G a i n ( D , 脐部 ) = 15 17 × { − ( 7 15 log ⁡ 7 15 + 8 15 log ⁡ 8 15 ) − [ 7 15 × ( − ( 5 7 log ⁡ 5 7 + 2 7 log ⁡ 2 7 ) ) + 4 15 × ( − ( 2 4 log ⁡ 2 4 + 2 4 log ⁡ 2 4 ) ) + 4 15 × ( − ( 0 4 log ⁡ 0 4 + 4 4 log ⁡ 4 4 ) ) ] } = 0.2888 \begin{aligned} Gain(D, 脐部)&=\frac{15}{17}\times\{- (\frac{7}{15}\log\frac{7}{15}+\frac{8}{15}\log\frac{8}{15})- [\frac{7}{15}\times(-(\frac{5}{7}\log\frac{5}{7}+\frac{2}{7}\log\frac{2}{7}))+ \frac{4}{15}\times(-(\frac{2}{4}\log\frac{2}{4}+\frac{2}{4}\log\frac{2}{4}))+ \frac{4}{15}\times(-(\frac{0}{4}\log\frac{0}{4}+\frac{4}{4}\log\frac{4}{4}))]\}\\ &=0.2888 \end{aligned} Gain(D,脐部)=1715×{(157log157+158log158)[157×((75log75+72log72))+154×((42log42+42log42))+154×((40log40+44log44))]}=0.2888
  • 属性"触感",无缺失值样本子集 D ~ = { 1 , 3 , 4 , 5 , 6 , 7 , 8 , 9 , 10 , 12 , 13 , 14 , 15 , 16 , 17 } \widetilde{D}=\{1,3,4,5,6,7,8,9,10,12,13,14,15,16,17\} D ={1,3,4,5,6,7,8,9,10,12,13,14,15,16,17},有"硬滑"、"软粘"2个取值
    G a i n ( D , 脐部 ) = 15 17 × { − ( 7 15 log ⁡ 7 15 + 8 15 log ⁡ 8 15 ) − [ 10 15 × ( − ( 5 10 log ⁡ 5 10 + 5 10 log ⁡ 5 10 ) ) + 5 15 × ( − ( 2 5 log ⁡ 2 5 + 3 5 log ⁡ 3 5 ) ) ] } = 0.0057 \begin{aligned} Gain(D, 脐部)&=\frac{15}{17}\times\{- (\frac{7}{15}\log\frac{7}{15}+\frac{8}{15}\log\frac{8}{15})- [\frac{10}{15}\times(-(\frac{5}{10}\log\frac{5}{10}+\frac{5}{10}\log\frac{5}{10}))+ \frac{5}{15}\times(-(\frac{2}{5}\log\frac{2}{5}+\frac{3}{5}\log\frac{3}{5}))]\}\\ &=0.0057 \end{aligned} Gain(D,脐部)=1715×{(157log157+158log158)[1510×((105log105+105log105))+155×((52log52+53log53))]}=0.0057
    (2) 给定划分属性,若样本在该属性上确实,如何对样本进行划分?
  • 如果样本在划分属性上的取值已知,则将其分裂到与其取值对应的子节点,且样本权重在子节点中保持为1;
  • 如若样本在划分属性上的取值未知,则将其同时分裂到所有子节点中,在各子节点中的权重为对应子节点的样本权重 ρ \rho ρ

属性"纹理"的信息增益最大,用于进一步分裂,包含15个取值已知(清晰7个、稍糊5个、模糊3个)和2个取值未知的样本{8,10}。

属性: 取值样本好瓜差瓜缺失值缺失值权重总权重
纹理:清晰{1,2,3,4,5,6,15}{1,2,3,4,5,6}{15}{8,10} 2 × 7 15 2\times\frac{7}{15} 2×157 7 + 2 × 7 15 7+2\times\frac{7}{15} 7+2×157
纹理:稍糊{7,9,13,14,17}{7}{9,13,14,17}{8,10} 2 × 5 15 2\times\frac{5}{15} 2×155 5 + 2 × 5 15 5+2\times\frac{5}{15} 5+2×155
纹理:模糊{11,12,16}-{11,12,16}{8,10} 2 × 3 15 2\times\frac{3}{15} 2×153 3 + 2 × 3 15 3+2\times\frac{3}{15} 3+2×153

子节点属性纹理=清晰,包含7个有取值样本{1,2,3,4,5,6,15},其中6个好瓜和1个差瓜,假设属性在缺失值处对应的类别分布与原始样本一致,分别为 6 7 \frac{6}{7} 76 1 7 \frac{1}{7} 71,则子节点属性纹理=清晰的信息熵为:
E n t r o p y ( D 纹理 = 清晰 ) = − ∑ i = 1 k p i log ⁡ p i = − ( 6 + 6 7 × 7 15 × 2 7 + 7 15 × 2 log ⁡ 6 + 6 7 × 7 15 × 2 7 + 7 15 × 2 + 1 + 1 7 × 7 15 × 2 7 + 7 15 × 2 log ⁡ 1 + 1 7 × 7 15 × 2 7 + 7 15 × 2 ) = 0.5916 \begin{aligned} Entropy(D^{纹理=清晰})&=-\sum_{i=1}^{k}p_i\log p_i\\ &=-(\frac{6+\frac{6}{7}\times\frac{7}{15}\times2}{7+\frac{7}{15}\times2}\log\frac{6+\frac{6}{7}\times\frac{7}{15}\times2}{7+\frac{7}{15}\times2}+\frac{1+\frac{1}{7}\times\frac{7}{15}\times2}{7+\frac{7}{15}\times2}\log\frac{1+\frac{1}{7}\times\frac{7}{15}\times2}{7+\frac{7}{15}\times2})\\ &=0.5916 \end{aligned} Entropy(D纹理=清晰)=i=1kpilogpi=(7+157×26+76×157×2log7+157×26+76×157×2+7+157×21+71×157×2log7+157×21+71×157×2)=0.5916

  • 子节点属性纹理=清晰,计算属性色泽的信息增益
    • 色泽=乌黑的样本数为3(2个正样本和1个负样本);色泽=青绿的样本数为2(2个正样本);2个缺失值样本

    • 缺失值样本的权重:色泽=乌黑的权重 3 5 \frac{3}{5} 53,总权重为 2 × 3 5 = 6 5 2\times\frac{3}{5}=\frac{6}{5} 2×53=56;色泽=青绿的权重 2 5 \frac{2}{5} 52,总权重为 2 × 2 5 = 4 5 2\times\frac{2}{5}=\frac{4}{5} 2×52=54

    • 色泽=乌黑:正样本的权重: 2 + 2 3 × 3 5 × 2 2+\frac{2}{3}\times\frac{3}{5}\times2 2+32×53×2;负样本的权重: 1 + 1 3 × 3 5 × 2 1+\frac{1}{3}\times\frac{3}{5}\times2 1+31×53×2;总权重 2 + 2 3 × 3 5 × 2 + 1 + 1 3 × 3 5 × 2 = 3 + 3 5 × 2 2+\frac{2}{3}\times\frac{3}{5}\times2+1+\frac{1}{3}\times\frac{3}{5}\times2=3+\frac{3}{5}\times2 2+32×53×2+1+31×53×2=3+53×2
      E n t r o p y ( D 纹理 = 清晰 , 色泽 = 乌黑 ) = − ( 2 + 2 3 × 3 5 × 2 3 + 3 5 × 2 log ⁡ 2 + 2 3 × 3 5 × 2 3 + 3 5 × 2 + 1 + 1 3 × 3 5 × 2 3 + 3 5 × 2 log ⁡ 1 + 1 3 × 3 5 × 2 3 + 3 5 × 2 ) = 0.6589 \begin{aligned} Entropy(D^{纹理=清晰},色泽=乌黑)&=-(\frac{2+\frac{2}{3}\times\frac{3}{5}\times2}{3+\frac{3}{5}\times2}\log\frac{2+\frac{2}{3}\times\frac{3}{5}\times2}{3+\frac{3}{5}\times2}+\frac{1+\frac{1}{3}\times\frac{3}{5}\times2}{3+\frac{3}{5}\times2}\log\frac{1+\frac{1}{3}\times\frac{3}{5}\times2}{3+\frac{3}{5}\times2})=0.6589 \end{aligned} Entropy(D纹理=清晰,色泽=乌黑)=(3+53×22+32×53×2log3+53×22+32×53×2+3+53×21+31×53×2log3+53×21+31×53×2)=0.6589

    • 色泽=青绿:正样本的权重: 2 + 2 2 × 2 5 × 2 2+\frac{2}{2}\times\frac{2}{5}\times2 2+22×52×2;负样本的权重: 0 + 0 2 × 2 5 × 2 0+\frac{0}{2}\times\frac{2}{5}\times2 0+20×52×2;总权重 2 + 2 2 × 2 5 × 2 + 0 + 0 2 × 2 5 × 2 = 2 + 2 5 × 2 2+\frac{2}{2}\times\frac{2}{5}\times2+0+\frac{0}{2}\times\frac{2}{5}\times2=2+\frac{2}{5}\times2 2+22×52×2+0+20×52×2=2+52×2
      E n t r o p y ( D 纹理 = 清晰 , 色泽 = 青绿 ) = − ( 2 + 2 5 × 2 2 + 2 5 × 2 log ⁡ 2 + 2 5 × 2 2 + 2 5 × 2 + 0 + 0 2 × 2 5 × 2 2 + 2 5 × 2 log ⁡ 0 + 0 2 × 2 5 × 2 2 + 2 5 × 2 ) = 0.0 \begin{aligned} Entropy(D^{纹理=清晰},色泽=青绿)&=-(\frac{2+\frac{2}{5}\times2}{2+\frac{2}{5}\times2}\log\frac{2+\frac{2}{5}\times2}{2+\frac{2}{5}\times2}+\frac{0+\frac{0}{2}\times\frac{2}{5}\times2}{2+\frac{2}{5}\times2}\log\frac{0+\frac{0}{2}\times\frac{2}{5}\times2}{2+\frac{2}{5}\times2})=0.0 \end{aligned} Entropy(D纹理=清晰,色泽=青绿)=(2+52×22+52×2log2+52×22+52×2+2+52×20+20×52×2log2+52×20+20×52×2)=0.0

G a i n ( D 纹理 = 清晰 , 色泽 ) = 0.5916 − ( 3 + 3 5 × 2 7 + 7 15 × 2 × 0.6598 + 2 + 2 5 × 2 7 + 7 15 × 2 × 0.0 ) = 0.2423 \begin{aligned} Gain(D^{纹理=清晰},色泽)&=0.5916-(\frac{3+\frac{3}{5}\times2}{7+\frac{7}{15}\times2}\times0.6598+\frac{2+\frac{2}{5}\times2}{7+\frac{7}{15}\times2}\times0.0)&=0.2423 \end{aligned} Gain(D纹理=清晰,色泽)=0.5916(7+157×23+53×2×0.6598+7+157×22+52×2×0.0)=0.2423

  • 子节点属性纹理=清晰,计算属性根蒂的信息增益

    • 根蒂=蜷缩的样本数为5(5个正样本);根蒂=稍蜷的样本数为2(1个正样本和1个负样本);无缺失值样本
      E n t r o p y ( D 纹理 = 清晰 , 根蒂 = 蜷缩 ) = − ( 5 5 log ⁡ 5 5 + 0 5 log ⁡ 0 5 ) = 0.0 E n t r o p y ( D 纹理 = 清晰 , 根蒂 = 稍蜷 ) = − ( 1 2 log ⁡ 1 2 + 1 2 log ⁡ 1 2 ) = 1.0 G a i n ( D 纹理 = 清晰 , 根蒂 ) = 0.5916 − ( 5 7 × 0.0 + 2 7 × 1.0 ) = 0.3058 \begin{aligned} Entropy(D^{纹理=清晰},根蒂=蜷缩)&=-(\frac{5}{5}\log\frac{5}{5}+\frac{0}{5}\log\frac{0}{5})=0.0\\ Entropy(D^{纹理=清晰},根蒂=稍蜷)&=-(\frac{1}{2}\log\frac{1}{2}+\frac{1}{2}\log\frac{1}{2})=1.0\\ Gain(D^{纹理=清晰},根蒂)&=0.5916-(\frac{5}{7}\times0.0+\frac{2}{7}\times1.0)=0.3058 \end{aligned} Entropy(D纹理=清晰,根蒂=蜷缩)Entropy(D纹理=清晰,根蒂=稍蜷)Gain(D纹理=清晰,根蒂)=(55log55+50log50)=0.0=(21log21+21log21)=1.0=0.5916(75×0.0+72×1.0)=0.3058
  • 子节点属性纹理=清晰,计算属性敲声的信息增益

    • 敲声=浊响的样本数为4(3个正样本和1个负样本);敲声=沉闷的样本数为2(2个正样本);1个缺失值样本

    • 缺失值样本的权重:敲声=浊响的权重 4 6 \frac{4}{6} 64,总权重为 4 6 \frac{4}{6} 64敲声=沉闷的权重 2 6 \frac{2}{6} 62,总权重为 2 6 \frac{2}{6} 62

    • 敲声=浊响:正样本的权重: 3 + 3 4 × 4 6 3+\frac{3}{4}\times\frac{4}{6} 3+43×64;负样本的权重: 1 + 1 4 × 4 6 1+\frac{1}{4}\times\frac{4}{6} 1+41×64;总权重 3 + 3 4 × 4 6 + 1 + 1 4 × 4 6 = 4 + 4 6 3+\frac{3}{4}\times\frac{4}{6}+1+\frac{1}{4}\times\frac{4}{6}=4+\frac{4}{6} 3+43×64+1+41×64=4+64
      E n t r o p y ( D 纹理 = 清晰 , 敲声 = 浊响 ) = − ( 3 + 3 4 × 4 6 4 + 4 6 log ⁡ 3 + 3 4 × 4 6 4 + 4 6 + 1 + 1 4 × 4 6 4 + 4 6 log ⁡ 1 + 1 4 × 4 6 4 + 4 6 ) = 0.8112 \begin{aligned} Entropy(D^{纹理=清晰},敲声=浊响)&=-(\frac{3+\frac{3}{4}\times\frac{4}{6}}{4+\frac{4}{6}}\log\frac{3+\frac{3}{4}\times\frac{4}{6}}{4+\frac{4}{6}}+\frac{1+\frac{1}{4}\times\frac{4}{6}}{4+\frac{4}{6}}\log\frac{1+\frac{1}{4}\times\frac{4}{6}}{4+\frac{4}{6}})=0.8112 \end{aligned} Entropy(D纹理=清晰,敲声=浊响)=(4+643+43×64log4+643+43×64+4+641+41×64log4+641+41×64)=0.8112

    • 敲声=沉闷:正样本的权重: 2 + 2 2 × 2 6 2+\frac{2}{2}\times\frac{2}{6} 2+22×62;负样本的权重: 0 + 0 2 × 2 6 0+\frac{0}{2}\times\frac{2}{6} 0+20×62;总权重 2 + 2 2 × 2 6 + 0 + 0 2 × 2 6 = 2 + 2 6 2+\frac{2}{2}\times\frac{2}{6}+0+\frac{0}{2}\times\frac{2}{6}=2+\frac{2}{6} 2+22×62+0+20×62=2+62
      E n t r o p y ( D 纹理 = 清晰 , 敲声 = 沉闷 ) = − ( 2 + 2 2 × 2 6 2 + 2 6 log ⁡ 2 + 2 2 × 2 6 2 + 2 6 + 0 + 0 2 × 2 6 2 + 2 6 log ⁡ 0 + 0 2 × 2 6 2 + 2 6 ) = 0.0 \begin{aligned} Entropy(D^{纹理=清晰},敲声=沉闷)&=-(\frac{2+\frac{2}{2}\times\frac{2}{6}}{2+\frac{2}{6}}\log\frac{2+\frac{2}{2}\times\frac{2}{6}}{2+\frac{2}{6}}+\frac{0+\frac{0}{2}\times\frac{2}{6}}{2+\frac{2}{6}}\log\frac{0+\frac{0}{2}\times\frac{2}{6}}{2+\frac{2}{6}})=0.0 \end{aligned} Entropy(D纹理=清晰,敲声=沉闷)=(2+622+22×62log2+622+22×62+2+620+20×62log2+620+20×62)=0.0
      G a i n ( D 纹理 = 清晰 , 敲声 ) = 0.5916 − ( 4 + 4 6 7 + 7 15 × 2 × 0.8112 + 2 + 2 6 7 + 7 15 × 2 × 0.0 ) = 0.1144 \begin{aligned} Gain(D^{纹理=清晰},敲声)&=0.5916-(\frac{4+\frac{4}{6}}{7+\frac{7}{15}\times2}\times0.8112+\frac{2+\frac{2}{6}}{7+\frac{7}{15}\times2}\times0.0)&=0.1144 \end{aligned} Gain(D纹理=清晰,敲声)=0.5916(7+157×24+64×0.8112+7+157×22+62×0.0)=0.1144


http://www.ppmy.cn/ops/155027.html

相关文章

【处理和预防校园霸凌】。营造安全

处理和预防校园霸凌对于营造安全、和谐的校园环境至关重要,以下从处理和预防两个方面提供一些建议: ### 处理校园霸凌 1. **及时干预制止**:教师一旦发现校园霸凌行为,要第一时间上前制止,确保受霸凌学生的人身安全&a…

算法基础——存储

引入 基础理论的进步,是推动技术实现重大突破,促使相关领域的技术达成跨越式发展的核心。 在发展日新月异的大数据领域,基础理论的核心无疑是算法。不管是技术设计,还是工程实践,都必须仰仗相关算法的支持&#xff0…

数据结构 队列

目录 前言 一,队列的基本知识 二,用数组实现队列 三,用链表实现队列 总结 前言 接下来我们将学习队列的知识,这会让我们了解队列的基本概念和基本的功能 一,队列的基本知识 (Queue) 我们先来研究队列的ADT&#xff0c…

前端【11】HTML+CSS+jQUery实战项目--实现一个简单的todolist

前端【8】HTMLCSSjavascript实战项目----实现一个简单的待办事项列表 (To-Do List)-CSDN博客 学过jQUery可以极大简化js代码的编写,基于之前实现的todolist小demo,了解如何使用 jQuery 来实现常见的动态交互功能。 修改后的js代码 关键点解析 动态添加…

通过 Docker 部署 pSQL 服务器的教程

在这篇文章中,我们将深入探讨如何利用 Docker 在 Azure 上快速部署 PostgreSQL(pSQL)服务器。这个过程不仅简单高效,还能为你的开发环境提供强大的支持。 如何使用 Edu 邮箱申请 Azure 订阅并开通免费的 VPS 首先,你…

基于SpringBoot的新闻资讯系统的设计与实现(源码+SQL脚本+LW+部署讲解等)

专注于大学生项目实战开发,讲解,毕业答疑辅导,欢迎高校老师/同行前辈交流合作✌。 技术范围:SpringBoot、Vue、SSM、HLMT、小程序、Jsp、PHP、Nodejs、Python、爬虫、数据可视化、安卓app、大数据、物联网、机器学习等设计与开发。 主要内容:…

C# 类与对象详解

.NET学习资料 .NET学习资料 .NET学习资料 在 C# 编程中,类与对象是面向对象编程的核心概念。它们让开发者能够将数据和操作数据的方法封装在一起,从而构建出模块化、可维护且易于扩展的程序。下面将详细介绍 C# 中类与对象的相关知识。 一、类的定义 …

变量和常量

一.变量 1.标准声明 var 变量名 变量类型 变量声明行末不需要分号 2..批量声明 package main import "fmt" func main(){var(a string b int c boold float32)}3.变量的初始化 var a int 10 var b float321.1 4.类型推导 var name"tom" var age18 fmt.Pr…