题目1
题目2
election | season | oil price | stock price |
---|---|---|---|
no | winter | rise | rise |
no | summer | fall | fall |
no | summer | rise | rise |
yes | winter | rise | fall |
no | winter | fall | fall |
yes | summer | rise | rise |
yes | summer | fall | fall |
yes | winter | fall | fall |
设数据集为S,
step1,计算熵
H ( S ) = − ( 3 8 log 2 3 8 + 5 8 log 2 5 8 ) = 0.954 H(S) = - \left( \frac{3}{8} \log_2 \frac{3}{8} + \frac{5}{8} \log_2 \frac{5}{8} \right)= 0.954 H(S)=−(83log283+85log285)=0.954
step2:计算条件熵
- step2.1 计算特征election的条件熵
- g ( S , e l e c t i o n ) = H ( S ) − H ( S ∣ e l e c t i o n ) g(S, election)=H(S)-H(S|election) g(S,election)=H(S)−H(S∣election)
- H ( S ∣ e l e c t i o n ) = 4 8 H ( S ∣ e l e c t i o n = y e s ) + 4 8 H ( S ∣ e l e c t i o n = n o ) H(S|election) = \frac{4}{8} H(S|election=yes) + \frac{4}{8} H(S|election=no) H(S∣election)=84H(S∣election=yes)+84H(S∣election=no)
- H ( S ∣ e l e c t i o n = y e s ) = − ( 1 4 log 2 1 4 + 3 4 log 2 3 4 ) H(S|election=yes) = - \left( \frac{1}{4} \log_2 \frac{1}{4} + \frac{3}{4} \log_2 \frac{3}{4} \right) H(S∣election=yes)=−(41log241+43log243)
- H ( S ∣ e l e c t i o n = n o ) = − ( 2 4 log 2 2 4 + 2 4 log 2 2 4 ) H(S|election=no) = - \left( \frac{2}{4} \log_2 \frac{2}{4} + \frac{2}{4} \log_2 \frac{2}{4} \right) H(S∣election=no)=−(42log242+42log242)
- 得 H ( S ∣ e l e c t i o n ) = 0.9055 H(S|election) = 0.9055 H(S∣election)=0.9055
- 故 g ( S , e l e c t i o n ) = 0.954 − 0.9055 = 0.0485 g(S, election)=0.954 - 0.9055 = 0.0485 g(S,election)=0.954−0.9055=0.0485
- step2.2 计算特征season的条件熵
- g ( S , s e a s o n ) = H ( S ) − H ( S ∣ s e a s o n ) g(S, season)=H(S)-H(S|season) g(S,season)=H(S)−H(S∣season)
- H ( S ∣ s e a s o n ) = 4 8 H ( S ∣ s e a s o n = w i n t e r ) + 4 8 H ( S ∣ s e a s o n = s u m m e r ) H(S|season) = \frac{4}{8} H(S|season=winter) + \frac{4}{8} H(S|season=summer) H(S∣season)=84H(S∣season=winter)+84H(S∣season=summer)
- H ( S ∣ s e a s o n = w i n t e r ) = − ( 1 4 log 2 1 4 + 3 4 log 2 3 4 ) = 0.811 H(S|season=winter) = - \left( \frac{1}{4} \log_2 \frac{1}{4} + \frac{3}{4} \log_2 \frac{3}{4} \right) = 0.811 H(S∣season=winter)=−(41log241+43log243)=0.811
- H ( S ∣ s e a s o n = s u m m e r ) = − ( 2 4 log 2 2 4 + 2 4 log 2 2 4 ) = 1 H(S|season=summer) = - \left( \frac{2}{4} \log_2 \frac{2}{4} + \frac{2}{4} \log_2 \frac{2}{4} \right) = 1 H(S∣season=summer)=−(42log242+42log242)=1
- 得 H ( S ∣ s e a s o n ) = 0.9055 H(S|season) = 0.9055 H(S∣season)=0.9055
- 故 g ( S , s e a s o n ) = 0.954 − 0.9055 = 0.0485 g(S, season)=0.954 - 0.9055 = 0.0485 g(S,season)=0.954−0.9055=0.0485
- step2.3 计算特征oil price 的条件熵
- g ( S , o i l p r i c e ) = H ( S ) − H ( S ∣ o i l p r i c e ) g(S, oil price)=H(S)-H(S|oil price) g(S,oilprice)=H(S)−H(S∣oilprice)
- H ( S ∣ o i l p r i c e ) = 4 8 H ( S ∣ o i l p r i c e = r i s e ) + 4 8 H ( S ∣ o i l p r i c e = f a l l ) H(S|oil price) = \frac{4}{8} H(S|oil price=rise) + \frac{4}{8} H(S|oil price=fall) H(S∣oilprice)=84H(S∣oilprice=rise)+84H(S∣oilprice=fall)
- H ( S ∣ o i l p r i c e = r i s e ) = − ( 3 4 log 2 3 4 + 1 4 log 2 1 4 ) H(S|oil price=rise) = - \left( \frac{3}{4} \log_2 \frac{3}{4} + \frac{1}{4} \log_2 \frac{1}{4} \right) H(S∣oilprice=rise)=−(43log243+41log241)
- H ( S ∣ o i l p r i c e = f a l l ) = − ( 0 4 log 2 0 4 + 4 4 log 2 4 4 ) = 0 H(S|oil price=fall) = - \left( \frac{0}{4} \log_2 \frac{0}{4} + \frac{4}{4} \log_2 \frac{4}{4} \right) = 0 H(S∣oilprice=fall)=−(40log240+44log244)=0
- 得 H ( S ∣ o i l p r i c e ) = 0.4055 H(S|oil price)=0.4055 H(S∣oilprice)=0.4055
- 故 g ( S , o i l p r i c e ) = 0.954 − 0.4055 = 0.5485 g(S, oil price)= 0.954 - 0.4055 = 0.5485 g(S,oilprice)=0.954−0.4055=0.5485
step3:选择信息增益最大的特征为根节点,即oil price为根节点。
step4:将 S S S根据特征oil price=rise和fall,划分为 S 1 S_1 S1, S 2 S_2 S2,其中oil price=fall,stock price全部为fall,则 S 2 S_2 S2为叶节点。对 S 1 S_1 S1从特征election 和season中选择新的特征。
-
step4.1 计算 g ( S 1 , e l e c t i o n ) = H ( S 1 ) − H ( S 1 ∣ e l e c t i o n ) g(S_1, election)=H(S_1)-H(S_1|election) g(S1,election)=H(S1)−H(S1∣election)
- H ( S 1 ∣ e l e c t i o n ) = 2 4 H ( S 1 ∣ e l e c t i o n = n o ) + 2 4 H ( S 1 ∣ e l e c t i o n = y e s ) H(S_1|election)=\frac{2}{4}H(S_1|election=no)+\frac{2}{4}H(S_1|election=yes) H(S1∣election)=42H(S1∣election=no)+42H(S1∣election=yes)
- H ( S 1 ∣ e l e c t i o n = n o ) = − 2 2 log 2 2 2 − 0 2 log 2 0 2 = 0 H(S_1|election=no)=-\frac{2}{2}\log_2 \frac{2}{2}-\frac{0}{2}\log_2 \frac{0}{2} = 0 H(S1∣election=no)=−22log222−20log220=0
- H ( S 1 ∣ e l e c t i o n = y e s ) = − 1 2 log 2 1 2 − 1 2 log 2 1 2 = 1 H(S_1|election=yes)=-\frac{1}{2}\log_2 \frac{1}{2}-\frac{1}{2}\log_2 \frac{1}{2} = 1 H(S1∣election=yes)=−21log221−21log221=1
- 得 H ( S 1 ∣ e l e c t i o n ) = 0.5 H(S_1|election) = 0.5 H(S1∣election)=0.5
- 故 g ( S 1 , e l e c t i o n ) = H ( S 1 ) − H ( S 1 ∣ e l e c t i o n ) = 0.811 − 0.5 = 0.311 g(S_1, election)=H(S_1)-H(S_1|election)=0.811-0.5=0.311 g(S1,election)=H(S1)−H(S1∣election)=0.811−0.5=0.311
-
step4.2 计算 g ( S 1 , s e a s o n ) = H ( S 1 ) − H ( S 1 ∣ s e a s o n ) g(S_1, season)=H(S_1)-H(S_1|season) g(S1,season)=H(S1)−H(S1∣season)
- H(S_1)=H(S|oil price=rise)=
- H ( S 1 ∣ s e a s o n ) = 2 4 H ( S 1 ∣ s e a s o n = w i n t e r ) + 2 4 H ( S 1 ∣ s e a s o n = s u m m e r ) H(S_1|season) = \frac{2}{4}H(S_1|season=winter)+ \frac{2}{4} H(S_1|season=summer) H(S1∣season)=42H(S1∣season=winter)+42H(S1∣season=summer)
- H ( S 1 ∣ s e a s o n = w i n t e r ) = − 1 2 log 2 1 2 − 1 2 log 2 1 2 = 1 H(S_1|season=winter)=-\frac{1}{2}\log_2 \frac{1}{2}-\frac{1}{2}\log_2 \frac{1}{2} = 1 H(S1∣season=winter)=−21log221−21log221=1
- H ( S 1 ∣ s e a s o n = s u m m e r ) = − 2 2 log 2 2 2 − 0 2 log 2 0 2 = 0 H(S_1|season=summer)=-\frac{2}{2}\log_2 \frac{2}{2}-\frac{0}{2}\log_2 \frac{0}{2} = 0 H(S1∣season=summer)=−22log222−20log220=0
- 得 H ( S 1 ∣ s e a s o n ) = 0.5 H(S_1|season) = 0.5 H(S1∣season)=0.5
- 故 g ( S 1 , s e a s o n ) = H ( S 1 ) − H ( S 1 ∣ s e a s o n ) = 0.811 − 0.5 = 0.311 g(S_1, season)=H(S_1)-H(S_1|season)=0.811-0.5=0.311 g(S1,season)=H(S1)−H(S1∣season)=0.811−0.5=0.311
step5: g ( S 1 , e l e c t i o n ) g(S_1, election) g(S1,election)和 g ( S 1 , s e a s o n ) g(S_1, season) g(S1,season)一样,这里选择election作为节点划分。election=no,stock price全为rise,为叶子节点。对数据集election=yes,设为 H ( S 11 ) H(S_{11}) H(S11),只有season特征,选择season作为划分特征,两子集都为纯。当 season = winter
时,stock price = fall
。当 season = summer
时,stock price = rise
。
oil price?
├── fall: stock price = fall
└── rise:├── election?│ ├── no: stock price = rise│ └── yes:│ ├── season?│ │ ├── winter: stock price = fall│ │ └── summer: stock price = rise
log计算:
- − 1 2 log 2 1 2 − 1 2 log 2 1 2 = 1 \displaystyle -\frac{1}{2}\log_2 \frac{1}{2}-\frac{1}{2}\log_2 \frac{1}{2} = 1 −21log221−21log221=1
- − 2 2 log 2 2 2 − 0 2 log 2 0 2 = 0 \displaystyle -\frac{2}{2}\log_2 \frac{2}{2}-\frac{0}{2}\log_2 \frac{0}{2} = 0 −22log222−20log220=0
- − 2 3 log 2 2 3 − 1 3 log 2 1 3 = 0.918 \displaystyle -\frac{2}{3}\log_2 \frac{2}{3}-\frac{1}{3}\log_2 \frac{1}{3} = 0.918 −32log232−31log231=0.918
- − 3 4 log 2 3 4 − 1 4 log 2 1 4 = 0.811 \displaystyle - \frac{3}{4} \log_2 \frac{3}{4} - \frac{1}{4} \log_2 \frac{1}{4}=0.811 −43log243−41log241=0.811
- − 3 5 log 2 3 5 − 2 5 log 2 2 5 = 0.971 \displaystyle - \frac{3}{5} \log_2 \frac{3}{5} - \frac{2}{5} \log_2 \frac{2}{5}=0.971 −53log253−52log252=0.971
- − 1 5 log 2 1 5 − 4 5 log 2 4 5 = 0.722 \displaystyle - \frac{1}{5} \log_2 \frac{1}{5} - \frac{4}{5} \log_2 \frac{4}{5}=0.722 −51log251−54log254=0.722
- − 1 6 log 2 1 6 − 5 6 log 2 5 6 = 0.650 \displaystyle - \frac{1}{6} \log_2 \frac{1}{6} - \frac{5}{6} \log_2 \frac{5}{6}=0.650 −61log261−65log265=0.650