博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
Algorithm: Decision Tree, Entropy, Information Gain and Continues features
阅读量:4046 次
发布时间:2019-05-24

本文共 8848 字,大约阅读时间需要 29 分钟。

Deciesion Tree is the foundation of the random forest.

A decision tree is a decision support tool that uses a tree-like model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm that only contains conditional control statements.

Base on the history samples and experiences, we can build different decision tree(model), we defined a method to evaluate the performance of the decision tree.

Decision boundary of Decision Tree

The structure of the decision is complex. Structured Prediction

NP-hard problem

We can't solve a problem within the polynomial complexity called NP-hard

We use the approximate approach to solve the problem such as greedy algorithm.

Information Gain

Type 1 of Decision Tree

Type two of the decision tree

we change the order of the two features to obtain a new decision tree.

if the dimension of the input data is large, the number associate decision trees is very large!

 

We use the greedy algorithm to obtain the decision tree model.

The first step is to choose the root

how to choose the features?

Information entropy

 

How to present the reduce of the uncertainty(entropy)?, we use the old uncertainty and subtract with the current uncertainty(entropy)

A prograss to build the decision tree

Calculte the information gain for feature 'time'

Calculate the information gain for feature 'Match type'

calculate the information gain for feature 'court surface'

calculate the information gain for 'best effort'

we choose the feature which has the max information gain as the root!

in the situation above, we choose the 'cour surface'

for clay we need to do a further split.

we obtain the last decision tree.

Prevent the decision tree from overfitting

We try to let the decision tree become simple. such as reduce the number of leaves

The more number of leaves, the decision tree is more complex

Sometimes we try to constraint the depth of the decision tree to control the number of leaves.

How to deal with continues-value features?

for discrete feature, we just build a branch for each value of the current feature.

we calculate the information gain for each split rule and choose the best(information gain max)

We can reuse the continues-value feature in the subtree

Decision tree for the regression problem

sometimes we use the accuracy rate two merge the decision tree as classifier

we use the MSE(mean square error) to measure the performance for decision tree as regressor.

For classification problem, we use the entropy to evaluate the performance of the decision tree.

For regression problem, we use the variance/(standard deveriation) to evaluate the performance of the decision tree.

 

A example:

A python implementation for decision tree classifier

reference: 

# author sesiria 2019# a simple Decision Tree classifier implementationimport numpy as npimport sysfrom collections import Counter# **************************helper function.************************************# function for calculate the entropy of the data.# input shape should be (N,)def calcEntropy(data):    nSize = len(data)    classes = np.unique(data)    p = np.zeros(len(classes))    for i in range(len(classes)):        p[i] = len(data[data == classes[i]]) / nSize    return -np.sum(p * np.log(p))# func for compare two float numbersdef approximateEqual(X, Y):    return np.abs(X - Y) < 10e-4# choose the most frequency labeldef chooseMaxFrequency(X):    count = Counter(X)    result = sorted(count.items(), key = lambda x : x[1])    return result[-1][0]# **********************definition of the Decision Tree***************************# Node for the decision treeclass Node:    # feature_index = -1 for leaf nodes    def __init__(self, feature_index, label = None):        # check for valid leaf nodes        if feature_index == -1:            assert label != None        self.feature_index = feature_index        self.label = label  # it is only valid for leaf node        self.children = {}  # hashtable to store the nodes# simple decision tree classifier, only support for discrete input.class DecisionTreeClassifier:    # constructor    def __init__(self, maxDepth, maxLeaves):        self.root = None        self.maxDepth = maxDepth        self.maxLeaves = maxLeaves    # train the decision tree model    def fit(self, data, label):        self.root = self.split(data, label, 0)    # predict the classification via the decision tree model    def predict(self, test):        if len(test.shape) == 1:            return self.search(self.root, test)                predictions = np.zeros(test.shape[0])        for n in range(len(predictions)):            predictions[n] = self.search(self.root, test[n, :])        return predictions    # search for the node    def search(self, node, data):        assert node != None        # this is a leaf node        if node.feature_index == -1:            return node.label        child = node.children[data[node.feature_index]]        assert child != None        return self.search(child, data[[x for x in range(data.shape[0]) if x != node.feature_index]])    # split nodes    def split(self, data, label, depth):        # we stop the split for these condition        # 1) we meet the max depth        # 2) we meet the max size of the nodes        # 3) we have only one features with the same value        # 4) we have the same labels        if (depth == self.maxDepth or           len(data) == self.maxLeaves or            (data.shape[1] == 1 and  len(np.unique(data[:, 0]) == 1)) or           len(np.unique(label)) == 1           ):           targetLabel = chooseMaxFrequency(label)           # for leaf node we don't need to pass the feature index           node = Node(-1, targetLabel)           return node        # we need to build the node.        bestFeature = self.featureSelection(data, label)        node = Node(bestFeature)        partitions = self.splitByfeature(data, bestFeature)        # create the children for the current node        for key in partitions:            index = partitions[key]            child = self.split(data[index, :][:, [x for x in range(data.shape[1]) if x != bestFeature]],                                label[index],                                depth + 1)            node.children[key] = child        return node             # split the data set by the target feature_index    # return the splitted index    def splitByfeature(self, data, feature_idx):        nClass = np.unique(data[:, feature_idx])        result = {}        for i in range(len(nClass)):            index = data[:, feature_idx] == nClass[i]            result[i] = index        return result    # iterate for each features and choose the best one for current split    def featureSelection(self, data, label):        # choose the best features        nSize, n_features = data.shape        # iterate for each features        # originEntropy = calcEntropy(label)        bestFeatures = 0        minEntropy = float(sys.maxsize)        for i in range(n_features):            # split by the current features            subIndexes = self.splitByfeature(data, i)            # calculate the current entropys            currentEntropy = 0            for key in subIndexes:                idx = subIndexes[key]                currentEntropy += len(label[idx]) * calcEntropy(label[idx]) / nSize            if currentEntropy < minEntropy:                minEntropy = currentEntropy                bestFeatures = i        return bestFeatures# **************************unit test function.************************************def test_chooseMaxFrequency():    A=['a','b','b','c','d','b','a']    assert chooseMaxFrequency(A) == 'b'    print("testing...chooseMaxFrequency()")    print("pass!")def test_calcEntropy():    X = np.zeros(5)    Y = np.ones(5)    Z = np.array([0, 0, 0, 1, 1, 1])    assert approximateEqual(calcEntropy(X), 0)    assert approximateEqual(calcEntropy(Y), 0)    assert approximateEqual(calcEntropy(Z), 0.69314718)    print("testing...calcEntropy()")    print("pass!")def test_decisionTree():    clf = DecisionTreeClassifier(100, 100)    X = np.array([[0, 0, 0],                   [1, 0, 0],                   [0, 0, 0],                  [0, 0, 0],                  [0, 1, 0],                  [1, 0, 0],                  [0, 1, 0],                  [0, 1, 0],                  [1, 0, 0],                  [1, 0, 0]                ])    Y = np.array([0, 1, 0, 0, 1, 0, 0, 1, 0, 0])    # test splitByfeatures    result = clf.splitByfeature(X, 0)    # test featureSelection    idx = clf.featureSelection(X, Y)    clf.fit(X, Y)    X_test = np.array([[1, 0, 0],                        [0, 1, 0],                       [0, 0, 0]])    predictions = clf.predict(X_test)def sanity_check():    # test_chooseMaxFrequency()    # test_calcEntropy()    test_decisionTree()if __name__ == '__main__':    sanity_check()

 

转载地址:http://nwwci.baihongyu.com/

你可能感兴趣的文章
Selenium-ActionChains Api接口详解
查看>>
Selenium-Switch与SelectApi接口详解
查看>>
Selenium-Css Selector使用方法
查看>>
Linux常用统计命令之wc
查看>>
测试必会之 Linux 三剑客之 sed
查看>>
Socket请求XML客户端程序
查看>>
Java中数字转大写货币(支持到千亿)
查看>>
Java.nio
查看>>
函数模版类模版和偏特化泛化的总结
查看>>
VMware Workstation Pro虚拟机不可用解决方法
查看>>
最简单的使用redis自带程序实现c程序远程访问redis服务
查看>>
redis学习总结-- 内部数据 字符串 链表 字典 跳跃表
查看>>
iOS 对象序列化与反序列化
查看>>
iOS 序列化与反序列化(runtime) 01
查看>>
iOS AFN 3.0版本前后区别 01
查看>>
iOS ASI和AFN有什么区别
查看>>
iOS QQ侧滑菜单(高仿)
查看>>
iOS 扫一扫功能开发
查看>>
iOS app之间的跳转以及传参数
查看>>
iOS __block和__weak的区别
查看>>