p algorithm builds a set of trees, just as

p { margin-bottom: 0.25cm; direction: ltr; color: rgb(0, 0, 0); line-height: 120%; }p.western { font-family: “Liberation Serif”, “Times New Roman”, serif; font-size: 12pt; }p.cjk { font-family: “Noto Sans CJK SC Regular”; font-size: 12pt; }p.ctl { font-family: “FreeSans”; font-size: 12pt; }

The
proposed system uses the Random Tree algorithm. The Random Tree
algorithm makes multiple passes over the training data, in two
different ways. First, a pass through the data is made to create the
training data for each tree that will be built. Second, for each
tree, a pass is made through (some columns of the data) for each
internal node of the tree, to generate the counts that are used to
decide which attribute and split point to select for it.

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!


order now

To
build each tree, a separate batch or block of labeled records is
used. As a result, the Random Tree algorithm requires substantially
more labeled data than the standard algorithm to build a set of
trees. It would be conceivable to use a separate batch of labeled
data to make the decision about each internal node, but this would
require an even larger amount of labeled data, and would increase the
time before robust classifications could be made for unlabelled
records. Instead, we adapt ideas from streaming decision tree
algorithms to route every labeled record to an appropriate node of
the tree under construction, so that every labeled record contributes
to a decision about one node.

The
Random Tree algorithm builds a set of trees, just as the standard
Random Forest algorithm does. For the time being, the algorithm takes
the required number of trees as a parameter, but extensions in which
the number of trees is derived from the classification accuracy, or
the set of trees has new members added and old ones deleted to
respond to changes in the labeled data are straightforward. As a new
labeled record arrives, it is routed down the current tree, based on
its attribute values and the inequalities of the internal nodes,
until it arrives at a frontier node.

At
the frontier node, the attribute values of the record contribute to
class counts that are used to compute Gini indexes. To be able to
maintain information about the distribution of attribute values and
their relationship to class labels, attributes are discretized into
fixed-length intervals. The boundaries between these intervals are
the possible split points for each attribute.

The
procedure for deciding when and how to change a frontier node into
another kind of node is somewhat complex. A parameter, nmin , is used
to decide how often to check whether a frontier node should be
considered for transformation.Whenever a node accumulates nmin
labeled records, the Gini index and the Hoeffding bound tests are
applied. If the Hoeffding bound test is satisfied, then the frontier
node has seen enough records to determine the best attribute and
split value. The Gini index test is then used to select the best
attribute and split point, and the frontier node is transformed into
an internal node with an inequality based on this attribute and split
point . The two children of this node become new frontier nodes.If
the number of records that have reached the frontier node exceeds a
threshold called the node window, and the node has not yet been
split, the algorithm checks to see if this node should instead be
transformed into a leaf. If the node has accumulated records that are
predominantly from one class, then it is transformed into a leaf.
Otherwise, the node is transformed to an internal node using the best
attribute and split point so far.

The
size of the node window threshold depends on the depth of the node in
the tree, because fewer records will reach deeper nodes. We therefore
define the node window as a function of the tree window and the node
level:

The
construction of a tree is complete when a total of tree window
records have been used in its construction. The algorithm then begins
the construction of the next tree, until the required number of trees
have been built. A limited form of pruning is necessary, because a
node may have generated two descendant frontier nodes that do not see
enough subsequent records to be considered for splitting themselves.
If two sibling nodes have failed to receive enough records when the
tree window is reached, the node purity is calculated for both. Node
purity is the ratio of the number of instances of records labelled
with the most frequent class label to the total number of records. If
both

siblings’
node purity is less than 1/number of classes, then both nodes are
pruned and their parent node is labelled as a leaf rather than an
internal node. Otherwise, the sibling nodes become leaf nodes
labelled with the majority class among the records in their parent
that would have flowed down to them.

The
algorithm handles a single phase of learning and classification. It
can be extended to multiple phases by growing new trees from
subsequent batches of labeled records, and discarding the oldest
trees from the previous phase.

The
Random Tree algorithm relies on samples chosen using random
selection with replacement, both to guarantee attractive properties
of the learned models and to provide a natural test set. Sampling
with replacement is not possible when the data arrives as a stream,
so we must consider whether this affects the properties of the new
algorithm. For an infinite stream of data drawn from the same
distribution, the results produced by sampling with replacement and
sampling without replacement are not distinguishable, since each
outcome is independent of the previous one.