Tuesday, November 21, 2017

Coloring With Random Forests


Single Trees

Random forests are typically described using trees. They can also be described as partitioning space. To explore this representation, let's generate some data. Below we generate the location and color for many points. These points are plotted below. This data has a bright spot near each corner. We will see that random forests are able to identify this structure.



One method to predict the color of the a new point at a specific location would be to use random forests. We start by looking at the behavior of a single tree of a forest. A random forest tree partitions the data into small buckets. For the data we have been given, the random forest tree draws partitions as shown below.



The random forest will then find the average color for each partition. Below the I have drawn the plot with the mean of the average color in each partition. I will call this the trained partition plot. We can see what the random forest tree will predict for every point.


If we wanted to predict a new point we would simply put on the trained partition plot and see what color located there. For instance we have plotted (0.5, 0.7) on the trained partition plot. We can see that the random forest tree will predict a medium red.


Entire Forests

Now we want to build an entire forest of trees. We give a sample of the data to each tree and train each tree individually. For our data each tree was given 50% of the data. We can see how each individual tree trains on the data. Each individual tree does not necessarily look like a good fit to the original data. However, combined together they describe the data much better.

Now that we have trained each tree we need to combine all of the trees together. To combine the trees we average the predictions from each tree together. This results in a singular trained partition plot. For our random forest this produces the following trained partition plot. As we can see the random forest predicts a fairly reasonable description of the original data.


Finding the Partitions

Random forests need to have a way to determine the partitions for them. We will use the original data to show how it decides on the splits.

Each tree is trained independently, so we need to only understand how each tree is trained. The tree will search for a horizontal or vertical line that is "best". To determine the "best" we have a way to score a particular model (the score is usually determined by MSE). We than take the mean of each partition and use this model to determine a score. We will pick the horizontal or vertical line that reduces the score the most. For our points the "best" split is.


The random forest takes each smaller partition and repeats the same process on both of them. This repeats until there are very few points in each partition. The next level of splits for our data is.



And at the end we saw that the partitions look like this.




We now just find the mean for each partition and that gives us are final trained model.



code

This uses a slightly modified version of Jeremy Howard's Random Forest found here. All code used to generate the plots can be found here.

No comments:

Post a Comment