Anqi Fu

August 13, 2013

Run H2O From Within R

With the REST API, it’s simple to run H2O operations from within R using similar syntax to all your favorite R functions. In this post, we’ll walk through a simple demo of its capabilities. First, get H2O installed and running by following the tutorial here. Once you have the R package loaded, you can take a look at the included demos by typing demo(package=”h2o”), and run one of them by typing e.g., demo(h2o.glm). We’ll be stepping through a few of the basic statistical functions in this tutorial.

Starting up H2O in R

library(h2o)
localH2O = new(“H2OClient”, ip = “127.0.0.1”, port = 54321)
h2o.checkClient(localH2O)

The beginning of each H2O R script looks the same - first, load the R package, then create an H2OClient object containing the IP and port at which H2O resides. If you are running H2O on your local machine, the default is IP = 127.0.0.1 and port = 54321. You can call h2o.checkClient to check if H2O is connectable. Once that’s done, we are ready to work with some data!

Importing and Summarizing Data

In this tutorial, we will be working with the prostate cancer data set, which comes from a study by Dr. Donn Young at The Ohio State University Comprehensive Cancer Center of patients with varying degrees of prostate cancer. The relevant columns are CAPSULE (binary variable indicating tumor penetration of prostatic capsule), AGE (in years), RACE (1 = white, 2 = black), PSA (prostatic-specific antigen value), and GLEASON (total gleason score, indicating how aggressive the cancer is). See Applied Logistic Regression by Hosmer and Lemeshow (2000) for more details.

prostate.data = h2o.importURL(localH2O, path = “https://raw.github.com/0xdata/h2o/master/smalldata/logreg/prostate.csv“, key = “prostate.hex”)
summary(prostate.data)

The first line imports and parses the data set prostate.csv from the given URL, storing it in H2O under a unique identifier (hex key), prostate.hex. The method h2o.importURL returns an H2OParsedData object containing the IP and port on which H2O resides, as well as the data set’s hex key, which we save to prostate.data. All references within R to this data set will now be through prostate.data. Hence, if we wanted to get summary statistics, we’d call summary(prostate.data). This displays the minimum, maximum, median, mean, and quantiles of each column of the data set, just like in R (only relevant columns are shown below):

CAPSULE AGE RACE PSA GLEASON
Min. :0.000 Min. :43.000 Min. :0.000 Min. : 0.300 Min. :0.000
1st Qu.:0.000 1st Qu.:62.000 1st Qu.:1.000 1st Qu.: 5.132 1st Qu.:6.000
Median :0.000 Median :67.000 Median :1.000 Median : 5.132 Median :6.000
Mean :0.403 Mean :66.039 Mean :1.087 Mean : 15.409 Mean :6.384
3rd Qu.:1.000 3rd Qu.:71.000 3rd Qu.:1.000 3rd Qu.: 14.795 3rd Qu.:7.000
Max. :1.000 Max. :79.000 Max. :2.000 Max. :139.700 Max. :9.000

Running GLM (Generalized Linear Model)

Now that we have a sense of the data set’s structure, we will want to run a statistical analysis on it. Let’s try to run a logistic regression, with CAPSULE as the response and AGE, RACE, PSA and GLEASON as the predictors. The GLM family in this case is binomial with default link function logit. (See Wiki for a more detailed mathematical explanation). In essence, we are studying how the probability of capsular involvement is affected by a patient’s age, race, PSA and gleason score.

h2o.glm(y = “CAPSULE”, x = c(“AGE”,”RACE”,”PSA”,”GLEASON”), data = prostate.data, family = “binomial”, nfolds = 10, alpha = 0.5)

You should get as your result the following coefficients:

Coefficients:
AGE RACE PSA GLEASON Intercept
-0.02119 -0.46410 0.02804 1.07613 -5.86616
Degrees of Freedom: 379 Total (i.e. Null); 374 Residual
Null Deviance: 512.3
Residual Deviance: 416.3 AIC: 426.3

 

Looking at the coefficients, we see that the log-odds (and by extension, probability) of prostate capsular penetration increase with PSA and gleason score, as expected, but decrease slightly with age. A patient who is black is significantly less likely to exhibit capsular involvement than one who is white, although it is unknown whether this is a direct effect, or whether race is capturing some other characteristic excluded from the regression.

Running K-Means Clustering

Now, let’s run the k-means algorithm to identify how similar patients should be clustered. (See Wiki for a description of the mathematics). We start with k = 5 clusters, using only the predictors AGE, RACE, GLEASON, CAPSULE and PSA for categorization:

prostate.km = h2o.kmeans(data = prostate.data, centers = 5, cols = c(“AGE”,”RACE”,”GLEASON”,”CAPSULE”,”PSA”))
print(prostate.km)

K-means clustering with 5 clusters of sizes 278, 4, 23, 69, 6
Cluster means:
AGE RACE GLEASON CAPSULE PSA
1 66.14947 1.071174 6.124555 0.3060498 7.107402
2 65.75000 1.250000 8.000000 1.0000000 131.175000
3 66.09091 1.227273 7.136364 0.7272727 55.213636
4 65.44776 1.089552 7.014925 0.6119403 23.876119
5 67.50000 1.166667 7.666667 1.0000000 86.500000

 

From our results, we see that 278 patients - the large majority - are in category 1, with age close to 66 years and only about 30% exhibiting capsular penetration. The PSA and gleason score of this cluster are by far the lowest. In contrast, category 2 is the smallest cluster, with only 4 patients, but they all show capsular penetration and, as expected, far higher gleason scores and PSA. Clearly, k-means was correct in categorizing these patients into separate groups.

While this tutorial used a relatively small data set, H2O gives you the ability to manipulate huge amounts of data that conventional R can’t handle. With the H2O R package installed, you can treat them like any other data set in R, and H2O will do the heavy lifting in the background for you. Try it out yourself!

References

comments powered by Disqus

Contact

1185 Terra Bella Ave
Mountain View, CA 94043
(650) 429-8337
message was sent successfully. Thanks!