Run H2O From Within R

With the REST API, it's simple to run H2O operations from within R using similar syntax to all your favorite R functions. In this post, we'll walk through a simple demo of its capabilities. First, get H2O installed and running by following the tutorial here. Once you have the R package loaded, you can take a look at the included demos by typing demo(package="h2o"), and run one of them by typing e.g., demo(h2o.glm). We'll be stepping through a few of the basic statistical functions in this tutorial.

Starting up H2O in R

     library(h2o)
    localH2O = new("H2OClient", ip = "127.0.0.1", port = 54321)
    h2o.checkClient(localH2O)

The beginning of each H2O R script looks the same - first, load the R package, then create an H2OClient object containing the IP and port at which H2O resides. If you are running H2O on your local machine, the default is IP = 127.0.0.1 and port = 54321. You can call h2o.checkClient to check if H2O is connectable. Once that's done, we are ready to work with some data!

Importing and Summarizing Data

In this tutorial, we will be working with the prostate cancer data set, which comes from a study by Dr. Donn Young at The Ohio State University Comprehensive Cancer Center of patients with varying degrees of prostate cancer. The relevant columns are CAPSULE (binary variable indicating tumor penetration of prostatic capsule), AGE (in years), RACE (1 = white, 2 = black), PSA (prostatic-specific antigen value), and GLEASON (total gleason score, indicating how aggressive the cancer is). See Applied Logistic Regression by Hosmer and Lemeshow (2000) for more details.

     prostate.data = h2o.importURL(localH2O, path = "https://raw.github.com/0xdata/h2o/master/smalldata/logreg/prostate.csv", key = "prostate.hex")
     summary(prostate.data)

The first line imports and parses the data set prostate.csv from the given URL, storing it in H2O under a unique identifier (hex key), prostate.hex. The method h2o.importURL returns an H2OParsedData object containing the IP and port on which H2O resides, as well as the data set's hex key, which we save to prostate.data. All references within R to this data set will now be through prostate.data. Hence, if we wanted to get summary statistics, we'd call summary(prostate.data). This displays the minimum, maximum, median, mean, and quantiles of each column of the data set, just like in R (only relevant columns are shown below):

       CAPSULE         AGE              RACE            PSA               GLEASON
 Min.   :0.000   Min.   :43.000   Min.   :0.000   Min.   :  0.300   Min.   :0.000
 1st Qu.:0.000   1st Qu.:62.000   1st Qu.:1.000   1st Qu.:  5.132   1st Qu.:6.000
 Median :0.000   Median :67.000   Median :1.000   Median :  5.132   Median :6.000
 Mean   :0.403   Mean   :66.039   Mean   :1.087   Mean   : 15.409   Mean   :6.384
 3rd Qu.:1.000   3rd Qu.:71.000   3rd Qu.:1.000   3rd Qu.: 14.795   3rd Qu.:7.000
 Max.   :1.000   Max.   :79.000   Max.   :2.000   Max.   :139.700   Max.   :9.000

Running GLM (Generalized Linear Model)

Now that we have a sense of the data set's structure, we will want to run a statistical analysis on it. Let's try to run a logistic regression, with CAPSULE as the response and AGE, RACE, PSA and GLEASON as the predictors. The GLM family in this case is binomial with default link function logit. (See Wiki for a more detailed mathematical explanation). In essence, we are studying how the probability of capsular involvement is affected by a patient's age, race, PSA and gleason score.

     h2o.glm(y = "CAPSULE", x = c("AGE","RACE","PSA","GLEASON"), data = prostate.data, family = "binomial", nfolds = 10, alpha = 0.5)

You should get as your result the following coefficients:

Coefficients:
      AGE      RACE       PSA   GLEASON Intercept
 -0.02119  -0.46410   0.02804   1.07613  -5.86616
Degrees of Freedom: 379 Total (i.e. Null);  374 Residual
Null Deviance:     512.3
Residual Deviance: 416.3  AIC: 426.3

 

Looking at the coefficients, we see that the log-odds (and by extension, probability) of prostate capsular penetration increase with PSA and gleason score, as expected, but decrease slightly with age. A patient who is black is significantly less likely to exhibit capsular involvement than one who is white, although it is unknown whether this is a direct effect, or whether race is capturing some other characteristic excluded from the regression.

Running K-Means Clustering

Now, let's run the k-means algorithm to identify how similar patients should be clustered. (See Wiki for a description of the mathematics). We start with k = 5 clusters, using only the predictors AGE, RACE, GLEASON, CAPSULE and PSA for categorization:

      prostate.km = h2o.kmeans(data = prostate.data, centers = 5, cols = c("AGE","RACE","GLEASON","CAPSULE","PSA"))
print(prostate.km)

K-means clustering with 5 clusters of sizes 278, 4, 23, 69, 6
Cluster means:
       AGE     RACE  GLEASON   CAPSULE        PSA
1 66.14947 1.071174 6.124555 0.3060498   7.107402
2 65.75000 1.250000 8.000000 1.0000000 131.175000
3 66.09091 1.227273 7.136364 0.7272727  55.213636
4 65.44776 1.089552 7.014925 0.6119403  23.876119
5 67.50000 1.166667 7.666667 1.0000000  86.500000

 

From our results, we see that 278 patients - the large majority - are in category 1, with age close to 66 years and only about 30% exhibiting capsular penetration. The PSA and gleason score of this cluster are by far the lowest. In contrast, category 2 is the smallest cluster, with only 4 patients, but they all show capsular penetration and, as expected, far higher gleason scores and PSA. Clearly, k-means was correct in categorizing these patients into separate groups.

While this tutorial used a relatively small data set, H2O gives you the ability to manipulate huge amounts of data that conventional R can't handle. With the H2O R package installed, you can treat them like any other data set in R, and H2O will do the heavy lifting in the background for you. Try it out yourself!

References