Home
/
Blog
/
All About AutoML: H2O Framework with Examples

All About AutoML: H2O Framework with Examples

Today Machine Learning (ML) become a very popular field. Clients segmentation, sales prediction, goods recommendation – almost every business needs it to work more effectively or to get a competitive advantage in his niche. But the salary of the employee, the server, infrastructure for software integration, research, and no guarantees of results are those factors, that make you think twice before starting investment in this area. But there are always some ways to reduce the costs and one they are AutoML service.

All you need is just to upload your data (usually table data) and AutoML service will use all known technics, frameworks, and best practices to build an as good model as it can be. Usually, it’s free of charge, but you will pay per request when you start using it. Of cause Google and Amazon are the kings of this market and will suggest you all kinds of way to put your money in.

But what else can we get from the AutoML field? One of the very powerful and popular frameworks here is H2O. It allows you to save a lot of time doing some automated calculations for you.

Maybe you do not know anything about machine learning but have to build some prediction model on your data, or you are a machine learning engineer, but tired of doing the same things every time: train/test split data, cross-validation, hyperparameters optimization, embedding, stacking and so on. Or maybe you are doing machine learning competitions and want to save your time on routine things. In all these cases H2O can help you save your time.

Let’s see some example code:

import h2o
from h2o.automl import H2OAutoML

h2o.init()

# Import a sample binary outcome train/test set into H2O
train = h2o.import_file("https://s3.amazonaws.com/erin-data/higgs/higgs_train_10k.csv")
test = h2o.import_file("https://s3.amazonaws.com/erin-data/higgs/higgs_test_5k.csv")

# Identify predictors and response
x = train.columns
y = "response"
x.remove(y)

# For binary classification, response should be a factor
train[y] = train[y].asfactor()
test[y] = test[y].asfactor()

# Run AutoML for 20 base models (limited to 1 hour max runtime by default)
aml = H2OAutoML(max_models=20, seed=1)
aml.train(x=x, y=y, training_frame=train)

# View the AutoML Leaderboard
lb = aml.leaderboard
lb.head(rows=lb.nrows)

aml.leader

preds = aml.leader.predict(test)
Well, we uploaded data, define our target column and setup H2O to train 20 models (max_models=20), and choose the best one to predict results. Fantastic, isn’t it?

Yes, it is. But if it is so simple, why there are so many online courses about machine learning, mathematical formulas, and so on? The biggest issue here is not to make some model, but how to make this model better. And here is were data preprocessing comes into play. For example, you want to predict house prices (yes, you are right… it’s a classical machine learning case).

You have some parameters: house square meters, amount of bedrooms, swimming pool present or not, car garage present or not, and the year when the house was built. So we have 5 parameters and final price. H2O will do everything possible to find all dependencies between these columns and build a model.

For example, we got 65% accuracy of this model (means that from 100 model predictions 65 are correct and 35 incorrect). Is it enough for your case? Can we do something to increase accuracy? Yes, we can, and here is where real work starts. What other “real world parameters” can influence house prices? House location, decoration, view from the windows, what schools and around, is it a big town or not, is it a seaside or not, is it a safe area to leave in and so on.

So many things may influence, but how we can measure them and put them as parameters. We can add 1 or 0 for seaside (present or not present), we can add the average price of sold houses in this area for the last 2-3 years, we can add the number of schools, hospitals, shops, cinemas in this area, we can add the number of people leaving in this city, I think you got the point. Instead of 5 parameters, we will have 25 or even maybe 125. Be sure, that accuracy will grow up. That’s how it works.

Machine learning is a wide field of activities. It’s like a web sites construction. One site can cost 50 dollars, another 50 000 dollars to build.

H2O can be a good way to save time and money and quickly get some good results.

Conclusion

If you have half an hour to build a machine learning model, then copy-paste the code above, put your data in and get results.
If you have several days, then try to think about what else can influence the results and how it can become a new column in your data set, then follow the step above.
If you want to get model accuracy as good as possible, then you should learn all “underhood” things of data science: what is backpropagation, random forest, gradient boosting trees, and so on. You will need it to have a deep understanding of what H2O is doing and how it can be improved. Good news, that there are hundreds of courses and thousands of videos, that can help you.

I hope this article answered the main questions: when, why, and how you can use AutoML frameworks.

All About AutoML: H2O Framework with Examples

Conclusion

Related Posts