Scalable Predictive Analytics for Big Data: A Q&A Spotlight with Clint Johnson of Alpine Data

Originally published 15 February 2012

BeyeNETWORK Spotlights focus on news, events and products in the business intelligence ecosystem that are poised to have a significant impact on the industry as a whole; on the enterprises that rely on business intelligence, analytics, performance management, data warehousing and/or data governance products to understand and act on the vital information that can be gleaned from their data; or on the providers of these mission-critical products.

Presented as Q&A-style articles, these interviews conducted by the BeyeNETWORK present the behind-the-scene view that you won’t read in press releases.

This BeyeNETWORK spotlight features Ron Powell's interview with Clint Johnson, Vice President of Customer Solutions for Alpine Data Labs. Ron and Clint discuss the benefits and challenges of predictive analytics and data mining in these days of “big data.”

Clint, predictive analytics and data mining have been used by companies to gain valuable insights from their data. But now with big data, there are new challenges, and companies are trying to effectively interpret the patterns in that data to gain insight into everything from fraud and risk management, to customer retention, and cross selling. Before we get into the analytics, let's start with having you describe the big data landscape for us.

Clint Johnson: Sure. There are several items that I think are worth noting. Of course, there's just big data in general. We hear all the time about more and more data becoming available from many different sources. Therefore, more and more companies want to collect more of the data. We're seeing this big pile up of data that's now becoming available for analysis, reporting, and predictive analytics.

The other thing that we're seeing is that big data platforms are starting to mature. Back in 2010, we saw a huge consolidation of the analytic database platforms. Teradata brought Aster, EMC bought Greenplum, HP bought Vertica, and Oracle introduced their Exadata appliance. Now that these mainline vendors are offering their own analytic appliances, we're seeing really powerful combinations of hardware and database software that make managing and analyzing all of this data possible and a lot easier than it had been in the past.

Along the lines of the big data appliances, we're seeing that Hadoop, the open source file management system, has really taken off. It's becoming pervasive with its low-cost entry. It's really becoming a mainstay in different organizations.

We're seeing Hadoop move out of being just a departmental sandbox and starting to see it be supportive as a true production system in a lot of environments. As an example, we were working with a financial services company not long ago. They had a very large Greenplum environment for their data warehouse. It turns out that the same company had a Hadoop instance in the company. Their Hadoop instance was ten times as large as their Greenplum data warehouse. They were using Hadoop instance to gather all of this data to build fraud detection patterns. The whole Hadoop effort at this institution had happened just within roughly a year and was actually a lot bigger than their data warehouse. So we're starting to see Hadoop really become pervasive.

What we're starting to see then if you put all these together, there is more data coming in, the data platforms are maturing, and Hadoop is becoming much more mainstream. As a result, this presents a great opportunity for developers to provide software packages that really let users and companies harness and now leverage the horsepower of all the infrastructure that they've been buying. All the appliances really now can be put to use for analyzing data.

Sure. And obviously, looking at this data with predictive analytics and data mining is Alpine Data's main focus. What are the biggest hurdles that companies are having to overcome when they tackle these big data issues with data mining and predictive analytics?

Clint Johnson: Let's take a look at just the traditional process of doing data mining. First, some analysts start extracting data out of the data warehouse or source application system. After they get it all extracted, it gets processed, formatted, transformed and munged together. That's about 60% to 80% of the actual data mining process. All this happens before any analytic algorithms are run on the data. Finally, it gets ported and formatted into a data mining application, often times running on a departmental server. The analysts and data scientists crank through all of this data, and they produce a formula or a model result. Once that result has been validated, it has to be incorporated back into the data warehouse or the operational environment so that the rest of the organization can use the results of it either for analysis or for the scoring of data. That's the process. It's been this way for maybe 20 years.

But as datasets get bigger and more data comes down the pipe, this process really breaks down. It gets really hard to start moving around terabytes or petabytes of data. It's hard to process it, and it's hard to work with it. As the data gets large, the analysts are left trying to take sample sets of the data to build their models. And finally, after the analysts have built their models on their sample set, they still have to get the results back into the data warehouse. And so, the big hurdles that we see out there and that Alpine Miner is really here to address are really around this idea of integration, being able to scale with the data and to process the largest datasets. If your database engine can process the data, Alpine Miner will get you the results without moving the data out of the environment.

Well Clint, there are a lot of other vendors in predictive analytics and they all have solutions like SAS, IBM SPSS and Revolution Analytics. Why do you feel your product, Alpine Miner, is a good choice for companies that want to mine big data, and is Alpine Miner compatible with all the database platforms?

Clint Johnson: The whole reason for the analytics process I described a minute ago is because these traditional data mining packages have to run on monolithic servers. They really don't have the scalability that these new analytic database and Hadoop engines offer today. Alpine Miner takes a very different approach to doing analytics. What Alpine Miner does is to leverage the horsepower of the database to run these analytic functions and algorithms directly in the kernel of the system. We do it right where the data lives. Analysts using Alpine Miner can explore the data, transform it, format it, analyze it, score it, and never have to pipe the results out.

If you think about using some of these traditional tools and building models on maybe two or three hundred terabytes of data, they don't scale well at all. The core differentiator of Alpine Miner compared to traditional tools like SAS, IBM SPSS, and Revolution R really gets into the scalability factor. We can scale as big as your data is. The benefit of that is because the users are running these algorithms directly in the database, they can analyze a lot more data, churn through more models, and  produce better results faster than with some of these traditional tools. Once the models are complete, the actual model output gets applied directly into the database, and they can really iterate faster and faster through this modeling process.

So Clint, when I look at Alpine Miner, it doesn't require any additional hardware, it's not necessary to move data out of the data warehouse, and it processes the data using the database engine. But wouldn't it seem to make more sense to move the analytic processing to an external engine so the processing doesn't affect other activities of the database? Why does Alpine Miner take the in-database approach?

Clint Johnson: The reason we did it is to address the scalability. If you have tools that can run the analytics directly within the database, then the only issue left is workload management. It's making sure that the analytics process doesn't affect normal reporting, or BI, or just data processing going on in the system. I think that was a problem earlier on with some of these platforms. But as the database appliances are becoming more mature, this is becoming less of an issue. What we're seeing in these systems are much more robust workflow management features. Administrators can provision out resources of the system, they can manage security, and the user provisioning process. All of that happens within the database anyway. And all that can be leveraged directly by the Alpine users.

Very intriguing. Can you give us some examples of how your customers are using Alpine?

Clint Johnson: Let's go back to the financial services customer. They're using Alpine for two main processes. One is an offer recommendation system. The model takes a look at all of their customers, the product steps, and the profitability of all of these customers, and generates a list of recommended products that, number one, the customer is likely to want and, number two, that are very likely to increase the customer's profitability to the bank. This has been built out within the Alpine environment there, and they back-tested the model. One of the very interesting things that they came across was that 1% of the time when customers would go to a branch and open a new account, the customers were opening the same product that a recommendation engine proposed only 1% of the time. The other 99% were just opening the marketing flavor of the month.

But what gets really interesting is that in 1% of the cases, 85 times out of 100 the average profitability on that customer relationship went up by $10 a month, or $120 a year. If the customers were just opening the marketing flavor of the month, profitability stayed the same. So you can see this really big opportunity for ROI if through their branch channel, the teller line and their online system, this bank could offer targeted recommendations and tailored products to their customer base of about 2 million customers. There's a lot of money to be made. That’s just one example that we're seeing of Alpine use.

We're doing some work in the healthcare space building out predictive models around clinical tests and weeding out drugs that don't work or that have some toxicity in them. We can sort through all of this test data and really start honing into the results of a model and a better product much quicker.

That's excellent. With Alpine Data’s focus on big data scalability, can you make some predictions about the future of predictive analytics and the future of Alpine Miner?

Clint Johnson: As far as predictive analytics goes, this is a green field, a blue ocean, or whatever you want to call it. I think people, companies and users intuitively recognize that there's a value in their data if they could somehow extract it or figure that out. So there's a big opportunity for a software company like Alpine to provide tools that really let them leverage and extract that value without the frustration.

As far as Alpine goes, we have a good roadmap set up. We’re on Version 2 of our first product, Alpine Miner. Version 2 right now works with the Greenplum platform, Oracle's Exadata, as well as standard Oracle 11g databases. It works with Postgres. We have a DB2 offering. We have a Netezza offering that's in the works, and we expect to have a Hadoop offering sometime in 2012. We're really trying to be platform independent and let users use the data platform of their choice.

We are also about to release a server edition of the product that we'll be demonstrating at the Strata conference this year. What this server edition does is it lets the data scientists who are using Alpine build their models and publish them to the server. Then other users within the organization can subscribe to the models. There are some very robust model management capabilities in the server edition, instructions, assumptions and documentation, but it really then puts the model into a lot broader audience. We're trying to enable this broader participation by all the analysts, not just the PhDs, not just the data scientists. We want to get these models into a lot more hands, let people score their own local data and really localize the analytics, but take advantage of all the work of the data scientists.

Clint, thank you for taking the time to talk with me about Alpine Data’s approach for big data predictive analytics and data mining.

SOURCE: Scalable Predictive Analytics for Big Data: A Q&A Spotlight with Clint Johnson of Alpine Data

  • Ron PowellRon Powell
    Ron, an independent analyst and consultant, has an extensive technology background in business intelligence, analytics and data warehousing. In 2005, Ron founded the BeyeNETWORK, which was acquired by Tech Target in 2010.  Prior to the founding of the BeyeNETWORK, Ron was cofounder, publisher and editorial director of DM Review (now Information Management). Ron also has a wealth of consulting expertise in business intelligence, business management and marketing. He may be contacted by email at rpowell@wi.rr.com.

    More articles and Ron's blog can be found in his BeyeNETWORK expert channel. Be sure to visit today!

Recent articles by Ron Powell



 

Comments

Want to post a comment? Login or become a member today!

Posted 23 April 2012 by

As far as predictive analytics goes, this is a green field, a blue ocean, or whatever you want to call it. I think people, companies and users intuitively recognize that there's a value in their data if they could somehow extract it or figure that out. thanks robert iPad 3 Cases

Is this comment inappropriate? Click here to flag this comment.