How to get the advantages of Big Data on your laptop
If you’ve been reading anything lately on business and technology you can’t have missed the many articles on Big Data. Many of these articles imply that if you aren’t ‘doing’ Big Data your competition will speed past you on their way to taking your piece of the market. But the Big Data hype overlooks something you learned a long time ago – the Pareto Principle. The same 80/20 rule you’ve used before will help you to achieve most of the value from your data.
Applying the Pareto Principle to Big Data
The first thing you need to narrow down is what data is important to your analysis. Tens of thousands of customers and millions of orders can bring in a ton of data – most of which is irrelevant. So if you want to increase your average order size, for instance, you want to gather information on transactions including expenditure totals, number of items purchased, etc. But you don’t need customer demographics, supply chain history, financials, etc.
Here is the first gain from applying the Pareto Principle: even though you may have a lot of records, you only need a small set of attributes for analysis – and it may all fit nicely in memory on your laptop computer. After all, finding averages, applying regression, and even running many sophisticated machine learning algorithms, can all be done with basic numerical data found in most operational databases.
You can continue leveraging the small data from your operational data before going after more granular or real-time data from your web or transaction logs. Cohort analysis, customer segmentation, customer lifetime value, vendor performance, and supply-chain efficiency are just some of the many types of analysis that will bring you plenty of return on investment without a big data infrastructure and methods.
Even if you have analyzed this data and are ready to look at web or transaction logs you may find that you don’t need to run every record through your statistical or machine learning algorithm. Unless you are looking for exceptional data (i.e. needles in the haystack) such as fraud detection or credit defaults, sampling methods will almost certainly give you the results you need. See A Berkely MOOC course on sampling for an introduction to sampling and why it provides valid results when properly applied.
Many databases including Microsoft SQL Server and Oracle provide commands in their SQL dialects for sampling. These database systems give you two options for sampling. One selects random pages from the database and then delivers records from the randomly selected pages. This method is quite fast since the page index can be used. The second option scans the table and randomly selects records while scanning. While much slower, this second method provides better randomness. But both are useful. I use the first method to quickly iterate through many different models and then use the second method as a final test.
The important message here is that you don’t need a Big Data infrastructure and techniques to extract important information from your data. You can get a lot of information using your existing tools and a laptop computer. I will be writing future blogs about how we have used relatively inexpensive small data techniques at Coherent to help clients get a lot of what Big Data promises – at a fraction of the cost.