What is the history of Big Data and Machine Learning?

The term “Big Data” originally referred to the Internet and its glut of information. Internet managers worried about the inability to store and process data within this new technology. While the term is actually traceable back to a paper in the late 90s, the concept significantly predates it.

  • In 1880, the Census Bureau announced that it would take 8 years to process all the data it had collected. The following year, Census Bureau engineer Herman Hollerith created the Hollerith Tabulating Machine to tackle the task. It relied on storing data on punch cards and reduced the task from 8 years to months. Hollerith founded a small company around his invention in 1896, which the Computing-Tabulating-Recording Company amalgamated in 1911. In 1924, the company renamed itself to International Business Machines, or IBM as we know it today.
  • In 1937, the new Social Security Administration contracted IBM to track earnings for 26 million Americans and 3 million employers. To respond to this monumental task, IBM developed a new technology that automatically processed punch cards.
  • In 1952, the newly formed National Security Agency contracted over 12,000 cryptologists to chronicle an overwhelming amount of information they amassed during the Cold War, an undertaking that would consume 10 years.
  • In 1965, the federal government started a project to store over 742 million tax returns and 175 million sets of fingerprints. Politicians scrapped the project because of public concern, but experts recognized its massive potential.

So, if we look at “big data” as the ability to store and retrieve data much larger than our current computational capacity, managers and executives will recognize that it has been an issue for far longer than it has held an official name.

Machine learning, a subcategory of computer science and artificial intelligence, started in the 1950s. General Artificial Intelligence refers to creating sentient machines, or the ability for machines to have emotions and consciousness. A narrower version of artificial intelligence refers to the automation of tasks using a combination of statistical principles and computing algorithms. These operate in a similar manner to the human brain.

If big data refers to storage and retrieval, machine learning refers to processing and analysis.

Over the last 5 years, many have used the terms synonymously. For this post, I’ll refer to machine learning as the processing and analysis of large data sets.

Is Machine Learning over-hyped?

Sometimes. The issue is that often, overpromising results leads to disappointment. Here are three examples that indicate machine learning isn’t being used to its greatest potential.

1) The Wrong Tool for the Job
Companies often use machine learning as the default tool to solve simple problems. A programmer might enjoy playing with machine learning technology to solve problems but does not have enough training or expertise to recognize which tool is appropriate for the task. For example, in forecasting sales, an organization may try to use a complex, expensive algorithm when a cheaper, simpler algorithm will achieve better results. Complex problem solving takes time and research to set up correctly. In another example, I was working for a technology vendor that was attempting to interpret customer attrition using an immense amount of data with a complex algorithm. We ended up running a more accurate analysis using a much simpler algorithm just by switching our data source. Machine learning, a black box approach by design, requires training the algorithm to a certain level of assured accuracy before releasing it into the wild. In summary, complex algorithms are often set up to solve basic problems and machine learning is not used to its potential.

2) Habits are Hard to Break
An organization may lack the culture to act on the algorithm’s recommendations. This applies to any form of analytics. When dashboards and scorecards arrived, many companies quickly adopted them, but few made them a part of the management and decision process. So, managers made decisions (and even to this day) by anecdotal evidence or a “we’ve always done it that way” approach. This creates a strategic catastrophe, akin to driving 60 miles/hour through a residential area, in the dark with no streetlights, with your lights off. In other words, organizations have access to complex algorithms and data but haven’t changed their habits to use it to help with decision making.

3) Decision Making isn’t Part of the Process
Programmers may create a machine learning algorithm in a business vacuum. If, at the end of the algorithm’s process, stakeholders aren’t using the results to make decisions, then at best, the machine runs without supervision. At worst, it spins its wheels costing money to produce something that no one will act on. This means no savings or revenue increase will occur. To summarize, if stakeholders and decision makers don’t understand how to interpret the data so that it applies to business questions and issues, then it renders useless and wasteful.

Where’s the value?

First a note on machine learning vs. statistics: I want to point out that statistics and machine learning overlap, but the latter does a better job of sitting in an ongoing or live production environment. Statistics, more appropriate for post-process analysis, also has a lot of rules about when to use what and under what circumstances to consider the analysis “invalid”. Machine learning has fewer rules, such that a programmer measures success (usually) with one metric: accuracy of the algorithm.

The value of machine learning depends on the task. If a process needs to evolve over time on a frequent basis, one should use machine learning. A one-off analysis or static process should use statistics since machine learning would be overkill and possibly inaccurate.

What are your thoughts on this big phenomenon in healthcare?