Big Data Is Dead: Why Data Size Doesn't Matter Anymore #
This blog post argues that the era of "Big Data" is over and we should stop worrying about data size and focus on using it effectively to make better decisions.
- Data Size Is Not the Problem: The author, a former engineer at Google BigQuery and SingleStore, analyzes years of data and industry trends to debunk the myth that data size itself is the primary obstacle to gaining insights.
- Most Organizations Have Moderate Data Sizes: While there are some companies with massive data sets, the vast majority operate with less than a terabyte of data, with many even smaller than 100 gigabytes.
- Modern Data Platforms Separate Storage and Compute: The separation of storage and compute allows for independent scaling of each, leading to significant storage growth but often stagnant compute needs.
- Workloads Typically Process Much Less Data Than Total Size: Dashboards and analytics often focus on recent data, resulting in a much smaller data processing footprint than the overall data size might suggest.
- Data Storage Age Patterns Show Most Data Is Infrequently Accessed: While the most recent data is heavily accessed, older data is rarely queried, meaning the actual working set size is often smaller than expected.
- Increasingly Powerful Hardware Makes Big Data Processing Less Relevant: The increasing power of single machines, especially in terms of RAM, means that many workloads no longer require distributed processing.
- Data Can Be a Liability: Keeping vast amounts of data can lead to regulatory compliance issues, legal risks, and increased complexity in managing data quality and interpretation.
Top Quotes #
"Big Data is coming! You need to buy what I’m selling!"
"The cost of keeping data around is higher than just the cost to store the physical bytes."
"If you answer no to any of these questions, you might be a good candidate for a new generation of data tools that help you handle data at the size you actually have, not the size that people try to scare you into thinking that you might have someday."
Key Takeaways #
- The "Big Data" scare is largely unfounded.
- Most organizations don’t have a significant data size problem.
- Focus on efficiently using the data you have rather than worrying about its size.
- New generations of data tools can help manage data effectively, regardless of size.
- Consider the true costs of data storage and processing before accumulating unnecessary data.
Action Steps #
- Evaluate your organization's data storage and processing needs.
- Consider simplifying data storage strategies to focus on relevant and frequently accessed data.
- Investigate data tools that are designed for efficient processing and management of moderate data sizes.
- Develop a data retention policy that balances the need for historical data with the risks and costs associated with data storage.