Big Data is Dead (2023) #
- The author argues that the "big data" hype cycle is over, and a lot of data can be easily managed with simple tools on a single machine, even 6 TiB.
- The author recounts a hiring experience where a candidate who understood this principle was chosen over those who wanted to use complex and expensive "big data" tools.
- He asserts that "not understanding the scale of 'real' big data was a no go" in his hiring process.
- He highlights the fallacy of "hammer and nail" thinking, where people who learn a new tool tend to apply it everywhere.
The Problem with Trick Questions #
- Several commenters discuss the potential problems with using a trick question in an interview scenario.
- They argue that such questions can be stressful for the interviewee, don't necessarily reflect real-world scenarios, and can lead to inaccurate judgements about the candidate's abilities.
- They suggest the interviewer should be more careful about leveraging their position and instead ask clearer, more open-ended questions that allow the candidate to demonstrate their skills.
Simple Versus Complex Solutions #
- Commenters debate the merits of simple solutions versus more complex "big data" solutions.
- Many commenters agree that simpler solutions, like using CLI tools or running a relational database on a single machine (like SQLite or PostgreSQL) are often sufficient for many data management tasks.
- They highlight the risk of overengineering and the hidden costs associated with complex solutions, including maintenance, scalability, and accessibility.
The Importance of Understanding Data Scale #
- There is consensus that most businesses deal with data sets much smaller than 6 TiB.
- Commenters emphasize the importance of understanding the specific needs of the problem and avoiding unnecessary complexity.
- Tools like DuckDB and ClickHouse are mentioned as good alternatives to complex "big data" tools for handling data of moderate size.
The Value of Flexibility #
- The discussion also touches on the importance of flexibility in data management.
- The author emphasizes the need for being able to change the approach and tools based on the needs of the project.
- The use of technologies like Apache Drill and Athena are mentioned as examples of tools that provide flexibility and query data stored in different formats and locations.
Top Quotes #
"The winner of course was the guy who understood that 6TiB is what 6 of us in the room could store on our smart phones, or a enterprise HDD (or three of them for redundancy), and it could be loaded (multiple times) to memory as CSV and simply run awk scripts on it."
"I'm prone to the same fallacy: when I learn how to use a hammer, everything looks like a nail."
"Consulting service: you bring your big data problems to me, I say "your data set fits in RAM", you pay me ,000 for saving you ,000."
"I think more like, how would you prepare and cook the best five course gala dinner for only . That requires true skill."
TL;DR #
The article argues that "big data" is not as big a problem as people think, and that many data sets can be easily managed with simple tools and approaches. It criticizes the tendency to overengineer solutions and emphasizes the importance of understanding the actual data scale and needs of the problem before jumping to complex and expensive "big data" solutions.