Dygest logo
Google logo

Google Play

Apple logo

App Store

Viktor Mayer-Schönberger & Kenneth Cukier

Big data

Big data refers to large, complex data sets that are difficult to process using traditional database methods. Key characteristics are high volume, velocity, and variety of data. Applications span many industries like banking, media, healthcare, and more. Common use cases include analyzing customer data to improve products and recommendations, detecting fraud, tracking disease outbreaks, and optimizing business operations. Implementing big data brings challenges like storing massive datasets, ensuring data quality, and having the proper analytics skills. However, successfully leveraging big data can bring new revenue opportunities, innovations, and competitive advantages. Overall, big data is transforming how organizations extract value from information to make better decisions.

Big data
Big data

book.chapter Big data's impact

Big data enhances understanding in complex areas, improving health and wellbeing solutions and boosting business productivity through thorough data analysis. It also spurs rapid innovation, enabling data-driven decisions and competitive advantages. However, it raises ethical concerns around privacy and responsible use, necessitating strong safeguards against misuse. Big data's benefits are thus twofold: it fosters progress and necessitates ethical management. Total data analysis Throughout history, humans have faced significant challenges in collecting, organizing, and understanding data, primarily because most information was analog, making analysis expensive and time-consuming. The 1880 U.S. census exemplifies this, taking eight years to process, with results outdated by completion. The 1890 census was projected to take even longer, but the advent of punch cards and tabulation machines reduced this to one year, aligning with constitutional requirements for decennial censuses to determine taxation and representation. Traditionally, to manage big-data problems, a random sample was analyzed and extrapolated, but this approach had limitations. Ensuring a genuinely random and representative sample was challenging, biases could skew results, and sampling could miss the nuances of subgroups. Sampling, a necessity due to past information-processing constraints, often obscured details, including those within the margin of error where interesting insights might lie. However, the need for sampling has diminished with advancements in technology. Today's sensors, GPS, web interactions, and social media generate vast data streams, and computers can process this information more efficiently. Analyzing all available data, rather than just a sample, can lead to superior predictions and insights, a practice now more feasible due to reduced costs and complexity in data storage and processing. For example, computer scientist Oren Etzioni leveraged this approach by analyzing billions of flight data points to predict ticket price trends, leading to the creation of Farecast. Acquired by Microsoft, Farecast saves travelers money and is correct 75% of the time. Similarly, credit card fraud detection relies on analyzing all transactions to spot anomalies, often requiring real-time processing to identify outliers against the backdrop of normal activity. This shift towards using all available data is transforming how we analyze and derive value from information. Less precise predictions In the realm of big data, precision is often sacrificed for the sake of gaining macro-level insights from large, real-world datasets. Big data's inherent messiness and inconsistency, spread across numerous servers, necessitates a comfort with inaccuracy. For instance, while a single temperature sensor in a vineyard can provide exact readings, multiple low-cost sensors offer a broader view of temperature variations, despite individual inaccuracies. Similarly, Google's web-scale translation services benefit from large, varied datasets over smaller, high-quality ones. The "Billion Prices Project" at MIT, which scrapes online prices for inflation measurement, exemplifies the value of timely, large-scale data over precision. In natural language processing, Microsoft discovered that increasing the volume of training data significantly improved the accuracy of its grammar checker. Embracing big data means accepting uncertainty to unlock insights that precise, small-scale data cannot provide, focusing on the big picture rather than exactness. Patterns over causes In 1995, Amazon began with a team of book critics to write reviews and make recommendations. However, a software engineer's suggestion to use purchase patterns for recommendations proved more effective. This data-driven approach led to the disbandment of the human editors, and now, Amazon's recommendation system drives about one-third of its sales. Mayer-Schönberger and Cukier highlight that understanding customer behavior is less important than identifying correlations in purchase data for driving sales. Correlations in big data help businesses understand the present and predict the future. For instance, Walmart discovered that sales of Pop-Tarts increase before hurricanes and capitalized on this by placing them near emergency supplies to boost sales. This approach differs from traditional scientific methods that relied on forming hypotheses and testing them with data. Now, data sets are analyzed to find correlations for predictions. Across various industries, this method is being applied. In healthcare, researchers monitoring premature babies with sensors and software can predict infections before symptoms appear, allowing for earlier treatment. The advancement of tools for analyzing complex data continues, and while causal explanations are valuable, the focus on correlations is leading to new insights and improved predictions. With sufficient data, the emphasis is on correlation over causation, as the numbers can reveal previously unseen links.

book.moreChapters

allBooks.title