Stats for Data Science

By Danny Kaplan | June 5, 2019

Stats for Data Science is a forward-looking introductory-level textbook currently in draft form. The book is available on-line here.

From the preface …

Data science is an emerging computational discipline with roots sprouting in contemporary problems. Almost always these problems involve large amounts of data collected by automated systems and collated by bringing together diverse data from different sources. The goals for working with such data are varied: predicting the preferences of individual people for consumer products or news feeds; examining government or business or clinical medical records to answer questions such as the efficacy of a proposed program or intervention in public health; detecting and classifying rare events such as credit-card fraud; finding useful patterns in clouds of text or data that might help identify harmful interactions between medicines or extract meaning from thousands of documents.

Statistics as a discipline is also young, at least compared to mathematics, chemistry, and physics. It emerged in the decades around 1900.

If statistics had not already existed, data science would need to invent it. Statistics provides the foundation for describing variation among individuals, for relating different factors to one another, and for drawing appropriate inferences from patterns observed in data. Statistics also provided the impetus for a critically important method of science: the randomized controlled experiment. It is fair to say, “There can be no data science without statistics.”

The converse is not true. Statistics as a field can and did exist without data science. Statistics had a century to mature before the problems addressed by data science became broadly important. As such, the context to which statistics was applied was initially very different from the contexts in which data science is important. Statisticians had to find ways to deal with very limited amounts of data, and so the mathematics of small data became central to the self-definition of statistics. Statisticians had to help steer medicine and science away from arguments based on anecdote and narrow data, and so great emphasis was placed on the technique of random sampling and random assignment in experiment. Both of these uses of randomness help compensate for the potentially misleading influence of unknown or unmeasured factors. And statisticians had to do their work in a world without electronic computers or the idea of software. Without software or the machines to run it on, statisticians described their conceptual methods not with step-by-step algorithms composed of simple procedures, but using algebraic formulas derived from mathematical stand-ins for procedures and supplemented with elaborate tables of standardized probabilities. Lacking the computational facilities for working with randomness directly, statisticians eliminated the randomness by replacing it with deterministic, exactly repeatable idealizations of randomness.

The historically early maturing of statistics creates a difficult situation for the student undertaking to master data science. Statistics, as it is almost always taught, provides very little contact with data and the description of relationships between variables. Much of a standard statistics course involves misleadingly precise calculations to do a simple comparison of two groups in terms of one quantity, e.g. the difference in height of males and females. Given the limited applicability to data science of many of the particular concepts and calculation techniques at the core of traditional introductory statistics, there has been an unfortunate tendency to develop data science programs without genuinely important statistical foundations.

Stats for Data Science re-imagines statistics as if it were being invented today, alongside data science. It emphasizes concepts and techniques relating directly to data with multiple variables, to constructing predictive models of individual outcomes, and to making responsible inferences about causal relationships from data. It can get further than traditional introductory statistics because it by-passes the complications introduced by small data and embraces computing algorithms for working with randomness.

comments powered by Disqus