Clarifying Scientific Concepts Part 5: Data
Table of Contents
From Measurement to Data
Bridging from the previous section
So far, we have treated measurement as the disciplined act of assigning values to features of the world. We have asked what is being measured, how it is being measured, what instruments and units are involved, and how metrology gives measurement its structure. Those questions are essential because measurement is where scientific evidence first begins to take form. But a measurement by itself is rarely the final object of scientific reasoning. Most scientific claims are not built from a single reading, a single observation, or a single classification. They are built from collections of measurements assembled into what we call data.
This shift from measurement to data can seem simple: take many measurements, arrange them in a table, and begin analysis. But that simplicity is deceptive. A dataset is not a transparent window onto reality. It is a constructed artifact, produced through many decisions about what counts as an observation, which properties are worth recording, how those properties should be represented, and which measurements are included or excluded. Data are not “the world” in raw form. They are the world after it has passed through concepts, instruments, protocols, sampling schemes, coding rules, database structures, and cleaning decisions.
This matters because every later statistical result inherits the consequences of those earlier choices. A mean, a regression coefficient, a confidence interval, or a model prediction can only summarize the data that were actually produced. If the data were shaped by narrow definitions, inaccessible populations, inconsistent measurements, or undocumented cleaning decisions, then the statistical conclusions will reflect those limitations. Statistics begins long before calculation. It begins when we decide how the world will be allowed to appear in our data.
Records, variables, and units of analysis
A dataset usually has an internal grammar. At its most familiar, this grammar is tabular: rows and columns. Each row is a record, and each column is a variable. But this simple structure hides an important conceptual question: what does each row represent?
The answer is the unit of analysis. A unit of analysis is the kind of thing about which observations are being made. In one study, the unit might be a person. In another, it might be a household, a school, a hospital visit, a cell line, a tissue sample, an experimental run, a region, a species, a galaxy, or a social media post. Once we know the unit of analysis, variables become the features measured on each unit. If the unit is a person, variables might include age, blood pressure, educational attainment, or survey responses. If the unit is a region, variables might include rainfall, population density, disease incidence, or average income. If the unit is an experimental run, variables might include temperature, treatment condition, reaction time, and observed yield.
A record is the pairing of a unit with measured or assigned variables. In a tidy table, each row says, in effect: here is one unit, and here are the values we have recorded for that unit. But not all datasets are tidy, and not all records correspond cleanly to the conceptual object we care about. Problems emerge when the conceptual unit and the recorded unit diverge.
Suppose a researcher wants to study households, but the available dataset records information about the “household head.” Income, occupation, education, and age may be attached to that one individual, even though the scientific question concerns the household as a collective unit. In that case, the recorded unit is not quite the same as the conceptual unit. A household may contain multiple earners, dependents, relationships, and internal inequalities that are invisible if the household is represented by one person. The row appears to describe a household, but much of the actual household structure has disappeared.
Similar problems arise in many domains. A hospital dataset may record visits rather than patients, so one person can appear multiple times. A school dataset may record classrooms rather than students. A biological dataset may record samples rather than organisms. A platform dataset may record posts rather than users. None of these choices is automatically wrong, but each one changes what the data can legitimately support. Before asking what a dataset shows, we need to ask what its rows are.
Scoping a dataset in terms of a scientific problem
A dataset should be scoped around a scientific problem, not merely assembled because information is available. The question should guide the design. What are the appropriate units? What variables need to be measured on each unit? What time window matters? What geographic or institutional boundaries define the setting? What contextual information is necessary for interpretation?
Consider a question such as: Does urban heat exposure increase emergency room visits among older adults? To build data for this question, we must decide what the units are. Are they individual people, emergency room visits, neighborhoods, days, or hospitals? Each choice creates a different dataset and a different kind of claim. If the unit is an individual person, we might need age, address, health history, and daily exposure estimates. If the unit is a neighborhood-day, we might need daily temperature, neighborhood demographics, and counts of emergency room visits. If the unit is a hospital, we might study changes in patient volume under different weather conditions.
We also need to define the time window. Are we studying one heat wave, one summer, ten years, or a changing climate trend over decades? We need to define geography. Are we studying one city, several cities, or an entire country? We need to define what counts as heat exposure and what counts as a relevant emergency room visit. Does exposure refer to outdoor temperature, indoor temperature, heat index, nighttime cooling, or individual-level sensor readings? Does the outcome include only heat-related diagnoses, all cardiovascular events, respiratory distress, dehydration, or total emergency visits?
These decisions happen before any formal statistical analysis. They are part of conceptual design. A dataset is useful only to the extent that its structure matches the scientific problem. If the data are poorly scoped, later statistical sophistication can create the appearance of rigor without actually answering the intended question.
Populations, Samples, and the Scope of Claims
Defining a population conceptually
Most scientific studies observe less than everything they care about. A researcher rarely measures every human, every mouse of a particular strain, every galaxy of a certain type, every social media post on a topic, or every ecosystem in a biome. Instead, researchers observe some units and then use those observations to reason about a larger set. This larger set is the population.
The target population is the full set of units to which we want our conclusions to apply. It is defined conceptually by the scientific question. If we ask whether a vaccine is effective in adults, the target population might be all adults in a given country, or all adults globally, or all adults with a particular health condition. If we ask how a pollutant affects fish development, the target population might be a species in a particular watershed or a broader class of organisms exposed to similar chemicals. If we study online misinformation, the target population might be all posts about a topic during a specified time period on a platform or across multiple platforms.
The accessible population is the subset of the target population that can realistically be reached or observed. This distinction is crucial. A researcher may care about all adults, but only be able to recruit adults who live near a clinic, speak a particular language, have internet access, or are enrolled in a health system. A biologist may care about a wild species broadly, but can only sample accessible sites. A social scientist may care about political attitudes across a country, but can only reach people listed in certain contact databases or willing to answer surveys.
The sample is the set of units actually observed. It is the concrete data-bearing subset. Inference is the bridge from sample to population. Every inferential claim therefore depends on how sturdy that bridge is. If the sample resembles the target population in relevant ways, the bridge may be strong. If the sample differs systematically from the target population, the bridge becomes fragile. Statistical tools can estimate some forms of uncertainty in crossing from sample to population, but they cannot magically erase a mismatch between the people, organisms, places, or events observed and the ones we ultimately want to understand.
Units, aggregation, and level problems
Population claims also depend on the level at which data are measured and analyzed. Individual-level data describe individual units: persons, cells, animals, transactions, or events. Group-level data describe aggregates: schools, neighborhoods, countries, laboratories, companies, tissues, or populations. Both levels are valuable, but they answer different questions.
A classic danger is the ecological fallacy: drawing conclusions about individuals from group-level data. Suppose neighborhoods with higher average income also have higher average rates of some health screening. It does not necessarily follow that higher-income individuals within each neighborhood are the ones receiving more screening. The group-level association may reflect neighborhood infrastructure, clinic density, local policy, or demographic composition rather than individual income effects.
The reverse danger is sometimes called the atomistic fallacy: drawing conclusions about groups from individual-level data without accounting for group-level structure. Suppose individual students in a dataset show a relationship between study time and exam performance. That does not automatically explain differences between schools. School-level outcomes may depend on funding, curriculum, teacher experience, peer networks, class size, and institutional culture. Individual associations do not automatically scale up to group explanations.
The level of analysis changes what the population is. If rows are people, the population may be people. If rows are schools, the population may be schools. If rows are region-years, the population may be region-years. Confusion about level can lead to overclaiming. A dataset of hospital visits is not automatically a dataset of people. A dataset of countries is not automatically evidence about citizens. A dataset of cells in culture is not automatically evidence about whole organisms. The statistical unit and the scientific claim must be aligned.
Framing the scope of conclusions
A responsible analysis states not only what it found but also where, when, and for whom the finding is meant to hold. The question is: for which population and time period do my conclusions apply?
Sometimes the scope is narrow. A study may support a conclusion about students at a particular university in 2024, patients at a specific hospital system between 2018 and 2023, or measurements from one experimental apparatus under a particular protocol. Narrow claims can be extremely valuable. They are often more honest than broad claims because they stay close to the data-generating conditions.
Broader claims require stronger justification. To move from “students at this particular university in 2024” to “university students in similar institutions globally,” we need arguments about similarity across institutions, countries, curricula, cultures, and time periods. We may need replication, comparative samples, or theory explaining why the observed pattern should generalize. Broadness is not merely a rhetorical choice; it is an evidential burden.
This is why scope statements are part of statistical reasoning. They define the boundaries of inference. Without them, results float free of their conditions of production and are easily misused. A precise estimate for a narrow sample may be less useful for broad policy than a less precise estimate from a representative design. Conversely, a small, carefully defined experiment may reveal a mechanism that is scientifically important even if it does not immediately estimate population prevalence. The right scope depends on the question.
How Data Are Obtained
Sampling frames and access to the population
Sampling is often described as selecting units from a population. But in practice, researchers usually do not sample directly from an abstract population. They sample from a sampling frame: the list, system, map, registry, platform, or mechanism through which units can be selected.
A sampling frame might be a census register, an email list, a voter file, a patient registry, a school enrollment database, a biobank, a sensor grid, a map of field sites, a directory of firms, or an archive of platform activity. The sampling frame is operational. It says not just who exists, but who can appear in the data.
Frame problems are common. Undercoverage occurs when some members of the target population have no chance of being selected because they are absent from the frame. A phone survey that excludes people without stable phone access undercovers certain populations. A biobank based on volunteers may underrepresent people with limited healthcare access or mistrust of medical institutions. A sensor grid may fail to cover rural areas, indoor environments, or informal settlements.
Overcoverage occurs when the frame includes units that are not part of the target population or includes duplicate and outdated entries. A business directory may include closed firms. A patient list may include people who moved away. A database may contain multiple records for the same person under slightly different identifiers. These problems affect who appears to exist from the perspective of the data.
The sampling frame silently defines the doorway into the dataset. Units outside that doorway cannot be observed, no matter how large the eventual sample is. This is one of the most important reasons that “more data” does not automatically mean “better data.” A huge dataset drawn from a distorted frame can produce extremely precise estimates of the wrong population.
Probability sampling: the basic idea
Probability sampling refers to designs in which units are selected according to known or specifiable probabilities. The technical details can become sophisticated, but the central idea is simple: the selection process is deliberately structured so that sampling uncertainty can be understood.
In simple random sampling, every unit in the frame has an equal chance of selection. If we have a complete list of 10,000 eligible units and randomly select 1,000, then chance determines which units enter the sample. This does not guarantee a perfect miniature of the population in every sample, but it makes the process transparent. We can reason about the variability we would expect if the sampling were repeated.
In stratified sampling, the population is divided into meaningful subgroups, or strata, and samples are drawn within each stratum. For example, a health survey might stratify by age group, region, or sex to ensure that key subgroups are represented. Stratification is useful when some groups are small but scientifically important, or when researchers want more precise estimates within subgroups.
In cluster sampling, groups are sampled first, and then units within those groups are measured. A researcher might sample schools and then students within schools, villages and then households within villages, hospitals and then patients within hospitals. Cluster sampling is often practical and cost-effective, especially when a complete list of individuals is unavailable but a list of groups exists. However, units within clusters may be similar to each other, so cluster designs usually carry different uncertainty than simple random samples of the same size.
Randomization in sampling is not magic. It does not ensure that every sample is representative in every respect. Rather, it gives us a principled way to control selection bias and quantify sampling variability. Because the selection process is known, we can ask how much estimates would vary across imaginary repetitions of the same design.
Non-probability sampling and found data
Many datasets are not produced by probability sampling. Convenience samples include units that are easy to reach: students in a laboratory class, patients at a nearby clinic, volunteers responding to an online advertisement, or organisms collected from accessible field sites. Snowball samples recruit through networks: one participant refers another, who refers another. Administrative and platform data arise because institutions or systems record activity for operational reasons: hospital records, tax files, app logs, purchase histories, web scrapes, learning management systems, and social media archives.
These data are not bad by default. In many fields, they are indispensable. Administrative data can be large, detailed, and longitudinal. Platform data can capture behavior at scales impossible for traditional surveys. Clinical records can reveal patterns in real-world care. Convenience samples can be appropriate for early-stage experiments, mechanism testing, teaching laboratories, or contexts where the target population is intentionally narrow.
The problem is not that non-probability data are useless. The problem is that population claims become more assumption-laden. If units enter the data because they are easy to reach, because they volunteer, because they use a platform, because they interact with an institution, or because their activity leaves digital traces, then the data reflect those access mechanisms. The observed units may differ systematically from the unobserved units.
Big data can make this problem harder to see. A dataset with millions of records can feel authoritative. But size does not guarantee representativeness. A social media dataset may contain millions of posts while excluding people who do not use the platform, people who only read but do not post, people whose posts were deleted, or people whose language is not captured by the search terms. A hospital dataset may contain millions of encounters while excluding people who could not access care. The question is always: large relative to what population, and generated by what process?
Sources of sampling bias
Sampling bias arises when the observed units differ systematically from the population about which we want to make claims. Several forms are especially important.
Coverage bias occurs when parts of the population cannot appear in the sampling frame. If a survey frame excludes people without internet access, then internet access becomes a condition for being represented. If a wildlife study samples only accessible areas near roads, then animals in remote habitats may be missing. If a genomic database consists mainly of participants from certain ancestries, findings may not generalize to underrepresented populations.
Selection bias occurs when the process of participation or inclusion is related to the variables of interest. People who volunteer for a nutrition study may be more health-conscious than those who do not. Patients who receive a particular treatment may differ from those who do not in disease severity, insurance status, physician access, or personal preferences. Users who remain active on a platform may differ from those who leave.
Nonresponse bias occurs when units are invited or eligible but do not respond, and when their nonresponse is systematically related to the study topic. A survey about financial stress may miss people under the greatest stress if they lack time, stability, or trust to participate. A follow-up study may lose participants who experienced adverse outcomes. In these cases, the missing responses are not simply empty cells; they are clues about the social and practical conditions of data production.
These biases are not merely random noise. Random sampling error may make an estimate fluctuate around a target. Systematic sampling bias can move the target itself. A large biased sample can produce a result that is stable, precise, and wrong. This is why sampling is a generative act: it helps create the dataset and shapes the uncertainty attached to every later claim.
Data Collection as a Process, Not an Event
Study designs and their implications for data
Data collection is not a single moment when facts are gathered. It is a process organized by a study design. The design determines what comparisons are possible, what temporal relationships can be observed, and what causal claims are plausible.
In experimental designs, researchers intervene. They assign treatments, conditions, exposures, or manipulations and then observe outcomes. Randomized experiments are powerful because random assignment can balance known and unknown factors across groups, making causal interpretation more credible. But experiments are not automatically generalizable, ethical, or realistic. Laboratory control may come at the cost of ecological validity, and some interventions cannot be assigned.
In observational designs, researchers observe without assigning the exposure or condition of interest. Much of epidemiology, ecology, astronomy, economics, sociology, and clinical research is observational. Observational data can be rich and realistic, but causal claims require careful attention to confounding, selection, measurement, and timing. If people who receive a treatment are already different from those who do not, then outcome differences cannot automatically be attributed to the treatment.
Cross-sectional designs measure units at one point in time or over a short interval. They are useful for describing prevalence, relationships, and snapshots. Longitudinal designs follow units over time. Panels repeatedly measure the same units. Cohorts follow people, organisms, organizations, or other units from a defined starting point. Longitudinal data allow researchers to study change, timing, trajectories, and temporal ordering, but they introduce challenges such as attrition, changing measurement practices, and time-dependent confounding.
In biomedical contexts, case-control studies begin with outcome status: researchers compare units with a condition to units without it and look backward for exposures or risk factors. Case series describe a set of cases, often useful for identifying new phenomena, rare conditions, or clinical patterns. Each design produces a different kind of evidence. The statistical analysis must respect the design rather than pretending all datasets are interchangeable tables.
Protocols, standardization, and field or lab practice
A protocol defines how data should be collected. It specifies when measurements are taken, where they are taken, who takes them, which instruments are used, how those instruments are calibrated, how specimens are handled, how observers are trained, how questions are asked, and how deviations are documented. Protocols are an attempt to make data collection stable enough that observations can be compared.
Standardization matters because variation in procedure can masquerade as variation in the phenomenon. If one clinic measures blood pressure after five minutes of rest and another measures it immediately after patients arrive, differences between clinics may reflect protocol rather than health. If one field team samples water after rainfall and another during dry conditions, site differences may reflect timing. If one coder receives extensive training and another does not, disagreement may reflect coding practice rather than the events being coded.
But reality is messier than protocols. Field sites are inaccessible. Instruments drift. Participants misunderstand questions. Biological specimens degrade. Observers are tired. Software updates change defaults. Laboratories replace equipment. A protocol may say that measurements are taken under identical conditions, while actual practice involves compromises.
These deviations are not always catastrophic. Science routinely operates under imperfect conditions. The problem arises when deviations are invisible. If we do not know that one batch of samples was processed differently, one interviewer paraphrased questions, one sensor was miscalibrated, or one site changed its procedure midway through the study, we may interpret procedural artifacts as scientific findings. Documentation is therefore not administrative overhead. It is part of the evidential structure of the data.
From events to data entries: coding, classification, and judgment
Many data do not begin as numbers. They begin as events, behaviors, symptoms, images, narratives, specimens, or traces. Turning these into data often requires coding and classification.
A physician assigns diagnostic codes. A survey respondent chooses from response categories. A researcher labels animal behavior from video. A social scientist codes interview transcripts. A machine learning system classifies images. A laboratory threshold turns a continuous measurement into “positive” or “negative.” A government agency classifies causes of death, employment status, migration status, or industry type.
Coding creates analyzable variables, but it also introduces judgment. Categories are never purely natural. They depend on definitions, thresholds, conventions, institutional needs, and historical context. What counts as “high blood pressure”? Which symptoms qualify for a diagnosis? How should ambiguous survey responses be treated? When does a behavior begin and end? What counts as an outlier, a duplicate, a valid record, or a censored value?
These decisions should be visible in metadata. Metadata are data about the data: descriptions of variables, coding schemes, category definitions, measurement units, thresholds, instruments, protocols, transformations, and provenance. Without metadata, a dataset may be technically readable but scientifically opaque. A column labeled status or score is not self-explanatory. A category labeled 3 is meaningless without a codebook. A threshold-based variable cannot be interpreted without knowing the threshold.
The act of coding is part of the data-generating process. It is where concepts become columns.
Data cleaning as part of the generative story
Data cleaning is often described as the unglamorous work that happens before “real analysis.” That description is misleading. Cleaning is not merely janitorial. It actively shapes the dataset and therefore the conclusions drawn from it.
Cleaning may involve removing duplicates, reconciling inconsistent identifiers, correcting impossible values, standardizing units, fixing dates, harmonizing categories, linking records across sources, detecting outliers, excluding records, or imputing missing values. Each of these actions requires judgment.
Suppose a dataset contains two records with the same name and birthdate. Are they duplicates, or two different people? Suppose a blood pressure value is physiologically impossible. Is it a data entry error, a unit conversion problem, or evidence that the row should be removed? Suppose a participant skipped half a survey. Should their partial responses be retained? Suppose a sensor produced extreme readings during a storm. Are those readings invalid noise or exactly the phenomenon of interest?
Cleaning decisions can change estimates, associations, and uncertainty. Excluding outliers may make results look cleaner while removing rare but important cases. Imputing missing values may preserve sample size while adding model-based assumptions. Linking records may enrich data while introducing linkage errors. Harmonizing categories across datasets may enable comparison while erasing local distinctions.
The key is not to avoid cleaning. Uncleaned data can be unusable. The key is to treat cleaning as part of the generative story and document it accordingly. A final dataset is not simply collected; it is produced.
Conceptualizing Data-Generating Processes Without Heavy Math
The data-generating process as a narrative
A data-generating process, or DGP, is a story about how the world produces the data we observe. In formal statistics, DGPs can be expressed mathematically. But before equations, they can be understood narratively.
A DGP begins with the world state: the true properties of units, many of which we may never observe directly. People have health statuses, preferences, histories, exposures, and constraints. Cells have molecular states. Ecosystems have species interactions. Galaxies have masses, distances, and histories. These properties exist whether or not our instruments capture them well.
Next come the processes that select units. Which people enter a study? Which cells are sampled? Which hospitals contribute records? Which posts are scraped? Which field sites are accessible? Selection determines which parts of the world become visible.
Then come the processes that measure units. Instruments, surveys, observers, sensors, assays, and algorithms translate properties into recorded values. Measurement error can enter here. Some errors are random, such as small fluctuations in repeated readings. Others are systematic, such as an instrument that is consistently miscalibrated or a survey question that consistently leads respondents toward a particular answer.
Finally, there are processes that record, code, store, clean, and transform the data. Values may be rounded, categorized, censored, linked, excluded, imputed, or aggregated. Missingness may occur because someone refused to answer, a machine failed, a record system did not capture a field, or a value was judged invalid.
Thinking through the DGP helps us ask better questions before fitting any model. What must have happened for this row to appear? What must have happened for this value to be missing? What selection processes are invisible? What measurement processes introduce error? What coding decisions turned reality into this column? A DGP narrative keeps statistics connected to the conditions that produced the data.
Aleatory and epistemic uncertainty
Uncertainty is not all of one kind. A useful distinction separates aleatory uncertainty from epistemic uncertainty.
Aleatory uncertainty is variability inherent in a system. Even if we knew a great deal, some variation would remain. Biological organisms differ. People make different choices. Weather fluctuates. Quantum and thermal processes involve randomness. Experimental runs vary. If we sampled another person, another mouse, another storm, another day, or another cell, we might observe a different outcome because the world itself is variable.
Epistemic uncertainty comes from limited knowledge. It reflects what we do not know but might learn. A small sample creates epistemic uncertainty because more observations could sharpen our estimate. Unknown mechanisms create epistemic uncertainty because better theory or measurement could clarify the process. Missing covariates, poor instrumentation, limited follow-up, and incomplete records all contribute to epistemic uncertainty.
A simple way to express the distinction is this: aleatory uncertainty is variation that would remain even if we knew everything relevant; epistemic uncertainty is uncertainty that could be reduced if we knew more. In practice, the boundary is not always sharp. What seems random at one level may become explainable at another. Still, the distinction helps clarify what kind of uncertainty we are facing and what kind of response is appropriate.
More data may reduce epistemic uncertainty, especially when the data are relevant and well measured. But more data cannot eliminate aleatory variability. Better design, better theory, better measurement, and better documentation may reduce epistemic uncertainty in ways that mere sample size cannot.
Imaginary repetitions and sampling variability
One of the central ideas in statistics is that the dataset we have is only one of many datasets we might have obtained under the same design. Imagine repeating the same study many times: drawing new samples from the same population, measuring them with the same instruments, applying the same protocol, and calculating the same summary each time. The results would not be identical. They would vary.
This thought experiment motivates sampling variability. If we sample 1,000 people to estimate average blood pressure, another random sample of 1,000 people would likely produce a slightly different average. If we conduct an experiment with a finite number of participants, another run of the experiment would likely produce a slightly different treatment effect estimate. Even a perfectly executed study has variability because sampling itself is a random process.
Confidence intervals, p-values, error bars, standard errors, and related tools are ways of reasoning about this imaginary repetition. They do not eliminate uncertainty. They organize it. They ask: how much would our estimate tend to move around if the study were repeated under the same design? How surprising would a result like this be under a specified assumption? How precisely have we estimated the quantity of interest?
This is why uncertainty is not a sign that a study failed. Uncertainty is a normal feature of learning from samples. The goal is not to pretend it is absent, but to characterize it honestly.
Data Quality and the Anatomy of Uncertainty
Types of error
Data quality is not a single property. A dataset can be excellent in one respect and weak in another. To understand quality, it helps to distinguish types of error.
Measurement error occurs when recorded values differ from the values we intend to measure. A scale may be miscalibrated. A diagnostic test may misclassify patients. A respondent may misunderstand a survey question. An observer may code behavior inconsistently. A sensor may drift over time. Measurement error can be random, adding noise, or systematic, creating bias. A thermometer that fluctuates unpredictably adds random error; a thermometer that is always two degrees too high creates systematic error.
Sampling error arises because we observe a sample rather than the entire population. Even with an ideal sampling frame and a probability design, the particular units selected will vary from sample to sample. This produces variability in estimates. Sampling error is not a mistake; it is the expected consequence of learning from a subset.
Processing error occurs after measurement, during data entry, linkage, coding, storage, transformation, or cleaning. A value may be typed incorrectly. Dates may be parsed in the wrong format. Units may be mixed. Records may be linked to the wrong person. Categories may be recoded inconsistently. Duplicates may remain or valid records may be removed.
Each error type can be random or systematic. Random errors tend to increase variability and reduce precision. Systematic errors can shift conclusions in a particular direction. The distinction matters because different errors require different remedies. Repeating measurements may reduce random measurement noise, but it will not fix a systematically biased instrument. Increasing sample size may reduce sampling variability, but it will not correct coverage bias. More sophisticated modeling may account for some processing errors, but it cannot recover information that was destroyed by undocumented transformations.
Bias versus variability
Bias and variability are two different ways results can be uncertain or wrong.
Variability refers to how much estimates would spread across imaginary repetitions of a study. If a study were repeated many times and the estimates bounced widely from one repetition to another, the estimate is noisy. High variability means low precision. The result may be centered on the right target but difficult to pin down.
Bias refers to a systematic tendency to miss the target in a particular direction. If an instrument is miscalibrated, a sampling frame excludes a relevant group, or a survey question consistently pushes respondents toward one answer, results may be biased. Low variability does not protect against bias. A biased process can produce highly consistent wrong answers.
Consider two bathroom scales. One scale gives slightly different readings each time you step on it: 160, 163, 158, 162. It is noisy, but perhaps centered near the truth. Another scale gives the same reading every time, but it is always ten pounds too high. It is precise but biased. The second scale may look more reliable because it is consistent, but consistency alone is not truth.
Scientific results can fail in either way. A small randomized study may be relatively unbiased but noisy. A massive convenience sample may produce extremely precise estimates that are systematically off. Good statistical reasoning asks both questions: how variable are the results, and are they aimed at the right target?
Missing data as structured uncertainty
Missing data are often treated as a nuisance, but missingness is itself informative about the data-generating process. A blank cell is not just absence; it is the result of something that happened or failed to happen.
Sometimes data are missing for reasons that are effectively random with respect to the variables of interest. A lab machine may fail unpredictably for a small number of samples. A survey page may be skipped because of a software glitch. In such cases, missingness may mainly reduce sample size and precision.
Other times, missingness is related to variables we observe. Younger participants may be less likely to answer landline surveys, but we may know their age from registration data. Patients with longer hospital stays may have more complete lab records because they are observed more often. If missingness is related to observed variables, careful adjustment may help.
The most difficult cases occur when missingness is related to variables we do not observe, including the missing value itself. People with the highest income may be less willing to report income. Patients with severe symptoms may be more likely to drop out of a longitudinal study. Students struggling most may be least likely to complete evaluations. In these cases, the missing data may conceal precisely the information most needed.
Missingness can change the effective population represented by the sample. If the people who remain in a dataset differ from those who are missing, then the analysis may describe the reachable, recorded, complete-case population rather than the intended target population. Treating missingness as part of the DGP helps us ask: who is missing, why are they missing, and how might their absence alter the conclusions?
Descriptive summaries as tools for uncertainty awareness
Descriptive statistics are often introduced as simple ways to summarize data: means, medians, ranges, quantiles, histograms, standard deviations. But they are more than tidy summaries. They are instruments for inspecting uncertainty, variation, and data quality.
A mean tells us about a center of gravity, but it can be pulled by extreme values. A median tells us about the middle case and may better represent skewed distributions. Quantiles show how values are spread across the distribution. Ranges reveal extremes. Histograms and density plots show shape: symmetry, skew, multimodality, gaps, and outliers. These summaries help us see whether a single number is adequate or misleading.
The standard deviation describes variability among the observed data values. If we measure heights in a sample, the standard deviation tells us how spread out individual heights are around the average. It is about variation in the data.
The standard error describes variability in an estimate across hypothetical samples. If we estimate a mean, the standard error asks how much that estimated mean would tend to vary if we repeated the sampling process. It is about uncertainty in the estimate.
Confusing the two leads to conceptual mistakes. A population can have high individual variability but a precisely estimated mean if the sample is large and well designed. Conversely, a population can have modest individual variability but an imprecise estimate if the sample is small or poorly designed.
Descriptive summaries should be used early and often, not as a mechanical prelude but as a way of learning what kind of uncertainty the data contain. They can reveal implausible values, unexpected clusters, missingness patterns, subgroup differences, and distributions that challenge later modeling assumptions.
Statistics as Organized Reasoning from Samples
Descriptive versus inferential statistics
Descriptive statistics characterize the data in hand. They tell us what is present in the sample: the average age of participants, the distribution of incomes, the proportion of records with missing values, the range of temperatures observed, the difference between treatment groups in the collected data. Descriptive statistics do not require a leap beyond the observed dataset, though they still depend on how the data were generated and cleaned.
Inferential statistics go further. They use the sample to make claims about a broader population or process. An inferential claim might estimate the average effect of a treatment in a target population, the prevalence of a disease in a region, the relationship between exposure and outcome in a population, or the parameters of a process that generated observations.
The movement from description to inference is the movement from “what did we observe?” to “what does this imply beyond what we observed?” That movement always rests on assumptions. We assume something about sampling, measurement, independence, missingness, model structure, or the stability of processes across contexts. Sometimes those assumptions are well supported by design. Sometimes they are speculative. But they are never absent.
This is why descriptive and inferential statistics should not be treated as merely two chapters in a textbook. They represent different kinds of claims. A descriptive claim can be true of the sample and still fail as a population claim. A sample may contain 60% women, but that does not mean the target population is 60% women unless the sampling process supports that inference. Inference requires a bridge, and the bridge is built from design plus assumptions.
Frequentist and Bayesian lenses
Two major traditions offer different ways to organize statistical uncertainty: frequentist and Bayesian reasoning. At this stage, the distinction can be introduced lightly.
In a frequentist lens, probability is connected to long-run behavior under repeated procedures. A confidence interval is not usually interpreted as the probability that this particular interval contains the true value. Rather, it is tied to a procedure that, if repeated many times under specified conditions, would produce intervals that contain the true value a certain proportion of the time. A p-value summarizes how unusual the observed data, or more extreme data, would be under a specified null model.
Frequentist reasoning fits naturally with the imaginary repetition idea. What would happen if we repeated the sampling, experiment, or estimation procedure again and again? How often would a method make errors? How variable is an estimator across repetitions?
In a Bayesian lens, probability represents degrees of belief or uncertainty about unknown quantities. We begin with prior beliefs or prior information, observe data, and update to posterior beliefs. The posterior distribution expresses uncertainty after combining prior information with the evidence in the data. Bayesian reasoning asks: given what we believed before and what we observed, what should we believe now?
Both approaches are ways of reasoning from samples to broader claims. They differ in interpretation, workflow, and philosophical grounding, but both require assumptions. Both can be used well or poorly. Both can express uncertainty. The important point here is not to choose a side prematurely, but to see that statistical frameworks are organized languages for uncertainty.
The role of assumptions
No inference happens without assumptions. Some assumptions are explicit, such as a model stating that errors are normally distributed or that observations are independent. Others are implicit, such as assuming that nonrespondents resemble respondents after adjustment, that a measurement instrument behaves the same across groups, or that a relationship observed in one setting applies to another.
Distributional assumptions concern the shape of variability. Are errors roughly symmetric? Are extreme values common? Are counts better represented by one process than another? Independence assumptions concern whether observations can be treated as separate. If students are nested in classrooms, patients in hospitals, cells in plates, or repeated measures within individuals, independence may be unrealistic. Temporal assumptions concern whether processes are stable over time. Measurement assumptions concern whether variables capture the constructs they claim to capture.
Design can make assumptions more or less plausible. Random sampling supports certain population inferences. Random assignment supports causal comparisons. Standardized protocols support comparability. Longitudinal follow-up supports temporal reasoning. Good metadata supports interpretation. Weak design forces heavier reliance on modeling assumptions.
A mature statistical analysis does not hide assumptions. It names them, motivates them, checks them where possible, and asks how conclusions might change if they fail. Assumptions are not embarrassing. They are the scaffolding of inference.
Introducing Statistical Models as Golems
What a statistical model is in this narrative
A statistical model is a formalized version of a data-generating story. It encodes ideas about which variables matter, how they are related, where randomness enters, and what patterns we expect to see across possible datasets.
A model might say that an outcome varies around a mean, that the mean depends on predictors, that observations are grouped, that measurement error exists, that some values are missing through a particular process, or that individual trajectories change over time. These statements can be written mathematically, but their roots are conceptual. A model is not merely a formula. It is a disciplined guess about how data could have arisen.
This connection to the DGP is crucial. A model does not only describe the one dataset we observed. It imagines a space of possible datasets. It asks what kinds of data would be plausible if the model’s assumptions were true. Fitting a model to observed data is then a way of comparing the observed dataset to that imagined generative structure.
Models can summarize, estimate, predict, explain, and simulate. But they always simplify. They leave things out. They impose structure. They translate messy reality into a system that can be reasoned with. That simplification is both their power and their danger.
Models as golems
Richard McElreath’s metaphor of statistical models as golems is useful because it captures their strange combination of power and mindlessness. In folklore, a golem is a creature animated to perform tasks. It can be strong and useful, but it does not understand the world in the way a person does. It follows instructions.
Statistical models are similar. They can process large amounts of information, estimate complex relationships, propagate uncertainty, and generate predictions. But they have no common sense. They do not know whether the sampling frame excluded half the population. They do not know whether a variable is poorly measured unless we tell them. They do not know whether a causal interpretation is inappropriate. They do not know whether a category is socially constructed, historically unstable, or ethically fraught.
Models are built from assumptions rather than clay. If the assumptions are thoughtful, transparent, and connected to design, the model can be a powerful helper. If the assumptions are careless, the model may produce nonsense with great confidence. Harm does not come because the model is malicious. It comes because the model does exactly what it was instructed to do, even when the instructions are incomplete or misguided.
This metaphor prepares us for later questions: how do we design models, fit them, check them, compare them, criticize them, and refine them? A model should not be worshipped as an oracle. It should be treated as a constructed tool whose behavior must be understood.
How modeling connects back to sampling and data collection
A model cannot fully repair a broken data-generating process. It cannot make an inaccessible population appear in the data. It cannot remove severe sampling bias by mathematical elegance alone. It cannot turn poor measurements into good measurements without additional information. It cannot recover distinctions erased by coding, or reconstruct undocumented cleaning decisions with certainty.
What a model can do is help quantify uncertainty under stated assumptions. It can help adjust for known design features, estimate relationships, pool information across groups, incorporate prior knowledge, represent measurement error, account for clustering, and extrapolate cautiously. It can make assumptions explicit enough to criticize. It can show how conclusions depend on different ways of representing the DGP.
But modeling always inherits the conditions of data collection. A model fitted to a convenience sample may describe that sample well while failing to generalize. A model fitted to biased measurements may estimate the wrong relationship precisely. A model fitted to a dataset with undocumented exclusions may produce results whose target population has silently changed.
“Garbage in, garbage out” understates the danger. In statistics, the risk is sometimes “garbage in, gospel out”: poor inputs transformed into polished, precise, authoritative-looking conclusions. The more sophisticated the model, the more tempting it can be to trust the output. That is why the earlier questions about measurement, sampling, protocols, metadata, and cleaning remain central even when the analysis becomes mathematically advanced.
Trust, Context, and Looking Ahead
Knowing where data came from is not a side issue. It is statistical. Provenance and metadata tell us how to interpret uncertainty, bias, scope, and evidence.
Provenance describes the origin and history of data: who collected them, under what conditions, using which instruments, from which sampling frame, according to what protocol, and through what transformations. Metadata describe the structure and meaning of the data: variable definitions, units, coding schemes, thresholds, missing value conventions, cleaning steps, linkage procedures, and version histories.
These details matter because uncertainty is not attached only to final estimates. It is embedded throughout the path from world to measurement to dataset to analysis. A confidence interval may quantify sampling variability under a model, but it does not by itself tell us whether the sampling frame excluded key groups. A p-value may summarize incompatibility with a null model, but it does not tell us whether the outcome was misclassified. A model coefficient may look precise, but it does not reveal undocumented cleaning choices.
Trustworthy statistical reasoning therefore begins by asking where the data came from. What population was accessible? What sample was observed? What measurements were taken? What protocols governed them? What coding decisions were made? What values are missing, and why? What transformations occurred between collection and analysis? What assumptions connect the observed data to the claim being made?
This prepares us for the next stage: treating data not only as tables but as documented, contextualized, historically produced objects. Metadata, provenance, and modeling are not separate concerns. They are different parts of the same problem: how to reason responsibly from partial, constructed evidence about a variable world.
Comments
Post a Comment