Clarifying Scientific Concepts

There is a lot of confusion, propagated by various media sources, around fundamental scientific concepts and terminology. Colloquial uses of terms such as "theory" or "hypothesis" tend to distort the scientific usage of these terms. Scientific concepts can become trivialized as well; "your 'theory' is just as good as my 'theory'". I want to clarify some of this terminology because constant misuse simply confuses everyone, making it harder to distinguish between competing sources of information on social media platforms.

Science and Pseudo-Science

Massimo Pigliucci: Pseudoscience

Demarcating science from non-science is quite difficult. There are obvious exemplars of pseudo-science and many prototypical examples of real science but there haven't been any necessary and sufficient conditions identified that can be used to categorize any particular example into a clearly defined bucket. Nevertheless, there are common features shared across many disciplines we deem scientific. These attributes form clusters; examples on the peripheries become harder to classify because they share features with canonical examples of pseudoscience. If you are familiar with Wittgenstein's notion of family resemblance then this might make sense to you. You can also think about it in terms of network clustering. In the diagram below, you can think of the edges between nodes as defining some common relation and distance measuring some degree of closeness.

A two dimensional view might look something like this:

In the middle of the large cluster we might consider a discipline like physics and towards the periphery of the cluster we could consider something like psychology, sociology, and economics. The green cluster could represent a pseudo-scientific category containing things like intelligent design. The thing to note with both visualizations is that there aren't a global set of definable features that can distinguish any two disciplines. We cannot construct a list containing all of the essential features that could exclude non-sciences without running into problems. For example, a pillar of modern sciences is the experiment. Theoretical physics, however, typically does not conduct experiments. Does this mean we exclude it from being a science? That would be absurd. Similarly, Geological sciences typically do not engage in any a-priori theorizing. A common practice of modern science is establishing some sort of theory to explain observations. Does this mean geology is not a science? That too, would be absurd.

These examples immediately make it obvious that, when discussing what is science, we have to consider that "science" is a term referring to a broad class of related types of disciplines. Something can be more or less scientific, exemplifying the fact that "science" is somewhat of a graded concept. This implies that there are qualities or properties by which we can evaluate any particular knowledge claim as science or not. I've shifted away from classifying disciplines as a whole to individual claims because we might run into the same problem when attempting to classify an entire discipline as scientific. For example, there are many knowledge claims coming out of the psychological literature that do not exhibit properties we consider scientific. It's also possible that for any given knowledge claim, some of the properties might not be exhibited. Nevertheless, this does not imply that all claims coming out of psychology are pseudo-scientific. This is also true of some of the more canonical examples of science. Therefore, we must consider the rate at which claims originating from any discipline exhibit scientific qualities. This will prevent us from a-prior labeling knowledge claims as pseudo-scientific simply based on the discipline they come from.

Here is a list of some qualities I think that can be used when considering whether a claim is scientific. I do not claim this list to be exhaustive. Also, the order does not matter at this point. This is not to say that all qualities are of equal importance. I think that would be false.

Makes predictions or retrodictions
Is testable
Is replicable
Systematically records observations
Capable of verifiability and validation
Acknowledges the boundaries of its explanatory breadth
Self corrective and reflective
Maintains are fair degree of precision and clarity with its terminology
Is falsifiable
Has an empirical basis
Is reproducible
Is internally coherent and logically consistent
Strives for impartiality and objectivity
Has a sufficient degree of generalizability
Can be subject to scrutiny within a broader community of peer review
Uses concepts that can be measured or quantified
Subject to revision in light of new findings
Is transparent with it's methodology
Is rigorous
Seeks disconfirmation along with confirmation
Is communicable
Seeks to provide causal explanations
Highly critical during design and analysis phase
Critically assess methods, assumptions, and interpretations of results
Uses mathematical models and simulation methods
Seeks simple and robust methods
Leverages probabilistic reasoning and acknowledges the uncertainty of it's conclusions
Does not rely on authority unless it's a claim that is taken to be true within the community
Considers alternative hypotheses
Seeks convergent validity through multiple information sources and different methods

There are probably more I am not considering but I think this is not a bad start. I am partial to mathematical modeling but acknowledge that there are disciplines such as Anthropology that are scientific but might not emphasize mathematical modeling. Also, not all knowledge claims must come from a mathematical model. Nevertheless, scientific disciplines tend to use models because they help us check our assumptions against reality. Another consideration is that a claim might not be testable at this point in time, but technological innovations in the future can make it become testable. The fact that something isn't immediately testable due to technical constraints does not make it unscientific. I would say that it's unscientific if, in principle, it cannot be tested. If there is no conceivable way to test the claim, then it is not testable. Again, these are qualities that claims should strive for if we want to consider them scientific.

Throughout the rest of this blog post I'll touch on some of these concepts. I just wanted to initially get this out of the way because many people are confused about which claims are genuinely scientific.

Theories and Scientific Theories

I am not going to focus on any particular theory. I just want to consider in general, what it means to theorize in a scientific setting, how this activity differs from something like philosophical theorizing, and how both activities are quite different from how the public understands the term.

In the broadest sense of the term, theory is a structured way of understanding, interpreting, or explaining phenomena. It provides a conceptual framework, a network of ideas that helps us make sense of observations, connect patterns, and predict or interpret outcomes. Theorizing is something humans do all the time; often when you are trying to explain something, you are assuming some underlying theory (although its normally implicit and not fully structured). Theorizing in the broadest sense, is any process of pattern finding, meaning making, or framework building. It’s the creative and interpretive act of connecting ideas into a coherent picture — whether the “data” are experiments, emotions, social behaviors, or symbols.

In science, theorizing takes on specific methodological and epistemic constraints. Scientific theories must be testable, falsifiable, and consistent with empirical data. These are often formalized, expressed mathematically, aimed at predictive power. So while all scientific theories are theories, not all theories are scientific. Science narrows the broader act of theorizing into a disciplined method: empirical, systematic, and verifiable. In philosophy, theorizing is often about conceptual analysis rather than empirical testing. Philosophical theories often deal with abstractions more removed from empirical reality; it is not connected to experimental methods but rather focuses on logical entailment. It might deal with concepts like possibility and necessity. You might be eager to claim that science deals with these concepts as well. You'd be correct, certain scientific theories entail the possibility and impossibility of various empirical outcomes. Philosophical possibility is much broader, consisting of what is logically possible; in other words its theories are "metaphysical". So you can think of scientific and philosophical theorizing as specialized, formalized subsets of the larger, more universal human capacity to theorize — just like poetry and mathematics are specialized ways of using language.

There are common components to all theories, regardless of how fleshed out the theoretical details.

Concepts: the basic building blocks of a theory, they name and define the phenomena being discussed. For example, "gravity" in physics, or "motivation" in psychology. Concepts are abstractions, they simplify reality so we can think systematically about it.
Construct: A type of concept that has been deliberately defined for a specific theoretical purpose. Constructs often can’t be directly observed but are inferred (e.g. “intelligence,” “social capital,” “self-esteem”).
Propositions: These are statements that state the relationships between concepts, how one thing effects or relates to another. In formal sciences, these are hypotheses; in philosophy or critical theory, they may be argumentative claims. A well defined scientific theory generates testable hypotheses amenable to falsification.
Assumptions: These are the underlying ideas or conditions taken for granted for the theory to work. For example, in Economics we often assume humans are rational decision-makers. Making assumptions explicit is key to understanding the scope and limits of a theory.
Boundaries and Scope Conditions: This is the "where and when" of a theory, what domain or context it applies to. For example, a psychological theory may explain individual behavior, not group dynamics.
Logical Structure: This is the theories internal organization, how its pieces fit together coherently and systematically. A good theory has internal consistency and avoids contradictions.
Empirical Linkages: This is how the theory connects to observation or experience. Theory entails certain observations, these are the predictions. In science, this means operational definitions and testability.

Theorizing isn’t just coming up with ideas — it’s a disciplined, iterative process of moving between observation, abstraction, and synthesis. Across domains, you can think of it as a cycle:

Observation or Problem Identification: It starts with noticing a phenomenon, inconsistency, or puzzle. “Something interesting is happening here — why?”
Conceptualization: Identify key elements and name them. Define concepts clearly and delimit what you’re focusing on.
Relationship Mapping: Propose how these elements relate. In science, this becomes hypotheses or models. In philosophy or social theory, this becomes conceptual arguments or dialectical relations.
Integration and Abstraction: Bring multiple relationships together into a systematic framework. The theory begins to generalize — it becomes more than a list of observations.
Validation or Evaluation: In science → testing with data, replication, falsification. In interpretive or critical theory → coherence, explanatory depth, ethical and practical adequacy.
Refinement and Extension: Theories evolve as new evidence or perspectives emerge. This is the “living” nature of theory — it’s continuously reshaped.

It's better to think of a "theory" as a living organism that changes, because we are always theorizing. This involves abstracting, synthesizing, questioning, and reframing. Our theories change in light of this process.

I've been reading a lot from Paul Smaldino recently, and think his description of theory is incredibly useful. Paul Smaldino doesn’t offer a single, neat “textbook” definition of theory in the way a philosophy-of-science treatise might, but across his writings we can reconstruct how he treats and uses theories. From his published work (on modeling, methodology, philosophy of science), Smaldino’s view of theory includes the following aspects:

Decomposition into parts, properties, relationships, and dynamics: In “How to Build a Strong Theoretical Foundation,” Smaldino urges that to develop a theory of some phenomenon, one must decompose the system into relevant parts, specify the properties of those parts, articulate the relationships among them, and define how these can change over time. Thus, theory is not just a verbal or narrative statement, but a structural decomposition plus a specification of dynamics and interactions.
Theories are tools (not “Truth”): Smaldino is explicit that there is (in his view) no one “true” theory; rather, theories are evaluated by how useful they are for understanding, prediction, generalizability, and refinement. In other words, theory is pragmatic: it is judged by its capacity to guide thinking, to generate falsifiable hypotheses, to clarify assumptions, and to integrate with empirical work.
Verbal vs. formal theories / role of models: Smaldino repeatedly distinguishes verbal theories (narrative descriptions, “story-like”) from formal theories (mathematical or computational models). He argues that verbal theories are often vague, underdetermined, and thus resist strong testing or falsification. Formal models serve as instantiations of theory—they force explicit specification of assumptions, highlight omitted aspects, and allow rigorous exploration of consequences. In this view, a “good” theory is one that can be (or already is) translated into a formal model (or a family of models) that sharpen and test its claims.
Iterative and reflexive process: Smaldino sees theory construction as iterative: empirical work should refine the theory, and theory should shape what empirical questions get asked. He warns against treating data merely as support for a verbal theory; rather, data should prompt refinement, specification, or rejection of theoretical assumptions. Also, theory-building is reflexive: one must be conscious of which assumptions are built in (implicitly or explicitly), what is omitted for simplicity, and the “violence” (i.e., distortion) done to reality in modeling.
Theoretical foundation and training: Smaldino laments that many social scientists lack training in theory construction and formal modeling. In “How to Build a Strong Theoretical Foundation,” he argues for greater methodological and conceptual training so that theory is not just received (from canonical frameworks) but actively constructed. His emphasis is that theory is not peripheral—it is central. Without robust theory, methods (however sophisticated) may produce results without insight. (“Better methods can’t make up for mediocre theory.”)

So we could encapsulate his ideas with the following definition:

A theory is a deliberately constructed specification of (i) entities or components of a system, (ii) the properties and possible states of those components, (iii) the relationships and rules by which those components interact, and (iv) the temporal dynamics of how those states and relationships evolve. A strong theory is one that (a) can be formalized in mathematical or computational models, (b) offers testable predictions or counterfactuals, (c) is subject to empirical refinement, and (d) is judged not by an abstract “Truth” but by its utility in explaining, predicting, generalizing, and guiding further inquiry.

In his book “Modeling Social Behavior: Mathematical and Agent-Based Models of Social Dynamics and Cultural Evolution”, he defines theory as:

"... a set of assumptions upon which hypotheses derived from that theory must depend. Strong theories allow us to generate clear and falsifiable hypotheses."

Distinguishing it from a theoretical framework:

“A theoretical framework is a broad collection of related theories that all share a common set of core assumptions.”

Theories guide inquiry, and the modeling process. It frames what phenomena we pay attention to, what questions we ask, and how we model:

“Each [model] decomposes a system in a particular way … What questions does your theory address? What parts do you need to include to answer those questions? … Is your model a satisfying representation of your theory?”

That is, a theory is more than just a verbal narrative: it's the background of assumptions that define how one decomposes the phenomena, and from which hypotheses or models are generated. Formal models are instantiations or precise expressions of the theory, and are used as a way to stress test or refine the theory. There is a one to many relationship between theories and models; one theory can be expressed with many different models. This is what I take to be the scientific notion of theory, how I see it applied and how I was trained to apply the term (within the context of economic theory).

Theoretical Virtues

What counts as a "good" theory? How do we compare two theories explaining the same data? Why is simplicity considered desirable? Theoretical virtues are the criteria by which we compare competing theories. In addition to simplicity, there are other common virtues such as elegance (symmetry), explanatory power (unifying phenomena under one framework), fruitfulness (good at generating testable predictions), and coherence (with itself and other theories). Scientists often invoke these when deciding between theories that fit data equally well.

The weight given to each theoretical virtue varies across fields and context. Empirical adequacy is typically non-negotiable. In practice, scientists do appeal to simplicity, elegance, and explanatory depth — even if they don’t always articulate these as “philosophical criteria.” Generally, theoretical scientists (e.g., theoretical physicists, cosmologists, or mathematicians) care more explicitly about theoretical virtues because their work often advances ahead of decisive empirical data. For example, a String Theorist might emphasize mathematical beautify and unification, even though direct empirical tests might be lacking. Empiricists on the other hand, tend to prioritize measurable success and predictive reliability. The line dividing the two is by no means sharp.

We will look at a paper called "Systematizing the Theoretical Virtues". It provides a fairly comprehensive and structured account of the major theoretical virtues, and how they constitute a "logic of theory choice".

Evidential Virtues

1) Evidential accuracy: “A theory fits the empirical evidence well (regardless of causal claims).” Does the theory fit the data? This is the baseline virtue: the observable world looks the way the theory says it should. It’s neutral about causes; it’s just “getting the facts right.” Use it when comparing rivals that speak to the same dataset; watch for overfitting (a theory can “fit” because it has too much wiggle room). Evidential accuracy underwrites the other two evidential virtues: typically you assess causal adequacy and depth after you’ve seen solid fit.
2) Causal adequacy: “T’s causal factors plausibly produce the effects (evidence) in need of explanation.” Does the posited mechanism really have the oomph? Beyond fit, we ask whether the causes would in fact yield the observed effects (often many causes in interaction). Robustness analysis across heterogeneous models can support this by showing the same core causal structure yields the phenomenon across variations. Beware “dormant” causes that are merely named, not shown to operate at the required scale.
3) Explanatory depth: “Excels in causal history depth or in other depth measures such as the range of counterfactual questions that its law-like generalizations answer.” How far and how flexibly does the explanation reach? Depth comes in two flavors: (i) event-focused “how far back” causal history, and (ii) law-focused counterfactual range (how much would still hold under interventions or changed background conditions). It’s different from unification: depth concerns the same target system under varying conditions, not explaining more kinds of facts. Measure it by the breadth of stable “what-if” answers your laws support.

Coherential Virtues

4) Internal consistency: “T’s components are not contradictory.” No contradictions inside the theory. A minimal bar: if it derives P and ¬P, something must give. Subtle inconsistencies can hide in idealizations; don’t set the bar so high that all idealized modeling looks “inconsistent,” but don’t excuse genuine clashes as “just idealization,” either. Think formal coherence first, before aesthetic “niceness.”
5) Internal coherence: “Components are coordinated into an intuitively plausible whole… T lacks ad hoc hypotheses—components merely tacked on to solve isolated problems.” Parts hang together as an intuitively plausible whole (no ad hoc patches). Different from pure logic: a theory can be consistent yet obviously jury-rigged. Red flags: fixes that are untestable, explain nothing else, or sit awkwardly with the core principles. Use “negative” diagnosis (ad hocness) to pressure-test coherence.
6) Universal coherence: “T sits well with (or is not obviously contrary to) other warranted beliefs.” Fits with the rest of what we’re warranted to believe. This is external fit: harmony with well-established results and background commitments (including conservation principles, etc.). Clash here doesn’t instantly falsify, but it raises costs you must repay with exceptional evidential gains. Distinguish healthy tension (pushes progress) from outright conflict with robust knowledge.

Aesthetic Virtues

7) Beauty: “Evokes aesthetic pleasure in properly functioning and sufficiently informed persons.” The theory evokes aesthetic pleasure in appropriately situated observers. Beauty shows up as symmetry, aptness, “surprising inevitability,” etc. On Keas’s account, beauty may have extrinsic epistemic value (it can guide us toward other, more tightly connected virtues like simplicity and unification). Use with humility: beauty can inspire, but by itself it doesn’t guarantee truth.
8) Simplicity: “Explains the same facts as rivals, but with less theoretical content.” Same explananda, less theory. Think fewer entities (parsimony) and/or more concise principles (elegance). Practically, count independent parameters, primitive postulates, or distinct assumptions. Simplicity often correlates with better predictive performance in model selection, but it also interacts with coherence (ad hoc add-ons usually bloat a theory).
9) Unification: “Explains more kinds of facts than rivals with the same amount of theoretical content.” Same resources, more kinds of facts explained. Unification and simplicity are complementary “styles of informativeness”: simplicity reduces content for the same domain; unification expands domain for the same content. Use it to prefer frameworks that tie disparate phenomena together (Maxwell’s electrodynamics-light, plate tectonics, etc.). Keep distinct the diachronic notion (“consilience” gained over time) from this aesthetic one present at introduction.

Diachronic Virtues

10) Durability: “Has survived testing by successful prediction or plausible accommodation of new data.” Survives testing over time (prediction or plausible accommodation). Durability is not mere popularity or longevity: it’s testy time. Prediction is often the gold standard; in historical sciences, repeated plausible accommodation of novel data also counts. A newborn theory can’t yet be “durable”; this virtue is inherently time-laden.
11) Fruitfulness: “Over time, generates additional discovery by means such as successful novel prediction, unification, and non ad hoc theoretical elaboration.” Generates further discovery (incl. novel prediction, non-ad hoc elaboration, added unification). If durability is conservation (passing tests), fruitfulness is innovation (creating new testable strands). Novel prediction here is genuinely new—wasn’t “built in” as a target during construction. Fruitfulness and durability interlock in mature research traditions (e.g., gravitational astronomy from Uranus’s anomaly to Neptune).
12) Applicability: “Used to guide successful action or to enhance technological control… higher when it enables outcomes otherwise not possible.” Guides successful action or control (science → technology, policy). Distinct from experimental control for testing; this is practical leverage (engineering, medicine, forecasting). It’s confirmatory and arrives only after earlier virtues are in place (you can’t apply what you haven’t yet credibly learned), so it is inherently diachronic.

Keas systematizes these classes of virtues arranged roughly from greater to lesser immediate epistemic weight. They work together in a patterned way to guide theory choice and maturation across disciplines (with room for field-specific tweaks). He emphasizes that evidential accuracy (fit with data) isn’t isolated; it’s “entangled” with other virtues in practice. The upshot is a flexible, informal “logic of theory choice” scientists can use beyond any single field. Scientists rarely pick winners by fit alone. They move from fit → mechanism → depth, then ask whether the theory is internally clean and externally harmonious, while letting aesthetics steer search when data underdetermine choices. Over time, durability/fruitfulness/applicability settle the score. Keas’s taxonomy codifies this workflow many labs already follow implicitly. Physicists and other theorists often use simplicity and unification as discovery heuristics. Keas legitimizes this when tethered to evidential and diachronic payoffs (e.g., better predictive success). It clarifies why elegant-but-untestable ideas feel attractive yet remain provisional until durability and fruitfulness show up. By separating diachronic virtues, Keas shows why some theories only become “best explanations” after decades: testing, refinement, and practical leverage increase their epistemic standing. This perspective helps empiricists justify patience (or skepticism) toward brand-new frameworks. Keas doesn’t just list virtues; he systematizes how scientists already (often tacitly) weigh them—early emphasis on evidential and coherential virtues, aesthetic guidance when evidence underdetermines, and eventual elevation of theories that prove themselves durable, fruitful, and applicable over time.

Hypothesizing and Confirmation

The term "hypothesis" is frequently bastardized. Confirmation, and its counterpart disconfirmation, are also incredibly misunderstood by the general public. A hypothesis is pretty much just a testable guess, normally derived from a theoretical framework. It is a specific, testable statement about what you expect to happen. It's a prediction about reality that you intend to check with evidence. For example, "If plants are given more light, they will grow faster"; this is a hypothesis, it can be wrong but can definitely be tested. It’s different from an axiom (assumed true), a conjecture (unproven mathematical guess), or a proposition (any statement that is true or false but not necessarily testable), in that it directly connected to the idea of testability and should have properties such as verifiability and falsifiability. This normally implies the phenomena referenced by the hypothesis is measurable, and is therefore directly or indirectly observable through empirical data. In other words, hypotheses should be operationalizable, not merely verbal statements. The hypothesis must be expressed in a level of precision necessary to implement some test, and if a hypothesis is not amendable to this, its not testable in a way that can discern its likeliness.

I'd like to follow this with a few caveats. Scientific practice is often messy, and not defined by one thing such as falsification. Very often, strict falsifiability is not feasible. In practice, this might restrict the applicability of certain testing procedures, such as statistical testing. As Richard McElreath writes in Statistical Rethinking:

Science is not described by the falsification standard, and Popper recognized that. In fact, deductive falsification is impossible in nearly every scientific context.
(1) Hypotheses are not models. The relations among hypotheses and different kinds of models are complex. Many models correspond to the same hypothesis, and many hypotheses correspond to a single model. This makes strict falsification impossible.
(2) Measurement matters. Even when we think the data falsify a model, another observer will debate our methods and measures. They don’t trust the data. Sometimes they are right. (in addition to issues such as false positives and false negatives, observation error)

So in other words, the scientific method is not reducible to a statistical procedure. Statistical evidence is nevertheless an important feature of the process, and statistical methods can relate hypotheses to data, but they are not sufficient.

We will talk more about this later, but since modern science relies so heavily on procedures from statistics, its impossible to fully conceptually separate hypothesis from statistical inference. There are two concepts that frequently occur in the context of reasoning about hypotheses; confirmation and disconfirmation. Remember that a hypothesis makes a prediction about something; in other words, if the hypothesis were true, we would expect to observe something implied by that hypothesis. These observations are typically encapsulated by a probability distribution, and therefore are described by likelihoods. We have some hypothesis H, and we show that it entails some observation D. If we look for D and don't find it, we must conclude that H is false. However, finding D tells us nothing certain about H, because other hypotheses can also predict D. This is why we invoke the notion of likelihoods. If we observe D, we can't be certain that H explains D; but if we measure relative likelihoods, we can find that H is most probable relative to alternative hypotheses. This type of reasoning is central to understanding how scientists reason under uncertainty.

I'll briefly introduce the idea of Bayesian confirmation. The core idea is that a hypothesis wins credit when evidence was more likely if the hypothesis were true than if it weren't. If seeing E is more expected under H than under “not-H,” then E confirms H. The stronger the shift, the stronger the confirmation. (Formally: strength ≈ how big the ratio is between P(E|H) and P(E|¬H).) Evidence confirms H when P(E | H) > P(E | not-H) and disconfirms when P(E | H) < P(E | not-H).

Evidence doesn’t “prove” a hypothesis; it shifts how credible it is. Many different stories can fit the same facts. What matters is which story makes those facts more expected than rival stories. The same observation can match multiple hypotheses, this is the idea of underdetermination. Also, analysis choices matter. What you count, how you measure, which model you use, and when you stop collecting data can all tilt the result without changing the raw facts. Instead of "proof", you should think of "support"; how much evidence tips the scales relative to alternatives, not in isolation. Vague hypotheses also carry little weight; if almost anything you observe feels like confirmation, the statement doesn't discriminate against anything; specific predictions force real tests. Here are a few questions to ask yourself when evaluating a hypothesis:

What are the live alternatives? What else could explain this? (List at least one.)
What did each hypothesis specifically predict? (Before seeing the data.)
Would this result have surprised the rival more? (If yes, support is stronger.)
What would disconfirm your hypothesis? (Name a clear outcome.)
Did we tune our analysis after seeing results? (If yes, be cautious.)
Does this hold in new data or by a different method? (Consilience.)
Would those alternatives have expected this result as much as your hypothesis does? If your hypothesis makes the result less surprising than the alternatives, that’s good support. If lots of stories would’ve predicted it, it’s mild at best.

This implies a set of red flags. This includes no explicit alternatives considered, vague hypotheses that predict almost anything, only in-sample success, no preregistration or out-of-sample tests, and heavy reliance on one method/measure with no replication by an independent approach. In science, evidence doesn’t prove; it compares. A result raises or lowers our confidence depending on how expected it was under your hypothesis versus the alternatives. Vague claims make everything look like confirmation; specific, risky predictions make support meaningful. Real progress comes from putting rival ideas on the table, making predictions that could fail, and seeing which idea makes the world’s surprises least surprising.

Evidence that matches the hypothesis is confirming evidence. But confirmation is graded, not all-or-nothing. Some evidence slightly raises confidence, some strongly raises it. A single confirming datapoint means almost nothing unless compared against alternatives. Only seeking confirmation is dangerous. If you only look for confirming evidence, you’ll always find it — because most weak or vague hypotheses can “explain” almost anything after the fact. By searching for anything that confirms the hypothesis, you have not learned anything; you've just collected agreeable anecdotes. Real confirmation involves three disciplines; predictive specificity, active search for disconfirmation, and total evidence (not cherry-picked bits). A single confirming case is cheap, what matters is the entire pattern including: how often the prediction fails vs. succeeds, whether alternative hypotheses explain the same data better, and whether you saw not only what you expected — but didn’t see what would have refuted it. Confirmation raises your confidence in a hypothesis but only when you’ve given disconfirmation a fair chance to happen. The total balance of evidence is what matters.

Evidence, Empirical Evidence, and Scientific Underdetermination

It is actually quite difficult to define evidence. What distinguishes a detective who uses evidence from the scientist who uses "empirical evidence", derived from empirical research, when advancing a claim? Clearly, what counts as evidence in these domains is not entirely overlapping. In addition, there is a plethora of synonymous terms that are very often used interchangeably with "evidence", but are conceptually distinct (data, facts, etc.) that muddy the waters. There are also related concepts, such as the burden of proof and admissibility, that frequently arise when discussions involve evidence. In some contexts, these are formally established and institutionalized through rules and procedures; as it is the case in law or debate. The word is also used as a modifier. Consider something like evidence based policy or evidence based medicine; to what extent does the word "evidence" impact how each discipline is carried out? What exactly is that modifier doing to the subsequent words? There is even a branch of epistemology called evidentialism, that is primarily concerned with the relationship between evidence, justification, and knowledge. Lastly, there are even attempts to construct frameworks that grade the quality of evidence, such as the Hierarchy of Evidence. Clearly, understanding how people use this term, in particular scientists, is of significant consequence. My main focus with this section is to characterize how scientists use and reason about evidence. But I also want to bridge the gap between these different sense of the term, so I'll introduce an two authors who have had an impact on how I think about this concept.

In "Evidential Foundations of Probabilistic Reasoning", David Schum introduces the notion of a "Science of Evidence"; he recognizes the inherent plurality of the term and wants to abstract the notion across all disciplinary domains. Schum also recognizes the inherent uncertainty featured in all reasoning tasks based on evidence: “… in any inference task our evidence is always incomplete, rarely conclusive, and often imprecise or vague; it comes from sources having any gradation of credibility. As a result, conclusions reached from evidence […] can only be probabilistic in nature.” He also identifies and is concerned with, structural features of evidence; how it all connects together within a network of inference. Schum doesn’t pin “evidence” down with a single neat definition. He argues it’s best understood functionally; by what it does in reasoning. For him, evidence is any item of information (a trace, record, testimony, measurement, etc.) that bears on a hypothesis; its value depends on (i) relevance (how it connects to the hypothesis) and (ii) credibility (how much you can trust the item or its source). The overall inferential force (or probative weight) of an item is a joint product of those two strands. For Schum, evidence does not exist in a vacuum; it is relational, not free floating. An item isn’t “evidence” all by itself; it becomes evidence only relative to a specific hypothesis/probandum once you supply (and defend) an inference link from the item to that hypothesis. The link from an item to a hypothesis is licensed by background generalizations (“glue”) that are often implicit and need support; in other words, relevance must be argued. He distinguishes directly relevant evidence (bearing on the hypothesis) from ancillary (meta) evidence; material about the strength of that link (e.g., source credibility or whether the generalization really fits this case). “Evidence” is information put to work in support of a hypothesis, with its relevance and credibility argued. An item becomes evidence only when embedded in an argument that (1) states the hypothesis at issue, (2) shows relevance by supplying the generalization that links the item to that hypothesis, (3) supports credibility (often with ancillary evidence), and (4) assesses inferential force given how the item interacts with the rest of the mass of evidence.

Schum is explicit that a report about an event and the event’s actual occurrence are not the same; you must infer from the report to the world, and that inference is always uncertain. This is true of domains involving measurement as well; our measure of the thing is not the thing itself. Hence his insistence that all evidential reasoning is, in the end, probabilistic due to this uncertainty. Relevance is the logical/inferential link from an evidential claim to the hypothesis and Credibility concerns source reliability. With testimony, he decomposes credibility into veracity, objectivity, and observational sensitivity (were they truthful, unbiased, and in a position to observe?), but credibility standards are used beyond the legal realm (consider a scientist questioning the mechanism by which data was collected). Much of what we call “evidence” is actually evidence about other evidence; material that bears on a witness’s credibility or on the soundness of a measurement process. That ancillary layer is often what lets you evaluate the force of the directly relevant items. He often uses the likelihood ratio as a convenient gauge of an item’s inferential force and shows structurally-driven phenomena like inferential drag (links in a chain weaken force), redundancy, and synergy when items combine. But the broader point is structural: you can analyze evidential force even when precise frequencies are unavailable. The key point about structure is that basic configurations of evidence combinations have differing degrees of inferential force. Consider two witnesses claiming to report some event. If we find one of them is wrong, it might weaken our credibility in the other witness if we find out they someone collaborated before reporting the account. In other words, there is structural collapse which leads to a non-linear reduction in inferential force (redundancy). The reverse can be true as well; one piece of information can amplify the inferential force of a collection of evidence (synergy). An item can also be clearly relevant (it bears on the hypothesis) and its source fully credible, yet still move the needle only a little. Fundamentally, the probabilistic assessment of evidence rests on these more primitive notions of relevance and credibility.

A bit more about the notion of inferential force; this is the diagnostic strength of an item given a stated hypothesis versus its rivals. As mentioned before, Schum gauges this with likelihood ratio, but this is not necessary, and scores force in a similar way to how statisticians score posterior inference with Bayes Factors. An item can have a likelihood ratio near one, and hence little inferential force, despite being relevant and credible. Similarly, an item can have very extreme values for likelihood ratios but carry little weight if the credibility and relevance are called into question. Here are just a few ways Schum describes how this can occur:

Low diagnosticity: A careful, honest witness saw the suspect “in a dark hoodie”—a description that fits many people. Credibility is high, relevance is clear, but P(E∣¬H) is also high, so likehood ratio is small.
Chains = inferential drag: When E supports H only through several intermediate links (A→B→C→H), each link’s uncertainty compounds, typically reducing net force (“inferential drag”).
Dependence & redundancy: Two “independent-looking” reports may trace back to the same primary source. The second then adds little; combining them yields less force than naïvely multiplying independent likelihood ratios. (Conversely, truly independent items can show synergy.)
Ancillary constraints on the Likelihood Ratio: Ancillary (meta) evidence about the measurement/testimony changes the likelihoods (e.g., false-positive rates, viewing conditions), which can push force up or down without altering surface relevance.

Let's transition from Schum to another influential thinker on the topic of evidence. While Schum was a legal scholar at the bridge between law and artificial intelligence, Peter Achinstein was a prolific philosopher, writing extensively in philosophy of science. While philosophers don't determine how scientists actually reason in practice, I do find some of them to be quite illustrative and insightful. Achinstein attempts to characterize this in his book "The Concept of Evidence". Similar to Schum, he does not give a single definition of evidence. He distinguishes four concepts of evidence, making one ("potential evidence") the base for these others:

Potential Evidence: a true statement e, together with true background b, is potential evidence for hypothesis h only if (i) e doesn’t entail h, and (ii) given e & b, it’s probable that there is an explanatory connection between e and h (Achinstein formalizes this with an “objective epistemic” probability >½).
Veridical Evidence: Strong VE requires: (1) e is PE for h; and (2) h is true; and (3) there is an explanatory connection between e’s truth and h’s truth. (He also discusses a weaker VE that drops (3), but argues scientists should want the strong form to avoid “misleading” evidence.)
ES-evidence (Epistemic Situation): e is true and anyone in a specified epistemic situation is justified in believing that e is (probably) VE for h.
Subjective evidence: at time t, agent X believes e is (probably) VE for h, and X’s reason for believing h (is true/probable) is that e is true.

Achinstein rejects mere "positive relevance" alone as adequate, grounding his account in the explanatory connection requirement. I think his account is similar to Schums. For both, "evidence" is relative to a hypothesis. There also has to be some sort of linking glue. Achinsteins explanatory connection plays a role like Schums generalizations that justify the relevance link from an item to hypothesis. I think both agree that this link must be argued, its not something obvious. I think the major difference between the two (arguably due to their professions), is that Achinsteins VE requirement depends heavily on this heavy notion of truth, while Schum considers that an item can be functional within an inferential task despite being mistaken or noisy; directly building in this notion of credibility. Achinstein also does not consider likelihood ratios or combination effects, because he does not have a notion of inferential network like Schum. Achinstein gives normative, truth-leaning definitions keyed to explanatory connection and objective epistemic probability (with “veridical evidence” as the scientific gold standard), whereas Schum gives a process/structure-first account where an item’s status depends on argued relevance, assessed credibility, and the net inferential force it contributes within an evidence network. Generally, I think Achensteins account is weaker for practitioners, but I do like his explicit recognition of background context; it often contains "ancillary items" recognized by Schum. His non-entailment condition also seems to capture Schums idea of the non-deterministic nature of evidence; one piece of evidence can support multiple, often conflicting hypotheses. Achinstein asks: Is there (probably) a real explanatory connection, and is it (in stronger notions) actually true? Schum asks: Have you argued the link (relevance), can you trust the item (credibility), and how much does it move the odds (force)? The main point is that most evidential links are uncertain and context-sensitive, they are essentially defeasible inferences. These defeasible links can be targeted with critical questioning; these questions address these more basic considerations: relevance, credibility, and explanatory connection. Since I always bring up Douglas Walton in my posts, I'll continue that tradition here. Walton tells us that evidence isn’t a freestanding object; it’s a move in a normative exchange that involve dialectical rules about what is and isn't an acceptable move. Walton makes the procedural part explicit: relevance, weight, and even admissibility are dialogue-governed. This connects his pragmatic approach to the broader topic of evidence; evidence is a premise (or set of premises) that gives presumptive, defeasible support to a claim within a regulated dialogue, and whose strength is assessed by asking the right critical questions. Walton's critical questions are concretizations of the examination of warrants and relevance in an inference task. The dialectical aspect of this is interesting, because these meta-norms of discussion can influence what is considered relevant and the burden of proof. If you can answer the critical questions satisfactorily, the argument keeps its presumptive force, but it's always subject to revision because evidential reasoning is incremental and contextual.

Okay, now how does all of this connect to "Science"? I think to some extent, Schum captures the core logic. Scientists are essentially doing some variation of what Schum, Achenstein, and Walton are describing; but most likely won't use the terminology introduced by these thinkers. Scientists don’t always say “this is evidence” — they talk about data, results, signals, effects, fits to a model, p-values, confidence intervals, posterior distributions, replications. But under the hood they’re doing recognizable evidential work:

They start from questions/hypotheses/models. Even in exploratory work, there’s at least a background model (“these genes might co-express,” “this detector should see X events”). That gives evidence something to be for. This maps nicely to Schums "evidence is not free floating" concept and also captures relevance.
They produce data via instruments or observations. That’s the raw material; but nobody sensible treats raw data as already “evidence.” This collection step is often a source of critical questioning, mapping to Schum's notion of quality.
They process/clean/model it. This is where measurement error, instrument calibration, and statistical assumptions come in.
They interpret it relative to rival explanations. “Does this pattern support model A over model B?” “Does this reject the null?” “Does this effect replicate?”
They document uncertainty (standard errors, likelihoods, Bayes factors, upper bounds).
They bring in meta/ancillary info (instrument logs, sample provenance, blinding procedures, preregistration, replication studies, peer review).
They value replication and reproducibility as communal credibility checks: “Can someone else’s instrument get the same item?” That is ancillary evidence writ large.

Scientists constantly argue for relevance, even if they don't explicitly use that word. For example, in an experiment someone might state “If the drug truly lowers blood pressure, the treatment group mean should be lower than control.” That’s the relevance link; they're appealing to a causal generalization based on perhaps an implicit understanding of the dynamics in that scenario. In theoretical domains, this relevance criterion comes explicitly from the theory (or implied by the theory). The theory tells you what data to look for, and legitimizes the data as evidence based on the theoretical framework. No matter what the domain, Schum-style "generalizations" are used to connect an item to a hypothesis. Inferential force is often more nuanced in each domain based on their methods (the type of modeling), but is fundamentally about comparing competing hypotheses. For example, in economics, establishing a causal connection by convincingly settling disputes about directionality of effect (X->Y instead of Y->X), is considered massive inferential force (and might actually win you the Nobel Prize).

I'd like to make a few caveats, because obviously no single definition can encapsulate all scientific inquiry. In exploratory data analysis, systems biology, or ML-heavy science, people sometimes say “let’s see what the data say” before fixing a sharp hypothesis. Evidence, for Schum, is always evidence-for. Exploratory science often parks in a pre-evidential phase: it’s producing candidates for evidence. It does not start with H, it determines which set of H's should be considered based on features of the raw information. This is called "Data Driven Science". Some actually argue this is more of a subset of engineering rather than "pure science" (see the demarcation issue above). I don't care to make the distinction. Once the plausible set of H's are identified, its pretty much business as usual after that. Another potential issue (perhaps biased by his particular discipline), is that Schum is very item- and argument-focused. Science has a social dimension to confirmation: independent labs, convergence of multiple methodologies (experiment, simulation, field data), long-run track record of an instrument platform etc. That “social corroboration” can be thought of as Schum-style ancillary evidence from independent sources, but in practice scientists sometimes treat it almost as a separate epistemic good (“multiple independent labs found this”... institutional trust). That’s bigger and messier than one neat inference network. Nevertheless, Schum's inference networks in principle could encapsulate this, but in practice it's very likely computationally intractable and the evidence landscape will probably dramatically shift before the analyst can finish computing inferential force.

Scientific Measurement

This topic actually spans entire academic journals, so it will be somewhat difficult to condense this rich topic into something concise. Measurement is important across pretty much every domain of science and engineering. Given the nuances specific to each domain, I'll try to capture the broad generalities that could represent how scientists "in general" think about measurement. Much of my thinking on this section comes from "Measurement Across the Sciences: Developing a Shared Concept System for Measurement".

As a disclaimer, this section will be biased towards measurement considerations in the social sciences, primarily because I am an applied econometrician by training. Like I mentioned previously, the concept of "measurement" is extremely broad. For example, there is a widely research area of research in mathematics called "Measure Theory", that seeks to formalize and generalize common notions of measurement such as magnitude, mass, and probability. In contrast, there is the scientific study of measurement called "Metrology", which is less concerned with formalization, and much more concerned with establishing units of measurement, development of measurement methods/instruments, identification of measurement standards, evaluation of uncertainties, and the traceability/usability of these standards across a wider population. Mathematical theories of measure do not concern themselves with evidential grounds or success criteria associated with such methods. This culminates in products such as the International Vocabulary of Metrology standardized by ISO. If you look at broad definitions of measurement, they almost always make explicit reference to the fact that the thing being measured is physical. This begs the question "can non-physical 'things' be measured?" If something is non-physical, is it inherently incapable of being empirically investigated? Consider something like "subjective probability" in Bayesian statistics, to what extent can someone measure their "degrees of belief" about a proposition? Nevertheless, social scientists frequently make reference to "measurements" when conducting empirical research. Economists construct "Happiness Indexes", Psychologists measure "Personality", and Sociologists measure "Community Cohesion", often using sophisticated statistical methods that map observed data to unobserved "constructs". At first glance, I think it's obvious these are quite distinct from yardstick measures someone can use in a physical science lab. In many physical-science settings, there’s a well-defined quantity, and an instrument that’s been calibrated to that quantity. In a lot of social-science settings, there’s a theoretical construct, and we build a data-collection apparatus to approximate it. I'll explain this difference in detail later; for now lets dive into the fundamentals.

So what is measurement? It is the process of assigning numbers (or well-ordered labels) to aspects of the world according to a rule so that the numbers reflect something about the thing. We have a set of real things (objects, events) and a set of numbers; measurement is a structure preserving assignment from one set to the other. I think social science adds additional constraints, captured by the SEP article linked above. From section 7 of the SEP article, model-based accounts of measurement consist of two levels:

(i) a concrete process involving interactions between an object of interest, an instrument, and the environment; and (ii) a theoretical and/or statistical model of that process, where “model” denotes an abstract and local representation constructed from simplifying assumptions. The central goal of measurement according to this view is to assign values to one or more parameters of interest in the model in a manner that satisfies certain epistemic desiderata, in particular coherence and consistency.

So measurement involves interaction between the object (or aspect of the system), an instrument (or measuring tool), and an environment, which includes the subjects doing the measurement. Measurement represents these interactions with parameters, assigning values to the parameters (measurands), based on the results of the interactions. The SEP article continues, saying there are two main outputs identified by model-based accounts of measurement:

Instrument indications: these are properties of the measuring instrument in its final state after the measurement process is complete. Examples are digits on a display, marks on a multiple-choice questionnaire and bits stored in a device’s memory. Indications may be represented by numbers, but such numbers describe states of the instrument and should not be confused with measurement outcomes, which concern states of the object being measured.
Measurement Outcomes: these are knowledge claims about the values of one or more quantities attributed to the object being measured, and are typically accompanied by a specification of the measurement unit and scale and an estimate of measurement uncertainty. For example, a measurement outcome may be expressed by the sentence “the mass of object a is 20±1 grams with a probability of 68%”.

Inferring outcomes from instrument measures is non-trivial, often being theory laden and reliant on statistical assumptions about the object being measured, the instrument, the environment, and calibration process. Let's concretize this. Consider an econometric study that seeks to understand the the effect of some policy on health outcomes. The object being measured might be aggregate outcomes among a population (survival rates), the instrument might be self-report surveys or hospital reports, the environment might include sources of confounding variables (noise), and the calibration process might be how the surveyor constructed the survey (use of language, choice of questions, etc.), how those questions connect to the concept (health), and accepted measurement standards. Corrections in data might be necessary to account for systematic bias in data collection; for example if we know there to be a non-response bias due to some expected reason, we might adjust the data according to statistical and theoretic assumptions. On this view, measurement is a set of procedures aimed at assigning values to model parameters based on instrument indicators.

Like I mentioned earlier, my views of measurement comes from economics graduate studies. Above is a partial view of the entire field of measurement; a highly partial view, I recognize. But is there core terminology independent of a domain? The text referenced above (Measurement Across the Sciences), seeks to establish such a lexicon. They note that the book is a departure from the VIM (listed above), which assumes that only physical quantities are measurable. This book seeks to expand that, where non-physical properties of a system can also be measured (expanding the scope of measurement to social science and management domains). According to the text, measurement can be thought of as:

a process based on empirical interaction with an object and aimed at producing information on a property of that object in the form of values of that property.

Measurement is an empirical process, designed on purpose, whose input is a property of an object, and that produces information in the forms of values of that property.

The first aspect of measurement is that it is an empirical process that operates on inputs to produce outputs. This demarcates measurement from related concepts such as computation; computing the surface of a volume for example. You must be able to interact with the object under consideration. There are many input-output processes that are not measurements, so this is not a sufficient characterization. Which leads us to the next aspect; measurement is a process designed on purpose, rather than a spontaneous transformation of inputs and outputs. Measurements are performed according to specifications called measurement procedures. Measurement is inherently pragmatic, designed for a specific purpose (acquiring information on a property of a system), aimed at enabling decision making in a given context (enabling decisions rules such as "if the value of property is less than X, do Y, else do Z"). Another purpose-driven reason for measurement is conformance assessment, in which a measurement is used to decide if an item confirms to a specified requirement. This is of particular importance in engineering applications. A specification for a target value of a property of the system is called the nominal value, and then compared to what has been measured for that property. This is done within a level of acceptable tolerance; used to decide whether something produced has met its requirements. Below summarizes this second aspect:

This condition is still not sufficient, leading us to our third component of the definition. Measurement requires an interaction with an object, and this interaction must be related to a property of that object. This assumes a basic ontology that objects have properties. This is how it is depicted:

This makes it clear that we do not measure objects themselves, but properties of these objects (or systems). It also enables comparison of objects on these measured properties, assuming the system of measurement used to measure these values are the same. This is important, because it address the "how" you went about measuring the property. If someone measures say, intelligence using procedure X and someone else measures it with procedure Y, we might not be able to compare the two measures. Generally speaking, you must compare properties of the same kind, in order for the comparison to be meaningful. "Property" in this context, designates both properties of objects and their kinds of properties. Below the authors provide notation and more detail for how to refer to properties:

The last component of the definition of measurement is that it produces information on the measurand in the form of values of properties, and thus, in the specific case of quantities, in the form of values of quantities. Remember earlier, I mentioned these authors wanted to expand the notion of measurement to non-quantitative properties, something metrologists typically do not do. I'm not sure how controversial this is more broadly, but doing this enables qualitative research methods. Extending figure 2.5 to include this aspect:

The measurement result is called the measurand. This is known as the "Basic Evaluation Equation", which symbolically is :

Where Q[a] refers to "Generic property of object a", "q_ref" refers to the unit, and "x" is the numerical value of the quantity. We choose "Q" instead of "P" in the case where we are deliberately interested in quantitative properties of the object. This leads to the most generic understanding of measurement: it is designed empirical property evaluation (of an object, system, process). Measurement is a process that connects entities of the empirical world and entities of the information world. The authors describe this connection using the following terminology:

Transduction: A measuring instrument interacts with an object and, being sensitive to a specific property, changes its own state to produce an indication of that property. In other words, it converts (transduces) the measurand into something observable—like a bathroom scale’s spring turning weight (force) into spring elongation, or a paper test turning a person’s reading ability into a pattern of marked answers. This step is purely empirical.
Instrument scale application: The instrument is built so its observable indications can be systematically linked to information units (often numbers) through a scale. This means mapping the physical sign to a value—like turning spring elongation into a length reading in centimeters, or turning a pattern of checked boxes into scored responses. This step mixes empirical observation with informational mapping.
Calibration function computation: Because the indication (e.g. length, scored items) is usually not the same kind of property as the measurand (e.g. force, reading ability), the indication value must be transformed via a calibration function that models how the instrument’s indication relates to the actual quantity of interest. Thus, length is converted to force, and scored responses to a reading-comprehension level. This step is entirely informational.

The figure below visually depicts this process:

Remember, this is a designed procedure. The intent of the process is to produce an information entity; the measured value is expected to convey information about an empirical entity (and analyzed mathematically). As a person carrying out this procedure, we seek to minimize the distance between the measured property of an object and the measured value; we want to minimize the error/uncertainty. There can be multiple sources of drift between the actual value of the property and the measured property, which collectively contribute to measurement uncertainty. In the book, the authors mention that these sources somewhat derive from your measurement strategy (daily-life, operational, statistical, analytical). The two main sources of uncertainty are:

Definitional uncertainty: we didn’t fully or sharply say what the measurand is. This refers to how fuzzy the measurand’s definition is.
Measurement uncertainty: even if we did define it sharply, the instrument/procedure isn’t perfectly repeatable, is not sensitive to the measurand, has calibration issues etc.

Transferability is basically a consequence of the strategy. If you poorly define your measure or have issues with your procedure (instrumentation), you'll likely not have a reproducible measure. In practice, there are also many methods practitioners use to minimize this uncertainty; of which I am not going to dive in to. Generally speaking, a measuring instrument triggers transduction, and a calibration function converts the state of the instrument to an informational value that is supposed to map the the empirical property. In principle, all measurement (should) follows this abstract process. The authors end with an initial characterization of measurement:

measurement is an empirical and informational process, designed on purpose, whose input is an empirical property of an object and that produces information in the form of values of that property.

The authors end the section with an example from human sciences, which I think leads nicely into the last few points I want to note about measurement:

In the human sciences, one can see an example of the distinction between intended property and effective property in the case of measurement of reading comprehension ability. Here, the assessments always specify that the tests are to be given under conditions free from distraction while the student is reading the passages and responding to the comprehension questions, so that a noisy environment, for example, would not be advisable. This is strongly associated with the intended property—a student’s comprehension of text under good conditions. However, it may be the case that, in a given situation, a student is asked to respond in a noisy and distracting environment— this would be a case where the effective property differs from the intended property, and, presumably, any measurements made in this distracting situation would tend to show lower reading comprehension ability.

I bolded those words deliberately because, if it was not clear by now, as an applied econometrician I am familiar with these problems; they plague the social sciences. In physical sciences, you can more or less always use an operational strategy with clear definitions grounded in very precise and unambiguous terms. You can simply never do this in the social sciences. The way we measure objects is highly theory laden and model dependent. The authors address this, and I made note of it earlier, but I want to dive into this more, and discuss the implications when it comes to non-social scientists interpreting the literature. Let's first look at the process by which social scientists go from concept to measure. Here's a high level overview of the flow for attaining a measurement:

Phenomenon → Concept → Construct → Operationalization → Instrument/Procedure → Data → Metric/Indicator → Interpretation

Phenomenon: Something in the world you care about (well-being, intelligence, economic activity, discrimination).
Concept: Your verbal idea of it — usually a bit fuzzy. “Intelligence is the ability to adapt and solve problems.” “Economic growth is how much more stuff a society produces.”
Construct: The theory-shaped version of the concept — clearer, bounded, hooked into other ideas. A construct says, “this thing has these dimensions, relates to these causes/effects.” Psychologists love this word. It’s the “scientific” packaging of the concept. Economists use this word less often, but they're doing the same thing.
Operationalization: “Given that construct, what observable things will stand in for it?” This is the missing link people skip. It’s the mapping rule. “We will treat X, Y, and Z behaviors/scores/answers as evidence of the construct.” This is highly theory laden, meaning it depends on the theoretical formulation.
Instrument / Procedure: The actual tool or protocol: a survey scale, a test, a national accounts system, a coding scheme for interviews. This is the point where we ask "is the procedure measuring the attribute of the system/object we care about."
Data: The raw responses, counts, test scores, monetary totals. This are supposed to be the raw measures attained through the transduction phase, extracted from the measurand.
Metric / Indicator: The processed number (or small set of numbers) we show to the world: IQ = 115, GDP = $24 trillion, Depression score = 18/27.
Interpretation: “Therefore, this person is above average,” or “this economy is growing,” or “this group is more prejudiced.” When a theory is mathematically precise, there should be less room for competing interpretations. Every step prior to this influences what can be said about the metric.

Pretty much at every step, something can go wrong (and does go wrong).

Concept/construct problems: Concepts can be normative (“a good life,” “social capital”) but treated as descriptive. Different scholars mean different things by the same word. Constructs often bake in a theory: e.g. “intelligence is unitary” vs “intelligence is multiple.” The measure will follow that theory.
Operationalization problems: Operationalization is a choice. You’re saying: “We can’t see the thing, so we will look at these things instead.” Choices can be narrow (only income for SES) or broad (income, education, occupation). Choices can be convenient rather than conceptually tight (we measure “learning” with multiple-choice tests because they’re easy to score). And crucially: different operationalizations can all be defensible — but they will give different numbers.
Instrument/procedure problems: This is quite a problem in social sciences. Very often in economics, this is conflated with modeling itself. You might hear something like "what was your identification strategy?", meaning how did you isolate causal relationships, not how you generated the data. This is however, a problem of data collection and data reliability; having much to do with the sampling scheme (did we generate a truly representative subset of the population?)
Data → metric problems: We often transform raw data (standardize, scale, weight, index); these transformations create meaning. Indexes like GDP combine heterogeneous stuff using formulas that look technical but are ultimately convention + theory.
Interpretation Problems: People forget the metric is about a construct, not the "thing in itself". They ignore error and uncertainty, they over-generalize from group level properties to individuals, and reverse.

Compared to say, measuring the length of a table, in social sciences the concepts are latent, they are often value-laden, they are theory-dependent which implies multiple distinct operationalizations, and they are highly contextual. This leads to massive confusions as to what something is supposed to mean when cited by the "experts". Lets consider two examples that are frequently misunderstood and misused in public discourse: GDP and IQ.

GDP is built to measure the value of market production inside a country over a period, using a set of accounting conventions (final goods only, market prices, imputed rent, exclusion of unpaid work, etc.). Statistical agencies gather data and, following standards like SNA, turn it into metrics such as GDP, real GDP, and GDP per capita. Where people go wrong is treating that number as if it were a direct measure of well-being or “the economy” in some natural sense. It isn’t. It leaves out unpaid care work, most environmental costs, distributional issues, and leisure because the operational rules said to leave them out. It also reflects the way national accountants define “production,” not an eternal truth about economic life. And the fact that changing the accounting basis can raise GDP without anyone getting richer shows how conventional it is. So the misunderstanding is: people read a theory-laden, convention-driven production number as a full report card on social prosperity.

IQ tests take a particular theory of intelligence — that there’s a general factor you can tap by giving people a battery of cognitive tasks — and operationalize it with standardized tests, normed so 100 = average for that population at that time. The scores are useful for comparing people on those tasks. The public mistake is to inflate that into “IQ = how smart you are, full stop.” IQ doesn’t capture wisdom, creativity, social skill, or motivation; it captures performance on certain decontextualized tasks that happen to correlate. People also forget it’s norm-referenced, so 100 is just “average for this group,” not a natural zero or a percent. And they reify the scale — treating 130 as “30% more intelligent” — even though it’s a constructed scale with only limited interval meaning. Add in cultural and language loading at the test stage, and you get a metric that partly reflects context. So the misunderstanding is: people treat a narrow, theory-shaped cognitive measure as a total, context-free measure of human intelligence.

There are countless examples like these from social sciences. They reflect a common pattern of misunderstanding, based on the general lack of awareness of how scientists go about measuring phenomena. The first is reification; people will often treat the "score" or "measure" of something as the thing itself. GDP is not the economy. The second is people often forget the exclusions that occur in the process of operationalization; they ignore what was left out of the procedure. This can manifest in choosing to include certain categories of measurement but not others. The third common mistake is people ignore the population and context; they forget the metric is normed and defined for a specific time and place. The fourth common mistake is that people will mistake convenience for truth. Just because something was given a number, does not mean that number maps to the measurand; it could have been just an easy number to generate. The fifth common error is not recognizing metric drift. Over time, the instrument/procedure changes, but people compare numbers as if the whole chain were stable. This also happens when the definition of the concept changes, resulting in how the information is collected, computed, and recorded. Lastly, people thing one indicator can exhaust the multivariate landscape from which it's constructed.

Social scientists (especially empiricists) should be aware of these nuances and issues. Researchers use something called "validity" to judge whether there were errors in the process of going from concept to measure. To be clear, statistical validity is a much broader domain, including experimental validity (more on that later). I'm just going to introduce the validity concepts related to measurement for now. Broadly speaking, it refers to the degree to which your measurement strategy measures the concept you intended on measuring. The three we are covering are Content Validity, Criterion Validity, and Construct Validity:

Content Validity: This asks “Did we include the right content for this construct?” Does a measure represent all the facets of the measurand it's intending on covering? This is primarily about coverage; the instrument should span the entire conceptual category. This is hard to achieve in social sciences. Social constructs are often broad and contested (e.g. “well-being,” “social capital,” “leadership”). If experts don’t even agree on the domain, content validity can’t be settled once and for all. You end up with “for this theory, this was good coverage,” which is weaker than “this is the coverage.”
Criterion Validity: This asks “Does our measure relate in the right way to some external, meaningful criterion?” It refers to "the extent to which an operationalization of a construct, such as a test, relates to, or predicts, a theoretically related behavior or outcome — the criterion". For example, a job aptitude test should predict actual job performance. Often there is no gold-standard criterion. What’s the “true” criterion for intelligence, or for political trust, or for creativity? We use proxies (grades, supervisor ratings, future income), but those proxies are themselves social measurements with their own validity problems. So you get “a measure validated against another imperfect measure.”
Construct Validity: This asks “Does this measure behave like the theory says the underlying construct should behave?” This actually refers to a broader umbrella of relate validity questions. Does it correlate with things it should correlate with? (Convergent) Does it not correlate with things it shouldn’t? (Discriminant) Does it fit into the nomological network — the web of other variables the theory posits? Generally speaking, construct validity refers to how well a set of indicators reflects or represents a concepts that is not directly observable (latent). Are the numbers produced by your measurements actually mapping onto something in the real world? For example, IQ is not directly measurable.

Contrast this with physics; none of this is really a topic of discussion in that discipline. Constructs are much clearer/stable, operational definitions are agreed upon in the community, calibration of instruments is much more precise, and traceability is possible. Physical instruments are engineered to reduce random error and bias; you can model their error precisely. In social science, the “instrument” is often a questionnaire answered by a tired human who has opinions about you. Social scientists often make drastically simplifying assumptions about the distribution and source of error. Physical quantities often have ratio scales with a meaningful zero (0 kg, 0 m). Many social measures don’t — IQ 0 isn’t “no intelligence.” So you can’t interpret differences and ratios as straightforwardly. The process of measurement in physical sciences looks like this:

Construct (very stable) → Operational definition (community standard) → Instrument (calibrated) → Measurement (with known error)

While social sciences it often looks like this:

Construct (contested) → Operationalization (chosen among several) → Instrument (partly human, context-sensitive) → Measurement (with unknown or changing error, plus assumptions) → Interpretation (theory-relative)

To wrap up this ridiculously long section (that could have been much longer because this is such a rich topic), I'd like to describe how I used to teach concepts of economic measurement. I used to work for a massive data provider in the finance industry. My job essentially was "be the subject domain expert" for economic data and "be the data engineer", which meant I had to understand how economic indices were constructed, generated, reported, and used, across statistical agencies globally. I also used to teach fundamental economic concepts for non-economists who specialized in other data domains (like fixed income or commodities). These other domains are quite different from economics; they (like physics) are much more amenable to direct measurement. For example, a fixed income "measure" might just be a straight forward report from a bank about what interest rate they are charging on some financial instrument. No mystery there. Likewise, a commodities dataset reported from CBOE might simply be bids and asks for a particular trading day. As alluded to earlier, economic data is highly aggregated and connected with sampling schemes and theory. You would think that, given the background of many of these people, they would have some familiarity with the construction of an economic index. Surprisingly, many would assume the economic measurements are as straightforward as the data they specialize in. This is perhaps one thing many people misunderstand: the distinction between economists/statisticians, and someone who majored in Business Administration. It becomes quite evidence when you get into the nuances of data. At the start of these lectures, I would begin by saying something like "We can measure the flow of water simply by putting a well calibrated and sensitive instrument into that river, this gives us a direct measure. Economic measurement is different from this. In many cases, we are often 'probing' for data. We construct surveys to extract information, on subjects who can game the metric (Goodhart's Law) and who are aware they're being measured (think Hawthorne Effect; the respondent is not a passive transducer). In many cases, there isn't an observable 'thing' we are measuring. 'Instrument calibration' is completely different (and very possibly non-existent) in social sciences."

In economics, the signal source is elicited, not naturally emitted. People answer questionnaires because you asked, firms report because you surveyed, households disclose because it’s the census. That makes measurement reactive and context-dependent: wording, order, incentives, trust in the agency — all affect the signal. Even “administrative data” (tax records, unemployment claims) is behavior under rules — if rules change, behavior and therefore “measurement” changes. In economics, measures are highly theory dependent. “Unemployment,” “inflation,” “household,” even “GDP” are statistical constructs defined by agencies: Change the definition, change the number, even if the world didn’t change. The object is theory and convention-dependent: you need a theory of labor-force attachment to define unemployment; a theory of consumption to define a price index. Measurement error is radically different in social sciences. Errors can come from comprehension, nonresponse, strategic answering, interviewer effects, mode effects, seasonal economic behavior, policy changes; Some of these errors are not i.i.d. and not stationary, they change when the social context changes (which makes doing historical analyses incredibly difficult). Unlike physical sciences, repeating the measurement doesn’t always reduce error (people may learn the test, or get bored). Now im not arguing here that Economists don't have methods to account for these issues, that would be foolish. I'm simply saying that the nature of the measurement process in economics is fundamentally different from a physical science, and therefore how you interpret the data. I think the biggest issue is that we are measuring a unit of analysis that is fundamentally reflexive, not passive; people know things and can infer from context, which can nudge their behavior ever so slightly, biasing the measurement. In physical science, we often tap into an existing, stable signal with a calibrated device. In social science and economics, we often have to coax a signal out of people and institutions using instruments made of questions, definitions, and incentives. That means the quality of the number depends much more on theory, on design, and on people’s cooperation — not just on the sensitivity of the device.

Data, Statistics, and Uncertainty

Simply put, this is also a cornerstone of modern science. We will look at how scientists model the Data Generating Process, how data is collected, and how data is what binds science to reality. Let's first look at The Data-Generating Process and Scientific Inference.

Scientific Research and Big Data

A corollary to the prior section, data driven methods are also becoming quite prolific in many domains. The general public is grossly incompetent when it comes to understanding the nuances of collection, storage, governance, processing, transmission, provenance, and utility of data for inquiry. And yet, this has been a massive pillar in many of the advances in the past few decades. People generally do not have a clue why big data is so valuable, what can be done with it, and to whom. They are unaware that their digital footprint can be used to yield a fairly accurate picture of their beliefs and preferences, which can then be used for predictive analytics. They are also unaware of the value it provides to scientific researchers.

Scientific Representation, Models in Science, and Mathematical Modeling

How do scientists represent the target system they are studying? There are quite a range of scientific models in application across all domains of science.

Computer Simulation

The advent and proliferation of computing, programming languages, and software has undoubtedly had a significant impact on the way science is carried out. Simulation modeling is now quite indispensable within the toolkit of the modern scientist. I would go so far to say that you simply cannot do modern science without the aid of a computer in one form or another. This is true for physical sciences and biological sciences as well as social sciences; even non-traditional scientific disciplines like quantitative finance. In fact, most of my initial experience in this during grad school came through studying stochastic processes in financial engineering courses, in addition to Monte Carlo Methods in Bayesian statistics and state space modeling in economics (as well as DSGE models). Since then, I've been interested in simulating social complexity via agent based models. Most modeling cannot be done unless within the context of computer simulation, which requires knowledge of algorithms, data structures, and computational complexity, for understanding how to implement your model. This is obviously a prolific aspect of science. So in this section, I want to describe the function of a simulation, how it augments the scientific toolkit, and various simulation methods, ones that I am more familiar with given my education and work experience.

When we simulate, we are simulating some process or system. This shows it's generality, because we can essentially represent just about anything as a system or process, which means we can describe the properties, components, relationships, behavior, dynamics and architecture of just about any system computationally; allowing us to reason about the real system under discussion in a controlled setting. A simulation is an imitation of the dynamics of a real-world process or system over time. This computational representation is studied, like non-computation models, for a variety of tasks including: "what if" analysis, scenario analysis, intervention analysis, stress testing, modification, or pretty much anything else. The alternative approach to simulation is direct experimentation, which is infeasible in many situations. Simulations are often cheaper, faster, more likely to be replicated, safer, and ethical. In many cases it's also just practically impossible to model as system mathematically with closed form solutions; systems are often intractable and too complicated to solve. Approximations via simulation tend to be much more suitable for rapid experimentation. Like any model, it is not assumption free; these assumptions are encapsulated in our formulation of the model. Simulation models allow us to modify our assumptions and test the implications.

These models are essential for engineering any system with significance. Consider the car you use, how did the engineers determine it's reliability? They used simulation methods to guide the design process. How do airlines have such high reliability? The use simulations to understand how the plane will operate under a variety of scenarios, this influences their design decisions. How does the airline ensure timely arrival of planes and coordinate thousands of daily trips? They use simulation methods, among other methods like optimization. How did researchers identify a vaccination so quickly during the COVID pandemic? This is multifaceted, and involves simulation at every step. Supercomputers like those at Lawrence Livermore National Laboratory were used for rapid drug discovery. Identifying an effective drug involves discovering a molecular structure. You can imagine the combinatorial explosiveness of the search space; doing this purely by gathering information from experiments is simply not feasible for rapid turn discoveries. Supercomputing allows you to simulate the effectiveness of a proposed structure, narrowing down the search space for researchers, allowing them to identify an effective structure more quickly by searching more dense regions of probability space. In addition, simulations were used for epidemic forecasting. Country level microsimulations quantified how distancing, lockdowns, and closures could keep hospitals from being overwhelmed. Suppose you have normal capacity at a hospital, with limited ability to scale; massive stress on that system might overwhelm it, leading to excess deaths. Therefore, from a policymakers perspective, they might want to know these counterfactual situations, and adjust their policy accordingly. Closures were also determined based on simulations. Airflow models revealed how respiratory particles move indoors, guiding ventilation, filtration, and layout choices, and which facilities are likely locations to have a massive outbreak, which subsequently impacts hospital stress. In each of these cases, simulations gave us usable answers while experiments and trials were still spinning up. This particular problem, shared among many complex problems, often involves systems of systems. Modeling and simulation allows researchers to understand how various systems interact; we can effectively integrate multiple models of systems to understand how they all interact. This is something that is very difficult without the use of computational resources. Supercomputers enabled the possibility of rapid computational experimentation, which lead to effective decision support. Put simply, computer simulation has a direct impact on the policy that effects your life.

There are essentially 3 by 2 types of simulations. Think of it as a grid, where each cell represents a combination of the various elements of a simulation. There are stochastic vs deterministic simulations, static vs dynamic simulations, and discrete time/event vs continuous time/event simulations. So you can have a discrete time dynamic stochastic system, a stochastic continuous time continuous event simulation, a deterministic dynamic time discrete event simulation etc. Each of these dimensions represent different aspects of the system under discussion. Stochastic systems have random components, dynamical systems are time dependent, and continuous systems are those where the system state can be represented numerically as a non-finite number. On the contrary, deterministic systems do not contain randomness, static representations do not depend on time, and discrete representations refer to systems where the states can be represented as a finite number. Each combination implies different sets of methods. It is entirely up to the research to decide how to model the system, but the decision is not arbitrary. Sometimes it is just easier to represent a system statically, this is often the case in economics. Introducing more moving parts makes the system harder to understand, so researchers must find a sweet spot between model complexity and granularity, and how well it answers questions. For example, in economics we have DSGE models that rely on the "representative agent". This is a sort of idealization about how people make decisions in an economy, imposed upon the entire collection of agents; the "representative agent" represents how everyone who is "rational" would make economic decisions. It assumes away any underlying network structure and heterogeneity. It idealizes the economic decision independent from other factors. This form allows us to have nice compact modeling formulations that are solvable or easy to reason about. But obviously, it does not have to be done this way. Agent based models on the contrary, allow the modeler to encode heterogeneity. We can then run simulations "from the ground up", and use these results to reason about a real world economy. This also comes with its own set of costs and sacrifices. These models are harder to validate and make sense of. Therefore, decisions to represent systems depend upon these considerations.

What are the elements of a simulation model? Well, it depends on the type of model and the domain you're studying. This taxonomy will be biased towards discrete event simulations, but I think pretty much every simulation will implicitly refer to these elements. There are two objects of simulation:

Entities: individual elements of the system that are being simulated and whose behavior is being explicitly tracked. Each entity can be individually identified;
Resources: also individual elements of the system but they are not modelled individually. They are treated as countable items whose behavior is not tracked.

These decisions are made by the modeler, and depend on the system under discussion. How do we organize the entities and resources?

Attributes: properties of objects (that is entities and resources). This is often used to control the behavior of the object. In a more comprehensive simulation, an attribute might be the type of features that distinguish entities.
State: collection of variables necessary to describe the system at any time point. These fully characterize the system. For example, in a queuing system, it might be wait time.
Queue: collection of entities or resources ordered in some logical fashion. This refers to how the entities are processed within the system

There are certain actions available to the entities, which change state. Below are terminology that refer to these operations:

Event: instant of time where the state of the system changes. An event describes the possible ways the state can change, and locates the time in which that change took place.
Activity: a time period of specified length which is known when it begins (although its length may be random). This may be specified in terms of a random distribution.
Delay: duration of time of unspecified length, which is not known until it ends. This is not specified by the modeler ahead of time but is determined by the conditions of the system. Very often this is one of the desired output of a simulation.
Clock: variable representing simulated time.
Processes: a type of event that has start-end rules with, including decision logic, policies, and control rules.

Exogenous factors can be modeled; for example, demand for a product (assuming the firm has no external influence on its demand) can be used to simulate stress and bottlenecks on a system. These often have corresponding probability distributions. Simulation models are initialized with initial state and have a determined run length or stop criteria, since we can't model simulations indefinitely. State variables are often analyzed or turned into metrics, so we can reason quantitatively about the system behavior over time, and how the system responds to exogenous factors or parametric adjustments. These are useful for V&V (verification and validation), important aspects of the entire modeling process. This formulation is discrete event focused, but System Dynamics also tracks this approach; except we are interested in things like stocks, flows, and feedback loops which often requires differential or difference equations.

When engaging in modeling and simulation, you typically follow a pattern a common pattern. The workflow below works for discrete-event and agent-based models, system dynamics, Monte Carlo risk models, and hybrid approaches.

1) Frame the decision and the system

Start with the decision you want to inform: “Should we add a second checkout?”; “How sensitive is revenue to demand volatility?”; “What policy reduces infection peaks?”. Then declare the scope, what’s inside your model (entities, resources, policies) and what is exogenous (arrivals, prices, weather). State performance measures you’ll report (e.g., mean wait time, 95th-percentile queue length, throughput, cost, peak prevalence). Finally, capture success criteria: the level of accuracy, speed, and confidence you need for the simulation to be useful.

2) Build the conceptual model:

Entities and states: What things move or change (patients, packets, orders, molecules)? What states can they occupy (waiting, in service, recovered, failed)?
Processes and rules: How do states change—by scheduled events (arrivals, service completions), by interactions (agent meetings), or by continuous flows (stock-and-flow)?
Time treatment: Decide if you advance time by events (jump to next event; classic discrete-event), by fixed steps (∆t; good for differential equations or when events are dense), or hybrid (event-driven with sub-stepping for continuous parts).
Resources and constraints: Servers, machines, beds, CPU cores, budgets. Specify capacities, calendars, and priorities.
Randomness: Where uncertainty lives (interarrival times, service durations, agent behaviors, failure times) and how you’ll model it (distributions, correlations).
Policies and controls: Schedules, routing rules, admission limits, pricing, triage—these become the levers for scenarios.

3) Input modeling: turn messy data into usable distributions

Simulations are only as credible as their inputs. Collect data (logs, sensors, EMR, telemetry) and clean it. For each stochastic input, decide whether to use empirical distributions (sample with replacement), parametric distributions (fit exponential, lognormal, Weibull, etc.), or mechanistic sub-models (e.g., time-of-day Poisson rate). Check goodness-of-fit (visual QQ plots and CDF overlays are often more informative than a single p-value). Model seasonality and trends explicitly (piecewise rates or time-varying parameters), and capture dependence if it matters (copulas, shared random seeds, correlated draws). Document assumptions you cannot measure.

4) Choose a paradigm

Discrete-event simulation (DES): Best for queuing, logistics, manufacturing, networks. You maintain an event calendar, a future event list, and process handlers that update state and schedule downstream events. You observe sharp changes at discrete times (arrivals, completions).
Agent-based simulation (ABS): Best when micro-level behavior and interaction drive macro outcomes (epidemics, social systems, markets). Each agent carries rules; the system emerges from interactions. Often run with small time steps or event hooks.
System dynamics (SD): Best for feedback-heavy, aggregate systems (stocks, flows, delays). You write coupled differential or difference equations and integrate in time.
Monte Carlo (MC): Best for pure uncertainty propagation: sample inputs, evaluate a deterministic model, aggregate outputs. Often baked into other paradigms.

5) Implement a Minimal Version

Write the minimal version that runs end-to-end: initialize state, advance time, process one or two event types, stop cleanly, and produce a single metric. Only then add detail. For DES, implement: (a) a priority queue for the future event list; (b) handlers that update system state and schedule new events; (c) resource logic (seize, wait, release) with queue disciplines and priorities. For step-based models, implement a stable integrator and verify simple invariants (non-negativity of stocks, conservation of mass).

6) Verification: prove you built the model you meant to build

Verification is about correctness of implementation. Use unit tests for event handlers and resource logic; seed the RNG and check exact outcomes on toy cases. Run extreme scenarios (infinite capacity → zero waiting; zero arrivals → idle servers) and check that outputs hit obvious limits. Step through the first few events by hand and ensure your trace matches logic. If you refactor, keep these tests to prevent regressions.

7) Validation: prove the model is a good stand-in for reality

Validation compares model outputs to real system behavior or to accepted theory. Start with face validity: do SMEs agree that queues and bottlenecks look right? Then do historical fit: feed the model observed inputs and compare outputs (means, percentiles, time series structures). Where you lack data, check theoretical limits. If discrepancies persist, adjust structure before tuning parameters; calibration should be the last mile, not a crutch.

8) Experiment design: plan runs that answer the question

Simulations are experiments. Define scenarios (policy sets), factors, and responses (KPIs). Decide warm-up period (discard initial transient until the system reaches steady behavior), run length (long enough to stabilize averages or cover cycles), and replications (independent runs with different seeds). Use common random numbers (setting a seed) to reduce variance when comparing alternatives: drive scenarios with the same underlying randomness to isolate policy effects. For broad sensitivity, consider factorial designs, Latin hypercube sampling, or Sobol sequences to explore the space efficiently.

9) Randomness, variance, and confidence

Stochastic outputs need uncertainty quantification. For each scenario, compute point estimates (means, percentiles) and intervals (confidence intervals for means via batch means or replication-deletion; bootstrap for quantiles). Report not only averages but also distributions and tails if risk matters. If estimates are too noisy, apply variance reduction techniques: CRN (for comparisons), antithetic variates (use paired negatively correlated streams), or control variates (leverage a correlated, known-mean statistic to shrink variance).

10) Sensitivity, uncertainty, and robustness

Test how conclusions change when inputs shift. Vary fitted parameters within their uncertainty (e.g., confidence regions of arrival rates) and plot how KPIs move; this guards against overconfidence in a single fitted distribution. Run global sensitivity when many inputs might interact. Summarize results in terms a decision-maker can act on: “Across plausible demand and service variability, adding one nurse reduces the 95th-percentile wait by 18–27%.”

11) Prepare results for presentation

Presentation matters. Show (a) the question; (b) the conceptual picture; (c) key assumptions; (d) validation checks; (e) experimental design; (f) results with uncertainty; (g) the recommendation. Use counterfactual comparisons (“Policy B vs A”) and keep plots honest: label warm-ups, show intervals, avoid over-precision. Include a model card: version, seed strategy, input sources and dates, known limitations, and intended use.

12) Reproducibility and governance

Fix a random seed policy (e.g., separate streams per process/entity class), lock software versions, and write a one-command script/notebook to rerun experiments. Check the model into version control with tests and the conceptual spec. If the model informs material decisions, add model risk controls: peer review, change logs, and periodic re-validation when the real system drifts. There are common errors or red flags you must consider when evaluating simulation models. For example, over-modeling detail you cannot validate; using a single replication; comparing scenarios with different random streams; hiding uncertainty; and skipping the conceptual model. The antidotes are baked into the steps above: minimal viable model first, disciplined input modeling, seeded replications, and transparent validation.

For centuries, science advanced through a conversation between theory (general laws that explain) and experiment (careful interventions that test the theory). Computation adds a third voice: it makes consequences of theories calculable and creates experimental worlds we can manipulate when the real one is too slow, too fast, too distant, or too dangerous. It fundamentally reshapes what counts as evidence, how we reason about mechanisms, and the pace at which we discover them. Many theories are expressed as equations with no closed-form solutions. Numerical methods—finite elements for materials, lattice Boltzmann for fluids, MCMC for Bayesian inference, agent-based updates for interacting systems—turn those tough theories into executable models. Running them becomes a kind of experiment: we “prepare” initial conditions, “apply” perturbations, and observe outcomes with virtual sensors. Climate models, collisionless galaxy simulations, or whole-cell models are not thought experiments; they’re computable laboratories where we can ask counterfactuals no wind tunnel or telescope can reach. Classic experiments test a handful of conditions. Computation makes parameter sweeps routine: thousands of runs mapping where a mechanism holds, fails, or bifurcates. It also enables simulation-based inference; when the likelihood is intractable but we can simulate from a model, we compare simulated and observed patterns to estimate parameters or even select models. This widens the scope of testable theories, especially in ecology, epidemiology, and parts of the social sciences where controlled experiments are scarce.

Because runs are cheap relative to fieldwork, we can search vast design spaces: new materials with targeted band gaps, synthetic pathways with lower energy cost, or intervention policies that flatten epidemic peaks. Surrogate models and emulators (e.g., Gaussian processes, neural operators) compress expensive simulations into fast predictors, turning day-long computations into millisecond queries and allowing automated experiment planning: the machine proposes the next promising condition, we compute or measure it, and the loop continues. This extends to what is known as "The Fourth Paradigm" of science. Computation fuses mechanistic models with machine learning: physics-informed networks that respect conservation laws; Bayesian calibration that quantifies parameter uncertainty; data assimilation that corrects forecasts on the fly. The result isn’t just better prediction; it’s stronger explanations because the space of plausible models is continuously pruned by both data and constraints. With code acting as an instrument, credibility depends on how we build and check it: verification, validation, and uncertainty quantification. Because the “experiment” is code, reproducibility has a precise form: share the code, the random seeds, the data, and the environment. Version control, containers, notebooks, and workflow managers turn papers into executable artifacts. This raises the bar: claims must survive re-runs, re-fits, and re-perturbations by others, not merely rest on prose descriptions of a setup. Computation gives us algorithmic instruments and precision that reveal signals no human eye could extract unaided. Good computational science is explicit about limits, documents assumptions, and triangulates across theory, simulation, and empirical data.

The day-to-day workflow has changed. Instead of a long wait between theory and a few decisive experiments, we iterate rapidly: sketch a mechanism → code a minimal model → test on stylized cases → confront with data → refine. This cadence accelerates theory building (by surfacing edge cases and counterexamples early) and sharpens experimentation (by making pre-registered predictions and power analyses simulation-aided). Computation doesn’t replace theory or experiment; it mediates between them. It makes theories executable, experiments interpretable, and both of them scalable. When we treat code as a first-class scientific instrument—with rigorous verification, validation, and uncertainty quantification—we get a third arm strong enough to lift questions that were previously out of reach.

Mechanisms in Science

The act of identifying mechanistic cause and effect relations.

Scientific Explanation

What does it mean when someone says "Science has explained something"?

Scientific Reduction

What is the role of reduction in explanation? When larger systems are explained in terms of something more fundamental, what exactly are we accomplishing?

Scientific Objectivity

Whether or not the practice of science can be truly objective is not the purpose of this section. Rather, I'd like to discuss various methods it uses to maintain alignment with the standard, and how built in mechanisms self correct when deviations from the ideal occur.

Scientific Discovery

What constitutes a scientific discovery? With the constant barrage of "new discoveries" flooding the media, how do we make sense of what is going on?

Scientism

Can someone dogmatically adhere to science at the expense of other methods of inquiry? We will look at Six Signs of Scientism to answer this question. Susan Haack’s central objective in Six Signs of Scientism is to demarcate scientism from legitimate science; not in the naïve sense of drawing a boundary around science proper (a move she explicitly critiques as itself scientistic), but rather to expose a cluster of intellectual temptations in contemporary culture that inflate the authority, epistemic reach, or rhetorical prestige of science beyond its proper bounds. Early on, she defines scientism as “a kind of over-enthusiastic and uncritically deferential attitude toward science, an inability to see or an unwillingness to acknowledge its fallibility, its limitations, and its potential dangers” (Haack, p. 76). Her task is not to attack science, she explicitly defends its value, but to identify when admiration becomes uncritical worship. She warns that scientism is not a single thesis but a family of symptoms — subtle, culturally normalized behaviors and linguistic patterns. Hence: six “signs.” Each sign, she notes, is not definitive alone, but diagnostic when seen together.

Sign 1: Honorific use of "Science"

Haack opens with a linguistic phenomenon: contemporary discourse increasingly uses “science,” “scientific,” “scientifically,” not descriptively but praise-word-wise — i.e., as honorifics (p. 78). She gives examples from advertising (“science has proven”), from law (“scientific evidence”), and from academic fields self-branding as “Management Science,” “Library Science,” even “Mortuary Science.” The issue is semantic drift — the substantive content of the term recedes, and its rhetorical prestige dominates. She warns this leads to serious epistemic distortions. It suggests suggests scientific = reliable, non-scientific = inferior by default, it conceals the fallibility of science, since bad science is often rhetorically shielded by the honorific use of the term, and it trains the public to accept authority instead of evidence. Crucially, she notes this begins culturally innocently — we praise science because it has genuinely transformed our understanding — but the result is an atmosphere where being called scientific is a shortcut to legitimacy, not a conclusion earned by rigor.

Sign 2: Using Scientific Trappings Decoratively

Haack identifies a recurrent move: actors in non-scientific domains deliberately adopt the form of science without its substance; what she calls “the manners, the trappings, the technical terminology” of scientific disciplines “irrespective of their real usefulness” (p. 78). This amounts to invoking the aesthetics of science — numbers, formulas, technical jargon — not as tools of inquiry, but as symbolic legitimacy markers. She gives a few examples; one in sociology, where poorly defined constructs are masked with statistical jargon, another in philosophical journals, adopting APA-style citation conventions and “peer-reviewed” prestige structures not because they help determine truth, but as credentialing performance. What matters here is not anti-science sentiment — her criticism is not “how dare they borrow?” — rather, she says borrowing is only legitimate when the tools are genuinely cognitively functional. Otherwise, it's scientistic theater — precisely what A. H. Hobbs in 1953 called sociology’s “unassailable conclusions” draped in faux-scientific rhetoric (p. 78). Haack draws a sharp line: it is not scientism to borrow scientific tools; it is scientism to borrow them “for display rather than serious use”.

Sign 3: Obsession with Demarcation

Once the word “science” has become an honorific, Haack notes that people become defensive and policing about who gets to wear that honorific (p. 83). This produces an exaggerated obsession with what she calls “the problem of demarcation.” Her historical example is extremely precise: early Logical Positivists and then Karl Popper both attempted strict demarcation criteria, “verifiability” (Positivists) or “falsifiability” (Popper), as tools to protect scientific legitimacy. But their motives, she argues, were less epistemic than status-protective; “what exactly… was the motivation for wanting a criterion of demarcation in the first place?” she asks pointedly (p. 84). Her answer: it was the honorific drift of the word “science”. Haack further exposes how popper himself keeps revising his demarcation rule; first Marxism is “pseudo-science,” then it’s “falsified” science. First evolution is “metaphysical,” then it’s safely reclassified as science (p. 84). The fact the boundary keeps shifting proves it isn't principled, it’s rhetorical. Her conclusion is razor-sharp: there is no clean line between science, proto-science, and non-scientific inquiry. The desire to draw such a line is itself a symptom of scientism, not a way of overcoming it (p. 86).

I find this section interesting because she is not entirely dismissing the notion of a demarcation; that there are some distinguishing factors separating science from non-science. Rather, she is referring to an attitude. I began this blog post explicitly addressing demarcation; I don't think I am being obsessive. In fact, I think my method of distinction is quite useful. Demarcation becomes important insofar as people use scientific findings authoritatively, in my opinion.

Sign 4: The Quest for "The Scientific Method"

Haack argues that fixation on a unique, privileged “scientific method” is both a by-product and a prop of scientism (pp. 87–89). The narrative runs: once “science” becomes honorific (Sign 1) and the boundary of who counts as scientific turns into status-policing (Sign 3), there is pressure to point to a single, codifiable procedure that legitimates real science and excludes rivals. She surveys the familiar canon of candidates — inductivisms old and new, Popperian deductivism (“conjectures and refutations”), Lakatosian research programmes, and later Bayesian / decision-theoretic frameworks (p. 87). Her key move is to contrast these theories about method with what working scientists actually do. Citing Percy Bridgman, she notes “there is a good deal of ballyhoo about scientific method,” and “the people who talk most about it are the people who do least about it” (p. 88). For the practicing scientist, there is no ritual check whether a step conforms to “the Method”; there is only “getting down to brass tacks,” doing one’s damnedest “no holds barred” (p. 88). The real pattern of serious empirical inquiry, not uniquely scientific, is: make an informed conjecture; derive consequences; check them against available and newly obtainable evidence; then judge whether to retain, revise, abandon, or suspend (pp. 88–89).

Science’s distinction, Haack emphasizes, lies not in a single algorithm, but in a historically evolved armory of “helps” to inquiry: instruments (microscopes, telescopes, imaging), techniques (extraction, purification), mathematics and statistics, computation, and crucially, social-institutional arrangements that to some extent incentivize honesty and thoroughness and disincentivize sloppiness and cheating (p. 88). Hence two conclusions follow:

There is no one “scientific method” used by all and only scientists (p. 89).
This does not make scientific discovery miraculous; it makes it continuous with ordinary empirical inquiry, but amplified, refined, and disciplined by the distinctive helps science has developed (pp. 88–89).

This also clarifies the social sciences: they share the general pattern of empirical inquiry and can benefit from institutional norms, but often require different special tools and techniques than the natural sciences (p. 89). The error of scientism is treating “the scientific method” as a shibboleth rather than recognizing the plural, fallible, adaptable practices that actually make inquiry go well.

Sign 5: Looking to Science for Answers Beyond its Scope

Haack distinguishes several legitimate roles science plays — supplying empirical input, shifting boundaries by rendering once-philosophical problems empirically tractable, and informing policy means–ends relations (pp. 89–91). But scientism overreads such successes into a global mandate, expecting science to answer ethical, legal, political, aesthetic, or other normative questions; questions whose resolution cannot be read straight off from facts. I think this is a quite obvious application of Hume's is-ought gap, which is fair, but her addition of normative questions I take issue with. If we stipulate a decision criteria, we can use methods from operation research and economics (for example, utility maximization). Arguing which evaluative criteria is appropriate for the given scenario is not a scientific question yes, but I think this deserves a bit of nuance. I think she understands this, it's just a clarification I wanted to make. She describes two kinds of overreach that should receive special scrutiny:

Policy masquerading as science. Science can tell us the likely consequences of damming a river, changing tax codes, or modifying school governance; it cannot by itself adjudicate whether the ends are desirable, or what trade-offs are morally justifiable (p. 90). When researchers’ ethical/political convictions tilt their evidential judgment, or when normative conclusions are presented “as if they were scientific results,” we have scientism (p. 90).
Empirical surveys as ethical verdicts. Haack analyzes a Lancet article advocating the “complete lives” principle for allocating scarce medical resources — giving priority to adolescents/young adults — and notes the authors cite surveys of what “most people think” as support (pp. 90–91). She underscores the category mistake: “most people think x is morally best” ≠ “x is morally best” (p. 91). Substituting measured preference for justification is a hallmark of scientism.

She then treats evolutionary ethics in E. O. Wilson: the empirical study of the “moral sentiments” (psychology, genetics, anthropology, evolutionary biology) may be relevant, but it is insufficient to settle which sentiments are morally desirable or how they should be ranked or constrained (pp. 91–92). Haack notes Wilson’s ambivalence between a modest “unity of knowledge” thesis (all knowledge should cohere) and a strong reductionist thesis (all knowledge ultimately derivable from science) (p. 92). When Wilson asks which instincts to rank, limit, or incorporate into law, he tacitly recognizes that biology cannot close the normativity gap (p. 92). Haack’s stance is a modest naturalism: empirical knowledge about human flourishing can bear on ethics without dictating it (p. 91).

Sign 6: Denigrating the Non-Scientific

Haack here targets the cultural attitude that, because science has “demystified” many phenomena (Weinberg), all other inquiries and practices are therefore obsolescent or inferior (pp. 92–94). To be clear, she is not saying that all forms of empirical inquiry are equivalent. She is not advocating for something like astrology being a legitimate form of investigative inquiry about cosmology. She is primarily concerned with two reductions:

Within inquiry: It is scientistic to assume empirical legal studies are inherently superior to interpretive legal scholarship (p. 92). Different questions demand different cognitive virtues and methods.
Beyond inquiry: It is scientistic to assume that art, literature, music, craftsmanship, and tradition have lesser value simply because they are not avenues of empirical discovery (pp. 92–93).

Haack underscores continuities between science and literature: both require imagination (citing Peirce; scientists “dream of explanations and laws,” novelists of characters and worlds), but they pursue different excellences (p. 93). Asking “Which is more important, science or literature?” is as misguided as asking “Which is more important, a sense of humor or a sense of justice?” (p. 93). She extends the argument with anthropological and cultural reflections: modern scientific/technological practices have improved life but sometimes displace valuable traditions and communal practices (pp. 93–95). Her examples, the Panare’s cooperative tree-clearing displaced by steel axes; contemporary consumers seeking Amish craft; students losing deep book-reading habits amid the internet; high-tech medicine’s impersonality, illustrate that progress can carry losses (pp. 94–95). Forgetting these losses is itself a kind of scientism (p. 95).

Summarizing Scientism

I think Haack develops a fair and coherent diagnosis of the term. Once "science" is used honorifically, it creates pressure to protect the brand, and codify a single method. Both responses mischaracterize how the inquiry actually works. Decorative trappings (sign 2) simulate rigor while short-circuiting genuine evidential labor, licensing overconfident pronouncements (which is probably what is contributing the the replication crisis). Lastly, there is a bit of jurisdictional overreach. Expecting science to adjudicate normative or conceptual disputes confuses explanatory and predictive power with authority over value and meaning. Haack’s positive picture is neither “cynicism” (undervaluing science) nor “scientism” (overvaluing it), but a fallibilist, ecumenical understanding of inquiry: science is a magnificent human achievement, continuous with ordinary empirical investigation and enriched by novel tools and institutions, yet not exhaustive of human knowledge or flourishing.

Conclusion: The Richard Feynman Lectures

I've always found Feynman to be an excellent science communicator. So to wrap this up, lets have a look at his famous lecture on the scientific method:

Richard Feynman on Scientific Method (1964) | After noise reduction

Now, I'm going to discuss how we would look for a new law. In general, we look for a new law by the following process. First, we guess it.
Then we-- well, don't laugh. That's really true. Then we compute the consequences of the guess to see what-- if this is right, if this law that we guessed is right, we see what it would imply, and then we compare those computation results to nature. Or we say, compare to experiment or experience. Compare it directly with observation to see if it works.
If it disagrees with experiment, it's wrong. And that simple statement is the key to science. It doesn't make a difference how beautiful your guess is. It doesn't make a difference how smart you are, who made the guess, or what his name is, if it disagrees with experiment, it's wrong. That's all there is to it.
It's therefore not unscientific to take a guess, although many people who are not in science think it is. For instance, I had a conversation about flying saucers some years ago with laymen.
Because I'm scientific. I know all about flying saucers. So I said, I don't think there are flying saucers. So the other-- my antagonist said, is it impossible that there are flying saucers? Can you prove that it's impossible? I said, no, I can't prove it's impossible. It's just very unlikely.
That, they say, you are very unscientific. If you can't prove an impossible, then why-- how can you say it's likely, that it's unlikely? Well, that's the way-- that it is scientific. It is scientific only to say what's more likely and less likely, and not to be proving all the time possible and impossible.
To define what I mean, I finally said to them, listen, I mean that from my knowledge of the world that I see around me, I think that it is much more likely that the reports of flying saucers are the result of the known irrational characteristics of terrestrial intelligence, rather than the unknown rational effort of extraterrestrial intelligence.
It's just more likely, that's all. And it's a good guess. And we always try to guess the most likely explanation, keeping in the back of the mind the fact that if it doesn't work, then we must discuss the other possibilities.
There was, for instance, for a while a phenomenon we called superconductivity. It still is a phenomenon, which is that metals conducts electricity without resistance at low temperatures. And it was not at first obvious that this was a consequence of the known laws with these particles. But it turns out that it has been thought through carefully enough, and it's seen, in fact, to be a consequence of known laws.
There are other phenomena, such as extrasensory perception, which cannot be explained by this known knowledge of physics here. And it is interesting, however, that that phenomenon has not been well established, and--
--that we cannot guarantee that it's there. So if it could be demonstrated, of course, that would prove that the physics is incomplete. And therefore, it's extremely interesting to physicists whether it's right or wrong. And many, many experiments exist which show it doesn't work.
The same goes for astrological influences. If that were true, that the stars could affect the day that it was good to go to the dentist, then-- it's in America we have that kind of astrology-- then it would be wrong. The physics theory would be wrong, because there's no mechanism understandable in principle from these things that would make it go. And that's the reason that there's some skepticism among scientists with regard to those ideas.
Now, you see, of course, that with this method, we can disprove any definite theory. We have a definite theory, a real guess from which you can really compute consequences which could be compared to experiment, and in principle, we can get rid of any theory. You can always prove any definite theory wrong. Notice, however, we never prove it right.
Suppose that you invent a good guess, calculate the consequences, and discover every consequence that you calculate agrees with the experiment. Your theory is then right? No, it is simply not proved wrong. Because in the future, there could be a wider range of experiments, you compute a wider range of consequences, and you may discover, then, that the thing is wrong.
That's why laws like Newton's laws for the motion of planets lasts such a long time. He guessed the law of gravitation, calculated all kinds of consequences for the solar system and so on, compared them to experiment, and it took several hundred years before the slight error of the motion of Mercury was developed.
During all that time, the theory had been failed to be proved wrong, and could be taken to be temporarily right. But it can never be proved right, because tomorrow's experiment may succeed in proving what you thought was right wrong. So we never are right. We can only be sure we're wrong. However, it's rather remarkable that we can last so long. I mean, have some idea which will last so long.
I must also point out to you that you cannot prove a vague theory wrong. If the guess that you make is poorly expressed and rather vague, and the method that you used for figuring out the consequences is rather a little vague-- you're not sure. You say, I think everything is because it's all due to [INAUDIBLE], and [INAUDIBLE] do this and that, more or less. So I can sort of explain how this works. Then you see that that theory is good, because it can't be proved wrong.
If the process of computing the consequences is indefinite, then with a little skill, any experimental result can be made to look like-- or an expected consequence. You're probably familiar with that in other fields. For example, A hates his mother. The reason is, of course, because she didn't caress him or love him enough when he was a child. Actually, if you investigate, you find out that as a matter of fact, she did love him very much, and everything was all right. Well, then, it's because she was overindulgent when he was [INAUDIBLE]. So by having a vague theory--
--it's possible to get either result.
Now, wait. Now, the cure for this one is the following. It would be possible to say, if it were possible to state ahead of time how much love is not enough, and how much love is overindulgent exactly, and then there would be a perfectly legitimate theory against which you can make tests. It is usually said when this is pointed out how much love is and so on, oh, you're dealing with psychological matters, and things can't be defined so precisely. Yes, but then you can't claim to know anything about it.
Now, I want to concentrate for now on-- because I'm a theoretical physicist, and more delighted with this end of the problem-- as to what goes-- how do you make the guesses? Now, it's strictly, as I said before, not of any importance where the guess comes from. It's only important that it should agree with experiment, and that it should be as definite as possible.
But, you say, that is very simple. We set up a machine-- a great computing machine-- which has a random wheel in it that makes a succession of guesses. And each time it guesses a hypotheses about how nature should work, computes immediately the consequences, and makes a comparison to a list of experimental results it has at the other end. In other words, guessing is a dumb man's job.
Actually, it's quite the opposite, and I will try to explain why.
The first problem is how to start. You see how I start? I'll start with all the known principles. But the principles that are all known are inconsistent with each other, so something has to be removed. So we get a lot of letters from people. We're always getting letters from people who are insisting that we ought to make holes in our guesses as follows. You see, you make a hole to make room for a new guess.
Somebody says, do you know, people always say space is continuous. But how do you know when you get to a small enough dimension that there really are enough points in between? It isn't just a lot of dots separated by a little distance.
Or they say, you know those quantum mechanical amplitudes you told me about? They're so complicated and absurd. What makes you think those are right? Maybe they aren't right. I get a lot of letters with such content.
But I must say that such remarks are perfectly obvious and are perfectly clear to anybody who is working on this problem, and it doesn't do any good to point this out. The problem is not what might be wrong, but what might be substituted precisely in place of it. If you say anything precise, for example, in the case of a continuous space. Suppose the precise composition is that space really consists of a series of dots only, and the space between them doesn't mean anything, and the dots are in a cubic array, then we can prove that immediately is wrong. That doesn't work.
You see, the problem is not to make-- to change, or to say something might be wrong, but to replace it by something. And that is not so easy. As soon as any real definite idea is substituted, it becomes almost immediately apparent that it doesn't work.
Secondly, there's an infinite number of possibilities of these simple types. It's something like this. You're sitting, working very hard. You work for a long time trying to open a safe. And some Joe comes along who hasn't-- doesn't know anything about what you're doing or anything, except that you're trying to open a safe.
He says, you know, why don't you try the combination 10, 20, 30? Because you're busy. You tried a lot of things. Maybe you already tried 10, 20, 30. Maybe you know that the middle number is already 32 and not 20. Maybe you know that as a matter of fact, this is a five-digit combination. There we go.
So these letters don't do any good, and so please don't send me any letters trying to tell me how the thing is going to work. I read them to make sure--
--that I haven't already thought of that. But it takes too long to answer them, because they're usually in the class, try 10, 20, 30.