Clarifying Scientific Concepts
There is a lot of confusion, propagated by various media sources, around fundamental scientific concepts and terminology. Colloquial uses of terms such as "theory" or "hypothesis" tend to distort the scientific usage of these terms. Scientific concepts can become trivialized as well; "your 'theory' is just as good as my 'theory'". I want to clarify some of this terminology because constant misuse simply confuses everyone, making it harder to distinguish between competing sources of information on social media platforms.
Science and Pseudo-Science
Demarcating science from non-science is quite difficult. There are obvious exemplars of pseudo-science and many prototypical examples of real science but there haven't been any necessary and sufficient conditions identified that can be used to categorize any particular example into a clearly defined bucket. Nevertheless, there are common features shared across many disciplines we deem scientific. These attributes form clusters; examples on the peripheries become harder to classify because they share features with canonical examples of pseudoscience. If you are familiar with Wittgenstein's notion of family resemblance then this might make sense to you. You can also think about it in terms of network clustering. In the diagram below, you can think of the edges between nodes as defining some common relation and distance measuring some degree of closeness.
A two dimensional view might look something like this:
In the middle of the large cluster we might consider a discipline like physics and towards the periphery of the cluster we could consider something like psychology, sociology, and economics. The green cluster could represent a pseudo-scientific category containing things like intelligent design. The thing to note with both visualizations is that there aren't a global set of definable features that can distinguish any two disciplines. We cannot construct a list containing all of the essential features that could exclude non-sciences without running into problems. For example, a pillar of modern sciences is the experiment. Theoretical physics, however, typically does not conduct experiments. Does this mean we exclude it from being a science? That would be absurd. Similarly, Geological sciences typically do not engage in any a-priori theorizing. A common practice of modern science is establishing some sort of theory to explain observations. Does this mean geology is not a science? That too, would be absurd.
These examples immediately make it obvious that, when discussing what is science, we have to consider that "science" is a term referring to a broad class of related types of disciplines. Something can be more or less scientific, exemplifying the fact that "science" is somewhat of a graded concept. This implies that there are qualities or properties by which we can evaluate any particular knowledge claim as science or not. I've shifted away from classifying disciplines as a whole to individual claims because we might run into the same problem when attempting to classify an entire discipline as scientific. For example, there are many knowledge claims coming out of the psychological literature that do not exhibit properties we consider scientific. It's also possible that for any given knowledge claim, some of the properties might not be exhibited. Nevertheless, this does not imply that all claims coming out of psychology are pseudo-scientific. This is also true of some of the more canonical examples of science. Therefore, we must consider the rate at which claims originating from any discipline exhibit scientific qualities. This will prevent us from a-prior labeling knowledge claims as pseudo-scientific simply based on the discipline they come from.
Here is a list of some qualities I think that can be used when considering whether a claim is scientific. I do not claim this list to be exhaustive. Also, the order does not matter at this point. This is not to say that all qualities are of equal importance. I think that would be false.
- Makes predictions or retrodictions
- Is testable
- Is replicable
- Systematically records observations
- Capable of verifiability and validation
- Acknowledges the boundaries of its explanatory breadth
- Self corrective and reflective
- Maintains are fair degree of precision and clarity with its terminology
- Is falsifiable
- Has an empirical basis
- Is reproducible
- Is internally coherent and logically consistent
- Strives for impartiality and objectivity
- Has a sufficient degree of generalizability
- Can be subject to scrutiny within a broader community of peer review
- Uses concepts that can be measured or quantified
- Subject to revision in light of new findings
- Is transparent with it's methodology
- Is rigorous
- Seeks disconfirmation along with confirmation
- Is communicable
- Seeks to provide causal explanations
- Highly critical during design and analysis phase
- Critically assess methods, assumptions, and interpretations of results
- Uses mathematical models and simulation methods
- Seeks simple and robust methods
- Leverages probabilistic reasoning and acknowledges the uncertainty of it's conclusions
- Does not rely on authority unless it's a claim that is taken to be true within the community
- Considers alternative hypotheses
- Seeks convergent validity through multiple information sources and different methods
There are probably more I am not considering but I think this is not a bad start. I am partial to mathematical modeling but acknowledge that there are disciplines such as Anthropology that are scientific but might not emphasize mathematical modeling. Also, not all knowledge claims must come from a mathematical model. Nevertheless, scientific disciplines tend to use models because they help us check our assumptions against reality. Another consideration is that a claim might not be testable at this point in time, but technological innovations in the future can make it become testable. The fact that something isn't immediately testable due to technical constraints does not make it unscientific. I would say that it's unscientific if, in principle, it cannot be tested. If there is no conceivable way to test the claim, then it is not testable. Again, these are qualities that claims should strive for if we want to consider them scientific.
Throughout the rest of this blog post I'll touch on some of these concepts. I just wanted to initially get this out of the way because many people are confused about which claims are genuinely scientific.
Theories and Scientific Theories
I am not going to focus on any particular theory. I just want to consider in general, what it means to theorize in a scientific setting, how this activity differs from something like philosophical theorizing, and how both activities are quite different from how the public understands the term.
In the broadest sense of the term, theory is a structured way of understanding, interpreting, or explaining phenomena. It provides a conceptual framework, a network of ideas that helps us make sense of observations, connect patterns, and predict or interpret outcomes. Theorizing is something humans do all the time; often when you are trying to explain something, you are assuming some underlying theory (although its normally implicit and not fully structured). Theorizing in the broadest sense, is any process of pattern finding, meaning making, or framework building. It’s the creative and interpretive act of connecting ideas into a coherent picture — whether the “data” are experiments, emotions, social behaviors, or symbols.
In science, theorizing takes on specific methodological and epistemic constraints. Scientific theories must be testable, falsifiable, and consistent with empirical data. These are often formalized, expressed mathematically, aimed at predictive power. So while all scientific theories are theories, not all theories are scientific. Science narrows the broader act of theorizing into a disciplined method: empirical, systematic, and verifiable. In philosophy, theorizing is often about conceptual analysis rather than empirical testing. Philosophical theories often deal with abstractions more removed from empirical reality; it is not connected to experimental methods but rather focuses on logical entailment. It might deal with concepts like possibility and necessity. You might be eager to claim that science deals with these concepts as well. You'd be correct, certain scientific theories entail the possibility and impossibility of various empirical outcomes. Philosophical possibility is much broader, consisting of what is logically possible; in other words its theories are "metaphysical". So you can think of scientific and philosophical theorizing as specialized, formalized subsets of the larger, more universal human capacity to theorize — just like poetry and mathematics are specialized ways of using language.
There are common components to all theories, regardless of how fleshed out the theoretical details.
- Concepts: the basic building blocks of a theory, they name and define the phenomena being discussed. For example, "gravity" in physics, or "motivation" in psychology. Concepts are abstractions, they simplify reality so we can think systematically about it.
- Construct: A type of concept that has been deliberately defined for a specific theoretical purpose. Constructs often can’t be directly observed but are inferred (e.g. “intelligence,” “social capital,” “self-esteem”).
- Propositions: These are statements that state the relationships between concepts, how one thing effects or relates to another. In formal sciences, these are hypotheses; in philosophy or critical theory, they may be argumentative claims. A well defined scientific theory generates testable hypotheses amenable to falsification.
- Assumptions: These are the underlying ideas or conditions taken for granted for the theory to work. For example, in Economics we often assume humans are rational decision-makers. Making assumptions explicit is key to understanding the scope and limits of a theory.
- Boundaries and Scope Conditions: This is the "where and when" of a theory, what domain or context it applies to. For example, a psychological theory may explain individual behavior, not group dynamics.
- Logical Structure: This is the theories internal organization, how its pieces fit together coherently and systematically. A good theory has internal consistency and avoids contradictions.
- Empirical Linkages: This is how the theory connects to observation or experience. Theory entails certain observations, these are the predictions. In science, this means operational definitions and testability.
- Observation or Problem Identification: It starts with noticing a phenomenon, inconsistency, or puzzle. “Something interesting is happening here — why?”
- Conceptualization: Identify key elements and name them. Define concepts clearly and delimit what you’re focusing on.
- Relationship Mapping: Propose how these elements relate. In science, this becomes hypotheses or models. In philosophy or social theory, this becomes conceptual arguments or dialectical relations.
- Integration and Abstraction: Bring multiple relationships together into a systematic framework. The theory begins to generalize — it becomes more than a list of observations.
- Validation or Evaluation: In science → testing with data, replication, falsification. In interpretive or critical theory → coherence, explanatory depth, ethical and practical adequacy.
- Refinement and Extension: Theories evolve as new evidence or perspectives emerge. This is the “living” nature of theory — it’s continuously reshaped.
I've been reading a lot from Paul Smaldino recently, and think his description of theory is incredibly useful. Paul Smaldino doesn’t offer a single, neat “textbook” definition of theory in the way a philosophy-of-science treatise might, but across his writings we can reconstruct how he treats and uses theories. From his published work (on modeling, methodology, philosophy of science), Smaldino’s view of theory includes the following aspects:
- Decomposition into parts, properties, relationships, and dynamics: In “How to Build a Strong Theoretical Foundation,” Smaldino urges that to develop a theory of some phenomenon, one must decompose the system into relevant parts, specify the properties of those parts, articulate the relationships among them, and define how these can change over time. Thus, theory is not just a verbal or narrative statement, but a structural decomposition plus a specification of dynamics and interactions.
- Theories are tools (not “Truth”): Smaldino is explicit that there is (in his view) no one “true” theory; rather, theories are evaluated by how useful they are for understanding, prediction, generalizability, and refinement. In other words, theory is pragmatic: it is judged by its capacity to guide thinking, to generate falsifiable hypotheses, to clarify assumptions, and to integrate with empirical work.
- Verbal vs. formal theories / role of models: Smaldino repeatedly distinguishes verbal theories (narrative descriptions, “story-like”) from formal theories (mathematical or computational models). He argues that verbal theories are often vague, underdetermined, and thus resist strong testing or falsification. Formal models serve as instantiations of theory—they force explicit specification of assumptions, highlight omitted aspects, and allow rigorous exploration of consequences. In this view, a “good” theory is one that can be (or already is) translated into a formal model (or a family of models) that sharpen and test its claims.
- Iterative and reflexive process: Smaldino sees theory construction as iterative: empirical work should refine the theory, and theory should shape what empirical questions get asked. He warns against treating data merely as support for a verbal theory; rather, data should prompt refinement, specification, or rejection of theoretical assumptions. Also, theory-building is reflexive: one must be conscious of which assumptions are built in (implicitly or explicitly), what is omitted for simplicity, and the “violence” (i.e., distortion) done to reality in modeling.
- Theoretical foundation and training: Smaldino laments that many social scientists lack training in theory construction and formal modeling. In “How to Build a Strong Theoretical Foundation,” he argues for greater methodological and conceptual training so that theory is not just received (from canonical frameworks) but actively constructed. His emphasis is that theory is not peripheral—it is central. Without robust theory, methods (however sophisticated) may produce results without insight. (“Better methods can’t make up for mediocre theory.”)
A theory is a deliberately constructed specification of (i) entities or components of a system, (ii) the properties and possible states of those components, (iii) the relationships and rules by which those components interact, and (iv) the temporal dynamics of how those states and relationships evolve. A strong theory is one that (a) can be formalized in mathematical or computational models, (b) offers testable predictions or counterfactuals, (c) is subject to empirical refinement, and (d) is judged not by an abstract “Truth” but by its utility in explaining, predicting, generalizing, and guiding further inquiry.
In his book “Modeling Social Behavior: Mathematical and Agent-Based Models of Social Dynamics and Cultural Evolution”, he defines theory as:
"... a set of assumptions upon which hypotheses derived from that theory must depend. Strong theories allow us to generate clear and falsifiable hypotheses."
Distinguishing it from a theoretical framework:
“A theoretical framework is a broad collection of related theories that all share a common set of core assumptions.”
Theories guide inquiry, and the modeling process. It frames what phenomena we pay attention to, what questions we ask, and how we model:
“Each [model] decomposes a system in a particular way … What questions does your theory address? What parts do you need to include to answer those questions? … Is your model a satisfying representation of your theory?”
That is, a theory is more than just a verbal narrative: it's the background of assumptions that define how one decomposes the phenomena, and from which hypotheses or models are generated. Formal models are instantiations or precise expressions of the theory, and are used as a way to stress test or refine the theory. There is a one to many relationship between theories and models; one theory can be expressed with many different models. This is what I take to be the scientific notion of theory, how I see it applied and how I was trained to apply the term (within the context of economic theory).
Theoretical Virtues
What counts as a "good" theory? How do we compare two theories explaining the same data? Why is simplicity considered desirable? Theoretical virtues are the criteria by which we compare competing theories. In addition to simplicity, there are other common virtues such as elegance (symmetry), explanatory power (unifying phenomena under one framework), fruitfulness (good at generating testable predictions), and coherence (with itself and other theories). Scientists often invoke these when deciding between theories that fit data equally well.
The weight given to each theoretical virtue varies across fields and context. Empirical adequacy is typically non-negotiable. In practice, scientists do appeal to simplicity, elegance, and explanatory depth — even if they don’t always articulate these as “philosophical criteria.” Generally, theoretical scientists (e.g., theoretical physicists, cosmologists, or mathematicians) care more explicitly about theoretical virtues because their work often advances ahead of decisive empirical data. For example, a String Theorist might emphasize mathematical beautify and unification, even though direct empirical tests might be lacking. Empiricists on the other hand, tend to prioritize measurable success and predictive reliability. The line dividing the two is by no means sharp.
We will look at a paper called "Systematizing the Theoretical Virtues". It provides a fairly comprehensive and structured account of the major theoretical virtues, and how they constitute a "logic of theory choice".
Evidential Virtues
- 1) Evidential accuracy: “A theory fits the empirical evidence well (regardless of causal claims).” Does the theory fit the data? This is the baseline virtue: the observable world looks the way the theory says it should. It’s neutral about causes; it’s just “getting the facts right.” Use it when comparing rivals that speak to the same dataset; watch for overfitting (a theory can “fit” because it has too much wiggle room). Evidential accuracy underwrites the other two evidential virtues: typically you assess causal adequacy and depth after you’ve seen solid fit.
- 2) Causal adequacy: “T’s causal factors plausibly produce the effects (evidence) in need of explanation.” Does the posited mechanism really have the oomph? Beyond fit, we ask whether the causes would in fact yield the observed effects (often many causes in interaction). Robustness analysis across heterogeneous models can support this by showing the same core causal structure yields the phenomenon across variations. Beware “dormant” causes that are merely named, not shown to operate at the required scale.
- 3) Explanatory depth: “Excels in causal history depth or in other depth measures such as the range of counterfactual questions that its law-like generalizations answer.” How far and how flexibly does the explanation reach? Depth comes in two flavors: (i) event-focused “how far back” causal history, and (ii) law-focused counterfactual range (how much would still hold under interventions or changed background conditions). It’s different from unification: depth concerns the same target system under varying conditions, not explaining more kinds of facts. Measure it by the breadth of stable “what-if” answers your laws support.
Coherential Virtues
- 4) Internal consistency: “T’s components are not contradictory.” No contradictions inside the theory. A minimal bar: if it derives P and ¬P, something must give. Subtle inconsistencies can hide in idealizations; don’t set the bar so high that all idealized modeling looks “inconsistent,” but don’t excuse genuine clashes as “just idealization,” either. Think formal coherence first, before aesthetic “niceness.”
- 5) Internal coherence: “Components are coordinated into an intuitively plausible whole… T lacks ad hoc hypotheses—components merely tacked on to solve isolated problems.” Parts hang together as an intuitively plausible whole (no ad hoc patches). Different from pure logic: a theory can be consistent yet obviously jury-rigged. Red flags: fixes that are untestable, explain nothing else, or sit awkwardly with the core principles. Use “negative” diagnosis (ad hocness) to pressure-test coherence.
- 6) Universal coherence: “T sits well with (or is not obviously contrary to) other warranted beliefs.” Fits with the rest of what we’re warranted to believe. This is external fit: harmony with well-established results and background commitments (including conservation principles, etc.). Clash here doesn’t instantly falsify, but it raises costs you must repay with exceptional evidential gains. Distinguish healthy tension (pushes progress) from outright conflict with robust knowledge.
Aesthetic Virtues
- 7) Beauty: “Evokes aesthetic pleasure in properly functioning and sufficiently informed persons.” The theory evokes aesthetic pleasure in appropriately situated observers. Beauty shows up as symmetry, aptness, “surprising inevitability,” etc. On Keas’s account, beauty may have extrinsic epistemic value (it can guide us toward other, more tightly connected virtues like simplicity and unification). Use with humility: beauty can inspire, but by itself it doesn’t guarantee truth.
- 8) Simplicity: “Explains the same facts as rivals, but with less theoretical content.” Same explananda, less theory. Think fewer entities (parsimony) and/or more concise principles (elegance). Practically, count independent parameters, primitive postulates, or distinct assumptions. Simplicity often correlates with better predictive performance in model selection, but it also interacts with coherence (ad hoc add-ons usually bloat a theory).
- 9) Unification: “Explains more kinds of facts than rivals with the same amount of theoretical content.” Same resources, more kinds of facts explained. Unification and simplicity are complementary “styles of informativeness”: simplicity reduces content for the same domain; unification expands domain for the same content. Use it to prefer frameworks that tie disparate phenomena together (Maxwell’s electrodynamics-light, plate tectonics, etc.). Keep distinct the diachronic notion (“consilience” gained over time) from this aesthetic one present at introduction.
Diachronic Virtues
- 10) Durability: “Has survived testing by successful prediction or plausible accommodation of new data.” Survives testing over time (prediction or plausible accommodation). Durability is not mere popularity or longevity: it’s testy time. Prediction is often the gold standard; in historical sciences, repeated plausible accommodation of novel data also counts. A newborn theory can’t yet be “durable”; this virtue is inherently time-laden.
- 11) Fruitfulness: “Over time, generates additional discovery by means such as successful novel prediction, unification, and non ad hoc theoretical elaboration.” Generates further discovery (incl. novel prediction, non-ad hoc elaboration, added unification). If durability is conservation (passing tests), fruitfulness is innovation (creating new testable strands). Novel prediction here is genuinely new—wasn’t “built in” as a target during construction. Fruitfulness and durability interlock in mature research traditions (e.g., gravitational astronomy from Uranus’s anomaly to Neptune).
- 12) Applicability: “Used to guide successful action or to enhance technological control… higher when it enables outcomes otherwise not possible.” Guides successful action or control (science → technology, policy). Distinct from experimental control for testing; this is practical leverage (engineering, medicine, forecasting). It’s confirmatory and arrives only after earlier virtues are in place (you can’t apply what you haven’t yet credibly learned), so it is inherently diachronic.
Hypothesizing and Confirmation
The term "hypothesis" is frequently bastardized. Confirmation, and its counterpart disconfirmation, are also incredibly misunderstood by the general public. A hypothesis is pretty much just a testable guess, normally derived from a theoretical framework. It is a specific, testable statement about what you expect to happen. It's a prediction about reality that you intend to check with evidence. For example, "If plants are given more light, they will grow faster"; this is a hypothesis, it can be wrong but can definitely be tested. It’s different from an axiom (assumed true), a conjecture (unproven mathematical guess), or a proposition (any statement that is true or false but not necessarily testable), in that it directly connected to the idea of testability and should have properties such as verifiability and falsifiability. This normally implies the phenomena referenced by the hypothesis is measurable, and is therefore directly or indirectly observable through empirical data. In other words, hypotheses should be operationalizable, not merely verbal statements. The hypothesis must be expressed in a level of precision necessary to implement some test, and if a hypothesis is not amendable to this, its not testable in a way that can discern its likeliness.
I'd like to follow this with a few caveats. Scientific practice is often messy, and not defined by one thing such as falsification. Very often, strict falsifiability is not feasible. In practice, this might restrict the applicability of certain testing procedures, such as statistical testing. As Richard McElreath writes in Statistical Rethinking:
Science is not described by the falsification standard, and Popper recognized that. In fact, deductive falsification is impossible in nearly every scientific context.
(1) Hypotheses are not models. The relations among hypotheses and different kinds of models are complex. Many models correspond to the same hypothesis, and many hypotheses correspond to a single model. This makes strict falsification impossible.
(2) Measurement matters. Even when we think the data falsify a model, another observer will debate our methods and measures. They don’t trust the data. Sometimes they are right. (in addition to issues such as false positives and false negatives, observation error)
So in other words, the scientific method is not reducible to a statistical procedure. Statistical evidence is nevertheless an important feature of the process, and statistical methods can relate hypotheses to data, but they are not sufficient.
We will talk more about this later, but since modern science relies so heavily on procedures from statistics, its impossible to fully conceptually separate hypothesis from statistical inference. There are two concepts that frequently occur in the context of reasoning about hypotheses; confirmation and disconfirmation. Remember that a hypothesis makes a prediction about something; in other words, if the hypothesis were true, we would expect to observe something implied by that hypothesis. These observations are typically encapsulated by a probability distribution, and therefore are described by likelihoods. We have some hypothesis H, and we show that it entails some observation D. If we look for D and don't find it, we must conclude that H is false. However, finding D tells us nothing certain about H, because other hypotheses can also predict D. This is why we invoke the notion of likelihoods. If we observe D, we can't be certain that H explains D; but if we measure relative likelihoods, we can find that H is most probable relative to alternative hypotheses. This type of reasoning is central to understanding how scientists reason under uncertainty.
I'll briefly introduce the idea of Bayesian confirmation. The core idea is that a hypothesis wins credit when evidence was more likely if the hypothesis were true than if it weren't. If seeing E is more expected under H than under “not-H,” then E confirms H. The stronger the shift, the stronger the confirmation. (Formally: strength ≈ how big the ratio is between P(E|H) and P(E|¬H).) Evidence confirms H when P(E | H) > P(E | not-H) and disconfirms when P(E | H) < P(E | not-H).
Evidence doesn’t “prove” a hypothesis; it shifts how credible it is. Many different stories can fit the same facts. What matters is which story makes those facts more expected than rival stories. The same observation can match multiple hypotheses, this is the idea of underdetermination. Also, analysis choices matter. What you count, how you measure, which model you use, and when you stop collecting data can all tilt the result without changing the raw facts. Instead of "proof", you should think of "support"; how much evidence tips the scales relative to alternatives, not in isolation. Vague hypotheses also carry little weight; if almost anything you observe feels like confirmation, the statement doesn't discriminate against anything; specific predictions force real tests. Here are a few questions to ask yourself when evaluating a hypothesis:
- What are the live alternatives? What else could explain this? (List at least one.)
- What did each hypothesis specifically predict? (Before seeing the data.)
- Would this result have surprised the rival more? (If yes, support is stronger.)
- What would disconfirm your hypothesis? (Name a clear outcome.)
- Did we tune our analysis after seeing results? (If yes, be cautious.)
- Does this hold in new data or by a different method? (Consilience.)
- Would those alternatives have expected this result as much as your hypothesis does? If your hypothesis makes the result less surprising than the alternatives, that’s good support. If lots of stories would’ve predicted it, it’s mild at best.
Evidence, Empirical Evidence, and Scientific Underdetermination
It is actually quite difficult to define evidence. What distinguishes a detective who uses evidence from the scientist who uses "empirical evidence", derived from empirical research, when advancing a claim? Clearly, what counts as evidence in these domains is not entirely overlapping. In addition, there is a plethora of synonymous terms that are very often used interchangeably with "evidence", but are conceptually distinct (data, facts, etc.) that muddy the waters. There are also related concepts, such as the burden of proof and admissibility, that frequently arise when discussions involve evidence. In some contexts, these are formally established and institutionalized through rules and procedures; as it is the case in law or debate. The word is also used as a modifier. Consider something like evidence based policy or evidence based medicine; to what extent does the word "evidence" impact how each discipline is carried out? What exactly is that modifier doing to the subsequent words? There is even a branch of epistemology called evidentialism, that is primarily concerned with the relationship between evidence, justification, and knowledge. Lastly, there are even attempts to construct frameworks that grade the quality of evidence, such as the Hierarchy of Evidence. Clearly, understanding how people use this term, in particular scientists, is of significant consequence. My main focus with this section is to characterize how scientists use and reason about evidence. But I also want to bridge the gap between these different sense of the term, so I'll introduce an two authors who have had an impact on how I think about this concept.
In "Evidential Foundations of Probabilistic Reasoning", David Schum introduces the notion of a "Science of Evidence"; he recognizes the inherent plurality of the term and wants to abstract the notion across all disciplinary domains. Schum also recognizes the inherent uncertainty featured in all reasoning tasks based on evidence: “… in any inference task our evidence is always incomplete, rarely conclusive, and often imprecise or vague; it comes from sources having any gradation of credibility. As a result, conclusions reached from evidence […] can only be probabilistic in nature.” He also identifies and is concerned with, structural features of evidence; how it all connects together within a network of inference. Schum doesn’t pin “evidence” down with a single neat definition. He argues it’s best understood functionally; by what it does in reasoning. For him, evidence is any item of information (a trace, record, testimony, measurement, etc.) that bears on a hypothesis; its value depends on (i) relevance (how it connects to the hypothesis) and (ii) credibility (how much you can trust the item or its source). The overall inferential force (or probative weight) of an item is a joint product of those two strands. For Schum, evidence does not exist in a vacuum; it is relational, not free floating. An item isn’t “evidence” all by itself; it becomes evidence only relative to a specific hypothesis/probandum once you supply (and defend) an inference link from the item to that hypothesis. The link from an item to a hypothesis is licensed by background generalizations (“glue”) that are often implicit and need support; in other words, relevance must be argued. He distinguishes directly relevant evidence (bearing on the hypothesis) from ancillary (meta) evidence; material about the strength of that link (e.g., source credibility or whether the generalization really fits this case). “Evidence” is information put to work in support of a hypothesis, with its relevance and credibility argued. An item becomes evidence only when embedded in an argument that (1) states the hypothesis at issue, (2) shows relevance by supplying the generalization that links the item to that hypothesis, (3) supports credibility (often with ancillary evidence), and (4) assesses inferential force given how the item interacts with the rest of the mass of evidence.
Schum is explicit that a report about an event and the event’s actual occurrence are not the same; you must infer from the report to the world, and that inference is always uncertain. This is true of domains involving measurement as well; our measure of the thing is not the thing itself. Hence his insistence that all evidential reasoning is, in the end, probabilistic due to this uncertainty. Relevance is the logical/inferential link from an evidential claim to the hypothesis and Credibility concerns source reliability. With testimony, he decomposes credibility into veracity, objectivity, and observational sensitivity (were they truthful, unbiased, and in a position to observe?), but credibility standards are used beyond the legal realm (consider a scientist questioning the mechanism by which data was collected). Much of what we call “evidence” is actually evidence about other evidence; material that bears on a witness’s credibility or on the soundness of a measurement process. That ancillary layer is often what lets you evaluate the force of the directly relevant items. He often uses the likelihood ratio as a convenient gauge of an item’s inferential force and shows structurally-driven phenomena like inferential drag (links in a chain weaken force), redundancy, and synergy when items combine. But the broader point is structural: you can analyze evidential force even when precise frequencies are unavailable. The key point about structure is that basic configurations of evidence combinations have differing degrees of inferential force. Consider two witnesses claiming to report some event. If we find one of them is wrong, it might weaken our credibility in the other witness if we find out they someone collaborated before reporting the account. In other words, there is structural collapse which leads to a non-linear reduction in inferential force (redundancy). The reverse can be true as well; one piece of information can amplify the inferential force of a collection of evidence (synergy). An item can also be clearly relevant (it bears on the hypothesis) and its source fully credible, yet still move the needle only a little. Fundamentally, the probabilistic assessment of evidence rests on these more primitive notions of relevance and credibility.
A bit more about the notion of inferential force; this is the diagnostic strength of an item given a stated hypothesis versus its rivals. As mentioned before, Schum gauges this with likelihood ratio, but this is not necessary, and scores force in a similar way to how statisticians score posterior inference with Bayes Factors. An item can have a likelihood ratio near one, and hence little inferential force, despite being relevant and credible. Similarly, an item can have very extreme values for likelihood ratios but carry little weight if the credibility and relevance are called into question. Here are just a few ways Schum describes how this can occur:
- Low diagnosticity: A careful, honest witness saw the suspect “in a dark hoodie”—a description that fits many people. Credibility is high, relevance is clear, but P(E∣¬H) is also high, so likehood ratio is small.
- Chains = inferential drag: When E supports H only through several intermediate links (A→B→C→H), each link’s uncertainty compounds, typically reducing net force (“inferential drag”).
- Dependence & redundancy: Two “independent-looking” reports may trace back to the same primary source. The second then adds little; combining them yields less force than naïvely multiplying independent likelihood ratios. (Conversely, truly independent items can show synergy.)
- Ancillary constraints on the Likelihood Ratio: Ancillary (meta) evidence about the measurement/testimony changes the likelihoods (e.g., false-positive rates, viewing conditions), which can push force up or down without altering surface relevance.
- Potential Evidence: a true statement e, together with true background b, is potential evidence for hypothesis h only if (i) e doesn’t entail h, and (ii) given e & b, it’s probable that there is an explanatory connection between e and h (Achinstein formalizes this with an “objective epistemic” probability >½).
- Veridical Evidence: Strong VE requires: (1) e is PE for h; and (2) h is true; and (3) there is an explanatory connection between e’s truth and h’s truth. (He also discusses a weaker VE that drops (3), but argues scientists should want the strong form to avoid “misleading” evidence.)
- ES-evidence (Epistemic Situation): e is true and anyone in a specified epistemic situation is justified in believing that e is (probably) VE for h.
- Subjective evidence: at time t, agent X believes e is (probably) VE for h, and X’s reason for believing h (is true/probable) is that e is true.
- They start from questions/hypotheses/models. Even in exploratory work, there’s at least a background model (“these genes might co-express,” “this detector should see X events”). That gives evidence something to be for. This maps nicely to Schums "evidence is not free floating" concept and also captures relevance.
- They produce data via instruments or observations. That’s the raw material; but nobody sensible treats raw data as already “evidence.” This collection step is often a source of critical questioning, mapping to Schum's notion of quality.
- They process/clean/model it. This is where measurement error, instrument calibration, and statistical assumptions come in.
- They interpret it relative to rival explanations. “Does this pattern support model A over model B?” “Does this reject the null?” “Does this effect replicate?”
- They document uncertainty (standard errors, likelihoods, Bayes factors, upper bounds).
- They bring in meta/ancillary info (instrument logs, sample provenance, blinding procedures, preregistration, replication studies, peer review).
- They value replication and reproducibility as communal credibility checks: “Can someone else’s instrument get the same item?” That is ancillary evidence writ large.
Scientific Measurement
This topic actually spans entire academic journals, so it will be somewhat difficult to condense this rich topic into something concise. Measurement is important across pretty much every domain of science and engineering. Given the nuances specific to each domain, I'll try to capture the broad generalities that could represent how scientists "in general" think about measurement. Much of my thinking on this section comes from "Measurement Across the Sciences: Developing a Shared Concept System for Measurement".
As a disclaimer, this section will be biased towards measurement considerations in the social sciences, primarily because I am an applied econometrician by training. Like I mentioned previously, the concept of "measurement" is extremely broad. For example, there is a widely research area of research in mathematics called "Measure Theory", that seeks to formalize and generalize common notions of measurement such as magnitude, mass, and probability. In contrast, there is the scientific study of measurement called "Metrology", which is less concerned with formalization, and much more concerned with establishing units of measurement, development of measurement methods/instruments, identification of measurement standards, evaluation of uncertainties, and the traceability/usability of these standards across a wider population. Mathematical theories of measure do not concern themselves with evidential grounds or success criteria associated with such methods. This culminates in products such as the International Vocabulary of Metrology standardized by ISO. If you look at broad definitions of measurement, they almost always make explicit reference to the fact that the thing being measured is physical. This begs the question "can non-physical 'things' be measured?" If something is non-physical, is it inherently incapable of being empirically investigated? Consider something like "subjective probability" in Bayesian statistics, to what extent can someone measure their "degrees of belief" about a proposition? Nevertheless, social scientists frequently make reference to "measurements" when conducting empirical research. Economists construct "Happiness Indexes", Psychologists measure "Personality", and Sociologists measure "Community Cohesion", often using sophisticated statistical methods that map observed data to unobserved "constructs". At first glance, I think it's obvious these are quite distinct from yardstick measures someone can use in a physical science lab. In many physical-science settings, there’s a well-defined quantity, and an instrument that’s been calibrated to that quantity. In a lot of social-science settings, there’s a theoretical construct, and we build a data-collection apparatus to approximate it. I'll explain this difference in detail later; for now lets dive into the fundamentals.
So what is measurement? It is the process of assigning numbers (or well-ordered labels) to aspects of the world according to a rule so that the numbers reflect something about the thing. We have a set of real things (objects, events) and a set of numbers; measurement is a structure preserving assignment from one set to the other. I think social science adds additional constraints, captured by the SEP article linked above. From section 7 of the SEP article, model-based accounts of measurement consist of two levels:
(i) a concrete process involving interactions between an object of interest, an instrument, and the environment; and (ii) a theoretical and/or statistical model of that process, where “model” denotes an abstract and local representation constructed from simplifying assumptions. The central goal of measurement according to this view is to assign values to one or more parameters of interest in the model in a manner that satisfies certain epistemic desiderata, in particular coherence and consistency.
So measurement involves interaction between the object (or aspect of the system), an instrument (or measuring tool), and an environment, which includes the subjects doing the measurement. Measurement represents these interactions with parameters, assigning values to the parameters (measurands), based on the results of the interactions. The SEP article continues, saying there are two main outputs identified by model-based accounts of measurement:
Instrument indications: these are properties of the measuring instrument in its final state after the measurement process is complete. Examples are digits on a display, marks on a multiple-choice questionnaire and bits stored in a device’s memory. Indications may be represented by numbers, but such numbers describe states of the instrument and should not be confused with measurement outcomes, which concern states of the object being measured.
Measurement Outcomes: these are knowledge claims about the values of one or more quantities attributed to the object being measured, and are typically accompanied by a specification of the measurement unit and scale and an estimate of measurement uncertainty. For example, a measurement outcome may be expressed by the sentence “the mass of object a is 20±1 grams with a probability of 68%”.
Inferring outcomes from instrument measures is non-trivial, often being theory laden and reliant on statistical assumptions about the object being measured, the instrument, the environment, and calibration process. Let's concretize this. Consider an econometric study that seeks to understand the the effect of some policy on health outcomes. The object being measured might be aggregate outcomes among a population (survival rates), the instrument might be self-report surveys or hospital reports, the environment might include sources of confounding variables (noise), and the calibration process might be how the surveyor constructed the survey (use of language, choice of questions, etc.), how those questions connect to the concept (health), and accepted measurement standards. Corrections in data might be necessary to account for systematic bias in data collection; for example if we know there to be a non-response bias due to some expected reason, we might adjust the data according to statistical and theoretic assumptions. On this view, measurement is a set of procedures aimed at assigning values to model parameters based on instrument indicators.
Like I mentioned earlier, my views of measurement comes from economics graduate studies. Above is a partial view of the entire field of measurement; a highly partial view, I recognize. But is there core terminology independent of a domain? The text referenced above (Measurement Across the Sciences), seeks to establish such a lexicon. They note that the book is a departure from the VIM (listed above), which assumes that only physical quantities are measurable. This book seeks to expand that, where non-physical properties of a system can also be measured (expanding the scope of measurement to social science and management domains). According to the text, measurement can be thought of as:
a process based on empirical interaction with an object and aimed at producing information on a property of that object in the form of values of that property.
Measurement is an empirical process, designed on purpose, whose input is a property of an object, and that produces information in the forms of values of that property.
This makes it clear that we do not measure objects themselves, but properties of these objects (or systems). It also enables comparison of objects on these measured properties, assuming the system of measurement used to measure these values are the same. This is important, because it address the "how" you went about measuring the property. If someone measures say, intelligence using procedure X and someone else measures it with procedure Y, we might not be able to compare the two measures. Generally speaking, you must compare properties of the same kind, in order for the comparison to be meaningful. "Property" in this context, designates both properties of objects and their kinds of properties. Below the authors provide notation and more detail for how to refer to properties:
The last component of the definition of measurement is that it produces information on the measurand in the form of values of properties, and thus, in the specific case of quantities, in the form of values of quantities. Remember earlier, I mentioned these authors wanted to expand the notion of measurement to non-quantitative properties, something metrologists typically do not do. I'm not sure how controversial this is more broadly, but doing this enables qualitative research methods. Extending figure 2.5 to include this aspect:
Where Q[a] refers to "Generic property of object a", "q_ref" refers to the unit, and "x" is the numerical value of the quantity. We choose "Q" instead of "P" in the case where we are deliberately interested in quantitative properties of the object. This leads to the most generic understanding of measurement: it is designed empirical property evaluation (of an object, system, process). Measurement is a process that connects entities of the empirical world and entities of the information world. The authors describe this connection using the following terminology:
- Transduction: A measuring instrument interacts with an object and, being sensitive to a specific property, changes its own state to produce an indication of that property. In other words, it converts (transduces) the measurand into something observable—like a bathroom scale’s spring turning weight (force) into spring elongation, or a paper test turning a person’s reading ability into a pattern of marked answers. This step is purely empirical.
- Instrument scale application: The instrument is built so its observable indications can be systematically linked to information units (often numbers) through a scale. This means mapping the physical sign to a value—like turning spring elongation into a length reading in centimeters, or turning a pattern of checked boxes into scored responses. This step mixes empirical observation with informational mapping.
- Calibration function computation: Because the indication (e.g. length, scored items) is usually not the same kind of property as the measurand (e.g. force, reading ability), the indication value must be transformed via a calibration function that models how the instrument’s indication relates to the actual quantity of interest. Thus, length is converted to force, and scored responses to a reading-comprehension level. This step is entirely informational.
Remember, this is a designed procedure. The intent of the process is to produce an information entity; the measured value is expected to convey information about an empirical entity (and analyzed mathematically). As a person carrying out this procedure, we seek to minimize the distance between the measured property of an object and the measured value; we want to minimize the error/uncertainty. There can be multiple sources of drift between the actual value of the property and the measured property, which collectively contribute to measurement uncertainty. In the book, the authors mention that these sources somewhat derive from your measurement strategy (daily-life, operational, statistical, analytical). The two main sources of uncertainty are:
- Definitional uncertainty: we didn’t fully or sharply say what the measurand is. This refers to how fuzzy the measurand’s definition is.
- Measurement uncertainty: even if we did define it sharply, the instrument/procedure isn’t perfectly repeatable, is not sensitive to the measurand, has calibration issues etc.
measurement is an empirical and informational process, designed on purpose, whose input is an empirical property of an object and that produces information in the form of values of that property.
In the human sciences, one can see an example of the distinction between intended property and effective property in the case of measurement of reading comprehension ability. Here, the assessments always specify that the tests are to be given under conditions free from distraction while the student is reading the passages and responding to the comprehension questions, so that a noisy environment, for example, would not be advisable. This is strongly associated with the intended property—a student’s comprehension of text under good conditions. However, it may be the case that, in a given situation, a student is asked to respond in a noisy and distracting environment— this would be a case where the effective property differs from the intended property, and, presumably, any measurements made in this distracting situation would tend to show lower reading comprehension ability.
Phenomenon → Concept → Construct → Operationalization → Instrument/Procedure → Data → Metric/Indicator → Interpretation
- Phenomenon: Something in the world you care about (well-being, intelligence, economic activity, discrimination).
- Concept: Your verbal idea of it — usually a bit fuzzy. “Intelligence is the ability to adapt and solve problems.” “Economic growth is how much more stuff a society produces.”
- Construct: The theory-shaped version of the concept — clearer, bounded, hooked into other ideas. A construct says, “this thing has these dimensions, relates to these causes/effects.” Psychologists love this word. It’s the “scientific” packaging of the concept. Economists use this word less often, but they're doing the same thing.
- Operationalization: “Given that construct, what observable things will stand in for it?” This is the missing link people skip. It’s the mapping rule. “We will treat X, Y, and Z behaviors/scores/answers as evidence of the construct.” This is highly theory laden, meaning it depends on the theoretical formulation.
- Instrument / Procedure: The actual tool or protocol: a survey scale, a test, a national accounts system, a coding scheme for interviews. This is the point where we ask "is the procedure measuring the attribute of the system/object we care about."
- Data: The raw responses, counts, test scores, monetary totals. This are supposed to be the raw measures attained through the transduction phase, extracted from the measurand.
- Metric / Indicator: The processed number (or small set of numbers) we show to the world: IQ = 115, GDP = $24 trillion, Depression score = 18/27.
- Interpretation: “Therefore, this person is above average,” or “this economy is growing,” or “this group is more prejudiced.” When a theory is mathematically precise, there should be less room for competing interpretations. Every step prior to this influences what can be said about the metric.
- Concept/construct problems: Concepts can be normative (“a good life,” “social capital”) but treated as descriptive. Different scholars mean different things by the same word. Constructs often bake in a theory: e.g. “intelligence is unitary” vs “intelligence is multiple.” The measure will follow that theory.
- Operationalization problems: Operationalization is a choice. You’re saying: “We can’t see the thing, so we will look at these things instead.” Choices can be narrow (only income for SES) or broad (income, education, occupation). Choices can be convenient rather than conceptually tight (we measure “learning” with multiple-choice tests because they’re easy to score). And crucially: different operationalizations can all be defensible — but they will give different numbers.
- Instrument/procedure problems: This is quite a problem in social sciences. Very often in economics, this is conflated with modeling itself. You might hear something like "what was your identification strategy?", meaning how did you isolate causal relationships, not how you generated the data. This is however, a problem of data collection and data reliability; having much to do with the sampling scheme (did we generate a truly representative subset of the population?)
- Data → metric problems: We often transform raw data (standardize, scale, weight, index); these transformations create meaning. Indexes like GDP combine heterogeneous stuff using formulas that look technical but are ultimately convention + theory.
- Interpretation Problems: People forget the metric is about a construct, not the "thing in itself". They ignore error and uncertainty, they over-generalize from group level properties to individuals, and reverse.
- Content Validity: This asks “Did we include the right content for this construct?” Does a measure represent all the facets of the measurand it's intending on covering? This is primarily about coverage; the instrument should span the entire conceptual category. This is hard to achieve in social sciences. Social constructs are often broad and contested (e.g. “well-being,” “social capital,” “leadership”). If experts don’t even agree on the domain, content validity can’t be settled once and for all. You end up with “for this theory, this was good coverage,” which is weaker than “this is the coverage.”
- Criterion Validity: This asks “Does our measure relate in the right way to some external, meaningful criterion?” It refers to "the extent to which an operationalization of a construct, such as a test, relates to, or predicts, a theoretically related behavior or outcome — the criterion". For example, a job aptitude test should predict actual job performance. Often there is no gold-standard criterion. What’s the “true” criterion for intelligence, or for political trust, or for creativity? We use proxies (grades, supervisor ratings, future income), but those proxies are themselves social measurements with their own validity problems. So you get “a measure validated against another imperfect measure.”
- Construct Validity: This asks “Does this measure behave like the theory says the underlying construct should behave?” This actually refers to a broader umbrella of relate validity questions. Does it correlate with things it should correlate with? (Convergent) Does it not correlate with things it shouldn’t? (Discriminant) Does it fit into the nomological network — the web of other variables the theory posits? Generally speaking, construct validity refers to how well a set of indicators reflects or represents a concepts that is not directly observable (latent). Are the numbers produced by your measurements actually mapping onto something in the real world? For example, IQ is not directly measurable.
Construct (very stable) → Operational definition (community standard) → Instrument (calibrated) → Measurement (with known error)
Construct (contested) → Operationalization (chosen among several) → Instrument (partly human, context-sensitive) → Measurement (with unknown or changing error, plus assumptions) → Interpretation (theory-relative)
To wrap up this ridiculously long section (that could have been much longer because this is such a rich topic), I'd like to describe how I used to teach concepts of economic measurement. I used to work for a massive data provider in the finance industry. My job essentially was "be the subject domain expert" for economic data and "be the data engineer", which meant I had to understand how economic indices were constructed, generated, reported, and used, across statistical agencies globally. I also used to teach fundamental economic concepts for non-economists who specialized in other data domains (like fixed income or commodities). These other domains are quite different from economics; they (like physics) are much more amenable to direct measurement. For example, a fixed income "measure" might just be a straight forward report from a bank about what interest rate they are charging on some financial instrument. No mystery there. Likewise, a commodities dataset reported from CBOE might simply be bids and asks for a particular trading day. As alluded to earlier, economic data is highly aggregated and connected with sampling schemes and theory. You would think that, given the background of many of these people, they would have some familiarity with the construction of an economic index. Surprisingly, many would assume the economic measurements are as straightforward as the data they specialize in. This is perhaps one thing many people misunderstand: the distinction between economists/statisticians, and someone who majored in Business Administration. It becomes quite evidence when you get into the nuances of data. At the start of these lectures, I would begin by saying something like "We can measure the flow of water simply by putting a well calibrated and sensitive instrument into that river, this gives us a direct measure. Economic measurement is different from this. In many cases, we are often 'probing' for data. We construct surveys to extract information, on subjects who can game the metric (Goodhart's Law) and who are aware they're being measured (think Hawthorne Effect; the respondent is not a passive transducer). In many cases, there isn't an observable 'thing' we are measuring. 'Instrument calibration' is completely different (and very possibly non-existent) in social sciences."
In economics, the signal source is elicited, not naturally emitted. People answer questionnaires because you asked, firms report because you surveyed, households disclose because it’s the census. That makes measurement reactive and context-dependent: wording, order, incentives, trust in the agency — all affect the signal. Even “administrative data” (tax records, unemployment claims) is behavior under rules — if rules change, behavior and therefore “measurement” changes. In economics, measures are highly theory dependent. “Unemployment,” “inflation,” “household,” even “GDP” are statistical constructs defined by agencies: Change the definition, change the number, even if the world didn’t change. The object is theory and convention-dependent: you need a theory of labor-force attachment to define unemployment; a theory of consumption to define a price index. Measurement error is radically different in social sciences. Errors can come from comprehension, nonresponse, strategic answering, interviewer effects, mode effects, seasonal economic behavior, policy changes; Some of these errors are not i.i.d. and not stationary, they change when the social context changes (which makes doing historical analyses incredibly difficult). Unlike physical sciences, repeating the measurement doesn’t always reduce error (people may learn the test, or get bored). Now im not arguing here that Economists don't have methods to account for these issues, that would be foolish. I'm simply saying that the nature of the measurement process in economics is fundamentally different from a physical science, and therefore how you interpret the data. I think the biggest issue is that we are measuring a unit of analysis that is fundamentally reflexive, not passive; people know things and can infer from context, which can nudge their behavior ever so slightly, biasing the measurement. In physical science, we often tap into an existing, stable signal with a calibrated device. In social science and economics, we often have to coax a signal out of people and institutions using instruments made of questions, definitions, and incentives. That means the quality of the number depends much more on theory, on design, and on people’s cooperation — not just on the sensitivity of the device.
- Reproducibility in Science: A Metrology Perspective
- Measurement in metrology, psychology and social sciences: data generation traceability and numerical traceability as basic methodological principles applicable across sciences
Data, Statistics, and Uncertainty
Simply put, this is also a cornerstone of modern science. We will look at how scientists model the Data Generating Process, how data is collected, and how data is what binds science to reality. Let's first look at The Data-Generating Process and Scientific Inference.
Scientific Research and Big Data
A corollary to the prior section, data driven methods are also becoming quite prolific in many domains. The general public is grossly incompetent when it comes to understanding the nuances of collection, storage, governance, processing, transmission, provenance, and utility of data for inquiry. And yet, this has been a massive pillar in many of the advances in the past few decades. People generally do not have a clue why big data is so valuable, what can be done with it, and to whom. They are unaware that their digital footprint can be used to yield a fairly accurate picture of their beliefs and preferences, which can then be used for predictive analytics. They are also unaware of the value it provides to scientific researchers.
Scientific Representation, Models in Science, and Mathematical Modeling
How do scientists represent the target system they are studying? There are quite a range of scientific models in application across all domains of science.
Computer Simulation
The advent and proliferation of computing, programming languages, and software has undoubtedly had a significant impact on the way science is carried out. Simulation modeling is now quite indispensable within the toolkit of the modern scientist. I would go so far to say that you simply cannot do modern science without the aid of a computer in one form or another. This is true for physical sciences and biological sciences as well as social sciences; even non-traditional scientific disciplines like quantitative finance. In fact, most of my initial experience in this during grad school came through studying stochastic processes in financial engineering courses, in addition to Monte Carlo Methods in Bayesian statistics and state space modeling in economics (as well as DSGE models). Since then, I've been interested in simulating social complexity via agent based models. Most modeling cannot be done unless within the context of computer simulation, which requires knowledge of algorithms, data structures, and computational complexity, for understanding how to implement your model. This is obviously a prolific aspect of science. So in this section, I want to describe the function of a simulation, how it augments the scientific toolkit, and various simulation methods, ones that I am more familiar with given my education and work experience.
When we simulate, we are simulating some process or system. This shows it's generality, because we can essentially represent just about anything as a system or process, which means we can describe the properties, components, relationships, behavior, dynamics and architecture of just about any system computationally; allowing us to reason about the real system under discussion in a controlled setting. A simulation is an imitation of the dynamics of a real-world process or system over time. This computational representation is studied, like non-computation models, for a variety of tasks including: "what if" analysis, scenario analysis, intervention analysis, stress testing, modification, or pretty much anything else. The alternative approach to simulation is direct experimentation, which is infeasible in many situations. Simulations are often cheaper, faster, more likely to be replicated, safer, and ethical. In many cases it's also just practically impossible to model as system mathematically with closed form solutions; systems are often intractable and too complicated to solve. Approximations via simulation tend to be much more suitable for rapid experimentation. Like any model, it is not assumption free; these assumptions are encapsulated in our formulation of the model. Simulation models allow us to modify our assumptions and test the implications.
These models are essential for engineering any system with significance. Consider the car you use, how did the engineers determine it's reliability? They used simulation methods to guide the design process. How do airlines have such high reliability? The use simulations to understand how the plane will operate under a variety of scenarios, this influences their design decisions. How does the airline ensure timely arrival of planes and coordinate thousands of daily trips? They use simulation methods, among other methods like optimization. How did researchers identify a vaccination so quickly during the COVID pandemic? This is multifaceted, and involves simulation at every step. Supercomputers like those at Lawrence Livermore National Laboratory were used for rapid drug discovery. Identifying an effective drug involves discovering a molecular structure. You can imagine the combinatorial explosiveness of the search space; doing this purely by gathering information from experiments is simply not feasible for rapid turn discoveries. Supercomputing allows you to simulate the effectiveness of a proposed structure, narrowing down the search space for researchers, allowing them to identify an effective structure more quickly by searching more dense regions of probability space. In addition, simulations were used for epidemic forecasting. Country level microsimulations quantified how distancing, lockdowns, and closures could keep hospitals from being overwhelmed. Suppose you have normal capacity at a hospital, with limited ability to scale; massive stress on that system might overwhelm it, leading to excess deaths. Therefore, from a policymakers perspective, they might want to know these counterfactual situations, and adjust their policy accordingly. Closures were also determined based on simulations. Airflow models revealed how respiratory particles move indoors, guiding ventilation, filtration, and layout choices, and which facilities are likely locations to have a massive outbreak, which subsequently impacts hospital stress. In each of these cases, simulations gave us usable answers while experiments and trials were still spinning up. This particular problem, shared among many complex problems, often involves systems of systems. Modeling and simulation allows researchers to understand how various systems interact; we can effectively integrate multiple models of systems to understand how they all interact. This is something that is very difficult without the use of computational resources. Supercomputers enabled the possibility of rapid computational experimentation, which lead to effective decision support. Put simply, computer simulation has a direct impact on the policy that effects your life.
There are essentially 3 by 2 types of simulations. Think of it as a grid, where each cell represents a combination of the various elements of a simulation. There are stochastic vs deterministic simulations, static vs dynamic simulations, and discrete time/event vs continuous time/event simulations. So you can have a discrete time dynamic stochastic system, a stochastic continuous time continuous event simulation, a deterministic dynamic time discrete event simulation etc. Each of these dimensions represent different aspects of the system under discussion. Stochastic systems have random components, dynamical systems are time dependent, and continuous systems are those where the system state can be represented numerically as a non-finite number. On the contrary, deterministic systems do not contain randomness, static representations do not depend on time, and discrete representations refer to systems where the states can be represented as a finite number. Each combination implies different sets of methods. It is entirely up to the research to decide how to model the system, but the decision is not arbitrary. Sometimes it is just easier to represent a system statically, this is often the case in economics. Introducing more moving parts makes the system harder to understand, so researchers must find a sweet spot between model complexity and granularity, and how well it answers questions. For example, in economics we have DSGE models that rely on the "representative agent". This is a sort of idealization about how people make decisions in an economy, imposed upon the entire collection of agents; the "representative agent" represents how everyone who is "rational" would make economic decisions. It assumes away any underlying network structure and heterogeneity. It idealizes the economic decision independent from other factors. This form allows us to have nice compact modeling formulations that are solvable or easy to reason about. But obviously, it does not have to be done this way. Agent based models on the contrary, allow the modeler to encode heterogeneity. We can then run simulations "from the ground up", and use these results to reason about a real world economy. This also comes with its own set of costs and sacrifices. These models are harder to validate and make sense of. Therefore, decisions to represent systems depend upon these considerations.
What are the elements of a simulation model? Well, it depends on the type of model and the domain you're studying. This taxonomy will be biased towards discrete event simulations, but I think pretty much every simulation will implicitly refer to these elements. There are two objects of simulation:
- Entities: individual elements of the system that are being simulated and whose behavior is being explicitly tracked. Each entity can be individually identified;
- Resources: also individual elements of the system but they are not modelled individually. They are treated as countable items whose behavior is not tracked.
These decisions are made by the modeler, and depend on the system under discussion. How do we organize the entities and resources?
- Attributes: properties of objects (that is entities and resources). This is often used to control the behavior of the object. In a more comprehensive simulation, an attribute might be the type of features that distinguish entities.
- State: collection of variables necessary to describe the system at any time point. These fully characterize the system. For example, in a queuing system, it might be wait time.
- Queue: collection of entities or resources ordered in some logical fashion. This refers to how the entities are processed within the system
- Event: instant of time where the state of the system changes. An event describes the possible ways the state can change, and locates the time in which that change took place.
- Activity: a time period of specified length which is known when it begins (although its length may be random). This may be specified in terms of a random distribution.
- Delay: duration of time of unspecified length, which is not known until it ends. This is not specified by the modeler ahead of time but is determined by the conditions of the system. Very often this is one of the desired output of a simulation.
- Clock: variable representing simulated time.
- Processes: a type of event that has start-end rules with, including decision logic, policies, and control rules.
1) Frame the decision and the system
2) Build the conceptual model:
- Entities and states: What things move or change (patients, packets, orders, molecules)? What states can they occupy (waiting, in service, recovered, failed)?
- Processes and rules: How do states change—by scheduled events (arrivals, service completions), by interactions (agent meetings), or by continuous flows (stock-and-flow)?
- Time treatment: Decide if you advance time by events (jump to next event; classic discrete-event), by fixed steps (∆t; good for differential equations or when events are dense), or hybrid (event-driven with sub-stepping for continuous parts).
- Resources and constraints: Servers, machines, beds, CPU cores, budgets. Specify capacities, calendars, and priorities.
- Randomness: Where uncertainty lives (interarrival times, service durations, agent behaviors, failure times) and how you’ll model it (distributions, correlations).
- Policies and controls: Schedules, routing rules, admission limits, pricing, triage—these become the levers for scenarios.
3) Input modeling: turn messy data into usable distributions
4) Choose a paradigm
- Discrete-event simulation (DES): Best for queuing, logistics, manufacturing, networks. You maintain an event calendar, a future event list, and process handlers that update state and schedule downstream events. You observe sharp changes at discrete times (arrivals, completions).
- Agent-based simulation (ABS): Best when micro-level behavior and interaction drive macro outcomes (epidemics, social systems, markets). Each agent carries rules; the system emerges from interactions. Often run with small time steps or event hooks.
- System dynamics (SD): Best for feedback-heavy, aggregate systems (stocks, flows, delays). You write coupled differential or difference equations and integrate in time.
- Monte Carlo (MC): Best for pure uncertainty propagation: sample inputs, evaluate a deterministic model, aggregate outputs. Often baked into other paradigms.
5) Implement a Minimal Version
6) Verification: prove you built the model you meant to build
7) Validation: prove the model is a good stand-in for reality
8) Experiment design: plan runs that answer the question
9) Randomness, variance, and confidence
10) Sensitivity, uncertainty, and robustness
11) Prepare results for presentation
12) Reproducibility and governance
Mechanisms in Science
The act of identifying mechanistic cause and effect relations.
Scientific Explanation
What does it mean when someone says "Science has explained something"?
Scientific Reduction
What is the role of reduction in explanation? When larger systems are explained in terms of something more fundamental, what exactly are we accomplishing?
Scientific Objectivity
Whether or not the practice of science can be truly objective is not the purpose of this section. Rather, I'd like to discuss various methods it uses to maintain alignment with the standard, and how built in mechanisms self correct when deviations from the ideal occur.
Scientific Discovery
What constitutes a scientific discovery? With the constant barrage of "new discoveries" flooding the media, how do we make sense of what is going on?
Scientism
Can someone dogmatically adhere to science at the expense of other methods of inquiry? We will look at Six Signs of Scientism to answer this question. Susan Haack’s central objective in Six Signs of Scientism is to demarcate scientism from legitimate science; not in the naïve sense of drawing a boundary around science proper (a move she explicitly critiques as itself scientistic), but rather to expose a cluster of intellectual temptations in contemporary culture that inflate the authority, epistemic reach, or rhetorical prestige of science beyond its proper bounds. Early on, she defines scientism as “a kind of over-enthusiastic and uncritically deferential attitude toward science, an inability to see or an unwillingness to acknowledge its fallibility, its limitations, and its potential dangers” (Haack, p. 76). Her task is not to attack science, she explicitly defends its value, but to identify when admiration becomes uncritical worship. She warns that scientism is not a single thesis but a family of symptoms — subtle, culturally normalized behaviors and linguistic patterns. Hence: six “signs.” Each sign, she notes, is not definitive alone, but diagnostic when seen together.
Sign 1: Honorific use of "Science"
Sign 2: Using Scientific Trappings Decoratively
Sign 3: Obsession with Demarcation
Sign 4: The Quest for "The Scientific Method"
- There is no one “scientific method” used by all and only scientists (p. 89).
- This does not make scientific discovery miraculous; it makes it continuous with ordinary empirical inquiry, but amplified, refined, and disciplined by the distinctive helps science has developed (pp. 88–89).
Sign 5: Looking to Science for Answers Beyond its Scope
- Policy masquerading as science. Science can tell us the likely consequences of damming a river, changing tax codes, or modifying school governance; it cannot by itself adjudicate whether the ends are desirable, or what trade-offs are morally justifiable (p. 90). When researchers’ ethical/political convictions tilt their evidential judgment, or when normative conclusions are presented “as if they were scientific results,” we have scientism (p. 90).
- Empirical surveys as ethical verdicts. Haack analyzes a Lancet article advocating the “complete lives” principle for allocating scarce medical resources — giving priority to adolescents/young adults — and notes the authors cite surveys of what “most people think” as support (pp. 90–91). She underscores the category mistake: “most people think x is morally best” ≠ “x is morally best” (p. 91). Substituting measured preference for justification is a hallmark of scientism.
Sign 6: Denigrating the Non-Scientific
- Within inquiry: It is scientistic to assume empirical legal studies are inherently superior to interpretive legal scholarship (p. 92). Different questions demand different cognitive virtues and methods.
- Beyond inquiry: It is scientistic to assume that art, literature, music, craftsmanship, and tradition have lesser value simply because they are not avenues of empirical discovery (pp. 92–93).
Summarizing Scientism
Conclusion: The Richard Feynman Lectures
I've always found Feynman to be an excellent science communicator. So to wrap this up, lets have a look at his famous lecture on the scientific method:
Richard Feynman on Scientific Method (1964) | After noise reduction
Now, I'm going to discuss how we would look for a new law. In general, we look for a new law by the following process. First, we guess it.
Then we-- well, don't laugh. That's really true. Then we compute the consequences of the guess to see what-- if this is right, if this law that we guessed is right, we see what it would imply, and then we compare those computation results to nature. Or we say, compare to experiment or experience. Compare it directly with observation to see if it works.
If it disagrees with experiment, it's wrong. And that simple statement is the key to science. It doesn't make a difference how beautiful your guess is. It doesn't make a difference how smart you are, who made the guess, or what his name is, if it disagrees with experiment, it's wrong. That's all there is to it.
It's therefore not unscientific to take a guess, although many people who are not in science think it is. For instance, I had a conversation about flying saucers some years ago with laymen.
Because I'm scientific. I know all about flying saucers. So I said, I don't think there are flying saucers. So the other-- my antagonist said, is it impossible that there are flying saucers? Can you prove that it's impossible? I said, no, I can't prove it's impossible. It's just very unlikely.
That, they say, you are very unscientific. If you can't prove an impossible, then why-- how can you say it's likely, that it's unlikely? Well, that's the way-- that it is scientific. It is scientific only to say what's more likely and less likely, and not to be proving all the time possible and impossible.
To define what I mean, I finally said to them, listen, I mean that from my knowledge of the world that I see around me, I think that it is much more likely that the reports of flying saucers are the result of the known irrational characteristics of terrestrial intelligence, rather than the unknown rational effort of extraterrestrial intelligence.
It's just more likely, that's all. And it's a good guess. And we always try to guess the most likely explanation, keeping in the back of the mind the fact that if it doesn't work, then we must discuss the other possibilities.
There was, for instance, for a while a phenomenon we called superconductivity. It still is a phenomenon, which is that metals conducts electricity without resistance at low temperatures. And it was not at first obvious that this was a consequence of the known laws with these particles. But it turns out that it has been thought through carefully enough, and it's seen, in fact, to be a consequence of known laws.
There are other phenomena, such as extrasensory perception, which cannot be explained by this known knowledge of physics here. And it is interesting, however, that that phenomenon has not been well established, and--
--that we cannot guarantee that it's there. So if it could be demonstrated, of course, that would prove that the physics is incomplete. And therefore, it's extremely interesting to physicists whether it's right or wrong. And many, many experiments exist which show it doesn't work.
The same goes for astrological influences. If that were true, that the stars could affect the day that it was good to go to the dentist, then-- it's in America we have that kind of astrology-- then it would be wrong. The physics theory would be wrong, because there's no mechanism understandable in principle from these things that would make it go. And that's the reason that there's some skepticism among scientists with regard to those ideas.
Now, you see, of course, that with this method, we can disprove any definite theory. We have a definite theory, a real guess from which you can really compute consequences which could be compared to experiment, and in principle, we can get rid of any theory. You can always prove any definite theory wrong. Notice, however, we never prove it right.
Suppose that you invent a good guess, calculate the consequences, and discover every consequence that you calculate agrees with the experiment. Your theory is then right? No, it is simply not proved wrong. Because in the future, there could be a wider range of experiments, you compute a wider range of consequences, and you may discover, then, that the thing is wrong.
That's why laws like Newton's laws for the motion of planets lasts such a long time. He guessed the law of gravitation, calculated all kinds of consequences for the solar system and so on, compared them to experiment, and it took several hundred years before the slight error of the motion of Mercury was developed.
During all that time, the theory had been failed to be proved wrong, and could be taken to be temporarily right. But it can never be proved right, because tomorrow's experiment may succeed in proving what you thought was right wrong. So we never are right. We can only be sure we're wrong. However, it's rather remarkable that we can last so long. I mean, have some idea which will last so long.
I must also point out to you that you cannot prove a vague theory wrong. If the guess that you make is poorly expressed and rather vague, and the method that you used for figuring out the consequences is rather a little vague-- you're not sure. You say, I think everything is because it's all due to [INAUDIBLE], and [INAUDIBLE] do this and that, more or less. So I can sort of explain how this works. Then you see that that theory is good, because it can't be proved wrong.
If the process of computing the consequences is indefinite, then with a little skill, any experimental result can be made to look like-- or an expected consequence. You're probably familiar with that in other fields. For example, A hates his mother. The reason is, of course, because she didn't caress him or love him enough when he was a child. Actually, if you investigate, you find out that as a matter of fact, she did love him very much, and everything was all right. Well, then, it's because she was overindulgent when he was [INAUDIBLE]. So by having a vague theory--
--it's possible to get either result.
Now, wait. Now, the cure for this one is the following. It would be possible to say, if it were possible to state ahead of time how much love is not enough, and how much love is overindulgent exactly, and then there would be a perfectly legitimate theory against which you can make tests. It is usually said when this is pointed out how much love is and so on, oh, you're dealing with psychological matters, and things can't be defined so precisely. Yes, but then you can't claim to know anything about it.
Now, I want to concentrate for now on-- because I'm a theoretical physicist, and more delighted with this end of the problem-- as to what goes-- how do you make the guesses? Now, it's strictly, as I said before, not of any importance where the guess comes from. It's only important that it should agree with experiment, and that it should be as definite as possible.
But, you say, that is very simple. We set up a machine-- a great computing machine-- which has a random wheel in it that makes a succession of guesses. And each time it guesses a hypotheses about how nature should work, computes immediately the consequences, and makes a comparison to a list of experimental results it has at the other end. In other words, guessing is a dumb man's job.
Actually, it's quite the opposite, and I will try to explain why.
The first problem is how to start. You see how I start? I'll start with all the known principles. But the principles that are all known are inconsistent with each other, so something has to be removed. So we get a lot of letters from people. We're always getting letters from people who are insisting that we ought to make holes in our guesses as follows. You see, you make a hole to make room for a new guess.
Somebody says, do you know, people always say space is continuous. But how do you know when you get to a small enough dimension that there really are enough points in between? It isn't just a lot of dots separated by a little distance.
Or they say, you know those quantum mechanical amplitudes you told me about? They're so complicated and absurd. What makes you think those are right? Maybe they aren't right. I get a lot of letters with such content.
But I must say that such remarks are perfectly obvious and are perfectly clear to anybody who is working on this problem, and it doesn't do any good to point this out. The problem is not what might be wrong, but what might be substituted precisely in place of it. If you say anything precise, for example, in the case of a continuous space. Suppose the precise composition is that space really consists of a series of dots only, and the space between them doesn't mean anything, and the dots are in a cubic array, then we can prove that immediately is wrong. That doesn't work.
You see, the problem is not to make-- to change, or to say something might be wrong, but to replace it by something. And that is not so easy. As soon as any real definite idea is substituted, it becomes almost immediately apparent that it doesn't work.
Secondly, there's an infinite number of possibilities of these simple types. It's something like this. You're sitting, working very hard. You work for a long time trying to open a safe. And some Joe comes along who hasn't-- doesn't know anything about what you're doing or anything, except that you're trying to open a safe.
He says, you know, why don't you try the combination 10, 20, 30? Because you're busy. You tried a lot of things. Maybe you already tried 10, 20, 30. Maybe you know that the middle number is already 32 and not 20. Maybe you know that as a matter of fact, this is a five-digit combination. There we go.
So these letters don't do any good, and so please don't send me any letters trying to tell me how the thing is going to work. I read them to make sure--
--that I haven't already thought of that. But it takes too long to answer them, because they're usually in the class, try 10, 20, 30.
Comments
Post a Comment