Causal Inquiry Dialogue

While writing the "Theory of Rationality" post, I revisited the pragma-dialectical rules for critical discussion and Waltons extended dialogue types; then suddenly got an idea. Walton's extensions could possibly be extended further into sub-types. Why not extend this framework into a dialogue exclusively concerned with the nuances and features of causal inquiry? This interests me in particular because I'm quite familiar and interested in empirical methods used to identify causal effects in data. While in Economics graduate school, I distinctly recall most applied microeconometric research reducing to debates about whether X is a confounder, proper instrument, or if there is selection bias. This sure sounds like a sub-type of Walton's Inquiry Dialogue. Furthermore, Walton already has argumentation schemes for causal reasoning. Perhaps we could simply extend the existing work? Below I will first review the schemes then propose a dialogue sub-type.

1) Argument from Cause to Effect (C→E)

Canonical form (presumptive):
Generally, if A happens, B (likely) happens.
A happens (or will).
So, B will (likely) happen.

What to look for (evaluation criteria):

The strength and applicability of the causal regularity (“if A then B”).
Evidence that A really holds in this case.
Possible interveners/defeaters that could block B.
Plausible mechanism and proper temporal order (cause precedes effect)

Core Critical Questions (CQs):

CQ1 – Strength: How well-supported is the A→B generalization here?
CQ2 – Fact: Is the evidence that A occurs (here/now) good enough?
CQ3 – Interference: Are there other causal factors that would prevent B despite A?
CQ4 – Alternatives/base-rate: Could B occur anyway without A (or mainly from some other cause)?

2) Argument from Effect to Cause (E→C)

Canonical form (abductive/IBE-style):
Generally, if A happens, B (likely) happens.
B is observed.
So, (presumably) A happened (as the best explanation).

Walton treats E→C as defeasible and best understood in an abductive, dialogue-driven setting: you hypothesize the cause that would explain the effect, then test it against rivals and further evidence.

What to look for (evaluation criteria):

Explanatory adequacy of A for B (fit, coherence with background knowledge).
Comparative superiority over rival explanations.
Search thoroughness and stage of inquiry (is it too early to commit?).
Predictive leverage/testability (does A lead to further checkable consequences?).

Core Critical Questions (CQs) used here (IBE-style):

CQ1 – Adequacy: Does A genuinely explain B (independently of rivals)?
CQ2 – Bestness: Is A a better explanation than the alternatives considered so far?
CQ3 – Inquiry status: Have we looked hard enough (could more inquiry flip our judgment)?
CQ4 – Prudence: Should we withhold conclusion and investigate further before accepting A?

3) Argument from Correlation to Cause (Corr→C)

Canonical move (heuristic):
A and B are correlated.
Therefore, (tentatively) A causes B (or B causes A, or there’s a causal link).
This is a short, defeasible bridge from association to a causal hypothesis that then needs bolstering.

What to look for (evaluation criteria):

Reality/robustness of the correlation (replication, effect size, sampling).
Directionality and temporal precedence (does putative cause precede effect?).
Non-spuriousness (rule out third variables/common causes; avoid mere coincidence).
Mechanistic plausibility (is there a credible pathway from A to B?).

Standard Critical Questions (CQs):

CQ1 – Reality: Is there really a correlation between A and B?
CQ2 – Coincidence: Is the correlation more than just coincidence?
CQ3 – Third factor: Could some third factor C be causing both A and B?
(Extended lists add temporal order and mechanism checks, but these three are the widely used core.)

Walton explicitly brings the Bradford Hill “considerations” into his treatment of arguments from correlation to causation and treats them as a bank of critical questions for evaluating presumptive causal claims (e.g., he lists Hill’s items under the correlation→causation chapter in Argument Evaluation and Evidence).

Temporality – the putative cause precedes the effect.

Temporality is the minimal requirement that exposure must occur before the outcome, but this is necessary rather than sufficient for causality: many non-causal associations also satisfy temporal ordering. In practice, the presence of induction periods, onset lags, and feedback loops (where early changes in the outcome can alter subsequent exposure) can make it difficult to determine whether the relevant “causal episode” truly precedes the measured outcome. These complications create room for biases such as immortal-time bias, where individuals must survive event-free long enough to be classified as “exposed,” artificially inflating the apparent benefit of treatment. To extend the basic notion of temporality, timing should be made explicit in both the DAG (e.g., by indexing nodes as exposure at time t, outcome at time t+k) and the study protocol (e.g., through clear eligibility windows, follow-up start, and time-varying covariates), ideally framed as a target trial emulation. For example, defining “treatment initiation” as a well-specified baseline and starting follow-up at that same time reduces bias compared to designs that classify treatment based on future information.
Strength (effect size) – larger associations are harder to dismiss as bias.

Strong associations (e.g., large risk ratios) are intuitively compelling because it is less plausible that modest unmeasured bias alone could fully explain them, but strength is not a guarantee of causality: confounding, selection bias, or measurement error can either exaggerate or attenuate effect sizes. For instance, if healthier patients are preferentially selected into a preventive intervention, the apparent protective effect may be very large even when the true causal effect is modest or null. Conversely, real causal effects can be small (e.g., subtle environmental exposures), and dismissing them for lack of dramatic effect sizes risks ignoring important public health impacts when exposures are common. To extend this criterion, strength should be examined through formal bias and sensitivity analyses (e.g., E-values, quantitative bias analysis) that ask how strong an unmeasured confounder would need to be to overturn the results, and through triangulation across multiple designs (e.g., cohort, case–control, natural experiments) that are differently vulnerable to specific biases. Consistent evidence of a nontrivial effect from diverse designs can be more persuasive than a single very large estimate from a potentially biased study.
Consistency (reproducibility) – seen across studies, settings, methods.

Observing similar associations across multiple studies, populations, and analytic methods increases confidence that the relationship is not an artifact of a single dataset or modeling choice, but raw “agreement of estimates” can be misleading if it ignores meaningful contextual differences. Heterogeneity in effect estimates is not always a sign of error; it can reflect genuine effect modification by factors such as age, baseline risk, or healthcare systems. For example, a drug might show strong benefit in high-risk patients but little effect in low-risk populations, leading to apparently inconsistent results that are actually predicted by a more nuanced causal structure. Extending this criterion means planning for and interpreting heterogeneity rather than treating it as a nuisance: constructing subgroup-specific DAGs, formally testing interactions, and using triangulation that treats differences across designs as informative about underlying mechanisms. Instead of asking, “Do all studies find the same number?” we ask, “Do the patterns of where and when the effect appears match what our causal model predicts?”
Specificity – a cause leads to a single effect (or a very specific pattern).

The classical idea of specificity—that one cause leads to one disease and vice versa—rarely holds in complex, multifactorial systems where exposures affect many outcomes and diseases arise from many interacting causes. For example, smoking increases the risk of lung cancer but also cardiovascular disease, chronic obstructive pulmonary disease, and multiple cancers, while lung cancer itself can be caused by several exposures (e.g., radon, asbestos, air pollution) in addition to smoking. This limits the value of strict one-cause/one-effect specificity as a general criterion for causality and risks overemphasizing “clean” examples at the expense of more realistic, messy etiologies. A more productive extension is to think in terms of pattern specificity: does the exposure predict a distinctive constellation of effects, across outcomes and subgroups, that aligns with a proposed mechanism? For instance, a neurotoxicant might be expected to produce a recognizable pattern of cognitive deficits, imaging findings, and perhaps biomarker changes; observing that pattern, rather than a single isolated outcome, can provide stronger mechanistic support even when strict specificity fails.
Biological gradient (dose–response) – more exposure → more effect.

A monotonic dose–response relationship—where risk increases with higher levels or longer durations of exposure—fits many simple causal stories and can help rule out some forms of confounding, but real-world relationships are often more complex. Threshold effects (where risk only rises beyond a certain level), saturation (where additional exposure yields little extra risk), and U- or J-shaped curves (where both very low and very high exposures are harmful) can all occur, particularly when homeostatic mechanisms or competing risks are involved. Moreover, exposure misclassification (e.g., poor measurement of cumulative dose) can blur or distort the observed gradient, sometimes making a true nonlinear pattern appear linear, or erasing it altogether. To extend this criterion, dose–response should be modeled flexibly (e.g., splines, categorical dose bands, time-varying exposures) rather than imposed as a simple linear trend, and interpreted in light of plausible mechanisms. Complementary tools—such as negative control exposures/outcomes and quantitative bias analysis—can help assess whether an observed gradient could plausibly arise from residual confounding or measurement error, and whether the shape of the curve matches what a mechanistic model would predict.
Plausibility – is there a credible mechanism?

Plausibility asks whether the proposed causal relationship fits with current biological, social, or physical theory, which is helpful but also deeply theory-laden and historically contingent: what seems implausible today may become accepted tomorrow as mechanisms are discovered. For example, the early suggestion that infectious agents could cause chronic ulcers was initially deemed implausible given prevailing views, yet later work confirmed Helicobacter pylori as a major causal factor. Over-reliance on plausibility risks dismissing true but currently unexplained associations and privileging mechanisms that fit dominant paradigms. To extend this criterion, we can treat plausibility as a joint requirement that difference-making evidence (showing that manipulating the exposure changes risk) is complemented by mechanistic evidence (e.g., pathways, intermediate biomarkers, experimental models), in line with ideas such as the Russo–Williamson thesis. This encourages explicitly mapping the hypothesized mechanisms (e.g., via DAGs or systems diagrams), identifying what empirical patterns they predict, and then testing those predictions rather than simply appealing to “intuitive” plausibility.
Coherence – fits with what else we know (lab, natural history, theory).

Coherence emphasizes that a proposed causal relationship should not contradict established knowledge from laboratory studies, natural history of disease, or broader theory; however, without careful formalization, “coherence” can become a vague appeal to intuition or authority. There is also a risk of confirmation bias: investigators may selectively notice and emphasize strands of evidence that fit the favored hypothesis while discounting conflicting data. For instance, animal studies showing toxicity at very high doses might be taken as “coherent” with weak observational signals at low human exposures, even if the dosing regimens or species differences make the analogy tenuous. A more rigorous extension is to encode background knowledge in explicit causal models—such as DAGs or structural equations—and derive testable implications (e.g., conditional independencies, predicted mediating pathways). Empirically checking whether these implications hold (or systematically fail) in data makes coherence a falsifiable property of the whole evidence structure, rather than a subjective impression that everything “seems to fit.”
Experiment – manipulation changes outcomes (e.g., RCTs, natural experiments).

Experimental evidence, where deliberate or quasi-random changes in exposure lead to changes in outcomes, is often regarded as the strongest single strand of causal support, but experiments have important limitations. Randomized controlled trials (RCTs) may be infeasible, unethical (e.g., assigning harmful exposures), or limited in duration and population, raising questions about external validity when applied to broader or more diverse settings. Natural experiments and instrumental-variable–type settings can mimic randomization but often rely on strong structural assumptions (e.g., exclusion restrictions) that may be difficult to fully justify. To extend this criterion, we can frame observational analyses as target-trial emulations, explicitly specifying the hypothetical trial (eligibility, treatment strategies, assignment, follow-up, analysis) and then approximating it with real-world data, paying particular attention to time-varying confounding and dynamic treatment regimes. In addition, we can exploit quasi-experimental designs (e.g., policy changes, step-wedge rollouts, regression discontinuities) and integrate their results with non-experimental evidence to build a coherent picture of how manipulation at different levels and contexts affects outcomes.
Analogy – by similarity to known causal relations.

Analogy appeals to similarity with established causal relationships—for example, reasoning that if one drug in a class causes a particular adverse effect, another drug with a similar structure might do so as well. While this can be a useful heuristic for hypothesis generation, it is also the weakest and most subjective of the criteria: analogies can be cherry-picked, and superficial similarities may mask crucial differences in mechanisms, contexts, or effect modifiers. Over-reliance on analogy can lead to both false positives (assuming causality because something “looks like” another case) and false negatives (ignoring novel causal structures that lack an obvious precedent). A more disciplined extension is to specify explicitly which features of the comparison are claimed to be analogous (e.g., receptor binding profile, route and timing of exposure, population susceptibility) and to use structured analogical mapping to derive concrete, testable predictions. For instance, if two environmental exposures are thought to act via the same oxidative stress pathway, we should observe similar patterns in intermediate biomarkers and effect modification by antioxidant status; systematically testing these implications moves analogy from a loose metaphor toward a source of structured evidence.

Walton’s causal argumentation schemes are defeasible and come with critical questions. Bradford Hill’s considerations plug in naturally as such questions when someone argues from correlation to cause:

“Does cause precede effect?” (Temporality)
“How strong/consistent is the association across settings?” (Strength, Consistency)
“Could a third factor explain both?” (Strength/Consistency via confounding)
“Is there a dose–response? A plausible mechanism? Coherence with other evidence?” (Gradient, Plausibility, Coherence)
“Any experimental or quasi-experimental confirmation? Are there relevant analogies?” (Experiment, Analogy)
Walton’s chapter explicitly points students to Hill (1965) when assessing correlation→causation moves.

Practical limitations of Bradford-Hill Criteria

Confounding/selection/collider bias can satisfy several considerations spuriously. DAGs help diagnose these risks.
Over-formalizing the list as a pass/fail test misses its heuristic intent. Modern reviews emphasize using Hill as guidance within a broader causal framework.

Bradford-Hill can be modernized with sensible extensions that account for modern improvements in the causal literature:

Make the causal question explicit and design to answer it: Use the target-trial emulation playbook for observational studies (eligibility, treatment strategies, time zero, outcomes, estimand).
Represent assumptions with DAGs: Encode background knowledge, identify confounding/selection structures, and derive testable implications that operationalize “coherence.”
Triangulate across methods and biases: Combine evidence differing in key sources of bias (e.g., MR, natural experiments, cohorts, case-crossovers) to strengthen inference.
Blend mechanisms + difference-making: Use the Russo–Williamson insight: require both probabilistic/difference-making evidence and mechanistic support, instead of treating “plausibility” as a soft afterthought.
Bias-aware quantification: Routine sensitivity analyses for unmeasured confounding and measurement error alongside effect estimates (refines “strength” and “consistency”).
Keep Hill as critical questions: Treat each item as a Walton-style CQ attached to the correlation→cause scheme; use them to structure inquiry rather than to “grade” causality. (This is exactly how Walton deploys them pedagogically.)

Here are some more sources to consider before moving on;

Now, here is a proposed "Causal Inquiry Dialogue", modeled after Waltons Inquiry Dialogue, capturing the dynamics of causal inquiry across disciplines:

Causal Inquiry Dialogue (CID)

1) Purpose (telos) & product

Goal: arrive at a warranted causal claim (or non-claim) about a well-specified effect of a well-specified cause, under explicit assumptions and with stated uncertainty.
Product: a conclusion tagged with (a) the estimand (what causal quantity), (b) scope (population, time, context), (c) assumptions & design, (d) robustness (sensitivity, rival explanations), and (e) status (accept/reject/suspend).

2) Initial situation

Anomaly, association, or policy problem triggers the dialogue.
Participants share incomplete knowledge and agree to rules that prioritize discovery over winning.

3) Roles (can be distributed across 2+ parties)

Proponent: advances a causal hypothesis and identification strategy.
Challenger: raises targeted doubts, counter-hypotheses, and tests.
Methodologist (optional role, often shared): vets design/assumptions and proposes diagnostics.
Mechanism expert (optional): articulates or tests mechanistic pathways.

4) Key commitments set in the Opening stage

Question clarity: state the causal question in counterfactual terms and name the estimand (e.g., ATE/CATE, effect of treatment on the treated, etc.).
Target trial/target system: define time-zero, eligibility, treatment strategies, outcomes, follow-up, and causal contrast.
Model sketch: present an initial DAG or structured mechanism showing confounders, mediators, colliders.
Burden of proof: proponent carries the forward burden (positive case); challengers carry a specific defeater burden (to point to concrete alternative mechanisms, biases, or tests).

5) Stages (specializing pragma-dialectics)

Stage I — Problem Formulation

Moves: propose_hypothesis(H), state_estimand(ψ), define_population(P), define_time_zero(T0).
Rule CID-1 (Clarity): no arguments about “causation” without a named estimand, population, and time-zero.

Stage II — Modeling & Identification

Moves: propose_DAG(G), justify_identification(I) (randomization, IV, DiD, RD, g-methods, etc.), list_assumptions(A).
Rule CID-2 (Identification): proponent must present a legible identification strategy and its assumptions; no hidden estimand drift.
Rule CID-3 (Comparative space): acknowledge plausible rival hypotheses and pathways.

Stage III — Evidence & Testing

Moves: propose_design(D), present_evidence(E), run_diagnostic(QC) (placebo/negative controls, balance checks, pre-trend checks, falsification tests, sensitivity/E-values).
Rule CID-4 (Relevance & Quality): evidence must be probative for the stated estimand under G and A (no p-hacking, data dredging, or post-hoc model switching without disclosure).
Rule CID-5 (Robustness): report sensitivity to key unverifiable assumptions (unmeasured confounding, measurement error, model misspecification).

Stage IV — Mechanisms & Coherence

Moves: propose_mechanism(M), predict_signature(S) (dose-response, lags, subgroup patterns), check_coherence(K) with background knowledge.
Rule CID-6 (Mechanistic–probabilistic pairing): difference-making evidence should be paired with mechanistic articulation (even if partial) or the conclusion stays provisional.

Stage V — Comparative Appraisal & Rival Explanations

Moves: table_rivals(R1..Rn), compare_fit(C), choose_best_explanation(BE).
Rule CID-7 (Best-explanation discipline): address live rivals; “victory by default” is disallowed.

Stage VI — Conclusion & Reporting

Moves: accept/ reject/ suspend, qualify_scope, state_uncertainty, declare_limits & next tests.
Rule CID-8 (Transparency & Scope): publish the warrant ledger (what supports the claim) and external validity limits.

6) Permitted dialogue moves (locutions) & their commitments

assert_association(A,B) → commit to reproducible measurement and design details.
propose_DAG(G) → commit to edges/omissions as working assumptions open to targeted challenge.
challenge_edge(X→Y) → must specify the basis (confounder, collider, measurement, selection).
propose_test(T) / request_sensitivity(S) → the other side must either perform it (if feasible) or justify non-feasibility.
propose_trial_or_quasi(E) → moves the dialogue toward experimentation when observational inference stalls.
assert_extrapolation(EXT) → requires a bridging argument for transportability (similarity of mechanisms/distributions).

7) Evaluation grid: Critical Questions specialized for causality

Attach these CQs at the points they’re most discriminating (they operationalize Bradford-Hill style concerns without score-keeping):

Temporality CQ: Is time-zero defined, and does exposure precede effect with plausible lags?
Identification CQ: Under G and A, is ψ identified? What assumptions are unverifiable?
Confounding CQ: Which unmeasured factors could generate the association, and how strong must they be (sensitivity/E-value)?
Design CQ: Do design diagnostics pass (balance, parallel trends, bandwidth/placebo tests, instrument strength/exclusion)?
Robustness CQ: Do results survive alternative specifications, samples, and measures?
Mechanism CQ: Is there an articulated pathway predicting distinctive signatures (dose–response, subgroup, temporal pattern) that we observe?
Coherence CQ: Are implications consistent with other data/modalities (lab, quasi-experiments, natural history)?
Rivals CQ: What are the best rival explanations and how do they fare on fit and testability?
Transportability CQ: What changes if we move population/context; which assumptions support extrapolation?
Decision CQ (if policy-relevant): Given current warrant and uncertainty, what are the consequences of acting vs. waiting?

8) Typical derailments (fallacies-as-violations)

Post hoc ergo propter hoc (violates CID-1 & Temporality CQ).
Confounding by indication / selection bias (violates CID-2/3).
Collider conditioning (violates CID-2 by corrupting identification).
P-hacking / garden of forking paths (violates CID-4).
Mechanism hand-waving (violates CID-6; claim remains provisional).
Rival neglect (violates CID-7).
Target drift (quietly changing estimand/design mid-stream; violates CID-2 & CID-8).
Illicit dialogue shift (sliding from inquiry into eristic or policy advocacy without declaring a shift and its rules).

9) Stopping rules (how to close the dialogue)

Accept (provisional): ψ is identified; diagnostics satisfactory; rivals addressed; mechanism articulated (or explicitly limited); uncertainty quantified; scope stated.
Suspend: open defeaters remain or diagnostics fail; specify the next most probative test.
Reject: identification breaks, key diagnostics fail, or a rival clearly dominates.

10) How this extends Walton + pragma-dialectics

Keeps inquiry’s discovery telos but adds causal-specific rules (estimand clarity, DAG/identification discipline, sensitivity obligations, rival management, and mechanism pairing).
Preserves the pragma-dialectical spirit (burden of proof, relevance, clarity, closure) with domain-specific instantiations that make “rationality” depend on good causal practice in a cooperative exchange.

11) Minimal “protocol card” you can use

State ψ, P, T0 → what effect, on whom, from when?
Show G, name I → DAG + identification strategy.
Run QC → diagnostics + sensitivity/negative controls.
Predict signatures → dose/lag/subgroups; check them.
Confront rivals → articulate, test, compare.
Conclude with scope & uncertainty → accept/suspend/reject; next test.

Below is a more detailed description of some of the function declarations listed above. 1-4 represent problem formation, 5-7 represent modeling & identification, 8-10 represent Evidence & Testing, 11-13 represent mechanism & coherence, 14-16 represent comparative appraisal, 17-20 represent conclusions & reporting, and 21-26 represent cross cutting challenge or response moves. How to use in practice:

propose_hypothesis(H)
- Purpose: Put a causal claim on the table in plain language.
- When: Opening of inquiry.
- Inputs: Cause, effect, context (e.g., “X increases Y among P during T”).
- Output/Commitments: Commit to making the claim testable (estimand, design, diagnostics).
- Acceptance checks: Clear, falsifiable, time-anchored.
- CQs: Directional? Time-ordered? Contextualized?
- Pitfalls: Vagueness (“X affects Y somehow”).
- Example: “Traffic-related PM2.5 increases asthma ER visits among adults in County Z, 2016–2020.”
state_estimand(Ψ)
- Purpose: Turn the hypothesis into a named causal quantity.
- When: Immediately after H.
- Inputs: Target population, exposure contrast, outcome, time-zero, follow-up, summary (ATE/CATE/ATT).
- Output/Commitments: A precise counterfactual query you will identify/estimate.
- Acceptance checks: Formally writable (e.g., E[Y(1)−Y(0)] over window W).
- CQs: Is the contrast precise? Are time & population fixed?
- Pitfalls: Estimand drift later.
- Example: “Ψ = ATT of odd-even traffic restrictions on ER visits within 7 days among registered drivers.”
define_population(P)
- Purpose: Lock in who Ψ is about.
- When: With Ψ.
- Inputs: Inclusion/exclusion, geography, time, eligibility.
- Output/Commitments: Sampling frame; transportability boundaries.
- Acceptance checks: Replicable inclusion criteria.
- CQs: Any selection mechanisms tied to exposure/outcome?
- Pitfalls: Convenience samples inducing bias.
- Example: “Adults 18–65 in County Z, continuously insured, 2016–2020.”
define_time_zero(T0)
- Purpose: Fix the start of risk/measurement (avoid immortal-time bias).
- When: With Ψ/P.
- Inputs: Operational timestamp for exposure assignment & follow-up.
- Output/Commitments: All timing claims become checkable.
- Acceptance checks: Exposure precedes outcome; lags specified.
- CQs: Is temporality satisfied for everyone in P?
- Pitfalls: Post-exposure covariates treated as baseline.
- Example: “T0 = 00:00 on restriction day; outcomes over next 7 days.”
propose_DAG(G)
- Purpose: Externalize assumed causal structure.
- When: Early modeling.
- Inputs: Nodes (exposure, outcome, covariates), edges (assumed arrows).
- Output/Commitments: You own edges & omissions; open to targeted challenge.
- Acceptance checks: G justifies a concrete adjustment/strategy.
- CQs: Confounders, mediators, colliders correctly classified?
- Pitfalls: Post-treatment adjustment; omitted common causes.
- Example: Weather & mobility → PM2.5 & ER; PM2.5 → ER; no ER → PM2.5.
justify_identification(I)
- Purpose: Show how Ψ is identified under G + assumptions.
- When: After G.
- Inputs: Strategy (RCT, IV, RD, DiD, panel FE, g-methods, matching, TMLE, etc.) with identification conditions.
- Output/Commitments: Map assumptions → estimand; diagnostic plan.
- Acceptance checks: Clear conditions (exchangeability, positivity/SUTVA; IV relevance/exclusion; RD continuity; DiD parallel trends).
- CQs: Are conditions plausible? How will you check them?
- Pitfalls: “Black-box” ML with no identification story.
- Example: DiD with city×week FE; 24-week pre-trends; matched control cities.
list_assumptions(A)
- Purpose: Make hidden levers visible.
- When: With I.
- Inputs: Testable & untestable assumptions; measurement & linkage assumptions.
- Output/Commitments: Each assumption gets a diagnostic or sensitivity plan.
- Acceptance checks: Feasible tests or reasoned defense for untestables.
- CQs: Which assumption, if broken, flips the conclusion?
- Pitfalls: Hand-waving (“no unmeasured confounding”).
- Example: No cross-border migration this week; wind-shift IV affects ER only via PM2.5.
propose_design(D)
- Purpose: Commit to a concrete empirical design before results.
- When: Pre-analysis / design registration.
- Inputs: Data sources, inclusion rules, variables, transformations, windows, bandwidths, models.
- Output/Commitments: A design others can reproduce.
- Acceptance checks: Pre-specification or justified deviations.
- CQs: Is D aligned with Ψ and I?
- Pitfalls: Garden of forking paths; hidden post-hoc tweaks.
- Example: Synthetic control; donor pool 20 counties; outcome = daily ER rate; covariates = weather, holidays.
present_evidence(E)
- Purpose: Put results on the table (estimates + uncertainty).
- When: After D executed.
- Inputs: Point estimates, intervals, diagnostics, robustness tables/figures.
- Output/Commitments: Accept scrutiny relative to D, I, A, G, Ψ.
- Acceptance checks: Traceable to design; uncertainty quantified; code/metadata if practical.
- CQs: Consistent with identification diagnostics?
- Pitfalls: Reporting only favorable specs; unit confusion.
- Example: ATT = −6.2% (95% CI −9.8, −2.5); pre-trend p=0.62; placebo policies null.
run_diagnostic(QC)
- Purpose: Execute validity checks specific to I.
- When: Alongside E.
- Inputs: Balance/pre-trends, negative controls, IV F-stat, RD density, sensitivity (E-values), etc.
- Output/Commitments: Abide by diagnostic implications (revise/suspend if they fail).
- Acceptance checks: Pre-specified or justified; adequate power.
- CQs: Do diagnostics support key assumptions?
- Pitfalls: Underpowered/irrelevant tests; ignoring failures.
- Example: RD McCrary p=0.47; IV first-stage F=28; negative-control outcome (fractures) null.
propose_mechanism(M)
- Purpose: Articulate a causal pathway (even partial) that makes predictions.
- When: After initial E (or earlier if known).
- Inputs: Biological/behavioral/economic mechanism; intermediates.
- Output/Commitments: Mechanism-linked, testable implications.
- Acceptance checks: Compatible with G and E.
- CQs: Intermediates measurable? Implied lags/dose/subgroups?
- Pitfalls: Vague “plausibility” with no predictions.
- Example: Inflammatory pathways; expect lag 0–2 days; stronger in COPD subgroup.
predict_signature(SIG)
- Purpose: Turn M into observable signatures.
- When: With M.
- Inputs: A-priori patterns (dose–response, lags, subgroup/geo gradients).
- Output/Commitments: Agree that non-appearance weakens the claim.
- Acceptance checks: Predictions precise enough to test.
- CQs: Are signatures unique to M or shared with rivals?
- Pitfalls: Post-hoc signature invention.
- Example: Stronger effects on high-exposure commuting days; no effect on fractures.
check_coherence(K)
- Purpose: Integrate with external strands of evidence.
- When: After E & SIG.
- Inputs: Lab/toxicology, quasi-experiments, history, mechanistic literature.
- Output/Commitments: Place finding in broader web; explain inconsistencies.
- Acceptance checks: Citations & comparability discussed; conflicts acknowledged.
- CQs: Any discordant high-quality results? Why?
- Pitfalls: Cherry-picking supportive studies.
- Example: Animal models show inflammation within 24h; UK congestion-charge study shows similar ER reduction.
table_rivals(R₁…Rₙ)
- Purpose: Lay out live alternative explanations side-by-side.
- When: Before concluding.
- Inputs: Confounding, measurement error, selection, alternative causes, reverse causation.
- Output/Commitments: Each rival gets proposed tests/diagnostics.
- Acceptance checks: Rivals are plausible (not straw versions).
- CQs: Which rival best fits residual patterns?
- Pitfalls: Ignoring the serious rival.
- Example: R1: heat waves; R2: care-seeking changes; R3: coding changes.
compare_fit(COMP)
- Purpose: Evaluate main hypothesis vs. rivals on fit & testability.
- When: After R₁…Rₙ.
- Inputs: Likelihood/posteriors, predictive checks, out-of-sample performance, qualitative pattern match.
- Output/Commitments: Transparent scoring or narrative with criteria.
- Acceptance checks: Uses pre-agreed criteria or justified ex post.
- CQs: Does any rival explain signatures better?
- Pitfalls: Changing metrics mid-stream.
- Example: Heat waves fail (effects persist in cool weeks); care-seeking fails placebo tests.
choose_best_explanation(BE)
- Purpose: Make the abductive choice, or suspend.
- When: End of appraisal.
- Inputs: COMP outcome; decision threshold reflecting stakes.
- Output/Commitments: Reasoned, defeasible selection; or suspension with next tests.
- Acceptance checks: Clear rationale tied to diagnostics & signatures.
- CQs: Risk of premature closure?
- Pitfalls: Victory by default (rivals not addressed).
- Example: Adopt main hypothesis provisionally; rivals underperform on pre-specified diagnostics.
accept / reject / suspend
- Purpose: Close the dialogue honestly.
- When: After BE.
- Inputs: Evidence grade, diagnostics, rival status.
- Output/Commitments: If accept: provisional & scoped; if suspend: name next test; if reject: explain failure.
- Acceptance checks: Closure matches warrant.
- CQs: Are you over-claiming?
- Pitfalls: Treating “accept” as certainty; or never closing.
- Example: Accept (provisional): −6% effect; next: mechanism sub-study.
qualify_scope(SCOPE)
- Purpose: State where the claim applies.
- When: With closure.
- Inputs: Population, time, setting; transportability limits.
- Output/Commitments: Boundaries for reuse/policy.
- Acceptance checks: Matches P and data support.
- CQs: Any reason scope would shrink/expand?
- Pitfalls: Over-generalization.
- Example: Urban counties with similar traffic mix; 2010s vehicle fleet.
state_uncertainty(UQ)
- Purpose: Make uncertainty first-class (statistical + structural).
- When: With closure.
- Inputs: CIs/posteriors; sensitivity ranges; model dependence; unknowns.
- Output/Commitments: Honest map of confidence and fragility.
- Acceptance checks: Quantified where possible; qualitative where needed.
- CQs: What single assumption, if wrong, flips the sign?
- Pitfalls: Reporting only sampling error.
- Example: If unmeasured confounder RR 2.0 with 15% prevalence difference exists, effect may vanish.
declare_limits_and_next_tests(NEXT)
- Purpose: Record what remains uncertain and the next best test.
- When: Final step.
- Inputs: Open defeaters; feasible designs.
- Output/Commitments: Concrete plan (experiment, new data, quasi-design).
- Acceptance checks: Next test would truly discriminate.
- CQs: Is the next step proportionate to stakes?
- Pitfalls: Vague “more research needed.”
- Example: IV using refinery outages; mechanistic biomarker panel in COPD clinic.
assert_association(A,B)
- Purpose: Put a descriptive association on record (not yet causal).
- When: Early evidence marshalling.
- Inputs: Estimand-free correlation/regression with proper denominators/weights.
- Output/Commitments: Full description of how measured; ready for stress-tests.
- Acceptance checks: Replicability; robustness to basic spec changes.
- CQs: Is it real (not artifact)?
- Pitfalls: Implicit causal spin.
- Example: Pearson r=0.31 across 200 days; Spearman 0.29.
challenge_edge(X→Y)
- Purpose: Target a specific arrow in G.
- When: After DAG proposed.
- Inputs: Missing confounder, wrong direction, collider path, measurement error.
- Output/Commitments: Challenger offers a concrete alternative or test.
- Acceptance checks: Connects to data/design or literature.
- CQs: Would change alter identification?
- Pitfalls: Generic skepticism.
- Example: Mobility is a common cause of PM2.5 and ER; omitting it biases ATT.
propose_test(T)
- Purpose: Add a discriminating check.
- When: Any time a live uncertainty is identified.
- Inputs: Diagnostic, required data, expected pattern if H vs. rival.
- Output/Commitments: Other party runs it if feasible, or justifies infeasibility.
- Acceptance checks: Test truly discriminates; adequate power.
- CQs: What result would falsify?
- Pitfalls: Non-diagnostic tests.
- Example: Add placebo outcome (appendicitis). Effect should be null.
request_sensitivity(S)
- Purpose: Quantify robustness to unmeasured threats.
- When: After initial E.
- Inputs: E-values, Rosenbaum bounds, bias-factor grids; parameter ranges.
- Output/Commitments: Run & report; interpret.
- Acceptance checks: Transparent parameterization; realistic ranges.
- CQs: Are requested ranges realistic?
- Pitfalls: Cherry-picking benign ranges.
- Example: Report E-value for point estimate and CI bound.
propose_trial_or_quasi(XP)
- Purpose: Escalate to intervention (RCT) or strong quasi-experiment if feasible.
- When: When observational designs plateau.
- Inputs: Sketch of randomization or natural experiment; ethics/logistics.
- Output/Commitments: Consider seriously (or justify why not).
- Acceptance checks: Would answer Ψ with fewer assumptions.
- CQs: Is XP ethical, timely, powered?
- Pitfalls: Dismissing feasible experiments.
- Example: Randomize congestion-pricing start across districts.
assert_extrapolation(EXT)
- Purpose: Argue that findings transport to new settings.
- When: After acceptance/scope.
- Inputs: Bridge assumptions; similarities/differences; reweighting/transport formulas if used.
- Output/Commitments: State what must hold in the target to carry over Ψ.
- Acceptance checks: Structural & distributional alignment argued or shown.
- CQs: Which differences would break transport?
- Pitfalls: Hand-wavy generalization.
- Example: Effect transports to City Q because fleet mix, baseline and compliance are similar; reweighted estimate shown.

Common Assumptions

Consistency (Well-defined Interventions)
- The observed outcome under the treatment actually received equals the potential outcome for that treatment (i.e., if a person receives treatment level A = a, then their observed outcome is the same as their potential outcome Y(a)).
- Requires that treatments are clearly and precisely defined (what, when, how long, dosage, etc.) and implemented consistently across individuals.
- Implies that there are no “mismatches” between the label of the treatment in the data and what the potential outcome refers to (the treatment in theory is the same as the treatment in practice).
- Why it matters: If the same treatment label corresponds to different real-world procedures, you cannot interpret the estimated effect as the effect of a single well-defined intervention.
- Example:
  - Suppose “treatment” is recorded as “received physical therapy.” If in reality some patients get 2 weeks of light stretching and others get 3 months of intensive rehab, then the treatment is not well-defined. Consistency is violated because Y(therapy) is ambiguous.
  - In contrast, if treatment is defined as “12 sessions of standardized physical therapy over 4 weeks following protocol X,” and everyone labeled as treated actually gets that same protocol, then the consistency assumption is more plausible.
Exchangeability (No Unmeasured Confounding / Ignorability)
- Given the observed covariates X, treatment assignment A is independent of the potential outcomes Y(a) for all treatment levels a. Intuitively, after conditioning on X, there are no unmeasured factors that jointly affect both treatment and outcome.
- After adjusting for the covariates X, the treated and untreated groups are comparable as if treatment were randomized within levels of X.
- This assumption is sometimes phrased as “no unmeasured confounders,” “selection on observables,” or “conditional ignorability.”
- Why it matters: Violations lead to biased estimates because differences in outcomes may be due to unmeasured factors rather than the treatment itself.
- Example:
  - Studying the effect of a new diabetes medication on HbA1c. If sicker patients (with higher baseline HbA1c, more comorbidities) are more likely to receive the medication, and these health characteristics are not measured or adjusted for, then exchangeability is violated. The treated group is systematically different from the untreated group in ways that affect the outcome.
  - If, however, you have rich data on baseline HbA1c, comorbidities, age, BMI, prior treatment history, and healthcare access, and you appropriately adjust for them, then it becomes more plausible that within levels of these covariates, treatment assignment is “as good as random.”
Positivity (Overlap / Common Support)
- For every combination of observed covariates X that occurs in the population, each treatment level has a positive probability of being received:
  0 < P(A = a | X = x) < 1 for all relevant a, x.
- Intuitively, there must be overlap between treated and untreated individuals within covariate strata. If some subgroup never (or almost never) receives a particular treatment, we cannot learn the causal effect for that subgroup from the data.
- Positivity can be violated structurally (by design, e.g., guidelines forbid treatment in a subgroup) or practically (e.g., probabilities extremely close to 0 or 1 due to strong selection).
- Why it matters: Without overlap, comparisons become extrapolations outside the support of the data, making causal estimates unstable or non-identifiable.
- Example:
  - Suppose you are evaluating the effect of an aggressive chemotherapy protocol, but in practice it is only ever given to patients under age 70. For patients over 70, the probability of receiving this treatment is effectively 0. You cannot estimate the effect of this protocol for patients over 70 because there is no treated comparison group for them.
  - In a policy setting, if a scholarship is only available to students above a certain test score, then for low-score students the probability of receiving the scholarship is 0. The effect of the scholarship on those low-score students is not identifiable from the observed data.
Stable Unit Treatment Value Assumption (SUTVA)
- No interference: Each individual’s potential outcome depends only on their own treatment assignment, not on the treatment assignments of other individuals (i.e., my outcome is unaffected by what treatment you receive).
- No hidden versions of treatment: For each treatment level defined in the study (e.g., “treated” vs “untreated”), there is a single, well-defined version. Different versions that could have different effects are not all lumped into the same category.
- SUTVA is fundamental for defining potential outcomes as functions of an individual’s own treatment only, rather than of the entire treatment vector in the population.
- Why it matters: Violations can cause standard causal estimands (like average treatment effects) to be ill-defined or misleading, because outcomes depend on others’ treatment or on unrecognized treatment variations.
- Examples:
  - Interference violation: Studying the effect of a vaccine on infection risk. If my risk depends not only on whether I am vaccinated but also on whether the people around me are vaccinated (herd immunity), then SUTVA (no interference) is violated. Specialized methods (e.g., network or spillover models) are needed.
  - Hidden versions violation: Treatment is “participation in a job training program,” but in reality there are multiple program types (short workshop, intensive multi-month training, online vs in-person). If these versions have different effects but are all coded simply as “treated,” the no hidden versions part of SUTVA is violated.
Correct Model Specification (when using parametric models)
- The chosen statistical model correctly represents the relationship among treatment, covariates, and outcome, including the functional form (e.g., linear vs nonlinear), interactions, and distributional assumptions (e.g., normal errors, logistic link).
- Misspecification can occur if important covariates are omitted, if nonlinearity is ignored, if interactions are missing, or if incorrect link/distribution assumptions are made.
- Parametric approaches (e.g., simple linear or logistic regression) strongly rely on these assumptions; semi-parametric or nonparametric / machine learning approaches can relax them but still require other assumptions (e.g., enough data, regularization).
- Why it matters: Even if exchangeability, consistency, and positivity hold, a misspecified model can yield biased estimates of causal effects.
- Example:
  - Fitting a linear regression model to estimate the effect of dosage on blood pressure, when the true relationship is highly nonlinear (e.g., plateauing at high doses). If the model assumes a straight line, the estimated effect at high doses could be severely biased.
  - Using a logistic regression without including an important interaction (e.g., treatment × age), when in reality the effect of treatment varies substantially by age. The model may then misrepresent the average effect or suggest no effect when there is one in specific subgroups.
Additional Method-Specific Assumptions
- Instrumental Variables (IV):
  - Relevance: The instrument Z is associated with the treatment A (i.e., it meaningfully shifts the probability or level of treatment).
  - Exclusion restriction: The instrument affects the outcome Y only through its effect on the treatment A, not through any other pathway.
  - Independence / Ignorability of the instrument: The instrument is independent of all unmeasured confounders of the treatment–outcome relationship (often assumed conditional on covariates).
  - Monotonicity (often assumed): There are no “defiers,” i.e., there is no individual who would take the opposite treatment status in response to the instrument compared to everyone else.
  - Example: Using distance to the nearest hospital that offers a specialized treatment as an instrument for receiving that treatment. Relevance requires that living closer increases the chance of receiving the treatment. The exclusion restriction requires that distance affects the health outcome only through treatment, not through other channels like general healthcare quality (this is often debatable and must be argued carefully).
- Difference-in-Differences (DiD):
  - Parallel trends assumption: In the absence of treatment, the average outcome in treated and control groups would have followed parallel (or at least similar) trends over time.
  - Often requires stable composition of groups over time, no anticipatory effects (units do not change behavior before the treatment actually occurs), and no simultaneous shocks that differentially affect treated and control groups.
  - Example: Evaluating the effect of a minimum wage increase in one state (treated) using a nearby state (control) without a change as a comparison. The key assumption is that, absent the policy, employment trends in both states would have evolved similarly. If the treated state experiences a separate recession or industry shock at the same time, the parallel trends assumption may fail.
- Regression Discontinuity (RD):
  - Continuity of potential outcomes at the cutoff: In the absence of treatment, the potential outcome as a function of the running variable would be continuous at the threshold where treatment assignment changes.
  - Requires that individuals cannot precisely manipulate the running variable around the cutoff in ways that select into treatment (or that such manipulation can be ruled out or tested for).
  - Under these conditions, units just above and just below the cutoff are assumed to be comparable, and the discontinuous jump in outcomes at the cutoff can be attributed to the treatment.
  - Example: A scholarship is awarded to students with test scores above 80. If students cannot precisely manipulate scores and nothing else changes discontinuously at score 80, then the jump in college attendance at the score threshold can be interpreted as the causal effect of receiving the scholarship for students near that cutoff.

Causal Estimands: A Structured Map of the Causal Inference Universe

Below is a structured map of common causal estimands (targets of inference), grouped into major families. For each family, we explain:

What kinds of causal questions it addresses
What each estimand means
When you’d use it and what you can conclude
Key references with links

We use potential-outcomes notation:

Binary treatment: \(A \in \{0,1\}\)
Outcome: \(Y\)
Potential outcomes: \(Y(1)\) (if treated), \(Y(0)\) (if not treated)

1. Single Binary Treatment, One Time Point

1.1 What kinds of questions?

“On average, what is the effect of this program/drug/policy?”
“What was the effect on the people who actually got treated?”
“Does the effect differ across subgroups (by age, risk, etc.)?”
“What is the effect in some target population (not just our sample)?”

1.2 Core estimands

(a) Individual Treatment Effect (ITE)

Definition: For individual \(i\), the individual treatment effect is \(\tau_i = Y_i(1) - Y_i(0)\).

Use: This is a conceptual object: the causal effect for a specific individual. We almost never observe both \(Y_i(1)\) and \(Y_i(0)\) for the same person.

Value: All population-level estimands (ATE, ATT, etc.) are averages of these individual effects \(\tau_i\).

(b) Average Treatment Effect (ATE)

Definition: \[ \text{ATE} = \mathbb{E}[Y(1) - Y(0)]. \]

Question: “If everyone in the target population were treated instead of not treated, by how much would the average outcome change?”

When you’d use it:

Evaluating the overall impact of a policy or intervention on a population.
Randomized trials where you want the effect of assignment to treatment on average.

What you can conclude: Under appropriate identification assumptions (e.g., randomization or no unmeasured confounding), the difference in mean outcomes between treated and control can be interpreted as a causal ATE.

Alternative scales for binary outcomes:

Risk Difference (RD): \(\mathbb{E}[Y(1)] - \mathbb{E}[Y(0)]\)
Risk Ratio (RR): \(\mathbb{E}[Y(1)] / \mathbb{E}[Y(0)]\)
Odds Ratio (OR): ratio of odds under treatment vs control

(c) ATT / ATET (Average Treatment Effect on the Treated)

Definition: \[ \text{ATT} = \mathbb{E}[Y(1) - Y(0) \mid A = 1]. \]

Question: “Among the people who actually received treatment, what was the average effect of treatment?”

When you’d use it:

Program evaluation (e.g., job-training, subsidies, social programs).
Clinical settings where the treated population is not representative of everyone.

What you can conclude: ATT is about the effect on the realized treated group, not the whole population. It is especially relevant for ex-post evaluation: did the intervention help the participants?

(d) ATC / ATU (Average Treatment Effect on Controls / Untreated)

Definition: \[ \text{ATC} = \mathbb{E}[Y(1) - Y(0) \mid A = 0]. \]

Question: “If the currently untreated were treated, what would their average effect be?”

Use: Relevant when considering expanding access or changing eligibility, and you care about the effect among those currently not treated.

(e) CATE (Conditional Average Treatment Effect)

Definition: \[ \text{CATE}(x) = \mathbb{E}[Y(1) - Y(0) \mid X = x], \] where \(X\) is a vector of baseline covariates (e.g., age, sex, risk score).

Question: “For individuals with characteristics \(X = x\), what is the average treatment effect?”

When you’d use it:

Understanding heterogeneous treatment effects across subgroups.
Personalized / precision medicine: “What is the expected benefit for a patient like this?”

What it tells you: How treatment effects vary with covariates. The overall ATE can be expressed as \[ \text{ATE} = \mathbb{E}_X[\text{CATE}(X)]. \]

Variants:

CATT(x): treatment effect among treated with \(X = x\).
Subgroup ATEs: special cases of CATE for defined groups (e.g., men vs women).

(f) SATE, PATE, TATE (Sample vs Population vs Target)

SATE (Sample ATE): ATE in your actual analytic sample.
PATE (Population ATE): ATE in a reference population of interest.
TATE (Target ATE): ATE in a specific target population (which may differ from both your sample and the “default” population).

Questions: “What was the effect in this trial sample?” vs “What is the effect in our target population?”

Use: These are central to generalizability and transportability problems, where you adjust for differences between the sample and the target population.

1.3 Policy / stochastic variations (still single treatment)

We can go beyond “everyone treated” vs “everyone untreated” to consider specific policies.

Population Intervention Effect (PIE): Difference in average outcome between the observed world and a hypothetical world where treatment is set or assigned by some specified policy or probability rule.
Population Attributable Fraction (PAF): Proportion of adverse outcomes that would be prevented if an exposure were eliminated or reduced.

2. Noncompliance & Instrumental-Variable (IV) Settings

2.1 What questions?

“We randomized encouragement/assignment, but people don’t comply perfectly. What’s the effect of actually taking the treatment?”
“What’s the effect for the group whose treatment status is influenced by the instrument?”

Let \(Z\) be an instrument (e.g., random assignment, encouragement, distance to facility), \(A\) the actual treatment, and \(Y\) the outcome.

2.2 Estimands

(a) ITT (Intention-To-Treat Effect)

Definition: Effect of assignment on the outcome: \[ \text{ITT}_Y = \mathbb{E}[Y \mid Z = 1] - \mathbb{E}[Y \mid Z = 0]. \]

Question: “What is the impact of being assigned or encouraged to treatment, regardless of whether people comply?”

Use: Policy settings in which assignment or offer itself is the policy lever (e.g., offering a voucher, offering a training program).

(b) Effect of assignment on treatment (first stage)

Definition: \[ \text{ITT}_A = \mathbb{E}[A \mid Z = 1] - \mathbb{E}[A \mid Z = 0]. \]

Question: “How much does the instrument (assignment/encouragement) change treatment uptake?”

Use: A non-zero first stage is crucial for IV; if this is near zero, the instrument is weak.

(c) LATE / CACE (Local Average Treatment Effect / Complier Average Causal Effect)

Conceptually, individuals fall into compliance types:

Compliers: take treatment if \(Z = 1\), don’t if \(Z = 0\).
Always-takers: take treatment regardless of \(Z\).
Never-takers: never take treatment.
Defiers: do the opposite of the assignment (usually ruled out by monotonicity).

Definition: \[ \text{LATE} = \mathbb{E}[Y(1) - Y(0) \mid \text{compliers}]. \]

Question: “Among those whose treatment status is actually changed by the instrument (the compliers), what is the average causal effect of treatment?”

Wald / IV form: under standard IV assumptions, \[ \text{LATE} = \frac{\text{ITT}_Y}{\text{ITT}_A}. \]

Use:

Randomized trials with noncompliance.
Observational IV studies (e.g., draft lottery, distance to hospital, policy shocks).

Interpretation: LATE is local to the complier subgroup and need not equal the overall ATE. It answers “What is the effect on those who are marginal with respect to the instrument?”

(d) LATT / LATU

Variations exist like LATT (Local Average Treatment effect on the Treated) and related principal-stratum estimands that focus on effects for specific compliance types among treated or untreated.

(e) MTE (Marginal Treatment Effect) & PRTE (Policy-Relevant Treatment Effect)

More advanced econometric estimands that link ATE, ATT, LATE, etc.

MTE: treatment effect for individuals at a specific level of an unobserved “treatment resistance” variable. Conceptually, these are people just on the margin of indifference to treatment.
PRTE: the effect of changing a policy (instrument distribution) on outcomes, integrating MTE over the range of individuals affected by the policy change.

These help show how different causal estimands (ATE, ATT, LATE, etc.) are averages of underlying MTEs over different subsets of the population.

3. Mediation / Pathway Effects

Here we care not just whether treatment works, but how it works, via mediators.

Let:

A: treatment
M: mediator
Y: outcome

3.1 What questions?

“How much of the effect of smoking on heart disease is mediated by blood pressure?”
“How much of a policy’s effect on health works through income versus stress?”

3.2 Estimands

(a) Total Effect (TE)

Definition: Often expressed as \[ \text{TE} = \mathbb{E}[Y(1)] - \mathbb{E}[Y(0)], \] or more explicitly \[ \text{TE} = \mathbb{E}[Y(1, M(1))] - \mathbb{E}[Y(0, M(0))]. \]

Question: “What is the overall effect of treatment on the outcome, regardless of the path it takes?”

(b) Natural Direct Effect (NDE)

The NDE captures the part of the treatment effect not operating through the mediator.

Definition (for change \(0 \to 1\) in \(A\)): \[ \text{NDE} = \mathbb{E}[Y(1, M(0))] - \mathbb{E}[Y(0, M(0))]. \]

Question: “If we change treatment but keep the mediator at the level it would have under no treatment, how much does the outcome change?”

Use: Quantifying direct pathways that bypass the mediator.

(c) Natural Indirect Effect (NIE)

The NIE captures the effect transmitted through the mediator.

Definition: \[ \text{NIE} = \mathbb{E}[Y(0, M(1))] - \mathbb{E}[Y(0, M(0))]. \]

Question: “If we hold treatment fixed, but let the mediator vary as it would under treatment versus control, how much does the outcome change?”

Under certain assumptions, the total effect decomposes as \[ \text{TE} = \text{NDE} + \text{NIE}. \]

(d) Controlled Direct Effect (CDE)

Definition: The effect of changing \(A\) while fixing the mediator at a specific value \(m\): \[ \text{CDE}(m) = \mathbb{E}[Y(1, m)] - \mathbb{E}[Y(0, m)]. \]

Question: “If we could intervene to fix \(M\) at value \(m\) for everyone, what is the direct effect of treatment?”

Use: Evaluating policies where the mediator can be controlled or standardized.

(e) Interventional / randomized analogues

Interventional direct and indirect effects (sometimes called randomized interventional effects) define mediation estimands by randomizing the mediator from its distribution under each treatment level, which can relax some identification assumptions.

(f) Path-specific effects

For complex DAGs with multiple mediators, you can define effects associated with specific paths (e.g., \(A \to M_1 \to Y\) vs \(A \to M_2 \to Y\)).

Question: “What is the causal effect transmitted through this specific path or set of paths?”

4. Time-Varying Treatments & Dynamic Regimes

Now treatment and covariates evolve over time:

Treatments: \(A_0, A_1, \dots, A_T\)
Covariates: \(L_0, L_1, \dots, L_T\)

4.1 What questions?

“What is the effect of a sustained treatment strategy over several years?”
“What if we start treatment when a biomarker crosses a threshold?”
“What is the optimal treatment strategy over time?”

4.2 Estimands

Let \(\bar{A}_t = (A_0, \dots, A_t)\); similarly for \(\bar{L}_t\).

(a) Mean outcome under a static treatment regime

Consider a fixed sequence of treatments \(\bar{a}\) (e.g., always treat, never treat).

Definition: \[ \mathbb{E}[Y^{\bar{a}}], \] the expected outcome if everyone followed regime \(\bar{a}\).

Question: “What would the average outcome be if everyone followed this static treatment strategy?”

Use: Comparing different hypothetical long-term strategies via the g-formula, marginal structural models (MSMs), etc.

(b) Dynamic treatment regime estimands

A dynamic regime sets treatment based on history: \(A_t = g_t(\bar{L}_t, \bar{A}_{t-1})\).

Definition: \[ \mathbb{E}[Y^{g}], \] the expected outcome if everyone followed regime \(g\).

Question: “What is the effect of following a data-dependent strategy, like ‘treat if CD4 < 350’?”

(c) Optimal regime value

Definition: \[ \max_{g \in \mathcal{G}} \mathbb{E}[Y^{g}], \] where \(\mathcal{G}\) is a class of candidate regimes.

Question: “Among all feasible treatment strategies in class \(\mathcal{G}\), which yields the best expected outcome?”

Use: Basis for dynamic treatment regimes and reinforcement learning approaches in medicine and policy.

(d) MSM parameters

Marginal Structural Models (MSMs) specify parametric models for causal contrasts between multiple hypothetical treatment regimes (e.g., log-hazard ratios comparing always-treat vs never-treat). The parameters of these models are estimands summarizing \(\mathbb{E}[Y^{\bar{a}}]\) across different regimens.

5. Distributional / Inequality-Focused Estimands

5.1 What questions?

“Does the treatment help the worst-off more than the best-off?”
“How does treatment affect the distribution of outcomes, not just the mean?”
“Does this policy reduce or increase inequality?”

5.2 Estimands

(a) Quantile Treatment Effect (QTE)

Let \(Q_{Y(a)}(q)\) be the \(q\)-th quantile of \(Y(a)\).

Definition: \[ \text{QTE}(q) = Q_{Y(1)}(q) - Q_{Y(0)}(q). \]

Question: “How much does the \(q\)-th percentile of the outcome distribution change if everyone were treated vs not treated?”

Use:

Understanding who gains most from treatment (e.g., bottom vs top of the outcome distribution).
Assessing distributional impacts rather than just mean effects.

(b) Effects on variance / inequality indices

Estimands can target:

\(\text{Var}(Y(1)) - \text{Var}(Y(0))\) (effect on variance)
Differences in Gini coefficient, Theil index, etc., under treatment vs control

Question: “Does the intervention make outcomes more or less unequal?”

6. Interference / Spillover Effects

Here, one unit’s treatment may affect another unit’s outcome (e.g., networks, households, classrooms). This violates the usual no-interference part of SUTVA.

6.1 What questions?

Vaccines: “What is the direct protection from vaccinating someone, and what is the indirect herd protection via others being vaccinated?”
Social programs: “Does your neighbor’s treatment status affect your outcome?”

A common assumption is partial interference: interference only within, not between, groups (e.g., villages, households).

6.2 Hudgens & Halloran–type estimands

Hudgens & Halloran (2008) define:

Direct effect
Indirect (spillover) effect
Total effect
Overall effect

Very roughly:

Direct effect: Effect on an individual of changing their own treatment status, holding others’ treatment fixed.
Indirect (spillover) effect: Effect on an individual of changing others’ treatments, holding that individual’s own treatment fixed.
Total effect: Combined effect of changing both own and others’ treatment (direct + indirect).
Overall effect: Effect on average outcomes of changing the allocation strategy (e.g., vaccinate 30% vs 70% of the group).

These are formally defined in terms of potential outcomes that depend on the entire vector of treatments within a group.

7. Principal Stratification & Truncation by Events

Principal stratification handles post-treatment variables that define latent subgroups, like survival or compliance type.

7.1 What questions?

“What is the treatment effect among people who would survive under either treatment arm?”
“What is the effect among compliers, i.e., those who would take treatment if encouraged and not otherwise?”

7.2 Estimands

(a) Principal stratification in general

Let \(S(a)\) be a post-treatment variable (e.g., survival indicator under treatment level \(a\)). Principal strata are defined by joint potential values \((S(0), S(1))\) (e.g., survive both, survive only if treated, etc.).

(b) SACE (Survivor Average Causal Effect)

Definition: \[ \text{SACE} = \mathbb{E}[Y(1) - Y(0) \mid S(0) = 1, S(1) = 1]. \]

Question: “Among those who would survive regardless of treatment assignment, what is the average treatment effect on outcome?”

(c) CACE / LATE as principal-stratum estimand

The complier average causal effect (CACE) can also be expressed via principal strata defined by potential treatment receipt under \(Z = 0\) and \(Z = 1\). Compliers are the stratum with \(A(0) = 0\), \(A(1) = 1\).

8. Transportability & External Validity Estimands

8.1 What questions?

“Given a trial in population A, what is the expected effect in population B?”
“How does the effect vary across sites or settings?”

8.2 Estimands

Transported ATE / TATE: The ATE in a target population whose covariate distribution differs from the study sample. Often: \[ \text{TATE} = \mathbb{E}_{X \sim \text{target}}[\text{CATE}(X)]. \]

Site-specific ATEs: ATEs within each study site, plus summaries (e.g., meta-analytic estimands).

These estimands formalize the notion of applying causal findings from one context to another under explicit assumptions about effect modification and covariate shifts.

9. Stochastic / Policy Interventions (Generalized Treatment Rules)

9.1 What questions?

“What happens if we change the probability of treatment from 0.2 to 0.5?”
“What if we cap exposure at some level or shift it by a fixed amount?”

9.2 Estimands

Stochastic intervention mean: \(\mathbb{E}[Y^*]\), where \(Y^*\) is the outcome under a modified assignment mechanism (e.g., incremental propensity score interventions).
Incremental effect estimands: effect of multiplicatively increasing treatment odds, etc.

These are used when the policy lever is not “treat everyone” but modifying treatment probabilities or exposure levels in a population.

Key References & Suggested Readings

General textbooks / overviews

Miguel A. Hernán & James M. Robins, Causal Inference: What If. Free PDF and info: https://miguelhernan.org/whatifbook (chapters on ATE/ATT, time-varying treatments, g-methods, policy interventions, etc.).
Guido W. Imbens & Donald B. Rubin, Causal Inference for Statistics, Social, and Biomedical Sciences, Cambridge University Press. Overview and publisher page: https://www.gsb.stanford.edu/faculty-research/books/causal-inference-statistics-social-biomedical-sciences
Judea Pearl, Causality (2nd ed.). Core reference for DAGs, identification, mediation, path-specific effects, and transportability.
Tyler VanderWeele, Explanation in Causal Inference: Methods for Mediation and Interaction, Oxford University Press. Book info: https://hsph.harvard.edu/research/vanderweele-group/books/

IV, LATE, and related topics

Guido W. Imbens & Joshua D. Angrist (1994), “Identification and Estimation of Local Average Treatment Effects,” Econometrica. PDF (author copy): https://scholar.harvard.edu/files/imbens/files/identification_and_estimation_of_local_average_treatment_effects.pdf
Lecture notes on LATE by Imbens: https://www.cedlas.econo.unlp.edu.ar/wp/wp-content/uploads/miami_late.pdf

Mediation & path-specific effects

Tyler J. VanderWeele (2016), “Explanation in causal inference: developments in mediation and interaction,” International Journal of Epidemiology. Open-access overview: https://pmc.ncbi.nlm.nih.gov/articles/PMC6373498/
Hernán & Robins, What If, chapters on mediation & g-methods (see PDF link above).

Interference / spillover effects

M.G. Hudgens & M.E. Halloran (2008), “Toward Causal Inference With Interference,” Journal of the American Statistical Association. Open-access version: https://pmc.ncbi.nlm.nih.gov/articles/PMC2600548/

Dynamic treatments & g-methods

Hernán & Robins, What If, especially chapters on the g-formula, inverse probability weighting, and marginal structural models (see book link above).

This set of estimands (ATE/ATT/CATE/LATE, direct/indirect effects, regime-based estimands, distributional effects, interference estimands, principal stratification, transportability, and stochastic interventions) covers the main landscape of causal questions in modern causal inference. Each estimand is just a formal way of turning a verbal question (“what if we…”) into a precise mathematical target.

Concluding Remarks

For causal inquiry, this is a general outline for what to expect:

Formulate: propose_hypothesis → state_estimand → define_population/define_time_zero.
Model/Identify: propose_DAG → justify_identification → list_assumptions.
Design/Estimate: propose_design → run → present_evidence + run_diagnostic.
Mechanize & Compare: propose_mechanism → predict_signature → check_coherence → table_rivals → compare_fit → choose_best_explanation.
Close: accept/reject/suspend + qualify_scope + state_uncertainty + declare_limits_and_next_tests.
At any point: opponents deploy challenge_edge, propose_test, request_sensitivity, propose_trial_or_quasi, assert_extrapolation.