Below are four sections of the sample size justification. Part A contains a description of the population, as well as a description of the resource constraints that determine how much of the population can be sampled. In part B a description of which effect sizes are of interest is provided. In part C an overview of the inferential goal of the study is specified. In part D the sample size that will be collected is reported, and the informational value of the study is evaluated.

# A: Sample Description

Description of the population

We will assess public opinion in four large nationally representative samples that differ in their general trust in science (Spain, The Netherlands, Italy, and Poland). The population we want to generalize our findings to consists of the general public in these four countries. However, as we will collect data online, through four companies that facilitate survey studies, the population we will sample from is limited to the people in their database.
We aim to collect a sample of 250 participants from four countries each (Spain, The Netherlands, Italy, Poland), for a total sample size of 1000. In consultation with the funder, we have received information that they can offer us the following stratified sample that matches the age and sex distribution in the population:

Country       Female    Male    20-29   30-39   40-49   50-59   59<
Spain         127       123     31      41      53      46      80
Netherlands   125       124     38      39      41      47      83
Italy         128       122     31      36      46      48      89
Poland        129       121     38      50      44      38      79  

Can you collect data from the entire population?

no

Description of resource constraints

This research project is funded by a research grant that aims to facilitate data collection through online data collection services that offer the option to collect a stratified sample that matches the population of a country on certain characteristics. The maximum sample size we can collect (N = 1000, 250 from each of four countries) is determined by the sample size the funder could provide us with. 

# B: Effects of Interest

Information about the Smallest Effect Size of Interest:

The smallest effect size of interest size is specified as a Cohen's dz of 0.245.

The following details were provided about the smallest effect size of interest:

For our planned hypothesis tests H1 and H2 we have to specify a smallest effect size of interest. How much more acceptable can the public judge selective reporting, publication bias, and not sharing data, for researchers to argue that knowledge about the incentive structure does *not* change moral acceptability judgments? And how much can trust decrease after learning about prevalence estimates from meta-scientific work, without considering this a meaningful change? We subjectively believe that a difference of 5 scale points on a 100 point scale is a difference that is too small to matter in practice. Such a change corresponds to a quarter of a standard deviation on the moral acceptability ratings for publication bias, according to our pilot data. It seems difficult for researchers to argue that the general public would not think selective reporting, publication bias, and not sharing data are morally unacceptable if only they understood the incentive structures in science, if the difference in their judgments is less than 5 scale points.

Information about the Minimal Statistically Detectable Effect:

The minimal statistically detectable effect is specified as a Cohen's dz of 0.0983.

The following details were provided about the minimal statistically detectable effect:

We plan for 5% missing data to be conservative, so a total sample size of 950.  In G*power we compute the critical t-value:
t tests - Means: Difference between two dependent means (matched pairs)
Analysis:   Sensitivity: Compute required effect size
Input:  Tail(s) =   Two
α err prob  =   0.0025
Power (1-β err prob)    =   0.8
Total sample size   =   950
Output: Noncentrality parameter δ   =   3.8742947
Critical t  =   3.0314378
Df  =   949
Effect size dz  =   0.1256987

We then convert the critical t-value to a critical Cohen's dz in R:
3.031437/sqrt(949) = 0.09839042

# C: Inferential Goal

The following information about the inferential goal related to statistical power has been provided:

The inferential goal is to perform a hypothesis test with a certain statistical power, computed by a sensitivity power analysis. The chosen alpha level is 0.05 A justification for the chosen alpha level and desired power (or for a sensitivity power analysis, the achieved power for effects of interest), and details of the power calculation (preferably in reproducible code) is provided below.

Given the effort required to collect data from a representative sample across four European countries, we do not predict it will be easy for other researchers to replicate our study. If our findings are used to inform policy, we want to reduce the probability of erronous claims beyond the default alpha level of 0.05. We therefore lower our alpha level to 0.0025, or 0.05 * 0.05, to yield the equivalent Type 1 error rate that would be achieved if two studies (e.g., one original and one direct replication) with a default alpha of 0.05 had been performed. As we will perform three paired t-tests (one for selective reporting, publication bias, and not sharing data), we will effectively use an alpha level of 0.0025/3 for these tests (but an alpha level of 0.0025 for H2). With a final sample size of 950 (allowing for 5% missing values), we would have high (i.e., more than 99.99%) statistical power to conclude the absence of a meaningful effect (set as a SESOI for a two-sided test of 5 on a 100 point scale), given the largest observed standard deviation from our pilot data (for publication bias, the standard deviation of the difference score was 20.42), assuming there was no difference between pre- and post measures.

library(TOSTER)
powerTOSTpaired.raw(alpha = (0.0025/3), N = 950, low_eqbound = -5, high_eqbound = 5, sdif = 20.42)

For a sensitivity power analysis with 950 pairs of observations, we would also have high (e.g., more than 99.99%) power to detect a difference larger than 5 scale points (which given the standard deviation of the difference score would equal a Cohen's dz = 0.245):

library(pwr)
pwr.t.test(d = 5/20.42, n = 950, sig.level = 0.0025/3, type="paired", alternative = "two.sided")  

The following information about the inferential goal related to estimation has been provided:

The inferentional goal to estimate parameters with a desired accuracy based on a 0.95 Confidence Interval. Details related to the sample size computation:

Based on our pilot data collected from Dutch participants through Prolific, and the largest standard deviation observed for our dependent variables (for the publication bias question which a SD of 19.65) a sample of 250 per country will yield acceptability ratings with a margin of error of 2.43 on a 100 point scale (computed as 1.96 * (19.65/sqrt(250))), with an accuracy of 1.22 for N = 1000 when averaging across countries (computed as 1.96 * (19.65/sqrt(1000))). This means that in 95% of samples drawn using this procedure, the population estimate will fall within a range of either 2.5 scale points higher or lower than the true population estimate. We assume policy decisions would not meaningfully change if the estimates we report are 2.5 points higher or lower on a 100 point scale (or 1.25 point for the combined sample). Therefore, we believe the descriptives are accurate enough to inform policy decisions. If we conservatively allow for 5% missing data, the margin of errors are still sufficient at 2.50 (for 238 observations per country) and 1.25 (for 950 observations in total). 

# D: Informational Value of the Study

Based on the resource constraints, the effects of interest, and the inferential goals, the following evaluation of the informational value of the study has been provided:

Given the following resource constraints:

This research project is funded by a research grant that aims to facilitate data collection through online data collection services that offer the option to collect a stratified sample that matches the population of a country on certain characteristics. The maximum sample size we can collect (N = 1000, 250 from each of four countries) is determined by the sample size the funder could provide us with. 

and given a smallest effect size of interest size of Cohen's dz = 0.245, a minimal statistically detectable effect of Cohen's dz = 0.0983,

and given the inferential goal based on a sensitivity power analysis with an alpha level of 0.05,

the sample size in the planned study consists a total of 1000 participants, each contributing 1 observations. The following additional details about the sample size were provided:

Based on our pilot study we will present information as clearly as possible to prevent confusion, and we will analyze data from all completed questionnaires that are collected through the online data collection providers. To be conservative, we have based our sample size justification on a drop-out of 5%. 

An explanation of the informational value of the sample size that will be collected, given any resource constraints, the effects of interest, and the inferential goal, is provided below:

With a sample size of 1000 participants in total, we expect to be able to collect sufficiently accurate estimates to inform policy decisions, and our hypothesis tests has a very low Type 1 error rate, while still achieving very high power for both a null-hypothesis test as an equivalence test against the smallest effect of interest, indicating a severe test of a difference on the pre- and post measure. This should therefore be a study with a relatively high informational value with respect to the statistical questions we aim to answer.