Beginner’s Guide to Simulating Data

Date: 2020-05-15 17:01:15

This tutorial introduces a number of basic concepts in data simulation using the statistical package, R (R Core Team, 2019). I hope the concepts introduced in this tutorial serve as the building blocks you need to simulate the data you need.

Simulating data gives you power over every aspect of the data that result. That means you have to make a lot of decisions. Therefore, this tutorial separates the process of simulation into planning (Example 1) and data-generation steps (Example 2) at first, and then these steps are merged together in both Examples 3 and 4. The latter two examples extend the basic simulation tutorial to data that could be analyzed with a multilevel model (Example 3) or a structural equation model (Example 4).

The first two examples assume that you are familiar with the general linear model (GLM). The last two examples assume you have familiarity with multilevel or structural equation modelling, but they are optional (i.e., you could stop reading after Example 2).

This tutorial was generated using R Markdown. You can have the raw R Markdown script to run all these examples on your own computer.

Guide-to-Simulation-for-R.Rmd

You should be able to open the file in RStudio, uncomment any lines that involve installing packages, and then “knitting” the file. If you have trouble, then please let me know and hopefully I can debug it.

Why Simulate Data?

Simulated data is useful for many reasons, including to give you the ability to practice an analysis, test a draft of your analysis syntax before data collection (e.g., for preregistration), and do a power analysis. The examples today assume that you are simulating data to test a particular analysis syntax.

Making a Plan for the Simulation

Steps for Planning a Data Simulation

Write out your final model in analysis syntax
For each variable identified in Step 1, identify the following properties:

2.1. Variable name

2.2. Distribution, including relevant parameters like location and spread

2.3. Relationship to the other variables, considering effect size
Decide on the size of your sample

Example 1

1. This example simulates data that could be analyzed with a GLM, specifically a moderated regression model. Physical stress (\(physical\_stress_i\)) will be predicted from a variable representing random assignment to a control condition or intervention (\(condition_i\)), neuroticism (\(centered\_neuroticism_i\)), and their interaction. That GLM syntax should look like this:

primary_model = glm(physical_stress ~ condition * centered_neuroticism)

2.1. From the syntax above, we see that there are 3 variables needed for the analysis:

physical_stress
condition
centered_neuroticism

2.2. Define each variable’s properties

When you simulate data, all the variable properties are made up by you. You truly just make them up, but you want them to be plausible. So, start by identifying the distribution that the variable should follow and then make up parameters for the distribution. If you already know those parameters (e.g., from existing data or published papers), then that would be a great source for good parameters. Otherwise, when creating the table below, I simply made up parameters that seemed plausible and reflected my expectations for each variable. These choices are explained in further detail below Table 1.

Table 1

Construct	Variable Name	Range	Distribution
Physical Stress	physical_stress	{1, 2, …, 7}	~ N(\(\mu\) = 3.15)
Intervention Condition	condition	{-1, 1}	~ B(n = 1, p = .5)
Neuroticism	centered_neuroticism	{-3, -2.875, …, 3}	N(\(\mu\) = -1, \(\sigma\) = .8)

Physical Stress. With physical stress, I was thinking of a single item, 1 - 7 Likert scale that people would use to rate, “Have you felt worn down physically from stress?” (1 = Not at All, 7 = Extremely). The values will be integers ranging between 1 - 7. I chose a mean value of 3.15 somewhat arbitrarily, but I generally guess that most people are below the midpoint (i.e., \(\mu < 4\)) but not at the floor (i.e., \(\mu > 1\)). Normally, we would define a standard deviation for a variable we are simulating, but physical stress is the dependent variable and the dependent variable’s variance will be set by the simulation process.[^This is because the dependent variable’s variance is conditional on the effect sizes that we will define in planning step 2.3. In other words, the dependent variable’s variance is partially explained by the independent variables.]

Intervention Condition. Intervention condition is going to be a 2-level categorical variable that represents random assignment to a control condition or a mindfulness-based cognitive therapy intervention. Participants have an equal chance of being assigned to the control or intervention conditions. Therefore, intervention condition will be a Binomially distributed variable, this time with one trial (i.e., assignment to intervention or not) that has a 50% chance of being assigned to the intervention or control conditions.

Neuroticism. I decided that the neuroticism variable would mimic an average of 8 variables (i.e., a composite score), where each variable is a 1-7 Likert scale. Note that I can choose to (a) just simulate the final composite score already centered (e.g., a variable with values ranging between -3 to +3 in increments of 0.125) or (b) simulate the 8 variables in their raw scale (i.e., integers ranging from 1 - 7), take the mean, and then centre them. I choose the former option (i.e., simulate one value that mimics the final composite score), because I am simulating these data to specifically test some draft analysis syntax (i.e., so I only need variables in their final value). I might also make this choice if I were simulating data for a power analysis. However, I would make the other choice if I were simulating data for purposes of testing a full draft analysis syntax (i.e., including data preparation syntax) or to reproduce a dataset. In those cases, I almost always want to simulate the raw data. I would then do all the data preparation on the simulated data that I would do on real data. Ultimately, as a construct, neuroticism is normally distributed. So, we will use a normal distribution to generate these values. I decided neuroticism should have a mean of -1 to reflect self-enhancement and other response-bias errors that likely deflate neuroticism ratings. I chose the SD of .8 arbitrarily.

2.3. Declare relationships between variables

Finally, you must determine how all the variables will relate to each other in your simulated data set. These relationships usually represent your a priori hypotheses and you need them to simulate the data. The hypothesized relationships between the variables are listed in Table 2.

In the specific case of this example, there is an interaction term between variables in the GLM. Therefore, I must also consider how the product of \(condition_i\) and \(centered\_neuroticism_i\) relates to all other variables, too.

Table 2

	physical_stress	condition	centered_neuroticism	condition * centered_neuroticism
physical_stress	1
condition	d = .63	1
centered_neuroticism	r = .50	d = .00	1
condition * centered_neuroticism	r = -.30	d = .00	r = .00	1

In this table, I defined expected effect sizes using whatever effect size (e.g., Cohen’s d, Pearson r) made the most sense for the comparison. For example, experimental condition is a 2-level category. I wanted an effect size that captured mean differences between the 2-levels of condition. Therefore, I used Cohen’s d, because Cohen’s d standardizes differences between two means. However, all effect sizes will be converted to correlation coefficients, r, for the simulation; The effect sizes will be incorporated by using them like standardized slopes in a GLM (Example 2). However, I will convert the d values to r values after generating the \(condition_i\) variable, because I need to know sample sizes for each level of \(condition_i\) to apply the equation that converts d to r with the most accuracy.

3. Determine Sample Size

The expected effect sizes are all medium or large, but I chose a sample size of N = 750 as a rough guess. I chose 750, because there are three effect sizes of interest declared above (one for each slope in our model), and we know that correlation estimates stabilize around sample sizes of N = 250 (Schönbrodt & Perugini, 2013). This is not a legitimate power analysis, because it does not really matter for this specific example.

Conduct the Simulation

I have three general tips for conducting simulations:

One general tip for simulating data is that you should go in the order of your process. For example, you usually want to simulate the independent variable(s) before the dependent variable, because you will use the independent variable(s) to simulate the dependent variable.
A second general tip for simulating data is to always “set a random seed.” We use Monte Carlo simulation that requires random numbers. The computer generates quasi-random numbers for the Monte Carlo simulation using a really complex algorithm. This algorithm gets started with a starting number that “seeds” the random process, like you would plant a seed in a garden and watch it grow. 🌱 So, we call this starting number the “random seed” for the randomizer that drives our simulation. If you set the random seed to a specific number, then your simulation will yield identical results every time you run it. Sometimes you want identical results each time the simulation is run, but sometimes you specifically do not want that; Just think about it for a moment. If you want other people to reproduce your results (e.g., simulations conducted for research or teaching), then you definitely need a random seed. You can use literally any number for the random seed. It does not matter what number you give it. When setting the random seed in the example below, I just banged my fingers against the number keys on my keyboard to get the number you see.

#--- Set the Random Seed ----
set.seed(91285103)

My final tip is to always return to the analysis that you intend to run with the simulated data if you are unsure of what you need to do next. The variables involved in that analysis are the ones that you must simulate. Let yourself use your model as a guide. This is our hypothesized linear model, integrating the information from Tables 1 and 2 with our initial GLM syntax:

\[physical\_stresss_i = 3.15 + .3 * condition_i + .5 * centered\_neuroticism_i - .3 * condition_i * centered\_neuroticism_i + e_i\]

In this equation, everything that has a subscript i needs to be simulated. That means the dependent variable, two independent variables, and the residuals (\(e_i\)), too.

Steps of Data Simulation

Simulate data
Test the simulated data
Save data

Example 2

1. Simulate the data

I will demonstrate how to use Monte Carlo simulation and the General Linear Model (GLM) to create a simulated dataset. First, we will simulate the independent variables using Monte Carlo simulation. Secondly, we will simulate the dependent variable. By specifying the GLM in different ways, this simple approach allows you to simulate complex multivariate relationships.

Simulate the two predictors

As specified in Table 2, intervention condition and neuroticism should be independent of each other. Neuroticism would be expected to be equal across condition, because assignment to condition was random. Therefore, we can simulate each independent variable separately without having to consider covariance between them¹.

Condition. Above, we defined condition as a Binomial variable with 1 trial and .5 probability of success. You can conduct a Monte Carlo simulation to sample values of condition from this distribution using the \(rbinom()\) function. The \(rbinom()\) function will output values of 0 and 1. Especially because this variable will be used in an interaction term, I will effect-code the variable into values of -1 and 1 after generating it (see West, Aiken, & Krull, 1996).

#--- Display and Read Help File for rbinom() ----
# help(rbinom)

#--- Monte Carlo Simulation of ~B(1, .5) ----
condition = rbinom(n = 750, 
                   size = 1, 
                   prob = .5)

#--- Effect-code Condition ----
condition = ifelse(condition == 0, -1, condition)

#--- Inspect New Variable ----
head(condition)

## [1]  1  1 -1 -1  1 -1

data.frame(condition) %>% 
  ggplot(aes(condition)) + 
  geom_bar(fill = "cornflower blue", alpha = .6)

Neuroticism. Above, we decided that we would directly simulate one neuroticism variable that mimicked a composite variable made from 8 variables. If you take the average of 8 integer variables, then you get values that differ, at the lowest resolution, by 0.125. To imitate that property, I do a little fancy conversion after simulating the data. Otherwise, this is a classic Monte Carlo simulation of a normally distributed variable.

#--- Display and Read Help File for rnorm() ----
# help(rnorm)

#--- Monte Carlo Simulation of ~N(-1, .8) ----
centered_neuroticism = rnorm(n = 750, 
                   mean = -1, 
                   sd = .8)

head(centered_neuroticism)

## [1] -2.53329862 -0.44486107 -0.71672448 -0.57190558 -1.82056909 -0.09496492

Finally, I post-process neuroticism to make it as much like the type of variable I would measure and then put in my analysis. First, I need to manipulate the raw simulated variable whose values are continuous to put them in the proper values that would result from averaging 8 variables rated on 1 - 7 Likert scales (i.e., all decimal values being factors of .125 = 1/8). After I do that, then I mean-centre neuroticism.

#--- Manipulate neuroticism to be Maximally Plausible ----
centered_neuroticism = round(centered_neuroticism / .125) * .125
centered_neuroticism = centered_neuroticism - mean(centered_neuroticism, na.rm = TRUE)

head(centered_neuroticism)

## [1] -1.4831667  0.5168333  0.2668333  0.3918333 -0.8581667  0.8918333

data.frame(centered_neuroticism) %>% 
  ggplot(aes(centered_neuroticism)) + 
  geom_density(fill = "cornflower blue", alpha = .6)

Note that it is possible for \(rnorm()\) to generate values that are not plausible for my hypothetical variable that ranges between -3 to +3 (e.g., 3.750). For example, it is possible for the code demonstrated above to generate a value of -4. There are a lot of ways to deal with this issue of implausible values, ranging in fanciness. A simple solution is to use the \(rtruncnorm()\) function from the \(truncnorm\) package².

#--- Display and Read Help File for rnorm() ----
# install.packages("truncnorm")
library(truncnorm)
# help(rtruncnorm)

#--- Monte Carlo Simulation of ~N(-1, .8) ----
centered_neuroticism = rtruncnorm(n = 750, 
                   mean = -1, 
                   sd = .8,
                   a = -3,
                   b = 3)

#--- Manipulate neuroticism to be Maximally Plausible ----
centered_neuroticism = round(centered_neuroticism / .125) * .125
centered_neuroticism = centered_neuroticism - mean(centered_neuroticism, na.rm = TRUE)

head(centered_neuroticism)

## [1]  0.2456667  0.6206667  0.3706667 -0.2543333 -1.0043333 -0.6293333

data.frame(centered_neuroticism) %>% 
  ggplot(aes(centered_neuroticism)) + 
  geom_density(fill = "cornflower blue", alpha = .6)

Simulate the Dependent Variable

Now that we have our independent variables generated, we need to create a dependent variable that is related to the independent variables in the manner that we specified. The general linear model (GLM) is meant to predict outcomes, so it can be used as a simulation tool. We will use it in this way! We will combine the independent variables together in a linear equation to predict the dependent variable.

We begin by combining the information across Tables 1 and 2 with the primary model. We convert the predicted d = 0.63 to a Pearson r = .30 using the equation below. The values for \(n_1\) and \(n_2\) reflect the frequency of each level simulated for \(condition\).

table(condition)

## condition
##  -1   1 
## 373 377

\[r = {d\over \sqrt{d^2+{(N^2-2N)\over{n_1 n_2}}}} = {0.63\over \sqrt{0.63^2 + {(750^2 - 2*750)\over{373*377}}}} = {0.63\over \sqrt{0.3969+4.010781}} = 0.3008\]

#--- Convert Effect Size for condition to Pearson r
d = .63

conversion_correlation = d/sqrt(d^2 + (length(condition)^2 - 2*length(condition))/prod(table(condition)))

print(conversion_correlation)

## [1] 0.3008078

Note that if your category levels have equal sample sizes (i.e., \(n_1 = n_2\)), then you can use the constant “4” with the following equation, because that’s the result you would get from the preceding equation if the samples sizes are equal across category levels.

\[r = {d\over \sqrt{d^2+4}}\]

I will also note that there are a couple of effect sizes defined in Cohen’s d in Table 2 that equal zero. Cohen’s d and Pearson r are equivalent at zero. So, no calculations are required to convert effect sizes of d = 0 to r.

Now we are ready to simulate the dependent variable. We originally stated that physical stress is a function of neuroticism, condition, and their interaction, plus some error (Equation 1).

\[\begin{equation} \label{eq:1} physical\_stress_i = 3.15 + .30*condition_i + .50*centered\_neuroticism_i - .30 * condition_i * centered\_neuroticism_i + e_i \end{equation}\]

This is the equation we will use to generate values of physical stress. We introduce randomness into our model by using Monte Carlo simulation to generate the residuals for this model. Following the assumptions of the GLM, the residuals should be Normally distributed with a mean of 0 and a standard deviation (SD) of 1. The mean of the residuals should be set to 0 to reflect a model that does not systematically over-predict nor under-predict the outcome. We will refine the value of the residual SD, \(\sigma_{e_i}\), shortly, but I am starting with the value of 1 for this example.

Now, we considered everything we needed to consider. Let us simulate our dependent variable!

#--- Predicting Dependent Variable (DV) From Linear Model ----
physical_stress = 3.15 + 
  .30 * condition + 
  .5 * centered_neuroticism + 
  -.3 * condition * centered_neuroticism + 
  rnorm(n = 750, mean = 0, sd = 1)

#--- Inspect New Variable ----
head(physical_stress)

## [1] 4.998308 5.052018 3.864886 1.782625 3.597156 1.120585

data.frame(physical_stress) %>% 
  ggplot(aes(physical_stress)) + 
  geom_density(fill = "cornflower blue", alpha = .6)

The resulting values we get for physical stress are normally distributed, ranging between \(-\infty\) and \(\infty\). Ultimately, we want to simulate \(physical\_stress_i\) as a positive integer that ranges between 1 and 7. We cannot use \(truncnorm()\) in this case, because we wanted specific values of the dependent variable. So, we will simply exchange any values that exceed the range of 1 to 7 to the extreme values (i.e., 1 or 7) after rounding the predicted values to create integers. Please expect that both truncating and reducing the resolution of the variable in this way will deflate the effect sizes a bit.

#--- Round Predicted Values to Create Integerss ----
physical_stress = round(physical_stress)

#--- Replace any values that exceed a plausible range with the most extreme values
physical_stress = ifelse(physical_stress > 7, 7, # Replace implausibly high values with highest value
                         ifelse(physical_stress < 1, 1, # Replace implausibly low values with lowest value
                                physical_stress)) # Otherwise, keep the same values

#--- Inspect New Variable ----
head(physical_stress)

## [1] 5 5 4 2 4 1

data.frame(physical_stress) %>% 
  ggplot(aes(physical_stress)) + 
  geom_histogram(fill = "cornflower blue", alpha = .6)

One Last Thing: Calculate Proper SD for the Residuals

Okay, so on the one hand, this is all awesome. We simulated all our variables, and their values are plausible. There is the remaining issue of effect size, however. If you want your variables to be related to each other to a specific degree (e.g., you are simulating data for a power analysis), then we have an additional step.

We used the correlation effect sizes from Table 2 as the slopes for the GLM, but this method of simulation treats those estimates like unstandardized slopes. This means that the effect sizes we defined in Table 2 are not being applied properly right now. This becomes obvious when we print the standardized slopes for the primary model.

#--- Run the Primary Model ----
# Note that we need to fully write out the linear model and use the eval function for lm.beta() to work correctly with the interaction term
primary_model = glm(physical_stress ~ condition + centered_neuroticism + eval(condition * centered_neuroticism))

#--- Inspect Unstandardized Slopes ----
summary(primary_model)

## 
## Call:
## glm(formula = physical_stress ~ condition + centered_neuroticism + 
##     eval(condition * centered_neuroticism))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4280  -0.6647  -0.0223   0.6734   3.2024  
## 
## Coefficients:
##                                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                             3.21625    0.03718  86.496  < 2e-16 ***
## condition                               0.24674    0.03718   6.636 6.21e-11 ***
## centered_neuroticism                    0.47325    0.04815   9.828  < 2e-16 ***
## eval(condition * centered_neuroticism) -0.20275    0.04815  -4.211 2.86e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 1.036589)
## 
##     Null deviance: 942.44  on 749  degrees of freedom
## Residual deviance: 773.30  on 746  degrees of freedom
## AIC: 2161.3
## 
## Number of Fisher Scoring iterations: 2

#--- Inspect Standardized Slopes ----
QuantPsyc::lm.beta(primary_model)

##                              condition                   centered_neuroticism 
##                              0.2201097                              0.3261244 
## eval(condition * centered_neuroticism) 
##                             -0.1396962

You see that the unstandardized slopes are around the effect sizes we specified, whereas the standardized slopes are not. To control the effect sizes between variables, we need the GLM to treat our slopes like standardized slopes. The way to do this is to (a) calculate how much variance should be explained by the predictors in the model, given that the slopes represent standardized slopes, and then (b) set the residual variance to explain the remaining unexplained variance.

a) Calculate how much variance should be explained by the predictors in the model

There are three slopes in our model, the main effect of \(condition_i\) (\(b_1 = .30\)), the main effect of \(centered\_neuroticism_i\) (\(b_2 = .50\)), and the interaction between \(condition_i\) and \(centered\_neuroticism_i\) (\(b_3 = -.30\)). The squared correlation coefficient equals the variance explained by that effect. Since we expect all predictors are uncorrelated, we can simply sum the squared values of our three slopes to find out how much variance our model is supposed to explain, \(R^2_\beta\)³.

\[R^2_\beta = .30^2 + .50^2 +(-.30^2) = .09 + .25 + .09 = 0.43\]

If our predictors explain 43% of the variance according to the effect sizes defined in Table 2, then that means that 57% of the variance is left unexplained. We will set the variance of the residuals, \(\sigma_{e_i}\), to account for this remaining amount of variance.

\[\sigma^2_{e_i} = 1 - R^2_\beta = 1 - .43 = 0.57\]

b) Set the residual variance to the remaining variance in the model

Now, we can set the residual variance to explain the remaining variance. We need to provide the standard deviation to the \(rnorm()\) function, not the variance. The standard deviation of the residuals, \(\sigma_{e_i}\), will be the square root of the residual variance.

\[\sigma_{e_i} = \sqrt{\sigma^2_{e_i}} = \sqrt{0.57} = 0.7549834\]

Now that we know the correct value for the SD of the residuals, we simulate the dependent variable again. Now, all our variables should relate to each other to the degree that we specified in Table 2.

#--- Predicting Dependent Variable (DV) From Linear Model ----
physical_stress = 3.15 + 
  .30 * condition + 
  .5 * centered_neuroticism + 
  -.3 * condition * centered_neuroticism + 
  rnorm(n = 750, mean = 0, sd = 0.7549834) # THIS IS THE VALUE THAT CHANGED!!!

#--- Converting DV Into Likert-Scale Format ----
physical_stress = round(physical_stress)

physical_stress = ifelse(physical_stress > 7, 7,
                         ifelse(physical_stress < 1, 1, 
                                physical_stress))

#--- Test the Model ----
primary_model = glm(physical_stress ~ condition + centered_neuroticism + eval(condition * centered_neuroticism))

#--- Inspect Unstandardized Slopes ----
summary(primary_model)

## 
## Call:
## glm(formula = physical_stress ~ condition + centered_neuroticism + 
##     eval(condition * centered_neuroticism))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4842  -0.4936  -0.1251   0.5534   2.6285  
## 
## Coefficients:
##                                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                             3.15703    0.02942 107.300   <2e-16 ***
## condition                               0.30874    0.02942  10.493   <2e-16 ***
## centered_neuroticism                    0.44883    0.03810  11.780   <2e-16 ***
## eval(condition * centered_neuroticism) -0.37365    0.03810  -9.807   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.6490228)
## 
##     Null deviance: 715.37  on 749  degrees of freedom
## Residual deviance: 484.17  on 746  degrees of freedom
## AIC: 1810.2
## 
## Number of Fisher Scoring iterations: 2

#--- Inspect Standardized Slopes ----
QuantPsyc::lm.beta(primary_model)

##                              condition                   centered_neuroticism 
##                              0.3161204                              0.3550067 
## eval(condition * centered_neuroticism) 
##                             -0.2954960

Note that the only reason I simulated the dependent variable twice was to emphasize the issue with residuals and effect size. In a real simulation, you would only simulate the dependent variable once.

3. Test the simulated data

In various ways, you want to make sure to test the data. We have already been inspecting values generated along the way, which is important. Additionally, I want to make sure that syntax would run using the data we simulated. If it runs, then we can move onto Step 4. If the analysis syntax returns an error, then there is either a problem with the simulated data or a problem with the draft analysis syntax – at least one of those two things will need to be fixed. So, I would figure out the problem, fix it, and then try to run this syntax again.

#--- See Whether The Originally Planned GLM Would Run ----
primary_model = glm(physical_stress ~ condition * centered_neuroticism)

summary(primary_model)

## 
## Call:
## glm(formula = physical_stress ~ condition * centered_neuroticism)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4842  -0.4936  -0.1251   0.5534   2.6285  
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     3.15703    0.02942 107.300   <2e-16 ***
## condition                       0.30874    0.02942  10.493   <2e-16 ***
## centered_neuroticism            0.44883    0.03810  11.780   <2e-16 ***
## condition:centered_neuroticism -0.37365    0.03810  -9.807   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.6490228)
## 
##     Null deviance: 715.37  on 749  degrees of freedom
## Residual deviance: 484.17  on 746  degrees of freedom
## AIC: 1810.2
## 
## Number of Fisher Scoring iterations: 2

Indeed, the syntax ran without errors. That was the goal of the test. ✅

3. Finalize Data Organization and Export Data

Once you are satisfied with the data you have simulated, you are ready to save or export it. You are finished!

#--- First, I'll create an id variable ----
id = seq(from = 1, to = 750, by = 1)

#--- Now, I'll combine all the simulated variables into a data frame ----
simulated_data = data.frame(id, physical_stress, condition, centered_neuroticism)

write_csv(simulated_data, "Simulated Dataset.csv")

Extensions to Datasets for Advanced Analyses

Lastly, I want to give you the tools to extend your simulations skills to generate data for “advanced” analyses. Specifically, I’ll demonstrate how to extend the example to hierarchical data for multilevel modelling and how to generate a big correlation matrix to simulate data for a path analysis and structural equation modelling (SEM).

Example 3: Simulate Data for Multilevel Modelling

This example assumes that you are familiar with multilevel modelling (a.k.a., “mixed models,” “hierarchical linear modelling,” “random effects models”). It is easy to take the regular GLM example and make it into a multilevel one. The GLM that drove Examples 1 and 2 was specified like this:

\[y_i = b_0 + b_1 x_1 + b_2 x_2 + b_3 x_1 x_2 + e_i\]

To simulate a multilevel dataset, we will pretend that we had a study design that was pretty similar to Examples 1 and 2, except that we measured physical symptoms every day for 7 days. We will keep the same model specification for the fixed effects, but we will estimate a random intercept for each participant. The equation for this multilevel model would be:

\[ y_{ij} = b_{0_j} + b_1 x_{1_j} + b_2 x_{2_j} + b_3 x_{1_j} x_{2_j} + u_{0_j} + e_{ij} \]

Note that the only real difference between the equations is the extra residual for the intercept, \(u_{0_{j}}\). Therefore, to take our between-subjects simulation and turn it into a multilevel one, we really only need to simulate these extra residuals. Like our other residuals, these residuals should vary around zero. Their variance should actually be proportional to whatever you want to intercept variance to be, relative to the residual variance. In the following simulation, I will arbitrarily say that the intercept will have a variance of .40 (SD = 0.6324555).

#--- Generate a Residual for each "Participant" in Simulated Dataset ----
per_participant_intercept = rnorm(750, mean = 0, sd = 0.6324555)

head(per_participant_intercept)

## [1] -0.21886787 -0.05788667 -0.73939683  0.07846672 -0.77854670  0.29592453

The intercept residual that we just simulated represents how much each person deviates from the average. So, we will add them to the intercept in our model.

We are ready to simulate our multilevel data. First, we will create the participant-level (Level 2) dataset, containing variables that only vary between participants (and are invariant within participants). The first variable in the dataset is \(id\), which is the number that we generated for each participant prior to exporting the data. It did not really matter then, but it is helpful now. The second variable is \(per_participant_intercept\), which are the intercept residuals that we simulated in the last chunk of code. Condition (\(condition_i\)) is a level-2 variable, and thus the way we simulated it before is still good.Neuroticism (\(centered\_neuroticism_i\)) is here for the same reason condition is: It is a participant-level variable, so the way it was simulated before works equally well in the multilevel context. The reason you can simulate the Level 2 variables for a hierarchical dataset the same way you simulate variables for a non-hierarchical dataset is that we assume all Level 2 units were randomly sampled from the population and thus are independent. Values in the participant-level dataset will later be repeated across multiple rows.

#--- Organize Data into Level 2 (Participant-Level) Datset ----
participant_level_dataset = data.frame(id,
                                       per_participant_intercept,
                                       condition,
                                       centered_neuroticism)

The second step is to simulate the level 1 residuals and merge the Level 2 dataset together with the Level 1 residuals. The multilevel model assumes that, after the random effects are taken into account, the Level 1 residuals are conditionally independent from each other. So, you can use a straight-forward Monte Carlo simulation to generate the Level 1 residuals. I combined these residuals into a Level 1 dataset.

Note: As with the previous example using an ordinary GLM, if we want our slopes to reflect effect sizes, we need to change the variance of the Level 1 residuals to be reduced by the variance of the Level 2 residuals. We simulated the random intercept to have a variance of 0.40 (SD = 0.6324555). The unexplained variance we estimated for the residuals in the Example 2 was 0.57 (SD = .7549834). That is now the total variance that needs to be accounted for by the intercept and Level 1 residuals. So, to find the new residual variance, I will subtract the intercept variance from the previous residual variance (\(\sigma^2_{e_{ij}} = \sigma^2_{e_i} - \sigma^2_{0_j} = 0.57 - 0.4 = 0.17\)). From this recalculation, I decide to make the SD of the Level 1 residuals equal to 0.4123106.

#--- Monte Carlo Simulation of Level 1 Residuals ----
residuals = rnorm(n = 750*7, mean = 0, sd = 0.4123106)

#--- Create an id variable to match with the Level 2 dataset ----
participant_id = rep(1:750, each = 7)

# Look at the id to understand what we just did
head(participant_id, n = 21)

##  [1] 1 1 1 1 1 1 1 2 2 2 2 2 2 2 3 3 3 3 3 3 3

#--- Create a Level 1 Dataset ----
observation_level_dataset = data.frame(id = participant_id,
                                       residuals)
head(observation_level_dataset)

##   id   residuals
## 1  1 -0.06397713
## 2  1  0.14006790
## 3  1  0.38307868
## 4  1 -0.01320721
## 5  1  0.31727136
## 6  1  0.38999099

#--- Merge to two Datasets Together ----
dataset = left_join(observation_level_dataset, participant_level_dataset)

#--- Inspect the new Dataset ----
head(dataset)

##   id   residuals per_participant_intercept condition centered_neuroticism
## 1  1 -0.06397713                -0.2188679         1            0.2456667
## 2  1  0.14006790                -0.2188679         1            0.2456667
## 3  1  0.38307868                -0.2188679         1            0.2456667
## 4  1 -0.01320721                -0.2188679         1            0.2456667
## 5  1  0.31727136                -0.2188679         1            0.2456667
## 6  1  0.38999099                -0.2188679         1            0.2456667

Finally, we can put these components together into the multilevel model to predict values of the dependent variable of “physical symptoms” that reflect clustering in the data.

#--- Predicting Dependent Variable (DV) From Multilevel Model ----
dataset = within(dataset, {
  physical_stress = (3.15 + per_participant_intercept) +
    .30 * condition + 
    .5 * centered_neuroticism + 
    -.3 * condition * centered_neuroticism + 
    residuals
})

#--- Converting DV Into Likert-Scale Format ----
dataset = dataset %>%
  mutate(physical_stress = round(ifelse(physical_stress > 7, 7,
                                        ifelse(physical_stress < 1, 1, 
                                               physical_stress))))

#--- Inspect the Final Dataset ----
head(dataset)

##   id   residuals per_participant_intercept condition centered_neuroticism
## 1  1 -0.06397713                -0.2188679         1            0.2456667
## 2  1  0.14006790                -0.2188679         1            0.2456667
## 3  1  0.38307868                -0.2188679         1            0.2456667
## 4  1 -0.01320721                -0.2188679         1            0.2456667
## 5  1  0.31727136                -0.2188679         1            0.2456667
## 6  1  0.38999099                -0.2188679         1            0.2456667
##   physical_stress
## 1               3
## 2               3
## 3               4
## 4               3
## 5               4
## 6               4

As a final step to ensure our simulated data is sufficient, let us test it by running it in an MLM!

#--- Test the Model ----
#install.packages(c("lme4", "lmerTest"))
library(lme4) # Necessary for multilevel/mixed models
library(lmerTest) # Calculates degrees of freedom and p-values for lmer()
primary_model = lmer(physical_stress ~ (1|id) + condition * centered_neuroticism, data = dataset)
summary(primary_model)

## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: physical_stress ~ (1 | id) + condition * centered_neuroticism
##    Data: dataset
## 
## REML criterion at convergence: 9671.5
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.8606 -0.6095 -0.0264  0.6108  3.5385 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  id       (Intercept) 0.4304   0.6560  
##  Residual             0.2559   0.5059  
## Number of obs: 5250, groups:  id, 750
## 
## Fixed effects:
##                                 Estimate Std. Error        df t value Pr(>|t|)
## (Intercept)                      3.15179    0.02496 745.99980 126.293   <2e-16
## condition                        0.28035    0.02496 745.99980  11.233   <2e-16
## centered_neuroticism             0.51827    0.03232 745.99980  16.037   <2e-16
## condition:centered_neuroticism  -0.28952    0.03232 745.99980  -8.958   <2e-16
##                                   
## (Intercept)                    ***
## condition                      ***
## centered_neuroticism           ***
## condition:centered_neuroticism ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) condtn cntrd_
## condition   -0.005              
## cntrd_nrtcs  0.000 -0.019       
## cndtn:cntr_ -0.019  0.000  0.027

#--- Inspect New Slopes ----
#install.packages("r2glmm")
library(r2glmm) # Calculates Edwards et al. (2008) Partial R^2 for MLM!
r2beta(primary_model)

##                           Effect   Rsq upper.CL lower.CL
## 1                          Model 0.449    0.467    0.431
## 3           centered_neuroticism 0.304    0.324    0.285
## 2                      condition 0.177    0.195    0.159
## 4 condition:centered_neuroticism 0.120    0.137    0.105

Sweet. It worked. Export the data, and you are DONE! ✅

write_csv(dataset, "Simulated Multilevel Dataset.csv")

Example 4: Simulate Correlation Matrix for SEM

This example assumes that you are familiar with structural equation modelling (SEM).

Ultimately, SEM is an analysis of a covariance matrix. Some of the first statistical packages used for SEM required a covariance or correlation matrix as input instead of your raw data. Therefore, to simulate data that would be suitable for SEM, you just need to create a set of correlated variables.

Plan the Simulation

1. Similar to the other examples, you want to begin the simulation by specifying your final model. The best way to specify an SEM is to draw it! Since I am currently “isolating” with two small kids, multitasking takes on a whole different meaning. Aaaaand, I have no time. So, I drew the example SEM while playing with my daughter. 😊 🖍 🎨 👧

2.1. The structural model above represents the idea that both physical health and mental health contribute to well-being. The variables represented in rectangles are the measured variables, and that means they are the variables we need to simulate. This is the list of variables to simulate:

WB1 - WB4
sleep
nutrition
exercise
mood
relationships
enrichment

2.2. Next, we need to elaborate the properties of all variables that need to be simulated (Table 3). As with the previous examples, I am choosing these properties somewhat arbitrarily. The strategy I use is to think of what these data would be like if I collected them for real, and I make the simulated data as plausible and similar as possible. I elaborate my reasoning for all variables beneath Table 3.

Table 3

Latent Construct	Measured Variable	Range	Distribution
Well-being	WB1	{1, 2, …, 7}	\(\sim N(\mu = 4.85, \sigma = 0.76)\)
Well-being	WB2	{1, 2, …, 7}	\(\sim N(\mu = 4.85, \sigma = 0.76)\)
Well-being	WB3	{1, 2, …, 7}	\(\sim N(\mu = 4.85, \sigma = 0.76)\)
Well-being	WB4	{1, 2, …, 7}	\(\sim N(\mu = 4.85, \sigma = 0.76)\)
Physical Health	sleep	{0, 1, …, \(\infty\)}	\(\sim Pois(\lambda = 438)\)
Physical Health	nutrition	{0, …, 100}	\(\sim N(\mu = 68.3, \sigma = 8.3)\)
Physical Health	exercise	{0, 1, …, \(\infty\)}	\(\sim Pois(\lambda = 20)\)
Mental Health	mood	{1.0, 1.1, …, 7}	\(\sim N(\mu = 4.6, \sigma = 1.1)\)
Mental Health	relationships	{1.0, 1.1, …, 7}	\(\sim N(\mu = 5.3, \sigma = .75)\)
Mental Health	enrichment	{1.0, 1.1, …, 7}	\(\sim N(\mu = 3.8, \sigma = .85)\)

Well-Being (WB1 - WB4). In this example, the measurement model for the latent variable, Well-Being, is estimated from responses to an imaginary scale of well-being. In this imaginary scale, each item is measured on a 7-point Likert scale from 1 - 7, where low values represents extremely poor well-being and high values represents extremely good well-being. We will assume the scale – and thus each item in the scale – has a mean of M = 4.85 and a standard deviation of SD = 0.76. I will say that this set of variables will be correlated with each other to the extent the scale reliability would be excellent, Cronbach’s \(\alpha\) = .90.

Physical Health (sleep, nutrition, exercise). Sleep and exercise are expected to be single-item, Poisson-distributed measures that ask how much sleep people got the previous night, in minutes. We expect that people will sleep an average of 7.3 hours per night (438 minutes) and exercise about 20 minutes per day. However, we will say that the \(nutrition\) variable is a calculated score that combines things like servings per food group, total calories, and dietary fibre. This imaginary nutrition score will be an approximately Normally distributed variable that ranges from 0 to 100. Altogether, we will generate these three measured variables of the physical health latent variable assuming a good internal consistency of Cronbach’s \(\alpha\) = .83.

Mental Health (mood, relationships, enrichment). All of the “mental health” variables are assumed to themselves be composite scores built from many different variables. To make things easy on myself, I decided to think of each measured variable for mental health as being an average of 10 items, which means that I can easily simulate a normal distribution for each item and then just round them to tenths place to mimic the numbers that would result from averaging a 10-item Likert scale. I will simulate the three measured variables for the mental health construct using the same internal consistency as I used for the measured variables of physical health (Cronbach’s \(\alpha\) = .83.

2.3 Define relationships between variables

We need to define the relationships between all variables.

Structural Model Predictions. The relationships between measured variables from different latent variables will be represented with the same correlations as the expected standardized path estimate for the structural model. I will simulate a structural model that has a path of \(\beta\) = .55 from physical health to well-being and \(\beta\) = .40 from mental health to well-being. I anticipate the correlation (i.e., standardized estimate of the covariance path) between physical health and mental health to be about \(\beta\) = .30. This creates the values you see in Table 4a.

Table 4a: Correlations between latent variables (Structural Model)

	Well-Being	Physical Health	Mental Health
Well-Being	1	.55	.40
Physical Health		1	.30
Mental Health			1

Measurement Model Predictions. Each latent variable defined above has its own measurement model comprised of either three (mental health and physical health) or four (well-being) measured variables. For each measurement model, I will assume that all manifest variables are correlated with each other to some average degree. So, I will simulate each set of variables to have an average correlation that matches the Cronbach’s \(\alpha\) values I defined in step 2.2.

I used Cronbach’s \(\alpha\) to estimate a good value for the average correlations between measured variables in each measurement model, because the equation for the standardized Cronbach’s \(\alpha\) includes the average inter-item correlation, \(\bar{r}\), and the number of items, \(K\). To calculate the average correlation, we must solve the equation for Cronbach’s standardized \(\alpha\) for the average inter-item correlation, \(\bar{r}\).

\[\alpha = {K\bar{r}\over{1 + (K-1)\bar{r}}}\] \[\alpha = {K\over{{1\over{\bar{r}}} + (K-1)}}\] \[\alpha({1\over{\bar{r}}} + (K-1)) = {K}\] \[{1\over{\bar{r}}} + (K-1) = {K\over\alpha}\] \[{1\over{\bar{r}}} = {K\over\alpha}-(K-1)\] \[1 = ({K\over\alpha}-(K-1))\bar{r}\] \[{1\over{{K\over\alpha}-(K-1)}} = \bar{r}\]

Now, we can use this equation to solve for the average correlation for variables within each measurement model.

Latent Variable	Predicted Cronbach’s \(\alpha\)	Number of Items, \(K\)	\(\bar{r} = {1\over{{K\over{\alpha}}-(K-1)}}\)
Well-Being	.90	4	0.6923
Physical Health	.83	3	0.6194
Mental Health	.83	3	0.6194

Bringing these decisions together, we are ready to generate a big correlation matrix that combines the information from Tables 4a and 4b (Table 5). The correlations between variables that belong to the same measurement model matches \(\bar{r}\) from Table 4b and the correlations between variables that belong to different measurement models correspond to the standardized paths between latent variables in the structural model, defined as correlations in Table 4a.

Table 5

	WB1	WB2	WB3	WB4	sleep	nutrition	exercise	mood	relationships	enrichment
WB1	1	.6923	.6923	.6923	.55	.55	.55	.40	.40	.40
WB2		1	.6923	.6923	.55	.55	.55	.40	.40	.40
WB3			1	.6923	.55	.55	.55	.40	.40	.40
WB4				1	.55	.55	.55	.40	.40	.40
sleep					1	0.6194	0.6194	.30	.30	.30
nutrition						1	0.6194	.30	.30	.30
excercise							1	.30	.30	.30
mood								1	0.6194	0.6194
relationships									1	0.6194
enrichment										1

3. Decide on a sample size for the simulation

I will base my sample size for the simulated data to follow the general guideline to have 10 observations per estimated parameter in the model (Kline, 1998). I count 10 paths in the model: 2 Structural Paths + 1 Covariance Path + (10 Measurement Paths - 3 paths that were fixed to 1, once per measurement model) + 3 latent variable variances/disturbances + 10 disturbances for all the measured variables = 23 parameters to estimate. Therefore, I will simulate a sample ten times that size (N = 230).

Run the Simulation!

Taking together the information in Tables 3 and 5, I am ready to simulate a set of correlated variables for an SEM analysis.

The easiest way to simulate correlated variables in R is to use the \(mvrnorm()\) function in the \(MASS\) package. \(mvrnorm\) wants you to give it a correlation matrix like Table 5 and the means of each variable (Table 3).

#--- Load MASS Package ----
# install.packages("MASS")
library(MASS)

First, I create a matrix in R that represents Table 5.

#--- Create Correlation Matrix of Population Values ----
correlation_matrix = matrix(c(1, .6923, .6923, .6923,   .55,     .55,     .55,   .40,       .40,      .40, 
                              .6923,   1, .6923, .6923,   .55,     .55,     .55,   .40,       .40,      .40,  
                              .6923, .6923,   1, .6923,   .55,     .55,     .55,   .40,       .40,      .40,  
                              .6923,  .6923,  .6923,   1,   .55,     .55,     .55,   .40,       .40,      .40,  
                              .55, .55,  .55,  .55,    1,   0.6194,   0.6194,   .30,       .30,      .30,  
                              .55, .55,   .55, .55,   0.6194,      1,   0.6194,   .30,       .30,      .30,  
                              .55,  .55,   .55,  .55,  0.6194,  0.6194,      1,   .30,       .30,      .30,  
                              .40,  .40,   .40,   .40,   .30,     .30,    .30,    1,     0.6194,    0.6194, 
                              .40,   .40,    .40,    .40,   .30,   .30,   .30,  0.6194,        1,    0.6194, 
                              .40,   .40,   .40,   .40,    .30,    .30,     .30,   0.6194, 0.6194,       1),
                            nrow = 10)

#--- Inspect Correlation Matrix ----
print(correlation_matrix)

##         [,1]   [,2]   [,3]   [,4]   [,5]   [,6]   [,7]   [,8]   [,9]  [,10]
##  [1,] 1.0000 0.6923 0.6923 0.6923 0.5500 0.5500 0.5500 0.4000 0.4000 0.4000
##  [2,] 0.6923 1.0000 0.6923 0.6923 0.5500 0.5500 0.5500 0.4000 0.4000 0.4000
##  [3,] 0.6923 0.6923 1.0000 0.6923 0.5500 0.5500 0.5500 0.4000 0.4000 0.4000
##  [4,] 0.6923 0.6923 0.6923 1.0000 0.5500 0.5500 0.5500 0.4000 0.4000 0.4000
##  [5,] 0.5500 0.5500 0.5500 0.5500 1.0000 0.6194 0.6194 0.3000 0.3000 0.3000
##  [6,] 0.5500 0.5500 0.5500 0.5500 0.6194 1.0000 0.6194 0.3000 0.3000 0.3000
##  [7,] 0.5500 0.5500 0.5500 0.5500 0.6194 0.6194 1.0000 0.3000 0.3000 0.3000
##  [8,] 0.4000 0.4000 0.4000 0.4000 0.3000 0.3000 0.3000 1.0000 0.6194 0.6194
##  [9,] 0.4000 0.4000 0.4000 0.4000 0.3000 0.3000 0.3000 0.6194 1.0000 0.6194
## [10,] 0.4000 0.4000 0.4000 0.4000 0.3000 0.3000 0.3000 0.6194 0.6194 1.0000

Second, I create an array that is meant to hold the means for each variable that we will be simulating, which I took from Table 3. Note that \(mvrnorm\) does not ask for the standard deviation of any variables, so those SDs that I provided in Table 3 don’t end up making any difference in the simulation (I did it just to call your attention to this fact).

#--- Make an Array That Holds the Means Defined in Table 3 ----
means_array = c(3.15, 3.15, 3.15, 3.15, # Well-being variables
                438, 68.3, 20, # Physical health variables
                4.6, 5.3, 3.8) # Mental health variables

#--- Inspect Means Array ----
print(means_array)

##  [1]   3.15   3.15   3.15   3.15 438.00  68.30  20.00   4.60   5.30   3.80

Finally, I simulate the data!

#--- Set the Random Seed ----
set.seed(3240692)

#--- Simulate Correlated Data ----
simulated_correlated_data = mvrnorm(n = 230, mu = means_array, Sigma = correlation_matrix)

#--- Turn these data into a data frame and name the columns ----
simulated_correlated_data = as.data.frame(simulated_correlated_data)

#--- Name the columns according to our variables names ----
names(simulated_correlated_data) = c("WB1", "WB2", "WB3", "WB4", "sleep", "nutrition", "exercise", "mood", "relationships", "enrichment")

#--- Inspect the Data ----
head(simulated_correlated_data)

##        WB1      WB2      WB3      WB4    sleep nutrition exercise     mood
## 1 3.974332 2.735126 3.217782 3.082822 438.4458  68.01458 20.88846 6.136803
## 2 2.257772 2.781725 2.232060 3.095962 437.5200  68.81524 19.15199 4.117323
## 3 2.500930 2.287801 2.954376 2.778039 437.6046  67.98726 20.06814 3.929887
## 4 1.341194 2.149544 2.096604 2.424412 436.2743  67.72909 19.04979 5.191252
## 5 2.664268 2.503059 2.784414 2.720353 436.6667  68.66780 18.90414 4.174824
## 6 4.733888 3.692419 5.451466 4.146647 438.8843  69.42985 21.93486 6.421581
##   relationships enrichment
## 1      6.577405   3.450505
## 2      4.910892   3.448284
## 3      4.102914   2.958373
## 4      5.576063   3.546485
## 5      4.473924   4.583672
## 6      6.611260   4.854629

Now that I have a good correlation matrix, I will transform the two Poisson-distributed variables (i.e., \(sleep\) and \(exercise\)) from this normal distribution into the Poisson. To do that, I first need to identify the quantile of each observation and then find the Poisson value that would be equivalent to that quantile when \(\lambda\) was defined according to Table 3. I also need to ensure that all minimum and maximum values are adhered to. Finally, I need to round the well-being and physical health variables to be integers and round the mental health variables to the tenths place to mimic averages of 10-item scales.

#--- Identify the Quantile of each simulated Observation ----
simulated_correlated_data = mutate(simulated_correlated_data, 
    sleep_quantile = pnorm(sleep, mean = mean(sleep), sd = sd(sleep)),
    sleep = qpois(sleep_quantile, lambda = 438),
    exercise_quantile = pnorm(exercise, mean = mean(exercise), sd = sd(exercise)),
    exercise = qpois(exercise_quantile, lambda = 20),
    WB1 = round(WB1),
    WB2 = round(WB2),
    WB3 = round(WB3),
    WB4 = round(WB4),
    nutrition = round(nutrition),
    mood = round(mood, 1),
    relationships = round(relationships, 1),
    enrichment = round(enrichment, 1),
    sleep = ifelse(sleep < 0, 0, sleep),
    exercise = ifelse(exercise < 0, 0, exercise),
    nutrition = ifelse(nutrition < 0, 0, nutrition),
    nutrition = ifelse(nutrition > 100, 100, nutrition),
    WB1 = ifelse(WB1 < 1, 1, ifelse(WB1 > 7, 7, WB1)),
    WB2 = ifelse(WB2 < 1, 1, ifelse(WB2 > 7, 7, WB2)),
    WB3 = ifelse(WB3 < 1, 1, ifelse(WB3 > 7, 7, WB3)),
    WB4 = ifelse(WB4 < 1, 1, ifelse(WB4 > 7, 7, WB4)),
    mood = ifelse(mood < 1, 1, ifelse(mood > 7, 7, mood)),
    relationships = ifelse(relationships < 1, 1, ifelse(relationships > 7, 7, relationships)),
    enrichment = ifelse(enrichment < 1, 1, ifelse(enrichment > 7, 7, enrichment))
)

simulated_correlated_data = dplyr::select(simulated_correlated_data, - sleep_quantile, - exercise_quantile)


#--- Inspect the Post-Processed Data ----
head(simulated_correlated_data)

##   WB1 WB2 WB3 WB4 sleep nutrition exercise mood relationships enrichment
## 1   4   3   3   3   450        68       25  6.1           6.6        3.5
## 2   2   3   2   3   430        69       16  4.1           4.9        3.4
## 3   3   2   3   3   431        68       20  3.9           4.1        3.0
## 4   1   2   2   2   403        68       15  5.2           5.6        3.5
## 5   3   3   3   3   411        69       15  4.2           4.5        4.6
## 6   5   4   5   4   459        69       31  6.4           6.6        4.9

That looks about right!

Test the simulation

Finally, we are ready to test the simulated data. We will use the \(lavaan\) package to run an SEM using our simulated data. We will load that package and then estimate our SEM. I also use the \(semPlot\) package to plot our model and see if it matches the hand-drawn picture above.

#--- Load the Lavaan Package ----
# install.packages("lavaan", "semPlot")
library(lavaan)
library(semPlot)

#--- Define the Model ----
sem_model = '
WELL_BEING =~ WB1 + WB2 + WB3 + WB4
PHYSICAL_HEALTH =~ sleep + exercise + nutrition
MENTAL_HEALTH =~ mood + relationships + enrichment
WELL_BEING ~ PHYSICAL_HEALTH + MENTAL_HEALTH
PHYSICAL_HEALTH  ~~ MENTAL_HEALTH
'

#--- Estimate the Model ----
sem_analysis = sem(sem_model, data= simulated_correlated_data)

## Warning in lav_model_estimate(lavmodel = lavmodel, lavpartable = lavpartable, :
## lavaan WARNING: the optimizer warns that a solution has NOT been found!

Notice that there is a convergence problem! We receive a message from R saying that convergence was not achieved. Indeed, looking at the output of the model, many parameters are not estimated.

#--- Inspect the Results ----
summary(sem_analysis, standardized = TRUE, fit.measures = TRUE)

## lavaan 0.6-6 did NOT end normally after 1644 iterations
## ** WARNING ** Estimates below are most likely unreliable
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of free parameters                         23
##                                                       
##   Number of observations                           230
##                                                       
## Model Test User Model:
##                                                       
##   Test statistic                                    NA
##   Degrees of freedom                                NA

## Warning in .local(object, ...): lavaan WARNING: fit measures not available if model did not converge

## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Latent Variables:
##                      Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   WELL_BEING =~                                                           
##     WB1                 1.000                               0.750    0.765
##     WB2                 0.975       NA                      0.731    0.741
##     WB3                 0.988       NA                      0.741    0.786
##     WB4                 1.037       NA                      0.778    0.793
##   PHYSICAL_HEALTH =~                                                      
##     sleep               1.000                              10.488    0.500
##     exercise            0.215       NA                      2.259    0.501
##     nutrition           0.048       NA                      0.499    0.481
##   MENTAL_HEALTH =~                                                        
##     mood                1.000                               0.715    0.734
##     relationships       1.060       NA                      0.758    0.769
##     enrichment          1.129       NA                      0.807    0.806
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   WELL_BEING ~                                                          
##     PHYSICAL_HEALT   -8.317       NA                   -116.307 -116.307
##     MENTAL_HEALTH   122.845       NA                    117.192  117.192
## 
## Covariances:
##                      Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   PHYSICAL_HEALTH ~~                                                      
##     MENTAL_HEALTH       7.512       NA                      1.001    1.001
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##    .WB1               0.398       NA                      0.398    0.415
##    .WB2               0.438       NA                      0.438    0.451
##    .WB3               0.340       NA                      0.340    0.382
##    .WB4               0.357       NA                      0.357    0.371
##    .sleep           329.639       NA                    329.639    0.750
##    .exercise         15.237       NA                     15.237    0.749
##    .nutrition         0.828       NA                      0.828    0.769
##    .mood              0.437       NA                      0.437    0.461
##    .relationships     0.396       NA                      0.396    0.408
##    .enrichment        0.351       NA                      0.351    0.350
##    .WELL_BEING       16.749       NA                     29.780   29.780
##     PHYSICAL_HEALT  109.998       NA                      1.000    1.000
##     MENTAL_HEALTH     0.512       NA                      1.000    1.000

One thing I notice in the summary output is that the variances of two of our variables, sleep and nutrition, are relatively quite large. That can mess up SEM substantially and even lead to impossible estimates. All the other variables are on scales that are all in the ones range (e.g., 1 - 7). I want all variables to be in that range. So in my draft analysis syntax, I will transform the simulated variable. I am going to divide sleep (on the scale of hundreds of minutes) by 100 and nutrition (on the scale of tens) by 10. Then, we rerun the analysis.

#--- Constrict Variables with Large Scales ----
simulated_correlated_data = simulated_correlated_data %>%
  mutate(sleep = sleep/100, 
         nutrition = nutrition/10)

After doing those transformations, we try to estimate the SEM again.

#--- Estimate the Model Again ----
sem_analysis = sem(sem_model, data= simulated_correlated_data)

## Warning in lav_data_full(data = data, group = group, cluster = cluster, : lavaan
## WARNING: some observed variances are (at least) a factor 1000 times larger than
## others; use varTable(fit) to investigate

#--- Inspect the Results ----
summary(sem_analysis, standardized = TRUE, fit.measures = TRUE)

## lavaan 0.6-6 ended normally after 45 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of free parameters                         23
##                                                       
##   Number of observations                           230
##                                                       
## Model Test User Model:
##                                                       
##   Test statistic                                35.584
##   Degrees of freedom                                32
##   P-value (Chi-square)                           0.303
## 
## Model Test Baseline Model:
## 
##   Test statistic                              1156.693
##   Degrees of freedom                                45
##   P-value                                        0.000
## 
## User Model versus Baseline Model:
## 
##   Comparative Fit Index (CFI)                    0.997
##   Tucker-Lewis Index (TLI)                       0.995
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)              -2147.708
##   Loglikelihood unrestricted model (H1)      -2129.916
##                                                       
##   Akaike (AIC)                                4341.415
##   Bayesian (BIC)                              4420.491
##   Sample-size adjusted Bayesian (BIC)         4347.595
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.022
##   90 Percent confidence interval - lower         0.000
##   90 Percent confidence interval - upper         0.055
##   P-value RMSEA <= 0.05                          0.908
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.026
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Latent Variables:
##                      Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   WELL_BEING =~                                                           
##     WB1                 1.000                               0.771    0.775
##     WB2                 0.973    0.084   11.543    0.000    0.751    0.750
##     WB3                 0.985    0.080   12.279    0.000    0.760    0.793
##     WB4                 1.034    0.083   12.398    0.000    0.798    0.800
##   PHYSICAL_HEALTH =~                                                      
##     sleep               1.000                               0.165    0.789
##     exercise           20.747    1.897   10.935    0.000    3.425    0.760
##     nutrition           0.460    0.043   10.585    0.000    0.076    0.733
##   MENTAL_HEALTH =~                                                        
##     mood                1.000                               0.716    0.735
##     relationships       1.105    0.099   11.164    0.000    0.791    0.803
##     enrichment          1.181    0.103   11.474    0.000    0.845    0.844
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   WELL_BEING ~                                                          
##     PHYSICAL_HEALT    2.787    0.378    7.371    0.000    0.597    0.597
##     MENTAL_HEALTH     0.398    0.078    5.133    0.000    0.370    0.370
## 
## Covariances:
##                      Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   PHYSICAL_HEALTH ~~                                                      
##     MENTAL_HEALTH       0.059    0.011    5.254    0.000    0.497    0.497
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##    .WB1               0.396    0.046    8.626    0.000    0.396    0.400
##    .WB2               0.438    0.049    8.931    0.000    0.438    0.437
##    .WB3               0.341    0.041    8.354    0.000    0.341    0.372
##    .WB4               0.358    0.044    8.235    0.000    0.358    0.360
##    .sleep             0.017    0.002    7.201    0.000    0.017    0.377
##    .exercise          8.597    1.100    7.816    0.000    8.597    0.423
##    .nutrition         0.005    0.001    8.269    0.000    0.005    0.463
##    .mood              0.436    0.051    8.526    0.000    0.436    0.459
##    .relationships     0.344    0.048    7.196    0.000    0.344    0.355
##    .enrichment        0.288    0.048    6.013    0.000    0.288    0.287
##    .WELL_BEING        0.171    0.036    4.738    0.000    0.288    0.288
##     PHYSICAL_HEALT    0.027    0.004    6.579    0.000    1.000    1.000
##     MENTAL_HEALTH     0.512    0.084    6.071    0.000    1.000    1.000

#--- Plot the Model ---
semPaths(sem_analysis, what = "std", rotation = 2)

Now the model fits well! This all looks good. A bonus is that you have seen an example of how simulated data and testing it with your intended analysis script allows you to debug your analysis script before you even start collecting data! 💡

One way or the other, you have now simulated data for a structural equation model that actually fits. You can call this job done! ✅

#--- Export Simulated Data Set for SEM ----
write_csv(simulated_correlated_data, "Simulated SEM Dataset.csv")

Conclusion

Being able to simulate data is a liberating skill for any psychological scientist. It is practical, at the very least, and can become an integral component of your research program, at the most. Simulated data is useful for important things like testing draft analysis scripts and conducting power analysis.

I did my best to simulate data that would be most similar to the data I would get from observing the natural world. Sometimes, this is more or less difficult, because research tasks are highly specialized, almost definitionally. So, do your best, and it is OK if some of your variables are not exactly as they would be if you collected natural data – really it depends on how much these deviations impact your goals.

Hopefully, the things you learned in this tutorial provide the building blocks you need to begin simulating data of your own! 🧙

If you did want to model covariance between predictors, then combine the skills you learn in Example 4 with this example↩︎
Another simple solution is demonstrated for the dependent variable in this Example and again in Example 4. However, the best solution is to simulate new values for all implausible values until all values are plausible (e.g., to “throw back” the implausible value and simulate a new value to take its place.↩︎
If the predictors were correlated, then the square of their correlation should be included in this sum.↩︎