Section 5 Structuring

Data structuring is the process of correcting, rearranging, or removing inaccurate records from raw data so that, after the treatment, the transformed data will be easy to analyze. The variable names, types, and values must be consistent and uniform. The focus here is on the appearance of the data. We often restructure data so that we can merge two dataframes with pairable attributes, combine several sources describing the same event, or prepare data for intense statistical analysis that requires consistent and complete data. Whatever the purpose, having clean data is a good habit regardless of application or functionality. If somebody else picks up your project, you do not want to have to explain to them the complications and intricacies of messy data. The first stage of structuring is to make sure that your read-in process gave you data of the type that you anticipated or intended. Before we can do that, we should learn how to access specific parts of the data for examination.

5.1 Accessing Data in R

Once you have read in your data, you may want to look, change, or perform calculations on a specific part or element. Keep in mind that R uses 1-indexing instead of 0-indexing, which other languages like Python employ. 1-indexing means that you start counting from the number 1 instead of 0. To access a range of values in a dataframe, you can specify a set of numbers or column names. Let’s create a dataframe below so that you can get a sense of how this works.

x <- data.frame(replicate(5, sample(1:10,20,rep=TRUE)))
names(x) <- c("One", "Two", "Three", "Four", "Five")

What we’ve just done above is create a dataframe with 5 columns, each of which has 20 rows populated by a random number between 1 and 10. The names of these columns are One, Two, Three, Four, and Five respectively. Let’s say that we want to access just the first three rows of the third and fourth columns. We can do that one of two ways:

# Accessing a part of the data
x[1:3, 3:4]
x[1:3, c("Three", "Four")]

The first line shows how we use indexing to access data. The rows are listed first, then after a comma the columns are listed. If we want a range of values, we use ‘:’ to indicate we want every number between the two values. 1:4 means we want rows 1, 2, 3, and 4 while just typing 4 means we want just row 4. We use the same logic for columns. However, if we want to access columns by name, we can generate a vector within c(), which is the R shorthand way of writing a vector. Each value within c() is treated as a single variable, separated by commas. In the example, we typed c("Three", "Four") to get the third and fourth columns. Whatever the names of the columns that you want are, you can use this method to access them. You can also use c() to get a non-consecutive set of numbers. Say we want the first, third, and fifth column. We could use c(1,3,5) to get just these columns.

Let’s say we just want one column from a dataframe to use for examination or manipulation. We can use the $ character to clarify which one we want by name. You will see this all over R scripts, as it is very common to want to access a particular column or item in a list in its entirety.

# Accessing just the fifth column
x$Five
x$Five[6:8]

The first command gives us just the fifth column of the frame. The second command gives us the sixth to eighth row of the fifth column, using the same square bracket [] indexing as before. Typically, square brackets are reserved for indexing, and parentheses are used for functions.

One final method of subsetting is throuch exclusion. You can type - in front of a number if you want to access everything but a single column. For example, instead of typing c(1,2,3,5) to get all columns except for the fourth, we can write -4.

# These are equivalent statements
x[,c(1,2,3,5)]
x[,-4]

5.2 Data types

One type of anomaly that we may encounter is the coercion of irrelevant data types to variables. This is very common for numerically coded variables, or variables that have distinct levels. First we read in our data by the method we practiced in the earlier chapter:

If we read in the same SPSS data from the Reading data section, we get the coded values instead of the labels.

## # A tibble: 6 x 9
##   Zone     Q4           Q5    Q6           Q7     Q10   Q50     Q51          Q59
##   <chr> <dbl>    <dbl+lbl> <dbl>    <dbl+lbl> <dbl+l> <dbl> <dbl+l>    <dbl+lbl>
## 1 A         2 3                0 3 [Moderate…  2 [No]  1928 1 [Mal…  4 [$70,000…
## 2 A         1 4                0 3 [Moderate…  2 [No]  1962 1 [Mal… NA          
## 3 A         3 4                0 3 [Moderate…  2 [No]  1931 2 [Fem…  6 [Over $2…
## 4 A         3 6                1 1 [Fully Pr…  2 [No]  1950 1 [Mal…  5 [$100,00…
## 5 A         2 1 [Not Worr…     0 2 [Very Pre…  2 [No]  1948 1 [Mal…  5 [$100,00…
## 6 A         5 4                0 2 [Very Pre…  2 [No]  1938 2 [Fem… NA

It appears as though some of the categorical data was read in as integers. This is a quirk of R, that if left unchanged can drastically change the outcome of your analyses. To access the specific type of data of one column, we can use the class() function, then pass the name of the frame and column we are trying to access using the technique outlined above:

class(data$Q6)
## [1] "numeric"

Q6 is meant to be categorical, not numeric. If we run summary(data), we get this unintended result which affects several columns:

## Warning in mask$eval_all_mutate(dots[[i]]): NAs introduced by coercion
##       Zone            Q4               Q5              Q6        
##  Min.   : NA    Min.   : 0.000   Min.   :1.000   Min.   :0.0000  
##  1st Qu.: NA    1st Qu.: 2.000   1st Qu.:1.000   1st Qu.:0.0000  
##  Median : NA    Median : 2.000   Median :2.000   Median :0.0000  
##  Mean   :NaN    Mean   : 2.537   Mean   :1.617   Mean   :0.4191  
##  3rd Qu.: NA    3rd Qu.: 3.000   3rd Qu.:2.000   3rd Qu.:1.0000  
##  Max.   : NA    Max.   :20.000   Max.   :2.000   Max.   :4.0000  
##  NA's   :1130   NA's   :134      NA's   :968     NA's   :116     
##        Q7             Q10             Q50            Q51            Q59       
##  Min.   :1.000   Min.   :1.000   Min.   :  19   Min.   :1.00   Min.   :1.000  
##  1st Qu.:2.000   1st Qu.:2.000   1st Qu.:1944   1st Qu.:1.00   1st Qu.:3.000  
##  Median :3.000   Median :2.000   Median :1955   Median :2.00   Median :4.000  
##  Mean   :2.674   Mean   :1.796   Mean   :1944   Mean   :1.55   Mean   :3.715  
##  3rd Qu.:3.000   3rd Qu.:2.000   3rd Qu.:1966   3rd Qu.:2.00   3rd Qu.:5.000  
##  Max.   :5.000   Max.   :2.000   Max.   :1992   Max.   :2.00   Max.   :6.000  
##  NA's   :114     NA's   :123     NA's   :47     NA's   :38     NA's   :112

Q4 and Q50 are the only variables that are supposed to be numeric, but here everything is treated as numeric, even when it was originally categorical data, which is incorrect. We might also want to treat Zone as a factor so that we can examine groups collectively based on this metric.

We can easily convert data types into factors using either dplyr::mutate_at() and applying as.factor function to the variables. Examples of both are given below:

# Converting data types
updated_data <- data %>% mutate_at(vars(-Q4, -Q6, -Q50), as_factor)

# Alternately, for specific columns
data$Q6 <- as.factor(data$Q6)

And now we can get the full summary statistics in the form that we want:

##  Zone          Q4               Q5            Q6        
##  A:684   Min.   : 0.000   5      :244   Min.   :0.0000  
##  B:446   1st Qu.: 2.000   4      :211   1st Qu.:0.0000  
##          Median : 2.000   3      :169   Median :0.0000  
##          Mean   : 2.537   6      :129   Mean   :0.4191  
##          3rd Qu.: 3.000   2      :104   3rd Qu.:1.0000  
##          Max.   :20.000   (Other):162   Max.   :4.0000  
##          NA's   :134      NA's   :111   NA's   :116     
##                    Q7        Q10           Q50           Q51     
##  Fully Prepared     : 93   Yes :205   Min.   :  19   Male  :491  
##  Very Prepared      :326   No  :802   1st Qu.:1944   Female:601  
##  Moderately Prepared:438   NA's:123   Median :1955   NA's  : 38  
##  A Little Prepared  :137              Mean   :1944               
##  Not at all Prepared: 22              3rd Qu.:1966               
##  NA's               :114              Max.   :1992               
##                                       NA's   :47                 
##                 Q59     
##  Less than $15,000: 81  
##  $15,000-$39,999  :169  
##  $40,000-$69,999  :215  
##  $70,000-$99,999  :190  
##  $100,000-$199,999:220  
##  Over $200,000    :143  
##  NA's             :112

As we can see from the summary, there might still be anomalies with the variables:

  • Q4: Number of storms experienced: The mean value is 2.5 but the max value is 20, which suggests heavily right skewed data.
  • Q50: Birth year: Some respondent answered 19 which is incorrect. It’s pretty clearly a typo, because nobody is 2,001 years old. We may also consider transforming this column into age instead of birth year, which makes it easier to interpret.

Perpetually in data analysis, the issue of missing values and how to deal with them comes up. We can see here that this dataset is no different, with several NA values across the board. We will talk at length about missing data later on in the next chapter on Data Cleaning.

5.3 Inspecting the data

In order to structure a dataset, we need to not only detect the anomalies within the data, but also decide what to do with them. Such anomalies can include values that are stored in the wrong format (ex: a number stored as a string), values that fall outside of the expected range (ex: outliers), values with inconsistent patterns (ex: dates stored as mm/dd/year vs dd/mm/year), trailing spaces in strings (ex: data vs data), or other logistical issues.

With our data successfully read in above, we can examine both structure and summary statistics for numerical and categorical variables.

# Structure of the data
str(data[,1:3])
## tibble [1,130 × 3] (S3: tbl_df/tbl/data.frame)
##  $ Zone: chr [1:1130] "A" "A" "A" "A" ...
##   ..- attr(*, "format.spss")= chr "A9"
##   ..- attr(*, "display_width")= int 1
##  $ Q4  : num [1:1130] 2 1 3 3 2 5 3 5 1 2 ...
##   ..- attr(*, "label")= chr "Q4. Since the beginning of 2009, how many hurricanes and tropical storms, if any, hit your city or town on or n"| __truncated__
##   ..- attr(*, "format.spss")= chr "F2.0"
##   ..- attr(*, "display_width")= int 2
##  $ Q5  : dbl+lbl [1:1130]  3,  4,  4,  6,  1,  4,  6,  4,  3,  6,  7,  5,  7,  ...
##    ..@ label      : chr "Q5. Generally speaking, when a hurricane or tropical storm is approaching your city or town, how worried do you"| __truncated__
##    ..@ format.spss: chr "F1.0"
##    ..@ labels     : Named num [1:2] 1 7
##    .. ..- attr(*, "names")= chr [1:2] "Not Worried At All" "Extremely Worried"
# Summary for a numerical variable
summary(data$Q4)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   2.000   2.000   2.537   3.000  20.000     134
# Summary for a categorical variable
summary(as_factor(data$Q7))
##      Fully Prepared       Very Prepared Moderately Prepared   A Little Prepared 
##                  93                 326                 438                 137 
## Not at all Prepared                NA's 
##                  22                 114

Here we will look at the output of a few other key functions. Calling head() will by default show you the first six rows of a dataframe. If you want to see more or fewer rows, simply put a comma in after the dataframe’s name and write the number of rows you want to display. Calling tail() will give you the exact opposite, displaying the last n rows of data, with the default again set to six.

# First 10 rows
head(data, 10)
## # A tibble: 10 x 9
##    Zone     Q4           Q5 Q6              Q7     Q10   Q50     Q51         Q59
##    <chr> <dbl>    <dbl+lbl> <fct>    <dbl+lbl> <dbl+l> <dbl> <dbl+l>   <dbl+lbl>
##  1 A         2 3            0     3 [Moderate…  2 [No]  1928 1 [Mal…  4 [$70,00…
##  2 A         1 4            0     3 [Moderate…  2 [No]  1962 1 [Mal… NA         
##  3 A         3 4            0     3 [Moderate…  2 [No]  1931 2 [Fem…  6 [Over $…
##  4 A         3 6            1     1 [Fully Pr…  2 [No]  1950 1 [Mal…  5 [$100,0…
##  5 A         2 1 [Not Worr… 0     2 [Very Pre…  2 [No]  1948 1 [Mal…  5 [$100,0…
##  6 A         5 4            0     2 [Very Pre…  2 [No]  1938 2 [Fem… NA         
##  7 A         3 6            1     3 [Moderate…  2 [No]  1977 2 [Fem… NA         
##  8 A         5 4            0     3 [Moderate…  2 [No]  1964 2 [Fem… NA         
##  9 A         1 3            0     3 [Moderate…  2 [No]  1976 1 [Mal…  3 [$40,00…
## 10 A         2 6            0     2 [Very Pre…  2 [No]  1964 2 [Fem…  6 [Over $…
# Last 10 rows
tail(data, 10)
## # A tibble: 10 x 9
##    Zone     Q4      Q5 Q6                Q7      Q10   Q50      Q51          Q59
##    <chr> <dbl> <dbl+l> <fct>      <dbl+lbl> <dbl+lb> <dbl> <dbl+lb>    <dbl+lbl>
##  1 B         1       2 0     4 [A Little P…  2 [No]   1980 1 [Male] 3 [$40,000-…
##  2 B         2       2 0     3 [Moderately…  2 [No]   1977 2 [Fema… 4 [$70,000-…
##  3 B         4       4 1     2 [Very Prepa…  1 [Yes]  1962 2 [Fema… 2 [$15,000-…
##  4 B         2       5 0     1 [Fully Prep…  2 [No]   1946 1 [Male] 5 [$100,000…
##  5 B        NA       4 <NA>  1 [Fully Prep…  1 [Yes]  1957 2 [Fema… 1 [Less tha…
##  6 B         1       4 1     4 [A Little P…  1 [Yes]  1987 2 [Fema… 6 [Over $20…
##  7 B         2       5 0     3 [Moderately…  2 [No]   1953 1 [Male] 4 [$70,000-…
##  8 B        NA       4 4     2 [Very Prepa…  2 [No]   1973 2 [Fema… 1 [Less tha…
##  9 B         2       5 0     3 [Moderately…  2 [No]   1980 1 [Male] 5 [$100,000…
## 10 B         2       2 0     4 [A Little P…  2 [No]     NA 2 [Fema… 3 [$40,000-…

To find the number of rows and columns your dataframe has, call dim(). To find just the rows, use nrow(). To find just the columns, use ncol(). Calling either names() or colnames() will give you the names of the columns. If your data has row names (which is not always the case), you can use rownames() to access those.

# Dataframe dimensions
dim(data)
## [1] 1130    9
# Total number of rows
nrow(data)
## [1] 1130
# Total number of columns
ncol(data)
## [1] 9
# Column names
names(data) # also colnames(data)
## [1] "Zone" "Q4"   "Q5"   "Q6"   "Q7"   "Q10"  "Q50"  "Q51"  "Q59"

We can also plot the data to visualize the distribution of variables using the dplyr and magrittr

# Plotting the first 5 columns
plot(data[,1:5])

If you use these techniques and see an outlier, whether it be an abnormally large or small number, you can use either the min() or max() function to get the specific value. Because we have missing values, we need to pass the argument na.rm = TRUE to ignore all NA values when calculating these. For several numerical evaluations, R makes sure to explicitly highlight the presence of NA values, which are treated as unknowns. Because R does not know if the missing value could have been the minimum or maximum, it will tell you that this calculation is NA unless you explicitly give it permission to ignore these missing values.

# Examine the problematic values we identified earlier
max(data$Q4, na.rm = TRUE)
## [1] 20
min(data$Q50, na.rm = TRUE)
## [1] 19

5.4 Subsetting and Filtering

We can remove incorrect or missing row values by using dplyr::filter:

# Removing rows where birth year is irrelevant
# Here we decided that all birth year must be greater 1900
updated_data <- data %>% filter(Q50 > 1900)

# Now if we re-run its summary we get the following
summary(updated_data$Q50)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1908    1945    1955    1956    1966    1992
# Removing rows with birth year greater than 1900 and missing responses for Q4
updated_data <- data %>% filter(Q50 > 1900, !is.na(Q4))
summary(updated_data$Q50)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1908    1944    1954    1955    1965    1990
summary(updated_data$Q4)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   2.000   2.000   2.522   3.000  20.000

We can also select for only the variables that we are interested in using the function dplyr::select:

# Creating a new dataframe with only zone, gender, and income columns
updated_data <- data %>% select(Zone, Q59, Q51)
head(updated_data, 10)
## # A tibble: 10 x 3
##    Zone  Q59   Q51  
##    <fct> <fct> <fct>
##  1 A     4     1    
##  2 A     <NA>  1    
##  3 A     6     2    
##  4 A     5     1    
##  5 A     5     1    
##  6 A     <NA>  2    
##  7 A     <NA>  2    
##  8 A     <NA>  2    
##  9 A     3     1    
## 10 A     6     2
plot(table(updated_data), las=1)

It is also possible to split the dataset into multiple dataframes by number of rows using split():

# To split the dataset into multiple dataframes of 10 rows each
max_number_of_rows_per_dataframe <- 10
total_number_rows_in_the_current_dataset <- nrow(data)
sets_of_10rows_dataframes <- split(data, 
                                   rep(1:ceiling(total_number_rows_in_the_current_dataset/max_number_of_rows_per_dataframe), 
                                       each=max_number_of_rows_per_dataframe, 
                                       length.out=total_number_rows_in_the_current_dataset)
                                   )
# Here are the first 2 dataframes
sets_of_10rows_dataframes[[1]] # or sets_of_10rows_dataframes$`1`
## # A tibble: 10 x 9
##    Zone     Q4 Q5    Q6    Q7    Q10     Q50 Q51   Q59  
##    <fct> <dbl> <fct> <fct> <fct> <fct> <dbl> <fct> <fct>
##  1 A         2 3     0     3     2      1928 1     4    
##  2 A         1 4     0     3     2      1962 1     <NA> 
##  3 A         3 4     0     3     2      1931 2     6    
##  4 A         3 6     1     1     2      1950 1     5    
##  5 A         2 1     0     2     2      1948 1     5    
##  6 A         5 4     0     2     2      1938 2     <NA> 
##  7 A         3 6     1     3     2      1977 2     <NA> 
##  8 A         5 4     0     3     2      1964 2     <NA> 
##  9 A         1 3     0     3     2      1976 1     3    
## 10 A         2 6     0     2     2      1964 2     6
sets_of_10rows_dataframes[[2]]
## # A tibble: 10 x 9
##    Zone     Q4 Q5    Q6    Q7    Q10     Q50 Q51   Q59  
##    <fct> <dbl> <fct> <fct> <fct> <fct> <dbl> <fct> <fct>
##  1 A         2 7     2     1     1      1937 2     <NA> 
##  2 A         3 5     0     2     2      1943 1     4    
##  3 A         2 7     0     2     2      1954 2     5    
##  4 A         2 5     0     2     2      1959 2     5    
##  5 A         4 1     <NA>  2     2      1936 2     6    
##  6 A         1 3     1     3     1      1963 1     6    
##  7 A         2 3     1     2     1      1950 2     5    
##  8 A         4 6     0     3     2        NA <NA>  <NA> 
##  9 A         0 4     0     2     2      1941 1     5    
## 10 A        NA <NA>  <NA>  <NA>  <NA>   1952 2     5

5.5 Changing cell values

As we mentioned earlier, it is best if Q50 is stored as an age variable instead of the default birth year. Q50 is a numeric variable and we can simply change it by using dplyr::mutate()

With the filtered data, we can replace all of the values in Q50 with 2020 - Q50 like so:

# Replacing Q50 values to their age in 2020
updated_data <- data %>% mutate(Q50 = 2020 - Q50)
head(updated_data, 10)
## # A tibble: 10 x 9
##    Zone     Q4 Q5    Q6    Q7    Q10     Q50 Q51   Q59  
##    <fct> <dbl> <fct> <fct> <fct> <fct> <dbl> <fct> <fct>
##  1 A         2 3     0     3     2        92 1     4    
##  2 A         1 4     0     3     2        58 1     <NA> 
##  3 A         3 4     0     3     2        89 2     6    
##  4 A         3 6     1     1     2        70 1     5    
##  5 A         2 1     0     2     2        72 1     5    
##  6 A         5 4     0     2     2        82 2     <NA> 
##  7 A         3 6     1     3     2        43 2     <NA> 
##  8 A         5 4     0     3     2        56 2     <NA> 
##  9 A         1 3     0     3     2        44 1     3    
## 10 A         2 6     0     2     2        56 2     6
# It is also possible to leave Q50 untouched and store the results into a new column
updated_data <- data %>% mutate(age = 2020 - Q50)
head(updated_data, 10)
## # A tibble: 10 x 10
##    Zone     Q4 Q5    Q6    Q7    Q10     Q50 Q51   Q59     age
##    <fct> <dbl> <fct> <fct> <fct> <fct> <dbl> <fct> <fct> <dbl>
##  1 A         2 3     0     3     2      1928 1     4        92
##  2 A         1 4     0     3     2      1962 1     <NA>     58
##  3 A         3 4     0     3     2      1931 2     6        89
##  4 A         3 6     1     1     2      1950 1     5        70
##  5 A         2 1     0     2     2      1948 1     5        72
##  6 A         5 4     0     2     2      1938 2     <NA>     82
##  7 A         3 6     1     3     2      1977 2     <NA>     43
##  8 A         5 4     0     3     2      1964 2     <NA>     56
##  9 A         1 3     0     3     2      1976 1     3        44
## 10 A         2 6     0     2     2      1964 2     6        56
summary(updated_data$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   28.00   54.00   65.00   64.14   75.00  112.00

For a categorical variable, we use a different function: dplyr::recode_factor() or dplyr::recode(). We will apply this to Q5 as we have noticed in the previous section that not all of its values were labelled from SPSS. Here is its summary:

##    1    2    3    4    5    6    7 NA's 
##   58  101  164  197  232  123   97  104

Looking back at the questionnare, here is how it was phrased:

Because the survey itself does not have labels, the recoding will be up to the user. Here we chose to replace the extreme values with 1 and 7. As mentioned in the documentation: dplyr::recode() will preserve the existing order of levels while changing the values, and dplyr::recode_factor() will change the order of levels to match the order of replacements.

# Recoding Q5
recoded.with.recode <- recode(data$Q5, `Not Worried At All`="1", `Extremely Worried`="7")
summary(recoded.with.recode)
##    1    2    3    4    5    6    7 NA's 
##   58  101  164  197  232  123   97  104
recoded.with.recode_factor <- recode_factor(data$Q5, `Not Worried At All`="1", `Extremely Worried`="7")
summary(recoded.with.recode_factor)
##    1    7    2    3    4    5    6 NA's 
##   58   97  101  164  197  232  123  104

We can also change cell values without external libraries like dplyr by running the following code:

# Add column age where the values are 2020 - Q50
data$age <- 2020 - data$Q50

# Replace Q5 with value "Not Worried At All" to "1"
data$Q5[data$Q5 == "Not Worried At All"] <- 1

Notice above that we used a conditional subset in the second command. Just like we used indexing earlier, we can also use a Boolean vector (True/False) to pick only rows where some condition is met. Within the [] brackets, we gave a statement: data$Q5 == "Not Worried At All" using == to indicate we want True/False for each row. This syntax will return only the entries of Q5 that are True within the condition. We then reassign the value for each of these to be 1 instead of Not Worried At All, effectively doing the same task as we did above. R is smart enough to work with vectors in this way, running the same evaluation on each entry without you specifying.

One final set of functions that can be incredibly helpful with changing cell values are gsub() and grep(). If, for instance, you want to find all entries of a column that have a specific string pattern in them, you can use grep() to locate every instance. Simply type in the pattern, and the frame over which you want the function to look. For instance, if we wanted to find every entry of a column with the word "missing" in it, we could use grep() like so:

# Identify every row that contains the phrase "missing"
grep("missing", x$example)

Now let’s say that instead of simply locating every instance, we instead want to replace them with a substitute phrase. That is when we would use gsub(), which lets us substitute another phrase in. If we wanted to replace every part of every entry that says "CT" with "Connecticut", we could do so:

# Replace every instance of 'CT' with 'Connecticut'
gsub("CT", "Connecticut", x$statesample)

5.6 Pivoting the dataset

In some cases, we may want to split a column based on values, or merge multiple columns into fewer columns. A classic example is with time-stamped data. Sometimes we want rows to be days, with individual responses recorded as columns, and sometimes we want each row to be an individual, with columns representing days. It just depends on what the data are modelling. This process can be done using the tidyr package. For example, to convert the dataframe into long-format with only Zone, question, and value as columns:

# Read in the necessary package
library(tidyr)

# We have to pivot by variable type
# Pivot longer for factor variables
pivoted.longer <- data %>% 
  select_if(is.factor) %>% 
  pivot_longer(-Zone, names_to = "question", values_to = "value")
pivoted.longer
## # A tibble: 6,456 x 3
##    Zone  question value
##    <fct> <chr>    <fct>
##  1 A     Q5       3    
##  2 A     Q6       0    
##  3 A     Q7       3    
##  4 A     Q10      2    
##  5 A     Q51      1    
##  6 A     Q59      4    
##  7 A     Q5       4    
##  8 A     Q6       0    
##  9 A     Q7       3    
## 10 A     Q10      2    
## # … with 6,446 more rows
# Then we can reshape it back to the original
pivoted.wider <- pivoted.longer %>%
  group_by(question) %>% mutate(row = row_number()) %>% 
  pivot_wider(names_from = question, values_from = value) %>% 
  select(-row)
pivoted.wider
## # A tibble: 1,076 x 7
##    Zone  Q5    Q6    Q7    Q10   Q51   Q59  
##    <fct> <fct> <fct> <fct> <fct> <fct> <fct>
##  1 A     3     0     3     2     1     4    
##  2 A     4     0     3     2     1     <NA> 
##  3 A     4     0     3     2     2     6    
##  4 A     6     1     1     2     1     5    
##  5 A     1     0     2     2     1     5    
##  6 A     4     0     2     2     2     <NA> 
##  7 A     6     1     3     2     2     <NA> 
##  8 A     4     0     3     2     2     <NA> 
##  9 A     3     0     3     2     1     3    
## 10 A     6     0     2     2     2     6    
## # … with 1,066 more rows

tidyr::spread() and tidyr::gather() are the outdated equivalent of tidyr::pivot_wider() and tidyr::pivot_longer().

To merge or split columns, we can use tidyr::unite() or tidyr::separate(). For example, to merge Q7 and Q10:

# Creating a new column with responses from both Q7 and Q10
merged <- data %>% unite("Q7_Q10", Q7:Q10, sep = "__", remove = TRUE, na.rm = FALSE)
merged
## # A tibble: 1,076 x 9
##    Zone     Q4 Q5    Q6    Q7_Q10   Q50 Q51   Q59     age
##    <fct> <dbl> <fct> <fct> <chr>  <dbl> <fct> <fct> <dbl>
##  1 A         2 3     0     3__2    1928 1     4        92
##  2 A         1 4     0     3__2    1962 1     <NA>     58
##  3 A         3 4     0     3__2    1931 2     6        89
##  4 A         3 6     1     1__2    1950 1     5        70
##  5 A         2 1     0     2__2    1948 1     5        72
##  6 A         5 4     0     2__2    1938 2     <NA>     82
##  7 A         3 6     1     3__2    1977 2     <NA>     43
##  8 A         5 4     0     3__2    1964 2     <NA>     56
##  9 A         1 3     0     3__2    1976 1     3        44
## 10 A         2 6     0     2__2    1964 2     6        56
## # … with 1,066 more rows
# To split it back
merged %>% separate(Q7_Q10, c("Q7", "Q10"), sep = "__", remove = TRUE)
## # A tibble: 1,076 x 10
##    Zone     Q4 Q5    Q6    Q7    Q10     Q50 Q51   Q59     age
##    <fct> <dbl> <fct> <fct> <chr> <chr> <dbl> <fct> <fct> <dbl>
##  1 A         2 3     0     3     2      1928 1     4        92
##  2 A         1 4     0     3     2      1962 1     <NA>     58
##  3 A         3 4     0     3     2      1931 2     6        89
##  4 A         3 6     1     1     2      1950 1     5        70
##  5 A         2 1     0     2     2      1948 1     5        72
##  6 A         5 4     0     2     2      1938 2     <NA>     82
##  7 A         3 6     1     3     2      1977 2     <NA>     43
##  8 A         5 4     0     3     2      1964 2     <NA>     56
##  9 A         1 3     0     3     2      1976 1     3        44
## 10 A         2 6     0     2     2      1964 2     6        56
## # … with 1,066 more rows

5.7 Merging Two Dataframes

If we want to combine two dataframes based on a shared column, we can do that with a merge call. Let’s say we have two separate dataframes that discuss the same individuals. Each participant was given an ID that was recorded over the two frames, and we want one dataframe with their responses joined.

# Here we will create the simulated data described above
survey1 <- cbind(1:10, data.frame(replicate(10,sample(0:1,1000,rep=TRUE))))
survey2 <- cbind(1:10, data.frame(replicate(10,sample(0:1,1000,rep=TRUE))))

# Name the columns of both frames
names(survey1) <- c("ID", paste0("Q", 1:10))
names(survey2) <- c("ID", paste0("Q", 11:20))

The first column of each dataframe represents the participant’s ID, and the next 10 columns represent their responses to questions. We can use a merge to execute this:

# Merge the two by their 'ID' column
surveytotal <- merge(survey1, survey2, by = "ID")

With the merge command, we can do one of four types: an inner join, a left join, a right join, or a full join. With an inner join, we only include data that have an identification match in both sets. A left join saves all information from the left dataset and puts NA values in if there is not a corresponding row in the right, a right join does the opposite, preserving all of the right frame’s data, and a full join saves all rows from both of the original datasets.

The frame surveytotal now has all questions stored in the same place. If the ID column does not match up in format, this can give you issues, so make sure you only merge once your data is fully cleaned, which we will talk about in the next chapter.

We can also do two simpler combines, known as cbind() or rbind(), if the data do not need to be merged on an ID entry. Say, for example, we had two sheets of an Excel file which were completed on different days and had the exact same set of columns> If we want them to be in the same frame, one after the other, we could use an rbind() call to bind the rows like so:

# Generate the top dataframe
top <- data.frame(replicate(10, sample(0:1,1000,rep=TRUE)))
names(top) <- paste0("Row_", 1:10)

# Generate the bottom dataframe
bottom <- data.frame(replicate(10, sample(0:1,1000,rep=TRUE)))
names(bottom) <- paste0("Row_", 1:10)

# Use rbind() to bring them together
combined <- rbind(top, bottom)

If there are two frames that have the same number of rows that describe in sequence the same observations, we can use a cbind() call to combine them:

# Create the left data
left <- data.frame(replicate(10, sample(0:1,1000,rep=TRUE)))
names(left) <- paste0("Row_", 1:10)

# Create the right data
right <- data.frame(replicate(10, sample(0:1,1000,rep=TRUE)))
names(right) <- paste0("Row_", 11:20)

# Use cbind() to stick them together
combined <- cbind(left, right)

Keep in mind that this is only a valid merge if the two frames are composed of rows that represent the same observations.