Data exploration essay using SAS.

On Canvas you will find five files, a mix of csv files and SAS formatted datafiles. This information has been modified from data sourced from Food.com (Majumder et al 2019). See the image on the last page for guidance on the relationship between the tables.

Use this information complete the following Tasks using SAS coding, with the goal of exploring the following questions.

Question: Which recipes receive higher user ratings?

To assist you in answering this research question, you will need to complete at least the tasks listed on the following page using SAS code:

1. Read the data into SAS using a data step or using a libref where appropriate. Ensure variables are formatted accordingly where required.

2. Produce a graph that shows the number of ingredients in the recipes in the data set. Ensure the graph is clearly labelled with the correct names and an informative title.

3. Using PROC SQL, determine the average rating given by each user that have reviewed more than 5 recipes. Report the number of users that meet this criteria and the overall average from these users in your report.

4. Create a new variable “Complexity” to represent recipes with a low number of steps to those with a high number of steps, where low is less than 10 and high complexity is 10 or more steps.

5. Use a t test to determine whether the Complexity (created in task 4) leads to a difference in calories per serve, but only for recipes that have more than 15 ratings. Use α=0.03.

6. Include any other relevant analysis of this data source to assist you to understand which recipes receive higher user ratings.

Task objectives

Data exploration analysis using SAS

Usually, every customer that visits a restaurant or hotel may have different experiences as far as the recipe is concerned, (Smilansky, 2017). Therefore, the clients who in one way or the other had been served in a hotel or restaurant may be given an opportunity to rate the food or recipes in general. These ratings may range from strongly agree (5) to strongly disagree (1). Through the ratings received from the clients, all the services in one way or the other can be improved based on the findings.

Therefore, in the current analysis, the task aims to establish which recipes receive higher user ratings. This will help in conducting a benchmarking in the leading recipes so that actions can be put in place to improve the general ratings of other recipes. For the purposes of visualizing the results in a way that it can easily be understood, both the graphs and tables have been used to show the results, (Mirman, 2017).

In order to complete this task, the secondary data from the Canvas portal have been used to answer the question of focus. A total of 5 datasets from the Canvas both having a mix of CSV files and SAS formatted data files have been used. On the same note, the information from data sourced from Food.com (Majumder et al 2019) has been modified for the results presentation.

For instance, according to the results, recipe dataset has a total of 231575 observations with a total of 7 variables. Other characteristics explored in the dataset is the calculation of the correlations. For instance, recipe data has a correlation, r of 0.00059 between total number of ingredients and minutes taken. Moreover, number of ingredients have a mean of 9.05 with a standard deviation of 3.73 while the maximum and minimum values of the number of ingredients accounted for 1 and 43 respectively as shown in Table below:

N Mean Std Dev Minimum Maximum

ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ

231575 9.0514477 3.7348796 1.0000000 43.0000000

ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ

Furthermore, the results confirm that mean of minutes in the recipe dataset is given as 9401.04 with a standard deviation of 4462560.3 as shown below. Given that minutes standard deviation is too large confirms that there is a wide variation when it comes to the minutes within the dataset, (Luo, Gopukumar, and Zhao, 2016).

N Mean Std Dev Minimum Maximum

ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ

231575 9401.04 4462560.30 0 2147483647

Dataset information

?ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ

Finally, other measures of variation in the dataset are shown below and the results confirm that median of the number of ingredients is the presentation of the resultsand this is a good measure of the central tendencies which should be used.

Variable Minimum Median Maximum

ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ

id 38.0000000 207264.00 537716.00

minutes 0 40.0000000 2147483647

contributor_id 27.0000000 173614.00 2002289981

submitted 14462.00 17189.00 21522.00

n_steps 0 9.0000000 145.0000000

n_ingredients 1.0000000 9.0000000 43.0000000

ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ

For the purposes of analysing this dataset, the SAS software has been used to present the findings. In addition, both the descriptive statistics such as means, percentages and frequencies have been used. In addition, inferential statistics such as t tests results have been used to make conclusions about the population, (MacRae, 2019). All tests have been considered at the level of significance of 0.05 which according to many social researchers have been found to be relevant and that it shows the probability of which a variable is a good predictor or a bad predictor, (Greenland, 2018).

The following graph shows the distribution of the number of ingredients in the recipes in the data set.

Figure 1: Distribution of the number of ingredients in the data.

According to the results in figure 1, it shows that the distribution of the ingredients in the dataset is positively skewed with the tail to the right. This is an indication that majority of the values in the ingredients are towards the right hand of the distribution as shown above, (Safford, et, al, 2015).

In order to generate all the users appearing to be unique who had reviewed more than 5 recipes, a table was developed in SAS and the SAS output exported to excel indicating the number of counts. From the results, 18, 556 users had more than 5 recipes reviewed. Moreover, results show that the mean rating was 3.9 for these users reviewing more than 5 recipes. The rating is okay considering the maximum rating of 5 confirming that a higher rating is likely to be obtained by the users who had reviewed more than 5 recipes.

For the analysis in T test to be completed, both Recipes and Nutrition datasets were combined together using the unique identifier, “ID” afterwhich a new variable, “Complexity” was generated appropriately. The following hypothesis were developed and tested:

Null Hypothesis (H₀): There is no statistically significant mean difference in calories by the complexity.

Descriptive statistics

Research Hypothesis (H₁): There is a statistically significant mean difference in calories by the complexity.

At alpha value of 0.03, there is no normal distribution hence we reject the null hypothesis and conclude that there is a statistically significant mean difference in calories by the complexity, (see output 1 in Appendices).

For the variable, Ingredients (many>=15, few<15). Hypothesis developed are:

Null Hypothesis (H₀): There is no statistically significant mean difference in calories per serve for both many and few ingredients.

Research Hypothesis (H₁): There is a statistically significant mean difference in calories per serve for both many and few ingredients.

At alpha value of 0.03, there is no normal distribution hence we reject the null hypothesis and conclude that there is is a statistically significant mean difference in calories per serve for both many and few ingredients, (see output 2 in Appendices).

Finally, the variable duration (Long>=60, Short<60). Hypothesis tested are;

Null Hypothesis (H₀): There is no statistically significant mean difference in calories per serve for both long and short duration.

Research Hypothesis (H₁): There is a statistically significant mean difference in calories per serve for long and short duration.

At alpha value of 0.03, there is no normal distribution hence we reject the null hypothesis and conclude that there is is is a statistically significant mean difference in calories per serve for long and short duration, (see output 3 in Appendices).

Finally, understanding the recipes with higher ratings, the variation of the ratings was calculated as shown below.

Findings confirmed that, the percentage of the total sample variation is accounted for the first, and second PCs is 0.57.43 which is equivalent to 57.43% of the PCs since they have the highest eingenvalues. This implies that as much as some ratings are high, there is no much variation in the outcome of ratings.

A scatterplot showing the relationship between number of ingredients and calories per serve indicate no statistical association.

Conclusion

Generally, there is a belief that higher ratings of food depend on the foods with higher calories per serve. As a result, we conclude that recipes with high ratings are viewed to have short durations as far as the preparations are concerned, limited number of ingredients, and few steps on the preparation in general.

References

Greenland, S., 2018. The unconditional information in P-values, and its refutational interpretation via S-values. under submission.

Luo, J., Wu, M., Gopukumar, D. and Zhao, Y., 2016. Big data application in biomedical research and health care: a literature review. Biomedical informatics insights, 8, pp.BII-S31559.

MacRae, A.W., 2019. Descriptive and inferential statistics. Companion Encyclopedia of Psychology: Volume Two, p.1099.

Mirman, D., 2017. Growth curve analysis and visualization using R. Chapman and Hall/CRC.

Safford, B., Api, A.M., Barratt, C., Comiskey, D., Daly, E.J., Ellis, G., McNamara, C., O’Mahony, C., Robison, S., Smith, B. and Thomas, R., 2015. Use of an aggregate exposure model to estimate consumer exposure to fragrance ingredients in personal care and cosmetic products. Regulatory Toxicology and Pharmacology, 72(3), pp.673-682.

Smilansky, S., 2017. Experiential marketing: A practical guide to interactive brand experiences. Kogan Page Publishers.

Watson, R., 2015. Quantitative research. Nursing Standard (2014+), 29(31), p.44.

Cite This Work

To export a reference to this article please select a referencing stye below:

My Assignment Help. (2020). Data Exploration Analysis Using SAS Essay.. Retrieved from https://myassignmenthelp.com/free-samples/math1322-introduction-to-statistical-computing.

"Data Exploration Analysis Using SAS Essay.." My Assignment Help, 2020, https://myassignmenthelp.com/free-samples/math1322-introduction-to-statistical-computing.

My Assignment Help (2020) Data Exploration Analysis Using SAS Essay. [Online]. Available from: https://myassignmenthelp.com/free-samples/math1322-introduction-to-statistical-computing
[Accessed 30 May 2025].

My Assignment Help. 'Data Exploration Analysis Using SAS Essay.' (My Assignment Help, 2020) <https://myassignmenthelp.com/free-samples/math1322-introduction-to-statistical-computing> accessed 30 May 2025.

My Assignment Help. Data Exploration Analysis Using SAS Essay. [Internet]. My Assignment Help. 2020 [cited 30 May 2025]. Available from: https://myassignmenthelp.com/free-samples/math1322-introduction-to-statistical-computing.

Get instant help from 5000+ experts for