Dataset:
The panel dataset contains various metrics for apps in the top free and top paid charts of the Google Play app store for nine months (May 2016 to January 2017) at a monthly level.
Below is the description of the variables in the dataset.
? log_users: the number of new users of the app per month.
? updates: the number of new updates for the app per month.
? rating: the star rating (between zero and five) an app received by users until the given month.
? size: the app file size, in megabytes, in the given month. It shows the complexity and sophistication of the app.
? log_price: natural logarithm of the app price in the given month. App price is the price that users must pay before downloading the paid apps (price is the non-logged version).
? chart: indicates if the app is free or paid.
? rank: the app rank in the top chart in the given month.
? category: app category.
? life_cysle: an indicator of the app age; precisely, from 1st to 4th stage, indicating inception to maturity of an app life cycle.
? app_id is the unique identifier of the app.
? period: the time-period identifier (at a monthly level) of the panel data.
Note: The log transformation applied for price variable (log_price) is Ln(x+1), rather than Ln(x), to avoid losing observations with price=0; hence, if the price is zero, the log-transformed version will be zero as well— Ln(0+1)=0. For simplicity, you can interpret the effect size (if needed) as Ln(x).
Other variables (calendar year, calendar month, and store) are also included. The App name is dropped from the dataset, but the unique identifier (app_id) is included, as explained above. Overall, the objective is to estimate the effect of the number of new updates, app rating, app file size, and being free or paid on the app users (while controlling for some factors, as explained below).
? Inspect your data and make sure there is no data entry mistake in the value of variables.For example, the price of paid apps should not be zero, or the size of an app cannot be negative. If you detect some data entry mistake, drop those observations.
? Build the needed variables (as explained below); for example, natural logarithm transformation of variables (if needed) or creating new categorical variables required for analysis.
? Provide a brief explanation for the methodology, such as data, the corrected and cleansed sample, the definition of dependent, independent, and control variables,the objective of the analyses, and the baseline model.
? Provide a two-way table for summary statistics of your model’s numeric variables forthe whole sample, Game, and non-Game apps (altogether). Tip: you need to build a variable to distinguish between Game and non-Game apps.
? Provide the correlation matrix of the numeric variables of your model. Beefily discuss the results.
? Apply a statistical test and evaluate if there is any significant difference (at 0.05 significance level) between the app categories regarding the number of users (logged).
Can you qualitatively support the test results with a graph?
? Inspect the data graphically, such as visual summary statistics across subsamples (such as categories, free vs paid, Game vs non-games, etc.), checking the
distribution/skewness of main variables (i.e., dependent and independent variable), prechecking the relationship between the dependent and independent variables, the longitudinal trend of the dependent variable (across subsamples), etc. The details and types of graphs are your decision—the objective is to provide a concise yet informative inspection of the data before running the regression. You may pick up a few of the above-mentioned list of potential graphs (or other graphs), which describe various aspects of the data efficiently.
? Conduct an OLS regression to estimate the effect of the number of new updates, app rating, app file size (logged), and being free or paid on the number of new users (logged) in the whole sample, while controlling for app price (logged), the, app category, and time-period. Carefully interpret and discuss the results (e.g., R-squared, the statistical significance of coefficients, the effect size). This will be the baseline model.
? Modify the baseline model to evaluate the differential effect of the number of new updates for high-quality vs low-quality apps. For running your model, you need to build a variable to distinguish high vs low-quality apps. Define low-quality apps as those with a rating below four stars, and high-quality apps as those with four stars and above. Based on the results, discuss the statistical significance and effect size of the difference. You may use graphical illustration to enhance your discussion. Based on your experience or understanding of the mobile app context, can you provide some conceptual explanation for the results?
? Modify the baseline model to evaluate the differential effect of app rating across app life cycles. You may use graphical illustration to enhance your discussion.