Geek Logbook

Tech sea log book

Understanding the Quality of a Multiple Linear Regression Model: Analyzing SalaryUSD Predictions

In this blog post, we’ll dive into the process of analyzing the quality of a multiple linear regression model, specifically focusing on predicting SalaryUSD based on factors like EducationLevel and YearsExperience. We’ll also explore the significance of each predictor variable and how well the model fits the data.

Introduction to Multiple Linear Regression

Multiple linear regression is a statistical technique used to model the relationship between a dependent variable and two or more independent variables. The goal is to find the best-fitting line that predicts the dependent variable based on the independent variables. In our case, we’re interested in predicting the SalaryUSD of individuals based on their EducationLevel and YearsExperience.

The Dataset

For this analysis, we’re working with a dataset containing the following variables:

  • SalaryUSD: The salary of the individual in USD (dependent variable).
  • EducationLevel: The level of education attained by the individual.
  • YearsExperience: The number of years of experience the individual has.

We’ll use these variables to build our multiple linear regression models and evaluate their effectiveness.

Model Building and Analysis

To start, we constructed two simple linear regression models, each using one independent variable (EducationLevel and YearsExperience) to predict SalaryUSD. Here’s a summary of the results for each model:

1. Model 1: Predicting SalaryUSD Based on EducationLevel
                            OLS Regression Results                            
==============================================================================
Dep. Variable:              SalaryUSD   R-squared:                       0.197
Model:                            OLS   Adj. R-squared:                  0.196
Method:                 Least Squares   F-statistic:                     243.8
Date:                Thu, 04 Jul 2024   Prob (F-statistic):           2.56e-49
Time:                        12:45:35   Log-Likelihood:                -7899.3
No. Observations:                 995   AIC:                         1.580e+04
Df Residuals:                     993   BIC:                         1.581e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==================================================================================
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
const            653.3062     62.800     10.403      0.000     530.070     776.543
EducationLevel   310.0718     19.857     15.615      0.000     271.105     349.039
==============================================================================
Omnibus:                      114.671   Durbin-Watson:                   0.377
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               38.703
Skew:                           0.221   Prob(JB):                     3.94e-09
Kurtosis:                       2.141   Cond. No.                         10.0
==============================================================================

Model 2: Predicting SalaryUSD Based on YearsExperience

                            OLS Regression Results                            
==============================================================================
Dep. Variable:              SalaryUSD   R-squared:                       0.186
Model:                            OLS   Adj. R-squared:                  0.185
Method:                 Least Squares   F-statistic:                     226.6
Date:                Thu, 04 Jul 2024   Prob (F-statistic):           2.77e-46
Time:                        12:45:44   Log-Likelihood:                -7906.2
No. Observations:                 995   AIC:                         1.582e+04
Df Residuals:                     993   BIC:                         1.583e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
===================================================================================
                      coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------
const              31.8407    104.745      0.304      0.761    -173.706     237.387
YearsExperience   220.9169     14.675     15.054      0.000     192.119     249.715
==============================================================================
Omnibus:                      119.015   Durbin-Watson:                   0.377
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               37.055
Skew:                           0.182   Prob(JB):                     8.99e-09
Kurtosis:                       2.127   Cond. No.                         35.1
==============================================================================

Analyzing the Results

Let’s break down the key metrics from the regression results:

  1. R-squared (R²):
    • For EducationLevel, R2=0.197R^2 = 0.197R2=0.197 indicates that 19.7% of the variance in SalaryUSD can be explained by the level of education.
    • For YearsExperience, R2=0.186R^2 = 0.186R2=0.186 indicates that 18.6% of the variance in SalaryUSD can be explained by years of experience.
    • Both R² values are relatively low, suggesting that there are other significant factors influencing salary that are not captured by these models.
  2. Coefficients:
    • EducationLevel: The positive coefficient (310.07) indicates that higher education levels are associated with higher salaries. Specifically, each unit increase in EducationLevel corresponds to an increase of $310.07 in salary.
    • YearsExperience: The positive coefficient (220.92) indicates that more years of experience are associated with higher salaries. Each additional year of experience corresponds to an increase of $220.92 in salary.
  3. Significance (p-value):
    • Both predictors (EducationLevel and YearsExperience) have p-values well below 0.05, indicating that they are statistically significant predictors of SalaryUSD.
  4. Model Fit:
    • While the models are statistically significant, the relatively low R² values suggest that these are not strong predictors by themselves. A more complex model with additional variables might provide a better fit.

Conclusion

From this analysis, it’s clear that both education and experience play significant roles in determining salary, but they are not the only factors. The low R² values indicate that a more comprehensive model is needed to better predict SalaryUSD. Adding more relevant predictors and exploring interactions between variables could lead to a more accurate and insightful model.

In future analyses, we could explore multivariate regression models, polynomial regressions, or even machine learning techniques to improve the prediction accuracy. The key takeaway is that while linear regression is a powerful tool, understanding its limitations and knowing when to move to more complex models is crucial for accurate predictions.

Tags: