In this blog post, I would like to mention a few things that I’ve learned about linear regression by using the CDC diabetes dataset to analyze the impact of obesity and inactivity on diabetes. This dataset provides a valuable opportunity to explore how these two factors contribute to the prevalence of diabetes in the United States. This technique was used to predict a target variable (in this case, diabetes prevalence) based on one or more input variables (obesity and inactivity rates). The goal is to find the best-fit line that describes the relationship between the input variables and the target variable.
As it was mentioned in the lecture notes, I have uploaded the data in 3 different data frames which was then merged together to form a master dataframe. Since there was only 354 data points which is common from both the tables, I tried to isolate the factors and find their correlation with diabetes i.e., target variable. On plotting the distribution and calculating the descriptive statistics of the factors against diabetics data, I found that distributions are skewed with negative kurtosis.The correlations between the target and factor variables are positive.
In terms of the class, I have learned more about the importance of calculating the residuals, Heteroscedasticity and p-value in determining the model for the regression problem.