Nov 8th – Project Updates

Our latest analysis provides insights into the proportions of individuals shot relative to their estimated population percentage in the U.S. This comparison reveals notable disparities across different racial categories:

  • Black (B): Black individuals are shot at a proportion approximately 1.70 times higher than their representation in the U.S. population.
  • Native American (N): The proportion of Native American individuals shot is approximately 1.31 times their representation in the U.S. population.
  • Hispanic (H): Hispanic individuals are shot at a proportion approximately 0.77 times their representation in the U.S. population.
  • White (W): The proportion of White individuals shot is approximately 0.67 times their representation in the U.S. population.
  • Asian (A): Asian individuals are shot at a proportion approximately 0.29 times their representation in the U.S. population.
  • Other (O): Individuals from other racial categories are shot at a proportion approximately 0.09 times their representation in the U.S. population.

This analysis sheds light on the disparities in the likelihood of being shot across various racial groups. Our next steps involve exploring potential factors contributing to these disparities and further examining the broader implications of these findings.

Nov 3rd – Project Updates

To understand the statistical significance , we used  “Cohen’s d” effect size measure, which is commonly used in conjunction with t-tests. Cohen’s d quantifies the difference between two groups in terms of standard deviations.

Cohen’s d is calculated using the formula:

d = (mean difference/Pooled Standard Deviation)

Where the Mean Difference is the difference in means between two groups, and the Pooled Standard Deviation is a weighted average of the standard deviations of the two groups.

On calculating this in regards to the mean age of Black and white race, we found that Cohen’s d: 0.57  which is a small effect size i.e average age of White  has higher mean in comparison with Black race.

Nov 1st – Project updates

Logistic regression serves as a statistical modeling technique utilized for examining the connection between a binary outcome variable and one or more predictor variables. In this analysis, I designated “manner_of_death” as the binary outcome variable, where the column specifies whether the death resulted from being “shot” or “shot and tasered.” Additional columns such as “armed,” “age,” “gender,” and “race” were considered as predictor variables to gauge the likelihood of a specific mode of death.

Subsequently, I delved into exploring and preprocessing the data, addressing missing values and encoding categorical variables. Following this, a logistic regression model was constructed by fitting the data. In this model, the binary variable became the dependent variable, and the other columns served as independent data, acting as predictors.

The model’s performance was then evaluated using various metrics, including accuracy, precision, recall, F1 score, and R2 score, all of which demonstrated satisfactory results. This comprehensive approach allowed for a thorough understanding of the relationship between the chosen predictors and the binary outcome, providing valuable insights into the risk estimation for different modes of death.

Oct 30th – T-test

A t-test is a statistical test used to determine if there is a significant difference between the means of two groups. There are different types of t-tests, but the most common ones are the independent samples t-test and the paired samples t-test.

To evaluate whether there is a significant difference in the average age between two races using a t-test, we can use Python and the scipy.stats module. In our project, ttest_ind is used to perform  on the age distributions of the two races. The null hypothesis is that there is no significant difference in the average age between the two races. The p-value is then used to determine whether to reject the null hypothesis.

We can came to the conclusion that we failed to reject the null hypothesis as there is no significant difference in the average age between the two races.

Oct 27th – Project Updates

Regarding Black individuals: There is a slight disparity between the mean and median, indicating a subtle rightward skew in the age distribution. In fact, the skewness is approximately 1, suggesting a modest deviation from a perfectly symmetrical distribution. The kurtosis is close to 4, slightly higher than the typical value of 3 for a normal distribution. This implies the potential presence of concentration around the mean or the existence of a fat tail, or possibly both.

As for White individuals: The mean and median also exhibit a minor difference, signaling a mild right-leaning skew in the age distribution. The skewness is approximately 0.5, reflecting a relatively smaller deviation compared to the Black population. The kurtosis is nearly 3, indicating minimal concentration around the mean and the absence of a substantial fat tail.

Oct 25th – Project updates (Analyzing Age Distribution)

In this project, we focused on understanding the age distribution in fatal police shootings in the United States using a dataset of 8,002 incidents. Our goal was to identify the most common age groups involved in these tragic events.

Age Distribution Analysis
With the data ready, we analyzed the age distribution of individuals killed by police using a histogram, visually representing incident frequency across different age groups.

Key Findings and Observations
The visualization revealed:

– A right-skewed distribution, indicating a higher frequency of incidents involving younger individuals.
– The most affected age groups were those in their 20s and 30s, raising significant concern.

Conclusion and Reflection

This analysis provided valuable insights into age-related aspects of police shootings, emphasizing the vulnerability of younger demographics. This information guides further research, policy discussions, and interventions to understand and address contributing factors. The project demonstrates the power of data in illuminating societal issues, supporting informed decision-making for positive change.


Oct 23rd – Hierarchical Clustering

In the expansive realm of unsupervised machine learning, Hierarchical Clustering emerges as a nuanced approach, offering a panoramic view of data relationships. Unlike K-Means, this technique doesn’t necessitate predefining the number of clusters. Instead, it creates a hierarchical tree of clusters, known as a dendrogram, which illustrates the fusion of data points into progressively larger clusters. Hierarchical Clustering can be approached in two ways: Agglomerative, where each data point starts as an individual cluster and gradually merges, or Divisive, where the process begins with a single cluster encompassing all data points and progressively divides. This flexibility renders Hierarchical Clustering applicable across various domains, from biological taxonomy to market segmentation.

The dendrogram generated by Hierarchical Clustering serves as a visual narrative, providing insights into the relationships between data points. The y-axis of the dendrogram represents the distance at which clusters merge, while the x-axis showcases individual data points and their groupings. Analysts can then choose an optimal threshold to cut the dendrogram, delineating distinct clusters based on their desired level of granularity. This adaptability makes Hierarchical Clustering a potent tool for discerning intricate structures within datasets.

The decision to employ Hierarchical Clustering often hinges on the complexity and nature of the dataset. Its ability to unveil hierarchical structures within data makes it ideal for scenarios where understanding relationships at multiple levels of granularity is crucial. Whether deciphering biological classifications or exploring market dynamics, Hierarchical Clustering offers a meticulous lens through which to comprehend the intricate tapestry of data relationships.

Oct 18th – K-Means and DBSCAN

K-Means clustering stands as a stalwart in the realm of unsupervised machine learning, offering a powerful technique for grouping data points based on their similarities. The algorithm strives to partition the dataset into ‘k’ distinct clusters, where each cluster is defined by a central point called a centroid. Iteratively, data points are assigned to the cluster whose centroid is nearest, and the centroids are recalculated until convergence. K-Means finds its utility in a myriad of applications, from customer segmentation in marketing to image compression in computer vision, providing a versatile solution for pattern recognition.

In contrast, Density-Based Spatial Clustering of Applications with Noise (DBSCAN) takes a unique approach to clustering by identifying regions of high data density. Unlike K-Means, DBSCAN does not require the user to predefine the number of clusters. Instead, it classifies points into three categories: core points, border points, and noise points. Core points, surrounded by a minimum number of other points within a specified radius, form the nucleus of clusters. Border points lie on the periphery of these clusters, while points in sparser regions are designated as noise. This makes DBSCAN particularly adept at discovering clusters of arbitrary shapes and handling outliers effectively.

In choosing between K-Means and DBSCAN, the nature of the dataset and the desired outcome play pivotal roles. K-Means excels when the number of clusters is known, and the clusters are well-defined and spherical. On the other hand, DBSCAN shines when dealing with datasets of varying densities and irregularly shaped clusters. The adaptability of these clustering algorithms empowers data scientists to unravel hidden structures, paving the way for more informed decision-making in diverse fields.

Oct 16th – Fatal shooting Data Exploration

Today, I delved into the intricate realm of comprehending the dataset pertaining to fatal police shootings in the United States since 2015. The chronicle of my discoveries and subsequent actions is meticulously chronicled below:

  1. Data Loading:I accessed two CSV files, housing incident and agency data, directly from the repositories of the venerable Washington Post. This method ensures that I am equipped with the most current and pertinent information.
  2. Data Exploration: Upon scrutinizing the datasets, a salient revelation emerged—the agency ID serves as the common identifier across both tables. This fortuitous alignment allows for the seamless merging of the two tables, enabling a correlation between incident details and the agency or agencies responsible.
  3. Data Manipulation: However, upon closer inspection, a hurdle manifested itself—I encountered difficulties merging the tables due to disparate data types. To surmount this obstacle, I undertook the task of converting the data type of the agency ID in the incidents table to int64. Furthermore, the ‘agency ID’ column in the incident data table exhibited multiple values that necessitated separation. Rest assured, I am diligently addressing this matter and will provide updates in subsequent posts.

Oct 11th – ANOVA

Today’s class delved into the intriguing world of Analysis of Variance (ANOVA), and I’m excited to share how it’s like a statistical detective, especially when it comes to understanding variations in incidents like police shootings in the United States. Picture this: we have data representing different racial groups, and we want to know if the average number of incidents varies significantly between them. That’s where ANOVA steps in.

So, here’s the breakdown: ANOVA acts as our investigator, examining two types of variations. First, it scrutinizes the differences within each racial group, considering how the number of shootings might differ within the same race. Then, it compares that to the differences between the racial groups, analyzing how the average number of incidents might differ across all races. If the differences within each race are similar to the differences between races, ANOVA suggests that the variations might be due to random factors. But if the differences between race are significantly larger than the differences within each race, ANOVA signals that there’s likely something more profound at play.

Oct 2 – Bootstrapping

As we learned in the class about various resampling methods such as K-fold cross validation, Bootstrapping is another nifty statistical technique for data analysis. Imagine you have a small dataset, and you want to understand more about its characteristics. Enter bootstrapping—a method that enables you to generate multiple datasets by repeatedly sampling from your original data, with replacement. It’s like creating multiple mini-worlds from your limited observations, allowing you to get a better grip on the underlying patterns and uncertainties in your data.

Here’s the magic: since you’re sampling with replacement, some data points may appear more than once in a given bootstrap sample, while others might not appear at all. This process mimics the randomness inherent in real-world data collection. By creating these bootstrapped datasets and analyzing them, you can estimate things like the variability of your measurements or the uncertainty around a particular statistic. It’s a statistical resilience booster, giving you a more robust understanding of your data’s nuances.

The beauty of bootstrapping lies in its simplicity and power. Whether you’re dealing with a small dataset, uncertain about your assumptions, or just curious about the reliability of your results, bootstrapping is like a statistical friend that says, “Let’s explore your data from various angles and see what insights we can uncover together.”

(9/18/23) – p-Value

I delved into the intricacies of p-values following the class and based on the videos recommended, and this exploration led me to a nuanced understanding of the null hypothesis. The null hypothesis, often denoted as H0, assumes a central role in hypothesis testing, positing the absence of a significant difference, effect, or relationship between variables or groups within a given population. It serves as the foundational hypothesis against which subsequent testing is conducted. It’s important to note that the failure to reject the null hypothesis doesn’t affirm its veracity; rather, it indicates an absence of sufficient evidence from the collected data to suggest otherwise. The structured approach of statistical hypothesis testing provides a systematic framework for this decision-making process in the realm of statistics.

Transitioning to the concept of p-values, they function as a critical metric in gauging evidence against the null hypothesis. The p-value, or probability value, quantifies the likelihood of observing a test statistic as extreme as the one derived from sample data, assuming the null hypothesis holds true. This probability informs whether the results obtained from the sample data are statistically significant or could be attributed to random chance. The process involves formulating a null hypothesis, collecting and analyzing data, calculating the p-value, and subsequently comparing it to a predetermined significance level. A small p-value doesn’t necessarily invalidate the null hypothesis; instead, it suggests evidence against it. Interpretation of p-values demands caution, taking into account the research question, study design, and potential biases in data collection.

In summary, the interplay between the null hypothesis and p-values is pivotal in the landscape of statistical analysis. The null hypothesis establishes the baseline assumption, and p-values offer a quantifiable measure of evidence against it. Proper interpretation requires a nuanced understanding of statistical significance, careful consideration of study parameters, and acknowledgment of the broader statistical context beyond p-values, including effect sizes and confidence intervals.

Cross validation

Based on the lecture class and the recommended videos by Prof. Gary, I was able to learn the intricacies of various cross validation approaches and their usefulness when it comes to dealing with smaller dataset similar to the project 1 dataset.

it all boils down how we interpret the test error rate in comparison to training error rate of a model.  As we may have observed, usually the training error rate of a polynomial regression function, for example, falls with increase in the degree of polynomial functions. However, we found that the test error rate falls initial before it rises with complexity of the model.This leads to overfitting. To resolve this I learned that there are various cross validation approaches as mentioned below:

  • The Validation Set Approach

This process entails randomly dividing the available set of observations into two segments: a training set and a validation set (or hold-out set). The model is trained on the training set, and the resulting model is employed to predict responses for the observations in the validation set. The ensuing validation set error rate, often evaluated using Mean Squared Error (MSE) for quantitative responses, serves as an approximation of the test error rate. However, this approach comes with certain limitations.

The validation estimate of the test error rate can exhibit high variability, depending on the specific observations included in the training and validation sets.In addition, the validation approach employs only a subset of observations—those included in the training set rather than the validation set—to train the model. Given that statistical methods generally perform less effectively when trained on a smaller dataset, this implies that the validation set error rate may tend to overstate the test error rate for the model fitted on the entire dataset.

  • Leave-One-Out Cross-Validation

Leave-One-Out Cross-Validation (LOOCV) is a technique used to assess the performance of predictive models by systematically leaving out one data point at a time for validation while training the model on the remaining data. This process is repeated for each observation in the dataset, allowing the model to be validated against the entire dataset. While LOOCV provides a thorough evaluation of model performance and utilizes all available data, it can be computationally expensive for large datasets due to the need for multiple model fittings. Despite this, LOOCV is commonly employed when the dataset size is limited, providing a robust assessment of model generalization.

  • k-Fold Cross-Validation

This approach involves randomly dividing the set of observations into k groups, or folds, of approximately equal size. The first fold is treated as a validation set, and the method is fit on the remaining k − 1 folds. The mean squared error, MSE1, is then computed on the observations in the held-out fold. This procedure is repeated k times; each time, a different group of observations is treated as a validation set. This process results in k estimates of the test error, MSE1, MSE2, . . . , MSEk.

The most important advantage of k-fold CV is that it often gives more accurate estimates of the test error rate than does LOOCV. This has to do with a bias-variance trade-off.


Week 1 and 2 summary post

In this blog post, I would like to mention a few things  that I’ve learned about linear regression by using the CDC diabetes dataset to analyze the impact of obesity and inactivity on diabetes. This dataset provides a valuable opportunity to explore how these two factors contribute to the prevalence of diabetes in the United States. This technique was used to predict a target variable (in this case, diabetes prevalence) based on one or more input variables (obesity and inactivity rates). The goal is to find the best-fit line that describes the relationship between the input variables and the target variable.

As it was mentioned in the lecture notes, I have uploaded the data in 3 different data frames which was then merged together to form a master dataframe. Since there was only 354 data points which is common from both the tables, I tried to isolate the factors and find their correlation with diabetes i.e., target variable. On plotting the distribution and calculating the descriptive statistics of the factors against diabetics data, I found that distributions are skewed with negative kurtosis.The correlations between the target and factor variables are positive.

In terms of the class, I have learned more about the importance of calculating the residuals, Heteroscedasticity and p-value in determining the model for the regression problem.


Nov 6th – Project Updates

In our latest project analysis, we’ve uncovered significant trends related to the armed status of individuals from diverse racial categories. Predominantly, individuals from the White, Black, and Hispanic groups are armed with guns. Notably, a substantial number of White individuals are armed with knives, while a concerning observation highlights a high number of unarmed Black individuals compared to other groups.

Our visualizations reveal consistent patterns across armed status categories, with White individuals leading, followed by Black and Hispanic individuals. Our next focus is on examining the relationship between threat levels and race. Initial findings show that “attack” incidents are prevalent across all racial categories, with White individuals in the lead, followed by Black and Hispanic individuals. The “other” category follows a similar pattern, with incidents labeled as “undetermined” being relatively low across all racial categories. This analysis aims to provide key insights into the perceived threat levels associated with different racial categories in police shooting incidents.