Oct 30th – T-test

A t-test is a statistical test used to determine if there is a significant difference between the means of two groups. There are different types of t-tests, but the most common ones are the independent samples t-test and the paired samples t-test.

To evaluate whether there is a significant difference in the average age between two races using a t-test, we can use Python and the scipy.stats module. In our project, ttest_ind is used to perform  on the age distributions of the two races. The null hypothesis is that there is no significant difference in the average age between the two races. The p-value is then used to determine whether to reject the null hypothesis.

We can came to the conclusion that we failed to reject the null hypothesis as there is no significant difference in the average age between the two races.

Oct 27th – Project Updates

Regarding Black individuals: There is a slight disparity between the mean and median, indicating a subtle rightward skew in the age distribution. In fact, the skewness is approximately 1, suggesting a modest deviation from a perfectly symmetrical distribution. The kurtosis is close to 4, slightly higher than the typical value of 3 for a normal distribution. This implies the potential presence of concentration around the mean or the existence of a fat tail, or possibly both.

As for White individuals: The mean and median also exhibit a minor difference, signaling a mild right-leaning skew in the age distribution. The skewness is approximately 0.5, reflecting a relatively smaller deviation compared to the Black population. The kurtosis is nearly 3, indicating minimal concentration around the mean and the absence of a substantial fat tail.

Oct 25th – Project updates (Analyzing Age Distribution)

In this project, we focused on understanding the age distribution in fatal police shootings in the United States using a dataset of 8,002 incidents. Our goal was to identify the most common age groups involved in these tragic events.

Age Distribution Analysis
With the data ready, we analyzed the age distribution of individuals killed by police using a histogram, visually representing incident frequency across different age groups.

Key Findings and Observations
The visualization revealed:

– A right-skewed distribution, indicating a higher frequency of incidents involving younger individuals.
– The most affected age groups were those in their 20s and 30s, raising significant concern.

Conclusion and Reflection

This analysis provided valuable insights into age-related aspects of police shootings, emphasizing the vulnerability of younger demographics. This information guides further research, policy discussions, and interventions to understand and address contributing factors. The project demonstrates the power of data in illuminating societal issues, supporting informed decision-making for positive change.

 

Oct 23rd – Hierarchical Clustering

In the expansive realm of unsupervised machine learning, Hierarchical Clustering emerges as a nuanced approach, offering a panoramic view of data relationships. Unlike K-Means, this technique doesn’t necessitate predefining the number of clusters. Instead, it creates a hierarchical tree of clusters, known as a dendrogram, which illustrates the fusion of data points into progressively larger clusters. Hierarchical Clustering can be approached in two ways: Agglomerative, where each data point starts as an individual cluster and gradually merges, or Divisive, where the process begins with a single cluster encompassing all data points and progressively divides. This flexibility renders Hierarchical Clustering applicable across various domains, from biological taxonomy to market segmentation.

The dendrogram generated by Hierarchical Clustering serves as a visual narrative, providing insights into the relationships between data points. The y-axis of the dendrogram represents the distance at which clusters merge, while the x-axis showcases individual data points and their groupings. Analysts can then choose an optimal threshold to cut the dendrogram, delineating distinct clusters based on their desired level of granularity. This adaptability makes Hierarchical Clustering a potent tool for discerning intricate structures within datasets.

The decision to employ Hierarchical Clustering often hinges on the complexity and nature of the dataset. Its ability to unveil hierarchical structures within data makes it ideal for scenarios where understanding relationships at multiple levels of granularity is crucial. Whether deciphering biological classifications or exploring market dynamics, Hierarchical Clustering offers a meticulous lens through which to comprehend the intricate tapestry of data relationships.

Oct 18th – K-Means and DBSCAN

K-Means clustering stands as a stalwart in the realm of unsupervised machine learning, offering a powerful technique for grouping data points based on their similarities. The algorithm strives to partition the dataset into ‘k’ distinct clusters, where each cluster is defined by a central point called a centroid. Iteratively, data points are assigned to the cluster whose centroid is nearest, and the centroids are recalculated until convergence. K-Means finds its utility in a myriad of applications, from customer segmentation in marketing to image compression in computer vision, providing a versatile solution for pattern recognition.

In contrast, Density-Based Spatial Clustering of Applications with Noise (DBSCAN) takes a unique approach to clustering by identifying regions of high data density. Unlike K-Means, DBSCAN does not require the user to predefine the number of clusters. Instead, it classifies points into three categories: core points, border points, and noise points. Core points, surrounded by a minimum number of other points within a specified radius, form the nucleus of clusters. Border points lie on the periphery of these clusters, while points in sparser regions are designated as noise. This makes DBSCAN particularly adept at discovering clusters of arbitrary shapes and handling outliers effectively.

In choosing between K-Means and DBSCAN, the nature of the dataset and the desired outcome play pivotal roles. K-Means excels when the number of clusters is known, and the clusters are well-defined and spherical. On the other hand, DBSCAN shines when dealing with datasets of varying densities and irregularly shaped clusters. The adaptability of these clustering algorithms empowers data scientists to unravel hidden structures, paving the way for more informed decision-making in diverse fields.

Oct 16th – Fatal shooting Data Exploration

Today, I delved into the intricate realm of comprehending the dataset pertaining to fatal police shootings in the United States since 2015. The chronicle of my discoveries and subsequent actions is meticulously chronicled below:

  1. Data Loading:I accessed two CSV files, housing incident and agency data, directly from the repositories of the venerable Washington Post. This method ensures that I am equipped with the most current and pertinent information.
  2. Data Exploration: Upon scrutinizing the datasets, a salient revelation emerged—the agency ID serves as the common identifier across both tables. This fortuitous alignment allows for the seamless merging of the two tables, enabling a correlation between incident details and the agency or agencies responsible.
  3. Data Manipulation: However, upon closer inspection, a hurdle manifested itself—I encountered difficulties merging the tables due to disparate data types. To surmount this obstacle, I undertook the task of converting the data type of the agency ID in the incidents table to int64. Furthermore, the ‘agency ID’ column in the incident data table exhibited multiple values that necessitated separation. Rest assured, I am diligently addressing this matter and will provide updates in subsequent posts.

Oct 11th – ANOVA

Today’s class delved into the intriguing world of Analysis of Variance (ANOVA), and I’m excited to share how it’s like a statistical detective, especially when it comes to understanding variations in incidents like police shootings in the United States. Picture this: we have data representing different racial groups, and we want to know if the average number of incidents varies significantly between them. That’s where ANOVA steps in.

So, here’s the breakdown: ANOVA acts as our investigator, examining two types of variations. First, it scrutinizes the differences within each racial group, considering how the number of shootings might differ within the same race. Then, it compares that to the differences between the racial groups, analyzing how the average number of incidents might differ across all races. If the differences within each race are similar to the differences between races, ANOVA suggests that the variations might be due to random factors. But if the differences between race are significantly larger than the differences within each race, ANOVA signals that there’s likely something more profound at play.

Oct 2 – Bootstrapping

As we learned in the class about various resampling methods such as K-fold cross validation, Bootstrapping is another nifty statistical technique for data analysis. Imagine you have a small dataset, and you want to understand more about its characteristics. Enter bootstrapping—a method that enables you to generate multiple datasets by repeatedly sampling from your original data, with replacement. It’s like creating multiple mini-worlds from your limited observations, allowing you to get a better grip on the underlying patterns and uncertainties in your data.

Here’s the magic: since you’re sampling with replacement, some data points may appear more than once in a given bootstrap sample, while others might not appear at all. This process mimics the randomness inherent in real-world data collection. By creating these bootstrapped datasets and analyzing them, you can estimate things like the variability of your measurements or the uncertainty around a particular statistic. It’s a statistical resilience booster, giving you a more robust understanding of your data’s nuances.

The beauty of bootstrapping lies in its simplicity and power. Whether you’re dealing with a small dataset, uncertain about your assumptions, or just curious about the reliability of your results, bootstrapping is like a statistical friend that says, “Let’s explore your data from various angles and see what insights we can uncover together.”

(9/18/23) – p-Value

I delved into the intricacies of p-values following the class and based on the videos recommended, and this exploration led me to a nuanced understanding of the null hypothesis. The null hypothesis, often denoted as H0, assumes a central role in hypothesis testing, positing the absence of a significant difference, effect, or relationship between variables or groups within a given population. It serves as the foundational hypothesis against which subsequent testing is conducted. It’s important to note that the failure to reject the null hypothesis doesn’t affirm its veracity; rather, it indicates an absence of sufficient evidence from the collected data to suggest otherwise. The structured approach of statistical hypothesis testing provides a systematic framework for this decision-making process in the realm of statistics.

Transitioning to the concept of p-values, they function as a critical metric in gauging evidence against the null hypothesis. The p-value, or probability value, quantifies the likelihood of observing a test statistic as extreme as the one derived from sample data, assuming the null hypothesis holds true. This probability informs whether the results obtained from the sample data are statistically significant or could be attributed to random chance. The process involves formulating a null hypothesis, collecting and analyzing data, calculating the p-value, and subsequently comparing it to a predetermined significance level. A small p-value doesn’t necessarily invalidate the null hypothesis; instead, it suggests evidence against it. Interpretation of p-values demands caution, taking into account the research question, study design, and potential biases in data collection.

In summary, the interplay between the null hypothesis and p-values is pivotal in the landscape of statistical analysis. The null hypothesis establishes the baseline assumption, and p-values offer a quantifiable measure of evidence against it. Proper interpretation requires a nuanced understanding of statistical significance, careful consideration of study parameters, and acknowledgment of the broader statistical context beyond p-values, including effect sizes and confidence intervals.

Cross validation

Based on the lecture class and the recommended videos by Prof. Gary, I was able to learn the intricacies of various cross validation approaches and their usefulness when it comes to dealing with smaller dataset similar to the project 1 dataset.

it all boils down how we interpret the test error rate in comparison to training error rate of a model.  As we may have observed, usually the training error rate of a polynomial regression function, for example, falls with increase in the degree of polynomial functions. However, we found that the test error rate falls initial before it rises with complexity of the model.This leads to overfitting. To resolve this I learned that there are various cross validation approaches as mentioned below:

  • The Validation Set Approach

This process entails randomly dividing the available set of observations into two segments: a training set and a validation set (or hold-out set). The model is trained on the training set, and the resulting model is employed to predict responses for the observations in the validation set. The ensuing validation set error rate, often evaluated using Mean Squared Error (MSE) for quantitative responses, serves as an approximation of the test error rate. However, this approach comes with certain limitations.

The validation estimate of the test error rate can exhibit high variability, depending on the specific observations included in the training and validation sets.In addition, the validation approach employs only a subset of observations—those included in the training set rather than the validation set—to train the model. Given that statistical methods generally perform less effectively when trained on a smaller dataset, this implies that the validation set error rate may tend to overstate the test error rate for the model fitted on the entire dataset.

  • Leave-One-Out Cross-Validation

Leave-One-Out Cross-Validation (LOOCV) is a technique used to assess the performance of predictive models by systematically leaving out one data point at a time for validation while training the model on the remaining data. This process is repeated for each observation in the dataset, allowing the model to be validated against the entire dataset. While LOOCV provides a thorough evaluation of model performance and utilizes all available data, it can be computationally expensive for large datasets due to the need for multiple model fittings. Despite this, LOOCV is commonly employed when the dataset size is limited, providing a robust assessment of model generalization.

  • k-Fold Cross-Validation

This approach involves randomly dividing the set of observations into k groups, or folds, of approximately equal size. The first fold is treated as a validation set, and the method is fit on the remaining k − 1 folds. The mean squared error, MSE1, is then computed on the observations in the held-out fold. This procedure is repeated k times; each time, a different group of observations is treated as a validation set. This process results in k estimates of the test error, MSE1, MSE2, . . . , MSEk.

The most important advantage of k-fold CV is that it often gives more accurate estimates of the test error rate than does LOOCV. This has to do with a bias-variance trade-off.