In my latest project, I wanted to develop a predictive model for property valuations. After separating the dataset into features and the target variable, I preprocessed the data, treating numerical columns with a median strategy to handle missing values, and categorical columns with the most frequent strategy followed by one-hot encoding. This meticulous preparation ensured that each property’s unique attributes were ready to inform the predictive process.
Choosing a Decision Tree Regressor for its straightforward approach to learning, I created a pipeline that included both preprocessing and modeling stages. The training phase was an exercise in pattern recognition, with the model analyzing 80% of the data to understand the intricacies of real estate valuation. The decision tree’s method of breaking down data into a series of binary decisions made it an excellent tool for navigating the complex relationships between property features and their market values.
The model’s performance was evaluated using the mean squared error and R-squared metrics, revealing an impressive R-squared value of 0.976. This indicates that our model explains a significant majority of the variance in property values, showcasing its ability to make highly accurate predictions. Such precision in predictive modeling is not only a triumph in statistical analysis but also a potential cornerstone for investors and policymakers in the real estate market, providing a reliable tool for future valuation assessments.
My analysis of the dataset has led me to a striking conclusion: the size of a property, encompassing both its gross and living areas, is profoundly influential on its value. This insight came from observing correlations well over 0.83 between these size metrics and land value. Moreover, the number of rooms and bedrooms also shares a moderate positive relationship with land value, further cementing the idea that more spacious properties fetch higher land valuations. This relationship holds true across the board, from land to building, and ultimately, the total value of a property.
While size stands out as a crucial determinant of value, the age and renovation history of a building tell a subtler story. My observations reveal that these factors are not key players in determining land value. However, when it comes to building value, renovations (YR_REMODEL) seem to carry slightly more weight than the year of construction (YR_BUILT), hinting that modern updates can indeed enhance a building’s worth, albeit modestly.
Extending these insights into the total value of properties, I’ve noticed that the patterns observed for land and building values echo here as well. The total value is most responsive to the tangible, functional aspects of a property—its size, structure, and capacity. This suggests a consistent market trend: buyers and assessors alike prioritize the present capabilities and amenities of a property over its historical narrative or cosmetic improvements.
Through this data-driven journey, I’ve learned that in the realm of property valuation, the physical attributes of a property—its expansiveness and utility—reign supreme. The age and renovation history, while they do play roles, are secondary in the grand scheme of property valuation. This analysis not only informs potential investors and homeowners but also enriches my understanding of the real estate market’s valuation principles.
Diving into the dataset I have charted a course through the construction and remodeling history of our urban landscape. The data stretches back to the 1700s, revealing an enduring legacy of architectural history with a mean construction year nestled in the roaring twenties—a time when the city was burgeoning with new structures. The standard deviation tells a story of diversity; a 42-year spread indicates an array of property ages, painting a picture of a city that grew in bursts and waves, rather than a steady stream.
The histogram I’ve examined shows a city that reached its zenith of construction in the 1900s, a testament to a bygone era of rapid expansion and the industrial boom. Over 50,600 buildings from that period still stand today, marking it as the most prolific in the city’s history. Fast forward to the 1980s, and the data reveals a peak in renovations, with over 104,000 properties rejuvenated, suggesting a decade of renewal and transformation, echoing a renewed spirit of urban revitalization.
Interestingly, the skewness of the construction years leans towards older properties, hinting at a city that values and preserves its past. Yet, the kurtosis presents a flat distribution, suggesting few anomalies in this historical narrative. When it comes to renovations, the data shows a less skewed distribution, revealing a consistent effort to update and modernize. However, the weak correlation between construction or remodel years and total property value challenges common perceptions, indicating that a property’s age or its updates do not necessarily equate to its market worth—a fascinating observation that suggests the city’s real estate value is dictated by more than just its history or its facelifts.
In my analysis of the “FY2023 Property Assessment Data,” I’ve uncovered intriguing correlations that shed light on the complex dynamics of property valuation. The data revealed a robust positive correlation between the gross area, living area, and the value metrics, including land and building values. This compelling link underscores a fundamental principle in real estate: larger properties tend to command higher values, reflecting the premium placed on space.
Another interesting observation is the positive correlation between the number of residential and commercial units and the total property value. This suggests that multi-unit properties, which offer more living or business space, naturally carry a higher valuation. Moreover, the strong correlation between gross tax and property values reaffirms the direct impact of valuation on tax obligations, highlighting the fiscal implications of property assessments.
Curiously, the year a building was constructed or remodeled shows little to no correlation with its size or value, pointing to the nuanced ways that age and modern updates intersect with market worth. Meanwhile, features like the number of parking spaces and fireplaces appear to have a varied influence on value, with parking showing a moderate link to property size, while fireplaces bear a negligible relationship, perhaps challenging conventional wisdom about the value these features add to a property.
In my recent analysis of property values across different neighborhoods, I’ve discovered a vibrant patchwork of economic climates that define our city’s landscape. The chart I’ve crafted here is a horizontal bar chart that lays bare the distribution of property counts across various parts of the city. East Boston, the lengthiest bar on the chart, suggests a bustling hub of real estate activity, possibly due to a combination of historical significance and contemporary development. Meanwhile, areas like Newton and Brookline, with shorter bars, might indicate exclusivity and higher property values despite a smaller count of properties.
This chart is more than a collection of colored bars; it’s a series of narratives about each neighborhood’s character and economic status. For instance, the substantial number of properties in Dorchester and Jamaica Plain could reflect a density of residential zones or a surge in property development. In contrast, the modest bars representing Chestnut Hill and Dedham may speak to a quieter real estate landscape, potentially one with larger properties and more green spaces.
In my recent analysis of the “FY2023 Property Assessment Data,” I was particularly captivated by the ‘Year Built’ aspect of the dataset. It felt like unearthing a historical timeline, where each property’s construction year tells a story of the past. The accompanying chart, a histogram of these years, visually narrates the peaks of construction activity, reflecting the region’s economic and social transformations over time.
The chart reveals a fascinating blend of historic and modern structures. It’s intriguing to see how different periods have left their architectural signatures, from historic buildings that echo the past to contemporary ones symbolizing modernity and progress. This mix not only highlights the region’s architectural evolution but also its adaptability to changing times and needs.
This exploration was more than an analysis; it was a journey through the architectural history of our region. It underscored the importance of balancing the preservation of historical buildings with the embrace of modern development. As I delved into these construction years, I gained a deeper appreciation for the stories embedded in our built environment, each structure a chapter in the ongoing narrative of our community’s growth.
In my initial exploration of the “FY2023 Property Assessment Data,” I was struck by the depth and breadth of information captured in this dataset. It encompasses a vast array of 180,627 properties, each detailed across 34 diverse attributes. These range from basic identifiers like street name, city, and zip code, to more intricate details such as land use classifications, building values, and the physical characteristics of the properties.
As I delved into the dataset, I noticed that it paints a rich tapestry of the area’s property landscape. The variety in attributes like land and building values particularly caught my attention, revealing the economic diversity of the region’s real estate. Furthermore, the dataset provides a window into the architectural history and development trends of the area, as evidenced by data points like the year properties were built or remodeled, and their physical conditions.
From a statistical standpoint, the range and spread of the data are remarkable. The properties’ construction years span from the early 18th century to the present, illustrating a fascinating mix of historical and contemporary architecture. The economic aspects, such as land and building values, show significant variation, underscoring the varied economic strata within the region. This initial analysis of the dataset has been enlightening, offering a comprehensive overview of the housing market and property dynamics. It lays the groundwork for more detailed investigations, which I anticipate will yield further insights into specific trends and patterns in property assessments.
In my exploration of decision trees within the realm of predictive modeling, a pivotal concept that has significantly enriched my understanding is the Residual Sum of Squares (RSS). This unassuming yet powerful metric serves as the linchpin in the decision tree algorithm, contributing substantially to the precision and efficacy of predictive modeling.
In essence, RSS functions as a guiding principle for decision trees, particularly during the process of making optimal splits. Its primary objective is to minimize the sum of squared differences between predicted values and actual outcomes. As the decision tree algorithm traverses through the dataset, RSS emerges as a discerning force, meticulously evaluating potential feature splits and selecting those that result in the minimal RSS at each node.
The role of RSS extends beyond the initial training phase, manifesting in the crucial process of pruning to prevent overfitting. Pruning, guided by RSS, strategically trims branches of the tree that contribute minimally to reducing the overall RSS. This delicate balance between complexity and accuracy ensures the decision tree’s capacity to generalize effectively to new and unseen data, cementing RSS as an integral component in the journey from model creation to refinement. In conclusion, my exploration of RSS in decision trees has underscored its significance as a decision-making criterion and a key contributor to the model’s predictive prowess.
Our latest analysis provides insights into the proportions of individuals shot relative to their estimated population percentage in the U.S. This comparison reveals notable disparities across different racial categories:
- Black (B): Black individuals are shot at a proportion approximately 1.70 times higher than their representation in the U.S. population.
- Native American (N): The proportion of Native American individuals shot is approximately 1.31 times their representation in the U.S. population.
- Hispanic (H): Hispanic individuals are shot at a proportion approximately 0.77 times their representation in the U.S. population.
- White (W): The proportion of White individuals shot is approximately 0.67 times their representation in the U.S. population.
- Asian (A): Asian individuals are shot at a proportion approximately 0.29 times their representation in the U.S. population.
- Other (O): Individuals from other racial categories are shot at a proportion approximately 0.09 times their representation in the U.S. population.
This analysis sheds light on the disparities in the likelihood of being shot across various racial groups. Our next steps involve exploring potential factors contributing to these disparities and further examining the broader implications of these findings.
To understand the statistical significance , we used “Cohen’s d” effect size measure, which is commonly used in conjunction with t-tests. Cohen’s d quantifies the difference between two groups in terms of standard deviations.
Cohen’s d is calculated using the formula:
d = (mean difference/Pooled Standard Deviation)
Where the Mean Difference is the difference in means between two groups, and the Pooled Standard Deviation is a weighted average of the standard deviations of the two groups.
On calculating this in regards to the mean age of Black and white race, we found that Cohen’s d: 0.57 which is a small effect size i.e average age of White has higher mean in comparison with Black race.
Logistic regression serves as a statistical modeling technique utilized for examining the connection between a binary outcome variable and one or more predictor variables. In this analysis, I designated “manner_of_death” as the binary outcome variable, where the column specifies whether the death resulted from being “shot” or “shot and tasered.” Additional columns such as “armed,” “age,” “gender,” and “race” were considered as predictor variables to gauge the likelihood of a specific mode of death.
Subsequently, I delved into exploring and preprocessing the data, addressing missing values and encoding categorical variables. Following this, a logistic regression model was constructed by fitting the data. In this model, the binary variable became the dependent variable, and the other columns served as independent data, acting as predictors.
The model’s performance was then evaluated using various metrics, including accuracy, precision, recall, F1 score, and R2 score, all of which demonstrated satisfactory results. This comprehensive approach allowed for a thorough understanding of the relationship between the chosen predictors and the binary outcome, providing valuable insights into the risk estimation for different modes of death.
A t-test is a statistical test used to determine if there is a significant difference between the means of two groups. There are different types of t-tests, but the most common ones are the independent samples t-test and the paired samples t-test.
To evaluate whether there is a significant difference in the average age between two races using a t-test, we can use Python and the scipy.stats module. In our project, ttest_ind is used to perform on the age distributions of the two races. The null hypothesis is that there is no significant difference in the average age between the two races. The p-value is then used to determine whether to reject the null hypothesis.
We can came to the conclusion that we failed to reject the null hypothesis as there is no significant difference in the average age between the two races.
Regarding Black individuals: There is a slight disparity between the mean and median, indicating a subtle rightward skew in the age distribution. In fact, the skewness is approximately 1, suggesting a modest deviation from a perfectly symmetrical distribution. The kurtosis is close to 4, slightly higher than the typical value of 3 for a normal distribution. This implies the potential presence of concentration around the mean or the existence of a fat tail, or possibly both.
As for White individuals: The mean and median also exhibit a minor difference, signaling a mild right-leaning skew in the age distribution. The skewness is approximately 0.5, reflecting a relatively smaller deviation compared to the Black population. The kurtosis is nearly 3, indicating minimal concentration around the mean and the absence of a substantial fat tail.
In this project, we focused on understanding the age distribution in fatal police shootings in the United States using a dataset of 8,002 incidents. Our goal was to identify the most common age groups involved in these tragic events.
Age Distribution Analysis
With the data ready, we analyzed the age distribution of individuals killed by police using a histogram, visually representing incident frequency across different age groups.
Key Findings and Observations
The visualization revealed:
– A right-skewed distribution, indicating a higher frequency of incidents involving younger individuals.
– The most affected age groups were those in their 20s and 30s, raising significant concern.
Conclusion and Reflection
This analysis provided valuable insights into age-related aspects of police shootings, emphasizing the vulnerability of younger demographics. This information guides further research, policy discussions, and interventions to understand and address contributing factors. The project demonstrates the power of data in illuminating societal issues, supporting informed decision-making for positive change.
In the expansive realm of unsupervised machine learning, Hierarchical Clustering emerges as a nuanced approach, offering a panoramic view of data relationships. Unlike K-Means, this technique doesn’t necessitate predefining the number of clusters. Instead, it creates a hierarchical tree of clusters, known as a dendrogram, which illustrates the fusion of data points into progressively larger clusters. Hierarchical Clustering can be approached in two ways: Agglomerative, where each data point starts as an individual cluster and gradually merges, or Divisive, where the process begins with a single cluster encompassing all data points and progressively divides. This flexibility renders Hierarchical Clustering applicable across various domains, from biological taxonomy to market segmentation.
The dendrogram generated by Hierarchical Clustering serves as a visual narrative, providing insights into the relationships between data points. The y-axis of the dendrogram represents the distance at which clusters merge, while the x-axis showcases individual data points and their groupings. Analysts can then choose an optimal threshold to cut the dendrogram, delineating distinct clusters based on their desired level of granularity. This adaptability makes Hierarchical Clustering a potent tool for discerning intricate structures within datasets.
The decision to employ Hierarchical Clustering often hinges on the complexity and nature of the dataset. Its ability to unveil hierarchical structures within data makes it ideal for scenarios where understanding relationships at multiple levels of granularity is crucial. Whether deciphering biological classifications or exploring market dynamics, Hierarchical Clustering offers a meticulous lens through which to comprehend the intricate tapestry of data relationships.
K-Means clustering stands as a stalwart in the realm of unsupervised machine learning, offering a powerful technique for grouping data points based on their similarities. The algorithm strives to partition the dataset into ‘k’ distinct clusters, where each cluster is defined by a central point called a centroid. Iteratively, data points are assigned to the cluster whose centroid is nearest, and the centroids are recalculated until convergence. K-Means finds its utility in a myriad of applications, from customer segmentation in marketing to image compression in computer vision, providing a versatile solution for pattern recognition.
In contrast, Density-Based Spatial Clustering of Applications with Noise (DBSCAN) takes a unique approach to clustering by identifying regions of high data density. Unlike K-Means, DBSCAN does not require the user to predefine the number of clusters. Instead, it classifies points into three categories: core points, border points, and noise points. Core points, surrounded by a minimum number of other points within a specified radius, form the nucleus of clusters. Border points lie on the periphery of these clusters, while points in sparser regions are designated as noise. This makes DBSCAN particularly adept at discovering clusters of arbitrary shapes and handling outliers effectively.
In choosing between K-Means and DBSCAN, the nature of the dataset and the desired outcome play pivotal roles. K-Means excels when the number of clusters is known, and the clusters are well-defined and spherical. On the other hand, DBSCAN shines when dealing with datasets of varying densities and irregularly shaped clusters. The adaptability of these clustering algorithms empowers data scientists to unravel hidden structures, paving the way for more informed decision-making in diverse fields.
Today, I delved into the intricate realm of comprehending the dataset pertaining to fatal police shootings in the United States since 2015. The chronicle of my discoveries and subsequent actions is meticulously chronicled below:
- Data Loading:I accessed two CSV files, housing incident and agency data, directly from the repositories of the venerable Washington Post. This method ensures that I am equipped with the most current and pertinent information.
- Data Exploration: Upon scrutinizing the datasets, a salient revelation emerged—the agency ID serves as the common identifier across both tables. This fortuitous alignment allows for the seamless merging of the two tables, enabling a correlation between incident details and the agency or agencies responsible.
- Data Manipulation: However, upon closer inspection, a hurdle manifested itself—I encountered difficulties merging the tables due to disparate data types. To surmount this obstacle, I undertook the task of converting the data type of the agency ID in the incidents table to int64. Furthermore, the ‘agency ID’ column in the incident data table exhibited multiple values that necessitated separation. Rest assured, I am diligently addressing this matter and will provide updates in subsequent posts.
Today’s class delved into the intriguing world of Analysis of Variance (ANOVA), and I’m excited to share how it’s like a statistical detective, especially when it comes to understanding variations in incidents like police shootings in the United States. Picture this: we have data representing different racial groups, and we want to know if the average number of incidents varies significantly between them. That’s where ANOVA steps in.
So, here’s the breakdown: ANOVA acts as our investigator, examining two types of variations. First, it scrutinizes the differences within each racial group, considering how the number of shootings might differ within the same race. Then, it compares that to the differences between the racial groups, analyzing how the average number of incidents might differ across all races. If the differences within each race are similar to the differences between races, ANOVA suggests that the variations might be due to random factors. But if the differences between race are significantly larger than the differences within each race, ANOVA signals that there’s likely something more profound at play.
As we learned in the class about various resampling methods such as K-fold cross validation, Bootstrapping is another nifty statistical technique for data analysis. Imagine you have a small dataset, and you want to understand more about its characteristics. Enter bootstrapping—a method that enables you to generate multiple datasets by repeatedly sampling from your original data, with replacement. It’s like creating multiple mini-worlds from your limited observations, allowing you to get a better grip on the underlying patterns and uncertainties in your data.
Here’s the magic: since you’re sampling with replacement, some data points may appear more than once in a given bootstrap sample, while others might not appear at all. This process mimics the randomness inherent in real-world data collection. By creating these bootstrapped datasets and analyzing them, you can estimate things like the variability of your measurements or the uncertainty around a particular statistic. It’s a statistical resilience booster, giving you a more robust understanding of your data’s nuances.
The beauty of bootstrapping lies in its simplicity and power. Whether you’re dealing with a small dataset, uncertain about your assumptions, or just curious about the reliability of your results, bootstrapping is like a statistical friend that says, “Let’s explore your data from various angles and see what insights we can uncover together.”
I delved into the intricacies of p-values following the class and based on the videos recommended, and this exploration led me to a nuanced understanding of the null hypothesis. The null hypothesis, often denoted as H0, assumes a central role in hypothesis testing, positing the absence of a significant difference, effect, or relationship between variables or groups within a given population. It serves as the foundational hypothesis against which subsequent testing is conducted. It’s important to note that the failure to reject the null hypothesis doesn’t affirm its veracity; rather, it indicates an absence of sufficient evidence from the collected data to suggest otherwise. The structured approach of statistical hypothesis testing provides a systematic framework for this decision-making process in the realm of statistics.
Transitioning to the concept of p-values, they function as a critical metric in gauging evidence against the null hypothesis. The p-value, or probability value, quantifies the likelihood of observing a test statistic as extreme as the one derived from sample data, assuming the null hypothesis holds true. This probability informs whether the results obtained from the sample data are statistically significant or could be attributed to random chance. The process involves formulating a null hypothesis, collecting and analyzing data, calculating the p-value, and subsequently comparing it to a predetermined significance level. A small p-value doesn’t necessarily invalidate the null hypothesis; instead, it suggests evidence against it. Interpretation of p-values demands caution, taking into account the research question, study design, and potential biases in data collection.
In summary, the interplay between the null hypothesis and p-values is pivotal in the landscape of statistical analysis. The null hypothesis establishes the baseline assumption, and p-values offer a quantifiable measure of evidence against it. Proper interpretation requires a nuanced understanding of statistical significance, careful consideration of study parameters, and acknowledgment of the broader statistical context beyond p-values, including effect sizes and confidence intervals.
Based on the lecture class and the recommended videos by Prof. Gary, I was able to learn the intricacies of various cross validation approaches and their usefulness when it comes to dealing with smaller dataset similar to the project 1 dataset.
it all boils down how we interpret the test error rate in comparison to training error rate of a model. As we may have observed, usually the training error rate of a polynomial regression function, for example, falls with increase in the degree of polynomial functions. However, we found that the test error rate falls initial before it rises with complexity of the model.This leads to overfitting. To resolve this I learned that there are various cross validation approaches as mentioned below:
- The Validation Set Approach
This process entails randomly dividing the available set of observations into two segments: a training set and a validation set (or hold-out set). The model is trained on the training set, and the resulting model is employed to predict responses for the observations in the validation set. The ensuing validation set error rate, often evaluated using Mean Squared Error (MSE) for quantitative responses, serves as an approximation of the test error rate. However, this approach comes with certain limitations.
The validation estimate of the test error rate can exhibit high variability, depending on the specific observations included in the training and validation sets.In addition, the validation approach employs only a subset of observations—those included in the training set rather than the validation set—to train the model. Given that statistical methods generally perform less effectively when trained on a smaller dataset, this implies that the validation set error rate may tend to overstate the test error rate for the model fitted on the entire dataset.
- Leave-One-Out Cross-Validation
Leave-One-Out Cross-Validation (LOOCV) is a technique used to assess the performance of predictive models by systematically leaving out one data point at a time for validation while training the model on the remaining data. This process is repeated for each observation in the dataset, allowing the model to be validated against the entire dataset. While LOOCV provides a thorough evaluation of model performance and utilizes all available data, it can be computationally expensive for large datasets due to the need for multiple model fittings. Despite this, LOOCV is commonly employed when the dataset size is limited, providing a robust assessment of model generalization.
- k-Fold Cross-Validation
This approach involves randomly dividing the set of observations into k groups, or folds, of approximately equal size. The first fold is treated as a validation set, and the method is fit on the remaining k − 1 folds. The mean squared error, MSE1, is then computed on the observations in the held-out fold. This procedure is repeated k times; each time, a different group of observations is treated as a validation set. This process results in k estimates of the test error, MSE1, MSE2, . . . , MSEk.
The most important advantage of k-fold CV is that it often gives more accurate estimates of the test error rate than does LOOCV. This has to do with a bias-variance trade-off.
In this blog post, I would like to mention a few things that I’ve learned about linear regression by using the CDC diabetes dataset to analyze the impact of obesity and inactivity on diabetes. This dataset provides a valuable opportunity to explore how these two factors contribute to the prevalence of diabetes in the United States. This technique was used to predict a target variable (in this case, diabetes prevalence) based on one or more input variables (obesity and inactivity rates). The goal is to find the best-fit line that describes the relationship between the input variables and the target variable.
As it was mentioned in the lecture notes, I have uploaded the data in 3 different data frames which was then merged together to form a master dataframe. Since there was only 354 data points which is common from both the tables, I tried to isolate the factors and find their correlation with diabetes i.e., target variable. On plotting the distribution and calculating the descriptive statistics of the factors against diabetics data, I found that distributions are skewed with negative kurtosis.The correlations between the target and factor variables are positive.
In terms of the class, I have learned more about the importance of calculating the residuals, Heteroscedasticity and p-value in determining the model for the regression problem.
Welcome to UMassD WordPress. This is your first post. Edit or delete it, then start blogging!
In our latest project analysis, we’ve uncovered significant trends related to the armed status of individuals from diverse racial categories. Predominantly, individuals from the White, Black, and Hispanic groups are armed with guns. Notably, a substantial number of White individuals are armed with knives, while a concerning observation highlights a high number of unarmed Black individuals compared to other groups.
Our visualizations reveal consistent patterns across armed status categories, with White individuals leading, followed by Black and Hispanic individuals. Our next focus is on examining the relationship between threat levels and race. Initial findings show that “attack” incidents are prevalent across all racial categories, with White individuals in the lead, followed by Black and Hispanic individuals. The “other” category follows a similar pattern, with incidents labeled as “undetermined” being relatively low across all racial categories. This analysis aims to provide key insights into the perceived threat levels associated with different racial categories in police shooting incidents.