Author: Venkata Raghavendra Karthik Kanna
Last Modified: July 12, 2023

This notebook is an extension of the project conducted by Hicounselor.. The project, titled "Work-Related Fatality Study: An In-depth Analysis using Python and MySQL" aims to examine and understand the factors contributing to work-related fatalities, identify trends and patterns, and provide insights to improve workplace safety measures.

Hicounselor provides virtual sandbox machines/kernels where the project can be implemented. I have completed my project there as well. This notebook serves as an extension for anyone who wants to work on their own machines locally, using both Python dataframes and SQL methods.

Softwares used:

  • Python (numpy, pandas, sqlalchemy, pymsql)
  • MySQL
In [1]:
## These packages may not directly come with anaconda installation, so ensure you have installed on your computer.
#!pip install pymysql
#!pip install sqlalchemy
## Also MySQL installation and configuration is needed.(Somehow you have to establish a connection to MySQL database)
In [2]:
import pandas as pd
import numpy as np

#Task 1: Read the given dataset
def read_data_from_csv():
    #df =read the 'fatality.csv' file
    df = pd.read_csv('./fatality.csv')
    return df

#Task 2: Clean the given dataset
def data_cleaning():
    # do not edit the predefined function name
    df = read_data_from_csv()
    # Remove the 'Unnamed' column
    df.drop(columns = ['Unnamed'], inplace=True)
    
    return df

#Task 3: Export the cleaned dataset to "fatalities_cleaned.csv"
def export_the_dataset():
    # do not edit the predefined function name
    df=data_cleaning()
    #write your code to export the cleaned dataset and set the index=false and return the same as 'df'
    df.to_csv('./fatalities_cleaned.csv', index=False)
    return df

#TASK 4: Load the Cleaned dataset 'fatalities_cleaned.csv' to the database provided.
#follow the instruction in the Task 4 description and complete the task as per it.

#check if mysql table is created using "fatalities_cleaned"
#Use this final dataset and upload it on the provided database for performing analysis in  MySQL
#To run this task click on the terminal and click on the run project
In [3]:
## Create a cleaned dataset file as per above functions.
export_the_dataset()
Out[3]:
id incident_date day_of_week city state description plan citation
0 1 2009-06-23 tuesday tampa florida Victim was inside a building bundling material... unknown unknown
1 2 2009-06-30 tuesday conroe texas Victim was on a scaffold doing repairs on an a... unknown unknown
2 3 2009-07-06 monday miami florida Victim was conducting roofing work, and fell t... unknown unknown
3 4 2009-07-09 thursday north platte nebraska Victim was climbing a ladder while servicing a... unknown unknown
4 5 2009-07-11 saturday greensburg pennsylvania Employees were throwing away old files. They h... unknown unknown
... ... ... ... ... ... ... ... ...
14909 14910 2017-01-03 tuesday cut and shoot texas Worker died in fall from tree. federal no
14910 14911 2017-01-03 tuesday glendale kentucky Worker fatally stabbed by resident. state yes
14911 14912 2017-01-03 tuesday la verkin utah Worker died after being struck by excavator co... state no
14912 14913 2017-01-03 tuesday mount pleasant texas Worker died after being struck by tanker truck. federal yes
14913 14914 2017-01-02 monday shakopee minnesota Worker died in fall inside grain tank. state yes

14914 rows × 8 columns

In [4]:
df_cleaned = pd.read_csv('./fatalities_cleaned.csv')
In [5]:
df_cleaned.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14914 entries, 0 to 14913
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   id             14914 non-null  int64 
 1   incident_date  14914 non-null  object
 2   day_of_week    14914 non-null  object
 3   city           14914 non-null  object
 4   state          14914 non-null  object
 5   description    14914 non-null  object
 6   plan           14914 non-null  object
 7   citation       14914 non-null  object
dtypes: int64(1), object(7)
memory usage: 932.2+ KB
In [6]:
df_cleaned.isnull().sum()
Out[6]:
id               0
incident_date    0
day_of_week      0
city             0
state            0
description      0
plan             0
citation         0
dtype: int64
In [7]:
df_cleaned.head()
Out[7]:
id incident_date day_of_week city state description plan citation
0 1 2009-06-23 tuesday tampa florida Victim was inside a building bundling material... unknown unknown
1 2 2009-06-30 tuesday conroe texas Victim was on a scaffold doing repairs on an a... unknown unknown
2 3 2009-07-06 monday miami florida Victim was conducting roofing work, and fell t... unknown unknown
3 4 2009-07-09 thursday north platte nebraska Victim was climbing a ladder while servicing a... unknown unknown
4 5 2009-07-11 saturday greensburg pennsylvania Employees were throwing away old files. They h... unknown unknown

We now have a cleaned dataset in the file fatalities_cleaned.csv. As per the project description, we are required to perform our analysis using SQL after loading the cleaned dataset into a MySQL instance. Therefore, I have set up a MySQL instance on my local machine and created the necessary database, table, and user.

In this notebook, we will perform the analysis using two methods:

  • pandas (method 1)
  • SQL (method 2).

First, we will load the dataset into the dataframe and establish a connection to the created database.

In [8]:
# pandas dataframe
df_cleaned = pd.read_csv('./fatalities_cleaned.csv')
df_cleaned.head()
Out[8]:
id incident_date day_of_week city state description plan citation
0 1 2009-06-23 tuesday tampa florida Victim was inside a building bundling material... unknown unknown
1 2 2009-06-30 tuesday conroe texas Victim was on a scaffold doing repairs on an a... unknown unknown
2 3 2009-07-06 monday miami florida Victim was conducting roofing work, and fell t... unknown unknown
3 4 2009-07-09 thursday north platte nebraska Victim was climbing a ladder while servicing a... unknown unknown
4 5 2009-07-11 saturday greensburg pennsylvania Employees were throwing away old files. They h... unknown unknown

Connect to the MySQL database from jupyter notebook¶

In [9]:
from sqlalchemy import create_engine
import pymysql
In [10]:
## Define database details
user = 'b8e35269'
password = 'Cab#22se'
host = 'localhost'
port = 3306
database = 'b8e35269'
In [11]:
conn_string = f"mysql+pymysql://{user}:{password}@{host}:{port}/{database}"
conn_string
Out[11]:
'mysql+pymysql://b8e35269:Cab#22se@localhost:3306/b8e35269'
In [12]:
engine = create_engine(conn_string)
In [13]:
connection = engine.connect()

At this point we have our data sources from which we can query the information based on our analysis questions.

1. What is the number of reported incidents?¶

method 1 - using pandas dataframes¶

In [14]:
print("Number of reported incidents: {}".format(df_cleaned.shape[0]))
Number of reported incidents: 14914

method 2 - using SQL query¶

In [15]:
query = "select count(*) as n_fatalities from fatalities_cleaned;"
pd.read_sql(query, connection)
Out[15]:
n_fatalities
0 14914

2. What is the year to year change for the number of fatal incidents?¶

The task is to calculates the year-to-year percentage changes(Round it of to nearest whole number) in the number of fatalities for each incident year, excluding the year 2022.

In [16]:
# now lets convert the column 'incident_date' to datetime object. It would have been done while laoding itself using parse_dates option.

df_cleaned['incident_date'] = pd.to_datetime(df_cleaned.incident_date)

method 1 - using pandas dataframes¶

In [17]:
# create a column to store the year information
df_cleaned['incident_year'] = df_cleaned.incident_date.dt.year
yearwise_incidents = df_cleaned.groupby('incident_year').agg('count')[['id']].reset_index()
yearwise_incidents.rename(columns={'id': 'n_fatalities'}, inplace=True)
yearwise_incidents['previous_year'] = yearwise_incidents.n_fatalities.shift(1)
yearwise_incidents['year_to_year'] = np.round(100 * (yearwise_incidents.n_fatalities - yearwise_incidents.previous_year)/yearwise_incidents.previous_year)
yearwise_incidents.iloc[:-1, :] # as we don't need the data for the year 2022 (the last row)
Out[17]:
incident_year n_fatalities previous_year year_to_year
0 2009 515 NaN NaN
1 2010 1110 515.0 116.0
2 2011 1185 1110.0 7.0
3 2012 997 1185.0 -16.0
4 2013 1189 997.0 19.0
5 2014 1345 1189.0 13.0
6 2015 1148 1345.0 -15.0
7 2016 1106 1148.0 -4.0
8 2017 1541 1106.0 39.0
9 2018 1260 1541.0 -18.0
10 2019 1376 1260.0 9.0
11 2020 1119 1376.0 -19.0
12 2021 950 1119.0 -15.0

method 2 - using SQL query¶

In [18]:
query = """With table_year_incidents as (
select year(incident_date) as incident_year, count(*) as n_fatalities
from fatalities_cleaned
group by year(incident_date)
order by 1
    )
select incident_year, n_fatalities,
lag(n_fatalities) over() as previous_year,
round(100 * (n_fatalities - lag(n_fatalities) over() )/
            (lag(n_fatalities) over())) as year_to_year
from table_year_incidents
where incident_year != 2022"""

pd.read_sql(query, connection)
Out[18]:
incident_year n_fatalities previous_year year_to_year
0 2009 515 NaN NaN
1 2010 1110 515.0 116.0
2 2011 1185 1110.0 7.0
3 2012 997 1185.0 -16.0
4 2013 1189 997.0 19.0
5 2014 1345 1189.0 13.0
6 2015 1148 1345.0 -15.0
7 2016 1106 1148.0 -4.0
8 2017 1541 1106.0 39.0
9 2018 1260 1541.0 -18.0
10 2019 1376 1260.0 9.0
11 2020 1119 1376.0 -19.0
12 2021 950 1119.0 -15.0

3. What is the number of fatalities that received a citation?¶

The task is to calculates the total number of fatalities that received a citation.

method 1 - using pandas dataframes¶

In [19]:
# print("Number of fatalities that received a citation: {}"
#       .format(len(df_cleaned[df_cleaned['citation'] != 'unknown'])))

citationwise_fatalities = df_cleaned.groupby('citation').agg('count')[['id']].reset_index()
citationwise_fatalities.rename(columns={'id':'n_fatalities'},inplace=True)
citationwise_fatalities.sort_values('n_fatalities', ascending=False)
Out[19]:
citation n_fatalities
1 unknown 8886
2 yes 3345
0 no 2683

method 2 - using SQL query¶

In [20]:
query = """select citation, count(*) as n_fatalities
from fatalities_cleaned 
group by citation
order by n_fatalities desc
"""
pd.read_sql(query, connection)
Out[20]:
citation n_fatalities
0 unknown 8886
1 yes 3345
2 no 2683

4. What day of the week has the most fatalities and what is the overall percentage?¶

The task is to calculates day of the week that reported more number of fatalities and percentage(Rounds the percentage to two decimal places).

method 1 - using pandas dataframes¶

In [21]:
day_of_week_fatalities = df_cleaned.groupby('day_of_week').agg(['count'])['id'].reset_index()
day_of_week_fatalities.rename(columns = {'count': 'n_fatalities'}, inplace=True)
day_of_week_fatalities['percentage'] = np.round(100*day_of_week_fatalities.n_fatalities / day_of_week_fatalities.n_fatalities.sum(),2)
day_of_week_fatalities.sort_values('percentage', ascending=False)
Out[21]:
day_of_week n_fatalities percentage
5 tuesday 2728 18.29
6 wednesday 2706 18.14
1 monday 2626 17.61
4 thursday 2612 17.51
0 friday 2335 15.66
2 saturday 1177 7.89
3 sunday 730 4.89

method 2 - using SQL query¶

In [22]:
query = """select day_of_week, count(*) as n_fatalities,
round(100 * count(*)/sum(count(*)) over () , 2) as percentage
from fatalities_cleaned
group by day_of_week
order by percentage desc
"""
pd.read_sql(query, connection)
Out[22]:
day_of_week n_fatalities percentage
0 tuesday 2728 18.29
1 wednesday 2706 18.14
2 monday 2626 17.61
3 thursday 2612 17.51
4 friday 2335 15.66
5 saturday 1177 7.89
6 sunday 730 4.89

5. What is the number of fatalities involving welding?¶

The task is to calculates the total number of fatalities during welding.

method 1 - using pandas dataframes¶

In [23]:
print("Number of fatalities involving welding: {}"
      .format(df_cleaned.description.str.lower().str.contains('weld').sum()))
Number of fatalities involving welding: 79

method 2 - using SQL query¶

In [24]:
query = """select count(*) as welding_fatalities
from fatalities_cleaned
where lower(description) like '%%weld%%'
"""
pd.read_sql(query, connection)
Out[24]:
welding_fatalities
0 79

6. Select the last 5 from the previous query¶

The task is to calculates the last 5 fatalities during welding.

method 1 - using pandas dataframes¶

In [25]:
weldings_filter = df_cleaned.description.str.lower().str.contains('weld')
print("Last 5 fatalities during welding:")
Last 5 fatalities during welding:
In [26]:
df_cleaned.loc[weldings_filter,:].sort_values(by='incident_date', ascending=False).head()
Out[26]:
id incident_date day_of_week city state description plan citation incident_year
9587 9588 2021-04-14 wednesday cleveland ohio Worker electrocuted by portable welding machine. federal yes 2021
9812 9813 2021-01-30 saturday mission texas Worker died in welding explosion. federal yes 2021
9999 10000 2020-12-10 thursday urbana ohio Worker fatally crushed by seam welder. federal yes 2020
10678 10679 2020-05-24 sunday dallas texas Worker electrocted while welding HVAC pipe. federal no 2020
11743 11744 2019-07-08 monday kingwood texas Worker electrocuted while welding air conditio... federal no 2019

method 2 - using SQL query¶

In [27]:
query = """select *
from fatalities_cleaned
where lower(description) like '%%weld%%'
order by incident_date desc
limit 5
"""
pd.read_sql(query, connection)
Out[27]:
id incident_date day_of_week city state description plan citation
0 9588 2021-04-14 wednesday cleveland ohio Worker electrocuted by portable welding machine. federal yes
1 9813 2021-01-30 saturday mission texas Worker died in welding explosion. federal yes
2 10000 2020-12-10 thursday urbana ohio Worker fatally crushed by seam welder. federal yes
3 10679 2020-05-24 sunday dallas texas Worker electrocted while welding HVAC pipe. federal no
4 11744 2019-07-08 monday kingwood texas Worker electrocuted while welding air conditio... federal no

7. Select the top 5 states with the most fatal incidents.¶

The task is to calculates the top 5 states which have most fatal incidents.

method 1 - using pandas dataframes¶

In [28]:
statewise_fatalities = df_cleaned.groupby('state').agg('count')[['id']].reset_index()
statewise_fatalities.rename(columns={'id':'n_fatalities'}, inplace=True)
print("Top 5 states which have most fatal incidents:")
statewise_fatalities.sort_values('n_fatalities', ascending=False).head()
Top 5 states which have most fatal incidents:
Out[28]:
state n_fatalities
47 texas 1758
5 california 1352
10 florida 1021
35 new york 726
15 illinois 635

method 2 - using SQL query¶

In [29]:
query = """select state, count(*) as n_fatalities
from fatalities_cleaned
group by state
order by n_fatalities desc
limit 5
"""
pd.read_sql(query, connection)
Out[29]:
state n_fatalities
0 texas 1758
1 california 1352
2 florida 1021
3 new york 726
4 illinois 635

8. What are the top 5 states that had the most workplace fatalities from stabbings?¶

The task is to calculates the top 5 states which have most fatal incidents happed from stabbing.

method 1 - using pandas dataframes¶

In [30]:
stabb_filter = df_cleaned.description.str.lower().str.contains('stabb')
stabbing_fatalities = df_cleaned[stabb_filter]
statewise_stabbing_fatalities = stabbing_fatalities.groupby('state').agg('count')['id'].reset_index()
statewise_stabbing_fatalities.rename(columns={'id':'stabbing_fatalities'}, inplace=True)
print("Top 5 states which have most fatal incidents:")
statewise_stabbing_fatalities.sort_values('stabbing_fatalities', ascending=False).head()
Top 5 states which have most fatal incidents:
Out[30]:
state stabbing_fatalities
8 new york 7
0 california 5
4 kentucky 5
2 illinois 3
1 connecticut 2

method 2 - using SQL query¶

In [31]:
query = """select state, count(*) as stabbing_fatalities
from fatalities_cleaned
where lower(description) like '%%stabb%%'
group by state
order by stabbing_fatalities desc
limit 5
"""
pd.read_sql(query, connection)
Out[31]:
state stabbing_fatalities
0 new york 7
1 california 5
2 kentucky 5
3 illinois 3
4 connecticut 2

9. What are the top 10 states that had the most workplace fatalities from shootings?¶

The task is to calculates the top 10 states which have most fatal incidents happed from shooting.

method 1 - using pandas dataframes¶

In [32]:
shoot_filter = df_cleaned.description.str.lower().str.contains('shot')
shooting_fatalities = df_cleaned[shoot_filter]
statewise_shooting_fatalities = shooting_fatalities.groupby('state').agg('count')['id'].reset_index()
statewise_shooting_fatalities.rename(columns={'id':'n_fatalities'}, inplace=True)
print("Top 10 states which have most fatal incidents:")
statewise_shooting_fatalities.sort_values(['n_fatalities','state'], ascending=False).head(10)
Top 10 states which have most fatal incidents:
Out[32]:
state n_fatalities
13 indiana 28
4 california 23
40 texas 21
29 new york 20
8 florida 14
16 kentucky 13
34 oregon 9
25 nevada 9
12 illinois 9
44 washington 8

method 2 - using SQL query¶

In [33]:
query = """select state, count(*) as shooting_fatalities
from fatalities_cleaned
where lower(description) like '%%shot%%'
group by state
order by shooting_fatalities desc
limit 10
"""
pd.read_sql(query, connection)
Out[33]:
state shooting_fatalities
0 indiana 28
1 california 23
2 texas 21
3 new york 20
4 florida 14
5 kentucky 13
6 nevada 9
7 illinois 9
8 oregon 9
9 washington 8

10. What is the total number of shooting deaths per year?¶

The task is to calculates the total number of deaths caused by shooting each year.(In Decreasing order)

method 1 - using pandas dataframes¶

In [34]:
yearwise_shooting_fatalities = shooting_fatalities.groupby('incident_year').agg('count')[['id']].reset_index()
yearwise_shooting_fatalities.rename(columns={'id':'shooting_fatalities'}, inplace=True)
In [35]:
yearwise_shooting_fatalities.sort_values('shooting_fatalities', ascending=False)
Out[35]:
incident_year shooting_fatalities
12 2021 38
6 2015 28
7 2016 28
11 2020 27
10 2019 24
9 2018 21
4 2013 19
5 2014 18
1 2010 16
8 2017 14
2 2011 13
0 2009 10
3 2012 9
13 2022 2

method 2 - using SQL query¶

In [36]:
query = """select year(incident_date) as incident_year, count(*) as shooting_fatalities
from fatalities_cleaned
where lower(description) like '%%shot%%'
group by year(incident_date)
order by shooting_fatalities desc
"""
pd.read_sql(query, connection)
Out[36]:
incident_year shooting_fatalities
0 2021 38
1 2015 28
2 2016 28
3 2020 27
4 2019 24
5 2018 21
6 2013 19
7 2014 18
8 2010 16
9 2017 14
10 2011 13
11 2009 10
12 2012 9
13 2022 2

Once we are done with the analysis we shall close the database connection

In [37]:
connection.close()

This marks the end of the notebook. If any modifications are made to this notebook, I shall upload the revised version to the github repo.

Thank you for reading. Have a great day!!