Data Modeling + Normalization of a Human Resources Data Set

Jeremiah — Tue, 11 Feb 2020 21:28:26 +0000

This is a quick and simple demonstration of how I used Power BI’s Query Editor to clean, normalize, and model a Human Resources data set.

The Data Set:

I first learned about data modeling from using Excel’s Power Query. When I moved on to learning Power BI, I came across a data set that I thought could be a great tool for practicing data modeling. The data set was designed by Dr. Rich Heubner and Dr. Carla Patalano of New England College of Business for the purpose of teaching HR graduate students HR metrics, measurement, and analytics. It was designed to accompany a case study with workforce diversity as the focal point.

Human Resources Data Set (Base + Normalized)Download

The following is a description of each of the columns in the data set:

Feature	Description	DataType
Employee Name	Employee’s full name	Text
EmpID	Employee ID is unique to each employee	Text
MarriedID	Is the person married (1 or 0 for yes or no)	Binary
MaritalStatusID	Marital status code that matches the text field MaritalDesc	Integer
EmpStatusID	Employment status code that matches text field EmploymentStatus	Integer
DeptID	Department ID code that matches the department the employee works in	Integer
PerfScoreID	Performance Score code that matches the employee’s most recent performance score	Integer
FromDiversityJobFairID	Was the employee sourced from the Diversity job fair? 1 or 0 for yes or no	Binary
PayRate	The person’s hourly pay rate. All salaries are converted to hourly pay rate	Float
Termd	Has this employee been terminated – 1 or 0	Binary
PositionID	An integer indicating the person’s position	Integer
Position	The text name/title of the position the person has	Text
State	The state that the person lives in	Text
Zip	The zip code for the employee	Text
DOB	Date of Birth for the employee	Date
Sex	Sex – M or F	Text
MaritalDesc	The marital status of the person (divorced, single, widowed, separated, etc)	Text
CitizenDesc	Label for whether the person is a Citizen or Eligible NonCitizen	Text
HispanicLatino	Yes or No field for whether the employee is Hispanic/Latino	Text
RaceDesc	Description/text of the race the person identifies with	Text
DateofHire	Date the person was hired	Date
DateofTermination	Date the person was terminated, only populated if, in fact, Termd = 1	Date
TermReason	A text reason / description for why the person was terminated	Text
EmploymentStatus	A description/category of the person’s employment status. Anyone currently working full time = Active	Text
Department	Name of the department that the person works in	Text
ManagerName	The name of the person’s immediate manager	Text
ManagerID	A unique identifier for each manager.	Integer
RecruitmentSource	The name of the recruitment source where the employee was recruited from	Text
PerformanceScore	Performance Score text/category (Fully Meets, Partially Meets, PIP, Exceeds)	Text
EngagementSurvey	Results from the last engagement survey, managed by our external partner	Float
EmpSatisfaction	A basic satisfaction score between 1 and 5, as reported on a recent employee satisfaction survey	Integer
SpecialProjectsCount	The number of special projects that the employee worked on during the last 6 months	Integer
LastPerformanceReviewDate	The most recent date of the person’s last performance review.	Date
DaysLateLast30	The number of times that the employee was late to work during the last 30 days	Integer

Demonstrating any actual analytics work on this data set requires a whole separate post. For now, the goal is simply to demonstrate my data modeling skill set. If you take a look at the data, you will see why it lends itself so well for data modeling.

There are plenty of redundant columns. As an example, we can see that MaritalStatusID gives the information as the MaritalDesc (whether the person is married, divorced, single, etc.). The same is true with EmpStatusID and EmploymentStatus. We could easily split these ID-Descriptor pairs into separate tables to reduce redundancy, i.e., we could normalize this data set.

Note that normalization isn’t really necessary to do analysis on this data set. As I said before, I did this only as a demonstration of data modeling in Power BI. Another side-note is that I didn’t technically fully normalize the data. There are still some transitive dependence. I only did a very basic data normalization.

The Process:

I prepared a video showing how I processed the data from start to finish (loading, cleaning, normalizing, and modeling).

The following is a breakdown of each process and corresponding timestamps.

Data Cleaning

0:00 – 0:12 | I loaded the data set and did a quick glance to see how many NULL values we have. I removed most of the NULL values by removing all the NULL values in the Employee_Name columns.

0:13 – 0:30 | I split the Employee_Name column into two: a First_Name column and a Last_Name column. I used the comma as the separator and used the Trim functionality to remove the leading spaces in the First_Name values. I then noticed that the name “Jeremy Prater” was not split correctly because it didn’t have a comma in the first place. I corrected this manually. (I realized as I’m typing this that I left “Jeremy” in the Last_Name column and “Prater” in the First_Name column. “Jeremy” is most likely the person’s first name.)

0:35 – 0:45 | I replaced the single blank entry in the TermReason column with “Unknown”.

0:46 – 1:00 | I noticed that manager “Webster Butler” has a ManagerID of “39” and “null”. It’s possible that there are two managers named “Webster Butler” and that one was not assigned a ManagerID yet. I just assumed that these are the same person and so I replaced all “null” values with “39”.

1:05 – 1:27 | I corrected the Zip codes by adding back the leading zeros that was lost when the data set was loaded into Power BI. I did this by adding a custom column generated by the following M code: = Text.PadStart([Zip], 5, "0"). I then removed the original Zip column.

1:31 – 1:37 | I noticed that the HispanicLatino column has four unique values: “No”, “no”, “Yes”, and “yes”. I formatted this column by using the Capitalize Each Word method to remove the redundant “yes” and “no” values.

1:45 – 1:53 | I removed all the binary columns (MarriedID, GenderID, FromDiversityJobFairID, and Termd) because I believed they’re redundant. For example… all person with a MarriedID of “0” (i.e., never married) has a MaritalDesc of “Single”. If I would like to visualize the count of “Single” employees in the organization, I could just use the MaritalDesc column.

Data Normalization

2:10 – 2:24 | I created the Marital_Status table. I first copied both the MaritalStatusID and MaritalDesc columns into a new table. I then removed all the duplicate rows from this new table and sorted it by ascending MaritalStatusID.

2:40 – 2:55 | I created the Employment_Status table using the same method I used to create the previous table.

2:56 – 3:16 | I created the Departments table using the same method as before.

3:20 – 3:34 | I created the Performance_Scores table using the same method as before.

3:35 – 3:47 | I created the Performance_Scores table.

3:49 – 4:17 | I created the Citizenship_Status table. This time the CitizenDesc column does not come paired with an ID column. I had to generate the IDs. First, I copied the CitizenDesc column to a new table and removed all the duplicate values. I then added an index column and renamed it CitizenDescID. Finally, I merged this new table with the original table using the CitizenDesc columns as the connector and removed the CitizenDesc column from the original table.

4:18 – 4:43 | I created the Race_Descriptions table by using a similar method.

4:44 – 5:10 | I created the Termination_Reasons table using the same method.

5:11 – 5:25 | I created the Managers table. This time, the ManagerName column does come paired with a ManagerID column. There was no need to create a new ID column.

5:26 – 5:55 | I created the Recruitment_Sources table. The RecruitmentSource column does NOT come paired with an ID column. I generated an ID column called RecruitmentSourceID.

5:56 – 7:14 | I did a final check on the data set. I changed the data types of all ID fields into Whole Number and rearranged the columns to what I think makes more sense.

Data Modeling

7:23 – 8:22 | I designed the data model. I created relationships between the main Fact table and each of the Dimension tables. I connected each foreign key in the Fact table with a corresponding primary key in each of the Dimension table. The resulting schema is a Star schema.

And that’s it! Yes, it isn’t practical to do all this work for this particular data set. However, I believe it’s great practice for a situation when the data at hand is split into separate systems or csv files. I believe that knowing how to connect several data sources is a crucial data analysis skill. I hope that after reading through this post, you’re convinced that I have it.

Power BI – iJeremiah.com