Building a large real-world SQL Server demo database
In my previous post, I discussed a large data set that I was using for demonstrations and working with R. Instead of using the normal AdventureWorks or Northwind databases that Microsoft makes available to us via Codeplex, I wanted something a bit more real-world as well as large. I also wanted something that I could continue to build on as more data became available. I had never really found anything that I liked and basically relegated the thought to a background task.
As I mentioned in the last post, I had the opportunity to attend an alpha delivery of the EMC “Data Science and Big Data Analytics” training course. As we were working through the labs, I couldn’t help but think that the data set being used was very interesting, and would almost fit what I had been looking for. In speaking with the creators of the class, I learned that they had built the database based on information obtained from the publically-available “Home Mortgage Disclosure Act” reporting site. Here in the US, when you apply for a home mortgage loan you must provide certain information to the lending institution, and they in turn must report information on all applications that they process, whether they are approved or denied. The information itself is stripped of personal-identifiers when it is submitted, but the overall type of data that is reported is extremely interesting from both a volume perspective as well as a content perspective. I decided to spend a bit of time looking in to how I could gain access to this data, and thanks to the magic of the interwebs, I was able to piece together everything I needed in order to build the database I was looking for.
Gaining Access to the Home Mortgage Disclosure Act Data
(I realize that the information posted here is really only relevant to those of us in the United States, however I believe the resulting data is useful and relevant worldwide for purposes of learning or demo)
The Home Mortgage Disclosure Act was enacted by Congress in 1975 and is administered by the “Federal Financial Institution Examination Council”. Because it is a government institution and funded by US taxpayer dollars, the data that they collect and maintain is made available to the public free of charge. Basically what they do is chunk the data into yearly “drops” that are made available according to a specific timeline. You can read about the timeline here: http://www.ffiec.gov/hmda/timeline.htm
The data itself is available from the following link:
There are multiple ways to obtain the data for a given year. The easiest way to get to the data is to download the “LAR/TS Windows Application” for a given year. (LAR = “Loan Accounts Register”, TS=”Transmittal Sheet”). You can download the application from the following link: http://www.ffiec.gov/hmda/hmdaproducts.htm (note the Windows application download links towards the bottom of the page). When downloaded, the application contains all of the data for a given year within a SQL Server Compact Edition (SQLCE) database. The database itself is about 5GB per year. Of course the problem with this format is that it’s strictly a SQLCE database, which means you’ll want to extract the data for use with SQL Server. Another issue with the data is that it is organized by a seemingly random collection of states, meaning that there is no single dataset that contains the data within a given year for the entire country.
Another way to obtain the data is to download the text files directly from the site. The files are zipped and can be found by downloading the “ALL” file in the LAR table for the year that you are interested in. (At the time of this posting, there are 3 years available, 2008, 2009 and 2010). These files are tab-delimited and will need to be imported to SQL Server.
Creating the HMDA Database
Because I want to create a database that will have a single “fact” table containing all of the LAR records, I will need to first create the database and then the table structure necessary. The HMDA data is very “flat” and denormalized, so it works very well as a fact table. There are 45 fields contained in the text file and column names are NOT included in the first row. The data dictionary can be found here: LAR Record Format
Since the resulting database will work out to be about 5GB per year, and since we’ll be importing 3 years of data, I’ll start by creating a 15GB database with the following T-SQL(I am using SQL Server 2012 as my destination, so some of the syntax might be slightly different than you are used to) command:
CREATE DATABASE [HMDAData]
CONTAINMENT = NONE
ON PRIMARY
( NAME = N'HMDAData', FILENAME = N'C:\SQLData\MDF\HMDAData.mdf' , SIZE = 15GB , MAXSIZE = UNLIMITED, FILEGROWTH = 1024MB )
LOG ON
( NAME = N'HMDAData_log', FILENAME = N'C:\SQLData\LDF\HMDAData_log.ldf' , SIZE = 8GB , MAXSIZE = 2048GB , FILEGROWTH = 10%);
Once the database is created, we can create the table to hold the LAR records. The table can be created with the following command:
CREATE TABLE [dbo].[lar_data](
[year] [int] NOT NULL,
[respid] [nchar](10) NOT NULL,
[agycd] [nchar](1) NOT NULL,
[loan_type] [int] NOT NULL,
[property_type] [int] NOT NULL,
[loan_purpose] [nchar](1) NULL,
[occupancy] [nchar](1) NULL,
[loan_amount] [nchar](5) NULL,
[preapproval] [nchar](1) NULL,
[action_type] [nchar](1) NULL,
[msa_md] [nchar](5) NULL,
[state_code] [int] NOT NULL,
[county_code] [int] NULL,
[tract_code] [nchar](7) NOT NULL,
[applicant_ethnicity] [nchar](1) NULL,
[co_applicant_ethnicity] [nchar](1) NULL,
[applicant_race_1] [nchar](1) NULL,
[applicant_race_2] [nchar](1) NULL,
[applicant_race_3] [nchar](1) NULL,
[applicant_race_4] [nchar](1) NULL,
[applicant_race_5] [nchar](1) NULL,
[co_applicant_race_1] [nchar](1) NULL,
[co_applicant_race_2] [nchar](1) NULL,
[co_applicant_race_3] [nchar](1) NULL,
[co_applicant_race_4] [nchar](1) NULL,
[co_applicant_race_5] [nchar](1) NULL,
[applicant_sex] [nchar](1) NULL,
[co_applicant_sex] [nchar](1) NULL,
[applicant_income] [nchar](4) NULL,
[purchaser_type] [nchar](1) NULL,
[denial_reason_1] [nchar](1) NULL,
[denial_reason_2] [nchar](1) NULL,
[denial_reason_3] [nchar](1) NULL,
[rate_spread] [nchar](5) NULL,
[HOEPA_status] [nchar](1) NULL,
[lien_status] [nchar](1) NULL,
[edit_status] [nchar](1) NULL,
[seq_number] [nchar](7) NULL,
[population] [int] NULL,
[minority_population_percent] [numeric](18, 0) NULL,
[median_income] [int] NULL,
[tract_msa_income_percent] [numeric](18, 0) NULL,
[owner_occ_units] [int] NULL,
[owner_occ_1_to_4_family] [int] NULL,
[app_date_pre_2004] [nchar](1) NULL
) ON [PRIMARY];
Once the table is created, you can either use the SQL Server Import/Export wizard or SSIS to import each of the files to the lar_data table. Since it is a simple data load process without any conversions needed, the Import/Export Wizard works just fine. Here’s an example of importing the 2010 file to the lar_data table:
Because the text file will contain character data there will be CASTs required for most columns. In the event that an error occurs within a specific CAST, it’s best to just ignore it. We’re not trying to create a perfect database, just one that works for most conditions.
When the import for 2010 is complete, there will be approximately 16.3 million rows of data inserted into the table.
To complete the task, import the remaining years into the table. You should end up with approximately 53.2 million rows in the table if you copy 2008, 2009 and 2010 data.
You may notice that several of the columns are codes and not well described. You can find the information for each of the columns in the data dictionary linked above, however the data for each of the columns can be a pain to enter. Since I have already extracted the data and created appropriate tables, you can download the following ZIP file (which contains text files that are in .csv format with column names as the first row and the file is named the same as the table it comes from. These files are simple to import using the Import/Export wizard. There is also a .sql file there to create the remaining tables and insert descriptive data):
Zip file containing dimension tables and descriptive data
Creating an Appropriate “State” View
Now that you have the fact table data loaded, and you’ve used the files I’ve supplied to create the dimensional tables, you’ll likely want to create a subset of the data for specific analysis. Since I live in Colorado, I decided to create a view that shows only Colorado data. The view definition is included below, and you can modify it accordingly to isolate the data for the state you are interested in:
CREATE VIEW [dbo].[vColoradoLoans]
AS
SELECT
lt.description AS [loan_type]
,lpt.description AS [property_type]
,lp.description AS [loan_purpose]
,o.description AS [occupancy_type]
,CAST(loan_amount AS money) * 1000 AS [loan_amount]
,p.description AS [preapproval]
,at.description AS [action_taken]
,s.state_name AS [state]
,c.county_name AS [county]
,e.description AS [applicant_ethnicity]
,r.description AS [applicant_race]
,sx.description AS [applicant_sex]
,CAST(applicant_income AS money) * 1000 AS [applicant_income]
,pt.description AS [purchaser_type]
,dr.description AS [denial_reason]
,CASE WHEN rate_spread = 'NA' then 0 ELSE CAST(rate_spread AS numeric) END AS [rate_spread]
,population AS [tract_population]
,minority_population_percent
,median_income
,owner_occ_units
FROM
lar_data ld
JOIN
tblLoanType lt
ON
ld.loan_type = lt.loan_type
JOIN
tblPropertyType lpt
ON
ld.property_type = lpt.property_type
JOIN
tblLoanPurpose lp
ON
ld.loan_purpose = lp.loan_purpose
JOIN
tblOwnerOccupancy o
ON
ld.occupancy = o.owner_occupancy
JOIN
tblPreapproval p
ON
ld.preapproval = p.preapproval
JOIN
tblAction at
ON
ld.action_type = at.action_taken
JOIN
tblState s
ON
ld.state_code = s.state_code
JOIN
tblCounty c
ON
ld.county_code = c.county_code AND s.state_name=c.state_name
JOIN
tblEthnicity e
ON
ld.applicant_ethnicity = e.ethnicity
JOIN
tblRace r
ON
ld.applicant_race_1 = r.race
JOIN
tblSex sx
ON
ld.applicant_sex = sx.sex
JOIN
tblPurchaserType pt
ON
ld.purchaser_type = pt.purchaser_type
JOIN
tblDenialReason dr
ON
ld.denial_reason_1 = dr.reason
WHERE
ld.state_code = '08' -- Colorado
AND
ld.property_type = 1 -- Single Family Homes
AND
ld.applicant_income <> 'NA' -- remove invalid income reports
;
Conclusion
Once you have the view in place, you now have a very flexible large database that you can use that has a real-world use-case and can be used for demos, performance tuning work, statistical analysis, etc.. Of course you’ll want to add your own indexes and possibly partitions depending on your use case.
I know that I’ve often wanted such a database when presenting or demonstrating specific functions within SQL Server, so I hope this database proves useful.
No comments:
Post a Comment