Monday, December 26, 2011

Building a large real-world SQL Server demo database

In my previous post, I discussed a large data set that I was using for demonstrations and working with R. Instead of using the normal AdventureWorks or Northwind databases that Microsoft makes available to us via Codeplex, I wanted something a bit more real-world as well as large. I also wanted something that I could continue to build on as more data became available. I had never really found anything that I liked and basically relegated the thought to a background task.

As I mentioned in the last post,  I had the opportunity to attend an alpha delivery of the EMC “Data Science and Big Data Analytics” training course. As we were working through the labs, I couldn’t help but think that the data set being used was very interesting, and would almost fit what I had been looking for. In speaking with the creators of the class, I learned that they had built the database based on information obtained from the publically-available “Home Mortgage Disclosure Act” reporting site. Here in the US, when you apply for a home mortgage loan you must provide certain information to the lending institution, and they in turn must report information on all applications that they process, whether they are approved or denied. The information itself is stripped of personal-identifiers when it is submitted, but the overall type of data that is reported is extremely interesting from both a volume perspective as well as a content perspective. I decided to spend a bit of time looking in to how I could gain access to this data, and thanks to the magic of the interwebs, I was able to piece together everything I needed in order to build the database I was looking for.

Gaining Access to the Home Mortgage Disclosure Act Data

(I realize that the information posted here is really only relevant to those of us in the United States, however I believe the resulting data is useful and relevant worldwide for purposes of learning or demo)

The Home Mortgage Disclosure Act was enacted by Congress in 1975 and is administered by the “Federal Financial Institution Examination Council”. Because it is a government institution and funded by US taxpayer dollars, the data that they collect and maintain is made available to the public free of charge. Basically what they do is chunk the data into yearly “drops” that are made available according to a specific timeline. You can read about the timeline here: http://www.ffiec.gov/hmda/timeline.htm 

The data itself is available from the following link:

http://www.ffiec.gov/hmda/

There are multiple ways to obtain the data for a given year. The easiest way to get to the data is to download the “LAR/TS Windows Application” for a given year. (LAR = “Loan Accounts Register”, TS=”Transmittal Sheet”). You can download the application from the following link: http://www.ffiec.gov/hmda/hmdaproducts.htm (note the Windows application download links towards the bottom of the page). When downloaded, the application contains all of the data for a given year within a SQL Server Compact Edition (SQLCE) database. The database itself is about 5GB per year. Of course the problem with this format is that it’s strictly a SQLCE database, which means you’ll want to extract the data for use with SQL Server. Another issue with the data is that it is organized by a seemingly random collection of states, meaning that there is no single dataset that contains the data within a given year for the entire country.

Another way to obtain the data is to download the text files directly from the site. The files are zipped and can be found by downloading the “ALL” file in the LAR table for the year that you are interested in. (At the time of this posting, there are 3 years available, 2008, 2009 and 2010). These files are tab-delimited and will need to be imported to SQL Server.

Creating the HMDA Database

Because I want to create a database that will have a single “fact” table containing all of the LAR records, I will need to first create the database and then the table structure necessary. The HMDA data is very “flat” and denormalized, so it works very well as a fact table. There are 45 fields contained in the text file and column names are NOT included in the first row. The data dictionary can be found here: LAR Record Format

Since the resulting database will work out to be about 5GB per year, and since we’ll be importing 3 years of data, I’ll start by creating a 15GB database with the following T-SQL(I am using SQL Server 2012 as my destination, so some of the syntax might be slightly different than you are used to) command:

CREATE DATABASE [HMDAData]
CONTAINMENT = NONE
ON  PRIMARY
( NAME = N'HMDAData', FILENAME = N'C:\SQLData\MDF\HMDAData.mdf' , SIZE = 15GB , MAXSIZE = UNLIMITED, FILEGROWTH = 1024MB )
LOG ON
( NAME = N'HMDAData_log', FILENAME = N'C:\SQLData\LDF\HMDAData_log.ldf' , SIZE = 8GB , MAXSIZE = 2048GB , FILEGROWTH = 10%);

Once the database is created, we can create the table to hold the LAR records. The table can be created with the following command:


CREATE TABLE [dbo].[lar_data](
    [year] [int] NOT NULL,
    [respid] [nchar](10) NOT NULL,
    [agycd] [nchar](1) NOT NULL,
    [loan_type] [int] NOT NULL,
    [property_type] [int] NOT NULL,
    [loan_purpose] [nchar](1) NULL,
    [occupancy] [nchar](1) NULL,
    [loan_amount] [nchar](5) NULL,
    [preapproval] [nchar](1) NULL,
    [action_type] [nchar](1) NULL,
    [msa_md] [nchar](5) NULL,
    [state_code] [int] NOT NULL,
    [county_code] [int] NULL,
    [tract_code] [nchar](7) NOT NULL,
    [applicant_ethnicity] [nchar](1) NULL,
    [co_applicant_ethnicity] [nchar](1) NULL,
    [applicant_race_1] [nchar](1) NULL,
    [applicant_race_2] [nchar](1) NULL,
    [applicant_race_3] [nchar](1) NULL,
    [applicant_race_4] [nchar](1) NULL,
    [applicant_race_5] [nchar](1) NULL,
    [co_applicant_race_1] [nchar](1) NULL,
    [co_applicant_race_2] [nchar](1) NULL,
    [co_applicant_race_3] [nchar](1) NULL,
    [co_applicant_race_4] [nchar](1) NULL,
    [co_applicant_race_5] [nchar](1) NULL,
    [applicant_sex] [nchar](1) NULL,
    [co_applicant_sex] [nchar](1) NULL,
    [applicant_income] [nchar](4) NULL,
    [purchaser_type] [nchar](1) NULL,
    [denial_reason_1] [nchar](1) NULL,
    [denial_reason_2] [nchar](1) NULL,
    [denial_reason_3] [nchar](1) NULL,
    [rate_spread] [nchar](5) NULL,
    [HOEPA_status] [nchar](1) NULL,
    [lien_status] [nchar](1) NULL,
    [edit_status] [nchar](1) NULL,
    [seq_number] [nchar](7) NULL,
    [population] [int] NULL,
    [minority_population_percent] [numeric](18, 0) NULL,
    [median_income] [int] NULL,
    [tract_msa_income_percent] [numeric](18, 0) NULL,
    [owner_occ_units] [int] NULL,
    [owner_occ_1_to_4_family] [int] NULL,
    [app_date_pre_2004] [nchar](1) NULL
) ON [PRIMARY];

Once the table is created, you can either use the SQL Server Import/Export wizard or SSIS to import each of the files to the lar_data table. Since it is a simple data load process without any conversions needed, the Import/Export Wizard works just fine. Here’s an example of importing the 2010 file to the lar_data table:

image

image

image

 

 

image

Because the text file will contain character data there will be CASTs required for most columns. In the event that an error occurs within a specific CAST, it’s best to just ignore it. We’re not trying to create a perfect database, just one that works for most conditions.

image

image

When the import for 2010 is complete, there will be approximately 16.3 million rows of data inserted into the table.

To complete the task, import the remaining years into the table. You should end up with approximately 53.2 million rows in the table if you copy 2008, 2009 and 2010 data.

You may notice that several of the columns are codes and not well described. You can find the information for each of the columns in the data dictionary linked above, however the data for each of the columns can be a pain to enter. Since I have already extracted the data and created appropriate tables, you can download the following ZIP file (which contains text files that are in .csv format with column names as the first row and the file is named the same as the table it comes from. These files are simple to import using the Import/Export wizard. There is also a .sql file there to create the remaining tables and insert descriptive data):

Zip file containing dimension tables and descriptive data

Creating an Appropriate “State” View

Now that you have the fact table data loaded, and you’ve used the files I’ve supplied to create the dimensional tables, you’ll likely want to create a subset of the data for specific analysis. Since I live in Colorado, I decided to create a view that shows only Colorado data. The view definition is included below, and you can modify it accordingly to isolate the data for the state you are interested in:

CREATE VIEW [dbo].[vColoradoLoans]
AS
SELECT
    lt.description AS [loan_type]
    ,lpt.description AS [property_type]
    ,lp.description AS [loan_purpose]
    ,o.description AS [occupancy_type]
    ,CAST(loan_amount AS money) * 1000 AS [loan_amount]   
    ,p.description AS [preapproval]
    ,at.description AS [action_taken]
    ,s.state_name AS [state]
    ,c.county_name AS [county]
    ,e.description AS [applicant_ethnicity]   
    ,r.description AS [applicant_race]
    ,sx.description AS [applicant_sex]
    ,CAST(applicant_income AS money) * 1000 AS [applicant_income]   
    ,pt.description AS [purchaser_type]
    ,dr.description AS [denial_reason]
    ,CASE WHEN rate_spread = 'NA' then 0 ELSE CAST(rate_spread AS numeric) END AS [rate_spread]    
    ,population AS [tract_population]
    ,minority_population_percent
    ,median_income
    ,owner_occ_units
FROM
    lar_data ld
JOIN
    tblLoanType lt
ON
    ld.loan_type = lt.loan_type
JOIN
    tblPropertyType lpt
ON
    ld.property_type = lpt.property_type
JOIN
    tblLoanPurpose lp
ON
    ld.loan_purpose = lp.loan_purpose
JOIN
    tblOwnerOccupancy o
ON
    ld.occupancy = o.owner_occupancy
JOIN
    tblPreapproval p
ON
    ld.preapproval = p.preapproval
JOIN
    tblAction at
ON
    ld.action_type = at.action_taken
JOIN
    tblState s
ON
    ld.state_code = s.state_code
JOIN
    tblCounty c
ON
    ld.county_code = c.county_code AND s.state_name=c.state_name
JOIN
    tblEthnicity e
ON
    ld.applicant_ethnicity = e.ethnicity
JOIN
    tblRace r
ON
    ld.applicant_race_1 = r.race
JOIN
    tblSex sx
ON
    ld.applicant_sex = sx.sex
JOIN
    tblPurchaserType pt
ON
    ld.purchaser_type = pt.purchaser_type
JOIN
    tblDenialReason dr
ON
    ld.denial_reason_1 = dr.reason
WHERE
    ld.state_code = '08' -- Colorado
AND
    ld.property_type = 1 -- Single Family Homes
AND
    ld.applicant_income <> 'NA' -- remove invalid income reports
;

Conclusion

Once you have the view in place, you now have a very flexible large database that you can use that has a real-world use-case and can be used for demos, performance tuning work, statistical analysis, etc.. Of course you’ll want to add your own indexes and possibly partitions depending on your use case.

image

I know that I’ve often wanted such a database when presenting or demonstrating specific functions within SQL Server, so I hope this database proves useful.

No comments: