Introducing Blaze - Practice in which we solve a real problem
This work is supported by Continuum Analytics and the XDATA Grant as part of the Blaze Project
We look at data from the Home Mortgage Disclosure Act, a collection of actions taken on housing loans by various governmental agencies (gzip-ed csv file here) (thanks to Aron Ahmadia for the pointer). Uncompressed this dataset is around 10GB on disk and so we don’t want to load it up into memory with a modern commercial notebook.
Instead, we use Blaze to investigate the data, select down to the data we care about, and then migrate that data into a suitable computational backend.
In this post we’re going to use the interactive Table
object, which
wraps up a dataset and calls compute whenever we ask for something to be
printed to the screen. It retains the abstract delayed-evaluation nature
of Blaze with the interactive feel of NumPy and Pandas.
from blaze import CSV, Table
csv = CSV('hmda_lar-2012.csv') # Open the CSV file
t = Table(csv) # Interact with CSV file using interactive Table object
t
action_taken | action_taken_name | agency_code | agency_abbr | agency_name | applicant_ethnicity | applicant_ethnicity_name | applicant_income_000s | applicant_race_1 | applicant_race_2 | applicant_race_3 | applicant_race_4 | applicant_race_5 | applicant_race_name_1 | applicant_race_name_2 | applicant_race_name_3 | applicant_race_name_4 | applicant_race_name_5 | applicant_sex | applicant_sex_name | application_date_indicator | as_of_year | census_tract_number | co_applicant_ethnicity | co_applicant_ethnicity_name | co_applicant_race_1 | co_applicant_race_2 | co_applicant_race_3 | co_applicant_race_4 | co_applicant_race_5 | co_applicant_race_name_1 | co_applicant_race_name_2 | co_applicant_race_name_3 | co_applicant_race_name_4 | co_applicant_race_name_5 | co_applicant_sex | co_applicant_sex_name | county_code | county_name | denial_reason_1 | denial_reason_2 | denial_reason_3 | denial_reason_name_1 | denial_reason_name_2 | denial_reason_name_3 | edit_status | edit_status_name | hoepa_status | hoepa_status_name | lien_status | lien_status_name | loan_purpose | loan_purpose_name | loan_type | loan_type_name | msamd | msamd_name | owner_occupancy | owner_occupancy_name | preapproval | preapproval_name | property_type | property_type_name | purchaser_type | purchaser_type_name | respondent_id | sequence_number | state_code | state_abbr | state_name | hud_median_family_income | loan_amount_000s | number_of_1_to_4_family_units | number_of_owner_occupied_units | minority_population | population | rate_spread | tract_to_msamd_income | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Loan originated | 7 | HUD | Department of Housing and Urban Development | 2 | Not Hispanic or Latino | 173 | 5 | White | 1 | Male | 0 | 2012 | 8803.06 | 2 | Not Hispanic or Latino | 5 | White | 2 | Female | 197 | Will County | None | NaN | 2 | Not a HOEPA loan | 1 | Secured by a first lien | 3 | Refinancing | 1 | Conventional | 16974 | Chicago, Joliet, Naperville - IL | 1 | Owner-occupied as a principal dwelling | 3 | Not applicable | 1 | One-to-four family dwelling (other than manufa... | 6 | Commercial bank, savings bank or savings assoc... | 36-4176531 | 2712 | 17 | IL | Illinois | 77300 | 264 | 2153 | 1971 | 45.820000 | 7894 | NaN | 170.679993 | ||||||||||||||||||||||
1 | 1 | Loan originated | 5 | NCUA | National Credit Union Administration | 2 | Not Hispanic or Latino | 83 | 5 | White | 1 | Male | 0 | 2012 | 2915.00 | 5 | No co-applicant | 8 | No co-applicant | 5 | No co-applicant | 111 | Midland County | None | NaN | 2 | Not a HOEPA loan | 1 | Secured by a first lien | 3 | Refinancing | 1 | Conventional | NaN | 1 | Owner-occupied as a principal dwelling | 3 | Not applicable | 1 | One-to-four family dwelling (other than manufa... | 0 | Loan was not originated or was not sold in cal... | 0000060137 | 328 | 26 | MI | Michigan | 52100 | 116 | 1662 | 1271 | 3.340000 | 4315 | NaN | 102.760002 | |||||||||||||||||||||||
2 | 6 | Loan purchased by the institution | 9 | CFPB | Consumer Financial Protection Bureau | 4 | Not applicable | 70 | 7 | Not applicable | 4 | Not applicable | 2 | 2012 | 212.01 | 4 | Not applicable | 7 | Not applicable | 4 | Not applicable | 7 | Benton County | None | 6 | Quality edit failure only | 2 | Not a HOEPA loan | 4 | Not applicable | 3 | Refinancing | 1 | Conventional | 22220 | Fayetteville, Springdale, Rogers - AR, MO | 1 | Owner-occupied as a principal dwelling | 3 | Not applicable | 1 | One-to-four family dwelling (other than manufa... | 8 | Affiliate institution | 0000476810 | 43575 | 5 | AR | Arkansas | 58200 | 159 | 1194 | 708 | 21.870001 | 4239 | NaN | 127.639999 | |||||||||||||||||||||
3 | 6 | Loan purchased by the institution | 9 | CFPB | Consumer Financial Protection Bureau | 2 | Not Hispanic or Latino | 108 | 5 | White | 2 | Female | 2 | 2012 | 407.06 | 5 | No co-applicant | 8 | No co-applicant | 5 | No co-applicant | 123 | Ramsey County | None | NaN | 2 | Not a HOEPA loan | 4 | Not applicable | 3 | Refinancing | 1 | Conventional | 33460 | Minneapolis, St. Paul, Bloomington - MN, WI | 1 | Owner-occupied as a principal dwelling | 3 | Not applicable | 1 | One-to-four family dwelling (other than manufa... | 1 | Fannie Mae (FNMA) | 0000451965 | 2374657 | 27 | MN | Minnesota | 83900 | 100 | 1927 | 1871 | 13.680000 | 4832 | NaN | 137.669998 | ||||||||||||||||||||||
4 | 1 | Loan originated | 3 | FDIC | Federal Deposit Insurance Corporation | 2 | Not Hispanic or Latino | NaN | 5 | White | 1 | Male | 0 | 2012 | 104.00 | 2 | Not Hispanic or Latino | 5 | White | 2 | Female | 3 | Allen County | None | 6 | Quality edit failure only | 2 | Not a HOEPA loan | 1 | Secured by a first lien | 2 | Home improvement | 1 | Conventional | 23060 | Fort Wayne - IN | 1 | Owner-occupied as a principal dwelling | 3 | Not applicable | 1 | One-to-four family dwelling (other than manufa... | 0 | Loan was not originated or was not sold in cal... | 0000013801 | 11 | 18 | IN | Indiana | 63800 | 267 | 1309 | 1160 | 4.680000 | 3612 | NaN | 139.100006 | |||||||||||||||||||||
5 | 1 | Loan originated | 7 | HUD | Department of Housing and Urban Development | 2 | Not Hispanic or Latino | 144 | 5 | White | 1 | Male | 0 | 2012 | 8057.01 | 2 | Not Hispanic or Latino | 5 | White | 1 | Male | 31 | Cook County | None | NaN | 2 | Not a HOEPA loan | 1 | Secured by a first lien | 3 | Refinancing | 1 | Conventional | 16974 | Chicago, Joliet, Naperville - IL | 1 | Owner-occupied as a principal dwelling | 3 | Not applicable | 1 | One-to-four family dwelling (other than manufa... | 3 | Freddie Mac (FHLMC) | 36-4327855 | 22594 | 17 | IL | Illinois | 77300 | 260 | 1390 | 1700 | 6.440000 | 5074 | NaN | 140.550003 | ||||||||||||||||||||||
6 | 1 | Loan originated | 7 | HUD | Department of Housing and Urban Development | 2 | Not Hispanic or Latino | 51 | 3 | Black or African American | 1 | Male | 0 | 2012 | 17.00 | 5 | No co-applicant | 8 | No co-applicant | 5 | No co-applicant | 19 | Calcasieu Parish | None | NaN | 2 | Not a HOEPA loan | 1 | Secured by a first lien | 1 | Home purchase | 1 | Conventional | 29340 | Lake Charles - LA | 1 | Owner-occupied as a principal dwelling | 3 | Not applicable | 1 | One-to-four family dwelling (other than manufa... | 1 | Fannie Mae (FNMA) | 7056000000 | 34177 | 22 | LA | Louisiana | 62400 | 115 | 3500 | 2797 | 29.260000 | 8745 | NaN | 86.739998 | ||||||||||||||||||||||
7 | 1 | Loan originated | 7 | HUD | Department of Housing and Urban Development | 2 | Not Hispanic or Latino | 162 | 5 | White | 1 | Male | 0 | 2012 | 2.01 | 2 | Not Hispanic or Latino | 5 | White | 2 | Female | 49 | Grand County | None | NaN | 2 | Not a HOEPA loan | 1 | Secured by a first lien | 3 | Refinancing | 2 | FHA-insured | NaN | 1 | Owner-occupied as a principal dwelling | 3 | Not applicable | 1 | One-to-four family dwelling (other than manufa... | 6 | Commercial bank, savings bank or savings assoc... | 87-0623581 | 2141 | 8 | CO | Colorado | 61000 | 283 | 5706 | 1724 | 9.650000 | 4817 | NaN | 128.559998 | |||||||||||||||||||||||
8 | 1 | Loan originated | 3 | FDIC | Federal Deposit Insurance Corporation | 2 | Not Hispanic or Latino | 32 | 5 | White | 2 | Female | 0 | 2012 | 103.04 | 5 | No co-applicant | 8 | No co-applicant | 5 | No co-applicant | 3 | Allen County | None | NaN | 2 | Not a HOEPA loan | 1 | Secured by a first lien | 3 | Refinancing | 1 | Conventional | 23060 | Fort Wayne - IN | 1 | Owner-occupied as a principal dwelling | 3 | Not applicable | 1 | One-to-four family dwelling (other than manufa... | 0 | Loan was not originated or was not sold in cal... | 0000013801 | 10 | 18 | IN | Indiana | 63800 | 40 | 2384 | 2210 | 6.640000 | 6601 | 3.33 | 149.690002 | ||||||||||||||||||||||
9 | 1 | Loan originated | 9 | CFPB | Consumer Financial Protection Bureau | 2 | Not Hispanic or Latino | 38 | 5 | White | 1 | Male | 0 | 2012 | 9608.00 | 2 | Not Hispanic or Latino | 5 | White | 2 | Female | 41 | Talbot County | None | NaN | 2 | Not a HOEPA loan | 1 | Secured by a first lien | 3 | Refinancing | 1 | Conventional | NaN | 1 | Owner-occupied as a principal dwelling | 3 | Not applicable | 1 | One-to-four family dwelling (other than manufa... | 3 | Freddie Mac (FHLMC) | 0000504713 | 18459 | 24 | MD | Maryland | 72600 | 108 | 1311 | 785 | 6.040000 | 1920 | NaN | 77.279999 | |||||||||||||||||||||||
10 | 1 | Loan originated | 7 | HUD | Department of Housing and Urban Development | 2 | Not Hispanic or Latino | 47 | 5 | White | 1 | Male | 0 | 2012 | 22.04 | 1 | Hispanic or Latino | 5 | White | 2 | Female | 19 | Calcasieu Parish | None | NaN | 2 | Not a HOEPA loan | 1 | Secured by a first lien | 1 | Home purchase | 3 | VA-guaranteed | 29340 | Lake Charles - LA | 1 | Owner-occupied as a principal dwelling | 3 | Not applicable | 1 | One-to-four family dwelling (other than manufa... | 6 | Commercial bank, savings bank or savings assoc... | 7056000000 | 33791 | 22 | LA | Louisiana | 62400 | 158 | 1854 | 1463 | 12.410000 | 4955 | NaN | 112.010002 |
Reducing the Dataset
That’s a lot of columns, most of which are redundant or don’t carry much information. Let’s clean up our dataset a bit by selecting a smaller subset of columns. Already this quick investigation improves our comprehension and reduces the size of the dataset.
columns = ['action_taken_name', 'agency_abbr', 'applicant_ethnicity_name',
'applicant_race_name_1', 'applicant_sex_name', 'county_name',
'loan_purpose_name', 'state_abbr']
t = t[columns]
t
action_taken_name | agency_abbr | applicant_ethnicity_name | applicant_race_name_1 | applicant_sex_name | county_name | loan_purpose_name | state_abbr | |
---|---|---|---|---|---|---|---|---|
0 | Loan originated | HUD | Not Hispanic or Latino | White | Male | Will County | Refinancing | IL |
1 | Loan originated | NCUA | Not Hispanic or Latino | White | Male | Midland County | Refinancing | MI |
2 | Loan purchased by the institution | CFPB | Not applicable | Not applicable | Not applicable | Benton County | Refinancing | AR |
3 | Loan purchased by the institution | CFPB | Not Hispanic or Latino | White | Female | Ramsey County | Refinancing | MN |
4 | Loan originated | FDIC | Not Hispanic or Latino | White | Male | Allen County | Home improvement | IN |
5 | Loan originated | HUD | Not Hispanic or Latino | White | Male | Cook County | Refinancing | IL |
6 | Loan originated | HUD | Not Hispanic or Latino | Black or African American | Male | Calcasieu Parish | Home purchase | LA |
7 | Loan originated | HUD | Not Hispanic or Latino | White | Male | Grand County | Refinancing | CO |
8 | Loan originated | FDIC | Not Hispanic or Latino | White | Female | Allen County | Refinancing | IN |
9 | Loan originated | CFPB | Not Hispanic or Latino | White | Male | Talbot County | Refinancing | MD |
10 | Loan originated | HUD | Not Hispanic or Latino | White | Male | Calcasieu Parish | Home purchase | LA |
More Complex Computation
Now that we can more clearly see what’s going on let’s ask a simple question:
How many times does each action occur in the state of New York?
t2 = t[t.state_abbr == 'NY']
t2
%%time
from blaze import into, by
from pandas import DataFrame
# Group on action_taken_name, count each group
into(DataFrame, by(t2.action_taken_name,
t2.action_taken_name.count()).sort('action_taken_name_count',
ascending=False))
CPU times: user 13min 50s, sys: 5.23 s, total: 13min 55s
Wall time: 13min 55s
action_taken_name | action_taken_name_count | |
---|---|---|
0 | Loan originated | 285106 |
1 | Application denied by financial institution | 109423 |
2 | Loan purchased by the institution | 75241 |
3 | Application withdrawn by applicant | 50563 |
4 | Application approved but not accepted | 25632 |
5 | File closed for incompleteness | 20585 |
6 | Preapproval request approved but not accepted | 259 |
7 | Preapproval request denied by financial instit... | 171 |
Great! Sadly, because it was reading through the CSV file and because it was using a Pure Python backend, that computation took fourteen minutes.
Moving to a Faster Backend
By default computations on CSV files use the streaming Python backend.
While robust for large files and decently fast, this backend parses the
CSV file each time we do a full-data operation, and this parsing is very
slow. Let’s move our reduced dataset to a more efficient and widely
accessible backend, sqlite
.
from blaze import SQL
sql = SQL('sqlite:///hmda.db', 'data', schema=t.schema) # A SQLite database
into(sql, t) # Migrate data
Yup, a little sqlite
database just arrived
$ ls -lh hmda*
-rw-r--r-- 1 mrocklin mrocklin 2.7G Aug 25 13:38 hmda.db
-rw-r--r-- 1 mrocklin mrocklin 12G Jul 10 12:15 hmda_lar-2012.csv
Working with SQL
Now that we’ve migrated our csv file into a sqlite database let’s redefine t
to use the SQL backend and repeat our computation.
# t = Table(csv)
t = Table(sql)
t2 = t[t.state_abbr == 'NY']
%%time
into(DataFrame, by(t2.action_taken_name,
t2.action_taken_name.count()).sort('action_taken_name_count',
ascending=False))
CPU times: user 5.55 s, sys: 1.64 s, total: 7.19 s
Wall time: 7.46 s
action_taken_name | action_taken_name_count | |
---|---|---|
0 | Loan originated | 285106 |
1 | Application denied by financial institution | 109423 |
2 | Loan purchased by the institution | 75241 |
3 | Application withdrawn by applicant | 50563 |
4 | Application approved but not accepted | 25632 |
5 | File closed for incompleteness | 20585 |
6 | Preapproval request approved but not accepted | 259 |
7 | Preapproval request denied by financial instit... | 171 |
We’re about to repeat this same computation many times. We’ll omit the table result from here on out. It will always be the same.
Create an index on state name
This was much faster, largely because the data was stored in a binary
format. We can improve the query speed significantly by placing an index
on the state_abbr
field. This will cause the selection
t[t.state_abbr == 'NY']
to return more quickly, eliminating the need
for an expensive full table scan.
from blaze import create_index
create_index(sql, 'state_abbr', name='state_abbr_index')
Now we can ask this same query for many states at interactive timescales.
t2 = t[t.state_abbr == 'NY']
%%time
into(DataFrame, by(t2.action_taken_name,
t2.action_taken_name.count()).sort('action_taken_name_count',
ascending=False))
CPU times: user 1.74 s, sys: 430 ms, total: 2.17 s
Wall time: 2.17 s
Comparing against MongoDB
Because moving between computational backends is now easy, we can quickly compare performance between backends. SQLite and MongoDB are similarly available technologies, each being trivial to set up on a personal computer. However they’re also fairly different technologies with varying communities.
Which performs faster for our sample computation?
import pymongo
db = pymongo.MongoClient().db
into(db.hmda, sql) # Migrate to Mongo DB from SQLite database
# t = Table(csv)
# t = Table(sql)
t = Table(db.hmda)
t2 = t[t.state_abbr == 'NY']
%%time
into(DataFrame, by(t2.action_taken_name,
t2.action_taken_name.count()).sort('action_taken_name_count',
ascending=False))
CPU times: user 4.05 ms, sys: 701 µs, total: 4.76 ms
Wall time: 7.61 s
Almost exactly the same time as for SQLite.
We just did a complex thing easily. If we weren’t familiar with MongoDB we would need to learn how to set up a database, how to migrate data from SQL to MongoDB, and finally how to perform queries. Blaze eased that process considerably.
Create an index on state name
Again we create an index on the state name and observe the performance difference.
create_index(db.hmda, 'state_abbr', name='state_abbr_index')
t2 = t[t.state_abbr == 'NY']
%%time
into(DataFrame, by(t2.action_taken_name,
t2.action_taken_name.count()).sort('action_taken_name_count',
ascending=False))
CPU times: user 4.13 ms, sys: 844 µs, total: 4.97 ms
Wall time: 954 ms
Here the indexed MongoDB system seems about twice as fast as the comparably indexed SQLite system.
Results
Disclaimer: These results come from a single run. No attempt was made to optimize the backend configuration, nor was any consideration taken into account about databases being warmed up. These numbers are far from conclusive, and are merely here to present the ease with which intuitive-building experiments are easy with Blaze and the value of choosing the right backend.
Backend | Duration |
---|---|
NumPy/Pandas | need bigger machine |
Python/CSV | 14 minutes |
SQLite | 7 seconds |
MongoDB | 7 seconds |
SQLite (indexed) | 2 seconds |
MongoDB (indexed) | 1 second |
Conclusion
Blaze enables you to investigate, transform, and migrate large data intuitively. You can choose the right technology for your application without having to worry about learning a new syntax.
We hope that by lowering this barrier more users will use the right tool for the job.
More Information
- Documentation: blaze.pydata.org/
- Source: github.com/ContinuumIO/blaze/
-
Install with Anaconda:
conda install blaze