Professional Documents
Culture Documents
Jeff Butler
Director, Research Databases
IRS, Research, Analysis & Statistics
November 14, 2012
1
Presentation Agenda
Forms Taxpayers
Schedules Employers
Worksheets Information Preparers
Tax Returns
Attachments Returns
Banks
Images Brokers
Correspondence Non-Profits
Customer
Transactions Interagency
Accounts
Phone Calls Fed/State
Notices Treaty Partners
Case
Transcripts Third Party Intermediaries
Management
Structured Service
Unstructured Enforcement
Business Processes
Accounts Customer
Tax Returns Enforcement Other
Management Service
Fast integration of data from a variety Does the data model allow for fast load
of sources in a format conducive to times and high performance for massively
analysis large data?
Comprehensive set of searchable Are data elements easy to find and
metadata understand?
Dynamic storage that supports user- Do flexible storage management protocols
defined data structures support user-created data?
High-performance database designed Is the right database available for fast
for analytics analytics on massively large volumes?
Tools for queries, advanced analytics, Are there tools in the right place for the
and visualization right job?
Systems to support massively large Does the systems infrastructure support
computing tasks resource-intensive computation?
Key Features
2000 2500
2000
1500
1500
1000
1000
500
500
0 0
2007 2008 2009 2010 2011 2012 2013 2007 2008 2009 2010 2011 2012 2013
Available Used
Continued shift to higher frequency data releases for research and analytics
In 2005, it took over 4 months to load a full year of tax return data vs. 10 hours today
Challenge: What are the limits of real-time replication in heterogeneous environments?
700 4000
600
500 3000
400
2000
300
200 1000
100
0 0
2004 2006 2008 2010 2012 2004 2006 2008 2010 2012
Usage driven by new data, increased literacy in SQL, new tools, and web services
New users in Treasury, Joint Committee on Taxation (Congress), universities
Challenge: What is the right mix of analytic and operational use?
Source Systems
Servers
2005-2009: 2x CPU 192GB RAM
2009-2012: 4x CPU 512GB RAM
2012-2014: 8x CPU 1024GB RAM
Storage
2005-2009: 256MB drives, 1-2Gb/s I/O
2009-2012: 1GB drives, 2-4Gb/s I/O
2012-2014: 3GB drives, 8Gb/s I/O
Networking
SAN, bus adapter, and backplane speeds
are critical for I/O-bound tasks
Most high-volume queries are I/O bound
Challenges
Jeff Butler
Director, Research Databases
Internal Revenue Service
Research, Analysis, and Statistics
jeff.butler@irs.gov