You are on page 1of 27

An IBM-TechAmerica Event

TECHAMERICA BIG DATA COMMISSION

DEMYSTIFYING BIG DATA Washington, DC


November 14, 2012

Big Data and Analytics at the IRS

Jeff Butler
Director, Research Databases
IRS, Research, Analysis & Statistics
November 14, 2012

1
Presentation Agenda

IRS business environment


Business processes, enterprise data, and systems
Research and analysis at the IRS
Examples of analytics
Methods and techniques
Skills and system requirements
Big Data environment for IRS Research
Volume, variety, velocity
Systems, architecture, tools
Information quality strategy
Best practices and lessons learned
Big Data challenges
Five myths about Big Data and Analytics

TECHAMERICA BIG DATA COMMISSION 2

DEMYSTIFYING BIG DATA An IBM-TechAmerica Event


Big Data and Analytics at the IRS
IRS Business Environment

TECHAMERICA BIG DATA COMMISSION 3

DEMYSTIFYING BIG DATA An IBM-TechAmerica Event


IRS Business Environment

Business Processes Data Environment Systems & Operations

 234 million tax returns filed


Tax Returns
 1.8 billion third-party information returns received

 $2.4 trillion in gross receipts


Accounts Management
 122 million in refunds totaling $415 billion

 319 million vists to IRS website


Customer Service
 83 million toll-free telephone calls

 223 million letters or notices sent to taxpayers


Enforcement
 $116 billion in accounts receivable

TECHAMERICA BIG DATA COMMISSION 4

DEMYSTIFYING BIG DATA An IBM-TechAmerica Event


IRS Business Environment

Business Processes Data Environment Systems & Operations

Types of Data Sources of Data

 Forms  Taxpayers
 Schedules  Employers
 Worksheets Information  Preparers
Tax Returns
 Attachments Returns
 Banks
 Images  Brokers
 Correspondence  Non-Profits
Customer
 Transactions  Interagency
Accounts
 Phone Calls  Fed/State
 Notices  Treaty Partners
Case
 Transcripts Third Party  Intermediaries
Management

 Structured  Service
 Unstructured  Enforcement

TECHAMERICA BIG DATA COMMISSION 5

DEMYSTIFYING BIG DATA An IBM-TechAmerica Event


IRS Business Environment

Business Processes Data Environment Systems & Operations

Business Processes

Accounts Customer
Tax Returns Enforcement Other
Management Service

Data Systems and Applications

Tax Processing Case Management Customer Accounts


 Return Submission  Examination  Transactions
 Refunds  Appeals  Notices
 Math Errors  Collection  Correspondence
 Issue Resolution  Underreporter  Telephone
 Settlements  Criminal  Walk-in Centers
 State Exchange Investigation  Web Service

TECHAMERICA BIG DATA COMMISSION 6

DEMYSTIFYING BIG DATA An IBM-TechAmerica Event


IRS Business Environment

Business Processes Data Environment Systems & Operations

 Over 450 separate systems or applications in the IRS


 Data are stored in different formats (flat files, XML,
databases, VSAM) and on multiple platforms
 Separate authorization policies for system access
 Most systems designed for operational processing, not
research or analytics

Case for Analytic Data Environment


 Cost of compiling data from multiple enterprise systems is too high
 Enterprise tools are not suited for advanced analytics
 Operational data systems are not designed to isolate resource-
intensive computation for research and analyisis
 Different skill sets are needed for analytics

TECHAMERICA BIG DATA COMMISSION 7

DEMYSTIFYING BIG DATA An IBM-TechAmerica Event


Big Data and Analytics at the IRS
IRS Research Environment

TECHAMERICA BIG DATA COMMISSION 8

DEMYSTIFYING BIG DATA An IBM-TechAmerica Event


Big Data and Analytics at the IRS
Research & Analytics in the IRS

Taxpayer Behavior Examples of Analytics


 Failure to file or remit payment  Predict patterns of filing and payment
 Abusive tax shelters compliance

 ID Theft  Estimate U.S. tax gap

 Return preparer non-compliance  Measure taxpayer burden

 Misreporting of income and deductions  ID fraud and ID Theft

 Refund fraud  Simulate impact of legislative changes on


taxpayer behavior
 Off-shore transactions
 Optimize case management inventories
 Analyze taxpayer networks and their
structural relationships
 Develop workload allocation models

TECHAMERICA BIG DATA COMMISSION 9

DEMYSTIFYING BIG DATA An IBM-TechAmerica Event


Big Data and Analytics at the IRS
Research & Analytics in the IRS

Methods and Techniques

 Regression-based methods (GLM, logisitic, quantile,


non-linear, proportional hazards)
 Social network analysis and graph theory
 Machine learning (neural networks, SVMs, genetic
algorithms)
Education and Skills
 Time series analysis
 Economics
 Multivariate statistical methods (discriminant analysis,  Statistics
clustering, density estimation, factor analysis)
 Mathematics
 Simulation (Monte Carlo, MCMC, agent-based modeling)
 Computer science
 Decision trees (CART, CHAID, C5, hybrids)  Operations research
 Bayes rules and other classifiers  Physics
 Sampling and survey estimation  Behavioral sciences

TECHAMERICA BIG DATA COMMISSION 10

DEMYSTIFYING BIG DATA An IBM-TechAmerica Event


Big Data and Analytics at the IRS
Research & Analytics in the IRS

Data and Systems Requirements Solutions Checklist

 Fast integration of data from a variety Does the data model allow for fast load
of sources in a format conducive to times and high performance for massively
analysis large data?
 Comprehensive set of searchable Are data elements easy to find and
metadata understand?
 Dynamic storage that supports user- Do flexible storage management protocols
defined data structures support user-created data?
 High-performance database designed Is the right database available for fast
for analytics analytics on massively large volumes?
 Tools for queries, advanced analytics, Are there tools in the right place for the
and visualization right job?
 Systems to support massively large Does the systems infrastructure support
computing tasks resource-intensive computation?

TECHAMERICA BIG DATA COMMISSION 11

DEMYSTIFYING BIG DATA An IBM-TechAmerica Event


Big Data and Analytics at the IRS
Big Data Environment

TECHAMERICA BIG DATA COMMISSION 12

DEMYSTIFYING BIG DATA An IBM-TechAmerica Event


IRS Research: Big Data Environment
Compliance Data Warehouse (CDW)

Overview and Capabilities

 Data from over 30 sources, including tax returns, customer


accounts, case management systems, and third parties
 Computing environment for resource-intensive processing
and user-defined data structures
 Metadata for over 32,500 columns that includes definitions,
lookup tables, cross-references, and other artifacts
 Tools for a variety of analytics, including SAS, SQL, R, Stata,
Hyperion, and ArcGIS
 Web services for data profiling, reports, SSN masking, and
password management
 Training and support to ensure efficient use of systems
 Nearly 1,000 users from IRS, Treasury, Congress, and
universities

TECHAMERICA BIG DATA COMMISSION 13

DEMYSTIFYING BIG DATA An IBM-TechAmerica Event


IRS Research: Big Data Environment
Compliance Data Warehouse (CDW)

Key Features

 Number of key data sources ........................ 32


 Number of database tables ...................... 1,985
 Number of columns. 46,150
 Number of columns with searchable metadata........................... 32,510
 Number of metadata-column attributes. 715,220
 Total database storage .. 460TB
 Total disk storage .... 1.2PB
 Number of user accounts ... 920
 Average daily concurrent connections ..840
 Average daily database queries . 6,500
 Average daily database queries from the website ...............................1,200

TECHAMERICA BIG DATA COMMISSION 14

DEMYSTIFYING BIG DATA An IBM-TechAmerica Event


IRS Compliance Data Warehouse
Data Availability (Volume)
Storage Volume, in Terabytes Number of User Database Tables
2500 3000

2000 2500

2000
1500
1500
1000
1000

500
500

0 0
2007 2008 2009 2010 2011 2012 2013 2007 2008 2009 2010 2011 2012 2013
Available Used

 CDW is the largest database in the IRS


 Over 1000% increase in data and storage in the past 8 years
 Challenge: Is data growing faster than the resources needed to support it?

TECHAMERICA BIG DATA COMMISSION 15

DEMYSTIFYING BIG DATA An IBM-TechAmerica Event


IRS Compliance Data Warehouse
Data Timeliness (Velocity)
Frequency of Data Release Extract-to-Load Latency, in Days
200
180
160
140
120
100
80
60
40
20
0
2004 2006 2008 2010 2012 2005 2006 2007 2008 2009 2011

 Continued shift to higher frequency data releases for research and analytics
 In 2005, it took over 4 months to load a full year of tax return data vs. 10 hours today
 Challenge: What are the limits of real-time replication in heterogeneous environments?

TECHAMERICA BIG DATA COMMISSION 16

DEMYSTIFYING BIG DATA An IBM-TechAmerica Event


IRS Compliance Data Warehouse
System and Database Usage
Average Daily Connections Average Daily Database Queries
1000 6000
900
800 5000

700 4000
600
500 3000
400
2000
300
200 1000
100
0 0
2004 2006 2008 2010 2012 2004 2006 2008 2010 2012

Accounts Daily Connections Database Queries Web Queries

 Usage driven by new data, increased literacy in SQL, new tools, and web services
 New users in Treasury, Joint Committee on Taxation (Congress), universities
 Challenge: What is the right mix of analytic and operational use?

TECHAMERICA BIG DATA COMMISSION 17

DEMYSTIFYING BIG DATA An IBM-TechAmerica Event


IRS Compliance Data Warehouse
Metadata Management
Number of Columns with Metadata
60000
Metadata Matters
50000  One of the most important ingredients for
data warehousing success
40000
 Touches every part of the data supply chain
30000  Informs and guides decision making
 User satisfaction is highly correlated to
20000 robust, accessible metadata
 New frontier: Real-time data profiling at the
10000
metadata layer
0
2005 2006 2007 2008 2009 2011 2013
Columns Metadata

 CDW has the largest structured metadata repository in the IRS


 In 2012, more than 32,500 columns each with over 20 separate metadata attributes
 Challenge: How to minimize the lag time between data and metadata releases?

TECHAMERICA BIG DATA COMMISSION 18

DEMYSTIFYING BIG DATA An IBM-TechAmerica Event


IRS Compliance Data Warehouse
Data Supply Chain

Source Systems

Query, Analysis, Reporting


Database
Extract Staging

Flat File Transform


Load
DW
XML
Validate
VSAM Roll-Ups

Source Metadata ETL/T Metadata Data Model Metadata Report Metadata

Central Metadata Repository Web Accessible

TECHAMERICA BIG DATA COMMISSION 19

DEMYSTIFYING BIG DATA An IBM-TechAmerica Event


IRS Compliance Data Warehouse
System and Data Accessibility

Database Servers Application/Web Servers


(Sybase IQ, Oracle, SQL Server) Shared Storage (>1PB) (SAS, R, Hyperion)
(DB, Backup, Staging, User)

IRS Network IRS Network

SAS R SQL ODBC/JDBC Hyperion ArcGIS

TECHAMERICA BIG DATA COMMISSION 20

DEMYSTIFYING BIG DATA An IBM-TechAmerica Event


IRS Compliance Data Warehouse
Systems Infrastructure

 Servers
2005-2009: 2x CPU 192GB RAM
2009-2012: 4x CPU 512GB RAM
2012-2014: 8x CPU 1024GB RAM
 Storage
2005-2009: 256MB drives, 1-2Gb/s I/O
2009-2012: 1GB drives, 2-4Gb/s I/O
2012-2014: 3GB drives, 8Gb/s I/O
 Networking
SAN, bus adapter, and backplane speeds
are critical for I/O-bound tasks
Most high-volume queries are I/O bound

Strategy for High Performance


 Continue to leverage Moores and Kryders law
 Find opportunities to improve network throughput for I/O bound tasks

TECHAMERICA BIG DATA COMMISSION 21

DEMYSTIFYING BIG DATA An IBM-TechAmerica Event


IRS Compliance Data Warehouse
Web Services: Metadata and Data Profiling

TECHAMERICA BIG DATA COMMISSION 22

DEMYSTIFYING BIG DATA An IBM-TechAmerica Event


IRS Compliance Data Warehouse
Web Services: Metadata and Data Profiling

TECHAMERICA BIG DATA COMMISSION 23

DEMYSTIFYING BIG DATA An IBM-TechAmerica Event


Big Data and Analytics at the IRS
Best Practices and Lessons Learned

TECHAMERICA BIG DATA COMMISSION 24

DEMYSTIFYING BIG DATA An IBM-TechAmerica Event


IRS Research: Big Data Environment
Best Practices and Lessons Learned

Strategies for Big Data and Analytics

 Build multi-disciplinary teams that combines analytic


skills (statistics, machine learning) with IT skills (system
and database administration)
 Maintain a focus on data quality that includes easily
accessible metadata, web-based data profiling, and online
feedback and collaboration capabilities
 Create simple data models that are conducive to the
widest possible variety of analytics
 Implement right-size governance that allows for rapid
change management
 Avoid investments in solutions that are in search of a
problem
 Develop a culture that tolerates non-linear processes and
controlled disruption

TECHAMERICA BIG DATA COMMISSION 25

DEMYSTIFYING BIG DATA An IBM-TechAmerica Event


IRS Research: Big Data Environment
Best Practices and Lessons Learned

Challenges

 I/O bottlenecks from off-loading data from the database


to application server
Software vendors must push more analytic
APIs into the database
Hardware vendors must provide faster disk speeds
and network throughput rates
 Continued shift in costs to software, labor, and security
 Legal or administratve policies that inhibit Inter-agency
data exchange
 Growth of data outpacing the ability to manage data
quality
 Managing dual goals of safeguarding privacy of data
while expanding access to new information

TECHAMERICA BIG DATA COMMISSION 26

DEMYSTIFYING BIG DATA An IBM-TechAmerica Event


Contact Information
If you have further questions or comments:

Jeff Butler
Director, Research Databases
Internal Revenue Service
Research, Analysis, and Statistics
jeff.butler@irs.gov

TECHAMERICA BIG DATA COMMISSION 27

DEMYSTIFYING BIG DATA An IBM-TechAmerica Event

You might also like