Finra: Building A Secure Data Science Platform On AWS

BDM203
FINRA
Building a Secure Data Science Platform on AWS
Scott Donaldson Senior Director, FINRA

Vincent Saulys Senior Director, FINRA
November 2016
2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
2
DATA SCIENCE NEEDS
Data discovery & exploration
Bring disparate sources of data together
Semantic understanding of the data sets
Ease of use: enable users without having to understand underlying data

infrastructure
Safeguard information with high degree of security and least privileges access
Model migration from research to prototype to production
Avoid time spent on environment administration
3
SEPARATE INFRASTRUCTURE SERVICES
4
SCALE THE DATA PLANT
Considerations
Scale compute and storage separately.
Resiliency and disaster recovery
Flexibility of instance types
Data discovery through an enterprise data catalog
Security
Virtual private cloud (VPC) & encryption
Separation of duties
DevOps: Automate everything
Least privileges and no catch-all rules
Centralized monitoring for total transparency
5
SCALE THE DATA PLANT
6
CENTRALIZED DATA MANAGEMENT
Unified catalog
Schemas
Versions
Encryption type
Storage policies
Lineage and Usage

Track publishers & consumers
Easily identify jobs and derived data sets
Shared Metastore
Common definition of tables & partitions
Use with Spark, Presto, Hive, etc.
http://finraos.github.io/herd Faster instantiation of clusters
7
EFFECTS OF CLOUD CHANGE
Gold source of all the data in S3
Separated data and compute
Easily spin up compute with unlimited query engine

capacity
8
REMAINING PAIN POINTS
Data scientists still relied on SQL to query the data
Data science continued to be done on local machines
No standard setup
Everyone administered their own machines
The data was too big for local machines
More people doing advanced analytics
Easy collaboration was still not addressed
9
DATA SCIENCE TOOLING: BEFORE UDSP
10
SOLUTION: UNIVERSAL DATA SCIENCE
PLATFORM
11
UDSP V1
Secure
Technology controls and curates content
Self-Service
Users manage their machines
Scalable Compute
Size machines to your needs
Turnkey
Libraries pre-built and installed
12
NO USERS, WHY?
Needs driven by technology
IT: Reduce costs
Users: need more compute
Secure but inflexible
Local machines where more flexible
Install any package and experiment
Data availability
On-premises databases not reachable
Setup still required
Driver configuration to connect to databases
Technology in the way
Technology required to install any new package
13
UDSP V2
Flexible
Download/Install any package
Data Availability
No additional setup necessary
On-premises and cloud data accessible
Ownership
Changes proposed and vetted through
the data science forum
14
ADOPTION METRICS
15
INVENTORY
R 3.2.5, Python (2.7.12 and 3.4.3)
Packages
R: 300+ Python: 100+
Tools for Building Packages
gcc, gfortran, make, java, maven,
ant
IDEs
Jupyter, RStudio Server
Deep Learning
CUDA, CuDNN (if GPU present)
Theano, Caffe, Torch
TensorFlow
16
SELF SERVICE
Completely self service, no technology administration
Users select UDSP version and machine capacity
Users associated to groups (AWS billing tags and machine selection choices)
Users manage their instances
Create, Stop, Terminate (delete)
Managers can administer their teams instances
Dashboard to monitor resource usage
Stop instances from the dashboard
Reports for historical usage
17
USDP: CREATE AND LAUNCH
18
UDSP: MONITOR RUNNING INSTANCES
19
UDSP: USE TOOLS WITH BROWSER
20
MAINTAINING THE USDP
Community Driven Experimentation
Data scientists can install any package to try it out
No technologist necessary to administer installation
New library (or version) is proposed for next release
Releases have been monthly
Envision quarterly releases
Philosophy: Support last major release (most recent
patch)
R 3.3.1 is available and still releasing patches, UDSP
has 3.2.5
21
THE ROAD AHEAD
Clusters for Advanced Analytics

Surveillance Platform
Facilitate surveillance development on spark
Data Framework for accessing and manipulating data
ML Framework standardizes algorithms, diagnostics and
best practices
22
SURVEILLANCE PLATFORM
Spark as the processing platform

Cluster based data processing cluster based data science
Frameworks will speed data engineering and data science
23
RECAP
Each improvement brought pressures to legacy ways of working
Flexibility of platform key to adoption
Groups do what they are best at (administer setups, do analytics)
Technology get out of the way!
Full visibility to administer costs
24
RELATED SESSIONS
Other FINRA Sessions:
BDM203 Building a Secure Data Science Platform
DAT302 Best Practices for Migrating to RDS / Aurora
ENT313 FINRA in the Cloud, Big Data Enterprise
CMP316 Aligning Billions of Time Ordered Events with Spark
STG308 Analytics Without Limits. FINRAs Scalable Big Data Architecture on S3
25
ABOUT THE PRESENTERS
Scott Donaldson Vincent Saulys
Senior Director, FINRA Senior Director, FINRA
Data Analytics and Surveillance Systems Advanced Surveillance Development
Scott.Donaldson@finra.org Vincent.Saulys@finra.org
https://www.linkedin.com/in/scottdonaldson www.linkedin.com/in/vincentsaulys
26
QUESTIONS?
Learn more at
http://technology.finra.org
FINRA Technology is hiring

http://technology.finra.org/careers.html
27
Thank you!

Finra: Building A Secure Data Science Platform On AWS

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Finra: Building A Secure Data Science Platform On AWS

Uploaded by

Copyright:

Available Formats

BDM203

Scott Donaldson Senior Director, FINRA

Bring disparate sources of data together

Semantic understanding of the data sets

Ease of use: enable users without having to understand underlying data

Model migration from research to prototype to production

Avoid time spent on environment administration

Lineage and Usage

Gold source of all the data in S3

Separated data and compute

Easily spin up compute with unlimited query engine

Data scientists still relied on SQL to query the data

Data science continued to be done on local machines

Everyone administered their own machines

The data was too big for local machines

More people doing advanced analytics

Easy collaboration was still not addressed

Users select UDSP version and machine capacity

Users manage their instances

Create, Stop, Terminate (delete)

Managers can administer their teams instances

Dashboard to monitor resource usage

Stop instances from the dashboard

Reports for historical usage

Clusters for Advanced Analytics

Spark as the processing platform

Flexibility of platform key to adoption

Groups do what they are best at (administer setups, do analytics)

Technology get out of the way!

Full visibility to administer costs

BDM203 Building a Secure Data Science Platform

DAT302 Best Practices for Migrating to RDS / Aurora

ENT313 FINRA in the Cloud, Big Data Enterprise

CMP316 Aligning Billions of Time Ordered Events with Spark

STG308 Analytics Without Limits. FINRAs Scalable Big Data Architecture on S3

Senior Director, FINRA Senior Director, FINRA

Data Analytics and Surveillance Systems Advanced Surveillance Development

FINRA Technology is hiring

You might also like