Performance Tips For Large Datasets

3/11/13
Performance tips for large datasets - Knowledge Base
The Spotfire Community is moving to TIBCOmmunity and this forum location has closed. During the transition, you can still search the old forums but posting has been disabled. We encourage you to pick up the discussion at the new Spotfire community on TIBCOmmunity.
TIBCO Spotfire Community
Performance tips for large datasets
Abstract
The purpose of this brief blog is to provide some guidance around the topic of performance, especially in light of TIBCO Spotfire (aka TS) version 3.0 . Note: The content in this posting will be modified and evolve over time to adapt to newer Spotfire versions.
About performance
A high performing system is not stronger than its weakest link You should really optimize the performance of every link of the chain: Hardware, Operative System, Database, TS Server, TS Professional, TS Web Player
Contemporary data volumes definitions

Small datasets: less than 100 MB Medium datasets: 100 MB to 1 GB Large: 1 to 10 GB Very Large: 10 to 100 GB Extremely large: 100 GB to TB class Challenging: TB sizes a week!
Regarding data volumes in TS Pro and/or Web Player

A good performance estimate cant be made using the nr of rows alone The total table size (on disk, at the dB or file, measured in MB) is a better indicator Best indicator is combination of both above with entropy degree Entropy is the amount of chaos in the data. Higher randomness will imply worse compaction and indexing, lower randomness will hence imply better performance
Preface
The purpose is to discuss how to improve total systemic performance. Were not going to cover tips and tricks already explained elsewhere, for example
spotfirecommunity.tibco.com/community/blogs/stn/archive/2009/10/26/Performance-tips-for-large-datasets.aspx 1/8
3/11/13
How much data can Spotfire handle?

Virtually infinitely large data sets. Thats described in more detail in the Spotfire Data On Demand article, featuring a combination of TS Server and TS Professional functionality to allow the client to only retrieve as much data as it needs at one point in time. Since you wont be loading all the data at once, you are able to analyze virtually any size of data as you go.
Can I load only a fraction of the data? Specifically, can I load data based on criteria or conditions?
Yes! You can use a Hard Filter in your Information Link. You can create a Prompted Information Link to prompt users for specific values, intervals, etc., and only retrieve data relevant to that user and moment in time. You can use Parameterized Information Links to retrieve only data that matches one or more conditions. You can use Personalized Information Links to retrieve data only pertinent to a specific user or group.
Can I load more data than fits in RAM?

Yes! Spotfire is a 4th generation in-memory analytic engine, continuously improving over the last two decades. Like other similar technologies, we highly compact the information when we load it (never discarding the full detail though!) and so, depending on the data entropy, it will actually take less memory than your available RAM. But wait, theres more Over the last decade weve optimized all our engines to gracefully page to disk pieces which dont fit in RAM, as appropriate, and only when needed. This means that you are indeed able to work with much larger data volumes than ever fit in RAM, regardless of the degree of entropy. So youre guaranteed that Spotfire will load your data and youll be able to see not only aggregates but also the full level of detail of it, even if your data is larger than the available RAM, as long as you run on a true x64 platform and OS theres currently a Microsoft limitation on 32 bits, limiting the usable data volumes to around 1.5 GB on those platforms.
OK, were done with what we wont be covering, so lets get started
Operative System performance

Hardware
For starters: You will only get best performance on a true 64 bit system, regardless of the OS you have on top of it. Now, Ive said it. We can move on ;)
Keep it clean
It goes without saying, but you should really disable unnecessary software, services, daemons, etc. Youll want every bit of performance dedicated to the analytic core duties.
Microsoft OS
Microsoft file systems fragment badly (as does the Registry), which in turns affects dB performance (scattered files) and applications needing I/O access.
3/11/13
Use, e.g., the free MyDefrag for file system Google free Registry defrag tools Windows Services take up precious resources Turn off all services strictly unnecessary for the task of running your apps. Google about this topic for your particular OS, client or server
Linux systems
File system doesnt fragment so bad Ext4 file systems has excellent performance. XFS or JFS are good choices too.
Solaris
ZFS (perhaps the best file system ever) will give you top performance
Database performance
How to see the time required by the dB to actually serve data
You can do that in a number of ways. For instance: While TS Pro loads data, expand the progress dialog and uncheck the check box at the bottom, so you keep this dialog up even after the data has been loaded. That will allow you to study what is happening, for systemic performance tweaking purposes. In the logged output in that dialog, look for three lines, one containing Reading data from data source..., another line containing Reading data, and another one containing Creating columns, all prefixed by a timestamp, like this example below (all other log lines removed): [...] 17:51:35 Reading data from data source... 17:51:36 Reading data. [...] 17:54:36 Creating columns. [...] 17:54:51 Done
Reading data from data source means that Spotfire has asked the dB to start processing a query. If the query is very complex and the dB hasnt been tuned, the dB make take quite some time to start producing records. So, this will tell the time the dB requires to process the SQL query. Reading data means that data is flowing in from the dB, and if needed, being stored in a temporary local file cache to later be used. So, this will tell the speed at which data flows from the dB machine, over the network, into the temporary cache being built. Creating columns means that the internal Spotfire in-memory data engine and filter engine are been created, column by column. So, this will tell the speed with which the destination machine is able to process data, once the dB has already done all work.
By studying these times, youll be able to learn about potential bottlenecks!
Basic dB recommendations
User normalized tables (star or snow schemas) Minimal types where possible e.g. store BIT instead of true, false, and create lookup table
3/11/13
For all columns where involved in JOINs, or columns later used by Spotfire IM to filtering by, Create PK (Primary Keys), FPK (Foreign Primary Keys) and INDEXes. Also create multi-column indexes for better joins and compound queries
C R E A T EI N D E Xi x _ 1O Nt a b l e( c o l 1 ,c o l 2 , ,c o l n )
Run/Create statistics for the relevant tables only stats for dB Sytem tables can actually worsen db performance Shrink/defrag the tables involved in the analytic process Generally needed if you got a production dB export Convert Subqueries to JOINs Better performance
Asymmetric working tables

Run JOINs to create a new smaller, relevant and common working set [in terms of # of rows]. Example: One table contains transaction detail for active customers this month (1M unique rows), other table contains full account details for all customers (100M unique rows). For analytic purposes, you can reduce the set of interesting information by creating a new table by selecting the most interesting columns of the inner join of these two tables. Ive recently used both techniques above to improve dB response times by two orders of magnitude!
Limit the nr of records in the exploratory phase

While Spotfire currently offers no OOTB solution to get a reduced data set to create your template, this is easily achieved using plain SQL. Heres how to get first n rows really fast (even in complex queries). Examples with n = 1000. Green indicates optional dB optimizer hint. Oracle:
S E L E C T t . * F R O M ( S E L E C T/ *f i r s t _ r o w s ( 1 0 0 0 )* / c o l 1 ,c o l 2F R O Mt a b l eO R D E RB Yc o l 2D E S C )t W H E R Er o w n u m< =1 0 0 0
MS SQL:
S E L E C TT O P ( 1 0 0 0 )c o l 1 ,c o l 2F R O Mt a b l eO R D E RB Yc o l 2D E S CO P T I O N( F A S T 1 0 0 0 )
MySQL and PostgresSQL:

S E L E C Tc o l 1 ,c o l 2F R O Mt a b l eO R D E RB Yc o l 2D E S CL I M I T1 0 0 0 ;
Use the techniques above from within Information Model Designer while creating Information Links by simply editing the SQL query and modify it accordingly. Also useful when querying data for the 1st time, to avoid surprises.
For DBAs only

3/11/13
Use Index-organized tables (for read-only dBs, e.g., Data Warehouses). These are self contained: data is in index table Oracle specific but can be mimicked in all other dB Very fast reading Use Partitioned tables and indexes Recommended by all major dB makers for higher performance with very large tables! Bitmap indexes are better when data changes less frequently like in a typical DW Heres an article about Bitmap indexes. Just Google for more. For serious dB tuning Id recommend reading SQL tuning (OReilly) Very theoretical book though.
Spotfire Servers performance

JDBC Connectors
For TSS, always use native JDBC connectors to dBs, i.e., the ones provided by the dB maker. Shipped DataDirect connectors arent nearly as fast as the native ones by the dB provider. When using SQL Server and using the MSs JDBC connector, add
r e s p o n s e B u f f e r i n g = a d a p t i v e
to the JDBC connection string. Example:

j d b c : s q l s e r v e r : / / d b s r v . c o m p a n y . c o m : 1 4 3 3 ; d a t a b a s e N a m e = s p o t f i r e _ s e r v e r ; s e l e c t M e t h o d = c u r s o r ; r e s p o n s e B u f f e r i n g = a d a p t i v e
Anti-Virus
White-list the Antivirus for everything under TSS and under WebPlayers installation directories AV slows down our servers, and neither TSS or TSWP are a security threat since both run in sandboxes (Java and .Net, respectively) At the bare minimum, edit web.config to re-point the WP Temp Dir elsewhere, and white-list that directory:
< l o g 4 n e t > < a p p e n d e rn a m e = " F i l e A p p e n d e r " t y p e = " l o g 4 n e t . A p p e n d e r . R o l l i n g F i l e A p p e n d e r " > < f i l ev a l u e = " D : \ S p o t f i r e T m p \ W P _ l o g s \ S p o t f i r e . D x p . W e b . l o g "/ >
Advanced settings in Web Player doc

Section 5.2: in particular, itemExpirationTimeout For performance, also read 5.3 (Scheduled Updates) When expecting thousands of WP users, read section 5.4
Information Model performance

3/11/13
Use optimizer hints for dB queries as described earlier Include as few columns as possible in top views Drill down to fetch more detail, even from the same table
Server-side caching query dB results

If a dB view contains difficult joins or aggregations to be done on-the-fly, that could potentially be pretty heavy for the dB. Id recommend that you cache those aggregated views into tables (which obviously should be much smaller than the full detail tables): Easiest path: single-click cache using TIBCO Spotfire Advanced Data Service (aka TS ADS) Works out of the box in TS ADS Studio For added performance, install and configure a separate dB for caching (following the dedicated PDF about that) so you can tweak that even further Medium difficulty: in a DW-type dB, you can cache pre-aggregated tables in the dB itself May need to schedule table creation using external software Hardest path: use a dedicated freeware dB Youll need to somehow pull data from the original dB while building the new tables Still need to schedule and automate entire process MySQL with the Memory data engine would give you good performance However, itd be limited to available RAM size on that machine
TS Pro performance
Antivirus
If possible, white-list the entire folder where TS Pro is installed.
Remap the temporary folder

If you have a secondary disk/partition with more space, define a new dedicated Temp dir for TS Pro (also white-listing the Antivirus for that area): Edit the
< i n s t a l lp a t h > \ 3 . 0 \ M o d u l e s \ S p o t f i r eD X PF o r m s _ < n u m b e r h e r e > \ S p o t f i r e . D x p . M a i n . d l l . c o n f i g
file, and within the

< a p p l i c a t i o n S e t t i n g s >
area, add, for example:

< S p o t f i r e . D x p . I n t e r n a l . P r o p e r t i e s . S e t t i n g s > < s e t t i n gn a m e = " T e m p F o l d e r "s e r i a l i z e A s = " S t r i n g " > < v a l u e > D : \ f i l e s \ D X P t e m p < / v a l u e > < / s e t t i n g > < / S p o t f i r e . D x p . I n t e r n a l . P r o p e r t i e s . S e t t i n g s >
Numerical columns
Numerical columns (displayed as Range Sliders in the Filter Panel) arent indexed by default for large datasets. Thats perfectly fine, but in some cases
3/11/13
you can get improve filtering performance for those columns. If the column contains a large number of rows combined with few number of unique values, then you can increase performance by forcing an indexing (described below). Typical examples would be cases where you have millions of rows containing integers like customer ages, zip3 codes, a set of a few hundred unique IDs, etc. In those cases youll get a performance improvement when filtering. Otherwise this trick can potentially make Spotfire run slower How to do it: for those cases fitting the description above and If youre going to filter or join by a non-indexed column (i.e., use Add Columns or Add Rows), then first force the index creation simply by changing the slider to an Item Slider and back to Range Slider.
Implicit Joins
Sometimes implicit joins, i.e., side by side tables, are much more memory efficient than regular joins (Add Rows and Add Columns in Spotfire terminology) Implicit Joins are especially useful when youre only looking for the proper text value of a property to filter by More about that side-by-side tables here
Hardware acceleration and virtual machines or remote access

Scatters (2D & 3D), Line and Parallel Coordinate Plot (aka Profile Chart) all use direct hardware acceleration when available. If running within most Virtual Machines, those type of plots are not going to render as fast, since they dont have proper hardware acceleration Same issue through some types of Remote Desktop connections Remote access through TightVNC with Mirage video driver seems to work OK though
Initial visualizations for large data sets

Id recommend setting your initial visualization to the Table Plot for all your data sets. How to change the default plot just for you: Just go to Tools > Options > Document, and set that option to Table. How to change the default plot for everyone in the corporation: Have your Spotfire administrator go to Tools > Administration Manager > Preferences, then select the Everyone group, click on the Edit button, and set the Visualization > Visualization Preferences > InitialVisualization to TablePlot.
Also, Id recommend starting to explore the data using aggregated plots, e.g., Tree Maps, Pie Charts or Bar Charts. Scatter Plots (2D or 3D), Network Graphs (aka NGs) and BoxPlots with Tukey-Kramer comparison-circles will all be more computationally intensive. Scatters have to comb through every row Aggregated Scatter Plots (commonly referred to as Bubble Charts by other vendors) are OK though NGs can have almost exponential complexity
Anyway, aggregated visualizations are going to give you the fastest rendering times. Measure rendering time (Easter egg, see further down in this article)
spotfirecommunity.tibco.com/community/blogs/stn/archive/2009/10/26/Performance-tips-for-large-datasets.aspx
7/8
3/11/13
Performance-measuring easter egg

Theres a way to measure the rendering time of all visualizations which is extremely useful to pinpoint performance issues. Heres how to do that: In the white area of a plot, quickly hit three exclamation marks and a question mark, i.e., the following sequence:
! ! ! ?
Eventually youll see a box in the top left corner of the plot indicating graphic rendering time and total time (i.e., the chain of events leading to actually being able to render the plot usually the aggregations, etc) The colors of those boxes indicate: Green: Hardware acceleration is available and used Yellow: Hardware acceleration not needed for that plot Red: Hardware acceleration needed but not available
Some random performance examples

The numbers below are not meant to be representative for the best performance you can get out of Spotfire (you can see that neither machine is a highend machine!) but rather, to show examples of real use in environments similar to what many of our users already have 25M rows 8 cols (1.4 GB text file, low entropy) in a 1-year old laptop (*) 7 min to load from *.csv text file Same as above but data already embedded in dxp file 8 sec to load 50M rows 23 columns (6.2 GB file size, very high entropy)) in a 3-year old machine (**) Around 35 min to load from *.csv text file Would have been faster from dB 50M unique values in several of the columns 6M rows 63 cols (500 MB in the dB, medium entropy) in a 3-year old machine (**) 3 min to load from MySQL 5.1 dB (*) Lenovo T61 laptop, Intel Core T9300 2.5GHz, 4 GB RAM, running Windows Vista Business x64, 1-year old. (**) Sun Ultra 40 workstation, Dual AMD64 2.8 Mhz, 16 GB RAM, running Windows XP x64, 3-year old.
Published Oct 26 2009, 05:07 PM by Carlos Ferraro Cavallini Rating:
Comments
About Carlos Ferraro Cavallini

Carlos is a Senior Solutions Consultant with Spotfire focusing on large enterprise across the Americas, and has been with Spotfire since 1998 in a number of positions. Carlos has been working with Analytics and Business Intelligence since 1995. He holds an M.Sc. in Computing Science from University of Gothenburg, Sweden.
spotfirecommunity.tibco.com/community/blogs/stn/archive/2009/10/26/Performance-tips-for-large-datasets.aspx
8/8

Performance Tips For Large Datasets - Knowledge Base

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Performance Tips For Large Datasets - Knowledge Base

Uploaded by

Copyright:

Available Formats

3/11/13

Performance tips for large datasets - Knowledge Base

TIBCO Spotfire Community

Contemporary data volumes definitions

Regarding data volumes in TS Pro and/or Web Player

Performance tips for large datasets - Knowledge Base

How much data can Spotfire handle?

Can I load more data than fits in RAM?

Operative System performance

Performance tips for large datasets - Knowledge Base

By studying these times, youll be able to learn about potential bottlenecks!

Performance tips for large datasets - Knowledge Base

Asymmetric working tables

Limit the nr of records in the exploratory phase

MySQL and PostgresSQL:

For DBAs only

Performance tips for large datasets - Knowledge Base

Spotfire Servers performance

to the JDBC connection string. Example:

Advanced settings in Web Player doc

Information Model performance

Performance tips for large datasets - Knowledge Base

Server-side caching query dB results

Remap the temporary folder

file, and within the

area, add, for example:

Performance tips for large datasets - Knowledge Base

Hardware acceleration and virtual machines or remote access

Initial visualizations for large data sets

Performance tips for large datasets - Knowledge Base

Performance-measuring easter egg

Some random performance examples

About Carlos Ferraro Cavallini

You might also like