Improving Query Flexibility with Partition Mapping
BY BRYAN APRIL 19, 2011
Partition Mapping is a traditional database concept that
MicroStrategy has adopted in the SQL Engine to give you even more options for performance and tuning. Traditional Database Partitioning involves taking a large table and breaking it up into smaller pieces by separating it at logical break points, such as one table per year or per region. The database would then be intelligent enough to know where the breakpoints are and direct queries to the specific tables containing the data, providing a smaller set of data to process. MicroStrategy uses a similar approach to carving Fact tables into smaller pieces and offers two different approaches for implementation. While this feature is intended to provide a little more efficiency to databases, most advanced platforms wont have any (or much) performance increases, such as Netezza, Greenplum or Vertica (though Im no expert on those platforms, its just from my observations). While this feature may seem antiquated for projects trying to squeeze the last bit of life out of a stressed MS SQL Server, you can also use it in a few clever ways to provide query flexibility and even achieve some amazing data availability. Before I talk about different ways to take advantage of Partitioning, I should first describe the feature in general and how to use it. There are two methods of implementing Partitioning: Warehouse Partition Mapping Table The best method for pure performance is to use the Table approach. Effectively, you create a table of tables by physically breaking up your Fact table into individual pieces, and then tying the table name
to an Attribute Element in a Control Table so MicroStrategy knows
how to load them. Example: You have 10 years worth of data in a table, and its getting a little large to scan through. Most of your queries are for the last few years anyway, but you dont want to lose the ability to query history. You could break that single table up into 10 smaller pieces and create a control table such as: Year PBTName 2001 FactData2001 2002 FactData2002 . .. 2011 FactData2011 MicroStrategy will consume this table as a special kind of table, called a Partition Mapping Table, and treat it as a single table. When you run a query, MicroStrategy will first determine which table(s) contain the Year(s) in your report, and only use the appropriate tables. Aside from the obvious benefits of smaller tables to query, consider also the benefits to ETL from having smaller Current Year tables to work with, as opposed to sifting through 9yrs of data that isnt going to change on a day to day basis. The obvious drawback of this approach is that you have to maintain that control table. While it may not seem like a big deal, especially if you are slicing by an Attribute like Region, which probably doesnt change very often, there is a requirement not immediately noticeable. You must provide *separate* tables for every element. That means if you wanted to do something like FactData2011, FactData2010 and FactDataArchive (that contains everything prior to 2010), you couldnt use this feature. Even if you define each row in your Control Table, they cant point to the same physical table in the
database. Perhaps you could be clever and circumvent that
requirement with Views, but even then, if youre slicing on an Attribute that will have lots of values, such as a Month or even a Day, youre still in for an unmanageable number of objects. For something that granular, youd want to turn to the next Partitioning method Metadata Partition Mapping Im not a fan of the name of this method, because whenever I say it, I feel like the initial interpretation is that it has something to do with the Metadata. While I guess everything has something to do with the metadata, this is really a Query Based Partition Mapping, compared to the Table Based Partition Mapping mentioned previously. This approach is a lot more MicroStrategy-esque (to coin a word). Effectively, youre building a filter that will tell the SQL Engine which table to choose. Based on the previous example that wasnt feasible (using 3 tables for FactData2011, FactData2010 and FactDataArchive), this can be achieved using this method. You simply define the filter Year=2011 and then choose FactData2011 as the target table. Repeat the process for Year=2010 to FactData2010 and then you can define Year<2010 to FactDataArchive. The downside of this approach is that before the report can run, a pre-query must first be run to determine which table to choose. This can be pretty inefficiency if it has to query the tables themselves, though you can get it to come from dimensions with some careful schema tweaking. I dont have those details readily available right now, but when Ive implemented this in the past, it was a challenge. Even once you craft pre-SQL that is efficient, debugging reports becomes incredibly harder all around. Since the SQL Engine must first run a pre-query to determine which tables to run, you end up
with Schrodingers SQL where every possible outcome is listed in the
SQL View. This not only occurs when trying to view the SQL preexecution, but also in the Job Monitor. This can make the debugging process very difficult and annoying. Thankfully, once the report is complete, you can view the SQL and see the actual chosen path. Partitioning for Real Time Data Now on to my favorite way to leverage Metadata Partition Mapping. In my current environment, we load data in near real time out of necessity (saving it until a single nightly load would take too long to process), so the Warehouse is within an hour of real time at any given time during the day. Since users mostly use the BI project for historical analysis and trending, we dont query the Today data too often, so we dont include them in Aggregates. Of course, if they do want to do some real time reporting, usually in the form of debugging some system issue, traditional Aggregate Table approaches wouldnt work here. An alternative solution to this scenario to last weeks Attribute Anchoring approach would be to use Metadata Partition Mapping to force queries that include Today to go directly to the FactDetail table, and any queries < Today to go to the Aggregates. This provides some relief on the ETL for not having to reprocess Aggregate tables throughout the day, which if they are at the Day level would require lots of deletes/updates, but without sacrificing access to the data. 9.2 MultiSource Enhancement A fantastic enhancement found in 9.2 is the ability to do Metadata Partitioning across MultiSource connections. This means you could use separate databases to hold your real time data vs historical data. In the original example, this could mean that you move Archived data off of your expensive Warehouse Appliance and off into a cheaper database solution. Reports wouldnt have to change, and users wouldnt even notice (other than speed differences).
Another Real Time solution would be to possibly hit Source System
data directly. In most environments, this may not be feasible, but in my current environment our Source System actually keeps aggregate level data in absolute real time (within a minute). While they dont keep history, it is enough that we could use MultiSource Metadata Partitioning to provide absolute real time data availability to our users in MicroStrategy. For a longer list of considerations in defining your Partitioning strategy, check out this TechNote. For details on configuring Partitioning, start with this TechNote.