You are on page 1of 11

1. How we are going to decide which schema we are going to implement in the data warehouse?

2. What is the difference between logical data model and physical data model in Erwin? 3. Match the following in the context of data flow diagram : i. Circle a. Source of data ii. Square b. File or data base iii. Arrow c. Conversion process iv. Parallel line d. Data flow 1. L-a, ll-b,lll-c, lv-d 2. L-c, ll-a, lll-d, lv-b 3. L-c,ll-d, lll-a, lv-b 4. L-b, ll-a, lll-d, iv-c 4. Managers salary details are hidden from employees This is 1. Conceptual level of data hiding 2. Physical level of data hiding 3. External level of data hiding 4. Either 1 or 2 5. Data Modeling software tools 6. In which normal form is the dimension table and fact table in the schema? 7. Conceptual Models 8. Why recursive relationships are bad? How do you resolve them? 9. What is the difference between star flake and snow flake schema? 10. When should you consider denormalization? 11. Describe the third normal form? 12. What is second normal form ? 13. What is First Normal Form ? 14. Generally speaking, for a weak entity set to be meaningful it must be part of a 1. One-to-one relationship 2. One-to-many relationship 3. Many-to-many relationship 4. Depends on a particular situation 15. Data modeling is the process of constructing 1. An orderly arrangement of data elements 2. A graphic representation of data contained in an information system 3. Physical elements of the information system 4. A verbal description of the data need 16. What is ERD? 17. What is the difference between hashed file stage and sequential file stage in relates to datastage Server? 18. Is this statement TRUE or FALSE all databases must be in third normal form? 19. What is data sparsity and how it effect on aggregation? 20. What is an artificial (derived) primary key? When should it be used?

Data Modeling

Data Warehousing
1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. 40. 41. 42. 43. 44. 45. 46. 47. 48. 49. 50. 51. What is Data Warehousing? What is Virtual Data Warehousing? Explain in brief various fundamental stages of Data Warehousing. What is active data warehousing? List down differences between dependent data warehouse and independent data warehouse. What is data modeling and data mining? What is this used for? Difference between ER Modeling and Dimensional Modeling. What is the difference between data warehousing and business intelligence? Describe dimensional Modeling. What is snapshot with reference to data warehouse? List out types of dimension tables. What is degenerate dimension table? What is Data Mart? Define Fact table. Define Dimension table. What is the difference between metadata and data dictionary? What is ETL? Describe the various methods of loading Dimension tables. What is OLTP? What is the difference between OLAP and data warehouse? What is ODS? What is OLAP? List out the difference between OLTP and OLAP. Describe the foreign key columns in fact table and dimension table. Explain in brief Data Mining. Difference between view and materialized view. Explain in brief ER Diagram. What is VLDB? What is cube grouping? Define the term slowly changing dimensions (SCD). Differences between star and snowflake schema. What is a Star Schema? Why fact table is in normal form despite the fact that de-normalized is improves data warehouse processes? Explain the use lookup tables and Aggregate tables. What is a Cube and Linked Cube with reference to data warehouse? What is real time data-warehousing? What is conformed dimensions use for? What is conformed fact? What Snow Flake Schema? How do you load the time dimension? What is junk dimension? What is a level of Granularity of a fact table? Define non-additive facts. Explain the use of factless facts table. What is hybrid slowly changing dimension? Define BUS Schema. List out difference between SAS tool and other tools. Why is SAS so popular? What is data cleaning? How can we do that? Explain in brief critical column. What is data cube technology used for?

What is Data warehousing?


Answer A data warehouse can be considered as a storage area where interest specific or relevant data is stored irrespective of the source. What actually is required to create a data warehouse can be considered as Data Warehousing. Data warehousing merges data from multiple sources into an easy and complete form.

What are fact tables and dimension tables?


Answer As mentioned, data in a warehouse comes from the transactions. Fact table in a data warehouse consists of facts and/or measures. The nature of data in a fact table is usually numerical. On the other hand, dimension table in a data warehouse contains fields used to describe the data in fact tables. A dimension table can provide additional and descriptive information (dimension) of the field of a fact table. e.g. If I want to know the number of resources used for a task, my fact table will store the actual measure (of resources) while my Dimension table will store the task and resource details. Hence, the relation between a fact and dimension table is one to many.

What is ETL process in data warehousing?


Answer ETL is Extract Transform Load. It is a process of fetching data from different sources, converting the data into a consistent and clean form and load into the data warehouse. Different tools are available in the market to perform ETL jobs.

Explain the difference between data mining and data warehousing.


Answer Data warehousing is merely extracting data from different sources, cleaning the data and storing it in the warehouse. Where as data mining aims to examine or explore the data using queries. These queries can be fired on the data warehouse. Explore the data in data mining helps in reporting, planning strategies, finding meaningful patterns etc. E.g. a data warehouse of a company stores all the relevant information of projects and employees. Using Data mining, one can use this data to generate different reports like profits generated etc.

What is an OLTP system and OLAP system?


Answer OLTP: Online Transaction and Processing helps and manages applications based on transactions involving high volume of data. Typical example of a transaction is commonly observed in Banks, Air tickets etc. Because OLTP uses client server architecture, it supports transactions to run cross a network. OLAP: Online analytical processing performs analysis of business data and provides the ability to perform complex calculations on usually low volumes of data. OLAP helps the user gain an insight on the data coming from different sources (multi dimensional).

What is PDAP?
Answer A data cube stores data in a summarized version which helps in a faster analysis of data. The data is stored in such a way that it allows reporting easily. E.g. using a data cube A user may want to analyze weekly, monthly performance of an employee. Here, month and week could be considered as the dimensions of the cube.

What is snow flake scheme design in database?


Answer A snowflake Schema in its simplest form is an arrangement of fact tables and dimension tables. The fact table is usually at the center surrounded by the dimension table. Normally in a snow flake schema the dimension tables are further broken down into more dimension table. E.g. Dimension tables include employee, projects and status. Status table can be further broken into status_weekly, status_monthly.

What is analysis service?


Answer Analysis service provides a combined view of the data used in OLAP or Data mining. Services here refer to OLAP, Data mining.

Explain sequence clustering algorithm.


Answer Sequence clustering algorithm collects similar or related paths, sequences of data containing events. E.g. Sequence clustering algorithm may help finding the path to store a product of similar nature in a retail ware house.

Explain discrete and continuous data in data mining.


Answer Discreet data can be considered as defined or finite data. E.g. Mobile numbers, gender. Continuous data can be considered as data which changes continuously and in an ordered fashion. E.g. age

Explain time series algorithm in data mining.


Answer Time series algorithm can be used to predict continuous values of data. Once the algorithm is skilled to predict a series of data, it can predict the outcome of other series. E.g. Performance one employee can influence or forecast the profit

What is XMLA?
Answer XMLA is XML for Analysis which can be considered as a standard for accessing data in OLAP, data mining or data sources on the internet. It is Simple Object Access Protocol. XMLA uses discover and Execute methods. Discover fetched information from the internet while Execute allows the applications to execute against the data sources.

Explain the difference between Data warehousing and Business Intelligence.


Answer Data Warehousing helps you store the data while business intelligence helps you to control the data for decision making, forecasting etc. Data warehousing using ETL jobs, will store data in a meaningful form. However, in order to query the data for reporting, forecasting, business intelligence tools were born.

What is Dimensional Modeling?


Answer Dimensional modeling is often used in Data warehousing. In simpler words it is a rational or consistent design technique used to build a data warehouse. DM uses facts and dimensions of a warehouse for its design. A snow and star flake schema represent data modeling

What is surrogate key? Explain it with an example.


Answer Data warehouses commonly use a surrogate key to uniquely identify an entity. A surrogate is not generated by the user but by the system. A primary difference between a primary key and surrogate key in few databases is that PK uniquely identifies a record while a SK uniquely identifies an entity. E.g. an employee may be recruited before the year 2000 while another employee with the same name may be recruited after the year 2000. Here, the primary key will uniquely identify the record while the surrogate key will be generated by the system (say a serial number) since the SK is NOT derived from the data.

What is the purpose of Factless Fact Table?


Answer Fact less tables are so called because they simply contain keys which refer to the dimension tables. Hence, they dont really have facts or any information but are more commonly used for tracking some information of an event. Eg. To find the number of leaves taken by an employee in a month.

What is a level of Granularity of a fact table?


Answer A fact table is usually designed at a low level of Granularity. This means that we need to find the lowest level of information that can store in a fact table. E.g. Employee performance is a very high level of granularity. Employee_performance_daily, employee_perfomance_weekly can be considered lower levels of granularity.

Explain the difference between star and snowflake schemas.


Answer A snow flake schema design is usually more complex than a start schema. In a start schema a fact table is surrounded by multiple fact tables. This is also how the Snow flake schema is designed. However, in a snow flake schema, the dimension tables can be further broken down to sub dimensions. Hence, data in a snow flake schema is more stable and standard as compared to a Start schema. E.g. Star Schema: Performance report is a fact table. Its dimension tables include performance_report_employee, performance_report_manager Snow Flake Schema: the dimension tables can be broken to performance_report_employee_weekly, monthly etc.

What is the difference between view and materialized view?


Answer A view is created by combining data from different tables. Hence, a view does not have data of itself. On the other hand, Materialized view usually used in data warehousing has data. This data helps in decision making, performing calculations etc. The data stored by calculating it before hand using queries. When a view is created, the data is not stored in the database. The data is created when a query is fired on the view. Whereas, data of a materialized view is stored.

What is junk dimension?


Answer In scenarios where certain data may not be appropriate to store in the schema, this data (or attributes) can be stored in a junk dimension. The nature of data of junk dimension is usually Boolean or flag values. E.g. whether the performance of employee was up to the mark? , Comments on performance.

What are fundamental stages of Data Warehousing?


Answer Stages of a data warehouse helps to find and understand how the data in the warehouse changes. At an initial stage of data warehousing data of the transactions is merely copied to another server. Here, even if the copied data is processed for reporting, the source datas performance wont be affected. In the next evolving stage, the data in the warehouse is updated regularly using the source data. In Real time Data warehouse stage data in the warehouse is updated for every transaction performed on the source data (E.g. booking a ticket) When the warehouse is at integrated stage, It not only updates data as and when a transaction is performed but also generates transactions which are passed back to the source online data.

What is Data Scheme?


Data Scheme is a diagrammatic representation that illustrates data structures and data relationships to each other in the relational database within the data warehouse. The data structures have their names defined with their data types. Data Schemes are handy guides for database and data warehouse implementation. The Data Scheme may or may not represent the real lay out of the database but just a structural representation of the physical database. Data Schemes are useful in troubleshooting databases.

What is Bit Mapped Index?


Bitmap indexes make use of bit arrays (bitmaps) to answer queries by performing bitwise logical operations. They work well with data that has a lower cardinality which means the data that take fewer distinct values. Bitmap indexes are useful in the data warehousing applications. Bitmap indexes have a significant space and performance advantage over other structures for such data. Tables that have less number of insert or update operations can be good candidates. The advantages of Bitmap indexes are:

They have a highly compressed structure, making them fast to read. Their structure makes it possible for the system to combine multiple indexes together so that they can access the underlying table faster.

The Disadvantage of Bitmap indexes is:

The overhead on maintaining them is enormous.

What is Bi-directional Extract?


In hierarchical, networked or relational databases, the data can be extracted, cleansed and transferred in two directions. The ability of a system to do this is refered to as bidirectional extracts. This functionality is extremely useful in data warehousing projects. Data Extraction The source systems the data is extracted from vary in various forms right from their structures and file formats to the department and the business segment they belong to. Common source formats include flat files and relational database and other non-relational database structures such as IMS, VSAM or ISAM.

Data transformation The extracted data may undergo transformation with possible addition of metadata before they are exported to another large storage area. In transformation phase, various functions related to business needs, requirements, rules and policies are applied on them. During this process some values even get translated and encoded. Care is also taken to avoid redundancy of data. Data Cleansing In data cleansing, scrutinizing of the incorrect or corrupted data is done and those inaccuracies are removed. Thus data consistency is ensured in Data cleansing. It involves activities like - removing typographical errors and inconsistencies - comparing and validating data entries against a list of entities Data transformation This is the last process of Bidirectional Extracts. The cleansed, transformed extracted source data is then loaded into the data warehouse. Advantages - Updates and data loading become very fast due to bidirectional extracting. - As timely updates are received in a useful pattern companies can make good use of this data to launch new products and formulate market strategies. Disadvantage - More investment on advance and faster IT infrastructure. - Not being able to come up with fault tolerance may mean unexpected stoppage of operations when the system breaks. - Skilled data administrator needs to be hired to manage the complex process.

What is Data Collection Frequency?


Data collection frequency is the rate at which data is collected. However, the data is not just collected and stored. it goes through various stages of processing like extracting from various sources, cleansing, transforming and then storing in useful patterns. It is important to have a record of the rate at which data is collected because of various reasons: Companies can use these records to keep a track of the transactions that have occurred. Based on these records the company can know if any invalid transactions ever occurred. In scenarios where the market changes rapidly, companies need very frequently updated data to enable them make decisions based on the state of the market and then invest appropriately. A few companies keep launching new products and keep updating their records so that their customers can see them which would in turn increase their business. When data warehouses face technical problems, the logs as well as the data collection frequency can be used to determine the time and cause of the problem. Due to real time data collection, database managers and data warehouse specialists can make more room for recording data collection frequency.

What is Data Cardinality?


Cardinality is the term used in database relations to denote the occurrences of data on either side of the relation. There are 3 basic types of cardinality: High data cardinality: Values of a data column are very uncommon. e.g.: email ids and the user names Normal data cardinality: Values of a data column are somewhat uncommon but never unique. e.g.: A data column containing LAST_NAME (there may be several entries of the same last name) Low data cardinality: Values of a data column are very usual. e.g.: flag statuses: 0/1 Determining data cardinality is a substantial aspect used in data modeling. This is used to determine the relationships Types of cardinalities: The Link Cardinality - 0:0 relationships The Sub-type Cardinality - 1:0 relationships The Physical Segment Cardinality - 1:1 relationship The Possession Cardinality - 0: M relation The Child Cardinality - 1: M mandatory relationship The Characteristic Cardinality - 0: M relationship The Paradox Cardinality - 1: M relationship.

What is Chained Data Replication?


In Chain Data Replication, the non-official data set distributed among many disks provides for load balancing among the servers within the data warehouse. Blocks of data are spread across clusters and each cluster can contain a complete set of replicated data. Every data block in every cluster is a unique permutation of the data in other clusters. When a disk fails then all the calls made to the data in that disk are redirected to the other disks when the data has been replicated. At times replicas and disks are added online without having to move around the data in the existing copy or affect the arm movement of the disk. In load balancing, Chain Data Replication has multiple servers within the data warehouse share data request processing since data already have replicas in each server disk.

What are Critical Success Factors?


Key areas of activity in which favorable results are necessary for a company to reach its goal. There are four basic types of CSFs which are: Industry CSFs Strategy CSFs Environmental CSFs Temporal CSFs A few CSFs are: Money Your future Customer satisfaction Quality Product or service development Intellectual capital Strategic relationships Employee attraction and retention Sustainability The advantages of identifying CSFs are: they are simple to understand; they help focus attention on major concerns; they are easy to communicate to coworkers; they are easy to monitor; and they can be used in concert with strategic planning methodologies.

What is Data Warehousing?


A data warehouse can be considered as a storage area where interest specific or relevant data is stored irrespective of the source. What actually is required to create a data warehouse can be considered as Data Warehousing. Data warehousing merges data from multiple sources into an easy and complete form.

What is Virtual Data Warehousing?


A virtual data warehouse provides a collective view of the completed data. A virtual data warehouse has no historic data. It can be considered as a logical data model of the containing metadata.

Explain in brief various fundamental stages of Data Warehousing.


Stages of a data warehouse helps to find and understand how the data in the warehouse changes. At an initial stage of data warehousing data of the transactions is merely copied to another server. Here, even if the copied data is processed for reporting, the source datas performance wont be affected. In the next evolving stage, the data in the warehouse is updated regularly using the source data. In Real time Data warehouse stage data in the warehouse is updated for every transaction performed on the source data (E.g. booking a ticket)

When the warehouse is at integrated stage, It not only updates data as and when a transaction is performed but also generates transactions which are passed back to the source online data.

What is active data warehousing?


An active data warehouse represents a single state of the business. Active data warehousing considers the analytic perspectives of customers and suppliers. It helps to deliver the updated data through reports.

Data Warehousing Interview - Jan 09, 2009 at 19:30 pm by Venkatesh Raman

What is data modeling and data mining? What is this used for?
Data Modeling is a technique used to define and analyze the requirements of data that supports organizations business process. In simple terms, it is used for the analysis of data objects in order to identify the relationships among these data objects in any business. Data Mining is a technique used to analyze datasets to derive useful insights/information. It is mainly used in retail, consumer goods, telecommunication and financial organizations that have a strong consumer orientation in order to determine the impact on sales, customer satisfaction and profitability. Data Mining is very helpful in determining the relationships among different business attributes.

Difference between ER Modeling and Dimensional Modeling


The entity-relationship model is a method used to represent the logical flow of entities/objects graphically that in turn create a database. It has both logical and physical model. And it is good for reporting and point queries. Dimensional model is a method in which the data is stored in two types of tables namely facts table and dimension table. It has only physical model. It is good for ad hoc query analysis.

What is the difference between data warehousing and business intelligence?


Data warehousing relates to all aspects of data management starting from the development, implementation and operation of the data sets. It is a back up of all data relevant to business context i.e. a way of storing data Business Intelligence is used to analyze the data from the point of business to measure any organizations success. The factors like sales, profitability, marketing campaign effectiveness, market share and operational efficiency etc are analyzed using Business Intelligence tools like Cognos, Informatica, SAS etc.

Describe dimensional Modeling.


Dimensional model is a method in which the data is stored in two types of tables namely facts table and dimension table. Fact table comprises of information to measure business successes and the dimension table comprises of information on which the business success is calculated. It is mainly used by data warehouse designers to build data warehouses. It represents the data in a standard and sequential manner that triggers for high performance access.

What is snapshot with reference to data warehouse?


Snapshot refers to a complete visualization of data at the time of extraction. It occupies less space and can be used to back up and restore data quickly.

You might also like