You are on page 1of 5

Expertise and role of Data Scientist and Data Engineer

Definition , goal of Data engineering


Transaction concept and main issues
Four schedules on transaction to transfer EUR 50 from A to B. You should be
able to explain the process. Choose the best schedules and explain your
reason.
OLTP and OLAP
ACID Requirements
Distributed Database. Definition and Advantages
Cube Diagram in multidimensional data. ( Sales volume as function of product, month
and region. You should be able to explain about multidimensional data.
Possible dimention on data retrieving.

Expertise and role of Data Scientist and Data Engineer

Data scientists spending time and brainpower on applying data science and
analytic results to critical business issues - helping an organization turn data
into information - information into knowledge and insights - and valuable,
actionable insights into better decision making and game changing
strategies.

Data engineers are the designers, builders and managers of the information
or "big data" infrastructure. They develop the architecture that helps analyse
and process data in the way the organization needs it. And they make sure
those systems are performing smoothly.

Data Engineering Goal : The goal is to use the available data or generate
more data, and to thereby understand the process being investigated.

Transaction Concept : A transaction is a unit of program execution that


accesses and possibly updates various data items.

Two main issues to deal with:


1. Failures of various kinds, such as hardware failures and system crashes
2. Concurrent execution of multiple transactions

OLTP vs OLAP

OLTP (on-line transaction processing)

Major task of traditional relational DBMS

Day-to-day operations: purchasing, inventory, banking, manufacturing,


payroll, registration, accounting, etc.

OLAP (on-line analytical processing)

Major task of data warehouse system

Data analysis and decision making

Visual on OLAP

ACID Requirements
1. Atomicity Either all operations of the transaction are properly
reflected in the database or none are.
2. Consistency Execution of a (single) transaction preserves the
consistency of the database.
3. Isolation Although multiple transactions may execute concurrently,
each transaction must be unaware of other concurrently executing
transactions. Intermediate transaction results must be hidden from
other concurrently executed transactions.

4. Durability. After a transaction completes successfully, the changes it


has made to the database persist, even if there are system failures.

Distributed Database. Definition and Advantages


A distributed database (DDB) is a collection of multiple logically related
database distributed over a computer network, and a distributed
database management system as a software system that manages a
distributed database while making the distribution transparent to the
user.

Advantages
Management of distributed data with different levels of
transparency
The EMPLOYEE, PROJECT, and WORKS_ON tables may be fragmented
horizontally and stored with possible replication
Users do not have to worry about operational details of the network
Replication transparency:

It allows to store copies of a data at multiple sites as shown

in the above diagram.


This is done to minimize access time to the required data.

Fragmentation transparency:

Allows to fragment a relation horizontally (create a subset


of tuples of a relation) or vertically (create a subset of
columns of a relation).

Increased reliability and availability:


Improved performance:

A distributed DBMS fragments the database to keep data

closer to where it is needed most.


This reduces data management (access and modification)
time significantly.

Easier expansion (scalability):

Allows new nodes (computers) to be added anytime


without chaining the entire configuration.

1) In a distributed database, data can be stored in different systems like


personal computers, servers, mainframes, etc.
2) A user doesnt know where the data is located physically. Database
presents the data to the user as if it were located locally.
3) Database can be accessed over different networks.
4) Data can be joined and updated from different tables which are located on
different machines.
5) Even if a system fails the integrity of the distributed database is
maintained.
6) A distributed database is secure.

You might also like