Wednesday, June 29, 2011

Datawarehouse Basics Part 2

In Transaction Server, data is prone to be changed at any given point of time. And therefore, it is not advisable to do both analysis and transaction on the same server.


Data Warehouse and Transactional Server Comparison
Data Warehouse
Operational/Transactional
Subject oriented
Application oriented
Summarized, refined & detailed
Detailed
Represents value over time
Accurate as of moment
Supports managerial needs
Supports day-to-day needs
Read only data
Can be updated
Batch processing
Real time transactions
Completely different life cycle25
Software Development Life Cycle
Analysis driven
Transaction driven
Dimensional model
Entity Relational Diagram
Large amount of data
Small amount of data
Relaxed availability
High availability
Flexible structure
Static structure


ETL (Extraction, Transformation and Loading) are together used as a process through which we extract data from single or multiple data sources, then transform in one format and load at a common place i.e. Data Warehouse.
ETL tool has API of all Databases supported by the ETL tool.


 

 


At the DATA STAGING AREA- Data is in unchanged format.

ETL TOOLS:

  • Informatica
  • Data Stage (IBM Product)
  • AB Initio
  • OWB (Oracle Warehouse Builder)
 
Benefits of Multi-Tier Architecture:


A few points under which we can see obvious benefits of Multi Tier Architecture over 2 Tier Architecture
  • Database Server and Processing is kept separately and thus it is easier to serve more requests
  • Authorization and Authentication by adding another security layer
  • Performance, Stability and Transparency.

DATAWAREHOUSE TEAM IS DIVIDED INTO FOLLOWING TEAMS

 

PRE-REQUISITE TO EXTRACTION
·         DDL (Data Definition Language), DML (Data Manipulation Language), DCL (Data Control Language) should be checked on OLTP Database. If there are any errors – they should be rectified.
Earlier to SAP BW, ETL in SAP was done through separate tools for ETL. Extraction (Data Stage), Transformation (Erwin), Loading (Cognos). 

Drawbacks of Separate Extraction, Transformation, Loading:

  • Not Cost Effective
  • Often it does not give desired result because of Technical constraint between different products.

SAP BW is complete package (DW tool under SAP) that contains Extraction, Transformation, and Loading.

ETL is set of processes through which we are extracting data from different and separate data sources, processing and keeping them after processing at a place called Data warehousing.

OLTP is based on E.R. Diagram. This is based on joins which are time consuming while processing.
In BW, we have FACT and DIMENSIONS which have Parent Child Relationship.

Landscape of SAP BW

 


Amount of data doubles every 18months, which affects response time and the sheer ability to comprehend its content.

There are many technology related reasons for the existence of DW.
  • DW is designed to address the incompatibility of informational and operational transaction systems. (Since the requirement of them are incompatible).
 
MODELLING
Business Intelligence:
Business (in layman terms) is any transaction between two difference entities. If the transaction happens in the same entity - it is not considered as business.
Intelligence- extra set of features needed for aiding in business decision making.


FACTS AND DIMENSIONS
FACT represents numerical data in a business transaction which is cumulative. e.g.: Employee salary, product price.
FACT TABLE: Where these numerical data are stored also known as Measures or Metrics.
Work of FACT TABLE in BI: It stores all business related Metrics, Measures, numbers in their own business area. If a data is not in FACT table, it is not possible to analyse that data.
DIMENSION: Descriptive Information which has definite value. It describes the FACT. Every FACT must contain Dimension, also it must contain UNIT. Unit can be any metric – most frequently used UNIT is kg or INR (Indian units for weight and currency).
There will be only one FACT Table in an SAP BW lifecycle. In BW we can make a maximum of 16 Dimensions, of which 3 are system defined and 13 can be user defined.
System Defined Dimensions:
Time: Client requirement decides the time dimension – year, week, day or hour.
Unit: most frequently used UNIT is kg or INR (Indian units for weight and currency
Data Packet: size of data to be stored.
example:
Employee Salary 50,000 per month.
In above statement: Employee is Dimension
Salary is Fact
50,000 is Fact Value
Month is Time
BI Content (BC) or Business Content:
Sample Analysis Reports of all business modules’  are already pre-manufactured and stored in BC.
TYPES OF DATA IN SAP BI:


  • Master Data (does not often change – it plays part of reference from where we can refer the Facts) e.g. EMP ID, Bank Account, Surname.
  • Transaction Data : DWH BI analysis mainly happens for this type of data. This type of data changes frequently. e.g. Salary, tax, amount, price
 
SCHEMA
 
It is an old concept. Every Database consists of number of schemas. It is collection of tables. Table is collection of rows and columns. Data is in the column.

With respect to Data warehouse, Schema is presentation of data behind the BI Server.
In SAP BI, two types of Schemas are used.


  • Star Schema (http://www.learndatamodeling.com/star.htm) 
  • Extended Star Schema or Multi Dimension Modeling.
Star Schema

 


Dimension IDs are used to select unique records.

Fact Table contains: respective Dimension IDs and FACT Data.
In Star Schema, Master Data is embedded with the Dimensions.


Drawback of Star Schema:
SAP AG made a rule that Master Data will be executed before Transaction Data.

Extended Star Schema 



In star schema the central fact table  is surrounded by  master data tables(=dimension tables).since master data lies within the cube. we cannot reuse the master data information for other cubes.where as in Extended star schema the central fact table is surrounded by dimension tables (!=master data tables). Master data tables lies outside the info cube. So,we can share the master data information for other cubes.
In star schema the fact table holds Alpha-numeric information.so,it degrades the database performance and query performance.where as in Extended star schema using SID tables we can see all the information in the fact table as numeric.so,it improves the database performance and query performance.

In star schema the fact table is directly connected to the master data tables and it can be connected to a maximum of 16 master data tables.so,we cannot analyze the data more than 16 angles.where as in Extended star schema the fact table is connect to maximum 16 dimension tables and each dimension table can be connected to a 248 SID  tables. so we can analyze the data in 16*248 angles  
 
Star schema
  • It can be analyzed only in 16 dimensions.
  • Master data is not reusable.
  • Performance is degraded because of the alpha-numeric values in the fact table.

 Extended  Star Schema

  • It can be analyzed in 16*248 dimensions.
  • Master data can be reusable as it is not inside the cube.
  • Better performance as there r no alphanumeric  values in the fact table.
 Earlier, request comes --> goes to Mater Data -->Transaction Data
 
Now with Extended Star Schema, goto Dimension and connect to the respective Master Data using the SID Table.
After Primary Key (Dim ID), SID is unique ID which is system generate
Operational Data: Stored in DW or cube, data which is part of operations and not legacy data.
If there is data which is going to be used after 5 years, we can keep it in separate DW.

MODELING
Basically involves designing the Schema, Framework and Methodology.
Before designing, we should know definition of all data fields (only fields, not their values). This is called Extract Structure. Full Database with data value is needed in Extraction.


Modeling Terms
The need and definition of Modeling: We create the schema.
If not designed properly, there will be problems like:
  • Integration Problem.
  • Difficulty in fetching data.
If designed properly, it helps in Business Analysis:

  • Faster
  • Easier
  • Accurate
 
Info objects: It is the smallest entity in BI and Infocube.
They are of two types:

  • Characteristics (e.g. EMP ID, Name, Bank Account Number) These are descriptions or Descriptive Items
  • Key Figure (e.g. Salary, Price, Amount)

Point to note is FACTs and Key Figures are not same.
Fact is part of  schema where Key figure is not direct property of Schema.
Character is going to store in Dimensions.
Since Server understands Dimensions and not Character. And all the data has to go through schema(basic entity), we  convert key figure into FACT and characteristics to Dimension. It is not a direct copy – there is conversion involved.
Infoobjects are kept in folders called Info Area. They cannot be kept independently. It is common place where we are going to keep Info Objects and Info Cube.
Whenever we search an InfoCube, we look for its Info Area.

Info Object Catalogue – It is used to keep separate type of data in different container/path/segment.
There will be 2 info-object catalogues always (because we have only two types of data – charecteristics and key figure)

When doing Modeling, first step is to create a source system. It is the system from where we are going to fetch and process data, it can be R/3, flat file, oracle, Sybase, XML, Another B/W System etc.
Source System should have IP settings and drivers.

Most often, the data is extracted and given to us since it is a lengthy and costly process.

 File System / Oracle / SAP  --> Extract Structure   --> (Replication) (Data in BW System Format) Transfer Structure  --> PSA (Transfer Rule) Communication Structure   --> (Update Rule) Info Object (Info Provider)

 
Transfer and Communication Structure are together called Info Source. They are already system created, we don’t create it. We only keep our structure in them.
Transfer Rule is collection of instructions which indicates transfer of structure from 1 stage to another stage.
PSA is Persistent Staging Area which is Temporary Backup Storage Area.
First Infocube is created then the Update Rule is created.
From Transfer Structure to Infocube creation – the work is done Admin Workbench (RSA1 or RSA1OLD).
Infocube is also known as Data Target/ Info Provider.
Info Provider are of two types: Physical and Virtual. It is a place where all the data is stored finally for analysis purpose.

Info Provider

  • Physical
  1. Info Cubes
  2. ODS / DSO
  3. Info Objects
  • Virtual
  1. Multi Provider (It is a collection of Info cube and DSO) which can be combination of two InfoCubes, two DSOs or one DSO and one Info Cube)
Infocube is architectured on Star or Extended Star Schema. This implies that there are FACT and Dimension tables in Info Cube.
Maximum 248 Characteristics and 233 Key Figures
Standard Infocube will always follow Star or Extended Star Schema rules. It is a storage area where data is arranged in the form of Schema.
Real Time Info Cube is used by SEM. It comes under new dimensional tools. They do no follow Star Schema concepts and they are not used in normal BI cycle.
Infocubes and ODS are also called Data Targets – they are more reliable that ODS.