I was very skeptical about online Universities, but this lecture series has changed my mind. Here I am waiting for my car oil change and I can refresh my understanding of data warehousing and data mining by watching a video on YouTube. Amazing!
If you get past the accent and the stuttering, this is a very good quality 300 series post Bachelor video course by Dr. S. Srinath, India Institute of Technology (IIT), Madras. IIT Madras has many other courses, some as web pages, many in video format - all free. Why is not this being done in USA and Europe?
Watching a video does not replace reading books or attending actual University. But I somehow enjoy listening to a person describing the subject, especially for the material that I already know to some extent. Listening gives me time to reflect, synthesize and apply ideas to the problems in hand. The notes below are my own, made while watching. Enjoy!
Lectures
Lecture 30 - Introduction to Data Warehousing and OLAP, Part I [57:49]
Operational efficiency (tactical) vs. strategic efficiency of databases: same underlying data, different nature of CRUD operations.
Data warehouse OLAP is a database system that is based on a join of different OLTP data sources (optimized for tactical operations), but optimized for the strategic decision making.
Transactional data (OLTP): requires frequent updates, has local relevance to a specific operation, accesses by point queries (queries that return one specific tuple: transaction, date, account, ...), focus on reduced update time.
Warehouse data (OLAP): does not change the dataset, updated on predictable intervals, has huge amount of aggregation, global relevance, focus on reduced query time (O in OLAP means online, real time).
Data Warehouse: infrastructure to manage historical data, designed to support OLAP queries, trending, what-if and reporting.
Data Warehouse consists of several datamarts, each covering distinct OLTP entities. Data Mart is a collection of data about specific segment of the entire data warehouse. Data mart can support the same basic OLAP query and reporting as main data warehouse, but is focused on the specific segment of data.
Data cleaning and integration for OLAP in a two directional process, which produces data back flash into the OLTP system.
Data cleaning: Removes dirt, removes duplicates, enforces standardization & semantics. Sources of data dirt: language/locale/encoding, abbreviations, semantic equivalence (Chennai == Madras), units of measure, missing values, use of 0/1 vs True/False vs T/F, address de-normalization, inconsistent values vs inconsistent relations, ... There is no single cleaner can be uniformly used and it is difficult to automate the entire process. Lots of ad hock knowledge needs to be provided. Hard to validate and prone to GIGO (garbage in, garbage out) syndrome. Complexity increases with the number of data sources and over time.
Data cleaning steps are: analysis (finding data and metadata), definition of transformation rules, rule verification, transformation and backflow (replace original data with normalized representation). Data cleaning techniques. Hash-Merge duplicates elimination algorithm. Sorted neighborhood technique for misspelling detection and duplicate elimination; Graph-Based transitive closure algorithm to reduce the number of passes.
Integration: data and schema integration. Wrapper maps schema into metaschema and Extractor phisically esxtracts the data. Mediator maps individual metaschema into uber schema and Constructor joins the data together.
Lecture 31 - Introduction to Data Warehousing and OLAP, Part II [58:11]
Data warehouse internals: involves aggregation, clustering, trending, multi-dimensional projections (how each dimension contributes to a total).
Typical warehouse has a hyper cube (data cube, which stores facts for dimensions) and materialized views (which store projections and aggregates). We do not store materialized views in the normal databases, but it is ok for OLAP as data is read-only. Hyper cubes support additional operators compared to classical SQL: pivoting (rotation of the hyper cube), slicing-dicing (extracting sub cubes), roll-up (aggregating across dimension) and drill-down (reveal details of an aggregated dimension).
Hyper cube implementations: ROLAP (multi-dimensional schema and queries are mapped into the relations model and is managed by a relational database; done in True Relational OLAP by Microstrategy) and MOLAP (native multi-dimensional model for storage and querying the data; done in EssBase).
Start schema architecture for OLAP cube. Snowflake schema is a star schema where dimension table has additional sub-dimensions in separate tables. Constellation model has multiple star schemas joined together, so there are multiple fact tables, but the dimension tables are shared.
Time dimension is special: found in most applications, has special meaning to the rollup (based on multiple calendars: tax year, academic or fiscal calendar), has interpretation for special events (release date, holiday, weekend), ordered dimension and order of traversal is very important. Revision history table.
Warehouse design starts with the enterprise model, then we build logical and physical model. Good warehouse design resembles enterprise model (not the operational model). Notion for ALL, to denote all dimention elements.
Classes of dimensions: categorical ({dog, cat}, {true, false}), ordinal ({1,2,3}, {Jan, Feb, ...}), sparse (small number of unique values (T/F, Male/Female)), dense (large number of unique values). Types of indexes and storage: tabular (with multi-level sorting), space filling curve, mutli-dimensional indexes (ordered index on multiple attributes, partitioned hashing, grid files, bitmap indexes, KD trees, Quad trees, R trees for regions, signature trees and extensible hashing as indexes into categorical data).
Lecture 32 - Case Study : MYSQL [57:55]
Lecture 33 - Case Study ORACLE and Microsoft Access [57:52]
Lecture 34 - Data Mining and Knowledge Discovery (Part 1) [54:46]
Value of data mining: extrapolation, wealth generation, security, strategic decision making, ...
Apriori algorithm: frequent item-sets and association rules mining.
Lecture 35 - Data Mining and Knowledge Discovery (Part 2) [57:59]
Classification (based on element difference) and clustering (based on element similarity)
Classification techniques: Hunt's decision tree identification
Measures for similarity between elements projected into n-dimensional space
Clustering algorithms: nearest member (unknown number of clusters), iterative partitioning (number of clusters and their initial centers are known, we only reassigning members)
Sequence: finite ordered list of item-sets.
Mining sequence data: apriori algoritm for sequences
Language inference: finite state machine that produced the item-set sequence, general vs. specific inference with shortest-run generalization,
Mining streaming data: infinite sequence, no storage, real-time analysis.
Running and standing queries; unlike normal databases where data is standing, in streaming data query is standing and dataset is moving.
Running mean, running variance.
Gamma-consistency.
Lecture 36 - Object Oriented Databases [57:49]
Lecture 37 - Object Oriented Databases II [56:39]
Lecture 38 - XML - Introductory Concepts [57:51]
Lecture 39 - XML Advanced Concepts [57:40]
Lecture 40 - XML Databases [58:12]
Lecture 41 - Case Study - Part One Database Design [53:43]
Lecture 42 - Case Study - Part 2 Database Design [58:36]
No comments yet
Leave a comment