Charles Engelke's Blog

July 9, 2003

OSCON Day 2: Building Data Warehouses with MySQL

Filed under: Ozette Brown — Charles Engelke @ 4:37 pm

Ok, this is an excellent view on the talk about Building Data
Warehouses with MySQL. Again, this is from Jeremy Zawodny.

In my opinion this was one of the best tutorials. The presenter
certainly knew his stuff and actually came from the MS SQL world.
His examples were good and meaningful. I would have liked to see actual code and how the code was applied.
The speaker eluded to having something to that effect at next years conference, if he’s envited to come back.

So, here it is; The talk starts with a story about why MySQL is the most
cost-effective data warehousing solution available. Compared to
Microsoft SQL Server (the cheapest closed-source solution), MySQL is a
big savings.

Warehouses vs. Marts

Data Warehouse vs. Data Mart. Plan for a warehouse, but build a
mart. There are a lot of things you might need, but there’s
no need to build all of it until you need it. Data marts lock
together to become a warehouse.

Book recommendation: The Data Warehouse Toolkit.

Data warehouse: focused on business processes, using standardized
granular facts. It’s a collection of standardized marts.

Data mart: focused on one narrow business, includes lightly
summarized data. It’s a component of a data warehouse.

Metadata

Metadata capture. Need to get terms and definitions correct and
agreed upon in advance. Policies and company practices factor into
the decision making. How does your business really quantify things?
You need to ask users what their business needs are. Sometimes this
involves going quite high in the organization.

Dimensions

Every time you hear “by”, think about dimensions. They’re wide and
flat tables (compared to fact tables). Lots of redundant data. Make
sure the measurement units are the same. This is hard in
multi-national companies. Which day (time zones)? Sizes and volumes
of items, etc. How will they be formatted? What enumerations
(possible values) will exist? M/F, 0/1, Y/N,
Small/Medium/Large/X-Large.

Calendar dimension example: date_id, date_value, description,
month, day, year, quarter, is_weekday, day_of_week, fiscal_month,
fiscal_quarter, astrological_sign, etc. Lots of duplication (imagine
one record for each day of the year). You could add weather info,
abnormal business closes, etc.

Fact Tables

All about keys. Many keys, few facts. Very deep (tall) and
narrow. It’s best not to store calculated values because you may need
to recalculate someday (margin, for example). You can calculate on
the fly (in the query) or in code that’s pulling the data. Use facts
that the business users understand.

Don’t use anything meaningful for keys. Never. Ever. Meaningful
things change when companies merge, change, etc. Just invent numbers
that are meaningful to the database only.

Star Schema

There’s a central fact, many dimensions (the arms), and no other
tables. Don’t “snowflake” or over-normalize by hanging new tables off
of the dimension tables.

When building the schema, decide on the grain. What’s the
smallest bit of data anyone will ask for? One day? One hour? One
week?

Scenario

Sales from a web-based meal order system (Vmeals.com). Many
clients/customers, caterer/restaurants, delivery locations, etc. The
database is relatively small now (300MB or so).

Lots of data to track (on the white board). The most important bit
will likely be order items. Starting simple, our facts are orders and
customer service metrics. Dimensions are calendar, customers,
products, promotions, and so on. Create a bus design.

We’ll use sales for this example.

Determine Grain

Pick the lowest possible grain that makes sense. For this
example, it’s orders or more specifically ordered items. Order will
be the second fact table. Dimensions: calendar (order date, deliver
date), product (menu items), customer, delivery location, provider
(vendor), licensee (market).

Finding the Data

Need to pull data from the on-line system to populate the
warehouse. Some may came from other places too: market information
system (MS Access), promotion engine, etc. Larger companies will have
many more.

The order fact table will be primarily built from line items from
orders. Think about additive, non-additive, and semi-additive values.
You generally want additive data. Store the data needed to compute
things, not the resultant values. Snapshot values (daily bank
balance) not additive.

Special Dimension Types

Degenerate dimensions have no corresponding dimension table.
Invoice or oder number are common examples. They’re only used in
groups or rollups, typically.

Role-playing dimensions are used over and over. Dates are good
example: payment date, order date, delivery date. In some systems
you’d use views for this. They’d all be views over the underlying
calendar table. You can use MySQL Merge tables to work around the
lack of views. Or you can create several one-table merge tables.

Slowly changing dimensions. Fixed data: just update. Changed
data: add a new row (like a new address). Fundamental schema change:
add a new column, keep old column data around.

Getting the Data

ETL process: extract, transform, load. Go to the source of the
data. Extraction can be tricky if you have lots of data and little
time (24×7 system). If the source data has date stamps, you can
perform incremental dumps. With some systems you can use a row
checksum and a computed index. Transformation is about making the
data match the warehouse’s metadata standards. Perl to the rescue!
(Or maybe using an intermediate database.)

Microsoft DTS is a good option in the Windows world. It comes with
SQL Server and is scripted via a scripting host language. Lots of
expensive commercial tools to do this too.

When dumping from other systems, watch out for blobs. They
probably don’t belong anyway. Make sure that stuff comes out as
quoted ASCII rather than some internal representative that MySQL won’t
grok.

To load, MySQL’s LOAD DATA INFILE works quite well for this. It’s
fast and flexible.

Demo: Dump via MSSQL BCP and load into MySQL using LOAD DATA.

Having a staging environment is important. Rather than loading all
the data into the warehouse, you can do a lot of intermediate work on
a different server before loading into the “real” warehouse. You can
use this staging area for run validation checks, manage any changes
needed (SQL, Perl, custom apps, DTS or other ETL tools), and perform
multi-step extractions.

A frehness date helps users understand when the latest data isn’t
as new as they might think.

Reporting Tools

All Java, open source: Jasper Reports, jFreeReport/Chart,
DataViz.

OLAP Tools

Open Source: Mondrian (Java), JPivot (Java/JSP), BEE (perl).

Advertisements

Blog at WordPress.com.

%d bloggers like this: