Discussion:
Data Warehousing 2; was: RE: [AM] ZZZ
Paul Oldfield
2004-02-04 22:17:40 UTC
Permalink
(responding to Dagna)
DATA WAREHOUSING
Does Data Warehousing have special problems? Again, I
don't know enough to comment on what approaches would
be good.
DG: Yes. The big special problem of data warehousing is
1. Name has different lengths, structures and contents on
every system - some have surname/initials, some have free-
format text, lengths vary from 20 to 1024 characters (hey, it's a
Varchar, it won't use what it doesn't need, so let's be safe!), so
you need to look at each one and work out how to transform it
and what to do when it fails (name is blank, for example). Then
there are addresses. Always a source of fun for the ETL
programmers.
Hmm... one of the DBMS I worked on was a 'heterogeneous';
we read data from different sources into objects. I guess this
would have been a pain for cases where there were loads of
tables - we had to describe the mapping for each table by
hand. I guess we also worked on pretty sparse data, quite
high processing on relatively small datasets. Creating
objects probably wouldn't be good for high data volumes,
low processing requiremetns.
2. Identifiers aren't unique across systems, so they all have to
be converted to a unique Warehouse id. And some of the
source systems have different id formats (number, alphanumeric,
formatted alpha/number/alpha...). And the same customer can
have different ids on different systems and there is no cross-
reference to identify the matches.
...and some customers will have multiple IDs on the one system,
etc., and the data wouldn't stay clean if you *did* manage to
cross-reference most customers...
3. What do you do with the rubbish? Stick it in anyway, or reject
it? And how does it get fixed, in either case? These questions
have to be answered for every table. Different tables may have
different answers. (This is emphatically not a development team
decision. It has to be made by the users, who have to realise the
consequences of their decision. Resist all attempts by the users
to push it back to you.)
And I've yet to see any attempt to keep the data clean from the
outset last beyond the first day of operation.
The other warehouse problem is the size of the thing. The first
iteration will take a long time (whatever value of 'long' you use),
because that's where you set things up so the next bits just add
in and use the infrastructure (way of generating ids, links back
to source systems, etl tool, database instance, etc) that you
decided on and set up in phase 1.
There will be ways to cut that down. I can't say what these will
be, but there always are.
If you are developing a warehouse, you need a full time data
analyst/modeller/architect on the project who knows the data,
knows why it was done like that, knows what can be changed
(so when you want to change a format, they either say 'OK',
'OK but...', or threaten you with extreme violence),
Something like a DBA, then?
and knows what transformations are needed, have been done,
or are near enough like the one you are starting that you can
re-use the old one, and they know who to go to when the feed
you have finished, tested, installed and are running live is
suddenly changed. (Because you are re-using existing interface
files and someone, somewhere, needed a change made.
You can probably sell tickets to this meeting, to anyone who
isn't too squeamish.)
I get the picture.
Always good to have another point of view. If you don't think
we're talking sense, yell. Somebody will learn something,
with any luck. Either we'll learn to express our ideas better,
or we'll learn to express better ideas ;-)
DG: Thanks. I will try to keep future replies a little shorter.
I've cut the reply into threads to help you ;-)


Paul Oldfield

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
www.aptprocess.com

any opinions expressed herein are not necessarily those of
Mentors of Cally or the Appropriate Process Movement
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

For more information about AM, visit the Agile Modeling Home Page at www.agilemodeling.com
--^----------------------------------------------------------------
This email was sent to: gcma-***@gmane.org

EASY UNSUBSCRIBE click here: http://topica.com/u/?bUrKDA.bWnbtk.Z2NtYS1h
Or send an email to: agilemodeling-***@topica.com

TOPICA - Start your own email discussion group. FREE!
http://www.topica.com/partner/tag02/create/index2.html
--^----------------------------------------------------------------
Loading...