"Legacy Systems Consume more than 50% of Corporate Resources."

Article

DATA QUALITY: IS GIGO A THING OF THE PAST?

Article published in DM Direct Newsletter
By John C. Hermansen

These days, we rarely hear the old saw, "garbage in, garbage out (GIGO)", but it used to be a very common explanation for the mediocre performance of database operations. The fact that we do not hear that expression as often, however, does not mean that the problem has gone away. Quite the contrary.

It is not that our databases are that much cleaner now, but rather that they are so much larger. Dirty data still plagues every database we have worked on over the past 20 years, and that means that an extraordinary amount of valuable information is lost forever in the mega-sized databases that are commonplace today. Because these databases are so enormous, data errors now represent a much larger percentage of our data stores than ever before.

Projects and programs to cleanse or remediate corporate data continue to receive the same short shrift that they always have. One major hard drive crash, and a new DBA never again forgets to back up the database. But, erroneous data rarely teaches the same kind of dramatic lesson, and thus dirty data persists.

Dirty Names

One of the most difficult database problems to fix is also the one that can cause the most havoc in databases: personal names. While we do find programs and services that can "clean up" addresses, telephone numbers and other data fields, until recently there were no such tools for helping with bad name data. Why is this?

People's names are complex data elements that cannot be looked up in a dictionary or table for the proper spelling or parsing. Especially with names from cultures that we are not familiar with, names often suffer significant damage during data entry. Below are a few examples of names that might be found in the database of any global organization today. How would you split these names in order to put the proper elements in the surname field and given name field? (Answers are shown below.)

Maria del Carmen Bustamante de la Fuente
Hisham Abu Ali Quereshi Noor Eldin
Chang Wen Ying
Nadezhda Ivanovna Ovtsyuk
William Martin Smith-Bagby Jr.
Kees Andries Van Der Merve

If these names are entered into a database improperly, it is very likely that they will never be seen again. So, how can the average DBA help his data entry staff to handle these vital data elements correctly? And what about the name data already in the database?

Key to Success

There is no substitute for establishing an ongoing process of data stewardship. This means first cleansing the existing legacy data, setting up and monitoring methods for error prevention at data entry and maintaining a routine for scheduled data cleansing. Doing this for other data fields is bothersome enough, but how can this possibly be done for complex personal name data?

Fortunately, new products are now coming to market that can help guarantee the quality of your valuable name data. Based on years of research and statistics derived from hundreds of millions of names from around the world, this new name recognition technology can be used to clean an existing database of personal names as well as to protect that cleaned database from future infection with bad data.

Using information about how cultures around the world define and use personal names, it is now possible to actually measure the accuracy of the way a name is parsed into surname and given name fields, no matter what type of name it is. This provides a superior method for cleaning up an existing database, and it allows for interactive checking of data entry. In fact, the entire parsing operation could be - and perhaps should be - automated using this new technology.

This parsing software was actually used on the set of names shown previously and instantly returned the following (correct) parsed:

Bustamante de la Fuente, Maria del Carmen
Quereshi Noor Eldin, Hisham Abu Ali
Chang, Wen Ying
Ovtsyuk, Nadezhda Ivanovna
Smith-Bagby Jr., William Martin
Van Der Merve, Kees Andries

Capitalizing on Your Data Investment

Technologies such as name recognition software make it much easier now to commit a data shop to better stewardship of its valuable data. It still requires an unequivocal management decision to support clean data, but the rewards are certainly worth it.

In fact, the business risks of not enhancing and protecting the quality of your personal name data will only grow as customer databases become more and more international. Companies that lack the vision and the commitment to do a better job of handling their customers' names are unlikely to succeed in attracting this growing global market. Those companies that do understand the value of good data will now be able to demonstrate that directly to their customers every time they use their names.

Source: DM Review