jump to navigation

Today’s Linkedin Discussion Thread: Enterprise Data Quality April 28, 2009

Posted by Peter Benza in Data Analysis, Data Elements, Data Governance, Data Optimization, Data Processes, Data Profiling, Data Quality, Data Sources, Data Standardization, Data Synchronization, Data Tools, Data Verification.
Tags: ,
add a comment

Here is my most recent question I just added to my Linkedin discussion group = Enterprise Data Quality.

QUESTION: What master data or existing “traditional” data management processes (or differentiators) have you identified to be useful across the enterprise regarding data quality?

MY INSIGHTS: Recently, I was able to demonstrate (and quantify) the impact of using an NCOA updated address for match/merge accuracy purposes when two or more customer “names and addresses” from three disparate source systems were present. The ultimate test approach warrants consideration especially when talking about the volume of customer records for big companies today number “hundreds” of millions of records. It is ideal to apply this test to the entire file not just a sample set. But, we all know today its about: money, time, value, resources, etc.

For testing purposes, I advised all individual customer address attributes were replaced (where information was available) with NCOA updated addresses and then loaded and processed through the “customer hub” technology. If you are not testing a piece of technology, then constructing your own match key or visually checking sample sets of customer records before and after is an alternative. Either way, inventory matches and non-matches from the two different runs – once with addresses (as-is) and once with addresses that leverage the NCOA information.

My goal was to establish a business process that focused on “pre-processing customer records” using a reliable third party source (in this case NCOA) instead of becoming completely dependent on a current or future piece of technology that may offer the same results, especially when the methodology (matching algorithms) are probalistic. My approach reduces your dependency, as well, and you can focus on “lift” the technology may offer – if your are comparing two or more products.

Where as, inside a deterministic-based matching utility (or off-the-shelf solution) adding extra space or columns of data to the end of your input file to store the NCOA addresses will allow you to accomplish the same results. But, for test purposes, the easier way may be to replace addresses where an NCOA record is available.

Remember, based on the volume of records your client may be dealing with, a pre-process (business process) may be ideal, rather than loading all the customer names and addresses into the third party customer hub technology and processing it. Caution: This all depends on how the business is required (i.e. compliance) to store information from cradle to grave. But, the rule of thumb of the MDM customer hub is to store the “best/master” (single customer view record) with the exception of users with extended search requirements. The data warehouse (vs. MDM solutions) now becomes the next challenge… what to keep where and how much. But, that is another discussion.

The percentage realized in using the updated customer address was substantial (over 10%) on the average based on all the sources factored into the analysis. This means several 10’s of millions of customer records will match/merge more effectively (and efficiently) followed by the incremental lift – based on what the “customer hub” technology enables using its proprietary tools and techniques. This becomes the real differentiator!

What is the major difference between structured and unstructured data? December 27, 2008

Posted by Peter Benza in Data Dictionary, Data Elements, Data Formats, Data Types.
add a comment

A good rule of thumb is structured “tabular” data fits into rows and columns and unstructured data are things like web pages, presentations, survey’s, and images.

[Add more examples here.]

What other data aggregate functions are useful besides averages and means? September 19, 2007

Posted by Peter Benza in Data Aggregates, Data Consolidation, Data Elements, Data Errors, Data Research.
1 comment so far

(Be first to author an article on this topic.)

Spatial data layers and conflation September 18, 2007

Posted by Peter Benza in Data Accuracy, Data Elements, Data Formats, Data Types, Data Visualization.
add a comment

Conflation is more than matching features from different spatial sources.  A good spatial-matching technology that includes conflation as a parameter should also be defined by location, the shapes attributes, and its relationships to other objects. 

A good example of this is when two or more road networks have conflicting views – how do you proceed, if you end up only wanting to display one of the sources? 

What geometrical matching techniques or advice do you have on this topic?

NCDM tradeshow from early 1990’s August 19, 2007

Posted by Peter Benza in Data Elements, Data Management, Data Mining, Data Profiling.
add a comment


Here is a old snapshot I found from a database marketing tradeshow I attended back in the early 1990’s in Orlando, FL. 

Data quality in the 1990’s equated to postal name and address information – address standardization, zip correction, phone number verification, first name tables, apartment number enhancements, area code changes, and probably the biggest – National Change of Address.  Today, data quality has expanded to include product data, item data, financial data, and other master data sources across the enterprise.

Service bureau’s like IRA (at that time) were just one of a few bureau’s remaining that were privately held who mass-compiled consumer data on a national basis and collected information like exact age, phone numbers, length of residence, dwelling unit type, dwelling unit size, height/weight information, voting preference… the list goes on!

Today, with the evolution of database technology, consumer data used as reference data, statistical analysis, and advanced data profiling tools – the database marketing industry has truely taken all these subject area’s to the next level. 

Best practices for database management, data quality, and data governance are now prime time and instead of organization just concentrating on how to cut costs (more) – they want to shift to increasing revenues – and to do that it begins with leveraging corporate data sources across the enterprise.

What kind of data references are being bolted-on to enhance record matching inside the customer database? August 17, 2007

Posted by Peter Benza in Data Elements, Data References, Data Strategy, Data Verification.
add a comment

Organizations are turning to compiled reference data to compliment the match/merge/optimize routines inside their customer data hub.  A score/rank is also being pre-assigned (appended) to each customer record to make this process easier when it comes to building match-logic for use during the file build process.

A good example of this is aggregating surname into various geographic levels – block group, census tract, zip code, county, and so on.  The resulting surname statistics by geography are used as part of the overall algorithm applied during the integration/update process to improve the decision making process which brings two or more customer records together, referred to as a household.

Note: Surname is only one data element – others exist and vendors in the informations services industry have packaged this concept into licensed modules for use in organizations master data management landscape. 

Deciphering between data variables and data elements? August 16, 2007

Posted by Peter Benza in Data Consistancy, Data Consolidation, Data Elements, Data Formats, Data Standardization, Data Templates.
add a comment

Here are two data variables that require some special attention or you just might “age” your customers too soon, too late, or not at all. 

Exact age is a data variable and is typically stored as a whole number representing a customer’s age.  In this form it is a very powerful (and predictive) data variable and is used as one of the more commonly used variables to discriminate responders from non-responders. 

Exact age in this case can’t be broken down into any smaller data elements.  Okay, so know you understand the difference, but is this good enough given how you plan to use this data variable for target marketing purposes.

Exact age does have some limitations.  What about maintaining this particular variable in your customer data warehouse.  If left alone in its current format it (exact age) becomes an operational nightmare.  A more common and efficient way is creating a second data variable named (date of birth), and include three data elements month, day, and year of birth.

Remember, some data variables may have specific data elements within them – such as a phone number, street address, zip code, etc.  The more you examine each of the data variables in your database – you will begin to uncover all the potential options. 

What are some of the most popular business related data elements used today for data quality purposes? August 13, 2007

Posted by Peter Benza in Data Elements.
add a comment

Be one of the first to author an article in this category!

Peter Benza – 1984 graduate of the direct marketing educational foundation – creates enterprise data quality weblog August 13, 2007

Posted by Peter Benza in Data Elements, Data Governance, Data Integrity, Data Management, Data Mining, Data Optimization, Data Profiling, Data Quality, Data Stewardship, Data Strategy, Data Tools, Data Variables, Data Visualization.
add a comment