TIPS & TRICKS
Find Duplicates Intelligently
This tip explores two methods for identifying duplicates, both of which
integrate some calculation and scripting techniques.
By Stuart Gripman
Few problems cut across databases as pervasively as
duplicate records. Although record duplication may be most prevalent in
databases containing contact information, such as names and addresses,
it has the potential to show up anywhere. It comes as no surprise that
just about every client I've worked with over the years has expressed a
need to identify and remove duplicates.
Finding redundancy seems simple enough. Just find
duplicates and delete one of them, right? Well, sure, but how do you
define "duplicate?" Consider this example. Your database has the
following two records:
Ben Finn
412 Arlington Boulevard
Eastsound, WA 98070
Benjamin Finn
412 Arlington Bl.
Eastsound, WA 98070-1222
Likely, these two records are for the same person, and
you should purge one, but if you just went looking for identical
records, these two records wouldn't come up because there are small
differences on each line. Your method for identifying potential
duplicates must allow for "loose" matches if you're going to catch these
two "duplicates." In this tip, I'll explore two methods for identifying
duplicates, both of which will integrate some calculation and scripting
techniques.
The "human knows best" technique
Typically the chore of finding duplicates is going to involve some
judgment calls. Somebody has to look over the records and decide whether
a given record is a duplicate. Although it can be a real chore,
FileMaker Pro does provide some tools to ease the burden. Using some
scripting and calculations, you can let FileMaker Pro generate a list of
likely duplicates and present them to you for review.
The first step is to create a calculation field. In
this scenario, finding duplicates based on any single field will return
a large found set with few true duplicates. The new calculation field
will combine the contents of multiple fields. Your script will search
for duplicates using the calculation field, thereby reducing the number
of false duplicates. For this exercise, I chose to combine the Last Name
and ZIP code fields. Finding duplicate last names or ZIP codes alone is
sure to return many records that aren't true duplicates. When you search
on the combination of the two, FileMaker Pro will find fewer records
that are more likely to be duplicates.
In Define Database, create a new calculation field
named "zct_DuplicateCalc" and enter the following formula:
NameLast & Zip
The ampersand concatenates -- or "glues" -- NameLast
and ZIP together.
Now you're ready to write the script that simply
searches for duplicate values in your "zct_DuplicateCalc" field and
presents the sorted results.
When you run the script, FileMaker Pro finds the
records most likely to be duplicates, sorts them so they're
appropriately grouped together, and switches to a list layout where you
can review the results. In the example file, the script locates 22
records or 11 potential duplicates out of 500 records in the database.
Contrast this result with searching for duplicate data in the last name
field alone; that results in 58 records. Finding duplicate ZIP codes
alone returns 192 records.
Now that you have a manageable list of likely
duplicates, you can check them over and manually delete the unneeded
records.
The "blunt force" technique
The "blunt force" technique is a script that finds duplicates based on
your criteria and deletes all but the first duplicate without pausing
for verification. The scripting here is more complex than the previous
technique, but it requires no human review. Automation comes at the cost
of flexibility here. This technique deletes records without asking
first. Be sure you're developing and testing this technique on backup
copies of your data. When you or your script delete a record, it's gone
for good.
Before you can write the script, you'll have to add
two new fields to the database. Call the first "zgt_LastValue" and set
it to be a text field with global storage. The script uses this field to
compare records and it doesn't have to appear on any layout. The second
field is "zt_DeleteFlag," a regular text field.
The script works like this:
• Find all records with duplicate cities.
• Sort by city so the duplicates line up together.
• Starting with the first record, go to each record and compare its
"City" value to the "City" value in the previous record.
• If the "City" values match, mark the record for deletion.
• If there's no match, keep going.
• After checking all records, find the ones marked for deletion and
delete them.
Set your expectations
As you can see, defining duplicates and accurately finding them usually
requires some human intervention. This task can be much less onerous if
you take the time to define what "duplicate record" means for your
database first.
The approaches I presented in this issue are designed
to be compatible with FileMaker Pro 7 through 8.5. With minor
modifications, you can apply them to earlier versions.
ARCHIVE-CD can show you
how we can help you streamline
your work environment
— and expand your possibilities.
Call 800-323-1868 or visit www.archive-cd.com
|