Merging two lists

A good enough solution for small sets, but there's certainly a more standard algorithm for this kind of thing.
Sunday, June 13, 2021

The problem

We're merging a bunch of Salesforce records at work. We got a list of duplicates from an external tool (Cloudingo), but that didn't give us all the duplicates. So we also wrote some custom Ruby code to pull potential duplicates based on additional criteria. The result was two sets of account IDs, like this (very simplified):

List 1: a,b; d,e; f,g

List 2: a,n; d,e,o; x,y

Note: Each letter represents an ID of a Salesforce account, grouped by which accounts we decided were duplicates. So a and b are the IDs for two different Salesforce accounts, which we want to merge into a single account.

Some of the items in the matched on one element (a,b, a,n); others matched on multiple elements (d,e, d,e,o); others had no overlap at all (f,g, x,y).

The goal was to merge these two lists, so that we had a single set of duplicates: a,b,n; d,e,o; f,g; x,y.

My solution

  1. Concatenate the two lists together (cat list1.txt list2.txt > combined.txt). The list is now: a,b; d,e; f,g; a,n; d,e,o; x,y.
  2. Take the first item in the list (a,b) and pop it off the list (call it the CANDIDATE item). The list is now d,e; f,g; a,n; d,e,o; x,y.
  3. Go to the next item in the list (d,e).
  4. Check if there are any shared elements between CANDIDATE and the list item. If not, go onto the next item in the list.
    1. If there are shared elements, merge CANDIDATE into the list item and throw away duplicate elements (a,b shares elements with a,b,n, so merge to a,b,a,b,n and then discard the dupes to result in a,b,n). The list is now d,e; f,g; a,b,n; d,e,o; x,y.
    2. If there are no items in the list with shared items, insert CANDIDATE at the bottom of the list.
  5. Check to see if the number of unique items in the entire list equals the number of items in the list.
    1. If the counts are the same, then we've merged as much as we can, as there won't be any more elements that match. We're done. Stop running.
    2. if the counts are different, go to step #2 and repeat.

Something in my gut tells me that there's some more elegant solution to this (some kind of graph theory thing?). But this worked for the small data set we had, so good enough for now.

The code