/
Dedup

Dedup

DB design: 

CREATE TABLE dedup (
catchment_id varchar,
health_id1 varchar,
health_id2 varchar,
rule_name varchar,
rule_value varchar,
created_at timeuuid,
PRIMARY KEY (catchment_id, health_id1, health_id2, rule_name, rule_value));

Example:

Feed - p1 (create), p2 (create), p3 (create), p1 (updated occupation), p1 (updated nid2), p2 (voided, merged-with p3), h4 (create), p5 (create). pn represents HID.
All have same NID, nid1 but different catchments, c1, c2, c3, c4, c5. Only, h4, h5 have same phone #. c1 - 101112, c2 - 202122, c3 - 303133, so on.
Rules: same nid and same phone # are dedups. Dedups cannot be resolved for just one rule, but for all.

 

ScenariosActionEvent in FeedScheduled job runsDupes in C1Dupes in C2Dupes in C3Dupes in C4Dupes in C5
1Patient created (P1)created:p1No dupes found.     
2Patient created (P2)created:p2Finds P1 with same NIDP2-P1 (NID)P2-P1(NID)   
3Patient created (P3)created:p3Finds that P3 shares NID with P1 and P2P2-P1(NID)
P3-P1(NID) +
P2-P1(NID)
P3-P2(NID) +
P3-P1(NID) +
P3-P2(NID) +
  
4Patient updated (P1)updated:p1 (occupation)All entries related to p1 are deleted and recreated.P2-P1(NID)
P3-P1(NID)
P2-P1(NID)
P3-P2(NID)
P3-P1(NID)
P3-P2(NID)
  
5Patient updated (P1)updated:p1 (NID)All entries related to p1 are deleted and recreated.EmptyP3-P2(NID)P3-P2(NID)  
6Resolution. P2-P3 conflict for all rules.
p2 - voided and merged with p3
voided:P2 (merged with p3)Since P2 does not exist anymore, we must remove all pairs with P2 from all dedup tables.EmptyEmptyEmpty  
7 updated:P3Run deduplication rule anyway. Found no duplicatesEmptyEmptyEmpty  
8Patient created (P4)created: P4P4 matches P3 in NID. (P1's NID has changed in earlier step.)  P4-P3(NID) +
P4-P3(NID) +
 
9Patient created (P5)created:P5P5 is detected for NID duplication, matching with P3, P4. And Also with P4 for Ph. num  P4-P3(NID)
P5-P3 (NID) +
P4-P3(NID)
P5-P4 (NID+Ph.No) +
P5-P3(NID) +
P5-P4 (NID+Ph.No) +
10P3-P4 resolved for all rules,
both are unique.
ignore-duplicate: P3, P4As soon as approver says these must remain separate, we must go back and take them out of all catchments  P5-P3 (NID)P5-P4 (NID+Ph.No)P5-P3(NID)
P5-P4 (NID+Ph.No)

Scenario 2 (DB Calls): 

#1 select * from nid_mapping where nid = nid1

#2 select catchment from patient where hid in (h1, h2)

#3. Dedup DB:

CatchmentHID1HID2Rule NameRule ValueCreated At
A10B11h1h2nidnid1t1
A10B11C12h1h2nidnid1t2
      
A20B21h2h1nidnid1t3
A20B21C22h2h1nidnid1t4

Searches all patients by phone # ph2. No matching patient. No action taken.

Scenario 3: 

#1 select * from nid_mapping where nid = nid1

#2 select catchment from patient where hid in (h1, h2, h3)

#3. Dedup DB:

CatchmentHID1HID2Rule NameRule ValueCreated At
A10B11h1h2nidnid1t1
A10B11C12h1h2nidnid1t2
A10B11h1h3nidnid1t9
A10B11C12h1h3nidnid1t10
      
A20B21h2h1nidnid1t3
A20B21C22h2h1nidnid1t4
A20B21h2h3nidnid1t11
A20B21C22h2h3nidnid1t12
      
A30B31h3h1nidnid1t5
A30B31C32h3h1nidnid1t6
A30B31h3h2nidnid1t7
A30B31C32h3h2nidnid1t8

Searches all patients by phone # ph3. No matching patient. No action taken.

Scenario 4: 

#1 select hid2 from dedup where catchment in (1011, 101112) and hid1 = h1. 

#2 select catchment from patient where hid = .. (for each hid returned in #1)

#3

(Batch begin)

delete from dedup where catchment_id in (1011, 101112) and hid1=h1

delete from dedup where catchment_id in (...) and hid1=.. and hid2=h1 (repeated as many times required)

repeat scenario 2.

(Batch ends)

#4 Dedup db looks same as in scenario 3.

Note: If catchment is updated, it will be similar to scenario 3.

Scenario 5: 

Similar to scenario 4. 

Dedup db:

CatchmentHID1HID2Rule NameRule ValueCreated At
A20B21h2h3nidnid1t11
A20B21C22h2h3nidnid1t12
      
A30B31h3h2nidnid1t7
A30B31C32h3h2nidnid1t8

 

Scenario 6:

Similar to scenario 4. Dedup table becomes empty.

Scenario 8:

Similar to scenario 2. 

Dedup table:

CatchmentHID1HID2Rule NameRule ValueCreated At
A30B31h3h4nidnid1t..
A30B31C32h3h4nidnid1t..
      
A40B41h4h3nidnid1t..
A40B41C42h4h3nidnid1t..

Scenario 9:

Similar to scenario 2. 

Dedup table:

CatchmentHID1HID2Rule NameRule ValueCreated At
A30B31h3h4nidnid1t..
A30B31C32h3h4nidnid1t..
A30B31h3h5nidnid1t..
A30B31C32h3h5nidnid1t..
      
A40B41h4h3nidnid1t..
A40B41C42h4h3nidnid1t..
A40B41h4h5phone #ph1t..
A40B41C42h4h5phone #ph1t..
A40B41h4h5nidnid1t..
A40B41C42h4h5nidnid1t..
      
A50B51h5h3nidnid1t..
A50B51C52h5h3nidnid1t..
A50B51h5h4phone #ph1t..
A50B51C52h5h4phone #ph1t..

 

Scenario 10:

p3 and p4 are no more dupes. 

#1. select catchment from patient where hid in (h3, h4)

#2. delete from dedup where catchment in (..) and hid1 in (h3, h4) and hid2 in (h3, h4) [in won't work. so run separate queries in batch]

Dedup table:

CatchmentHID1HID2Rule NameRule ValueCreated At
A30B31h3h5nidnid1t..
A30B31C32h3h5nidnid1t..
      
A40B41h4h5phone #ph1t..
A40B41C42h4h5phone #ph1t..
A40B41h4h5nidnid1t..
A40B41C42h4h5nidnid1t..
      
A50B51h5h3nidnid1t..
A50B51C52h5h3nidnid1t..
A50B51h5h4phone #ph1t..
A50B51C52h5h4phone #nid1t..

 

Find duplicates by catchment:

select * from dedup where catchment = 102030.

Then sort the result by timeuuid (time component only) in-memory and return.