DB design:
CREATE TABLE dedup (
catchment_id varchar,
health_id1 varchar,
health_id2 varchar,
rule_name varchar,
rule_value varchar,
created_at timeuuid,
PRIMARY KEY (catchment_id, health_id1, health_id2, rule_name, rule_value));
Example:
Feed - p1 (create), p2 (create), p3 (create), p1 (updated occupation), p1 (updated nid2), p2 (voided, merged-with p3), h4 (create), p5 (create). pn represents HID.
All have same NID, nid1 but different catchments, c1, c2, c3, c4, c5. Only, h4, h5 have same phone #. c1 - 101112, c2 - 202122, c3 - 303133, so on.
Rules: same nid and same phone # are dedups. Dedups cannot be resolved for just one rule, but for all.
Scenarios | Action | Event in Feed | Scheduled job runs | Dupes in C1 | Dupes in C2 | Dupes in C3 | Dupes in C4 | Dupes in C5 |
1 | Patient created (P1) | created:p1 | No dupes found. | |||||
2 | Patient created (P2) | created:p2 | Finds P1 with same NID | P2-P1 (NID) | P2-P1(NID) | |||
3 | Patient created (P3) | created:p3 | Finds that P3 shares NID with P1 and P2 | P2-P1(NID) P3-P1(NID) + | P2-P1(NID) P3-P2(NID) + | P3-P1(NID) + P3-P2(NID) + | ||
4 | Patient updated (P1) | updated:p1 (occupation) | All entries related to p1 are deleted and recreated. | P2-P1(NID) P3-P1(NID) | P2-P1(NID) P3-P2(NID) | P3-P1(NID) P3-P2(NID) | ||
5 | Patient updated (P1) | updated:p1 (NID)Now then job runs, no dupes wil be found for P1. But we must also search the dedup tables to remove any pairs with P1 (reason NID match) that were earlier created | All entries related to p1 are deleted and recreated. | Empty | P3-P2(NID) | P3-P2(NID) | ||
6 | Resolution. P2-P3 conflict for all rules. p2 - voided and merged with p3 | voided:P2 (merged with p3) | Since P2 does not exist anymore, we must remove all pairs with P2 from all dedup tables. | Empty | Empty | Empty | ||
7 | updated:P3 | Run deduplication rule anyway. Found no duplicates | Empty | Empty | Empty | |||
8 | Patient created (P4) | created: P4 | P4 matches P3 in NID. (P1's NID has changed in earlier step.) | P4-P3(NID) + | P4-P3(NID) + | |||
9 | Patient created (P5) | created:P5 | P5 is detected for NID duplication, matching with P3, P4. And Also with P4 for Ph. num | P4-P3(NID) P5-P3 (NID) + | P4-P3(NID) P5-P4 (NID+Ph.No) + | P5-P3(NID) + P5-P4 (NID+Ph.No) + | ||
10 | P3-P4 resolved for all rules, both are unique. | ignore-duplicate: P3, P4 | As soon as approver says these must remain separate, we must go back and take them out of all catchments | P5-P3 (NID) | P5-P4 (NID+Ph.No) | P5-P3(NID) P5-P4 (NID+Ph.No) |
...
Searches all patients by phone # ph1ph2. No matching patient. No action taken.
...
Catchment | HID1 | HID2 | Rule Name | Rule Value | Created At |
A10B11 | h1 | h2 | nid | nid1 | t1 |
A10B11C12 | h1 | h2 | nid | nid1 | t2 |
A10B11 | h1 | h3 | nid | nid1 | t9 |
A10B11C12 | h1 | h3 | nid | nid1 | t10 |
A20B21 | h2 | h1 | nid | nid1 | t3 |
A20B21C22 | h2 | h1 | nid | nid1 | t4 |
A20B21 | h2 | h3 | nid | nid1 | t11 |
A20B21C22 | h2 | h3 | nid | nid1 | t12 |
A30B31 | h3 | h1 | nid | nid1 | t5 |
A30B31C32 | h3 | h1 | nid | nid1 | t6 |
A30B31 | h3 | h2 | nid | nid1 | t7 |
A30B31C32 | h3 | h2 | nid | nid1 | t8 |
Searches all patients by phone # ph3. No matching patient. No action taken.
Scenario 4:
#1 select hid2 from dedup where catchment in (1011, 101112) and hid1 = h1.
...
#4 Dedup db looks same as in scenario 3.
Note: If catchment is updated, it will be similar to scenario 3.
Scenario 5:
Similar to scenario 4.
Dedup db:
Catchment | HID1 | HID2 | Rule Name | Rule Value | Created At |
A20B21 | h2 | h3 | nid | nid1 | t11 |
A20B21C22 | h2 | h3 | nid | nid1 | t12 |
A30B31 | h3 | h2 | nid | nid1 | t7 |
A30B31C32 | h3 | h2 | nid | nid1 | t8 |
Scenario 6:
Similar to scenario 4. Dedup table becomes empty.
Scenario 8:
Similar to scenario 2.
Dedup table:
Catchment | HID1 | HID2 | Rule Name | Rule Value | Created At |
A30B31 | h3 | h4 | nid | nid1 | t.. |
A30B31C32 | h3 | h4 | nid | nid1 | t.. |
A40B41 | h4 | h3 | nid | nid1 | t.. |
A40B41C42 | h4 | h3 | nid | nid1 | t.. |
Scenario 9:
Similar to scenario 2.
Dedup table:
Catchment | HID1 | HID2 | Rule Name | Rule Value | Created At |
A30B31 | h3 | h4 | nid | nid1 | t.. |
A30B31C32 | h3 | h4 | nid | nid1 | t.. |
A30B31 | h3 | h5 | nid | nid1 | t.. |
A30B31C32 | h3 | h5 | nid | nid1 | t.. |
A40B41 | h4 | h3 | nid | nid1 | t.. |
A40B41C42 | h4 | h3 | nid | nid1 | t.. |
A40B41 | h4 | h5 | phone # | ph1 | t.. |
A40B41C42 | h4 | h5 | phone # | ph1 | t.. |
A40B41 | h4 | h5 | nid | nid1 | t.. |
A40B41C42 | h4 | h5 | nid | nid1 | t.. |
A50B51 | h5 | h3 | nid | nid1 | t.. |
A50B51C52 | h5 | h3 | nid | nid1 | t.. |
A50B51 | h5 | h4 | phone # | ph1 | t.. |
A50B51C52 | h5 | h4 | phone # | ph1 | t.. |
Scenario 10:
p3 and p4 are no more dupes.
#1. select catchment from patient where hid in (h3, h4)
#2. delete from dedup where catchment in (..) and hid1 in (h3, h4) and hid2 in (h3, h4) [in won't work. so run separate queries in batch]
Dedup table:
Catchment | HID1 | HID2 | Rule Name | Rule Value | Created At |
A30B31 | h3 | h5 | nid | nid1 | t.. |
A30B31C32 | h3 | h5 | nid | nid1 | t.. |
A40B41 | h4 | h5 | phone # | ph1 | t.. |
A40B41C42 | h4 | h5 | phone # | ph1 | t.. |
A40B41 | h4 | h5 | nid | nid1 | t.. |
A40B41C42 | h4 | h5 | nid | nid1 | t.. |
A50B51 | h5 | h3 | nid | nid1 | t.. |
A50B51C52 | h5 | h3 | nid | nid1 | t.. |
A50B51 | h5 | h4 | phone # | ph1 | t.. |
A50B51C52 | h5 | h4 | phone # | nid1 | t.. |
Find duplicates by catchment:
select * from dedup where catchment = 102030.
Then sort the result by timeuuid (time component only) in-memory and return.