Dedup

Dedup

DB design: 

CREATE TABLE dedup (
catchment_id varchar,
health_id1 varchar,
health_id2 varchar,
rule_name varchar,
rule_value varchar,
created_at timeuuid,
PRIMARY KEY (catchment_id, health_id1, health_id2, rule_name, rule_value));

Example:

Feed - p1 (create), p2 (create), p3 (create), p1 (updated occupation), p1 (updated nid2), p2 (voided, merged-with p3), h4 (create), p5 (create). pn represents HID.
All have same NID, nid1 but different catchments, c1, c2, c3, c4, c5. Only, h4, h5 have same phone #. c1 - 101112, c2 - 202122, c3 - 303133, so on.
Rules: same nid and same phone # are dedups. Dedups cannot be resolved for just one rule, but for all.

 

Scenarios

Action

Event in Feed

Scheduled job runs

Dupes in C1

Dupes in C2

Dupes in C3

Dupes in C4

Dupes in C5

1

Patient created (P1)

created:p1

No dupes found.

 

 

 

 

 

2

Patient created (P2)

created:p2

Finds P1 with same NID

P2-P1 (NID)

P2-P1(NID)

 

 

 

3

Patient created (P3)

created:p3

Finds that P3 shares NID with P1 and P2

P2-P1(NID)
P3-P1(NID) +

P2-P1(NID)
P3-P2(NID) +

P3-P1(NID) +
P3-P2(NID) +

 

 

4

Patient updated (P1)

updated:p1 (occupation)

All entries related to p1 are deleted and recreated.

P2-P1(NID)
P3-P1(NID)

P2-P1(NID)
P3-P2(NID)

P3-P1(NID)
P3-P2(NID)

 

 

5

Patient updated (P1)

updated:p1 (NID)

All entries related to p1 are deleted and recreated.

Empty

P3-P2(NID)

P3-P2(NID)

 

 

6

Resolution. P2-P3 conflict for all rules.
p2 - voided and merged with p3

voided:P2 (merged with p3)

Since P2 does not exist anymore, we must remove all pairs with P2 from all dedup tables.

Empty

Empty

Empty

 

 

7

 

updated:P3

Run deduplication rule anyway. Found no duplicates

Empty

Empty

Empty

 

 

8

Patient created (P4)

created: P4

P4 matches P3 in NID. (P1's NID has changed in earlier step.)

 

 

P4-P3(NID) +


P4-P3(NID) +

 

9

Patient created (P5)

created:P5

P5 is detected for NID duplication, matching with P3, P4. And Also with P4 for Ph. num

 

 

P4-P3(NID)
P5-P3 (NID) +

P4-P3(NID)
P5-P4 (NID+Ph.No) +

P5-P3(NID) +
P5-P4 (NID+Ph.No) +

10

P3-P4 resolved for all rules,
both are unique.

ignore-duplicate: P3, P4

As soon as approver says these must remain separate, we must go back and take them out of all catchments

 

 

P5-P3 (NID)

P5-P4 (NID+Ph.No)

P5-P3(NID)
P5-P4 (NID+Ph.No)

Scenario 2 (DB Calls): 

#1 select * from nid_mapping where nid = nid1

#2 select catchment from patient where hid in (h1, h2)

#3. Dedup DB:

Catchment

HID1

HID2

Rule Name

Rule Value

Created At

A10B11

h1

h2

nid

nid1

t1

A10B11C12

h1

h2

nid

nid1

t2

 

 

 

 

 

 

A20B21

h2

h1

nid

nid1

t3

A20B21C22

h2

h1

nid

nid1

t4

Searches all patients by phone # ph2. No matching patient. No action taken.

Scenario 3: 

#1 select * from nid_mapping where nid = nid1

#2 select catchment from patient where hid in (h1, h2, h3)

#3. Dedup DB:

Catchment

HID1

HID2

Rule Name

Rule Value

Created At

A10B11

h1

h2

nid

nid1

t1

A10B11C12

h1

h2

nid

nid1

t2

A10B11

h1

h3

nid

nid1

t9

A10B11C12

h1

h3

nid

nid1

t10

 

 

 

 

 

 

A20B21

h2

h1

nid

nid1

t3

A20B21C22

h2

h1

nid

nid1

t4

A20B21

h2

h3

nid

nid1

t11

A20B21C22

h2

h3

nid

nid1

t12

 

 

 

 

 

 

A30B31

h3

h1

nid

nid1

t5

A30B31C32

h3

h1

nid

nid1

t6

A30B31

h3

h2

nid

nid1

t7

A30B31C32

h3

h2

nid

nid1

t8

Searches all patients by phone # ph3. No matching patient. No action taken.

Scenario 4: 

#1 select hid2 from dedup where catchment in (1011, 101112) and hid1 = h1. 

#2 select catchment from patient where hid = .. (for each hid returned in #1)

#3

(Batch begin)

delete from dedup where catchment_id in (1011, 101112) and hid1=h1

delete from dedup where catchment_id in (...) and hid1=.. and hid2=h1 (repeated as many times required)

repeat scenario 2.

(Batch ends)

#4 Dedup db looks same as in scenario 3.

Note: If catchment is updated, it will be similar to scenario 3.

Scenario 5: 

Similar to scenario 4. 

Dedup db:

Catchment

HID1

HID2

Rule Name

Rule Value

Created At

A20B21

h2

h3

nid

nid1

t11

A20B21C22

h2

h3

nid

nid1

t12

 

 

 

 

 

 

A30B31

h3

h2

nid

nid1

t7

A30B31C32

h3

h2

nid

nid1

t8

 

Scenario 6:

Similar to scenario 4. Dedup table becomes empty.

Scenario 8:

Similar to scenario 2. 

Dedup table:

Catchment

HID1

HID2

Rule Name

Rule Value

Created At

A30B31

h3

h4

nid

nid1

t..

A30B31C32

h3

h4

nid

nid1

t..