Data Deduplication refers to a technique for eliminating redundant data in a data set. In the process of Deduplication, extra copies of the same data are deleted, leaving only one copy to be stored.
The below workflow:
Finds matching records between 2 given datasets. It first joins them with the column “State”.
Then it applies distance algorithms on a few fields to find the distance between the records.
The tutorial for this workflow is available here: https://docs.sparkflows.io/en/latest/tutorials/data-engineering/dedup-customers.html