Shaswat Pandey

DataData quality

Graph Theory for Advanced Data Deduplication

Author

Shaswat Pandey

Cover

WhatsApp Image 2024-12-18 at 13.16.27.jpeg

Slug

graph-theory-for-advanced-data-deduplication

Person

Published

Date

Dec 16, 2024

Introduction

In the era of big data, ensuring the accuracy and reliability of information is crucial for businesses to make informed decisions. One of the most critical tasks in data management is data deduplication, a process designed to eliminate redundant copies of data and ensure that each piece of information is unique and consistent. There are other approaches to managing data deduplication but in this blog post, we will explore using graph theory, particularly through the NetworkX library.

Overview of Data Deduplication

Data deduplication refers to the technique of identifying and removing duplicate copies of data within a dataset. This process is essential for several reasons:

First, it improves data quality by eliminating redundancies that can lead to inconsistencies and errors.

Second, it optimizes storage efficiency by reducing the amount of space required to store data, which can significantly lower operational costs.

Finally, deduplication enhances data processing performance by streamlining access to unique records.

In essence, deduplication is a fundamental practice for maintaining the integrity and efficiency of data systems.

Data deduplication is challenging when datasets contain multiple unique identifiers. Customers may appear under different IDs or with variations like misspelled names or differing address formats. These discrepancies complicate identifying and merging records, requiring advanced techniques to create a single accurate customer view. recognizing and resolving these variations to achieve a single, accurate customer record.

Introduction to NetworkX and Tools

Graph theory provides an effective approach to customer data deduplication, with NetworkX as a powerful tool. Using its advanced algorithms, NetworkX helps identify and consolidate duplicates, streamlining the deduplication process and improving data accuracy.

Tools and Technologies Used

In the data deduplication process, several powerful tools and technologies can be used to efficiently handle and analyze complex datasets. Here’s a look at the key tools:

Tool	Description	Usage in Deduplication
NetworkX	NetworkX is a versatile Python library for the creation, manipulation, and analysis of complex networks and graphs. It provides a rich set of features for handling both undirected and directed graphs, making it an ideal choice for network analysis tasks.	In our project, we used NetworkX to model profile relationships as a graph, with profiles as nodes and relationships (like shared phone numbers and emails) as edges. NetworkX enabled us to construct individual graphs from various datasets, merge them into a unified graph, and identify connected components to efficiently group and deduplicate profiles.
Fuzzy Wuzzy	FuzzyWuzzy is a Python library for fuzzy string matching, which is useful for comparing and matching strings that may have minor differences or typos. It utilizes Levenshtein Distance to calculate the differences between sequences.	We used FuzzyWuzzy to handle variations and misspellings in profile attributes. It allowed us to perform fuzzy matching to merge similar profiles and enhance the accuracy of our deduplication process.
Pandas	Pandas is a powerful and widely-used data manipulation library in Python. It provides data structures like DataFrames and Series that facilitate data cleaning, transformation, and analysis.	Pandas was essential for transforming and processing data in our deduplication workflow. It enabled us to convert connected components into a DataFrame, reshape data with functions like pd.wide_to_long(), and remove duplicates using drop_duplicates() to ensure unique profile IDs in the final dataset.

Use Case: Customer Data Management

Imagine two hotels, each maintaining separate customer databases with unique IDs for guests. After their acquisition by XYZ Company, the challenge arises of merging these databases. Many customers have multiple IDs, assigned independently by each hotel. To address this, XYZ must deduplicate the data, consolidating each customer’s records into a single entry. A new unique ID will be generated, combining the first four letters of the customer’s name with the last four digits of their contact number.

Let’s say there are 2 distinct databases for two different hotels and each database consists of 2-3 million records. Each record is partitioned on a monthly basis and are stored on-prem.

Deduplication Challenges

In some cases, records for the same customer have variations in their details. For instance, one database may record the name as "Smith," while another lists it as "Sith," referring to the same individual. Similarly, a single customer with two phone numbers might appear in different databases under different numbers, complicating the identification of duplicates.

Methodology

Data Preprocessing

Data preprocessing is the most important thing which should be done before data deduplication. It involved several steps like:

Data Collection and Integration

If there are two different databases for two different hotels, we want to identify unique customers. We only want relevant data to improve the deduplication process. Hence, collecting relevant columns from each dataset and storing them in a single place is an important step.

Data Cleaning

Before starting deduplication data must be cleaned by eliminating extra spaces and other symbols according to this format: First_Name, Middle_Name, Last_Name, Email, Contact_Number.

You may use isnull() or dropna() for identifying and removing missing values; and fillna() to impute values. Use dropduplicates() to eliminate duplicate rows before joining. Use df.str.strip() for removing extra spaces, and use regex (ie. \s+, [@#$]) for removing spaces symbols.

Data Standardization

Data standardization is essential for effective deduplication, aligning disparate data sources to ensure consistency before redundancy checks. In NetworkX-based deduplication, standardization enables accurate graph construction by ensuring nodes and edges consistently represent data characteristics. Normalizing formats, such as text case, encoding, and date formats, creates a cohesive dataset, improving the precision and performance of NetworkX algorithms in detecting duplicates and relationships..

Graph Theory and NetworkX

With NetworkX, users can easily create graphs, add and remove nodes and edges, and perform various graph-based algorithms such as finding shortest paths, computing centrality measures, detecting communities, and more.

Nodes and Edges: In graph theory, a graph consists of nodes (vertices) and edges (connections between nodes). In this context, each profile (e.g., based on phone numbers, emails, or addresses) is represented as a node, and the relationships between profiles (e.g., shared phone numbers or email addresses) are represented as edges.

nx.Graph(): Creates an empty undirected graph in NetworkX. An undirected graph is used because the relationships between profiles do not have a direction; they are bidirectional.

Building the Graph

This constructs the graph by connecting nodes and edges are added to the graph to represent the data. Multiple graphs are constructed from different datasets (e.g., phone numbers, email addresses).

G.add_nodes_from(nodes): Adds multiple nodes to the graph from a list of nodes.

G.add_edges_from(edges): Adds multiple edges to the graph from a list of edges.

Merging and Resolving Duplicates

1. Merging Graph - Graph Composition:

nx.compose_all (graphs): Merges a list of graphs into a single graph. This method combines all nodes and edges from the individual graphs into one unified graph. This combines multiple graphs into a single graph to consolidate information from different datasets.

2. Identifying Connected Components

nx.connected_components(G): Identifies all connected components in an undirected graph. This method returns a generator of sets, where each set contains nodes in a connected component. This is a connected component in an undirected graph is a subset of nodes such that there is a path between any two nodes in the subset. In this context, connected components represent clusters of profiles that are interconnected through shared relationships.

Implementation

Initially there are duplicates records for same records:


hotel_1=spark.table('hotel.hotel_data1')
hotel_2=spark.table('hotel.hotel_data2')
master_hotel_data=hotel_1.union(hotel_2)
display(master_hotel_data)

Data standardization and processing : Combining first and last name that will be used in calculating string similarity


hotel_data1=hotel_data1.withColumn('full_name',concat(lower(col("first_name")),lower(col("last_name")))).withColumn("profile_id",concat(lit("h1_"),col("profile_id")))
hotel_data2=hotel_data2.withColumn('full_name',concat(lower(col("first_name")),lower(col("last_name")))).withColumn("profile_id",concat(lit("h2_"),col("profile_id")))

Adding “b_” prefix for all the columns in the data of the second hotel in order to distinguish the columns between different table


b_hotel_data2=hotel_data2
for column in b_hotel_data2.columns:
    b_hotel_data2 = b_hotel_data2.withColumnRenamed(column,"b_"+column)

Data Deduplication : Joining two tables based on email-id and considering only those records having string similarity >=80%


customers_with_same_email=hotel_data1.join(b_hotel_data2,hotel_data1.email_id==b_hotel_data2.b_email_id,'left').select("*")
customers_with_same_email_similarity = customers_with_same_email.withColumn("levenshtein_distance", levenshtein(col("full_name"), col("b_full_name"))).withColumn("max_length", greatest(length(col("full_name")), length(col("b_full_name")))).withColumn("similarity_percentage", expr("100 * (1 - levenshtein_distance / max_length)"))
customers_with_same_email_max_similarity=customers_with_same_email_similarity.filter(col("similarity_percentage")>=80)
display(customers_with_same_email_max_similarity)

Joining two tables based on contact_no and considering only those records having string similarity >=80%


customers_with_same_contactno=hotel_data1.join(b_hotel_data2,hotel_data1.contact_number==b_hotel_data2.b_contact_number,'left').select("*")
customers_with_same_contactno_similarity = customers_with_same_contactno.withColumn("levenshtein_distance", levenshtein(col("full_name"), col("b_full_name"))).withColumn("max_length", greatest(length(col("full_name")), length(col("b_full_name")))).withColumn("similarity_percentage", expr("100 * (1 - levenshtein_distance / max_length)"))
customers_with_same_contactno_max_similarity=customers_with_same_contactno_similarity.filter(col("similarity_percentage")>=80)
display(customers_with_same_contactno_max_similarity.select("profile_id","first_name","last_name","b_profile_id","b_first_name","b_last_name","similarity_percentage"))

Building graph for the customers having same email-id


email_similar=customers_with_same_email_max_similarity.select("*").dropDuplicates()
email_similar_id=[data[0] for data in email_similar.select('profile_id','b_profile_id').toLocalIterator()]
v4 = email_similar.select('profile_id', 'b_profile_id').rdd.map(tuple).collect()
G4 = nx.Graph()
G4.add_nodes_from(email_similar_id)
G4.add_edges_from(v4)
Building graph for the customers having same contact-no
contactno_similar=customers_with_same_contactno_max_similarity.select("*").dropDuplicates()
contactno_similar_id=[data[0] for data in contactno_similar.select('profile_id','b_profile_id').toLocalIterator()]
v5 = contactno_similar.select('profile_id', 'b_profile_id').rdd.map(tuple).collect()
G5 = nx.Graph()
G5.add_nodes_from(contactno_similar_id)
G5.add_edges_from(v5)
Composing both graph in-order to get connected components
Final_graph = nx.compose_all([G4,G5])
final_undirected = Final_graph.to_undirected()
final_connected_profiles = nx.connected_components(final_undirected)

After getting connected components we can now fetch the unique customers having multiple id’s


profiles_mapped = pd.DataFrame(list(final_connected_profiles))
mapping_keys = spark.createDataFrame(profiles_mapped)
for c in mapping_keys.columns:
mapping_keys = mapping_keys.withColumnRenamed(c,"profile_"+c)
display(mapping_keys)

Like h1_3 and h2_3 are same person but had different id’s.

Performing Inner join with the master_data containing both hotel data


df1=spark.table('hotel.hotel_data1')
df2=spark.table('hotel.hotel_data2')
master_data=df1.union(df2)
mapping_unique_id=mapping_keys.select("profile_0")
unique_customer=mapping_unique_id.join(master_data,test_map.profile_0==master_data.profile_id,"inner")
display(unique_customer)

Assigning new Customer_id for the dedup data to each customer


window_spec = Window.orderBy("full_name")
final_dedup_data = unique_customer.withColumn(
    "unique_customer_id",
    concat(lit("TJ_"), row_number().over(window_spec))
)
display(final_dedup_data.select("unique_customer_id","first_name","last_name","contact_number","email_id"))

Results and Analysis

Deduplication Outcomes

The deduplication process led to significant improvements in the dataset’s quality and usability. Before deduplication, the dataset was plagued by redundancy and inconsistencies, impacting data accuracy and efficiency. After the deduplication process, the dataset was streamlined, with duplicate records merged into unified profiles, resulting in enhanced data quality and more accurate insights.

Performance Metrics

The performance metrics demonstrate the significant improvements achieved through the deduplication process. Before deduplication, the dataset suffered from high record counts, substantial duplication, and associated inefficiencies. Post-deduplication, the reduction in record count, improved data quality, and faster processing times highlight the success of the deduplication efforts in optimizing data management and analysis. These metrics underscore the value of deduplication in enhancing the performance and reliability of data systems.

Conclusion

Effective data deduplication is the cornerstone of reliable analytics and informed decision-making in today's data-driven world. Leveraging advanced tools like NetworkX and robust preprocessing techniques empowers organizations to consolidate fragmented records, enhance data quality, and streamline operations. The methodologies demonstrated here not only resolve redundancies but also unlock the full potential of data, driving efficiency, accuracy, and actionable insights.