Smart LeetCode Practice: Navigating Problems with NetworkX

As we returned from our respective internships in the third semester of our MTech programs, we approached what was arguably the most intense and stressful phase, placement preparation. Ideally, one would have started early, but the demanding curriculum had left little room for dedicated prep. The challenge, then, was clear: how could we achieve the best results with minimal yet strategic preparation?

Given the many companies hiring for SDE roles, most naturally focus on practicing Data Structures and Algorithms (DSA) problems. This preparation typically falls into two approaches:

Patiently solving problems from a standard textbook, our choice being Algorithm Design by Jon Kleinberg and Éva Tardos (as covered in our Advanced DSA course), while carefully developing, if not formal, at least intuitive proofs for solutions.
Memorizing solutions to frequently asked interview problems, relying on pattern recognition rather than deep understanding.

Most of us, however, land somewhere in between, attempting to cover a diverse set of commonly asked problems with enough variation to maximize exposure to different problem patterns. For that, we rely on sheets curated by professionals.

A bottleneck emerges while covering diverse topics: the Paradox of Choice. With an ever-growing influx of problem sheets, we often waste time deciding where to start. This also introduces a Developer Bias, as these sheets are typically curated by individuals whose selection criteria may not align with our specific needs or weaknesses.

This raised fundamental questions about effective problem progression in such a broad domain. Is there a universally ideal starting problem for each topic? What sequence of problems maximizes conceptual understanding? More importantly, can we move beyond individual intuition and develop data-driven procedures that offer generalizable learning strategies?

Modelling :

If we assume that problems are related, a graph is a natural choice to represent these relationships. The edges must be directed since progression is a key requirement, indicating the flow from simpler to more complex problems. Additionally, edges should only exist between similar problems to ensure meaningful transitions, and weights can be assigned based on relative difficulty to guide an optimal learning path.

Feature Selection :

To establish meaningful connections between problems, we need strong features that effectively convey similarity. When visiting a problem page, we are presented with multiple attributes :

For difficulty, an existing problem rating list has been created and contributed to LeetCode Problem Rating; thanks to ZeroTrac and the community for compiling this valuable resource.

Exploratory Data Analysis

With the dataset established, exploration commenced. A straightforward linear correlation analysis of the tags produced the following results:

Analyzing the top correlating solution tags reveals expected relationships:

These correlations align with intuitive expectations. DFS and BFS commonly appear together due to their fundamental tree and graph traversal roles. The strong association between recursion and trees further supports its prevalence in pre- and post-order traversal problems.

Since our feature vectors become sparse due to the high number of unique tags, we can employ Principal Component Analysis (PCA) to obtain denser embeddings while reducing redundancy. This allows for more compact and meaningful representations of problem relationships.

Now that the feature vectors were available, the next step was determining an appropriate threshold for identifying neighboring nodes. One approach was to set a similarity threshold, considering an edge between two nodes only if their cosine similarity exceeded a certain value. Such a graph was defined at various thresholds, and its structural statistics were analyzed at each level.

The analysis was initially performed at thresholds of 25, 50, 75, and 90, before being narrowed down to a more refined range. The goal was to determine a threshold that minimized the number of connected components, ensuring that all problems remained reachable from each other. Outlier detection was conducted based on the total number of submitted tags to address the issue of small-sized clusters, but it did not yield any significant improvements.

Clustering was employed as an approach for topic modeling, where the maximum voted tags within a cluster could assist in unsupervised classification of a set of problems. However, this approach introduced a challenge.

Addressing this problem at an algorithmic level required devising a heuristic solution to manage small cluster sizes. The pseudocode was implemented to enforce a constraint on the maximum cluster size, ensuring no cluster exceeded a predefined limit.

def RecursiveClustering(df, max_size):

    # Start with all points unlabeled
    labels = [-1 for every row in df]
    current_label = 0
    
    # Helper function to handle clustering
    def process_cluster(subset, label):
        if number of points in subset <= max_size:
            # Label all these points
            for index in subset:
                labels[index] = label
            return label + 1  # Next available label
        
        # Try to split the cluster
        try:
            # Split into two groups
            kmeans_result = do KMeans(k=2) on subset
            group1 = [points where kmeans says cluster 0]
            group2 = [points where kmeans says cluster 1]
            
            # Check if split actually worked
            if either group is empty:
                oops, bad split!  # Force fallback
            
            # Recurse on both groups
            next_label = process_cluster(group1, label)
            next_label = process_cluster(group2, next_label)
            return next_label
            
        except:  # If anything went wrong
            # Label entire cluster as current label
            for index in subset:
                labels[index] = label
            return label + 1
    
    # Start with all data as first cluster
    process_cluster(all_data_points, current_label)
    
    return labels

Clustering solely using K-Means resulted in a large, dense cluster where many problems were grouped together, while others were confined to much smaller clusters. This imbalance indicated that K-Means struggled to properly capture the structural relationships between problems.

The cluster sizes improved after enforcing the maximum size constraint, but the issue of small clusters still persisted. To address this, a merging routine would be required, potentially introducing an additional constraint to ensure better cluster balancing. This aspect remains an open challenge and will be explored as part of future work.

Inferencing

Now, the graph between nodes i and j is defined using the NetworkX library, a Python package designed for the creation, manipulation, and analysis of complex networks. Edges are established between nodes where the cosine similarity of their respective feature vectors exceeds 0.7. Additionally, a directed edge is assigned based on the absolute difference in problem ratings, ensuring that the direction reflects the relative difficulty between problems.

Problem IDs with zero degree: [1584, 1072, 838, 769, 900, 1267, 2273, 2352, 821, 2293]

Note: t-SNE is a dimensionality reduction technique and may distort original high-dimensional relationships, making some points appear closer than they actually are.

Most problems exhibit a relatively small difficulty slope between them, as seen in the graph. Additionally, a significant number have fewer neighbors, with a second peak likely representing more generic problems that act as gateways to multiple concepts.

An easy array-focused DP problem is a great starting point, while hard multi-topic problems serve as ideal endpoints for mastering concepts.

Summary

The article addresses the challenge of efficient DSA problem selection for SDE placement preparation.
A graph-based model using problem tags and difficulty ratings is proposed to represent problem relationships.
Clustering and graph algorithms are used to infer optimal problem sequences.
Easy to medium problems with high connectivity are ideal starting points.
Hard, multi-topic problems serve as effective end-points for mastery.

Future Work

Utilize Graph Neural Networks (GNNs) to generate more sophisticated problem embeddings.
Explore advanced graph algorithms for refined problem sequencing and cluster balancing.
Incorporate user feedback and performance data for adaptive learning paths.

Smart LeetCode Practice: Navigating Problems with NetworkX

Palash Behra

Engineer.

Modelling :

Feature Selection :

Exploratory Data Analysis

Inferencing

Summary

Future Work

Others also viewed

Reasoning Models, Upskilling Your Engineering Team, Foundation Models for Time Series, and Visualizing Hidden Structures

Decoding Machine Learning Interviews, The gt-extra Project, New Tutorials

Engineers of Endava | Meet Špela

The skforecast Project, AI Engineering and New Learning Resources

Vibe Coding Unveiled: The Surreal (and Sometimes Soggy) Middle Ground Between Idea and App

How ChatGPT Actually Works - no math, no code

No Code? No Problem! AI Transforms Ideas into Action

Control AI! Learn to program prompts.

Building Agents with Pydantic AI

Navigating the Labyrinth: Uncovering Patterns in LeetCode Problems using Gemini Deep Research

Explore topics