Smart LeetCode Practice: Navigating Problems with NetworkX
Generated using DALL-E

Smart LeetCode Practice: Navigating Problems with NetworkX

As we returned from our respective internships in the third semester of our MTech programs, we approached what was arguably the most intense and stressful phase, placement preparation. Ideally, one would have started early, but the demanding curriculum had left little room for dedicated prep. The challenge, then, was clear: how could we achieve the best results with minimal yet strategic preparation?

Given the many companies hiring for SDE roles, most naturally focus on practicing Data Structures and Algorithms (DSA) problems. This preparation typically falls into two approaches:

  1. Patiently solving problems from a standard textbook, our choice being Algorithm Design by Jon Kleinberg and Éva Tardos (as covered in our Advanced DSA course), while carefully developing, if not formal, at least intuitive proofs for solutions.
  2. Memorizing solutions to frequently asked interview problems, relying on pattern recognition rather than deep understanding.

Most of us, however, land somewhere in between, attempting to cover a diverse set of commonly asked problems with enough variation to maximize exposure to different problem patterns. For that, we rely on sheets curated by professionals.

A bottleneck emerges while covering diverse topics: the Paradox of Choice. With an ever-growing influx of problem sheets, we often waste time deciding where to start. This also introduces a Developer Bias, as these sheets are typically curated by individuals whose selection criteria may not align with our specific needs or weaknesses.

This raised fundamental questions about effective problem progression in such a broad domain. Is there a universally ideal starting problem for each topic? What sequence of problems maximizes conceptual understanding? More importantly, can we move beyond individual intuition and develop data-driven procedures that offer generalizable learning strategies?

Modelling :

If we assume that problems are related, a graph is a natural choice to represent these relationships. The edges must be directed since progression is a key requirement, indicating the flow from simpler to more complex problems. Additionally, edges should only exist between similar problems to ensure meaningful transitions, and weights can be assigned based on relative difficulty to guide an optimal learning path.

Feature Selection :

To establish meaningful connections between problems, we need strong features that effectively convey similarity. When visiting a problem page, we are presented with multiple attributes :


Article content
The problem description might not be accurate, as it often presents an abstraction rather than the core concept. Additionally, it relies heavily on informal and natural language, which may fail to convey meaningful insight into the problem's underlying structure.


Article content
A good representation but not very exhaustive. The assigned topics do not always cover the multiple approaches a problem can be solved with. This is important because exploring different approaches provides an opportunity to learn new techniques or think outside the box.


Article content
Pretty self-explanatory.


Article content
Not bad, but not great either. Code alone is not a strong feature since syntax does not inherently convey much about the problem, and the structure of a solution can vary significantly based on individual coding styles. The combination of a solution and its description is also unreliable, as poorly written explanations often introduce noise with unnecessary details or provide no explanation at all.


Article content
This is the safest bet so far, as it provides a frequency-wise distribution of topics based on submissions from many users. Since these tags emerge from collective user input, they offer a more generalized and statistically significant representation of problem-solving approaches.


For difficulty, an existing problem rating list has been created and contributed to LeetCode Problem Rating; thanks to ZeroTrac and the community for compiling this valuable resource.

Exploratory Data Analysis

With the dataset established, exploration commenced. A straightforward linear correlation analysis of the tags produced the following results:


Article content
A heatmap of the Pearson correlation between user-submitted solution tags provides insights into how different problem-solving approaches relate to each other. Higher correlation values indicate that certain techniques frequently appear together, suggesting possible conceptual overlaps or dependencies between problem types.


Analyzing the top correlating solution tags reveals expected relationships:


Article content
Strongest relationships between algorithm tags.

These correlations align with intuitive expectations. DFS and BFS commonly appear together due to their fundamental tree and graph traversal roles. The strong association between recursion and trees further supports its prevalence in pre- and post-order traversal problems.

Since our feature vectors become sparse due to the high number of unique tags, we can employ Principal Component Analysis (PCA) to obtain denser embeddings while reducing redundancy. This allows for more compact and meaningful representations of problem relationships.


Article content
PCA feature reduction results.

Now that the feature vectors were available, the next step was determining an appropriate threshold for identifying neighboring nodes. One approach was to set a similarity threshold, considering an edge between two nodes only if their cosine similarity exceeded a certain value. Such a graph was defined at various thresholds, and its structural statistics were analyzed at each level.



Article content

The analysis was initially performed at thresholds of 25, 50, 75, and 90, before being narrowed down to a more refined range. The goal was to determine a threshold that minimized the number of connected components, ensuring that all problems remained reachable from each other. Outlier detection was conducted based on the total number of submitted tags to address the issue of small-sized clusters, but it did not yield any significant improvements.



Article content

Clustering was employed as an approach for topic modeling, where the maximum voted tags within a cluster could assist in unsupervised classification of a set of problems. However, this approach introduced a challenge.


Addressing this problem at an algorithmic level required devising a heuristic solution to manage small cluster sizes. The pseudocode was implemented to enforce a constraint on the maximum cluster size, ensuring no cluster exceeded a predefined limit.

def RecursiveClustering(df, max_size):

    # Start with all points unlabeled
    labels = [-1 for every row in df]
    current_label = 0
    
    # Helper function to handle clustering
    def process_cluster(subset, label):
        if number of points in subset <= max_size:
            # Label all these points
            for index in subset:
                labels[index] = label
            return label + 1  # Next available label
        
        # Try to split the cluster
        try:
            # Split into two groups
            kmeans_result = do KMeans(k=2) on subset
            group1 = [points where kmeans says cluster 0]
            group2 = [points where kmeans says cluster 1]
            
            # Check if split actually worked
            if either group is empty:
                oops, bad split!  # Force fallback
            
            # Recurse on both groups
            next_label = process_cluster(group1, label)
            next_label = process_cluster(group2, next_label)
            return next_label
            
        except:  # If anything went wrong
            # Label entire cluster as current label
            for index in subset:
                labels[index] = label
            return label + 1
    
    # Start with all data as first cluster
    process_cluster(all_data_points, current_label)
    
    return labels        


Article content

Clustering solely using K-Means resulted in a large, dense cluster where many problems were grouped together, while others were confined to much smaller clusters. This imbalance indicated that K-Means struggled to properly capture the structural relationships between problems.


Article content

The cluster sizes improved after enforcing the maximum size constraint, but the issue of small clusters still persisted. To address this, a merging routine would be required, potentially introducing an additional constraint to ensure better cluster balancing. This aspect remains an open challenge and will be explored as part of future work.


Article content
Impact of max cluster size on clustering metrics.

Inferencing

Now, the graph between nodes i and j is defined using the NetworkX library, a Python package designed for the creation, manipulation, and analysis of complex networks. Edges are established between nodes where the cosine similarity of their respective feature vectors exceeds 0.7. Additionally, a directed edge is assigned based on the absolute difference in problem ratings, ensuring that the direction reflects the relative difficulty between problems.


Article content
Simple graph statistics.


Article content

Problem IDs with zero degree: [1584, 1072, 838, 769, 900, 1267, 2273, 2352, 821, 2293]

Note: t-SNE is a dimensionality reduction technique and may distort original high-dimensional relationships, making some points appear closer than they actually are.



Article content

Most problems exhibit a relatively small difficulty slope between them, as seen in the graph. Additionally, a significant number have fewer neighbors, with a second peak likely representing more generic problems that act as gateways to multiple concepts.


Article content
Harder problems dominate the top-degree list as they cover more topics, linking to many simpler problems. Higher out-degree, however, is seen in easier problems that serve as stepping stones to complex ones.


Article content

An easy array-focused DP problem is a great starting point, while hard multi-topic problems serve as ideal endpoints for mastering concepts.


Article content
Well, that escalated quickly! BFS gets you to harder problems with the fewest problems solved in between but ignores difficulty since it doesn’t consider edge weights.


Article content
Now, while Dijkstra accounts for edge weights, it simply finds the shortest aggregated path without guaranteeing a smooth progression in difficulty.


Article content
Creating a custom algorithm that prioritizes the lowest edge weights ensures a smoother progression but results in an excessively long path—about 257 problems in this case. The figure above highlights only 1 out of every 15 problems from that sequence.


Article content
Top-ranked problems serve as final challenge points


Article content
However, running PageRank on the inverted graph reveals different insights. High PageRank problems now serve as key entry point. They are generally of medium difficulty, making them ideal for a structured learning path before progressing to harder challenges.

Summary

  • The article addresses the challenge of efficient DSA problem selection for SDE placement preparation.
  • A graph-based model using problem tags and difficulty ratings is proposed to represent problem relationships.
  • Clustering and graph algorithms are used to infer optimal problem sequences.
  • Easy to medium problems with high connectivity are ideal starting points.
  • Hard, multi-topic problems serve as effective end-points for mastery.


Future Work

  • Utilize Graph Neural Networks (GNNs) to generate more sophisticated problem embeddings.
  • Explore advanced graph algorithms for refined problem sequencing and cluster balancing.
  • Incorporate user feedback and performance data for adaptive learning paths.


Sai Sree Ram Putta

SWE @Google | Ex - Amazon | 130K LinkedIn | M.Tech CS '24 @IITM | GATE '22 | IIITS

4mo

Interesting

To view or add a comment, sign in

Others also viewed

Explore topics