Smart LeetCode Practice: Navigating Problems with NetworkX
As we returned from our respective internships in the third semester of our MTech programs, we approached what was arguably the most intense and stressful phase, placement preparation. Ideally, one would have started early, but the demanding curriculum had left little room for dedicated prep. The challenge, then, was clear: how could we achieve the best results with minimal yet strategic preparation?
Given the many companies hiring for SDE roles, most naturally focus on practicing Data Structures and Algorithms (DSA) problems. This preparation typically falls into two approaches:
Most of us, however, land somewhere in between, attempting to cover a diverse set of commonly asked problems with enough variation to maximize exposure to different problem patterns. For that, we rely on sheets curated by professionals.
A bottleneck emerges while covering diverse topics: the Paradox of Choice. With an ever-growing influx of problem sheets, we often waste time deciding where to start. This also introduces a Developer Bias, as these sheets are typically curated by individuals whose selection criteria may not align with our specific needs or weaknesses.
This raised fundamental questions about effective problem progression in such a broad domain. Is there a universally ideal starting problem for each topic? What sequence of problems maximizes conceptual understanding? More importantly, can we move beyond individual intuition and develop data-driven procedures that offer generalizable learning strategies?
Modelling :
If we assume that problems are related, a graph is a natural choice to represent these relationships. The edges must be directed since progression is a key requirement, indicating the flow from simpler to more complex problems. Additionally, edges should only exist between similar problems to ensure meaningful transitions, and weights can be assigned based on relative difficulty to guide an optimal learning path.
Feature Selection :
To establish meaningful connections between problems, we need strong features that effectively convey similarity. When visiting a problem page, we are presented with multiple attributes :
For difficulty, an existing problem rating list has been created and contributed to LeetCode Problem Rating; thanks to ZeroTrac and the community for compiling this valuable resource.
Exploratory Data Analysis
With the dataset established, exploration commenced. A straightforward linear correlation analysis of the tags produced the following results:
Analyzing the top correlating solution tags reveals expected relationships:
These correlations align with intuitive expectations. DFS and BFS commonly appear together due to their fundamental tree and graph traversal roles. The strong association between recursion and trees further supports its prevalence in pre- and post-order traversal problems.
Since our feature vectors become sparse due to the high number of unique tags, we can employ Principal Component Analysis (PCA) to obtain denser embeddings while reducing redundancy. This allows for more compact and meaningful representations of problem relationships.
Now that the feature vectors were available, the next step was determining an appropriate threshold for identifying neighboring nodes. One approach was to set a similarity threshold, considering an edge between two nodes only if their cosine similarity exceeded a certain value. Such a graph was defined at various thresholds, and its structural statistics were analyzed at each level.
The analysis was initially performed at thresholds of 25, 50, 75, and 90, before being narrowed down to a more refined range. The goal was to determine a threshold that minimized the number of connected components, ensuring that all problems remained reachable from each other. Outlier detection was conducted based on the total number of submitted tags to address the issue of small-sized clusters, but it did not yield any significant improvements.
Clustering was employed as an approach for topic modeling, where the maximum voted tags within a cluster could assist in unsupervised classification of a set of problems. However, this approach introduced a challenge.
Addressing this problem at an algorithmic level required devising a heuristic solution to manage small cluster sizes. The pseudocode was implemented to enforce a constraint on the maximum cluster size, ensuring no cluster exceeded a predefined limit.
def RecursiveClustering(df, max_size):
# Start with all points unlabeled
labels = [-1 for every row in df]
current_label = 0
# Helper function to handle clustering
def process_cluster(subset, label):
if number of points in subset <= max_size:
# Label all these points
for index in subset:
labels[index] = label
return label + 1 # Next available label
# Try to split the cluster
try:
# Split into two groups
kmeans_result = do KMeans(k=2) on subset
group1 = [points where kmeans says cluster 0]
group2 = [points where kmeans says cluster 1]
# Check if split actually worked
if either group is empty:
oops, bad split! # Force fallback
# Recurse on both groups
next_label = process_cluster(group1, label)
next_label = process_cluster(group2, next_label)
return next_label
except: # If anything went wrong
# Label entire cluster as current label
for index in subset:
labels[index] = label
return label + 1
# Start with all data as first cluster
process_cluster(all_data_points, current_label)
return labels
Clustering solely using K-Means resulted in a large, dense cluster where many problems were grouped together, while others were confined to much smaller clusters. This imbalance indicated that K-Means struggled to properly capture the structural relationships between problems.
The cluster sizes improved after enforcing the maximum size constraint, but the issue of small clusters still persisted. To address this, a merging routine would be required, potentially introducing an additional constraint to ensure better cluster balancing. This aspect remains an open challenge and will be explored as part of future work.
Inferencing
Now, the graph between nodes i and j is defined using the NetworkX library, a Python package designed for the creation, manipulation, and analysis of complex networks. Edges are established between nodes where the cosine similarity of their respective feature vectors exceeds 0.7. Additionally, a directed edge is assigned based on the absolute difference in problem ratings, ensuring that the direction reflects the relative difficulty between problems.
Problem IDs with zero degree: [1584, 1072, 838, 769, 900, 1267, 2273, 2352, 821, 2293]
Note: t-SNE is a dimensionality reduction technique and may distort original high-dimensional relationships, making some points appear closer than they actually are.
Most problems exhibit a relatively small difficulty slope between them, as seen in the graph. Additionally, a significant number have fewer neighbors, with a second peak likely representing more generic problems that act as gateways to multiple concepts.
An easy array-focused DP problem is a great starting point, while hard multi-topic problems serve as ideal endpoints for mastering concepts.
Summary
Future Work
SWE @Google | Ex - Amazon | 130K LinkedIn | M.Tech CS '24 @IITM | GATE '22 | IIITS
4moInteresting