A guide to write crisp and concise machine learning code!

Sandeep Sisupalan

Advanced Data Scientist at antuit.ai

Published Apr 18, 2020

Yes! This is what I chose to write as my first LinkedIn article! Time and again, I have seen even the most experienced coders write non-standard, hard-to-follow, unstructured code which is almost incoherent. As a person who has been on the receiving end of ugly code many times as part of knowledge transfers, this is hugely going to be a rant on how you can make your code as easy as possible to read for others.

This is not rocket science folks! While the code showed here is in Python (with Jupyter), you can adopt the principles across any language and tool. So lets get started.

1. Structure your code according to logical blocks

While starting out, most coders generally maintain a single script/notebook which ends up being long and hard to decipher. Also, this ends up creating problems in debugging, if the code is put into production. Try to structure your code into separate scripts/notebooks. This is what a typical data pipeline should look like:

01_Data_Preparation.ipynb -> 02_Exploratory_Data_Analysis.ipynb -> 03_Modelling.ipynb

Number your scripts to ensure there is no confusion as to the order of execution. If required, you can write a pipeline script which runs the above pipeline in order. Something along the lines of 04_Pipeline.ipynb which contains:

print('Running data preparation notebook...')
%run 01_Data_Preparation.ipynb

print('Running exploratory data analysis notebook...')
%run 02_Exploratory_Data_Analysis.ipynb

print('Running modelling notebook...')
%run 03_Modelling.ipynb

This structure is easy to debug. If you encounter any issues in execution, you will be able to pinpoint the exact script which is the root cause of it. If there are not many lines of code then it is not necessary to break them into different scripts. A comment block separating each logical segment will do as well. Something along the lines of :

###### IMPORT BLOCK ######


# All library imports here


###### CONFIG BLOCK ######


# All settings and global variables declared here


###### FUNCTION BLOCK ######


# All user defined functions here


def data_prep():
 
 ...
 
 return


def eda():
  
  ...

  return


def model():
  
  ...

  return


###### MAIN BLOCK ######

# Main block here

print('Running data_prep...')
data_prep()

...

This is the simplest structure to maintain. Once you include this principle into your coding DNA, you would find your code much more streamlined. Let there be order!!!

2. Be as verbose as possible

Even experienced coders are guilty of this offence. Nothing is worse than code which has little to no comments explaining what is happening. Well placed comments are a sign of maturity as a coder. Lets have a look at a code snippet:

# Dataframe to append the csvs to
df = pd.DataFrame()

# Iterate over the list of files in the path
for file in os.listdir(DATA_INPUT):
    
    # Consider only csv files
    if 'csv' in file:   
        
        # Find the path of the csv file
        pathname = os.path.join(DATA_INPUT,file)
        
        # Read the csv file
        print("%s is being read..." % (file))
        temp = pd.read_csv(pathname, low_memory=False, encoding='latin1')
        
        # Strip space from headers so as to not encounter any
        # appending issues
        temp.columns = temp.columns.str.replace(' ','')
        
        # Append the file towards the end sequentially
        df = df.append(temp)
        print("%s has been appended..." % (file))

Looking at the code snippet, even if you may not understand the syntax of the code, the comments clearly indicate what is going on.

If you are using notebooks for your code, then make use of Markdowns. Markdowns are your friend! Also, jupyter notebooks have a number of wonderful extensions. In particular, I use the TOC extension which churns out a nice table of contents at the beginning of the notebook.

3. Following a specific coding convention

This boils down to 3 simple things:

Name your variables, functions and scripts meaningfully.
Follow a casing. I follow under scores. Some follow camel casing. To each their own.
All library imports at the top, then configurations and global variables, then user defined functions and finally the main block.

These are the minimal guidelines one should follow. In particular, I follow the PEP-8 style guide which can be found here.

4. Functionize your code

If you have a block of code which is repeated elsewhere, it is better to put them into a function. When in doubt, if you have to repeat your code more than two times, put it into a function.

While writing functions, one should also keep in mind to give appropriate names to it and include docstrings. Lets have a look at a code snippet:

def calc_rmse(y_true, y_pred):

    '''Function to calculate root mean square error'''

    # Convert the labels and predictions to numpy arrays
    y_true, y_pred = np.array(y_true), np.array(y_pred)

    # Calculate rmse
    return(np.sqrt(mean_squared_error(y_true, y_pred)))

Looking at it we can easily identify what the function does with both the name of the function as well as the docstring.

5. Follow a good folder structure for your project

If I had a nickel for everytime I have seen both code and outputs being dumped into the same place, I would probably have had enough money to save Harambe with my ill gotten gains!

Jokes aside, following a good folder structure is critical. This is even more so, when said code is being used in production.

This is the structure most of my projects take:

data
plots
notebooks/scripts
reports
models

A popular tool which does this beautifully is Cookiecutter. You can find this here.

That brings me to the end of my rant! While, there are numerous more ways you can tame the chaos within your code, you would find that following these simple five steps would be mostly enough to make your code look beautiful and structured. Hope this helps you (or whoever you hand over your code to)!

A guide to write crisp and concise machine learning code!

Sandeep Sisupalan

Advanced Data Scientist at antuit.ai

1. Structure your code according to logical blocks

2. Be as verbose as possible

3. Following a specific coding convention

4. Functionize your code

5. Follow a good folder structure for your project

More articles by this author

Others also viewed

Exploring the World of Data Science with NumPy: Basics to Advanced Techniques & Applications

Ten Good Coding Practices for Data Scientists

Kickstart Your Python Skills with Engaging Data Projects

Spark 3.0 : Adaptive Query Execution & Dynamic Partition Pruning

NumPy for Data Science

Unlock Python’s Counting Superpower: A Deep Dive into collections.Counter

Understanding the essential Data Processing libraries

Advanced Python Topics for Aspiring Data Scientists 🐍🚀

Understanding Protocol Buffers with Practical Examples

What is the purpose of libraries like NumPy and Pandas?

Explore topics

1. Structure your code according to logical blocks

2. Be as verbose as possible

3. Following a specific coding convention

4. Functionize your code

5. Follow a good folder structure for your project

Acing your take home data science interview exercises!

Apr 25, 2020

Others also viewed

Exploring the World of Data Science with NumPy: Basics to Advanced Techniques & Applications

Ten Good Coding Practices for Data Scientists

Kickstart Your Python Skills with Engaging Data Projects

Spark 3.0 : Adaptive Query Execution & Dynamic Partition Pruning

NumPy for Data Science

Unlock Python’s Counting Superpower: A Deep Dive into collections.Counter

Understanding the essential Data Processing libraries

Advanced Python Topics for Aspiring Data Scientists 🐍🚀

Understanding Protocol Buffers with Practical Examples

What is the purpose of libraries like NumPy and Pandas?

Explore topics