A guide to write crisp and concise machine learning code!
Yes! This is what I chose to write as my first LinkedIn article! Time and again, I have seen even the most experienced coders write non-standard, hard-to-follow, unstructured code which is almost incoherent. As a person who has been on the receiving end of ugly code many times as part of knowledge transfers, this is hugely going to be a rant on how you can make your code as easy as possible to read for others.
This is not rocket science folks! While the code showed here is in Python (with Jupyter), you can adopt the principles across any language and tool. So lets get started.
1. Structure your code according to logical blocks
While starting out, most coders generally maintain a single script/notebook which ends up being long and hard to decipher. Also, this ends up creating problems in debugging, if the code is put into production. Try to structure your code into separate scripts/notebooks. This is what a typical data pipeline should look like:
01_Data_Preparation.ipynb -> 02_Exploratory_Data_Analysis.ipynb -> 03_Modelling.ipynb
Number your scripts to ensure there is no confusion as to the order of execution. If required, you can write a pipeline script which runs the above pipeline in order. Something along the lines of 04_Pipeline.ipynb which contains:
print('Running data preparation notebook...') %run 01_Data_Preparation.ipynb print('Running exploratory data analysis notebook...') %run 02_Exploratory_Data_Analysis.ipynb print('Running modelling notebook...') %run 03_Modelling.ipynb
This structure is easy to debug. If you encounter any issues in execution, you will be able to pinpoint the exact script which is the root cause of it. If there are not many lines of code then it is not necessary to break them into different scripts. A comment block separating each logical segment will do as well. Something along the lines of :
###### IMPORT BLOCK ###### # All library imports here ###### CONFIG BLOCK ###### # All settings and global variables declared here ###### FUNCTION BLOCK ###### # All user defined functions here def data_prep(): ... return def eda(): ... return def model(): ... return ###### MAIN BLOCK ###### # Main block here print('Running data_prep...') data_prep()
...
This is the simplest structure to maintain. Once you include this principle into your coding DNA, you would find your code much more streamlined. Let there be order!!!
2. Be as verbose as possible
Even experienced coders are guilty of this offence. Nothing is worse than code which has little to no comments explaining what is happening. Well placed comments are a sign of maturity as a coder. Lets have a look at a code snippet:
# Dataframe to append the csvs to df = pd.DataFrame() # Iterate over the list of files in the path for file in os.listdir(DATA_INPUT): # Consider only csv files if 'csv' in file: # Find the path of the csv file pathname = os.path.join(DATA_INPUT,file) # Read the csv file print("%s is being read..." % (file)) temp = pd.read_csv(pathname, low_memory=False, encoding='latin1') # Strip space from headers so as to not encounter any # appending issues temp.columns = temp.columns.str.replace(' ','') # Append the file towards the end sequentially df = df.append(temp) print("%s has been appended..." % (file))
Looking at the code snippet, even if you may not understand the syntax of the code, the comments clearly indicate what is going on.
If you are using notebooks for your code, then make use of Markdowns. Markdowns are your friend! Also, jupyter notebooks have a number of wonderful extensions. In particular, I use the TOC extension which churns out a nice table of contents at the beginning of the notebook.
3. Following a specific coding convention
This boils down to 3 simple things:
- Name your variables, functions and scripts meaningfully.
- Follow a casing. I follow under scores. Some follow camel casing. To each their own.
- All library imports at the top, then configurations and global variables, then user defined functions and finally the main block.
These are the minimal guidelines one should follow. In particular, I follow the PEP-8 style guide which can be found here.
4. Functionize your code
If you have a block of code which is repeated elsewhere, it is better to put them into a function. When in doubt, if you have to repeat your code more than two times, put it into a function.
While writing functions, one should also keep in mind to give appropriate names to it and include docstrings. Lets have a look at a code snippet:
def calc_rmse(y_true, y_pred): '''Function to calculate root mean square error''' # Convert the labels and predictions to numpy arrays y_true, y_pred = np.array(y_true), np.array(y_pred) # Calculate rmse return(np.sqrt(mean_squared_error(y_true, y_pred)))
Looking at it we can easily identify what the function does with both the name of the function as well as the docstring.
5. Follow a good folder structure for your project
If I had a nickel for everytime I have seen both code and outputs being dumped into the same place, I would probably have had enough money to save Harambe with my ill gotten gains!
Jokes aside, following a good folder structure is critical. This is even more so, when said code is being used in production.
This is the structure most of my projects take:
- data
- plots
- notebooks/scripts
- reports
- models
A popular tool which does this beautifully is Cookiecutter. You can find this here.
That brings me to the end of my rant! While, there are numerous more ways you can tame the chaos within your code, you would find that following these simple five steps would be mostly enough to make your code look beautiful and structured. Hope this helps you (or whoever you hand over your code to)!
Machine Learning | Edge Computing | Computer Vision | LLMs
5yExcellent writeup Sandeep! Please write more!