Evaluation of programs codes using machine learning

EVALUATION OF PROGRAMS
CODES USING MACHINE
LEARNING
SUBMITTED BY –
MANAS CHHABRA
2K12/SE/041
ROHIT PAL 2K12/SE/
066
TANMAY AGGARWAL 2K12/SE/

OBJECTIVE
•To detect copied codes submitted by
different users on online judges.

HOW CAN WE SAY WHETHER TWO CODES
ARE COPIED ?
• -Tradition Way
• Make a program which compare every code submitted by
user to every other code in the database.

THEN WHAT IS THE PROBLEM WITH THIS
METHOD
Assume that we have 4000 user to solve a given programming
question, assume that every code has 30 lines of codes

WE CAN RESOLVE THIS COMPLEXITY USING
MACHINE LEARNING
And Before we know how we have reduced this complexity
Let’s first Know something about the machine learning concepts
which we have applied

WHAT IS MACHINE LEARNING?
• According to Arthur Samule , Machine Learning is a Field of
study that gives computers the ability to learn without being
explicitly programmed.
• According to Tom Mitchell it is a Well-posed Learning Problem:
A computer program is said to learn from experience E with
respect to some task T and some performance measure P , if its
Performance on T , as measured by P, improves with Experience
E.

MACHINE LEARNING ALGORITHMS
• Supervised Learning
• Unsupervised Learning

WE ARE GOING TO USE UNSUPERVISED
MACHINE LEARNING
Clustering : Learning from unlabled data.
Unsupervised learning
Try and determining structure in the data
Clustering algorithm groups data together based on data
features

WHAT IS CLUSTERING GOOD FOR
• Market segmentation - group customers into different market
segments
• Social network analysis - Facebook "smartlists"
• Organizing computer clusters and data centers for network
layout & location
• Astronomical data analysis - Understanding galaxy formation

K-MEANS ALGORITHM
• Used to automatically group the data into coherent clusters
• e.g. Assume for this unlabled data
• Step 1Randomly allocate two points as the cluster
centroidsCluster
• Step 2 Go through each example and depending on if it's closer
to the red or blue centroid assign each point to one of the two
clusters

• Step 3Move centroid step
• Take each centroid and move to the average of the
correspondingly assigned data-points
• Repeat Step 2 and Step 3 until convergence

MORE FORMAL DEFINITION
• INPUT:
• K (number of clusters in the data)
• Training set {x1, x2, x3 ..., xn)
• Algorithm
• Randomly initialize K cluster centroids as {μ1, μ2, μ3 ... μK}

DIMENSIONALITY REDUCTION
• Speeds up algorithms
• Reduces space used by data for them
• Reduce dimension from nD to mD
e.g. Reduction of 3D->2D

PRINCIPLE COMPONENT ANALYSIS (PCA)
• To reduce from nD to kD weFind k vectors (u(1), u(2), ... u(k)) onto
which to project the data to minimize the projection error
• So lots of vectors onto which we project the data
• Find a set of vectors which we project the data onto the linear
subspace spanned by that set of vectors
• We can define a point in a plane with k vectors

• e.g. 3D->2DFind pair of vectors which define a 2D plane
(surface) onto which you're going to project your data
• Much like the "shallow box" example in compression, we're
trying to create the shallowest box possible (by defining two of
it's three dimensions, so the box' depth is minimized)

ALGORITHM
• Reducing data from n-dimensional to k-dimensional
• Compute the covariance matrix
•This is commonly denoted as
•Σ (greek upper case sigma) - NOT summation symbol
•Σ = sigma
•This is an [n x n] matrix
•Remember than xi is a [n x 1] matrix
•In MATLAB or octave we can implement this as follows;

•Compute eigenvectors of matrix Σ
•[U,S,V] = svd(sigma)
•svd = singular value decomposition
•More numerically stable than eig
•eig = also gives eigenvector
•U,S and V are matrices
•U matrix is also an [n x n] matrix
•Turns out the columns of U are the u vectors we want!
•So to reduce a system from n-dimensions to k-dimensions
•Just take the first k-vectors from U (first k columns)

• Now if we need to reduce to k dimension
• Then we extract k columns from matrix U to Ureduce

NOW LET’S IMPLEMENT THIS INFORMATION
TO OUR PROJECT

FEATURES TO BE CONSIDERED
- Detect type of file eg . .java, .cpp etc
- No. of Lines of Codes
-No. of Functions
-No. of Variables used
-No. of if-else conditions
-No. of loops

APPLICATIONS
• Beside detection of Cheating in Online programming contests
we can have following applications
• We can use this in our Programming labs to evaluate the
programs submitted by students according to complexity.
• We can know what programming style is in trend these days

Evaluation of programs codes using machine learning

More Related Content

What's hot (20)

Similar to Evaluation of programs codes using machine learning (20)

Recently uploaded (20)

Evaluation of programs codes using machine learning