Statistics group project_Fraud Detection

YOUNG INDIA FELLOWSHIP

Statistics Course
Group Project
Members :

Abhishek Chopra
Adhiraj Sarmah,
Kshitij Garg
Mahesh Jakhotia
Tulasi Prasad Chaudhary

7/25/2011

The group project is based on real case study taken from the Atlanta primary school test papers. The growing
pressure among the teachers to improve the test performance of their classes has resulted in malpractices. We
have to find out the methodologies to find out the fraud if done in the following case.

Contents

1) Problem Statement 2

2) Logical Analysis 2-4

3) Inference 4

4) Our Interpretation of the Cheating Process 4

5) Statistical Approaches 5

6) ANOVA 5

7) Pictorial Method 8

8) The Wincoxon Rank Sum Test 9

9) Appendix

a. Table A.1 : Division of questions into groups based

on the approach 1 used in ANOVA test 9

b. Table A.2 : Class Results 10

c. Table A.3: Class B Results 11

d. Table A.5 : Class A Results 12

e. Table A.5 : Class B Results 13

1

GROUP PROJECT STATISTICS – FRAUD DETECTION

Problem Statement: We have been given 2 sets of data of 2 different classrooms and we are required to
strategize and analyze to eventually determine whether there was a teacher fraud in one or both of the
classrooms.

There can be 4 different scenarios:
1) Both A & B data have been tampered.
2) Both A & B data have not been tampered.
3) A is Fraud, B is Not
4) B is Fraud, A is Not

We have summarized our thought processes in the following document and demonstrated them through the
help of excel sheets attached in the folder. We have used various approaches to derive the solution. Each
and every methodology has its own assumptions and its own pros & cons.

Logical Analysis:

STEP – 1: We calculate the total number of correct answers for every question in both the classes.
Since we took a student wise-question wise analysis and assign a correct score with the value „1‟, it
also shows the total number of students who got each question correctly for both the classes

STEP -2: We then find the Total Number of correct answers of the entire class and divide it by the
total number of students to arrive at the average mean number of correct answers per student for or
both the classes.

STEP – 3: We take the analysis of STEP -2 and then plot line-graphs for both the classes with
Questions on the X-Axis and Class Performance on the Y-Axis. The analysis of this will provide a
broad perspective on whether there is any evidence of fraud or not.

# We found that in Class – A, Questions 30 to Questions 36 clearly show an anomaly.

STEP-4: We decided to focus on the anomaly region. We analyzed the questions 30-36 and tried to
see if there were any abnormal patterns in them for both the classes.

2

# There was very clearly a pattern of answers of exact and uniform correct answers to questions 30-
36 for class A for particular 16 students, which wasn‟t so in Class B.

STEP – 5: We calculated the Average score (i.e. Average no. of correct answers) for each of these
16 students in class A which included questions 30-36. We then found the mean score of these 16
students = 46%.

For Class B, The mean score of all the students is: 38%

3

STEP – 6: We calculated the Average score (ie. Average no. of correct answers) for each of these
16 students in class A EXCLUDING the questions 30-36. We then found the mean score the 16
students of Class A, the mean DECREASED to 42% (ie. A decrease of 4%)

For Class B, The mean score of all the students INCREASED to 40%. (ie. An increase of 2%)

INFERENCE: Therefore we can say that the set of questions 30 to 36, show reasonable proof
to believe that some form of cheating/tampering was done in respect to these questions.

Our interpretation of the Cheating Process

1) From questions 30 to 36, the graphs present a consistent growth for 16 students from the other
students from the average growth visually, which can be summed up to 16 x 6 questions, which is
equal to 96 questions that have been probably tampered with.

2) The reasons to choose that particular set of questions (from 30 to 36) could be

a) Since it is given that the level of difficulty increases with the questions it is logical to assume
that more students would get correct answers for the first half of the questions compared to the
second half, because the difficulty level would be low at the beginning. In the same manner, the
second half of the question would be expected to show lesser correct answers as the difficulty
would be higher.
b) So it would be logically smart on the teachers part to attempt to tamper/cheat in the second half
of the questions, since most of the students would be expected to get the correct answers in the
first half. Even in the second half, it would be smarter to avoid tampering with the last few
questions since they are the most difficult, and an increased number of correct answers for those
questions will immediately be easily exposed to detection. So it would be logical to choose
questions from somewhere within the beginning of second half and significantly before the last
few questions.

3) A set of questions which are consecutively chosen for editing also eases the time factor required to
edit the answers manually, which talks about the limited time available to an invigilator or a teacher
generally. And 96 questions is a good number of questions to change the entire average of the class
performance to a significant level which is an increased level of 4 % as we later found from our
analysis..

Statistical Approaches used:

1) Anova Method: Initially we divided the classes into groups and applied anova to see if the groups
have the same distribution or not. If one of the groups did not have the same distribution we could
conclude that the data of that group was tampered as it disturbed the distribution of the whole class.
We used two approaches to divide into groups. Later on we used the Tukey Method to find out the
groups which had a deviated mean.

4

2) Pictorial distribution: A graph was plotted with the questions on the X axis and the class
performance on the Y Axis. When we analyzed the class A graph we found out that between the
questions 30-36 the plot was flat and the results were higher than the performances in the other
questions. We can conclude on a pictorial basis that fraud has been done in these questions.

3) The Wilcoxon Rank Sum Test: If we want to use the samples without considering the normal
assumptions we can use the Rank Sum approach (used for non-normal distribution) discussed in
section 9.2 of the text book. Since the other tests are based on a lot of mathematical assumptions
which are not satisfied by the given data, we can use this approach which requires weaker
mathematical assumptions.

Approach 1 : ANOVA Approach

To compare the means and distributions of various groups, ANOVA is preferred to multiple “t-tests” as
ANOVA leads to a single test statistic for comparing all the means, so the overall risk of type-I error can
be controlled. If we ran many t tests, each at a given alpha level, we couldn‟t know what the overall risk of
a type 1 error is. Certainly the more tests one runs, the greater the risk of a false positive conclusion
somewhere among the tests.

Initially we divided the groups of class A according to the toughness level of the questions. The toughness
level was divided according to the area of right answers answered by the students. For example if the total
number of questions answered by the group is 445. We divided the group into eight groups by classifying
them in to equal areas of (445/8=56). The cumulative sum of total scores in each group is 56.

The data was divided into eight groups. The grouping has been shown in appendix section Table A.1
.Anova test was applied on the above groups to find out if the means of the groups was same or different.

Test Hypothesis

Ho : u1= u2=……u8

Ha : Means are not the same(Thus showing that one or more of the groups have been tampered which
resulted in the varying of its mean from the other groups)

Results for CLASS A

Anova: Single Factor
for CLASS A
SUMMARY
Groups Count Sum Average Variance
Column 1 5 52 10.4 16.3
Column 2 4 47 11.75 7.583333
Column 3 4 62 15.5 5.666667
Column 4 6 58 9.666667 8.266667
Column 5 6 60 10 16.8
Column 6 6 53 8.833333 48.56667
Column 7 4 67 16.75 0.916667

5

Column 8 9 46 5.111111 8.861111

ANOVA
Source of Variation SS df MS F P-value F crit
Between Groups 539.6763 7 77.09661 5.076268 0.000444 2.277143
Within Groups 546.7556 36 15.18765

Total 1086.432 43

ANOVA Results for
CLASS B
Between Groups 287.308 7 41.044 3.695126 0.004144 2.277143
Within Groups 399.8738 36 11.10761

Total 687.1818 43

In the test results we find out that the F Statistic value of “between groups” in class A is 5.07 which is
higher than the critical F value.(2.27). This proves that the null hypothesis Ho that the means are equal can
be rejected.

But a small flaw in this argument is that the samples size of each group is different and this disturbs one of
the basic assumption of the ANOVA that equal variance of groups is required except in the cases when the
groups are of equal sizes. Second point to be noted is that the groups have to be independent. Hence we
have to use a different approach to satisfy the above assumptions. We have now divided the group in such
a way that it entails questions of all difficulty levels. We used a circular approach to divide the questions
into four categories. For example we put questions 1 to 4 in the following four groups. And then questions
5-8 in the following four groups, thus each group had questions of all types making it a homogenous
model. Please refer Appendix Table A.4 - Class A Results for more details of the grouping.

Assumptions for ANOVA:

1) The sample measurements are selected from a normal population.
2) The samples are independent.
3) The unknown population & variance for the measurements from sample I are ui and c2 respectively.

Now let me explain why our current approach in a way satisfies the given assumptions.

The normal approximation is the least crucial. The ANOVA test is a test on means; the central limit
theorem has its effect. The central limit theorem may not work for a small sample size. Hence we have
taken a large sample size per group. 11 is the sample size and we the central limit theorem can be applied
approximately. However one particular alternative is the Kruskal-Wallis Rank Test which is discussed in
the section 10.2 of the textbook. This method can be applied to non-random samples. But since this
methodology was not taught in the class room we leave that solution and focus on using the ANOVA test
for solving the current problem.

6

The assumption of equal variances is important if the sample sizes are substantially different. But since we
have chosen the same sample size therefore the variance can‟t be a problem over here. When all n‟s are
equal, the effect of even grossly unequal variances is minimal.

Coming to the independence problem, since we have used a homogenous group in which the questions
from easy to tough have been taken therefore the group as a whole is independent from the other groups.
We have used a circular approach to make sure that each group has a homogenous set if questions similar
to the other groups.

Test Hypothesis

Ho : u1= u2=……u8

Ha : Means are not the same(Thus showing that one or more of the groups have been tampered which
resulted in the varying of its mean from the other groups)

CLASS A Results

1 2 3 4

13 9 4 14
12 10 13 9
15 19 15 14
14 14 12 10
7 7 8 14
11 9 9 3
14 2 12 3
3 16 17 16
16 18 17 2
8 9 8 1
6 4 6 2


SUMMARY
Column 1 11 119 10.81818 17.76364
Column 2 11 117 10.63636 30.45455
Column 3 11 121 11 19
Column 4 11 88 8 34.8

ANOVA
Source of
Variation SS df MS F P-value F crit
Between Groups 66.25 3 22.08333 0.865859 0.466735 2.838745
Within Groups 1020.182 40 25.50455

Total 1086.432 43

7

ANOVA Results
for Class B
Source of
Between Groups 51.72727 3 17.24242 1.08536 0.366324 2.838745
Within Groups 635.4545 40 15.88636

Total 687.1818 43

Now the ANOVA F test has helped us in just finding whether we need to reject the Ho or not. But rejection
of null hypothesis that means are equal does not indicate specifically which means are not equal. Therefore
we can use the Tukey method to find out the differences among the specified means. By this method we
can specifically point out the group in which the tampering has been done.

Approach 2: The Pictorial Method

We can see that the frequency curve comes out to be normal for class B, but it is skewed on the higher side
in class A. This skewed nature can be attributed to "Tampering or Cheating by the teacher". The mean for
class A (Mean = 20.23) is too high as compared to class B (Mean = 16.78)

And as seen from the "Question vs. No. of students who attempted it correct " plot we can say that Q-30 to
Q- 36 in class A consists of the tampered data. These questions don't follow the normal trend and show an
increased peak in between of the decreasing curve.

So, we trim off Q-30 to Q-36 from both the classes and then plot them again for remaining questions. And
we can see that both the curves come out be normal this time and there is no skew nature in class A. The
mean for class A (Mean = 16.32) has also reduced and is now comparable to class B (Mean = 15.44) So it
can be easily said that some tempering was done from Q-30 to Q-36 in class A.

Approach 3 : The Wilcoxon Rank Sum Test

If we want to use the samples without considering the normal assumptions we can use the Rank Sum
approach (used for non-normal distribution) discussed in section 9.2 of the text book. Since the other tests
are based on a lot of mathematical assumptions which are not satisfied by the given data, we can use this
approach which requires weaker mathematical assumptions.

This test requires the following conditions:

1) Identical distributions but not necessary normal.

The null hypothesis is that the two population distributions are identical. And the alternative test is that the
mean of one of the groups is larger than the other group. If the null hypothesis is rejected it implies that
both the groups are not distributed in an identical way which implies that on one of the groups a fraud has
been done. We can use the critical values and reject the values based on the statistic values.

8

Here the two groups could be the data from the two classes or the different groups of questions divided in a
homogenous manner. But since this has not been covered in the syllabus we haven‟t done the problem with
this method.

APPENDIX

Table A.1 : Division of questions into groups based on the approach 1 used in ANOVA test

Cumulative 8
Sum Groups

13
22
26
40
52
62
75
84
99
118
133
147
161
175
187
197
204
211
219
233
244
253
262
265
279
281
293
296
299
315
332
348
364
382
399
401
409
418
426
427
433
437
443
445
9

Table A.2 : Class A Results

CLASS A

Groups A B C D E F G H

13.00 10.00 19.00 14.00 14.00 2.00 16.00 2
9.00 13.00 15.00 12.00 11.00 12.00 16.00 8
4.00 9.00 14.00 10.00 9.00 3.00 18.00 9
14.00 15.00 14.00 7.00 9.00 3.00 17.00 8
12.00 7.00 3.00 16.00 1
8.00 14.00 17.00 6
4
6
2

SUMMARY
Column 1 5 52 10.4 16.3
Column 2 4 47 11.75 7.583333
Column 3 4 62 15.5 5.666667
Column 4 6 58 9.666667 8.266667
Column 5 6 60 10 16.8
Column 6 6 53 8.833333 48.56667
Column 7 4 67 16.75 0.916667
Column 8 9 46 5.111111 8.861111

ANOVA
Between Groups 539.6763 7 77.09661 5.076268 0.000444 2.277143
Within Groups 546.7556 36 15.18765

Total 1086.432 43

10

Table A.3: Class B Results

CLASS B Results

Groups A B C D E F G H

13 2.00 15.00 7.00 3 4 4.00 8.00
5 6.00 10.00 10.00 10 12 10.00 4.00
6 9.00 14.00 12.00 10 10 11.00 5.00
14 4.00 6.00 4 8 6.00 2.00
10.00 4 1.00 5.00
9.00 12 3.00 4.00
1.00 2.00
3.00
3.00
1.00


SUMMARY
Column 1 4 38 9.5 21.66667
Column 2 6 40 6.666667 10.26667
Column 3 3 39 13 7
Column 4 4 35 8.75 7.583333
Column 5 6 43 7.166667 15.36667
Column 6 4 34 8.5 11.66667
Column 7 7 36 5.142857 16.47619
Column 8 10 37 3.7 4.011111

ANOVA
Between Groups 287.308 7 41.044 3.695126 0.004144 2.277143
Within Groups 399.8738 36 11.10761

Total 687.1818 43

11

Table A.4 - Class A Results

Class A :

1 2 3 4

13 9 4 14
12 10 13 9
15 19 15 14
14 14 12 10
7 7 8 14
11 9 9 3
14 2 12 3
3 16 17 16
16 18 17 2
8 9 8 1
6 4 6 2


SUMMARY
Column 1 11 119 10.81818 17.76364
Column 2 11 117 10.63636 30.45455
Column 3 11 121 11 19
Column 4 11 88 8 34.8

ANOVA
Source of
Between Groups 66.25 3 22.08333 0.865859 0.466735 2.838745
Within Groups 1020.182 40 25.50455

Total 1086.432 43

12

Table A.5 : Class B Results

CLASS B

1 2 3 4

13 5 6 14
2 6 9 4
10 9 15 10
14 7 10 12
6 3 10 10
4 4 12 4
12 10 8 4
10 11 6 1
3 1 8 4
5 2 5 4
2 3 3 1


SUMMARY
Column 1 11 81 7.363636 20.65455
Column 2 11 61 5.545455 11.27273
Column 3 11 92 8.363636 11.45455
Column 4 11 68 6.181818 20.16364

ANOVA
Source of
Between Groups 51.72727 3 17.24242 1.08536 0.366324 2.838745
Within Groups 635.4545 40 15.88636

Total 687.1818 43

13

Statistics group project_Fraud Detection

More Related Content

What's hot (9)

Similar to Statistics group project_Fraud Detection (20)

Recently uploaded (20)

Statistics group project_Fraud Detection