SlideShare a Scribd company logo
YOUNG INDIA FELLOWSHIP




                        Statistics Course
                                        Group Project
                                                Members :

                                              Abhishek Chopra
                                              Adhiraj Sarmah,
                                                Kshitij Garg
                                              Mahesh Jakhotia
                                          Tulasi Prasad Chaudhary




                                                 7/25/2011




The group project is based on real case study taken from the Atlanta primary school test papers. The growing
pressure among the teachers to improve the test performance of their classes has resulted in malpractices. We
have to find out the methodologies to find out the fraud if done in the following case.
Contents



  1) Problem Statement                                            2


  2) Logical Analysis                                             2-4


  3) Inference                                                    4


  4) Our Interpretation of the Cheating Process                    4


  5) Statistical Approaches                                       5


  6) ANOVA                                                        5


  7) Pictorial Method                                             8


  8) The Wincoxon Rank Sum Test                                   9


  9) Appendix


         a. Table A.1 : Division of questions into groups based


             on the approach 1 used in ANOVA test                 9


         b. Table A.2 : Class Results                             10


         c. Table A.3: Class B Results                            11


         d. Table A.5 : Class A Results                           12


         e. Table A.5 : Class B Results                           13




                                                     1
GROUP PROJECT STATISTICS – FRAUD DETECTION


Problem Statement: We have been given 2 sets of data of 2 different classrooms and we are required to
strategize and analyze to eventually determine whether there was a teacher fraud in one or both of the
classrooms.

There can be 4 different scenarios:
1) Both A & B data have been tampered.
2) Both A & B data have not been tampered.
3) A is Fraud, B is Not
4) B is Fraud, A is Not

We have summarized our thought processes in the following document and demonstrated them through the
help of excel sheets attached in the folder. We have used various approaches to derive the solution. Each
and every methodology has its own assumptions and its own pros & cons.

Logical Analysis:

       STEP – 1: We calculate the total number of correct answers for every question in both the classes.
       Since we took a student wise-question wise analysis and assign a correct score with the value „1‟, it
       also shows the total number of students who got each question correctly for both the classes

       STEP -2: We then find the Total Number of correct answers of the entire class and divide it by the
       total number of students to arrive at the average mean number of correct answers per student for or
       both the classes.

       STEP – 3: We take the analysis of STEP -2 and then plot line-graphs for both the classes with
       Questions on the X-Axis and Class Performance on the Y-Axis. The analysis of this will provide a
       broad perspective on whether there is any evidence of fraud or not.

       # We found that in Class – A, Questions 30 to Questions 36 clearly show an anomaly.




       STEP-4: We decided to focus on the anomaly region. We analyzed the questions 30-36 and tried to
       see if there were any abnormal patterns in them for both the classes.

                                                     2
# There was very clearly a pattern of answers of exact and uniform correct answers to questions 30-
36 for class A for particular 16 students, which wasn‟t so in Class B.

STEP – 5: We calculated the Average score (i.e. Average no. of correct answers) for each of these
16 students in class A which included questions 30-36. We then found the mean score of these 16
students = 46%.

For Class B, The mean score of all the students is: 38%




                                             3
STEP – 6: We calculated the Average score (ie. Average no. of correct answers) for each of these
      16 students in class A EXCLUDING the questions 30-36. We then found the mean score the 16
      students of Class A, the mean DECREASED to 42% (ie. A decrease of 4%)

      For Class B, The mean score of all the students INCREASED to 40%. (ie. An increase of 2%)

      INFERENCE: Therefore we can say that the set of questions 30 to 36, show reasonable proof
      to believe that some form of cheating/tampering was done in respect to these questions.


      Our interpretation of the Cheating Process

   1) From questions 30 to 36, the graphs present a consistent growth for 16 students from the other
      students from the average growth visually, which can be summed up to 16 x 6 questions, which is
      equal to 96 questions that have been probably tampered with.

   2) The reasons to choose that particular set of questions (from 30 to 36) could be

      a) Since it is given that the level of difficulty increases with the questions it is logical to assume
         that more students would get correct answers for the first half of the questions compared to the
         second half, because the difficulty level would be low at the beginning. In the same manner, the
         second half of the question would be expected to show lesser correct answers as the difficulty
         would be higher.
      b) So it would be logically smart on the teachers part to attempt to tamper/cheat in the second half
         of the questions, since most of the students would be expected to get the correct answers in the
         first half. Even in the second half, it would be smarter to avoid tampering with the last few
         questions since they are the most difficult, and an increased number of correct answers for those
         questions will immediately be easily exposed to detection. So it would be logical to choose
         questions from somewhere within the beginning of second half and significantly before the last
         few questions.

   3) A set of questions which are consecutively chosen for editing also eases the time factor required to
      edit the answers manually, which talks about the limited time available to an invigilator or a teacher
      generally. And 96 questions is a good number of questions to change the entire average of the class
      performance to a significant level which is an increased level of 4 % as we later found from our
      analysis..


Statistical Approaches used:

   1) Anova Method: Initially we divided the classes into groups and applied anova to see if the groups
      have the same distribution or not. If one of the groups did not have the same distribution we could
      conclude that the data of that group was tampered as it disturbed the distribution of the whole class.
      We used two approaches to divide into groups. Later on we used the Tukey Method to find out the
      groups which had a deviated mean.



                                                     4
2) Pictorial distribution: A graph was plotted with the questions on the X axis and the class
      performance on the Y Axis. When we analyzed the class A graph we found out that between the
      questions 30-36 the plot was flat and the results were higher than the performances in the other
      questions. We can conclude on a pictorial basis that fraud has been done in these questions.

   3) The Wilcoxon Rank Sum Test: If we want to use the samples without considering the normal
      assumptions we can use the Rank Sum approach (used for non-normal distribution) discussed in
      section 9.2 of the text book. Since the other tests are based on a lot of mathematical assumptions
      which are not satisfied by the given data, we can use this approach which requires weaker
      mathematical assumptions.


Approach 1 : ANOVA Approach

To compare the means and distributions of various groups, ANOVA is preferred to multiple “t-tests” as
ANOVA leads to a single test statistic for comparing all the means, so the overall risk of type-I error can
be controlled. If we ran many t tests, each at a given alpha level, we couldn‟t know what the overall risk of
a type 1 error is. Certainly the more tests one runs, the greater the risk of a false positive conclusion
somewhere among the tests.

Initially we divided the groups of class A according to the toughness level of the questions. The toughness
level was divided according to the area of right answers answered by the students. For example if the total
number of questions answered by the group is 445. We divided the group into eight groups by classifying
them in to equal areas of (445/8=56). The cumulative sum of total scores in each group is 56.

The data was divided into eight groups. The grouping has been shown in appendix section Table A.1
.Anova test was applied on the above groups to find out if the means of the groups was same or different.



Test Hypothesis

Ho : u1= u2=……u8

Ha : Means are not the same(Thus showing that one or more of the groups have been tampered which
resulted in the varying of its mean from the other groups)

Results for CLASS A

Anova: Single Factor
for CLASS A
SUMMARY
       Groups           Count       Sum   Average Variance
Column 1                        5      52     10.4     16.3
Column 2                        4      47    11.75 7.583333
Column 3                        4      62     15.5 5.666667
Column 4                        6      58 9.666667 8.266667
Column 5                        6      60       10     16.8
Column 6                        6      53 8.833333 48.56667
Column 7                        4      67    16.75 0.916667

                                                     5
Column 8                       9          46 5.111111 8.861111



ANOVA
 Source of Variation      SS         df         MS        F     P-value   F crit
Between Groups         539.6763            7 77.09661 5.076268 0.000444 2.277143
Within Groups          546.7556           36 15.18765

Total                  1086.432           43


ANOVA Results for
CLASS B
 Source of Variation      SS         df         MS        F     P-value   F crit
Between Groups          287.308            7   41.044 3.695126 0.004144 2.277143
Within Groups          399.8738           36 11.10761

Total                  687.1818           43


In the test results we find out that the F Statistic value of “between groups” in class A is 5.07 which is
higher than the critical F value.(2.27). This proves that the null hypothesis Ho that the means are equal can
be rejected.

But a small flaw in this argument is that the samples size of each group is different and this disturbs one of
the basic assumption of the ANOVA that equal variance of groups is required except in the cases when the
groups are of equal sizes. Second point to be noted is that the groups have to be independent. Hence we
have to use a different approach to satisfy the above assumptions. We have now divided the group in such
a way that it entails questions of all difficulty levels. We used a circular approach to divide the questions
into four categories. For example we put questions 1 to 4 in the following four groups. And then questions
5-8 in the following four groups, thus each group had questions of all types making it a homogenous
model. Please refer Appendix Table A.4 - Class A Results for more details of the grouping.

Assumptions for ANOVA:

   1) The sample measurements are selected from a normal population.
   2) The samples are independent.
   3) The unknown population & variance for the measurements from sample I are ui and c2 respectively.

Now let me explain why our current approach in a way satisfies the given assumptions.

The normal approximation is the least crucial. The ANOVA test is a test on means; the central limit
theorem has its effect. The central limit theorem may not work for a small sample size. Hence we have
taken a large sample size per group. 11 is the sample size and we the central limit theorem can be applied
approximately. However one particular alternative is the Kruskal-Wallis Rank Test which is discussed in
the section 10.2 of the textbook. This method can be applied to non-random samples. But since this
methodology was not taught in the class room we leave that solution and focus on using the ANOVA test
for solving the current problem.



                                                      6
The assumption of equal variances is important if the sample sizes are substantially different. But since we
have chosen the same sample size therefore the variance can‟t be a problem over here. When all n‟s are
equal, the effect of even grossly unequal variances is minimal.

Coming to the independence problem, since we have used a homogenous group in which the questions
from easy to tough have been taken therefore the group as a whole is independent from the other groups.
We have used a circular approach to make sure that each group has a homogenous set if questions similar
to the other groups.

Test Hypothesis

Ho : u1= u2=……u8

Ha : Means are not the same(Thus showing that one or more of the groups have been tampered which
resulted in the varying of its mean from the other groups)

        CLASS A Results

         1                2       3         4

        13                9       4         14
        12                10     13         9
        15                19     15         14
        14                14     12         10
        7                 7       8         14
        11                9       9         3
        14                2      12         3
        3                 16     17         16
        16                18     17         2
        8                 9       8         1
        6                 4       6         2

Anova: Single Factor

SUMMARY
    Groups             Count    Sum   Average Variance
Column 1                   11     119 10.81818 17.76364
Column 2                   11     117 10.63636 30.45455
Column 3                   11     121       11       19
Column 4                   11      88        8     34.8



ANOVA
   Source of
   Variation           SS        df         MS        F     P-value   F crit
Between Groups         66.25           3 22.08333 0.865859 0.466735 2.838745
Within Groups       1020.182          40 25.50455

Total               1086.432          43

                                                     7
ANOVA Results
 for Class B
     Source of
     Variation        SS          df         MS         F     P-value   F crit
 Between Groups    51.72727             3 17.24242   1.08536 0.366324 2.838745
 Within Groups     635.4545            40 15.88636

 Total             687.1818            43


Now the ANOVA F test has helped us in just finding whether we need to reject the Ho or not. But rejection
of null hypothesis that means are equal does not indicate specifically which means are not equal. Therefore
we can use the Tukey method to find out the differences among the specified means. By this method we
can specifically point out the group in which the tampering has been done.

Approach 2: The Pictorial Method

We can see that the frequency curve comes out to be normal for class B, but it is skewed on the higher side
in class A. This skewed nature can be attributed to "Tampering or Cheating by the teacher". The mean for
class A (Mean = 20.23) is too high as compared to class B (Mean = 16.78)

 And as seen from the "Question vs. No. of students who attempted it correct " plot we can say that Q-30 to
Q- 36 in class A consists of the tampered data. These questions don't follow the normal trend and show an
increased peak in between of the decreasing curve.

 So, we trim off Q-30 to Q-36 from both the classes and then plot them again for remaining questions. And
we can see that both the curves come out be normal this time and there is no skew nature in class A. The
mean for class A (Mean = 16.32) has also reduced and is now comparable to class B (Mean = 15.44) So it
can be easily said that some tempering was done from Q-30 to Q-36 in class A.




Approach 3 : The Wilcoxon Rank Sum Test

If we want to use the samples without considering the normal assumptions we can use the Rank Sum
approach (used for non-normal distribution) discussed in section 9.2 of the text book. Since the other tests
are based on a lot of mathematical assumptions which are not satisfied by the given data, we can use this
approach which requires weaker mathematical assumptions.

This test requires the following conditions:

   1) Identical distributions but not necessary normal.

The null hypothesis is that the two population distributions are identical. And the alternative test is that the
mean of one of the groups is larger than the other group. If the null hypothesis is rejected it implies that
both the groups are not distributed in an identical way which implies that on one of the groups a fraud has
been done. We can use the critical values and reject the values based on the statistic values.




                                                       8
Here the two groups could be the data from the two classes or the different groups of questions divided in a
homogenous manner. But since this has not been covered in the syllabus we haven‟t done the problem with
this method.

APPENDIX

Table A.1 : Division of questions into groups based on the approach 1 used in ANOVA test


Cumulative     8
  Sum        Groups

     13
     22
     26
     40
     52
     62
     75
     84
     99
    118
    133
    147
    161
    175
    187
    197
    204
    211
    219
    233
    244
    253
    262
    265
    279
    281
    293
    296
    299
    315
    332
    348
    364
    382
    399
    401
    409
    418
    426
    427
    433
    437
    443
    445
                                                     9
Table A.2 : Class A Results

           CLASS A

     Groups             A           B          C            D       E       F       G       H

                         13.00      10.00      19.00        14.00   14.00    2.00   16.00       2
                          9.00      13.00      15.00        12.00   11.00   12.00   16.00       8
                          4.00       9.00      14.00        10.00    9.00    3.00   18.00       9
                         14.00      15.00      14.00         7.00    9.00    3.00   17.00       8
                         12.00                               7.00    3.00   16.00               1
                                                             8.00   14.00   17.00               6
                                                                                                4
                                                                                                6
                                                                                                2


SUMMARY
     Groups             Count       Sum   Average Variance
Column 1                        5      52     10.4     16.3
Column 2                        4      47    11.75 7.583333
Column 3                        4      62     15.5 5.666667
Column 4                        6      58 9.666667 8.266667
Column 5                        6      60       10     16.8
Column 6                        6      53 8.833333 48.56667
Column 7                        4      67    16.75 0.916667
Column 8                        9      46 5.111111 8.861111



ANOVA
 Source of Variation      SS         df         MS        F     P-value   F crit
Between Groups         539.6763            7 77.09661 5.076268 0.000444 2.277143
Within Groups          546.7556           36 15.18765

Total                  1086.432           43




                                                       10
Table A.3: Class B Results


        CLASS B Results

        Groups             A        B          C            D       E    F         G   H

                               13    2.00      15.00         7.00    3    4     4.00   8.00
                                5    6.00      10.00        10.00   10   12    10.00   4.00
                                6    9.00      14.00        12.00   10   10    11.00   5.00
                               14    4.00                    6.00    4    8     6.00   2.00
                                    10.00                            4          1.00   5.00
                                     9.00                           12          3.00   4.00
                                                                                1.00   2.00
                                                                                       3.00
                                                                                       3.00
                                                                                       1.00


Anova: Single Factor

SUMMARY
     Groups               Count     Sum   Average Variance
Column 1                        4      38      9.5 21.66667
Column 2                        6      40 6.666667 10.26667
Column 3                        3      39       13        7
Column 4                        4      35     8.75 7.583333
Column 5                        6      43 7.166667 15.36667
Column 6                        4      34      8.5 11.66667
Column 7                        7      36 5.142857 16.47619
Column 8                       10      37      3.7 4.011111



ANOVA
 Source of Variation      SS         df         MS        F     P-value   F crit
Between Groups          287.308            7   41.044 3.695126 0.004144 2.277143
Within Groups          399.8738           36 11.10761

Total                  687.1818           43




                                                       11
Table A.4 - Class A Results

Class A :

        1               2        3         4

        13              9        4         14
        12              10      13         9
        15              19      15         14
        14              14      12         10
        7               7        8         14
        11              9        9         3
        14              2       12         3
        3               16      17         16
        16              18      17         2
        8               9        8         1
        6               4        6         2



Anova: Single Factor

SUMMARY
    Groups             Count    Sum   Average Variance
Column 1                   11     119 10.81818 17.76364
Column 2                   11     117 10.63636 30.45455
Column 3                   11     121       11       19
Column 4                   11      88        8     34.8



ANOVA
   Source of
   Variation         SS         df         MS        F     P-value   F crit
Between Groups       66.25            3 22.08333 0.865859 0.466735 2.838745
Within Groups     1020.182           40 25.50455

Total             1086.432           43




                                                   12
Table A.5 : Class B Results

             CLASS B



        1                 2        3           4

        13               5         6          14
        2                6         9          4
        10               9        15          10
        14               7        10          12
        6                3        10          10
        4                4        12          4
        12               10        8          4
        10               11        6          1
        3                1         8          4
        5                2         5          4
        2                3         3          1

Anova: Single Factor

SUMMARY
    Groups              Count     Sum       Average    Variance
Column 1                    11       81     7.363636   20.65455
Column 2                    11       61     5.545455   11.27273
Column 3                    11       92     8.363636   11.45455
Column 4                    11       68     6.181818   20.16364



ANOVA
   Source of
   Variation              SS      df         MS            F     P-value   F crit
Between Groups         51.72727         3 17.24242      1.08536 0.366324 2.838745
Within Groups          635.4545        40 15.88636

Total                  687.1818        43




                                                        13

More Related Content

PDF
Medium artical2
PDF
1 s2.0-s1877042810021233-main
 
PPT
Dr. William Allan Kritsonis, Dissertation Chair for Clarence Johnson, Dissert...
PDF
Psy 315 psy315
PPT
Clarence Johnson, Dissertation PPT, Dr. William Allan Kritsonis, Dissertation...
PDF
Probability & Samples
PPTX
MEASURES OF CENTRAL TENDENCY by: Leizyl Lugsanay Crispo
PDF
7103449 dislike-of-math-thesis-final-version
Medium artical2
1 s2.0-s1877042810021233-main
 
Dr. William Allan Kritsonis, Dissertation Chair for Clarence Johnson, Dissert...
Psy 315 psy315
Clarence Johnson, Dissertation PPT, Dr. William Allan Kritsonis, Dissertation...
Probability & Samples
MEASURES OF CENTRAL TENDENCY by: Leizyl Lugsanay Crispo
7103449 dislike-of-math-thesis-final-version

What's hot (9)

PPTX
Occe2018: Student experiences with a bring your own laptop e-Exam system in p...
PPTX
Z-scores: Location of Scores and Standardized Distributions
PDF
Franklin Public Schools: MCAS Presentation 2017
PDF
Lesson 1 02 data collection and analysis
DOCX
Chapter iv & v
PPSX
Lesson 2
PPTX
Measures of Central Tendency
PDF
Ma sampletest-hs 2010-13
PDF
Basics of SPSS, Part 1
Occe2018: Student experiences with a bring your own laptop e-Exam system in p...
Z-scores: Location of Scores and Standardized Distributions
Franklin Public Schools: MCAS Presentation 2017
Lesson 1 02 data collection and analysis
Chapter iv & v
Lesson 2
Measures of Central Tendency
Ma sampletest-hs 2010-13
Basics of SPSS, Part 1
Ad

Similar to Statistics group project_Fraud Detection (20)

DOCX
WEEK 7 – EXERCISES Enter your answers in the spaces pr.docx
PPTX
local_media6355515740080111993.pptx
DOCX
Midterm Exam The purpose of this examination is t
PDF
CTA Algebra Comparative Pilot Study
PPTX
DATA-INTERPRETATION in Inquiries, Investigations and Immersion.pptx
PPTX
DATA-INTERPRETATION IN INQUIRIES, INVESTIGATIONS AND IMMERSION.pptx
DOCX
Chapter NineShow all workProblem 1)A skeptical paranorma.docx
DOCX
Final Project ScenarioA researcher has administered an anxiety.docx
PPTX
EDU 533 reviewer for all student haysttttpptx
PDF
Practice Test 1
PPTX
Statistika Anova one way (1) uji dataaaa
DOCX
Show all workProblem 1)A skeptical paranormal researcher cla.docx
PDF
Qmet 252
DOCX
Data File 5Chapter NineShow all workProblem 1)A skeptica.docx
PPTX
CLASSROOM DEMONSTRATION FOR RANKING.pptx
PDF
tutor2u Strong Foundations A Level Psychology
PPTX
ONE-WAY ANOVA (Analysis of Variance).pptx
PPTX
Intro_Measure of Central Tendency and pptx
PPTX
Mixed between-within groups ANOVA
PDF
Creative Problem Solving Model for Promoting Achievement among Higher Seconda...
WEEK 7 – EXERCISES Enter your answers in the spaces pr.docx
local_media6355515740080111993.pptx
Midterm Exam The purpose of this examination is t
CTA Algebra Comparative Pilot Study
DATA-INTERPRETATION in Inquiries, Investigations and Immersion.pptx
DATA-INTERPRETATION IN INQUIRIES, INVESTIGATIONS AND IMMERSION.pptx
Chapter NineShow all workProblem 1)A skeptical paranorma.docx
Final Project ScenarioA researcher has administered an anxiety.docx
EDU 533 reviewer for all student haysttttpptx
Practice Test 1
Statistika Anova one way (1) uji dataaaa
Show all workProblem 1)A skeptical paranormal researcher cla.docx
Qmet 252
Data File 5Chapter NineShow all workProblem 1)A skeptica.docx
CLASSROOM DEMONSTRATION FOR RANKING.pptx
tutor2u Strong Foundations A Level Psychology
ONE-WAY ANOVA (Analysis of Variance).pptx
Intro_Measure of Central Tendency and pptx
Mixed between-within groups ANOVA
Creative Problem Solving Model for Promoting Achievement among Higher Seconda...
Ad

Recently uploaded (20)

PPTX
The Essence of Sufism: Love, Devotion, and Divine Connection
PPTX
June 10–16- Have Ye Experienced This Mighty Change in Your Hearts.pptx
PDF
Printable Mizo Gospel Tract - Be Sure of Heaven.pdf
PPTX
WALKING IN YOUR CALLING.pptx hahhahqhubhdbyd dujsskladjhajhdboauhdbj jadhdnah...
PPTX
389 Your troops shall be willing 390 This is the Day
PPTX
Has-Satans-Little-Season-Already-Begun.pptx
PDF
Printable Nepali Gospel Tract - Be Sure of Heaven.pdf
PPTX
Human Rights AMFOKSFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
PPTX
God Doesn't Forget You He will never abandon you
PPTX
cristianity quiz.pptx introduction to world religion
PPTX
Analyizing----Opinion---and---Truth.pptx
PPTX
Joshua Through the Lens of Jesus: Part 8 - Ch.22-24
PDF
Printable Maldivian Divehi Gospel Tract - Be Sure of Heaven.pdf
PPTX
Faith and Gratitude: Guide to the Baccalaureate Mass & Responses
PPTX
Organizational Psychology Advance Notes.pptx
PDF
Printable Macedonian Gospel Tract - Be Sure of Heaven.pdf
PDF
Printable Malagasy Gospel Tract - Be Sure of Heaven.pdf
PDF
Explaining Sahih Muslim Book 6 – Hadith 216-241
PDF
Heavenly Holy Spirit vs False Spirit: An Analysis of 1 Peter 1:12 by Matthews...
PPTX
GANESHA SHLOKA PPT (1).pptx balvikas sai
The Essence of Sufism: Love, Devotion, and Divine Connection
June 10–16- Have Ye Experienced This Mighty Change in Your Hearts.pptx
Printable Mizo Gospel Tract - Be Sure of Heaven.pdf
WALKING IN YOUR CALLING.pptx hahhahqhubhdbyd dujsskladjhajhdboauhdbj jadhdnah...
389 Your troops shall be willing 390 This is the Day
Has-Satans-Little-Season-Already-Begun.pptx
Printable Nepali Gospel Tract - Be Sure of Heaven.pdf
Human Rights AMFOKSFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
God Doesn't Forget You He will never abandon you
cristianity quiz.pptx introduction to world religion
Analyizing----Opinion---and---Truth.pptx
Joshua Through the Lens of Jesus: Part 8 - Ch.22-24
Printable Maldivian Divehi Gospel Tract - Be Sure of Heaven.pdf
Faith and Gratitude: Guide to the Baccalaureate Mass & Responses
Organizational Psychology Advance Notes.pptx
Printable Macedonian Gospel Tract - Be Sure of Heaven.pdf
Printable Malagasy Gospel Tract - Be Sure of Heaven.pdf
Explaining Sahih Muslim Book 6 – Hadith 216-241
Heavenly Holy Spirit vs False Spirit: An Analysis of 1 Peter 1:12 by Matthews...
GANESHA SHLOKA PPT (1).pptx balvikas sai

Statistics group project_Fraud Detection

  • 1. YOUNG INDIA FELLOWSHIP Statistics Course Group Project Members : Abhishek Chopra Adhiraj Sarmah, Kshitij Garg Mahesh Jakhotia Tulasi Prasad Chaudhary 7/25/2011 The group project is based on real case study taken from the Atlanta primary school test papers. The growing pressure among the teachers to improve the test performance of their classes has resulted in malpractices. We have to find out the methodologies to find out the fraud if done in the following case.
  • 2. Contents 1) Problem Statement 2 2) Logical Analysis 2-4 3) Inference 4 4) Our Interpretation of the Cheating Process 4 5) Statistical Approaches 5 6) ANOVA 5 7) Pictorial Method 8 8) The Wincoxon Rank Sum Test 9 9) Appendix a. Table A.1 : Division of questions into groups based on the approach 1 used in ANOVA test 9 b. Table A.2 : Class Results 10 c. Table A.3: Class B Results 11 d. Table A.5 : Class A Results 12 e. Table A.5 : Class B Results 13 1
  • 3. GROUP PROJECT STATISTICS – FRAUD DETECTION Problem Statement: We have been given 2 sets of data of 2 different classrooms and we are required to strategize and analyze to eventually determine whether there was a teacher fraud in one or both of the classrooms. There can be 4 different scenarios: 1) Both A & B data have been tampered. 2) Both A & B data have not been tampered. 3) A is Fraud, B is Not 4) B is Fraud, A is Not We have summarized our thought processes in the following document and demonstrated them through the help of excel sheets attached in the folder. We have used various approaches to derive the solution. Each and every methodology has its own assumptions and its own pros & cons. Logical Analysis: STEP – 1: We calculate the total number of correct answers for every question in both the classes. Since we took a student wise-question wise analysis and assign a correct score with the value „1‟, it also shows the total number of students who got each question correctly for both the classes STEP -2: We then find the Total Number of correct answers of the entire class and divide it by the total number of students to arrive at the average mean number of correct answers per student for or both the classes. STEP – 3: We take the analysis of STEP -2 and then plot line-graphs for both the classes with Questions on the X-Axis and Class Performance on the Y-Axis. The analysis of this will provide a broad perspective on whether there is any evidence of fraud or not. # We found that in Class – A, Questions 30 to Questions 36 clearly show an anomaly. STEP-4: We decided to focus on the anomaly region. We analyzed the questions 30-36 and tried to see if there were any abnormal patterns in them for both the classes. 2
  • 4. # There was very clearly a pattern of answers of exact and uniform correct answers to questions 30- 36 for class A for particular 16 students, which wasn‟t so in Class B. STEP – 5: We calculated the Average score (i.e. Average no. of correct answers) for each of these 16 students in class A which included questions 30-36. We then found the mean score of these 16 students = 46%. For Class B, The mean score of all the students is: 38% 3
  • 5. STEP – 6: We calculated the Average score (ie. Average no. of correct answers) for each of these 16 students in class A EXCLUDING the questions 30-36. We then found the mean score the 16 students of Class A, the mean DECREASED to 42% (ie. A decrease of 4%) For Class B, The mean score of all the students INCREASED to 40%. (ie. An increase of 2%) INFERENCE: Therefore we can say that the set of questions 30 to 36, show reasonable proof to believe that some form of cheating/tampering was done in respect to these questions. Our interpretation of the Cheating Process 1) From questions 30 to 36, the graphs present a consistent growth for 16 students from the other students from the average growth visually, which can be summed up to 16 x 6 questions, which is equal to 96 questions that have been probably tampered with. 2) The reasons to choose that particular set of questions (from 30 to 36) could be a) Since it is given that the level of difficulty increases with the questions it is logical to assume that more students would get correct answers for the first half of the questions compared to the second half, because the difficulty level would be low at the beginning. In the same manner, the second half of the question would be expected to show lesser correct answers as the difficulty would be higher. b) So it would be logically smart on the teachers part to attempt to tamper/cheat in the second half of the questions, since most of the students would be expected to get the correct answers in the first half. Even in the second half, it would be smarter to avoid tampering with the last few questions since they are the most difficult, and an increased number of correct answers for those questions will immediately be easily exposed to detection. So it would be logical to choose questions from somewhere within the beginning of second half and significantly before the last few questions. 3) A set of questions which are consecutively chosen for editing also eases the time factor required to edit the answers manually, which talks about the limited time available to an invigilator or a teacher generally. And 96 questions is a good number of questions to change the entire average of the class performance to a significant level which is an increased level of 4 % as we later found from our analysis.. Statistical Approaches used: 1) Anova Method: Initially we divided the classes into groups and applied anova to see if the groups have the same distribution or not. If one of the groups did not have the same distribution we could conclude that the data of that group was tampered as it disturbed the distribution of the whole class. We used two approaches to divide into groups. Later on we used the Tukey Method to find out the groups which had a deviated mean. 4
  • 6. 2) Pictorial distribution: A graph was plotted with the questions on the X axis and the class performance on the Y Axis. When we analyzed the class A graph we found out that between the questions 30-36 the plot was flat and the results were higher than the performances in the other questions. We can conclude on a pictorial basis that fraud has been done in these questions. 3) The Wilcoxon Rank Sum Test: If we want to use the samples without considering the normal assumptions we can use the Rank Sum approach (used for non-normal distribution) discussed in section 9.2 of the text book. Since the other tests are based on a lot of mathematical assumptions which are not satisfied by the given data, we can use this approach which requires weaker mathematical assumptions. Approach 1 : ANOVA Approach To compare the means and distributions of various groups, ANOVA is preferred to multiple “t-tests” as ANOVA leads to a single test statistic for comparing all the means, so the overall risk of type-I error can be controlled. If we ran many t tests, each at a given alpha level, we couldn‟t know what the overall risk of a type 1 error is. Certainly the more tests one runs, the greater the risk of a false positive conclusion somewhere among the tests. Initially we divided the groups of class A according to the toughness level of the questions. The toughness level was divided according to the area of right answers answered by the students. For example if the total number of questions answered by the group is 445. We divided the group into eight groups by classifying them in to equal areas of (445/8=56). The cumulative sum of total scores in each group is 56. The data was divided into eight groups. The grouping has been shown in appendix section Table A.1 .Anova test was applied on the above groups to find out if the means of the groups was same or different. Test Hypothesis Ho : u1= u2=……u8 Ha : Means are not the same(Thus showing that one or more of the groups have been tampered which resulted in the varying of its mean from the other groups) Results for CLASS A Anova: Single Factor for CLASS A SUMMARY Groups Count Sum Average Variance Column 1 5 52 10.4 16.3 Column 2 4 47 11.75 7.583333 Column 3 4 62 15.5 5.666667 Column 4 6 58 9.666667 8.266667 Column 5 6 60 10 16.8 Column 6 6 53 8.833333 48.56667 Column 7 4 67 16.75 0.916667 5
  • 7. Column 8 9 46 5.111111 8.861111 ANOVA Source of Variation SS df MS F P-value F crit Between Groups 539.6763 7 77.09661 5.076268 0.000444 2.277143 Within Groups 546.7556 36 15.18765 Total 1086.432 43 ANOVA Results for CLASS B Source of Variation SS df MS F P-value F crit Between Groups 287.308 7 41.044 3.695126 0.004144 2.277143 Within Groups 399.8738 36 11.10761 Total 687.1818 43 In the test results we find out that the F Statistic value of “between groups” in class A is 5.07 which is higher than the critical F value.(2.27). This proves that the null hypothesis Ho that the means are equal can be rejected. But a small flaw in this argument is that the samples size of each group is different and this disturbs one of the basic assumption of the ANOVA that equal variance of groups is required except in the cases when the groups are of equal sizes. Second point to be noted is that the groups have to be independent. Hence we have to use a different approach to satisfy the above assumptions. We have now divided the group in such a way that it entails questions of all difficulty levels. We used a circular approach to divide the questions into four categories. For example we put questions 1 to 4 in the following four groups. And then questions 5-8 in the following four groups, thus each group had questions of all types making it a homogenous model. Please refer Appendix Table A.4 - Class A Results for more details of the grouping. Assumptions for ANOVA: 1) The sample measurements are selected from a normal population. 2) The samples are independent. 3) The unknown population & variance for the measurements from sample I are ui and c2 respectively. Now let me explain why our current approach in a way satisfies the given assumptions. The normal approximation is the least crucial. The ANOVA test is a test on means; the central limit theorem has its effect. The central limit theorem may not work for a small sample size. Hence we have taken a large sample size per group. 11 is the sample size and we the central limit theorem can be applied approximately. However one particular alternative is the Kruskal-Wallis Rank Test which is discussed in the section 10.2 of the textbook. This method can be applied to non-random samples. But since this methodology was not taught in the class room we leave that solution and focus on using the ANOVA test for solving the current problem. 6
  • 8. The assumption of equal variances is important if the sample sizes are substantially different. But since we have chosen the same sample size therefore the variance can‟t be a problem over here. When all n‟s are equal, the effect of even grossly unequal variances is minimal. Coming to the independence problem, since we have used a homogenous group in which the questions from easy to tough have been taken therefore the group as a whole is independent from the other groups. We have used a circular approach to make sure that each group has a homogenous set if questions similar to the other groups. Test Hypothesis Ho : u1= u2=……u8 Ha : Means are not the same(Thus showing that one or more of the groups have been tampered which resulted in the varying of its mean from the other groups) CLASS A Results 1 2 3 4 13 9 4 14 12 10 13 9 15 19 15 14 14 14 12 10 7 7 8 14 11 9 9 3 14 2 12 3 3 16 17 16 16 18 17 2 8 9 8 1 6 4 6 2 Anova: Single Factor SUMMARY Groups Count Sum Average Variance Column 1 11 119 10.81818 17.76364 Column 2 11 117 10.63636 30.45455 Column 3 11 121 11 19 Column 4 11 88 8 34.8 ANOVA Source of Variation SS df MS F P-value F crit Between Groups 66.25 3 22.08333 0.865859 0.466735 2.838745 Within Groups 1020.182 40 25.50455 Total 1086.432 43 7
  • 9. ANOVA Results for Class B Source of Variation SS df MS F P-value F crit Between Groups 51.72727 3 17.24242 1.08536 0.366324 2.838745 Within Groups 635.4545 40 15.88636 Total 687.1818 43 Now the ANOVA F test has helped us in just finding whether we need to reject the Ho or not. But rejection of null hypothesis that means are equal does not indicate specifically which means are not equal. Therefore we can use the Tukey method to find out the differences among the specified means. By this method we can specifically point out the group in which the tampering has been done. Approach 2: The Pictorial Method We can see that the frequency curve comes out to be normal for class B, but it is skewed on the higher side in class A. This skewed nature can be attributed to "Tampering or Cheating by the teacher". The mean for class A (Mean = 20.23) is too high as compared to class B (Mean = 16.78) And as seen from the "Question vs. No. of students who attempted it correct " plot we can say that Q-30 to Q- 36 in class A consists of the tampered data. These questions don't follow the normal trend and show an increased peak in between of the decreasing curve. So, we trim off Q-30 to Q-36 from both the classes and then plot them again for remaining questions. And we can see that both the curves come out be normal this time and there is no skew nature in class A. The mean for class A (Mean = 16.32) has also reduced and is now comparable to class B (Mean = 15.44) So it can be easily said that some tempering was done from Q-30 to Q-36 in class A. Approach 3 : The Wilcoxon Rank Sum Test If we want to use the samples without considering the normal assumptions we can use the Rank Sum approach (used for non-normal distribution) discussed in section 9.2 of the text book. Since the other tests are based on a lot of mathematical assumptions which are not satisfied by the given data, we can use this approach which requires weaker mathematical assumptions. This test requires the following conditions: 1) Identical distributions but not necessary normal. The null hypothesis is that the two population distributions are identical. And the alternative test is that the mean of one of the groups is larger than the other group. If the null hypothesis is rejected it implies that both the groups are not distributed in an identical way which implies that on one of the groups a fraud has been done. We can use the critical values and reject the values based on the statistic values. 8
  • 10. Here the two groups could be the data from the two classes or the different groups of questions divided in a homogenous manner. But since this has not been covered in the syllabus we haven‟t done the problem with this method. APPENDIX Table A.1 : Division of questions into groups based on the approach 1 used in ANOVA test Cumulative 8 Sum Groups 13 22 26 40 52 62 75 84 99 118 133 147 161 175 187 197 204 211 219 233 244 253 262 265 279 281 293 296 299 315 332 348 364 382 399 401 409 418 426 427 433 437 443 445 9
  • 11. Table A.2 : Class A Results CLASS A Groups A B C D E F G H 13.00 10.00 19.00 14.00 14.00 2.00 16.00 2 9.00 13.00 15.00 12.00 11.00 12.00 16.00 8 4.00 9.00 14.00 10.00 9.00 3.00 18.00 9 14.00 15.00 14.00 7.00 9.00 3.00 17.00 8 12.00 7.00 3.00 16.00 1 8.00 14.00 17.00 6 4 6 2 SUMMARY Groups Count Sum Average Variance Column 1 5 52 10.4 16.3 Column 2 4 47 11.75 7.583333 Column 3 4 62 15.5 5.666667 Column 4 6 58 9.666667 8.266667 Column 5 6 60 10 16.8 Column 6 6 53 8.833333 48.56667 Column 7 4 67 16.75 0.916667 Column 8 9 46 5.111111 8.861111 ANOVA Source of Variation SS df MS F P-value F crit Between Groups 539.6763 7 77.09661 5.076268 0.000444 2.277143 Within Groups 546.7556 36 15.18765 Total 1086.432 43 10
  • 12. Table A.3: Class B Results CLASS B Results Groups A B C D E F G H 13 2.00 15.00 7.00 3 4 4.00 8.00 5 6.00 10.00 10.00 10 12 10.00 4.00 6 9.00 14.00 12.00 10 10 11.00 5.00 14 4.00 6.00 4 8 6.00 2.00 10.00 4 1.00 5.00 9.00 12 3.00 4.00 1.00 2.00 3.00 3.00 1.00 Anova: Single Factor SUMMARY Groups Count Sum Average Variance Column 1 4 38 9.5 21.66667 Column 2 6 40 6.666667 10.26667 Column 3 3 39 13 7 Column 4 4 35 8.75 7.583333 Column 5 6 43 7.166667 15.36667 Column 6 4 34 8.5 11.66667 Column 7 7 36 5.142857 16.47619 Column 8 10 37 3.7 4.011111 ANOVA Source of Variation SS df MS F P-value F crit Between Groups 287.308 7 41.044 3.695126 0.004144 2.277143 Within Groups 399.8738 36 11.10761 Total 687.1818 43 11
  • 13. Table A.4 - Class A Results Class A : 1 2 3 4 13 9 4 14 12 10 13 9 15 19 15 14 14 14 12 10 7 7 8 14 11 9 9 3 14 2 12 3 3 16 17 16 16 18 17 2 8 9 8 1 6 4 6 2 Anova: Single Factor SUMMARY Groups Count Sum Average Variance Column 1 11 119 10.81818 17.76364 Column 2 11 117 10.63636 30.45455 Column 3 11 121 11 19 Column 4 11 88 8 34.8 ANOVA Source of Variation SS df MS F P-value F crit Between Groups 66.25 3 22.08333 0.865859 0.466735 2.838745 Within Groups 1020.182 40 25.50455 Total 1086.432 43 12
  • 14. Table A.5 : Class B Results CLASS B 1 2 3 4 13 5 6 14 2 6 9 4 10 9 15 10 14 7 10 12 6 3 10 10 4 4 12 4 12 10 8 4 10 11 6 1 3 1 8 4 5 2 5 4 2 3 3 1 Anova: Single Factor SUMMARY Groups Count Sum Average Variance Column 1 11 81 7.363636 20.65455 Column 2 11 61 5.545455 11.27273 Column 3 11 92 8.363636 11.45455 Column 4 11 68 6.181818 20.16364 ANOVA Source of Variation SS df MS F P-value F crit Between Groups 51.72727 3 17.24242 1.08536 0.366324 2.838745 Within Groups 635.4545 40 15.88636 Total 687.1818 43 13