SlideShare a Scribd company logo
Inference Control In Statistical Databases From
Theory To Practice 1st Edition Josep
Domingoferrer Auth download
https://ebookbell.com/product/inference-control-in-statistical-
databases-from-theory-to-practice-1st-edition-josep-
domingoferrer-auth-1664790
Explore and download more ebooks at ebookbell.com
Here are some recommended products that we believe you will be
interested in. You can click the link to download.
Computational Inference And Control Of Quality In Multimedia Services
1st Edition Vlado Menkovski Auth
https://ebookbell.com/product/computational-inference-and-control-of-
quality-in-multimedia-services-1st-edition-vlado-menkovski-
auth-5236580
Target In Control Social Influence As Distributed Information
Processing Andrzej K Nowak Robin R Vallacher Agnieszka Rychwalska
Magdalena Roszczyskakurasiska Karolina Ziembowicz Mikolaj Biesaga
Marta Kacprzykmurawska
https://ebookbell.com/product/target-in-control-social-influence-as-
distributed-information-processing-andrzej-k-nowak-robin-r-vallacher-
agnieszka-rychwalska-magdalena-roszczyskakurasiska-karolina-
ziembowicz-mikolaj-biesaga-marta-kacprzykmurawska-11093316
Dark Psychology This Book Includes The Art Of How To Influence And Win
People Using Emotional Manipulation Mind Control Nlp Techniques
Persuasion Psychological Warfare Tactics In Relationships David Bennis
https://ebookbell.com/product/dark-psychology-this-book-includes-the-
art-of-how-to-influence-and-win-people-using-emotional-manipulation-
mind-control-nlp-techniques-persuasion-psychological-warfare-tactics-
in-relationships-david-bennis-47363694
Secure Data Provenance And Inference Control With Semantic Web 1st
Edition Thuraisingham
https://ebookbell.com/product/secure-data-provenance-and-inference-
control-with-semantic-web-1st-edition-thuraisingham-55304616
Secure Data Provenance And Inference Control With Semantic Web
Thuraisingham
https://ebookbell.com/product/secure-data-provenance-and-inference-
control-with-semantic-web-thuraisingham-4732498
Secure Data Provenance And Inference Control With Semantic Web Bhavani
Thuraisingham
https://ebookbell.com/product/secure-data-provenance-and-inference-
control-with-semantic-web-bhavani-thuraisingham-5411218
Algorithms For Analysis Inference And Control Of Boolean Networks
Tatsuya Akutsu
https://ebookbell.com/product/algorithms-for-analysis-inference-and-
control-of-boolean-networks-tatsuya-akutsu-7029944
Copulabased Markov Models For Time Series Parametric Inference And
Process Control 1st Ed Lihsien Sun
https://ebookbell.com/product/copulabased-markov-models-for-time-
series-parametric-inference-and-process-control-1st-ed-lihsien-
sun-22477864
Influence Of Flight Control Laws On Structural Sizing Of Commercial
Aircraft Rahmetalla Nazzeri
https://ebookbell.com/product/influence-of-flight-control-laws-on-
structural-sizing-of-commercial-aircraft-rahmetalla-nazzeri-36652924
Inference Control In Statistical Databases From Theory To Practice 1st Edition Josep Domingoferrer Auth
Inference Control In Statistical Databases From Theory To Practice 1st Edition Josep Domingoferrer Auth
Lecture Notes in Computer Science 2316
Edited by G. Goos, J. Hartmanis, and J. van Leeuwen
3
Berlin
Heidelberg
New York
Barcelona
Hong Kong
London
Milan
Paris
Singapore
Tokyo
Josep Domingo-Ferrer (Ed.)
Inference Control
in Statistical Databases
From Theory to Practice
1 3
Series Editors
Gerhard Goos, Karlsruhe University, Germany
Juris Hartmanis, Cornell University, NY, USA
Jan van Leeuwen, Utrecht University, The Netherlands
Volume Editor
Josep Domingo-Ferrer
Universitat Rovira i Virgili
Department of Computer Engineering and Mathematics
Av. Paı̈sos Catalans 26, 43007 Tarragona, Spain
E-mail: jdomingo@etse.urv.es
Cataloging-in-Publication Data applied for
Die Deutsche Bibliothek - CIP-Einheitsaufnahme
Inference control in statistical databases : from theory to practice /
Josep Domingo-Ferrer (ed.). - Berlin ; Heidelberg ; New York ; Barcelona ;
Hong Kong ; London ; Milan ; Paris ; Tokyo : Springer, 2002
(Lecture notes in computer science ; Vol. 2316)
ISBN 3-540-43614-6
CR Subject Classification (1998): G.3, H.2.8, K.4.1, I.2.4
ISSN 0302-9743
ISBN 3-540-43614-6 Springer-Verlag Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is
concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting,
reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication
or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965,
in its current version, and permission for use must always be obtained from Springer-Verlag. Violations are
liable for prosecution under the German Copyright Law.
Springer-Verlag Berlin Heidelberg New York
a member of BertelsmannSpringer Science+Business Media GmbH
http://www.springer.de
© Springer-Verlag Berlin Heidelberg 2002
Printed in Germany
Typesetting: Camera-ready by author, data conversion by Christian Grosche, Hamburg
Printed on acid-free paper SPIN 10846628 06/3142 5 4 3 2 1 0
Preface
Inference control in statistical databases (also known as statistical disclosure
control, statistical disclosure limitation, or statistical confidentiality) is about
finding tradeoffs to the tension between the increasing societal demand for accu-
rate statistical data and the legal and ethical obligation to protect the privacy of
individuals and enterprises which are the source of data for producing statistics.
To put it bluntly, statistical agencies cannot expect to collect accurate informa-
tion from individual or corporate respondents unless these feel the privacy of
their responses is guaranteed.
This state-of-the-art survey covers some of the most recent work in the field
of inference control in statistical databases. This topic is no longer (and proba-
bly never was) a purely statistical or operations-research issue, but is gradually
entering the arena of knowledge management and artificial intelligence. To the
extent that techniques used by intruders to make inferences compromising pri-
vacy increasingly draw on data mining and record linkage, inference control tends
to become an integral part of computer science.
Articles in this book are revised versions of a few papers selected among
those presented at the seminar “Statistical Disclosure Control: From Theory to
Practice” held in Luxemburg on 13 and 14 December 2001 under the sponsorship
of EUROSTAT and the European Union 5th FP project “AMRADS” (IST-2000-
26125).
The book starts with an overview article which goes through the remaining
17 articles. These cover inference control for aggregate statistical data released
in tabular form, inference control for microdata files, software developments, and
user case studies. The article authors and myself hope that this collective work
will be a reference point to both academics and official statisticians who wish to
keep abreast with the latest advances in this very dynamic field.
The help of the following experts in discussing and reviewing the selected
papers is gratefully acknowledged:
– Lawrence H. Cox (U. S. National Center for Health Statistics)
– Gerd Ronning (Universität Tübingen)
– Philip M. Steel (U. S. Bureau of the Census)
– William E. Winkler (U. S. Bureau of the Census)
As an organizer of the seminar from which articles in this book have evolved, I
wish to emphasize that such a seminar would not have taken place without the
sponsorship of EUROSTAT and the AMRADS project as well as the help and en-
couragement by Deo Ramprakash (AMRADS coordinator), Photis Nanopoulos,
Harald Sonnberger, and John King (all from EUROSTAT). Finally, the inputs
by Anco Hundepool (Statistics Netherlands and co-ordinator of the EU 5th FP
project “CASC”) and Francesc Sebé (Universitat Rovira i Virgili) were crucial
to the success of the seminar and the book, respectively. I apologize for possible
omissions.
February 2002 Josep Domingo-Ferrer
Table of Contents
Advances in Inference Control in Statistical Databases: An Overview . . . . . 1
Josep Domingo-Ferrer
Tabular Data Protection
Cell Suppression: Experience and Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Dale A. Robertson, Richard Ethier
Bounds on Entries in 3-Dimensional Contingency Tables Subject to Given
Marginal Totals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Lawrence H. Cox
Extending Cell Suppression to Protect Tabular Data against Several
Attackers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Juan José Salazar González
Network Flows Heuristics for Complementary Cell Suppression:
An Empirical Evaluation and Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Jordi Castro
HiTaS: A Heuristic Approach to Cell Suppression in Hierarchical Tables . . 74
Peter-Paul de Wolf
Microdata Protection
Model Based Disclosure Protection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Silvia Polettini, Luisa Franconi, Julian Stander
Microdata Protection through Noise Addition . . . . . . . . . . . . . . . . . . . . . . . . . 97
Ruth Brand
Sensitive Micro Data Protection Using Latin Hypercube
Sampling Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
Ramesh A. Dandekar, Michael Cohen, Nancy Kirkendall
Integrating File and Record Level Disclosure Risk Assessment . . . . . . . . . . . 126
Mark Elliot
Disclosure Risk Assessment in Perturbative Microdata Protection . . . . . . . . 135
William E. Yancey, William E. Winkler, Robert H. Creecy
LHS-Based Hybrid Microdata vs Rank Swapping and Microaggregation
for Numeric Microdata Protection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
Ramesh A. Dandekar, Josep Domingo-Ferrer, Francesc Sebé
VIII Table of Contents
Post-Masking Optimization of the Tradeoff between Information Loss and
Disclosure Risk in Masked Microdata Sets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
Francesc Sebé, Josep Domingo-Ferrer, Josep Maria Mateo-Sanz,
Vicenç Torra
Software and User Case Studies
The CASC Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
Anco Hundepool
Tools and Strategies to Protect Multiples Tables with the GHQUAR Cell
Suppression Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
Sarah Giessing, Dietz Repsilber
SDC in the 2000 U.S. Decennial Census . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
Laura Zayatz
Applications of Statistical Disclosure Control at Statistics Netherlands . . . . 203
Eric Schulte Nordholt
Empirical Evidences on Protecting Population Uniqueness at Idescat . . . . . 213
Julià Urrutia, Enric Ripoll
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
Advances in Inference Control in Statistical
Databases: An Overview
Josep Domingo-Ferrer
Universitat Rovira i Virgili
Dept. of Computer Engineering and Mathematics
Av. Paı̈sos Catalans 26, E-43007 Tarragona, Catalonia, Spain
jdomingo@etse.urv.es
Abstract. Inference control in statistical databases is a discipline with
several other names, such as statistical disclosure control, statistical dis-
closure limitation, or statistical database protection. Regardless of the
name used, current work in this very active field is rooted in the work
that was started on statistical database protection in the 70s and 80s.
Massive production of computerized statistics by government agencies
combined with an increasing social importance of individual privacy has
led to a renewed interest in this topic. This is an overview of the latest
research advances described in this book.
Keywords: Inference control in statistical database, Statistical disclo-
sure control, Statistical disclosure limitation, Statistical database pro-
tection, Data security, Respondents’ privacy, Official statistics.
1 Introduction
The protection of confidential data is a constant issue of concern for data collec-
tors and especially for national statistical agencies. There are legal and ethical
obligations to maintain confidentiality of respondents whose answers are used
for surveys or whose administrative data are used to produce statistics. But,
beyond law and ethics, there are also practical reasons for data collectors to
care about confidentiality: unless respondents are convinced that their privacy
is being adequately protected, they are unlikely to co-operate and supply their
data for statistics to be produced on them.
The rest of this book consists of seventeen articles clustered in three parts:
1. The protection of tabular data is covered by the first five articles;
2. The protection of microdata (i.e. invididual respondent data) is addressed
by the next seven articles;
3. Software for inference control and user case studies are reported in the last
five articles.
The material in this book focuses on the latest research developments of
the mathematical and computational aspects of inference control and should
be regarded as an update of [6]. For a systematic approach to the topic, we
J. Domingo-Ferrer (Ed.): Inference Control in Statistical Databases, LNCS 2316, pp. 1–7, 2002.
c
 Springer-Verlag Berlin Heidelberg 2002
2 Josep Domingo-Ferrer
strongly recommend [13]; for a quicker overview, [12] may also be used. All
references given so far in this paragraph concentrate only on the mathematical
and computational side of the topic. If a broader scope is required, [7] is a work
where legal, organizational, and practical issues are covered in addition to the
purely computational ones.
This overview goes through the book articles and then gives an account of
related literature and other sources of information.
2 Tabular Data Protection
The first article “Cell suppression: experience and theory”, by Robertson and
Ethier, emphasizes that some basic points of cell suppression for table protec-
tion are not sufficiently known. While the underlying theory is well developed,
sensitivity rules in use are in some cases flawed and may lead to the release of
sensitive information. Another issue raised by the paper is the lack of a sound
information loss measure to assess the damage inflicted to a table in terms of
data utility by the use of a particular suppression pattern. The adoption of
information-theoretic measures is hinted as a possible improvement.
The article “Bounds on entries in 3-dimensional contingency tables subject
to given marginal totals” by Cox deals with algorithms for determining integer
bounds on suppressed entries of multi-dimensional contingency tables subject to
fixed marginal totals. Some heuristic algorithms are compared, and it is demon-
strated that they are not exact. Consequences for statistical database query
systems are discussed.
“Extending cell suppression to protect tabular data against several attack-
ers”, by Salazar, points out that attackers to confidentiality need not be just
external intruders; internal attackers, i.e. special respondents contributing to
different cell values of the table, must also be taken into account. This article
describes three mathematical models for the problem of finding a cell suppres-
sion pattern minimizing information loss while ensuring protection for different
sensitive cells and different intruders.
When a set of sensitive cells are suppressed from a table (primary sup-
pressions), a set of non-sensitive cells must be suppressed as well (complemen-
tary suppressions) to prevent primary suppressions from being computable from
marginal constraints. Network flows heuristics have been proposed in the past
for finding the minimal complementary cell suppression pattern in tabular data
protection. However, the heuristics known so far are only appropriate for two-
dimensional tables. In “Network flows heuristics for complementary cell suppres-
sion: an empirical evaluation and extensions”, by Castro, it is shown that network
flows heuristics (namely multicommodity network flows and network flows with
side constraints) can also be used to model three-dimensional, hierarchical, and
linked tables.
Also related to hierarchical tables is the last article on tabular data, authored
by De Wolf and entitled “HiTaS: a heuristic approach to cell suppression in hier-
archical tables”. A heuristic top-down approach is presented to find suppression
Advances in Inference Control in Statistical Databases: An Overview 3
patterns in hierarchical tables. When a table of high level is protected using cell
suppression, its interior is regarded as the marginals of possibly several lower
level tables, each of which is protected while keeping their marginals fixed.
3 Microdata Protection
The first three articles in this part describe methods for microdata protection:
– Article “Model based disclosure protection”, by Polettini, Franconi, and
Stander, argues that any microdata protection method is based on a formal
reference model. Depending on the number of restrictions imposed, meth-
ods are classified as nonparametric, semiparametric or fully parametric. An
imputation procedure for business microdata based on a regression model is
applied to the Italian sample from the Community Innovation Survey. The
utility of the released data and the protection achieved are also evaluated.
– Adding noise is a very used principle for microdata protection. In fact, re-
sults in the article by Yancey et al. (discussed below) show that noise addi-
tion methods can perform very well. Article “Microdata protection through
noise addition”, by Brand, contains an overview of noise addition algorithms.
These range from simple white noise addition to complex methods which try
to improve the tradeoff between data utility and data protection. Theoretical
properties of the presented algorithms are discussed in Brand’s article and
an illustrative numerical example is given.
– Synthetic microdata generation is an attractive alternative to protection
methods based on perturbing original microdata. The conceptual advantage
is that, even if a record in the released data set can be linked to a record in
the original data set, such a linkage is not actually a re-identification because
the released record is a synthetic one and was not derived from any specific
respondent. In “Sensitive microdata protection using Latin hypercube sam-
pling technique”, Dandekar, Cohen, and Kirkendall propose a method for
synthetic microdata generation based on Latin hypercube sampling.
The last four articles in this part concentrate on assessing disclosure risk and
information loss achieved by microdata protection methods:
– Article “Integrating file and record level disclosure risk assessment”, by El-
liot, deals with disclosure risk in non-perturbative microdata protection. Two
methods for assessing disclosure risk at the record-level are described, one
based on the special uniques method and the other on data intrusion simu-
lation. Proposals to integrate both methods with file level risk measures are
also presented.
– Article “Disclosure risk assessment in perturbative microdata protection”,
by Yancey, Winkler, and Creecy, presents empirical re-identification results
that compare methods for microdata protection including rank swapping
and additive noise. Enhanced re-identification methods based on probabilis-
tic record linkage are used to empirically assess disclosure risk. Then the
4 Josep Domingo-Ferrer
performance of methods is measured in terms of information loss and disclo-
sure risk. The reported results extend earlier work by Domingo-Ferrer et al.
presented in [7].
– In “LHS-based hybrid microdata vs rank swapping and microaggregation for
numeric microdata protection”, Dandekar, Domingo-Ferrer, and Sebé report
on another comparison of methods for microdata protection. Specifically,
hybrid microdata generation as a mixture of original data and synthetic
microdata is compared with rank swapping and microaggregation, which had
been identified as the best performers in earlier work. Like in the previous
article, the comparison considers information loss and disclosure risk, and
the latter is empirically assessed using record linkage.
– Based on the metrics previously proposed to compare microdata protection
methods (also called masking methods) in terms of information loss and
disclosure risk, article “Post-masking optimization of the tradeoff between
information loss and disclosure risk in masked microdata sets”, by Sebé,
Domingo-Ferrer, Mateo-Sanz, and Torra, demonstrates how to improve the
performance of any microdata masking method. Post-masking optimization
of the metrics can be used to have the released data set preserve as much as
possible the moments of first and second order (and thus multivariate statis-
tics) of the original data without increasing disclosure risk. The technique
presented can also be used for synthetic microdata generation and can be
extended to preserve all moments up to m-th order, for any m.
4 Software and User Case Studies
The first two articles in this part are related to software developments for the
protection of statistical data:
– “The CASC project”, by Hundepool, is an overview of the European project
CASC (Computational Aspects of Statistical Confidentiality,[2]), funded by
the EU 5th Framework Program. CASC can be regarded as a follow-up of
the SDC project carried out under the EU 4th Framework Program. The
central aim of the CASC project is to produce a new version of the Argus
software for statistical disclosure control. In order to reach this practical
goal, the project also includes methodological research both in tabular data
and microdata protection; the research results obtained will constitute the
core of the Argus improvement. Software testing by users is an important
part of CASC as well.
– The first sections of the article “Tools and strategies to protect multiple ta-
bles with the GHQUAR cell suppression engine”, by Gießing and Repsilber,
are an introduction to the GHQUAR software for tabular data protection.
The last sections of this article describe GHMITER, which is a software pro-
cedure allowing use of GHQUAR to protect sets of multiple linked tables.
This software constitutes a very fast solution to protect complex sets of big
tables and will be integrated in the new version of Argus developed under
the CASC project.
Advances in Inference Control in Statistical Databases: An Overview 5
This last part of the book concludes with three articles presenting user case
studies in statistical inference control:
– “SDC in the 2000 U. S. Decennial Census”, by Zayatz, describes statistical
disclosure control techniques to be used for all products resulting from the
2000 U. S. Decennial Census. The discussion covers techniques for tabular
data, public microdata files, and on-line query systems for tables. For tabular
data, algorithms used are improvements of those used for the 1990 Decennial
Census. Algorithms for public-use microdata are new in many cases and will
result in less detail than was published in previous censuses. On-line table
query is a new service, so the disclosure control algorithms used there are
completely new ones.
– “Applications of statistical disclosure control at Statistics Netherlands”, by
Schulte Nordholt, reports on how Statistics Netherlands meets the require-
ments of statistical data protection and user service. Most users are satisfied
with data protected using the Argus software: τ-Argus is used to produce
safe tabular data, while µ-Argus yields publishable safe microdata. How-
ever, some researchers need more information than is released in the safe
data sets output by Argus and are willing to sign the proper non-disclosure
agreements. For such researchers, on-site access to unprotected data is of-
fered by Statistics Netherlands in two secure centers.
– The last article “Empirical evidences on protecting population uniqueness
at Idescat”, by Urrutia and Ripoll, presents the process of disclosure control
applied by Statistics Catalonia to microdata samples from census and sur-
veys with some population uniques. Such process has been in use since 1995,
and has been implemented with µ-Argus since it first became available.
5 Related Literature and Information Sources
In addition to the above referenced books [6,7,12,13], a number of other sources
of information on current research in statistical inference control are available.
In fact, since statistical database protection is a rapidly evolving field, the use of
books should be directed to acquiring general insight on concepts and ideas, but
conference proceedings, research surveys, and journal articles remain essential
to gain up-to-date detailed knowledge on particular techniques and open issues.
This section contains a non-exhaustive list of research references, sorted from
a historical point of view:
1970s and 1980s. The first broadly known papers and books on statistical
database protection appear (e.g. [1,3,4,5,11]).
1990s. Eurostat produces a compendium for practitioners [10] and sponsors
a number of conferences on the topic, namely the three International Sem-
inars on Statistical Confidentiality (Dublin 1992 [9], Luxemburg 1994 [14],
and Bled 1996 [8]) and the Statistical Data Protection’98 conference (Lisbon
1998,[6]). While the first three events covered mathematical, legal, and orga-
nizational aspects, the Lisbon conference focused on the statistical, mathe-
matical, and computational aspects of statistical disclosure control and data
6 Josep Domingo-Ferrer
protection. The goals of those conferences were to promote research and
interaction between scientists and practitioners in order to consolidate sta-
tistical disclosure control as a high-quality research discipline encompassing
statistics, operations research, and computer science. In the second half of the
90s, the research project SDC was carried out under the EU 4th Framework
Program; its most visible result was the first version of the Argus software. In
the late 90s, other European organizations start joining the European Com-
mission in fostering research in this field. A first example is Statistisches
Bundesamt which organized in 1997 a conference for the German-speaking
community. A second example is the United Nations Economic Commission
for Europe, which has jointly organized with Eurostat two Work Sessions
on Statistical Data Confidentiality (Thessaloniki 1999 [15] and Skopje 2001).
Outside Europe, the U.S. Bureau of the Census and Statistics Canada have
devoted considerable attention to statistical disclosure control in their confer-
ences and symposia. In fact, well-known general conferences such as COMP-
STAT, U.S. Bureau of the Census Annual Research Conferences, Eurostat’s
ETK-NTTS conference series, IEEE Symposium on Security and Privacy,
etc. have hosted sessions and papers on statistical disclosure control.
2000s. In addition to the biennial Work Sessions on Statistical Data Confiden-
tiality organized by UNECE and Eurostat, other research activities are being
promoted by the U.S. Census Bureau, which sponsored the book [7], by the
European projects CASC [2], and AMRADS (a co-sponsor of the seminar
which originated this book).
As far as journals are concerned, there is not yet a monographic journal on
statistical database protection. However, at least the following journals occa-
sionally contain papers on this topic: Research in Official Statistics, Statistica
Neerlandica, Journal of Official Statistics, Journal of the American Statistical
Association, ACM Transactions on Database Systems, IEEE Transactions on
Software Engineering, IEEE Transactions on Knowledge and Data Engineering,
Computers  Mathematics with Applications, Statistical Journal of the UNECE,
Qüestiió and Netherlands Official Statistics.
Acknowledgments
Special thanks go to the authors of this book and to the discussants of the
seminar “Statistical Disclosure Control: From Theory to Practice” (L. Cox, G.
Ronning, P. M. Steel, and W. Winkler). Their ideas were invaluable to write this
overview, but I bear full responsibility for any inaccuracy, omission, or mistake
that may remain.
Advances in Inference Control in Statistical Databases: An Overview 7
References
1. N. R. Adam and J. C. Wortmann, “Security-control methods for statistical
databases: A comparative study”, ACM Computing Surveys, vol. 21, no. 4, pp.
515-556, 1989.
2. The CASC Project, http://neon.vb.cbs.nl/rsm/casc/menu.htm
3. T. Dalenius, “The invasion of privacy problem and statistics production. An
overview”, Statistik Tidskrift, vol. 12, pp. 213-225, 1974.
4. D. E. Denning and J. Schlörer, “A fast procedure for finding a tracker in a statistical
database”, ACM Transactions on Database Systems, vol. 5, no. 1, pp. 88-102, 1980.
5. D. E. Denning, Cryptography and Data Security. Reading MA: Addison-Wesley,
1982.
6. J. Domingo-Ferrer (ed.), Statistical Data Protection. Luxemburg: Office for Official
Publications of the European Communities, 1999.
7. P. Doyle, J. Lane, J. Theeuwes, and L. Zayatz (eds.), Confidentiality, Disclosure
and Data Access. Amsterdam: North-Holland, 2001.
8. S. Dujić and I. Trs̆inar (eds.), Proceedings of the 3rd International Seminar on
Statistical Confidentiality (Bled, 1996). Ljubljiana: Statistics Slovenia-Eurostat,
1996.
9. D. Lievesley (ed.), Proceedings of the International Seminar on Statistical Confi-
dentiality (Dublin, 1992). Luxemburg: Eurostat, 1993.
10. D. Schackis, Manual on Disclosure Control Methods. Luxemburg: Eurostat, 1993.
11. J. Schlörer, “Identification and retrieval of personal records from a statistical data
bank”, Methods Inform. Med., vol. 14, no.1, pp. 7-13, 1975.
12. L. Willenborg and T. de Waal, Statistical Disclosure Control in Practice. New
York: Springer-Verlag, 1996.
13. L. Willenborg and T. de Waal, Elements of Statistical Disclosure Control. New
York: Springer-Verlag, 2001.
14. Proceedings of the 2nd International Seminar on Statistical Confidentiality (Lux-
emburg, 1994). Luxemburg: Eurostat, 1995.
15. Statistical Data Confidentiality: Proc. of the Joint Eurostat/UNECE Work Ses-
sion on Statistical Data Confidentiality (Thessaloniki, 1999). Luxemburg: Euro-
stat, 1999.
_____________________
* The opinions expressed in this paper are those of the authors, and not necessarily those of
Statistics Canada.
J. Doningo-Ferrer (Ed.): Inference Control in Statistical Databases, LNCS 2316, pp. 8-20, 2002.
 Springer-Verlag Berlin Heidelberg 2002
Cell Suppression: Experience and Theory
Dale A. Robertson and Richard Ethier *
Statistics Canada
robedal@statcan.ca
ethiric@statcan.ca
Abstract. Cell suppression for disclosure avoidance has a well-developed
theory, unfortunately not sufficiently well known. This leads to confusion and
faulty practices. Poor (sometimes seriously flawed) sensitivity rules can be
used while inadequate protection mechanisms may release sensitive data. The
negative effects on the published information are often exaggerated. An
analysis of sensitivity rules will be done and some recommendations made.
Some implications of the basic protection mechanism will be explained. A
discussion of the information lost from a table with suppressions will be
given, with consequences for the evaluation of patterns and of suppression
heuristics. For most practitioners, the application of rules to detect sensitive
economic data is well understood (although the rules may not be). However,
the protection of that data may be an art rather than an application of sound
concepts. More misconceptions and pitfalls arise.
Keywords: Disclosure avoidance, cell sensitivity.
Cell suppression is a technique for disclosure control. It is used for additive tables,
typically business data, where it is the technique of choice. There is a good theory
of the technique, originally developed by Gordon Sande [1,2| with important
contributions by Larry Cox [3]. In practice, the use of cell suppression is troubled
by misconceptions at the most fundamental levels. The basic concept of sensitivity
is confused, the mechanism of protection is often misunderstood, and an erroneous
conception of information loss seems almost universal. These confusions prevent the
best results from being obtained. The sheer size of the problems makes automation
indispensable. Proper suppression is a subtle task and the practitioner needs a sound
framework of knowledge. Problems in using the available software are often related
to a lack of understanding of the foundations of the technique. Often the task is
delegated to lower level staff, not properly trained, who have difficulty describing
problems with the rigour needed for computer processing. This ignorance at the
foundation level leads to difficulty understanding the software. As the desire for
more comprehensive, detailed, and sophisticated outputs increases, the matter of
table and problem specification needs further attention.
Our experience has shown that the biggest challenge has been to teach the basic
ideas. The theory is not difficult to grasp, using only elementary mathematics, but
clarity of thought is required. The attempt of the non-mathematical to describe things
Cell Suppression: Experience and Theory 9
in simple terms has led to confusion, and the failure to appreciate the power and
value of the technique.
The idea of cell sensitivity is surprisingly poorly understood. Obsolete sensitivity
rules, with consistency problems, not well adapted to automatic processing, survive.
People erroneously think of sensitivity as a binary variable: a cell is publishable or
confidential. The theory shows that sensitivity can be precisely measured, and the
value is important in the protection process. The value helps capture some common
sense notions that early practitioners had intuitively understood.
The goal of disclosure avoidance is to protect the respondents, and ensure that a
response cannot be estimated accurately. Thus, a sensitive cell is one for which
knowledge of the value would permit an unduly accurate estimate of the contribution
of an individual respondent. Dominance is the situation where one or a small number
of responses contribute nearly all the total of the cell. Identifying the two leads to the
following pronouncement
A cell should be considered sensitive if one respondent contributes 60% or more
of the total value.
This is an example of an N-K rule, with N being the number of respondents to
count, and K the threshold percentage. (Here N = 1 and K = 60. These values of N
and K are realistic and are used in practice.) Clearly a N-K rule measures
dominance. Using it for sensitivity creates the following situation.
Consider two cells each with 3 respondents, and of total 100.
Cell 1 has a response sequence of {59,40,1} and thus may be published according
to the rule. Cell 2 has a response sequence of {61,20,19}} and would be declared
sensitive by the rule.
Suppose the cell value of 100 is known to the second largest respondent (X2) and
he uses this information to estimate the largest (X1). He can remove his contribution
to get an upper bound, obtaining (with non-negative data)
For (59,40, 1) 100-40 = 60 therefore X1 = 60, while
For (61,20,19) 100-20 = 80 therefore X1 = 80.
Since the actual values of X1 are 59 and 61 respectively, something has gone
badly wrong. Cell 1 is much more dangerous to publish than cell 2. The rule gets
things the wrong way around! Is this an exceptional case that took ingenuity to find?
No, this problem is intrinsic to the rule, examples of misclassification are numerous,
and the rule cannot be fixed. To see this we need to make a better analysis.
One can understand much about sensitivity by looking at 3 respondent cells of
value 100. (Keeping the total at 100 just means that values are the same as the
percentages.) Call the three response values a, b, c.
Dale A. Robertson and Richard Ethier
10
One can represent these possible cells
pictorially. Recall (Fig, 1) that in an equilateral
triangle, for any interior point, the sum of the
perpendicular distance to the 3 sides is a constant
(which is in fact h, the height of the triangle).
One can nicely incorporate the condition a + b
+ c = 100 by plotting the cell as a point inside a
triangle, of height 100, measuring a, b, and c from
a corresponding edge (Fig 2). This gives a
symmetric treatment and avoids geometric
confusions that occur in other plots.
In Figure 2 we have drawn the medians,
intersecting at the centroid. The triangle is
divided into areas which correspond to the
possible size orders of a, b, and c. The upper kite
shaped region is the area where a is the biggest of
the three responses, with its right sub triangle the
area where a  b  c.
In the triangle diagram, where are the sensitive
cells? Cells very near an edge of the triangle
should be sensitive. Near an edge one of the
responses is negligible, effectively the cell has 2
respondents, hence is obviously sensitive (each
respondent knows the other). As a bare minimum requirement then any rule that
purports to define sensitivity must classify points near the edge as sensitive.
What does the 1-60% rule do? The region
where a is large is the small sub triangle at the
top, away from the a edge, with its base on the
line at 60% of the height of the triangle.
Likewise for b and c leading to (Fig. 3)
Now one problem is obvious. Significant
portions of the edges are allowed. This rule
fails the minimum requirement. Slightly
subtler is the over-suppression. That will
become clearer later. Points inside the
sensitive area near the interior edge and the
median are unnecessarily classified as
sensitive. The rule is a disaster.
Trying to strengthen the rule leads to more over-suppression, relaxing it leads to
more release of sensitive data. Any organisation which defines sensitivity using only
a 1-K rule (and they exist) has a serious problem. The identification of sensitivity
Cell Suppression: Experience and Theory 11
and dominance has led to a gross error, allowing
very precise knowledge of some respondents to
be deduced.
Now lets look at a good rule. Here is a how a
good rule divides the triangle into sensitive and
non-sensitive areas. (Fig. 4)
The sensitivity border excludes the edges
without maintaining a fixed distance from them.
There is lots of symmetry and the behaviour in
one of the 6 sub triangles is reproduced in all the
others. This shape is not hard to understand.
This rule is expressed as a formula in terms of the responses in size order, X1, X2,
and X3. As one crosses a median, the size order, and hence the expression in terms
of a, b, and c changes, hence the slope discontinuities. The other thing to understand
is the slope of the one non-trivial part of a sub triangle boundary, the fact that the line
moves farther from the edge as one moves away from a median. The reasoning
needed is a simple generalisation of the previous argument about the two cells. Start
at the boundary point on the a median nearest the a axis. There a is the smallest
response, while b and c are of the same size and much larger. For easy numbers
suppose b=c= 45 and a=10. The protection afforded to b or c against estimation by
the other comes from the presence of a. On the separation line between sensitive and
non-sensitive areas, the value of the smallest response (a) is felt to be just big enough
to give protection to the larger responses (b and c). The values above indicate that
the large responses are allowed to be up to 4.5 times larger than the value of the
response providing protection, but no larger. A higher ratio is considered sensitive, a
lower one publishable. As one moves away from the median, one of b or c becomes
larger than 45, the other smaller. Consequently a must become larger than 10 to
maintain the critical ratio of 4.5:1. Hence a must increase (move upwards away from
the bottom) i.e. the line has an upward slope.
This simple but subtle rule is one known as the C times rule. C is the ratio, an
adjustable parameter (4.5 above) of the rule. (There are slight variations on this
formulation, termed the p/q rule or the p% rule. They are usually specified by an
inverse parameter p or p/q = 1/C. The differences are only in interpretation of the
parameter, not in the form of the rule. We find the inverse transformation harder to
grasp intuitively, and prefer this way).
The formula is easily written down then. The protection available when X2
estimates X1 is given by X3. The value of X3 has to be big enough (relative to X1)
to give the needed uncertainty. One compares X1 with X3 using the scaling factor C
to adjust their relative values. Explicitly one evaluates
S = X1/C - X3 (1)
Dale A. Robertson and Richard Ethier
12
Written in this form the rule appears not to depend on X2, but it can be trivially be
written as
S = X1*(C+1)/C + X2 - T (2)
For a fixed total T then, the rule does depend on both X1 and X2 and they enter
with different coefficients, an important point. Sensitivity depends on the cell
structure (the relative sizes of the responses). This difference in coefficient values
captures this structure. (Note in passing that the rule grasps the concept of sensitivity
well enough that 1 and 2 respondent cells are automatically sensitive, one does not
have to add those requirements as side conditions.)
The rule is trivially generalised to more than 3 respondents in the cell, X3 is
simply changed to
X3 + X4 + X5 + … (3)
i.e. the sum of the smallest responses. One can treat all cells as effectively 3
respondent cells. (The rule can also be generalised to include coalitions where
respondents in the cell agree to share their knowledge in order to estimate another
respondent's contribution. The most dangerous case is X2 and X3 sharing their
contributions to estimate X1. The generalisation for a coalition of 2 respondents is
S = X1/C - (X4+X5+…) (4)
(and similar changes for larger coalitions.) This is a deceptively simple rule of wide
application, with a parameter whose meaning is clear. The C value can be taken as a
precise measure of the strength of the rule.
Note that S is a function defined over the area of the triangle. Sensitivity is not
just a binary classification into sensitive and non-sensitive. (S looks like an inverted
pyramid with the peak at the centroid, where a = b = c). The line separating the two
areas is the 0 valued contour of S. The value of S is meaningful. It indicates the
degree of uncertainty required in estimations of the cell value, and the minimum size
of an additional response big enough to render the cell non-sensitive. For example,
take a 2 respondent cell of some value. Its sensitivity is S = X1/C. To get a non-
sensitive cell S must be reduced to 0. This can be done by adding a response X3 of
value X1/C. The cell sensitivity tells us the minimum amount that must be added to
the cell to make it non-sensitive.
The only other proposed rule that occurs with any frequency is a 2-K rule; a cell is
sensitive if 2 respondents contribute more than K % of the cell value. (Here K in the
range 80 - 95 is typical.) Only the sum of the top two responses (X1+X2) enters.
Their relative size (for a fixed total) is not used in any way. However, as just seen,
the relative sizes are important. We can conclude that this cannot be a very good rule
without further analysis. A 2-K rule cannot distinguish between a cell with responses
(50,45,5) and one with responses (90,5,5). Most would agree that the second one is
more sensitive than the first. The picture of the rule is easy to draw. If the sum of
two responses is to be less than a certain value, then the third must be larger than a
corresponding value. That gives a line parallel to an edge i.e. Figure 5.
Cell Suppression: Experience and Theory 13
As we have just observed, in a good rule the
sensitivity line should not be parallel to the
edge. This means that the 2-K rule is not
consistent in its level of protection. In certain
areas less protection is given than in others.
This would seem undesirable. The rule has no
clear objective. It might be hoped that the non-
uniformity is negligible, and that it doesn't
matter all that much. Alas no. We can
quantify the non-uniformity. It turns out to be
unpleasantly large. The effect of the rule is
variable, and it is difficult to choose a value for
K and to explain its meaning.
To measure the non uniformity one can find the two C times rules that surround
the 2-K rule, i.e. the strongest C times rule that is everywhere weaker than the 2-K
rule, and the weakest C times rule that is everywhere stronger than the 2-K rule.
These C values are easily found to be
C (inner) = k/(200-k) (stronger) (5)
C (outer) = (100-k)/k (weaker) (6)
For the typical range of K values, this gives the following graph (Figure 6).
One can see that the difference between the C values is rather large. One can get a
better picture by using a logarithmic scale (Figure 7).
On this graph, the difference between the lines is approximately constant, which
means the two values differ by a constant factor. The scale shows this constant to be
near 2 (more precisely it is in the range 1.7 to 1.9) i.e. the non-uniformity is serious.
The problem of deciding on an appropriate value of K is ill defined.
Dale A. Robertson and Richard Ethier
14
The summary is then:
Don't use N-K rules, they are at best inferior with consistency problems, at worst lead
to disasters. Their strength is unclear, and the K value hard to interpret. They arise
from a fundamental confusion between sensitivity and dominance.
Now its time to talk a little about the protection mechanism. First, observe that in
practice, the respondents are businesses. Rough estimates of a business size are easy
to make. In addition, the quantities tabulated are often intrinsically non-negative.
These two things will be assumed; suppressed cells will only be allowed to take
values within 50% of the true ones from now on
Here is a trivial table with 4 suppressions X1, X2, X3, X4 (Figure 8).
One or more of these cells may be
presumed sensitive, and the suppressions
protect the sensitive values from trivial
disclosure. Suppose X1 is sensitive. Note
that X1+X2 and X1+X3 are effectively
published. The suppressed data are not
lost, they are simply aggregated.
Aggregation is the protection mechanism at
work here, just as in statistics in general.
Data is neither lost nor distorted.
Suppression involves the creation of miscellaneous aggregations of the sensitive cells
with other cells. Obviously then if the pattern is to be satisfactory, then these
aggregations must be non-sensitive. Here the unions X1+X2 and X1+X3 must be
non-sensitive if this pattern is to be satisfactory. From our previous discussion if
follows that both X2 and X3 must be at least as large as the sensitivity of X1. As
complements, the best they can do is to add value to the protective responses. There
should be no responses from the large contributors in the complementary cells.
Proper behaviour of S when cells are combined will be of great importance.
Certainly one would like to ensure as a bare minimum
Two non-sensitive cells should never combine to form a sensitive cell.
The notion that aggregation provides protection is captured by the property of sub-
additivity (1), (2). This is an inequality relating the sensitivity of a combination of
two cells to the sensitivity of the components. The direction of the inequality is such
that aggregation is a good thing. One should also have smoothness of behaviour, the
effect on sensitivity should be limited by the size of the cell being added in. For a
good rule then one has the inequalities
S(x) - T(Y) = S(X+Y) = S(X) + S(Y) (7)
(Using a convenient normalisation for S)
Cell Suppression: Experience and Theory 15
Given that S = 0 indicates sensitivity, and that the aim is to ensure that S(X+Y) is
not sensitive given that S(X) is, the right most inequality indicates that aggregation
tends to be helpful (and is definitely helpful if the complement Y is not sensitive,
S(Y)  0). The left inequality limits the decrease in the S value. A successful
complement must be big enough, T (Y) = S(X) to allow S(X+Y) to be negative.
These inequalities are natural conditions that most simple rules obey, but people
have, at some effort, found rules that violate them.
These inequalities are only that. One cannot exactly know the sensitivity of a
union by looking at the summary properties of the cells. One needs the actual
responses. The typical complicating factor here is a respondent who has data in more
than one cell of a union. (This may be poor understanding of the corporate structure
or data classified at an unsuitable level.) Generation of a good pattern is thus a subtle
process, involving the magnitudes and sensitivities of the cells, and the pattern of
individual responses in the cells. One needs information about the individual
responses when creating a pattern. The sensitive union problem certainly is one that
is impractical to perform without automation. Sophisticated software is
indispensable. Tables that appear identical may need different suppression patterns,
because of the different underlying responses. We have found, using a small
pedagogical puzzle, that few people understand this point
In the software in use at Statistics Canada (CONFID) the sensitive union problem
is dealt with by enumerating all sensitive combinations which are contained in a non-
sensitive totals [4]. (They are in danger of being trivially disclosed.) All such
combinations are protected by the system. This may sound like combinatorial
explosion and a lot of work. It turns out that there are not that many to evaluate, the
extra work is negligible, and the effect on the pattern small. The algorithm makes
heavy use of the fact that non-sensitive cells cannot combine to form a sensitive
union.
Now let us talk a bit about the suppression process, and information. It is
generally said that one should minimise the information lost, but without much
discussion of what that means. It is often suggested that the information lost in a
table with suppressions can be measured by
i) the number of suppressed cells,
ii) the total suppressed value.
The fact that one proposes to measure something in two different and
incompatible ways surely shows that we are in trouble. Here are two tables with
suppressions (Figure 9, Figure 10).
Dale A. Robertson and Richard Ethier
16
These two tables have the same total number of suppressed cells, 4, and the same
total suppressed value, 103. By either of the above incompatible definitions then,
they should have the same amount of missing information. However, they do not.
Substantially more information has been removed from one of these two tables than
from the other (three times as much in fact). This misunderstanding about
information is related to the misunderstanding that the value of a suppressed cell
remains a total mystery, with no guesses about the value possible, and that somehow
the data is lost. In fact, it is just aggregated. Furthermore, any table with
suppressions is equivalent to a table of ranges for the hidden cells. This latter fact is
not well known, or if it is, the implications are not understood. These ranges can be
found by a straightforward and inexpensive LP calculation, maximising and
minimising the value of each hidden cell subject to the constraint equations implied
by the additivity of the table. The above tables are equivalent to Figure.11, Figur 12,
(provided the data are known to be non-negative). The different ranges provide a
clue about the information loss. The second table has wider ranges. Large
complementary cells used to protect cells of small sensitivity often turn out to have
very narrow ranges and contain valuable information. They should not be thought of
as lost.
Clearly one needs a better concept of information. Looking around, in the
introductory chapters of an introductory book we found (in the context of signal
transmission in the presence of errors) a thought that we paraphrase as
The additional information required to correct or repair a noisy signal (thus
recovering the original one) is equal to the amount of information which has been
lost due to the noise.
Thinking of a table with missing entries as a garbled message, the information
missing from a table is the minimum amount of information needed to regain the full
table. This may seem simple, but it is a subtle concept. One can use all properties
that are implied by the table structure. The previous measures do not use the
structural properties. The width of the ranges will have some bearing. One also can
use the fact that the hidden values are not independent, but are linked by simple
equations. Only a subset of them need be recovered by using more information, the
rest can be trivially calculated. Calculating the cost on a cell by cell basis implicitly
suggests that all hidden cells are independent, and that the information loss can
simply be added up.
For the first table (assuming all the quantities are integral for simplicity) there are
only 3 possible tables compatible with the suppression pattern. Consequently,
Cell Suppression: Experience and Theory 17
(having agreed upon a standard order for enumerating the possible tables) one only
needs to be told which table in the sequence is the true one, table 1, table 2 or table 3.
In this case the amount of information needed is that needed to select one out of three
(equally probable) possibilities. This is a precise amount of information, which even
has a name, one trit. In more common units 1 trit = 1.58 bits (since the binary log of
3 is 1.58...). For the second table one has 27 possible solutions. Selecting one from
27 could be done by 3 divisions into 3 equal parts selecting one part after each
division. So one requires 3 trits of information, hence the statement that 3 times as
much information has been lost in the second table. Counting the number of tables
correctly uses the cell ranges, and the relationships between the hidden values, both
of which are ignored by the value or number criteria.
This viewpoint can resolve some other puzzles or paradoxes. Here is a
hypothetical table in which the two cells of value 120 are sensitive (Figure 13).
Here is the minimum cell count suppression pattern, with two complements
totalling 200, and the minimum suppressed value pattern, (Figure 14, Figure 15),
with 6 complements totalling 60.
Dale A. Robertson and Richard Ethier
18
Here (Figure 16) is the pattern we prefer, which is intermediate between the two
others, having 4 complementary suppressions of total value 80.
Using the notion of the amount of information needed to recover the table, this is
in fact the best of the 3. With the minimum count pattern, the size of the
complements makes the ranges large, and there are many possible tables (101)
consistent with the pattern. With the minimum value pattern, although the ranges
are small, there are two independent sets of hidden cells, and hence the number of
tables is the product of the numbers of the two sub-tables with suppression (11*11).
In the preferred pattern one has 21 possible tables.
(Note for the minimum value pattern, one has two independent sub-patterns. Here
one would like to be able to say that the information lost is the sum of two terms, one
for each independent unit. Since the number of tables is the product of the numbers
of possible solutions to these two sub-problems, it is clear that it is appropriate to
take logarithms of the number of possible tables. Some of you may of course realise
that this is a vast oversimplification of information theory. There is a simple
generalisation of the measure if the tables are not equi-probable.) The form of
objective function that is in practice the most successful in CONFID may be looked
on as an attempt to approximate this sort of measure. Given all these subtleties, it
follows that the effects of interventions, forcing the publication or suppression of
certain cells should not be done without serious thought. One should always measure
the effect of this type of intervention.
Given that tables with suppressions are equivalent to range tables, and that
sophisticated users are probably calculating these ranges themselves, either exactly or
approximately, it has often been suggested that the statistical agencies improve their
service, especially to the less sophisticated users by publishing the ranges
themselves. One suppression package ACS [5] takes this farther by providing in
addition to the ranges, a hypothetical solution consistent with them, i.e. a conceivable
set of values for the hidden cells which make the tables add up. These values are not
an estimate in any statistical sense, but provide a full set of reasonable values that
may be of use in certain types of models for example, which do not like missing data
points. In our view, range publication would be highly desirable. It has the
following advantages for the statistical agency.
Cell Suppression: Experience and Theory 19
The personnel doing the suppression need to think more and to have a better
understanding of the process. Seeing the ranges will add to the quality of the
patterns, especially if any hand tinkering has happened. The effect of tinkering can
be better evaluated.
Certain „problems“ attributed to cell suppression are pseudo-problems, caused
only by poor presentation. One such is the feared necessity of having to use a large
complement to protect a not very sensitive cell. Well if the sensitivity is small, the
cell doesn't require much ambiguity or protection. Most of the statistical value of the
complementary cell can be published.
Another problem, the issue of continuity in time series, becomes smaller in
importance. It is generally felt that a series such as Figure 17 is disconcerting. If
ranges were used one could publish something like Figure 18 which is less
disturbing.
Obviously there are advantages for the customer too. They are getting more data.
(Or the data more conveniently. More sophisticated users can perform the analysis
for themselves.) Giving the ranges explicitly helps the less sophisticated user.
In our opinion, the arguments for range publication are convincing, and any
objections are not very sensible. If one had competing statistical agencies, they
would be rushing to get this new and improved service out to the salesman.
A few conclusions
If we don't yet have a good measure of information lost in a table, it follows that
all methods in use today are heuristic. They solve various problems that approximate
the real problem. It is difficult to quantify how well these approximate problems
resemble the real problem. Our feeling is that better progress would be attained if
one had agreed upon the proper problem, and discussed various methods of solution,
exact or approximate. Properties of the problem and its solution could be studied,
and a more objective way to evaluate the heuristic methods would be available. As
well, standard test data sets could be prepared to facilitate discussion and
comparison.
Dale A. Robertson and Richard Ethier
20
It is only a modest overstatement to suggest that some people don't know what
they are doing, in spite of the fact that a proper understanding is not difficult.
Therefore training and attitude are big problems. The common errors that occur are
using bad, deeply flawed sensitivity rules, and using inadequate methods to generate
the patterns that do not ensure that the sensitive cells have sufficient ambiguity or
protect combinations.
References
[1] Towards Automated Disclosure Analysis for Statistical Agencies. Gordon Sande;
InternalDocument Statistics Canada (1977)
[2] Automated Cell Suppression to Preserve Confidentiality of Business Statistics. Gordon
Sande; Stat. Jour U.N. ECE2 pp33-41 (1984)
[3] Linear Sensitivity Measures in Statistical disclosure Control. L.H. Cox.; Jour. Stat. Plan.
 Infer. V5, pp153-164 (1981)
[4] Improving Statistics Canada's Cell Suppression Software (CONFID). D. A. Robertson,
COMPSTAT 2000 Proceedings in Computational Statistics. Ed. J.K Bethlehem, P.G.M
van der Heiden, Physica Verlag (Heidelberg New York) (2000)
[5] ACS available from Sande and Associates, 600 Sanderling Ct., Secaucus N.J. 07094
U.S.A. g.sande@worldnet.att.net
J. Domingo-Ferrer (Ed.): Inference Control in Statistical Databases, LNCS 2316, pp. 21 - 33, 2002.
© Springer-Verlag Berlin Heidelberg 2002
Bounds on Entries in 3-Dimensional Contingency Tables
Subject to Given Marginal Totals
Lawrence H. Cox
U.S. National Center for Health Statistics, 6525 Belcrest Road
Hyattsville, MD 20782 USA
lcox@cdc.gov
Abstract: Problems in statistical data security have led to interest in determining
exact integer bounds on entries in multi-dimensional contingency tables subject
to fixed marginal totals. We investigate the 3-dimensional integer planar
transportation problem (3-DIPTP). Heuristic algorithms for bounding entries in
3-DIPTPs haverecentlyappeared. We demonstrate these algorithms are not exact,
are based on necessary but not sufficient conditions to solve 3-DIPTP, and that all
are insensitive to whether a feasible table exists. We compare the algorithms and
demonstrate that one is superior, but not original. We exhibit fractional extremal
points and discuss implications for statistical data base query systems.
1 Introduction
A problem of interest in operations research since the 1950s [1] and during the 1960s
and 1970s [2) is to establish sufficient conditions for existence of a feasible solution to
the 3-dimensional planar transportation problem (3-DPTP), viz., to the linear program:
(1)
where are constants, referred to as the 2-dimensional marginal
totals. Attempts on the problem are summarized in [3]. Unfortunately, each has been
shown [2, 3] to yield necessary but not sufficient conditions for feasibility. In [4] is
given a sufficient condition for multi-dimensional transportation problems based on an
iterative nonlinear statistical procedure known as iterative proportional fitting. The
purpose here is to examine the role of feasibility in the pursuit of exact integral lower and
upper bounds on internal entries in a 3-DPTP subject to integer constraints (viz., a 3-
DIPTP) and to describe further research directions. Our notation suggests that internal
and marginal entries are integer. Integrality is not required for 3-DPTP (1), nor by the
feasibility procedure of [4]. However, henceforth integrality of all entries is assumed as
we focus on contingency tables - tables of nonnegative integer frequency counts and
totals - and on the 3-DIPTP.
Lawrence H. Cox
22
The consistency conditions are necessary for feasibility of 3-DIPTP:
(2)
The respective values, , are the 1-dimensional marginal totals.
is the grand total. It is customary to
represent the 2-dimensional marginal totals in matrices defined by elements:
(3)
The feasibility problem is the existence of integer solutions to (1) subject to consistency
(2) and integrality conditions on the 2-dimensional marginal totals. The bounding
problem is to determine integer lower and upper bounds on each entry over
contingency tables satisfying (1)-(2). Exact bounding determines the interval
, over all integer feasible solutions of (1)-(2).
The bounding problem is important in statistical data protection. To prevent
unauthorized disclosure of confidential subject-level data, it is necessary to thwart
narrow estimation of small counts. In lieu of releasing the internal entries of a 3-
dimensional contingency table, a national statistical office (NSO) may release only the
2-dimensional marginals. An important question for the NSO is then: how closely can
a third party estimate the suppressed internal entries using the published marginal totals?
During large-scale data production such as for a national census or survey, the NSO
needs to answer this question thousands of times.
Several factors can produce an infeasible table, viz., marginal totals satisfying (1)-(2) for
which no feasible solution exists, and that infeasible tables are ubiquitous and abundant,
viz., dense in the set of all potential tables [4]. To be useful, bounding methods must be
sensitive to infeasibility, otherwise meaningless data and erroneous inferences can result
[5].
The advent of public access statistical data base query systems has stimulated recent
research by statisticians on the bounding problem. Unfortunately, feasibility and its
consequences have been ignored. We highlight and explore this issue, through
examination of four papers representing separate approaches to bounding problems.
Three of the papers [6-8] were presented at the International Conference on Statistical
Data Protection (SDP’98), March 25-27, 1998, Lisbon, Portugal, sponsored by the
StatisticalOfficeoftheEuropeanCommunities (EUROSTAT). Thefourth[9]appeared
in Management Science and reports current research findings. A fifth, more recent,
paper [10] is also discussed. We examine issues raised and offer observations and
generalizations.
Bounds on Entries in 3-Dimensional Contingency Tables 23
2 The F-Bounds
Given a 2-dimensional table with consistent sets of column ( ) and row ( )
marginal totals, the nominal upper bound for equals . The nominal
lower bound is zero.
It is easy to obtain exact bounds in 2-dimensions. The nominal upper bound is exact, by
the stepping stones algorithm: set to its nominal upper bound, and subtract this value
from the column, row and grand totals. Either the column total or the row total (or both)
must become zero: set all entries in the corresponding column (or row, or both) equal to
zero and drop this column (or row, or both) from the table. Arbitrarily pick an entry
from the remaining table, set it equal to its nominal upper bound, and continue. In a
finite number of iterations, a completely specified, consistent 2-dimensional table
exhibiting the nominal upper bound for will be reached. Exact lower bounds can be
obtained as follows.
As , then
. That this bound is exact followsfromobserving that
is feasible if
. Therefore, in 2-dimensions, exact bounds are given by:
(4)
These bounds generalize to m-dimensions, viz., each internal entry is contained in
precisely m(m-1)/2 2-dimensional tables, each of which yields a candidate lower bound.
The maximum of these lower bounds and zero provides a lower bound on the entry.
Unlike the 2-dimensional case, in m  3 dimensions these bounds are not necessarily
exact [5]. We refer to these bounds as the F-bounds. In 3-dimensions, the F-bounds are:
(5)
3 The Bounding Methods of Fienberg and Buzzigoli-Giusti
3.1 The Procedure of Fienberg
In [6], Fienberg does not specify a bounding algorithm precisely, but illustrates an
approach via example. The example is a 3x3x2 table of sample counts from the 1990
Decennial Census Public Use Sample [6, Table 1]. For convenience, we present it in the
following form. The internal entries are:
Lawrence H. Cox
24
Table 1. Fienberg [6, Table 1]
INCOME
High Med Low High Med Low
MALE FEMALE
and the 2-dimensional marginal totals are:
.
In the remainder of this sub-section, we examine the properties of the bound procedure
of [6].
Corresponding to internal entry , there exists a collapsed 2x2x2 table:
Table 2. Collapsing a 3-dimensional table around entry
Entry (2, 2, 2) in the lower right-hand corner is the complement of , denoted . Fix
(i, j, k). Denote the 2-dimensional marginals of Table 2 to which contributes by
. Observe that:
(6)
From (6) results the 2-dimensional Fréchet lower bound of Fienberg (1999):
(7)
Bounds on Entries in 3-Dimensional Contingency Tables 25
The nominal (also called Fréchet) upper bound on equals .
The 2-dimensional Fréchet bounds of [6] are thus:
(8)
Simple algebra reveals that the lower F-bounds of Section 2 and the 2-dimensional
lower Fréchet bounds of [6] are identical. The F-bounds are easier to implement.
Consequently, we replace (8) by (5).
From (6) also follows the 2-dimensional Bonferroni upper bound of [6]:
(9)
This Bonferroni upper bound is not redundant: if is sufficiently small, it can
be sharper than the nominal upper bound. This is illustrated by the entry of the
2x2x2 table with marginals:
(10)
yields the 1-dimensional Fréchet lower
bound [6]:
(11)
This bound is redundant with respect to the lower F-bounds. This is demonstrated as
follows:
(12)
The Fréchet-Bonferroni bounds of [6] can be replaced by:
(13)
We return to the example [6, Table 1]. Here, the 2-dimensional Bonferroni upper
bound (9) is not effective for any entry, and can be ignored. Thus, in this example, the
bounds (13) are identical to the F-bounds (5), and should yield identical results. They
in fact do not, most likely due to numeric error somewhere in [6]. This discrepancy is
need to be kept in mind as we compare computational approaches below, that of
Fienberg, using (13), and an alternative using (5).
Lawrence H. Cox
26
Fienberg [6] computes the Fréchet bounds, without using the Bonferroni bound (9),
resulting in Table 7 of [6]. We applied the F-bounds (5), but in place of his Table 7,
obtained sharper bounds. Both sets of bounds are presented in Table 3, as follows. If
our bound agrees with [6 , Table 7] we present the bound. If there is disagreement, we
include the [6], Table 7 bound in parentheses alongside ours.
Table 3. F-Bounds and Fienberg (Table 7) Fréchet Bounds for Table 1
INCOME
High Med Low High Med Low
MALE FEMALE
Fienberg [6] next applies the Bonferroni upper bound (9) to Table 7, and reports
improvement in five cells and a Table 8, the table of exact bounds for the table. We
were unable to reproduce these results: the Bonferroni bound provides improvement
for none of the entries.
3.2 The Shuttle Algorithm of Buzzigoli-Giusti
In [7], Buzzigoli-Giusti present the iterative shuttle algorithm, based on principles of
subadditivity:
- a sum of lower bounds on entries is a lower bound for the sum of the
entries, and
- a sum of upper bounds on entries is an upper bound for the sum of the
entries;
- the difference between the value (or an upper bound on the value) of an
aggregate and a lower bound on the sum of all but one entry in the
aggregate is an upper bound for that entry, and
- the difference between the value (or a lower bound on the value) of an
aggregate and an upper bound on the sum of all but one entry in
the aggregate is a lower bound for that entry.
The shuttle algorithm begins with nominal lower and upper bounds. For each entry and
its 2-dimensional marginal totals, the sum of current upper bounds of all other entries
contained in the 2-dimensional marginal is subtracted from the marginal. This is a
candidate lower bound for the entry. If the candidate improves the current lower
bound, it replaces it. This is followed by an analogous procedure using sums of lower
bounds and potentially improved upper bounds. This two-step procedure is repeated
until all bounds are stationary. The authors fail to note but it is evident that stationarity
is reached in a finite number of iterations because the marginals are integer.
Bounds on Entries in 3-Dimensional Contingency Tables 27
3.3 Comparative Analysis of Fienberg , Shuttle, and F-Bounding Methods
We compare the procedure of Fienberg ([6]), the shuttle algorithm and the F-bounds.
As previously observed, the bounds of [6] can be reduced to the F-bounds plus the 2-
dimensional Bonferroni upper bound, viz., (13). The shuttle algorithm produces
bounds at least as sharp as the F-bounds, for two reasons. First, the iterative shuttle
procedure enables improved lower bounds to improve the nominal and subsequent
upper bounds. Second, lower F-bounds can be no sharper than those produced during
the first set of steps of the shuttle algorithm. To see this, for concreteness consider the
candidate lower F-bound for . One of the three candidate
shuttle lower bounds for equals , where denotes the current
upper bound for .
Thus, and therefore .
Consequently, the shuttle candidate lower bound is greater than or equal to
, so the shuttle candidate is at least as
sharp as the F-candidate. If the shuttle algorithmis employed, all but the nominal lower
Fienberg (1999) and F-bounds (namely, 0) are redundant.
Buzzigoli-Giusti [7] illustrate the 3-dimensional shuttle algorithm for the case of a
2x2x2 table. It is not clear, for the general case of a table, if they intend to
utilize the collapsing procedure of Table 2, but in what follows we assume that they do.
Consider the 2-dimensional Bonferroni upper bounds (9). From (6), the Bonferroni
upper bound for equals . Consider the right-hand 2-dimensional table in
Table 2. Apply the standard 2-dimensional lower F-bound to the entry in the upper left-
hand corner:
(14)
Aspreviouslyobserved,theshuttlealgorithmwillcomputethisbound duringstep 1and,
if it is positive, replace the nominal lower bound with it, or with something sharper.
During step 2, the shuttle algorithm will use this lower bound (or something sharper)
to improve the upper bound on , as follows:
(15)
Thus, the Fienberg [6] 2-dimensional Bonferroni upper bound is redundant relative to
the shuttle algorithm. Consequently, if the shuttle and collapsing methodologies are
applied in combination, it suffices to begin with the nominal bounds and run the shuttle
algorithm to convergence. Application of this approach (steps 1-2-1) to Table 1 yields
Table 4 of exact bounds:
Lawrence H. Cox
28
Table 4. Exact bounds for Table 1 from the Shuttle Algorithm
INCOME
High Med Low High Med Low
MALE FEMALE
3.4 Limitations of All Three Procedures
Although the shuttle algorithm produced exact bounds for this example, the shuttle
algorithm,andconsequentlytheFienberg([6])procedureandtheF-bounds,areinexact,
as follows. Examples 7b,c of [4] are 3-DIPTP exhibiting one or more non-integer
continuous exact bounds. Because it is based on iterative improvement ofinteger upper
bounds, in these situations the shuttle algorithm can come no closer than one unit larger
than the exact integer bound, and therefore is incapable of achieving the exact integer
bound.
The shuttle algorithm is not new: it was introduced by Schell [1] towards purported
sufficient conditions on the 2-dimensional marginals for feasibility of 3-DPTP. By
means of (16), Moravek-Vlach [2] show that the Schell conditions are necessary, but
not sufficient, for the existence of a solution to the 3-DPTP. This counterexample is
applicable here. Each 1-dimensional Fréchet lower bounds is negative, thus not
effective. Each Bonferroni upper bound is too large to be sharp. No lower F-bound is
positive. Iteration of the shuttle produces no improvement. Therefore, each procedure
yields nominal lower (0) and upper (1) bounds for each entry. Each procedure
converges. Consequently,allthreeproceduresproduceseeminglycorrectboundswhen
in fact no table exists. A simpler counterexample, (Example 2 of [5]), given by (17),
appears in the next section.
Bounds on Entries in 3-Dimensional Contingency Tables 29
(16)
4 The Roehrig and Chowdhury Network Models
Roehrig et al. (1999) [8] and Chowdhury et al. (1999) [9] offer network models for
computing exact bounds on internal entries in 3-dimensional tables. Network models
are extremely convenient and efficient, and most important enjoy the integrality
property, viz., integer constraints (viz., 2-dimensional marginal totals) assure integer
optima.Networkmodelsprovideanaturalmechanismand language in which to express
2-dimensional tables, but most generalizations beyond 2-dimensions are apt to fail.
Roehrig et al. [8] construct a network model for 2x2x2 tables and claim that it can be
generalized to all 3-dimensional tables. This must be false. Cox ([4]) shows that the
class of 3-dimensional tables (of size ) representable as a network is the set
of tables for which for at least one index i. This is true because, if all ,
it is possible to construct a 3-dimensional table with integer marginals whose
corresponding polytope has non-integer vertices ([4]), which contradicts the integrality
property.
Chowdhury et al. [9] address the following problemrelated to 3-DIPTP, also appearing
in [8]. Suppose that the NSO releases only two of the three sets of 2-dimensional
marginal totals (A and B), but not the third set (C) or the internal entries . Is it
possible to obtain exact lower and upper bounds for the remaining marginals (C)? The
authors construct independent networks corresponding to the 2-dimensional tables
defined by the appropriate (A, B) pairs and provide a procedure for obtaining exact
bounds.
Unfortunately, this problem is quite simple and can be solved directly without recourse
to networks or other mathematical formalities. In particular, the F-bounds of Section
2 suffice, as follows. Observe that, absent the C-constraints, the minimum (maximum)
feasible value of a C-marginal total Cij equals the sum of the minimum (maximum)
values for the corresponding internal entries . As the are subject only to 2-
dimensionalconstraintswithintheir respective k-planes, thenexactboundsforeach
are precisely its 2-dimensional lower and upper F-bounds. These can be computed at
Lawrence H. Cox
30
sight and added along the k-dimension thus producing the corresponding C-bounds
without recourse to networks or other formulations.
The Chowdhury et al. [9] method is insensitive to infeasibility, again demonstrated by
example (16): all Fréchet lower and nominal upper bounds computed in the two 2-
dimensional tables defined by k = 1 and k = 2 contain the corresponding Cij (equal to
1 in all cases), but there is no underlying table as the problemis infeasible. Insensitivity
is also demonstrated by Example 2 of [5], also presented at the 1998 conference, viz.,
(17)
5 Fractional Extremal Points
Theorem 4.3 of [4] demonstrates that the only multi-dimensional integer planar
transportation problems (m-DIPTP) for which integer extremal points are assured are
those of size 2m-2
xbxc. In these situations, linear programming methods can be relied
upon to produce exact integer bounds on entries, and to do so in a computationally
efficient manner even for problems of large dimension or size. The situation for other
integerproblemsoftransportation type,viz.,m-dimensionalcontingencytablessubject
to k-dimensional marginal totals, k = 0, 1, ..., m-1, is quite different: until a direct
connectioncanbedemonstratedbetweenexactcontinuousboundsonentriesobtainable
from linear programming and exact integer bounds on entries, linear programming will
remainanunreliabletoolforsolvingmulti-dimensionalproblemsoftransportationtype.
Integer programming is not a viable option for large dimensions or size or repetitive
use. Methods that exploit unique structure of certain subclasses of tables are then
appealing, though possibly of limited applicability.
Dobra-Fienberg [10] present one such method, based on notions from mathematical
statistics and graph theory. Given an m-dimensional integer problem of transportation
type and specified marginal totals, if these marginals form a set of sufficient statistics
for a specialized log-linear model known as a decomposable graphical model, then the
model is feasible and exact integer bounds can be obtained from straightforward
formulae. These formulae yield, in essence, the F- and Bonferroni bounds. The reader
is referred to [10] for details, [11] for details on log-linear models, and [12] for
development of graphical models.
Bounds on Entries in 3-Dimensional Contingency Tables 31
The m-dimensional planar transportation problemconsidered here, m  2, corresponds
to the no m-factor effect log-linear model, which is not graphical, and consequently the
Dobra-Fienberg method [10] does not apply to problems considered here.
The choice here and perhaps elsewhere of the 3- or m-DIPTP as the initial focus of
study for bounding problems is motivated by the following observations. If for reasons
of confidentiality the NSO cannot release the full m-dimensional tabulations (viz., the
m-dimensional marginal totals), then its next-best strategy is to release the (m-1)-
dimensional marginal totals, corresponding to the m-DIPTP. If it is not possible to
release all of these marginals, perhaps the release strategy of Chowdhury et al. [9]
should beinvestigated. Alternatively,releaseofthe(m-2)-dimensionalmarginalsmight
be considered. This strategy for release is based on the principle of releasing the most
specific information possible without violating confidentiality. Dobra-Fienberg offers
a different approach, a class of marginal totals, perhaps of varying dimension, that can
be released while assuring confidentiality via easy computation of exact bounds on
suppressed internal entries.
A formula driven bounding method is valuable for large problems and for repetitive,
large scale use. Consider the m-dimensional integer problem of transportation type
specified by its 1-dimensional marginal totals. In the statistics literature, this is known
as the complete independence log-linear model [11]. This model, and in particular the
3-dimensional complete independence model, is graphical and decomposable. Thus,
exact bounding can be achieved using Dobra-Fienberg.
Such problems can exhibit non-integer extremal points. For example, consider the
3x3x3 complete independence model with 1-dimensional marginal totals given by the
vector:
(18)
Even though all continuous exact bounds on internal entries in (18) are integer, one
extremal point at which is maximized (at 1) contains four non-integer entries and
anothercontainssix. Boundingusinglinearprogrammingwouldonlydemonstratethat
is the continuous, not the integer, maximum if either of these extremal points were
exhibited. A strict network formulation is not possible because networks exhibit only
integer extremal points (although use of networks with side constraints is under
investigation). Integer programming is undesirable for large or repetitive applications.
Direct methods such as Dobra-Fienberg may be required. A drawback of Dobra-
Fienberg is that it applies only in specialized cases.
Lawrence H. Cox
32
6 Discussion
In this paper we have examined the problem of determining exact integer bounds for
entries in 3-dimensional integer planar transportation problems. This research was
motivated by previous papers presenting heuristic approaches to similar problems that
failed in some way to meet simple criteria for performance or reasonableness. We
examined these and other approaches analytically and demonstrated one’s superiority.
We demonstrated that this method is imperfect and a reformulation of a method from
the operations literature of the 1950s. We demonstrated that these methods are
insensitive to infeasibility and can produce meaningless results otherwise undetected.
We demonstrated that a method purported to generalize from 2x2x2 tables to all 3-
dimensional tables could not possibly do so. We demonstrated that a problem posed
and solved using networks in a Management Science paper can be solved by simple,
direct means without recourse to mathematical programming. We illustrated the
relationship between computing integer exact bounds, the presence of non-integer
extremal points and the applicability of mathematical programming formulations such
as networks.
NSOs must rely on automated statistical methods for operations including estimation,
tabulation, quality assurance, imputation, rounding and disclosure limitation.
Algorithms for these methods must converge to meaningful quantities. In particular,
these procedures should not report meaningless, misleading results such as seemingly
correct bounds on entries when no feasible values exist. These risks are multiplied in
statistical data base settings where data from different sources often are combined.
Methods examined here for bounds on suppressed internal entries in 3-dimensional
contingency tables fail this requirement because they are heuristic and based on
necessary, but not sufficient, conditions for the existence of a solution to the 3-DPTP.
In addition, most of these methods fail to ensure exact bounds, and are incapable of
identifying if and when they do in fact produce an exact bound. Nothing is gained by
extending these methods to higher dimensions.
Disclaimer
Opinions expressed are solely those of the author and are not intended to represent
policy or practices of the National Center for Health Statistics, Centers for Disease
Control and Prevention, or any other organization.
References
1. Schell, E. Distribution of a product over several properties. Proceedings, 2nd
Symposium
on Linear Programming. Washington, DC (1955) 615-642
2. Moravek, J. and Vlach, M. (1967). On necessary conditions for the existence of the
solution to the multi-index transportation problem. Operations Research 15, 542-545
3. Vlach, M. Conditions for the existence of solutions of the three-dimensional planar
transportation problem. Discrete Applied Mathematics 13 (1986) 61-78
4. Cox, L. (2000). On properties of multi-dimensional statistical tables. Manuscript (April
2000) 29 pp.
Bounds on Entries in 3-Dimensional Contingency Tables 33
5. Cox, L. Invited Talk: Some remarks on research directions in statistical data protection.
Statistical Data Protection: Proceedings of the Conference.EUROSTAT, Luxemburg
(1999)163-176
6. Fienberg, S. Fréchet and Bonferroni bounds for multi-way tables of counts with applications
to disclosure limitation. Statistical Data Protection: Proceedings of the Conference.
EUROSTAT, Luxembourg (1999) 115-129
7. Buzzigoli, L. and Giusti, A. An algorithm to calculate the lower and upper bounds of
the elements of an array given its marginals. Statistical Data Protection: Proceedings of
the Conference. EUROSTAT, Luxemburg (1999) 131-147
8. Roehrig, S., Padman, R., Duncan, G., and Krishman, R. Disclosure detection in multiple
linked categorical datafiles: A unified network approach. Statistical Data Protection:
Proceedings of the Conference. EUROSTAT, Luxembourg (1999) 149-162
9. Chowdhury, S., Duncan, G., Krishnan, R., Roehrig, S., and Mukherjee, S. Disclosure
detection in multivariate categorical databases: Auditing confidentiality protection
through two new matrix operators. Management Science 45 (1999) 1710-1723
10. Dobra, A. and S. Fienber, S.. Bounds for cell entries in contingency table given marginal
totals and decomposable graphs. Proceedings of theNational Academy of Sciences 97
(2000) 11185-11192
11. Bishop, Y., Fienberg, S., and Holland, P. Discrete Multivariate Analysis: Theory and
Practice. Cambridge, MA: M.I.T. Press (1975)
12. Lauritzen, S. Graphical Models. Oxford: Clarendon Press (1996)
Extending Cell Suppression to Protect Tabular
Data against Several Attackers
Juan José Salazar González
DEIOC, Faculty of Mathematics, University of La Laguna
Av. Astrofı́sico Francisco Sánchez, s/n; 38271 La Laguna, Tenerife, Spain
Tel: +34 922 318184; Fax: +34 922 318170
jjsalaza@ull.es
Abstract. This paper presents three mathematical models for the prob-
lem of finding a cell suppression pattern minimizing the loss of informa-
tion while guaranteeing protection level requirements for different sen-
sitive cells and different intruders. This problem covers a very general
setting in Statistical Disclosure Control, and it contains as particular
cases several important problems like, e.g., the so-called “common re-
spondent problem” mentioned in Jewett [9]. Hence, the three models
also applies to the common respondent problem, among others. The first
model corresponds to bi-level Mathematical Programming. The second
model belongs to Integer Linear Programming (ILP) and could be used
on small-size tables where some nominal values are known to assume
discrete values. The third model is also an ILP model valid when the
nominal values of the table are continuous numbers, and with the good
advantage of containing an small number of variables (one 0-1 variable
for each cell in the table). On the other hand, this model has a bigger
number of linear inequalities (related with the number of sensitive cells
and the number of attackers). Nevertheless, this paper addresses this
disadvantage which is overcame by a dynamic generation of the impor-
tant inequalities when necessary. The overall algorithm follows a modern
Operational Research technique known as branch-and-cut approach, and
allows to find optimal solutions to medium-size tables. On large-size ta-
bles the approach can be used to find near-optimal solutions. The paper
illustrates the procedure on an introductory instance.
The paper ends pointing another alternative methodology (closely re-
lated to the one in Jewett [9]) to produce patterns by shrinking all the
different intruders into a single one, and compares it with the classical
single-attacker methodology and with the above multi-attacker method-
ology.
1 Introduction
Cell suppression is one of the most popular techniques for protecting sensitive in-
formation in statistical tables, and it is typically applied to 2- and 3-dimensional

Work supported by the European project IST-2000-25069, “Computational Aspects
of Statistical Confidentiality” (CASC).
J. Domingo-Ferrer (Ed.): Inference Control in Statistical Databases, LNCS 2316, pp. 34–58, 2002.
c
 Springer-Verlag Berlin Heidelberg 2002
Multi-attacker Cell Suppression Problem 35
tables whose entries (cells) are subject to marginal totals. The standard cell sup-
pression technique is based on the idea of protecting the sensitive information
by hiding the values of some cells with a symbol (e.g. ∗). The aim is that a
set of potential intruders could not guess (exactly or approximately) any one of
the hiding values by only using the published values and some a-priori informa-
tion. The only assumption of this work is that this a-priori information must be
formulated as a linear system of equations or inequations with integer and/or
continuous variables. For example, we allow the intruder to know bounds on the
hiding values (as it happens with bigger contributors to each cell) but not proba-
bility distributions on them. Notice that different intruders could know different
a-priori information.
The aim is considered so complex that there are in literature only heuristic
approaches (i.e., procedures providing approximated —probably overprotected—
suppression patterns) for special situations. For example, a relevant situation
occurs when there is an entity which contributes to several cells, leading to
the so-called common respondent problem. Possible simplifications valid for this
situation consist on replacing all the different intruders by one stronger attacker
with “protection capacities” associated to the potential secondary suppressions
(see, e.g., Jewett [9] or Sande [17] for details), or on aggregating some sensitive
cells into new “union” cells with stronger protection level requirements (see, e.g.,
Robertson [15]).
This paper presents the first mathematical models for the problem in the
general situation (i.e, without any simplification) and a first exact algorithm
for the resolution. The here-proposed approach looks for a suppression pattern
with minimum loss of information and which guarantees all the protection re-
quirements against all the attackers. It is also a practical method to find an
optimal solution using modern tools from Mathematical Programming, a well-
established methodology (see, e.g., [13]). The models and the algorithm apply
to a very general problem, containing the common respondent problem as a
particular case. Section 2 illustrates the classical cell suppression methodology
by means of a (very simple) introductory example, and Section 3 describes the
more general multi-attacker cell suppression methodology. Three mathematical
models are described in Section 4, the last one with one decision variable for
each cell and a large number of constraints. Section 5 proposes an algorithm for
finding an optimal solution of the model using the third model, and Section 6 il-
lustrates how it works on the introductory example. Section 7 presents a relaxed
methodology based on considering one worse-case attacker with the information
of all the original intruders, leading to an intermediate scheme that could be ap-
plied with smaller computational effort. This scheme considers the “protection
capacities” in Jewett [9] and provides overprotected patterns. Finally, Section 8
compares the classical, the multi-attacker and the intermediate methodologies,
and Section 9 summarizes the main ideas of this paper and point out a further
extension of the multi-attacker models.
36 Juan José Salazar González
2 Classical Cell Suppression Methodology
A common hypothesis in the classical cell suppression methodology (see, e.g.,
Willenborg and De Waal [18]) is that there is only one attacker interested in
the disclosure of all sensitive cells. We next introduce the main concepts of the
methodology through a simple example.
A B C Total
Activity I 20 50 10 80
Activity II 8 19 22 49
Activity III 17 32 12 61
Total 45 101 44 190
Fig. 1. Investment of enterprises by activity and region
A B C Total
Activity I 20 50 10 80
Activity II * 19 * 49
Activity III * 32 * 61
Total 45 101 44 190
Fig. 2. A possible suppression pattern
Figure 1 exhibits a statistical table giving the investment of enterprises (per
millions of guilders), classified by activity and region. For simplicity, the cell
corresponding to Activity i (for each i ∈ {I, II, III}) and Region j (for each
j ∈ {A, B, C}) will be represented by the pair (i, j). Let us assume that the
information in cell (II, C) is confidential, hence it is viewed as a sensitive cell
to be suppressed. By using the marginal totals the attacker can however recom-
pute the missing value in the sensitive cell, hence other table entries must be
suppressed as well, e.g., those of Figure 2. With this choice, the attacker cannot
disclosure the value of the sensitive cell exactly, although he/she can still com-
pute a range for the values of this cell which are consistent with the published
entries. Indeed, from Figure 2 the minimum value y−
II,C for the sensitive cell
(II, C) can be computed by solving a Linear Programming (LP) model in which
the values yi,j for the suppressed cells (i, j) are treated as unknowns, namely
y−
II,C := min yII,C
subject to
yII,A +yII,C = 30
yIII,A +yIII,C = 29
yII,A +yIII,A = 25
yII,C +yIII,C = 34
yII,A ≥ 0 , yIII,A ≥ 0 , yII,C ≥ 0 , yIII,C ≥ 0.
Multi-attacker Cell Suppression Problem 37
Notice that the right-hand side values are known to the attacker, as they can be
obtained as the difference between the marginal and the published values in a
row/column. We are also assuming that the attacker knows that a missing value
is non-negative, i.e., 0 and infinity are known “external bounds” for suppressions.
The maximum value y+
II,C for the sensitive cell can be computed in a perfectly
analogous way, by solving the linear program of maximizing yII,C subject to
the same constraints as before. Notice that each solution of this common set of
constraints is a congruent table according with the published suppression pattern
in Figure 2 and with the extra knowledge of the external bounds (non-negativity
on this example).
In the example, y−
II,C = 5 and y+
II,C = 30, i.e., the sensitive information
is “protected” within the protection interval [5,30]. If this interval is considered
sufficiently wide by the statistical office, then the sensitive cell is called protected;
otherwise Figure 2 is not a valid suppression pattern and new complementary
suppressions are needed.
Notice that the extreme values of the computed interval [5, 30] are only at-
tained if the cell (II, A) takes the quite unreasonable values of 0 and 25. Oth-
erwise, if the external bounds for each suppressed cell are assumed to be ±50%
of the nominal value, then the solution of the new two linear programs results
in the more realistic protection interval [18, 26] for the sensitive cell. That is
why it is very important to consider good estimations of the external bounds
known for the attacker on each suppressed cell when checking if a suppression
pattern protects (or not) each sensitive cell of the table. As already stated, in
the above example we are assuming that the external bounds are 0 and infinity,
i.e., the only knowledge of the attacker on the unknown variables is that they
are non-negative numbers.
To classify the computed interval [y−
p , y+
p ] around a nominal value ap of a
sensitive cell p as “sufficiently wide” or not, the statistical office must provide
us with three parameters for each sensitive cell:
– an upper protection level representing the minimum allowed value to y+
p −ap;
– a lower protection level representing the minimum allowed value to ap − y−
p ;
– an sliding protection level representing the minimum allowed value to y+
p −
y−
p .
For example, if 7, 5 and 0 are the upper, lower and sliding protection levels,
respectively, then the interval [5, 30] is “sufficiently wide”, and therefore pattern
in Figure 2 is a valid solution for the statistical office (assuming the external
bounds are 0 and infinity).
The statistical office then aims at finding a valid suppression pattern pro-
tecting all the sensitive cells against the attacker, and such that the loss of
information associated with the suppressed entries is minimized. This results
into a combinatorial optimization problem known as the (classical) Cell Sup-
pression Problem, or CSP for short. CSP belongs to the class of the strongly
NP-hard problems (see, e.g., Kelly et al. [12], Geurts [7], Kao [10]), meaning
that it is very unlikely that an algorithm for the exact solution of CSP exists,
38 Juan José Salazar González
which guarantees an efficient (i.e., polynomial-time) performance for all possible
input instances.
Previous works on the classical CSP from the literature mainly concentrate
on 2-dimensional tables with marginals. Heuristic solution procedures have been
proposed by several authors, including Cox [1,2], Sande [16], Kelly et al. [12], and
Carvalho et al. [3]. Kelly [11] proposed a mixed-integer linear programming for-
mulation involving a huge number of variables and constraints (for instance, the
formulation involves more than 20,000,000 variables and 30,000,000 constraints
for a two-dimensional table with 100 rows, 100 columns and 5% sensitive en-
tries). Geurts [7] refined this model, and reported computational experiences on
small-size instances, the largest instance solved to optimality being a table with
20 rows, 6 columns and 17 sensitive cells. (the computing time is not reported;
for smaller instances, the code required several thousand CPU seconds on a SUN
Spark 1+ workstation). Gusfield [8] gave a polynomial algorithm for a special
case of the problem. Heuristics for 3-dimensional tables have been proposed in
Robertson [14], Sande [16], and Dellaert and Luijten [4]. Very recently, Fischetti
and Salazar [5] proposed a new method capable of solving to proven optimal-
ity, on a personal computer, 2-dimensional tables with about 250,000 cells and
10,000 sensitive entries. An extension of this methodology capable of solving to
proven optimality real-world 3- and 4-dimensional tables is presented in Fischetti
and Salazar [6].
3 Multi-attacker Cell Suppression Methodology
The classical cell suppression methodology has several disadvantages. One of
them concerns with the popular hypothesis that the table must be protected
against one attacker. To be more precise, with the above assumption the attacker
is supposed to be one external intruder with no other information different than
the structure of the table, the published values and the external bounds on the
suppressed values. In practice, however, there are also other types of attackers,
like some special respondents (e.g., different entities contributing to different
cell values). We will refer to those ones as internal attackers, while the above
intruder will be refereed as external attacker. For each internal attacker there is
an additional information concerning his/her own contribution to the table. To
be more precise, in the above example, if cell (II, A) has only one respondent,
the output from Figure 2 could be protected for an external attacker but not
from this potential internal attacker. Indeed, the respondent contributing to cell
(II, A) knows that yII,A ≥ 8 (and even that yII,A = 8 if he/she also knows that
he/she is the only contributor to such cell). This will allow him/her to compute
a more narrow protection interval for the sensitive cell (II, C) from Figure 2,
even if it is protected for the external attacker.
In order to avoid this important disadvantage of the classical cell suppres-
sion methodology, this paper proposes an extension of the mathematical model
presented in Fischetti and Salazar [6]. The extension determines a suppression
Multi-attacker Cell Suppression Problem 39
pattern protected against external and internal attackers and it will be described
in the next section.
The classical Cell Suppression will be also referred through this article as
Single-attacker CSP, while the new proposal will be named as Multi-attacker
CSP. The hypothesis of the new methodology is that we must be given, not only
with the basic information (nominal values, loss of information, etc), but also
with a set of attackers K and for each one the specific information he/she has
(i.e., his/her own bounds on unpublished values).
Notice that if K contains only the external attacker, then the multi-attacker
CSP reduces to the single-attacker CSP. Otherwise it could happens that some
attackers could be not considered since a suppression pattern that protect sen-
sitive information against one attacker could also protect the table against some
others. This is, for example, the case when there are two attackers with the
same protection requirements, but one knows tighter bounds; then it is enough
to protect the table against the stronger attacker. For each attacker, a similar
situation occurs with the sensitive cells since protecting some of them could
imply to protect others. Therefore, a clever preprocessing is always required to
reduce as much as possible the number of protection level requirements and the
number of attackers.
The availability of a well-implemented preprocessing could help also in the
task of setting the potential internal attackers. Indeed, notice that in literature
there have been developed several rules to establish the sensitive cells in a ta-
ble (e.g., the dominance rule) but the same effort does not exist to establish the
attackers. Within the preprocessing a proposal could be to consider each respon-
dent in a table as a potential intruder, and then simply apply the preprocessing
to remove the dominated ones. In theory this approach could lead to a huge
number of attackers, but in practice it is expected a number of attackers not
bigger than the number of cells in the table (and hopefully much smaller).
Another important observation is the following. Considering a particular sen-
sitive cell, the statistical office could also be interested on providing different
protection levels for each attacker. For example, suppose that the statistical of-
fice requires a lower protection level of 15 and an upper protection level of 5
for the sensitive cell (II, C) in the introductory example (Figure 1) against an
external attacker. If the sensitive cell is the sum of the contribution from two re-
spondents, one providing 18 units and the other providing 4 units, the statistical
office could be also interested on requiring an special upper protection level of
20 against the biggest contributor to the sensitive cell (because he/she is a po-
tential attacker with the extra knowledge that the sensitive cell contains at least
value 18). Indeed notice that it does not make sense to protect the sensitive cell
against the internal attacker within a lower protection requirement of at least 5,
since he/she already knows a lower bound of 18. In other words, it makes sense
that the statistical office wants different protection levels for different attack-
ers, with the important assumption that each protection level must be smaller
than the correspondent bound. The following section describes Mathematical
Programming models capturing all these features.
40 Juan José Salazar González
4 Mathematical Models
Let us consider a table [ai, i ∈ I] with n := |I| cells. It can be a k-dimensional,
hierarchical or linked table. Since there are marginals, the cells are linked through
some equations indexes by J, and let [

i∈I mijyi = bj, j ∈ J] the linear system
defined by such equations. (Typically bj = 0 and mij ∈ {−1, 0, +1} with one
−1 in each equation.) We are also given a weight wi for each cell i ∈ I for the
loss of information incurred if such cell is suppressed in the final suppression
pattern. Let P ⊂ I the set of sensitive cells (and hence the set of primary
suppression). Finally, let us consider a set K of potential attackers. Associated
to each attacker k ∈ K, we are given with the external bounds (lbk
i , ubk
i ) known
by the attacker on each suppressed cell i ∈ I, and with the three protection levels
(uplkp
, lplkp
, splkp
) that the statistical office requires to protect each sensitive
cell p ∈ P against such attacker k. From the last observation in the previous
section, we will assume
lbk
p ≤ ap − lplkp
≤ ap ≤ ap + uplkp
≤ ubk
p
and
ubk
p − lbk
p ≥ splkp
for each attacker k and each sensitive cell p.
Then, the optimization problem associated to the Cell Suppression Method-
ology can be modeled as follows. Let us consider a binary variable xi associated
to each cell i ∈ I, assuming value 1 if such cell must be suppressed in the final
pattern, or 0 otherwise. Notice that the attacker will minimize and maximize
unknown values on the set of consistent tables, defined by:

i∈I mijyi = bj , j ∈ J
lbk
i ≤ yi ≤ ubk
i , i ∈ I when xi = 1
yi = ai , i ∈ I when xi = 0,
equivalently represented as the solution set of the following linear system:

i∈I mijyi = bj , j ∈ J
ai − (ai − lbk
i )xi ≤ yi ≤ ai + (ubk
i − ai)xi , i ∈ I.

(1)
Therefore, our optimization problem is to find a value for each xi such that the
total loss of the information in the final pattern is minimized, i.e.:
min

i∈I
wixi (2)
subject to, for each sensitive cell p ∈ P and for each attacker k ∈ K,
– the upper protection requirement must be satisfied, i.e.:
max {yp : (1) holds } ≥ ap + uplkp
(3)
Multi-attacker Cell Suppression Problem 41
– the lower protection requirement must be satisfied, i.e.:
min {yp : (1) holds } ≤ ap − lpl kp
(4)
– the sliding protection requirement must be satisfied, i.e.:
max {yp : (1) holds } − min {yp : (1) holds } ≥ splkp
(5)
Finally, each variable must assume value 0 or 1, i.e.:
xi ∈ {0, 1} for all i ∈ I. (6)
Mathematical model (2)–(6) contains all the requirements of the statistical
office (according with the definition given in Section 1), and therefore a solution
[x∗
i , i ∈ I] defines an optimal suppression pattern. The inconvenient is that it is
not an easy model to be solved, since it does not belong to the standard Mixed
Integer Linear Mathematical Programming. In fact, the existence of optimization
problems as constraints of a main optimization problem classifies the model
in the so-called “Bilevel Mathematical Programming”, which does not contain
efficient algorithms to solve model (2)–(6) even of small sizes. Observe that
the inconvenience of model (2)–(6) is not the number of variables (which is at
most the number of cells both for the master optimization problem and for
each subproblem in the second level), but the fact there are nested optimization
problems in two levels. The better way to avoid the direct resolution it is by
looking for a transformation into a classical model in Integer Programming.
A first idea arises by observing that the optimization problem in condition (3)
can be replaced by the existence of a table [fkp
i , i ∈ I] such that it is congruent
(i.e., it satisfies (1)) and it guarantees the upper protection level requirement,
i.e.:
fkp
p ≥ ap + uplkp
.
In the same way, the optimization problem in condition (4) can be replaced by
the existence of a table [gkp
i , i ∈ I] such that it is also congruent (i.e., it satisfies
(1)) and it guarantees the lower protection level requirement, i.e.:
gkp
p ≤ ap − lplkp
.
Finally, the two optimization problems in condition (5) can be replaced by the
above congruent tables if they guarantee the sliding protection level, i.e.:
fkp
p − gkp
p ≥ splkp
.
Figure 3 shows a first attempt to have an integer linear model.
Clearly, this new model is a Mixed Integer Linear Programming model, and
therefore —in theory— there are efficient approaches to solve it. Nevertheless,
the number of new variables (fkp
i and gkp
i ) is really huge even on small tables.
For example, the model associated with a table with 100 × 100 cells with 1%
42 Juan José Salazar González
min

i∈I
wixi
subject to:

i∈I mijfkp
i = bj for all j ∈ J
ai − (ai − lbk
i )xi ≤ fkp
i ≤ ai + (ubk
i − ai)xi for all i ∈ I

i∈I mijgkp
i = bj for all j ∈ J
ai − (ai − lbk
i )xi ≤ gkp
i ≤ ai + (ubk
i − ai)xi for all i ∈ I
fkp
p ≥ ap + upl kp
gkp
p ≤ ap − lplkp
fkp
p − gkp
p ≥ splkp
for all p ∈ P and all k ∈ K, and also subject to:
xi ∈ {0, 1} for all i ∈ I.
Fig. 3. First ILP model for multi-attacker CSP
sensitive and 100 attackers would have millions of variables. Therefore, it is
necessary another approach to transform model (2)–(6) without adding so many
additional variables.
An alternative approach which does not add any additional variable follows
the idea described in Fischetti and Salazar [6] for the classical cell suppression
problem (i.e., with one attacker). Based on the Farkas’ Lemma, it is possible to
replace the second level problems of model (2)–(6) by linear constraints on the
xi variables. Indeed, assuming that values yi in a congruent table are continuous
numbers, the two linear programs in conditions (3)–(5) can be rewritten in their
dual format. More precisely, by Dual Theory in Linear Programming
max {yp : (1) holds }
is equivalent to
min

j∈J
γjbj +

i∈I
[αi(ai + (ubk
i − ai)xi) − βi(ai − (ai − lbk
i )xi)]
Random documents with unrelated
content Scribd suggests to you:
If implies that implies , then implies that implies .
Writing in place of , in place of , and in place of , this becomes:
If implies that implies , then implies that implies . Call
this .
Now we proved by means of our fifth principle that
 implies that implies , which was what we called .
Thus we have here an instance of the schema of inference, since
represents the of our scheme, and represents the  implies . Hence
we arrive at , namely,
 implies that implies ,
which was the proposition to be proved. In this proof, the adaptation of our
fifth principle, which yields , occurs as a substantive premiss; while the
adaptation of our fourth principle, which yields , is used to give the form
of the inference. The formal and material employments of premisses in the
theory of deduction are closely intertwined, and it is not very important to
keep them separated, provided we realise that they are in theory distinct.
The earliest method of arriving at new results from a premiss is one
which is illustrated in the above deduction, but which itself can hardly be
called deduction. The primitive propositions, whatever they may be, are to
be regarded as asserted for all possible values of the variable propositions ,
, which occur in them. We may therefore substitute for (say) any
expression whose value is always a proposition, e.g. not- ,  implies ,
and so on. By means of such substitutions we really obtain sets of special
cases of our original proposition, but from a practical point of view we
obtain what are virtually new propositions. The legitimacy of substitutions
of this kind has to be insured by means of a non-formal principle of
inference.[36]
[36]No such principle is enunciated in Principia Mathematica, or in M.
Nicod's article mentioned above. But this would seem to be an omission.
We may now state the one formal principle of inference to which M.
Nicod has reduced the five given above. For this purpose we will first show
how certain truth-functions can be defined in terms of incompatibility. We
saw already that
means  implies .
We now observe that
means  implies both and .
For this expression means  is incompatible with the incompatibility of
and , i.e.  implies that and are not incompatible, i.e.  implies that
and are both true—for, as we saw, the conjunction of and is the
negation of their incompatibility.
Observe next that means  implies itself. This is a particular
case of .
Let us write for the negation of ; thus will mean the negation of
, i.e. it will mean the conjunction of and . It follows that
expresses the incompatibility of with the conjunction of and ; in
other words, it states that if and are both true, is false, i.e. and
are both true; in still simpler words, it states that and jointly imply and
jointly.
Now, put
Then M. Nicod's sole formal principle of deduction is
in other words, implies both and .
He employs in addition one non-formal principle belonging to the theory
of types (which need not concern us), and one corresponding to the
principle that, given , and given that implies , we can assert . This
principle is:
If is true, and is true, then is true.
From this apparatus the whole theory of deduction follows, except in so far
as we are concerned with deduction from or to the existence or the universal
truth of propositional functions, which we shall consider in the next
chapter.
There is? if I am not mistaken, a certain confusion in the minds of some
authors as to the relation, between propositions, in virtue of which an
inference is valid. In order that it may be valid to infer from , it is only
necessary that should be true and that the proposition not- or  should
be true. Whenever this is the case, it is clear that must be true. But
inference will only in fact take place when the proposition not- or  is
known otherwise than through knowledge of not- or knowledge of .
Whenever is false, not- or  is true, but is useless for inference, which
requires that should be true. Whenever is already known to be true,
not- or  is of course also known to be true, but is again useless for
inference, since is already known, and therefore does not need to be
inferred. In fact, inference only arises when not- or  can be known
without our knowing already which of the two alternatives it is that makes
the disjunction true. Now, the circumstances under which this occurs are
those in which certain relations of form exist between and . For example,
we know that if implies the negation of , then implies the negation of .
Between  implies not-  and  implies not-  there is a formal relation
which enables us to know that the first implies the second, without having
first to know that the first is false or to know that the second is true. It is
under such circumstances that the relation of implication is practically
useful for drawing inferences.
But this formal relation is only required in order that we may be able to
know that either the premiss is false or the conclusion is true. It is the truth
of not- or  that is required for the validity of the inference; what is
required further is only required for the practical feasibility of the inference.
Professor C. I. Lewis[37] has especially studied the narrower, formal
relation which we may call formal deducibility. He urges that the wider
relation, that expressed by not- or  should not be called implication.
That is, however, a matter of words. Provided our use of words is
consistent, it matters little how we define them. The essential point of
difference between the theory which I advocate and the theory advocated by
Professor Lewis is this: He maintains that, when one proposition is
formally deducible from another , the relation which we perceive
between them is one which he calls strict implication, which is not the
relation expressed by not- or  but a narrower relation, holding only
when there are certain formal connections between and . I maintain that,
whether or not there be such a relation as he speaks of, it is in any case one
that mathematics does not need, and therefore one that, on general grounds
of economy, ought not to be admitted into our apparatus of fundamental
notions; that, whenever the relation of formal deducibility holds between
two propositions, it is the case that we can see that either the first is false or
the second true, and that nothing beyond this fact is necessary to be
admitted into our premisses; and that, finally, the reasons of detail which
Professor Lewis adduces against the view which I advocate can all be met
in detail, and depend for their plausibility upon a covert and unconscious
assumption of the point of view which I reject. I conclude, therefore, that
there is no need to admit as a fundamental notion any form of implication
not expressible as a truth-function.
[37]See Mind, vol. XXI., 1912, pp. 522-531; and vol. XXIII., 1914, pp.
240-247.
CHAPTER XV
PROPOSITIONAL FUNCTIONS
WHEN, in the preceding chapter, we were discussing propositions, we did
not attempt to give a definition of the word proposition. But although the
word cannot be formally defined, it is necessary to say something as to its
meaning, in order to avoid the very common confusion with propositional
functions, which are to be the topic of the present chapter.
We mean by a proposition primarily a form of words which expresses
what is either true or false. I say primarily, because I do not wish to
exclude other than verbal symbols, or even mere thoughts if they have a
symbolic character. But I think the word proposition should be limited to
what may, in some sense, be called symbols, and further to such symbols
as give expression to truth and falsehood. Thus two and two are four and
two and two are five will be propositions, and so will Socrates is a man
and Socrates is not a man. The statement: Whatever numbers and
may be,  is a proposition; but the bare formula 
 alone is not, since it asserts nothing definite
unless we are further told, or led to suppose, that and are to have all
possible values, or are to have such-and-such values. The former of these is
tacitly assumed, as a rule, in the enunciation of mathematical formulæ,
which thus become propositions; but if no such assumption were made,
they would be propositional functions. A propositional function, in fact,
is an expression containing one or more undetermined constituents, such
that, when values are assigned to these constituents, the expression becomes
a proposition. In other words, it is a function whose values are propositions.
But this latter definition must be used with caution. A descriptive function,
e.g. the hardest proposition in 's mathematical treatise, will not be a
propositional function, although its values are propositions. But in such a
case the propositions are only described: in a propositional function, the
values must actually enunciate propositions.
Examples of propositional functions are easy to give:  is human is a
propositional function; so long as remains undetermined, it is neither true
nor false, but when a value is assigned to it becomes a true or false
proposition. Any mathematical equation is a propositional function. So long
as the variables have no definite value, the equation is merely an expression
awaiting determination in order to become a true or false proposition. If it is
an equation containing one variable, it becomes true when the variable is
made equal to a root of the equation, otherwise it becomes false; but if it is
an identity it will be true when the variable is any number. The equation
to a curve in a plane or to a surface in space is a propositional function, true
for values of the co-ordinates belonging to points on the curve or surface,
false for other values. Expressions of traditional logic such as all is 
are propositional functions: and have to be determined as definite
classes before such expressions become true or false.
The notion of cases or instances depends upon propositional
functions. Consider, for example, the kind of process suggested by what is
called generalisation, and let us take some very primitive example, say,
lightning is followed by thunder. We have a number of instances of this,
i.e. a number of propositions such as: this is a flash of lightning and is
followed by thunder. What are these occurrences instances of? They are
instances of the propositional function: If is a flash of lightning, is
followed by thunder. The process of generalisation (with whose validity
we are fortunately not concerned) consists in passing from a number of
such instances to the universal truth of the propositional function: If is a
flash of lightning, is followed by thunder. It will be found that, in an
analogous way, propositional functions are always involved whenever we
talk of instances or cases or examples.
We do not need to ask, or attempt to answer, the question: What is a
propositional function? A propositional function standing all alone may be
taken to be a mere schema, a mere shell, an empty receptacle for meaning,
not something already significant. We are concerned with propositional
functions, broadly speaking, in two ways: first, as involved in the notions
true in all cases and true in some cases; secondly, as involved in the
theory of classes and relations. The second of these topics we will postpone
to a later chapter; the first must occupy us now.
When we say that something is always true or true in all cases, it is
clear that the something involved cannot be a proposition. A proposition
is just true or false, and there is an end of the matter. There are no instances
or cases of Socrates is a man or Napoleon died at St Helena. These are
propositions, and it would be meaningless to speak of their being true in all
cases. This phrase is only applicable to propositional functions. Take, for
example, the sort of thing that is often said when causation is being
discussed. (We are net concerned with the truth or falsehood of what is said,
but only with its logical analysis.) We are told that is, in every instance,
followed by . Now if there are instances of , must be some general
concept of which it is significant to say  is ,  is ,  is , and
so on, where , , are particulars which are not identical one with
another. This applies, e.g., to our previous case of lightning. We say that
lightning ( ) is followed by thunder ( ). But the separate flashes are
particulars, not identical, but sharing the common property of being
lightning. The only way of expressing a common property generally is to
say that a common property of a number of objects is a propositional
function which becomes true when any one of these objects is taken as the
value of the variable. In this case all the objects are instances of the truth
of the propositional function—for a propositional function, though it cannot
itself be true or false, is true in certain instances and false in certain others,
unless it is always true or always false. When, to return to our example,
we say that is in every instance followed by , we mean that, whatever
may be, if is an , it is followed by a ; that is, we are asserting that a
certain propositional function is always true.
Sentences involving such words as all, every, a, the, some
require propositional functions for their interpretation. The way in which
propositional functions occur can be explained by means of two of the
above words, namely, all and some.
There are, in the last analysis, only two things that can be done with a
propositional function: one is to assert that it is true in all cases, the other to
assert that it is true in at least one case, or in some cases (as we shall say,
assuming that there is to be no necessary implication of a plurality of
cases). All the other uses of propositional functions can be reduced to these
two. When we say that a propositional function is true in all cases, or
always (as we shall also say, without any temporal suggestion), we mean
that all its values are true. If   is the function, and is the right sort of
object to be an argument to  , then is to be true, however may
have been chosen. For example, if is human, is mortal is true whether
is human or not; in fact, every proposition of this form is true. Thus the
propositional function if is human, is mortal is always true, or true
in all cases. Or, again, the statement there are no unicorns is the same as
the statement the propositional function ' is not a unicorn' is true in all
cases. The assertions in the preceding chapter about propositions, e.g. '
or ' implies ' or ,' are really assertions that certain propositional
functions are true in all cases. We do not assert the above principle, for
example, as being true only of this or that particular or , but as being true
of any or concerning which it can be made significantly. The condition
that a function is to be significant for a given argument is the same as the
condition that it shall have a value for that argument, either true or false.
The study of the conditions of significance belongs to the doctrine of types,
which we shall not pursue beyond the sketch given in the preceding chapter.
Not only the principles of deduction, but all the primitive propositions of
logic, consist of assertions that certain propositional functions are always
true. If this were not the case, they would have to mention particular things
or concepts—Socrates, or redness, or east and west, or what not,—and
clearly it is not the province of logic to make assertions which are true
concerning one such thing or concept but not concerning another. It is part
of the definition of logic (but not the whole of its definition) that all its
propositions are completely general, i.e. they all consist of the assertion that
some propositional function containing no constant terms is always true. We
shall return in our final chapter to the discussion of propositional functions
containing no constant terms. For the present we will proceed to the other
thing that is to be done with a propositional function, namely, the assertion
that it is sometimes true, i.e. true in at least one instance.
When we say there are men, that means that the propositional function
 is a man is sometimes true. When we say some men are Greeks, that
means that the propositional function  is a man and a Greek is
sometimes true. When we say cannibals still exist in Africa, that means
that the propositional function  is a cannibal now in Africa is sometimes
true, i.e. is true for some values of . To say there are at least individuals
in the world is to say that the propositional function  is a class of
individuals and a member of the cardinal number  is sometimes true, or,
as we may say, is true for certain values of . This form of expression is
more convenient when it is necessary to indicate which is the variable
constituent which we are taking as the argument to our propositional
function. For example, the above propositional function, which we may
shorten to  is a class of individuals, contains two variables, and .
The axiom of infinity, in the language of propositional functions, is: The
propositional function 'if is an inductive number, it is true for some values
of that is a class of individuals' is true for all possible values of .
Here there is a subordinate function,  is a class of individuals, which
is said to be, in respect of , sometimes true; and the assertion that this
happens if is an inductive number is said to be, in respect of , always
true.
The statement that a function is always true is the negation of the
statement that not- is sometimes true, and the statement that is
sometimes true is the negation of the statement that not- is always true.
Thus the statement all men are mortals is the negation of the statement
that the function  is an immortal man is sometimes true. And the
statement there are unicorns is the negation of the statement that the
function  is not a unicorn is always true.[38] We say that is never
true or always false if not- is always true. We can, if we choose, take
one of the pair always, sometimes as a primitive idea, and define the
other by means of the one and negation. Thus if we choose sometimes as
our primitive idea, we can define: ' is always true' is to mean 'it is false
that not- is sometimes true.'[39] But for reasons connected with the
theory of types it seems more correct to take both always and
sometimes as primitive ideas, and define by their means the negation of
propositions in which they occur. That is to say, assuming that we have
already defined (or adopted as a primitive idea) the negation of propositions
of the type to which belongs, we define: The negation of ' always' is
'not- sometimes'; and the negation of ' sometimes' is 'not- always.'
In like manner we can re-define disjunction and the other truth-functions, as
applied to propositions containing apparent variables, in terms of the
definitions and primitive ideas for propositions containing no apparent
variables. Propositions containing no apparent variables are called
elementary propositions. From these we can mount up step by step, using
such methods as have just been indicated, to the theory of truth-functions as
applied to propositions containing one, two, three, ... variables, or any
number up to , where is any assigned finite number.
[38]The method of deduction is given in Principia Mathematica, vol. I.
* 9.
[39]For linguistic reasons, to avoid suggesting either the plural or the
singular, it is often convenient to say  is not always false rather than
 sometimes or  is sometimes true.
The forms which are taken as simplest in traditional formal logic are
really far from being so, and all involve the assertion of all values or some
values of a compound propositional function. Take, to begin with, all is
. We will take it that is defined by a propositional function , and
by a propositional function . E.g., if is men, will be  is human; if
is mortals, will be there is a time at which dies. Then all is 
means: ' implies ' is always true. It is to be observed that all is 
does not apply only to those terms that actually are 's; it says something
equally about terms which are not 's. Suppose we come across an of
which we do not know whether it is an or not; still, our statement all is
 tells us something about , namely, that if is an , then is a . And
this is every bit as true when is not an as when is an . If it were not
equally true in both cases, the reductio ad absurdum would not be a valid
method; for the essence of this method consists in using implications in
cases where (as it afterwards turns out) the hypothesis is false. We may put
the matter another way. In order to understand all is , it is not
necessary to be able to enumerate what terms are 's; provided we know
what is meant by being an and what by being a , we can understand
completely what is actually affirmed by all is , however little we may
know of actual instances of either. This shows that it is not merely the
actual terms that are 's that are relevant in the statement all is , but all
the terms concerning which the supposition that they are 's is significant,
i.e. all the terms that are 's, together with all the terms that are not 's—i.e.
the whole of the appropriate logical type. What applies to statements
about all applies also to statements about some. There are men, e.g.,
means that  is human is true for some values of . Here all values of
(i.e. all values for which  is human is significant, whether true or false)
are relevant, and not only those that in fact are human. (This becomes
obvious if we consider how we could prove such a statement to be false.)
Every assertion about all or some thus involves not only the arguments
that make a certain function true, but all that make it significant, i.e. all for
which it has a value at all, whether true or false.
We may now proceed with our interpretation of the traditional forms of
the old-fashioned formal logic. We assume that is those terms for which
is true, and is those for which is true. (As we shall see in a later
chapter, all classes are derived in this way from propositional functions.)
Then:
All is  means ' implies ' is always true.
Some is  means ' and ' is sometimes true.
No is  means ' implies not- ' is always true.
Some is not  means ' and not- ' is sometimes true.
It will be observed that the propositional functions which are here asserted
for all or some values are not and themselves, but truth-functions of
and for the same argument . The easiest way to conceive of the sort
of thing that is intended is to start not from and in general, but from
and , where is some constant. Suppose we are considering all men
are mortal: we will begin with
If Socrates is human, Socrates is mortal,
and then we will regard Socrates as replaced by a variable wherever
Socrates occurs. The object to be secured is that, although remains a
variable, without any definite value, yet it is to have the same value in  
as in   when we are asserting that  implies  is always true. This
requires that we shall start with a function whose values are such as 
implies , rather than with two separate functions and ; for if we
start with two separate functions we can never secure that the , while
remaining undetermined, shall have the same value in both.
For brevity we say  always implies  when we mean that 
implies  is always true. Propositions of the form  always implies
are called formal implications; this name is given equally if there are
several variables.
The above definitions show how far removed from the simplest forms
are such propositions as all is , with which traditional logic begins. It
is typical of the lack of analysis involved that traditional logic treats all
is  as a proposition of the same form as  is —e.g., it treats all men
are mortal as of the same form as Socrates is mortal. As we have just
seen, the first is of the form  always implies , while the second is of
the form  . The emphatic separation of these two forms, which was
effected by Peano and Frege, was a very vital advance in symbolic logic.
It will be seen that all is  and no is  do not really differ in
form, except by the substitution of not- for , and that the same applies
to some is  and some is not . It should also be observed that the
traditional rules of conversion are faulty, if we adopt the view, which is the
only technically tolerable one, that such propositions as all is  do not
involve the existence of 's, i.e. do not require that there should be terms
which are 's. The above definitions lead to the result that, if is always
false, i.e. if there are no 's, then all is  and no is  will both be
true, whatever may be. For, according to the definition in the last chapter,
 implies  means not- or  which is always true if not- is
always true. At the first moment, this result might lead the reader to desire
different definitions, but a little practical experience soon shows that any
different definitions would be inconvenient and would conceal the
important ideas. The proposition  always implies , and is
sometimes true is essentially composite, and it would be very awkward to
give this as the definition of all is , for then we should have no
language left for  always implies , which is needed a hundred times
for once that the other is needed. But, with our definitions, all is  does
not imply some is , since the first allows the non-existence of and
the second does not; thus conversion per accidens becomes invalid, and
some moods of the syllogism are fallacious, e.g. Darapti: All is , all
is , therefore some is , which fails if there is no .
The notion of existence has several forms, one of which will occupy us
in the next chapter; but the fundamental form is that which is derived
immediately from the notion of sometimes true. We say that an argument
satisfies a function if is true; this is the same sense in which the
roots of an equation are said to satisfy the equation. Now if is
sometimes true, we may say there are 's for which it is true, or we may say
arguments satisfying exist This is the fundamental meaning of the
word existence. Other meanings are either derived from this, or embody
mere confusion of thought. We may correctly say men exist, meaning that
 is a man is sometimes true. But if we make a pseudo-syllogism: Men
exist, Socrates is a man, therefore Socrates exists, we are talking nonsense,
since Socrates is not, like men, merely an undetermined argument to a
given propositional function. The fallacy is closely analogous to that of the
argument: Men are numerous, Socrates is a man, therefore Socrates is
numerous. In this case it is obvious that the conclusion is nonsensical, but
in the case of existence it is not obvious, for reasons which will appear
more fully in the next chapter. For the present let us merely note the fact
that, though it is correct to say men exist, it is incorrect, or rather
meaningless, to ascribe existence to a given particular who happens to be
a man. Generally, terms satisfying exist means  is sometimes
true; but  exists (where is a term satisfying ) is a mere noise or
shape, devoid of significance. It will be found that by bearing in mind this
simple fallacy we can solve many ancient philosophical puzzles concerning
the meaning of existence.
Another set of notions as to which philosophy has allowed itself to fall
into hopeless confusions through not sufficiently separating propositions
and propositional functions are the notions of modality: necessary,
possible, and impossible. (Sometimes contingent or assertoric is used
instead of possible.) The traditional view was that, among true propositions,
some were necessary, while others were merely contingent or assertoric;
while among false propositions some were impossible, namely, those whose
contradictories were necessary, while others merely happened not to be true.
In fact, however, there was never any clear account of what was added to
truth by the conception of necessity. In the case of propositional functions,
the three-fold division is obvious. If   is an undetermined value of a
certain propositional function, it will be necessary if the function is always
true, possible if it is sometimes true, and impossible if it is never true. This
sort of situation arises in regard to probability, for example. Suppose a ball
is drawn from a bag which contains a number of balls: if all the balls are
white,  is white is necessary; if some are white, it is possible; if none, it
is impossible. Here all that is known about is that it satisfies a certain
propositional function, namely,  was a ball in the bag. This is a situation
which is general in probability problems and not uncommon in practical life
—e.g. when a person calls of whom we know nothing except that he brings
a letter of introduction from our friend so-and-so. In all such cases, as in
regard to modality in general, the propositional function is relevant. For
clear thinking, in many very diverse directions, the habit of keeping
propositional functions sharply separated from propositions is of the utmost
importance, and the failure to do so in the past has been a disgrace to
philosophy.
CHAPTER XVI
DESCRIPTIONS
We dealt in the preceding chapter with the words all and some; in this
chapter we shall consider the word the in the singular, and in the next
chapter we shall consider the word the in the plural. It may be thought
excessive to devote two chapters to one word, but to the philosophical
mathematician it is a word of very great importance: like Browning's
Grammarian with the enclitic , I would give the doctrine of this word if I
were dead from the waist down and not merely in a prison.
We have already had occasion to mention descriptive functions, i.e.
such expressions as the father of  or the sine of . These are to be
defined by first defining descriptions.
A description may be of two sorts, definite and indefinite (or
ambiguous). An indefinite description is a phrase of the form a so-and-so,
and a definite description is a phrase of the form the so-and-so (in the
singular). Let us begin with the former.
Who did you meet? I met a man. That is a very indefinite
description. We are therefore not departing from usage in our terminology.
Our question is: What do I really assert when I assert I met a man? Let us
assume, for the moment, that my assertion is true, and that in fact I met
Jones. It is clear that what I assert is not I met Jones. I may say I met a
man, but it was not Jones; in that case, though I lie, I do not contradict
myself, as I should do if when I say I met a man I really mean that I met
Jones. It is clear also that the person to whom I am speaking can understand
what I say, even if he is a foreigner and has never heard of Jones.
But we may go further: not only Jones, but no actual man, enters into my
statement. This becomes obvious when the statement is false, since then
there is no more reason why Jones should be supposed to enter into the
proposition than why anyone else should. Indeed the statement would
remain significant, though it could not possibly be true, even if there were
no man at all. I met a unicorn or I met a sea-serpent is a perfectly
significant assertion, if we know what it would be to be a unicorn or a sea-
serpent, i.e. what is the definition of these fabulous monsters. Thus it is only
what we may call the concept that enters into the proposition. In the case of
unicorn, for example, there is only the concept: there is not also,
somewhere among the shades, something unreal which may be called a
unicorn. Therefore, since it is significant (though false) to say I met a
unicorn, it is clear that this proposition, rightly analysed, does not contain
a constituent a unicorn, though it does contain the concept unicorn.
The question of unreality, which confronts us at this point, is a very
important one. Misled by grammar, the great majority of those logicians
who have dealt with this question have dealt with it on mistaken lines. They
have regarded grammatical form as a surer guide in analysis than, in fact, it
is. And they have not known what differences in grammatical form are
important. I met Jones and I met a man would count traditionally as
propositions of the same form, but in actual fact they are of quite different
forms: the first names an actual person, Jones; while the second involves a
propositional function, and becomes, when made explicit: The function 'I
met and is human' is sometimes true. (It will be remembered that we
adopted the convention of using sometimes as not implying more than
once.) This proposition is obviously not of the form I met , which
accounts for the existence of the proposition I met a unicorn in spite of
the fact that there is no such thing as a unicorn.
For want of the apparatus of propositional functions, many logicians
have been driven to the conclusion that there are unreal objects. It is argued,
e.g. by Meinong,[40] that we can speak about the golden mountain, the
round square, and so on; we can make true propositions of which these are
the subjects; hence they must have some kind of logical being, since
otherwise the propositions in which they occur would be meaningless. In
such theories, it seems to me, there is a failure of that feeling for reality
which ought to be preserved even in the most abstract studies. Logic, I
should maintain, must no more admit a unicorn than zoology can; for logic
is concerned with the real world just as truly as zoology, though with its
more abstract and general features. To say that unicorns have an existence
in heraldry, or in literature, or in imagination, is a most pitiful and paltry
evasion. What exists in heraldry is not an animal, made of flesh and blood,
moving and breathing of its own initiative. What exists is a picture, or a
description in words. Similarly, to maintain that Hamlet, for example, exists
in his own world, namely, in the world of Shakespeare's imagination, just as
truly as (say) Napoleon existed in the ordinary world, is to say something
deliberately confusing, or else confused to a degree which is scarcely
credible. There is only one world, the real world: Shakespeare's
imagination is part of it, and the thoughts that he had in writing Hamlet are
real. So are the thoughts that we have in reading the play. But it is of the
very essence of fiction that only the thoughts, feelings, etc., in Shakespeare
and his readers are real, and that there is not, in addition to them, an
objective Hamlet. When you have taken account of all the feelings roused
by Napoleon in writers and readers of history, you have not touched the
actual man; but in the case of Hamlet you have come to the end of him. If
no one thought about Hamlet, there would be nothing left of him; if no one
had thought about Napoleon, he would have soon seen to it that some one
did. The sense of reality is vital in logic, and whoever juggles with it by
pretending that Hamlet has another kind of reality is doing a disservice to
thought. A robust sense of reality is very necessary in framing a correct
analysis of propositions about unicorns, golden mountains, round squares,
and other such pseudo-objects.
[40]Untersuchungen zur Gegenstandstheorie und Psychologie, 1904.
In obedience to the feeling of reality, we shall insist that, in the analysis
of propositions, nothing unreal is to be admitted. But, after all, if there is
nothing unreal, how, it may be asked, could we admit anything unreal? The
reply is that, in dealing with propositions, we are dealing in the first
instance with symbols, and if we attribute significance to groups of symbols
which have no significance, we shall fall into the error of admitting
unrealities, in the only sense in which this is possible, namely, as objects
described. In the proposition I met a unicorn, the whole four words
together make a significant proposition, and the word unicorn by itself is
significant, in just the same sense as the word man. But the two words a
unicorn do not form a subordinate group having a meaning of its own.
Thus if we falsely attribute meaning to these two words, we find ourselves
saddled with a unicorn, and with the problem how there can be such a
thing in a world where there are no unicorns. A unicorn is an indefinite
description which describes nothing. It is not an indefinite description
which describes something unreal. Such a proposition as  is unreal only
has meaning when   is a description, definite or indefinite; in that case
the proposition will be true if   is a description which describes nothing.
But whether the description   describes something or describes nothing, it
is in any case not a constituent of the proposition in which it occurs; like a
unicorn just now, it is not a subordinate group having a meaning of its
own. All this results from the fact that, when   is a description,  is
unreal or  does not exist is not nonsense, but is always significant and
sometimes true.
We may now proceed to define generally the meaning of propositions
which contain ambiguous descriptions. Suppose we wish to make some
statement about a so-and-so, where so-and-so's are those objects that
have a certain property , i.e. those objects for which the propositional
function is true. (E.g. if we take a man as our instance of a so-and-
so, will be  is human.) Let us now wish to assert the property of
a so-and-so, i.e. we wish to assert that a so-and-so has that property
which has when is true. (E.g. in the case of I met a man, will be
I met .) Now the proposition that a so-and-so has the property is not
a proposition of the form  . If it were, a so-and-so would have to be
identical with for a suitable ; and although (in a sense) this may be true
in some cases, it is certainly not true in such a case as a unicorn. It is just
this fact, that the statement that a so-and-so has the property is not of the
form , which makes it possible for a so-and-so to be, in a certain
clearly definable sense, unreal. The definition is as follows:—
The statement that an object having the property has the property 
means:
The joint assertion of and is not always false.
So far as logic goes, this is the same proposition as might be expressed
by some 's are 's; but rhetorically there is a difference, because in the
one case there is a suggestion of singularity, and in the other case of
plurality. This, however, is not the important point. The important point is
that, when rightly analysed, propositions verbally about a so-and-so are
found to contain no constituent represented by this phrase. And that is why
such propositions can be significant even when there is no such thing as a
so-and-so.
The definition of existence, as applied to ambiguous descriptions, results
from what was said at the end of the preceding chapter. We say that men
exist or a man exists if the propositional function  is human is
sometimes true; and generally a so-and-so exists if  is so-and-so is
sometimes true. We may put this in other language. The proposition
Socrates is a man is no doubt equivalent to Socrates is human, but it is
not the very same proposition. The is of Socrates is human expresses the
relation of subject and predicate; the is of Socrates is a man expresses
identity. It is a disgrace to the human race that it has chosen to employ the
same word is for these two entirely different ideas—a disgrace which a
symbolic logical language of course remedies. The identity in Socrates is a
man is identity between an object named (accepting Socrates as a name,
subject to qualifications explained later) and an object ambiguously
described. An object ambiguously described will exist when at least one
such proposition is true, i.e. when there is at least one true proposition of
the form  is a so-and-so, where   is a name. It is characteristic of
ambiguous (as opposed to definite) descriptions that there may be any
number of true propositions of the above form—Socrates is a man, Plato is
a man, etc. Thus a man exists follows from Socrates, or Plato, or anyone
else. With definite descriptions, on the other hand, the corresponding form
of proposition, namely,  is the so-and-so (where   is a name), can only
be true for one value of at most. This brings us to the subject of definite
descriptions, which are to be defined in a way analogous to that employed
for ambiguous descriptions, but rather more complicated.
We come now to the main subject of the present chapter, namely, the
definition of the word the (in the singular). One very important point about
the definition of a so-and-so applies equally to the so-and-so; the
definition to be sought is a definition of propositions in which this phrase
occurs, not a definition of the phrase itself in isolation. In the case of a so-
and-so, this is fairly obvious: no one could suppose that a man was a
definite object, which could be defined by itself. Socrates is a man, Plato is
a man, Aristotle is a man, but we cannot infer that a man means the same
as Socrates means and also the same as Plato means and also the same
as Aristotle means, since these three names have different meanings.
Nevertheless, when we have enumerated all the men in the world, there is
nothing left of which we can say, This is a man, and not only so, but it is
the 'a man,' the quintessential entity that is just an indefinite man without
being anybody in particular. It is of course quite clear that whatever there
is in the world is definite: if it is a man it is one definite man and not any
other. Thus there cannot be such an entity as a man to be found in the
world, as opposed to specific man. And accordingly it is natural that we do
not define a man itself, but only the propositions in which it occurs.
In the case of the so-and-so this is equally true, though at first sight
less obvious. We may demonstrate that this must be the case, by a
consideration of the difference between a name and a definite description.
Take the proposition, Scott is the author of Waverley. We have here a
name, Scott, and a description, the author of Waverley, which are
asserted to apply to the same person. The distinction between a name and
all other symbols may be explained as follows:—
A name is a simple symbol whose meaning is something that can only
occur as subject, i.e. something of the kind that, in Chapter XIII., we
defined as an individual or a particular. And a simple symbol is one
which has no parts that are symbols. Thus Scott is a simple symbol,
because, though it has parts (namely, separate letters), these parts are not
symbols. On the other hand, the author of Waverley is not a simple
symbol, because the separate words that compose the phrase are parts
which are symbols. If, as may be the case, whatever seems to be an
individual is really capable of further analysis, we shall have to content
ourselves with what may be called relative individuals, which will be
terms that, throughout the context in question, are never analysed and never
occur otherwise than as subjects. And in that case we shall have
correspondingly to content ourselves with relative names. From the
standpoint of our present problem, namely, the definition of descriptions,
this problem, whether these are absolute names or only relative names, may
be ignored, since it concerns different stages in the hierarchy of types,
whereas we have to compare such couples as Scott and the author of
Waverley, which both apply to the same object, and do not raise the
problem of types. We may, therefore, for the moment, treat names as
capable of being absolute; nothing that we shall have to say will depend
upon this assumption, but the wording may be a little shortened by it.
We have, then, two things to compare: (1) a name, which is a simple
symbol, directly designating an individual which is its meaning, and having
this meaning in its own right, independently of the meanings of all other
words; (2) a description, which consists of several words, whose meanings
are already fixed, and from which results whatever is to be taken as the
meaning of the description.
A proposition containing a description is not identical with what that
proposition becomes when a name is substituted, even if the name names
the same object as the description describes. Scott is the author of
Waverley is obviously a different proposition from Scott is Scott: the first
is a fact in literary history, the second a trivial truism. And if we put anyone
other than Scott in place of the author of Waverley, our proposition would
become false, and would therefore certainly no longer be the same
proposition. But, it may be said, our proposition is essentially of the same
form as (say) Scott is Sir Walter, in which two names are said to apply to
the same person. The reply is that, if Scott is Sir Walter really means the
person named 'Scott' is the person named 'Sir Walter,' then the names are
being used as descriptions: i.e. the individual, instead of being named, is
being described as the person having that name. This is a way in which
names are frequently used in practice, and there will, as a rule, be nothing in
the phraseology to show whether they are being used in this way or as
names. When a name is used directly, merely to indicate what we are
speaking about, it is no part of the fact asserted, or of the falsehood if our
assertion happens to be false: it is merely part of the symbolism by which
we express our thought. What we want to express is something which might
(for example) be translated into a foreign language; it is something for
which the actual words are a vehicle, but of which they are no part. On the
other hand, when we make a proposition about the person called 'Scott,'
the actual name Scott enters into what we are asserting, and not merely
into the language used in making the assertion. Our proposition will now be
a different one if we substitute the person called 'Sir Walter.' But so long
as we are using names as names, whether we say Scott or whether we say
Sir Walter is as irrelevant to what we are asserting as whether we speak
English or French. Thus so long as names are used as names, Scott is Sir
Walter is the same trivial proposition as Scott is Scott. This completes
the proof that Scott is the author of Waverley is not the same proposition
as results from substituting a name for the author of Waverley, no matter
what name may be substituted.
When we use a variable, and speak of a propositional function, say,
the process of applying general statements about to particular cases will
consist in substituting a name for the letter  , assuming that is a
function which has individuals for its arguments. Suppose, for example, that
is always true; let it be, say, the law of identity, . Then we may
substitute for   any name we choose, and we shall obtain a true
proposition. Assuming for the moment that Socrates, Plato, and
Aristotle are names (a very rash assumption), we can infer from the law
of identity that Socrates is Socrates, Plato is Plato, and Aristotle is Aristotle.
But we shall commit a fallacy if we attempt to infer, without further
premisses, that the author of Waverley is the author of Waverley. This
results from what we have just proved, that, if we substitute a name for the
author of Waverley in a proposition, the proposition we obtain is a different
one. That is to say, applying the result to our present case: If   is a name,
  is not the same proposition as the author of Waverley is the author
of Waverley, no matter what name   may be. Thus from the fact that all
propositions of the form   are true we cannot infer, without more
ado, that the author of Waverley is the author of Waverley. In fact,
propositions of the form the so-and-so is the so-and-so are not always
true: it is necessary that the so-and-so should exist (a term which will be
explained shortly). It is false that the present King of France is the present
King of France, or that the round square is the round square. When we
substitute a description for a name, propositional functions which are
always true may become false, if the description describes nothing. There
is no mystery in this as soon as we realise (what was proved in the
preceding paragraph) that when we substitute a description the result is not
a value of the propositional function in question.
We are now in a position to define propositions in which a definite
description occurs. The only thing that distinguishes the so-and-so from
a so-and-so is the implication of uniqueness. We cannot speak of the
inhabitant of London, because inhabiting London is an attribute which is
not unique. We cannot speak about the present King of France, because
there is none; but we can speak about the present King of England. Thus
propositions about the so-and-so always imply the corresponding
propositions about a so-and-so, with the addendum that there is not more
than one so-and-so. Such a proposition as Scott is the author of Waverly
could not be true if Waverly had never been written, or if several people had
written it; and no more could any other proposition resulting from a
propositional function by the substitution of the author of Waverly for 
. We may say that the author of Waverly means the value of for
which ' wrote Waverly' is true. Thus the proposition the author of
Waverly was Scotch, for example, involves:
(1)  wrote Waverly is not always false;
(2) if and wrote Waverly, and are identical is always true;
(3) if wrote Waverly, was Scotch is always true.
These three propositions, translated into ordinary language, state:
(1) at least one person wrote Waverly;
(2) at most one person wrote Waverly;
(3) whoever wrote Waverly was Scotch.
All these three are implied by the author of Waverly was Scotch.
Conversely, the three together (but no two of them) imply that the author of
Waverly was Scotch. Hence the three together may be taken as defining
what is meant by the proposition the author of Waverly was Scotch.
We may somewhat simplify these three propositions. The first and
second together are equivalent to: There is a term such that ' wrote
Waverly' is true when is and is false when is not . In other words,
There is a term such that ' wrote Waverly' is always equivalent to ' is
.' (Two propositions are equivalent when both are true or both are false.)
We have here, to begin with, two functions of ,  wrote Waverly and 
is , and we form a function of by considering the equivalence of these
two functions of for all values of ; we then proceed to assert that the
resulting function of is sometimes true, i.e. that it is true for at least one
value of . (It obviously cannot be true for more than one value of .) These
Welcome to our website – the perfect destination for book lovers and
knowledge seekers. We believe that every book holds a new world,
offering opportunities for learning, discovery, and personal growth.
That’s why we are dedicated to bringing you a diverse collection of
books, ranging from classic literature and specialized publications to
self-development guides and children's books.
More than just a book-buying platform, we strive to be a bridge
connecting you with timeless cultural and intellectual values. With an
elegant, user-friendly interface and a smart search system, you can
quickly find the books that best suit your interests. Additionally,
our special promotions and home delivery services help you save time
and fully enjoy the joy of reading.
Join us on a journey of knowledge exploration, passion nurturing, and
personal growth every day!
ebookbell.com

More Related Content

PDF
Altman - Perfectly Anonymous Data is Perfectly Useless Data
PPTX
Final review m score
PDF
Literature Review: The Role of Signal Processing in Meeting Privacy Challenge...
PPTX
Sharing Confidential Data in ICPSR
PDF
Privacyaware Knowledge Discovery Novel Applications And New Techniques France...
PDF
Collusion Attack: A Kernel-Based Privacy Preserving Techniques in Data Mining
PDF
Synthetic Datasets For Statistical Disclosure Control Theory And Implementati...
PDF
Privacy In Statistical Databases Unesco Chair In Data Privacy International C...
Altman - Perfectly Anonymous Data is Perfectly Useless Data
Final review m score
Literature Review: The Role of Signal Processing in Meeting Privacy Challenge...
Sharing Confidential Data in ICPSR
Privacyaware Knowledge Discovery Novel Applications And New Techniques France...
Collusion Attack: A Kernel-Based Privacy Preserving Techniques in Data Mining
Synthetic Datasets For Statistical Disclosure Control Theory And Implementati...
Privacy In Statistical Databases Unesco Chair In Data Privacy International C...

Similar to Inference Control In Statistical Databases From Theory To Practice 1st Edition Josep Domingoferrer Auth (20)

PDF
Privacy In Statistical Databases Unesco Chair In Data Privacy International C...
PPTX
Protection models
PDF
Achieving Privacy in Publishing Search logs
PDF
Provider Aware Anonymization Algorithm for Preserving M - Privacy
PDF
Privacy Preserving Approaches for High Dimensional Data
PDF
Privacy Preserving by Anonymization Approach
PDF
A Study on Big Data Privacy Protection Models using Data Masking Methods
PPT
current-trends
PDF
winbis1005
PDF
A review on anonymization techniques for privacy preserving data publishing
PDF
Enhanced Privacy Preserving Access Control in Incremental Data using microagg...
PDF
Data centric security key to cloud and digital business
PDF
Article data-centric security key to cloud and digital business
PDF
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
PDF
A MATHEMATICAL MODEL OF ACCESS CONTROL IN BIG DATA USING CONFIDENCE INTERVAL ...
PDF
A mathematical model of access control in big data using confidence interval ...
PDF
An Effective Heuristic Approach for Hiding Sensitive Patterns in Databases
PDF
Cp34550555
PDF
Data Science at Intersection of Security and Privacy
PDF
The Science Of Quantitative Information Flow Mário S. Alvim
Privacy In Statistical Databases Unesco Chair In Data Privacy International C...
Protection models
Achieving Privacy in Publishing Search logs
Provider Aware Anonymization Algorithm for Preserving M - Privacy
Privacy Preserving Approaches for High Dimensional Data
Privacy Preserving by Anonymization Approach
A Study on Big Data Privacy Protection Models using Data Masking Methods
current-trends
winbis1005
A review on anonymization techniques for privacy preserving data publishing
Enhanced Privacy Preserving Access Control in Incremental Data using microagg...
Data centric security key to cloud and digital business
Article data-centric security key to cloud and digital business
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
A MATHEMATICAL MODEL OF ACCESS CONTROL IN BIG DATA USING CONFIDENCE INTERVAL ...
A mathematical model of access control in big data using confidence interval ...
An Effective Heuristic Approach for Hiding Sensitive Patterns in Databases
Cp34550555
Data Science at Intersection of Security and Privacy
The Science Of Quantitative Information Flow Mário S. Alvim
Ad

Recently uploaded (20)

PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PDF
Insiders guide to clinical Medicine.pdf
PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PDF
Basic Mud Logging Guide for educational purpose
PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PPTX
Cell Structure & Organelles in detailed.
PDF
Complications of Minimal Access Surgery at WLH
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
PDF
Pre independence Education in Inndia.pdf
PPTX
Lesson notes of climatology university.
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PDF
Anesthesia in Laparoscopic Surgery in India
PPTX
PPH.pptx obstetrics and gynecology in nursing
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
Insiders guide to clinical Medicine.pdf
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
STATICS OF THE RIGID BODIES Hibbelers.pdf
Microbial diseases, their pathogenesis and prophylaxis
Final Presentation General Medicine 03-08-2024.pptx
FourierSeries-QuestionsWithAnswers(Part-A).pdf
Basic Mud Logging Guide for educational purpose
human mycosis Human fungal infections are called human mycosis..pptx
Cell Structure & Organelles in detailed.
Complications of Minimal Access Surgery at WLH
Pharmacology of Heart Failure /Pharmacotherapy of CHF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
Pre independence Education in Inndia.pdf
Lesson notes of climatology university.
O5-L3 Freight Transport Ops (International) V1.pdf
Anesthesia in Laparoscopic Surgery in India
PPH.pptx obstetrics and gynecology in nursing
Ad

Inference Control In Statistical Databases From Theory To Practice 1st Edition Josep Domingoferrer Auth

  • 1. Inference Control In Statistical Databases From Theory To Practice 1st Edition Josep Domingoferrer Auth download https://ebookbell.com/product/inference-control-in-statistical- databases-from-theory-to-practice-1st-edition-josep- domingoferrer-auth-1664790 Explore and download more ebooks at ebookbell.com
  • 2. Here are some recommended products that we believe you will be interested in. You can click the link to download. Computational Inference And Control Of Quality In Multimedia Services 1st Edition Vlado Menkovski Auth https://ebookbell.com/product/computational-inference-and-control-of- quality-in-multimedia-services-1st-edition-vlado-menkovski- auth-5236580 Target In Control Social Influence As Distributed Information Processing Andrzej K Nowak Robin R Vallacher Agnieszka Rychwalska Magdalena Roszczyskakurasiska Karolina Ziembowicz Mikolaj Biesaga Marta Kacprzykmurawska https://ebookbell.com/product/target-in-control-social-influence-as- distributed-information-processing-andrzej-k-nowak-robin-r-vallacher- agnieszka-rychwalska-magdalena-roszczyskakurasiska-karolina- ziembowicz-mikolaj-biesaga-marta-kacprzykmurawska-11093316 Dark Psychology This Book Includes The Art Of How To Influence And Win People Using Emotional Manipulation Mind Control Nlp Techniques Persuasion Psychological Warfare Tactics In Relationships David Bennis https://ebookbell.com/product/dark-psychology-this-book-includes-the- art-of-how-to-influence-and-win-people-using-emotional-manipulation- mind-control-nlp-techniques-persuasion-psychological-warfare-tactics- in-relationships-david-bennis-47363694 Secure Data Provenance And Inference Control With Semantic Web 1st Edition Thuraisingham https://ebookbell.com/product/secure-data-provenance-and-inference- control-with-semantic-web-1st-edition-thuraisingham-55304616
  • 3. Secure Data Provenance And Inference Control With Semantic Web Thuraisingham https://ebookbell.com/product/secure-data-provenance-and-inference- control-with-semantic-web-thuraisingham-4732498 Secure Data Provenance And Inference Control With Semantic Web Bhavani Thuraisingham https://ebookbell.com/product/secure-data-provenance-and-inference- control-with-semantic-web-bhavani-thuraisingham-5411218 Algorithms For Analysis Inference And Control Of Boolean Networks Tatsuya Akutsu https://ebookbell.com/product/algorithms-for-analysis-inference-and- control-of-boolean-networks-tatsuya-akutsu-7029944 Copulabased Markov Models For Time Series Parametric Inference And Process Control 1st Ed Lihsien Sun https://ebookbell.com/product/copulabased-markov-models-for-time- series-parametric-inference-and-process-control-1st-ed-lihsien- sun-22477864 Influence Of Flight Control Laws On Structural Sizing Of Commercial Aircraft Rahmetalla Nazzeri https://ebookbell.com/product/influence-of-flight-control-laws-on- structural-sizing-of-commercial-aircraft-rahmetalla-nazzeri-36652924
  • 6. Lecture Notes in Computer Science 2316 Edited by G. Goos, J. Hartmanis, and J. van Leeuwen
  • 8. Josep Domingo-Ferrer (Ed.) Inference Control in Statistical Databases From Theory to Practice 1 3
  • 9. Series Editors Gerhard Goos, Karlsruhe University, Germany Juris Hartmanis, Cornell University, NY, USA Jan van Leeuwen, Utrecht University, The Netherlands Volume Editor Josep Domingo-Ferrer Universitat Rovira i Virgili Department of Computer Engineering and Mathematics Av. Paı̈sos Catalans 26, 43007 Tarragona, Spain E-mail: jdomingo@etse.urv.es Cataloging-in-Publication Data applied for Die Deutsche Bibliothek - CIP-Einheitsaufnahme Inference control in statistical databases : from theory to practice / Josep Domingo-Ferrer (ed.). - Berlin ; Heidelberg ; New York ; Barcelona ; Hong Kong ; London ; Milan ; Paris ; Tokyo : Springer, 2002 (Lecture notes in computer science ; Vol. 2316) ISBN 3-540-43614-6 CR Subject Classification (1998): G.3, H.2.8, K.4.1, I.2.4 ISSN 0302-9743 ISBN 3-540-43614-6 Springer-Verlag Berlin Heidelberg New York This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer-Verlag. Violations are liable for prosecution under the German Copyright Law. Springer-Verlag Berlin Heidelberg New York a member of BertelsmannSpringer Science+Business Media GmbH http://www.springer.de © Springer-Verlag Berlin Heidelberg 2002 Printed in Germany Typesetting: Camera-ready by author, data conversion by Christian Grosche, Hamburg Printed on acid-free paper SPIN 10846628 06/3142 5 4 3 2 1 0
  • 10. Preface Inference control in statistical databases (also known as statistical disclosure control, statistical disclosure limitation, or statistical confidentiality) is about finding tradeoffs to the tension between the increasing societal demand for accu- rate statistical data and the legal and ethical obligation to protect the privacy of individuals and enterprises which are the source of data for producing statistics. To put it bluntly, statistical agencies cannot expect to collect accurate informa- tion from individual or corporate respondents unless these feel the privacy of their responses is guaranteed. This state-of-the-art survey covers some of the most recent work in the field of inference control in statistical databases. This topic is no longer (and proba- bly never was) a purely statistical or operations-research issue, but is gradually entering the arena of knowledge management and artificial intelligence. To the extent that techniques used by intruders to make inferences compromising pri- vacy increasingly draw on data mining and record linkage, inference control tends to become an integral part of computer science. Articles in this book are revised versions of a few papers selected among those presented at the seminar “Statistical Disclosure Control: From Theory to Practice” held in Luxemburg on 13 and 14 December 2001 under the sponsorship of EUROSTAT and the European Union 5th FP project “AMRADS” (IST-2000- 26125). The book starts with an overview article which goes through the remaining 17 articles. These cover inference control for aggregate statistical data released in tabular form, inference control for microdata files, software developments, and user case studies. The article authors and myself hope that this collective work will be a reference point to both academics and official statisticians who wish to keep abreast with the latest advances in this very dynamic field. The help of the following experts in discussing and reviewing the selected papers is gratefully acknowledged: – Lawrence H. Cox (U. S. National Center for Health Statistics) – Gerd Ronning (Universität Tübingen) – Philip M. Steel (U. S. Bureau of the Census) – William E. Winkler (U. S. Bureau of the Census) As an organizer of the seminar from which articles in this book have evolved, I wish to emphasize that such a seminar would not have taken place without the sponsorship of EUROSTAT and the AMRADS project as well as the help and en- couragement by Deo Ramprakash (AMRADS coordinator), Photis Nanopoulos, Harald Sonnberger, and John King (all from EUROSTAT). Finally, the inputs by Anco Hundepool (Statistics Netherlands and co-ordinator of the EU 5th FP project “CASC”) and Francesc Sebé (Universitat Rovira i Virgili) were crucial to the success of the seminar and the book, respectively. I apologize for possible omissions. February 2002 Josep Domingo-Ferrer
  • 11. Table of Contents Advances in Inference Control in Statistical Databases: An Overview . . . . . 1 Josep Domingo-Ferrer Tabular Data Protection Cell Suppression: Experience and Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Dale A. Robertson, Richard Ethier Bounds on Entries in 3-Dimensional Contingency Tables Subject to Given Marginal Totals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 Lawrence H. Cox Extending Cell Suppression to Protect Tabular Data against Several Attackers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 Juan José Salazar González Network Flows Heuristics for Complementary Cell Suppression: An Empirical Evaluation and Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 Jordi Castro HiTaS: A Heuristic Approach to Cell Suppression in Hierarchical Tables . . 74 Peter-Paul de Wolf Microdata Protection Model Based Disclosure Protection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 Silvia Polettini, Luisa Franconi, Julian Stander Microdata Protection through Noise Addition . . . . . . . . . . . . . . . . . . . . . . . . . 97 Ruth Brand Sensitive Micro Data Protection Using Latin Hypercube Sampling Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 Ramesh A. Dandekar, Michael Cohen, Nancy Kirkendall Integrating File and Record Level Disclosure Risk Assessment . . . . . . . . . . . 126 Mark Elliot Disclosure Risk Assessment in Perturbative Microdata Protection . . . . . . . . 135 William E. Yancey, William E. Winkler, Robert H. Creecy LHS-Based Hybrid Microdata vs Rank Swapping and Microaggregation for Numeric Microdata Protection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 Ramesh A. Dandekar, Josep Domingo-Ferrer, Francesc Sebé
  • 12. VIII Table of Contents Post-Masking Optimization of the Tradeoff between Information Loss and Disclosure Risk in Masked Microdata Sets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 Francesc Sebé, Josep Domingo-Ferrer, Josep Maria Mateo-Sanz, Vicenç Torra Software and User Case Studies The CASC Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 Anco Hundepool Tools and Strategies to Protect Multiples Tables with the GHQUAR Cell Suppression Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 Sarah Giessing, Dietz Repsilber SDC in the 2000 U.S. Decennial Census . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 Laura Zayatz Applications of Statistical Disclosure Control at Statistics Netherlands . . . . 203 Eric Schulte Nordholt Empirical Evidences on Protecting Population Uniqueness at Idescat . . . . . 213 Julià Urrutia, Enric Ripoll Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
  • 13. Advances in Inference Control in Statistical Databases: An Overview Josep Domingo-Ferrer Universitat Rovira i Virgili Dept. of Computer Engineering and Mathematics Av. Paı̈sos Catalans 26, E-43007 Tarragona, Catalonia, Spain jdomingo@etse.urv.es Abstract. Inference control in statistical databases is a discipline with several other names, such as statistical disclosure control, statistical dis- closure limitation, or statistical database protection. Regardless of the name used, current work in this very active field is rooted in the work that was started on statistical database protection in the 70s and 80s. Massive production of computerized statistics by government agencies combined with an increasing social importance of individual privacy has led to a renewed interest in this topic. This is an overview of the latest research advances described in this book. Keywords: Inference control in statistical database, Statistical disclo- sure control, Statistical disclosure limitation, Statistical database pro- tection, Data security, Respondents’ privacy, Official statistics. 1 Introduction The protection of confidential data is a constant issue of concern for data collec- tors and especially for national statistical agencies. There are legal and ethical obligations to maintain confidentiality of respondents whose answers are used for surveys or whose administrative data are used to produce statistics. But, beyond law and ethics, there are also practical reasons for data collectors to care about confidentiality: unless respondents are convinced that their privacy is being adequately protected, they are unlikely to co-operate and supply their data for statistics to be produced on them. The rest of this book consists of seventeen articles clustered in three parts: 1. The protection of tabular data is covered by the first five articles; 2. The protection of microdata (i.e. invididual respondent data) is addressed by the next seven articles; 3. Software for inference control and user case studies are reported in the last five articles. The material in this book focuses on the latest research developments of the mathematical and computational aspects of inference control and should be regarded as an update of [6]. For a systematic approach to the topic, we J. Domingo-Ferrer (Ed.): Inference Control in Statistical Databases, LNCS 2316, pp. 1–7, 2002. c Springer-Verlag Berlin Heidelberg 2002
  • 14. 2 Josep Domingo-Ferrer strongly recommend [13]; for a quicker overview, [12] may also be used. All references given so far in this paragraph concentrate only on the mathematical and computational side of the topic. If a broader scope is required, [7] is a work where legal, organizational, and practical issues are covered in addition to the purely computational ones. This overview goes through the book articles and then gives an account of related literature and other sources of information. 2 Tabular Data Protection The first article “Cell suppression: experience and theory”, by Robertson and Ethier, emphasizes that some basic points of cell suppression for table protec- tion are not sufficiently known. While the underlying theory is well developed, sensitivity rules in use are in some cases flawed and may lead to the release of sensitive information. Another issue raised by the paper is the lack of a sound information loss measure to assess the damage inflicted to a table in terms of data utility by the use of a particular suppression pattern. The adoption of information-theoretic measures is hinted as a possible improvement. The article “Bounds on entries in 3-dimensional contingency tables subject to given marginal totals” by Cox deals with algorithms for determining integer bounds on suppressed entries of multi-dimensional contingency tables subject to fixed marginal totals. Some heuristic algorithms are compared, and it is demon- strated that they are not exact. Consequences for statistical database query systems are discussed. “Extending cell suppression to protect tabular data against several attack- ers”, by Salazar, points out that attackers to confidentiality need not be just external intruders; internal attackers, i.e. special respondents contributing to different cell values of the table, must also be taken into account. This article describes three mathematical models for the problem of finding a cell suppres- sion pattern minimizing information loss while ensuring protection for different sensitive cells and different intruders. When a set of sensitive cells are suppressed from a table (primary sup- pressions), a set of non-sensitive cells must be suppressed as well (complemen- tary suppressions) to prevent primary suppressions from being computable from marginal constraints. Network flows heuristics have been proposed in the past for finding the minimal complementary cell suppression pattern in tabular data protection. However, the heuristics known so far are only appropriate for two- dimensional tables. In “Network flows heuristics for complementary cell suppres- sion: an empirical evaluation and extensions”, by Castro, it is shown that network flows heuristics (namely multicommodity network flows and network flows with side constraints) can also be used to model three-dimensional, hierarchical, and linked tables. Also related to hierarchical tables is the last article on tabular data, authored by De Wolf and entitled “HiTaS: a heuristic approach to cell suppression in hier- archical tables”. A heuristic top-down approach is presented to find suppression
  • 15. Advances in Inference Control in Statistical Databases: An Overview 3 patterns in hierarchical tables. When a table of high level is protected using cell suppression, its interior is regarded as the marginals of possibly several lower level tables, each of which is protected while keeping their marginals fixed. 3 Microdata Protection The first three articles in this part describe methods for microdata protection: – Article “Model based disclosure protection”, by Polettini, Franconi, and Stander, argues that any microdata protection method is based on a formal reference model. Depending on the number of restrictions imposed, meth- ods are classified as nonparametric, semiparametric or fully parametric. An imputation procedure for business microdata based on a regression model is applied to the Italian sample from the Community Innovation Survey. The utility of the released data and the protection achieved are also evaluated. – Adding noise is a very used principle for microdata protection. In fact, re- sults in the article by Yancey et al. (discussed below) show that noise addi- tion methods can perform very well. Article “Microdata protection through noise addition”, by Brand, contains an overview of noise addition algorithms. These range from simple white noise addition to complex methods which try to improve the tradeoff between data utility and data protection. Theoretical properties of the presented algorithms are discussed in Brand’s article and an illustrative numerical example is given. – Synthetic microdata generation is an attractive alternative to protection methods based on perturbing original microdata. The conceptual advantage is that, even if a record in the released data set can be linked to a record in the original data set, such a linkage is not actually a re-identification because the released record is a synthetic one and was not derived from any specific respondent. In “Sensitive microdata protection using Latin hypercube sam- pling technique”, Dandekar, Cohen, and Kirkendall propose a method for synthetic microdata generation based on Latin hypercube sampling. The last four articles in this part concentrate on assessing disclosure risk and information loss achieved by microdata protection methods: – Article “Integrating file and record level disclosure risk assessment”, by El- liot, deals with disclosure risk in non-perturbative microdata protection. Two methods for assessing disclosure risk at the record-level are described, one based on the special uniques method and the other on data intrusion simu- lation. Proposals to integrate both methods with file level risk measures are also presented. – Article “Disclosure risk assessment in perturbative microdata protection”, by Yancey, Winkler, and Creecy, presents empirical re-identification results that compare methods for microdata protection including rank swapping and additive noise. Enhanced re-identification methods based on probabilis- tic record linkage are used to empirically assess disclosure risk. Then the
  • 16. 4 Josep Domingo-Ferrer performance of methods is measured in terms of information loss and disclo- sure risk. The reported results extend earlier work by Domingo-Ferrer et al. presented in [7]. – In “LHS-based hybrid microdata vs rank swapping and microaggregation for numeric microdata protection”, Dandekar, Domingo-Ferrer, and Sebé report on another comparison of methods for microdata protection. Specifically, hybrid microdata generation as a mixture of original data and synthetic microdata is compared with rank swapping and microaggregation, which had been identified as the best performers in earlier work. Like in the previous article, the comparison considers information loss and disclosure risk, and the latter is empirically assessed using record linkage. – Based on the metrics previously proposed to compare microdata protection methods (also called masking methods) in terms of information loss and disclosure risk, article “Post-masking optimization of the tradeoff between information loss and disclosure risk in masked microdata sets”, by Sebé, Domingo-Ferrer, Mateo-Sanz, and Torra, demonstrates how to improve the performance of any microdata masking method. Post-masking optimization of the metrics can be used to have the released data set preserve as much as possible the moments of first and second order (and thus multivariate statis- tics) of the original data without increasing disclosure risk. The technique presented can also be used for synthetic microdata generation and can be extended to preserve all moments up to m-th order, for any m. 4 Software and User Case Studies The first two articles in this part are related to software developments for the protection of statistical data: – “The CASC project”, by Hundepool, is an overview of the European project CASC (Computational Aspects of Statistical Confidentiality,[2]), funded by the EU 5th Framework Program. CASC can be regarded as a follow-up of the SDC project carried out under the EU 4th Framework Program. The central aim of the CASC project is to produce a new version of the Argus software for statistical disclosure control. In order to reach this practical goal, the project also includes methodological research both in tabular data and microdata protection; the research results obtained will constitute the core of the Argus improvement. Software testing by users is an important part of CASC as well. – The first sections of the article “Tools and strategies to protect multiple ta- bles with the GHQUAR cell suppression engine”, by Gießing and Repsilber, are an introduction to the GHQUAR software for tabular data protection. The last sections of this article describe GHMITER, which is a software pro- cedure allowing use of GHQUAR to protect sets of multiple linked tables. This software constitutes a very fast solution to protect complex sets of big tables and will be integrated in the new version of Argus developed under the CASC project.
  • 17. Advances in Inference Control in Statistical Databases: An Overview 5 This last part of the book concludes with three articles presenting user case studies in statistical inference control: – “SDC in the 2000 U. S. Decennial Census”, by Zayatz, describes statistical disclosure control techniques to be used for all products resulting from the 2000 U. S. Decennial Census. The discussion covers techniques for tabular data, public microdata files, and on-line query systems for tables. For tabular data, algorithms used are improvements of those used for the 1990 Decennial Census. Algorithms for public-use microdata are new in many cases and will result in less detail than was published in previous censuses. On-line table query is a new service, so the disclosure control algorithms used there are completely new ones. – “Applications of statistical disclosure control at Statistics Netherlands”, by Schulte Nordholt, reports on how Statistics Netherlands meets the require- ments of statistical data protection and user service. Most users are satisfied with data protected using the Argus software: τ-Argus is used to produce safe tabular data, while µ-Argus yields publishable safe microdata. How- ever, some researchers need more information than is released in the safe data sets output by Argus and are willing to sign the proper non-disclosure agreements. For such researchers, on-site access to unprotected data is of- fered by Statistics Netherlands in two secure centers. – The last article “Empirical evidences on protecting population uniqueness at Idescat”, by Urrutia and Ripoll, presents the process of disclosure control applied by Statistics Catalonia to microdata samples from census and sur- veys with some population uniques. Such process has been in use since 1995, and has been implemented with µ-Argus since it first became available. 5 Related Literature and Information Sources In addition to the above referenced books [6,7,12,13], a number of other sources of information on current research in statistical inference control are available. In fact, since statistical database protection is a rapidly evolving field, the use of books should be directed to acquiring general insight on concepts and ideas, but conference proceedings, research surveys, and journal articles remain essential to gain up-to-date detailed knowledge on particular techniques and open issues. This section contains a non-exhaustive list of research references, sorted from a historical point of view: 1970s and 1980s. The first broadly known papers and books on statistical database protection appear (e.g. [1,3,4,5,11]). 1990s. Eurostat produces a compendium for practitioners [10] and sponsors a number of conferences on the topic, namely the three International Sem- inars on Statistical Confidentiality (Dublin 1992 [9], Luxemburg 1994 [14], and Bled 1996 [8]) and the Statistical Data Protection’98 conference (Lisbon 1998,[6]). While the first three events covered mathematical, legal, and orga- nizational aspects, the Lisbon conference focused on the statistical, mathe- matical, and computational aspects of statistical disclosure control and data
  • 18. 6 Josep Domingo-Ferrer protection. The goals of those conferences were to promote research and interaction between scientists and practitioners in order to consolidate sta- tistical disclosure control as a high-quality research discipline encompassing statistics, operations research, and computer science. In the second half of the 90s, the research project SDC was carried out under the EU 4th Framework Program; its most visible result was the first version of the Argus software. In the late 90s, other European organizations start joining the European Com- mission in fostering research in this field. A first example is Statistisches Bundesamt which organized in 1997 a conference for the German-speaking community. A second example is the United Nations Economic Commission for Europe, which has jointly organized with Eurostat two Work Sessions on Statistical Data Confidentiality (Thessaloniki 1999 [15] and Skopje 2001). Outside Europe, the U.S. Bureau of the Census and Statistics Canada have devoted considerable attention to statistical disclosure control in their confer- ences and symposia. In fact, well-known general conferences such as COMP- STAT, U.S. Bureau of the Census Annual Research Conferences, Eurostat’s ETK-NTTS conference series, IEEE Symposium on Security and Privacy, etc. have hosted sessions and papers on statistical disclosure control. 2000s. In addition to the biennial Work Sessions on Statistical Data Confiden- tiality organized by UNECE and Eurostat, other research activities are being promoted by the U.S. Census Bureau, which sponsored the book [7], by the European projects CASC [2], and AMRADS (a co-sponsor of the seminar which originated this book). As far as journals are concerned, there is not yet a monographic journal on statistical database protection. However, at least the following journals occa- sionally contain papers on this topic: Research in Official Statistics, Statistica Neerlandica, Journal of Official Statistics, Journal of the American Statistical Association, ACM Transactions on Database Systems, IEEE Transactions on Software Engineering, IEEE Transactions on Knowledge and Data Engineering, Computers Mathematics with Applications, Statistical Journal of the UNECE, Qüestiió and Netherlands Official Statistics. Acknowledgments Special thanks go to the authors of this book and to the discussants of the seminar “Statistical Disclosure Control: From Theory to Practice” (L. Cox, G. Ronning, P. M. Steel, and W. Winkler). Their ideas were invaluable to write this overview, but I bear full responsibility for any inaccuracy, omission, or mistake that may remain.
  • 19. Advances in Inference Control in Statistical Databases: An Overview 7 References 1. N. R. Adam and J. C. Wortmann, “Security-control methods for statistical databases: A comparative study”, ACM Computing Surveys, vol. 21, no. 4, pp. 515-556, 1989. 2. The CASC Project, http://neon.vb.cbs.nl/rsm/casc/menu.htm 3. T. Dalenius, “The invasion of privacy problem and statistics production. An overview”, Statistik Tidskrift, vol. 12, pp. 213-225, 1974. 4. D. E. Denning and J. Schlörer, “A fast procedure for finding a tracker in a statistical database”, ACM Transactions on Database Systems, vol. 5, no. 1, pp. 88-102, 1980. 5. D. E. Denning, Cryptography and Data Security. Reading MA: Addison-Wesley, 1982. 6. J. Domingo-Ferrer (ed.), Statistical Data Protection. Luxemburg: Office for Official Publications of the European Communities, 1999. 7. P. Doyle, J. Lane, J. Theeuwes, and L. Zayatz (eds.), Confidentiality, Disclosure and Data Access. Amsterdam: North-Holland, 2001. 8. S. Dujić and I. Trs̆inar (eds.), Proceedings of the 3rd International Seminar on Statistical Confidentiality (Bled, 1996). Ljubljiana: Statistics Slovenia-Eurostat, 1996. 9. D. Lievesley (ed.), Proceedings of the International Seminar on Statistical Confi- dentiality (Dublin, 1992). Luxemburg: Eurostat, 1993. 10. D. Schackis, Manual on Disclosure Control Methods. Luxemburg: Eurostat, 1993. 11. J. Schlörer, “Identification and retrieval of personal records from a statistical data bank”, Methods Inform. Med., vol. 14, no.1, pp. 7-13, 1975. 12. L. Willenborg and T. de Waal, Statistical Disclosure Control in Practice. New York: Springer-Verlag, 1996. 13. L. Willenborg and T. de Waal, Elements of Statistical Disclosure Control. New York: Springer-Verlag, 2001. 14. Proceedings of the 2nd International Seminar on Statistical Confidentiality (Lux- emburg, 1994). Luxemburg: Eurostat, 1995. 15. Statistical Data Confidentiality: Proc. of the Joint Eurostat/UNECE Work Ses- sion on Statistical Data Confidentiality (Thessaloniki, 1999). Luxemburg: Euro- stat, 1999.
  • 20. _____________________ * The opinions expressed in this paper are those of the authors, and not necessarily those of Statistics Canada. J. Doningo-Ferrer (Ed.): Inference Control in Statistical Databases, LNCS 2316, pp. 8-20, 2002.  Springer-Verlag Berlin Heidelberg 2002 Cell Suppression: Experience and Theory Dale A. Robertson and Richard Ethier * Statistics Canada robedal@statcan.ca ethiric@statcan.ca Abstract. Cell suppression for disclosure avoidance has a well-developed theory, unfortunately not sufficiently well known. This leads to confusion and faulty practices. Poor (sometimes seriously flawed) sensitivity rules can be used while inadequate protection mechanisms may release sensitive data. The negative effects on the published information are often exaggerated. An analysis of sensitivity rules will be done and some recommendations made. Some implications of the basic protection mechanism will be explained. A discussion of the information lost from a table with suppressions will be given, with consequences for the evaluation of patterns and of suppression heuristics. For most practitioners, the application of rules to detect sensitive economic data is well understood (although the rules may not be). However, the protection of that data may be an art rather than an application of sound concepts. More misconceptions and pitfalls arise. Keywords: Disclosure avoidance, cell sensitivity. Cell suppression is a technique for disclosure control. It is used for additive tables, typically business data, where it is the technique of choice. There is a good theory of the technique, originally developed by Gordon Sande [1,2| with important contributions by Larry Cox [3]. In practice, the use of cell suppression is troubled by misconceptions at the most fundamental levels. The basic concept of sensitivity is confused, the mechanism of protection is often misunderstood, and an erroneous conception of information loss seems almost universal. These confusions prevent the best results from being obtained. The sheer size of the problems makes automation indispensable. Proper suppression is a subtle task and the practitioner needs a sound framework of knowledge. Problems in using the available software are often related to a lack of understanding of the foundations of the technique. Often the task is delegated to lower level staff, not properly trained, who have difficulty describing problems with the rigour needed for computer processing. This ignorance at the foundation level leads to difficulty understanding the software. As the desire for more comprehensive, detailed, and sophisticated outputs increases, the matter of table and problem specification needs further attention. Our experience has shown that the biggest challenge has been to teach the basic ideas. The theory is not difficult to grasp, using only elementary mathematics, but clarity of thought is required. The attempt of the non-mathematical to describe things
  • 21. Cell Suppression: Experience and Theory 9 in simple terms has led to confusion, and the failure to appreciate the power and value of the technique. The idea of cell sensitivity is surprisingly poorly understood. Obsolete sensitivity rules, with consistency problems, not well adapted to automatic processing, survive. People erroneously think of sensitivity as a binary variable: a cell is publishable or confidential. The theory shows that sensitivity can be precisely measured, and the value is important in the protection process. The value helps capture some common sense notions that early practitioners had intuitively understood. The goal of disclosure avoidance is to protect the respondents, and ensure that a response cannot be estimated accurately. Thus, a sensitive cell is one for which knowledge of the value would permit an unduly accurate estimate of the contribution of an individual respondent. Dominance is the situation where one or a small number of responses contribute nearly all the total of the cell. Identifying the two leads to the following pronouncement A cell should be considered sensitive if one respondent contributes 60% or more of the total value. This is an example of an N-K rule, with N being the number of respondents to count, and K the threshold percentage. (Here N = 1 and K = 60. These values of N and K are realistic and are used in practice.) Clearly a N-K rule measures dominance. Using it for sensitivity creates the following situation. Consider two cells each with 3 respondents, and of total 100. Cell 1 has a response sequence of {59,40,1} and thus may be published according to the rule. Cell 2 has a response sequence of {61,20,19}} and would be declared sensitive by the rule. Suppose the cell value of 100 is known to the second largest respondent (X2) and he uses this information to estimate the largest (X1). He can remove his contribution to get an upper bound, obtaining (with non-negative data) For (59,40, 1) 100-40 = 60 therefore X1 = 60, while For (61,20,19) 100-20 = 80 therefore X1 = 80. Since the actual values of X1 are 59 and 61 respectively, something has gone badly wrong. Cell 1 is much more dangerous to publish than cell 2. The rule gets things the wrong way around! Is this an exceptional case that took ingenuity to find? No, this problem is intrinsic to the rule, examples of misclassification are numerous, and the rule cannot be fixed. To see this we need to make a better analysis. One can understand much about sensitivity by looking at 3 respondent cells of value 100. (Keeping the total at 100 just means that values are the same as the percentages.) Call the three response values a, b, c.
  • 22. Dale A. Robertson and Richard Ethier 10 One can represent these possible cells pictorially. Recall (Fig, 1) that in an equilateral triangle, for any interior point, the sum of the perpendicular distance to the 3 sides is a constant (which is in fact h, the height of the triangle). One can nicely incorporate the condition a + b + c = 100 by plotting the cell as a point inside a triangle, of height 100, measuring a, b, and c from a corresponding edge (Fig 2). This gives a symmetric treatment and avoids geometric confusions that occur in other plots. In Figure 2 we have drawn the medians, intersecting at the centroid. The triangle is divided into areas which correspond to the possible size orders of a, b, and c. The upper kite shaped region is the area where a is the biggest of the three responses, with its right sub triangle the area where a b c. In the triangle diagram, where are the sensitive cells? Cells very near an edge of the triangle should be sensitive. Near an edge one of the responses is negligible, effectively the cell has 2 respondents, hence is obviously sensitive (each respondent knows the other). As a bare minimum requirement then any rule that purports to define sensitivity must classify points near the edge as sensitive. What does the 1-60% rule do? The region where a is large is the small sub triangle at the top, away from the a edge, with its base on the line at 60% of the height of the triangle. Likewise for b and c leading to (Fig. 3) Now one problem is obvious. Significant portions of the edges are allowed. This rule fails the minimum requirement. Slightly subtler is the over-suppression. That will become clearer later. Points inside the sensitive area near the interior edge and the median are unnecessarily classified as sensitive. The rule is a disaster. Trying to strengthen the rule leads to more over-suppression, relaxing it leads to more release of sensitive data. Any organisation which defines sensitivity using only a 1-K rule (and they exist) has a serious problem. The identification of sensitivity
  • 23. Cell Suppression: Experience and Theory 11 and dominance has led to a gross error, allowing very precise knowledge of some respondents to be deduced. Now lets look at a good rule. Here is a how a good rule divides the triangle into sensitive and non-sensitive areas. (Fig. 4) The sensitivity border excludes the edges without maintaining a fixed distance from them. There is lots of symmetry and the behaviour in one of the 6 sub triangles is reproduced in all the others. This shape is not hard to understand. This rule is expressed as a formula in terms of the responses in size order, X1, X2, and X3. As one crosses a median, the size order, and hence the expression in terms of a, b, and c changes, hence the slope discontinuities. The other thing to understand is the slope of the one non-trivial part of a sub triangle boundary, the fact that the line moves farther from the edge as one moves away from a median. The reasoning needed is a simple generalisation of the previous argument about the two cells. Start at the boundary point on the a median nearest the a axis. There a is the smallest response, while b and c are of the same size and much larger. For easy numbers suppose b=c= 45 and a=10. The protection afforded to b or c against estimation by the other comes from the presence of a. On the separation line between sensitive and non-sensitive areas, the value of the smallest response (a) is felt to be just big enough to give protection to the larger responses (b and c). The values above indicate that the large responses are allowed to be up to 4.5 times larger than the value of the response providing protection, but no larger. A higher ratio is considered sensitive, a lower one publishable. As one moves away from the median, one of b or c becomes larger than 45, the other smaller. Consequently a must become larger than 10 to maintain the critical ratio of 4.5:1. Hence a must increase (move upwards away from the bottom) i.e. the line has an upward slope. This simple but subtle rule is one known as the C times rule. C is the ratio, an adjustable parameter (4.5 above) of the rule. (There are slight variations on this formulation, termed the p/q rule or the p% rule. They are usually specified by an inverse parameter p or p/q = 1/C. The differences are only in interpretation of the parameter, not in the form of the rule. We find the inverse transformation harder to grasp intuitively, and prefer this way). The formula is easily written down then. The protection available when X2 estimates X1 is given by X3. The value of X3 has to be big enough (relative to X1) to give the needed uncertainty. One compares X1 with X3 using the scaling factor C to adjust their relative values. Explicitly one evaluates S = X1/C - X3 (1)
  • 24. Dale A. Robertson and Richard Ethier 12 Written in this form the rule appears not to depend on X2, but it can be trivially be written as S = X1*(C+1)/C + X2 - T (2) For a fixed total T then, the rule does depend on both X1 and X2 and they enter with different coefficients, an important point. Sensitivity depends on the cell structure (the relative sizes of the responses). This difference in coefficient values captures this structure. (Note in passing that the rule grasps the concept of sensitivity well enough that 1 and 2 respondent cells are automatically sensitive, one does not have to add those requirements as side conditions.) The rule is trivially generalised to more than 3 respondents in the cell, X3 is simply changed to X3 + X4 + X5 + … (3) i.e. the sum of the smallest responses. One can treat all cells as effectively 3 respondent cells. (The rule can also be generalised to include coalitions where respondents in the cell agree to share their knowledge in order to estimate another respondent's contribution. The most dangerous case is X2 and X3 sharing their contributions to estimate X1. The generalisation for a coalition of 2 respondents is S = X1/C - (X4+X5+…) (4) (and similar changes for larger coalitions.) This is a deceptively simple rule of wide application, with a parameter whose meaning is clear. The C value can be taken as a precise measure of the strength of the rule. Note that S is a function defined over the area of the triangle. Sensitivity is not just a binary classification into sensitive and non-sensitive. (S looks like an inverted pyramid with the peak at the centroid, where a = b = c). The line separating the two areas is the 0 valued contour of S. The value of S is meaningful. It indicates the degree of uncertainty required in estimations of the cell value, and the minimum size of an additional response big enough to render the cell non-sensitive. For example, take a 2 respondent cell of some value. Its sensitivity is S = X1/C. To get a non- sensitive cell S must be reduced to 0. This can be done by adding a response X3 of value X1/C. The cell sensitivity tells us the minimum amount that must be added to the cell to make it non-sensitive. The only other proposed rule that occurs with any frequency is a 2-K rule; a cell is sensitive if 2 respondents contribute more than K % of the cell value. (Here K in the range 80 - 95 is typical.) Only the sum of the top two responses (X1+X2) enters. Their relative size (for a fixed total) is not used in any way. However, as just seen, the relative sizes are important. We can conclude that this cannot be a very good rule without further analysis. A 2-K rule cannot distinguish between a cell with responses (50,45,5) and one with responses (90,5,5). Most would agree that the second one is more sensitive than the first. The picture of the rule is easy to draw. If the sum of two responses is to be less than a certain value, then the third must be larger than a corresponding value. That gives a line parallel to an edge i.e. Figure 5.
  • 25. Cell Suppression: Experience and Theory 13 As we have just observed, in a good rule the sensitivity line should not be parallel to the edge. This means that the 2-K rule is not consistent in its level of protection. In certain areas less protection is given than in others. This would seem undesirable. The rule has no clear objective. It might be hoped that the non- uniformity is negligible, and that it doesn't matter all that much. Alas no. We can quantify the non-uniformity. It turns out to be unpleasantly large. The effect of the rule is variable, and it is difficult to choose a value for K and to explain its meaning. To measure the non uniformity one can find the two C times rules that surround the 2-K rule, i.e. the strongest C times rule that is everywhere weaker than the 2-K rule, and the weakest C times rule that is everywhere stronger than the 2-K rule. These C values are easily found to be C (inner) = k/(200-k) (stronger) (5) C (outer) = (100-k)/k (weaker) (6) For the typical range of K values, this gives the following graph (Figure 6). One can see that the difference between the C values is rather large. One can get a better picture by using a logarithmic scale (Figure 7). On this graph, the difference between the lines is approximately constant, which means the two values differ by a constant factor. The scale shows this constant to be near 2 (more precisely it is in the range 1.7 to 1.9) i.e. the non-uniformity is serious. The problem of deciding on an appropriate value of K is ill defined.
  • 26. Dale A. Robertson and Richard Ethier 14 The summary is then: Don't use N-K rules, they are at best inferior with consistency problems, at worst lead to disasters. Their strength is unclear, and the K value hard to interpret. They arise from a fundamental confusion between sensitivity and dominance. Now its time to talk a little about the protection mechanism. First, observe that in practice, the respondents are businesses. Rough estimates of a business size are easy to make. In addition, the quantities tabulated are often intrinsically non-negative. These two things will be assumed; suppressed cells will only be allowed to take values within 50% of the true ones from now on Here is a trivial table with 4 suppressions X1, X2, X3, X4 (Figure 8). One or more of these cells may be presumed sensitive, and the suppressions protect the sensitive values from trivial disclosure. Suppose X1 is sensitive. Note that X1+X2 and X1+X3 are effectively published. The suppressed data are not lost, they are simply aggregated. Aggregation is the protection mechanism at work here, just as in statistics in general. Data is neither lost nor distorted. Suppression involves the creation of miscellaneous aggregations of the sensitive cells with other cells. Obviously then if the pattern is to be satisfactory, then these aggregations must be non-sensitive. Here the unions X1+X2 and X1+X3 must be non-sensitive if this pattern is to be satisfactory. From our previous discussion if follows that both X2 and X3 must be at least as large as the sensitivity of X1. As complements, the best they can do is to add value to the protective responses. There should be no responses from the large contributors in the complementary cells. Proper behaviour of S when cells are combined will be of great importance. Certainly one would like to ensure as a bare minimum Two non-sensitive cells should never combine to form a sensitive cell. The notion that aggregation provides protection is captured by the property of sub- additivity (1), (2). This is an inequality relating the sensitivity of a combination of two cells to the sensitivity of the components. The direction of the inequality is such that aggregation is a good thing. One should also have smoothness of behaviour, the effect on sensitivity should be limited by the size of the cell being added in. For a good rule then one has the inequalities S(x) - T(Y) = S(X+Y) = S(X) + S(Y) (7) (Using a convenient normalisation for S)
  • 27. Cell Suppression: Experience and Theory 15 Given that S = 0 indicates sensitivity, and that the aim is to ensure that S(X+Y) is not sensitive given that S(X) is, the right most inequality indicates that aggregation tends to be helpful (and is definitely helpful if the complement Y is not sensitive, S(Y) 0). The left inequality limits the decrease in the S value. A successful complement must be big enough, T (Y) = S(X) to allow S(X+Y) to be negative. These inequalities are natural conditions that most simple rules obey, but people have, at some effort, found rules that violate them. These inequalities are only that. One cannot exactly know the sensitivity of a union by looking at the summary properties of the cells. One needs the actual responses. The typical complicating factor here is a respondent who has data in more than one cell of a union. (This may be poor understanding of the corporate structure or data classified at an unsuitable level.) Generation of a good pattern is thus a subtle process, involving the magnitudes and sensitivities of the cells, and the pattern of individual responses in the cells. One needs information about the individual responses when creating a pattern. The sensitive union problem certainly is one that is impractical to perform without automation. Sophisticated software is indispensable. Tables that appear identical may need different suppression patterns, because of the different underlying responses. We have found, using a small pedagogical puzzle, that few people understand this point In the software in use at Statistics Canada (CONFID) the sensitive union problem is dealt with by enumerating all sensitive combinations which are contained in a non- sensitive totals [4]. (They are in danger of being trivially disclosed.) All such combinations are protected by the system. This may sound like combinatorial explosion and a lot of work. It turns out that there are not that many to evaluate, the extra work is negligible, and the effect on the pattern small. The algorithm makes heavy use of the fact that non-sensitive cells cannot combine to form a sensitive union. Now let us talk a bit about the suppression process, and information. It is generally said that one should minimise the information lost, but without much discussion of what that means. It is often suggested that the information lost in a table with suppressions can be measured by i) the number of suppressed cells, ii) the total suppressed value. The fact that one proposes to measure something in two different and incompatible ways surely shows that we are in trouble. Here are two tables with suppressions (Figure 9, Figure 10).
  • 28. Dale A. Robertson and Richard Ethier 16 These two tables have the same total number of suppressed cells, 4, and the same total suppressed value, 103. By either of the above incompatible definitions then, they should have the same amount of missing information. However, they do not. Substantially more information has been removed from one of these two tables than from the other (three times as much in fact). This misunderstanding about information is related to the misunderstanding that the value of a suppressed cell remains a total mystery, with no guesses about the value possible, and that somehow the data is lost. In fact, it is just aggregated. Furthermore, any table with suppressions is equivalent to a table of ranges for the hidden cells. This latter fact is not well known, or if it is, the implications are not understood. These ranges can be found by a straightforward and inexpensive LP calculation, maximising and minimising the value of each hidden cell subject to the constraint equations implied by the additivity of the table. The above tables are equivalent to Figure.11, Figur 12, (provided the data are known to be non-negative). The different ranges provide a clue about the information loss. The second table has wider ranges. Large complementary cells used to protect cells of small sensitivity often turn out to have very narrow ranges and contain valuable information. They should not be thought of as lost. Clearly one needs a better concept of information. Looking around, in the introductory chapters of an introductory book we found (in the context of signal transmission in the presence of errors) a thought that we paraphrase as The additional information required to correct or repair a noisy signal (thus recovering the original one) is equal to the amount of information which has been lost due to the noise. Thinking of a table with missing entries as a garbled message, the information missing from a table is the minimum amount of information needed to regain the full table. This may seem simple, but it is a subtle concept. One can use all properties that are implied by the table structure. The previous measures do not use the structural properties. The width of the ranges will have some bearing. One also can use the fact that the hidden values are not independent, but are linked by simple equations. Only a subset of them need be recovered by using more information, the rest can be trivially calculated. Calculating the cost on a cell by cell basis implicitly suggests that all hidden cells are independent, and that the information loss can simply be added up. For the first table (assuming all the quantities are integral for simplicity) there are only 3 possible tables compatible with the suppression pattern. Consequently,
  • 29. Cell Suppression: Experience and Theory 17 (having agreed upon a standard order for enumerating the possible tables) one only needs to be told which table in the sequence is the true one, table 1, table 2 or table 3. In this case the amount of information needed is that needed to select one out of three (equally probable) possibilities. This is a precise amount of information, which even has a name, one trit. In more common units 1 trit = 1.58 bits (since the binary log of 3 is 1.58...). For the second table one has 27 possible solutions. Selecting one from 27 could be done by 3 divisions into 3 equal parts selecting one part after each division. So one requires 3 trits of information, hence the statement that 3 times as much information has been lost in the second table. Counting the number of tables correctly uses the cell ranges, and the relationships between the hidden values, both of which are ignored by the value or number criteria. This viewpoint can resolve some other puzzles or paradoxes. Here is a hypothetical table in which the two cells of value 120 are sensitive (Figure 13). Here is the minimum cell count suppression pattern, with two complements totalling 200, and the minimum suppressed value pattern, (Figure 14, Figure 15), with 6 complements totalling 60.
  • 30. Dale A. Robertson and Richard Ethier 18 Here (Figure 16) is the pattern we prefer, which is intermediate between the two others, having 4 complementary suppressions of total value 80. Using the notion of the amount of information needed to recover the table, this is in fact the best of the 3. With the minimum count pattern, the size of the complements makes the ranges large, and there are many possible tables (101) consistent with the pattern. With the minimum value pattern, although the ranges are small, there are two independent sets of hidden cells, and hence the number of tables is the product of the numbers of the two sub-tables with suppression (11*11). In the preferred pattern one has 21 possible tables. (Note for the minimum value pattern, one has two independent sub-patterns. Here one would like to be able to say that the information lost is the sum of two terms, one for each independent unit. Since the number of tables is the product of the numbers of possible solutions to these two sub-problems, it is clear that it is appropriate to take logarithms of the number of possible tables. Some of you may of course realise that this is a vast oversimplification of information theory. There is a simple generalisation of the measure if the tables are not equi-probable.) The form of objective function that is in practice the most successful in CONFID may be looked on as an attempt to approximate this sort of measure. Given all these subtleties, it follows that the effects of interventions, forcing the publication or suppression of certain cells should not be done without serious thought. One should always measure the effect of this type of intervention. Given that tables with suppressions are equivalent to range tables, and that sophisticated users are probably calculating these ranges themselves, either exactly or approximately, it has often been suggested that the statistical agencies improve their service, especially to the less sophisticated users by publishing the ranges themselves. One suppression package ACS [5] takes this farther by providing in addition to the ranges, a hypothetical solution consistent with them, i.e. a conceivable set of values for the hidden cells which make the tables add up. These values are not an estimate in any statistical sense, but provide a full set of reasonable values that may be of use in certain types of models for example, which do not like missing data points. In our view, range publication would be highly desirable. It has the following advantages for the statistical agency.
  • 31. Cell Suppression: Experience and Theory 19 The personnel doing the suppression need to think more and to have a better understanding of the process. Seeing the ranges will add to the quality of the patterns, especially if any hand tinkering has happened. The effect of tinkering can be better evaluated. Certain „problems“ attributed to cell suppression are pseudo-problems, caused only by poor presentation. One such is the feared necessity of having to use a large complement to protect a not very sensitive cell. Well if the sensitivity is small, the cell doesn't require much ambiguity or protection. Most of the statistical value of the complementary cell can be published. Another problem, the issue of continuity in time series, becomes smaller in importance. It is generally felt that a series such as Figure 17 is disconcerting. If ranges were used one could publish something like Figure 18 which is less disturbing. Obviously there are advantages for the customer too. They are getting more data. (Or the data more conveniently. More sophisticated users can perform the analysis for themselves.) Giving the ranges explicitly helps the less sophisticated user. In our opinion, the arguments for range publication are convincing, and any objections are not very sensible. If one had competing statistical agencies, they would be rushing to get this new and improved service out to the salesman. A few conclusions If we don't yet have a good measure of information lost in a table, it follows that all methods in use today are heuristic. They solve various problems that approximate the real problem. It is difficult to quantify how well these approximate problems resemble the real problem. Our feeling is that better progress would be attained if one had agreed upon the proper problem, and discussed various methods of solution, exact or approximate. Properties of the problem and its solution could be studied, and a more objective way to evaluate the heuristic methods would be available. As well, standard test data sets could be prepared to facilitate discussion and comparison.
  • 32. Dale A. Robertson and Richard Ethier 20 It is only a modest overstatement to suggest that some people don't know what they are doing, in spite of the fact that a proper understanding is not difficult. Therefore training and attitude are big problems. The common errors that occur are using bad, deeply flawed sensitivity rules, and using inadequate methods to generate the patterns that do not ensure that the sensitive cells have sufficient ambiguity or protect combinations. References [1] Towards Automated Disclosure Analysis for Statistical Agencies. Gordon Sande; InternalDocument Statistics Canada (1977) [2] Automated Cell Suppression to Preserve Confidentiality of Business Statistics. Gordon Sande; Stat. Jour U.N. ECE2 pp33-41 (1984) [3] Linear Sensitivity Measures in Statistical disclosure Control. L.H. Cox.; Jour. Stat. Plan. Infer. V5, pp153-164 (1981) [4] Improving Statistics Canada's Cell Suppression Software (CONFID). D. A. Robertson, COMPSTAT 2000 Proceedings in Computational Statistics. Ed. J.K Bethlehem, P.G.M van der Heiden, Physica Verlag (Heidelberg New York) (2000) [5] ACS available from Sande and Associates, 600 Sanderling Ct., Secaucus N.J. 07094 U.S.A. g.sande@worldnet.att.net
  • 33. J. Domingo-Ferrer (Ed.): Inference Control in Statistical Databases, LNCS 2316, pp. 21 - 33, 2002. © Springer-Verlag Berlin Heidelberg 2002 Bounds on Entries in 3-Dimensional Contingency Tables Subject to Given Marginal Totals Lawrence H. Cox U.S. National Center for Health Statistics, 6525 Belcrest Road Hyattsville, MD 20782 USA lcox@cdc.gov Abstract: Problems in statistical data security have led to interest in determining exact integer bounds on entries in multi-dimensional contingency tables subject to fixed marginal totals. We investigate the 3-dimensional integer planar transportation problem (3-DIPTP). Heuristic algorithms for bounding entries in 3-DIPTPs haverecentlyappeared. We demonstrate these algorithms are not exact, are based on necessary but not sufficient conditions to solve 3-DIPTP, and that all are insensitive to whether a feasible table exists. We compare the algorithms and demonstrate that one is superior, but not original. We exhibit fractional extremal points and discuss implications for statistical data base query systems. 1 Introduction A problem of interest in operations research since the 1950s [1] and during the 1960s and 1970s [2) is to establish sufficient conditions for existence of a feasible solution to the 3-dimensional planar transportation problem (3-DPTP), viz., to the linear program: (1) where are constants, referred to as the 2-dimensional marginal totals. Attempts on the problem are summarized in [3]. Unfortunately, each has been shown [2, 3] to yield necessary but not sufficient conditions for feasibility. In [4] is given a sufficient condition for multi-dimensional transportation problems based on an iterative nonlinear statistical procedure known as iterative proportional fitting. The purpose here is to examine the role of feasibility in the pursuit of exact integral lower and upper bounds on internal entries in a 3-DPTP subject to integer constraints (viz., a 3- DIPTP) and to describe further research directions. Our notation suggests that internal and marginal entries are integer. Integrality is not required for 3-DPTP (1), nor by the feasibility procedure of [4]. However, henceforth integrality of all entries is assumed as we focus on contingency tables - tables of nonnegative integer frequency counts and totals - and on the 3-DIPTP.
  • 34. Lawrence H. Cox 22 The consistency conditions are necessary for feasibility of 3-DIPTP: (2) The respective values, , are the 1-dimensional marginal totals. is the grand total. It is customary to represent the 2-dimensional marginal totals in matrices defined by elements: (3) The feasibility problem is the existence of integer solutions to (1) subject to consistency (2) and integrality conditions on the 2-dimensional marginal totals. The bounding problem is to determine integer lower and upper bounds on each entry over contingency tables satisfying (1)-(2). Exact bounding determines the interval , over all integer feasible solutions of (1)-(2). The bounding problem is important in statistical data protection. To prevent unauthorized disclosure of confidential subject-level data, it is necessary to thwart narrow estimation of small counts. In lieu of releasing the internal entries of a 3- dimensional contingency table, a national statistical office (NSO) may release only the 2-dimensional marginals. An important question for the NSO is then: how closely can a third party estimate the suppressed internal entries using the published marginal totals? During large-scale data production such as for a national census or survey, the NSO needs to answer this question thousands of times. Several factors can produce an infeasible table, viz., marginal totals satisfying (1)-(2) for which no feasible solution exists, and that infeasible tables are ubiquitous and abundant, viz., dense in the set of all potential tables [4]. To be useful, bounding methods must be sensitive to infeasibility, otherwise meaningless data and erroneous inferences can result [5]. The advent of public access statistical data base query systems has stimulated recent research by statisticians on the bounding problem. Unfortunately, feasibility and its consequences have been ignored. We highlight and explore this issue, through examination of four papers representing separate approaches to bounding problems. Three of the papers [6-8] were presented at the International Conference on Statistical Data Protection (SDP’98), March 25-27, 1998, Lisbon, Portugal, sponsored by the StatisticalOfficeoftheEuropeanCommunities (EUROSTAT). Thefourth[9]appeared in Management Science and reports current research findings. A fifth, more recent, paper [10] is also discussed. We examine issues raised and offer observations and generalizations.
  • 35. Bounds on Entries in 3-Dimensional Contingency Tables 23 2 The F-Bounds Given a 2-dimensional table with consistent sets of column ( ) and row ( ) marginal totals, the nominal upper bound for equals . The nominal lower bound is zero. It is easy to obtain exact bounds in 2-dimensions. The nominal upper bound is exact, by the stepping stones algorithm: set to its nominal upper bound, and subtract this value from the column, row and grand totals. Either the column total or the row total (or both) must become zero: set all entries in the corresponding column (or row, or both) equal to zero and drop this column (or row, or both) from the table. Arbitrarily pick an entry from the remaining table, set it equal to its nominal upper bound, and continue. In a finite number of iterations, a completely specified, consistent 2-dimensional table exhibiting the nominal upper bound for will be reached. Exact lower bounds can be obtained as follows. As , then . That this bound is exact followsfromobserving that is feasible if . Therefore, in 2-dimensions, exact bounds are given by: (4) These bounds generalize to m-dimensions, viz., each internal entry is contained in precisely m(m-1)/2 2-dimensional tables, each of which yields a candidate lower bound. The maximum of these lower bounds and zero provides a lower bound on the entry. Unlike the 2-dimensional case, in m 3 dimensions these bounds are not necessarily exact [5]. We refer to these bounds as the F-bounds. In 3-dimensions, the F-bounds are: (5) 3 The Bounding Methods of Fienberg and Buzzigoli-Giusti 3.1 The Procedure of Fienberg In [6], Fienberg does not specify a bounding algorithm precisely, but illustrates an approach via example. The example is a 3x3x2 table of sample counts from the 1990 Decennial Census Public Use Sample [6, Table 1]. For convenience, we present it in the following form. The internal entries are:
  • 36. Lawrence H. Cox 24 Table 1. Fienberg [6, Table 1] INCOME High Med Low High Med Low MALE FEMALE and the 2-dimensional marginal totals are: . In the remainder of this sub-section, we examine the properties of the bound procedure of [6]. Corresponding to internal entry , there exists a collapsed 2x2x2 table: Table 2. Collapsing a 3-dimensional table around entry Entry (2, 2, 2) in the lower right-hand corner is the complement of , denoted . Fix (i, j, k). Denote the 2-dimensional marginals of Table 2 to which contributes by . Observe that: (6) From (6) results the 2-dimensional Fréchet lower bound of Fienberg (1999): (7)
  • 37. Bounds on Entries in 3-Dimensional Contingency Tables 25 The nominal (also called Fréchet) upper bound on equals . The 2-dimensional Fréchet bounds of [6] are thus: (8) Simple algebra reveals that the lower F-bounds of Section 2 and the 2-dimensional lower Fréchet bounds of [6] are identical. The F-bounds are easier to implement. Consequently, we replace (8) by (5). From (6) also follows the 2-dimensional Bonferroni upper bound of [6]: (9) This Bonferroni upper bound is not redundant: if is sufficiently small, it can be sharper than the nominal upper bound. This is illustrated by the entry of the 2x2x2 table with marginals: (10) yields the 1-dimensional Fréchet lower bound [6]: (11) This bound is redundant with respect to the lower F-bounds. This is demonstrated as follows: (12) The Fréchet-Bonferroni bounds of [6] can be replaced by: (13) We return to the example [6, Table 1]. Here, the 2-dimensional Bonferroni upper bound (9) is not effective for any entry, and can be ignored. Thus, in this example, the bounds (13) are identical to the F-bounds (5), and should yield identical results. They in fact do not, most likely due to numeric error somewhere in [6]. This discrepancy is need to be kept in mind as we compare computational approaches below, that of Fienberg, using (13), and an alternative using (5).
  • 38. Lawrence H. Cox 26 Fienberg [6] computes the Fréchet bounds, without using the Bonferroni bound (9), resulting in Table 7 of [6]. We applied the F-bounds (5), but in place of his Table 7, obtained sharper bounds. Both sets of bounds are presented in Table 3, as follows. If our bound agrees with [6 , Table 7] we present the bound. If there is disagreement, we include the [6], Table 7 bound in parentheses alongside ours. Table 3. F-Bounds and Fienberg (Table 7) Fréchet Bounds for Table 1 INCOME High Med Low High Med Low MALE FEMALE Fienberg [6] next applies the Bonferroni upper bound (9) to Table 7, and reports improvement in five cells and a Table 8, the table of exact bounds for the table. We were unable to reproduce these results: the Bonferroni bound provides improvement for none of the entries. 3.2 The Shuttle Algorithm of Buzzigoli-Giusti In [7], Buzzigoli-Giusti present the iterative shuttle algorithm, based on principles of subadditivity: - a sum of lower bounds on entries is a lower bound for the sum of the entries, and - a sum of upper bounds on entries is an upper bound for the sum of the entries; - the difference between the value (or an upper bound on the value) of an aggregate and a lower bound on the sum of all but one entry in the aggregate is an upper bound for that entry, and - the difference between the value (or a lower bound on the value) of an aggregate and an upper bound on the sum of all but one entry in the aggregate is a lower bound for that entry. The shuttle algorithm begins with nominal lower and upper bounds. For each entry and its 2-dimensional marginal totals, the sum of current upper bounds of all other entries contained in the 2-dimensional marginal is subtracted from the marginal. This is a candidate lower bound for the entry. If the candidate improves the current lower bound, it replaces it. This is followed by an analogous procedure using sums of lower bounds and potentially improved upper bounds. This two-step procedure is repeated until all bounds are stationary. The authors fail to note but it is evident that stationarity is reached in a finite number of iterations because the marginals are integer.
  • 39. Bounds on Entries in 3-Dimensional Contingency Tables 27 3.3 Comparative Analysis of Fienberg , Shuttle, and F-Bounding Methods We compare the procedure of Fienberg ([6]), the shuttle algorithm and the F-bounds. As previously observed, the bounds of [6] can be reduced to the F-bounds plus the 2- dimensional Bonferroni upper bound, viz., (13). The shuttle algorithm produces bounds at least as sharp as the F-bounds, for two reasons. First, the iterative shuttle procedure enables improved lower bounds to improve the nominal and subsequent upper bounds. Second, lower F-bounds can be no sharper than those produced during the first set of steps of the shuttle algorithm. To see this, for concreteness consider the candidate lower F-bound for . One of the three candidate shuttle lower bounds for equals , where denotes the current upper bound for . Thus, and therefore . Consequently, the shuttle candidate lower bound is greater than or equal to , so the shuttle candidate is at least as sharp as the F-candidate. If the shuttle algorithmis employed, all but the nominal lower Fienberg (1999) and F-bounds (namely, 0) are redundant. Buzzigoli-Giusti [7] illustrate the 3-dimensional shuttle algorithm for the case of a 2x2x2 table. It is not clear, for the general case of a table, if they intend to utilize the collapsing procedure of Table 2, but in what follows we assume that they do. Consider the 2-dimensional Bonferroni upper bounds (9). From (6), the Bonferroni upper bound for equals . Consider the right-hand 2-dimensional table in Table 2. Apply the standard 2-dimensional lower F-bound to the entry in the upper left- hand corner: (14) Aspreviouslyobserved,theshuttlealgorithmwillcomputethisbound duringstep 1and, if it is positive, replace the nominal lower bound with it, or with something sharper. During step 2, the shuttle algorithm will use this lower bound (or something sharper) to improve the upper bound on , as follows: (15) Thus, the Fienberg [6] 2-dimensional Bonferroni upper bound is redundant relative to the shuttle algorithm. Consequently, if the shuttle and collapsing methodologies are applied in combination, it suffices to begin with the nominal bounds and run the shuttle algorithm to convergence. Application of this approach (steps 1-2-1) to Table 1 yields Table 4 of exact bounds:
  • 40. Lawrence H. Cox 28 Table 4. Exact bounds for Table 1 from the Shuttle Algorithm INCOME High Med Low High Med Low MALE FEMALE 3.4 Limitations of All Three Procedures Although the shuttle algorithm produced exact bounds for this example, the shuttle algorithm,andconsequentlytheFienberg([6])procedureandtheF-bounds,areinexact, as follows. Examples 7b,c of [4] are 3-DIPTP exhibiting one or more non-integer continuous exact bounds. Because it is based on iterative improvement ofinteger upper bounds, in these situations the shuttle algorithm can come no closer than one unit larger than the exact integer bound, and therefore is incapable of achieving the exact integer bound. The shuttle algorithm is not new: it was introduced by Schell [1] towards purported sufficient conditions on the 2-dimensional marginals for feasibility of 3-DPTP. By means of (16), Moravek-Vlach [2] show that the Schell conditions are necessary, but not sufficient, for the existence of a solution to the 3-DPTP. This counterexample is applicable here. Each 1-dimensional Fréchet lower bounds is negative, thus not effective. Each Bonferroni upper bound is too large to be sharp. No lower F-bound is positive. Iteration of the shuttle produces no improvement. Therefore, each procedure yields nominal lower (0) and upper (1) bounds for each entry. Each procedure converges. Consequently,allthreeproceduresproduceseeminglycorrectboundswhen in fact no table exists. A simpler counterexample, (Example 2 of [5]), given by (17), appears in the next section.
  • 41. Bounds on Entries in 3-Dimensional Contingency Tables 29 (16) 4 The Roehrig and Chowdhury Network Models Roehrig et al. (1999) [8] and Chowdhury et al. (1999) [9] offer network models for computing exact bounds on internal entries in 3-dimensional tables. Network models are extremely convenient and efficient, and most important enjoy the integrality property, viz., integer constraints (viz., 2-dimensional marginal totals) assure integer optima.Networkmodelsprovideanaturalmechanismand language in which to express 2-dimensional tables, but most generalizations beyond 2-dimensions are apt to fail. Roehrig et al. [8] construct a network model for 2x2x2 tables and claim that it can be generalized to all 3-dimensional tables. This must be false. Cox ([4]) shows that the class of 3-dimensional tables (of size ) representable as a network is the set of tables for which for at least one index i. This is true because, if all , it is possible to construct a 3-dimensional table with integer marginals whose corresponding polytope has non-integer vertices ([4]), which contradicts the integrality property. Chowdhury et al. [9] address the following problemrelated to 3-DIPTP, also appearing in [8]. Suppose that the NSO releases only two of the three sets of 2-dimensional marginal totals (A and B), but not the third set (C) or the internal entries . Is it possible to obtain exact lower and upper bounds for the remaining marginals (C)? The authors construct independent networks corresponding to the 2-dimensional tables defined by the appropriate (A, B) pairs and provide a procedure for obtaining exact bounds. Unfortunately, this problem is quite simple and can be solved directly without recourse to networks or other mathematical formalities. In particular, the F-bounds of Section 2 suffice, as follows. Observe that, absent the C-constraints, the minimum (maximum) feasible value of a C-marginal total Cij equals the sum of the minimum (maximum) values for the corresponding internal entries . As the are subject only to 2- dimensionalconstraintswithintheir respective k-planes, thenexactboundsforeach are precisely its 2-dimensional lower and upper F-bounds. These can be computed at
  • 42. Lawrence H. Cox 30 sight and added along the k-dimension thus producing the corresponding C-bounds without recourse to networks or other formulations. The Chowdhury et al. [9] method is insensitive to infeasibility, again demonstrated by example (16): all Fréchet lower and nominal upper bounds computed in the two 2- dimensional tables defined by k = 1 and k = 2 contain the corresponding Cij (equal to 1 in all cases), but there is no underlying table as the problemis infeasible. Insensitivity is also demonstrated by Example 2 of [5], also presented at the 1998 conference, viz., (17) 5 Fractional Extremal Points Theorem 4.3 of [4] demonstrates that the only multi-dimensional integer planar transportation problems (m-DIPTP) for which integer extremal points are assured are those of size 2m-2 xbxc. In these situations, linear programming methods can be relied upon to produce exact integer bounds on entries, and to do so in a computationally efficient manner even for problems of large dimension or size. The situation for other integerproblemsoftransportation type,viz.,m-dimensionalcontingencytablessubject to k-dimensional marginal totals, k = 0, 1, ..., m-1, is quite different: until a direct connectioncanbedemonstratedbetweenexactcontinuousboundsonentriesobtainable from linear programming and exact integer bounds on entries, linear programming will remainanunreliabletoolforsolvingmulti-dimensionalproblemsoftransportationtype. Integer programming is not a viable option for large dimensions or size or repetitive use. Methods that exploit unique structure of certain subclasses of tables are then appealing, though possibly of limited applicability. Dobra-Fienberg [10] present one such method, based on notions from mathematical statistics and graph theory. Given an m-dimensional integer problem of transportation type and specified marginal totals, if these marginals form a set of sufficient statistics for a specialized log-linear model known as a decomposable graphical model, then the model is feasible and exact integer bounds can be obtained from straightforward formulae. These formulae yield, in essence, the F- and Bonferroni bounds. The reader is referred to [10] for details, [11] for details on log-linear models, and [12] for development of graphical models.
  • 43. Bounds on Entries in 3-Dimensional Contingency Tables 31 The m-dimensional planar transportation problemconsidered here, m 2, corresponds to the no m-factor effect log-linear model, which is not graphical, and consequently the Dobra-Fienberg method [10] does not apply to problems considered here. The choice here and perhaps elsewhere of the 3- or m-DIPTP as the initial focus of study for bounding problems is motivated by the following observations. If for reasons of confidentiality the NSO cannot release the full m-dimensional tabulations (viz., the m-dimensional marginal totals), then its next-best strategy is to release the (m-1)- dimensional marginal totals, corresponding to the m-DIPTP. If it is not possible to release all of these marginals, perhaps the release strategy of Chowdhury et al. [9] should beinvestigated. Alternatively,releaseofthe(m-2)-dimensionalmarginalsmight be considered. This strategy for release is based on the principle of releasing the most specific information possible without violating confidentiality. Dobra-Fienberg offers a different approach, a class of marginal totals, perhaps of varying dimension, that can be released while assuring confidentiality via easy computation of exact bounds on suppressed internal entries. A formula driven bounding method is valuable for large problems and for repetitive, large scale use. Consider the m-dimensional integer problem of transportation type specified by its 1-dimensional marginal totals. In the statistics literature, this is known as the complete independence log-linear model [11]. This model, and in particular the 3-dimensional complete independence model, is graphical and decomposable. Thus, exact bounding can be achieved using Dobra-Fienberg. Such problems can exhibit non-integer extremal points. For example, consider the 3x3x3 complete independence model with 1-dimensional marginal totals given by the vector: (18) Even though all continuous exact bounds on internal entries in (18) are integer, one extremal point at which is maximized (at 1) contains four non-integer entries and anothercontainssix. Boundingusinglinearprogrammingwouldonlydemonstratethat is the continuous, not the integer, maximum if either of these extremal points were exhibited. A strict network formulation is not possible because networks exhibit only integer extremal points (although use of networks with side constraints is under investigation). Integer programming is undesirable for large or repetitive applications. Direct methods such as Dobra-Fienberg may be required. A drawback of Dobra- Fienberg is that it applies only in specialized cases.
  • 44. Lawrence H. Cox 32 6 Discussion In this paper we have examined the problem of determining exact integer bounds for entries in 3-dimensional integer planar transportation problems. This research was motivated by previous papers presenting heuristic approaches to similar problems that failed in some way to meet simple criteria for performance or reasonableness. We examined these and other approaches analytically and demonstrated one’s superiority. We demonstrated that this method is imperfect and a reformulation of a method from the operations literature of the 1950s. We demonstrated that these methods are insensitive to infeasibility and can produce meaningless results otherwise undetected. We demonstrated that a method purported to generalize from 2x2x2 tables to all 3- dimensional tables could not possibly do so. We demonstrated that a problem posed and solved using networks in a Management Science paper can be solved by simple, direct means without recourse to mathematical programming. We illustrated the relationship between computing integer exact bounds, the presence of non-integer extremal points and the applicability of mathematical programming formulations such as networks. NSOs must rely on automated statistical methods for operations including estimation, tabulation, quality assurance, imputation, rounding and disclosure limitation. Algorithms for these methods must converge to meaningful quantities. In particular, these procedures should not report meaningless, misleading results such as seemingly correct bounds on entries when no feasible values exist. These risks are multiplied in statistical data base settings where data from different sources often are combined. Methods examined here for bounds on suppressed internal entries in 3-dimensional contingency tables fail this requirement because they are heuristic and based on necessary, but not sufficient, conditions for the existence of a solution to the 3-DPTP. In addition, most of these methods fail to ensure exact bounds, and are incapable of identifying if and when they do in fact produce an exact bound. Nothing is gained by extending these methods to higher dimensions. Disclaimer Opinions expressed are solely those of the author and are not intended to represent policy or practices of the National Center for Health Statistics, Centers for Disease Control and Prevention, or any other organization. References 1. Schell, E. Distribution of a product over several properties. Proceedings, 2nd Symposium on Linear Programming. Washington, DC (1955) 615-642 2. Moravek, J. and Vlach, M. (1967). On necessary conditions for the existence of the solution to the multi-index transportation problem. Operations Research 15, 542-545 3. Vlach, M. Conditions for the existence of solutions of the three-dimensional planar transportation problem. Discrete Applied Mathematics 13 (1986) 61-78 4. Cox, L. (2000). On properties of multi-dimensional statistical tables. Manuscript (April 2000) 29 pp.
  • 45. Bounds on Entries in 3-Dimensional Contingency Tables 33 5. Cox, L. Invited Talk: Some remarks on research directions in statistical data protection. Statistical Data Protection: Proceedings of the Conference.EUROSTAT, Luxemburg (1999)163-176 6. Fienberg, S. Fréchet and Bonferroni bounds for multi-way tables of counts with applications to disclosure limitation. Statistical Data Protection: Proceedings of the Conference. EUROSTAT, Luxembourg (1999) 115-129 7. Buzzigoli, L. and Giusti, A. An algorithm to calculate the lower and upper bounds of the elements of an array given its marginals. Statistical Data Protection: Proceedings of the Conference. EUROSTAT, Luxemburg (1999) 131-147 8. Roehrig, S., Padman, R., Duncan, G., and Krishman, R. Disclosure detection in multiple linked categorical datafiles: A unified network approach. Statistical Data Protection: Proceedings of the Conference. EUROSTAT, Luxembourg (1999) 149-162 9. Chowdhury, S., Duncan, G., Krishnan, R., Roehrig, S., and Mukherjee, S. Disclosure detection in multivariate categorical databases: Auditing confidentiality protection through two new matrix operators. Management Science 45 (1999) 1710-1723 10. Dobra, A. and S. Fienber, S.. Bounds for cell entries in contingency table given marginal totals and decomposable graphs. Proceedings of theNational Academy of Sciences 97 (2000) 11185-11192 11. Bishop, Y., Fienberg, S., and Holland, P. Discrete Multivariate Analysis: Theory and Practice. Cambridge, MA: M.I.T. Press (1975) 12. Lauritzen, S. Graphical Models. Oxford: Clarendon Press (1996)
  • 46. Extending Cell Suppression to Protect Tabular Data against Several Attackers Juan José Salazar González DEIOC, Faculty of Mathematics, University of La Laguna Av. Astrofı́sico Francisco Sánchez, s/n; 38271 La Laguna, Tenerife, Spain Tel: +34 922 318184; Fax: +34 922 318170 jjsalaza@ull.es Abstract. This paper presents three mathematical models for the prob- lem of finding a cell suppression pattern minimizing the loss of informa- tion while guaranteeing protection level requirements for different sen- sitive cells and different intruders. This problem covers a very general setting in Statistical Disclosure Control, and it contains as particular cases several important problems like, e.g., the so-called “common re- spondent problem” mentioned in Jewett [9]. Hence, the three models also applies to the common respondent problem, among others. The first model corresponds to bi-level Mathematical Programming. The second model belongs to Integer Linear Programming (ILP) and could be used on small-size tables where some nominal values are known to assume discrete values. The third model is also an ILP model valid when the nominal values of the table are continuous numbers, and with the good advantage of containing an small number of variables (one 0-1 variable for each cell in the table). On the other hand, this model has a bigger number of linear inequalities (related with the number of sensitive cells and the number of attackers). Nevertheless, this paper addresses this disadvantage which is overcame by a dynamic generation of the impor- tant inequalities when necessary. The overall algorithm follows a modern Operational Research technique known as branch-and-cut approach, and allows to find optimal solutions to medium-size tables. On large-size ta- bles the approach can be used to find near-optimal solutions. The paper illustrates the procedure on an introductory instance. The paper ends pointing another alternative methodology (closely re- lated to the one in Jewett [9]) to produce patterns by shrinking all the different intruders into a single one, and compares it with the classical single-attacker methodology and with the above multi-attacker method- ology. 1 Introduction Cell suppression is one of the most popular techniques for protecting sensitive in- formation in statistical tables, and it is typically applied to 2- and 3-dimensional Work supported by the European project IST-2000-25069, “Computational Aspects of Statistical Confidentiality” (CASC). J. Domingo-Ferrer (Ed.): Inference Control in Statistical Databases, LNCS 2316, pp. 34–58, 2002. c Springer-Verlag Berlin Heidelberg 2002
  • 47. Multi-attacker Cell Suppression Problem 35 tables whose entries (cells) are subject to marginal totals. The standard cell sup- pression technique is based on the idea of protecting the sensitive information by hiding the values of some cells with a symbol (e.g. ∗). The aim is that a set of potential intruders could not guess (exactly or approximately) any one of the hiding values by only using the published values and some a-priori informa- tion. The only assumption of this work is that this a-priori information must be formulated as a linear system of equations or inequations with integer and/or continuous variables. For example, we allow the intruder to know bounds on the hiding values (as it happens with bigger contributors to each cell) but not proba- bility distributions on them. Notice that different intruders could know different a-priori information. The aim is considered so complex that there are in literature only heuristic approaches (i.e., procedures providing approximated —probably overprotected— suppression patterns) for special situations. For example, a relevant situation occurs when there is an entity which contributes to several cells, leading to the so-called common respondent problem. Possible simplifications valid for this situation consist on replacing all the different intruders by one stronger attacker with “protection capacities” associated to the potential secondary suppressions (see, e.g., Jewett [9] or Sande [17] for details), or on aggregating some sensitive cells into new “union” cells with stronger protection level requirements (see, e.g., Robertson [15]). This paper presents the first mathematical models for the problem in the general situation (i.e, without any simplification) and a first exact algorithm for the resolution. The here-proposed approach looks for a suppression pattern with minimum loss of information and which guarantees all the protection re- quirements against all the attackers. It is also a practical method to find an optimal solution using modern tools from Mathematical Programming, a well- established methodology (see, e.g., [13]). The models and the algorithm apply to a very general problem, containing the common respondent problem as a particular case. Section 2 illustrates the classical cell suppression methodology by means of a (very simple) introductory example, and Section 3 describes the more general multi-attacker cell suppression methodology. Three mathematical models are described in Section 4, the last one with one decision variable for each cell and a large number of constraints. Section 5 proposes an algorithm for finding an optimal solution of the model using the third model, and Section 6 il- lustrates how it works on the introductory example. Section 7 presents a relaxed methodology based on considering one worse-case attacker with the information of all the original intruders, leading to an intermediate scheme that could be ap- plied with smaller computational effort. This scheme considers the “protection capacities” in Jewett [9] and provides overprotected patterns. Finally, Section 8 compares the classical, the multi-attacker and the intermediate methodologies, and Section 9 summarizes the main ideas of this paper and point out a further extension of the multi-attacker models.
  • 48. 36 Juan José Salazar González 2 Classical Cell Suppression Methodology A common hypothesis in the classical cell suppression methodology (see, e.g., Willenborg and De Waal [18]) is that there is only one attacker interested in the disclosure of all sensitive cells. We next introduce the main concepts of the methodology through a simple example. A B C Total Activity I 20 50 10 80 Activity II 8 19 22 49 Activity III 17 32 12 61 Total 45 101 44 190 Fig. 1. Investment of enterprises by activity and region A B C Total Activity I 20 50 10 80 Activity II * 19 * 49 Activity III * 32 * 61 Total 45 101 44 190 Fig. 2. A possible suppression pattern Figure 1 exhibits a statistical table giving the investment of enterprises (per millions of guilders), classified by activity and region. For simplicity, the cell corresponding to Activity i (for each i ∈ {I, II, III}) and Region j (for each j ∈ {A, B, C}) will be represented by the pair (i, j). Let us assume that the information in cell (II, C) is confidential, hence it is viewed as a sensitive cell to be suppressed. By using the marginal totals the attacker can however recom- pute the missing value in the sensitive cell, hence other table entries must be suppressed as well, e.g., those of Figure 2. With this choice, the attacker cannot disclosure the value of the sensitive cell exactly, although he/she can still com- pute a range for the values of this cell which are consistent with the published entries. Indeed, from Figure 2 the minimum value y− II,C for the sensitive cell (II, C) can be computed by solving a Linear Programming (LP) model in which the values yi,j for the suppressed cells (i, j) are treated as unknowns, namely y− II,C := min yII,C subject to yII,A +yII,C = 30 yIII,A +yIII,C = 29 yII,A +yIII,A = 25 yII,C +yIII,C = 34 yII,A ≥ 0 , yIII,A ≥ 0 , yII,C ≥ 0 , yIII,C ≥ 0.
  • 49. Multi-attacker Cell Suppression Problem 37 Notice that the right-hand side values are known to the attacker, as they can be obtained as the difference between the marginal and the published values in a row/column. We are also assuming that the attacker knows that a missing value is non-negative, i.e., 0 and infinity are known “external bounds” for suppressions. The maximum value y+ II,C for the sensitive cell can be computed in a perfectly analogous way, by solving the linear program of maximizing yII,C subject to the same constraints as before. Notice that each solution of this common set of constraints is a congruent table according with the published suppression pattern in Figure 2 and with the extra knowledge of the external bounds (non-negativity on this example). In the example, y− II,C = 5 and y+ II,C = 30, i.e., the sensitive information is “protected” within the protection interval [5,30]. If this interval is considered sufficiently wide by the statistical office, then the sensitive cell is called protected; otherwise Figure 2 is not a valid suppression pattern and new complementary suppressions are needed. Notice that the extreme values of the computed interval [5, 30] are only at- tained if the cell (II, A) takes the quite unreasonable values of 0 and 25. Oth- erwise, if the external bounds for each suppressed cell are assumed to be ±50% of the nominal value, then the solution of the new two linear programs results in the more realistic protection interval [18, 26] for the sensitive cell. That is why it is very important to consider good estimations of the external bounds known for the attacker on each suppressed cell when checking if a suppression pattern protects (or not) each sensitive cell of the table. As already stated, in the above example we are assuming that the external bounds are 0 and infinity, i.e., the only knowledge of the attacker on the unknown variables is that they are non-negative numbers. To classify the computed interval [y− p , y+ p ] around a nominal value ap of a sensitive cell p as “sufficiently wide” or not, the statistical office must provide us with three parameters for each sensitive cell: – an upper protection level representing the minimum allowed value to y+ p −ap; – a lower protection level representing the minimum allowed value to ap − y− p ; – an sliding protection level representing the minimum allowed value to y+ p − y− p . For example, if 7, 5 and 0 are the upper, lower and sliding protection levels, respectively, then the interval [5, 30] is “sufficiently wide”, and therefore pattern in Figure 2 is a valid solution for the statistical office (assuming the external bounds are 0 and infinity). The statistical office then aims at finding a valid suppression pattern pro- tecting all the sensitive cells against the attacker, and such that the loss of information associated with the suppressed entries is minimized. This results into a combinatorial optimization problem known as the (classical) Cell Sup- pression Problem, or CSP for short. CSP belongs to the class of the strongly NP-hard problems (see, e.g., Kelly et al. [12], Geurts [7], Kao [10]), meaning that it is very unlikely that an algorithm for the exact solution of CSP exists,
  • 50. 38 Juan José Salazar González which guarantees an efficient (i.e., polynomial-time) performance for all possible input instances. Previous works on the classical CSP from the literature mainly concentrate on 2-dimensional tables with marginals. Heuristic solution procedures have been proposed by several authors, including Cox [1,2], Sande [16], Kelly et al. [12], and Carvalho et al. [3]. Kelly [11] proposed a mixed-integer linear programming for- mulation involving a huge number of variables and constraints (for instance, the formulation involves more than 20,000,000 variables and 30,000,000 constraints for a two-dimensional table with 100 rows, 100 columns and 5% sensitive en- tries). Geurts [7] refined this model, and reported computational experiences on small-size instances, the largest instance solved to optimality being a table with 20 rows, 6 columns and 17 sensitive cells. (the computing time is not reported; for smaller instances, the code required several thousand CPU seconds on a SUN Spark 1+ workstation). Gusfield [8] gave a polynomial algorithm for a special case of the problem. Heuristics for 3-dimensional tables have been proposed in Robertson [14], Sande [16], and Dellaert and Luijten [4]. Very recently, Fischetti and Salazar [5] proposed a new method capable of solving to proven optimal- ity, on a personal computer, 2-dimensional tables with about 250,000 cells and 10,000 sensitive entries. An extension of this methodology capable of solving to proven optimality real-world 3- and 4-dimensional tables is presented in Fischetti and Salazar [6]. 3 Multi-attacker Cell Suppression Methodology The classical cell suppression methodology has several disadvantages. One of them concerns with the popular hypothesis that the table must be protected against one attacker. To be more precise, with the above assumption the attacker is supposed to be one external intruder with no other information different than the structure of the table, the published values and the external bounds on the suppressed values. In practice, however, there are also other types of attackers, like some special respondents (e.g., different entities contributing to different cell values). We will refer to those ones as internal attackers, while the above intruder will be refereed as external attacker. For each internal attacker there is an additional information concerning his/her own contribution to the table. To be more precise, in the above example, if cell (II, A) has only one respondent, the output from Figure 2 could be protected for an external attacker but not from this potential internal attacker. Indeed, the respondent contributing to cell (II, A) knows that yII,A ≥ 8 (and even that yII,A = 8 if he/she also knows that he/she is the only contributor to such cell). This will allow him/her to compute a more narrow protection interval for the sensitive cell (II, C) from Figure 2, even if it is protected for the external attacker. In order to avoid this important disadvantage of the classical cell suppres- sion methodology, this paper proposes an extension of the mathematical model presented in Fischetti and Salazar [6]. The extension determines a suppression
  • 51. Multi-attacker Cell Suppression Problem 39 pattern protected against external and internal attackers and it will be described in the next section. The classical Cell Suppression will be also referred through this article as Single-attacker CSP, while the new proposal will be named as Multi-attacker CSP. The hypothesis of the new methodology is that we must be given, not only with the basic information (nominal values, loss of information, etc), but also with a set of attackers K and for each one the specific information he/she has (i.e., his/her own bounds on unpublished values). Notice that if K contains only the external attacker, then the multi-attacker CSP reduces to the single-attacker CSP. Otherwise it could happens that some attackers could be not considered since a suppression pattern that protect sen- sitive information against one attacker could also protect the table against some others. This is, for example, the case when there are two attackers with the same protection requirements, but one knows tighter bounds; then it is enough to protect the table against the stronger attacker. For each attacker, a similar situation occurs with the sensitive cells since protecting some of them could imply to protect others. Therefore, a clever preprocessing is always required to reduce as much as possible the number of protection level requirements and the number of attackers. The availability of a well-implemented preprocessing could help also in the task of setting the potential internal attackers. Indeed, notice that in literature there have been developed several rules to establish the sensitive cells in a ta- ble (e.g., the dominance rule) but the same effort does not exist to establish the attackers. Within the preprocessing a proposal could be to consider each respon- dent in a table as a potential intruder, and then simply apply the preprocessing to remove the dominated ones. In theory this approach could lead to a huge number of attackers, but in practice it is expected a number of attackers not bigger than the number of cells in the table (and hopefully much smaller). Another important observation is the following. Considering a particular sen- sitive cell, the statistical office could also be interested on providing different protection levels for each attacker. For example, suppose that the statistical of- fice requires a lower protection level of 15 and an upper protection level of 5 for the sensitive cell (II, C) in the introductory example (Figure 1) against an external attacker. If the sensitive cell is the sum of the contribution from two re- spondents, one providing 18 units and the other providing 4 units, the statistical office could be also interested on requiring an special upper protection level of 20 against the biggest contributor to the sensitive cell (because he/she is a po- tential attacker with the extra knowledge that the sensitive cell contains at least value 18). Indeed notice that it does not make sense to protect the sensitive cell against the internal attacker within a lower protection requirement of at least 5, since he/she already knows a lower bound of 18. In other words, it makes sense that the statistical office wants different protection levels for different attack- ers, with the important assumption that each protection level must be smaller than the correspondent bound. The following section describes Mathematical Programming models capturing all these features.
  • 52. 40 Juan José Salazar González 4 Mathematical Models Let us consider a table [ai, i ∈ I] with n := |I| cells. It can be a k-dimensional, hierarchical or linked table. Since there are marginals, the cells are linked through some equations indexes by J, and let [ i∈I mijyi = bj, j ∈ J] the linear system defined by such equations. (Typically bj = 0 and mij ∈ {−1, 0, +1} with one −1 in each equation.) We are also given a weight wi for each cell i ∈ I for the loss of information incurred if such cell is suppressed in the final suppression pattern. Let P ⊂ I the set of sensitive cells (and hence the set of primary suppression). Finally, let us consider a set K of potential attackers. Associated to each attacker k ∈ K, we are given with the external bounds (lbk i , ubk i ) known by the attacker on each suppressed cell i ∈ I, and with the three protection levels (uplkp , lplkp , splkp ) that the statistical office requires to protect each sensitive cell p ∈ P against such attacker k. From the last observation in the previous section, we will assume lbk p ≤ ap − lplkp ≤ ap ≤ ap + uplkp ≤ ubk p and ubk p − lbk p ≥ splkp for each attacker k and each sensitive cell p. Then, the optimization problem associated to the Cell Suppression Method- ology can be modeled as follows. Let us consider a binary variable xi associated to each cell i ∈ I, assuming value 1 if such cell must be suppressed in the final pattern, or 0 otherwise. Notice that the attacker will minimize and maximize unknown values on the set of consistent tables, defined by: i∈I mijyi = bj , j ∈ J lbk i ≤ yi ≤ ubk i , i ∈ I when xi = 1 yi = ai , i ∈ I when xi = 0, equivalently represented as the solution set of the following linear system: i∈I mijyi = bj , j ∈ J ai − (ai − lbk i )xi ≤ yi ≤ ai + (ubk i − ai)xi , i ∈ I. (1) Therefore, our optimization problem is to find a value for each xi such that the total loss of the information in the final pattern is minimized, i.e.: min i∈I wixi (2) subject to, for each sensitive cell p ∈ P and for each attacker k ∈ K, – the upper protection requirement must be satisfied, i.e.: max {yp : (1) holds } ≥ ap + uplkp (3)
  • 53. Multi-attacker Cell Suppression Problem 41 – the lower protection requirement must be satisfied, i.e.: min {yp : (1) holds } ≤ ap − lpl kp (4) – the sliding protection requirement must be satisfied, i.e.: max {yp : (1) holds } − min {yp : (1) holds } ≥ splkp (5) Finally, each variable must assume value 0 or 1, i.e.: xi ∈ {0, 1} for all i ∈ I. (6) Mathematical model (2)–(6) contains all the requirements of the statistical office (according with the definition given in Section 1), and therefore a solution [x∗ i , i ∈ I] defines an optimal suppression pattern. The inconvenient is that it is not an easy model to be solved, since it does not belong to the standard Mixed Integer Linear Mathematical Programming. In fact, the existence of optimization problems as constraints of a main optimization problem classifies the model in the so-called “Bilevel Mathematical Programming”, which does not contain efficient algorithms to solve model (2)–(6) even of small sizes. Observe that the inconvenience of model (2)–(6) is not the number of variables (which is at most the number of cells both for the master optimization problem and for each subproblem in the second level), but the fact there are nested optimization problems in two levels. The better way to avoid the direct resolution it is by looking for a transformation into a classical model in Integer Programming. A first idea arises by observing that the optimization problem in condition (3) can be replaced by the existence of a table [fkp i , i ∈ I] such that it is congruent (i.e., it satisfies (1)) and it guarantees the upper protection level requirement, i.e.: fkp p ≥ ap + uplkp . In the same way, the optimization problem in condition (4) can be replaced by the existence of a table [gkp i , i ∈ I] such that it is also congruent (i.e., it satisfies (1)) and it guarantees the lower protection level requirement, i.e.: gkp p ≤ ap − lplkp . Finally, the two optimization problems in condition (5) can be replaced by the above congruent tables if they guarantee the sliding protection level, i.e.: fkp p − gkp p ≥ splkp . Figure 3 shows a first attempt to have an integer linear model. Clearly, this new model is a Mixed Integer Linear Programming model, and therefore —in theory— there are efficient approaches to solve it. Nevertheless, the number of new variables (fkp i and gkp i ) is really huge even on small tables. For example, the model associated with a table with 100 × 100 cells with 1%
  • 54. 42 Juan José Salazar González min i∈I wixi subject to: i∈I mijfkp i = bj for all j ∈ J ai − (ai − lbk i )xi ≤ fkp i ≤ ai + (ubk i − ai)xi for all i ∈ I i∈I mijgkp i = bj for all j ∈ J ai − (ai − lbk i )xi ≤ gkp i ≤ ai + (ubk i − ai)xi for all i ∈ I fkp p ≥ ap + upl kp gkp p ≤ ap − lplkp fkp p − gkp p ≥ splkp for all p ∈ P and all k ∈ K, and also subject to: xi ∈ {0, 1} for all i ∈ I. Fig. 3. First ILP model for multi-attacker CSP sensitive and 100 attackers would have millions of variables. Therefore, it is necessary another approach to transform model (2)–(6) without adding so many additional variables. An alternative approach which does not add any additional variable follows the idea described in Fischetti and Salazar [6] for the classical cell suppression problem (i.e., with one attacker). Based on the Farkas’ Lemma, it is possible to replace the second level problems of model (2)–(6) by linear constraints on the xi variables. Indeed, assuming that values yi in a congruent table are continuous numbers, the two linear programs in conditions (3)–(5) can be rewritten in their dual format. More precisely, by Dual Theory in Linear Programming max {yp : (1) holds } is equivalent to min j∈J γjbj + i∈I [αi(ai + (ubk i − ai)xi) − βi(ai − (ai − lbk i )xi)]
  • 55. Random documents with unrelated content Scribd suggests to you:
  • 56. If implies that implies , then implies that implies . Writing in place of , in place of , and in place of , this becomes: If implies that implies , then implies that implies . Call this . Now we proved by means of our fifth principle that implies that implies , which was what we called . Thus we have here an instance of the schema of inference, since represents the of our scheme, and represents the implies . Hence we arrive at , namely, implies that implies , which was the proposition to be proved. In this proof, the adaptation of our fifth principle, which yields , occurs as a substantive premiss; while the adaptation of our fourth principle, which yields , is used to give the form of the inference. The formal and material employments of premisses in the theory of deduction are closely intertwined, and it is not very important to keep them separated, provided we realise that they are in theory distinct. The earliest method of arriving at new results from a premiss is one which is illustrated in the above deduction, but which itself can hardly be called deduction. The primitive propositions, whatever they may be, are to be regarded as asserted for all possible values of the variable propositions , , which occur in them. We may therefore substitute for (say) any expression whose value is always a proposition, e.g. not- , implies , and so on. By means of such substitutions we really obtain sets of special cases of our original proposition, but from a practical point of view we obtain what are virtually new propositions. The legitimacy of substitutions of this kind has to be insured by means of a non-formal principle of inference.[36] [36]No such principle is enunciated in Principia Mathematica, or in M. Nicod's article mentioned above. But this would seem to be an omission. We may now state the one formal principle of inference to which M. Nicod has reduced the five given above. For this purpose we will first show how certain truth-functions can be defined in terms of incompatibility. We saw already that
  • 57. means implies . We now observe that means implies both and . For this expression means is incompatible with the incompatibility of and , i.e. implies that and are not incompatible, i.e. implies that and are both true—for, as we saw, the conjunction of and is the negation of their incompatibility. Observe next that means implies itself. This is a particular case of . Let us write for the negation of ; thus will mean the negation of , i.e. it will mean the conjunction of and . It follows that expresses the incompatibility of with the conjunction of and ; in other words, it states that if and are both true, is false, i.e. and are both true; in still simpler words, it states that and jointly imply and jointly. Now, put Then M. Nicod's sole formal principle of deduction is in other words, implies both and . He employs in addition one non-formal principle belonging to the theory of types (which need not concern us), and one corresponding to the principle that, given , and given that implies , we can assert . This principle is: If is true, and is true, then is true.
  • 58. From this apparatus the whole theory of deduction follows, except in so far as we are concerned with deduction from or to the existence or the universal truth of propositional functions, which we shall consider in the next chapter. There is? if I am not mistaken, a certain confusion in the minds of some authors as to the relation, between propositions, in virtue of which an inference is valid. In order that it may be valid to infer from , it is only necessary that should be true and that the proposition not- or should be true. Whenever this is the case, it is clear that must be true. But inference will only in fact take place when the proposition not- or is known otherwise than through knowledge of not- or knowledge of . Whenever is false, not- or is true, but is useless for inference, which requires that should be true. Whenever is already known to be true, not- or is of course also known to be true, but is again useless for inference, since is already known, and therefore does not need to be inferred. In fact, inference only arises when not- or can be known without our knowing already which of the two alternatives it is that makes the disjunction true. Now, the circumstances under which this occurs are those in which certain relations of form exist between and . For example, we know that if implies the negation of , then implies the negation of . Between implies not- and implies not- there is a formal relation which enables us to know that the first implies the second, without having first to know that the first is false or to know that the second is true. It is under such circumstances that the relation of implication is practically useful for drawing inferences. But this formal relation is only required in order that we may be able to know that either the premiss is false or the conclusion is true. It is the truth of not- or that is required for the validity of the inference; what is required further is only required for the practical feasibility of the inference. Professor C. I. Lewis[37] has especially studied the narrower, formal relation which we may call formal deducibility. He urges that the wider relation, that expressed by not- or should not be called implication. That is, however, a matter of words. Provided our use of words is consistent, it matters little how we define them. The essential point of difference between the theory which I advocate and the theory advocated by Professor Lewis is this: He maintains that, when one proposition is
  • 59. formally deducible from another , the relation which we perceive between them is one which he calls strict implication, which is not the relation expressed by not- or but a narrower relation, holding only when there are certain formal connections between and . I maintain that, whether or not there be such a relation as he speaks of, it is in any case one that mathematics does not need, and therefore one that, on general grounds of economy, ought not to be admitted into our apparatus of fundamental notions; that, whenever the relation of formal deducibility holds between two propositions, it is the case that we can see that either the first is false or the second true, and that nothing beyond this fact is necessary to be admitted into our premisses; and that, finally, the reasons of detail which Professor Lewis adduces against the view which I advocate can all be met in detail, and depend for their plausibility upon a covert and unconscious assumption of the point of view which I reject. I conclude, therefore, that there is no need to admit as a fundamental notion any form of implication not expressible as a truth-function. [37]See Mind, vol. XXI., 1912, pp. 522-531; and vol. XXIII., 1914, pp. 240-247.
  • 60. CHAPTER XV PROPOSITIONAL FUNCTIONS WHEN, in the preceding chapter, we were discussing propositions, we did not attempt to give a definition of the word proposition. But although the word cannot be formally defined, it is necessary to say something as to its meaning, in order to avoid the very common confusion with propositional functions, which are to be the topic of the present chapter. We mean by a proposition primarily a form of words which expresses what is either true or false. I say primarily, because I do not wish to exclude other than verbal symbols, or even mere thoughts if they have a symbolic character. But I think the word proposition should be limited to what may, in some sense, be called symbols, and further to such symbols as give expression to truth and falsehood. Thus two and two are four and two and two are five will be propositions, and so will Socrates is a man and Socrates is not a man. The statement: Whatever numbers and may be, is a proposition; but the bare formula alone is not, since it asserts nothing definite unless we are further told, or led to suppose, that and are to have all possible values, or are to have such-and-such values. The former of these is tacitly assumed, as a rule, in the enunciation of mathematical formulæ, which thus become propositions; but if no such assumption were made, they would be propositional functions. A propositional function, in fact, is an expression containing one or more undetermined constituents, such that, when values are assigned to these constituents, the expression becomes a proposition. In other words, it is a function whose values are propositions. But this latter definition must be used with caution. A descriptive function, e.g. the hardest proposition in 's mathematical treatise, will not be a
  • 61. propositional function, although its values are propositions. But in such a case the propositions are only described: in a propositional function, the values must actually enunciate propositions. Examples of propositional functions are easy to give: is human is a propositional function; so long as remains undetermined, it is neither true nor false, but when a value is assigned to it becomes a true or false proposition. Any mathematical equation is a propositional function. So long as the variables have no definite value, the equation is merely an expression awaiting determination in order to become a true or false proposition. If it is an equation containing one variable, it becomes true when the variable is made equal to a root of the equation, otherwise it becomes false; but if it is an identity it will be true when the variable is any number. The equation to a curve in a plane or to a surface in space is a propositional function, true for values of the co-ordinates belonging to points on the curve or surface, false for other values. Expressions of traditional logic such as all is are propositional functions: and have to be determined as definite classes before such expressions become true or false. The notion of cases or instances depends upon propositional functions. Consider, for example, the kind of process suggested by what is called generalisation, and let us take some very primitive example, say, lightning is followed by thunder. We have a number of instances of this, i.e. a number of propositions such as: this is a flash of lightning and is followed by thunder. What are these occurrences instances of? They are instances of the propositional function: If is a flash of lightning, is followed by thunder. The process of generalisation (with whose validity we are fortunately not concerned) consists in passing from a number of such instances to the universal truth of the propositional function: If is a flash of lightning, is followed by thunder. It will be found that, in an analogous way, propositional functions are always involved whenever we talk of instances or cases or examples. We do not need to ask, or attempt to answer, the question: What is a propositional function? A propositional function standing all alone may be taken to be a mere schema, a mere shell, an empty receptacle for meaning, not something already significant. We are concerned with propositional functions, broadly speaking, in two ways: first, as involved in the notions true in all cases and true in some cases; secondly, as involved in the
  • 62. theory of classes and relations. The second of these topics we will postpone to a later chapter; the first must occupy us now. When we say that something is always true or true in all cases, it is clear that the something involved cannot be a proposition. A proposition is just true or false, and there is an end of the matter. There are no instances or cases of Socrates is a man or Napoleon died at St Helena. These are propositions, and it would be meaningless to speak of their being true in all cases. This phrase is only applicable to propositional functions. Take, for example, the sort of thing that is often said when causation is being discussed. (We are net concerned with the truth or falsehood of what is said, but only with its logical analysis.) We are told that is, in every instance, followed by . Now if there are instances of , must be some general concept of which it is significant to say is , is , is , and so on, where , , are particulars which are not identical one with another. This applies, e.g., to our previous case of lightning. We say that lightning ( ) is followed by thunder ( ). But the separate flashes are particulars, not identical, but sharing the common property of being lightning. The only way of expressing a common property generally is to say that a common property of a number of objects is a propositional function which becomes true when any one of these objects is taken as the value of the variable. In this case all the objects are instances of the truth of the propositional function—for a propositional function, though it cannot itself be true or false, is true in certain instances and false in certain others, unless it is always true or always false. When, to return to our example, we say that is in every instance followed by , we mean that, whatever may be, if is an , it is followed by a ; that is, we are asserting that a certain propositional function is always true. Sentences involving such words as all, every, a, the, some require propositional functions for their interpretation. The way in which propositional functions occur can be explained by means of two of the above words, namely, all and some. There are, in the last analysis, only two things that can be done with a propositional function: one is to assert that it is true in all cases, the other to assert that it is true in at least one case, or in some cases (as we shall say, assuming that there is to be no necessary implication of a plurality of cases). All the other uses of propositional functions can be reduced to these
  • 63. two. When we say that a propositional function is true in all cases, or always (as we shall also say, without any temporal suggestion), we mean that all its values are true. If is the function, and is the right sort of object to be an argument to , then is to be true, however may have been chosen. For example, if is human, is mortal is true whether is human or not; in fact, every proposition of this form is true. Thus the propositional function if is human, is mortal is always true, or true in all cases. Or, again, the statement there are no unicorns is the same as the statement the propositional function ' is not a unicorn' is true in all cases. The assertions in the preceding chapter about propositions, e.g. ' or ' implies ' or ,' are really assertions that certain propositional functions are true in all cases. We do not assert the above principle, for example, as being true only of this or that particular or , but as being true of any or concerning which it can be made significantly. The condition that a function is to be significant for a given argument is the same as the condition that it shall have a value for that argument, either true or false. The study of the conditions of significance belongs to the doctrine of types, which we shall not pursue beyond the sketch given in the preceding chapter. Not only the principles of deduction, but all the primitive propositions of logic, consist of assertions that certain propositional functions are always true. If this were not the case, they would have to mention particular things or concepts—Socrates, or redness, or east and west, or what not,—and clearly it is not the province of logic to make assertions which are true concerning one such thing or concept but not concerning another. It is part of the definition of logic (but not the whole of its definition) that all its propositions are completely general, i.e. they all consist of the assertion that some propositional function containing no constant terms is always true. We shall return in our final chapter to the discussion of propositional functions containing no constant terms. For the present we will proceed to the other thing that is to be done with a propositional function, namely, the assertion that it is sometimes true, i.e. true in at least one instance. When we say there are men, that means that the propositional function is a man is sometimes true. When we say some men are Greeks, that means that the propositional function is a man and a Greek is sometimes true. When we say cannibals still exist in Africa, that means that the propositional function is a cannibal now in Africa is sometimes
  • 64. true, i.e. is true for some values of . To say there are at least individuals in the world is to say that the propositional function is a class of individuals and a member of the cardinal number is sometimes true, or, as we may say, is true for certain values of . This form of expression is more convenient when it is necessary to indicate which is the variable constituent which we are taking as the argument to our propositional function. For example, the above propositional function, which we may shorten to is a class of individuals, contains two variables, and . The axiom of infinity, in the language of propositional functions, is: The propositional function 'if is an inductive number, it is true for some values of that is a class of individuals' is true for all possible values of . Here there is a subordinate function, is a class of individuals, which is said to be, in respect of , sometimes true; and the assertion that this happens if is an inductive number is said to be, in respect of , always true. The statement that a function is always true is the negation of the statement that not- is sometimes true, and the statement that is sometimes true is the negation of the statement that not- is always true. Thus the statement all men are mortals is the negation of the statement that the function is an immortal man is sometimes true. And the statement there are unicorns is the negation of the statement that the function is not a unicorn is always true.[38] We say that is never true or always false if not- is always true. We can, if we choose, take one of the pair always, sometimes as a primitive idea, and define the other by means of the one and negation. Thus if we choose sometimes as our primitive idea, we can define: ' is always true' is to mean 'it is false that not- is sometimes true.'[39] But for reasons connected with the theory of types it seems more correct to take both always and sometimes as primitive ideas, and define by their means the negation of propositions in which they occur. That is to say, assuming that we have already defined (or adopted as a primitive idea) the negation of propositions of the type to which belongs, we define: The negation of ' always' is 'not- sometimes'; and the negation of ' sometimes' is 'not- always.' In like manner we can re-define disjunction and the other truth-functions, as applied to propositions containing apparent variables, in terms of the definitions and primitive ideas for propositions containing no apparent
  • 65. variables. Propositions containing no apparent variables are called elementary propositions. From these we can mount up step by step, using such methods as have just been indicated, to the theory of truth-functions as applied to propositions containing one, two, three, ... variables, or any number up to , where is any assigned finite number. [38]The method of deduction is given in Principia Mathematica, vol. I. * 9. [39]For linguistic reasons, to avoid suggesting either the plural or the singular, it is often convenient to say is not always false rather than sometimes or is sometimes true. The forms which are taken as simplest in traditional formal logic are really far from being so, and all involve the assertion of all values or some values of a compound propositional function. Take, to begin with, all is . We will take it that is defined by a propositional function , and by a propositional function . E.g., if is men, will be is human; if is mortals, will be there is a time at which dies. Then all is means: ' implies ' is always true. It is to be observed that all is does not apply only to those terms that actually are 's; it says something equally about terms which are not 's. Suppose we come across an of which we do not know whether it is an or not; still, our statement all is tells us something about , namely, that if is an , then is a . And this is every bit as true when is not an as when is an . If it were not equally true in both cases, the reductio ad absurdum would not be a valid method; for the essence of this method consists in using implications in cases where (as it afterwards turns out) the hypothesis is false. We may put the matter another way. In order to understand all is , it is not necessary to be able to enumerate what terms are 's; provided we know what is meant by being an and what by being a , we can understand completely what is actually affirmed by all is , however little we may know of actual instances of either. This shows that it is not merely the actual terms that are 's that are relevant in the statement all is , but all the terms concerning which the supposition that they are 's is significant, i.e. all the terms that are 's, together with all the terms that are not 's—i.e. the whole of the appropriate logical type. What applies to statements about all applies also to statements about some. There are men, e.g., means that is human is true for some values of . Here all values of
  • 66. (i.e. all values for which is human is significant, whether true or false) are relevant, and not only those that in fact are human. (This becomes obvious if we consider how we could prove such a statement to be false.) Every assertion about all or some thus involves not only the arguments that make a certain function true, but all that make it significant, i.e. all for which it has a value at all, whether true or false. We may now proceed with our interpretation of the traditional forms of the old-fashioned formal logic. We assume that is those terms for which is true, and is those for which is true. (As we shall see in a later chapter, all classes are derived in this way from propositional functions.) Then: All is means ' implies ' is always true. Some is means ' and ' is sometimes true. No is means ' implies not- ' is always true. Some is not means ' and not- ' is sometimes true. It will be observed that the propositional functions which are here asserted for all or some values are not and themselves, but truth-functions of and for the same argument . The easiest way to conceive of the sort of thing that is intended is to start not from and in general, but from and , where is some constant. Suppose we are considering all men are mortal: we will begin with If Socrates is human, Socrates is mortal, and then we will regard Socrates as replaced by a variable wherever Socrates occurs. The object to be secured is that, although remains a variable, without any definite value, yet it is to have the same value in as in when we are asserting that implies is always true. This requires that we shall start with a function whose values are such as implies , rather than with two separate functions and ; for if we start with two separate functions we can never secure that the , while remaining undetermined, shall have the same value in both. For brevity we say always implies when we mean that implies is always true. Propositions of the form always implies
  • 67. are called formal implications; this name is given equally if there are several variables. The above definitions show how far removed from the simplest forms are such propositions as all is , with which traditional logic begins. It is typical of the lack of analysis involved that traditional logic treats all is as a proposition of the same form as is —e.g., it treats all men are mortal as of the same form as Socrates is mortal. As we have just seen, the first is of the form always implies , while the second is of the form . The emphatic separation of these two forms, which was effected by Peano and Frege, was a very vital advance in symbolic logic. It will be seen that all is and no is do not really differ in form, except by the substitution of not- for , and that the same applies to some is and some is not . It should also be observed that the traditional rules of conversion are faulty, if we adopt the view, which is the only technically tolerable one, that such propositions as all is do not involve the existence of 's, i.e. do not require that there should be terms which are 's. The above definitions lead to the result that, if is always false, i.e. if there are no 's, then all is and no is will both be true, whatever may be. For, according to the definition in the last chapter, implies means not- or which is always true if not- is always true. At the first moment, this result might lead the reader to desire different definitions, but a little practical experience soon shows that any different definitions would be inconvenient and would conceal the important ideas. The proposition always implies , and is sometimes true is essentially composite, and it would be very awkward to give this as the definition of all is , for then we should have no language left for always implies , which is needed a hundred times for once that the other is needed. But, with our definitions, all is does not imply some is , since the first allows the non-existence of and the second does not; thus conversion per accidens becomes invalid, and some moods of the syllogism are fallacious, e.g. Darapti: All is , all is , therefore some is , which fails if there is no . The notion of existence has several forms, one of which will occupy us in the next chapter; but the fundamental form is that which is derived immediately from the notion of sometimes true. We say that an argument satisfies a function if is true; this is the same sense in which the
  • 68. roots of an equation are said to satisfy the equation. Now if is sometimes true, we may say there are 's for which it is true, or we may say arguments satisfying exist This is the fundamental meaning of the word existence. Other meanings are either derived from this, or embody mere confusion of thought. We may correctly say men exist, meaning that is a man is sometimes true. But if we make a pseudo-syllogism: Men exist, Socrates is a man, therefore Socrates exists, we are talking nonsense, since Socrates is not, like men, merely an undetermined argument to a given propositional function. The fallacy is closely analogous to that of the argument: Men are numerous, Socrates is a man, therefore Socrates is numerous. In this case it is obvious that the conclusion is nonsensical, but in the case of existence it is not obvious, for reasons which will appear more fully in the next chapter. For the present let us merely note the fact that, though it is correct to say men exist, it is incorrect, or rather meaningless, to ascribe existence to a given particular who happens to be a man. Generally, terms satisfying exist means is sometimes true; but exists (where is a term satisfying ) is a mere noise or shape, devoid of significance. It will be found that by bearing in mind this simple fallacy we can solve many ancient philosophical puzzles concerning the meaning of existence. Another set of notions as to which philosophy has allowed itself to fall into hopeless confusions through not sufficiently separating propositions and propositional functions are the notions of modality: necessary, possible, and impossible. (Sometimes contingent or assertoric is used instead of possible.) The traditional view was that, among true propositions, some were necessary, while others were merely contingent or assertoric; while among false propositions some were impossible, namely, those whose contradictories were necessary, while others merely happened not to be true. In fact, however, there was never any clear account of what was added to truth by the conception of necessity. In the case of propositional functions, the three-fold division is obvious. If is an undetermined value of a certain propositional function, it will be necessary if the function is always true, possible if it is sometimes true, and impossible if it is never true. This sort of situation arises in regard to probability, for example. Suppose a ball is drawn from a bag which contains a number of balls: if all the balls are white, is white is necessary; if some are white, it is possible; if none, it is impossible. Here all that is known about is that it satisfies a certain
  • 69. propositional function, namely, was a ball in the bag. This is a situation which is general in probability problems and not uncommon in practical life —e.g. when a person calls of whom we know nothing except that he brings a letter of introduction from our friend so-and-so. In all such cases, as in regard to modality in general, the propositional function is relevant. For clear thinking, in many very diverse directions, the habit of keeping propositional functions sharply separated from propositions is of the utmost importance, and the failure to do so in the past has been a disgrace to philosophy.
  • 70. CHAPTER XVI DESCRIPTIONS We dealt in the preceding chapter with the words all and some; in this chapter we shall consider the word the in the singular, and in the next chapter we shall consider the word the in the plural. It may be thought excessive to devote two chapters to one word, but to the philosophical mathematician it is a word of very great importance: like Browning's Grammarian with the enclitic , I would give the doctrine of this word if I were dead from the waist down and not merely in a prison. We have already had occasion to mention descriptive functions, i.e. such expressions as the father of or the sine of . These are to be defined by first defining descriptions. A description may be of two sorts, definite and indefinite (or ambiguous). An indefinite description is a phrase of the form a so-and-so, and a definite description is a phrase of the form the so-and-so (in the singular). Let us begin with the former. Who did you meet? I met a man. That is a very indefinite description. We are therefore not departing from usage in our terminology. Our question is: What do I really assert when I assert I met a man? Let us assume, for the moment, that my assertion is true, and that in fact I met Jones. It is clear that what I assert is not I met Jones. I may say I met a man, but it was not Jones; in that case, though I lie, I do not contradict myself, as I should do if when I say I met a man I really mean that I met Jones. It is clear also that the person to whom I am speaking can understand what I say, even if he is a foreigner and has never heard of Jones.
  • 71. But we may go further: not only Jones, but no actual man, enters into my statement. This becomes obvious when the statement is false, since then there is no more reason why Jones should be supposed to enter into the proposition than why anyone else should. Indeed the statement would remain significant, though it could not possibly be true, even if there were no man at all. I met a unicorn or I met a sea-serpent is a perfectly significant assertion, if we know what it would be to be a unicorn or a sea- serpent, i.e. what is the definition of these fabulous monsters. Thus it is only what we may call the concept that enters into the proposition. In the case of unicorn, for example, there is only the concept: there is not also, somewhere among the shades, something unreal which may be called a unicorn. Therefore, since it is significant (though false) to say I met a unicorn, it is clear that this proposition, rightly analysed, does not contain a constituent a unicorn, though it does contain the concept unicorn. The question of unreality, which confronts us at this point, is a very important one. Misled by grammar, the great majority of those logicians who have dealt with this question have dealt with it on mistaken lines. They have regarded grammatical form as a surer guide in analysis than, in fact, it is. And they have not known what differences in grammatical form are important. I met Jones and I met a man would count traditionally as propositions of the same form, but in actual fact they are of quite different forms: the first names an actual person, Jones; while the second involves a propositional function, and becomes, when made explicit: The function 'I met and is human' is sometimes true. (It will be remembered that we adopted the convention of using sometimes as not implying more than once.) This proposition is obviously not of the form I met , which accounts for the existence of the proposition I met a unicorn in spite of the fact that there is no such thing as a unicorn. For want of the apparatus of propositional functions, many logicians have been driven to the conclusion that there are unreal objects. It is argued, e.g. by Meinong,[40] that we can speak about the golden mountain, the round square, and so on; we can make true propositions of which these are the subjects; hence they must have some kind of logical being, since otherwise the propositions in which they occur would be meaningless. In such theories, it seems to me, there is a failure of that feeling for reality which ought to be preserved even in the most abstract studies. Logic, I
  • 72. should maintain, must no more admit a unicorn than zoology can; for logic is concerned with the real world just as truly as zoology, though with its more abstract and general features. To say that unicorns have an existence in heraldry, or in literature, or in imagination, is a most pitiful and paltry evasion. What exists in heraldry is not an animal, made of flesh and blood, moving and breathing of its own initiative. What exists is a picture, or a description in words. Similarly, to maintain that Hamlet, for example, exists in his own world, namely, in the world of Shakespeare's imagination, just as truly as (say) Napoleon existed in the ordinary world, is to say something deliberately confusing, or else confused to a degree which is scarcely credible. There is only one world, the real world: Shakespeare's imagination is part of it, and the thoughts that he had in writing Hamlet are real. So are the thoughts that we have in reading the play. But it is of the very essence of fiction that only the thoughts, feelings, etc., in Shakespeare and his readers are real, and that there is not, in addition to them, an objective Hamlet. When you have taken account of all the feelings roused by Napoleon in writers and readers of history, you have not touched the actual man; but in the case of Hamlet you have come to the end of him. If no one thought about Hamlet, there would be nothing left of him; if no one had thought about Napoleon, he would have soon seen to it that some one did. The sense of reality is vital in logic, and whoever juggles with it by pretending that Hamlet has another kind of reality is doing a disservice to thought. A robust sense of reality is very necessary in framing a correct analysis of propositions about unicorns, golden mountains, round squares, and other such pseudo-objects. [40]Untersuchungen zur Gegenstandstheorie und Psychologie, 1904. In obedience to the feeling of reality, we shall insist that, in the analysis of propositions, nothing unreal is to be admitted. But, after all, if there is nothing unreal, how, it may be asked, could we admit anything unreal? The reply is that, in dealing with propositions, we are dealing in the first instance with symbols, and if we attribute significance to groups of symbols which have no significance, we shall fall into the error of admitting unrealities, in the only sense in which this is possible, namely, as objects described. In the proposition I met a unicorn, the whole four words together make a significant proposition, and the word unicorn by itself is significant, in just the same sense as the word man. But the two words a
  • 73. unicorn do not form a subordinate group having a meaning of its own. Thus if we falsely attribute meaning to these two words, we find ourselves saddled with a unicorn, and with the problem how there can be such a thing in a world where there are no unicorns. A unicorn is an indefinite description which describes nothing. It is not an indefinite description which describes something unreal. Such a proposition as is unreal only has meaning when is a description, definite or indefinite; in that case the proposition will be true if is a description which describes nothing. But whether the description describes something or describes nothing, it is in any case not a constituent of the proposition in which it occurs; like a unicorn just now, it is not a subordinate group having a meaning of its own. All this results from the fact that, when is a description, is unreal or does not exist is not nonsense, but is always significant and sometimes true. We may now proceed to define generally the meaning of propositions which contain ambiguous descriptions. Suppose we wish to make some statement about a so-and-so, where so-and-so's are those objects that have a certain property , i.e. those objects for which the propositional function is true. (E.g. if we take a man as our instance of a so-and- so, will be is human.) Let us now wish to assert the property of a so-and-so, i.e. we wish to assert that a so-and-so has that property which has when is true. (E.g. in the case of I met a man, will be I met .) Now the proposition that a so-and-so has the property is not a proposition of the form . If it were, a so-and-so would have to be identical with for a suitable ; and although (in a sense) this may be true in some cases, it is certainly not true in such a case as a unicorn. It is just this fact, that the statement that a so-and-so has the property is not of the form , which makes it possible for a so-and-so to be, in a certain clearly definable sense, unreal. The definition is as follows:— The statement that an object having the property has the property means: The joint assertion of and is not always false. So far as logic goes, this is the same proposition as might be expressed by some 's are 's; but rhetorically there is a difference, because in the one case there is a suggestion of singularity, and in the other case of
  • 74. plurality. This, however, is not the important point. The important point is that, when rightly analysed, propositions verbally about a so-and-so are found to contain no constituent represented by this phrase. And that is why such propositions can be significant even when there is no such thing as a so-and-so. The definition of existence, as applied to ambiguous descriptions, results from what was said at the end of the preceding chapter. We say that men exist or a man exists if the propositional function is human is sometimes true; and generally a so-and-so exists if is so-and-so is sometimes true. We may put this in other language. The proposition Socrates is a man is no doubt equivalent to Socrates is human, but it is not the very same proposition. The is of Socrates is human expresses the relation of subject and predicate; the is of Socrates is a man expresses identity. It is a disgrace to the human race that it has chosen to employ the same word is for these two entirely different ideas—a disgrace which a symbolic logical language of course remedies. The identity in Socrates is a man is identity between an object named (accepting Socrates as a name, subject to qualifications explained later) and an object ambiguously described. An object ambiguously described will exist when at least one such proposition is true, i.e. when there is at least one true proposition of the form is a so-and-so, where is a name. It is characteristic of ambiguous (as opposed to definite) descriptions that there may be any number of true propositions of the above form—Socrates is a man, Plato is a man, etc. Thus a man exists follows from Socrates, or Plato, or anyone else. With definite descriptions, on the other hand, the corresponding form of proposition, namely, is the so-and-so (where is a name), can only be true for one value of at most. This brings us to the subject of definite descriptions, which are to be defined in a way analogous to that employed for ambiguous descriptions, but rather more complicated. We come now to the main subject of the present chapter, namely, the definition of the word the (in the singular). One very important point about the definition of a so-and-so applies equally to the so-and-so; the definition to be sought is a definition of propositions in which this phrase occurs, not a definition of the phrase itself in isolation. In the case of a so- and-so, this is fairly obvious: no one could suppose that a man was a definite object, which could be defined by itself. Socrates is a man, Plato is
  • 75. a man, Aristotle is a man, but we cannot infer that a man means the same as Socrates means and also the same as Plato means and also the same as Aristotle means, since these three names have different meanings. Nevertheless, when we have enumerated all the men in the world, there is nothing left of which we can say, This is a man, and not only so, but it is the 'a man,' the quintessential entity that is just an indefinite man without being anybody in particular. It is of course quite clear that whatever there is in the world is definite: if it is a man it is one definite man and not any other. Thus there cannot be such an entity as a man to be found in the world, as opposed to specific man. And accordingly it is natural that we do not define a man itself, but only the propositions in which it occurs. In the case of the so-and-so this is equally true, though at first sight less obvious. We may demonstrate that this must be the case, by a consideration of the difference between a name and a definite description. Take the proposition, Scott is the author of Waverley. We have here a name, Scott, and a description, the author of Waverley, which are asserted to apply to the same person. The distinction between a name and all other symbols may be explained as follows:— A name is a simple symbol whose meaning is something that can only occur as subject, i.e. something of the kind that, in Chapter XIII., we defined as an individual or a particular. And a simple symbol is one which has no parts that are symbols. Thus Scott is a simple symbol, because, though it has parts (namely, separate letters), these parts are not symbols. On the other hand, the author of Waverley is not a simple symbol, because the separate words that compose the phrase are parts which are symbols. If, as may be the case, whatever seems to be an individual is really capable of further analysis, we shall have to content ourselves with what may be called relative individuals, which will be terms that, throughout the context in question, are never analysed and never occur otherwise than as subjects. And in that case we shall have correspondingly to content ourselves with relative names. From the standpoint of our present problem, namely, the definition of descriptions, this problem, whether these are absolute names or only relative names, may be ignored, since it concerns different stages in the hierarchy of types, whereas we have to compare such couples as Scott and the author of Waverley, which both apply to the same object, and do not raise the
  • 76. problem of types. We may, therefore, for the moment, treat names as capable of being absolute; nothing that we shall have to say will depend upon this assumption, but the wording may be a little shortened by it. We have, then, two things to compare: (1) a name, which is a simple symbol, directly designating an individual which is its meaning, and having this meaning in its own right, independently of the meanings of all other words; (2) a description, which consists of several words, whose meanings are already fixed, and from which results whatever is to be taken as the meaning of the description. A proposition containing a description is not identical with what that proposition becomes when a name is substituted, even if the name names the same object as the description describes. Scott is the author of Waverley is obviously a different proposition from Scott is Scott: the first is a fact in literary history, the second a trivial truism. And if we put anyone other than Scott in place of the author of Waverley, our proposition would become false, and would therefore certainly no longer be the same proposition. But, it may be said, our proposition is essentially of the same form as (say) Scott is Sir Walter, in which two names are said to apply to the same person. The reply is that, if Scott is Sir Walter really means the person named 'Scott' is the person named 'Sir Walter,' then the names are being used as descriptions: i.e. the individual, instead of being named, is being described as the person having that name. This is a way in which names are frequently used in practice, and there will, as a rule, be nothing in the phraseology to show whether they are being used in this way or as names. When a name is used directly, merely to indicate what we are speaking about, it is no part of the fact asserted, or of the falsehood if our assertion happens to be false: it is merely part of the symbolism by which we express our thought. What we want to express is something which might (for example) be translated into a foreign language; it is something for which the actual words are a vehicle, but of which they are no part. On the other hand, when we make a proposition about the person called 'Scott,' the actual name Scott enters into what we are asserting, and not merely into the language used in making the assertion. Our proposition will now be a different one if we substitute the person called 'Sir Walter.' But so long as we are using names as names, whether we say Scott or whether we say Sir Walter is as irrelevant to what we are asserting as whether we speak
  • 77. English or French. Thus so long as names are used as names, Scott is Sir Walter is the same trivial proposition as Scott is Scott. This completes the proof that Scott is the author of Waverley is not the same proposition as results from substituting a name for the author of Waverley, no matter what name may be substituted. When we use a variable, and speak of a propositional function, say, the process of applying general statements about to particular cases will consist in substituting a name for the letter , assuming that is a function which has individuals for its arguments. Suppose, for example, that is always true; let it be, say, the law of identity, . Then we may substitute for any name we choose, and we shall obtain a true proposition. Assuming for the moment that Socrates, Plato, and Aristotle are names (a very rash assumption), we can infer from the law of identity that Socrates is Socrates, Plato is Plato, and Aristotle is Aristotle. But we shall commit a fallacy if we attempt to infer, without further premisses, that the author of Waverley is the author of Waverley. This results from what we have just proved, that, if we substitute a name for the author of Waverley in a proposition, the proposition we obtain is a different one. That is to say, applying the result to our present case: If is a name, is not the same proposition as the author of Waverley is the author of Waverley, no matter what name may be. Thus from the fact that all propositions of the form are true we cannot infer, without more ado, that the author of Waverley is the author of Waverley. In fact, propositions of the form the so-and-so is the so-and-so are not always true: it is necessary that the so-and-so should exist (a term which will be explained shortly). It is false that the present King of France is the present King of France, or that the round square is the round square. When we substitute a description for a name, propositional functions which are always true may become false, if the description describes nothing. There is no mystery in this as soon as we realise (what was proved in the preceding paragraph) that when we substitute a description the result is not a value of the propositional function in question. We are now in a position to define propositions in which a definite description occurs. The only thing that distinguishes the so-and-so from a so-and-so is the implication of uniqueness. We cannot speak of the inhabitant of London, because inhabiting London is an attribute which is
  • 78. not unique. We cannot speak about the present King of France, because there is none; but we can speak about the present King of England. Thus propositions about the so-and-so always imply the corresponding propositions about a so-and-so, with the addendum that there is not more than one so-and-so. Such a proposition as Scott is the author of Waverly could not be true if Waverly had never been written, or if several people had written it; and no more could any other proposition resulting from a propositional function by the substitution of the author of Waverly for . We may say that the author of Waverly means the value of for which ' wrote Waverly' is true. Thus the proposition the author of Waverly was Scotch, for example, involves: (1) wrote Waverly is not always false; (2) if and wrote Waverly, and are identical is always true; (3) if wrote Waverly, was Scotch is always true. These three propositions, translated into ordinary language, state: (1) at least one person wrote Waverly; (2) at most one person wrote Waverly; (3) whoever wrote Waverly was Scotch. All these three are implied by the author of Waverly was Scotch. Conversely, the three together (but no two of them) imply that the author of Waverly was Scotch. Hence the three together may be taken as defining what is meant by the proposition the author of Waverly was Scotch. We may somewhat simplify these three propositions. The first and second together are equivalent to: There is a term such that ' wrote Waverly' is true when is and is false when is not . In other words, There is a term such that ' wrote Waverly' is always equivalent to ' is .' (Two propositions are equivalent when both are true or both are false.) We have here, to begin with, two functions of , wrote Waverly and is , and we form a function of by considering the equivalence of these two functions of for all values of ; we then proceed to assert that the resulting function of is sometimes true, i.e. that it is true for at least one value of . (It obviously cannot be true for more than one value of .) These
  • 79. Welcome to our website – the perfect destination for book lovers and knowledge seekers. We believe that every book holds a new world, offering opportunities for learning, discovery, and personal growth. That’s why we are dedicated to bringing you a diverse collection of books, ranging from classic literature and specialized publications to self-development guides and children's books. More than just a book-buying platform, we strive to be a bridge connecting you with timeless cultural and intellectual values. With an elegant, user-friendly interface and a smart search system, you can quickly find the books that best suit your interests. Additionally, our special promotions and home delivery services help you save time and fully enjoy the joy of reading. Join us on a journey of knowledge exploration, passion nurturing, and personal growth every day! ebookbell.com