- AFR
- Sub Saharan Africa
- COICOP
- Classification of Individual Consumption by Purpose
- CRAN
- Comprehensive R Archive Network
- CTBIL
- Contingency Table-Based Information Loss
- DHS
- Demographic and Health Surveys
- DIS
- Data Intrusion Simulation
- EAP
- East Asia and the Pacific
- ECA
- Europe and Central Asia
- EU
- European Union
- GIS
- Geographical Information System
- GPS
- Global Positioning System
- GUI
- Graphical User Interface
- HIV/AIDS
- Human Immunodeficiency Virus/Acquired Immune Deficiency Syndrome
- I2D2
- International Income Distribution Database
- IHSN
- International Household Survey Network
- LAC
- Latin America and the Caribbean
- LSMS
- Living Standards Measurement Survey
- MDAV
- Maximum Distance Average Vector
- MDG
- Millennium Development Goal
- MENA
- Middle East and North America
- MICS
- Multiple Indicator Cluster Survey
- MME
- Mean Monthly Expenditures
- MMI
- Mean Monthly Income
- MSU
- Minimal Sample Uniques
- NSI
- National Statistical Institute
- NSO
- National Statistical Office
- OECD
- Organization for Economic Cooperation and Development
- PARIS21
- Partnership in Statistics for Development in the 21st century
- PRAM
- Post Randomization Method
- PC
- Principal Component
- PUF
- Public Use File
- SA
- South Asia
- SDC
- Statistical Disclosure Control
- SSE
- Sum of Squared Errors
- SHIP
- Survey-based Harmonized Indicators Program
- SUDA
- Special Uniques Detection Algorithm
- SUF
- Scientific Use File
- UNICEF
- United Nations Children’s Fund
- Administrative data
- Data collected for administrative purposes by government agencies. Typically, administrative data require specific SDC methods.
- Anonymization
- Use of techniques that convert confidential data into anonymized data/ removal or masking of identifying information from datasets.
- Attribute disclosure
- Attribute disclosure occurs if an intruder is able to determine new characteristics of an individual or organization based on the information available in the released data.
- Categorical variable
- A variable that takes values over a finite set, e.g., gender. Also called factor in R.
- Confidentiality
- Data confidentiality is a property of data, usually resulting from legislative measures, which prevents it from unauthorized disclosure. [2]
- Confidential data
- Data that will allow identification of an individual or organization, either directly or indirectly. [1]
- Continuous variable
- A variable with which numerical and arithmetic operations can be performed, e.g., income.
- Data protection
- Data protection refers to the set of privacy-motivated laws, policies and procedures that aim to minimize intrusion into respondents’ privacy caused by the collection, storage and dissemination of personal data. [2]
- Deterministic methods
- Anonymization methods that follow a certain algorithm and produce the same results if applied repeatedly to the same data with the same set of parameters.
- Direct identifier
- A variable that reveals directly and unambiguously the identity of a respondent, e.g., names, social identity numbers.
- Disclosure
- Disclosure occurs when a person or an organization recognizes or learns something that they did not already know about another person or organization through released data. [1] See also Identity disclosure, Attribute disclosure and Inferential disclosure.
- Disclosure risk
- A disclosure risk occurs if an unacceptably narrow estimation of a respondent’s confidential information is possible or if exact disclosure is possible with a high level of confidence. [2] Disclosure risk also refers to the probability that successful disclosure could occur.
- End user
- The user of the released microdata file after anonymization. Who is the end user depends on the release type.
- Factor variable
- Factor variables are one way to classify categorical variables in R.
- Hierarchical structure
- Data is made up of collections of records that are interconnected through links, e.g., individuals belonging to groups/households or employees belonging to companies.
- Identifier
- An identifier is a variable/ information that can be used to establish identity of an individual or organization. Identifiers can lead to direct or indirect identification.
- Identity disclosure
- Identity disclosure occurs if an intruder associates a known individual or organization with a released data record.
- Indirect identification
- Indirect identification occurs when the identity of an individual or organization is disclosed, not using direct identifiers but through a combination of unique characteristics in key variables. [1]
- Inferential disclosure
- Inferential disclosure occurs if an intruder is able to determine the value of some characteristic of an individual or organization more accurately with the released data than otherwise would have been possible.
- Information loss
- Information loss refers to the reduction of the information content in the released data relative to the information content in the raw data. Information loss is often measured with respect to common analytical measures, such as regressions and indicators. See also Utility.
- Interval
- A set of numbers between two designated endpoints that may or may not be included. Brackets (e.g., [0, 1]) denote a closed interval, which includes the endpoints 0 and 1. Parentheses (e.g., (0, 1) denote an open interval, which does not include the endpoints.
- Intruder
- A user who misuses released data by trying to disclose information about an individual or organization, using a set of characteristics known to the user.
- k-anonymity
- The risk measure k-anonymity is based on the principle that the number of individuals in a sample sharing the same combination of values (key) of categorical key variables should be higher than a specified threshold k.
- Key
- A combination or pattern of key variables/quasi-identifiers.
- Key variables
- A set of variables that, in combination, can be linked to external information to re-identify respondents in the released dataset. Key variables are also called “quasi-identifiers” or “implicit identifiers”.
- Microaggregation
- Anonymization method that is based on replacing values for a certain variable with a common value for a group of records. The grouping of records is based on a proximity measure of variables of interest. The groups of records are also used to calculate the replacement value.
- Microdata
- A set of records containing information on individual respondents or on economic entities. Such records may contain responses to a survey questionnaire or administrative forms.
- Noise addition
- Anonymization method based on adding or multiplying a stochastic or randomized number to the original values to protect data from exact matching with external files. Noise addition is typically applied to continuous variables.
- Non-perturbative methods
- Anonymization methods that reduce the detail in the data or suppress certain values (masking) without distorting the data structure.
- Observation
- A set of data derived from an object/unit of experiment, e.g., an individual (in individual-level data), a household (in household-level data) or a company (in company data). Observations are also called “records”.
- Original data
- The data before SDC/anonymization methods were applied. Also called “raw data” or “untreated data”.
- Outlier
- An unusual value that is correctly reported but is not typical of the rest of the population. Outliers can also be observations with an unusual combination of values for variables, such as 20-year-old widow. On their own age, 20 and widow are not unusual values, but their combination may be. [1]
- Perturbative methods
- Anonymization methods that alter values slightly to limit disclosure risk by creating uncertainty around the true values, while retaining as much content and structure as possible, e.g. microaggregation and noise addition.
- Population unique
- The only record in the population with a particular set of characteristics, such that the individual or organization can be distinguished from other units in the population based on that set of characteristics.
- Post Randomization Method (PRAM)
- Anonymization method for microdata in which the scores of a categorical variable are altered according to certain probabilities. It is thus intentional misclassification with known misclassification probabilities. [1]
- Probabilistic methods
- Anonymization methods that depend on a probability mechanism or a random number-generating mechanism. Every time a probabilistic method is used, a different outcome is generated.
- Privacy
- Privacy is a concept that applies to data subjects while confidentiality applies to data. The concept is defined as follows: "It is the status accorded to data which has been agreed upon between the person or organization furnishing the data and the organization receiving it and which describes the degree of protection which will be provided." [2]
- Public Use File (PUF)
- Type of release of microdata file, which is freely available to any user, for example on the internet.
- Quasi-identifiers
- A set of variables that, in combination, can be linked to external information to re-identify respondents in the released dataset. Quasi-identifiers are also called “key variables” or “implicit identifiers”.
- Raw data
- The data before SDC/anonymization methods were applied. Also called “original data” or “untreated data”.
- Recoding
- Anonymization method for microdata in which groups of existing categories/values are replaced with new values, e.g. the values ‘protestant’, and ‘catholic’ are replaced with ‘Christian’. Recoding reduces the detail in the data. Recoding of continuous variables leads to a transformation from continuous to categorical, e.g. creating income bands.
- Record
- A set of data derived from an object/unit of experiment, e.g., an individual (in individual-level data), a household (in household-level data) or a company (in company data). Records are also called “observations”.
- Regression
- A statistical process of measuring the relation between the mean value of one variable and corresponding values of other variables.
- Re-identification risk
- See Disclosure risk
- Release
- Dissemination – the release to users of information obtained through a statistical activity. [2]
- Respondents
- Individuals or units of observation whose information/responses to a survey make up the data file.
- Sample unique
- The only record in the sample with a particular set of characteristics, such that the individual or organization can be distinguished from other units in the sample based on that set of characteristics.
- Scientific Use File (SUF)
- Type of release of microdata file, which is only available to selected researchers under contract. Also known as “licensed file”, “microdata under contract” or “research file”.
- sdcMicro
- An R based package authored by Templ, M., Kowarik, A. and Meindl, B. with tools for the anonymization of microdata, i.e. for the creation of public- and scientific-use files.
- sdcMicroGUI
- A GUI for the R based sdcMicro package, which allows users to use the sdcMicro tools without R knowledge.
- Sensitive variables
- Sensitive or confidential variables are those whose values must not be discovered for any respondent in the dataset. The determination of sensitive variables is often subject to legal and ethical concerns.
- Statistical Disclosure Control (SDC)
- Statistical Disclosure Control techniques can be defined as the set of methods to reduce the risk of disclosing information on individuals, businesses or other organizations. Such methods are only related to the dissemination step and are usually based on restricting the amount of or modifying the data released. [2]
- Suppression
- Data suppression involves not releasing information that is considered unsafe because it fails confidentiality rules being applied. Sometimes this is done is by replacing values signifying individual attributes with missing values. In the context of this guide, usually to achieve a desired level of k- anonymity.
- Threshold
- An established level, value, margin or point at which values that fall above or below it will deem the data safe or unsafe. If unsafe, further action will need to be taken to reduce the risk of identification.
- Utility
- Data utility describes the value of data as an analytical resource, comprising analytical completeness and analytical validity.
- Untreated data
- The data before SDC/anonymization methods were applied. Also called “raw data” or “original data”.
- Variable
- Any characteristic, number or quantity that can be measured or counted for each unit of observation.
[1] | (1, 2, 3, 4, 5) Australian Bureau of Statistics, http://www.nss.gov.au/nss/home.nsf/pages/Confidentiality+-+Glossary |
[2] | (1, 2, 3, 4, 5, 6) OECD, http://stats.oecd.org/glossary |