From 8d95737e7b91f7cbc3d425f86cd0dc22f952120e Mon Sep 17 00:00:00 2001 From: thijsbenschop Date: Tue, 30 Oct 2018 12:41:35 +0100 Subject: [PATCH] Add info on alpha for frequency calculation for reference in sdcApp guide --- measure_risk.rst | 52 ++++++++++++++++++++++++++++++++++++++++++++++++ sdcMicro.rst | 10 ++++++++-- 2 files changed, 60 insertions(+), 2 deletions(-) diff --git a/measure_risk.rst b/measure_risk.rst index 3c9bf62..c84a2da 100644 --- a/measure_risk.rst +++ b/measure_risk.rst @@ -383,6 +383,58 @@ frequency. 10 Urban Female Secondary incomplete Non-LF 76 2 262 0.0074 ==== ========== ======== ====================== ============== ======== =============== =============== ======== +In the calculation of frequencies, missing values (‘NA’s in R [8]) are treated as if they +were any other valid value in the variable under consideration. :numref:`tab46` reproduces +the records from :numref:`tab41`. However, in record 4, the values for the variables +Education level and Labor status are recorded as missing. The sample frequencies :math:`f_{k}` +for the records 4 and 8 change as a result of the missing values. Record 4 could +have the same key as record 6 and record 8, depending on the true value of the recorded +missing values. Therefore, the sample frequency :math:`f_{k}` of record 4 increases to 3. +The key of record 8 could coincide with the key of record 4, based on the missing values. The +sample frequency :math:`f_{k}` of record 8 changes to 2. + +This treatment of missing values in the calculation of sample frequencies may lead +to an overestimation of the sample frequencies. The value for the variable +Education level of record +4 is once interpreted as Secondary incomplete and once as Post-secondary. +However, record 3 has only one true (unknown) value for the variable Education level. The same +holds true for the variable Labor status. +In order to take this observation into account, it is possible to count matches of keys +that are based on one or more missing values in the keys with a value less than 1. +This reflects the probability distribution of the the true value. + +The parameter alpha specifies the weight of a match based on missing values +in the sample frequencies. Alpha can be a value between 0 and 1 with the default at 1. +The same value for the parameter +alpha applies to the frequency calculation of all key variables +in the complete dataset. The last column in :numref:`tab46` shows the sample +frequencies with alpha equal 0.5. The +other risk measures, such as population frequencies and individual risk, +are calculated based on the adapted sample frequencies :math:`f_{k}`. + +.. _tab46: + +.. table:: Example dataset showing sample frequencies, + population frequencies and individual disclosure risk + :widths: auto + :align: center + + + ==== ========== ======== ====================== ============== =============== ================================== + No Residence Gender Education level Labor status :math:`f_{k}` :math:`f_{k}` with alpha = 0.5 + ==== ========== ======== ====================== ============== =============== ================================== + 1 Urban Female Secondary incomplete Employed 2 2 + 2 Urban Female Secondary incomplete Employed 2 2 + 3 Urban Female Primary incomplete Non-LF 1 1 + 4 Urban Male NA/missing NA/missing **3** 3 + 5 Rural Female Secondary complete Unemployed 1 1 + 6 Urban Male Secondary complete Employed 2 1.5 + 7 Urban Female Primary complete Non-LF 1 1 + 8 Urban Male Post-secondary Unemployed **2** 1.5 + 9 Urban Female Secondary incomplete Non-LF 2 2 + 10 Urban Female Secondary incomplete Non-LF 2 2 + ==== ========== ======== ====================== ============== =============== ================================== + In :numref:`code41`, we show how to use the *sdcMicro* package to create a list of sample frequencies :math:`f_{k}` for each record in a dataset. This is done by using the *sdcMicro* function freq(). A value of 2 for diff --git a/sdcMicro.rst b/sdcMicro.rst index 4eb906c..cae7645 100644 --- a/sdcMicro.rst +++ b/sdcMicro.rst @@ -292,6 +292,10 @@ sufficient to write the name of the object. # sensitive variables for l-diversity computation selectedSensibleVar = c('health') + # Options + alphaVal <- 1 # parameter alpha for frequency calculation + seedVal <- 12345 # seed + # creating the sdcMicro object with the assigned variables sdcInitial <- createSdcObj(dat = file, keyVars = selectedKeyVars, @@ -300,8 +304,10 @@ sufficient to write the name of the object. weightVar = selectedWeightVar, pramVars = selectedPramVars, hhId = selectedHouseholdID, - strataVar = selectedStrataVar, - sensibleVar = selectedSensibleVar) + strataVar = selectedStrataVar, + sensibleVar = selectedSensibleVar, + alpha = alphaVal, + seed = seedVal) # Summary of object sdcInitial