3 Missing data
It is very often that some countries do not reporter the weight of the commodity traded, which might generate the loss of information, and also the comparison between regions will not be able to be performed. In this section, we describe all the processes involved in the estimation of the missing, imputation methods, mathematical correction, and the validation tool.
There are some codes that should be removed from the the imputation module. These codes were listed by the FIAS team and it is stored in the datatable fishtrade_hs_not_impute
under Fisheries Commodities domain.
Data imputation summary
- Data source:
- Domain: Fisheries Commodities
- Input data:
fishtrade_built_data
- Output data:
fishtrade_data_imputed
: The data imputed is stored in this table to be validated by the expert. It is known as a working table.fishtrade_data_imputed_validated
: This is almost a copy from thefishtrade_data_imputed
datatable, but in this datatable the user is not allowed to change anything. All changes should be done over the datatablefishtrade_data_imputed
, and then can be export to the final data tablefishtrade_data_imputed_validated
.
The imputation is performed at the partner level, i.e, the shipments are aggregated before the imputation.
- There are 16 methods to impute the missing weights.
- 8 methods perform the imputation at Tariff Line level.
- 8 methods perform the imputation at 6 digits (HS) level.
Some automatic imputations can not be achieved. The main reason for this is there is no enough information to execute any method robustly. It is necessary the user’s intervention (manual imputation)
Imputation Flow
Figure 3.1 shows the imputation module flow. The first step is to build the helper tables, which use as input the data tables stored in the trade-input-data
domain. Once the helper tables have been built, the next step is to get the data to be imputed from the data table fishtrade_built_data
. The Imputation Module output is stored at the data table fishtrade_data_imputed
. This data table works as a working table because the user can check the imputation by performing the manual imputation if needed. After the expert validation the data is exported to the production data table fishtrade_data_imputed_validated
.
The order to apply the methods can be found at Section 3.4, and also links to the source of the functions.
Figure 3.1: Imputation Module Flow.
3.1 Mathematical conversion
The first step in the Imputation Module is the mathematical conversion. This step is not considered as an imputation method once we are just converting the units reported by the country.
In Figure 3.2, the left side shows the conversions applied in the quantity variable and also the weight. The right side shows all type of unit reported by the countries. When some countries do not report the weight value, but they report the quantity and also the quantity unit, hence it is possible to retrieve the weight, if the unit is reported in kilogram (code = 8). The other conversions aim to put all quantities in the same unit.
- Function:
unit_conversion
Figure 3.2: Imputation module: mathematical conversions
3.2 Partially Agreggation
The tariff line data from UNSD contains multiple shipments with identical combination of reporter/partner/commodity/flow/year/qunit (unit of quantity). Those are transactions registered separately, thus rows containinig non-missing values and quantities can be aggregated. Missing variables of the same type are also aggregated if they are all missing, as they will produce a missing aggregated value for missing disaggregated values while correctly summing the remaining variables. In Figure 3.3, we show an example of this. The green rectangle shows the values reported that can be aggregated, and in the blue rectangle shows the missing values that will also aggregated. This figure show how the is reported, and after the partial aggregation.
Note that the combination of year = 2016 | reporter = 76| partner = 764 | flow = 1 | hs = 030119000000 | qunit = 5 after the partially aggregation result in two rows. One composed by the values aggregated, and another with the missing values aggregated.
Figure 3.3: Imputation module: partially aggregation
3.3 Methods
In this section, we describe the methods used to impute the missing weight values. In Sections 3.3.1 and 3.3.3, we introduce the Helper Tables used to estimate the weight per unit, and the Unit Values for several levels of aggregation respectively. The imputation methods that use these Helper tables are described in the Section 3.3.5. The other methods are explained in next sections.
3.3.1 Helper Tables
In this Section, we show the shape of the table and variables which composed each auxiliary table. For each table we describe brefily how the the values were estimated. For a full detailed explanation can be found in the Sections 3.3.2 and 3.3.3.
Figure 3.4: Helper Table: weight estimated per quantity unit for the levels of aggregation: reporter-year; reporter; region-year.
Figure 3.5: Helper Table: weight estimated per quantity unit for the levels of aggregation: region; HS at the six digits level; year.
Figure 3.6: Helper Table: unit value estimated for several levels of aggregation: reporter-year; region-year; global-year.
3.3.2 Estimate weight per unit
3.3.2.1 General Approach
Figure 3.7: Helper Tables: general approach that is used in the building of the helper tables.
3.3.2.2 Reporter - Year
Figure 3.8: Helper Tables: weight per unit estimated by the median of partners by year
3.3.2.3 Reporter
Figure 3.9: Helper Tables: weight per unit estimated by the median of partners across the years.
3.3.2.4 Region-Year
Figure 3.10: Helper Tables: weight per unit estimated by the median of region by year.
3.3.2.5 Global-Year
Figure 3.11: Helper Tables: weight per unit estimated by the median of commodity by year.
3.3.2.6 Global
Figure 3.12: Helper Tables: weight per unit estimated by the median of commodity.
3.3.3 Estimate Unit Value
3.3.3.1 General Approach
Figure 3.13: Helper Tables: general approach that is used in the building of the helper tables.
3.3.3.2 Reporter - Year
Figure 3.14: Helper Tables: unit value estimated at the tariff line level by the median of partners by year.
3.3.3.3 Region - Year
Figure 3.15: Helper Tables: unit value estimated by the median of region by year using 6 digits (HS).
3.3.3.4 Global - Year
Figure 3.16: Helper Tables: unit value estimated by the median of commodity using 6 digits (HS)
3.3.4 Imputation by partially reported weight
Figure 3.17: Helper Tables: Impute the missing weight with the ratio between the given value and unit value calculated using only the valid monetary values. (TL level).
3.3.5 Imputation using the helper tables
Figure 3.18: Helper Tables: imputation methods using the auxiliary tables.
3.3.6 Interpolate Unit Value
Figure 3.19: Helper Tables: Interpolate the UV using the previous, or next (year) available value, by reporter, partner, flow, and TL.
3.3.7 Unit Value Total
Figure 3.20: Helper Tables: The Unit Value Total is the ratio between the summing of the weights and the summing of the quantities, by year, reporter, flow, and TL.
3.3.8 Interpolate Unit Value Total
Figure 3.21: Helper Tables: Interpolate using the Unit Value Total (TL level) of previous year.
3.3.9 Global Year Expanded
Figure 3.22: Helper Tables: Expand the Global-Year table by all the combination among: year, flow, and HS.
3.3.10 Global Year Expanded - HS6
Figure 3.23: Helper Tables: Compute the median of Unit Value found in the previous step by year, flow, and HS.
3.4 Functions
The imputation methods are applied as the order below:
- Imputation by partially reported weight
- Level: Tariff Line
- Function: partially_aggregation
- Interpolate Unit Value
- Level: Tariff Line
- Function: uv_hs_interpolate
- Reporter - Year
- Level: Tariff Line
- Function: wq_uv_imputation
- Unit Value: Reporter - Year
- Level: Tariff Line
- Function: wq_uv_imputation
- Reporter - Weight
- Level: Tariff Line
- Function: wq_uv_imputation
- Unit Value Total
- Level: Tariff Line
- Function: wq_uv_imputation
- Unit Value Total
- Level: Tariff Line
- Function: wq_uv_imputation
- Unit Value: Region - Year
- Level: HS at the six digits level
- Function: wq_uv_imputation
- Unit Value: Global - Year
- Level: HS at the six digits level
- Function: wq_uv_imputation
- Region
- Level: HS at the six digits level
- Function: wq_uv_imputation
- Global - year
- Level: HS at the six digits level
- Function: wq_uv_imputation
- Global
- Level: HS at the six digits level
- Function: wq_uv_imputation
- Interpolate Unit Value Total
- Level: HS at the six digits level
- Function: run_imputation_module
- Global Year Expanded
- Level: HS at the six digits level
- Function: run_imputation_module
- Global Year Expanded - HS6
- Level: HS at the six digits level
- Function: run_imputation_module