6 Exploratory Data Analysis

6.1 The PDF File

The automated exploratory data analysis file, reports/training_set_eda.Rmd, is written in R Markdown and is re-knit automatically every time the user runs the feature engineering script fe.R. The result is reports/training_set_eda_DATE.pdf, which includes a number of summary statistics as well as uni- and multivariate visualizations of the variables in the training set.

6.1.1 Types of EDA Performed

6.1.1.1 Summary Statistics

Summary statistics are generated for all numerical variables. They include the minimum, first quartile, mean, median, third quartile, maximum, and percentage of missing values. This can help when assessing skew and choosing proper measures of center (e.g., if the mean and median differ significantly), and the percentage of missing values can help when deciding whether to impute missing values or to simply discard them.

6.1.1.2 Distribution Plots

Distribution plots (namely histograms and density plots) are created for all numerical variables. These plots help detect things like skew (which suggests the need for a logarithmic transformation) or whether parmetric assumptions (e.g., normality) are satisfied. The R Markdown script automatically detects whether a variable is highly skewed (based on the third sample moment), and in such a case it will log-transform the variable before plotting it. This eases visualization, and it is easy to notice thanks to the \(x\)-axis label, which will include the text “(log)” after the variable name.

6.1.1.3 Boxplots

Boxplots are created for all numerical variables, and they once again help detect skew. Further, boxplots are useful in that they can highlight the number and of outliers (i.e., observations more than 1.5 IQRs away from either quartile), which is something that might not be immediately visible in histograms or density plots.

6.1.1.4 Correlation Plots

Correlation plots are generated for all numerical variables, and their purpose is two-fold. First, they can help detect non-linear relationships between predictors and the response, which might cause us to rethink using a linear model. Further, they help detect predictors that are highly correlated with each other, which can lead to multicollinearity in some models (i.e., a high variance inflation factor). This is a sign that one of the predictors should be excluded from the model.

6.1.1.5 Barplots

Barplots are created for all categorical variables. For each categorical variable, they display the number of observations in each category, ordered from highest to lowest count (with an exact label on top of each bar). The very last bin always displays the count of missing values, which can suggest the need for imputation or other handling methods if the missingness count is too high relative to other categories. The barplots can also help with detecting the sparse categories that are being merged within the feature engineering script, so that one can assess the soundness of this merging step and take action accordingly.

6.2 The Data Dictionary

The data dictionary is required for the automated EDA to run properly, and it needs to list every single variable of interest. It can be found in conf/meta/training_set_data_dict_VERSION.xlsx:

number name description type binary role use comment
1 precio_unitario_item_solicitado NA num N target Y NA
2 presentacion_item_solicitado NA cat N predictor N NA
3 agricultura_familiar_item_solicitado NA cat Y predictor Y NA
4 produccion_nacional_item_solicitado NA cat Y predictor Y NA
5 contrato_abierto_llamado_grupo NA cat Y predictor Y NA
6 forma_adjudicacion_llamado NA cat N predictor Y NA
7 forma_pago_llamado NA cat N predictor Y NA
8 tipo_unidad_contratacion NA cat N predictor Y NA
9 institucion_unidad_contratacion NA cat Y predictor Y NA
10 nombre_nivel_entidad NA cat N predictor Y NA
11 service NA cat Y predictor Y NA
12 police_buyer NA cat Y predictor Y NA
13 hospital_buyer NA cat Y predictor Y NA
14 health_buyer NA cat Y predictor Y NA
15 law_buyer NA cat Y predictor Y NA
16 ministry_buyer NA cat Y predictor Y NA
17 education_buyer NA cat Y predictor Y NA
18 army_buyer NA cat Y predictor Y NA
19 tech_buyer NA cat Y predictor Y NA
20 electricity_buyer NA cat Y predictor Y NA
21 descripcion_ingles_producto_n1 NA char N predictor Y NA
22 cantidad_item_solicitado NA num Y predictor Y NA
23 food_context NA cat Y predictor Y NA
24 vehicle_context NA cat Y predictor Y NA
25 construction_context NA cat Y predictor Y NA
26 hardware_context NA cat Y predictor Y NA
27 preventive_corrective_context NA cat Y predictor Y NA
28 real_estate_context NA cat Y predictor Y NA
29 office_context NA cat Y predictor Y NA
30 specialized_supplies_context NA cat Y predictor Y NA
31 cleaning_context NA cat Y predictor Y NA
32 politics_context NA cat Y predictor Y NA
33 medical_context NA cat Y predictor Y NA
34 chemical_context NA cat Y predictor Y NA
35 insurance_context NA cat Y predictor Y NA
36 specific_brand_context NA cat Y predictor Y NA
37 electricity_context NA cat Y predictor Y NA
38 kitchen_context NA cat Y predictor Y NA
39 computer_context NA cat Y predictor Y NA
40 air_conditioning_context NA cat Y predictor Y NA
41 spare_part_context NA cat Y predictor Y NA
42 machine_context NA cat Y predictor Y NA
43 fuel_context NA cat Y predictor Y NA

The key fields in the data dictionary are:

  • name (the variable name, as generated by the feature engineering script)
  • type (num for numerical, cat for categorical or logical, and char for string)
  • role (predictor or target)
  • use (Y if to include the variable in the automated EDA, N otherwise)