Data Preparation

Video Documentation

The following video will walk you through this section and how to use the various interactive widgets.



This section processes the excel file that is inputted. The excel file must contain time-series data adhering to the following rules:

  • data should be in wide-format (the columns are different timestamps, rows are kreise)
  • one sheet cannot have more than one variable (e.g., unemployment rate)
  • have one column containing the kreis-level area code ags5 ("amtlicher gemeindeschlüssel")

Time-series data excel workbook

Upload Excel Workbook

export excel workbook

When you confirm that you are using an excel workbook containing time-series data, you are prompted to upload the data. In this sample screenshot, you can see that the file "7444_318010_BMWI_Enkelmann_Eckdaten_Zeitreihe_Kreise.xlsx" is being uploaded.

Select data to clean

unemployment rate data

This is the assumed workbook for labor market data:

  • The index column "Region" shows that each row is a record of one kreis.
  • The rest of the columns should represent each year and month.
  • Each worksheet represents a different variable. For example, the current worksheet selected is "Alo_Quote", which stands for unemployment rate ("Arbeitslosenquote").
  • It is important that there is a sheet which is called "Alo_Quote" which contain the unemployment rate data.

Select variable

select variable

The default variable is set to "unemployment rate".

Select data format

select data format

Long format

long format one variable

This is an example of a single-variable "long" format data. The dataframe contains three columns: the kreis code (ags5), the time (date), and the variable (e.g. Alo Quote). The number of rows of this file is 401 kreis * #dates for each kreis.

Wide format

wide format one variable

This is an example of a single-variable "wide" format data. All the columns in the dataframe are dates. The index of the dataframe is the ags5 code. The number of rows of this file should be 401.

Final dataset cleaning

In this case, cropping to a certain dataframe: removing data from certain time periods that we may not want the mode to learn from.

Loading dataset

csv

Here, you load the data you would like to clean. By default, it loads the data that you have confirmed processing and merging in the last section.

Cropping timeframe

Pro Tip: The purpose of cropping the appropriate timeframe helps differentiate between a normal-time and crisis-time model. For example, we know that there was the economic crisis around 2008 and the COVID pandemic around 2020. Thus, when doing a normal-time unemployment rate prediction, you could crop out those times so that the model is only learning the pattern from normal-time data.

csv

This is where you can select the timeframe you would like to feed into the model. In this example, a timeframe between 2007-05 and 2007-07 is selected and shown in the preview. Again, once you confirm the dataset, it would be automatically loaded in the model on the next page.