Meta Data Information
Folder Structure
Below is the folder structure for the Github Repository. All of the notebooks and scripts are organized within the below folder structure and data is inputted with relative file paths. Therefore, once you set your own working directory, everything within should run on its own.
- conf/
- data/
- geodata/
- meta/
- school_loc/
- fb/
- opencellid/
- satellite/
- speedtest/
- survey/
- Brazil/
- Thailand/
- Philippines/
- training_sets/
- Brazil/
- Thailand/
- Philippines/
- worldpop/
- model/
- notebooks/
- src/
- scripts/
- map_offline
- data_gathering
- init.py
- opendata.py
- opendata_utils.py
- opendata_scrap.py
- opendata_facebook.py
- opendata_satellite.py
- feature_engineering
- init.py
- data_pipeline.py
- configs.py
- country.py
- survey.py
- school.py
- main.py
- requirements.txt
Configurations
We use configs.py to store the variable configurations that should be changed according to use-case:
WD
- Working Directory, e.g. 'C:/Users/itu/DSSGx/'COUNTRY
- Country name for current use-case, e.g. 'Thailand'COUNTRY_CODE
- Country code for current use-case, e.g. 'tha'AVAILABLE_COUNTRIES
- Countries for which survey data with an internet connectivity ground truth variable is available, e.g. list('bra', 'tha')FEATURES
- List of predictive features for use-case, e.g. list('speedtest', 'opencell', 'facebook', 'population', 'satellite'). This exemplary list contains each of the five open data sources we have used and they must be sytactically entered as shown.SCHOOL_BUFFER
- Buffer to define school area which will be used to joining survey and population data to schools, in kilometersSURVEY_AREAS
- Survey dataset geometry join type, 'tiles' if survey will be joined to country administrative region, province, etc., 'enumeration' if survey will be joined to enumeration area geometriesSATELLITE_COLLECTIONS
- Dictionary that has satellite image collection identifier as a key and image collection band names for multi-band images, e.g. '{'MODIS/006/MOD13A2': ['NDVI']}'SATELITTE_START_YEAR
- Start year of satellite imagery collection to be usedSATELITTE_END_YEAR
- End year of satellite imagery collection to be usedSATELITTE_BUFFER
- Buffer to define school area for which satellite imagery will be collected, in kilometersSATELITTE_MAX_CALL_SIZE
- Google Earth Engine API max feature collection length, by default 5000 pointsSATELLITE_IMAGE_SCALE
- Satellite Imagery scale of at which to request inputs to a computation is determined from the output. Resolution in meters; used 30 as default in our applicationsGOOGLE_SERVICES_ACCOUNT
- Google Services Account to call Google Earth Engine APIGOOGLE_EARTH_ENGINE_API_JSON_KEY
- JSON key file name that is located under satellite folderOPENCELLID_ACCESS_TOKEN
- OpenCelliD Project API access token as stringFACEBOOK_MARKETING_API_ACCESS_TOKEN
- Facebook Marketing API access token as stringFACEBOOK_AD_ACCOUNT_ID
- Facebook Ad account id as stringFACEBOOK_CALL_LIMIT
- Facebook Ads Management API maximum calls within one hour, by default it is 300 + 40 * (Number of Active Ads)FACEBOOK_RADIUS
- Radius to define school area for which Facebook API data will be collected, in kilometersFACEBOOK_SCHOOL_DATA_LEN
- # of schools in the dataset to keep facebook data with school length in the name in case schools wanted to be divided into chunks and also approximate Facebook API completion timeSPEEDTEST_TILE_TYPE
- Service type for Ookla Open Speedtest Dataset can be 'fixed' or 'mobile' representing fixed or mobile network performance aggregates of tilesSPEEDTEST_TILE_YEAR
- Speedtest data year, e.g. 2021SPEEDTEST_TILE_QUARTER
- Speedtest data quarter, e.g. 2POPULATION_DATASET_YEAR
- Population counts dataset year
Data Dictionaries
Data dictionaries should be created for each predictor and survey dataset. Dictionaries for the open-source data and Brazilian, Thai and Philippino survey data already exist and should be located in the data/meta/ folder.
An exemplary data dictionary (for speedtest data) is shown below:
Number | Name | Description | Type | Binary | Role | Use | Comment |
---|---|---|---|---|---|---|---|
1 | avg_d_kbps |
The average download speed of all tests performed in the tile, represented in kilobits per second | num | N | predictor | Y | mbps can also be used |
2 | avg_u_kbps |
The average upload speed of all tests performed in the tile, represented in kilobits per second | num | N | predictor | Y | mbps can also be used |
3 | avg_lat_ms |
The average latency of all tests performed in the tile, represented in milliseconds | num | N | predictor | N | |
4 | tests |
The number of tests taken in the tile | num | N | predictor | N | |
5 | devices |
The number of unique devices contributing tests in the tile | num | N | predictor | N |