MGB Framework

Honey | Tutorial | Dirty CSV Importer

Introduction

Honey (as well as the entire MGB framework) supports several formats of files. For more details about the supported formats, refers to the file format section in the intermediate tutorial. Honey supports csv files, but these files need to have a specific structure: The time is the first column, and all attributes are numerical. Additionally, the record of each independent entities (if there are several such entities) should be separated into several files, or use the "Titarl trick". If your csv files does not have this structure, you can use the dirty csv importer of the Honey tool box to convert this "dirty" csv into a csv that Honey can handle. The dirty csv importer supports csv files with:

The table bellow shows an example of two related "dirty" Csv files.

Example : Dirty csv records (data.csv)
"DATE","JOBCODE","SHOP","PLATE","MALCODE","CITY","DISC","ACTION","MAINT_TYPE","TYPE","DEFECT"
2007-03-15,"469","SEAT_466","GF_94669",70,"Seattle",24,"removed","10_000_mile_inspection","minivan","A"
2008-04-16,"469","CHIC_456","GF_94669",70,"Chicago",24,"removed","10_000_mile_inspection","minivan","A"
2008-04-16,"469","CHIC_456","GF_94669",70,"Chicago",24,"removed","10_000_mile_inspection","minivan","A"
2007-01-02,"3996","CHIC_617","GF_94669",70,"Chicago",24,"removed","10_000_mile_inspection","minivan","A"
2007-01-03,"3996","CHIC_617","GF_94669",70,"Chicago",24,"removed","10_000_mile_inspection","minivan","A"
2007-01-04,"3996","CHIC_617","GF_94669",70,"Chicago",24,"removed","10_000_mile_inspection","minivan","A"
...
Example : Dirty csv static information (data_static.csv)
"PLATE","MODEL","TYPE","COLOR"
"GF_94669","RENAULT","R21","BLUE"
"BX_9459","NISANE","350Z","BLACK"
...

This file data.csv describes an historical record of "maintenance operations" (e.g. reparation, inspection) on a set of vehicles. Vehicles identified by their "PLATE" and they are considered independent. Each row represents a single operation on a specific vehicle. Various information are available for each operation, including the location, date and operation details. Additionally, for each vehicles, various static information are available and stored in a second Csv file:

The second file data_static.csv describes "static" information about each vehicle i.e. information that does not change with time.

The importer has one constraint for it input dirty Csv file: Because the Honey importer works in a greedy way (the entire dataset is never entirely loaded in memory), the records should be grouped entities by entities. If your Csv file is not grouped this way, you can use the "sort" shell command (on linux or cygwin) to order it. In the previous example, if your csv file was not sorted, we could do it with the command sort -t',' -k4 -o data_sorted.csv data.csv.

When importing a csv file, several parameters should be specified (e.g. format of the type, way to process each column). To do so, you will write a configuration text file that include (non exaustive):

The configuration file for the Csv importer is a plain text file. The following example shows an example of such configuration file for the Csv file shown above. Once the configuration file is ready, use the following command to start the importation : honey --tool:import config.cfg.

Example : Configuration text file (config.cfg)
inputFile path:"data_sorted.csv" staticData:"data_static.csv" separator:, maximportline:-1
outputFile path:"data_imported.bin" format:bin outputInDirectory:false

check noTimeDuplicate:0

time key:DATE format:"%Y-%m-%d" min:2007-01-01 max:2016-01-01 epsilon:0.01 factor:DAY

entity key:PLATE

state key:JOBCODE
state key:MAINT_TYPE
state key:TYPE
state key:DEFECT
state key:TYPE;DEFECT

scalar key:DISC

event key:JOBCODE
event key:JOBCODE;DEFECT;MAINT_TYPE

static_state key:MODEL
static_state key:TYPE
static_state key:COLOR

In this example, the output of this import with be a single bin file : data_imported.bin.

Following is the list of possible options on the import configuration file:

Once a dataset is imported, it is strongly recommended to visualize it using the Event Viewer. In case of large dataset, you can restrict the number of imported entities or the number of imported row to quickly test your configuration file. Finally, the honey importing tool work in a greedy way entity by entity. This means that the importing tool can be used to import very large datasets (larger that your computer memory).