Honey | Tutorial | Dirty CSV Importer
Introduction
Honey (as well as the entire MGB framework) supports several formats of files. For more details about the supported formats, refers to the file format section in the intermediate tutorial. Honey supports csv files, but these files need to have a specific structure: The time is the first column, and all attributes are numerical. Additionally, the record of each independent entities (if there are several such entities) should be separated into several files, or use the "Titarl trick". If your csv files does not have this structure, you can use the dirty csv importer of the Honey tool box to convert this "dirty" csv into a csv that Honey can handle. The dirty csv importer supports csv files with:
- Time expressed in many formats (including text e.g. "2007-03-15")
- Categorical attributes.
- Entities (i.e. several independent SSTS with similar structures)
The table bellow shows an example of two related "dirty" Csv files.
2007-03-15,"469","SEAT_466","GF_94669",70,"Seattle",24,"removed","10_000_mile_inspection","minivan","A"
2008-04-16,"469","CHIC_456","GF_94669",70,"Chicago",24,"removed","10_000_mile_inspection","minivan","A"
2008-04-16,"469","CHIC_456","GF_94669",70,"Chicago",24,"removed","10_000_mile_inspection","minivan","A"
2007-01-02,"3996","CHIC_617","GF_94669",70,"Chicago",24,"removed","10_000_mile_inspection","minivan","A"
2007-01-03,"3996","CHIC_617","GF_94669",70,"Chicago",24,"removed","10_000_mile_inspection","minivan","A"
2007-01-04,"3996","CHIC_617","GF_94669",70,"Chicago",24,"removed","10_000_mile_inspection","minivan","A"
...
"GF_94669","RENAULT","R21","BLUE"
"BX_9459","NISANE","350Z","BLACK"
...
This file data.csv describes an historical record of "maintenance operations" (e.g. reparation, inspection) on a set of vehicles. Vehicles identified by their "PLATE" and they are considered independent. Each row represents a single operation on a specific vehicle. Various information are available for each operation, including the location, date and operation details. Additionally, for each vehicles, various static information are available and stored in a second Csv file:
The second file data_static.csv describes "static" information about each vehicle i.e. information that does not change with time.
The importer has one constraint for it input dirty Csv file: Because the Honey importer works in a greedy way (the entire dataset is never entirely loaded in memory), the records should be grouped entities by entities. If your Csv file is not grouped this way, you can use the "sort" shell command (on linux or cygwin) to order it. In the previous example, if your csv file was not sorted, we could do it with the command sort -t',' -k4 -o data_sorted.csv data.csv.
When importing a csv file, several parameters should be specified (e.g. format of the type, way to process each column). To do so, you will write a configuration text file that include (non exaustive):
- The path to the input and output files.
- Format of the output file (bin, evt, sevt, etc.).
- Various optional filtering on the entities or records (max rows, max entities, min number of records by entities, forbidden value, etc.).
- Which column represent the time, and how to interpret it.
- Which column represent the entity id.
- Which and how each column (or group of columns) should be exported (e.g. event, scalar, state).
- Optionally, the location of an another csv file representing static information about the entities (e.g. type of car, purchase date).
The configuration file for the Csv importer is a plain text file. The following example shows an example of such configuration file for the Csv file shown above. Once the configuration file is ready, use the following command to start the importation : honey --tool:import config.cfg.
outputFile path:"data_imported.bin" format:bin outputInDirectory:false
check noTimeDuplicate:0
time key:DATE format:"%Y-%m-%d" min:2007-01-01 max:2016-01-01 epsilon:0.01 factor:DAY
entity key:PLATE
state key:JOBCODE
state key:MAINT_TYPE
state key:TYPE
state key:DEFECT
state key:TYPE;DEFECT
scalar key:DISC
event key:JOBCODE
event key:JOBCODE;DEFECT;MAINT_TYPE
static_state key:MODEL
static_state key:TYPE
static_state key:COLOR
In this example, the output of this import with be a single bin file : data_imported.bin.
Following is the list of possible options on the import configuration file:
- inputFile : Define path and csv syntax of the input datasets.
- path : Path to the dirty input csv file.
- staticData : (Optional) Path to the unformated input csv file with static data.
- separator : Csv column separator character.
- maximportline : (Optional) Maximum imported lines.
- maximportentity : (Optional) Maximum imported entities.
- only : (Optional) Path to a text file containing the entities to import (one id by row). If not specified, all entities are imported.
- minRecords : (Optional) Only record entities with at least this amount of records.
- maxRecords : (Optional) Only record entities with at most this amount of records.
- minDuration : (Optional) Only record entities with a time difference between the last and first records greater than the specified value.
- outputFile : Define path and format of the input dataset.
- path : Path to the output file or output directory (if outputInDirectory:true)
- format : Output format. Can be "bin","evt" or "csv".
- outputInDirectory : If false (default), all the records will be saved in a single file. If true, the record of each entity will be saved in a separate file, and a globale .sevt file will be defined.
- filter : (Optional, can have several) Filter on the record by value of a column.
- key : Column name.
- value : Expected value.
- reverse : If false, only the record with the expected value are saved. If true, only the records with a value different from the expected value are saved.
- check : Various checking on the data.
- noTimeDuplicate : If true, make sure that no two records or a same entity have the same time.
- allStaticData : If true, each entity should have a row on the static dataset.
- time : Structure and format of the input time.
- key : Name of the column representing the time.
- format : Format of the time. Can be "direct" (i.e. direct numerical value) or any format supported by the std::time_get format.
- separateRecords : If true, if check:noTimeDuplicate=false and if two records are at the same time, the records will be shifted by epsilon time to they don't overlap.
- factor : Division factor to apply on the numerical time. If "DAY", the value will be 60*60*24 (i.e. number of second in a day).
- min : Lower bound of the time for the dataset.
- max : Upper bound of the time for the dataset.
- epsilon : Very small numerical time added or subtracted to time record to make sure they don't overlap.
- entity : Specify the column that contains the entity unique identifier.
- key : Name of the column.
- state : Specify the column (or the set of column, separated by ";") to convert into a state.
- key : Column names separated by ;
- scalar : Specify the column to convert into a state
- key : Name of the column.
- event : Specify the column (or the set of column, separated by ";") to convert into a state.
- key : Column names separated by ;
- static_state : Specify the column (or the set of column, separated by ";") from the static dataset to convert into a state.
- key : Column names separated by ;
- static_scalar : Specify the column from the static dataset to convert into a scalar.
- key : Name of the column.
Once a dataset is imported, it is strongly recommended to visualize it using the Event Viewer. In case of large dataset, you can restrict the number of imported entities or the number of imported row to quickly test your configuration file. Finally, the honey importing tool work in a greedy way entity by entity. This means that the importing tool can be used to import very large datasets (larger that your computer memory).