<iframe src="//www.googletagmanager.com/ns.html?id=GTM-TT4L49" height="0" width="0" style="display:none;visibility:hidden">
Jethro - We Make Real-Time Business Intelligence Work on Hadoop

Blog

How to Generate Mock Data for Testing

By Arthur Adams on March 09, 2016

Share

Do you want to test Jethro but don't have any data to use? With the csvgen utility you can very easily generate mock data and load it into Jethro. The utility generates csv files that can be loaded into Jethro based on your specifications. This document describes how to use the csvgen utility to generate data and load into Jethro.

csvgen

To use csvgen, first visit the csvgen github page here:

https://github.com/JethroData/csvgen

From within the github page, click on “Download Zip” to download a zip file containing all the required files.

Extract the zip file and open the README.md file for instructions on how to install and use csvgen.

Before generating the data, you need to create the input file that instructs csvgen how to generate the data. The input file contains text where each row corresponds to a column in the table and contains a description of how to generate data for that column. Each row begins with the data type and may contain a list of parameters. Here’s a description of the various data types and their parameters:

row_number

Generates a unique row number for each row, starting at 1 or at the number provided as parameter.

Parameters:
Start from, the number to start from, optional.
Prefix, the prefix to prepend to the number, optional.

Examples:
row_number 1000
row_number 100000 customer_

boolean

Generates a Boolean value, true or false.

Parameters:
Null %, the % of NULL values, 0-100, optional.

Examples:
boolean 20

uuid

Generates a unique id.

Parameters:
Null %, the % of NULL values, 0-100, optional.

Examples:
uuid 10

ip_address

Generates an IP address.

Parameters:
Null %, the % of NULL values, 0-100, optional.

Examples:
ip_address 20

regex

Generates a string based on a regular expression.

Parameters:
Expression, the regular expression, mandatory.
Null %, the % of NULL values, 0-100, optional.

Examples:
regex "201[2-5](0[1-9]|1[0-2])" 20
regex [a-zA-Z0-9_.+-]{2,10}@[a-zA-Z0-9-]{4,20}\\.(com|net|org)

word

Generates a random (meaningless) word.

Parameters:
Min, the minimum number of syllables, mandatory.
Max, the maximum number of syllables, mandatory.
Unique, the number of unique words to generate, optional (0 means no limit).
Null %, the % of NULL values, 0-100, optional.

Examples:
word 2 4 0 15
word 1 3 1000

number

Generates a whole or decimal number.

Parameters:
Min, minimum value, mandatory.
Max maximum value, mandatory.
Decimal, the number of decimal places (0 for whole), mandatory.
Null %, the % of NULL values, 0-100, optional.

Examples:
number 1 100 0 10
number -999 999 2 20

date

Generates a date with a specified format within a given period.

Parameters:
Start, start date period, mandatory.
End, end date period, mandatory.
Format, the date format as described here, mandatory.
Null %, the % of NULL values, 0-100, optional.

Examples:
date "01-01-2012 00:00:00" "12-31-2012 23:59:59" "%m-%d-%Y %H:%M:%S" 30
date "05-27-2014" "06-31-2014" "%m-%d-%Y" 25

fixed

Generates a string provided as parameter.

Parameters:
Fixed value, the string to generate, mandatory
Null %, the % of NULL values, 0-100, optional.

Examples:
Fixed hello 10

list

Generates a string from a list of values.

parameters:
List, list of values separated by a space enclosed by “[ ]” mandatory.
Null %, the % of NULL values, 0-100, optional.

Examples:
list “[dog cat mouse]” 20
Note that you must enclose the list with quotes.

custom list

Generates a string from a custom list of strings. For a given type, there should be a file <type>.csv in the data directory. The file should contain a list of values, each one on a separate line. One of these values will be picked randomly.
The data directory already contains several type files, like first_name, last_name, state, country, etc. You can add new custom types by adding such a file to the data directory.

Parameters:
Unique, the number of unique values to pick, optional (0 means pick from the entire list).
Null %, the % of NULL values, 0-100, optional.

Examples:
first_name
color 10
brands 100 10

 

Generate data and load into Jethro

The following steps assume that you already installed Jethro and that you have created an instance named “demo”. If your instance is named differently, just replace “demo” with your instance name.

For more information about installing Jethro, visit the Jethro documentation at http://jethro.io/learn/documentation

 

Let’s say that we want to generate data for a customer table that has the following definition:

CREATE TABLE customer
(
customer_id INTEGER,
first_name STRING,
last_name STRING,
age INTEGER,
birth_country STRING,
gender STRING,
phone_number STRING,
last_transaction_date TIMESTAMP,
balance FLOAT
);

Save the above data definition in a file called customer.ddl and run the following command from the Jethro server console as user Jethro to create the table:

JethroClient demo localhost:9111 -p jethro -i customer.ddl

 

Next we create a Jethro loader description file that will be used to load the data. Save the following in a file called customer.desc:

table customer
overwrite
row format delimited
        fields terminated by '|'
(
        customer_id,
        first_name,
        last_name,
        age,
        birth_country,
        gender  STRING,
        phone_number,
        last_transaction_date format='MM-dd-yyyy',
        balance
)

Now, we need to create the input file that instructs csvgen how to generate the data.

Here is an example of such an input file for the customer table:

row_number 1000000
word 1 4 1000
word 1 4 3000
number 18 100 0
country
list '[Male Female]'
regex '[1-9]\d{2}-\d{3}-\d{4}'
date '01-01-2000' '01-01-2016' '%m-%d-%Y' 25
number -999 999 2 10

It will generate the following for each column:

customer_id – A unique number starting at 1000000 for the first row and incrementing by one for subsequent rows.

first_name – A random word, 1 to 4 syllables, out of a list of 1000 words.

last_name – A random word, 1 to 4 syllables, out of a list of 3000 words.

age – A whole number between 18 and 100.

birth_country – A value from a list of countries. The list is taken from an existing file named country.csv in the data directory.

Gender – Either “Male” or “Female”.

phone_number – A string of 10 digits not starting with 0 generated by a regex with the format xxx xxx xxxx.

last_transaction_date – A date between 01-01-2000 and 01-01-2016 with 25% of NULL values.

balance – A decimal numbers with 2 places after the decimal point between -999 and 999 with 10% of NULL values.

Save the input text in a file called customer.in

Verify that the data is generated correctly by generating 100 rows with this command:

./csvgen.py -i 100 -d '|' customer.in

It will send the output to the console.

Here is an example of the last few lines of the output:

1000093|Paf|Yuc|76|Venezuela|Male|403-740-4749||706.68
1000094|Col|Axxa|64|Argentina|Male|371-486-9796|03-12-2009|
1000095|Uz|Ec|99|Italy|Male|624-579-8640|11-03-2000|-115.25
1000096|Bakdi|Iv|71|Iceland|Male|496-820-5813|03-13-2013|172.57
1000097|Vot|Iccife|71|United States|Male|637-522-6319||556.36
1000098|Nauhel|Owle|82|Senegal|Male|998-729-3208|11-24-2007|543.28
1000099|Ibri|Afzu|18|Slovenia|Female|629-826-0867|05-25-2013|432.62

Next, we will generate the data to be loaded to Jethro.

Let’s generate data for 10000 customers, using the delimiter ‘|’ and store it in a file customer.csv:

$./csvgen.py -i 10000 -d '|' -o customer.csv customer.in

 

Now that we have the csv file ready, we can load it into Jethro. Run the following command on the Jethro server console with user Jethro:

JethroLoader demo customer.desc customer.csv

You should see the name of the loader log file in the console. When the load finishes, view the log file to make sure that the load finished successfully. You should see lines like the following indicating that all the data was loaded:

Input records:                 10000
Records loaded:                 10000
Records skipped:               0 ( 0 empty lines, 0 no matching rules, 0 header lines, 0 outside partitions boundary )
Records rejected (see reject rep0
t file 'loader_rejects_20160304_212349_46837.out':Table rows - before :           0
Table rows - truncated :       0
Table rows - after :           10000

Another option is to generate the data and pass it to the loader using a pipe instead of saving it in a file. This can be achieved using the following:

 

./csvgen.py -i 10000 -d '|' customer.in | JethroLoader demo customer.desc stdin

 

Speeding up data generation

In order to generate a large amount of data, you can use the generate.sh script which can run multiple csvgen.py in parallel. This speeds up the data generation process and can be useful to generate data for large fact tables.

It spawns as many processes as requested and creates temporary output file for each. When all processes are done, it combines the output to one file.

 

To use it, run the following:

./generate.sh &lt;name&gt; &lt;rows&gt; &lt;number of processes&gt; &lt;delimiter&gt;

 

Where:

<name> is a description file named “<name>.in”. It also uses the same name for the output file which is generated as “<name>.csv”. This parameter is mandatory.

<rows> is the number of rows to generate for each process. Default =1M

<number of processes> is the number of processes to run in parallel. Default=1

<delimiter> is the column delimiter to use. Default=’|’

 

To generate data for 1000000 customers using 20 parallel processes, issue the command:

./generate.sh customer 50000 20 '|'

Note that you need to specify the number of rows for each process and not the total number of rows. For best results, align the number of processes with the number of CPU’s that the server has.