Friday, October 9, 2009

Bash script for test data generation

Here is the premise of this blog entry. I needed to discover the database size of a meta data database with full index data in it for a project I am working on. I am trying to discover worst case scenario kind of numbers so I don't hit a wall unexpectedly at some point in time when someone pushes the boundaries of the system.

The first test that I had done was very simple, generate a single 5GB file containing random ascii. The reason I went this route was because I wanted to ensure that the index engine I am using wasn't able to reduce the data set. It does vector based indexing and common word elimination so the completely random data represented in the file wouldn't be able to be reduced by the vector index in anyway that would reduce the database size. This process is awesome for space saving and in the real world does an amazing job at what it does, but I needed to defeat it for my test in case someone else introduced data to my system that would push the limits.

My simple approach to creating this test file was to us this command. BTW I am using ubuntu 9.10 so before this command would work I needed to install binutils using "sudo apt-get install binutils"

cat /dev/urandom |strings > file.txt

I then opened a second terminal window to monitor the file size manually till it got to the size that I needed it to be. Simple enough...

I ran the data into the system, after the test looked at my results and thought to myself "This can't be right". I then realized where the failure was in my test. Quite simply every single file generated would also generate a row in the meta data database. At this point I only generated a single row of data in the database for a 5GB file. This is not very real world or worst case.

I then started working on the problem from a more controlled approach since I realized I was going to have to generate a good deal more data in smaller batches. The first step was to figure out how to generate an ascii document of a specific size since my earlier approach gave absolutely no control over the size of the file other than through manual intervention. Anyone who has worked with more than a couple files here understands this would be impossible to generate the data set in the number of files I was going to need.

So the next thing I came up with was to use dd to create a file of the proper size directing the output of /dev/urandom into a tmp file then using that tmp file to collect my data.

My script started to look like this, I needed to make the script sleep for a period of time because I was having issues getting enough data into the tmp file before I would try and collect it for the dd.

cat /dev/urandom | strings >tmp &
sleep .25
dd if=tmp of=testdata/n.txt bs=32k count=1
kill `ps -ef | grep /dev/urandom | nawk '{print $2}'`
rm tmp

This gave me a single file of 32k full of random ascii inside a directory labeled testdata that I had previously created to try and keep this mess organized for cleaning up the testing of this script. You will notice the .txt extension on my file name, that was for my specific testing purposes as I identify my document types using extensions at this point when deciding how to handle them, which is a completely different subject.

After getting this all working it was a simple matter of putting some controls in place to generate the number of files I needed in the size that I needed them and making the output of the script easier to deal with. I made it so I could use the command line to input my variables for file generation for those two variables. I ultimately ended up with a script that looked like this.
#The purpose of this script is to generate a set of files that are of a specific file size
#filled with random ascii data as to create a dataset for testing against
#deduplication and tsv indexing algorithms, specifically creating a worst case
#scenario as it pertains to the size of meta data databases.

#To use this script from the command line you can input the variables for
#the number of files you want generated and the size of those files in kb.

#Your command should look like " filecount filesize" or
#" 163840 32" This command would generate 163,840
#32k files or 5GB of data.

#If the files being created but are smaller than the intended file size you may have
#to increase the sleep timer because it is not generating enough data into the tmp
#file to properly fill the tmp file before dd attempts to extract data from the file.


mkdir testdata
until [ $i -eq $1 ]
cat /dev/urandom | strings > tmp &
sleep .25
echo "Making file " '#'$i "of size "$bs
dd if=tmp of=testdata/$i.txt bs=$bs count=1 1>/dev/null 2>/dev/null
kill `ps -ef | grep /dev/urandom | nawk '{print $2}'` 2> /dev/null
rm tmp
i=$(($i + 1))

I do owe a friend of mine some kudo's for helping with my clean up here as there was a bunch of noise being propagated all over the screen that made the output of the script look like everything was blowing up. Plus he helped me make this script more usable in general purpose, instead of just in my narrowly focused problem.


  1. I'm wondering why you had to create an intermediate file and not just pipe the output of strings directly into dd like this:

    cat /dev/urandom | strings | dd of=testdata/n.txt bs=32k count=1

    It looks a lot simpler this way. Is there some reason why it wouldn't work for you?

  2. I initially attempted to do it that way but I ran into an issue where the size of the files I was generating were always smaller than what I was attempting to create.

    If you cat the output of /dev/urandom you will see it actually generates data in block. Its not a steady stream of random data that is available to the system.

    It was confusing to see that dd wasn't actually creating the proper file sizes, the closest I could tell was that dd was being starved for data and would end because its source didn't have any more data to put into the file. It essentially copied all of the data it could and finished with its task.

    You will notice that there is a .25 sleep in the script the reason that is there is to allow /dev/urandom enough time to fill the tmp file before dd tries to use the file. This allowed enough data to be put into the file so when dd used the file it would generate a file of the proper size.

    I hope that answers your question as to why that didn't/work for me. It was a rather interesting problem and part of the reason I wrote a blog on the topic.

    - Dean