Here's a slightly disorganized collection of advice on stocking your generators with data.
How much is enough?
First of all, if data has a very limited set (such as Aristotelian elements, continents, etc.) it's simple - include them all.
If the dataset isn't bounded, then try at least to include a representative set of data. There's many colors, for instance, and it may be worth having "brown and tan" or "purple and violet" but do you really need "taupe"?
If the dataset is large, try and find a way to classify the data to have a representative sample. There may be many ways to portray temperature, but when it comes down to it you may just need words for "Cold," "Warm," and "Hot" so try to make sure that the data is representative (or is somehow coded so probabilities are not skewed in ways you don't want). In other words, having 15 words for cold, 2 for warm, and 7 for hot may skew the results, depending how your generator works.
In fact, classifying data is its own art, which leads us to:
Sometime a kind of data may in turn be classifiable into sub-categories. Sometimes you may do this in a complex generator, but I also find that classifying a kind of data into its own categories gives you a framework to find and flesh out the data. For instance, a list of woods is one thing - but broken into kinds of woods, it makes it easier to look for kinds of woods and diversify the results of a generator.
A good way to classify data I've found is what I call the 3/5/7 rule to classify data within its own category:
Data, if it must be classified, should be classified into at least 3 categories, 5 if possible, 7 if you can do it - and it is always best to classify your data into an odd number of categories if it is some kind of continumn (such as temperature, age, etc) just for your convenience. Note that this quick rule may not work depending on the data you're using - it just may not break down to an odd number.
I find 7 is ideal in that:
Again, this applies to data within a category, not the kinds of data you should have. In other words you may have four kinds of data types (such as eye color, temperature, minerals), but within those if you need to classify data, use the 3/5/7 rule.
How Much Data?
If your data is classified or not, the question does arise - just how much data do you need? Let's face it, this can drag - or it can make you look bad if you don't have enough data.
The first answer, especially for unbounded data with no clear classifications is "as much as you really can get within reason." I mean you can go way too far and make your application overloaded and clunky, so use your own judgement.
I find that the minimal number of data in any definite category should be at least seven pieces if at all possible. Seven items usually provides the minimal variability needed to provide good random combinations and avoid having imagination-numbing repetition. (Sound familiar? Exactly see the 3/5/7 rule)
However, I prefer to go by at least ten items, 25 if I can get it, 50 if possible, and 100 if I can, and sometimes even more if possible (like for instance names). If the data is classified, then I try to make sure each classification of data is properly represented and that the probability of such categories of data appearing is appropriate to my intentions.
If a kind of data is further broken into classifications, I usually try to make sure each classification, if possible, is well represented. The rules on this are a bit more difficult - I usually try to make sure the classes of data within a specific kind of data each total up to a goodly number, and that data within classifications is diverse. A good rule of thumb is that, besides data adding up to a good number, that individual categories should have at least 3-7 items within them, minimum. Again, there's no hard and fast rule.
There are several reasons to have goodly amounts of data:
Data usually isn't just something you dump into a generator. Take the time to work on it and analyze it and you'll build better generators.