Sample

../_images/samplenodeicon.png

Sample Node icon

The Clario Sample node gives you the ability to perform either simple random sampling or stratified sampling, where the node randomly selects rows from an input data stream. The node connector can be connected to a variety of nodes, (e.g. Read File, Aggregate, Append, Missing, etc.), but requires a valid stream of data.

Configuration

The Sample node has only one configuration tab.

Configuration Tab

The configuration tab involves your interaction with a few drop down lists and text boxes.

First you must specify the seed value (any integer between 1 and one billion). Then move down to specify the number of samples (between 1 and 99). Sample will generate a unique value in the Replicate ID Attribute that will be added to the beginning of the output data stream metadata. This attribute can be renamed by clicking in the Replicate ID Attribute text box and typing in a new name, see tips on Valid Characters for Attribute Names. If invalid keys are pressed in the text box, nothing with appear.

Specify the sampling method desired: Simple Random or Stratified

../_images/samplesimplerandom.png

Simple Random

When Simple Random is selected, select either Rows or Percent, and then specify the corresponding Sample Size, between 1 and the # of rows in the input data stream, or a percent from 1 to 100.

When Stratified is selected, drag and drop an attribute from the Available list box (only String type attributes will be available for selection) into the Class Attribute field. Then select either Rows or Percent. Click [+] at the bottom of the Strata box, enter the value of the attribute of this stratum and the corresponding sample size (# of rows or percent), and hit [Save]. Repeat this process until you’ve entered in all the strata you wish, then hit [Save]. The values entered into each stratum must be valid values of the selected attribute.

../_images/samplestrat.png

Stratified

Results

There is one results set for the Sample node, which contains a Sample Results table.

../_images/sampleresults.png

Results Set

The table columns are as follows:

Column Description
Name value of attribute
Total Row Count Total # of rows in input file with specified attribute value
Sample Row Count Total # of rows sampled with specified attribute value
Selection Probability Sample Row Count / Total Row Count
Sampling Weight 1 / Selection Probability

If the Sampling Method is Simple, the Name will be ‘Simple’. If the Sampling Method is Stratified, the Name in each row will be the value of the attribute of each stratum defined in the Configuration Tab.

Note

To ensure consistent results or repeatability of the resulting sample, the input data stream sequence must be the same. To this end, it is recommended to sort the input data by the selected Class Attribute.

Output Stream

The sampled dataset is ready for immediate use in other nodes to explore, manipulate, cleanse, and model the data. The data can be exported at any point in a workflow by using the Write File node.

Table Of Contents

Previous topic

Reduce

Next topic

Score