Linear Regression Node icon
The Clario Linear Regression node uses linear regression to build a model for a continuous dependent attribute. The resulting model equation can be used to create a prediction score, based on one or more predictor attributes. Note that the dependent/predictor attribute must have been defined as a numeric attribute in a previous read node. The node connector can be connected to a variety of nodes, (e.g. Read File, Aggregate, Append, Missing, etc.), but requires a valid stream of data.
The Linear Regression node has two configuration tabs, Dependent Attribute and Predictor Attributes.
The Dependent Attribute tab contains an Available Attribute list box, a Dependent Attribute field, and Settings area for the Attribute Selection Method drop down. First select an attribute from the Available Attributes list box and drag and drop it into the Dependent Attribute area (required).
Dependent Attribute Tab, None
Next choose the Attribute Selection Method. Choices for the Attribute Selection Method are None and Stepwise. If Stepwise is chosen, a box called ‘Stepwise Selection Options’ appears below the Attribute Selection Method drop down box. In this box, you can select the ‘Maximum p to Enter’ and ‘Minimum p to Remove’ values for the stepwise regression.
Dependent Attribute Tab, Stepwise
The Predictor Attributes tab involves selecting a desired predictor attribute(s) by drag and dropping them from the Available Attributes box to the Force Entry Attributes list box. If Selection Method is ‘None’, attributes must be selected for entry into the model. Select the attribute(s) by clicking on it in the Available Attributes box and drag and drop into the Force Entry Attributes box. At least one attribute must be placed into the Force Entry Attributes box.
Predictor Attributes Tab, None
Predictor Attributes Tab, Stepwise
If Selection Method is ‘Stepwise’, select the attribute(s) by clicking on it in the Available Attributes box and drag and drop into the Candidate Attributes box. If there are any attributes that you wish to force into the model, select the attribute(s) by clicking on it in the Available Attributes box and drag and drop into the Force Entry Attributes box. At least one attribute must be placed into either the Force Entry or Candidate Attributes box. See tips on Finding and Selecting Attributes.
There is one results set with two different tabs (Detailed Results and Step History) for the Linear Regression node. When Attribute Selection Method is set to None, the Step History is omitted in the results set.
Results Set, None
This tab contains statistics such as R2 and Adjusted R2 (Coefficient of determination), Standard Error of Estimate, and Dependent Mean for each model step. It also contains, for each model attribute: name, regression coefficient, standard error, standardized coefficient, t-value, p-value, and tolerance. This results set also contains an Analysis of Variance (ANOVA) table which displays the F-statistic and corresponding p-value .
![]()
Detail Results, Stepwise
This tab (for stepwise method only) contains one row of data for each step in the model building process. Each step lists the attribute entered or removed along with the step on which it was entered or removed and the resulting model R2 for that step.
Step History, Stepwise
The results from logistic can be read into Write File node, Score node, and Evaluate node. The results tables can also be exported into Excel by clicking the Export to Spreadsheet button found on the Toolbar. If the Linear Regression results are written to a file to be used in a scoring application, make sure ‘Full Precision’ is selected as the number format to avoid truncation of model coefficients.
Because the Clario framework does not make any assumptions about the length or width of the raw input data, we do not use any estimator that requires the full design matrix (X) and the vector of values of the dependent variable (Y) to be loaded into memory or written to disk, such as required by the regression estimator:
B = (X'X)^{-1}X'Y
nor do we use the standard computation technique of singular value decomposition on the raw data matrices to handle cases of extreme data redundancy.
Clario solves the vector of regression coefficients \beta using basic ordinary least squares (OLS) techniques, together with corrective techniques for multicollinearity, but in a way which does not impose a priori conditions on the size of the data stream. A single pass is used to yield three components that are sufficient for producing the desired regression output:
In cases where R_{x'x} is not ill-conditioned the vector of standardized regression coefficients \beta is solved using:
\beta = (R_{x'x})^{-1}r_{y'}x
In cases where the R_{x'x} matrix is ill-conditioned, Clario will keep all of the chosen predictor variables in the model, and will automatically try again using a generalized inverse. The generalized inverse yields a linear regression solution regardless of the condition of the R_{x'x} matrix, but multiple regression coefficients solution might not be properly interpretable without first removing or accounting for the extreme redundancy. It is your choice whether to accept that solution or to reject it; you may instead elect to eliminate the problem at its root, perhaps by creating composite scales, or by removing some unnecessary variables. In any case, Clario notifies you through the results log when the generalized inverse option has been forced into effect. When \beta has been solved, the raw coefficients are computed using the formula:
b_i = \beta_i \frac{s_i}{s_y}
where s_i is the standard deviation of variable i and s_y is the standard deviation of y.
The R^2 statistic is the computed by: R^2 = \beta ry'x
If we let the number of independent variables be called k and the number of rows in the data N, the basic statistics computed above are used to produce all remaining Clario results including the ANOVA table:
Computation of the ANOVA Table with k predictors and N rows:
| Source | df | SS | MS | F | p(F) |
|---|---|---|---|---|---|
| Regression | k | S'S_{R'}E'G= R^2(S'S_{TO'T}) | M'S_{R'}E'G= S'S_{R'E'G} / d'f_{R'}E'G | F=M'S_{R'}E'G | P ( f > F_k | df_{REG} \cdot df_{RES} ) |
| Residual | 1..k..N | S'S_{R'}E'S= (1-R^2)(S'S_{TO'T}) | M'S_{R'}E'S= S'S_{R'E'S} / d'f_{R'}E'S | ||
| TOTAL | 1...N | S'S_{T'}O'T | M'S_{T'}O'T= S'S_{T'O'T} / d'f_{T'}O'T |
Mallow’s Cp Statistic:
C_p = \frac{SS_{RES}}{MS_{RES}} - N + 2(k+1)
where M'S_{R'}E'S is the mean square residual with all candidate variables entered, and S'S_{R'E'S} is the sum of squares of the model with a specified subset of variables only.
The standard error of the estimate:
S_{y'} = \sqrt {MS_{RES}}
The standard errors of the coefficients:
S_{\beta_i} = \frac{MS_{RES}} {[s_i^2 \left( N - 1 \right)] [1-\left(1-\frac{1}{R^{i'i}}\right)]}
where s_i^2 is the variance of the variable i and R^{i'i} is the ith diagonal of the inverse of R_{x'}x:. or from raw coefficients:
S_{b_i} = MS_{RES}(X'X)^{-1}_{ii}
(X'X)_{ij} = N [(R_{ij}s_i s_j ) + (\bar{x}_i \bar{x}_j) ]
T-tests for the slopes of coefficients:
t = \frac{\beta}{s_\beta} \sim t(df_{RES})
Tolerance
tol_i = \frac{1}{\left( 1 - \frac{1}{R_{ii}}\right) }
The Linear Regression node implements stepwise and forced entry algorithms. Forced entry is performed according to the formulae outlined above. The stepwise algorithm uses the above formulae repeatedly on subsets of variables selected from the master correlation matrix R.
Stepwise Algorithm Computation Steps
- Place all non-selected variables in the candidate pool.
- For each candidate variable, calculate the tolerance of the regression coefficient when it is entered into the existing model along with the other variables. If tolerance <.0001 then skip the variable, otherwise continue.
- Test the strength of the current test attribute’s contribution to the model by computing the change-F test (Neter, Wasserman, &Kutner, 1985) of the full model (includes variable) against the reduced model (includes only the existing variables).
- If none of the variables meet the p-to-enter criterion, variable selection is complete, otherwise, select the variable with the smallest p-value from the change-F test.
- Next, test all variables in the model to see if any have lost explanatory power in the context of the added variable. This is done by computing, for each model variable, the t-test of the significance every semi-partial regression coefficient. Remove any variable from the model and place it in the candidate pool if its p-value is higher than the p-to-remove criterion.
- Compute model statistics, the ANOVA table, and the variable metrics for this step.