You are on page 1of 653

Clementine 11.

1 Node Reference

For more information about SPSS software products, please visit our Web site at http://www.spss.com or contact: SPSS Inc. 233 South Wacker Drive, 11th Floor Chicago, IL 60606-6412 Tel: (312) 651-3000 Fax: (312) 651-3668 SPSS is a registered trademark and the other product names are the trademarks of SPSS Inc. for its proprietary computer software. No material describing such software may be produced or distributed without the written permission of the owners of the trademark and license rights in the software and the copyrights in the published materials. The SOFTWARE and documentation are provided with RESTRICTED RIGHTS. Use, duplication, or disclosure by the Government is subject to restrictions as set forth in subdivision (c) (1) (ii) of The Rights in Technical Data and Computer Software clause at 52.227-7013. Contractor/manufacturer is SPSS Inc., 233 South Wacker Drive, 11th Floor, Chicago, IL 60606-6412. Graphs powered by SPSS Inc.s nViZn(TM) advanced visualization technology http://www.spss.com/sm/nvizn Patent No. 7,023,453 General notice: Other product names mentioned herein are used for identication purposes only and may be trademarks of their respective companies. Project phases are based on the CRISP-DM process model. Copyright 19972003 by CRISP-DM Consortium (http://www.crisp-dm.org). Some sample data sets are included from the UCI Knowledge Discovery in Databases Archive: Hettich, S. and Bay, S. D. 1999. The UCI KDD Archive (http://kdd.ics.uci.edu). Irvine, CA: University of California, Department of Information and Computer Science. Microsoft and Windows are registered trademarks of Microsoft Corporation. IBM, DB2, and Intelligent Miner are trademarks of IBM Corporation in the U.S.A. and/or other countries. Oracle is a registered trademark of Oracle Corporation and/or its afliates. UNIX is a registered trademark of The Open Group. Linux is a registered trademark of Linus Torvalds. Red Hat is a registered trademark of Red Hat Corporation. Solaris is a registered trademark of Sun Microstsems Corporation. DataDirect and SequeLink are registered trademarks of DataDirect Technologies. Copyright 20012005 by JGoodies. Founder: Karsten Lentzsch. All rights reserved. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. Neither the name of JGoodies or Karsten Lentzsch nor the names of its contributors may be used to endorse or promote products derived from this software without specic prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. ICU4C 3.2.1. Copyright 19952003 by International Business Machines Corporation and others. All rights reserved. Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation les (the Software), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, provided that the above copyright notice(s) and this permission notice appear in all copies of the Software and that both the above copyright notice(s) and this permission notice appear in supporting documentation. THE SOFTWARE IS PROVIDED AS IS, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT OF THIRD PARTY RIGHTS. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR HOLDERS INCLUDED IN THIS NOTICE BE LIABLE FOR ANY CLAIM, OR ANY SPECIAL INDIRECT OR CONSEQUENTIAL DAMAGES, OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS

ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE. Except as contained in this notice, the name of a copyright holder shall not be used in advertising or otherwise to promote the sale, use or other dealings in this Software without prior written authorization of the copyright holder. Clementine 11.1 Node Reference Copyright 2007 by Integral Solutions Limited. All rights reserved. Printed in the United States of America. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any meanselectronic, mechanical, photocopying, recording, or otherwisewithout the prior written permission of the publisher. 1234567890 10 09 08 07

ISBN-13: 978-1-56827-388-4 ISBN-10: 1-56827-388-6

Preface

Clementine is the SPSS enterprise-strength data mining workbench. Clementine helps organizations to improve customer and citizen relationships through an in-depth understanding of data. Organizations use the insight gained from Clementine to retain protable customers, identify cross-selling opportunities, attract new customers, detect fraud, reduce risk, and improve government service delivery. Clementines visual interface invites users to apply their specic business expertise, which leads to more powerful predictive models and shortens time-to-solution. Clementine offers many modeling techniques, such as prediction, classication, segmentation, and association detection algorithms. Once models are created, Clementine Solution Publisher enables their delivery enterprise-wide to decision makers or to a database.

Serial Numbers
Your serial number is your identication number with SPSS Inc. You will need this serial number when you contact SPSS Inc. for information regarding support, payment, or an upgraded system. The serial number was provided with your Clementine system.

Customer Service
If you have any questions concerning your shipment or account, contact your local ofce, listed on the SPSS Web site at http://www.spss.com/worldwide/. Please have your serial number ready for identication.

Training Seminars
SPSS Inc. provides both public and onsite training seminars. All seminars feature hands-on workshops. Seminars will be offered in major cities on a regular basis. For more information on these seminars, contact your local ofce, listed on the SPSS Web site at http://www.spss.com/worldwide/.

Technical Support
The services of SPSS Technical Support are available to registered customers. Student Version customers can obtain technical support only for installation and environmental issues. Customers may contact Technical Support for assistance in using Clementine products or for installation help for one of the supported hardware environments. To reach Technical Support, see the SPSS Web site at http://www.spss.com or contact your local ofce, listed on the SPSS Web site at
iv

http://www.spss.com/worldwide/. Be prepared to identify yourself, your organization, and the serial number of your system.

Tell Us Your Thoughts


Your comments are important. Please let us know about your experiences with SPSS products. We especially like to hear about new and interesting applications using Clementine. Please send e-mail to suggest@spss.com or write to SPSS Inc., Attn.: Director of Product Planning, 233 South Wacker Drive, 11th Floor, Chicago, IL 60606-6412.

Contacting SPSS
If you would like to be on our mailing list, contact one of our ofces, listed on our Web site at http://www.spss.com/worldwide/.

Contents
1 About Clementine 1

Clementine Client, Server, and Batch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Clementine Modules. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Clementine Options. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Clementine Solution Publisher . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Text Mining for Clementine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Web Mining for Clementine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Database Modeling and Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Deploying Scenarios to the Predictive Enterprise Repository . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

Source Nodes

Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Enterprise View Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Setting Options for the Enterprise View Node. Enterprise View Connections . . . . . . . . . . . . . Choosing the DPD . . . . . . . . . . . . . . . . . . . . . Choosing the Table . . . . . . . . . . . . . . . . . . . . Variable File Node . . . . . . . . . . . . . . . . . . . . . . . . ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... .. .. .. .. .. 11 12 14 14 15

Setting Options for the Variable File Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Fixed File Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Setting Options for the Fixed File Node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Setting Field Storage and Formatting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Database Node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Setting Database Node Options . . Adding a Database Connection . . Selecting a Database Table . . . . . Querying the Database . . . . . . . . SPSS Import Node . . . . . . . . . . . . . . . ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... .. .. .. .. .. 23 25 25 27 27

SAS Import Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 Setting Options for the SAS Import Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 Excel Import Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 User Input Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Setting Options for the User Input Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

vi

Dimensions Import Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 Dimensions Import File Options . . . . . Metadata Properties . . . . . . . . . . . . . Database Connection String . . . . . . . Advanced Properties . . . . . . . . . . . . . Multiple Responses, Loops, and Grids Dimensions Column Import Notes . . . Common Source Node Tabs . . . . . . . . . . . ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... .. .. .. .. .. .. .. 37 39 40 41 41 43 44

Setting Data Types in the Source Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Filtering Fields from the Source Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

Record Operations Nodes

47

Overview of Record Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 Select Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 Sample Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 Setting Options for the Sample Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 Balance Node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 Setting Options for the Balance Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 Aggregate Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 Setting Options for the Aggregate Node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 Sort Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 Sort Optimization Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 Merge Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 Types of Joins . . . . . . . . . . . . . . . . . . . . . Specifying a Merge Method and Keys . . . Selecting Data for Partial Joins . . . . . . . . Filtering Fields from the Merge Node . . . . Setting Input Order and Tagging. . . . . . . . Merge Optimization Settings . . . . . . . . . . Append Node . . . . . . . . . . . . . . . . . . . . . . . . . ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... .. .. .. .. .. .. .. 57 59 60 60 61 63 65

Setting Append Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 Distinct Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

Field Operations Nodes

68

Field Operations Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Type Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 Data Types. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

vii

What Is Instantiation? . . . Data Values . . . . . . . . . . . Checking Type Values . . . Setting Field Direction . . . Copying Type Attributes . . Field Format Settings Tab . Filter Node . . . . . . . . . . . . . . .

... ... ... ... ... ... ...

... ... ... ... ... ... ...

... ... ... ... ... ... ...

... ... ... ... ... ... ...

... ... ... ... ... ... ...

... ... ... ... ... ... ...

... ... ... ... ... ... ...

... ... ... ... ... ... ...

... ... ... ... ... ... ...

... ... ... ... ... ... ...

... ... ... ... ... ... ...

... ... ... ... ... ... ...

... ... ... ... ... ... ...

... ... ... ... ... ... ...

... ... ... ... ... ... ...

... ... ... ... ... ... ...

.. .. .. .. .. .. ..

73 74 79 80 81 82 84

Setting Filtering Options. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 Derive Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 Setting Basic Options for the Derive Node . . . Deriving Multiple Fields . . . . . . . . . . . . . . . . . Setting Derive Formula Options . . . . . . . . . . . Setting Derive Flag Options . . . . . . . . . . . . . . Setting Derive Set Options . . . . . . . . . . . . . . . Setting Derive State Options . . . . . . . . . . . . . Setting Derive Count Options . . . . . . . . . . . . . Setting Derive Conditional Options . . . . . . . . . Recoding Values with the Derive Node . . . . . Filler Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... .. .. .. .. .. .. .. .. .. .. 88 89 91 92 93 94 95 96 97 98

Storage Conversion Using the Filler Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 Anonymize Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 Setting Options for the Anonymize Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 Anonymizing Field Values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 Reclassify Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 Setting Options for the Reclassify Node . . Reclassifying Multiple Fields . . . . . . . . . . Storage and Type for Reclassified Fields . Binning Node . . . . . . . . . . . . . . . . . . . . . . . . . Setting Options for the Binning Node . Fixed-Width Bins . . . . . . . . . . . . . . . . Tiles (Equal Count or Sum) . . . . . . . . . Rank Cases . . . . . . . . . . . . . . . . . . . . Mean/Standard Deviation . . . . . . . . . Optimal Binning . . . . . . . . . . . . . . . . . Previewing the Generated Bins . . . . . Partition Node . . . . . . . . . . . . . . . . . . . . . ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... . . . . . . . . . . . . 106 108 109 109 110 111 112 115 116 117 118 119

Partition Node Options. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 Set to Flag Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 Setting Options for the Set to Flag Node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 Restructure Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 Setting Options for the Restructure Node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

viii

Transpose Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 Setting Options for the Transpose Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 Time Intervals Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 Specifying Time Intervals. . . . Time Interval Build Options . . Estimation Period. . . . . . . . . . Forecasts . . . . . . . . . . . . . . . Supported Intervals . . . . . . . . History Node . . . . . . . . . . . . . . . . ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... . . . . . . 129 131 133 133 136 146

Setting Options for the History Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 Field Reorder Node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 Setting Field Reorder Options. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 SPSS Transform Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 Setting Syntax Options. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 Allowable Syntax. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

Graph Nodes
Overlay Graphs . . 3-D Graphs . . . . . Animation . . . . . . Building Graphs . . . . . ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...

155
. . . . 156 158 159 160

Graph Nodes Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

Setting Output Options for Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 Setting Appearance Options for Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 Viewing Graph Output. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 Editing Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . Adding Titles and Footnotes . . . . . . . . . . . . . . . . . Using Graph Stylesheets . . . . . . . . . . . . . . . . . . . Printing, Saving, Copying, and Exporting Graphs . Plot Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... . . . . . 166 174 175 176 176

Setting Options for the Plot Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 Using a Plot Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 Multiplot Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 Setting Options for the Multiplot Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 Using a Multiplot Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 Distribution Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 Setting Options for the Distribution Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 Using a Distribution Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193

ix

Histogram Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196 Setting Additional Options for the Histogram Node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 Using Histograms and Collections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198 Collection Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201 Setting Additional Options for the Collection Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202 Using a Collection Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 Web Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 Setting Options for the Web Node . . . . . . . . . Setting Additional Options for the Web Node . Appearance Options for the Web Plot . . . . . . Using a Web Graph . . . . . . . . . . . . . . . . . . . . Evaluation Chart Node . . . . . . . . . . . . . . . . . . . . . Setting Options for the Evaluation Chart Node Reading the Results of a Model Evaluation. . . Using an Evaluation Chart . . . . . . . . . . . . . . . Time Plot Node . . . . . . . . . . . . . . . . . . . . . . . . . . . ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... . . . . . . . . . 207 208 210 211 215 220 222 223 225

Setting Options for the Time Plot Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226 Appearance Options for the Time Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227 Using a Time Plot Graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228

Modeling Overview

231

Overview of Modeling Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231 Modeling Node Fields Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235 Overview of Generated Models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237 The Models Palette. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238 Browsing Generated Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239 Generated Model Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240 Using Generated Models in Streams. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241 Regenerating a Modeling Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242 Importing and Exporting Models as PMML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242 Model Types Supporting PMML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245 Unrefined Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246

Screening Models

247

Screening Fields and Records . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247 Feature Selection Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247 Feature Selection Model Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248 Feature Selection Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250 Generated Feature Selection Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251 Feature Selection Model Results . . . . . . . . . . . . . . . . Selecting Fields by Importance . . . . . . . . . . . . . . . . . Generating a Filter from a Feature Selection Model . . Anomaly Detection Node . . . . . . . . . . . . . . . . . . . . . . . . . ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... . . . . 252 253 253 254

Anomaly Detection Model Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256 Anomaly Detection Expert Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257 Generated Anomaly Detection Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259 Anomaly Detection Model Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260 Anomaly Detection Model Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261 Anomaly Detection Model Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261

Binary Classifier Node

263

Binary Classifier Node Model Options. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265 Binary Classifier Node Expert Options. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267 Binary Classifier Node Stopping Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268 Binary Classifier Node Discard Options. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269 Binary Classifier Results Browser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270 Generating Nodes and Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271 Generating Evaluation Charts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272

Decision Trees

273

Decision Tree Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273 The Tree Builder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275 Growing and Pruning the Tree . . . Defining Custom Splits . . . . . . . . . Split Details and Surrogates. . . . . Customizing the Tree View . . . . . . Gains. . . . . . . . . . . . . . . . . . . . . . Risks . . . . . . . . . . . . . . . . . . . . . . ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... . . . . . . 276 277 279 281 282 292

xi

Saving Tree Models and Results . . . . . . . . . . Generating Filter and Select Nodes . . . . . . . . Generating a Ruleset from a Decision Tree. . . C&R Tree Node . . . . . . . . . . . . . . . . . . . . . . . . . . . Tree Node Model Options . . . . . . C&R Tree Node Expert Options. . . Tree Node Stopping Options . . . . Prior Probability Options . . . . . . . Misclassification Cost Options . . . CHAID Node . . . . . . . . . . . . . . . . . . . . ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...

... ... ... ... ... ... ... ... ... ...

... ... ... ... ... ... ... ... ... ...

... ... ... ... ... ... ... ... ... ...

... ... ... ... ... ... ... ... ... ...

... ... ... ... ... ... ... ... ... ...

... ... ... ... ... ... ... ... ... ...

... ... ... ... ... ... ... ... ... ...

... ... ... ... ... ... ... ... ... ...

... ... ... ... ... ... ... ... ... ...

... ... ... ... ... ... ... ... ... ...

... ... ... ... ... ... ... ... ... ...

. . . . . . . . . .

293 295 295 296 297 301 302 303 304 305

CHAID Node Expert Options. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305 QUEST Node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307 QUEST Node Expert Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307 C5.0 Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308 C5.0 Node Model Options. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310 Generating a Tree Model Directly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312 Generated Decision Tree Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312 Decision Tree Model Rules . . . . . . . . Decision Tree Model Viewer . . . . . . . Decision Tree/Ruleset Model Settings Boosted C5.0 Models . . . . . . . . . . . . . Ruleset Nodes . . . . . . . . . . . . . . . . . . . . . ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... . . . . . 313 316 317 319 320

Ruleset Model Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321 Importing Projects from AnswerTree 3.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322

10 Neural Networks
Neural Net Node Model Options . . . . Neural Net Node Additional Options . Neural Net Node Learning Rates . . . . Generated Neural Network Models . . . . . ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...

323
. . . . 324 326 328 329

Neural Net Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323

Neural Network Model Settings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329 Neural Network Model Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331 Generating a Filter Node from a Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332

xii

11 Decision List

333

Decision List Model Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338 Decision List Node Expert Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340 Generated Decision List Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341 Decision List Generated Model Settings Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342 Decision List Viewer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342 Decision List Viewer Workspace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342 Working with Decision List Viewer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348

12 Statistical Models
Linear Regression Node Model Options . . . . . Linear Regression Node Expert Options . . . . . Linear Regression Node Stepping Options . . . Linear Regression Node Output Options. . . . . Generated Linear Regression Models . . . . . . . . . . ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...

363
. . . . . 364 366 367 367 368

Linear Regression Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364

Linear Regression Model Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369 Linear Regression Model Advanced Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370 Logistic Regression Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372 Logistic Regression Node Model Options. . . . . . . Adding Terms to a Logistic Regression Model . . . Logistic Regression Node Expert Options. . . . . . . Logistic Regression Node Convergence Options . Logistic Regression Node Output Options . . . . . . Logistic Regression Node Stepping Options . . . . . Generated Logistic Regression Models. . . . . . . . . . . . Logistic Regression Model Equations. . . . . . . Logistic Regression Model Summary . . . . . . . Logistic Regression Model Settings . . . . . . . . Logistic Regression Model Advanced Output . Factor Analysis/PCA Node . . . . . . . . . . . . . . . . . . Factor Analysis/PCA Node Model Options . . . Factor Analysis/PCA Node Expert Options . . . Factor/PCA Node Rotation Options . . . . . . . . Generated Factor Models . . . . . . . . . . . . . . . . . . . ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... . . . . . . . . . . . . . . . . 373 378 379 380 381 382 384 385 386 387 388 390 391 392 393 394

Factor Model Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395 Factor Model Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395 Factor Model Advanced Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396

xiii

Discriminant Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 398 Discriminant Node Model Options . . . Discriminant Node Expert Options . . . Discriminant Node Output Options . . . Discriminant Node Stepping Options . Generated Discriminant Models . . . . . . . . ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... . . . . . 398 399 400 402 403

Discriminant Model Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404 Discriminant Model Advanced Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404 Generalized Linear Models Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405 Generalized Linear Models Node Field Options . . . . . Generalized Linear Models Node Model Options . . . . Generalized Linear Models Node Expert Options . . . . Generalized Linear Models Node Iterations Options. . Generalized Linear Models Node Output Options . . . . Generated Generalized Linear Models . . . . . . . . . . . . . . . ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... . . . . . . 406 407 409 412 413 415

Generalized Linear Models Model Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415 Generalized Linear Models Model Advanced Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 416

13 Clustering Models

418

Kohonen Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419 Kohonen Node Model Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421 Kohonen Node Expert Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423 Generated Kohonen Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424 Kohonen Model Cluster Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424 Kohonen Model Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425 K-Means Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426 K-Means Node Model Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427 K-Means Node Expert Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 428 Generated K-Means Models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 429 K-Means Model Cluster Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 429 K-Means Model Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 430 TwoStep Cluster Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 431 TwoStep Cluster Node Model Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 432 Generated TwoStep Cluster Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433 TwoStep Model Cluster Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433 TwoStep Model Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434 The Cluster Viewer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435 Cluster Viewer Tab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436

xiv

14 Association Rules

448

Tabular versus Transactional Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 449 GRI Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 450 GRI Node Model Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451 Apriori Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 452 Apriori Node Model Options. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 452 Apriori Node Expert Options. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454 CARMA Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 456 CARMA Node Fields Options . . . . CARMA Node Model Options. . . . CARMA Node Expert Options. . . . Generated Association Rule Models . . ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... . . . . . . . . . . . . . . . . . . . . . 456 458 459 460 461 465 467 468 469 470 472 473 476 476 478 479 481 482 485 485 486

Association Rule Model Details . . . . . . . . . . . . . . . . . Visualizing Association Rule Models with IBM Tools . Association Rule Model Summary . . . . . . . . . . . . . . . Generating a Ruleset from an Association Model. . . . Generating a Filtered Model. . . . . . . . . . . . . . . . . . . . Association Rule Model Settings . . . . . . . . . . . . . . . . Scoring Association Rules . . . . . . . . . . . . . . . . . . . . . Deploying Association Models. . . . . . . . . . . . . . . . . . Sequence Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sequence Node Fields Options . . Sequence Node Model Options . . Sequence Node Expert Options . . Generated Sequence Rule Models . . . ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...

Sequence Rule Model Details . . . . . . . . . . . . . . . . . . . . . . . . Sequence Rule Model Settings . . . . . . . . . . . . . . . . . . . . . . . Sequence Rule Model Summary . . . . . . . . . . . . . . . . . . . . . . Generating a Rule SuperNode from a Sequence Rule Model .

15 Time Series Models

488

Why Forecast? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 488 Time Series Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 488 Characteristics of Time Series . . . . . . . . . . . . . . . . . . . . . Autocorrelation and Partial Autocorrelation Functions . . . Series Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . Predictor Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... . . . . 489 493 494 494

xv

Time Series Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495 Requirements . . . . . . . . . . . . . . . . . . . . . . . . Time Series Model Options . . . . . . . . . . . . . . Time Series Expert Modeler Criteria. . . . . . . . Time Series Exponential Smoothing Criteria . . Time Series ARIMA Criteria . . . . . . . . . . . . . . Transfer Functions . . . . . . . . . . . . . . . . . . . . . Handling Outliers . . . . . . . . . . . . . . . . . . . . . . Generated Time Series Models . . . . . . . . . . . . . . . Generating Multiple Models. . . . . . . . . . . Using Time Series Models in Forecasting. Re-estimating and Forecasting . . . . . . . . Time Series Model Node. . . . . . . . . . . . . . . . . ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... . . . . . . . . . . . . 496 498 499 501 502 504 506 507 507 507 508 509

Time Series Model Residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 512 Time Series Model Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513 Time Series Model Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514

16 Self-Learning Response Node Models


SLRM Node Fields Options. . . SLRM Node Model Options . . SLRM Node Settings Options . Generated SLRM Models . . . . . . . ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...

515
. . . . 516 517 518 520

SLRM Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515

SLRM Model Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 521

17 Output Nodes

523

Overview of Output Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 523 Managing Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 524 Viewing Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525 View output in an HTML browser . Exporting Output . . . . . . . . . . . . . Selecting Cells and Columns . . . . Table Node . . . . . . . . . . . . . . . . . . . . . Table Node Settings Tab . Table Node Format Tab . . Output Node Output Tab. . Table Browser . . . . . . . . . ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... . . . . . . . . 525 526 527 528 528 528 529 531

xvi

Matrix Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 532 Matrix Node Settings Tab . . . Matrix Node Appearance Tab Matrix Node Output Browser . Analysis Node . . . . . . . . . . . . . . . ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... . . . . 532 534 535 537

Analysis Node Analysis Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 537 Analysis Output Browser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 539 Data Audit Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 541 Data Audit Node Settings Tab Data Audit Quality Tab . . . . . . Data Audit Output Browser . . Statistics Node . . . . . . . . . . . . . . . ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... . . . . 542 543 545 554

Statistics Node Settings Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555 Statistics Output Browser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 556 Means Node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 558 Comparing Means for Independent Groups . . Comparing Means Between Paired Fields . . . Means Node Options . . . . . . . . . . . . . . . . . . . Means Node Output Browser . . . . . . . . . . . . Report Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... . . . . . 559 560 560 561 563

Report Node Template Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564 Report Node Output Browser. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 566 Set Globals Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 566 Set Globals Node Settings Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 567 Transform Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 568 Transform Node Options Tab . . . . Transform Node Output Tab . . . . . Transform Node Output Viewer . . SPSS Output Node . . . . . . . . . . . . . . . SPSS Output Node Syntax Tab . . . SPSS Output Node Output Tab . . . SPSS Output Browser . . . . . . . . . SPSS Helper Applications. . . . . . . . . . ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... . . . . . . . . 569 570 570 573 573 574 575 575

Other Helper Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 577

18 Export Nodes

578

Overview of Export Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 578 Database Output Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 578 Database Node Export Tab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 580

xvii

Database Output Schema Options . . . Database Output Index Options . . . . . Database Output Advanced Options. . Flat File Node . . . . . . . . . . . . . . . . . . . . . .

... ... ... ...

... ... ... ...

... ... ... ...

... ... ... ...

... ... ... ...

... ... ... ...

... ... ... ...

... ... ... ...

... ... ... ...

... ... ... ...

... ... ... ...

... ... ... ...

... ... ... ...

. . . .

581 582 585 587

Flat File Export Tab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 587 SPSS Export Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 588 SPSS Export Node Export Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 589 Renaming or Filtering Fields for SPSS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 590 SAS Export Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 591 SAS Export Node Export Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 591 Excel Export Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 592 Excel Node Export Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 592

19 SuperNodes

593

Overview of SuperNodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 593 Types of SuperNodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 593 Source SuperNodes . Process SuperNodes. Terminal SuperNodes Creating SuperNodes . . . . ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... . . . . . . . . . . . . . . 594 594 595 596 598 599 600 601 602 602 603 607 607 608

Nesting SuperNodes . . . . . . . . . . Examples of Valid SuperNodes . . Examples of Invalid SuperNodes . Editing SuperNodes . . . . . . . . . . . . . .

Modifying SuperNode Types . . . . . . . . . . Annotating and Renaming SuperNodes . . SuperNode Parameters . . . . . . . . . . . . . . SuperNodes and Caching . . . . . . . . . . . . SuperNodes and Scripting . . . . . . . . . . . . Saving and Loading SuperNodes . . . . . . . . . .

xviii

Glossary Bibliography Index

610 614 615

xix

Chapter

About Clementine

Clementine is a data mining workbench that enables you to quickly develop predictive models using business expertise and deploy them into business operations to improve decision making. Designed around the industry-standard CRISP-DM model, Clementine supports the entire data mining process, from data to better business results. Clementine can be purchased as a standalone product, or in combination with a number of modules and options as summarized in the following sections. Note that additional products or updates may also be available. For complete information, see the Clementine home page (http://www.spss.com/clementine/).

Clementine Client, Server, and Batch


Clementine uses a client/server architecture to distribute requests for resource-intensive operations to powerful server software, resulting in faster performance on larger datasets. Additional products or updates beyond those listed here may also be available. For the most current information, see the Clementine Web page (http://www.spss.com/clementine/).
Clementine Client. Clementine Client is a functionally complete version of the product that is

installed and run on the users desktop computer. It can be run in local mode as a standalone product or in distributed mode along with Clementine Server for improved performance on large datasets.
Clementine Server. Clementine Server runs continually in distributed analysis mode together

with one or more client installations, providing superior performance on large datasets because memory-intensive operations can be done on the server without downloading data to the client computer. Clementine Server also provides support for SQL optimization, batch-mode processing, and in-database modeling capabilities, delivering further benets in performance and automation. At least one Clementine Client or Clementine Batch installation must be present to run an analysis.
Clementine Batch. Clementine Batch is a special version of the client that runs in batch mode

only, providing support for the complete analytical capabilities of Clementine without access to the regular user interface. This allows long-running or repetitive tasks to be performed without user intervention and without the presence of the user interface on the screen. Unlike Clementine Client, which can be run as a standalone product, Clementine Batch must be licensed and used only in combination with Clementine Server.

2 Chapter 1

Clementine Modules
The Clementine Base module includes a selection of the most commonly used analytical nodes to allow customers to get started with data mining. A broad range of modeling techniques are supported, including classication (decision trees), segmentation or clustering, association, and statistical methods. More specialized analytical modules are also available as add-ons to the Base module, as summarized below. For purchasing information, contact your sales representative, or see the Clementine home page (http://www.spss.com/clementine/). The following nodes are included in the Base module:
The Classication and Regression Tree node generates a decision tree that allows you to predict or classify future observations. The method uses recursive partitioning to split the training records into segments by minimizing the impurity at each step, where a node is considered pure if 100% of cases in the node fall into a specic category of the target eld. Target and predictor elds can be range or categorical; all splits are binary (only two subgroups). For more information, see C&R Tree Node in Chapter 9 on p. 296. The QUEST node provides a binary classication method for building decision trees, designed to reduce the processing time required for large C&RT analyses while also reducing the tendency found in classication tree methods to favor predictors that allow more splits. Predictor elds can be numeric ranges, but the target eld must be categorical. All splits are binary. For more information, see QUEST Node in Chapter 9 on p. 307. The CHAID node generates decision trees using chi-square statistics to identify optimal splits. Unlike the C&RT and QUEST nodes, CHAID can generate nonbinary trees, meaning that some splits have more than two branches. Target and predictor elds can be range or categorical. Exhaustive CHAID is a modication of CHAID that does a more thorough job of examining all possible splits but takes longer to compute. For more information, see CHAID Node in Chapter 9 on p. 305. The K-Means node clusters the dataset into distinct groups (or clusters). The method denes a xed number of clusters, iteratively assigns records to clusters, and adjusts the cluster centers until further renement can no longer improve the model. Instead of trying to predict an outcome, k-means uses a process known as unsupervised learning to uncover patterns in the set of input elds. For more information, see K-Means Node in Chapter 13 on p. 426. The Generalized Rule Induction (GRI) node discovers association rules in the data. For example, customers who purchase razors and aftershave lotion are also likely to purchase shaving cream. GRI extracts rules with the highest information content based on an index that takes both the generality (support) and accuracy (condence) of rules into account. GRI can handle numeric and categorical inputs, but the target must be categorical. For more information, see GRI Node in Chapter 14 on p. 450.

3 About Clementine

The Factor/PCA node provides powerful data-reduction techniques to reduce the complexity of your data. Principal components analysis (PCA) nds linear combinations of the input elds that do the best job of capturing the variance in the entire set of elds, where the components are orthogonal (perpendicular) to each other. Factor analysis attempts to identify underlying factors that explain the pattern of correlations within a set of observed elds. For both approaches, the goal is to nd a small number of derived elds that effectively summarizes the information in the original set of elds. For more information, see Factor Analysis/PCA Node in Chapter 12 on p. 390.

Linear regression is a common statistical technique for summarizing data and making predictions by tting a straight line or surface that minimizes the discrepancies between predicted and actual output values. For more information, see Linear Regression Node in Chapter 12 on p. 364.

Classification Module

The Classication module helps organizations to predict a known result, such as whether a customer will buy or leave or whether a transaction ts a known pattern of fraud. Modeling techniques include machine learning (neural networks), decision trees (rule induction), subgroup identication, statistical methods, and multiple model generation. The following nodes are included:
The Binary Classier node creates and compares a number of different models for binary outcomes (yes or no, churn or dont, and so on), allowing you to choose the best approach for a given analysis. A number of modeling algorithms are supported, making it possible to select the methods you want to use, the specic options for each, and the criteria for comparing the results. The node generates a set of models based on the specied options and ranks the best candidates according to the criteria you specify. For more information, see Binary Classier Node in Chapter 8 on p. 263.

The Neural Net node uses a simplied model of the way the human brain processes information. It works by simulating a large number of interconnected simple processing units that resemble abstract versions of neurons. Neural networks are powerful general function estimators and require minimal statistical or mathematical knowledge to train or apply. For more information, see Neural Net Node in Chapter 10 on p. 323.

The C5.0 node builds either a decision tree or a ruleset. The model works by splitting the sample based on the eld that provides the maximum information gain at each level. The target eld must be categorical. Multiple splits into more than two subgroups are allowed. For more information, see C5.0 Node in Chapter 9 on p. 308.

The Decision List node identies subgroups, or segments, that show a higher or lower likelihood of a given binary outcome relative to the overall population. For example, you might look for customers who are unlikely to churn or are most likely to respond favorably to a campaign. You can incorporate your business knowledge into the model by adding your own custom segments and previewing alternative models side by side in order to compare the results. For more information, see Decision List in Chapter 11 on p. 333.

4 Chapter 1

The Time Series node estimates exponential smoothing, univariate Autoregressive Integrated Moving Average (ARIMA), and multivariate ARIMA (or transfer function) models for time series data and produces forecast data. A Time Series node must always be preceded by a Time Intervals node. For more information, see Time Series Node in Chapter 15 on p. 495. The Feature Selection node screens predictor elds for removal based on a set of criteria (such as the percentage of missing values); it then ranks the importance of remaining predictors relative to a specied target. For example, given a dataset with hundreds of potential predictors, which are most likely to be useful in modeling patient outcomes? For more information, see Feature Selection Node in Chapter 7 on p. 247. Logistic regression is a statistical technique for classifying records based on values of input elds. It is analogous to linear regression but takes a categorical target eld instead of a numeric range. For more information, see Logistic Regression Node in Chapter 12 on p. 372. Discriminant analysis makes more stringent assumptions than logistic regression but can be a valuable alternative or supplement to a logistic regression analysis when those assumptions are met. For more information, see Discriminant Node in Chapter 12 on p. 398.

The generalized linear model expands the general linear model so that the dependent variable is linearly related to the factors and covariates via a specied link function. Moreover, the model allows for the dependent variable to have a non-normal distribution. It covers the functionality of a wide number of statistical models, including linear regression, logistic regression, loglinear models for count data, and interval-censored survival models. For more information, see Generalized Linear Models Node in Chapter 12 on p. 405. The Self-Learning Response Model (SLRM) node enables you to build a model in which a single new case, or small number of new cases, can be used to re-estimate the model without having to retrain the model using all data. For more information, see SLRM Node in Chapter 16 on p. 515.

Segmentation Module

The Segmentation module is recommended in cases where the specic result is unknown (for example, when identifying new patterns of fraud, or when identifying groups of interest in your customer base). Clustering models focus on identifying groups of similar records and labeling the records according to the group to which they belong. This is done without the benet of prior knowledge about the groups and their characteristics, and it distinguishes clustering models from the other machine-learning techniques available in Clementinethere is no predened output or target eld for the model to predict. There are no right or wrong answers for these models. Their value is determined by their ability to capture interesting groupings in the data and provide useful descriptions of those groupings. Clustering models are often used to create clusters or segments that are then used as inputs in subsequent analyses (for example, by segmenting potential customers into homogeneous subgroups).

5 About Clementine

This following nodes are included:


The Kohonen node generates a type of neural network that can be used to cluster the dataset into distinct groups. When the network is fully trained, records that are similar should appear close together on the output map, while records that are different will appear far apart. You can look at the number of observations captured by each unit in the generated model to identify the strong units. This may give you a sense of the appropriate number of clusters. For more information, see Kohonen Node in Chapter 13 on p. 419. The TwoStep node uses a two-step clustering method. The rst step makes a single pass through the data to compress the raw input data into a manageable set of subclusters. The second step uses a hierarchical clustering method to progressively merge the subclusters into larger and larger clusters. TwoStep has the advantage of automatically estimating the optimal number of clusters for the training data. It can handle mixed eld types and large datasets efciently. For more information, see TwoStep Cluster Node in Chapter 13 on p. 431. The Anomaly Detection node identies unusual cases, or outliers, that do not conform to patterns of normal data. With this node, it is possible to identify outliers even if they do not t any previously known patterns and even if you are not exactly sure what you are looking for. For more information, see Anomaly Detection Node in Chapter 7 on p. 254.

Association Module

The Association module is most useful when predicting multiple outcomesfor example, customers who bought product X also bought Y and Z. Association rule algorithms automatically nd the associations that you could nd manually using visualization techniques, such as the Web node. The advantage of association rule algorithms over the more standard decision tree algorithms (C5.0 and C&RT) is that associations can exist between any of the attributes. A decision tree algorithm will build rules with only a single conclusion, whereas association algorithms attempt to nd many rules, each of which may have a different conclusion. The following nodes are included:
The Apriori node extracts a set of rules from the data, pulling out the rules with the highest information content. Apriori offers ve different methods of selecting rules and uses a sophisticated indexing scheme to process large datasets efciently. For large problems, Apriori is generally faster to train than GRI; it has no arbitrary limit on the number of rules that can be retained, and it can handle rules with up to 32 preconditions. Apriori requires that input and output elds all be categorical but delivers better performance because it is optimized for this type of data. For more information, see Apriori Node in Chapter 14 on p. 452. The CARMA model extracts a set of rules from the data without requiring you to specify In (predictor) or Out (target) elds. In contrast to Apriori and GRI, the CARMA node offers build settings for rule support (support for both antecedent and consequent) rather than just antecedent support. This means that the rules generated can be used for a wider variety of applicationsfor example, to nd a list of products or services (antecedents) whose consequent is the item that you want to promote this holiday season. For more information, see CARMA Node in Chapter 14 on p. 456.

6 Chapter 1

The Sequence node discovers association rules in sequential or time-oriented data. A sequence is a list of item sets that tends to occur in a predictable order. For example, a customer who purchases a razor and aftershave lotion may purchase shaving cream the next time he shops. The Sequence node is based on the CARMA association rules algorithm, which uses an efcient two-pass method for nding sequences. For more information, see Sequence Node in Chapter 14 on p. 476.

Clementine Options
In addition to the modules, the following components and features can be separately purchased and licensed for use with Clementine. Note that additional products or updates may also become available. For complete information, see the Clementine home page (http://www.spss.com/clementine/). Clementine Server access, providing improved scalability and performance on large datasets, as well as support for SQL optimization, batch-mode automation, and in-database modeling capabilities. Clementine Solution Publisher, for real-time or automated scoring outside the Clementine environment. For more information, see Clementine Solution Publisher in Chapter 2 in Clementine 11.1 Solution Publisher. Deployment to Predictive Enterprise Services. For more information, see Deploying Scenarios to the Predictive Enterprise Repository in Clementine 11.1 Users Guide.

Clementine Solution Publisher


Clementine Solution Publisher is an add-on component that enables organizations to publish Clementine streams for use outside of the standard Clementine environment. Published streams can be executed by using the Clementine Solution Publisher Runtime, which can be distributed and deployed as needed. Solution Publisher is installed along with Clementine Client but requires a separate license to enable the functionality.

Text Mining for Clementine


Text Mining for Clementine is a fully integrated add-on for Clementine that uses advanced linguistic technologies and Natural Language Processing (NLP) to rapidly process a large variety of unstructured text data, extract and organize the key concepts, and group these concepts into categories. Extracted concepts and categories can be combined with existing structured data, such as demographics, and applied to modeling using Clementines full suite of data mining tools to yield better and more focused decisions. The Text Mining node offers concept and category modeling as well as an interactive workbench where you can perform advanced exploration of text links and clusters, create your own categories, and rene the linguistic resource templates. A number of import formats are supported, including blogs and other Web-based sources. Custom templates, libraries, and dictionaries for specic domains, such as CRM and genomics, are also included.

7 About Clementine

Note: A separate license is required to access this component. For more information, contact your sales representative or see the Clementine Web page (http://www.spss.com/clementine/).

Web Mining for Clementine


Web Mining for Clementine is an add-on module that allows analysts to perform ad hoc predictive Web analysis within the Clementine intuitive visual work-ow interface. Powered by proven NetGenesis Web analytics technology, Web Mining for Clementine transforms raw Web data into analysis-ready business events that allow you to segment users, understand the pathways and afnities charted by users as they navigate your site, and predict user propensity to convert, buy, or churn.

Database Modeling and Optimization


Clementine supports integration with data mining and modeling tools that are available from database vendors, including Oracle Data Miner, IBM DB2 Intelligent Miner, and Microsoft Analysis Services 2005. You can build, score, and store models inside the databaseall from within the Clementine application. This allows you to combine the analytical capabilities and ease-of-use of Clementine with the power and performance of a database, while taking advantage of database-native algorithms provided by these vendors. Models are built inside the database, which can then be browsed and scored through the Clementine interface in the normal manner and can be deployed using Clementine Solution Publisher if needed. Supported algorithms are on the Database Modeling palette in Clementine. Using Clementine to access database-native algorithms offers several advantages: In-database algorithms are often closely integrated with the database server and may offer improved performance. Models built and stored in database may be more easily deployed to and shared with any application that can access the database.
SQL Optimization. In-database modeling is closely relatedbut distinct fromSQL Optimization,

which allows you to generate SQL statements for native Clementine operations that can be pushed back to the database in order to improve performance. For example, the Merge, Aggregate, and Select nodes all generate SQL code that can be pushed back to the database in this manner. Using SQL Optimization in combination with database modeling may result in streams that can be executed from start to nish in the database, resulting in signicant performance gains over streams executed in Clementine. For more information, see SQL Optimization in Chapter 6 in Clementine 11.1 Server Administration and Performance Guide.

Deploying Scenarios to the Predictive Enterprise Repository


Streams created in Clementine can be packaged as scenarios and deployed to SPSS Predictive Enterprise Services for purposes of automated scoring and model refresh, as well as further use in Predictive Applications 5.0. For example, a Self-Learning (SLRM) model can be automatically

8 Chapter 1

updated at regularly-scheduled intervals as new data becomes available, or a set of streams can be deployed for purposes of Champion-Challenger analysis.
About SPSS Predictive Enterprise Services

SPSS Predictive Enterprise Services is an enterprise-level application that enables widespread use and deployment of predictive analytics. SPSS Predictive Enterprise Services provides centralized, secure, and auditable storage of analytical assets, advanced capabilities for management and control of predictive analytic processes, as well as sophisticated mechanisms for delivering the results of analytical processing to the end users. The benets of SPSS Predictive Enterprise Services include safeguarding the value of analytical assets, ensuring compliance with regulatory requirements, improving the productivity of analysts, and minimizing the IT costs of managing analytics.
Other Deployment Methods

While SPSS Predictive Enterprise Services offers the most extensive features for managing enterprise content, a number of other mechanisms for deploying or exporting streams are also available, including: Use the Predictive Applications 4.x Wizard to export streams for deployment to that version of Predictive Applications. For more information, see Predictive Applications 4.x Wizard in Chapter 10 in Clementine 11.1 Users Guide. Use a Publisher node to export the stream and model for later use with Clementine Solution Publisher Runtime. For more information, see Clementine Solution Publisher in Chapter 2 in Clementine 11.1 Solution Publisher. Use the Cleo Wizard to prepare a stream for deployment as a Cleo scenario for real-time scoring over the Web. For more information, see Exporting to Cleo in Chapter 10 in Clementine 11.1 Users Guide. Export one or more models in PMML, an XML-based format for encoding model information. For more information, see Importing and Exporting Models as PMML in Chapter 10 in Clementine 11.1 Users Guide.

Chapter

Source Nodes
Overview

Source nodes enable you to import data stored in a number of formats, including at les, SPSS (.sav), SAS, Microsoft Excel, and ODBC-compliant relational databases. You can also generate synthetic data using the User Input node. The Sources palette contains the following nodes:
The Enterprise View node creates a connection to a Predictive Enterprise Repository, enabling you to read Enterprise View data into a stream and to package a model in a scenario that can be accessed from the repository by other users. For more information, see Enterprise View Node on p. 10. The Database node can be used to import data from a variety of other packages using ODBC (Open Database Connectivity), including Microsoft SQL Server, DB2, Oracle, and others. For more information, see Database Node on p. 22.

The Variable File node reads data from free-eld text lesthat is, les whose records contain a constant number of elds but a varied number of characters. This node is also useful for les with xed-length header text and certain types of annotations. For more information, see Variable File Node on p. 15. The Fixed File node imports data from xed-eld text lesthat is, les whose elds are not delimited but start at the same position and are of a xed length. Machine-generated or legacy data are frequently stored in xed-eld format. For more information, see Fixed File Node on p. 18. The SPSS Import node reads data directly from a saved SPSS le (.sav). This format has replaced the Clementine cache le from earlier versions of Clementine. If you would like to import a saved cache le, you should use this node. For more information, see SPSS Import Node on p. 27. The Dimensions Data Import node imports survey data based on the Dimensions Data Model used by SPSS market research products. The Dimensions Data Library must be installed to use this node. For more information, see Dimensions Import Node on p. 36. The SAS Import node imports SAS data into Clementine. For more information, see SAS Import Node on p. 29.

10 Chapter 2

The Excel Import node imports data from any version of Microsoft Excel. An ODBC data source is not required. For more information, see Excel Import Node on p. 30.

The User Input node provides an easy way to create synthetic dataeither from scratch or by altering existing data. This is useful, for example, when you want to create a test dataset for modeling. For more information, see User Input Node on p. 32.

To begin a stream, add a source node to the stream canvas. Next, double-click the node to open its dialog box. The various tabs in the dialog box allow you to read in data; view the elds and values; and set a variety of options, including lters, data types, eld direction, and missing-value checking.

Enterprise View Node


The Enterprise View node enables you to create and maintain a connection between a Clementine session and an Enterprise View in a shared SPSS Predictive Enterprise Repository. Doing so allows you to read data from an Enterprise View into a Clementine stream, and to package a Clementine model in a scenario that can be accessed by other users of the shared repository. A scenario is a le containing a Clementine stream with specic nodes, models, and additional properties that enable it to be deployed to a Predictive Enterprise Repository for the purposes of scoring or automatic model refresh. The use of Enterprise View nodes with scenarios ensures that, in a multi-user situation, all users are working from the same data. A connection is a link from a Clementine session to an Enterprise View in the Predictive Enterprise Repository. The Enterprise View is the complete set of the data belonging to an organization, irrespective of where the data is physically located. Each connection consists of a specic selection of a single Application View (subset of the Enterprise View tailored for a particular application), a Data Provider Denition (DPDlinks the logical Application View tables and columns to a physical data source), and an environment (identies which particular columns should be associated with dened business segments). The Enterprise View, Application Views, and DPD denitions are stored and versioned in the repository, although the actual data resides in one or more databases or other external sources. Once a connection has been established, you specify an Application View table to work with in Clementine. In an Application View, a table is a logical view consisting of some or all columns from one or more physical tables in one or more physical databases. Thus the Enterprise View node allows records from multiple database tables to be seen as a single table in Clementine.
Requirements

To use the Enterprise View node, a Predictive Enterprise Repository must rst be installed and congured at your site, with an Enterprise View, Application Views, and DPDs already dened. For more information, contact your local administrator, or see the SPSS Web site at http://www.spss.com/predictive_enterprise_services/. In addition, the PEV driver must be installed on each computer used to modify or execute the stream. For Windows, simply install the driver on the computer where Clementine Client or Clementine Server is installed, and no further conguration of the driver is needed.

11 Source Nodes

On UNIX, a reference to the pev.sh script must be added to the startup script. For more information, see Conguring a Driver for the Enterprise View Node in Appendix B in Clementine 11.1 Server Administration and Performance Guide. Contact your local administrator for details on installing the PEV driver.

Setting Options for the Enterprise View Node


You can use the options on the Data tab of the Enterprise View dialog box to: Select an existing repository connection Edit an existing repository connection Create a new repository connection Select an Application View table Refer to the SPSS Predictive Enterprise Services Administrators Guide for details on working with repositories.
Figure 2-1 Adding a connection to a Predictive Enterprise Repository

Connection. The drop-down list provides options for selecting an existing repository connection, editing an existing connection, or adding a connection. If you are already logged in to a repository through Clementine, choosing the Add/Edit a connection ... option displays the Enterprise View Connections dialog box, from where you can dene or edit the required details for the current connection. If you are not logged in, this option displays the repository Login dialog box.

12 Chapter 2 Figure 2-2 Logging in to a repository

Once a connection to a repository has been established, that connection remains in place until you exit from Clementine. A connection can be shared by other nodes within the same stream, but you must create a new connection for each new stream. A successful login displays the Enterprise View Connections dialog box.
Table name. This eld is initially empty and cannot be populated until you create a connection.

If you know the name of the Application View table you would like to access, enter it in the Table Name eld. Otherwise, click the Select button to open a dialog box listing the available Application View tables.

Enterprise View Connections


This dialog box enables you to dene or edit the required details for the repository connection. You can specify the: Application View and version Environment Data Provider Denition (DPD) Connection description

13 Source Nodes Figure 2-3 Choosing an application view

Connections. Lists existing repository connections. Add a new connection. Displays the Retrieve Object dialog box, from where you can search

for and select an Application View from the repository.


Copy the selected connection. Makes a copy of a selected connection, saving you from having

to browse again for the same Application View.


Delete the selected connection. Deletes the selected connection from the list. Connection Details. For the connection currently selected in the Connections pane, displays the

Application View, version label, environment, DPD, and descriptive text.


Application view. The drop-down list displays the selected application view, if any. If

connections have been made to other Application Views in the current session, these also appear on the drop-down list. Click the adjacent Browse button to search for other Application Views in the repository.
Version. The drop-down eld lists all dened version labels for the specied Application

View. Version labels help identify specic repository object versions. For example, there may be two versions for a particular Application View. By using labels, you could specify the label TEST for the version used in the development environment and the label PRODUCTION for the version used in the production environment. Select an appropriate label.
Environment. The drop-down eld lists all valid environments. The environment setting

determines which DPDs are available, thus specifying which particular columns should be associated with dened business segments. For example, when Analytic is selected, only those Application View columns dened as Analytic are returned. The default environment is Analytic; you can also choose Operational.

14 Chapter 2

Data provider. The drop-down list displays the names of up to ten Data Provider Denitions

for the selected Application View. Only DPDs that reference the selected Application View are shown. Click the adjacent Browse button to view the name and path of all DPDs related to the current Application View.
Description. Descriptive text about the repository connection. This text will be used for the

connection nameclicking OK causes the text to appear on the Connection drop-down list and title bar of the Enterprise View dialog box, and as the label of the Enterprise View node on the canvas.

Choosing the DPD


The Select Data Provider dialog box shows the name and path of all the DPDs that reference the current Application View.
Figure 2-4 Choosing a DPD

Application Views can have multiple DPDs in order to support the different stages of a project. For example, the historic data used to build a model may come from one database, while operational data comes from another. A DPD is dened against a particular ODBC datasource. To use a DPD from Clementine, you must have an ODBC datasource dened on the Clementine server host which has the same name, and which connects to the same data store, as the one referenced in the DPD.
E To choose a DPD to work with, select its name on the list and click OK.

Choosing the Table


The Select Table dialog box lists all the tables that are referenced in the current Application View. The dialog box is empty if no connection has been made to a Predictive Enterprise Repository.

15 Source Nodes Figure 2-5 Choosing a table

E To choose a table to work with, select its name on the list and click OK.

Variable File Node


You can use Variable File nodes to read data from free-eld text les (les whose records contain a constant number of elds but a varied number of characters). This type of node is also useful for les with xed-length header text and certain types of annotations. During the execution of a stream, the Variable File node rst tries to read the le. If the le does not exist or you do not have permission to read it, an error will occur and the execution will end. If there are no problems opening the le, records will be read one at a time and passed through the stream until the entire le is read.

16 Chapter 2 Figure 2-6 Variable File node dialog box

Setting Options for the Variable File Node


File. Specify the name of the le. You can enter a lename or click the ellipsis button (...) to

select a le. The le path is shown once you have selected a le, and its contents are displayed with delimiters in the panel below it. The sample text displayed from your data source can be copied and pasted into the following controls: EOL comment characters and user-specied delimiters. Use Ctrl-C and Ctrl-V to copy and paste.
Read field names from file. Selected by default, this option treats the rst row in the data le as

labels for the column. If your rst row is not a header, deselect to automatically give each eld a generic name, such as Field1, Field2, for the number of elds in the dataset.
Specify number of fields. Specify the number of elds in each record. Clementine can detect the number of elds automatically as long as the records are new-line terminated. You can also set a number manually. Skip header characters. Specify how many characters you want to ignore at the beginning of

the rst record.

17 Source Nodes

EOL comment characters. Specify characters, such as # or !, to indicate annotations in the data.

Wherever one of these characters appears in the data le, everything up to but not including the next new-line character will be ignored.
Strip lead and trail spaces. Select options for discarding leading and trailing spaces in strings

on import.
Invalid characters. Select Discard to remove invalid characters from the data input. Select
Replace with to replace invalid characters with the specied symbol (one character only). Invalid

characters are null (0) characters or any character that does not exist in the current encoding.
Encoding. Species the text-encoding method used. You can choose between the system default,

stream default, or UTF-8. The system default is specied in the Windows Control Panel or, if running in distributed mode, on the server computer. For more information, see Unicode Support in Clementine in Appendix B in Clementine 11.1 Users Guide. The stream default is specied in the Stream Properties dialog box. For more information, see Setting Options for Streams in Chapter 5 in Clementine 11.1 Users Guide.
Decimal symbol. Select the type of decimal separator used in your data source. The Stream default

is the character selected from the Options tab of the stream properties dialog box. Otherwise, select either Period (.) or Comma (,) to read all data in this dialog box using the chosen character as the decimal separator.
Delimiters. Using the check boxes listed for this control, you can specify which characters, such

as the comma (,), dene eld boundaries in the le. You can also specify more than one delimiter, such as , | for records that use multiple delimiters. The default delimiter is the comma. Note: If the comma is also dened as the decimal separator for streams, the default settings here will not work. In cases where the comma is both the eld delimiter and the decimal separator, select Other in the Delimiters list. Then manually specify a comma in the entry eld. Select Allow multiple blank delimiters to treat multiple adjacent blank delimiter characters as a single delimiter. For example, if one data value is followed by four spaces and then another data value, this group would be treated as two elds rather than ve.
Lines to scan for type. Specify how many lines to scan for specied data types. Quotes. Using the drop-down lists for this control, you can specify how single and double

quotation marks are treated on import. You can choose to Discard all quotation marks, Include as text by including them in the eld value, or Pair and discard to match pairs of quotation marks and remove them. If a quotation mark is unmatched, you will receive an error message. Both Discard and Pair and discard store the eld value (without quotation marks) as a string. At any point while you are working in this dialog box, click Refresh to reload elds from the data source. This is useful when you are altering data connections to the source node or when you are working between tabs in the dialog box.

18 Chapter 2

Fixed File Node


You can use Fixed File nodes to import data from xed-eld text les (les whose elds are not delimited but start at the same position and are of a xed length). Machine-generated or legacy data are frequently stored in xed-eld format. Using the File tab of the Fixed File node, you can easily specify the position and length of columns in your data.

Setting Options for the Fixed File Node


The File tab of the Fixed File node allows you to bring data into Clementine and to specify the position of columns and length of records. Using the data preview pane in the center of the dialog box, you can click to add arrows specifying the break points between elds.
Figure 2-7 Specifying columns in fixed-field data

File. Specify the name of the le. You can enter a lename or click the ellipsis button (...) to select a le. Once you have selected a le, the le path is shown and its contents are displayed with delimiters in the panel below.

The data preview pane can be used to specify column position and length. The ruler at the top of the preview window helps to measure the length of variables and to specify the break point between them. You can specify break point lines by clicking in the ruler area above the elds.

19 Source Nodes

Break points can be moved by dragging and can be discarded by dragging them outside of the data preview region. Each break-point line automatically adds a new eld to the elds table below. Start positions indicated by the arrows are automatically added to the Start column in the table below.
Line oriented. Select if you want to skip the new-line character at the end of each record. Skip header lines. Specify how many lines you want to ignore at the beginning of the rst record.

This is useful for ignoring column headers.


Record length. Specify the number of characters in each record. Lines to scan for type. Specify how many lines to scan for specied data types. Field. All elds that you have dened for this data le are listed here. There are two ways to dene elds:

Specify elds interactively using the data preview pane above. Specify elds manually by adding empty eld rows to the table below. Click the button to the right of the elds pane to add new elds. Then, in the empty eld, enter a eld name, a start position, and a length. These options will automatically add arrows to the data preview pane, which can be easily adjusted. To remove a previously dened eld, select the eld in the list and click the red delete button.
Start. Specify the position of the rst character in the eld. For example, if the second eld of a record begins on the sixteenth character, you would enter 16 as the starting point. Length. Specify how many characters are in the longest value for each eld. This determines

the cutoff point for the next eld.


Strip lead and trail spaces. Select to discard leading and trailing spaces in strings on import. Invalid characters. Select Discard to remove invalid characters from the data input. Select
Replace with to replace invalid characters with the specied symbol (one character only). Invalid

characters are null (0) characters or any character that does not exist in the current encoding.
Encoding. Species the text-encoding method used. You can choose between the system default, stream default, or UTF-8.

The system default is specied in the Windows Control Panel or, if running in distributed mode, on the server computer. For more information, see Unicode Support in Clementine in Appendix B in Clementine 11.1 Users Guide. The stream default is specied in the Stream Properties dialog box. For more information, see Setting Options for Streams in Chapter 5 in Clementine 11.1 Users Guide.
Decimal symbol. Select the type of decimal separator used in your data source. Stream default

is the character selected from the Options tab of the stream properties dialog box. Otherwise, select either Period (.) or Comma (,) to read all data in this dialog box using the chosen character as the decimal separator. At any point while working in this dialog box, click Refresh to reload elds from the data source. This is useful when altering data connections to the source node or when working between tabs on the dialog box.

20 Chapter 2

Setting Field Storage and Formatting


Options on the Data tab for Fixed File, Variable File, and User Input nodes allow you to specify storage type, eld formatting, and other metadata for elds as they are imported or created in Clementine. For data read from other sources, storage is determined automatically but can be changed using a conversion function, such as to_integer, in a Filler node or Derive node.
Figure 2-8 Overriding storage type and field formatting upon import

Field. Use the Field column to view and select elds in the current dataset. Override. Select the check box in the Override column to activate options in the Storage and

Input Format columns.


Data Storage

Storage describes the way data are stored in a eld. For example, a eld with values of 1 and 0 stores integer data. This is distinct from the data type, which describes the usage of the data in Clementine and does not affect storage. For example, you may want to set the type for an integer eld with values of 1 and 0 to ag. This usually indicates that 1 = True and 0 = False. While storage must be determined at the source, data type can be changed using a Type node at any point in the stream. For more information, see Data Types in Chapter 4 on p. 71. Available storage types are:
String. Used for elds that contain non-numeric data, also called alphanumeric data. A string

can include any sequence of characters, such as fred, Class 2, or 1234. Note that numbers in strings cannot be used in calculations.

21 Source Nodes

Integer. A eld whose values are integers. Real. Values are numbers that may include decimals (not limited to integers). The display

format is specied in the Stream Options dialog box and can be overridden for individual elds in a Type node (Format tab). For more information, see Setting Options for Streams in Chapter 5 in Clementine 11.1 Users Guide.
Time. Time measured as a duration. For example, a service call lasting 1 hour, 26 minutes, and 38 seconds might be represented as 01:26:38, depending on the current time format as

specied in the Stream Options dialog box.


Timestamp. Time values that indicate a specic hour of the day rather than a duration. For example, a service call beginning at exactly 9:04 A.M, could be logged as 09:04:00, again

depending on the current time format.


Date. Date values specied in a standard format such as year, month, and day (for example, 2005-09-26). The specic format is specied in the Stream Options dialog box. Storage conversions. You can convert storage for a eld using a variety of conversion functions, such as to_string and to_integer, in a Filler node. For more information, see Storage Conversion Using the Filler Node in Chapter 4 on p. 100. Note that conversion functions (and any other functions that require a specic type of input such as a date or time value) depend on the current formats specied in the Stream Options dialog box. For example, if you want to convert a string eld with values Jan 2003, Feb 2003, etc. to date storage, select MON YYYY as the default date format for the stream. For more information, see Setting Options for Streams in Chapter 5 in Clementine 11.1 Users Guide. Conversion functions are also available from the Derive node, for temporary conversion during a derive calculation. You can also use the Derive node to perform other manipulations, such as recoding string elds with discrete values. For more information, see Recoding Values with the Derive Node in Chapter 4 on p. 97. Reading in mixed data. Note that when reading in elds with numeric storage (either integer,

real, time, timestamp, or date), any non-numeric values are set to null or system missing. This is because unlike some applications, Clementine does not allow mixed storage types within a eld. To avoid this, any elds with mixed data should be read in as strings, either by changing the storage type in the source node or in the external application as necessary.

Field Input Format

For all storage types except String and Integer, you can specify formatting options for the selected eld using the drop-down list. For example, when merging data from various locales, you may need to specify a period (.) as the decimal separator for one eld, while another will require a comma separator. Input options specied in the source node override the formatting options specied in the stream properties dialog box; however, they do not persist later in the stream. They are intended to parse input correctly based on your knowledge of the data. The specied formats are used as a guide for parsing the data as they are read into Clementine, not to determine how they should be formatted after being read into Clementine. To specify formatting on a per-eld basis elsewhere in the stream, use the Format tab of a Type node. For more information, see Field Format Settings Tab in Chapter 4 on p. 82.

22 Chapter 2 Figure 2-9 Specifying date and time formats for timestamp fields

Options vary depending on the storage type. For example, for the Real storage type, you can select Period (.) or Comma (,) as the decimal separator. For timestamp elds, a separate dialog box opens when you select Specify from the drop-down list. For more information, see Setting Field Format Options in Chapter 4 on p. 83. For all storage types, you can also select Stream default to use the stream default settings for import. Stream settings are specied in the stream properties dialog box. For more information, see Setting Options for Streams in Chapter 5 in Clementine 11.1 Users Guide.

Additional Options

Several other options can be specied using the Data tab: To view storage settings for data that are no longer connected through the current node (train data, for example), select View unused field settings. You can clear the legacy elds by clicking Clear. At any point while working in this dialog box, click Refresh to reload elds from the data source. This is useful when you are altering data connections to the source node or when you are working between tabs on the dialog box.

Database Node
The Database node can be used to import data from a variety of other packages using ODBC (Open Database Connectivity), including Microsoft SQL Server, DB2, Oracle, and others. To read or write to a database, you must have an ODBC data source installed and congured for the relevant database, with read or write permissions as needed. The SPSS Data Access Pack includes a set of ODBC drivers that can be used for this purpose, and these drivers are available from the SPSS Web site at http://www.spss.com/drivers/clientCLEM.htm. If you have questions about creating or setting permissions for ODBC data sources, contact your database administrator.

Supported ODBC Drivers

For the latest information on which databases and ODBC drivers are supported and tested for use with Clementine 11.1, please review the product compatibility matrices on the SPSS Support site (http://support.spss.com).

23 Source Nodes

Where to Install Drivers

Note that ODBC drivers must be installed and congured on each computer where processing may occur. If you are running Clementine Client in local (standalone) mode, the drivers must be installed on the local computer. If you are running Clementine Client or Clementine Batch in distributed mode against a remote Clementine Server, the ODBC drivers need to be installed on the computer where Clementine Server is installed. If you need to access the same data sources from both Clementine Client and Clementine Server, the ODBC drivers must be installed on both computers. If you are running Clementine Client over Terminal Services, the ODBC drivers need to be installed on the Terminal Services server on which you have Clementine Client installed. If you have purchased Clementine Solution Publisher and are using the Solution Publisher Runtime to execute published streams on a separate computer, you also need to install and congure ODBC drivers on that computer. Use the following general steps to access data from a database:
E Install an ODBC driver and congure a data source to the database you want to use. E In the Database node dialog box, connect to a database using Table mode or SQL Query mode. E Select a table from the database. E Using the tabs in the Database node dialog box, you can alter usage types and lter data elds.

These steps are described in more detail in the next several topics.

Setting Database Node Options


You can use the options on the Data tab of the Database node dialog box to gain access to a database and read data from the selected table.

24 Chapter 2 Figure 2-10 Loading data by selecting a table

Mode. Select Table to connect to a table using the dialog box controls. Select SQL Query to

query the database selected below using SQL.


Data source. For both Table and SQL Query modes, you can enter a name in the Data Source eld

or select Add new database connection from the drop-down list. The following options are used to connect to a database and select a table using the dialog box:
Table name. If you know the name of the table you would like to access, enter it in the Table Name

eld. Otherwise, click the Select button to open a dialog box listing the available tables.
Quote table and column names. Specify whether you want table and column names to be enclosed in quotation marks when queries are sent to the database (if, for example, they contain spaces or punctuation).

The As needed option will quote table and eld names only if they include nonstandard characters. Nonstandard characters include non-ASCII characters, space characters, and any non-alphanumeric character other than a full stop (.). Select Never if you never want table and eld names quoted. Select Always if you want all table and eld names quoted.
Strip lead and trail spaces. Select options for discarding leading and trailing spaces in strings. Reading empty strings from Oracle. When reading from or writing to an Oracle database, be aware

that, unlike Clementine and unlike most other databases, Oracle treats and stores empty string values as equivalent to null values. This means that the same data extracted from an Oracle database may behave differently than when extracted from a le or another database, and the data may return different results.

25 Source Nodes

Adding a Database Connection


In order to open a database, you rst have to select the data source to which you want to connect. On the Data tab, select Add new database connection from the Data Source drop-down list. This opens the Database Connections dialog box.
Figure 2-11 Database Connections dialog box

Data sources. Lists the available data sources. Be sure to scroll down if you do not see the desired

database. Once you have selected a data source and entered any passwords, click Connect. Click
Refresh to update the list.

User name. If the data source is password protected, enter your user name. Password. If the data source is password protected, enter your password. Connections. Shows currently connected databases. To remove connections, select one from

the list and click Remove. Once you have completed your selections, click OK to return to the main dialog box and select a table from the currently connected database.

Selecting a Database Table


After you have connected to a data source, you can choose to import elds from a specic table or view. From the Data tab of the Database dialog box, you can either enter the name of a table in the Table Name eld or click Select to open a dialog box listing the available tables and views.

26 Chapter 2 Figure 2-12 Selecting a table from the currently connected database

Show table owner. Select if a data source requires that the owner of a table must be specied before

you can access the table. Deselect this option for data sources that do not have this requirement. Note: SAS and Oracle databases usually require you to show the table owner.
Tables/Views. Select the table or view to import. Show. Lists the columns in the data source to which you are currently connected. Click one of the

following options to customize your view of the available tables: Click User Tables to view ordinary database tables created by database users. Click System Tables to view database tables owned by the system (for example, tables that provide information about the database, such as details of indexes). This option can be used to view the tabs used in Excel databases. (Note that a separate Excel Import node is also available. For more information, see Excel Import Node on p. 30.) Click Views to view virtual tables based on a query involving one or more ordinary tables. Click Synonyms to view synonyms created in the database for any existing tables.
Name/Owner filters. These elds allow you to lter the list of displayed tables by name or owner. For example, type SYS to list only tables with that owner. For wildcard searches, an underscore

(_) can be used to represent any single character and a percent sign (%) can represent any sequence of zero or more characters.
Set As Default. Saves the current settings as the default for the current user. These settings will

be restored in the future when a user opens a new table selector dialog box for the same data source name and user login only.

27 Source Nodes

Querying the Database


Once you have connected to a data source, you can choose to import elds using an SQL query. From the main dialog box, select SQL Query as the connection mode. This adds a query editor window in the dialog box. Using the query editor, you can create or load an SQL query whose result set will be read into the data stream. To cancel and close the query editor window, select Table as the connection mode.
Figure 2-13 Loading data using SQL queries

Load Query. Click to open the le browser, which you can use to load a previously saved query. Save Query. Click to open the Save Query dialog box, which you can use to save the current query. Import Default. Click to import an example SQL SELECT statement constructed automatically using

the table and columns selected in the dialog box.


Clear. Clear the contents of the work area. Use this option when you want to start over.

SPSS Import Node


You can use the SPSS Import node to read data directly from a saved SPSS le (.sav). This format is now used to replace the Clementine cache le from earlier versions of Clementine. If you would like to import a saved cache le, you should use the SPSS Import node.

28 Chapter 2 Figure 2-14 Importing an SPSS file

Import file. Specify the name of the le. You can enter a lename or click the ellipsis button (...)

to select a le. The le path is shown once you have selected a le.
Variable names. Select a method of handling variable names and labels upon import from an SPSS .sav le. Metadata that you choose to include here persists throughout your work in Clementine and may be exported again for use in SPSS. Read names and labels. Select to read both variable names and labels into Clementine. By

default, this option is selected and variable names are displayed in the Type node. Labels may be displayed in charts, model browsers, and other types of output, depending on the options specied in the stream properties dialog box. By default, the display of labels in output is disabled. For more information, see Setting Options for Streams in Chapter 5 in Clementine 11.1 Users Guide.
Read labels as names. Select to read the descriptive variable labels from the SPSS .sav le

rather than the short eld names, and use these labels as variable names in Clementine.
Values. Select a method of handling values and labels upon import from an SPSS .sav le.

Metadata that you choose to include here persists throughout your work in Clementine and may be exported again for use in SPSS. Note: Ordinal data from SPSS version 8 and higher is mapped to the Ordered Set type in Clementine.
Read data and labels. Select to read both actual values and value labels into Clementine. By

default, this option is selected and values themselves are displayed in the Type node. Value labels may be displayed in the Expression Builder, charts, model browsers, and other types

29 Source Nodes

of output, depending on the options specied in the stream properties dialog box. For more information, see Setting Options for Streams in Chapter 5 in Clementine 11.1 Users Guide.
Read data as labels. Select if you want to use the value labels from the .sav le rather than the

numerical or symbolic codes used to represent the values. For example, selecting this option for data with a gender eld whose values of 1 and 2 actually represent male and female, respectively, will convert the eld to a string and import male and female as the actual values. It is important to consider missing values in your SPSS data before selecting this option. For example, if a numeric eld uses labels only for missing values (0 = No Answer, 99 = Unknown), then selecting the option above will import only the value labels No Answer and Unknown and will convert the eld to a string. In such cases, you should import the values themselves and set missing values in a Type node.

SAS Import Node


The SAS Import node allows you to bring SAS data into your data mining session. You can import four types of les: SAS for Windows/OS2 (.sd2) SAS for UNIX (.ssd) SAS Transport File (.tpt) SAS version 7/8/9 (.sas7bdat) When the data are imported, all variables are kept and no variable types are changed. All cases are selected.
Figure 2-15 Importing a SAS file

30 Chapter 2

Setting Options for the SAS Import Node


Import. Select which type of SAS le to transport. You can choose SAS for Windows/OS2 (.sd2),
SAS for UNIX (.SSD), SAS Transport File (.tpt), or SAS Version 7/8/9 (.sas7bdat).

Import file. Specify the name of the le. You can enter a lename or click the ellipsis button

(...) to browse to the les location.


Member. Select a member to import from the SAS transport le selected above. You can enter a

member name or click Select to browse through all members in the le.
Read user formats from a SAS data file. Select to read user formats. SAS les store data and data formats (such as variable labels) in different les. Most often, you will want to import the formats as well. If you have a large dataset, however, you may want to deselect this option to save memory. Format file. If a format le is required, this text box is activated. You can enter a lename or click

the ellipsis button (...) to browse to the les location.


Variable names. Select a method of handling variable names and labels upon import from a SAS

le. Metadata that you choose to include here persists throughout your work in Clementine and may be exported again for use in SAS.
Read names and labels. Select to read both variable names and labels into Clementine. By

default, this option is selected and variable names are displayed in the Type node. Labels may be displayed in the Expression Builder, charts, model browsers, and other types of output, depending on the options specied in the stream properties dialog box. For more information, see Setting Options for Streams in Chapter 5 in Clementine 11.1 Users Guide.
Read labels as names. Select to read the descriptive variable labels from the SAS le rather

than the short eld names and use these labels as variable names in Clementine.

Excel Import Node


The Excel Import node allows you to import data from any version of Microsoft Excel. Excel import is supported for Clementine Client and Server running on Windows platforms only and is not available on UNIX platforms.

31 Source Nodes Figure 2-16 Excel Import node

Import file. Species the name and location of the spreadsheet le to import. Use Named Range. Allows you to specify a named range of cells as dened in the Excel

worksheet. Click the ellipses button (...) to choose from the list of available ranges. All rows in the specied range are returned, including blank rows. If a named range is used, other worksheet and data range settings are no longer applicable and are disabled as a result.
Worksheet. Species the worksheet to import, either by index or by name. Index. Specify the index value for the worksheet you want to import, beginning with 0 for

the rst worksheet, 1 for the second worksheet, and so on.


Name. Specify the name of the worksheet you want to import. Click the ellipses button (...) to

choose from the list of available worksheets.


Data range. You can import data beginning with the rst non-blank row or with an explicit

range of cells.
First non-blank row. Locates the rst non-blank cell and uses this as the upper left corner

of the data range. If another blank row is encountered, you can choose whether to stop reading or choose Return blank rows to continue reading all data to the end of the worksheet, including blank rows.
Explicit range. Allows you to specify an explicit range by row or column (for example, A3:G178). All rows in the specied range are returned, including blank rows. First row contains field names. Indicates that the rst row in the specied range should be used as

eld (column) names. If not selected, eld names are generated automatically.

32 Chapter 2

Field Storage and Type

When reading values from Excel, elds with numeric storage are read in as ranges by default, and string elds are read in as sets. You can manually change the type (range versus set) on the Type tab, but the storage is determined automatically (although it can be changed using a conversion function, such as to_integer, in a Filler node or Derive node if necessary). For more information, see Setting Field Storage and Formatting on p. 20. By default, elds with a mix of numeric and string values read in as numbers, which means that any string values will be set to null (system missing) values in Clementine. This happens becauseunlike ExcelClementine does not allow mixed storage types within a eld. To avoid this, you can manually set the cell format to Text in the Excel spreadsheet, which causes all values (including numbers) to read in as strings.

User Input Node


The User Input node provides an easy way for you to create synthetic dataeither from scratch or by altering existing data. This is useful, for example, when you want to create a test dataset for modeling.
Creating Data from Scratch

The User Input node is available from the Sources palette and can be added directly to the stream canvas.
E Click the Sources tab of the nodes palette. E Drag and drop or double-click to add the User Input node to the stream canvas. E Double-click to open its dialog box and specify elds and values.

Note: User Input nodes that are selected from the Sources palette will be completely blank, with no elds and no data information. This enables you to create synthetic data entirely from scratch.
Generating Data from an Existing Data Source

You can also generate a User Input node from any nonterminal node in the stream:
E Decide at which point in the stream you want to replace a node. E Right-click on the node that will feed its data into the User Input node and select Generate User Input Node from the menu. E The User Input node appears with all downstream processes attached to it, replacing the existing

node at that point in your data stream. When generated, the node inherits all of the data structure and eld type information (if available) from the metadata. Note: If data have not been run through all nodes in the stream, then the nodes are not fully instantiated, meaning that storage and data values may not be available when replacing with a User Input node.

33 Source Nodes Figure 2-17 Generated User Input node dialog box for a newly generated node

Setting Options for the User Input Node


The dialog box for a User Input node contains several tools you can use to enter values and dene the data structure for synthetic data. For a generated node, the table on the Data tab contains eld names from the original data source. For a node added from the Sources palette, the table is blank. Using the table options, you can perform the following tasks: Add new elds using the Add a New Field button at the right in the table. Rename existing elds. Specify data storage for each eld. Specify values. Change the order of elds on the display.
Entering Data

For each eld, you can specify values or insert values from the original dataset using the value picker button to the right of the table. See the rules described below for more information on specifying values. You can also choose to leave the eld blankelds left blank are lled with the system null ($null$).
Generate data. Enables you to specify how the records are generated when you execute the stream.

34 Chapter 2

All combinations. Generates records containing every possible combination of the eld values,

so each eld value will appear in several records. This can sometimes generate more data than is wanted, so often you might follow this node with a sample node.
In order. Generates records in the order in which the data eld values are specied. Each eld

value only appears in one record. The total number of records is equal to the largest number of values for a single eld. Where elds have fewer than the largest number, undened ($null$) values are inserted. For example, the following entries will generate the records listed in the tables below.
Age. 30,60,10 BP. LOW Cholesterol. NORMAL HIGH Drug. (left blank)
Generate data set to All combinations:

Age 30 30 40 40 50 50 60 60

BP LOW LOW LOW LOW LOW LOW LOW LOW

Cholesterol NORMAL HIGH NORMAL HIGH NORMAL HIGH NORMAL HIGH

Drug $null$ $null$ $null$ $null$ $null$ $null$ $null$ $null$

Generate data set to In order:

Age 30 40 50 60

BP LOW $null$ $null$ $null$

Cholesterol NORMAL HIGH $null$ $null$

Drug $null$ $null$ $null$ $null$

Data Storage

Storage describes the way data are stored in a eld. For example, a eld with values of 1 and 0 stores integer data. This is distinct from the data type, which describes the usage of the data in Clementine and does not affect storage. For example, you may want to set the type for an integer eld with values of 1 and 0 to ag. This usually indicates that 1 = True and 0 = False. While storage must be determined at the source, data type can be changed using a Type node at any point in the stream. For more information, see Data Types in Chapter 4 on p. 71.

35 Source Nodes

Available storage types are:


String. Used for elds that contain non-numeric data, also called alphanumeric data. A string

can include any sequence of characters, such as fred, Class 2, or 1234. Note that numbers in strings cannot be used in calculations.
Integer. A eld whose values are integers. Real. Values are numbers that may include decimals (not limited to integers). The display

format is specied in the Stream Options dialog box and can be overridden for individual elds in a Type node (Format tab). For more information, see Setting Options for Streams in Chapter 5 in Clementine 11.1 Users Guide.
Time. Time measured as a duration. For example, a service call lasting 1 hour, 26 minutes, and 38 seconds might be represented as 01:26:38, depending on the current time format as

specied in the Stream Options dialog box.


Timestamp. Time values that indicate a specic hour of the day rather than a duration. For example, a service call beginning at exactly 9:04 A.M, could be logged as 09:04:00, again

depending on the current time format.


Date. Date values specied in a standard format such as year, month, and day (for example, 2005-09-26). The specic format is specied in the Stream Options dialog box. Storage conversions. You can convert storage for a eld using a variety of conversion functions, such as to_string and to_integer, in a Filler node. For more information, see Storage Conversion Using the Filler Node in Chapter 4 on p. 100. Note that conversion functions (and any other functions that require a specic type of input such as a date or time value) depend on the current formats specied in the Stream Options dialog box. For example, if you want to convert a string eld with values Jan 2003, Feb 2003, etc. to date storage, select MON YYYY as the default date format for the stream. For more information, see Setting Options for Streams in Chapter 5 in Clementine 11.1 Users Guide. Conversion functions are also available from the Derive node, for temporary conversion during a derive calculation. You can also use the Derive node to perform other manipulations, such as recoding string elds with discrete values. For more information, see Recoding Values with the Derive Node in Chapter 4 on p. 97. Reading in mixed data. Note that when reading in elds with numeric storage (either integer,

real, time, timestamp, or date), any non-numeric values are set to null or system missing. This is because unlike some applications, Clementine does not allow mixed storage types within a eld. To avoid this, any elds with mixed data should be read in as strings, either by changing the storage type in the source node or in the external application as necessary. Note: Generated User Input nodes may already contain storage information garnered from the source node if instantiated. An uninstantiated node does not contain storage or usage type information.

36 Chapter 2 Figure 2-18 Specifying storage type for fields in a generated User Input node

Rules for Specifying Values

For symbolic elds, you should leave spaces between multiple values, such as:
HIGH MEDIUM LOW

For numeric elds, you can either enter multiple values in the same manner (listed with spaces between):
10 12 14 16 18 20

Or you can specify the same series of numbers by setting its limits (10, 20) and the steps in between (2). Using this method, you would type:
10,20,2

These two methods can be combined by embedding one within the other, such as:
1 5 7 10,20,2 21 23

This entry will produce the following values:


1 5 7 10 12 14 16 18 20 21 23

Dimensions Import Node


Dimensions source nodes import survey data based on the Dimensions Data Model used by market research software from SPSS. This format distinguishes case datathe actual responses to questions gathered during a surveyfrom the metadata that describes how the case data is collected and organized. Metadata consists of information such as question texts, variable names and descriptions, translations of the various texts, and the denition of the structure of

37 Source Nodes

the case data. When survey data is imported into Clementine, questions are rendered as elds, with a record for each respondent. Note: This node requires Dimensions Data Model version 3.0 or higher, which is distributed along with Dimensions software products from SPSS. For more information, see the Dimensions Web page (http://www.spss.com/dimensions/index.htm).
Comments

Survey data is read from the at, tabular VDATA format only. Surveys that support only the hierarchical HDATA format cannot be imported. Types are instantiated automatically by using information from the metadata. Multiple response values are enclosed in braces and separated by commasfor example, {dinosaurs,fossils,botany}.

Dimensions Import File Options


The File tab in the Dimensions node allows you to specify options for the metadata and case data you want to import.
Figure 2-19 Dimensions node File options

Metadata Settings Metadata Provider. Survey data can be imported from a number of formats as supported by SPSS

Dimensions Data Model software. Available provider types include the following:
Dimensions Metadata (MDD). Reads metadata from a questionnaire denition le (.mdd). This

is the standard Dimensions Data Model format.

38 Chapter 2 ADO Database. Reads case data and metadata from ADO les. Specify the name and location

of the .adoinfo le that contains the metadata. The internal name of this DSC is mrADODsc.
In2data Database. Reads In2data case data and metadata. The internal name of this DSC is

mrI2dDsc.
Dimensions Log File. Reads metadata from a standard dimensions log le. Typically, log

les have a .tmp lename extension. However, some log les may have another lename extension. If necessary, you can rename the le so that it has a .tmp lename extension. The internal name of this DSC is mrLogDsc.
Quancept Definitions File. Converts metadata to Quancept script. Specify the name of the

Quancept .qdi le. The internal name of this DSC is mrQdiDrsDsc.


Quanvert Database. Reads Quanvert case data and metadata. Specify the name and location of

the .qvinfo or .pkd le. The internal name of this DSC is mrQvDsc.
Dimensions Participation Database. Reads a projects Sample and History Table tables and

creates derived categorical variables corresponding to the columns in those tables. The internal name of this DSC is mrSampleReportingMDSC.
SPSS File. Reads case data and metadata from an SPSS .sav le. Writes case data to an

SPSS .sav le for analysis in SPSS. Writes metadata from an SPSS .sav le to an .mdd le. The internal name of this DSC is mrSavDsc.
Surveycraft File. Reads SurveyCraft case data and metadata. Specify the name of the

SurveyCraft .vq le. The internal name of this DSC is mrSCDsc.


Dimensions Scripting File. Reads from metadata in an mrScriptMetadata le. Typically,

these les have an .mdd or .dms lename extension. The internal name of this DSC is mrScriptMDSC.
Metadata properties. Optionally, select Properties to specify the survey version to import as well

as the language, context, and label type to use. For more information, see Metadata Properties on p. 39.
Case Data Settings Get Case Data Settings. When reading metadata from .mdd les only, click Get Case Data Settings

to determine what case data sources are associated with the selected metadata, along with the specic settings needed to access a given source. This option is available only for .mdd les.
Case Data Provider. The following provider types are supported:
ADO Database. Reads case data using the Microsoft ADO interface. Select OLE-DB UDL for

the case data type, and specify a connection string in the Case Data UDL eld. For more information, see Database Connection String on p. 40. The internal name of this component is mrADODsc.
In2data Database. Reads case data and metadata from an In2data database (.i2d) le. The

internal name is mrI2dDsc.


Dimensions Log File. Reads case data from a standard dimensions log le. Typically, log

les have a .tmp lename extension. However, some log les may have another lename extension. If necessary, you can rename the le so that it has a .tmp lename extension. The internal name is mrLogDsc.

39 Source Nodes Quantum data file. Reads case data from any Quantum-format ASCII le (.dat). The internal

name is mrPunchDsc.
Quancept Data File. Reads case data from a Quancept .drs, .drz, or .dru le. The internal

name is mrQdiDrsDsc.
Quanvert Database. Reads case data from a Quanvert qvinfo or .pkd le. The internal name is

mrQvDsc.
Dimensions Database (MS SQL Server). Reads case data to a relational Microsoft SQL Server

database. For more information, see Database Connection String on p. 40. The internal name is mrRdbDsc2.
SPSS File. Reads case data from an SPSS .sav le. The internal name is mrSavDsc. Surveycraft File. Reads case data from a SurveyCraft .qdt le. Both the .vq and .qdt les must

be in the same directory, with read and write access for both les. This is not how they are created by default when using SurveyCraft, so one of the les needs to be moved to import SurveyCraft data. The internal name is mrScDsc.
Dimensions XML. Reads case data from an Dimensions XML data le. Typically, this

format may be used to transfer case data from one location to another. The internal name is mrXmlDsc.
Case Data Type. Species whether case data is read from a le, folder, OLE-DB UDL, or ODBC

DSN, and updates the dialog box options accordingly. Valid options depend on the type of provider. For database providers, you can specify options for the OLE-DB or ODBC connection. For more information, see Database Connection String on p. 40.
Case Data Project. When reading case data from a Dimensions database, you can enter the name

of the project. For all other case data types, this setting should be left blank.

Metadata Properties
When importing dimensions survey data, you can specify the survey version to import as well as the language, context, and label type to use. Note that only one language, context, and label type can be imported at a time.
Figure 2-20 Dimension Import Metadata Properties

Version. Each survey version can be regarded as a snapshot of the metadata used to collect

a particular set of case data. As a questionnaire undergoes changes, multiple versions may be created. You can import the latest version, all versions, or a specic version.

40 Chapter 2

All versions. Select this option if you want to use a combination (superset) of all of the

available versions. (This is sometimes called a superversion). When there is a conict between the versions, the most recent versions generally take precedence over the older versions. For example, if a category label differs in any of the versions, the text in the latest version will be used.
Latest version. Select this option if you want to use the most recent version. Specify version. Select this option if you want to use a particular survey version.

Choosing all versions is useful when, for example, you want to export case data for more than one version and there have been changes to the variable and category denitions that mean that case data collected with one version is not valid in another version. Selecting all of the versions for which you want to export the case data means that generally you can export the case data collected with the different versions at the same time without encountering validity errors due to the differences between the versions. However, depending on the version changes, some validity errors may still be encountered.
Language. Questions and associated text can be stored in multiple languages in the metadata. You can use the default language for the survey or specify a particular language. If an item is unavailable in the specied language, the default is used. Context. Select the user context you want to use. The user context controls which texts are displayed. For example, select Question to display question texts or Analysis to display shorter texts suitable for displaying when analyzing the data. Label type. Lists the types of labels that have been dened. The default is label, which is used for question texts in the Question user context and variable descriptions in the Analysis user context. Other label types can be dened for instructions, descriptions, etc.

Database Connection String


When using the Dimensions node to import case data from a database via an OLE-DB or ODBC, select Edit from the File tab to access the Connection String dialog box, which allows you to customize the connection string passed to the provider in order to ne-tune the connection.
Figure 2-21 Connection String dialog box

41 Source Nodes

Advanced Properties
When using the Dimensions node to import case data from a database that requires an explicit login, select Advanced to provide a user ID and password to access the data source.
Figure 2-22 Advanced Properties dialog box

Multiple Responses, Loops, and Grids


Figure 2-23 Multiple response question

Multiple responses can be imported into a single eld, with values separate by commas. For example, responses to a question such as Which museums have you visited? might be read into a single museum eld, as follows:
museums museum_of_design,institute_of_textiles_and_fashion museum_of_design archeological_museum $null$ national_art_gallery,national_museum_of_science,other

For purposes of analysis, you could use a Derive node to generate a separate ag eld for each response with an expression such as:
hassubstring(museums,"museum_of_design")

42 Chapter 2 Figure 2-24 Deriving a flag field

For more information, see Derive Node in Chapter 4 on p. 87. A number of additional functions are also supported. For more information, see String Functions in Chapter 8 in Clementine 11.1 Users Guide.
Looping Questions

For a looping question that may be asked multiple times, the number of elds depends on the number of loops. For example, a two-part question that could be asked up to six times would generate 12 elds.
Figure 2-25 Looping question with a fixed number of possible responses

43 Source Nodes

Grid Questions
Figure 2-26 Grid question with numeric responses

Dimensions Column Import Notes


Columns from the Dimensions data are read into Clementine as summarized in the following table.
Dimensions Column Type Boolean ag (yes/no) Categorical Date or time stamp Double (oating point value within a specied range) Long (integer value within a specied range) Text (free text description) Level (indicates grids or loops within a question) Object (binary data such as a facsimile showing scribbled text or a voice recording) None (unknown type) Respondent.Serial column (associates a unique id with each respondent) Clementine Storage String String Timestamp Real Integer String Doesnt occur in VDATA and is not imported into Clementine Not imported into Clementine Not imported into Clementine Integer Typeless Measure Flag (values 0 and 1) Set Range Range Range Typeless

To avoid possible inconsistencies between value labels read from metadata and actual values, all metadata values are converted to lower case. For example, the value the label E1720_years would be converted to e1720_years. Multiple response values are enclosed in braces and separated by commasfor example, {dinosaurs,fossils,botany}.

44 Chapter 2

Common Source Node Tabs


The following options can be specied for all source nodes by clicking the corresponding tab:
Data tab. Used to change the default storage type. Types tab. Used to set data types. This tab offers the same functionality as the Type node. Filter tab. Used to eliminate or rename data elds. This tab offers the same functionality as

the Filter node. For more information, see Setting Filtering Options in Chapter 4 on p. 85.
Annotations tab. Used for all nodes in Clementine, this tab offers options to rename nodes,

supply a custom ToolTip, and store a lengthy annotation. For more information, see Annotating Nodes and Streams in Chapter 5 in Clementine 11.1 Users Guide.

Setting Data Types in the Source Node


Field properties can be specied in a source node or in a separate Type node. The functionality is similar in both nodes. The following properties are available:
Type. Used to describe characteristics of the data in a given eld. If all of the details of a eld

are known, it is called fully instantiated. The type of a eld is different from the storage of a eld, which indicates whether data are stored as strings, integers, real numbers, dates, times, or timestamps.
Labels. Double-click any eld name to specify value and eld labels for data in Clementine.

For example, eld metadata imported from SPSS can be viewed or modied in the Type node. Similarly, you can create new labels for elds and their values. The labels that you specify in the Type node are displayed throughout Clementine depending on the selections you make in the Stream Properties dialog box. For more information, see Setting Options for Streams in Chapter 5 in Clementine 11.1 Users Guide.
Direction. Used to tell Modeling nodes whether elds will be Input (predictor elds) or Output

(predicted elds) for a machine-learning process. Both and None are also available directions, along with Partition, which indicates a eld used to partition records into separate samples for training, testing, and validation. For more information, see Setting Field Direction in Chapter 4 on p. 80.
Missing values. Used to specify which values will be treated as blanks. Value checking. In the Check column, you can set options to ensure that eld values conform

to the specied range.


Instantiation options. Using the Values column, you can specify options for reading data

values from the dataset, or use the Specify option to open another dialog box for setting values. You can also choose to pass elds without reading their values. For more information, see Type Node in Chapter 4 on p. 70.

45 Source Nodes Figure 2-27 Types tab options

Note: For the SPSS Import node, variables imported from an SPSS version 8.0 or higher .sav le will be set to Ordinal Set if they have the measurement level of ordinal from SPSS.

When to Instantiate at the Source Node


There are two ways you can learn about the data storage and values of your elds. This instantiation can occur at either the source node, when you rst bring data into Clementine, or by inserting a Type node into the data stream. Instantiating at the source node is useful when: The dataset is small. You plan to derive new elds using the Expression Builder (instantiating makes eld values available from the Expression Builder). Generally, if your dataset is not very large and you do not plan to add elds later in the stream, instantiating at the source node is the most convenient method.

Filtering Fields from the Source Node


The Filter tab on a source node dialog box allows you to exclude elds from downstream operations based on your initial examination of the data. This is useful, for example, if there are duplicate elds in the data or if you are already familiar enough with the data to exclude irrelevant elds. Alternatively, you can add a separate Filter node later in the stream. The functionality is similar in both cases. For more information, see Setting Filtering Options in Chapter 4 on p. 85.

46 Chapter 2 Figure 2-28 Filtering fields from the source node

Chapter

Record Operations Nodes

Overview of Record Operations


Record operations nodes are used to make changes to data at the record level. These operations are important during the Data Understanding and Data Preparation phases of data mining because they allow you to tailor the data to your particular business need. For example, based on the results of the data audit conducted using the Data Audit node (Output palette), you might decide that you would like customer purchase records for the past three months to be merged. Using a Merge node, you can merge records based on the values of a key eld, such as Customer ID. Or you might discover that a database containing information about Web site hits is unmanageable with over one million records. Using a Sample node, you can select a subset of data for use in modeling. The Record Operations palette contains the following nodes:
The Select node selects or discards a subset of records from the data stream based on a specic condition. For example, you might select the records that pertain to a particular sales region. For more information, see Select Node on p. 48. The Sample node trims the size of the dataset according to parameters that you set. It is useful for paring down a large dataset, selecting a random sample to generate a model, or training a neural network. For more information, see Sample Node on p. 49. The Balance node corrects imbalances in a dataset, so it conforms to a specied condition. The balancing directive adjusts the proportion of records where a condition is true by the factor specied. For more information, see Balance Node on p. 50. The Aggregate node replaces a sequence of input records with summarized, aggregated output records. For more information, see Aggregate Node on p. 52.

The Sort node sorts records into ascending or descending order based on the values of one or more elds. For more information, see Sort Node on p. 54.

The Merge node takes multiple input records and creates a single output record containing some or all of the input elds. It is useful for merging data from different sources, such as internal customer data and purchased demographic data. For more information, see Merge Node on p. 56.

47

48 Chapter 3

The Distinct node removes duplicate records, either by passing the rst distinct record to the data stream or by discarding the rst record and passing any duplicates to the data stream instead. For more information, see Distinct Node on p. 66. The Append node concatenates sets of records. It is useful for combining datasets with similar structures but different data. For more information, see Append Node on p. 65.

Many of the nodes in the Record Operations palette require you to use a CLEM expression. If you are familiar with CLEM (Clementine Language for Expression Manipulation), you can type an expression in the eld. However, all expression elds provide a button that opens the CLEM Expression Builder, which helps you create such expressions automatically. For more information, see The Expression Builder in Chapter 7 in Clementine 11.1 Users Guide.
Figure 3-1 Expression Builder button

Select Node
You can use Select nodes to select or discard a subset of records from the data stream based on a specic condition, such as BP (blood pressure) = "HIGH".
Figure 3-2 Select node dialog box

Mode. Species whether records that meet the condition will be included or excluded from

the data stream.


Include. Select to include records that meet the selection condition. Discard. Select to exclude records that meet the selection condition. Condition. Displays the selection condition that will be used to test each record, which you specify

using a CLEM expression. Either enter an expression in the window or use the Expression Builder by clicking the calculator (Expression Builder) button to the right of the window.

49 Record Operations Nodes

Select nodes are also used to choose a proportion of records. Typically, you would use a different node, the Sample node, for this operation. However, if the condition you want to specify is more complex than the parameters provided, you can create your own condition using the Select node. For example, you can create a condition such as:
BP = "HIGH" and random(10) <= 4

This will select approximately 40% of the records showing high blood pressure and pass those records downstream for further analysis.

Sample Node
You can use Sample nodes to specify a limit on the number of records passed to the data stream or to specify a proportion of records to discard. You may want to sample the original data for a variety of reasons, such as: Increasing the performance of the data mining tool. Paring down a large dataset, such as one with millions of records. Using Sample nodes, you can pass a random sample to generate a model that is usually as accurate as one derived from the full dataset. Training a neural network. You should reserve a sample for training and a sample for testing.

Setting Options for the Sample Node


Figure 3-3 Sample node dialog box

Mode. Select whether to pass (include) or discard (exclude) records for the following modes: Include sample. Select to include in the data stream the sample that you specify below. For

example, if you set the mode to Include sample and set the 1-in-n option to 5, then every fth record will be included in the data stream up to the maximum sample size.
Discard sample. Select to exclude the sample that you specify from the data stream. For

example, if you set the mode to Discard sample and set the 1-in-n option to 5, then every fth record will be discarded (excluded) from the data stream.

50 Chapter 3

Sample. Select the method of sampling from the following options: First. Select to use contiguous data sampling. For example, if the maximum sample size is set

to 10000, then the rst 10,000 records will either be passed on to the data stream (if the mode is Include sample) or discarded (if the mode is Discard sample).
1-in-n. Select to sample data by passing or discarding every nth record. For example, if

n is set to 5, then every fth record will either be passed to the data stream or discarded, depending on the mode selected.
Random %. Select to sample a random percentage of the data. For example, if you set the

percentage to 20, then 20% of the data will either be passed to the data stream or discarded, depending on the mode selected. Use the eld to specify a sampling percentage. You can also specify a seed value using the Set random seed control.
Maximum sample size. Specify the largest sample to be included or discarded from the data

stream. This option is redundant and therefore disabled when First and Include are selected above. Also note the interaction between this setting and the Random % option (see below). Important: When used in combination with the Random % option, the Maximum sample size setting may prevent certain records from being selected. For example if you have 10 million records in your dataset, and you select 50% of records with a maximum sample size of 3 million records, then only the rst 6 million records have a 50% chance of being selected. The remaining 4 million records are ignored (meaning they have no chance of being selected). For this reason these two settings should not be combined in cases where it is essential that all records have an equal chance of being selected.
Set random seed. When sampling or partitioning records based on a random percentage, this

option allows you to duplicate the same results in another session. By specifying the starting value used by the random number generator, you can ensure the same records are assigned each time the node is executed. Enter the desired seed value, or click the Generate button to automatically generate a random value. If this option is not selected, a different sample will be generated each time the node is executed. Note: When using the Set random seed option with records read from a database, a Sort node may be required prior to sampling in order to ensure the same result each time the node is executed. This is because the random seed depends on the order of records, which is not guaranteed to stay the same in a relational database. For more information, see Sort Node on p. 54.

Balance Node
You can use Balance nodes to correct imbalances in datasets so they conform to specied test criteria. For example, suppose that a dataset has only two valueslow or highand that 90% of the cases are low while only 10% of the cases are high. Many modeling techniques have trouble with such biased data because they will tend to learn only the low outcome and ignore the high one, since it is more rare. If the data are well balanced with approximately equal numbers of low and high outcomes, models will have a better chance of nding patterns that distinguish the two groups. In this case, a Balance node is useful for creating a balancing directive that reduces cases with a low outcome.

51 Record Operations Nodes

Balancing is carried out by duplicating and then discarding records based on the conditions you specify. Records for which no condition holds are always passed through. Because this process works by duplicating and/or discarding records, the original sequence of your data is lost in downstream operations. Be sure to derive any sequence-related values before adding a Balance node to the data stream. Note: Balance nodes can be generated automatically from distribution charts and histograms.

Setting Options for the Balance Node


Figure 3-4 Balance node dialog box

Record balancing directives. Lists the current balancing directives. Each directive includes both a factor and a condition that tells the software to increase the proportion of records by a factor specied where the condition is true. A factor lower than 1.0 means that the proportion of indicated records will be decreased. For example, if you want to decrease the number of records where drug Y is the treatment drug, you might create a balancing directive with a factor of 0.7 and a condition Drug = "drugY". This directive means that the number of records where drug Y is the treatment drug will be reduced to 70% for all downstream operations.

Note: Balance factors for reduction may be specied to four decimal places. Factors set below 0.0001 will result in an error, since the results do not compute correctly.
Create conditions by clicking the button to the right of the text eld. This inserts an empty

row for entering new conditions. To create a CLEM expression for the condition, click the Expression Builder button.
Delete directives using the red delete button. Sort directives using the up and down arrow buttons.

52 Chapter 3

Aggregate Node
Aggregation is a data preparation task frequently used to reduce the size of a dataset. Before proceeding with aggregation, you should take time to clean the data, concentrating especially on missing values. Once you have aggregated, potentially useful information regarding missing values may be lost. For more information, see Overview of Missing Values in Chapter 6 in Clementine 11.1 Users Guide. You can use an Aggregate node to replace a sequence of input records with summary, aggregated output records. For example, you might have a set of input records such as:
Age 23 45 37 30 44 25 29 41 23 45 33 Sex M M M M M M F F F F F Region S S S S N N S N N N N Branch 8 16 8 5 4 2 16 4 6 4 6 Sales 4 4 5 7 9 11 6 8 2 5 10

You can aggregate these records with Sex and Region as key elds. Then choose to aggregate Age with the mode Mean and Sales with the mode Sum. Select Include record count in field in the Aggregate node dialog box and your aggregated output would be:
Age 35.5 34.5 29 33.75 Sex F M F M Region N N S S Sales 25 20 6 20 RECORD_COUNT 4 2 1 4

Note: Fields such as Branchare automatically discarded when no aggregate mode is specied.

53 Record Operations Nodes Figure 3-5 Aggregate node dialog box

Setting Options for the Aggregate Node


Key fields. Lists elds that can be used as keys for aggregation. Both numeric and symbolic elds

can be used as keys. If you choose more than one key eld, the values will be combined to produce a key value for aggregating records. One aggregated record will be generated for each unique key eld. For example, if Sex and Region are your key elds, each unique combination of M and F with regions N and S (four unique combinations) will have an aggregated record. To add a key eld, use the Field Chooser button to the right of the window.
Keys are contiguous. Select to treat the values for the key elds as equal if they occur in adjacent

records.
Aggregate fields. Lists the numeric elds whose values will be aggregated as well as the selected modes of aggregation. To add elds to this list, use the Field Chooser button on the right. Default mode. Specify the default aggregation mode to be used for newly added elds. If you

frequently use the same aggregation, select one or more modes here and use the Apply to All button on the right to apply the selected modes to all elds listed above. The following aggregation modes are available in Clementine:
Sum. Select to return summed values for each key eld combination. Mean. Select to return the mean values for each key eld combination. Min. Select to return minimum values for each key eld combination.

54 Chapter 3

Max. Select to return maximum values for each key eld combination. SDev. Select to return the standard deviation for each key eld combination. New field name extension. Select to add a sufx or prex, such as 1 or new, to duplicate

aggregated elds. For example, the result of a minimum values aggregation on the eld Age will produce a eld name called Age_Min_1 if you have selected the sufx option and specied 1 as the extension. Note: Aggregation extensions such as _Min or Max_ are automatically added to the new eld, indicating the type of aggregation performed. Select Suffix or Prefix to indicate your preferred extension style.
Include record count in field. Select to include an extra eld in each output record called Record_Count, by default. This eld indicates how many input records were aggregated to form each aggregate record. Create a custom name for this eld by typing in the edit eld.

Note: System null values are excluded when aggregates are computed, but they are included in the record count. Blank values, on the other hand, are included in both aggregation and record count. To exclude blank values, you can use a Filler node to replace blanks with null values. You can also remove blanks using a Select node.
Performance

Aggregations operations may benet from enabling parallel processing. For more information, see Setting Optimization Options in Chapter 3 in Clementine 11.1 Users Guide.

Sort Node
You can use Sort nodes to sort records into ascending or descending order based on the values of one or more elds. For example, Sort nodes are frequently used to view and select records with the most common data values. Typically, you would rst aggregate the data using the Aggregate node and then use the Sort node to sort the aggregated data into descending order of record counts. Displaying these results in a table will allow you to explore the data and to make decisions, such as selecting the records of the top 10 best customers.

55 Record Operations Nodes Figure 3-6 Sort node dialog box

Sort by. All elds selected to use as sort keys are displayed in a table. A key eld works best for

sorting when it is numeric.


Add fields to this list using the Field Chooser button on the right. Select an order by clicking the Ascending or Descending arrow in the tables Order column. Delete fields using the red delete button. Sort directives using the up and down arrow buttons. Default sort order. Select either Ascending or Descending to use as the default sort order when new elds are added above.

Sort Optimization Settings


If you are working with data you know are already sorted by some key elds, you can specify which elds are already sorted, allowing the system to sort the rest of the data more efciently. For example, you want to sort by Age (descending) and Drug (ascending) but know your data are already sorted by Age (descending).

56 Chapter 3 Figure 3-7 Optimization settings

Data is presorted. Species whether the data are already sorted by one or more elds. Specify existing sort order. Specify the elds that are already sorted. Using the Select Fields dialog

box, add elds to the list. In the Order column, specify whether each eld is sorted in ascending or descending order. If you are specifying multiple elds, make sure that you list them in the correct sorting order. Use the arrows to the right of the list to arrange the elds in the correct order. If you make a mistake in specifying the correct existing sort order, an error will appear when you execute the stream displaying the record number where the sorting is inconsistent with what you specied. Note: Sorting speed may benet from enabling parallel processing. For more information, see Setting Optimization Options in Chapter 3 in Clementine 11.1 Users Guide.

Merge Node
The function of a Merge node is to take multiple input records and create a single output record containing all or some of the input elds. This is a useful operation when you want to merge data from different sources, such as internal customer data and purchased demographic data. There are two ways to merge data in Clementine: Merge by order concatenates corresponding records from all sources in the order of input until the smallest data source is exhausted. It is important if using this option that you have sorted your data using a Sort node. Merge using a key eld, such as Customer ID, to specify how to match records from one data source with records from the other(s). Several types of joins are possible in Clementine, including inner join, full outer join, partial outer join, and anti-join. For more information, see Types of Joins on p. 57.

57 Record Operations Nodes

Types of Joins
When using a key eld for data merging, it is useful to spend some time thinking about which records will be excluded and which will be included. Clementine offers a variety of joins, which are discussed in detail below. The two basic types of joins are referred to as inner and outer joins. These methods are frequently used to merge tables from related datasets based on common values of a key eld, such as Customer ID. Inner joins allow for clean merging and an output dataset that includes only complete records. Outer joins also include complete records from the merged data, but they also allow you to include unique data from one or more input tables. The types of joins allowed in Clementine are described in greater detail below.

An inner join includes only records in which a value for the key eld is common to all input tables. That is, unmatched records will not be included in the output dataset.

A full outer join includes all records, both matching and nonmatching, from the input tables. Left and right outer joins are referred to as partial outer joins and are described below.

A partial outer join includes all records matched using the key eld as well as unmatched records from specied tables. (Or, to put it another way, all records from some tables and only matching records from others.) Tables (such as A and B shown here) can be selected for inclusion in the outer join using the Select button on the Merge tab. Partial joins are also called left or right outer joins when only two tables are being merged. Since Clementine allows the merging of more than two tables, we refer to this as a partial outer join.

An anti-join includes only unmatched records for the rst input table (Table A shown here). This type of join is the opposite of an inner join and does not include complete records in the output dataset.

For example, if you have information about farms in one dataset, and farm-related insurance claims in another, you can match the records from the rst source to the second source using the Merge options.

58 Chapter 3

To determine if a customer in your farm sample has led an insurance claim, use the inner join option to return a list showing where all IDs match from the two samples.
Figure 3-8 Sample output for an inner join merge

Using the full outer join option returns both matching and nonmatching records from the input tables. The system-missing value ($null$) will be used for any incomplete values.
Figure 3-9 Sample output for a full outer join merge

A partial outer join includes all records matched using the key eld as well as unmatched records from specied tables. The table displays all of the records matched from the ID eld as well as the records matched from the rst dataset.
Figure 3-10 Sample output for a partial outer join merge

If you are using the anti-join option, the table returns only unmatched records for the rst input table.
Figure 3-11 Sample output for an anti-join merge

59 Record Operations Nodes

Specifying a Merge Method and Keys


Figure 3-12 Using the Merge tab to set merge method options

Merge Method. Select either Order or Keys to specify the method of merging records. Selecting
Keys activates the bottom half of the dialog box.

Order. Merges records by order such that the nth record from each input is merged to produce

the nth output record. When any record runs out of a matching input record, no more output records are produced. This means that the number of records created is the number of records in the smallest dataset.
Keys. Uses a key eld, such as Transaction ID, to merge records with the same value in the

key eld. This is equivalent to a database equi-join. If a key value occurs more than once, all possible combinations are returned. For example, if records with the same key eld value A contain differing values B, C, and D in other elds, the merged elds will produce a separate record for each combination of A with value B, A with value C, and A with value D. Note: Null values are not considered identical in the merge-by-key method and will not join.
Possible keys. Lists all elds found in all input data sources. Select a eld from this list and use

the arrow button to add it as a key eld used for merging records. More than one key eld may be used.
Keys for merge. Lists all elds used to merge records from all input data sources based on values of the key elds. To remove a key from the list, select one and use the arrow button to return it to the Possible Keys list. When more than one key eld is selected, the option below is enabled. Combine duplicate key fields. When more than one key eld is selected above, this option ensures

that there is only one output eld of that name. This option is enabled by default except in the case when streams have been imported from earlier versions of Clementine. When this option

60 Chapter 3

is disabled, duplicate key elds must be renamed or excluded using the Filter tab in the Merge node dialog box.
Include only matching records (inner join). Select to merge only complete records. Include matching and non-matching records (full outer join). Select to perform a full outer join. This means that if values for the key eld are not present in all input tables, the incomplete records are still retained. The undened value ($null$) is added to the key eld and included in the output record. Include matching and selected non-matching records (partial outer join). Select to perform a partial outer join of the tables you select in a subdialog box. Click Select to specify tables for which incomplete records will be retained in the merge. Include records in the first dataset not matching any others (anti-join). Select to perform a type of

anti-join, where only nonmatching records from the rst dataset are passed downstream. You can specify the order of input datasets using arrows on the Inputs tab. This type of join does not include complete records in the output dataset. For more information, see Types of Joins on p. 57.

Selecting Data for Partial Joins


For a partial outer join, you must select the table(s) for which incomplete records will be retained. For example, you may want to retain all records from a Customer table while retaining only matched records from the Mortgage Loan table.
Figure 3-13 Selecting data for a partial or outer join

Outer Join column. In the Outer Join column, select datasets to include in their entirety. For a partial join, overlapping records will be retained as well as incomplete records for datasets selected here. For more information, see Types of Joins on p. 57.

Filtering Fields from the Merge Node


Merge nodes include a convenient way of ltering or renaming duplicate elds as a result of merging multiple data sources. Click the Filter tab in the dialog box to select ltering options.

61 Record Operations Nodes Figure 3-14 Filtering from the Merge node

The options presented here are nearly identical to those for the Filter node. There are, however, additional options not discussed here that are available on the Filter menu. For more information, see Filter Node in Chapter 4 on p. 84.
Field. Displays the input elds from currently connected data sources. Tag. Lists the tag name (or number) associated with the data source link. Click the Inputs tab to

alter active links to this Merge node.


Source node. Displays the source node whose data is being merged. Connected node. Displays the node name for the node that is connected to the Merge node. Frequently, complex data mining requires several merge or append operations that may include the same source node. The connected node name provides a way of differentiating these. Filter. Displays the current connections between input and output eld. Active connections show

an unbroken arrow. Connections with a red X indicate ltered elds.


Field. Lists the output elds after merging or appending. Duplicate elds are displayed in red.

Click in the Filter eld above to disable duplicate elds.


View current fields. Select to view information on elds selected to be used as key elds. View unused field settings. Select to view information on elds that are not currently in use.

Setting Input Order and Tagging


Using the Inputs tab in the Merge and Append node dialog boxes, you can specify the order of input data sources and make any changes to the tag name for each source.

62 Chapter 3 Figure 3-15 Using the Inputs tab to specify tags and input order

Tags and order of input datasets. Select to merge or append only complete records. Tag. Lists current tag names for each input data source. Tag names, or tags, are a way of

uniquely identifying the data links for the merge or append operation. For example, imagine water from various pipes that is combined at one point and ows through a single pipe. Data in Clementine ows similarly, and the merging point is often a complex interaction between the various data sources. Tags provide a way of managing the inputs (pipes) to a Merge or Append node so that if the node is saved or disconnected, the links remain and are easily identiable. When you connect additional data sources to a Merge or Append node, default tags are automatically created using numbers to represent the order in which you connected the nodes. This order is unrelated to the order of elds in the input or output datasets. You can change the default tag by entering a new name in the Tag column.
Source Node. Displays the source node whose data is being combined. Connected Node. Displays the node name for the node that is connected to the Merge or

Append node. Frequently, complex data mining requires several merge operations that may include the same source node. The connected node name provides a way of differentiating these.
Fields. Lists the number of elds in each data source. View current tags. Select to view tags that are actively being used by the Merge or Append node.

In other words, current tags identify links to the node that have data owing through. Using the pipe metaphor, current tags are analogous to pipes with existing water ow.

63 Record Operations Nodes

View unused tag settings. Select to view tags, or links, that were previously used to connect to the Merge or Append node but are not currently connected with a data source. This is analogous to empty pipes still intact within a plumbing system. You can choose to connect these pipes to a new source or remove them. To remove unused tags from the node, click Clear. This clears all unused tags at once.
Figure 3-16 Removing unused tags from the Merge node

Merge Optimization Settings


The system provides two options that can help you merge data more efciently in certain situations. These options allow you to optimize merging when one input dataset is signicantly larger than the other datasets or when your data are already sorted by all or some of the key elds that you are using for the merge.

64 Chapter 3 Figure 3-17 Optimization settings

One input dataset is relatively large. Select to indicate that one of the input datasets is much larger than the others. The system will cache the smaller datasets in memory and then perform the merge by processing the large dataset without caching or sorting it. You will commonly use this type of join with data designed using a star-schema or similar design, where there is a large central table of shared data (for example, in transactional data). If you select this option, click Select to specify the large dataset. Note that you can select only one large dataset. The following table summarizes which joins can be optimized using this method.
Type of Join Inner Partial Full Anti-join Can be optimized for a large input dataset? Yes Yes, if there are no incomplete records in the large dataset. No Yes, if the large dataset is the rst input.

All inputs are already sorted by key field(s). Select to indicate that the input data are already sorted

by one or more of the key elds that you are using for the merge. Make sure all your input datasets are sorted.
Specify existing sort order. Specify the elds that are already sorted. Using the Select Fields

dialog box, add elds to the list. You can select from only the key elds that are being used for the merge (specied in the Merge tab). In the Order column, specify whether each eld is sorted in ascending or descending order. If you are specifying multiple elds, make sure that you list them in the correct sorting order. Use the arrows to the right of the list to arrange the elds in the correct order. If you make a mistake in specifying the correct existing sort order, an error

65 Record Operations Nodes

will appear when you execute the stream displaying the record number where the sorting is inconsistent with what you specied. Note: Merging speed may benet from enabling parallel processing. For more information, see Setting Optimization Options in Chapter 3 in Clementine 11.1 Users Guide.

Append Node
You can use Append nodes to concatenate sets of records. Unlike Merge nodes, which join records from different sources together, Append nodes read and pass downstream all of the records from one source until there are no more. Then the records from the next source are read using the same data structure (number of records, number of elds, and so on) as the rst, or primary, input. When the primary source has more elds than another input source, the system null string ($null$) will be used for any incomplete values. Append nodes are useful for combining datasets with similar structures but different data. For example, you might have transaction data stored in different les for different time periods, such as a sales data le for March and a separate one for April. Assuming that they have the same structure (the same elds in the same order), the Append node will join them together into one large le, which you can then analyze. Note: In order to append les, the eld types must be similar. For example, a eld typed as a Set eld can not be appended with a eld typed as Real Range.
Figure 3-18 Append node dialog box showing field matching by name

Setting Append Options


Match fields by. Select a method to use when matching elds to append.

66 Chapter 3

Position. Select to append datasets based on the position of elds in the main data source.

When using this method, your data should be sorted to ensure proper appending.
Name. Select to append datasets based on the name of elds in the input datasets. Also select
Match case to enable case sensitivity when matching eld names.

Output Field. Lists the source nodes that are connected to the Append node. The rst node on the

list is the primary input source. You can sort the elds in the display by clicking on the column heading. This sorting does not actually reorder the elds in the dataset.
Include fields from. Select Main dataset only to produce output elds based on the elds in the

main dataset. The main dataset is the rst input, specied on the Inputs tab. Select All datasets to produce output elds for all elds in all datasets regardless of whether there is a matching eld across all input datasets.
Tag records by including source dataset in field. Select to add an additional eld to the output le

whose values indicate the source dataset for each record. Specify a name in the text eld. The default eld name is Input.

Distinct Node
You can use Distinct nodes to remove duplicate records either by passing the rst distinct record to the data stream or by discarding the rst record and passing any duplicates to the data stream instead. This operation is useful when you want to have a single record for each item in the data, such as customers, accounts, or products. For example, Distinct nodes can be helpful in nding duplicate records in a customer database or in getting an index of all of the product IDs in your database.
Figure 3-19 Distinct node dialog box

Mode. Specify whether to include or exclude (discard) the rst record. Include. Select to include the rst distinct record in the data stream. Discard. Select to discard the rst distinct record found and pass any duplicate records to the

data stream instead. This option is useful for nding duplicates in your data so that you can examine them later in the stream.

67 Record Operations Nodes

Fields. Lists elds used to determine whether records are identical. Add fields to this list using the Field Chooser button on the right. Delete fields using the red delete button.

Chapter

Field Operations Nodes

Field Operations Overview


After an initial data exploration, you will probably have to select, clean, or construct data in preparation for analysis. The Field Operations palette contains many nodes useful for this transformation and preparation. For example, using a Derive node, you might create an attribute that is not currently represented in the data. Or you might use a Binning node to recode eld values automatically for targeted analysis. You will probably nd yourself using a Type node frequentlyit allows you to assign a data type, values, and a modeling role for each eld in the dataset. Its operations are useful for handling missing values and downstream modeling. The Field Operations palette contains the following nodes:
The Type node species eld metadata and properties. For example, you can specify a usage type (range, set, ordered set, or ag) for each eld, set options for handling missing values and system nulls, set the role of a eld for modeling purposes, specify eld and value labels, and specify values for a eld. For more information, see Type Node on p. 70. The Filter node lters (discards) elds, renames elds, and maps elds from one source node to another. For more information, see Filter Node on p. 84.

The Derive node modies data values or creates new elds from one or more existing elds. It creates elds of type formula, ag, set, stat, count, and conditional. For more information, see Derive Node on p. 87. The Filler node replaces eld values and changes storage. You can choose to replace values based on a CLEM condition, such as @BLANK(@FIELD). Alternatively, you can choose to replace all blanks or null values with a specic value. A Filler node is often used together with a Type node to replace missing values. For more information, see Filler Node on p. 98. The Anonymize node transforms the way eld names and values are represented downstream, thus disguising the original data. This can be useful if you want to allow other users to build models using sensitive data, such as customer names or other details. The Reclassify node transforms one set of discrete values to another. Reclassication is useful for collapsing categories or regrouping data for analysis. For more information, see Reclassify Node on p. 105.

68

69 Field Operations Nodes

The Binning node automatically creates new set elds based on the values of one or more existing numeric range elds. For example, you can transform a scale income eld into a new categorical eld containing groups of income as deviations from the mean. Once you have created bins for the new eld, you can generate a Derive node based on the cut points. For more information, see Binning Node on p. 109. If you have SPSS installed and licensed on your computer, the SPSS Transform, or Data Preparation, node runs a selection of SPSS syntax commands against data sources in Clementine. For more information, see SPSS Transform Node on p. 151.

The Partition node generates a partition eld, which splits the data into separate subsets for the training, testing, and validation stages of model building. For more information, see Partition Node on p. 119.

The Set to Flag node derives multiple ag elds based on the categorical values dened for one or more set elds. For more information, see Set to Flag Node on p. 121.

The Restructure node converts a set or ag eld into a group of elds that can be populated with the values of yet another eld. For example, given a eld named payment type, with values of credit, cash, and debit, three new elds would be created (credit, cash, debit), each of which might contain the value of the actual payment made. For more information, see Restructure Node on p. 123. The Transpose node swaps the data in rows and columns so that records become elds and elds become records. For more information, see Transpose Node on p. 125.

The Time Intervals node species intervals and creates labels (if needed) for modeling time series data. If values are not evenly spaced, the node can pad or aggregate values as needed to generate a uniform interval between records. For more information, see Time Intervals Node on p. 128. The History node creates new elds containing data from elds in previous records. History nodes are most often used for sequential data, such as time series data. Before using a History node, you may want to sort the data using a Sort node. For more information, see History Node on p. 146. The Field Reorder node denes the natural order used to display elds downstream. This order affects the display of elds in a variety of places, such as tables, lists, and the Field Chooser. This operation is useful when working with wide datasets to make elds of interest more visible. For more information, see Field Reorder Node on p. 148.

Several of these nodes can be generated directly from the audit report created by a Data Audit node. For more information, see Generating Other Nodes for Data Preparation in Chapter 17 on p. 554.

70 Chapter 4

Type Node
Field properties can be specied in a source node or in a separate Type node. The functionality is similar in both nodes. The following properties are available:
Type. Used to describe characteristics of the data in a given eld. If all of the details of a eld

are known, it is called fully instantiated. The type of a eld is different from the storage of a eld, which indicates whether data are stored as strings, integers, real numbers, dates, times, or timestamps.
Labels. Double-click any eld name to specify value and eld labels for data in Clementine.

For example, eld metadata imported from SPSS can be viewed or modied in the Type node. Similarly, you can create new labels for elds and their values. The labels that you specify in the Type node are displayed throughout Clementine depending on the selections you make in the Stream Properties dialog box. For more information, see Setting Options for Streams in Chapter 5 in Clementine 11.1 Users Guide.
Direction. Used to tell Modeling nodes whether elds will be Input (predictor elds) or Output

(predicted elds) for a machine-learning process. Both and None are also available directions, along with Partition, which indicates a eld used to partition records into separate samples for training, testing, and validation. For more information, see Setting Field Direction on p. 80.
Missing values. Used to specify which values will be treated as blanks. Value checking. In the Check column, you can set options to ensure that eld values conform

to the specied range.


Instantiation options. Using the Values column, you can specify options for reading data

values from the dataset, or use the Specify option to open another dialog box for setting values. You can also choose to pass elds without reading their values.
Figure 4-1 Type node options

71 Field Operations Nodes

Several other options can be specied using the Type node window: Using the tools menu button, you can choose to Ignore Unique Fields once a Type node has been instantiated (either through your specications, reading values, or executing the stream). Ignoring unique elds will automatically ignore elds with only one value. Using the tools menu button, you can choose to Ignore Large Sets once a Type node has been instantiated. Ignoring large sets will automatically ignore sets with a large number of members. Using the tools menu button, you can generate a Filter node to discard selected elds. Using the sunglasses toggle buttons, you can set the default for all elds to Read or Pass. The Types tab in the source node passes elds by default, while the Type node itself reads values by default. Using the Clear Values button, you can clear changes to eld values made in this node (non-inherited values) and reread values from upstream operations. This option is useful for resetting changes that you may have made for specic elds upstream. Using the Clear All Values button, you can reset values for all elds read into the node. This option effectively sets the Values column to Read for all elds. This option is useful to reset values for all elds and reread values and types from upstream operations. Using the context menu, you can choose to Copy attributes from one eld to another. For more information, see Copying Type Attributes on p. 81. Using the View unused field settings option, you can view type settings for elds that are no longer present in the data or were once connected to this Type node. This is useful when reusing a Type node for datasets that have changed.

Data Types
Data type describes the usage of the data elds in Clementine. It is frequently referred to as usage type and can be specied on the Types tab of a source or Type node. For example, you may want to set the type for an integer eld with values of 1 and 0 to ag. This usually indicates that 1 = True and 0 = False.
Storage versus type. Note that the data type of a eld is different from the storage of a eld, which

indicates whether data are stored as a string, integer, real number, date, time, or timestamp. While data types can be modied at any point in a stream using a Type node, storage must be determined at the source when reading data into Clementine (although it can subsequently be changed using a conversion function). For more information, see Setting Field Storage and Formatting in Chapter 2 on p. 20. The following data types are available:
Range. Used to describe numeric values, such as a range of 0100 or 0.751.25. A range

value can be an integer, real number, or date/time.


Discrete. Used for string values when an exact number of distinct values is unknown. This

is an uninstantiated data type, meaning that all possible information about the storage and usage of the data is not yet known. Once data have been read, the type will be ag, set, or typeless, depending on the maximum set size specied in the stream properties dialog box.

72 Chapter 4

Flag. Used for data with two distinct values, such as Yes and No or 1 and 2. Data may be

represented as text, integer, real number, or date/time. Note: Date/time refers to three types of storage: time, date, or timestamp.
Set. Used to describe data with multiple distinct values, each treated as a member of

a set, such as small/medium/large. In this version of Clementine, a set can have any storagenumeric, string, or date/time. Note that setting a type to Set does not automatically change the values to string.
Ordered Set. Used to describe data with multiple distinct values that have an inherent order.

For example, salary categories or satisfaction rankings can be typed as an ordered set. The order of an ordered set in Clementine is dened by the natural sort order of its elements. For example, 1, 3, 5 is the default sort order for a set of integers, while HIGH, LOW, NORMAL (ascending alphabetically) is the order for a set of strings. The ordered set type enables you to dene a set of categorical data as ordinal data for the purposes of visualization, model building (C5.0, C&R Tree, TwoStep), and export to other applications, such as SPSS, that recognize ordinal data as a distinct type. You can use an ordered set eld anywhere that a set eld can be used. Additionally, elds of any storage type (real, integer, string, date, time, and so on) can be dened as an ordered set. Note: When working with data from SPSS, variables dened as ordinal in SPSS version 8.0 or higher will be typed as an ordered set in Clementine. Similarly, when exporting data to SPSS, ordered sets will be retyped as ordinal in the exported .sav le.
Typeless. Used for data that does not conform to any of the above types or for set types with

too many members. It is useful for cases in which the type would otherwise be a set with many members (such as an account number). When you select Typeless for a eld, the role is automatically set to None. The default maximum size for sets is 250 unique values. This number can be adjusted or disabled in the stream properties dialog box. You can manually specify data types, or you can allow the software to read the data and determine the type based on the values that it reads.
To Use Auto-Typing
E In either a Type node or the Types tab of a source node, set the Values column to <Read> for the

desired elds. This will make metadata available to all nodes downstream. You can quickly set all elds to <Read> or <Pass> using the sunglasses buttons on the dialog box.
E Click Read Values to read values from the data source immediately.

To Manually Set the Type for a Field


E Select a eld in the table. E From the drop-down list in the Type column, select a type for the eld. E Alternatively, you can use Ctrl-A or Ctrl-click to select multiple elds before using the drop-down

list to select a type.

73 Field Operations Nodes Figure 4-2 Manually setting types

What Is Instantiation?
Instantiation is the process of reading or specifying information, such as storage type and values for a data eld. In order to optimize system resources, instantiating in Clementine is a user-directed processyou tell the software to read values by specifying options on the Types tab in a source node or by running data through a Type node. Data with unknown types are also referred to as uninstantiated. Data whose storage type and values are unknown are displayed in the Type column of the Types tab as <Default>. When you have some information about a elds storage, such as string or numeric, the data are called partially instantiated. Discrete or Range are partially instantiated types. For example, Discrete species that the eld is symbolic, but you dont know whether it is a set or a ag type. When all of the details about a type are known, including the values, a fully instantiated typeset, ag, rangeis displayed in this column. Note: The range type is used for both partially instantiated and fully instantiated data elds. Ranges can be either integers or real numbers. During the execution of a data stream with a Type node, uninstantiated types immediately become partially instantiated, based on the initial data values. Once all of the data have passed through the node, all data become fully instantiated unless values were set to <Pass>. If execution is interrupted, the data will remain partially instantiated. Once the Types tab has been instantiated, the values of a eld are static at that point in the stream. This means that any upstream changes will not affect the values of a particular eld, even if you reexecute the stream. To change or update the values based on new data or added manipulations, you need to edit them in the Types tab itself or set the value for a eld to <Read> or <Read +>.

74 Chapter 4

When to Instantiate

Generally, if your dataset is not very large and you do not plan to add elds later in the stream, instantiating at the source node is the most convenient method. However, instantiating in a separate Type node is useful when: The dataset is large, and the stream lters a subset prior to the Type node. Data have been ltered in the stream. Data have been merged or appended in the stream. New data elds are derived during processing.

Data Values
Using the Values column of the data types table, you can read values automatically from the data, or you can specify types and values in a separate dialog box.
Figure 4-3 Selecting methods for reading, passing, or specifying data values

The options available from this drop-down list provide the following instructions for auto-typing:
Option <Read> <Read+> <Pass> <Current> Specify... Function Data will be read when the node is executed. Data will be read and appended to the current data (if any exist). No data are read. Keep current data values. A separate dialog box is launched for you to specify values and type options.

Executing a Type node or clicking Read Values will auto-type and read values from your data source based on your selection. These values can also be specied manually using the Specify option or by double-clicking a cell in the Field column.

75 Field Operations Nodes

Once you have made changes for elds in the Type node, you can reset value information using the following buttons on the dialog box toolbar: Using the Clear Values button, you can clear changes to eld values made in this node (non-inherited values) and reread values from upstream operations. This option is useful for resetting changes that you may have made for specic elds upstream. Using the Clear All Values button, you can reset values for all elds read into the node. This option effectively sets the Values column to Read for all elds. This option is useful to reset values for all elds and reread values and types from upstream operations.

Using the Values Dialog Box


Double-clicking a eld in the Type node opens a separate dialog box where you can set options for reading, specifying, labeling, and handling values for the selected eld.
Figure 4-4 Setting options for data values

Many of the controls are common to all types of data. These common controls are discussed here.
Type. Displays the currently selected type. You can change the type to reect the way that you intend to use data in Clementine. For instance, if a eld called day_of_week contains numbers representing individual days, you may want to change this type to a set in order to create a distribution node that examines each category individually. Storage. Displays the storage type if known. Storage types are unaffected by the usage type (such as range, set, or ag) that you choose for work in Clementine. To alter the storage type, you can use the Data tab in Fixed File and Variable File source nodes or a conversion function in a Filler node.

76 Chapter 4

Values. Select a method to determine values for the selected eld. Selections that you make here

override any selections that you made earlier from the Values column of the Type node dialog box. Choices for reading values include:
Read from data. Select to read values when the node is executed. This option is the same

as <Read>.
Pass. Select not to read data for the current eld. This option is the same as <Pass>. Specify values and labels. Options here are used to specify values and labels for the selected

eld. Used in conjunction with value checking, this option allows you to specify values based on your knowledge of the current eld. This option activates unique controls for each type of eld. Options for values and labels are covered individually in subsequent topics. Note: You cannot specify values or labels for a typeless or <Default> eld type.
Extend values from data. Select to append the current data with the values that you enter

here. For example, if eld_1 has a range from (0,10), and you enter a range of values from (8,16), the range is extended by adding the 16, without removing the original minimum. The new range would be (0,16). Choosing this option automatically sets the auto-typing option to <Read+>.
Check values. Select a method of coercing values to conform to the specied range, ag, or set values. This option corresponds to the Check column in the Type node dialog box, and settings made here override those in the dialog box. Used in conjunction with the Specify Values option, value checking allows you to conform values in the data with expected values. For example, if you specify values as 1, 0 and then use the Discard option, you can discard all records with values other than 1 or 0. Define blanks. Select to activate the controls below that enable you to declare missing values or

blanks in your data.


Missing values table. Allows you to dene specic values (such as 99 or 0) as blanks. The

value should be appropriate for the storage type of the eld.


Range. Used to specify a range of missing values, for example, ages 117 or greater than 65.

If a bound value is left blank then the range will be unbounded; for example, if a lower bound of 100 is specied with no upper bound, then all values greater than or equal to 100 will be dened as missing. The bound values are inclusive; for example, a range with a lower bound of 5 and an upper bound of 10 will include 5 and 10 in the range denition. A missing value range can be dened for any storage type, including date/time and string (in which case the alphabetic sort order will be used to determine whether a value is within the range).
Null/White space. You can also specify system nulls (displayed in the data as $null$) and

white space (string values with no visible characters) as blanks. Note that the Type node also treats empty strings as white space for purposes of analysis, although they are stored differently internally and may be handled differently in certain cases. For more information, see Overview of Missing Values in Chapter 6 in Clementine 11.1 Users Guide. Note: To code blanks as undened or $null$, you should use the Filler node.
Description. Use this text box to specify a eld label. These labels appear in a variety of locations

throughout Clementine, such as in graphs, tables, output, and model browsers, depending on selections you make in the stream properties dialog box.

77 Field Operations Nodes

Specifying Values and Labels for a Range


The range type is used for numeric elds. There are three storage types for range Type nodes: Real Integer Date/Time The same dialog box is used to edit all types of range nodes; however, the different storage types are displayed as reference.
Figure 4-5 Options for specifying a range of values and their labels

Specifying Values

The following controls are unique to range elds and are used to specify a range of values:
Lower. Specify a lower limit for the range eld values. Upper. Specify an upper limit for the range eld values. Specifying Labels

You can specify labels for any value of a range eld. Click the Labels button to open a separate dialog box for specifying value labels.

Values and Labels Subdialog Box


Clicking Labels in the Values dialog box for a range eld opens a new dialog box in which you can specify labels for any value in the range.
Figure 4-6 Providing labels (optional) for range values

78 Chapter 4

You can use the Values and Labels columns in this table to dene value and label pairs. Currently dened pairs are shown here. You can add new label pairs by clicking in an empty cell and entering a value and its corresponding label. Note: Adding value/value-label pairs to this table will not cause any new values to be added to the eld. Instead, it simply creates metadata for the eld value. The labels that you specify in the Type node are displayed throughout Clementine (as ToolTips, output labels, and so on), depending on selections that you make in the stream properties dialog box. For more information, see Setting Options for Streams in Chapter 5 in Clementine 11.1 Users Guide.

Specifying Values and Labels for a Set


Set eld types (sets and ordered sets) indicate that the data values are used discretely as a member of the set. The storage types for a set can be string, integer, real number, or date/time.
Figure 4-7 Options for specifying set values and labels

The following controls are unique to set elds and are used to specify values and labels:
Values. The Values column in the table allows you to specify values based on your knowledge

of the current eld. Using this table, you can enter expected values for the eld and check the datasets conformity to these values using the Check Values drop-down list. Using the arrow and delete buttons, you can modify existing values as well as reorder or delete values.
Labels. The Labels column enables you to specify labels for each value in the set. These labels appear in a variety of locations throughout Clementine, such as graphs, tables, output, and model browsers, depending on selections that you make in the stream properties dialog box. For more information, see Setting Options for Streams in Chapter 5 in Clementine 11.1 Users Guide.

Specifying Values for a Flag


Flag elds are used to display data that have two distinct values. The storage types for ags can be string, integer, real number, or date/time.

79 Field Operations Nodes Figure 4-8 Options for specifying flag field values

True. Specify a ag value for the eld when the condition is met. False. Specify a ag value for the eld when the condition is not met. Labels. Specify labels for each value in the ag eld. These labels appear in a variety of locations

throughout Clementine, such as graphs, tables, output, and model browsers, depending on selections that you make in the stream properties dialog box. For more information, see Setting Options for Streams in Chapter 5 in Clementine 11.1 Users Guide.

Checking Type Values


Turning on the Check option for each eld examines all values in that eld to determine whether they comply with the current type settings or the values that you have specied in the specify values dialog box. This is useful for cleaning up datasets and reducing the size of a dataset within a single operation.
Figure 4-9 Selecting Check options for the selected field

The setting of the Check column in the Type node dialog box determines what happens when a value outside of the type limits is discovered. To change the Check settings for a eld, use the drop-down list for that eld in the Check column. To set the Check settings for all elds, click in the Field column and press Ctrl-A. Then use the drop-down list for any eld in the Check column.

80 Chapter 4

The following Check settings are available:


None. Values will be passed through without checking. This is the default setting. Nullify. Change values outside of the limits to the system null ($null$). Coerce. Fields whose types are fully instantiated will be checked for values that fall outside

the specied ranges. Unspecied values will be converted to a legal value for that type using the following rules: For ags, any value other than the true and false value is converted to the false value. For sets, any unknown value is converted to the rst member of the sets values. Numbers greater than the upper limit of a range are replaced by the upper limit. Numbers less than the lower limit of a range are replaced by the lower limit. Null values in a range are given the midpoint value for that range.
Discard. When illegal values are found, the entire record is discarded. Warn. The number of illegal items is counted and reported in the stream properties dialog box

when all of the data have been read.


Abort. The rst illegal value encountered terminates the execution of the stream. The error is

reported in the stream properties dialog box.

Setting Field Direction


The direction of a eld species how it is used in model buildingfor example, whether a eld is an input or target (the thing being predicted).
Figure 4-10 Setting modeling role options for the Type node

The following roles are available:


In. The eld will be used as an input to machine learning (a predictor eld).

81 Field Operations Nodes

Out. The eld will be used as an output or target for machine learning (one of the elds that the model will try to predict). Both. The eld will be used as both an input and an output by the GRI and Apriori nodes. All

other modeling nodes will ignore the eld.


Partition. Indicates a eld used to partition the data into separate samples for training, testing,

and (optional) validation purposes. The eld must be an instantiated set type with two or three possible values (as dened in the Field Values dialog box). The rst value represents the training sample, the second represents the testing sample, and the third (if present) represents the validation sample. Any additional values are ignored, and ag elds cannot be used. Note that to use the partition in an analysis, partitioning must be enabled on the Model Options tab in the appropriate model-building or analysis node. Records with null values for the partition eld are excluded from the analysis when partitioning is enabled. If multiple partition elds have been dened in the stream, a single partition eld must be specied on the Fields tab in each applicable modeling node. If a suitable eld doesnt already exist in your data, you can create one using a Partition node or Derive node. For more information, see Partition Node on p. 119.
None. The eld will be ignored by machine learning. Fields that have been set to Typeless are

automatically set to None in the Direction column.

Copying Type Attributes


You can easily copy the attributes of a type, such as values, checking options, and missing values from one eld to another:
E Right-click on the eld whose attributes you want to copy. E From the context menu, choose Copy. E Right-click on the eld(s) whose attributes you want to change. E From the context menu, choose Paste Special. Note: You can select multiple elds using the Ctrl-click method or by using the Select Fields option from the context menu.

A new dialog box opens, from which you can select the specic attributes that you want to paste. If you are pasting into multiple elds, the options that you select here will apply to all target elds.
Paste the following attributes. Select from the list below to paste attributes from one eld to

another.
Type. Select to paste the type. Values. Select to paste the eld values. Missing. Select to paste missing value settings. Check. Select to paste value checking options. Direction. Select to paste the direction of a eld.

82 Chapter 4

Field Format Settings Tab


Figure 4-11 Type node, Format tab

The Format tab on the Table and Type nodes shows a list of current or unused elds and formatting options for each eld. Following is a description of each column in the eld formatting table:
Field. This shows the name of the selected eld. Format. By double-clicking a cell in this column, you can specify formatting for elds on an

individual basis using the dialog box that opens. For more information, see Setting Field Format Options on p. 83. Formatting specied here overrides formatting specied in the overall stream properties. Note: The SPSS Export and SPSS Output nodes export .sav les that include per-eld formatting in their metadata. If a per-eld format is specied that is not supported by the SPSS .sav le format, then the node will use the SPSS default format.
Justify. Use this column to specify how the values should be justied within the table column.

The default setting is Auto, which left-justies symbolic values and right-justies numeric values. You can override the default by selecting Left, Right, or Center.
Column Width. By default, column widths are automatically calculated based on the values of the

eld. To override the automatic width calculation, click a table cell and use the drop-down list to select a new width. To enter a custom width not listed here, open the Field Formats subdialog box by double-clicking a table cell in the Field or Format column. Alternatively, you can right-click on a cell and choose Set Format.
View current fields. By default, the dialog box shows the list of currently active elds. To view

the list of unused elds, select View unused fields settings.


Context menu. The context menu for this tab provides various selection and setting update options. Select All. Selects all elds. Select None. Clears the selection.

83 Field Operations Nodes

Select Fields. Selects elds based on type or storage characteristics. Options are Select
Discrete, Select Range (numeric), Select Typeless, Select Strings, Select Numbers, or Select Date/Time. For more information, see Data Types on p. 71.

Set Format. Opens a subdialog box for specifying date, time, and decimal options on a

per-eld basis.
Set Justify. Sets the justication for the selected eld(s). Options are Auto, Center, Left, or
Right.

Set Column Width. Sets the eld width for selected elds. Specify Auto to read width from the

data. Or you can set eld width to 5, 10, 20, 30, 50, 100, or 200.

Setting Field Format Options


Field formatting is specied on a subdialog box available from the Format tab on the Type and Table nodes. If you have selected more than one eld before opening this dialog box, then settings from the rst eld in the selection are used for all. Clicking OK after making specications here will apply these settings to all elds selected on the Format tab.
Figure 4-12 Setting formatting options for one or more fields

The following options are available on a per-eld basis. Many of these settings can also be specied in the stream properties dialog box. For more information, see Setting Options for Streams in Chapter 5 in Clementine 11.1 Users Guide. Any settings made at the eld level override the default specied for the stream.
Date format. Select a date format to be used for date storage elds or when strings are interpreted

as dates by CLEM date functions.


Time format. Select a time format to be used for time storage elds or when strings are interpreted

as times by CLEM time functions.


Number display format. You can choose from standard (####.###), scientic (#.###E+##), or

currency display formats ($###.##).

84 Chapter 4

Decimal symbol. Select either a comma (,) or period (.) as a decimal separator. Grouping symbol. For number display formats, select the symbol used to group values

(For example, the comma in 3,000.00). Options include none, period, comma, space, and locale-dened (in which case the default for the current locale is used).
Decimal places (standard, scientific, currency, export). For number display formats, species the number of decimal places to be used when displaying, printing, or exporting real numbers. This option is specied separately for each display format. The export format applies only to elds with real storage. Justify. Species how the values should be justied within the column. The default setting is
Auto, which left-justies symbolic values and right-justies numeric values. You can override

the default by selecting left, right, or center.


Column width. By default, column widths are automatically calculated based on the values of

the eld. You can specify a custom width in intervals of ve using the arrows to the right of the list box.

Filter Node
Filter nodes have the following functions: To lter or discard elds from records that pass through them. For example, as a medical researcher, you may not be concerned about the potassium level (eld-level data) of patients (record-level data); therefore, you can lter out the K (potassium) eld. To rename elds. To map elds from one source node to another. For more information, see Mapping Data Streams in Chapter 5 in Clementine 11.1 Users Guide.
Figure 4-13 Setting Filter node options

85 Field Operations Nodes

Setting Filtering Options


The table used on the Filter tab shows the name of each eld as it comes into the node as well as the name of each eld as it leaves. You can use the options in this table to rename or lter out elds that are duplicates or are unnecessary for downstream operations.
Field. Displays the input elds from currently connected data sources. Filter. Displays the lter status of all input elds. Filtered elds include a red X in this

column, indicating that this eld will not be passed downstream. Click in the Filter column for a selected eld to turn ltering on and off. You can also select options for multiple elds simultaneously using the Shift-click method of selection.
Field. Displays the elds as they leave the Filter node. Duplicate names are displayed in red.

You can edit eld names by clicking in this column and entering a new name. Or, remove elds by clicking in the Filter column to disable duplicate elds. All columns in the table can be sorted by clicking on the column header.
View current fields. Select to view elds for datasets actively connected to the Filter node. This

option is selected by default and is the most common method of using Filter nodes.
View unused field settings. Select to view elds for datasets that were once but are no longer connected to the Filter node. This option is useful when copying Filter nodes from one stream to another or when saving and reloading Filter nodes.

The lter menu at the top of this dialog box (available from the lter button) helps you to perform operations on multiple elds simultaneously.
Figure 4-14 Filter menu options

You can choose to: Remove all elds.

86 Chapter 4

Include all elds. Toggle all elds. Remove duplicates. Note: Selecting this option removes all occurrences of the duplicate name, including the rst one. Rename eld names to conform with other SPSS applications. For more information, see Renaming or Filtering Fields for SPSS in Chapter 18 on p. 590. Truncate eld names. Use input eld names. Anonymize eld names Set the default lter state. You can also use the arrow toggle buttons at the top of the dialog box to specify whether you want to include or discard elds by default. This is useful for large datasets where only a few elds are to be included downstream. For example, you can select only the elds you want to keep and specify that all others should be discarded (rather than individually selecting all of the elds to discard).

Truncating Field Names


Figure 4-15 Truncate Field Names dialog box

Using the options from the lter button menu, you can choose to truncate eld names.
Maximum length. Specify a number of characters to limit the length of eld names. Number of digits. If eld names, when truncated, are no longer unique, they will be further

truncated and differentiated by adding digits to the name. You can specify the number of digits used. Use the arrow buttons to adjust the number. For example, the table below illustrates how eld names in a medical dataset are truncated using the default settings (maximum length=8 and number of digits=2).
Field Names Patient Input 1 Patient Input 2 Heart Rate BP Truncated Field Names Patien01 Patien02 HeartRat BP

87 Field Operations Nodes

Anonymizing Field Names


Figure 4-16 Transform Values dialog box

Using the option from the lter button menu, you can choose to anonymize the names of selected elds, or the names of all elds, in your data. Anonymized eld names consist of a string prex plus a unique numeric-based value.
Anonymize names of: Choose Selected fields only to anonymize only the names of elds already selected on the Filter tab. Default is All fields, which anonymizes all eld names. Field names prefix: The default prex for anonymized eld names is anon_; choose Custom and type your own prex if you want a different one.

To restore the original eld names, choose Use Input Field Names from the lter button menu.

Derive Node
One of the most powerful features in Clementine is the ability to modify data values and derive new elds from existing data. During lengthy data mining projects, it is common to perform several derivations, such as extracting a customer ID from a string of Web log data or creating a customer lifetime value based on transaction and demographic data. All of these transformations can be performed in Clementine, using a variety of eld operations nodes. Several nodes in Clementine provide the ability to derive new elds:
The Derive node modies data values or creates new elds from one or more existing elds. It creates elds of type formula, ag, set, stat, count, and conditional. For more information, see Derive Node on p. 87. The Reclassify node transforms one set of discrete values to another. Reclassication is useful for collapsing categories or regrouping data for analysis. For more information, see Reclassify Node on p. 105. The Binning node automatically creates new set elds based on the values of one or more existing numeric range elds. For example, you can transform a scale income eld into a new categorical eld containing groups of income as deviations from the mean. Once you have created bins for the new eld, you can generate a Derive node based on the cut points. For more information, see Binning Node on p. 109.

88 Chapter 4

The Set to Flag node derives multiple ag elds based on the categorical values dened for one or more set elds. For more information, see Set to Flag Node on p. 121. The Restructure node converts a set or ag eld into a group of elds that can be populated with the values of yet another eld. For example, given a eld named payment type, with values of credit, cash, and debit, three new elds would be created (credit, cash, debit), each of which might contain the value of the actual payment made. For more information, see Restructure Node on p. 123. The History node creates new elds containing data from elds in previous records. History nodes are most often used for sequential data, such as time series data. Before using a History node, you may want to sort the data using a Sort node. For more information, see History Node on p. 146.

Using the Derive Node

Using the Derive node, you can create six types of new elds from one or more existing elds:
Formula. The new eld is the result of an arbitrary CLEM (Clementine Language for

Expression Manipulation) expression.


Flag. The new eld is a ag, representing a specied condition. Set. The new eld is a set, meaning that its members are a group of specied values. State. The new eld is one of two states. Switching between these states is triggered by a

specied condition.
Count. The new eld is based on the number of times that a condition has been true. Conditional. The new eld is the value of one of two expressions, depending on the value

of a condition. Each of these nodes contains a set of special options in the Derive node dialog box. These options are discussed in subsequent topics.

Setting Basic Options for the Derive Node


At the top of the dialog box for Derive nodes are a number of options for selecting the type of Derive node that you need.

89 Field Operations Nodes Figure 4-17 Derive node dialog box

Mode. Select Single or Multiple, depending on whether you want to derive multiple elds. When
Multiple is selected, the dialog box changes to include options for multiple Derive elds.

Derive field. For simple Derive nodes, specify the name of the eld that you want to derive and add to each record. The default name is DeriveN, where N is the number of Derive nodes that you have created thus far during the current session. Derive as. Select a type of Derive node, such as Formula or Set, from the drop-down list. For each

type, a new eld is created based on the conditions that you specify in the type-specic dialog box. Selecting an option from the drop-down list will add a new set of controls to the main dialog box according to the properties of each Derive node type.
Field type. Select a type, such as range, set, or ag, for the newly derived node. This option is

common to all forms of Derive nodes. Note: Deriving new elds often requires the use of special functions or mathematical expressions. To help you create these expressions, an Expression Builder is available from the dialog box for all types of Derive nodes and provides rule checking as well as a complete list of CLEM expressions. For more information, see What Is CLEM? in Chapter 7 in Clementine 11.1 Users Guide.

Deriving Multiple Fields


Setting the mode to Multiple within a Derive node gives you the capability to derive multiple elds based on the same condition within the same node. This feature saves time when you want to make identical transformations on several elds in your dataset. For example, if you want to build a regression model predicting current salary based on beginning salary and previous experience,

90 Chapter 4

it might be benecial to apply a log transformation to all three skewed variables. Rather than add a new Derive node for each transformation, you can apply the same function to all elds at once. Simply select all elds from which to derive a new eld and then type the derive expression using the @FIELD function within the eld parentheses. Note: The @FIELD function is an important tool for deriving multiple elds at the same time. It allows you to refer to the contents of the current eld or elds without specifying the exact eld name. For instance, a CLEM expression used to apply a log transformation to multiple elds is log(@FIELD).
Figure 4-18 Deriving multiple fields

The following options are added to the dialog box when you select Multiple mode:
Derive from. Use the Field Chooser to select elds from which to derive new elds. One output

eld will be generated for each selected eld. Note: Selected elds do not need to be the same storage type; however, the Derive operation will fail if the condition is not valid for all elds.
File name extension. Type the extension that you would like added to the new eld name(s). For

example, for a new eld containing the log of Current Salary, you could add the extension log_ to the eld name, producing log_Current Salary. Use the radio buttons to choose whether to add the extension as a prex (at the beginning) or as a sufx (at the end) of the eld name. The default name is DeriveN, where N is the number of Derive nodes that you have created thus far during the current session. As in the single-mode Derive node, you now need to create an expression to use for deriving a new eld. Depending on the type of Derive operation selected, there are a number of options to create a condition. These options are discussed in subsequent topics. To create an expression, you

91 Field Operations Nodes

can simply type in the formula eld(s) or use the Expression Builder by clicking the calculator button. Remember to use the @FIELD function when referring to manipulations on multiple elds.

Selecting Multiple Fields


For all nodes that perform operations on multiple input elds, such as Derive (multiple mode), Aggregate, Sort, Multiplot, and Time Plot, you can easily select multiple elds using the following dialog box.
Figure 4-19 Selecting multiple fields

Sort by. You can sort available elds for viewing by selecting one of the following options: Natural. View the order of elds as they have been passed down the data stream into the

current node.
Name. Use alphabetical order to sort elds for viewing. Type. View elds sorted by their type. This option is useful when selecting elds by type.

Select elds from the table one at a time or use the Shift-click and Ctrl-click methods to select multiple elds. You can also use the buttons below to select groups of elds based on their type or to select or deselect all elds in the table.

Setting Derive Formula Options


Derive Formula nodes create a new eld for each record in a dataset based on the results of a CLEM expression. Note that this expression cannot be conditional. To derive values based on a conditional expression, use the ag or conditional type of Derive node.

92 Chapter 4 Figure 4-20 Setting options for a Derive Formula node

Formula. Specify a formula using the CLEM language to derive a value for the new eld. For

example, using the P3_LoS stream shipped with the Clementine Application Template (CAT) for CRM, you can derive the length of service for contracts pertaining to all customers in the database. The new eld is called LoS and using the Expression Builder, you can create the following expression in the Formula eld:
date_years_difference(CardStartDate,'20010101')

Upon execution, the new LoS eld will be created for each record and will contain the value of the difference between the value for CardStartDate and the reference date (2001/01/01) for each record.

Setting Derive Flag Options


Derive Flag nodes are used to indicate a specic condition, such as high blood pressure or customer account inactivity. A ag eld is created for each record, and when the true condition is met, the ag value for true is added in the eld.

93 Field Operations Nodes Figure 4-21 Deriving a flag field to indicate inactive accounts

True value. Specify a value to include in the ag eld for records that match the condition specied below. The default is T. False value. Specify a value to include in the ag eld for records that do not match the condition

specied below. The default is F.


True when. Specify a CLEM condition to evaluate certain values of each record and give the record a true value or a false value (dened above). Note that the true value will be given to records in the case of non-false numeric values.

Note: To return an empty string, you should type opening and closing quotes with nothing between them, such as "". Empty strings are often used, for example, as the false value in order to enable true values to stand out more clearly in a table. Similarly, quotes should be used if you want a string value that would otherwise be treated as a number

Setting Derive Set Options


Derive Set nodes are used to execute a set of CLEM conditions in order to determine which condition each record satises. As a condition is met for each record, a value (indicating which set of conditions was met) will be added to the new, derived eld.

94 Chapter 4 Figure 4-22 Setting customer value categories using a Derive Set node

Default value. Specify a value to be used in the new eld if none of the conditions are met. Set field to. Specify a value to enter in the new eld when a particular condition is met. Each

value in the list has an associated condition that you specify in the adjacent column.
If this condition is true. Specify a condition for each member in the set eld to list. Use the Expression Builder to select from available functions and elds. You can use the arrow and delete buttons to reorder or remove conditions.

A condition works by testing the values of a particular eld in the dataset. As each condition is tested, the values specied above will be assigned to the new eld to indicate which, if any, condition was met. If none of the conditions are met, the default value is used.

Setting Derive State Options


Derive State nodes are somewhat similar to Derive Flag nodes. A Flag node sets values depending on the fulllment of a single condition for the current record, but a Derive State node can change the values of a eld depending on how it fullls two independent conditions. This means that the value will change (turn on or off) as each condition is met.

95 Field Operations Nodes Figure 4-23 Using a Derive State node to indicate the current status of power plant conditions

Initial state. Select whether to give each record of the new eld the On or Off value initially. Note that this value can change as each condition is met. On value. Specify the value for the new eld when the On condition is met. Switch On when. Specify a CLEM condition that will change the state to On when the condition

is true. Click the calculator button to open the Expression Builder.


Off value. Specify the value for the new eld when the Off condition is met. Switch Off when. Specify a CLEM condition that will change the state to Off when the condition is false. Click the calculator button to open the Expression Builder.

Note: To specify an empty string, you should type opening and closing quotes with nothing between them, such as "". Similarly, quotes should be used if you want a string value that would otherwise be treated as a number.

Setting Derive Count Options


A Derive Count node is used to apply a series of conditions to the values of a numeric eld in the dataset. As each condition is met, the value of the derived count eld is increased by a set increment. This type of Derive node is useful for time series data.

96 Chapter 4 Figure 4-24 Count options in the Derive node dialog box

Initial value. Sets a value used on execution for the new eld. The initial value must be a numeric constant. Use the arrow buttons to increase or decrease the value. Increment when. Specify the CLEM condition that, when met, will change the derived value based on the number specied in Increment by. Click the calculator button to open the Expression Builder. Increment by. Set the value used to increment the count. You can use either a numeric constant or

the result of a CLEM expression.


Reset when. Specify a condition that, when met, will reset the derived value to the initial value.

Click the calculator button to open the Expression Builder.

Setting Derive Conditional Options


Derive Conditional nodes use a series of If-Then-Else statements to derive the value of the new eld.

97 Field Operations Nodes Figure 4-25 Using a conditional Derive node to create a second customer value category

If. Specify a CLEM condition that will be evaluated for each record upon execution. If the

condition is true (or non-false, in the case of numbers), the new eld is given the value specied below by the Then expression. Click the calculator button to open the Expression Builder.
Then. Specify a value or CLEM expression for the new eld when the If statement above is true

(or non-false). Click the calculator button to open the Expression Builder.
Else. Specify a value or CLEM expression for the new eld when the If statement above is false.

Click the calculator button to open the Expression Builder.

Recoding Values with the Derive Node


Derive nodes can also be used to recode values, for example by converting a string elds with discrete values to a numeric set eld.

98 Chapter 4 Figure 4-26 Recoding string values

E For Derive As, select the type of eld (Set, Flag, etc.) as appropriate. E Specify the conditions for recoding values. For example, you could set the value to 1 if

Drug='drugA', 2 if Drug='drugB', and so on.

Filler Node
Filler nodes are used to replace eld values and change storage. You can choose to replace values based on a specied CLEM condition, such as @BLANK(FIELD). Alternatively, you can choose to replace all blanks or null values with a specic value. Filler nodes are often used in conjunction with the Type node to replace missing values. For example, you can ll blanks with the mean value of a eld by specifying an expression such as @GLOBAL_MEAN. This expression will ll all blanks with the mean value as calculated by a Set Globals node.

99 Field Operations Nodes Figure 4-27 Filler node dialog box

Fill in fields. Using the Field Chooser (button to the right of the text eld), select elds from the

dataset whose values will be examined and replaced. The default behavior is to replace values depending on the Condition and Replace with expressions specied below. You can also select an alternative method of replacement using the Replace options below. Note: When selecting multiple elds to replace with a user-dened value, it is important that the eld types are similar (all numeric or all symbolic).
Replace. Select to replace the values of the selected eld(s) using one of the following methods: Based on condition. This option activates the Condition eld and Expression Builder for you

to create an expression used as a condition for replacement with the value specied.
Always. Replaces all values of the selected eld. For example, you could use this option

to convert the storage of income to a string using the following CLEM expression: (to_string(income)).
Blank values. Replaces all user-specied blank values in the selected eld. The standard

condition @BLANK(@FIELD) is used to select blanks. Note: You can dene blanks using the Types tab of the source node or with a Type node.
Null values. Replaces all system null values in the selected eld. The standard condition

@NULL(@FIELD) is used to select nulls.


Blank and null values. Replaces both blank values and system nulls in the selected eld. This

option is useful when you are unsure whether or not nulls have been dened as missing values.
Condition. This option is available when you have selected the Based on condition option. Use this

text box to specify a CLEM expression for evaluating the selected elds. Click the calculator button to open the Expression Builder.

100 Chapter 4

Replace by. Specify a CLEM expression to give a new value to the selected elds. You can also replace the value with a null value by typing undef in the text box. Click the calculator button to open the Expression Builder.

Note: When the eld(s) selected are string, you should replace them with a string value. Using the default 0 or another numeric value as the replacement value for string elds will result in an error.

Storage Conversion Using the Filler Node


Using the Replace condition of a Filler node, you can easily convert the eld storage for single or multiple elds. For example, using the conversion function to_integer, you could convert income from a string to an integer using the following CLEM expression: to_integer(income).
Figure 4-28 Using a Filler node to convert field storage

You can view available conversion functions and automatically create a CLEM expression using the Expression Builder. From the Functions drop-down list, select Conversion to view a list of storage conversion functions. The following conversion functions are available: to_integer(ITEM) to_real(ITEM) to_number(ITEM) to_string(ITEM) to_time(ITEM) to_timestamp(ITEM) to_date(ITEM) to_datetime(ITEM)

101 Field Operations Nodes

Converting date and time values. Note that conversion functions (and any other functions that require a specic type of input such as a date or time value) depend on the current formats specied in the Stream Options dialog box. For example if you want to convert a string eld with values Jan 2003, Feb 2003, etc. to date storage, select MON YYYY as the default date format for the stream. For more information, see Setting Options for Streams in Chapter 5 in Clementine 11.1 Users Guide.

Conversion functions are also available from the Derive node, for temporary conversion during a derive calculation. You can also use the Derive node to perform other manipulations such as recoding string elds with discrete values. For more information, see Recoding Values with the Derive Node on p. 97.

Anonymize Node
The Anonymize node enables you to disguise eld names, eld values, or both when working with data that are to be included in a model downstream of the node. In this way, the generated model can be freely distributed (for example, to SPSS Technical Support) with no danger that unauthorized users will be able to view condential data, such as employee records or patients medical records. Depending on where you place the Anonymize node in the stream, you may need to make changes to other nodes. For example, if you insert an Anonymize node upstream from a Select node, the selection criteria in the Select node will need to be changed if they are acting on values that have now become anonymized. The method to be used for anonymizing depends on various factors. For eld names, and all eld values except Range data types, the data are replaced by a string of the form:
prefix_Sn

where prex_ is either a user-specied string or the default string anon_, and n is an integer value that starts at 0 and is incremented for each unique value (for example, anon_S0, anon_S1 etc.). Field values of type Range must be transformed because ranges deal with integer or real values rather than strings. As such, they can be anonymized only by transforming the range into a different range, thus disguising the original data. Transformation of a value x in the range is performed in the following way:
A*(x + B)

where: A is a scale factor, which must be greater than 0. B is a translation offset to be added to the values.
Example

In the case of a eld AGE where the scale factor A is set to 7 and the translation offset B is set to 3, the values for AGE are transformed into:
7*(AGE + 3)

102 Chapter 4

Setting Options for the Anonymize Node


Here you can choose which elds are to have their values disguised further downstream. Note that the data elds must be instantiated upstream from the Anonymize node before anonymize operations can be performed. You can instantiate the data by clicking the Read Values button on a Type node or on the Types tab of a source node.
Figure 4-29 Setting anonymize options

Field. Lists the elds in the current dataset. If any eld names have already been anonymized, the

anonymized names are shown here.


Type. The data type of the eld. Anonymize Values. Select one or more elds, click this column, and choose Yes to anonymize the

eld value using the default prex anon_; choose Specify to display a dialog box in which you can either enter your own prex or, in the case of eld values of type Range, specify whether the transformation of eld values is to use random or user-specied values. Note that Range and non-Range eld types cannot be specied in the same operation; you must do this separately for each type of eld.
View current fields. Select to view elds for datasets actively connected to the Anonymize node.

This option is selected by default.


View unused field settings. Select to view elds for datasets that were once but are no longer

connected to the node. This option is useful when copying nodes from one stream to another or when saving and reloading nodes.

103 Field Operations Nodes

Specifying How Field Values Will Be Anonymized


The Replace Values dialog box lets you choose whether to use the default prex for anonymized eld values or to use a custom prex. Clicking OK in this dialog box changes the setting of Anonymize Values on the Settings tab to Yes for the selected eld or elds.
Figure 4-30 Replace Values dialog box

Field values prefix. The default prex for anonymized eld values is anon_; choose Custom and

enter your own prex if you want a different one. The Transform Values dialog box is displayed only for elds of type Range and allows you to specify whether the transformation of eld values is to use random or user-specied values.
Figure 4-31 Transform Values dialog box

Random. Choose this option to use random values for the transformation. Set random seed is selected by default; specify a value in the Seed eld, or use the default value. Fixed. Choose this option to specify your own values for the transformation: Scale by. The number by which eld values will be multiplied in the transformation.

Minimum value is 1; maximum is normally 10, but this may be lowered to avoid overow.
Translate by. The number that will be added to eld values in the transformation. Minimum

value is 0; maximum is normally 1000, but this may be lowered to avoid overow.

104 Chapter 4

Anonymizing Field Values


Fields that have been selected for anonymization on the Settings tab have their values anonymized: when you execute the stream containing the Anonymize node when you preview the values To preview the values, click the Anonymize Values button on the Anonymized Values tab. Next, select a eld name from the drop-down list. If the eld type is Range, the display shows the: minimum and maximum values of the original range equation used to transform the values
Figure 4-32 Anonymizing field values

If the eld type is anything other than Range, the screen displays the original and anonymized values for that eld.

105 Field Operations Nodes Figure 4-33 Anonymizing field values

If the display appears with a yellow background, this indicates that either the setting for the selected eld has changed since the last time the values were anonymized, or that changes have been made to the data upstream of the Anonymize node such that the anonymized values may no longer be correct. The current set of values is displayed; click the Anonymize Values button again to generate a new set of values according to the current setting.
Anonymize Values. Creates anonymized values for the selected eld and displays them in the table.

If you are using random seeding for a eld of type Range, clicking this button repeatedly creates a different set of values each time.
Clear Values. Clears the original and anonymized values from the table.

Reclassify Node
The Reclassify node enables the transformation from one set of discrete values to another. Reclassication is useful for collapsing categories or regrouping data for analysis. For example, you could reclassify the values for Product into three groups, such as Kitchenware, Bath and Linens, and Appliances. Often, this operation is performed directly from a Distribution node by grouping values and generating a Reclassify node. For more information, see Using a Distribution Node in Chapter 5 on p. 193. Reclassication can be performed for one or more symbolic elds. You can also choose to substitute the new values for the existing eld or generate a new eld. Before using a Reclassify node, consider whether another Field Operations node is more appropriate for the task at hand: To transform numeric ranges into sets using an automatic method, such as ranks or percentiles, you should use a Binning node. For more information, see Binning Node on p. 109.

106 Chapter 4

To classify numeric ranges into sets manually, you should use a Derive node. For example, if you want to collapse salary values into specic salary range categories, you should use a Derive node to dene each category manually. To create one or more ag elds based on the values of a categorical eld, such as Mortgage_type, you should use a Set to Flag node. To convert a discrete eld to numeric storage, you can use a Derive node. For example you could convert No and Yes values to 0 and 1, respectively. For more information, see Recoding Values with the Derive Node on p. 97.

Setting Options for the Reclassify Node


There are three steps to using the Reclassify node:
E First, select whether you want to reclassify multiple elds or a single eld. E Next, choose whether to recode into the existing eld or create a new eld. E Then, use the dynamic options in the Reclassify node dialog box to map sets as desired. Figure 4-34 Reclassify node dialog box

Mode. Select Single to reclassify the categories for one eld. Select Multiple to activate options enabling the transformation of more than one eld at a time.

107 Field Operations Nodes

Reclassify into. Select New field to keep the original set eld and derive an additional eld containing the reclassied values. Select Existing field to overwrite the values in the original eld with the new classications. This is essentially a ll operation.

Once you have specied mode and replacement options, you must select the transformation eld and specify the new classication values using the dynamic options on the bottom half of the dialog box. These options vary depending on the mode you have selected above.
Reclassify field(s). Use the Field Chooser button on the right to select one (Single mode) or

more (Multiple mode) discrete elds.


New field name. Specify a name for the new set eld containing recoded values. This option is

available only in Single mode when New field is selected above. When Existing field is selected, the original eld name is retained. When working in Multiple mode, this option is replaced with controls for specifying an extension added to each new eld. For more information, see Reclassifying Multiple Fields on p. 108.
Reclassify values. This table enables a clear mapping from old set values to those you specify here. Original value. This column lists existing values for the select eld(s). New value. Use this column to type new category values or select one from the drop-down

list. When you automatically generate a Reclassify node using values from a Distribution chart, these values are included in the drop-down list. This allows you to quickly map existing values to a known set of values. For example, healthcare organizations sometimes group diagnoses differently based upon network or locale. After a merger or acquisition, all parties will be required to reclassify new or even existing data in a consistent fashion. Rather than manually typing each target value from a lengthy list, you can read the master list of values in to Clementine, run a Distribution chart for the Diagnosis eld, and generate a Reclassify (values) node for this eld directly from the chart. This process will make all of the target Diagnosis values available from the New Values drop-down list.
E Click Get to read original values for one or more elds selected above. E Click Copy to paste original values over to the New value column for elds that have not been

mapped yet. The unmapped original values are added to the drop-down list.
E Click Clear new to erase all specications in the New value column. Note: This option does

not erase the values from the drop-down list.


E Click Auto to automatically generate consecutive integers for each of the original values. Only

integer values (no real values, such as 1.5, 2.5, and so on) can be generated.
Figure 4-35 Auto-classification dialog box

108 Chapter 4

For example, you can automatically generate consecutive product ID numbers for product names or course numbers for university class offerings. This functionality corresponds to the Automatic Recode transformation for sets in SPSS.
For unspecified values use. This option is used for lling unspecied values in the new eld. You

can either choose to keep the original value by selecting Original value or specify a default value.

Reclassifying Multiple Fields


To map category values for more than one eld at a time, set the mode to Multiple. This enables new settings in the Reclassify dialog box, which are described below.
Figure 4-36 Dynamic dialog box options for reclassifying multiple fields

Reclassify fields. Use the Field Chooser button on the right to select the elds that you want to

transform. Using the Field Chooser, you can select all elds at once or elds of a similar type, such as set or ag.
Field name extension. When recoding multiple elds simultaneously, it is more efcient to

specify a common extension added to all new elds rather than individual eld names. Specify an extension such as _recode and select whether to append or prepend this extension to the original eld names.

109 Field Operations Nodes

Storage and Type for Reclassified Fields


The Reclassify node always creates a set type eld from the recode operation. In some cases, this may change the type of the eld when using the Existing field reclassication mode. The new elds storage (how data are stored rather than how they are used) is calculated based on the following Settings tab options: If unspecied values are set to use a default value, the storage type is determined by examining both the new values as well as the default value and determining the appropriate storage. For example, if all values can be parsed as integers, the eld will have the integer storage type. If unspecied values are set to use the original values, the storage type is based on the storage of the original eld. If all of the values can be parsed as the storage of the original eld, then that storage is preserved; otherwise, the storage is determined by nding the most appropriate storage type encompassing both old and new values. For example, reclassifying an integer set { 1, 2, 3, 4, 5 } with the reclassication 4 => 0, 5 => 0 generates a new integer set { 1, 2, 3, 0 }, whereas with the reclassication 4 => Over 3, 5 => Over 3 will generate the string set { 1, 2, 3, Over 3 }. Note: If the original type was uninstantiated, the new type will be also be uninstantiated.

Binning Node
The Binning node enables you to automatically create new set elds based on the values of one or more existing numeric range elds. For example, you can transform a scale income eld into a new categorical eld containing income groups of equal width, or as deviations from the mean. Alternatively, you can select a categorical supervisor eld in order to preserve the strength of the original association between the two elds. Binning can be useful for a number of reasons, including:
Algorithm requirements. Certain algorithms, such as Naive Bayes, Logistic Regression,

require categorical inputs.


Performance. Algorithms such as multinomial logistic may perform better if the number of

distinct values of input elds is reduced. For example, use the median or mean value for each bin rather than using the original values.
Data Privacy. Sensitive personal information, such as salaries, may be reported in ranges

rather than actual salary gures in order to protect privacy. A number of binning methods are available, Once you have created bins for the new eld, you can generate a Derive node based on the cut points. Before using a Binning node, consider whether another technique is more appropriate for the task at hand: To manually specify cut points for categories, such as specic predened salary ranges, use a Derive node. For more information, see Derive Node on p. 87. To create new categories for existing sets, use a Reclassify node. For more information, see Reclassify Node on p. 105.

110 Chapter 4

Missing Value Handling

The Binning node handles missing values in the following ways:


User-specified blanks. Missing values specied as blanks are included during the

transformation. For example, if you designated 99 to indicate a blank value using the Type node, this value will be included in the binning process. To ignore blanks during binning, you should use a Filler node to replace the blank values with the system null value.
System-missing values ($null$). Null values are ignored during the binning transformation

and remain nulls after the transformation. The Settings tab provides options for available techniques. The View tab displays cut points established for data previously run through the node.

Setting Options for the Binning Node


Using the Binning node, you can automatically generate bins (categories) using the following techniques: Fixed-width binning Tiles (equal count or sum) Mean and standard deviation Ranks Optimized relative to a categorical supervisor eld The bottom half of the dialog box changes dynamically depending on the binning method you select.

111 Field Operations Nodes Figure 4-37 Binning node dialog box, Settings tab

Bin fields. Numeric range elds pending transformation are displayed here. The Binning node enables you to bin multiple elds simultaneously. Add or remove elds using the buttons on the right. Binning method. Select the method used to determine cut points for new eld bins (categories).

The following topics discuss options for the available methods of binning.

Fixed-Width Bins
When you choose Fixed-width as the binning method, a new set of options is displayed in the dialog box.

112 Chapter 4 Figure 4-38 Binning node dialog box (Settings tab) with options for fixed-width bins

Name extension. Specify an extension to use for the generated eld(s). _BIN is the default

extension. You may also specify whether the extension is added to the start (Prefix) or end (Suffix) of the eld name. For example, you could generate a new eld called income_BIN.
Bin width. Specify a value (integer or real) used to calculate the width of the bin. For example, you can use the default value, 10, to bin the eld Age. Since Age has a range from 1865, the generated bins would be the following:
Table 4-1 Bins for Age with range 1865

Bin 1 >=13 to <23

Bin 2 >=23 to <33

Bin 3 >=33 to <43

Bin 4 >=43 to <53

Bin 5 >=53 to <63

Bin 6 >=63 to <73

The start of bin intervals is calculated using the lowest scanned value minus half the bin width (as specied). For example, in the bins shown above, 13 is used to start the intervals according to the following calculation: 18 [lowest data value] 5 [0.5 (Bin width of 10)] = 13.
No. of bins. Use this option to specify an integer used to determine the number of xed-width

bins (categories) for the new eld(s). Once you have executed the Binning node in a stream, you can view the bin thresholds generated by clicking the Preview tab in the Binning node dialog box. For more information, see Previewing the Generated Bins on p. 118.

Tiles (Equal Count or Sum)


The tile binning method creates set variables that can be used to split scanned records into percentile groups (or quartiles, deciles, and so on) so that each group contains the same number of records, or the sum of the values in each group is equal. Records are ranked in ascending order based on the value of the specied bin eld, so that records with the lowest values for the selected bin variable are assigned a rank of 1, the next set of records are ranked 2, and so on. The threshold values for each bin are generated automatically based on the data and tiling method used.

113 Field Operations Nodes Figure 4-39 Binning node dialog box (Settings tab) with options for equal count bins

Tile name extension. Specify an extension used for eld(s) generated using standard p-tiles. The default extension is _TILE plus N, where N is the tile number. You may also specify whether the extension is added to the start (Prefix) or end (Suffix) of the eld name. For example, you could generate a new eld called income_BIN4. Custom tile extension. Specify an extension used for a custom tile range. The default is _TILEN.

Note that N in this case will not be replaced by the custom number. Available p-tiles are:
Quartile. Generate 4 bins, each containing 25% of the cases. Quintile. Generate 5 bins, each containing 20% of the cases. Decile. Generate 10 bins, each containing 10% of the cases. Vingtile. Generate 20 bins, each containing 5% of the cases. Percentile. Generate 100 bins, each containing 1% of the cases. Custom N. Select to specify the number of bins. For example, a value of 3 would produce 3

banded categories (2 cut points), each containing 33.3% of the cases. Note that if there are fewer discrete values in the data than the number of tiles specied, all tiles will not be used. In such cases, the new distribution is likely to reect the original distribution of your data.
Tiling method. Species the method used to assign records to bins. Record count. Seeks to assign an equal number of records to each bin. Sum of values. Seeks to assign records to bins such that the sum of the values in each bin is

equal. When targeting sales efforts, for example, this method can be used to assign prospects to decile groups based on value per record, with the highest value prospects in the top bin. For example, a pharmaceutical company might rank physicians into decile groups based on the number of prescriptions they write. While each decile would contain approximately the same number of scripts, the number of individuals contributing those scripts would not be the same, with the individuals who write the most scripts concentrated in decile 10. Note

114 Chapter 4

that this approach assumes that all values are greater than zero, and may yield unexpected results if this is not the case.
Ties. A tie condition results when values on either side of a cut point are identical. For example, if

you are assigning deciles and more than 10% of records have the same value for the bin eld, then all of them cannot t into the same bin without forcing the threshold one way or another. Ties can be moved up to the next bin or kept in the current one but must be resolved so that all records with identical values fall into the same bin, even if this causes some bins to have more records than expected. The thresholds of subsequent bins may also be adjusted as a result, causing values to be assigned differently for the same set of numbers based on the method used to resolve ties.
Add to next. Select to move the tie values up to the next bin. Keep in current. Keeps tie values in the current (lower) bin. This method may result in fewer

total bins being created.


Example: Tiling by Record Count

The table below illustrates how simplied eld values are ranked as quartiles when tiling by record count. Note the results vary depending on the selected ties option.
Values 10 13 15 15 20 Add to Next 1 2 3 3 4 Keep in Current 1 1 2 2 3

The number of items per bin is calculated as:


total number of value / number of tiles

In the simplied example above, the desired number of items per bin is 1.25 (5 values / 4 quartiles). The value 13 (being value number 2) straddles the 1.25 desired count threshold and is therefore treated differently depending on the selected ties option. In Add to Next mode, it is added into bin 2. In Keep in Current mode, it is left in bin 1, pushing the range of values for bin 4 outside that of existing data values. As a result, only three bins are created, and the thresholds for each bin are adjusted accordingly.

115 Field Operations Nodes Figure 4-40 Thresholds for generated bins

Note: The speed of binning by tiles may benet from enabling parallel processing. For more information, see Setting Optimization Options in Chapter 3 in Clementine 11.1 Users Guide.

Rank Cases
When you choose Ranks as the binning method, a new set of options is displayed in the dialog box.
Figure 4-41 Binning node dialog box (Settings tab) with options for ranks

Ranking creates new elds containing ranks, fractional ranks, and percentile values for numeric elds depending on the options specied below.
Rank order. Select Ascending (lowest value is marked 1) or Descending (highest value is marked 1). Rank. Select to rank cases in ascending or descending order as specied above. The range of

values in the new eld will be 1N, where N is the number of discrete values in the original eld. Tied values are given the average of their rank.
Fractional rank. Select to rank cases where the value of the new eld equals rank divided by the

sum of the weights of the nonmissing cases. Fractional ranks fall in the range of 01.

116 Chapter 4

Percentage fractional rank. Each rank is divided by the number of records with valid values and

multiplied by 100. Percentage fractional ranks fall in the range of 1100.


Extension. For all rank options, you can create custom extensions and specify whether the

extension is added to the start (Prefix) or end (Suffix) of the eld name. For example, you could generate a new eld called income_P_RANK.

Mean/Standard Deviation
When you choose Mean/standard deviation as the binning method, a new set of options is displayed in the dialog box.
Figure 4-42 Binning node dialog box (Settings tab) with options for mean/standard deviation

This method generates one or more new elds with banded categories based on the values of the mean and standard deviation of the distribution of the specied eld(s). Select the number of deviations to use below.
Name extension. Specify an extension to use for the generated eld(s). _SDBIN is the default extension. You may also specify whether the extension is added to the start (Prefix) or end (Suffix) of the eld name. For example, you could generate a new eld called income_SDBIN. +/ 1 standard deviation. Select to generate three bins. +/ 2 standard deviations. Select to generate ve bins. +/ 3 standard deviations. Select to generate seven bins.

For example, selecting +/1 standard deviation results in the three bins as calculated below:
Bin 1 x < (Mean Std. Dev) Bin 2 (Mean Std. Dev) <= x <= (Mean + Std. Dev) Bin 3 x > (Mean + Std. Dev)

In a normal distribution, 68% of the cases fall within one standard deviation of the mean, 95% within two standard deviations, and 99% within three standard deviations. Note, however, that creating banded categories based on standard deviations may result in some bins being dened outside the actual data range and even outside the range of possible data values (for example, a negative salary range).

117 Field Operations Nodes

Optimal Binning
If the eld you want to bin is strongly associated with another categorical eld, you can select the categorical eld as a supervisor eld in order to create the bins in such a way as to preserve the strength of the original association between the two elds. For example, suppose you have used cluster analysis to group states based on delinquency rates for home loans, with the highest rates in the rst cluster. In this case, you might choose Percent past due and Percent of foreclosures as the Bin elds and the cluster membership eld generated by the model as the supervisor eld.
Figure 4-43 Options for optimal or supervised binning

Name extension. Specify an extension to use for the generated eld(s) and whether to add it at the

start (Prefix) or end (Suffix) of the eld name. For example, you could generate a new eld called pastdue_OPTIMAL and another called inforeclosure_OPTIMAL.
Supervisor field. A categorical eld used to construct the bins. Merge bins that have relatively small case counts with a larger neighbor. If enabled, indicates that

bins smaller than the specied threshold should be merged with a larger adjacent neighbor.
Pre-bin fields to improve performance with large datasets. Indicates if preprocessing should be used to streamline optimal binning. The method groups scale values into a large number of bins using a simple unsupervised binning method, represents values within each bin by the mean, and adjusts the case weight accordingly before proceeding with supervised binning. In practical

118 Chapter 4

terms, this method trades a degree of precision for speed and is recommended for large datasets. You can also specify the maximum number of bins to create when this option is used.

Cut Point Settings


The Cut Point Settings dialog box enables you to specify advanced options for the optimal binning algorithm. These options tell the algorithm how to calculate the bins using the target eld.
Figure 4-44 Cut point settings for optimal binning

Bin end points. You can specify whether the lower or upper end points should be inclusive (lower

<= x) or exclusive (lower < x).


First and last bins. For both the rst and last bin, you can specify whether the bins should be

unbounded (extending toward positive or negative innity) or bounded by the lowest or highest data points.

Previewing the Generated Bins


The Bin values tab in the Binning node allows you to view the thresholds for generated bins. Using the Generate menu, you can also generate a Derive node that can be used to apply these thresholds from one dataset to another.

119 Field Operations Nodes Figure 4-45 Binning node dialog box, Bin values tab

Binned field. Use the drop-down list to select a eld for viewing. Field names shown use the

original eld name for clarity.


Tile. Use the drop-down list to select a tile, such as 10 or 100, for viewing. This option is available

only when bins have been generated using the tile method (equal count or sum).
Bin thresholds. Threshold values are shown here for each generated bin, along with the number of records that fall into each bin. For the optimal binning method only, the number of records in each bin is shown as a percentage of the whole. Note that thresholds are not applicable when the rank binning method is used. Read Values. Reads binned values from the dataset. Note that thresholds will also be overwritten

when new data are run through the stream.


Generating a Derive Node

You can use the Generate menu to create a Derive node based on the current thresholds. This is useful for applying established bin thresholds from one set of data to another. Furthermore, once these split points are known, a Derive operation is more efcient (meaning faster) than a Binning operation when working with large datasets.

Partition Node
Partition nodes are used to generate a partition eld that splits the data into separate subsets or samples for the training, testing, and validation stages of model building. By using one sample to generate the model and a separate sample to test it, you can get a good indication of how well the model will generalize to larger datasets that are similar to the current data.

120 Chapter 4

The Partition node generates a set eld with the direction set to Partition. Alternatively, if an appropriate eld already exists in your data, it can be designated as a partition using a Type node. In this case, no separate Partition node is required. Any instantiated set eld with two or three values can be used as a partition, but ag elds cannot be used. For more information, see Setting Field Direction on p. 80. Multiple partition elds can be dened in a stream, but if so, a single partition eld must be selected on the Fields tab in each modeling node that uses partitioning. (If only one partition is present, it is automatically used whenever partitioning is enabled.)
Enabling partitioning. To use the partition in an analysis, partitioning must be enabled on the Model Options tab in the appropriate model-building or analysis node. Deselecting this option makes it possible to disable partitioning without removing the eld.

To create a partition eld based on some other criterion such as a date range or location, you can also use a Derive node. For more information, see Derive Node on p. 87.

Partition Node Options


Figure 4-46 Partition node dialog box, Settings tab

Partition field. Species the name of the eld created by the node. Partitions. You can partition the data into two samples (train and test) or three (train, test, and validation). Train and test. Partitions the data into two samples, allowing you to train the model with

one sample and test with another.


Train, test, and validation. Partitions the data into three samples, allowing you to train the

model with one sample, test and rene the model using a second sample, and validate your results with a third. This reduces the size of each partition accordingly, however, and may be most suitable when working with a very large dataset.

121 Field Operations Nodes

Partition size. Species the relative size of each partition. If the sum of the partition sizes is less

than 100%, then the records not included in a partition will be discarded. For example, if a user has 10 million records and has specied partition sizes of 5% training and 10% testing, after running the node, there should be roughly 500,000 training and one million testing records, with the remainder having been discarded.
Values. Species the values used to represent each partition sample in the data. Use system-defined values (1, 2 and 3). Uses an integer to represent each partition; for

example, all records that fall into the training sample have a value of 1 for the partition eld. This ensures the data will be portable between locales and that if the partition eld is reinstantiated elsewhere (for example, reading the data back from a database), the sort order is preserved (so that 1 will still represent the training partition). However, the values do require some interpretation.
Append labels to system-defined values. Combines the integer with a label; for example,

training partition records have a value of 1_Training. This makes it possible for someone looking at the data to identify which value is which, and it preserves sort order. However, values are specic to a given locale.
Use labels as values. Uses the label with no integer; for example, Training. This allows you

to specify the values by editing the labels. However, it makes the data locale-specic, and reinstantiation of a partition column will put the values in their natural sort order, which may not correspond to their semantic order.
Set random seed. When sampling or partitioning records based on a random percentage, this option allows you to duplicate the same results in another session. By specifying the starting value used by the random number generator, you can ensure the same records are assigned each time the node is executed. Enter the desired seed value, or click the Generate button to automatically generate a random value. If this option is not selected, a different sample will be generated each time the node is executed.

Note: When using the Set random seed option with records read from a database, a Sort node may be required prior to sampling in order to ensure the same result each time the node is executed. This is because the random seed depends on the order of records, which is not guaranteed to stay the same in a relational database. For more information, see Sort Node in Chapter 3 on p. 54.
Generating Select Nodes

Using the Generate menu in the Partition node, you can automatically generate a Select node for each partition. For example, you could select all records in the training partition to obtain further evaluation or analyses using only this partition.

Set to Flag Node


The Set to Flag node is used to derive ag elds based on the categorical values dened for one or more set elds. For example, the drug demo data contains a set eld, BP (blood pressure), with the values High, Normal, and Low. For easier data manipulation, you might create a ag eld for high blood pressure, which indicates whether or not the patient has high blood pressure.

122 Chapter 4 Figure 4-47 Creating a flag field for high blood pressure using the drug demo data

Setting Options for the Set to Flag Node


Set fields. Lists all elds in the data whose types are set. Select one from the list to display the

values in the set. You can choose from these values to create a ag eld. Note that data must be fully instantiated using an upstream source or Type node before you can see the available set elds (and their values). For more information, see Type Node on p. 70.
Field name extension. Select to enable controls for specifying an extension that will be added as a sufx or prex to the new ag eld. By default, new eld names are automatically created by combining the original eld name with the eld value into a label, such as Fieldname_eldvalue. Available set values. Values in the set selected above are displayed here. Select one or more values

for which you want to generate ags. For example, if the values in a eld called blood_pressure are High, Medium, and Low, you can select High and add it to the list on the right. This will create a eld with a ag for records with a value indicating high blood pressure.
Create flag fields. The newly created ag elds are listed here. You can specify options for

naming the new eld using the eld name extension controls.
True value. Specify the true value used by the node when setting a ag. By default, this value is T. False value. Specify the false value used by the node when setting a ag. By default, this

value is F.

123 Field Operations Nodes

Aggregate keys. Select to group records together based on key elds specied below. When
Aggregate keys is selected, all ag elds in a group will be turned on if any record was set to

true. Use the Field Chooser to specify which key elds will be used to aggregate records.

Restructure Node
The Restructure node can be used to generate multiple elds based on the values of a set or ag eld. The newly generated elds can contain values from another eld or numeric ags (0 and 1). The functionality of this node is similar to that of the Set to Flag node. However, it offers more exibility. It allows you to create elds of any type (including numeric ags), using the values from another eld. You can then perform aggregation or other manipulations with other nodes downstream. (The Set to Flag node lets you aggregate elds in one step, which may be convenient if you are creating ag elds.) For example, the following dataset contains a set eld, Account, with the values Savings and Draft. The opening balance and current balance are recorded for each account, and some customers have multiple accounts of each type. Lets say you want to know whether each customer has a particular account type, and if so, how much money is in each account type. You use the Restructure node to generate a eld for each of the Account values, and you select Current_Balance as the value. Each new eld is populated with the current balance for the given record.
Table 4-2 Sample data before restructuring

CustID 12701 12702 12703 12703 12703

Account Open_Bal Draft 1000 Savings 100 Savings 300 Savings 150 Draft 1200

Current_Bal 1005.32 144.51 321.20 204.51 586.32

Table 4-3 Sample data after restructuring

CustID 12701 12702 12703 12703 12703

Account Draft Savings Savings Savings Draft

Open_Bal 1000 100 300 150 1200

Current_Bal 1005.32 144.51 321.20 204.51 586.32

Account_Draft_ Current_Bal 1005.32 $null$ $null$ $null$ 586.32

Account_Savings _Current_Bal $null$ 144.51 321.20 204.51 $null$

124 Chapter 4 Figure 4-48 Generating restructured fields for Account

Using the Restructure Node with the Aggregate Node

In many cases, you may want to pair the Restructure node with an Aggregate node. In the previous example, one customer (with the ID 12703) has three accounts. You can use an Aggregate node to calculate the total balance for each account type. The key eld is CustID, and the aggregate elds are the new restructured elds, Account_Draft_Current_Bal and Account_Savings_Current_Bal. The following table shows the results.
Table 4-4 Sample data after restructuring and aggregation

CustID Record_Count Account_Draft_Current_ Account_Savings_Current_ Bal_Sum Bal_Sum 12701 1 1005.32 $null$ 12702 12703 1 3 $null$ 586.32 144.51 525.71

Setting Options for the Restructure Node


Available fields. Lists all elds in the data whose types are set or ag. Select one from the list to

display the values in the set (or ag); then choose from these values to create the restructured elds. Note that data must be fully instantiated using an upstream source or Type node before you can see the available elds (and their values). For more information, see Type Node on p. 70.
Available values. Values in the set selected above are displayed here. Select one or more values

for which you want to generate restructured elds. For example, if the values in a eld called Blood Pressure are High, Medium, and Low, you can select High and add it to the list on the right. This will create a eld with a specied value (see below) for records with a value of High.

125 Field Operations Nodes

Create restructured fields. The newly created restructured elds are listed here. By default, new

eld names are automatically created by combining the original eld name with the eld value into a label, such as Fieldname_eldvalue.
Include field name. Deselect to remove the original eld name as a prex from the new eld names. Use values from other fields. Specify one or more elds whose values will be used to populate the

restructured elds. Use the Field Chooser to select one or more elds. For each eld chosen, one new eld is created. The value eld name is appended to the restructured eld namefor example, BP_High_Age or BP_Low_Age. Each new eld inherits the type of the original value eld.
Create numeric value flags. Select to populate the new elds with numeric value ags (0 for false and 1 for true), rather than using a value from another eld.

Transpose Node
In Clementine, columns are elds and rows are records or observations. If necessary, a Transpose node can used to swap the data in rows and columns so that elds become records and records become elds. For example, if you have time series data where each series is a row rather than a column, you can transpose the data prior to analysis.
Figure 4-49 Transpose node, Settings tab

126 Chapter 4

Setting Options for the Transpose Node


New Field Names

New eld names can generated automatically based on a specied prex or read from an existing eld in the data.
Use prefix. This option generates new eld names automatically based on the specied prex

(Field1, Field2, and so on). You can customize the prex as needed. With this option, you must specify the number of elds to be created, regardless of the number of rows in the original data. For example, if Number of new fields is set to 100, all data beyond the rst 100 rows will be discarded. If there are fewer than 100 rows in the original data, some elds will be null. (You can increase the number of elds as needed, but the purpose of this setting is to avoid transposing a million records into a million elds, which would produce an unmanageable result.) For example, suppose you have data with series in rows and a separate eld (column) for each month. You can transpose this so that each series is in a separate eld, with a row for each month.
Figure 4-50 Original data with series in rows

Figure 4-51 Transposed data with series in columns

Note: To produce the results shown, the Number of New Fields option was changed from 100 to 2, and the row ID name was changed from ID to Month (see below).
Read from field. Reads eld names from an existing eld. With this option, the number of new

elds is determined by the data, up to the specied maximum. Each value of the selected eld becomes a new eld in the output data. The selected eld can have any storage type (integer, string, date, and so on), but in order to avoid duplicate eld names, each value of the selected eld must be unique (in other words, the number of values should match the number of rows). If duplicate eld names are encountered, a warning is displayed.

127 Field Operations Nodes Figure 4-52 Reading field names from an existing field

Read Values. If the selected eld has not been instantiated, select this option to populate the list

of new eld names. If the eld has already been instantiated, then this step is not necessary.
Maximum number of values to read. When reading elds names from the data, an upper limit is

specied in order to avoid creating an inordinately large number of elds. (As noted above, transposing one million records into one million elds would produce an unmanageable result.) For example, if the rst column in your data species the name for each series, you can use these values as elds names in the transposed data.
Figure 4-53 Original data with series in rows

128 Chapter 4 Figure 4-54 Transposed data with series in columns

Transpose. By default, only numeric range elds are transposed (either integer or real storage).

Optionally, you can choose a subset of numeric elds or transpose string elds instead. However, all transposed elds must be of the same storage typeeither numeric or string but not bothsince mixing the input elds would generate mixed values within each output column, which violates the rule that all values of a eld must have the same storage. Other storage types (date, time, timestamp) cannot be transposed.
All numeric. Transposes all numeric elds (integer or real storage). The number of rows in the

output matches the number of numeric elds in the original data.


All string. Transposes all string elds. Custom. Allows you to select a subset of numeric elds. The number of rows in the output

matches the number of elds selected. Note: This option is available only for numeric elds.
Row ID name. Species the name of the row ID eld created by the node. The values of this eld

are determined by the names of the elds in the original data. Tip: When transposing time series data from rows to columns, if your original data includes a row, such as date, month, or year, that labels the period for each measurement, be sure to read these labels into Clementine as eld names (as demonstrated in the above examples, which show the month or date as eld names in the original data, respectively) rather than including the label in the rst row of data. This will avoid mixing labels and values in each column (which would force numbers to be read as strings, since storage types cannot be mixed within a column).

Time Intervals Node


The Time Intervals node allows you to specify intervals and generate labels for time series data to be used in a Time Series modeling or a Time Plot node for estimating or forecasting. A full range of time intervals is supported, from seconds to years. For example, if you have a series with daily measurements beginning January 3, 2005, you can label records starting on that date, with the second row being January 4, and so on. You can also specify the periodicityfor example, ve days per week or eight hours per day. In addition, you can specify the range of records to be used for estimating. You can choose whether to exclude the earliest records in the series and whether to specify holdouts. Doing so enables you to test the model by holding out the most recent records in the time series data in order to compare their known values with the estimated values for those periods.

129 Field Operations Nodes

You can also specify how many time periods into the future you want to forecast, and you can specify future values for use in forecasting by downstream Time Series modeling nodes. The Time Intervals node generates a TimeLabel eld in a format appropriate to the specied interval and period along with a TimeIndex eld that assigns a unique integer to each record. A number of additional elds may also be generated, depending on the selected interval or period (such as the minute or second within which a measurement falls). You can pad or aggregate values as needed to ensure that measurements are equally spaced. Methods for modeling time series data require a uniform interval between each measurement, with any missing values indicated by empty rows. If your data do not already meet this requirement, the node can transform them to do so.
Comments

Periodic intervals may not match real time. For example, a series based on a standard ve-day work week would treat the gap between Friday and Monday as a single day. The Time Intervals node assumes that each series is in a eld or column, with a row for each measurement. If necessary you can transpose your data to meet this requirement. For more information, see Transpose Node on p. 125. For series that are not equally spaced, you can specify a eld that identies the date or time for each measurement. Note that this requires a date, time, or timestamp eld in the appropriate format to use as input. If necessary, you can convert an existing eld (such as a string label eld) to this format using a Filler node. For more information, see Storage Conversion Using the Filler Node on p. 100. When viewing details for the generated label and index elds, turning on the display of value labels is often helpful. For example, when viewing a table with values generated for monthly data, you can click the value labels icon on the toolbar to see January, February, March, and so on, rather than 1, 2, 3, and so on.
Figure 4-55 Value labels icon

Specifying Time Intervals


The Intervals tab allows you to specify the interval and periodicity for building or labeling the series. The specic settings depend on the selected interval. For example, if you choose Hours per day, you can specify the number of days per week, the day each week begins, the number of hours in each day, and the hour each day begins. For more information, see Supported Intervals on p. 136.

130 Chapter 4 Figure 4-56 Time-interval settings for an hourly series

Labeling or Building the Series

You can label records consecutively or build the series based on a specied date, timestamp, or time eld.
Start labeling from the first record. Specify the starting date and/or time to label consecutive

records. If labeling hours per day, for example, you would specify the date and hour when the series begins, with a single record for each hour thereafter. Aside from adding labels, this method does not change the original data. Instead, it assumes that records are already equally spaced, with a uniform interval between each measurement. Any missing measurements must be indicated by empty rows in the data.
Build from data. For series that are not equally spaced, you can specify a eld that identies

the date or time for each measurement. Note that this requires a date, time, or timestamp eld in the appropriate format to use as input. For example if you have a string eld with values like Jan 2000, Feb 2000, etc., you can convert this to a date eld using a Filler node. For more information, see Storage Conversion Using the Filler Node on p. 100. The Build from data option also transforms the data to match the specied interval by padding or aggregating records as needed, for example, by rolling up weeks into months, or by replacing missing records with blanks or extrapolated values. You can specify the functions used to pad or aggregate records on the Build tab. For more information, see Time Interval Build Options on p. 131.
New field name extension. Allows you to specify a prex or sufx that is applied to all elds

generated by the node. For example, using the default $TI_ prex, the elds created by the node would be named $TI_TimeIndex, $TI_TimeLabel, and so on

131 Field Operations Nodes

Date Format. Species the format for the TimeLabel eld created by the node, as applicable to the

current interval. Availability of this options depends on the current selection.


Time Format. Species the format for the TimeLabel eld created by the node, as applicable to the current interval. Availability of this options depends on the current selection.

Time Interval Build Options


The Build tab in the Time Intervals node allows you to specify options for aggregating or padding elds to match the specied interval. These settings apply only when the Build from data option is selected on the Intervals tab. For example, if you have a mix of weekly and monthly data, you could aggregate or roll up the weekly values to achieve a uniform monthly interval. Alternatively, you could set the interval to weekly and pad the series by inserting blank values for any weeks that are missing, or by extrapolating missing values using a specied padding function. When you pad or aggregate data, any existing date or timestamp elds are effectively superseded by the generated TimeLabel and TimeIndex elds and are dropped from the output. Typeless elds are also dropped. Fields that measure time as a duration are preservedsuch as a eld that measures the length of a service call rather than the time the call startedas long as they are stored internally as time elds rather than timestamp. For more information, see Setting Field Storage and Formatting in Chapter 2 on p. 20. Other elds are aggregated based on the options specied in the Build tab.
Figure 4-57 Time Intervals node, Build tab

Use default fields and functions. Species that all elds should be aggregated or padded as

needed, with the exception of date, timestamp, and typeless elds as noted above. The default function is applied based on the eld typefor example, range elds are aggregated using

132 Chapter 4

the mean, while set elds use the mode. You can change the default for one or more eld types in the lower part of the dialog box.
Specify fields and functions. Allows you to specify the elds to pad or aggregate, and the

function used for each. Any elds not selected are dropped from the output. Use the icons on the right side to add or remove elds from the table, or click the cell in the appropriate column to change the aggregation or padding function used for that eld to override the default. Typeless elds are excluded from the list and cannot be added to the table.
Default functions. Species the aggregation and padding functions used by default for different

types of elds. These defaults are applied when Use defaults is selected and are also applied as the initial default for any new elds added to the table. (Changing the defaults does not change any of the existing settings in the table but does apply to any elds added subsequently.)
Aggregation functions. The following aggregation functions are available: Range fields. Available functions for range elds include Mean, Sum, Mode, Min, and Max. Set fields. Options include Mode, First, and Last. First means the rst non-null value (sorted

by date) in the aggregation group; last means the last non-null value in the group.
Flag fields. Options include True if any true, Mode, First, and Last. Padding functions. The following padding functions are available: Range fields. Options include Blank and Mean of most recent points, which means the mean of

the three most recent non-null values prior to the time period that will be created. If there are not three values, the new value is blank. Recent values only include actual values; a previously created padded value is not considered in the search for a non-null value.
Set fields. Blank and Most recent value. Most recent refers to the most recent non-null value

prior to the time period that will be created. Again, only actual values are considered in the search for a recent value.
Flag fields. Options include Blank, True, and False. Maximum number of records in resulting dataset. Species an upper limit to the number of records

created, which can otherwise become quite large, particularly when the time interval is set to seconds (whether deliberately or otherwise). For example, a series with only two values (Jan. 1, 2000 and Jan. 1, 2001) would generate 31,536,000 records if padded out to seconds (60 seconds x 60 minutes x 24 hours x 365 days). The system will stop processing and display a warning if the specied maximum is exceeded.
Count Field

When aggregating or padding values, a new Count eld is created that indicates the number of records involved in determining the new record. If four weekly values were aggregated into a single month, for example, the count would be 4. For a padded record, the count is 0. The name of the eld is Count, plus the prex or sufx specied on the Interval tab.

133 Field Operations Nodes

Estimation Period
The Estimation tab of the Time Intervals node allows you to specify the range of records used in model estimation, as well as any holdouts. These settings may be overridden in downstream modeling nodes as needed, but specifying them here may be more convenient than specifying them for each node individually.
Figure 4-58 Time Intervals node, Estimation tab

Begin Estimation. You can begin the estimation period at the beginning of the data or exclude

older values that may be of limited use in forecasting. Depending on the data, you may nd that shortening the estimation period can speed up performance (and reduce the amount of time spent on data preparation) with no signicant loss in forecasting accuracy.
End Estimation. You can estimate the model using all records up to the end of the data or hold

out the most recent records in order to evaluate the model. For example, if you hold out the last three records and then specify 3 for the number of records to forecast, you are effectively forecasting values that are already known, allowing you to compare observed and predictive values to gauge the models effectiveness.

Forecasts
The Forecast tab of the Time Intervals node allows you to specify the number of records you want to forecast and to specify future values for use in forecasting by downstream Time Series modeling nodes. These settings may be overridden in downstream modeling nodes as needed, but specifying them here may be more convenient than specifying them for each node individually.

134 Chapter 4 Figure 4-59 Time Intervals node, Forecast tab

Extend records into the future. Species the number of records to extend beyond the estimation

period. Note that these records may or may not be forecasts depending on the number of holdouts that are specied on the Estimation tab. For example, if you hold out 6 records and extend 7 records into the future, you are forecasting 6 holdout values and only 1 future value. (The six holdouts can be compared to the observed values to gauge the accuracy of the model.)
Future indicator field. Label of the generated eld that indicates whether a record contains forecast

data. Default value for the label is $TI_Future.


Future Values to Use in Forecasting. For each record that you want to forecast (excluding holdouts), if you are using predictor elds (Direction = In), you must specify estimated values for the forecast period for each predictor. You can either specify values manually, or choose from a list. Field. Click the eld selector button and choose any elds that may be used as predictors.

Note that elds selected here may or may not be used in modeling; to actually use a eld as a predictor, it must be selected in a downstream modeling node. This dialog box simply gives you a convenient place to specify future values so they can be shared by multiple downstream modeling nodes without specifying them separately in each node. Also note that the list of available elds may be constrained by selections on the Build tab. For example, if Specify fields and functions is selected on the Build tab, any elds not aggregated or padded are dropped from the stream and cannot be used in modeling.

135 Field Operations Nodes

Note: If future values are specied for a eld that is no longer available in the stream (because it has been dropped or because of updated selections made on the Build tab), the eld is shown in red on the Forecast tab.
Values. For each eld, you can choose from a list of functions, or click Specify to either enter

values manually or choose from a list of predened values. If the predictor elds relate to items that are under your control, or which are otherwise knowable in advance, you should enter values manually. For example, if you are forecasting next months revenues for a hotel based on the number of room reservations, you could specify the number of reservations you actually have for that period. Conversely, if a predictor eld relates to something outside your control, such as a stock price, you could use a function such as the most recent value or the mean of recent points. The available functions depend on the type of eld.
Field type Range or Set eld Functions Blank Mean of recent points Most recent value Specify Blank Most recent value True False Specify

Flag eld

Mean of recent points - calculates the future value from the mean of the last three data points. Most recent value - sets the future value to that of the most recent data point. True/False - sets the future value of a ag eld to True or False as specied. Specify - opens a dialog box for specifying future values manually, or choosing them from a predened list. Figure 4-60 Specifying future values for predictors

Future Values
Here you can specify future values for use in forecasting by downstream Time Series modeling nodes. These settings may be overridden in downstream modeling nodes as needed, but specifying them here may be more convenient than specifying them for each node individually.

136 Chapter 4

You can enter values manually, or click the selector button on the right side of the dialog box to choose from a list of values dened for the current eld. For more information, see Viewing or Selecting Values in Chapter 7 in Clementine 11.1 Users Guide. The number of future values that you can specify corresponds to the number of records by which you are extending the time series into the future.

Supported Intervals
The Time Intervals node supports a full range of intervals from seconds to years, as well as cyclic (for example, seasonal) and non-cyclic periods. You specify the interval in the Time Interval eld on the Intervals tab.

Periods
Select Periods to label an existing, non-cyclic series that doesnt match any of the other specied intervals. The series must already be in the correct order, with a uniform interval between each measurement. The Build from data option is not available when this interval is selected.
Figure 4-61 Time-interval settings for non-cyclic periods

Sample Output

Records are labeled incrementally based on the specied starting value (Period 1, Period 2, and so on). New elds are created as follows:
$TI_TimeIndex (Integer) 1 2 3 4 5 $TI_TimeLabel (String) Period 1 Period 2 Period 3 Period 4 Period 5 $TI_Period (Integer) 1 2 3 4 5

Cyclic Periods
Select Cyclic Periods to label an existing series with a repeating cycle that doesnt t one of the standard intervals. For example, you could use this option if you have only 10 months in your scal year. The series must already be in the correct order, with a uniform interval between each measurement. (The Build from data option is not available when this interval is selected.)

137 Field Operations Nodes Figure 4-62 Time-interval settings for cyclic periods

Sample Output

Records are labeled incrementally based on the specied starting cycle and period (Cycle 1, Period 1, Cycle 1, Period 2, and so on). For example, with the number of periods per cycle set to 3, new elds are created as follows:
$TI_TimeIndex (Integer) 1 2 3 4 5 $TI_TimeLabel (String) Cycle 1, Period 1 Cycle 1, Period 2 Cycle 1, Period 3 Cycle 2, Period 1 Cycle 2, Period 2 $TI_Cycle (Integer) 1 1 1 2 2 $TI_Period (Integer) 1 2 3 1 2

Years
For years, you can specify the starting year to label consecutive records or select Build from data to specify a timestamp or date eld that identies the year for each record.
Figure 4-63 Time-interval settings for a yearly series

Sample Output

New elds are created as follows:


$TI-TimeIndex (Integer) 1 2 3 $TI-TimeLabel (String) 2000 2001 2002 $TI-Year (Integer) 2000 2001 2002

138 Chapter 4

$TI-TimeIndex (Integer) 4 5

$TI-TimeLabel (String) 2003 2004

$TI-Year (Integer) 2003 2004

Quarters
For a quarterly series, you can specify the month when the scal year begins. You can also specify the starting quarter and year (for example, Q1 2000) to label consecutive records or select Build from data to choose a timestamp or date eld that identies the quarter and year for each record.
Figure 4-64 Time-interval settings for quarterly series

Sample Output

For a scal year starting in January, new elds would be created and populated as follows:
$TI-TimeIndex (Integer) 1 2 3 4 5 $TI-TimeLabel (String) Q1 2000 Q2 2000 Q3 2000 Q4 2000 Q1 2001 $TI-Year (Integer) 2000 2000 2000 2000 2001 $TI-Quarter (Integer with labels) 1 (Q1) 2 (Q2) 3 (Q3) 4 (Q4) 1 (Q1)

If the year starts in a month other than January, new elds are as below (assuming a scal year starting in July). To view the labels that identify the months for each quarter, turn on the display of value labels by clicking the toolbar icon.
Figure 4-65 Value labels icon

$TI-TimeIndex (Integer) 1 2 3

$TI-TimeLabel (String) Q1 2000/2001 Q2 2000/2001 Q3 2000/2001

$TI-Year (Integer) 1 1 1

$TI-Quarter (Integer with labels) 1 (Q1 Jul-Sep) 2 (Q2 Oct-Dec) 3 (Q3 Jan-Mar)

139 Field Operations Nodes

$TI-TimeIndex (Integer) 4 5

$TI-TimeLabel (String) Q4 2000/2001 Q1 2001/2002

$TI-Year (Integer) 1 2

$TI-Quarter (Integer with labels) 4 (Q4 Apr-Jun) 1 (Q1 Jul-Sep)

Months
You can select the starting year and month to label consecutive records or select Build from data to choose a timestamp or date eld that indicates the month for each record.
Figure 4-66 Time-interval settings for a monthly series

Sample Output

New elds are created as follows:


$TI-TimeIndex (Integer) 1 2 3 4 5 $TI-TimeLabel (Date) Jan 2000 Feb 2000 Mar 2000 Apr 2000 May 2000 $TI-Year (Integer) 2000 2000 2000 2000 2000 $TI-Months (Integer with labels) 1 (January) 2 (February) 3 (March) 4 (April) 5 (May)

Weeks (Non-Periodic)
For a weekly series, you can select the day of the week on which the cycle begins. Note that weeks can be only non-periodic because different months, quarters, and even years do not necessarily have the same number of weeks. However, time-stamped data can be easily aggregated or padded to a weekly level for non-periodic models.

140 Chapter 4 Figure 4-67 Time-interval settings for a weekly series

Sample Output

New elds are created as follows:


$TI-TimeIndex (Integer) 1 2 3 4 5 $TI-TimeLabel (Date) 1999-12-27 2000-01-03 2000-01-10 2000-01-17 2000-01-24 $TI-Week (Integer) 1 2 3 4 5

The $TI-TimeLabel eld for a week shows the rst day of that week. In the preceding table, the user starts labeling from January 1, 2000. However, the week starts on Monday, and January 1, 2000, is a Saturday. Thus, the week that includes January 1 starts on December 27, 1999, and is the label of the rst point. The Date format determines the strings produced for the $TI-TimeLabel eld.

Days per Week


For daily measurements that fall into a weekly cycle, you can specify the number of days per week and the day each week begins. You can specify a starting date to label consecutive records or select Build from data to choose a timestamp or date eld that indicates the date for each record.
Figure 4-68 Time-interval settings for a daily series

141 Field Operations Nodes

Sample Output

New elds are created as follows:


$TI-TimeIndex (Integer) 1 2 3 4 5 $TI-TimeLabel (Date) Jan 5 2005 Jan 6 2005 Jan 7 2005 Jan 10 2005 Jan 11 2005 $TI-Week (Integer) 1 1 1 2 2 $TI-Day (Integer with labels) 3 (Wednesday) 4 (Thursday) 5 (Friday) 1 (Monday) 2 (Tuesday)

Note: The week always starts at 1 for the rst time period and does not cycle based on the calendar. Thus, week 52 is followed by week 53, 54, and so on. The week does not reect the week of the year, just the number of weekly increments in the series.

Days (Non-Periodic)
Choose non-periodic days if you have daily measurements that do not t into a regular weekly cycle. You can specify a starting date to label consecutive records or select Build from data to choose a timestamp or date eld that indicates the date for each record.
Figure 4-69 Time-interval settings for a daily series (non-periodic)

Sample Output

New elds are created as follows:


$TI-TimeIndex (Integer) 1 2 3 4 5 $TI-TimeLabel (Date) Jan 5 2005 Jan 6 2005 Jan 7 2005 Jan 8 2005 Jan 9 2005

142 Chapter 4

Hours per Day


For hourly measurements that t into a daily cycle, you can specify the number of days per week, the number of hours in the day (such as an eight-hour workday), the day when the week begins, and the hour when each day begins. Hours can be specied down to the minute based on a 24-hour clock (for example, 14:05 = 2:05 p.m.)
Figure 4-70 Time-interval settings for an hourly series

You can specify the starting date and time to label consecutive records or select Build from data to choose a timestamp eld that identies the date and time for each record.
Sample Output

New elds are created as follows:


$TI-TimeIndex (Integer) 1 2 3 4 5 $TI-TimeLabel (Timestamp) Jan 5 2005 8:00 Jan 5 2005 9:00 Jan 5 2005 10:00 Jan 5 2005 11:00 Jan 5 2005 12:00 $TI-Day (Integer with labels) 3 (Wednesday) 3 (Wednesday) 3 (Wednesday) 3 (Wednesday) 3 (Wednesday) $TI-Hour (Integer with labels) 8 (8:00) 9 (9:00) 10 (10:00) 11 (11:00) 12 (12:00)

Hours (Non-Periodic)
Choose this option if you have hourly measurements that do not t into a regular daily cycle. You can specify the starting time to label consecutive records or select Build from data to choose a timestamp or time eld that indicates the time for each record.

143 Field Operations Nodes Figure 4-71 Time-interval settings for yearly data

Hours are based on a 24-hour clock (13:00 = 1:00 p.m.), and do not wrap (hour 25 follows hour 24).
Sample Output

New elds are created as follows:


$TI-TimeIndex (Integer) 1 2 3 4 5 $TI-TimeLabel (String) 8:00 9:00 10:00 11:00 12:00 $TI-Hour (Integer with labels) 8 (8:00) 9 (9:00) 10 (10:00) 11 (11:00) 12 (12:00)

Minutes per Day


For measurements by the minute that fall into a daily cycle, you can specify the number of days per week, the day the week begins, the number of hours in the day, and the time the day begins. Hours are specied based on a 24-hour clock and can be specied down to the minute and second using colons (for example, 2:05:17 p.m. = 14:05:17). You can also specify the number of minutes to increment (every minute, every two minutes, and so on, where the increment must be a value that divides evenly into 60).
Figure 4-72 Time-interval settings for minutes per day

You can specify the starting date and time to label consecutive records or select Build from data to choose a timestamp eld that identies the date and time for each record.

144 Chapter 4

Sample Output

New elds are created as follows:


$TI-TimeIndex (Integer) 1 2 3 4 5 $TI-TimeLabel (Timestamp) 2005-01-05 08:00:00 2005-01-05 08:01:00 2005-01-05 08:02:00 2005-01-05 08:03:00 2005-01-05 08:04:00 $TI-Minute 0 1 2 3 4

Minutes (Non-Periodic)
Choose this option if you have measurements by the minute that do not t into a regular daily cycle. You can specify the number of minutes to increment (every minute, every two minutes, and so on, where the specied value must be a number that divides evenly into 60).
Figure 4-73 Time-interval settings for minutes (non-periodic)

You can specify the starting time to label consecutive records or select Build from data to choose a timestamp or time eld that identies the time for each record.
Sample Output

New elds are created as follows:


$TI-TimeIndex (Integer) 1 2 3 4 5 $TI-TimeLabel (String) 8:00 8:01 8:02 8:03 8:04 $TI-Minute 0 1 2 3 4

The TimeLabel string is created by using a colon between the hour and minute. The hour does not wraphour 25 follows hour 24. Minutes increment by the value specied in the dialog box. For example, if the increment is 2, the TimeLabel will be 8:00, 8:02, and so on, and the minutes will be 0, 2, and so on.

145 Field Operations Nodes

Seconds per Day


For second intervals that fall into a daily cycle, you can specify the number of days per week, the day the week begins, the number of hours in the day, and the time the day begins. Hours are specied based on a 24-hour clock and can be specied down to the minute and second using colons (For example, 2:05:17 p.m. = 14:05:17). You can also specify the number of seconds to increment (every second, every two seconds, and so on, where the specied value must be a number that divides evenly into 60).
Figure 4-74 Time-interval settings for seconds per day

You can specify the date and time to start labeling consecutive records or select Build from data to choose a timestamp eld that species the date and time for each record.
Sample Output

New elds are created as follows:


$TI-TimeIndex (Integer) 1 2 3 4 5 $TI-TimeLabel (Timestamp) 2005-01-05 08:00:00 2005-01-05 08:00:01 2005-01-05 08:00:02 2005-01-05 08:00:03 2005-01-05 08:00:04 $TI-Minute 0 0 0 0 0 $TI-Second 0 1 2 3 4

Seconds (Non-Periodic)
Choose this option if you have measurements taken by the second that do not t into a regular daily cycle. You can specify the number of seconds to increment (every second, every two seconds, and so on, where the specied value must be a number that divides evenly into 60).

146 Chapter 4 Figure 4-75 Time-interval settings for seconds (non-periodic)

Specify the time to start labeling consecutive records or select Build from data to choose a timestamp or time eld the identies the time for each record.
Sample Output

New elds are created as follows:


$TI-TimeIndex (Integer) 1 2 3 4 5 $TI-TimeLabel (String) 8:00:00 8:00:01 8:00:02 8:00:03 8:00:04 $TI-Minute 0 0 0 0 0 $TI-Second 0 1 2 3 4

The TimeLabel string is created by using a colon between the hour and minute and between minute and second. The hour does not wrapafter hour 24 will be hour 25. Seconds increment by whatever number is specied as the increment. If the increment is 2, the TimeLabel will be 8:00:00, 8:00:02, and so on, and the seconds will be 0, 2, and so on.

History Node
History nodes are most often used for sequential data, such as time series data. They are used to create new elds containing data from elds in previous records. When using a History node, you may want to have data that is presorted by a particular eld. You can use a Sort node to do this.

147 Field Operations Nodes

Setting Options for the History Node


Figure 4-76 History node dialog box

Selected fields. Using the Field Chooser (button to the right of the text box), select the elds

for which you want a history. Each selected eld is used to create new elds for all records in the dataset.
Offset. Specify the latest record prior to the current record from which you want to extract

historical eld values. For example, if Offset is set to 3, as each record passes through this node, the eld values for the third record previous will be included in the current record. Use the Span settings to specify how far back records will be extracted from. Use the arrows to adjust the offset value.
Span. Specify how many prior records from which you want to extract values. For example,

if Offset is set to 3 and Span is set to 5, each record that passes through the node will have ve elds added to it for each eld specied in the Selected Fields list. This means that when the node is processing record 10, elds will be added from record 7 through record 3. Use the arrows to adjust the span value.
Where history is unavailable. Select one of the following options for handling records that have no

history values. This usually refers to the rst several records at the top of the dataset, for which there are no previous records to use as a history.
Discard records. Select to discard records where no history value is available for the eld

selected.
Leave history undefined. Select to keep records where no history value is available. The

history eld will be lled with an undened value, displayed as $null$.


Fill values with. Specify a value or string to be used for records where no history value is

available. The default replacement value is undef, the system null. Null values are displayed in Clementine using the string $null$.

148 Chapter 4

When selecting a replacement value, keep in mind the following rules in order for proper execution to occur: Selected elds should be of the same storage type. If all of the selected elds have numeric storage, the replacement value must be parsed as an integer. If all of the selected elds have real storage, the replacement value must be parsed as a real. If all of the selected elds have symbolic storage, the replacement value must be parsed as a string. If all of the selected elds have date/time storage, the replacement value must be parsed as a date/time eld. If any of the above conditions are not met, you will receive an error when executing the History node.

Field Reorder Node


The Field Reorder node enables you to dene the natural order used to display elds downstream. This order affects the display of elds in a variety of places, such as tables, lists, and the Field Chooser. This operation is useful, for example, when working with wide datasets to make elds of interest more visible.

Setting Field Reorder Options


There are two ways to reorder elds: custom ordering and automatic sorting.
Custom Ordering

Select Custom Order to enable a table of eld names and types where you can view all elds and use arrow buttons to create a custom order.

149 Field Operations Nodes Figure 4-77 Reordering to display fields of interest first

To reorder elds:
E Select a eld in the table. Use the Ctrl-click method to select multiple elds. E Use the simple arrow buttons to move the eld(s) up or down one row. E Use the line-arrow buttons to move the eld(s) to the bottom or top of the list. E Specify the order of elds not included here by moving up or down the divider row, indicated

as [other fields].
Other fields. The purpose of the [other fields] divider row is to break the table into two halves.

Fields appearing above the divider row will be ordered (as they appear in the table) at the top of all natural orders used to display the elds downstream of this node. Fields appearing below the divider row will be ordered (as they appear in the table) at the bottom of all natural orders used to display the elds downstream of this node.

150 Chapter 4 Figure 4-78 Diagram illustrating how other fields are incorporated into the new field order.

All other elds not appearing in the eld reorder table will appear between these top and bottom elds as indicated by the placement of the divider row. Additional custom sorting options include: Sort elds in ascending or descending order by clicking on the arrows above each column header (Type, Name, and Storage). When sorting by column, elds not specied here (indicated by the [other fields] row) are sorted last in their natural order. Click Clear Unused to delete all unused elds from the Field Reorder node. Unused elds are displayed in the table with a red font. This indicates that the eld has been deleted in upstream operations. Specify ordering for any new elds (displayed with a lightning icon to indicate a new or unspecied eld). When you click OK or Apply, the icon disappears. Note: If elds are added upstream after a custom order has been applied, the new elds will be appended at the bottom of the custom list.
Automatic Sorting

Select Automatic Sort to specify a parameter for sorting. The dialog box options dynamically change to provide options for automatic sorting.
Figure 4-79 Reordering all fields using automatic sorting options

Sort By. Select one of three ways to sort elds read into the Reorder node. The arrow buttons

indicate whether the order will be ascending or descending. Select one to make a change. Name Type Storage Fields added upstream of the Field Reorder node after auto-sort has been applied will automatically be placed in their proper position based on the sort type selected.

151 Field Operations Nodes

SPSS Transform Node


The SPSS Transform node allows you to complete data transformations using SPSS command syntax. This makes it possible to complete a number of transformations not supported by Clementine and allows automation of complex, multistep transformations, including the creation of a number of elds from a single node. It resembles the SPSS Output node, except that the data are returned to Clementine for further analysis, whereas, in the Output node the data are returned as the requested output objects, such as graphs or tables. Note: You must have SPSS installed and licensed on your computer to use this node. For more information, see SPSS Helper Applications in Chapter 17 on p. 575. For details on specic SPSS procedures, see the SPSS Command Syntax Reference, which is available under the \documentation folder on the product CD-ROM and also available from the Windows Start menu by choosing Start > [All] Programs > SPSS Clementine 11.1 > Documentation. Note that a newer version of this document may have been included with your copy of SPSS software. You can also click the SPSS Syntax Help button available from the node dialog box in Clementine. This will provide syntax help for the command that you are currently typing. If necessary, you can use the Filter tab to lter or rename elds so they conform to SPSS naming standards. For more information, see Renaming or Filtering Fields for SPSS in Chapter 18 on p. 590. Note: Not all SPSS syntax is supported by this node. For more information, see Allowable Syntax on p. 152.

152 Chapter 4

Setting Syntax Options


Figure 4-80 SPSS Transform node dialog box

Check. After you have entered your syntax commands in the upper part of the dialog box, use this

button to validate your entries. Any incorrect syntax is identied in the bottom part of the dialog. To ensure that the checking process does not take too long, when you validate the syntax, Clementine checks against a representative sample of your data to ensure that the entries are valid instead of checking against the entire dataset. The Check syntax before saving option is selected by default. If you need to close or save the stream while it is still only partially complete, deselect (uncheck) this box to prevent a syntax check being run automatically.

Allowable Syntax
If you have a lot of legacy syntax from SPSS or are familiar with the data preparation features of SPSS, you can use the SPSS Transform node to run many of your existing transformations. As a guideline, the node enables you to transform data in predictable waysfor example, by running looped commands or by changing, adding, sorting, ltering, or selecting data.

153 Field Operations Nodes

Examples of the commands that can be carried out are: Compute random numbers according to a binomial distribution:
COMPUTE newvar = RV.BINOM(10000,0.1)

Recode a variable into a new variable:


RECODE Age (Lowest thru 30=1) (30 thru 50=2) (50 thru Highest=3) INTO AgeRecoded

Replace missing values:


RMV Age_1=SMEAN(Age)

The SPSS syntax that is supported by the SPSS Transform node is listed in the following table:
Command Name ADD VALUE LABELS APPLY DICTIONARY AUTORECODE BREAK CD CLEAR MODEL PROGRAMS CLEAR TIME PROGRAM CLEAR TRANSFORMATIONS COMPUTE COUNT CREATE DATE DEFINE-!ENDDEFINE DELETE VARIABLES DO IF DO REPEAT ELSE ELSE IF END CASE END FILE END IF END INPUT PROGRAM END LOOP END REPEAT EXECUTE FILE HANDLE FILE LABEL FILE TYPE-END FILE TYPE

154 Chapter 4

Command Name FILTER FORMATS IF INCLUDE INPUT PROGRAM-END INPUT PROGRAM INSERT LEAVE LOOP-END LOOP MATRIX-END MATRIX MISSING VALUES N OF CASES NUMERIC PERMISSIONS PRESERVE RANK RECODE RENAME VARIABLES RESTORE RMV SAMPLE SELECT IF SET SORT CASES SORT CASES STRING SUBTITLE TEMPORARY TITLE UPDATE V2C VALIDATEDATA VALUE LABELS VARIABLE ATTRIBUTE VARSTOCASES VECTOR

Chapter

Graph Nodes

Graph Nodes Overview


Several phases of the data mining process use graphs and charts to explore data brought into Clementine. For example, you can connect a Plot or Distribution node to a data source to gain insight into data types and distributions. You can then perform record and eld manipulations to prepare the data for downstream modeling operations. Another common use of graphs is to check the distribution and relationships between newly derived elds. The Graphs palette contains the following nodes:
The Plot node shows the relationship between numeric elds. You can create a plot by using points (a scatterplot) or lines. For more information, see Plot Node on p. 176.

The Multiplot node creates a plot that displays multiple Y elds over a single X eld. The Y elds are plotted as colored lines; each is equivalent to a Plot node with Style set to Line and X Mode set to Sort. Multiplots are useful when you want to explore the uctuation of several variables over time. For more information, see Multiplot Node on p. 188. The Distribution node shows the occurrence of symbolic (categorical) values, such as mortgage type or gender. Typically, you might use the Distribution node to show imbalances in the data, which you could then rectify using a Balance node before creating a model. For more information, see Distribution Node on p. 190. The Histogram node shows the occurrence of values for numeric elds. It is often used to explore the data before manipulations and model building. Similar to the Distribution node, the Histogram node frequently reveals imbalances in the data. For more information, see Histogram Node on p. 196. The Collection node shows the distribution of values for one numeric eld relative to the values of another. (It creates graphs that are similar to histograms.) It is useful for illustrating a variable or eld whose values change over time. Using 3-D graphing, you can also include a symbolic axis displaying distributions by category. For more information, see Collection Node on p. 201. The Web node illustrates the strength of the relationship between values of two or more symbolic (categorical) elds. The graph uses lines of various widths to indicate connection strength. You might use a Web node, for example, to explore the relationship between the purchase of a set of items at an e-commerce site. For more information, see Web Node on p. 205.

155

156 Chapter 5

The Evaluation node helps to evaluate and compare predictive models. The evaluation chart shows how well models predict particular outcomes. It sorts records based on the predicted value and condence of the prediction. It splits the records into groups of equal size (quantiles) and then plots the value of the business criterion for each quantile from highest to lowest. Multiple models are shown as separate lines in the plot. For more information, see Evaluation Chart Node on p. 215. The Time Plot node displays one or more sets of time series data. Typically, you would rst use a Time Intervals node to create a TimeLabel eld, which would be used to label the x axis. For more information, see Time Plot Node on p. 225.

Once you have congured the options for a graph node, you can execute it from within the dialog box or as part of a stream. In the generated graph window, you can generate Derive (Set and Flag) and Select nodes based on a selection or region of data, effectively subsetting the data. For example, you might use this powerful feature to identify and exclude outliers.

Overlay Graphs
A variety of overlays can be applied to different graphs in Clementine, allowing you to explore additional aspects of the data. The following overlays are available, listed here with the applicable graphs: Colorplot, histogram, collection Panelplot, multiplot, histogram, collection Sizeplot Shapeplot Transparencyplot Animationmultiplot, histogram, collection
Figure 5-1 Graph with size overlay

157 Graph Nodes Figure 5-2 Graph with panel overlay

Figure 5-3 Graph with color overlay

Figure 5-4 Graph with color and transparency overlays

158 Chapter 5

3-D Graphs
Plots and collection graphs in Clementine have the ability to display information on a third axis. This provides you with additional exibility in visualizing your data to select subsets or deriving new elds for modeling. Once you have created a 3-D graph, you can click on it and drag your mouse to rotate it and view it from any angle.
Figure 5-5 Collection graph with x, y, and z axes

There are two ways of creating 3-D graphs in Clementine: plotting information on a third axis (true 3-D graphs) and displaying graphs with 3-D effects. Both methods are available for plots and collections.
To Plot Information on a Third Axis
E In the graph node dialog box, click the Plot tab. E Click the 3-D button to enable options for the z axis. E Use the Field Chooser button to select a eld for the z axis. In some cases, only symbolic elds

are allowed here. The Field Chooser will display the appropriate elds.
To Add 3-D Effects to a Graph
E Once you have created a graph, click the Graph tab in the output window. E Click the 3-D button to switch the view to a three-dimensional graph.

159 Graph Nodes

Animation
Plots, multiplots, and histograms can be animated in Clementine. An animation graph works like a movie clipyou click the play button to ip through charts for all categories. An animation variable with many categories works especially well, since the animation ips through all of the graphs for you. Keeping the number of distinct categories reasonable (such as 15) will ensure normal performance of the software.
Figure 5-6 Animated plot using a variable with three categories - high blood pressure

Figure 5-7 Animated plot using a variable with three categories - low blood pressure

160 Chapter 5 Figure 5-8 Animated plot using a variable with three categories - normal blood pressure

Once you have generated an animated chart, you can use the animation tools in a number of ways: Pause the animation at any point. Use the slider to view the animation at the desired point (category).

Building Graphs
Once you have added a graph node to a stream, you can double-click it to open a dialog box for specifying options. Most graphs contain a number of unique options presented on one or more tabs. There are also several tab options common to all graphs. The following sections contain more information about these common options.

Setting Output Options for Graphs


For all graph types, you can specify the following options for the lename and display of generated graphs. Note: For distributions, the le types are different and reect the distributions similarity to tables. For more information, see Output Options for the Distribution Node on p. 192.
Output name. Species the name of the graph produced when the node is executed. Auto chooses a name based on the node that generates the output. Optionally, you can select Custom to specify a different name. Output to screen. Select to generate and display the graph in a Clementine window. Output to file. Select to save the generated graph as a le of the type specied in the File type

drop-down list.
Filename. Specify a lename used for the generated graph. Use the ellipsis button (...) to specify a

le and location.

161 Graph Nodes

File type. Available le types are:

Bitmap (.bmp) JPEG (.jpg) PNG (.png) HTML document (.html) ViZml document (.xml) for use in other SPSS applications Output object (.cou)

Setting Appearance Options for Graphs


For all graphs, you can specify appearance options before graph creation.
Figure 5-9 Setting appearance options for graphs

Title. Enter the text to be used for the graphs title. Caption. Enter the text to be used for the graphs caption. X label. Either accept the automatically generated x-axis label, or select Custom to specify a custom label. Y label. Either accept the automatically generated y-axis label, or select Custom to specify

a custom label.
Z label. Available only for 3-D graphs, either accept the automatically generated z-axis label,

or select Custom to specify a custom label.


Display gridline. Selected by default, this option displays a gridline behind the plot or graph

that enables you to more easily determine region and band cutoff points. Gridlines are always displayed in white unless the graph background is white; in this case, they are displayed in gray.

162 Chapter 5

Color settings used for points and bars are specied in the User Options dialog box.
E To access this dialog box, from the Clementine window menus, choose: Tools User Options... E Then click the Display tab.

Note: Colors used for points, lines, and bars must be specied before graph creation in order for changes to take effect.
Layout. For time plots only, you can specify whether time values are plotted along a horizontal or

vertical axis. For more information, see Time Plot Node on p. 225. After you have generated a graph, you can amend details such as font; line style; and the color of fonts, lines, and graph contents. For more information, see Viewing Graph Output on p. 162.

Viewing Graph Output


Once you have created graphs, there are two different ways in which you can work with them: Use the Selection/Interaction mode to work on the data shown within the graph. To select this mode, click the Selection mode icon on the toolbar. Alternatively, choose Enable Interaction from the Edit menu.
Figure 5-10 Selection mode icon

Note: this is the default way in which graphs are displayed when rst generated. Use the Edit mode to change the look of the graph. For example, you can change the shape and size of items such as plots and points, change the colors of lines and other items, change the font type and size used for labels, and adjust the space separating items. To select this mode, click the Edit mode icon on the toolbar. Alternatively, choose Enable Editing from the Edit menu.
Figure 5-11 Edit mode icon

Using the Selection/Interaction mode

In this mode, there several ways that you can customize and manipulate graphs. For example, you can explore graphs in any of the following ways: Use the mouse to select an area of a graph for further operations. Use the options available from the menu bar. Different graphs may have different types of menus and options available. Right-click on a selected area to bring up a menu of available options for that area.

163 Graph Nodes Figure 5-12 Evaluation chart with context-menu options for a defined region

Using these methods, you can perform the following operations, depending on the type of graph created: Highlight data regions on plot graphs using the mouse to specify a rectangular area. Highlight data bands on histograms and collection graphs by clicking in the graph area. Identify and label subsets of your data. Generate manipulation nodes based on selected areas of the graph. For details about the node-generation options for a particular type of graph, see the documentation for that graph node.

164 Chapter 5 Figure 5-13 Exploring a plot using a variety of methods

Using the Edit mode

In this mode, you can change the look of a graph to match your needs. For example, you may want to change the font to make it larger or change the colors to match your corporate style guide. For more information, see Editing Graphs on p. 166. To add further detail to your graphs, you can apply title, footnote, and axis labels. For more information, see Adding Titles and Footnotes on p. 174. In Edit mode there are several toolbars that affect different aspects of the graphs layout. If you nd that there are any you dont use, you can hide them to increase the amount of space in the dialog box in which the graph is displayed. To select or deselect toolbars, click on the relevant toolbar name on the View menu.

165 Graph Nodes Figure 5-14 Selecting view options in Editing mode

Using Tooltips

When working with a graph that displays many values, you may be interested in the exact count of a specic value. This count value can be displayed as a tooltip when you hover your mouse pointer over the graph. To turn tooltips on or off, choose:
View Tooltips

166 Chapter 5 Figure 5-15 Displaying Tooltip information for one value

Saving graph layout changes

When you have made changes to the look of a graph, you can save the changes to be applied to other graphs. For more information, see Using Graph Stylesheets on p. 175.

Editing Graphs
You have several options for editing a graph. You can: Edit text and format it. Change the ll color and pattern of frames and graphic elements. Change the color and dashing of borders and lines. Rotate and change the shape and aspect ratio of point elements. Change the size of graphic elements (such as bars and points). Adjust the space around items by using margins and padding. Change the axis and scale settings. Set the orientation of axes and panels. Change the position of the legend. The following topics describe how to perform these various tasks. It is also recommended that you read the general rules for editing graphs.

167 Graph Nodes

General Rules for Editing Graphs


Selection

The options available for editing depend on selection. Different toolbar and properties palette options are enabled depending on what is selected. Only the enabled items apply to the current selection. For example, if an axis is selected, the Scale, Major Ticks, and Minor Ticks tabs are available in the properties palette. Here are some tips for selecting items in the graph: Click an item to select it. When selecting graphic elements (such as points in a scatterplot or bars in a bar chart), the rst click selects all graphic elements. Click again to drill down the selection to groups of graphic elements or a single graphic element. Press Esc to deselect everything.
Automatic Settings

Some settings provide an -auto- option. This indicates that automatic values are applied. Which automatic settings are used depends on the specic graph and data values. You can enter a value to override the automatic setting. If you want to restore the automatic setting, delete the current value and press Enter. The setting will display -auto- again.
Removing/Hiding Items

You can remove/hide various items in the graph. For example, you can hide the legend or axis label. To delete an item, select it and press Delete. If the item does not allow deletion, nothing will happen. If you accidentally delete an item, press Ctrl+Z to undo the deletion.
State

Some toolbars reect the state of the current selection, others dont. The properties palette always reects state. If a toolbar does not reect state, this is mentioned in the topic that describes the toolbar.

Editing and Formatting Text


You can edit text in place and change the formatting of an entire text block. Note that you cant edit text that is linked directly to data values. For example, you cant edit a tick label because the content of the label is derived from the underlying data. However, you can format any text in the graph.
How to Edit Text in Place
E Double-click the text block. This action selects all the text. All toolbars are disabled at this time,

because you cannot change any other part of the graph while editing text.

168 Chapter 5 E Type to replace the existing text. You can also click the text again to display a cursor. Position

the cursor in the desired position and enter the additional text.
How to Format Text
E Select the frame containing the text. Do not double-click the text. E Format text using the font toolbar. If the toolbar is not enabled, make sure only the frame

containing the text is selected. If the text itself is selected, the toolbar will be disabled.
Figure 5-16 Font toolbar

You can change the font: Color Family (e.g., Arial or Verdana) Size (the unit is pixels unless you indicate a different unit such as pt) Weight Alignment relative to the text frame Formatting applies to all the text in a frame. You cant change the formatting of individual letters or words in any particular block of text.

Changing Colors, Patterns, and Dashings


Many different items in a graph have a ll and border. The most obvious example is a bar in a bar chart. The color of the bars is the ll color. They may also have a solid, black border around them. There are other less obvious items in the graph that have ll colors. If the ll color is transparent, you may not know there is a ll. For example, consider the text in an axis label. It appears as if this text is oating text, but it actually appears in a frame that has a transparent ll color. You can see the frame by selecting the axis label. Any frame in the graph can have a ll and border style, including the frame around the whole graph.
How to Change the Colors, Patterns, and Dashing
E Select the item you want to format. For example, select the bars in a bar chart or a frame

containing text. If the graph is split by a categorical variable or eld, you can also select the group that corresponds to an individual category. This allows you to change the default aesthetic assigned to that group. For example, you can change the color of one of the stacking groups in a stacked bar chart.
E To change the ll color, the border color, or the ll pattern, use the color toolbar.

169 Graph Nodes Figure 5-17 Color toolbar

Note: This toolbar does not reect the state of the current selection. You can click the button to select the displayed option or click the drop-down arrow to choose another option. For colors, notice there is one color that looks like white with a red, diagonal line through it. This is the transparent color. You could use this, for example, to hide the borders on bars in a histogram. The rst button controls the ll color. The second button controls the border color. The third button controls the ll pattern. The ll pattern uses the border color. Therefore, the ll pattern is visible only if there is a visible border color.
E To change the dashing of a border or line, use the line toolbar. Figure 5-18 Line toolbar

Note: This toolbar does not reect the state of the current selection. As with the other toolbar, you can click the button to select the displayed option or click the drop-down arrow to choose another option.

Rotating and Changing the Shape and Aspect Ratio of Point Elements
You can rotate point elements, assign a different pre-dened shape, or change the aspect ratio (the ratio of width to height).
How to Modify Point Elements
E Select the point elements. You cannot rotate or change the shape and aspect ratio of individual

point elements.
E Use the symbol toolbar to modify the points. Figure 5-19 Symbol toolbar

The rst button allows you to change the shape of the points. Click the drop-down arrow and select a pre-dened shape.

170 Chapter 5

The second button allows you to rotate the points to a specic compass position. Click the drop-down arrow and then drag the needle to the desired position. The third button allows you to change the aspect ratio. Click the drop-down arrow and then click and drag the rectangle that appears. The shape of the rectangle represents the aspect ratio.

Changing the Size of Graphic Elements


You can change the size of the graphic elements in the graph. These include bars, lines, and points among others. If the graphic element is sized by a variable or eld, the specied size is the minimum size.
How to Change the Size of the Graphic Elements
E Select the graphic elements you want to resize. E Use the slider or enter a specic size for the option available on the symbol toolbar. The unit is

pixels unless you indicate a different unit (such as cm or in). You can also specify a percentage (such as 30%), which means that a graphic element uses the specied percentage of the available space. The available space depends on the graphic element type and the specic graph.
Figure 5-20 Size control on symbol toolbar

Specifying Margins and Padding


If there is too much or too little spacing around or inside a frame in the graph, you can change its margin and padding settings. The margin is the amount of space between the frame and other items around it. The padding is the amount of space between the border of the frame and the contents of the frame.
How to Specify Margins and Padding
E Select the frame for which you want to specify margins and padding. This can be a text frame,

the frame around the legend, or even the data frame displaying the graphic elements (such as bars and points).
E Use the Margins tab on the properties palette to specify the settings. All sizes are in pixels unless

you indicate a different unit (such as cm or in).


Figure 5-21 Margins tab

171 Graph Nodes

Changing the Axis and Scale Settings


There are several options for modifying axes and scales.
How to Change Axis and Scale Settings
E Select any part of the axis (for example, the axis label or tick labels). E Use the Scale, Major Ticks, and Minor Ticks tabs on the properties palette to change the axis

and scale settings.


Figure 5-22 Properties palette

Scale tab Type. Species whether the scale is linear or transformed. Scale transformations help you

understand the data or make assumptions necessary for statistical inference. On scatterplots, you might use a transformed scale if the relationship between the independent and dependent variables or elds is nonlinear. Scale transformations can also be used to make a skewed histogram more symmetric so that it resembles a normal distribution. Note that you are transforming only the scale on which the data are displayed; you are not transforming the actual data.
linear. Species a linear, untransformed scale. log. Species a base-10 log transformed scale. To accommodate zero and negative values,

this transformation uses a modied version of the log function. This safe log function is dened as sign(x) * log(1 + abs(x)). So, safeLog(-99) equals: sign(-99) * log(1 + abs(-99)) = -1 * log(1 + 99) = -1 * 2 = -2
power. Species a power transformed scale, using an exponent of 0.5. To accommodate

negative values, this transformation uses a modied version of the power function. This safe power function is dened as sign(x) * pow(abs(x), 0.5). So, safePower(-100) equals: sign(-100) * pow(abs(-100), 0.5) = -1* pow(100, 0.5) = -1 * 10 = -10
Min/Max/Nice Low/Nice High. Species the range for the scale. Selecting Nice Low and Nice High allows the application to select an appropriate scale based on the data. The minimum and

maximum are nice because they are typically whole values greater or less than the maximum and minimum data values. For example, if the data range from 4 to 92, a nice low and high for scale may be 0 and 100 rather than the actual data minimum and maximum. Be careful that you dont set a range that is too small and hides important items. Also note that you cannot set an explicit minimum and maximum if the Include Zero option is selected.
Low/High Margin. Create margins at the low and/or high end of the axis. The margin appears

perpendicular to the selected axis. The unit is pixels unless you indicate a different unit (such as cm or in). For example, if you set the High Margin to 5 for the vertical axis, a horizontal margin of 5px runs along the top of the data frame.

172 Chapter 5

Reverse. Species whether the scale is reversed. Include Zero. Indicates that the scale should include 0. This option is commonly used for bar

charts to ensure the bars begin at 0, rather than a value near the height of the smallest bar. If this option is selected, Min and Max are disabled because you cannot set a custom minimum and maximum for the scale range.
Major Ticks/Minor Ticks Tabs

Ticks or tick marks are the lines that appear on an axis. These indicate values at specic intervals or categories. Major ticks are the tick marks with labels. These are also longer than other tick marks. Minor ticks are tick marks that appear between the major tick marks. Some options are specic to the tick type, but most options are available for major and minor ticks.
Show ticks. Species whether major or minor ticks are displayed on the graph. Show gridlines. Species whether gridlines are displayed at the major or minor ticks. Gridlines

are lines that cross the whole graph from axis to axis.
Position. Species the position of the tick marks relative to the axis. Length. Species the length of the tick marks. The unit is pixels unless you indicate a different unit (such as cm or in). Base. Applies only to major ticks. Species the value at which the rst major tick appears. Delta. Applies only to major ticks. Species the difference between major ticks. That is, major

ticks will appear at every nth value, where n is the delta value.
Divisions. Applies only to minor ticks. Species the number of minor tick divisions between major

ticks. The number of minor ticks is one less than the number of divisions. For example, assume that there are major ticks at 0 and 100. If you enter 2 as the number of minor tick divisions, there will be one minor tick at 50, dividing the 0100 range and creating two divisions.

Changing the Orientation of Axes and Panels


You can change the orientation of the graph axis, or, if you are using panels, you can change the orientation of the panels.
How to Change the Orientation of the Graph Axes

Changing the orientation of the axes is called transposing. It is similar to swapping the vertical and horizontal axes in a 2-D chart. You do not need to select anything to transpose.
E Click the transpose button on the toolbar. Figure 5-23 Transpose button

E If necessary, click the button again to change the orientation back to the original appearance.

173 Graph Nodes

How to Change the Orientation of the Panels


E Select any part of the graph. E Click Panels on the properties palette. Figure 5-24 Panels tab

E Select an option from Layout:

Table. Lays out panels like a table, in that there is a row or column assigned to every individual

value.
Transposed. Lays out panels like a table, but also swaps the original rows and columns. This option is not the same as transposing the graph itself. Note that the x axis and the y axis are unchanged when you select this option. List. Lays out panels like a list, in that each cell represents a combination of values. Columns and

rows are no long assigned to individual values. This option allows the panels to wrap if needed.

Changing the Position of the Legend


If the graph includes a legend, the legend is typically displayed to the right of the graph. You can change this position if needed.
How to Change the Legend Position
E Select the legend. E Click Legend on the properties palette. Figure 5-25 Legend tab

E Select a position.

Keyboard Shortcuts
Table 5-1 Keyboard shortcuts

Shortcut Key Ctrl+Space Delete

Function Toggle between selection and editing mode Delete a graph item

174 Chapter 5

Shortcut Key Ctrl+Z Ctrl+Y

Function Undo Redo

Adding Titles and Footnotes


For all graph types you can add unique title, footnote, or axis labels to help identify what is shown in the graph.
Figure 5-26 Adding a graph title

For example, to add a title to a graph:


E Select Add Graph Title from the Edit menu. A text box containing <TITLE> is displayed above

the graph.
E Double-click on the <TITLE> text. E Type the required title and press Return.

175 Graph Nodes

Using Graph Stylesheets


Basic graph display information, such as the colors, sizes, and styles of fonts and lines as well as the data displayed, are controlled by a stylesheet. There is a default stylesheet supplied with Clementine; however, you can make changes to it, if required. For example, you may have a corporate color scheme for presentations that you want used in your graphs.For more information, see Editing Graphs on p. 166. In the graph nodes you can use the Edit mode to make changes to the look of a graph and then save the changes as a stylesheet to be applied either to all graphs that you subsequently generate from the current graph node, or to use as a new default stylesheet for all graphs that you produce using Clementine.
Figure 5-27 Selecting graph styles

There are four stylesheet options available from the Styles option on the Edit menu:
Store Styles in Node. This stores modications to the selected graphs styles so that they are

applied to any future graphs created from the same graph node in the current stream.
Store Styles as Default. This stores modications to the selected graphs styles so that they

are applied to all future graphs created from any graph node in any stream. After selecting this option, you can use Apply Default Styles to change any other existing graphs to use the same styles.
Apply Default Styles. This changes the selected graphs styles to those that are currently

saved as the default styles.


Apply Original Styles. This changes a graphs styles back to the ones supplied with Clementine

as the original default.

176 Chapter 5

Printing, Saving, Copying, and Exporting Graphs


Each graph has a number of options that allow you to save or print the graph or export it to another format. Most of these options are available from the File menu. In addition, from the Edit menu, you can choose to copy the graph for use in another application.
Figure 5-28 File menu and toolbar for graph windows

To print the graph, use the Print menu item or button. Before you print, you can use Page Setup and Print Preview to set print options and preview the output. To save the graph to a Clementine output le (*.cou), choose Save or Save As from the File menu. To save the graph in another format, such as bitmap, PNG, XML (or HTML, if applicable), choose Export from the File menu. To save the graph in the Predictive Enterprise Repository, choose Store Output from the File menu. To copy the graph for use in another application, such as MS Word or MS PowerPoint, choose Copy Graph from the Edit menu. Alternatively, use Ctrl-C. The remainder of this chapter focuses on the specic options for creating graphs and using them in their output windows.

Plot Node
Plot nodes show the relationship between numeric elds. You can create a plot using points (also known as a scatterplot), or you can use lines. You can create three types of line plots by specifying an X Mode in the dialog box.

177 Graph Nodes

X Mode = Sort

Setting X Mode to Sort causes data to be sorted by values for the eld plotted on the x axis. This produces a single line running from left to right on the graph. Using a set variable as an overlay produces multiple lines of different hues running from left to right on the graph.
Figure 5-29 Line plot with X Mode set to Sort

X Mode = Overlay

Setting X Mode to Overlay creates multiple line plots on the same graph. Data are not sorted for an overlay plot; as long as the values on the x axis increase, data will be plotted on a single line. If the values decrease, a new line begins. For example, as x moves from 0 to 100, the y values will be plotted on a single line. When x falls below 100, a new line will be plotted in addition to the rst one. The nished plot might have numerous plots useful for comparing several series of y values. This type of plot is useful for data with a periodic time component, such as electricity demand over successive 24-hour periods.

178 Chapter 5 Figure 5-30 Line plot with X Mode set to Overlay

X Mode = As Read

Setting X Mode to As Read plots x and y values as they are read from the data source. This option is useful for data with a time series component where you are interested in trends or patterns that depend on the order of the data. You may need to sort the data before creating this type of plot. It may also be useful to compare two similar plots with X Mode set to Sort and As Read in order to determine how much of a pattern depends on the sorting.

179 Graph Nodes Figure 5-31 Line plot shown earlier as Sort, executed again with X Mode set to As Read

Setting Options for the Plot Node


Plots show values of a Y eld against values of an X eld. Often, these elds correspond to a dependent variable and an independent variable, respectively.

180 Chapter 5 Figure 5-32 Setting options for a Plot node

X field. Select a eld from the list to display on the x axis, also known as the horizontal axis

or abscissa.
Y field. Select a eld from the list to display on the y axis, also known as the vertical axis

or ordinate.
Z field. When you click the 3-D chart button, a third eld becomes available for you to select

a eld from the list to display on the z axis.


Overlay. There are several ways to illustrate categories for data values. For example, you can use maincrop as a color overlay to indicate the estincome and claimvalue values for the main crop grown by claim applicants. Color. Select a eld to illustrate categories for data values by using a different color for

each value.
Panel. Select a set or ag eld to use in making a separate chart for each category. Charts will

be paneled, or displayed together in one output window.


Size. Select a eld to illustrate categories for data values by using a gradient of sizes. This

overlay is not available for line plots.


Animation. Select a set or ag eld to illustrate categories for data values by creating a series

of charts displayed in sequence using animation.


Shape. Select a set or ag eld to illustrate categories for data values by using a different

point shape for each category. This overlay is not available for line plots.
Transparency. Select a eld to illustrate categories for data values by using a different level of

transparency for each category. This overlay is not available for line plots. When using a range eld as an overlay for color, size, and transparency, the legend uses a continuous scale rather than discrete categories.
Overlay type. Species whether an overlay function or smoother is displayed.

181 Graph Nodes

Smoother. Displays a smoothed t line computed using locally weighted iterative robust least

squares regression (LOESS). This method effectively computes a series of regressions, each focused on a small area within the plot. This produces a series of local regression lines that are then joined to create a smooth curve.
Figure 5-33 Plot with a LOESS smoother overlay

Function. Select to specify a known function to compare to actual values. For example, to compare actual versus predicted values, you can plot the function y = x as an overlay. Specify a function for y = in the text box. The default function is y = x, but you can specify any sort of function, such as a quadratic function or an arbitrary expression, in terms of x. If you have specied a 3-D graph, you can also specify an overlay function for z. Note: Overlay

functions are not available for a panel or animation graph.


None. No overlay is displayed.

Note: The smoother and overlay functions are always calculated as a function of y. Once you have set options for a plot, you can execute the plot directly from the dialog box by clicking Execute. You may, however, want to use the Options tab for additional specications, such as binning, X Mode, and style.

182 Chapter 5

Additional Plot Options


Figure 5-34 Options tab settings for a Plot node

Style. Select either Point or Line for the plot style. Selecting Line activates the X Mode control described below. Point. By default, the point shape is a plus symbol (+). Once the graph is created, you can change the point shape and alter its size. X Mode. For line plots, you must choose an X Mode to dene the style of the line plot. Select
Sort, Overlay, or As Read. For Overlay or As Read, you should specify a maximum dataset size used to sample the rst n records. Otherwise, the default 2,000 records will be used.

Automatic X range. Select to use the entire range of values in the data along the x axis. Deselect to use an explicit subset of values based on your specied Min and Max values. Either enter values or use the arrows. Automatic ranges are selected by default to enable rapid graph building. Automatic Y range. Select to use the entire range of values in the data along the y axis. Deselect to use an explicit subset of values based on your specied Min and Max values. Either enter values or use the arrows. Automatic ranges are selected by default to enable rapid graph building. Automatic Z range. When a 3-D graph is specied on the Plot tab, you can select this option to use

the entire range of values in the data along the z axis. Deselect to use an explicit subset of values based on your specied Min and Max values. Either enter values or use the arrows. Automatic ranges are selected by default to enable rapid graph building.
Jitter. Also known as agitation, jitter is useful for point plots of a dataset in which many values

are repeated. In order to see a clearer distribution of values, you can use jitter to distribute the points randomly around the actual value.

183 Graph Nodes

Note to users of earlier versions of Clementine: The jitter value used in a plot uses a different metric in this release of Clementine. In earlier versions, the value was an actual number, but it is now a proportion of the frame size. This means that agitation values in old streams are likely to be too large. For this release, any nonzero agitation values will be converted to the value 0.2.
Maximum number of records to plot. Specify a method for plotting large datasets. You can specify

a maximum dataset size or use the default 2,000 records. Performance is enhanced for large datasets when you select the Bin or Sample options. Alternatively, you can choose to plot all data points by selecting Use all data, but you should note that this may dramatically decrease the performance of the software. Note: When X Mode is set to Overlay or As Read, these options are disabled and only the rst n records are used.
Bin. Select to enable binning when the dataset contains more than the specied number of

records. Binning divides the graph into ne grids before actually plotting and counts the number of points that would appear in each of the grid cells. In the nal graph, one point is plotted per cell at the bin centroid (average of all point locations in the bin). The size of the plotted symbols indicates the number of points in that region (unless you have used size as an overlay). Using the centroid and size to represent the number of points makes the binned plot a superior way to represent large datasets because it prevents overplotting in dense regions (undifferentiated masses of color) and reduces symbol artifacts (articial patterns of density). Symbol artifacts occur when certain symbols (particularly the plus symbol [+]) collide in a way that produces dense areas not present in the raw data.
Sample. Select to randomly sample the data to the number of records entered in the text

eld. The default is 2,000.

Using a Plot Graph


Plots, multiplots, and evaluation charts are essentially plots of X against Y. For example, if you are exploring potential fraud in agricultural grant applications (as illustrated in fraud.str in the demos folder of your Clementine installation), you might want to plot the income claimed on the application versus the income estimated by a neural net. Using an overlay, such as crop type, will illustrate whether there is a relationship between claims (value or number) and type of crop.

184 Chapter 5 Figure 5-35 Plot of the relationship between estimated income and claim value with main crop type as an overlay

Since plots, multiplots, and evaluation charts are two-dimensional displays of Y against X, it is easy to interact with them by selecting regions with the mouse. A region is an area of the graph described by its minimum and maximum X and Y values. Note: Regions cannot be dened in 3-D or animated plots.
To Define a Region

You can either use the mouse to interact with the graph, or you can use the Edit Graph Regions dialog box to specify region boundaries and related options. For more information, see Editing Graph Regions on p. 187. To use the mouse for dening a region:
E Click the left mouse button somewhere in the plot to dene a corner of the region. E Drag the mouse to the position desired for the opposite corner of the region. The resulting

rectangle cannot exceed the boundaries of the axes.


E Release the mouse button to create a permanent rectangle for the region. By default, the new

region is called Region<N>, where N corresponds to the number of regions already created in the Clementine session.

185 Graph Nodes Figure 5-36 Defining a region of high claim values

Once you have a dened a region, there are numerous ways to delve deeper into the selected area of the graph. Use the mouse in the following ways to produce feedback in the graph window: Hover over data points to provide point-specic information. Right-click and hold the mouse button in a region to provide information about boundaries of that region. Simply right-click in a region to bring up a context menu with additional options, such as generating process nodes.

186 Chapter 5 Figure 5-37 Exploring the region of high claim values

To Rename a Region
E Right-click anywhere in the dened region. E From the context menu, choose Rename Region. E Enter a new name and click OK.

Note: You can also rename the default region by right-clicking anywhere outside the region and choosing Rename Default Region. Once you have dened regions, you can select subsets of records on the basis of their inclusion in a particular region or in one of several regions. You can also incorporate region information for a record by producing a Derive node to ag records based on their inclusion in a region.
To Select or Flag Records in a Single Region
E Right-click in the region. Note that when you hold the right mouse button, the details for the

region are displayed in the feedback panel below the plot.


E From the context menu, choose Generate Select Node for Region or Generate Derive Node for Region.

A Select node or Derive node is automatically added to the stream canvas with the appropriate options and conditions specied. The Select node selects all records in the region. The Derive

187 Graph Nodes

node generates a ag for records whose values fall within the region. The ag eld name corresponds to the region name, with the ags set to T for records inside the region and F for records outside.
To Select, Flag, or Derive a Set for Records in All Regions
E From the Generate menu in the graph window, choose Derive Node (Set), Derive Node (Flag), or Select Node. E For all selections, a new node appears on the stream canvas with the following characteristics,

depending on your selection:


Derive Set. Produces a new eld called region for each record. The value of that eld is the

name of the region into which the records fall. Records falling outside all regions receive the name of the default region. (Right-click outside all regions and choose Rename Default Region to change the name of the default region.)
Derive Flag. Creates a ag eld called in_region with the ags set to T for records inside any

region and F for records outside all regions.


Select Node. Generates a new node that tests for inclusion in any region. This node selects

records in any region for downstream processing.

Editing Graph Regions


For plots, multiplots, and evaluation charts, you can edit the properties of regions dened on the graph. To open this dialog box, from graph window menus, choose:
Edit Graph Regions... Figure 5-38 Specifying properties for the defined regions

Region Name. Enter adjustments to the dened region names.

You can manually specify the boundaries of the region by adjusting the Min and Max values for X and Y. Add new regions by specifying the name and boundaries. Then press the Enter key to begin a new row.

188 Chapter 5

To Delete a Region
E From the Edit menu, open the Graph Regions dialog box.

Click on the region you want to delete.


E Click the delete button to the right of the Max Y table heading.

Multiplot Node
A multiplot is a special type of plot that displays multiple Y elds over a single X eld. The Y elds are plotted as colored lines and each is equivalent to a Plot node with Style set to Line and X Mode set to Sort. Multiplots are useful when you have time sequence data and want to explore the uctuation of several variables over time.
Figure 5-39 Setting options for a Multiplot node

Setting Options for the Multiplot Node


X field. Select a eld to display along the x axis. Y fields. Select one or more elds from the list to display over the range of X eld values. Use

the Field Chooser button to select multiple elds. Click the delete button to remove elds from the list.
Overlay. There are several ways to illustrate categories for data values. For example, you might

use an animation overlay to display multiple plots for each value in the data. This is useful for sets with many categories, such as 10. When used for sets with more than 15 categories, you may notice a decrease in performance.

189 Graph Nodes

Panel. Select a set or ag eld to use in making a separate chart for each category. Charts will

be paneled, or displayed together in one output window.


Animation. Select a set or ag eld to illustrate categories for data values by creating a series

of charts displayed in sequence using animation.


Normalize. Select to scale all Y values to the range 01 for display on the graph. Normalizing

helps you explore the relationship between lines that might otherwise be obscured due to differences in the range of values for each series and is recommended when plotting multiple lines on the same graph, or when comparing plots in side-by-side panels. (Normalizing is not necessary when all data values fall within a similar range.)
Figure 5-40 Standard multiplot showing power-plant fluctuation over time (note that without normalizing, the plot for Pressure is impossible to see)

Figure 5-41 Normalized multiplot showing a plot for Pressure

Overlay function. Select to specify a known function to compare to actual values. For example, to compare actual versus predicted values, you can plot the function y = x as an overlay. Specify a function in the y = text box. The default function is y = x, but you can specify any sort of function, such as a quadratic function or an arbitrary expression, in terms of x.

190 Chapter 5

When number of records greater than. Specify a method for plotting large datasets. You can

specify a maximum dataset size or use the default 2,000 points. Performance is enhanced for large datasets when you select the Bin or Sample options. Alternatively, you can choose to plot all data points by selecting Use all data, but you should note that this may dramatically decrease the performance of the software. Note: When X Mode is set to Overlay or As Read, these options are disabled and only the rst n records are used.
Bin. Select to enable binning when the dataset contains more than the specied number of

records. Binning divides the graph into ne grids before actually plotting and counts the number of connections that would appear in each of the grid cells. In the nal graph, one connection is used per cell at the bin centroid (average of all connection points in the bin).
Sample. Select to randomly sample the data to the specied number of records.

Using a Multiplot Graph


Plots and multiplots are two-dimensional displays of Y against X, making it easy to interact with them by selecting regions with the mouse. A region is an area of the graph described by its minimum and maximum X and Y values. Since multiplots are essentially a type of plot, the graph window displays the same options as those for the Plot node. For more information, see Using a Plot Graph on p. 183.

Distribution Node
A distribution graph or table shows the occurrence of symbolic (non-numeric) values, such as mortgage type or gender, in a dataset. A typical use of the Distribution node is to show imbalances in the data that can be rectied by using a Balance node before creating a model. You can automatically generate a Balance node using the Generate menu in the distribution graph or table window. Note: To show the occurrence of numeric values, you should use a Histogram node.

191 Graph Nodes

Setting Options for the Distribution Node


Figure 5-42 Setting options for a Distribution node

Plot. Select the type of distribution. Select Selected fields to show the distribution of the selected

eld. Select All flags (true values) to show the distribution of true values for ag elds in the dataset.
Field. Select a set or ag eld for which to show the distribution of values. Only elds that have

not been explicitly set as numeric appear on the list.


Overlay. Select a set or ag eld to use as a color overlay, illustrating the distribution of its values

within each value of the eld selected above. For example, you can use marketing campaign response (pep) as an overlay for number of children (children) to illustrate responsiveness by family size.
Normalize by color. Select to scale bars so that all bars take up the full width of the graph. The

overlay values equal a proportion of each bar, making comparisons across categories easier.
Sort. Select the method used to display values in the distribution graph. Select Alphabetic to use alphabetical order or By count to list values in decreasing order of occurrence. Proportional scale. Select to scale the distribution of values so that the value with the largest count

lls the full width of the plot. All other bars are scaled against this value. Deselecting this option scales bars according to the total counts of each value.

192 Chapter 5

Output Options for the Distribution Node


Figure 5-43 Setting output options for a Distribution node

The following options are displayed on the Output tab for distributions:
Output name. Species the name of the graph produced when the node is executed. Auto chooses a name based on the node that generates the output. Optionally, you can select Custom to specify a different name. Output to screen. Select to generate and display the graph in a Clementine window. Output to file. Select to save the generated graph or table as a le of the type specied in the File

type drop-down list.


Output Graph. Select to generate and output the node in a le as a graph. Output Table. Select to generate and output the node in a le as a table. Filename. Specify a lename used for the generated graph or table. Use the ellipsis button (...) to

specify a specic le and location.


File type. If you select Output Graph, the available graph le types are:

Bitmap (.bmp) JPEG (.jpg) PNG (.png) HTML document (.html) ViZml document (.xml) for use in other SPSS applications. Output object (.cou) If you select Output Table, the available table le types are: Tab delimited data (.tab) Comma delimited data (.csv)

193 Graph Nodes

HTML document (.html) Output object (.cou)


Paginate output. When saving output as HTML, this option is enabled to allow you to control the size of each HTML page. Lines per page. When Paginate output is selected, this option is enabled to allow you to determine

the length of each HTML page. The default setting is 400 rows.

Using a Distribution Node


Distribution nodes are used to show the distribution of symbolic values in a dataset. They are frequently used before manipulation nodes to explore the data and correct any imbalances. For example, if instances of respondents without children occur much more frequently than other types of respondents, you might want to reduce these instances so that a more useful rule can be generated in later data mining operations. A Distribution node will help you to examine and make decisions about such imbalances. The Distribution node is unusual in that it produces both a graph and a table to analyze your data.
Figure 5-44 Distribution graph showing the proportion of numbers of children with response to a marketing campaign

194 Chapter 5 Figure 5-45 Distribution table showing the proportion of numbers of children with response to a marketing campaign

Once you have created a distribution table and graph and examined the results, you can use options from the menus to group values, copy values, and generate a number of nodes for data preparation. In addition, you can copy or export the graph and table information for use in other applications, such as MS Word or MS PowerPoint.
Graph File Menu Options

In addition to the standard File menu options, the Export Graph option enables you to export the graph in one of the following formats: Bitmap (.bmp) JPEG (.jpg) PNG (.png) HTML document (.html) ViZml document (.xml) for use in other SPSS applications
Table File Menu Options

In addition to the standard File menu options, the Export Table option enables you to export the table in one of the following formats: Tab delimited (.tab) Comma delimited (.csv) HTML document (.html)
Table Edit Menu Options

You can use options on the Edit menu to group, select, and copy values in the distribution table.
To Select and Copy Values from a Distribution Table

195 Graph Nodes E Click and hold the mouse button while dragging it over the rows to select a set of values. You can use the Edit menu to Select All values. E From the Edit menu, choose Copy Table or Copy Table (inc. field names). E Paste to the clipboard or into the desired application.

Note: The bars do not get copied directly. Instead, the table values are copied. This means that overlaid values will not be displayed in the copied table.
To Group Values from a Distribution Table
E Select values for grouping using the Ctrl-click method. E From the Edit menu, choose Group.

Note: When you group and ungroup values, the graph on the Graph tab is automatically redrawn to show the changes. You can also: Ungroup values by selecting the group name in the distribution list and choosing Ungroup from the Edit menu. Edit groups by selecting the group name in the distribution list and choosing Edit group from the Edit menu. This opens a dialog box where values can be shifted to and from the group.
Figure 5-46 Edit group dialog box

Generate Menu Options

You can use options on the Generate menu to select a subset of data, derive a ag eld, regroup values, or balance the data from either a graph or table. These operations generate a data preparation node and place it on the stream canvas. To use the generated node, connect it to an existing stream.
Select Node. Select the graph, or any row from the table, to generate a Select node for that

category. You can select multiple categories using Ctrl-click in the distribution table.
Derive Node. Select the graph, or any cell from the table, to generate a Derive ag node for

that category. You can select multiple categories using Ctrl-click in the distribution table.
Balance Node (boost). Use this option to generate a Balance node that boosts the size of

smaller subsets.

196 Chapter 5

Balance Node (reduce). Use this option to generate a Balance node that reduces the size of

larger subsets.
Reclassify Node (groups). Use this option to generate a Reclassify node that recodes specic

values of the displayed eld depending upon their inclusion in a group. Groups can be selected using the Ctrl-click method. You can group values by selecting them and using the Edit menu options. Note: Groups are only available from the Table tab.
Reclassify Node (values). Use this option to generate a Reclassify node where values from

the distribution are available from the drop-down list in the New values column. This functionality is useful when data must be reclassied into an existing set of numerous values. For example, to merge nancial data from various companies for analysis, it might be necessary to reclassify products into a standard set of values. If the values are predened, you can read them into Clementine as a at le and use a distribution to display all values. Then generate a Reclassify (values) node for this eld directly from the chart. This process makes all the target values visible from the New values column (drop-down list) in the Reclassify node.

Histogram Node
Histogram nodes show the occurrence of values for numeric elds. They are often used to explore the data before manipulations and model building. Similar to the Distribution node, Histogram nodes are frequently used to reveal imbalances in the data. Note: To show the occurrence of values for symbolic elds, you should use a Distribution node.
Figure 5-47 Setting options for a Histogram node

Field. Select a numeric eld for which to show the distribution of values. Only elds that have not been explicitly dened as symbolic (categorical) will be listed.

197 Graph Nodes

Overlay. Select a symbolic eld to show categories of values for the eld selected above. Selecting

an overlay eld converts the histogram to a stacked chart with colors used to represent different categories of the overlay eld. Three types of overlays are available for histograms:
Color. Select a eld to illustrate categories for data values by using a different color for

each value.
Panel. Select a set or ag eld to use in making a separate graph for each category. Graphs

will be paneled, or displayed together in one output window.


Animation. Select a set or ag eld to illustrate categories for data values by creating a series

of graphs displayed in sequence using animation.

Setting Additional Options for the Histogram Node


Figure 5-48 Options tab settings for a Histogram node

Automatic X range. Select to use the entire range of values in the data along the x axis. Deselect to use an explicit subset of values based on your specied Min and Max values. Either enter values or use the arrows. Automatic ranges are selected by default to enable rapid graph building. Bins. Select By number to display a xed number of histogram bars whose width depends on the range specied above and the number of buckets specied below. Select By width to create a histogram with bars of a xed width (specied below). The number of bars depends on the specied width and the range of values. No. of bins. Specify the number of buckets (bars) to be used in the histogram. Use the arrows

to adjust the number.


Bin width. Specify the width of histogram bars. Normalize by color. Select to adjust all bars to the same height, displaying overlaid values as a

percentage of the total cases in each bar.


Show normal curve. Select to add a normal curve to the graph showing the mean and variance

of the data.
Separate bands for each color. Select to display each overlaid value as a separate band on the graph.

198 Chapter 5

Using Histograms and Collections


Histograms and collections offer a similar window into your data before modeling. Histograms show the distribution of values in a numeric eld whose values range along the x axis. Collections show the distribution of values for one numeric eld relative to the values of another, rather than the occurrence of values for a single eld. Both types of charts are frequently used before manipulation nodes to explore the data and correct any imbalances by generating a Balance node from the output window. You can also generate a Derive Flag node to add a eld showing which band each record falls into or a Select node to select all records within a particular set or range of values. Such operations help you to focus on a particular subset of data for further exploration.
Figure 5-49 Histogram showing the distribution of increased purchases by category due to promotion

Several options are available in the histogram window. These options apply for both histograms and collections. For example, you can: Split the range of values on the x axis into bands. Generate a Select or Derive Flag node based on inclusion in a particular bands range of values. Generate a Derive Set node to indicate the band into which a records values fall. Generate a Balance node to correct imbalances in the data. View the graph in 3-D (available for collections only).

199 Graph Nodes

To Define a Band

You can either use the mouse to interact with the graph, or you can use the Edit Graph Bands dialog box to specify the boundaries of bands and other related options. For more information, see Editing Graph Bands on p. 201. To use the mouse for dening a band: Click anywhere in the histogram to set a line dening a band of values. Or click the Bands button on the toolbar to split the graph into equal bands. This method adds additional options to the toolbar, which you can use to specify a number of equal bands.
Figure 5-50 Creating equal bands

Once you have a dened a band, there are numerous ways to delve deeper into the selected area of the graph. Use the mouse in the following ways to produce feedback in the graph window: Hover over bars to provide bar-specic information. Check the range of values for a band by right-clicking inside a band and reading the feedback panel at the bottom of the window. Simply right-click in a band to bring up a context menu with additional options, such as generating process nodes. Rename bands by right-clicking in a band and selecting Rename Band. By default, bands are named bandN, where N equals the number of bands from left to right on the x axis. Move the boundaries of a band by selecting a band line with your mouse and moving it to the desired location on the x axis. Delete bands by right-clicking on a line and selecting Delete Band. Once you have created a histogram, dened bands, and examined the results, you can use options on the Generate menu and the context menu to create Balance, Select, or Derive nodes.

200 Chapter 5 Figure 5-51 Generate and context menus showing options for generating nodes and renaming bands

To Select or Flag Records in a Particular Band


E Right-click in the band. Notice that the details for the band are displayed in the feedback panel

below the plot.


E From the context menu, choose Generate Select Node for Band or Generate Derive Node for Band.

A Select node or Derive node is automatically added to the stream canvas with the appropriate options and conditions specied. The Select node selects all records in the band. The Derive node generates a ag for records whose values fall within the band. The ag eld name corresponds to the band name, with ags set to T for records inside the band and F for records outside.
To Derive a Set for Records in All Regions
E From the Generate menu in the graph window, choose Derive Node. E A new Derive Set node appears on the stream canvas with options set to create a new eld called

band for each record. The value of that eld equals the name of the band that each record falls into.
To Create a Balance Node for Imbalanced Data
E From the Generate menu in the graph window, choose one of the two Balance node types:

Balance Node (boost). Generates a Balance node to boost the occurrence of infrequent values. Balance Node (reduce). Generates a Balance node to reduce the frequency of common values.

The generated node will be placed on the stream canvas. To use the node, connect it to an existing stream.

201 Graph Nodes

Editing Graph Bands


For histograms, collections, evaluation charts, and time plots, you can edit the properties of bands dened on the graph. To open this dialog box, from the graph window menus, choose:
Edit Graph Bands... Figure 5-52 Specifying properties for graph bands

Band Name. Enter adjustments to the dened band names.

You can manually specify the boundaries of the bands by adjusting the Lower Bound values. Add new bands by specifying the name and boundaries. Then press the Enter key to begin a new row. Delete bands by selecting one in the table and clicking the delete button.

Collection Node
Collections are similar to histograms except that collections show the distribution of values for one numeric eld relative to the values of another, rather than the occurrence of values for a single eld. A collection is useful for illustrating a variable or eld whose values change over time. Using 3-D graphing, you can also include a symbolic axis displaying distributions by category.

202 Chapter 5 Figure 5-53 Setting options for a Collection node

Collect. Select a eld whose values will be collected and displayed over the range of values for the eld specied below in Over. Only elds that have not been dened as symbolic are listed. Over. Select a eld whose values will be used to display the collection eld specied above. By. Enabled when creating a 3-D graph, this option allows you to select a set or ag eld used

to display the collection eld by categories.


Operation. Select what each bar or bucket in the collection graph represents. Options include
Sum, Mean, Max, Min, and Standard Deviation.

Overlay. Select a symbolic eld to show categories of values for the eld selected above.

Selecting an overlay eld converts the collection and creates multiple bars of varying colors for each category. Three types of overlays are available for collections:
Color. Select a eld to illustrate categories for data values by using a different color for

each value.
Panel. Select a set or ag eld to use in making a separate graph for each category. Graphs

will be paneled, or displayed together in one output window.


Animation. Select a set or ag eld to illustrate categories for data values by creating a series

of graphs displayed in sequence using animation.

Setting Additional Options for the Collection Node


Automatic X range. Select to use the entire range of values in the data along the x axis. Deselect to use an explicit subset of values based on your specied Min and Max values. Either enter values or use the arrows. Automatic ranges are selected by default to enable rapid graph building.

203 Graph Nodes

Bins. Select By number to display a xed number of collection bars whose width depends on the range specied above and the number of buckets specied below. Select By width to create a collection with bars of a xed width (specied below). The number of bars depends on the specied width and the range of values. No. of bins. Specify the number of buckets (bars) to be used in the collection. Use the arrows

to adjust the number.


Bin width. Specify the width of collection bars.

Using a Collection Graph


Collection nodes show the distribution of values in a numeric eld whose values range along the x axis. They are frequently used before manipulation nodes to explore the data and correct any imbalances by generating a Balance node from the graph window. You can also generate a Derive Flag node to add a eld showing which range (band) each record falls into or a Select node to select all records within a particular range of values. Such operations help you to focus on a particular subset of data for further exploration.
Figure 5-54 3-D collection graph showing sum of Na_to_K over Age for both high and low cholesterol levels

204 Chapter 5 Figure 5-55 Collection graph without z axis displayed but with Cholesterol as color overlay

Once you have created a collection graph, several options are available in the graph window. For example, you can: Split the range of values on the x axis into bands. Generate a Select or Derive Flag node based on inclusion in a particular bands range of values. Generate a Derive Set node to indicate which band contains a records value. Generate a Balance node to correct imbalances in the data.

205 Graph Nodes Figure 5-56 Options for generating Select and Derive nodes to examine a band of interest

Since collections are very similar to histograms, the graph window displays the same options. For more information, see Using Histograms and Collections on p. 198.

Web Node
Web nodes show the strength of relationships between values of two or more symbolic elds. The graph displays connections using varying types of lines to indicate connection strength. You can use a Web node, for example, to explore the relationship between the purchase of various items at an e-commerce site or a traditional retail outlet.

206 Chapter 5 Figure 5-57 Web node showing relationships between the purchase of grocery items

Directed Webs

Directed Web nodes are similar to Web nodes in that they show the strength of relationships between symbolic elds. However, directed web graphs show connections only from one or more From elds to a single To eld. The connections are unidirectional in the sense that they are one-way connections.
Figure 5-58 Directed web showing the relationship between the purchase of grocery items and gender

Like Web nodes, the graph displays connections using varying types of lines to indicate connection strength. You can use a Directed Web node, for example, to explore the relationship between gender and a proclivity for certain purchase items.

207 Graph Nodes

Setting Options for the Web Node


Figure 5-59 Setting options for a Web node

Web. Select to create a web graph illustrating the strength of relationships between all specied

elds.
Directed web. Select to create a directional web graph illustrating the strength of relationships between multiple elds and the values of one eld, such as gender or religion. When this option is selected, a To Field is activated and the Fields control below is renamed From Fields for additional clarity.
Figure 5-60 Directed web options

To Field (directed webs only). Select a ag or set eld used for a directed web. Only elds that

have not been explicitly set as numeric are listed.


Fields/From Fields. Select elds to create a web graph. Only elds that have not been explicitly set as numeric are listed. Use the Field Chooser button to select multiple elds or select elds by type. Note: For a directed web, this control is used to select From elds. Show true flags only. Select to display only true ags for a ag eld. This option simplies the web

display and is often used for data where the occurrence of positive values is of special importance.

208 Chapter 5

Line values are. Select a threshold type from the drop-down list.
Absolute sets thresholds based on the number of records having each pair of values. Overall percentages shows the absolute number of cases represented by the link as a proportion

of all of the occurrences of each pair of values represented in the web plot.
Percentages of smaller field/value and Percentages of larger field/value indicate which eld/value

to use for evaluating percentages. For example, suppose 100 records have the value drugY for the eld Drug and only 10 have the value LOW for the eld BP. If seven records have both values drugY and LOW, this percentage is either 70% or 7%, depending on which eld you are referencing, smaller (BP) or larger (Drug). Note: For directed web graphs, the third and fourth options above are not available. Instead, you can select Percentage of To field/value and Percentage of From field/value.
Strong links are heavier. Selected by default, this is the standard way of viewing links between

elds.
Weak links are heavier. Select to reverse the meaning of links displayed in bold lines. This option is frequently used for fraud detection or examination of outliers.

Setting Additional Options for the Web Node


The Options tab for Web nodes contains a number of additional options to customize the output graph.
Figure 5-61 Options tab settings for a Web node

209 Graph Nodes

Number of Links. The following controls are used to control the number of links displayed in the

output graph. Some of these options, such as Weak links above and Strong links above, are also available in the output graph window. You can also use a slider control in the nal graph to adjust the number of links displayed.
Maximum number of links to display. Specify a number indicating the maximum number of

links to show on the output graph. Use the arrows to adjust the value.
Show only links above. Specify a number indicating the minimum value for which to show a

connection in the web. Use the arrows to adjust the value.


Show all links. Specify to display all links regardless of minimum or maximum values.

Selecting this option may increase processing time if there are a large number of elds.
Discard if very few records. Select to ignore connections that are supported by too few records. Set the threshold for this option by entering a number in Min. records/line. Discard if very many records. Select to ignore strongly supported connections. Enter a number

in Max. records/line.
Weak links below. Specify a number indicating the threshold for weak connections (dotted lines)

and regular connections (normal lines). All connections below this value are considered weak.
Strong links above. Specify a threshold for strong connections (heavy lines) and regular connections (normal lines). All connections above this value are considered strong. Link Size. Specify options for controlling the size of links: Link size varies continuously. Select to display a range of link sizes reecting the variation in

connection strengths based on actual data values.


Link size shows strong/normal/weak categories. Select to display three strengths of

connectionsstrong, normal, and weak. The cutoff points for these categories can be specied above as well as in the nal graph.
Web Display. Select a type of web display: Circle. Select to use the standard web display. Network layout. Select to use an algorithm to group together the strongest links. This is

intended to highlight strong links using spatial differentiation as well as weighted lines.
Directed Layout. Select to create a directed web display that uses the To Field selection from

the Plot tab as the focus for the direction.


Grid Layout. Select to create a web display that is laid out in a regularly spaced grid pattern.

210 Chapter 5 Figure 5-62 Network display showing strong connections from frozenmeal and cannedveg to other grocery items

Appearance Options for the Web Plot


The Appearance tab for web plots contains a subset of options available for other types of graphs.
Figure 5-63 Appearance tab settings for a web plot

Title. Enter a Title for the graph. Caption. Enter a Caption for the graph.

211 Graph Nodes

Show legend. Species whether the legend is displayed. For plots with a large number of elds,

hiding the legend may improve the appearance of the plot.


Use labels as nodes. Includes the label text within each node rather than displaying adjacent

labels. For plots with a small number of elds, this may result in a more readable chart.
Figure 5-64 Web plot showing labels as nodes

Using a Web Graph


Web nodes are used to show the strength of relationships between values of two or more symbolic elds. Connections are displayed in a graph with varying types of lines to indicate connections of increasing strength. You can use a Web node, for example, to explore the relationship between cholesterol levels, blood pressure, and the drug that was effective in treating the patients illness. Strong connections are shown with a heavy line. This indicates that the two values are strongly related and should be further explored. Medium connections are shown with a line of normal weight. Weak connections are shown with a dotted line. If no line is shown between two values, this means either that the two values never occur in the same record or that this combination occurs in a number of records below the threshold specied in the Web node dialog box. Once you have created a Web node, there are several options for adjusting the graph display and generating nodes for further analysis.

212 Chapter 5 Figure 5-65 Web graph indicating a number of strong relationships, such as normal blood pressure with DrugX and high cholesterol with DrugY

For both Web nodes and Directed Web nodes, you can: Change the layout of the web display. Hide points to simplify the display. Change the thresholds controlling line styles. Highlight lines between values to indicate a selected relationship. Generate a Select node for one or more selected records or a Derive Flag node associated with one or more relationships in the web.
To Adjust Points

Move points by clicking the mouse on a point and dragging it to the new location. The web will be redrawn to reect the new location. Hide points by right-clicking on a point in the web and choosing Hide or Hide and Replan from the context menu. Hide simply hides the selected point and any lines associated with it. Hide and Replan redraws the web, adjusting for any changes you have made. Any manual moves are undone. Show all hidden points by choosing Reveal All or Reveal All and Replan from the Web menu in the graph window. Selecting Reveal All and Replan redraws the web, adjusting to include all previously hidden points and their connections.

213 Graph Nodes

To Select, or Highlight, Lines


E Left-click to select a line and highlight it in red. E Continue to select additional lines by repeating this process.

You can deselect lines by choosing Clear Selection from the Web menu in the graph window.
To View the Web Using a Different Layout
E From the Web menu, choose Circle Layout, Network Layout, Directed Layout, or Grid Layout to

change the layout of the graph.


To Turn the Links Slider on or off
E From the View menu, choose Links Slider.

To Select or Flag Records for a Single Relationship


E Right-click on the line representing the relationship of interest. E From the context menu, choose Generate Select Node For Link or Generate Derive Node For Link.

A Select node or Derive node is automatically added to the stream canvas with the appropriate options and conditions specied: The Select node selects all records in the given relationship. The Derive node generates a ag indicating whether the selected relationship holds true for records in the entire dataset. The ag eld is named by joining the two values in the relationship with an underscore, such as LOW_drugC or drugC_LOW.
To Select or Flag Records for a Group of Relationships
E Select the line(s) in the web display representing relationships of interest. E From the Generate menu in the graph window, choose Select Node (And), Select Node (Or), Derive Node (And), or Derive Node (Or).

The Or nodes give the disjunction of conditions. This means that the node will apply to records for which any of the selected relationships hold. The And nodes give the conjunction of conditions. This means that the node will apply only to records for which all selected relationships hold. An error occurs if any of the selected relationships are mutually exclusive. After you have completed your selection, a Select node or Derive node is automatically added to the stream canvas with the appropriate options and conditions specied.

214 Chapter 5

Adjusting Web Thresholds


After you have created a web graph, you can adjust the thresholds controlling line styles using the toolbar slider to change the minimum visible line. You can also view additional threshold options by clicking the yellow double-arrow button on the toolbar to expand the web graph window. Then click the Controls tab to view additional options.
Figure 5-66 Expanded window featuring display and threshold options

Threshold values are. Shows the type of threshold selected during creation in the Web node

dialog box.
Strong links are heavier. Selected by default, this is the standard way of viewing links between

elds.
Weak links are heavier. Select to reverse the meaning of links displayed in bold lines. This option is frequently used for fraud detection or examination of outliers. Web Display. Specify options for controlling the size of links in the output graph: Size varies continuously. Select to display a range of link sizes reecting the variation in

connection strengths based on actual data values.


Size shows strong/normal/weak categories. Select to display three strengths of

connectionsstrong, normal, and weak. The cutoff points for these categories can be specied above as well as in the nal graph.
Strong links above. Specify a threshold for strong connections (heavy lines) and regular connections (normal lines). All connections above this value are considered strong. Use the slider to adjust the value or enter a number in the eld. Weak links below. Specify a number indicating the threshold for weak connections (dotted lines)

and regular connections (normal lines). All connections below this value are considered weak. Use the slider to adjust the value or enter a number in the eld.

215 Graph Nodes

After you have adjusted the thresholds for a web, you can replan, or redraw, the web display with the new threshold values by clicking the black replan button on the web graph toolbar. Once you have found settings that reveal the most meaningful patterns, you can update the original settings in the Web node (also called the Parent Web node) by choosing Update Parent Node from the Web menu in the graph window.

Creating a Web Summary


You can create a web summary document that lists strong, medium, and weak links by clicking the yellow double-arrow button on the toolbar to expand the web graph window. Then click the Summary tab to view tables for each type of link. Tables can be expanded and collapsed using the toggle buttons for each.
Figure 5-67 Web summary listing connections between blood pressure, cholesterol, and drug type

Evaluation Chart Node


The Evaluation Chart node offers an easy way to evaluate and compare predictive models to choose the best model for your application. Evaluation charts show how models perform in predicting particular outcomes. They work by sorting records based on the predicted value and condence of the prediction, splitting the records into groups of equal size (quantiles), and then plotting the value of the business criterion for each quantile, from highest to lowest. Multiple models are shown as separate lines in the plot.

216 Chapter 5

Outcomes are handled by dening a specic value or range of values as a hit. Hits usually indicate success of some sort (such as a sale to a customer) or an event of interest (such as a specic medical diagnosis). You can dene hit criteria on the Options tab of the dialog box, or you can use the default hit criteria as follows: Flag output elds are straightforward; hits correspond to true values. For Set output elds, the rst value in the set denes a hit. For Range output elds, hits equal values greater than the midpoint of the elds range. There are ve types of evaluation charts, each of which emphasizes a different evaluation criterion.
Gains Charts

Gains are dened as the proportion of total hits that occurs in each quantile. Gains are computed as (number of hits in quantile / total number of hits) 100%.
Figure 5-68 Gains chart (cumulative) with baseline, best line, and business rule displayed

Lift Charts

Lift compares the percentage of records in each quantile that are hits with the overall percentage of hits in the training data. It is computed as (hits in quantile / records in quantile) / (total hits / total records).

217 Graph Nodes Figure 5-69 Lift chart (cumulative) using points and best line

Response Charts

Response is simply the percentage of records in the quantile that are hits. Response is computed as (hits in quantile / records in quantile) 100%.
Figure 5-70 Response chart (cumulative) with best line

218 Chapter 5

Profit Charts

Prot equals the revenue for each record minus the cost for the record. Prots for a quantile are simply the sum of prots for all records in the quantile. Prots are assumed to apply only to hits, but costs apply to all records. Prots and costs can be xed or can be dened by elds in the data. Prots are computed as (sum of revenue for records in quantile sum of costs for records in quantile).
Figure 5-71 Profit chart (cumulative) with best line

ROI Charts

ROI (return on investment) is similar to prot in that it involves dening revenues and costs. ROI compares prots to costs for the quantile. ROI is computed as (prots for quantile/costs for quantile) 100%.

219 Graph Nodes Figure 5-72 ROI chart (cumulative) with best line

Evaluation charts can also be cumulative, so that each point equals the value for the corresponding quantile plus all higher quantiles. Cumulative charts usually convey the overall performance of models better, whereas noncumulative charts often excel at indicating particular problem areas for models.

220 Chapter 5

Setting Options for the Evaluation Chart Node


Figure 5-73 Setting options for an Evaluation Chart node

Chart type. Select one of the following types: Gains, Response, Lift, Profit, or ROI (return on

investment).
Cumulative plot. Select to create a cumulative chart. Values in cumulative charts are plotted for

each quantile plus all higher quantiles.


Include baseline. Select to include a baseline in the plot, indicating a perfectly random distribution

of hits where condence becomes irrelevant. (Include baseline is not available for Prot and ROI charts.)
Include best line. Select to include a best line in the plot, indicating perfect condence (where

hits = 100% of cases).


Find predicted/predictor fields using. Select either Model output field metadata to search for the

predicted elds in the graph using their metadata, or select Field name format to search for them by name.
Plot. Select the size of quantiles to plot in the chart from the drop-down list. Options include
Quartiles, Quintiles, Deciles, Vingtiles, Percentiles, and 1000-tiles.

Style. Select Line or Point. Specify a point type by selecting one from the drop-down list. Options include Dot, Rectangle, Plus, Triangle, Hexagon, Horizontal dash, and Vertical dash.

For Prot and ROI charts, additional controls allow you to specify costs, revenue, and weights.
Costs. Specify the cost associated with each record. You can select Fixed or Variable costs.

For xed costs, specify the cost value. For variable costs, click the Field Chooser button to select a eld as the cost eld.

221 Graph Nodes

Revenue. Specify the revenue associated with each record that represents a hit. You can select
Fixed or Variable costs. For xed revenue, specify the revenue value. For variable revenue, click the Field Chooser button to select a eld as the revenue eld.

Weight. If the records in your data represent more than one unit, you can use frequency

weights to adjust the results. Specify the weight associated with each record, using Fixed or Variable weights. For xed weights, specify the weight value (the number of units per record). For variable weights, click the Field Chooser button to select a eld as the weight eld.
Split by partition. If a partition eld is used to split records into training, test, and validation samples, select this option to display a separate evaluation chart for each partition. For more information, see Partition Node in Chapter 4 on p. 119.

Note: When splitting by partition, records with null values in the partition eld are excluded from the evaluation. This will never be an issue if a Partition node is used, since Partition nodes do not generate null values.

Setting Additional Options for Evaluation Charts


The Options tab for evaluation charts provides exibility in dening hits, scoring criteria, and business rules displayed in the chart. You can also set options for exporting the results of the model evaluation.
Figure 5-74 Options tab settings for an Evaluation Chart node

User defined hit. Select to specify a custom condition used to indicate a hit. This option is useful

for dening the outcome of interest rather than deducing it from the type of target eld and the order of values.

222 Chapter 5

Condition. When User defined hit is selected above, you must specify a CLEM expression for a

hit condition. For example, @TARGET = "YES" is a valid condition indicating that a value of Yes for the target eld will be counted as a hit in the evaluation. The specied condition will be used for all target elds. To create a condition, type in the eld or use the Expression Builder to generate a condition expression. If the data are instantiated, you can insert values directly from the Expression Builder.
User defined score. Select to specify a condition used for scoring cases before assigning them to quantiles. The default score is calculated from the predicted value and the condence. Use the Expression eld below to create a custom scoring expression. Expression. Specify a CLEM expression used for scoring. For example, if a numeric output in the range 01 is ordered so that lower values are better than higher, you might dene a hit above as @TARGET < 0.5 and the associated score as 1 @PREDICTED. The score expression must result in a numeric value. To create a condition, type in the eld or use the Expression Builder to generate a condition expression. Include business rule. Select to specify a rule condition reecting criteria of interest. For example,

you may want to display a rule for all cases where mortgage = "Y" and income >= 33000. Business rules are drawn on the chart and labeled in the key as Rule.
Condition. Specify a CLEM expression used to dene a business rule in the output chart. Simply

type in the eld or use the Expression Builder to generate a condition expression. If the data are instantiated, you can insert values directly from the Expression Builder.
Export results to file. Select to export the results of the model evaluation to a delimited text le. You can read this le to perform specialized analyses on the calculated values. Set the following options for export: Filename. Enter the lename for the output le. Use the ellipsis button (...) to browse to the

desired directory.
Delimiter. Enter a character, such as a comma or space, to use as the eld delimiter. Include field names. Select this option to include eld names as the rst line of the output le. New line after each record. Select this option to begin each record on a new line.

Reading the Results of a Model Evaluation


The interpretation of an evaluation chart depends to a certain extent on the type of chart, but there are some characteristics common to all evaluation charts. For cumulative charts, higher lines indicate better models, especially on the left side of the chart. In many cases, when comparing multiple models the lines will cross, so that one model will be higher in one part of the chart and another will be higher in a different part of the chart. In this case, you need to consider what portion of the sample you want (which denes a point on the x axis) when deciding which model to choose. Most of the noncumulative charts will be very similar. For good models, noncumulative charts should be high toward the left side of the chart and low toward the right side of the chart. (If a noncumulative chart shows a sawtooth pattern, you can smooth it out by reducing the number of quantiles to plot and re-executing the graph.) Dips on the left side of the chart or spikes on the

223 Graph Nodes

right side can indicate areas where the model is predicting poorly. A at line across the whole graph indicates a model that essentially provides no information.
Gains charts. Cumulative gains charts always start at 0% and end at 100% as you go from left to right. For a good model, the gains chart will rise steeply toward 100% and then level off. A model that provides no information will follow the diagonal from lower left to upper right (shown in the chart if Include baseline is selected). Lift charts. Cumulative lift charts tend to start above 1.0 and gradually descend until they reach

1.0 as you go from left to right. The right edge of the chart represents the entire dataset, so the ratio of hits in cumulative quantiles to hits in data is 1.0. For a good model, lift should start well above 1.0 on the left, remain on a high plateau as you move to the right, and then trail off sharply toward 1.0 on the right side of the chart. For a model that provides no information, the line will hover around 1.0 for the entire graph. (If Include baseline is selected, a horizontal line at 1.0 is shown in the chart for reference.)
Response charts. Cumulative response charts tend to be very similar to lift charts except for the

scaling. Response charts usually start near 100% and gradually descend until they reach the overall response rate (total hits / total records) on the right edge of the chart. For a good model, the line will start near or at 100% on the left, remain on a high plateau as you move to the right, and then trail off sharply toward the overall response rate on the right side of the chart. For a model that provides no information, the line will hover around the overall response rate for the entire graph. (If Include baseline is selected, a horizontal line at the overall response rate is shown in the chart for reference.)
Profit charts. Cumulative prot charts show the sum of prots as you increase the size of the

selected sample, moving from left to right. Prot charts usually start near 0, increase steadily as you move to the right until they reach a peak or plateau in the middle, and then decrease toward the right edge of the chart. For a good model, prots will show a well-dened peak somewhere in the middle of the chart. For a model that provides no information, the line will be relatively straight and may be increasing, decreasing, or level depending on the cost/revenue structure that applies.
ROI charts. Cumulative ROI (return on investment) charts tend to be similar to response charts and

lift charts except for the scaling. ROI charts usually start above 0% and gradually descend until they reach the overall ROI for the entire dataset (which can be negative). For a good model, the line should start well above 0%, remain on a high plateau as you move to the right, and then trail off rather sharply toward the overall ROI on the right side of the chart. For a model that provides no information, the line should hover around the overall ROI value.

Using an Evaluation Chart


Using the mouse to explore an evaluation chart is similar to using a histogram or collection graph.

224 Chapter 5 Figure 5-75 Working with an evaluation chart

The x axis represents model scores across the specied quantiles, such as vingtiles or deciles. You can partition the x axis into bands just as you would for a histogram by clicking with the mouse or using the splitter icon to display options for automatically splitting the axis into equal bands.
Figure 5-76 Splitter icon used to expand the toolbar with options for splitting into bands

You can manually edit the boundaries of bands by selecting Graph Bands from the Edit menu. For more information, see Editing Graph Bands on p. 201.
Using Bands to Produce Feedback

Once you have dened a band, there are numerous ways to delve deeper into the selected area of the graph. Use the mouse in the following ways to produce feedback in the graph window: Hover over bands to provide point-specic information. Check the range for a band by right-clicking inside a band and reading the feedback panel at the bottom of the window. Right-click in a band to bring up a context menu with additional options, such as generating process nodes.

225 Graph Nodes

Rename bands by right-clicking in a band and selecting Rename Band. By default, bands are named bandN, where N equals the number of bands from left to right on the x axis. Move the boundaries of a band by selecting a band line with your mouse and moving it to the desired location on the x axis. Delete bands by right-clicking on a line and selecting Delete Band.
Generating Nodes

Once you have created an evaluation chart, dened bands, and examined the results, you can use options on the Generate menu and the context menu to automatically create nodes based upon selections in the graph. Generate a Select or Derive Flag node based on inclusion in a particular bands range of values. Generate a Derive Set node to indicate which band contains the record based upon score and hit criteria for the model.

Selecting a Model
When generating nodes from an Evaluation Chart, you will be prompted to select a single model from all available models in the chart.
Figure 5-77 Selecting a model for node generation

Select a model and click OK to generate the new node onto the stream canvas.

Time Plot Node


Time Plot nodes allow you to view one or more time series plotted over time. The series you plot must contain numeric values and are assumed to occur over a range of time in which the periods are uniform. You usually use a Time Intervals node before a Time Plot node to create a TimeLabel eld, which is used by default to label the x axis in the graphs. For more information, see Time Intervals Node in Chapter 4 on p. 128.

226 Chapter 5 Figure 5-78 Plotting sales of men and womens clothing and jewelry over time

Creating Interventions and Events

You can create Event and Intervention elds from the Time Plot output by generating a derive (ag or set) node from the context menus. For example, you could create an event eld in the case of a rail strike, where the drive state is True if the event happened and False otherwise. For an Intervention eld, for a price rise for example, you could use a derive count to identify the date of the rise, with 0 for the old price and 1 for the new price. For more information, see Derive Node in Chapter 4 on p. 87.

Setting Options for the Time Plot Node


Figure 5-79 Setting options for a Time Plot node

227 Graph Nodes

Plot. Provides a choice of how to plot time series data. Selected series. Plots values for selected time series. If you select this option when plotting

condence intervals, deselect the Normalize check box.


Selected Time Series models. Used in conjunction with a Time Series model, this option plots

all the related elds (actual and predicted values, as well as condence intervals) for one or more selected time series. This option disables some other options on the dialog box. This is the preferred option if plotting condence intervals.
Series. Select one or more elds with time series data you want to plot. The data must be numeric. X axis label. Choose either the default label or a single eld to use as the label for the x axis in plots. If you choose Default, then the system uses the TimeLabel eld created from a Time Intervals node upstream or sequential integers if there is no Time Intervals node. For more information, see Time Intervals Node in Chapter 4 on p. 128. Display series in separate panels. Species whether each series is displayed in a separate panel.

Alternatively, if you do not choose to panel, all time series are plotted on the same graph, and smoothers will not be available. When plotting all time series on the same graph, each series will be represented by a different color.
Normalize. Select to scale all Y values to the range 01 for display on the graph. Normalizing

helps you explore the relationship between lines that might otherwise be obscured due to differences in the range of values for each series and is recommended when plotting multiple lines on the same graph, or when comparing plots in side-by-side panels. (Normalizing is not necessary when all data values fall within a similar range.)
Display. Select one or more elements to display in your plot. You can choose from lines, points,

and (LOESS) smoothers. Smoothers are available only if you display the series in separate panels. By default, the line element is selected. Make sure you select at least one plot element before you execute the graph node; otherwise, the system will return an error stating that you have selected nothing to plot.
Point Type. If you choose to plot points, select the symbol that will represent points in the graph. Limit records. Select this option if you want to limit the number of records plotted. Maximum number of records to plot. Specify the number of records, read from the beginning of

your data le, that will be plotted. If you want to plot the last n records in your data le, you can use a Sort node prior to this node to arrange the records in descending order by time.

Appearance Options for the Time Plot


The Layout option on the Appearance tab allows you to specify whether time values are plotted along a horizontal or vertical axis. Other options are similar to those for other graphs. For more information, see Setting Appearance Options for Graphs on p. 161.

228 Chapter 5 Figure 5-80 Appearance tab settings for a time plot

Using a Time Plot Graph


Once you have created a Time Plot graph, there are several options for adjusting the graph display and generating nodes for further analysis.
Figure 5-81 Comparing catalog sales of men and womens clothing and jewelry over days

For example, you can: Split the range of values on the x axis into bands. Generate a Select or Derive Flag node based on inclusion in a particular bands range of values. Generate a Derive Set node to indicate the band into which a records values fall.

229 Graph Nodes

To Define a Band

You can either use the mouse to interact with the graph, or you can use the Edit Graph Bands dialog box to specify the boundaries of bands and other related options. For more information, see Editing Graph Bands on p. 201. To use the mouse for dening a band: Click anywhere in the histogram to set a line dening a band of values. Once you have dened a band, there are numerous ways to delve deeper into the selected area of the graph. Use the mouse in the following ways to produce feedback in the graph window: Check the range of values for a band by right-clicking inside a band and reading the feedback panel at the bottom of the window. Simply right-click in a band to bring up a context menu with additional options, such as generating process nodes. Rename bands by right-clicking in a band and selecting Rename Band. By default, bands are named bandN, where N equals the number of bands from left to right on the x axis. Move the boundaries of a band by selecting a band line with your mouse and moving it to the desired location on the x axis. Delete bands by right-clicking on a line and selecting Delete Band. Once you have created a time plot, dened bands, and examined the results, you can use options on the Generate menu and the context menu to create Select or Derive nodes.
Figure 5-82 Generate and context menus showing options for generating nodes and renaming bands

230 Chapter 5

To Select or Flag Records in a Particular Band


E Right-click in the band. Notice that the details for the band are displayed in the feedback panel

below the plot.


E From the context menu, choose Generate Select Node for Band or Generate Derive Node for Band.

A Select node or Derive node is automatically added to the stream canvas with the appropriate options and conditions specied. The Select node selects all records in the band. The Derive node generates a ag for records whose values fall within the band. The ag eld name corresponds to the band name, with ags set to T for records inside the band and F for records outside.
To Derive a Set for Records in All Regions
E From the Generate menu in the graph window, choose Derive Node. E A new Derive Set node appears on the stream canvas with options set to create a new eld called

band for each record. The value of that eld equals the name of the band that each record falls into.

Chapter

Modeling Overview
Overview of Modeling Nodes

Clementine offers a variety of modeling methods taken from machine learning, articial intelligence, and statistics. The methods available on the Modeling palette allow you to derive new information from your data and to develop predictive models. Each method has certain strengths and is best suited for particular types of problems. Modeling nodes are packaged by module. For more information, see Clementine Modules in Chapter 1 in Clementine 11.1 Users Guide. Detailed documentation on the modeling algorithms is also available. For more information, see the Clementine Algorithms Guide, available from the Windows Start menu by choosing Start >
[All] Programs > SPSS Clementine 11.1 > Documentation.

Available modeling nodes include the following:


Binary Classification models model yes-or-no outcomes using a number of methods.
The Binary Classier node creates and compares a number of different models for binary outcomes (yes or no, churn or dont, and so on), allowing you to choose the best approach for a given analysis. A number of modeling algorithms are supported, making it possible to select the methods you want to use, the specic options for each, and the criteria for comparing the results. The node generates a set of models based on the specied options and ranks the best candidates according to the criteria you specify. For more information, see Binary Classier Node in Chapter 8 on p. 263.

Screening models can be used to locate elds and records that are most likely to be of interest in

modeling or can be used to identify outliers that do not t known patterns.


The Feature Selection node screens predictor elds for removal based on a set of criteria (such as the percentage of missing values); it then ranks the importance of remaining predictors relative to a specied target. For example, given a dataset with hundreds of potential predictors, which are most likely to be useful in modeling patient outcomes? For more information, see Feature Selection Node in Chapter 7 on p. 247. The Anomaly Detection node identies unusual cases, or outliers, that do not conform to patterns of normal data. With this node, it is possible to identify outliers even if they do not t any previously known patterns and even if you are not exactly sure what you are looking for. For more information, see Anomaly Detection Node in Chapter 7 on p. 254.

231

232 Chapter 6

Decision List models consist of a list of rules in which each rule has a condition and an outcome.

Rules are applied in order, and the rst rule that matches determines the outcome.
The Decision List node identies subgroups, or segments, that show a higher or lower likelihood of a given binary outcome relative to the overall population. For example, you might look for customers who are unlikely to churn or are most likely to respond favorably to a campaign. You can incorporate your business knowledge into the model by adding your own custom segments and previewing alternative models side by side in order to compare the results. For more information, see Decision List in Chapter 11 on p. 333.

Decision Tree models allow you to develop classication systems that predict or classify future

observations based on a set of decision rules.


The Classication and Regression Tree node generates a decision tree that allows you to predict or classify future observations. The method uses recursive partitioning to split the training records into segments by minimizing the impurity at each step, where a node is considered pure if 100% of cases in the node fall into a specic category of the target eld. Target and predictor elds can be range or categorical; all splits are binary (only two subgroups). For more information, see C&R Tree Node in Chapter 9 on p. 296. The CHAID node generates decision trees using chi-square statistics to identify optimal splits. Unlike the C&RT and QUEST nodes, CHAID can generate nonbinary trees, meaning that some splits have more than two branches. Target and predictor elds can be range or categorical. Exhaustive CHAID is a modication of CHAID that does a more thorough job of examining all possible splits but takes longer to compute. For more information, see CHAID Node in Chapter 9 on p. 305. The QUEST node provides a binary classication method for building decision trees, designed to reduce the processing time required for large C&RT analyses while also reducing the tendency found in classication tree methods to favor predictors that allow more splits. Predictor elds can be numeric ranges, but the target eld must be categorical. All splits are binary. For more information, see QUEST Node in Chapter 9 on p. 307. The C5.0 node builds either a decision tree or a ruleset. The model works by splitting the sample based on the eld that provides the maximum information gain at each level. The target eld must be categorical. Multiple splits into more than two subgroups are allowed. For more information, see C5.0 Node in Chapter 9 on p. 308.

Neural Network models use a simplied model of the way information is processed by the human

brain.
The Neural Net node uses a simplied model of the way the human brain processes information. It works by simulating a large number of interconnected simple processing units that resemble abstract versions of neurons. Neural networks are powerful general function estimators and require minimal statistical or mathematical knowledge to train or apply. For more information, see Neural Net Node in Chapter 10 on p. 323.

233 Modeling Overview

Statistical models use mathematical equations to encode information extracted from the data.
Linear regression is a common statistical technique for summarizing data and making predictions by tting a straight line or surface that minimizes the discrepancies between predicted and actual output values. For more information, see Linear Regression Node in Chapter 12 on p. 364. Logistic regression is a statistical technique for classifying records based on values of input elds. It is analogous to linear regression but takes a categorical target eld instead of a numeric range. For more information, see Logistic Regression Node in Chapter 12 on p. 372. The Factor/PCA node provides powerful data-reduction techniques to reduce the complexity of your data. Principal components analysis (PCA) nds linear combinations of the input elds that do the best job of capturing the variance in the entire set of elds, where the components are orthogonal (perpendicular) to each other. Factor analysis attempts to identify underlying factors that explain the pattern of correlations within a set of observed elds. For both approaches, the goal is to nd a small number of derived elds that effectively summarizes the information in the original set of elds. For more information, see Factor Analysis/PCA Node in Chapter 12 on p. 390. The generalized linear model expands the general linear model so that the dependent variable is linearly related to the factors and covariates via a specied link function. Moreover, the model allows for the dependent variable to have a non-normal distribution. It covers the functionality of a wide number of statistical models, including linear regression, logistic regression, loglinear models for count data, and interval-censored survival models. For more information, see Generalized Linear Models Node in Chapter 12 on p. 405. Discriminant analysis makes more stringent assumptions than logistic regression but can be a valuable alternative or supplement to a logistic regression analysis when those assumptions are met. For more information, see Discriminant Node in Chapter 12 on p. 398.

Clustering models focus on identifying groups of similar records.


The K-Means node clusters the dataset into distinct groups (or clusters). The method denes a xed number of clusters, iteratively assigns records to clusters, and adjusts the cluster centers until further renement can no longer improve the model. Instead of trying to predict an outcome, k-means uses a process known as unsupervised learning to uncover patterns in the set of input elds. For more information, see K-Means Node in Chapter 13 on p. 426. The TwoStep node uses a two-step clustering method. The rst step makes a single pass through the data to compress the raw input data into a manageable set of subclusters. The second step uses a hierarchical clustering method to progressively merge the subclusters into larger and larger clusters. TwoStep has the advantage of automatically estimating the optimal number of clusters for the training data. It can handle mixed eld types and large datasets efciently. For more information, see TwoStep Cluster Node in Chapter 13 on p. 431.

234 Chapter 6

The Kohonen node generates a type of neural network that can be used to cluster the dataset into distinct groups. When the network is fully trained, records that are similar should appear close together on the output map, while records that are different will appear far apart. You can look at the number of observations captured by each unit in the generated model to identify the strong units. This may give you a sense of the appropriate number of clusters. For more information, see Kohonen Node in Chapter 13 on p. 419.

Association models associate a particular conclusion (such as a decision to buy something) with a

set of conditions.
The Generalized Rule Induction (GRI) node discovers association rules in the data. For example, customers who purchase razors and aftershave lotion are also likely to purchase shaving cream. GRI extracts rules with the highest information content based on an index that takes both the generality (support) and accuracy (condence) of rules into account. GRI can handle numeric and categorical inputs, but the target must be categorical. For more information, see GRI Node in Chapter 14 on p. 450. The Apriori node extracts a set of rules from the data, pulling out the rules with the highest information content. Apriori offers ve different methods of selecting rules and uses a sophisticated indexing scheme to process large datasets efciently. For large problems, Apriori is generally faster to train than GRI; it has no arbitrary limit on the number of rules that can be retained, and it can handle rules with up to 32 preconditions. Apriori requires that input and output elds all be categorical but delivers better performance because it is optimized for this type of data. For more information, see Apriori Node in Chapter 14 on p. 452. The CARMA model extracts a set of rules from the data without requiring you to specify In (predictor) or Out (target) elds. In contrast to Apriori and GRI, the CARMA node offers build settings for rule support (support for both antecedent and consequent) rather than just antecedent support. This means that the rules generated can be used for a wider variety of applicationsfor example, to nd a list of products or services (antecedents) whose consequent is the item that you want to promote this holiday season. For more information, see CARMA Node in Chapter 14 on p. 456. The Sequence node discovers association rules in sequential or time-oriented data. A sequence is a list of item sets that tends to occur in a predictable order. For example, a customer who purchases a razor and aftershave lotion may purchase shaving cream the next time he shops. The Sequence node is based on the CARMA association rules algorithm, which uses an efcient two-pass method for nding sequences. For more information, see Sequence Node in Chapter 14 on p. 476.

Time Series models produce forecasts of future performance from existing time series data.
The Time Series node estimates exponential smoothing, univariate Autoregressive Integrated Moving Average (ARIMA), and multivariate ARIMA (or transfer function) models for time series data and produces forecast data. A Time Series node must always be preceded by a Time Intervals node. For more information, see Time Series Node in Chapter 15 on p. 495.

235 Modeling Overview

Self-learning models use the latest data, no matter how small, to reestimate an existing model.
The Self-Learning Response Model (SLRM) node enables you to build a model in which a single new case, or small number of new cases, can be used to re-estimate the model without having to retrain the model using all data. For more information, see SLRM Node in Chapter 16 on p. 515.

Modeling Node Fields Options


All modeling nodes have a Fields tab, where you can specify the elds to be used in building the model.
Figure 6-1 Fields tab for C&R Tree node

Before you can build a model, you need to specify which elds you want to use as targets and as inputs. With a few exceptions, all modeling nodes will use eld information from an upstream Type node. If you are using a Type node to select input and target elds, you dont need to change anything on this tab. (Exceptions include the Sequence node and the Text Extraction node, which require that eld settings be specied in the modeling node.)
Use type node settings. This option tells the node to use eld information from an upstream Type

node. This is the default.


Use custom settings. This option tells the node to use eld information specied here instead of

that given in any upstream Type node(s). After selecting this option, specify the elds below:
Target. For models that require one or more target elds, select the target eld(s). This is

similar to setting a elds direction to Out in a Type node.


Inputs. Select the input eld(s). This is similar to setting a elds direction to In in a Type node. Partition. This eld allows you to specify a eld used to partition the data into separate

samples for the training, testing, and validation stages of model building. By using one sample to generate the model and a different sample to test it, you can get a good indication of how well the model will generalize to larger datasets that are similar to the current data.

236 Chapter 6

If multiple partition elds have been dened by using Type or Partition nodes, a single partition eld must be selected on the Fields tab in each modeling node that uses partitioning. (If only one partition is present, it is automatically used whenever partitioning is enabled.) For more information, see Partition Node in Chapter 4 on p. 119. Also note that to apply the selected partition in your analysis, partitioning must also be enabled in the Model Options tab for the node. (Deselecting this option makes it possible to disable partitioning without changing eld settings.)
Use frequency field. This option allows you to select a eld as a frequency weight. Use

this if the records in your training data represent more than one unit eachfor example, if you are using aggregated data. The eld values should be the number of units represented by each record. Note: Values for a frequency eld should be positive integers. Frequency weights affect calculation of branch instances for C&RT models. Records with a negative or zero frequency weight are excluded from the analysis. Non-integer frequency weights are rounded to the nearest integer.
Use weight field. This option allows you to select a eld as a case weight. Case weights are

used to account for differences in variance across levels of the output eld. Note: These weights are used in model estimation but do not affect calculation of branch instances for C&RT models. Case weight values should be positive but need not be integer values. Records with a negative or zero case weight are excluded from the analysis.
Consequents. For rule induction nodes (Apriori and GRI), select the elds to be used as

consequents in the resulting ruleset. (This corresponds to elds with type Out or Both in a Type node.)
Antecedents. For rule induction nodes (Apriori and GRI), select the elds to be used as

antecedents in the resulting ruleset. (This corresponds to elds with type In or Both in a Type node.)
Transactional data format (Apriori, CARMA, and DB2 Association nodes only). Data in this format

have two elds: one for an ID and one for content. Each record represents a single item, and associated items are linked by having the same ID. For more information, see Tabular versus Transactional Data in Chapter 14 on p. 449.
Tabular data format (Apriori and CARMA nodes only). Tabular data have items represented by

separate ags, and each record represents a complete set of associated items. For more information, see Tabular versus Transactional Data in Chapter 14 on p. 449. Some models have a Fields tab that differs from those described above. For more information, see Sequence Node Fields Options in Chapter 14 on p. 476. For more information, see CARMA Node Fields Options in Chapter 14 on p. 456.

237 Modeling Overview

Overview of Generated Models


Generated models are the fruits of your data modeling labor. A generated model node is created whenever you successfully execute a modeling node. Generated models contain information about the model created and provide a mechanism for using that model to generate predictions and facilitate further data mining. Generated models are placed in the generated models palette (located on the Models tab in the managers window in the upper right corner of the Clementine window), where they are represented by diamond-shaped icons (and occasionally called nuggets). From there, they can be selected and browsed to view details of the model. Generated models other than unrened rule models can be placed into the stream to generate predictions or to allow further analysis of their properties. You can identify the type of a generated model node from its icon, which is based on the design for the node used to create the model in most cases. A sampling is shown below (note that not all model types are pictured).
Icon Node type Neural Network Icon Node type Kohonen Net

C5.0 Tree model

Linear Regression Equation

Ruleset

K-Means model

Logistic Regression Equation

C&R Tree model

Factor/PCA Equation

Sequence set

Apriori model

CARMA model

QUEST tree

Time Series model

Feature Selection model Unrened models, such as GRI and CEMI models (generated models palette only)

Anomaly Detection model

CHAID tree

238 Chapter 6

The following topics provide information on using generated models in Clementine. For an in-depth understanding of the algorithms used in Clementine, see the Clementine Algorithms Guide, available on the product CD.

The Models Palette


The generated models palette (on the Models tab in the managers window) allows you to use, examine, and modify generated model nodes in various ways.
Figure 6-2 Generated models palette

Right-clicking a generated model node in the generated models palette opens a context menu with the following options for modifying the node:
Figure 6-3 Generated model context menu

Add To Stream. Adds the generated model node to the currently active stream. If there is a

selected node in the stream, the generated model node will be connected to the selected node when such a connection is possible.
Browse. Opens the model browser for the node. Rename and Annotate. Allows you to rename the generated model node and/or modify the

annotation for the node.


Save Model. Saves the node to an external le. Store Model. Stores the model in SPSS Predictive Enterprise Repository. For more

information, see Predictive Enterprise Repository in Chapter 9 in Clementine 11.1 Users Guide.

239 Modeling Overview

Export PMML. Exports the model as predictive model markup language (PMML), which can

be used with SPSS SmartScore for scoring new data outside of Clementine. Export PMML is available for all generated model nodes except those created by CEMI modeling nodes. A separate license is required to access this feature.For more information, see Setting PMML Export Options in Chapter 3 in Clementine 11.1 Users Guide.
Add to Project. Saves the generated model and adds it to the current project. On the Classes

tab, the node will be added to the Generated Models folder. On the CRISP-DM tab, it will be added to the default project phase. (See Setting the Default Project Phase for information on how to change the default project phase.)
Delete. Deletes the node from the palette.
Figure 6-4 Generated models palette context menu

Right-clicking an unoccupied area in the generated models palette opens a context menu with the following options:
Open Model. Loads a generated model previously created in Clementine. Load Palette. Loads a saved palette from an external le. Save Palette. Saves the entire contents of the generated models palette to an external le. Clear Palette. Deletes all nodes from the palette. Add To Project. Saves the generated models palette and adds it to the current project. On the

Classes tab, the node will be added to the Generated Models folder. On the CRISP-DM tab, it will be added to the default project phase.
Import PMML. Loads a model from an external le. You can open, browse, and score PMML

models created by SPSS and AnswerTree.

Browsing Generated Models


The generated model browsers allow you to examine and use the results of your models. From the browser, you can save, print, or export the generated model, examine the model summary, and view or edit annotations for the model. For some types of generated models, you can also generate new nodes, such as Filter nodes or Ruleset nodes. For some models, you can also view model parameters, such as rules or cluster centers. For some types of models (tree-based models and cluster models), you can view a graphical representation of the structure of the model. Controls for using the generated model browsers are described below.

240 Chapter 6

Menus File menu. All generated models have a File menu, containing the following options: Save Node. Saves the generated model node to a le. Close. Closes the current generated model browser. Header and Footer. Allows you to edit the page header and footer for printing from the node. Page Setup. Allows you to change the page setup for printing from the node. Print Preview. Displays a preview of how the node will look when printed. Select the

information you want to preview from the submenu.


Print. Prints the contents of the node. Select the information you want to print from the

submenu.
Export Text. Exports the contents of the node to a text le. Select the information you want to

export from the submenu.


Export HTML. Exports the contents of the node to an HTML le. Select the information you

want to export from the submenu.


Export PMML. Exports the model as predictive model markup language (PMML), which

can be used with other PMML-compatible software. A separate license is required to access this feature.
Export SQL. Exports the model as structured query language (SQL), which can be edited and

used with other databases. Note: SQL Export is available only from the following models: C5, C&RT, CHAID, Linear Regression, Logistic Regression, Neural Net, PCA/Factor, and QUEST.
Generate menu. Most generated models also have a Generate menu, allowing you to generate

new nodes based on the generated model. The options available from this menu will depend on the type of model you are browsing. See the specic generated model type for details about what you can generate from a particular model.

Generated Model Summary


The Summary tab for a generated model displays information about the elds, build settings, and model estimation process. Results are presented in a tree view that can be expanded or collapsed by clicking specic items.
Analysis. Displays information about the model. Specic details vary by model type, and are

covered in the section for each generated model. In addition, if you have executed an Analysis node attached to this modeling node, information from that analysis will also appear in this section. For more information, see Analysis Node in Chapter 17 on p. 537.
Fields. Lists the elds used as the target and the inputs in building the model. Build Settings. Contains information about the settings used in building the model.

241 Modeling Overview

Training Summary. Shows the type of model, the stream used to create it, the user who created it,

when it was built, and the elapsed time for building the model.

Using Generated Models in Streams


The generated models can be placed in streams to score new data and generate new nodes. Scoring data allows you to use the information gained from model building to create predictions for new records. For some models, generated model nodes can also give you additional information about the quality of the prediction, such as condence values or distances from cluster centers. Generating new nodes allows you to easily create new nodes based on the structure of the generated model. For example, most models that perform input eld selection allow you to generate Filter nodes that will pass only input elds that the model identied as important.
To use a generated model node for scoring data:
E Select the desired model by clicking it in the generated models palette. E Add the model to the stream by clicking the desired location in the stream canvas. E Connect the generated model node to a data source or stream that will pass data to it. Figure 6-5 Using a generated model for scoring

E Add or connect one or more processing or output nodes (such as a Table or Analysis node) to

the generated model node.


E Execute one of the nodes downstream from the generated model node.

Note: You cannot use the Unrened Rule node (the results of creating a GRI model) for scoring data. To score data based on a GRI association rule model, use the Unrened Rule node to generate a Ruleset node, and use the Ruleset node for scoring. For more information, see Generating a Ruleset from an Association Model in Chapter 14 on p. 468.

242 Chapter 6

To use a generated model node for generating processing nodes:


E On the palette, browse the model, or, on the stream canvas, edit the model. E Select the desired node type from the Generate menu of the generated model browser window.

The options available will vary, depending on the type of generated model node. See the specic generated model type for details about what you can generate from a particular model.

Regenerating a Modeling Node


If you have a generated model that you want to modify or update and the stream used to create the model is not available, you can regenerate a modeling node with the same options used to create the original model.
E To rebuild a model, click on the model in the generated models palette and select Generate Modeling Node. E Alternatively, when browsing any model, choose Modeling Node from the Generate menu.

The regenerated modeling node should be functionally identical to the one used to create the original model in most cases. For Decision Tree models, additional settings specied during the interactive session may also be stored with the node, and the Use tree directives option will be enabled in the regenerated modeling node. For more information, see Tree Node Model Options in Chapter 9 on p. 297. For Decision List models, the Use saved session information option will be enabled. For more information, see Decision List Model Options in Chapter 11 on p. 338. For Time Series models, the Reuse Stored Settings option is enabled, allowing you to regenerate the previous model with current data. For more information, see Time Series Model Options in Chapter 15 on p. 498.

Importing and Exporting Models as PMML


PMML, or predictive model markup language, is an XML format for describing data mining and statistical models, including inputs to the models, transformations used to prepare data for data mining, and the parameters that dene the models themselves. Clementine can import and export PMML, making it possible to share models with other applications that support this format, such as SPSS or SPSS Categorize. For more information about PMML, see the data mining group Web site (http://www.dmg.org).
To Export a Model

PMML export is supported for most of the model types generated in Clementine. For more information, see Model Types Supporting PMML in Clementine 11.1 Users Guide.
E Right-click a model on the Models tab in the managers window. E From the context menu, choose Export PMML.

243 Modeling Overview Figure 6-6 Exporting a model in PMML format

E In the Export dialog box, specify a target directory and a unique name for the model.

Note: You can change options for PMML export in the User Options dialog box. For more information, see Setting PMML Export Options in Chapter 3 in Clementine 11.1 Users Guide.
To Import a Model Saved as PMML

Models exported as PMML from Clementine or another application can be imported into the generated models palette. For more information, see Model Types Supporting PMML in Clementine 11.1 Users Guide.
E In the generated models palette, right-click on the palette and select Import PMML from the

context menu.
Figure 6-7 Importing a model in PMML format

E Select the le to import and specify options for variable and value labels as desired.

244 Chapter 6 Figure 6-8 Selecting the XML file for a model saved using PMML

Use variable labels. The PMML may specify both variable names and variable labels (such as Referrer ID for RefID) for variables in the data dictionary. Select this option to use variable labels if they are present in the originally exported PMML. Use value labels. The PMML may specify both values and value labels (such as Male for M

or Female for F) for a variable. Select this option to use the value labels if they are present in the PMML. If you have selected the above label options but there are no variable or value labels in the PMML, the variable names and literal values are used as normal. By default, both options are selected.

245 Modeling Overview

Model Types Supporting PMML


PMML Export Clementine models. All models created in Clementine can be exported as PMML 3.1, with the exception of the following:
Model type PCA/Factor Text Extraction Feature Selection Anomaly Detection Time Series Unrened (GRI, CEMI) PMML Export (version 3.1) not available not available not available not available not available not available

Database native models. For models generated using database-native algorithms, PMML export is

available for IBM Intelligent Miner models only. Models created using Analysis Services from Microsoft or Oracle Data Miner cannot be exported. Also note that IBM models exported as PMML cannot be imported back into Clementine. For more information, see Database Modeling Overview in Chapter 2 in Clementine 11.1 In-Database Mining Guide.
PMML 3.1 Import

Clementine can import and score PMML 3.1 models generated by current versions of all SPSS products, including models exported from Clementine as well as model or transformation PMML generated by SPSS 15.0. Essentially this means any PMML that the SPSS Smartscore component can score, with the following exceptions: Apriori, CARMA, and Anomaly Detection models cannot be imported. PMML models may not be browsed after importing into Clementine even though they can be used in scoring. (Note this includes models that were originally exported from Clementine. To avoid this limitation, export the model as a generated model le (*.gm) rather than PMML.) Models that cannot be scored will not be imported. IBM Intelligent Miner models exported as PMML cannot be imported back into Clementine.
Importing Earlier Versions of PMML (2.1 or 3.0)

PMML import for legacy models exported from earlier releases of Clementine (prior to 11.0) is supported for some, but not all, model types, as indicated below:
Model type Neural Network C&R Tree CHAID Tree QUEST Tree PMML Import (2.1 or 3.0) not available yes yes yes

246 Chapter 6

Model type C5.0 Tree Ruleset Kohonen Net K-Means TwoStep Linear Regression Logistic Regression Factor/PCA Sequence CARMA Apriori Text Extraction Feature Selection Anomaly Detection Unrened (GRI, CEMI)

PMML Import (2.1 or 3.0) not available not available not available not available yes yes yes not available not available not available not available not available not available not available not available

Unrefined Models
A GRI model and those models generated by using the Clementine External Module Interface (CEMI) contain information extracted from the data but are not designed for generating predictions directly. This means they cannot be added to streams. Unrened models appear as diamonds in the rough on the generated models palette.
Figure 6-9 Unrefined model icon

To see information about the unrened rule model, right-click the model and choose Browse from the context menu. Like other models generated in Clementine, the various tabs provide summary and rule information about the model created.
Generating nodes. The Generate menu allows you to create new nodes based on the rules. Select Node. Generates a Select node to select records to which the currently selected rule

applies. This option is disabled if no rule is selected.


Rule set. Generates a Ruleset node to predict values for a single target eld. For more

information, see Generating a Ruleset from an Association Model in Chapter 14 on p. 468.

Chapter

Screening Models
Screening Fields and Records

Several modeling nodes can be used during the preliminary stages of an analysis in order to locate elds and records that are most likely to be of interest in modeling. You can use the Feature Selection node to screen and rank elds by importance and the Anomaly Detection node to locate unusual records that do not conform to the known patterns of normal data.
The Feature Selection node screens predictor elds for removal based on a set of criteria (such as the percentage of missing values); it then ranks the importance of remaining predictors relative to a specied target. For example, given a dataset with hundreds of potential predictors, which are most likely to be useful in modeling patient outcomes? For more information, see Feature Selection Node on p. 247. The Anomaly Detection node identies unusual cases, or outliers, that do not conform to patterns of normal data. With this node, it is possible to identify outliers even if they do not t any previously known patterns and even if you are not exactly sure what you are looking for. For more information, see Anomaly Detection Node on p. 254.

Note that anomaly detection identies unusual records or cases through cluster analysis based on the set of elds selected in the model without regard for any specic target (dependent) eld and regardless of whether those elds are relevant to the pattern you are trying to predict. For this reason, you may want to use anomaly detection in combination with feature selection or another technique for screening and ranking elds. For example, you can use feature selection to identify the most important elds relative to a specic target and then use anomaly detection to locate the records that are the most unusual with respect to those elds. (An alternative approach would be to build a decision tree model and then examine any misclassied records as potential anomalies. However, this method would be more difcult to replicate or automate on a large scale.)

Feature Selection Node


This node is available with the Classication module. Data mining problems may involve hundreds, or even thousands, of elds that can potentially be used as predictors. As a result, a great deal of time and effort may be spent examining which elds or variables to include in the model. To narrow down the choices, the Feature Selection algorithm can be used to identify the elds that are most important for a given analysis. For example, if you are trying to predict patient outcomes based on a number of factors, which factors are the most likely to be important?
247

248 Chapter 7

Feature selection consists of three steps:


Screening. Removes unimportant and problematic predictors and records or cases, such as

predictors with too many missing values or predictors with too much or too little variation to be useful.
Ranking. Sorts remaining predictors and assigns ranks based on importance. Selecting. Identies the subset of features to use in subsequent modelsfor example, by

preserving only the most important predictors and ltering or excluding all others. In an age where many organizations are overloaded with too much data, the benets of feature selection in simplifying and speeding the modeling process can be substantial. By focusing attention quickly on the elds that matter most, you can reduce the amount of computation required; more easily locate small but important relationships that might otherwise be overlooked; and, ultimately, obtain simpler, more accurate, and more easily explainable models. By reducing the number of elds used in the model, you may nd that you can reduce scoring times as well as the amount of data collected in future iterations. Paring down the number of elds may be particularly useful for models such as Logistic Regression, which imposes a limit of 350 elds.
Example. A telephone company has a data warehouse containing information about responses to a special promotion by 5,000 of the companys customers. The data includes a large number of elds containing customers ages, employment, income, and telephone usage statistics. Three target elds show whether or not the customer responded to each of three offers. The company wants to use this data to help predict which customers are most likely to respond to similar offers in the future. For more information, see Screening Predictors (Feature Selection) in Chapter 6 in Clementine 11.1 Applications Guide. Requirements. A single target (Out) eld, along with multiple predictors you want to screen or

rank relative to the target. Both target and predictor elds can be numeric range or categorical.

Feature Selection Model Settings


This node is available with the Classication module. The settings on the Model tab include standard model options along with settings that allow you to ne-tune the criteria for screening predictors.

249 Screening Models Figure 7-1 Feature Selection Model tab

Model name. You can generate the model name automatically based on the target or ID eld (or

model type in cases where no such eld is specied) or specify a custom name.
Use partitioned data. If a partition eld is dened, this option ensures that only data from the training partition is used to build the model.For more information, see Partition Node in Chapter 4 on p. 119. Screening Predictors

Screening involves removing predictors or cases that do not add any useful information with respect to the predictor/target relationship. Screening options are based on attributes of the eld in question without respect to predictive power relative to the selected target eld. Screened elds are excluded from the computations used to rank predictors and optionally can be ltered or removed from the data used in modeling. Fields can be screened based on the following criteria:
Maximum percentage of missing values. Screens elds with too many missing values,

expressed as a percentage of the total number of records. Fields with a large percentage of missing values provide little predictive information.
Maximum percentage of records in a single category. Screens elds that have too many records

falling into the same category relative to the total number of records. For example, if 95% of the customers in the database drive the same type of car, including this information is not useful in distinguishing one customer from the next. Any elds that exceed the specied maximum are screened. This option applies to categorical elds only.
Maximum number of categories as a percentage of records. Screens elds with too many

categories relative to the total number of records. If a high percentage of the categories contains only a single case, the eld may be of limited use. For example, if every customer wears a different hat, this information is unlikely to be useful in modeling patterns of behavior. This option applies to categorical elds only.

250 Chapter 7

Minimum coefficient of variation. Screens elds with a coefcient of variance less than or equal

to the specied minimum. This measure is the ratio of the predictor standard deviation to the predictor mean. If this value is near zero, there is not much variability in the values for the variable. This option applies to numeric range elds only.
Minimum standard deviation. Screens elds with standard deviation less than or equal to the

specied minimum. This option applies to numeric range elds only.


Records with missing data. Records or cases that have missing values for the target eld, or missing

values for all predictors, are automatically excluded from all computations used in the rankings.

Feature Selection Options


This node is available with the Classication module. The Options tab allows you to specify the default settings for selecting or excluding predictor elds in the generated model. You can then add the model to a stream to select a subset of elds for use in subsequent model-building efforts. Alternatively, you can override these settings by selecting or deselecting additional elds in the model browser after generating the model. However, the default settings make it possible to apply the generated model without further changes, which may be particularly useful for purposes of scripting or batch mode automation. For more information, see Feature Selection Model Results on p. 252.
Figure 7-2 Feature Selection Options tab

The following options are available:


All fields ranked. Selects elds based on their ranking as important, marginal, or unimportant.

You can edit the label for each ranking as well as the cutoff values used to assign records to one rank or another.
Top number of fields. Selects the top n elds based on importance. Importance greater than. Selects all elds with importance greater than the specied value.

251 Screening Models

The target eld is always preserved regardless of the selection.


Importance Ranking Options Importance. A measure used to rank elds or results on a percentage scale, dened broadly as 1

minus the p value, or the probability of obtaining a result as extreme or more extreme than the observed result by chance alone. The measure used to rank importance depends on whether the predictors and the target are all categorical, all numeric ranges, or a mix of range and categorical. Despite the differences in computation, the use of a standard percentage scale allows comparisons across different types of elds and results. on any of four measures:
Pearson chi-square. Tests for independence of the target and the predictor without indicating

All categorical. When all predictors and the target are categorical, importance can be ranked based

the strength or direction of any existing relationship.


Likelihood-ratio chi-square. Similar to Pearsons chi-square but also tests for target-predictor

independence.
Cramers V. A measure of association based on Pearsons chi-square statistic. Values range

from 0, which indicates no association, to 1, which indicates perfect association.


Lambda. A measure of association reecting the proportional reduction in error when the

variable is used to predict the target value. A value of 1 indicates the predictor perfectly predicts the target, while a value of 0 means the predictor provides no useful information about the target.
Some categorical. When somebut not allpredictors are categorical and the target is also

categorical, importance can be ranked based on either the Pearson or likelihood-ratio chi-square. (Cramers V and lambda are not available unless all predictors are categorical.)
Categorical versus continuous. When ranking a categorical predictor against a continuous target or

vice versa (one or the other is categorical but not both), the F statistic is used.
Both continuous. When ranking a continuous predictor against a continuous target, the t statistic based on the correlation coefcient is used.

Generated Feature Selection Models


Generated Feature Selection models display the importance of each predictor relative to a selected target, as ranked by the Feature Selection node. Any elds that were screened out prior to the ranking are also listed. For more information, see Feature Selection Node on p. 247. When you execute a stream containing a generated Feature Selection model, the model acts as a lter that preserves only selected predictors, as indicated by the current selection on the Model tab. For example, you could select all elds ranked as important (one of the default options) or manually select a subset of elds on the Model tab. The target eld is also preserved regardless of the selection. All other elds are excluded. Filtering is based on the eld name only; for example, if you select age and income, any eld that matches either of these names will be preserved. The model does not update eld rankings based on new data; it simply lters elds based on the selected names. For this reason, care should be used in applying the model to new or updated data. When in doubt, regenerating the model is recommended.

252 Chapter 7

Feature Selection Model Results


The Model tab for a generated Feature Selection model displays the rank and importance of all predictors in the upper pane and allows you to select elds for ltering by using the check boxes in the column on the left. When you execute the stream, only the checked elds are preserved. The other elds are discarded. The default selections are based on the options specied in the model-building node, but you can select or deselect additional elds as needed. The lower pane lists predictors that have been excluded from the rankings based on the percentage of missing values or on other criteria specied in the modeling node. As with the ranked elds, you can choose to include or discard these elds by using the check boxes in the column on the left. For more information, see Feature Selection Model Settings on p. 248.
Figure 7-3 Feature Selection model results

To sort the list by rank, eld name, importance, or any of the other displayed columns, double-click on the column header. Or, to use the toolbar, select the desired item from the Sort By list, and use the up and down arrows to change the direction of the sort.

253 Screening Models

You can use the toolbar to check or uncheck all elds and to access the Check Fields dialog box, which allows you to select elds by rank or importance. You can also press the Shift and Ctrl keys while clicking on elds to extend the selection and use the space bar to toggle on or off a group of selected elds. For more information, see Selecting Fields by Importance on p. 253. The threshold values for ranking predictors as important, marginal, or unimportant are displayed in the legend below the table. These values are specied in the model-building node. For more information, see Feature Selection Options on p. 250.

Selecting Fields by Importance


When scoring data using a generated Feature Selection model, all elds selected from the list of ranked or screened eldsas indicated by the check boxes in the column on the leftwill be preserved. Other elds will be discarded. To change the selection, you can use the toolbar to access the Check Fields dialog box, which allows you to select elds by rank or importance.
Figure 7-4 Check fields dialog box

All fields marked. Selects all elds marked as important, marginal, or unimportant. Top number of fields. Allows you to select the top n elds based on importance. Importance greater than. Selects all elds with importance greater than the specied threshold.

Generating a Filter from a Feature Selection Model


Based on the results of a Feature Selection model, you can generate one or more Filter nodes that include or exclude subsets of elds based on importance relative to the specied target. While the generated model can also be used as a lter, this gives you the exibility to experiment with different subsets of elds without copying or modifying the model. The target eld is always preserved by the lter regardless of whether include or exclude is selected.

254 Chapter 7 Figure 7-5 Generating a Filter node

Include/Exclude. You can choose to include or exclude eldsfor example, to include the top 10 elds or exclude all elds marked as unimportant. Selected fields. Includes or excludes all elds currently selected in the table. All fields marked. Selects all elds marked as important, marginal, or unimportant. Top number of fields. Allows you to select the top n elds based on importance. Importance greater than. Selects all elds with importance greater than the specied threshold.

Anomaly Detection Node


This node is available with the Segmentation module. Anomaly detection models are used to identify outliers, or unusual cases, in the data. Unlike other modeling methods that store rules about unusual cases, anomaly detection models store information on what normal behavior looks like. This makes it possible to identify outliers even if they do not conform to any known pattern, and it can be particularly useful in applications, such as fraud detection, where new patterns may constantly be emerging. Anomaly detection is an unsupervised method, which means that it does not require a training dataset containing known cases of fraud to use as a starting point. While traditional methods of identifying outliers generally look at one or two variables at a time, anomaly detection can examine large numbers of elds to identify clusters or peer groups into which similar records fall. Each record can then be compared to others in its peer group to identify possible anomalies. The further away a case is from the normal center, the more likely it is to be unusual. For example, the algorithm might lump records into three distinct clusters and ag those that fall far from the center of any one cluster.

255 Screening Models Figure 7-6 Using clustering to identify potential anomalies

Each record is assigned an anomaly index, which is the ratio of the group deviation index to its average over the cluster that the case belongs to. The larger the value of this index, the more deviation the case has than the average. Under the usual circumstance, cases with anomaly index values less than 1 or even 1.5 would not be considered as anomalies, because the deviation is just about the same or a bit more than the average. However, cases with an index value greater than 2 could be good anomaly candidates because the deviation is at least twice the average. Anomaly detection is an exploratory method designed for quick detection of unusual cases or records that should be candidates for further analysis. These should be regarded as suspected anomalies, which, on closer examination, may or may not turn out to be real. You may nd that a record is perfectly valid but choose to screen it from the data for purposes of model building. Alternatively, if the algorithm repeatedly turns up false anomalies, this may point to an error or artifact in the data collection process. Note that anomaly detection identies unusual records or cases through cluster analysis based on the set of elds selected in the model without regard for any specic target (dependent) eld and regardless of whether those elds are relevant to the pattern you are trying to predict. For this reason, you may want to use anomaly detection in combination with feature selection or another technique for screening and ranking elds. For example, you can use feature selection to identify the most important elds relative to a specic target and then use anomaly detection to locate the records that are the most unusual with respect to those elds. (An alternative approach would be to build a decision tree model and then examine any misclassied records as potential anomalies. However, this method would be more difcult to replicate or automate on a large scale.)
Example. In screening agricultural development grants for possible cases of fraud, anomaly detection can be used to discover deviations from the norm, highlighting those records that are abnormal and worthy of further investigation. You are particularly interested in grant applications that appear to claim too much (or too little) money for the type and size of farm. For more

256 Chapter 7

information, see Fraud Screening (Anomaly Detection/Neural Net) in Chapter 7 in Clementine 11.1 Applications Guide.
Requirements. One or more input elds. Note that only elds with Direction set to In using a

source or Type node can be used as inputs. Target elds (Direction set to Out or Both) are ignored.
Strengths. By agging cases that do not conform to a known set of rules rather than those that do, Anomaly Detection models can identify unusual cases even when they dont follow previously known patterns. When used in combination with feature selection, anomaly detection makes it possible to screen large amounts of data to identify the records of greatest interest relatively quickly.

Anomaly Detection Model Options


This node is available with the Segmentation module.
Figure 7-7 Anomaly Detection Model tab

Model name. You can generate the model name automatically based on the target or ID eld (or

model type in cases where no such eld is specied) or specify a custom name.
Use partitioned data. If a partition eld is dened, this option ensures that only data from the training partition is used to build the model.For more information, see Partition Node in Chapter 4 on p. 119. Determine cutoff value for anomaly based on. Species the method used to determine the cutoff

value for agging anomalies. The following options are available:


Minimum anomaly index level. Species the minimum cutoff value for agging anomalies.

Records that meet or exceed this threshold are agged.

257 Screening Models

Percentage of most anomalous records in the training data. Automatically sets the threshold at a

level that ags the specied percentage of records in the training data. The resulting cutoff is included as a parameter in the model. Note that this option determines how the cutoff value is set, not the actual percentage of records to be agged during scoring. Actual scoring results may vary depending on the data.
Number of most anomalous records in the training data. Automatically sets the threshold at a

level that ags the specied number of records in the training data. The resulting threshold is included as a parameter in the model. Note that this option determines how the cutoff value is set, not the specic number of records to be agged during scoring. Actual scoring results may vary depending on the data. Note: Regardless of how the cutoff value is determined, it does not affect the underlying anomaly index value reported for each record. It simply species the threshold for agging records as anomalous when estimating or scoring the model. If you later want to examine a larger or smaller number of records, you can use a Select node to identify a subset of records based on the anomaly index value ($O-AnomalyIndex > X).
Number of anomaly fields to report. Species the number of elds to report as an indication of why

a particular record is agged as an anomaly. The most anomalous elds are reported, dened as those that show the greatest deviation from the eld norm for the cluster to which the record is assigned.

Anomaly Detection Expert Options


This node is available with the Segmentation module. To specify options for missing values and other settings, set the mode to Expert on the Expert tab.

258 Chapter 7 Figure 7-8 Anomaly Detection Expert tab

Adjustment coefficient. Value used to balance the relative weight given to numeric range and

categorical elds in calculating the distance. Larger values increase the inuence of numeric range elds. This must be a nonzero value.
Automatically calculate number of peer groups. Anomaly detection can be used to rapidly analyze a large number of possible solutions to choose the optimal number of peer groups for the training data. You can broaden or narrow the range by setting the minimum and maximum number of peer groups. Larger values will allow the system to explore a broader range of possible solutions; however, the cost is increased processing time. Specify number of peer groups. If you know how many clusters to include in your model, select

this option and enter the number of peer groups. Selecting this option will generally result in improved performance.
Noise level and ratio. These settings determine how outliers are treated during two-stage clustering.

In the rst stage, a cluster feature (CF) tree is used to condense the data from a very large number of individual records to a manageable number of clusters. The tree is built based on similarity measures, and when a node of the tree gets too many records in it, it splits into child nodes. In the second stage, hierarchical clustering commences on the terminal nodes of the CF tree. Noise handling is turned on in the rst data pass, and it is off in the second data pass. The cases in the noise cluster from the rst data pass are assigned to the regular clusters in the second data pass.
Noise level. Specify a value between 0 and 0.5. This setting is relevant only if the CF tree

lls during the growth phase, meaning that it cannot accept any more cases in a leaf node and that no leaf node can be split.

259 Screening Models

If the CF tree lls and the noise level is set to 0, the threshold will be increased and the CF tree regrown with all cases. After nal clustering, values that cannot be assigned to a cluster are labeled outliers. The outlier cluster is given an identication number of 1. The outlier cluster is not included in the count of the number of clusters; that is, if you specify n clusters and noise handling, the algorithm will output n clusters and one noise cluster. In practical terms, increasing this value gives the algorithm more latitude to t unusual records into the tree rather than assign them to a separate outlier cluster. If the CF tree lls and the noise level is greater than 0, the CF tree will be regrown after placing any data in sparse leaves into their own noise leaf. A leaf is considered sparse if the ratio of the number of cases in the sparse leaf to the number of cases in the largest leaf is less than the noise level. After the tree is grown, the outliers will be placed in the CF tree if possible. If not, the outliers are discarded for the second phase of clustering.
Noise ratio. Species the portion of memory allocated for the component that should be

used for noise buffering. This value ranges between 0.0 and 0.5. If inserting a specic case into a leaf of the tree would yield tightness less than the threshold, the leaf is not split. If the tightness exceeds the threshold, the leaf is split, adding another small cluster to the CF tree. In practical terms, increasing this setting may cause the algorithm to gravitate more quickly toward a simpler tree.
Impute missing values. For numeric range elds, substitutes the eld mean in place of any missing

values. For categorical elds, missing categories are combined and treated as a valid category. If this option is deselected, any records with missing values are excluded from the analysis.

Generated Anomaly Detection Models


Generated Anomaly Detection models contain all of the information captured by the Anomaly Detection model as well as information about the training data and estimation process. When you execute a stream containing a generated Anomaly Detection model, a number of new elds are added to the stream, as determined by the selections made on the Settings tab in the generated model. For more information, see Anomaly Detection Model Settings on p. 261. New eld names are based on the model name, prefaced by $O, as summarized in the following table:
$O-Anomaly $O-AnomalyIndex $O-PeerGroup $O-Field-n $O-FieldImpact-n Flag eld indicating whether or not the record is anomalous. The anomaly index value for the record. Species the peer group to which the record is assigned. Name of the nth most anomalous eld in terms of deviation from the cluster norm. Variable deviation index for the eld. This value measures the deviation from the eld norm for the cluster to which the record is assigned.

Optionally, you can suppress scores for non-anomalous records to make the results easier to read.

260 Chapter 7 Figure 7-9 Scoring results with non-anomalous records suppressed

Anomaly Detection Model Details


The Model tab for a generated Anomaly Detection model displays information about the peer groups in the model.
Figure 7-10 Anomaly Detection model details

261 Screening Models

Note that the peer group sizes and statistics reported are estimates based on the training data and may differ slightly from actual scoring results even if run on the same data.

Anomaly Detection Model Summary


The Summary tab for a generated Anomaly Detection model displays information about the elds, build settings, and estimation process. The number of peer groups is also shown, along with the cutoff value used to ag records as anomalous.
Figure 7-11 Anomaly Detection model summary

Anomaly Detection Model Settings


The Settings tab allows you to specify options for scoring the generated model.

262 Chapter 7 Figure 7-12 Scoring options for an Anomaly Detection model

Indicate anomalous records with. Species how anomalous records are treated in the output. Flag and index. Creates a ag eld that is set to True for all records that exceed the cutoff

value included in the model. The anomaly index is also reported for each record in a separate eld. For more information, see Anomaly Detection Model Options on p. 256.
Flag only. Creates a ag eld but without reporting the anomaly index for each record. Index only. Reports the anomaly index without creating a ag eld. Number of anomaly fields to report. Species the number of elds to report as an indication of why

a particular record is agged as an anomaly. The most anomalous elds are reported, dened as those that show the greatest deviation from the eld norm for the cluster to which the record is assigned.
Discard records. Select this option to discard all non-anomalous records from the stream, making

it easier to focus on potential anomalies in any downstream nodes. Alternatively, you can choose to discard all anomalous records in order to limit the subsequent analysis to those records that are not agged as potential anomalies based on the model. Note: Due to slight differences in rounding, the actual number of records agged during scoring may not be identical to those agged while training the model even if run on the same data.

Chapter

Binary Classifier Node


This node is available with the Classication module.

The Binary Classier node allows you to create and compare models for binary (yes/no) outcomes using a number of different methods, making it easier to try out a variety of approaches and compare the results. You can select the specic modeling algorithms that you want to use and the specic options for each. You can also specify multiple variants for each model. For example, rather than choose between the quick, dynamic, or prune method for a Neural Net, you can try them all. The node generates a set of models based on the specied options and ranks the candidates based on the criteria you specify.
Figure 8-1 Binary Classifier modeling results

Example. A bank wants to be able to predict whether a given customer is likely to default on a loan. Using a Binary Classier node, you can generate a number of models that you can use to classify customers as good or bad credit risks. Requirements. A single target eld of type Flag (Direction = Out) and at least one predictor (In)

eld. The True value dened for the target eld is assumed to represent a hit when calculating prots, lift, and related statistics. Predictor elds can be numeric ranges or categorical, although
263

264 Chapter 8

any categorical predictors must have numeric storage (not string). If necessary, a Reclassify node can be used to convert them. For more information, see Reclassify Node in Chapter 4 on p. 105. Numeric range predictors can be binned in some cases; see the sections on specic algorithms for details.
Supported Algorithms

Supported algorithms include Neural Net, Decision Trees (C5.0, C&RT, QUEST, and CHAID), Logistic Regression, and Decision List.
The Neural Net node uses a simplied model of the way the human brain processes information. It works by simulating a large number of interconnected simple processing units that resemble abstract versions of neurons. Neural networks are powerful general function estimators and require minimal statistical or mathematical knowledge to train or apply. For more information, see Neural Net Node in Chapter 10 on p. 323. The C5.0 node builds either a decision tree or a ruleset. The model works by splitting the sample based on the eld that provides the maximum information gain at each level. The target eld must be categorical. Multiple splits into more than two subgroups are allowed. For more information, see C5.0 Node in Chapter 9 on p. 308. The Classication and Regression Tree node generates a decision tree that allows you to predict or classify future observations. The method uses recursive partitioning to split the training records into segments by minimizing the impurity at each step, where a node is considered pure if 100% of cases in the node fall into a specic category of the target eld. Target and predictor elds can be range or categorical; all splits are binary (only two subgroups). For more information, see C&R Tree Node in Chapter 9 on p. 296. The QUEST node provides a binary classication method for building decision trees, designed to reduce the processing time required for large C&RT analyses while also reducing the tendency found in classication tree methods to favor predictors that allow more splits. Predictor elds can be numeric ranges, but the target eld must be categorical. All splits are binary. For more information, see QUEST Node in Chapter 9 on p. 307. The CHAID node generates decision trees using chi-square statistics to identify optimal splits. Unlike the C&RT and QUEST nodes, CHAID can generate nonbinary trees, meaning that some splits have more than two branches. Target and predictor elds can be range or categorical. Exhaustive CHAID is a modication of CHAID that does a more thorough job of examining all possible splits but takes longer to compute. For more information, see CHAID Node in Chapter 9 on p. 305. The Decision List node identies subgroups, or segments, that show a higher or lower likelihood of a given binary outcome relative to the overall population. For example, you might look for customers who are unlikely to churn or are most likely to respond favorably to a campaign. You can incorporate your business knowledge into the model by adding your own custom segments and previewing alternative models side by side in order to compare the results. For more information, see Decision List in Chapter 11 on p. 333.

265 Binary Classifier Node

Logistic regression is a statistical technique for classifying records based on values of input elds. It is analogous to linear regression but takes a categorical target eld instead of a numeric range. For more information, see Logistic Regression Node in Chapter 12 on p. 372.

Models and Execution Time

Powerful though it is to be able to compare a large number of models in a single pass, be aware that depending on the number of models and the size of the dataset, the node may take hours or even days to execute. When selecting options, pay attention to the number of models being produced. When practical, you may want to schedule modeling runs during nights or weekends when system resources are less likely to be in demand. If necessary, a Partition or Sample node can be used to reduce the number of records included in the initial training pass. Once you have narrowed the choices to a few candidate models, the full dataset can be restored. See Sample Node or Partition Node for more information. To reduce the number of input elds, use Feature Selection. For more information, see Feature Selection Node in Chapter 7 on p. 247. Optionally, you can limit the amount of time spent estimating any one model. For more information, see Binary Classier Node Expert Options on p. 267.

Binary Classifier Node Model Options


This node is available with the Classication module. The Model tab of the Binary Classier node allows you to specify the number of models to be saved, along with the criteria used to compare models.

266 Chapter 8 Figure 8-2 Binary Classifier node, Model tab

Use partitioned data. If a partition eld is dened, this option ensures that only data from the training partition is used to build the model.For more information, see Partition Node in Chapter 4 on p. 119. Rank models by. Species the criteria used to compare models, regardless of which algorithm

is used. Note that the True value dened for the target eld is assumed to represent a hit when calculating prots, lift, and related statistics.
Overall accuracy. The percentage of records that is correctly predicted by the model relative to the total number of records. Area under the ROC curve. The ROC curve provides an index for the performance of a model. The

further the curve lies above the reference line, the more accurate the test.

Profit (Cumulative). The sum of prots across cumulative percentiles (sorted in terms of condence for the prediction), as computed based on the specied cost, revenue, and weight criteria. Typically, the prot starts near 0 for the top percentile, increases steadily, and then decreases. For a good model, prots will show a well-dened peak, which is reported along with the percentile where it occurs. For a model that provides no information, the prot curve will be relatively straight and may be increasing, decreasing, or level, depending on the cost/revenue structure that applies.

267 Binary Classifier Node

Lift (Cumulative). The ratio of hits in cumulative quantiles relative to the overall sample (where quantiles are sorted in terms of condence for the prediction). For example, a lift value of 3 for the top quantile indicates a hit rate three times as high as for the sample overall. For a good model, lift should start well above 1.0 for the top quantiles and then drop off sharply toward 1.0 for the lower quantiles. For a model that provides no information, the lift will hover around 1.0. Number of variables. Ranks models based on the number of variables used. Rank models using. If a partition is in use, you can specify whether ranks are based on the training dataset or the testing set. With large datasets, use of a partition for preliminary screening of models may greatly improve performance. Maximum models listed in summary report. Species the maximum number of models to be listed

in the summary report produced by the node. The top-ranking models will listed according to the specied ranking criterion. Note that increasing this limit may slow performance. The maximum allowable value is 100.
Profit Criteria. Prot equals the revenue for each record minus the cost for the record. Prots for a

quantile are simply the sum of prots for all records in the quantile. Prots are assumed to apply only to hits, but costs apply to all records.
Costs. Specify the cost associated with each record. You can select Fixed or Variable costs.

For xed costs, specify the cost value. For variable costs, click the Field Chooser button to select a eld as the cost eld.
Revenue. Specify the revenue associated with each record that represents a hit. You can select
Fixed or Variable costs. For xed revenue, specify the revenue value. For variable revenue, click the Field Chooser button to select a eld as the revenue eld.

Weight. If the records in your data represent more than one unit, you can use frequency

weights to adjust the results. Specify the weight associated with each record, using Fixed or Variable weights. For xed weights, specify the weight value (the number of units per record). For variable weights, click the Field Chooser button to select a eld as the weight eld.
Lift Criteria. Species the percentile use for lift calculations. Note that you can also change

this value when comparing the results. For more information, see Binary Classier Results Browser on p. 270.

Binary Classifier Node Expert Options


This node is available with the Classication module. The Expert tab of the Binary Classier node allows you to apply a partition (if available), select the algorithms to use, and specify stopping rules.

268 Chapter 8 Figure 8-3 Binary Classifier node, Expert tab

Models used. Use the check boxes in the column on the left to specify the model types

(algorithms) to include in the comparison. The more types you select, the more models will be created, and the processing time will be longer.
Model parameters. For each model type, you can use the default settings, or select Specify to choose options for each model type. The specic options are similar to those available in the separate modeling nodes, with the difference that multiple options or combinations can be selected. For example, if comparing Neural Net models, rather than choose one of the six training methods you can choose all of them to train six models in a single pass. Supported algorithms are listed at the beginning of this section on p. 263. Number of models. Lists the number of models produced for each algorithm based on current settings. When combining options, the number of models can quickly add up, so paying close attention to this number is strongly recommended, particularly when using large datasets. Restrict maximum time spent building a single model. Sets a maximum time limit for any one model. For example, if a particular model requires an unexpectedly long time to train because of some complex interaction, you probably dont want it to hold up your entire modeling run.

Binary Classifier Node Stopping Rules


This node is available with the Classication module.

269 Binary Classifier Node

Stopping rules specied for the Binary Classier node relate to the overall node execution, not the stopping of individual models built by the node.
Figure 8-4 Stopping rules

Restrict overall execution time. Stops execution after a specied number of hours. All models

generated up to that point will be included in the results, but no further models will be produced.
A model is built that meets all selected filter criteria. Stops execution when a model passes all

criteria specied on the Discard tab. For more information, see Binary Classier Node Discard Options on p. 269.

Binary Classifier Node Discard Options


This node is available with the Classication module. The Discard tab of the Binary Classier node allows you to automatically discard models that do not meet certain criteria. These models will not be listed in the summary report.

270 Chapter 8 Figure 8-5 Binary Classifier node, Discard tab

You can specify a minimum threshold for overall accuracy, lift, prot, and area under the curve, and specify a maximum threshold for the number of variables used in the model. Lift and prot are determined as specied in the modeling node. For more information, see Binary Classier Node Model Options on p. 265. Optionally, you congure the node to stop execution the rst time a model is generated that meets all specied criteria. For more information, see Binary Classier Node Stopping Rules on p. 268.

Binary Classifier Results Browser


This node is available with the Classication module. The Binary Classier node produces a report that summarizes the results of the modeling run. For each model listed in the report, the build time, prot, lift, and accuracy are displayed. You can sort the table on any of these columns to quickly identify the most interesting models.

271 Binary Classifier Node Figure 8-6 Binary Classifier modeling results

Toolbar Options

Use the toolbar to show or hide specic columns or to change the column used to sort the table. (You can also change the sort by clicking on the column headers.) If a partition is in use, you can choose to view results for the training or testing partition as applicable. For more information, see Partition Node in Chapter 4 on p. 119.
Ranking and Comparing Models

The table presents a number of metrics that can be used to rank models. See Binary Classier Node Model Options on p. 265 for an overview of these measures. For the maximum prot, the percentile in which the maximum occurs is also reported. For cumulative lift, you can change the selected percentile using the toolbar.

Generating Nodes and Models


You can generate a model or modeling node for any of the models listed in the Binary Classier Report browser.
E Under the Generate column in the table, select one or more models. E From the Generate menu, select Model(s) to Palette to add each to the Models palette. Each

generated model can be saved or used as is without reexecuting the stream.

272 Chapter 8 E Alternatively, you can select Modeling Node(s) from the Generate menu to add one or more

modeling nodes to the stream canvas. These nodes can be used to reestimate the selected models without repeating the entire Binary Classier modeling run.

Generating Evaluation Charts


You can generate an evaluation chart for any of the models listed in the report. Evaluation charts offer a visual way to evaluate and compare predictive models. For more information, see Evaluation Chart Node in Chapter 5 on p. 215.
Figure 8-7 Response chart (cumulative) with best line and baseline

E Under the Generate column in the Binary Classier Results browser, select the models that

you want to evaluate.


E From the Generate menu, choose Evaluation Chart(s). Figure 8-8 Generating an evaluation chart

E Select the chart type and other options as desired. For more information, see Setting Options for

the Evaluation Chart Node in Chapter 5 on p. 220.

Chapter

Decision Trees
Decision Tree Models

Decision tree models allow you to develop classication systems that predict or classify future observations based on a set of decision rules. If you have data divided into classes that interest you (for example, high- versus low-risk loans, subscribers versus nonsubscribers, voters versus nonvoters, or types of bacteria), you can use your data to build rules that you can use to classify old or new cases with maximum accuracy. For example, you might build a tree that classies credit risk or purchase intent based on age and other factors.
Figure 9-1 Tree window

This approach, sometimes known as rule induction, has several advantages. First, the reasoning process behind the model is clearly evident when browsing the tree. This is in contrast to other black box modeling techniques in which the internal logic can be difcult to work out.

273

274 Chapter 9 Figure 9-2 Simple decision tree

Second, the process will automatically include in its rule only the attributes that really matter in making a decision. Attributes that do not contribute to the accuracy of the tree are ignored. This can yield very useful information about the data and can be used to reduce the data to relevant elds only before training another learning technique, such as a neural net. Generated decision tree models can be converted into a collection of if-then rules (a ruleset), which in many cases show the information in a more comprehensible form. The decision-tree presentation is useful when you want to see how attributes in the data can split, or partition, the population into subsets relevant to the problem. The ruleset presentation is useful if you want to see how particular groups of items relate to a specic conclusion. For example, the following rule gives us a prole for a group of cars that is worth buying:
IF mot = 'yes' AND mileage = 'low' THEN -> 'BUY'.

Tree-Building Algorithms

Four algorithms are available for performing classication and segmentation analysis. These algorithms all perform basically the same thingthey examine all of the elds of your database to nd the one that gives the best classication or prediction by splitting the data into subgroups. The process is applied recursively, splitting subgroups into smaller and smaller units until the tree is nished (as dened by certain stopping criteria). The target and input elds used in tree building can be numeric ranges or categorical, depending on the algorithm used. If a range target is used, a regression tree is generated; if a categorical target is used, a classication tree is generated.
The Classication and Regression Tree node generates a decision tree that allows you to predict or classify future observations. The method uses recursive partitioning to split the training records into segments by minimizing the impurity at each step, where a node is considered pure if 100% of cases in the node fall into a specic category of the target eld. Target and predictor elds can be range or categorical; all splits are binary (only two subgroups). For more information, see C&R Tree Node on p. 296.

275 Decision Trees

The CHAID node generates decision trees using chi-square statistics to identify optimal splits. Unlike the C&RT and QUEST nodes, CHAID can generate nonbinary trees, meaning that some splits have more than two branches. Target and predictor elds can be range or categorical. Exhaustive CHAID is a modication of CHAID that does a more thorough job of examining all possible splits but takes longer to compute. For more information, see CHAID Node on p. 305. The QUEST node provides a binary classication method for building decision trees, designed to reduce the processing time required for large C&RT analyses while also reducing the tendency found in classication tree methods to favor predictors that allow more splits. Predictor elds can be numeric ranges, but the target eld must be categorical. All splits are binary. For more information, see QUEST Node on p. 307. The C5.0 node builds either a decision tree or a ruleset. The model works by splitting the sample based on the eld that provides the maximum information gain at each level. The target eld must be categorical. Multiple splits into more than two subgroups are allowed. For more information, see C5.0 Node on p. 308.

General Uses of Tree-Based Analysis

The following are some general uses of tree-based analysis:


Segmentation. Identify persons who are likely to be members of a particular class. Stratification. Assign cases into one of several categories, such as high-, medium-, and low-risk

groups.
Prediction. Create rules and use them to predict future events. Prediction can also mean attempts

to relate predictive attributes to values of a continuous variable.


Data reduction and variable screening. Select a useful subset of predictors from a large set of

variables for use in building a formal parametric model.


Interaction identification. Identify relationships that pertain only to specic subgroups and specify

these in a formal parametric model.


Category merging and banding continuous variables. Recode group predictor categories and

continuous variables with minimal loss of information.

The Tree Builder


You can generate a tree model automatically, allowing the algorithm to choose the best split at each level, or you can use the interactive Tree Builder to take control, applying your business knowledge to rene or simplify the tree before saving the generated model.
E Create a stream and add one of the tree-building nodes, either C&RT, CHAID, or QUEST. (Note:

Interactive tree-building is not supported for C5.0 trees.)


E On the Model tab, select Interactive Tree. E Select target and predictor elds and specify additional model options as needed. For specic

instructions, see the documentation for each tree-building node.


E Execute the stream to launch the Tree Builder.

276 Chapter 9 Figure 9-3 Tree window

The current tree is displayed, starting with the root node. You can edit and prune the tree level-by-level and access gains, risks, and related information before generating one or more models.
Comments

With the C&RT, CHAID, and QUEST nodes, any ordered set elds used in the model must have numeric storage (not string). If necessary, the Reclassify node can be used to convert them. For more information, see Reclassify Node in Chapter 4 on p. 105. Optionally, you can use a partition eld to separate the data into training and test samples. For more information, see Partition Node in Chapter 4 on p. 119. As an alternative to using the Tree Builder, you can also generate a model directly from the build node as with other Clementine models. For more information, see Generating a Tree Model Directly on p. 312.

Growing and Pruning the Tree


The Viewer tab in the Tree Builder displays the current tree, starting with the root node.

277 Decision Trees E To grow the tree, from the menus choose: Tree Grow Tree

The system builds the tree by recursively splitting each branch until one or more stopping criteria are met. At each split, the best predictor is automatically selected based on the modeling method used.
E Alternatively, select Grow Tree One Level to add a single level. E To add a branch below a specic node, select the node and select Grow Branch. E To choose the predictor used for a split, select the desired node and select Grow Branch with Custom Split. For more information, see Dening Custom Splits on p. 277. E To prune a branch, select a node and select Remove Branch from the Tree menu to clear up the

selected node.
E To remove the bottom level from the tree, select Remove One Level. E For C&R trees only, select Grow Tree and Prune to prune based on a cost-complexity algorithm

that adjusts the risk estimate based on the number of terminal nodes, typically resulting in a simpler tree. For more information, see C&R Tree Node on p. 296.
Interrupting tree growth. To interrupt a tree-growing operation (if it is taking longer than expected,

for example), click the interrupt button on the toolbar. (The button is enabled only during tree growth, when it displays a red square.) This stops the current growing operation at its current point, leaving any nodes that have already been added, without saving changes or closing the window. The Tree Builder remains open, allowing you to generate a model, update directives, or export output in the appropriate format as needed.

Defining Custom Splits


The Dene Split dialog box allows you to select the predictor and specify conditions for each split.
E In the Tree Builder, select a node on the Viewer tab, and from the menus choose: Tree Grow Branch with Custom Split

278 Chapter 9 Figure 9-4 Defne Split dialog box

E Select the desired predictor from the drop-down list, or click on the ellipsis button (...) to view

details on each predictor. For more information, see Viewing Predictor Details on p. 278.
E You can accept the default conditions for each split or select Custom to specify conditions for the

split as appropriate. For numeric ranges, you can specify the range of values that fall into each new node (for example, New Node 1 >= 100). For categorical predictors, you can specify the specic values (or range of values in case of an ordered set) that map to each new node.
E Select Grow to regrow the branch using the selected predictor.

The tree can generally be split using any predictor, regardless of stopping rules. The only exceptions are when the node is pure (meaning that 100% of cases fall into the same target class, thus there is nothing left to split) or the chosen predictor is constant (there is nothing to split against).
Missing values. For CHAID trees only, if missing values are available for a given predictor, you have the option when dening a custom split to assign them to a specic child node. (With C&RT and QUEST trees, missing values are handled using surrogates as dened in the algorithm. For more information, see Split Details and Surrogates on p. 279.)

Viewing Predictor Details


The Select Predictor dialog box displays statistics on available predictors (or competitors as they are sometimes called) that can be used for the current split.

279 Decision Trees Figure 9-5 Select Predictor dialog box

For CHAID and exhaustive CHAID, the chi-square statistic is listed for each categorical predictor; if a predictor is a numeric range, the F statistic is shown. The chi-square statistic is a measure of how independent the target eld is from the splitting eld. A high chi-square statistic generally relates to a lower probability, meaning that there is less chance that the two elds are independentan indication that the split is a good one. Degrees of freedom are also included because these take into account the fact that it is easier for a three-way split to have a large statistic and small probability than it is for a two-way split. For C&RT and QUEST, the improvement for each predictor is displayed. The greater the improvement, the greater the reduction in impurity between the parent and child nodes if that predictor is used. (A pure node is one in which all cases fall into a single target category; the lower the impurity across the tree, the better the model ts the data.) In other words, a high improvement gure generally indicates a useful split for this type of tree. The impurity measure used is specied in the model-building node. For more information, see C&R Tree Node Expert Options on p. 301.

Split Details and Surrogates


You can select any node in the Viewer tab and select the split information button on the right side of the toolbar to view details about the split for that node. The split rule used is displayed, along with relevant statistics. For C&RT categorical trees, improvement and association are displayed. The association is a measure of correspondence between a surrogate and the primary split eld, with the best surrogate generally being the one that most closely mimics the split eld. For C&RT and QUEST trees, any surrogates used in place of the primary predictor are also listed.

280 Chapter 9 Figure 9-6 Tree Builder window with split information displayed

E To edit the split for the selected node, you can click the icon on the left side of the surrogates

panel to open the Dene Split dialog box. (As a shortcut, you can select a surrogate from the list before clicking the icon to select it as the primary split eld.)
Surrogates

Where applicable, any surrogates for the primary split eld are shown for the selected node. Surrogates are alternate elds used if the primary predictor value is missing for a given record. The maximum number of surrogates allowed for a given split is specied in the model-building node, but the actual number depends on the training data. In general, the more missing data, the more surrogates are likely to be used. For other decision tree models, this tab is empty. Note: To be included in the model, surrogates must be identied during the training phase. If the training sample has no missing values, then no surrogates will be identied, and any records with missing values encountered during testing or scoring will automatically fall into the child node with the largest number of records. If missing values are expected during testing or scoring, be sure that values are missing from the training sample, as well. Surrogates are not available for CHAID trees. Although surrogates are not used for CHAID trees, when dening a custom split you have the option to assign them to a specic child node. For more information, see Dening Custom Splits on p. 277.

281 Decision Trees

Customizing the Tree View


The Viewer tab in the Tree Builder displays the current tree. By default, all branches in the tree are expanded, but you can expand and collapse branches and customize other settings as needed.
Figure 9-7 Left-to-right view with split details, node graphs, and labels visible

Click the minus sign () next to a parent node to hide all of its child nodes. Click the plus sign (+) next to a parent node to display its child nodes. Use the View menu or toolbar to change the orientation of the tree (top-down, left-to-right, or right-to-left). Click the label icon on the toolbar to show or hide eld and value labels. Use the magnifying glass icons to zoom the view in or out, or click the tree map button on the far right side of the toolbar to view a diagram of the complete tree. If a partition eld is in use, you can swap the tree view between training and testing partitions (View menu, Partition). When the testing sample is displayed, the tree can be viewed but not edited. (The current partition is displayed in the status bar in the lower right corner of the window.) Click the split information icon to view details on the current split. For more information, see Split Details and Surrogates on p. 279. Display statistics, graphs, or both within each node (see below).

282 Chapter 9

Displaying Statistics and Graphs Node statistics. For a categorical target eld, the table in each node shows the number and percentage of records in each category and the percentage of the entire sample that the node represents. For a range (numeric) target eld, the table shows the mean, standard deviation, number of records, and predicted value of the target eld. Node graphs. For a categorical target eld, the graph is a bar chart of percentages in each category

of the target eld. Preceding each row in the table is a color swatch that corresponds to the color that represents each of the target eld categories in the graphs for the node. For a range (numeric) target eld, the graph shows a histogram of the target eld for records in the node.

Gains
The Gains tab displays statistics for all terminal nodes in the tree. Gains provide a measure of how far the mean or proportion at a given node differs from the overall mean. Generally speaking, the greater this difference, the more useful the tree is as a tool for making decisions. For example, an index or lift value of 148% for a node indicates that records in the node are about one-and-a-half times as likely to fall under the target category as for the dataset as a whole.
Figure 9-8 Gains tab

The Gains tab allows you to: Display node-by-node, cumulative, or quantile statistics. Display gains or prots. Swap the view between tables and charts. Select the target category (categorical targets only).

283 Decision Trees

Sort the table in ascending or descending order based on the index percentage. If statistics for multiple partitions are displayed, sorts are always applied on the training sample rather than on the testing sample. In general, selections made in the gains table will be updated in the tree view and vice versa. For example, if you select a row in the table, the corresponding node will be selected in the tree. This may not apply in certain situationsfor example, when viewing quantile statistics for training and testing samples side-by-side (since node groups may not be the same between training and testing). If training and testing samples have been dened, statistics for each sample are displayed side-by-side. For more information, see Gains for Partitions on p. 292.

Classification Gains
For classication trees (those with a categorical target variable), the gain index percentage tells you how much greater the proportion of a given target category at each node differs from the overall proportion.
Node-by-Node Statistics

In this view, the table displays one row for each terminal node. For example, if the overall response to your direct mail campaign was 10% but 20% of the records that fall into node X responded positively, the index percentage for the node would be 200%, indicating that respondents in this group are twice as likely to buy relative to the overall population.
Figure 9-9 Node-by-node gain statistics

Nodes. The ID of the current node (as displayed on the Viewer tab).

284 Chapter 9

Node n. The total number of records at that node. Node %. The percentage of all records in the dataset that fall into this node. Gain n. The number of records with the selected target category that fall into this node. In other words, of all the records in the dataset that fall under the target category, how many are in this node? Gain %. The percentage of all records in the target category, across the entire dataset, that fall

into this node.


Response %. The percentage of records in the current node that fall under the target category.

Responses in this context are sometimes referred to as hits.


Index %. The response percentage for the current node expressed as a percentage of the response percentage for the entire dataset. For example, an index value of 300% indicates that records in this node are three times as likely to fall under the target category as for the dataset as a whole. Cumulative Statistics

In the cumulative view, the table displays one node per row, but statistics are cumulative, sorted in ascending or descending order by index percentage. For example if a descending sort is applied, the node with the highest index percentage is listed rst, and statistics in the rows that follow are cumulative for that row and above.
Figure 9-10 Cumulative gains sorted in descending order

The cumulative index percentage decreases row-by-row as nodes with lower and lower response percentages are added. (The wider you cast the net, the lower the bang per buck.) The cumulative index for the nal row is always 100% because at this point the entire dataset is included.

285 Decision Trees

Quantiles

In this view, each row in the table represents a quantile rather than a node. The quantiles are either quartiles, quintiles (fths), deciles (tenths), vingtiles (twentieths), or percentiles (hundredths). Multiple nodes can be listed in a single quantile if more than one node is needed to make up that percentage (for example, if quartiles are displayed but the top two nodes contain fewer than 50% of all cases). The rest of the table is cumulative and can be interpreted in the same manner as the cumulative view.
Figure 9-11 Gains by quartile listed in descending order

Classification Profits and ROI


For classication trees, gains statistics can also be displayed in terms of prot and ROI (return on investment). The Dene Prots dialog box allows you to specify revenue and expenses for each category.
E On the Gains tab, click the Prot button on the toolbar to access the dialog box.

286 Chapter 9 Figure 9-12 Define Profits dialog box

E Assign revenue and expense values for each category of the target eld.

For example, if it costs you $0.48 to mail an offer to each customer and the revenue from a positive response is $9.95 for a three-month subscription, then each no response costs you $0.48 and each yes earns you $9.47 (calculated as 9.950.48).
Figure 9-13 Profit and ROI statistics

In the gains table, prot is calculated as the sum of revenues minus expenditures for each of the records at a terminal node. ROI is total prot divided by total expenditure at a node.
Comments

Prot values affect only average prot and ROI values displayed in the gains table, as a way of viewing statistics in terms more applicable to your bottom line. They do not affect the basic tree model structure. Prots should not be confused with misclassication costs, which are

287 Decision Trees

specied in the model-building node and are factored into the model as a way of protecting against costly mistakes. For more information, see Misclassication Cost Options on p. 304. Prot specications are not persisted between one interactive tree-building session and the next.

Regression Gains
For regression trees, you can choose between node-by-node, cumulative node-by-node, and quantile views. Average values are shown in the table. Charts are available only for quantiles.

Gains Charts
Charts can be displayed on the Gains tab as an alternative to tables.
E On the Gains tab, select the Quantiles icon (third from left on the toolbar). (Charts are not

available for node-by-node or cumulative statistics.)


E Select the Charts icon. E Select the displayed units (percentiles, deciles, and so on) from the drop-down list as desired. E Select Gains, Response, or Lift to change the displayed measure.

Gains Chart

The gains chart plots the values in the Gains % column from the table. Gains are dened as the proportion of hits in each increment relative to the total number of hits in the tree, using the equation: (hits in increment / total number of hits) x 100%

288 Chapter 9 Figure 9-14 Gains chart

The chart effectively illustrates how widely you need to cast the net to capture a given percentage of all the hits in the tree. The diagonal line plots the expected response for the entire sample, if the model were not used. In this case, the response rate would be constant, since one person is just as likely to respond as another. To double your yield, you would need to ask twice as many people. The curved line indicates how much you can improve your response by including only those who rank in the higher percentiles based on gain. For example, including the top 50% might net you more than 70% of the positive responses. The steeper the curve, the higher the gain.
Lift Chart

The lift chart plots the values in the Index % column in the table. This chart compares the percentage of records in each increment that are hits with the overall percentage of hits in the training dataset, using the equation: (hits in increment / records in increment) / (total number of hits / total number of records)

289 Decision Trees Figure 9-15 Lift chart

Response Chart

The response chart plots the values in the Response % column of the table. The response is a percentage of records in the increment that are hits, using the equation: (responses in increment / records in increment) x 100%

290 Chapter 9 Figure 9-16 Response chart

Gains-Based Selection
The Gains-Based Selection dialog box allows you to automatically select terminal nodes with the best (or worst) gains based on a specied rule or threshold. You can then generate a Select node based on the selection.

291 Decision Trees Figure 9-17 Gains-Based Selection dialog box

E On the Gains tab, select the node-by-node or cumulative view and select the target category on

which you want to base the selection. (Selections are based on the current table display and are not available for quantiles.)
E On the Gains tab, from the menus choose: Edit Select Terminal Nodes Gains-Based Selection

Select only. You can select matching nodes or nonmatching nodesfor example, to select all but

the top 100 records.


Match by gains information. Matches nodes based on gain statistics for the current target category,

including: Nodes where the gain, response, or lift (index) matches a specied thresholdfor example, response greater than or equal to 50% The top n nodes based on the gain for the target category The top nodes up to a specied number of records The top nodes up to a specied percentage of training data
E Click OK to update the selection on the Viewer tab. E To create a new Select node based on the current selection on the Viewer tab, choose Select Node

from the Generate menu. For more information, see Generating Filter and Select Nodes on p. 295. Note that since you are actually selecting nodes rather than records or percentages, a perfect match with the selection criterion may not always be achieved. The system selects complete nodes up to the specied level. For example, if you select the top 12 cases and you have 10 in the rst node and two in the second node, only the rst node will be selected.

292 Chapter 9

Gains for Partitions


If a partition eld is in use, gains are displayed separately for each partition. Both tables are sorted together based on the training data.
Figure 9-18 Gains for training and testing partitions

When viewing quantile gains for partitions, selections made in the tree view may not be updated in the table. This is because the node groups may not be the same between training and testing samples; thus, a given node in the tree can be listed in different rows in each table. As a result, terminal nodes selected in the tree view may not be updated to the quantiles view on the gains panel.

Risks
Risks tell you the chances of misclassication at any level. The Risks tab displays a point risk estimate and (for categorical outputs) a misclassication table.

293 Decision Trees Figure 9-19 Misclassification table for a categorical target

For numeric predictions, the risk is a pooled estimate of the variance at each of the terminal nodes. For categorical predictions, the risk is the proportion of cases incorrectly classied, adjusted for any priors or misclassication costs.

Saving Tree Models and Results


You can save or export the results of your interactive tree-building sessions in a number of ways, including: Generate a model based on the current tree (Tree Builder, Generate menu). Save the directives used to grow the current tree. The next time the model-building node is executed, the current tree will automatically be regrown, including any custom splits that you have dened. Export model, gain, and risk information. For more information, see Exporting Model, Gain, and Risk Information on p. 294. From either the Tree Builder or a generated tree model, you can: Generate a Filter or Select node based on the current tree. For more information, see Generating Filter and Select Nodes on p. 295. Generate a Ruleset node that represents the tree structure as a set of rules dening the terminal branches of the tree. For more information, see Generating a Ruleset from a Decision Tree on p. 295. In addition, for generated tree models only, you can export the model in PMML format. For more information, see The Models Palette in Chapter 6 on p. 238. If the model includes any custom splits, this information is not preserved in the exported PMML. (The split is preserved, but the fact that it is custom rather than chosen by the algorithm is not.)

294 Chapter 9

Note: The interactive tree itself cannot be saved. To avoid losing your work, generate a model and/or update tree directives before closing the Tree Builder window.

Generating a Model from the Tree Builder


To generate a model based on the current tree, from the Tree Builder menus choose:
Generate Model Figure 9-20 Generating a decision tree model

You can choose from the following options:


Model name. You can specify a custom name or generate the name automatically based on the

name of the modeling node.


Create node on. You can add the node on the Canvas, GM Palette, or Both.

Updating Tree Directives


To preserve your work from an interactive tree-building session, you can save the directives used to generate the current tree. Unlike saving a generated model, which cannot be edited further, this allows you to regenerate the tree in its current state for further editing.
E To update directives, from the Tree Builder menus choose: File Update Directives

Directives are saved in the modeling node used to create the tree (either C&RT, QUEST, or CHAID) and can be used to regenerate the current tree. For more information, see Tree-Growing Directives on p. 298.

Exporting Model, Gain, and Risk Information


From the interactive Tree Builder, you can export model, gain, and risk statistics in text, HTML, or image formats as appropriate.
E In the Tree Builder window, select the tab or view that you want to export. E From the menus choose: File Export E Select Text, HTML, or Graph as appropriate, and select the specic items you want to export

from the submenu.

295 Decision Trees

Where applicable, the export is based on current selections.


Exporting Text or HTML formats. You can export gain or risk statistics for the training or testing

partition (if dened). The export is based on the current selections on the Gains tabfor example, you can choose node-by-node, cumulative, or quantile statistics.
Exporting graphics. You can export the current tree as displayed on the Viewer tab or export

gains charts for the training or testing partition (if dened). Available formats include .JPEG, .PNG, and bitmap (.BMP). For gains, the export is based on current selections on the Gains tab (available only when a chart is displayed).

Generating Filter and Select Nodes


In the Tree Builder window, or when browsing a generated decision tree model, from the menus choose:
Generate Filter Node

or
Select Node

Filter Node. Generates a node that lters any elds not used by the current tree. This is a quick way to pare down the dataset to include only those elds that are selected as important by the algorithm. If there is a Type node upstream from this Decision Tree node, any elds with direction OUT are passed on by the generated Filter node. Select Node. Generates a node that selects all records that fall into the current node. This option

requires that one or more tree branches be selected in the Viewer tab. The generated node is placed on the stream canvas.

Generating a Ruleset from a Decision Tree


You can generate a Ruleset node that represents the tree structure as a set of rules dening the terminal branches of the tree. Rulesets can often retain most of the important information from a full decision tree but with a less complex model. The most important difference is that with a ruleset, more than one rule may apply for any particular record or no rules at all may apply. For example, you might see all of the rules that predict a no outcome followed by all of those that predict yes. If multiple rules apply, each rule gets a weighted vote based on the condence associated with that rule, and the nal prediction is decided by combining the weighted votes of all of the rules that apply to the record in question. If no rule applies, a default prediction is assigned to the record. Rulesets can be generated only from trees with categorical target elds (no regression trees).

296 Chapter 9 E In the Tree Builder window, or when browsing a generated decision tree model, from the menus

choose:
Figure 9-21 Generate Ruleset dialog box

Rule set name. Allows you to specify the name of the new generated Ruleset node. Create node on. Controls the location of the new generated Ruleset node. Select Canvas, GM
Palette, or Both.

Minimum instances. Specify the minimum number of instances (number of records to which the

rule applies) to preserve in the generated ruleset. Rules with support less than the specied value will not appear in the new ruleset.
Minimum confidence. Specify the minimum condence for rules to be preserved in the generated ruleset. Rules with condence less than the specied value will not appear in the new ruleset.

C&R Tree Node


This node is included with the Base module. The Classication and Regression (C&R) Tree node is a tree-based classication and prediction method. Similar to C5.0, this method uses recursive partitioning to split the training records into segments with similar output eld values. C&R Tree starts by examining the input elds to nd the best split, measured by the reduction in an impurity index that results from the split. The split denes two subgroups, each of which is subsequently split into two more subgroups, and so on, until one of the stopping criteria is triggered. All splits are binary (only two subgroups).
Pruning

C&R trees give you the option to rst grow the tree and then prune based on a cost-complexity algorithm that adjusts the risk estimate based on the number of terminal nodes. This method, which allows the tree to grow large before pruning based on more complex criteria, may result in smaller trees with better cross-validation properties. Increasing the number of terminal nodes generally reduces the risk for the current (training) data, but the actual risk may be higher when the model is generalized to unseen data. In an extreme case, suppose you have a separate terminal node for each record in the training set. The risk estimate would be 0%, since every record falls into its own node, but the risk of misclassication for unseen (testing) data would almost certainly be greater than 0. The cost-complexity measure attempts to compensate for this.

297 Decision Trees

Example. A cable TV company has commissioned a marketing study to determine which customers would buy a subscription to an interactive news service via cable. Using the data from the study, you can create a stream in which the target eld is the intent to buy the subscription and the predictor elds include age, sex, education, income category, hours spent watching television each day, and number of children. By applying a C&R Tree node to the stream, you will be able to predict and classify the responses to get the highest response rate for your campaign. For more information, see News Service Sales (C&RT) in Chapter 9 in Clementine 11.1 Applications Guide. Requirements. To train a C&R Tree model, you need one or more In elds and exactly one Out eld. Target and predictor elds can be range or categorical. Fields set to Both or None are ignored. Fields used in the model must have their types fully instantiated, and any ordinal elds used in the model must have numeric storage (not string). If necessary, the Reclassify node can be used to convert them. For more information, see Reclassify Node in Chapter 4 on p. 105. Strengths. C&R Tree models are quite robust in the presence of problems such as missing data

and large numbers of elds. They usually do not require long training times to estimate. In addition, C&R Tree models tend to be easier to understand than some other model typesthe rules derived from the model have a very straightforward interpretation. Unlike C5.0, C&R Tree can accommodate numeric ranges as well as categorical output elds.

Tree Node Model Options


Figure 9-22 C&R Tree node model options

Model name. You can generate the model name automatically based on the target or ID eld (or

model type in cases where no such eld is specied) or specify a custom name.
Use partitioned data. If a partition eld is dened, this option ensures that only data from the

training partition is used to build the model.For more information, see Partition Node in Chapter 4 on p. 119.

298 Chapter 9

Method. For CHAID trees only, you can specify standard or exhaustive CHAID. Exhaustive

CHAID is a modication of CHAID that does a more thorough job of examining all possible splits for each predictor but takes longer to compute.
Build method. Species the method used to build the model. Direct generates a model automatically when the stream is executed. Interactive launches the Tree Builder, which allows you to build your tree one level at a time, edit splits, and prune as desired before saving the generated model. Use tree directives. Select this option to specify directives to apply when generating an interactive

tree from the node. For example, you could specify the rst- and second-level splits, and these would automatically be applied when the Tree Builder is launched. You can also save directives from an interactive tree-building session in order to re-create the tree at a future date. For more information, see Updating Tree Directives on p. 294.
Maximum tree depth. Specify the maximum number of levels below the root node (the number of times the sample will be split recursively).

Tree-Growing Directives
For C&RT, CHAID, and QUEST models, tree directives specify conditions for growing the tree, one level at a time. Directives are applied each time the interactive Tree Builder is launched from the node. Directives are most safely used as a way to regenerate a tree created during a previous interactive session. For more information, see Updating Tree Directives on p. 294. You can also edit directives manually, but this should be done with care. Directives are highly specic to the structure of the tree they describe. Thus, any change to the underlying data or modeling options may cause a previously valid set of directives to fail. For example, if the CHAID algorithm changes a two-way split to a three-way split based on updated data, any directives based on the previous two-way split would fail. Note: If you choose to generate a model directly (without using the Tree Builder), any tree directives are ignored.
Editing Directives
E To view or edit saved directives, open the model-building node and select the Model tab.

299 Decision Trees Figure 9-23 Use tree directives option enabled in the modeling node

E Select Interactive Tree to enable the controls, select Use tree directives, and click Directives. Figure 9-24 Tree-growing directives

Directive Syntax

Directives specify conditions for growing the tree, starting with the root node. For example to grow the tree one level:
Grow Node Index 0 Children 1 2

Since no predictor is specied, the algorithm chooses the best split.

300 Chapter 9

Note that the rst split must always be on the root node (Index 0) and the index values for both children must be specied (1 and 2 in this case). It is invalid to specify Grow Node Index 2 Children 3 4 unless you rst grew the root that created Node 2. To grow the tree:
Grow Tree

To grow and prune the tree (C&RT only):


Grow_And_Prune Tree

To specify a custom split for a range predictor:


Grow Node Index 0 Children 1 2 Spliton ( "EDUCATE", Interval ( NegativeInfinity, 12.5) Interval ( 12.5, Infinity ) )

To split on a set predictor with two values:


Grow Node Index 2 Children 3 4 Spliton ( "GENDER", Group( "0.0" )Group( "1.0" ))

To split on an ordered set predictor:


Grow Node Index 4 Children 5 6 Spliton ( "CHILDS", Interval ( NegativeInfinity, 1.0) Interval ( 1.0, Infinity ))

For a set predictor with multiple values:


Grow Node Index 6 Children 7 8 Spliton ( "ORGS", Group( "2.0","4.0" ) Group( "0.0","1.0","3.0","6.0" ) )

Note that when specifying custom splits, eld names and values (EDUCATE, GENDER, CHILDS, etc.) are case sensitive.
Directives for CHAID Trees

Directives for CHAID trees are particularly sensitive to changes in the data or model becauseunlike C&RT and QUESTthey are not constrained to use binary splits. For example, the following syntax looks perfectly valid but would fail if the algorithm splits the root node into more than two children:
Grow Node Index 0 Children 1 2 Grow Node Index 1 Children 3 4

With CHAID, it is possible that Node 0 will have 3 or 4 children, which would cause the second line of syntax to fail.

301 Decision Trees

Using Directives in Scripts

Directives can also be embedded in scripts using triple quotation marks. For more information, see Blocks of Literal Text in Chapter 3 in Clementine 11.1 Scripting, Automation, and CEMI Reference.

C&R Tree Node Expert Options


This node is included with the Base module. Expert options allow you to ne-tune the model-building process. To access expert options, set the mode to Expert on the Expert tab.
Figure 9-25 C&R Tree expert options

Maximum surrogates. Surrogates are a method for dealing with missing values. For each split in the tree, C&R Tree identies the input elds that are most similar to the selected split eld. Those elds are the surrogates for that split. When a record must be classied but has a missing value for a split eld, its value on a surrogate eld can be used to make the split. Increasing this setting will allow more exibility to handle missing values but may also lead to increased memory usage and longer training times. Minimum change in impurity. Specify the minimum change in impurity to create a new split in

the tree. Impurity refers to the extent to which subgroups dened by the tree have a wide range of output eld values within each group. For categorical targets, a node is considered pure if 100% of cases in the node fall into a specic category of the target eld. The goal of tree building is to create subgroups with similar output valuesin other words, to minimize the impurity within each node. If the best split for a branch reduces the impurity by less than the specied amount, the split will not be made.

302 Chapter 9

Impurity measure for categorical targets. For categorical target elds, specify the method used to measure the impurity of the tree. (For continuous targets, this option is ignored, and the least squared deviation impurity measure is always used.)
Gini is a general impurity measure based on probabilities of category membership for the

branch.
Twoing is an impurity measure that emphasizes the binary split and is more likely to lead to

approximately equal-sized branches from a split.


Ordered twoing adds the additional constraint that only contiguous target classes can be

grouped together, as is applicable only with ordinal targets. If this option is selected for a nominal target, the standard twoing measure is used by default.
Stopping. Species rules for when to stop splitting nodes in the tree. For more information,

see Tree Node Stopping Options on p. 302.


Prune tree. Pruning consists of removing bottom-level splits that do not contribute signicantly to

the accuracy of the tree. Pruning can help simplify the tree, making it easier to interpret and, in some cases, improving generalization by helping to avoid over-tting. If you want the full tree without pruning, deselect this option.
Use standard error rule. Allows you to specify a more liberal pruning rule. The standard error

rule allows the algorithm to select the simplest tree whose risk estimate is close to (but possibly greater than) that of the subtree with the smallest risk. The multiplier indicates the size of the allowable difference in the risk estimate between the pruned tree and the tree with the smallest risk in terms of the risk estimate. For example, if you specify 2, a tree whose risk estimate is (2 standard error) larger than that of the full tree could be selected.
Priors. Allows you to set prior probabilities for target categories. For more information, see Prior Probability Options on p. 303.

Tree Node Stopping Options


Figure 9-26 Tree node stopping options

These options control how the tree is constructed. Stopping rules determine when to stop splitting specic branches of the tree. Set the minimum branch sizes to prevent splits that would create very small subgroups. Minimum records in parent branch will prevent a split if the number of records in the node to be split (the parent) is less than the specied value. Minimum records in

303 Decision Trees child branch will prevent a split if the number of records in any branch created by the split (the child) would be less than the specied value.

Use percentage. Allows you to specify sizes in terms of percentage of overall training data. Use absolute value. Allows you to specify sizes as the absolute numbers of records.

Prior Probability Options


Figure 9-27 C&R Tree prior probabilities options

These options allow you to specify prior probabilities for categories when predicting a categorical target eld. Prior probabilities are estimates of the overall relative frequency for each target category in the population from which the training data are drawn. In other words, they are the probability estimates that you would make for each possible target value prior to knowing anything about predictor values. There are three methods of setting priors:
Based on training data. This is the default. Prior probabilities are based on the relative

frequencies of the categories in the training data.


Equal for all classes. Prior probabilities for all categories are dened as 1/k, where k is the

number of target categories.


Custom. You can specify your own prior probabilities. Starting values for prior probabilities

are set as equal for all classes. You can adjust the probabilities for individual categories to user-dened values. To adjust a specic categorys probability, select the probability cell in the table corresponding to the desired category, delete the contents of the cell, and enter the desired value. The prior probabilities for all categories should sum to 1.0 (the probability constraint). If they do not sum to 1.0, Clementine will give a warning and offer to automatically normalize the values. This automatic adjustment preserves the proportions across categories while enforcing the probability constraint. You can perform this adjustment at any time by clicking the Normalize button. To reset the table to equal values for all categories, click the Equalize button.

304 Chapter 9

Adjust priors using misclassification costs. This option allows you to adjust the priors, based on misclassication costs (specied on the Costs tab). This enables you to incorporate cost information into the tree-growing process directly for trees that use the Twoing impurity measure. (When this option is not selected, cost information is used only in classifying records and calculating risk estimates for trees based on the Twoing measure.) For more information, see Misclassication Cost Options on p. 304.

Misclassification Cost Options


Figure 9-28 Specifying misclassification costs

In some contexts, certain kinds of errors are more costly than others. For example, it may be more costly to classify a high-risk credit applicant as low risk (one kind of error) than it is to classify a low-risk applicant as high risk (a different kind of error). Misclassication costs allow you to specify the relative importance of different kinds of prediction errors. Misclassication costs are basically weights applied to specic outcomes. These weights are factored into the model and may actually change the prediction (as a way of protecting against costly mistakes). The cost matrix shows the cost for each possible combination of predicted category and actual category. By default, all misclassication costs are set to 1.0. To enter custom cost values, select Use misclassification costs and enter your custom values into the cost matrix. To change a misclassication cost, select the cell corresponding to the desired combination of predicted and actual values, delete the existing contents of the cell, and enter the desired cost for the cell. Remember that customized misclassication costs are not automatically symmetric. For example, if you set the cost of misclassifying A as B to be 2.0, the cost of misclassifying B as A will still have the default value of 1.0 unless you explicitly change it as well.

305 Decision Trees

CHAID Node
This node is included with the Base module. CHAID, or Chi-squared Automatic Interaction Detection, is a classication method for building decision trees by using chi-square statistics to identify optimal splits. CHAID rst examines the crosstabulations between each of the predictor variables and the outcome and tests for signicance using a chi-square independence test. If more than one of these relations is statistically signicant, CHAID will select the predictor that is the most signicant (smallest p value). If a predictor has more than two categories, these are compared, and categories that show no differences in the outcome are collapsed together. This is done by successively joining the pair of categories showing the least signicant difference. This category-merging process stops when all remaining categories differ at the specied testing level. For set predictors, any categories can be merged; for an ordinal set, only contiguous categories can be merged. Exhaustive CHAID is a modication of CHAID that does a more thorough job of examining all possible splits for each predictor but takes longer to compute.
Requirements. Target and predictor elds can be range or categorical; nodes can be split into two

or more subgroups at each level. Any ordinal elds used in the model must have numeric storage (not string). If necessary, the Reclassify node can be used to convert them. For more information, see Reclassify Node in Chapter 4 on p. 105.
Strengths. Unlike the C&RT and QUEST nodes, CHAID can generate nonbinary trees, meaning

that some splits have more than two branches. It therefore tends to create a wider tree than the binary growing methods. CHAID works for all types of predictors, and it accepts both case weights and frequency variables.

CHAID Node Expert Options


This node is included with the Base module.

306 Chapter 9 Figure 9-29 CHAID node expert options

Alpha for splitting. Species the signicance level (alpha) for splitting nodes. The value must be between 0 and 1. Lower values tend to produce trees with fewer nodes. Alpha for merging. Species the signicance level (alpha) for merging categories. The value must be greater than 0 and less than or equal to 1. To prevent any merging of categories, specify a value of 1. For range targets, this means the number of categories for the variable in the nal tree matches the specied number of intervals. This option is not available for Exhaustive CHAID. Chi-square for categorical targets. For categorical targets, you can specify the method used to

calculate the chi-square statistic.


Pearson. This method provides faster calculations but should be used with caution on small

samples.
Likelihood ratio. This method is more robust than Pearson but takes longer to calculate. For

small samples, this is the preferred method. For range targets, this method is always used.
Stopping. Species rules for when to stop splitting nodes in the tree. For more information,

see Tree Node Stopping Options on p. 302.


Epsilon for convergence. When estimating cell frequencies (for both the nominal model and the row effects ordinal model), an iterative procedure is used to converge on the optimal estimate used in the chi-square test for a specic split. Epsilon determines how much change must occur for iterations to continue; if the change from the last iteration is smaller than the specied value, iterations stop. If you are having problems with the algorithm not converging, you can increase this value or increase the maximum number of iterations until convergence occurs. Maximum iterations for convergence. Species the maximum number of iterations before stopping, whether convergence has taken place or not.

307 Decision Trees

Allow splitting of merged categories. The CHAID algorithm attempts to merge categories in order

to produce the simplest tree that describes the model. If selected, this option allows merged categories to be resplit if that results in a better solution.
Use Bonferroni adjustment. Adjusts signicance values when testing the various category

combinations of a predictor. Values are adjusted based on the number of tests, which directly relates to the number of categories and measurement level of a predictor. This is generally desirable because it better controls the false-positive error rate. Disabling this option will increase the power of your analysis to nd true differences, but at the cost of an increased false-positive rate. In particular, disabling this option may be recommended for small samples.

QUEST Node
This node is included with the Base module. QUESTor Quick, Unbiased, Efcient Statistical Treeis a binary classication method for building decision trees. A major motivation in its development was to reduce the processing time required for large C&RT analyses with either many variables or many cases. A second goal of QUEST was to reduce the tendency found in classication tree methods to favor predictors that allow more splits; that is, continuous predictor variables or those with many categories. QUEST uses a sequence of rules, based on signicance tests, to evaluate the predictor variables at a node. For selection purposes, as little as a single test may need to be performed on each predictor at a node. Unlike C&RT, all splits are not examined, and unlike C&RT and CHAID, category combinations are not tested when evaluating a predictor for selection. This speeds the analysis. Splits are determined by running quadratic discriminant analysis using the selected predictor on groups formed by the target categories. This method again results in a speed improvement over exhaustive search (C&RT) to determine the optimal split.
Requirements. Predictor elds can be numeric ranges, but the target eld must be categorical. All

splits are binary. Weight elds cannot be used. Any ordinal elds used in the model must have numeric storage (not string). If necessary, the Reclassify node can be used to convert them. For more information, see Reclassify Node in Chapter 4 on p. 105.
Strengths. Like CHAID, but unlike C&RT, QUEST uses statistical tests to decide whether or not a

predictor is used. It also separates the issues of predictor selection and splitting, applying different criteria to each. This contrasts with CHAID, in which the statistical test result that determines variable selection also produces the split. Similarly, C&RT employs the impurity-change measure to both select the predictor variable and to determine the split.

QUEST Node Expert Options


This node is included with the Base module. Expert options allow you to ne-tune the model-building process. To access expert options, set the mode to Expert on the Expert tab.

308 Chapter 9 Figure 9-30 QUEST node expert options

Maximum surrogates. Surrogates are a method for dealing with missing values. For each split in the tree, the algorithm identies the input elds that are most similar to the selected split eld. Those elds are the surrogates for that split. When a record must be classied but has a missing value for a split eld, its value on a surrogate eld can be used to make the split. Increasing this setting will allow more exibility to handle missing values but may also lead to increased memory usage and longer training times. Alpha for Splitting. Species the signicance level (alpha) for splitting nodes. The value must be

between 0 and 1. Lower values tend to produce trees with fewer nodes.
Stopping. Species rules for when to stop splitting nodes in the tree. For more information,

see Tree Node Stopping Options on p. 302.


Prune tree. Pruning consists of removing bottom-level splits that do not contribute signicantly to the accuracy of the tree. Pruning can help simplify the tree, making it easier to interpret and, in some cases, improving generalization. If you want the full tree without pruning, deselect this option. Use standard error rule. Allows you to specify a more liberal pruning rule. The standard error

rule allows the algorithm to select the simplest tree whose risk estimate is close to (but possibly greater than) that of the subtree with the smallest risk. The multiplier indicates the size of the allowable difference in the risk estimate between the pruned tree and the tree with the smallest risk in terms of the risk estimate. For example, if you specify 2, a tree whose risk estimate is (2 standard error) larger than that of the full tree could be selected.
Priors. Allows you to set prior probabilities for target categories. For more information, see Prior Probability Options on p. 303.

C5.0 Node
This node is available with the Classication module.

309 Decision Trees

This node uses the C5.0 algorithm to build either a decision tree or a ruleset. A C5.0 model works by splitting the sample based on the eld that provides the maximum information gain. Each subsample dened by the rst split is then split again, usually based on a different eld, and the process repeats until the subsamples cannot be split any further. Finally, the lowest-level splits are reexamined, and those that do not contribute signicantly to the value of the model are removed or pruned. Note: As of release 11.0, a new version of the C5.0 algorithm is in use. When analyzing data with categorical elds (Set or Ordered Set elds), the new version is more likely to group categories together than previous versions of C5.0, which tends to produce smaller trees when you have categorical elds in your data. C5.0 can produce two kinds of models. A decision tree is a straightforward description of the splits found by the algorithm. Each terminal (or leaf) node describes a particular subset of the training data, and each case in the training data belongs to exactly one terminal node in the tree. In other words, exactly one prediction is possible for any particular data record presented to a decision tree. In contrast, a ruleset is a set of rules that tries to make predictions for individual records. Rulesets are derived from decision trees and, in a way, represent a simplied or distilled version of the information found in the decision tree. Rulesets can often retain most of the important information from a full decision tree but with a less complex model. Because of the way rulesets work, they do not have the same properties as decision trees. The most important difference is that with a ruleset, more than one rule may apply for any particular record, or no rules at all may apply. If multiple rules apply, each rule gets a weighted vote based on the condence associated with that rule, and the nal prediction is decided by combining the weighted votes of all of the rules that apply to the record in question. If no rule applies, a default prediction is assigned to the record.
Example. A medical researcher has collected data about a set of patients, all of whom suffered

from the same illness. During their course of treatment, each patient responded to one of ve medications. You can use a C5.0 model, in conjunction with other nodes, to help nd out which drug might be appropriate for a future patient with the same illness. For more information, see Drug Treatments (Exploratory Graphs/C5.0) in Chapter 4 in Clementine 11.1 Applications Guide.
Requirements. To train a C5.0 model, you need one In eld and one or more symbolic Out

eld(s). Fields set to Both or None are ignored. Fields used in the model must have their types fully instantiated.
Strengths. C5.0 models are quite robust in the presence of problems such as missing data and

large numbers of input elds. They usually do not require long training times to estimate. In addition, C5.0 models tend to be easier to understand than some other model types, since the rules derived from the model have a very straightforward interpretation. C5.0 also offers the powerful boosting method to increase accuracy of classication. Note: C5.0 model building speed may benet from enabling parallel processing. For more information, see Setting Optimization Options in Chapter 3 in Clementine 11.1 Users Guide.

310 Chapter 9

C5.0 Node Model Options


This node is available with the Classication module.
Figure 9-31 C5.0 node model options

Model name. Specify the name of the model to be produced. Auto. With this option selected, the model name will be generated automatically, based on

the target eld name(s). This is the default.


Custom. Select this option to specify your own name for the generated model that will

be created by this node.


Output type. Specify here whether you want the resulting generated model to be a Decision tree or

a Rule set.
Group symbolics. If this option is selected, C5.0 will attempt to combine symbolic values that

have similar patterns with respect to the output eld. If this option is not selected, C5.0 will create a child node for every value of the symbolic eld used to split the parent node. For example, if C5.0 splits on a COLOR eld (with values RED, GREEN, and BLUE), it will create a three-way split by default. However, if this option is selected, and the records where COLOR = RED are very similar to records where COLOR = BLUE, it will create a two-way split, with the GREENs in one group and the BLUEs and REDs together in the other.
Use boosting. The C5.0 algorithm has a special method for improving its accuracy rate, called boosting. It works by building multiple models in a sequence. The rst model is built in the usual way. Then, a second model is built in such a way that it focuses on the records that were misclassied by the rst model. Then a third model is built to focus on the second models errors, and so on. Finally, cases are classied by applying the whole set of models to them, using a weighted voting procedure to combine the separate predictions into one overall prediction. Boosting can signicantly improve the accuracy of a C5.0 model, but it also requires longer training. The Number of trials option allows you to control how many models are used for the

311 Decision Trees

boosted model. This feature is based on the research of Freund & Schapire, with some proprietary improvements to handle noisy data better.
Cross-validate. If this option is selected, C5.0 will use a set of models built on subsets of the

training data to estimate the accuracy of a model built on the full dataset. This is useful if your dataset is too small to split into traditional training and testing sets. The cross-validation models are discarded after the accuracy estimate is calculated. You can specify the number of folds, or the number of models used for cross-validation. Note that in previous versions of Clementine, building the model and cross-validating it were two separate operations. In the current version, no separate model-building step is required. Model building and cross-validation are performed at the same time.
Mode. For Simple training, most of the C5.0 parameters are set automatically. Expert training

allows more direct control over the training parameters.


Simple Mode Options Favor. By default, C5.0 will try to produce the most accurate tree possible. In some instances, this can lead to overtting, which can result in poor performance when the model is applied to new data. Select Generality to use algorithm settings that are less susceptible to this problem.

Note: Models built with the Generality option selected are not guaranteed to generalize better than other models. When generality is a critical issue, always validate your model against a held-out test sample.
Expected noise (%). Specify the expected proportion of noisy or erroneous data in the training set. Expert Mode Options Pruning severity. Determines the extent to which the generated decision tree or ruleset will be pruned. Increase this value to obtain a smaller, more concise tree. Decrease it to obtain a more accurate tree. This setting affects local pruning only (see Use global pruning below). Minimum records per child branch. The size of subgroups can be used to limit the number of splits

in any branch of the tree. A branch of the tree will be split only if two or more of the resulting subbranches would contain at least this many records from the training set. The default value is 2. Increase this value to help prevent overtraining with noisy data.
Use global pruning. Trees are pruned in two stages: First, a local pruning stage, which examines

subtrees and collapses branches to increase the accuracy of the model. Second, a global pruning stage considers the tree as a whole, and weak subtrees may be collapsed. Global pruning is performed by default. To omit the global pruning stage, deselect this option.
Winnow attributes. If this option is selected, C5.0 will examine the usefulness of the predictors

before starting to build the model. Predictors that are found to be irrelevant are then excluded from the model-building process. This option can be helpful for models with many predictor elds and can help prevent overtting. Note: C5.0 model building speed may benet from enabling parallel processing. For more information, see Setting Optimization Options in Chapter 3 in Clementine 11.1 Users Guide.

312 Chapter 9

Generating a Tree Model Directly


As an alternative to using the Tree Builder, you can generate a tree model directly from the build node when the stream is executed. This is consistent with most other model-building nodes in Clementine and may be useful if you want to automate tree generating using batch mode, for example. For C5.0 tree models, which are not supported by the interactive Tree Builder, this is the only method that can be used.
E Create a stream and add one of the tree-building nodesC&RT, CHAID, QUEST, or C5.0. Figure 9-32 Generating a tree model directly from the modeling node

E On the Model tab, select Model for the build option. (For C5.0, select Decision Tree.) E Select target and predictor elds and specify additional model options, as needed. For specic

instructions, see the documentation for each tree-building node.


E Execute the stream to generate the model.

Comments

When generating trees using this method, tree-growing directives are ignored. Whether interactive or direct, both methods of creating decision trees ultimately generate similar models. Its just a question of how much control you want along the way.

Generated Decision Tree Models


Generated decision tree models represent the tree structures for predicting a particular output eld discovered by one of the decision tree modeling nodes (C&R Tree, CHAID, QUEST, C5.0, or Build Rule from previous versions of Clementine). Tree models can be generated directly from the model-building node or from the interactive Tree Builder. For more information, see The Tree Builder on p. 275.

313 Decision Trees

Scoring Tree Models

When you execute a stream containing a generated tree model node, the specic result depends on the type of tree. For classication trees (categorical target), two new elds, containing the predicted value and the condence for each record, are added to the data. The prediction is based on the most frequent category for the terminal node to which the record is assigned; if a majority of respondents in a given node is yes, the prediction for all records assigned to that node is yes. For regression trees, only predicted values are generated; condences are not assigned. Optionally, for CHAID, QUEST, and C&RT models, an additional eld can be added that indicates the ID for the node to which each record is assigned. The new eld names are derived from the model name by adding prexes. For C&RT, CHAID, and QUEST trees, the prexes are $R- for the prediction eld, $RC- for the condence eld, and $RI- for the node identier eld. For C5.0 trees, the prexes are $C- for the prediction eld and $CC- for the condence eld. If multiple tree model nodes are present, the new eld names will include numbers in the prex to distinguish them if necessaryfor example, $R1and $RC1-, $R2-, and so on.
Working with Generated Tree Models

You can save or export information related to the model in a number of ways. Note: Many of these options are also available from the Tree Builder window. From either the Tree Builder or a generated tree model, you can: Generate a Filter or Select node based on the current tree. For more information, see Generating Filter and Select Nodes on p. 295. Generate a Ruleset node that represents the tree structure as a set of rules dening the terminal branches of the tree. For more information, see Generating a Ruleset from a Decision Tree on p. 295. In addition, for generated tree models only, you can export the model in PMML format. For more information, see The Models Palette in Chapter 6 on p. 238. If the model includes any custom splits, this information is not preserved in the exported PMML. (The split is preserved, but the fact that it is custom rather than chosen by the algorithm is not.) For boosted C5.0 models only, you can choose Single Decision Tree (Canvas) or Single Decision Tree (GM Palette) to create a new single Ruleset derived from the currently selected rule. For more information, see Boosted C5.0 Models on p. 319. Note: Although the Build Rule node was replaced by the C&R Tree node in version 6.0, Decision Tree nodes in existing streams that were originally created using a Build Rule node will still function properly.

Decision Tree Model Rules


The Model tab for a generated decision tree displays a list of conditions dening the partitioning of data discovered by the algorithmessentially a series of rules that can be used to assign individual records to child nodes based on the values of different predictors.

314 Chapter 9 Figure 9-33 Sample Decision Tree node Model tab

Decision trees work by recursively partitioning the data based on input eld values. The data partitions are called branches. The initial branch (sometimes called the root) encompasses all data records. The root is split into subsets, or child branches, based on the value of a particular input eld. Each child branch can be further split into sub-branches, which can in turn be split again, and so on. At the lowest level of the tree are branches that have no more splits. Such branches are known as terminal branches (or leaves). The rule browser shows the input values that dene each partition or branch and a summary of output eld values for the records in that split. For general information on using the model browser, see Browsing Generated Models. For splits based on numeric elds, the branch is shown by a line of the form:
fieldname relation value [summary]

where relation is a numeric relation. For example, a branch dened by values greater than 100 for the revenue eld would appear as
revenue > 100 [summary]

For splits based on symbolic elds, the branch is shown by a line of the form:
fieldname = value [summary] or fieldname in [values] [summary]

315 Decision Trees

where values represents the eld values that dene the branch. For example, a branch that includes records where the value of region can be North, West, or South would be represented as
region in ["North" "West" "South"] [summary]

For terminal branches, a prediction is also given, adding an arrow and the predicted value to the end of the rule condition. For example, a leaf dened by revenue > 100 that predicts a value of high for the output eld would be displayed as
revenue > 100 [Mode: high] high

The summary for the branch is dened differently for symbolic and numeric output elds. For trees with numeric output elds, the summary is the average value for the branch, and the effect of the branch is the difference between the average for the branch and the average of its parent branch. For trees with symbolic output elds, the summary is the mode, or the most frequent value, for records in the branch. To fully describe a branch, you need to include the condition that denes the branch, plus the conditions that dene the splits further up the tree. For example, in the tree
revenue > 100 region = "North" region in ["South" "East" "West"] revenue <= 200

the branch represented by the second line is dened by the conditions revenue > 100 and region = North. If you click Show Instances/Confidence on the toolbar, each rule will also show information about the number of records to which the rule applies (Instances) and the proportion of those records for which the rule is true (Condence). If you click Show Additional Information Panel on the toolbar, you will see a panel containing detailed information for the selected rule at the bottom of the window. The information panel contains three tabs.
Figure 9-34 Information panel

History

This tab traces the split conditions from the root node down to the selected node. This provides a list of conditions that determine when a record is assigned to the selected node. Records for which all of the conditions are true will be assigned to this node.

316 Chapter 9

Frequencies

For models with symbolic target elds, this tab shows, for each possible target value, the number of records assigned to this node (in the training data) that have that target value. The frequency gure, expressed as a percentage (shown to a maximum of three decimal places) is also displayed. For models with numeric targets, this tab is empty.
Surrogates

Where applicable, any surrogates for the primary split eld are shown for the selected node. Surrogates are alternate elds used if the primary predictor value is missing for a given record. The maximum number of surrogates allowed for a given split is specied in the model-building node, but the actual number depends on the training data. In general, the more missing data, the more surrogates are likely to be used. For other decision tree models, this tab is empty. Note: To be included in the model, surrogates must be identied during the training phase. If the training sample has no missing values, then no surrogates will be identied, and any records with missing values encountered during testing or scoring will automatically fall into the child node with the largest number of records. If missing values are expected during testing or scoring, be sure that values are missing from the training sample, as well. Surrogates are not available for CHAID trees.

Decision Tree Model Viewer


The Viewer tab for a generated decision tree model resembles the display in the Tree Builder. The main difference is that when browsing the generated model, you can not grow or modify the tree. Other options for customizing the display are similar between the two components. For more information, see Customizing the Tree View on p. 281.

317 Decision Trees Figure 9-35 Sample Decision Tree Viewer tab with tree map window

Decision Tree/Ruleset Model Settings


The Settings tab for a generated decision tree model or ruleset allows you to specify options for condences and for SQL generation during model scoring. This tab is available only after the generated model has been added to a stream.

318 Chapter 9 Figure 9-36 Decision Tree/Ruleset Settings tab

Generate SQL for this model. There are two ways you can use SQL with Clementine:

Export the SQL as a text le for modication and use in another, unconnected, database. For more information, see Browsing Generated Models in Chapter 6 on p. 239. Enable SQL generation for the model in order to take advantage of database performance. This setting only applies when using data from a database. For more information, see SQL Optimization in Chapter 6 in Clementine 11.1 Server Administration and Performance Guide. Also note that actual performance results may vary, depending on the complexity of the model and the capabilities of each DBMS, particularly when large numbers of categorical predictors with many discrete values are used. In general, the more complex the model, the greater the likelihood that databases will struggle with the resulting generated SQL. Select one of the options below to enable or disable SQL generation for the model.
Do not generate. Select to disable SQL generation for the model. No missing value support. Select to enable SQL generation without the overhead of handling

missing values. This option simply sets the prediction to null ($null$) when a missing value is encountered while scoring a case. Note: This option is available only for decision trees and is the recommended selection for C5.0 trees or when the data has already been treated for missing values.
With missing value support. Select to enable SQL generation with full missing value support.

This means that SQL is generated so that missing values are handled as specied in the model. For example, C&RT trees use surrogate rules and biggest child fallback. Note:

319 Decision Trees

SQL generation does not provide efcient support for C5.0s treatment of missing values; therefore, this option is not enabled for C5.0 trees. No missing value support is recommended if you still want to generate SQL for C5.0 trees.
Calculate Confidences. Select to include condences in scoring operations. When scoring models in the database, excluding condences allows you to generate more efcient SQL. Note that for regression trees, condences are not assigned. Rule identifier. For CHAID, QUEST, and C&RT models, this option adds a eld in the scoring

output that indicates the ID for the terminal node to which each record is assigned. Note: When this option is selected, SQL Generation is not available.

Boosted C5.0 Models


This node is available with the Classication module.
Figure 9-37 Sample boosted C5.0 Decision Tree node Model tab

When you create a boosted C5.0 model (either a ruleset or a decision tree), you actually create a set of related models. The model rule browser for a boosted C5.0 model shows the list of models at the top level of the hierarchy, along with the accuracy of each model and the cumulative accuracy of the boosted models up to and including the current model. To examine the rules or splits for a particular model, select that model and expand it as you would a rule or branch in a single model.

320 Chapter 9

You can also extract a particular model from the set of boosted models and create a new generated Ruleset node containing just that model. To create a new ruleset from a boosted C5.0 model, select the ruleset or tree of interest and choose either Single Decision Tree (GM Palette) or Single Decision Tree (Canvas) from the Generate menu.

Ruleset Nodes
A Ruleset node represents the rules for predicting a particular output eld discovered by one of the association rule modeling nodes (Apriori or GRI) or by one of the tree-building nodes (C&RT, CHAID, QUEST, or C5.0). For association rules, the ruleset must be generated from an Unrened Rule node. For trees, a ruleset can be generated from the Tree Builder, from a C5.0 model-building node, or from any generated tree model. Unlike Unrened Rule nodes, Ruleset nodes can be placed in streams to generate predictions. When you execute a stream containing a Ruleset node, two new elds are added to the stream containing the predicted value and the condence for each record to the data. The new eld names are derived from the model name by adding prexes. For association rulesets, the prexes are $A- for the prediction eld and $AC- for the condence eld. For C5.0 rulesets, the prexes are $C- for the prediction eld and $CC- for the condence eld. For C&R Tree rulesets, the prexes are $R- for the prediction eld and $RC- for the condence eld. In a stream with multiple Ruleset nodes in a series predicting the same output eld(s), the new eld names will include numbers in the prex to distinguish them from each other. The rst Association Ruleset node in the stream will use the usual names, the second node will use names starting with $A1and $AC1-, the third node will use names starting with $A2- and $AC2-, and so on.
How rules are applied. Rulesets generated from association rules are unlike other generated model

nodes because for any particular record, more than one prediction can be generated, and those predictions may not all agree. There are two methods for generating predictions from rulesets. Note: Rulesets generated from decision trees return the same results regardless of which method is used, since the rules derived from a decision tree are mutually exclusive.
Voting. This method attempts to combine the predictions of all of the rules that apply to the

record. For each record, all rules are examined and each rule that applies to the record is used to generate a prediction and an associated condence. The sum of condence gures for each output value is computed, and the value with the greatest condence sum is chosen as the nal prediction. The condence for the nal prediction is the condence sum for that value divided by the number of rules that red for that record.
First hit. This method simply tests the rules in order, and the rst rule that applies to the record

is the one used to generate the prediction. The method used can be controlled in the stream options. For more information, see Setting Options for Streams in Chapter 5 in Clementine 11.1 Users Guide.
Generating nodes. The Generate menu allows you to create new nodes based on the ruleset. Filter Node. Creates a new Filter node to lter elds that are not used by rules in the ruleset. Select Node. Creates a new Select node to select records to which the selected rule applies.

The generated node will select records for which all antecedents of the rule are true. This option requires a rule to be selected.

321 Decision Trees

Rule Trace Node. Creates a new SuperNode that will compute a eld indicating which rule

was used to make the prediction for each record. When a ruleset is evaluated using the rst hit method, this is simply a symbol indicating the rst rule that would re. When the ruleset is evaluated using the voting method, this is a more complex string showing the input to the voting mechanism.
Single Decision Tree (Canvas)/Single Decision Tree (GM Palette). Creates a new single Ruleset

derived from the currently selected rule. Available only for boosted C5.0 models. For more information, see Boosted C5.0 Models on p. 319.
Model to Palette. Returns the model to the generated models palette. This is useful in situations

where a colleague may have sent you a stream containing the model and not the model itself. Note: The Settings and Summary tabs in the Ruleset node are identical to those for decision tree models.

Ruleset Model Tab


The Model tab for a Ruleset node displays a list of rules extracted from the data by the algorithm.
Figure 9-38 Sample generated Ruleset node Model tab

Rules are broken down by consequent (predicted category), and are presented in the following format:
if antecedent_1 and antecedent_2 ... and antecedent_n

322 Chapter 9 then predicted value

where consequent and antecedent_1 through antecedent_n are all conditions. The rule is interpreted as for records where antecedent_1 through antecedent_n are all true, consequent is also likely to be true. If you click the Show Instances/Confidence button on the toolbar, each rule will also show information on the number of records to which the rule appliesthat is, for which the antecedents are true (Instances), and the proportion of those records for which the entire rule is true (Condence). Note that condence is calculated somewhat differently for C5.0 rulesets. C5.0 uses the following formula for calculating the condence of a rule:
(1 + number of records where rule is correct) / (2 + number of records for which the rule's antecedents are true)

This calculation of the condence estimate adjusts for the process of generalizing rules from a decision tree (which is what C5.0 does when it creates a ruleset).

Importing Projects from AnswerTree 3.0


Clementine can import projects saved in AnswerTree 3.0 or 3.1 using the standard File > Open dialog box, as follows:
E From the Clementine menus choose: File Open Stream E From the Files of Type drop-down list, select Answer Tree Files (*.atp, *.ats).

Each imported project is converted into a Clementine stream with the following nodes: One source node that denes the data source used (for example, an SPSS data le or database source). For each tree in the project (there can be several), one Type node is created that denes properties for each eld (variable), including type, direction (input or predictor eld versus output or predicted eld), missing values, and other options. For each tree in the project, a Partition node is created that partitions the data for a training or test sample, and a Tree Model node is created that denes parameters for generating the tree (either a C&R Tree, QUEST, or CHAID node).
E To view the generated tree(s), execute the stream.

Comments

Decision trees generated in Clementine cannot be exported to AnswerTree; the import from AnswerTree to Clementine is a one-way trip. Prots dened in AnswerTree are not preserved when the project is imported into Clementine.

Neural Networks
Neural Net Node
This node is available with the Classication module.

10

Chapter

The Neural Net node (formerly called Train Net) is used to create and train a neural network. Neural networks are simple models of the way the nervous system operates. The basic units are neurons, which are typically organized into layers, as shown in the following gure.
Figure 10-1 Structure of a neural network

A neural network, sometimes called a multilayer perceptron, is basically a simplied model of the way the human brain processes information. It works by simulating a large number of interconnected simple processing units that resemble abstract versions of neurons. The processing units are arranged in layers. There are typically three parts in a neural network: an input layer, with units representing the input elds; one or more hidden layers; and an output layer, with a unit or units representing the output eld(s). The units are connected with varying connection strengths (or weights). Input data are presented to the rst layer, and values are propagated from each neuron to every neuron in the next layer. Eventually, a result is delivered from the output layer. The network learns by examining individual records, generating a prediction for each record, and making adjustments to the weights whenever it makes an incorrect prediction. This process is repeated many times, and the network continues to improve its predictions until one or more of the stopping criteria have been met. Initially, all weights are random, and the answers that come out of the net are probably nonsensical. The network learns through training. Examples for which the output is known are repeatedly presented to the network, and the answers it gives are compared to the known
323

324 Chapter 10

outcomes. Information from this comparison is passed back through the network, gradually changing the weights. As training progresses, the network becomes increasingly accurate in replicating the known outcomes. Once trained, the network can be applied to future cases where the outcome is unknown.
Example. In screening agricultural development grants for possible cases of fraud, a neural

network can be used for an in-depth exploration of deviations from the norm, highlighting those records that are abnormal and worthy of further investigation. You are particularly interested in grant applications that appear to claim too much (or too little) money for the type and size of farm. For more information, see Fraud Screening (Anomaly Detection/Neural Net) in Chapter 7 in Clementine 11.1 Applications Guide.
Requirements. There are no restrictions on eld types. Neural Net nodes can handle numeric,

symbolic, or ag inputs and outputs. The Neural Net node expects one or more elds with direction In and one or more elds with direction Out. Fields set to Both or None are ignored. Field types must be fully instantiated when the node is executed.
Strengths. Neural networks are powerful general function estimators. They usually perform

prediction tasks at least as well as other techniques and sometimes perform signicantly better. They also require minimal statistical or mathematical knowledge to train or apply. Clementine incorporates several features to avoid some of the common pitfalls of neural networks, including sensitivity analysis to aid in interpretation of the network, pruning and validation to prevent overtraining, and dynamic networks to automatically nd an appropriate network architecture.

Neural Net Node Model Options


This node is available with the Classication module.
Figure 10-2 Neural Net node model options

325 Neural Networks

Editing the Neural Net node allows you to set the parameters for the node. You can set the following parameters:
Model name. You can generate the model name automatically based on the target or ID eld (or

model type in cases where no such eld is specied) or specify a custom name.
Use partitioned data. If a partition eld is dened, this option ensures that only data from the training partition is used to build the model.For more information, see Partition Node in Chapter 4 on p. 119. Method. Clementine provides six training methods for building neural network models: Quick. This method uses rules of thumb and characteristics of the data to choose an

appropriate shape (topology) for the network. Note that the method for calculating default size of the hidden layer has changed from previous versions of Clementine. The new method will generally produce smaller hidden layers that are faster to train and generalize better. If you nd you get poor accuracy with the default size, try increasing the size of the hidden layer on the Expert tab or try an alternative training method.
Dynamic. This method creates an initial topology but modies the topology by adding and/or

removing hidden units as training progresses.


Multiple. This method creates several networks of different topologies (the exact number

depends on the training data). These networks are then trained in a pseudo-parallel fashion. At the end of training, the model with the lowest RMS error is presented as the nal model.
Prune. This method starts with a large network and removes (prunes) the weakest units in the

hidden and input layers as training proceeds. This method is usually slow, but it often yields better results than other methods.
RBFN. The radial basis function network (RBFN) uses a technique similar to k-means

clustering to partition the data based on values of the target eld.


Exhaustive prune. This method is related to the Prune method. It starts with a large network

and prunes the weakest units in the hidden and input layers as training proceeds. With Exhaustive Prune, network training parameters are chosen to ensure a very thorough search of the space of possible models to nd the best one. This method is usually the slowest, but it often yields the best results. Note that this method can take a long time to train, especially with large datasets.
Prevent overtraining. This option randomly splits the data into separate training and testing sets for

purposes of model building. The network is trained on the training set, and accuracy is estimated based on the test set. Specify the proportion of the data to be used for training in the Sample % box in the Neural Net node, and the remainder of the data will be used for validation. Note: If a separate partition eld is in use (as created by a Partition node, for example), the Prevent overtraining setting is applied to the training partition only, effectively partitioning the partition. This shouldnt be a problemeach algorithm uses the training data as it sees tbut is noted here to alleviate any possible confusion.
Set random seed. If no random seed is set, the sequence of random values used to initialize the

network weights will be different every time the node is executed. This can cause the node to create different models on different runs, even if the node settings and data values are exactly the same. By selecting this option, you can set the random seed to a specic value so the resulting

326 Chapter 10

model is exactly reproducible. A specic random seed always generates the same sequence of random values, in which case executing the node always yields the same generated model. Note: When using the Set random seed option with records read from a database, a Sort node may be required prior to sampling in order to ensure the same result each time the node is executed. This is because the random seed depends on the order of records, which is not guaranteed to stay the same in a relational database. For more information, see Sort Node in Chapter 3 on p. 54.
Stop on. You can select one of the following stopping criteria: Default. With this setting, the network will stop training when the network appears to have

reached its optimally trained state. If the default setting is used with the Multiple training method, the networks that fail to train well are discarded as training progresses.
Accuracy (%). With this option, training will continue until the specied accuracy is attained.

This may never happen, but you can interrupt training at any point and save the net with the best accuracy achieved so far.
Cycles. With this option, training will continue for the specied number of cycles (passes

through the data).


Time (mins). With this option, training will continue for the specied amount of time (in

minutes). Note that training may go a bit beyond the specied time limit in order to complete the current cycle.
Optimize. Select options designed to increase performance during model building based on your

specic needs. Select Speed to instruct the algorithm to never use disk spilling in order to improve performance. Select Memory to instruct the algorithm to use disk spilling when appropriate at some sacrice to speed. This option is selected by default. Note: When running in distributed mode, this setting can be overridden by administrator options specied in options.cfg. For more information, see Using the options.cfg File in Chapter 4 in Clementine 11.1 Server Administration and Performance Guide.

Neural Net Node Additional Options


This node is available with the Classication module.

327 Neural Networks Figure 10-3 Neural Net node options

Continue training existing model. By default, each time you execute a Neural Net node, a

completely new network is created. If you select this option, training continues with the last net successfully produced by the node. The node correctly handles changes in training method between runs, except that RBFN networks cannot be adapted to other types of networks. Thus, when changing to or from the RBFN method, a new network is always created when the changed node is executed.
Use binary set encoding. If this option is selected, Clementine will use a compressed binary

encoding scheme for set elds. This option allows you to more easily build neural net models using set elds with large numbers of values as inputs. However, if you use this option, you may need to increase the complexity of the network architecture (by adding more hidden units or more hidden layers) to allow the network to properly use the compressed information in binary encoded set elds. Note: The simplemax and softmax scoring methods, SQL generation, and export to PMML are not supported for models that use binary set encoding. For more information, see Neural Network Model Settings on p. 329.
Show feedback graph. If this option is selected, you will see a graph that displays the accuracy of

the network over time as it learns. In addition, if you have selected Generate log file, you will see a second graph showing the training set and test set metrics (dened below). Note: This feature can slow training time. To speed training time, deselect this option.

328 Chapter 10 Figure 10-4 Neural Net feedback graph

Model selection. By default, when training is interrupted, the node will return the Best network as the generated net node. You can request that the node return the Final model instead. Sensitivity analysis. With this option selected, a sensitivity analysis of the input elds will be performed after the network has been trained. The sensitivity analysis provides information on which input elds are most important in predicting the output eld(s). (These results are part of the model information available in the generated model browser.) Generate log file. If this option is selected, information on training progress will be written to

the specied log le. To change the log le, enter a log lename or use the File Chooser button (labeled with an ellipsis) to select a location. (If you select a le that already exists, the new information will be appended to the le.) The format of each entry in the log le is as follows:
<Time> <Net ID> <Training Cycle> <Training Set Metric> <Test Set Metric>

<Time> takes the format HH:MM:SS. <Net ID> indicates which network is being trained when the network is in Multiple training

mode. For other training modes, the value is always 1.


<Training Cycle> is an integer, incrementing from 0 on each training run. <Training Set Metric> and <Test Set Metric> are measures of network performance on the

training data and test data, respectively. (These values are identical when Prevent overtraining is deselected.) They are calculated as the squared correlation between predicted and actual values divided by the mean squared error (MSE). If both Generate log file and Show feedback graph are selected, these metrics are displayed in the feedback graph in addition to the usual accuracy values.

Neural Net Node Learning Rates


Neural net training is controlled by several parameters. These parameters can be set by using the Expert tab of the Neural Net node dialog box.
Alpha. A momentum term used in updating the weights during training. Momentum tends to

keep the weight changes moving in a consistent direction. Specify a value between 0 and 1. Higher values of alpha increase momentum, decreasing the tendency to change direction based on local variations in the data.

329 Neural Networks

Eta. The learning rate, which controls how much the weights are adjusted at each update. Eta

changes as training proceeds for all training methods except RBFN, where eta remains constant. Initial Eta is the starting value of eta. During training, eta starts at Initial Eta, decreases to Low Eta, then is reset to High Eta and decreases to Low Eta again. The last two steps are repeated until training is complete. This process is shown in the following gure.
Figure 10-5 How eta changes during neural network training
Initial Eta

High Eta

Low Eta Cycles

Eta decay species the rate at which eta decreases, expressed as the number of cycles to go from High Eta to Low Eta. Specify values for each eta option.

Generated Neural Network Models


Generated neural network models contain all of the information captured by the trained network, as well as information about the neural networks characteristics, such as accuracy and architecture. When you execute a stream containing a generated neural network model, a new eld is added to the stream for each output eld from the original training data. The new eld contains the networks prediction for the corresponding output eld. The name of each new prediction eld is the name of the output eld being predicted, with $N- added to the beginning. For example, for an output eld named prot, the predicted values would appear in a new eld called $N-prot. For symbolic output elds, a second new eld is also added, containing the condence for the prediction. The condence eld is named in a similar manner, with $NC- added to the beginning of the original output eld name.
Generating a Filter node. The Generate menu allows you to create a new Filter node to pass input elds based on the results of the model. For more information, see Generating a Filter Node from a Neural Network on p. 332.

Neural Network Model Settings


The Settings tab for a neural network model species how condences are calculated and whether SQL is generated to take advantage of in-database mining. This tab is only available after the generated model has been added to a stream.

330 Chapter 10 Figure 10-6 Sample Generated Neural Net node Settings tab

You can specify whether condences are calculated for categorical output elds, and you can specify the method used.
Difference Method

The Difference method calculates condences for ag and set data as follows:
Flag data. Condence is computed as abs(0.5 Raw Output) * 2. Values are converted into a

scale of 0 to 1. If the output unit value is below 0.5, it is predicted as 0 (false), and if it is 0.5 or above, it is predicted as 1 (true). For example, if the Neural Net prediction value is 0.72, this is displayed as true, and the condence will be abs(0.5 0.72) * 2 = 0.44.
Set data. Set output elds are internally converted to ags for neural networks, so there

is a separate raw output value for each category of the output eld. Values are converted into a scale of 0 to 1. Condence is computed as (Highest Raw Output Second Highest Raw Output). The highest scaled value denes which predicted set value is chosen, and the difference between the highest scaled value and the second highest scaled value is the condence. For example, if there are four set values (red, blue, white, black), and the scaled values produced by Neural Net are red = 0.32, blue = 0.85, white = 0.04, and black = 0.27, then the predicted set value would be blue, and the condence would be 0.85 0.32 = 0.53.
SoftMax and SimpleMax Methods

The SoftMax and SimpleMax methods for calculating condences are supported by the PMML standard. The SimpleMax method actually generates pseudo-probabilities that sum to 1, and it is used mainly for hidden layers in RBF (Radial Basis Function) networks. For most purposes, the SoftMax method is recommended.

331 Neural Networks

Generate SQL for this model. There are two ways you can use SQL with Clementine:

Export the SQL as a text le for modication and use in another, unconnected, database. For more information, see Browsing Generated Models in Chapter 6 on p. 239. Enable SQL generation for the model in order to take advantage of database performance. This setting only applies when using data from a database. For more information, see SQL Optimization in Chapter 6 in Clementine 11.1 Server Administration and Performance Guide. Also note that actual performance results may vary, depending on the complexity of the model and the capabilities of each DBMS, particularly when large numbers of categorical predictors with many discrete values are used. In general, the more complex the model, the greater the likelihood that databases will struggle with the resulting generated SQL.

Neural Network Model Summary


The Summary tab for a neural network model displays information about the estimated accuracy of the network; the architecture or topology of the network; and the relative importance of elds, as determined by sensitivity analysis (if you requested it). In addition, if you have executed an Analysis node attached to this modeling node, information from that analysis will also appear in this section. For more information, see Analysis Node in Chapter 17 on p. 537.
Figure 10-7 Sample generated net node Summary tab

Estimated accuracy. This is an index of the accuracy of the predictions. For symbolic outputs,

this is simply the percentage of records for which the predicted value is correct. For numeric targets, the calculation is based on the differences between the predicted values and the actual values in the training data. The formula for nding the accuracy of numeric elds is

332 Chapter 10 (1.0 abs(Actual Predicted) / (Range of Output Field)) * 100.0

where Actual is the actual value of the output eld, Predicted is the value predicted by the network, and Range of Output Field is the range of values for the output eld (the highest value for the eld minus the lowest value). This accuracy is calculated for each record, and the overall accuracy is the average of the values for all records in the training data. Because these estimates are based on the training data, they are likely to be somewhat optimistic. The accuracy of the model on new data will usually be somewhat lower than this.
Input, Hidden, and Output Layers. The number of units is listed separately for each layer in the

network.
Relative Importance of Inputs. This section contains the results of the sensitivity analysis if

you requested one. The input elds are listed in order of importance, from most important to least important. The value listed for each input is a measure of its relative importance, varying between 0 (a eld that has no effect on the prediction) and 1.0 (a eld that completely determines the prediction).

Generating a Filter Node from a Neural Network


Figure 10-8 Generate Filter Node dialog box

You can generate a Filter node from a generated neural network model. The dialog box contains a list of elds in descending order of relative importance in the model. Select the elds to be retained in the model, and then click OK. The generated Filter node will appear on the stream canvas.
Selecting fields. Click the last eld you want to retain (the one with the smallest relative

importance that meets your criteria). This will select that eld and all elds with a higher relative importance. The top eld (with the highest importance) is always selected.

Decision List
This node is available with the Classication module.

11

Chapter

Decision List models identify subgroups or segments that show a higher or lower likelihood of a binary (yes or no) outcome relative to the overall sample. For example, you might look for customers who are least likely to churn or most likely to say yes to a particular offer or campaign. The Decision List Viewer gives you complete control over the model, allowing you to edit segments, add your own business rules, specify how each segment is scored, and customize the model in a number of other ways to optimize the proportion of hits across all segments. As such it is particularly well-suited for generating mailing lists or otherwise identifying which records to target for a particular campaign. You can also use multiple mining tasks to combine modeling approachesfor example, by identifying high and low performing segments within the same model and including or excluding each in the scoring stage as appropriate.

333

334 Chapter 11 Figure 11-1 Decision List model

Segments, Rules, and Conditions

A model consists of a list of segments, each of which is dened by a rule that selects matching records. A given rule may have multiple conditions; for example:
RFM_SCORE > 10 and MONTHS_CURRENT <= 9

Rules are applied in the order listed, with the rst matching rule determining the outcome for a given record. Taken independently, rules or conditions may overlap, but the order of rules resolves ambiguity. If no rule matches, the record is assigned to the remainder rule.
Complete Control over Scoring

The Decision List Viewer allows you to view, modify, and reorganize segments and to choose which to include or exclude for purposes of scoring. For example, you can choose to exclude one group of customers from future offers and include others and immediately see how this affects your overall hit rate. Decision List models return a score of Yes for included segments and $null for everything else, including the remainder. This direct control over scoring makes Decision

335 Decision List

List models ideal for generating mailing lists, and they are widely used in customer relationship management, including call center or marketing applications.
Figure 11-2 Decision List model

Mining Tasks, Measures, and Selections

The modeling process is driven by mining tasks. Each mining task effectively initiates a new modeling run, and returns a new set of alternative models to choose from. The default task is based on your initial specications in the Decision List node, but you can dene any number of custom tasks. You can also apply tasks iterativelyfor example, you can run an up search on the entire training set and then run a down search on the remainder to weed out low-performing segments.

336 Chapter 11 Figure 11-3 Creating a mining task

Data Selections

You can dene data selections and custom model measures for model building and evaluation. For example, you can specify a data selection in a mining task to tailor the model to a specic region and create a custom measure to evaluate how well that model performs on the whole country. Unlike mining tasks, measures dont change the underlying model but provide another lens to assess how well it performs.

337 Decision List Figure 11-4 Creating a data selection

Adding Your Business Knowledge

By ne-tuning or extending the segments identied by the algorithm, The Decision List Viewer allows you to incorporate your business knowledge right into the model. You can edit the segments generated by the model or add additional segments based on rules that you specify. You can then apply the changes and preview the results.
Figure 11-5 Decision List node Model tab

For further insight, a dynamic link with Excel allows you to export your data to Excel, where it can be used to create presentation charts and to calculate custom measures, such as complex Prot and ROI, which can be viewed in the Decision List Viewer while you are building the model.
Example. The marketing department of a nancial institution wants to achieve more protable

results in future campaigns by matching the right offer to each customer. You can use a Decision List model to identify the characteristics of customers most likely to respond favorably based on previous promotions and to generate a mailing list based on the results. For more information, see Modeling Customer Response (Decision List) in Chapter 8 in Clementine 11.1 Applications Guide.

338 Chapter 11

Requirements. A single categorical target eld of type Flag or Set that indicates the binary

outcome you want to predict (yes/no), and at least one predictor (In) eld. When the target eld type is Set, you must manually choose a single value to be treated as a hit, or response; all the other values are lumped together as not hit. An optional frequency eld may also be specied. Continuous date/time elds are ignored. Continuous numeric range predictors are automatically binned by the algorithm as specied on the Expert tab in the modeling node. For ner control over binning, add an upstream binning node and use the binned eld as input with type Ordered Set.

Decision List Model Options


Figure 11-6 Decision List node: Model tab

Mode. Species the method used to build the model. Generate model. Automatically generates a model on the models palette when the node is

executed. The resulting model can be added to streams for purposes of scoring but cannot be further edited.
Launch interactive session. Opens the Decision List Viewer interactive modelling (output)

window, allowing you to pick from multiple alternatives and repeatedly apply the algorithm with different settings to progressively grow or modify the model. For more information, see Decision List Viewer on p. 342.
Use saved interactive session information. Launches an interactive session using

previously-saved settings. Interactive settings can be saved from the Decision List Viewer using the Generate menu (to create a model or modeling node) or the File menu (to update

339 Decision List

the node from which the session was launched). For more information, see Generate New Model on p. 356.
Target value. Species the value of the target eld that indicates the outcome you want to model.

For example, if the target eld churn is coded 0 = no and 1 = yes, specify 1 to identify rules that indicate which records are likely to churn.
Search direction. Indicates the search direction with the values Up and Down and is related to the target variable. An upward search looks for segments with a high frequency. A downward search will create segments with low frequency. Finding and excluding them can be a useful way to improve your model and can be particularly useful when the remainder has a low frequency. Maximum number of new segments. Species the maximum number of segments to return. The

top N segments are created, where the best segment is the one with the highest probability or, if more than one model has the same probability, the highest coverage. The minimum allowed setting is 1; there is no maximum setting.
Minimum segment size. The two settings below dictate the minimum segment size. The larger of

the two values takes precedence. For example, if the percentage value equates to a number higher than the absolute value, the percentage setting takes precedence.
As percentage of previous segment (%). Species the minimum group size as a percentage of

records. The minimum allowed setting is 0; the maximum allowed setting is 99.9.
As absolute value (N). Species the minimum group size as an absolute number of records.

The minimum allowed setting is 1; there is no maximum setting.


Segment rules. Maximum attributes per segment. Species the maximum number of conditions per segment rule.

The minimum allowed setting is 1; there is no maximum setting.


Allow attribute reuse within segment. When enabled, each cycle can consider all attributes,

even those that have been used in previous cycles. The conditions for a segment are built up in cycles, where each cycle adds a new condition. The number of cycles is dened using the Maximum number of attributes setting.
Confidence interval for new conditions (%). Species the condence level for testing segment signicance. This setting plays a signicant role in the number of segments (if any) that are returned as well as the number-of-conditions-per-segment rule. The higher the value, the smaller the returned result set. The minimum allowed setting is 50; the maximum allowed setting is 99.9.

340 Chapter 11

Decision List Node Expert Options


Figure 11-7 Decision List node: Expert tab

Expert options allow you to ne-tune the model-building process.


Binning method. The method used for binning continuous elds (equal count or equal width). Number of bins. The number of bins to create for continuous elds. The minimum allowed setting

is 2; there is no maximum setting.


Model search width. The maximum number of model results per cycle that can be used for the

next cycle. The minimum allowed setting is 1; there is no maximum setting.


Rule search width. The maximum number of rule results per cycle that can be used for the next cycle. The minimum allowed setting is 1; there is no maximum setting. Bin merging factor. The minimum amount by which a segment must grow when merged with its

neighbor. The minimum allowed setting is 1.01; there is no maximum setting.


Allow missing values in conditions. True to allow the IS MISSING test in rules. Discard intermediate results. When True, only the nal results of the search process are

returned. A nal result is a result that is not rened any further in the search process. When False, intermediate results are also returned.
Maximum number of alternatives. Species the number of alternatives that will be returned upon

running the mining task. The minimum allowed setting is 1; there is no maximum setting.

341 Decision List

Generated Decision List Models


A model consists of a list of segments, each of which is dened by a rule that selects matching records. You can easily view or modify the segments before generating the model and choose which ones to include or exclude. When used in scoring, Decision List models return Yes for included segments, and $null for everything else, including the remainder. This direct control over scoring makes Decision List models ideal for generating mailing lists, and they are widely used in customer relationship management, including call center or marketing applications.
Figure 11-8 Decision List model

When you execute a stream containing a Decision List model, the node adds three new elds containing the score, either 1 (meaning Yes) for included elds or $null for excluded elds), the probability (hit rate) for the segment within which the record falls, and the ID number for the segment. The names of the new elds are derived from the name of the output eld being predicted, prexed with $D- for the score, $DP- for the probability, and $DI- for the segment ID. The model is scored based on the target value specied when the model was built. You can manually exclude segments so that they score as $null. For example, if you run a down search to nd segments with lower than average hit rates, these low segments will be scored as Yes unless you manually exclude them. If necessary, nulls can be recoded as No using a Derive or Filler node.
PMML

A Decision List model can be stored as a PMML RuleSetModel with a rst hit selection criterion. However, all of the rules are expected to have the same score. To allow for changes to the target eld or the target value, multiple ruleset models can be stored in one le to be applied in order, cases not matched by the rst model being passed to the second, and so on. The algorithm

342 Chapter 11

name DecisionList is used to indicate this non-standard behavior, and only ruleset models with this name are recognized as Decision List models and scored as such.

Decision List Generated Model Settings Tab


The Settings tab is used to enable and disable SQL push-back.
Generate SQL for this model. When enabled, Clementine will attempt to push-back the Decision

List model to SQL.

Decision List Viewer


The easy-to-use, task-based Decision List Viewer graphical interface takes the complexity out of the model building process, freeing you from the low-level details of data mining techniques and allowing you to devote your full attention to those parts of the analysis requiring user intervention, such as setting objectives, choosing target groups, analyzing the results, and selecting the optimal model. Before you start working with Decision List Viewer, you should make sure that the following options have been congured and are available.

Decision List Viewer Workspace


The Decision List Viewer workspace provides options for conguring, evaluating, and deploying models. The workspace consists of the following panes:
Working model pane. Displays the current model representation. Preview pane. Displays an alternative model or model snapshot to compare to the working model. Managers pane. The Session Results tab displays mining task results as well as alternative models. The Snapshots tab displays current model snapshots (a snapshot is a model representation at a specic point in time).

Note: The generated, read-only model displays only the working model pane and cannot be modied.

Working Model Pane


The working model pane displays the current model, including mining tasks and other actions that apply to the working model.

343 Decision List Figure 11-9 Working model pane

ID. Identies the sequential segment order. Model segments are calculated, in sequence, according

to their ID number.
Segment. Provides the segment name and dened segment conditions. By default, the segment

name is the eld name or concatenated eld names used in the conditions, with a comma as a separator.
Score. Represents the eld that you want to predict, whose value is assumed to be related to the

values of other elds (the predictors). Note: The following options can be toggled to display via the Organizing Model Measures dialog.
Cover. The pie chart visually identies the coverage each segment has in relation to the entire

cover.
Cover (n). Lists the coverage for each segment in relation to the entire cover. Frequency. Lists the number of hits received in relation to the cover. For example, when the cover

is 79 and the frequency is 50, that means that 50 out of 79 responded for the selected segment.
Probability. Indicates the segment probability. For example, when the cover is 79 and the

frequency is 50, that means that the probability for the segment is 63.29% (50 divided by 79).
Error. Indicates the segment error.

The information at the bottom of the pane indicates the cover, frequency, and probability for the entire model.

344 Chapter 11

Working Model Toolbar

The working model pane provides the following functions via a toolbar. Note: The functions are also available by right-clicking a model segment.
Table 11-1 Working Model Toolbar

Initiates the default mining task. The default mining task is dened in the Mining Tasks dialog (the task can be started from the dialog as well). Launches the Mining Tasks dialog that provides options for manually adding segments to a model by running mining tasks. Launches the Inserting Segments dialog that provides options for creating new model segments. Launches the Editing Segment Rules dialog that provides options for adding conditions to model segments or changing previously dened model segment conditions. Moves the selected segment up in the model hierarchy. Moves the selected segment down in the model hierarchy. Deletes the selected segment. Toggles whether the selected segment is included in the model. When excluded, the segment results are added to the remainder. This differs from deleting a segment in that you have the option of reactivating the segment. Takes a snapshot of the current model structure. Snapshots display in the Snapshots tab and are commonly used for model comparison purposes.

Preview Pane
The Preview pane allows you to compare the working model against model snapshots or generated alternative models. By default, the Preview pane is minimized. Click Preview on the Managers pane to maximize the Preview pane.

345 Decision List Figure 11-10 Preview pane

The information at the bottom of the pane indicates the preview models cover, frequency, and probability.
Preview Toolbar

The Preview pane provides the following functions via a toolbar.


Table 11-2 Preview Toolbar

When a new model is previewed and this button is clicked, the current preview model becomes the working model. All model measures and segments from the Preview pane are transferred to the Working Model pane. When a alternative model or snapshot is previewed and this button is clicked, a copy of the model is created in the Session Results tab and the current preview model becomes the working model. All model measures and segments from the Preview pane are transferred to the Working Model pane. Displays the previous alternative model. Alternative model order is indicated in the Session Results tab. Displays the next alternative model. Alternative model order is indicated in the Session Results tab. Displays the previous snapshot view. Snapshot order is indicated in the Snapshots tab. Displays the next snapshot view. Snapshot order is indicated in the Snapshots tab.

346 Chapter 11

Managers Pane
The Managers pane contains the following tabs: Session Results Tab. Lists the generated mining task results. Snapshots Tab. Lists the model snapshots.

Session Results Tab


The Session Results tab lists all alternative mining results for the selected model or segment on the working model pane. Each result, if there are any, appears alongside the name of the target that motivated the analysis. As you select segments on the working model pane, the Session Results tab displays any alternatives that have been found for that segment.
Figure 11-11 Session Results tab

Each generated model alternative displays specic model information. For example:
1.1 Alternative 1 [9#, 58.87%]

1.1 indicates the mining task sequence. Alternative 1 indicates that this is the rst returned alternative. The rst alternative usually

contains the best results.


9# indicates that the alternative model contains nine segments. 58.87% indicates that the alternative model has a probability of 58.87%.

347 Decision List

To move quickly through the list of alternatives, you can use the scroll bars at the right and bottom edges of the screen. The scroll bars are displayed automatically as soon as there are more alternatives than t in the window. Note: Session results are not saved with the model; results are valid only during the active session.

Snapshots Tab
A snapshot is a view of a model at a specic point in time. For example, you could take a model snapshot when you want to load a different alternative model into the working model pane but do not want to lose the work on the current model. The Snapshots tab lists all model snapshots manually taken for any number of working model states. Note: Snapshots are saved with the model. The rst snapshot is automatically generated when Decision List Viewer loads the model. This snapshot preserves the original model structure, ensuring that you can always return to the original model state. The generated snapshot name displays as a timestamp, indicating when is was generated.
Create a Model Snapshot
E Select an appropriate model/alternative to display in the working model pane. E Make any necessary changes to the working model. E Click Take Snapshot. A new snapshot is displayed on the Snapshots tab. Figure 11-12 Snapshots pane

348 Chapter 11

Name. The snapshot name. You can change a snapshot name by right-clicking the snapshot and selecting Rename Snapshot from the menu. Target. Indicates the target value at the time the snapshot was created. #. Indicates the number of segments contained in the selected snapshot. P (%). Indicates the probability percentage for the selected snapshot.

When you select a snapshot, the snapshotss model segments are loaded into the Preview pane. Click Preview to display the pane. You can delete a snapshot by clicking Delete or by right-clicking the snapshot and selecting Delete from the menu.

Working with Decision List Viewer


A model that will best predict customer response and behavior is built in various stages.When Decision List Viewer launches, the working model is populated with the dened model segments and measures, ready for you start a mining task, modify the segments/measures as required and generate a new model or modeling node. You can add one or more segment rules until you have developed a satisfactory model. You can add segment rules to the model by running mining tasks or by using the Edit Segment Rule function. In the model building process, you can assess the models value by validating the model against measure data, by visualizing the model in a chart, or by generating custom Excel measures. When you feel certain about the models quality, you can generate a new model and place it on the Clementine canvas or Model palette.

Mining Tasks
A mining task is a collection of parameters that determines the way new rules are generated. Some of these parameters are selectable to provide the user with the exibility to adapt models to new situations. A task consists of a task template (type), a target, and a build selection (mining data set). The following sections detail the various mining task operations: Running Mining Tasks Creating a Mining Task Organizing Data Selections

Running Mining Tasks


Decision List Viewer allows you to manually add rules to a model by running mining tasks or by copying and pasting rules between models. A mining task holds information on how to generate new rules (the data mining parameter settings, such as the search strategy, source attributes, search width, condence level, and so on), the customer behavior to predict, and the data to investigate. The goal of a mining task is to search for the best possible rules.

349 Decision List

To generate a model segment by running a mining task:


E Click the Remainder row. If there are already segments displayed on the model canvas, you can

also select one of the segments to nd additional rules based on the selected segment. After selecting the remainder or segment, use one of the following methods to access the Organize Mining Tasks dialog box. From the Tools menu, choose Organize Mining Tasks. Right-click the Remainder row/segment and choose Organize Mining Tasks. Click the Organize Mining Tasks button on the toolbar of the working model pane . The Organize Mining Tasks dialog box opens.
Figure 11-13 Organize Mining Tasks dialog box

E Select one of the mining tasks from the list of predened tasks.

Note: If no mining tasks are dened, you must create one.


E Click Execute.

While the task is processing, the progress is displayed at the bottom of the workspace and informs you when the task has completed. Precisely how long a task takes to complete depends on the complexity of the mining task and the size of the dataset. As soon as the task completes, the results (if any) are added to the Session Results tab. Note: A task result will either complete with models, complete with no models, or fail.
E To view the mining task results for the Model for comparison pane, click one of the generated

alternatives on the Session Results tab. The process of nding new model rules can be repeated until no new rules are added to the model. This means that all signicant groups of customers have been found. It is possible to run a mining task on any existing model segment. If the result of a task is not what you are looking for, you can choose to start another mining task on the same segment. This will provide additional found rules based on the selected segment. Segments that are below the

350 Chapter 11

selected segment (that is, added to the model later than the selected segment) are replaced by the new segments because each segment depends on its predecessors.

Creating a Mining Task


A mining task is the mechanism that searches for the collection of rules that make up a data model. Alongside the search criteria dened in the selected template, a task also denes the target (the actual question that motivated the analysis, such as how many customers are likely to respond to a mailing), and it identies the datasets to be used. The goal of a mining task is to search for the best possible models.
To create a mining task:
E Select the segment from which you want to mine additional segment conditions. E From the Tools menu, choose Organize Mining Tasks. You can also access this function from the toolbar button or by right-clicking in a model and selecting Organize Mining Tasks. The Organize

Mining Tasks dialog box opens.


Figure 11-14 Organize Mining Tasks dialog box

E Click the Create a new mining task button. The Dene New Mining Task dialog box opens. Figure 11-15 Define New Mining Task dialog box

351 Decision List E Provide a new task name and select a predened task on which to base the new task. E Click Continue. The Create/Edit Mining Task dialog box opens. The dialog box provides options for further dening the mining task. Make any necessary changes and click OK to return to the

Organize Mining Tasks dialog box.


E Click Set as Default to specify this task as the default task. Decision List Viewer uses the default

task settings to run each task until an alternative task is selected.


E Click Execute to start the mining task on the selected segment.

Create/Edit Mining Task


The Create/Edit Mining Task dialog box provides options for dening a new mining task or editing an existing one. Most parameters available for mining tasks are similar to those offered in the Decision List node. For more information, see Decision List Model Options on p. 338.
Data Build selection. Provides options for specifying the evaluation measure that Decision List Viewer should analyze to nd new rules. The listed evaluation measures are created/edited in the Organize Data Selections dialog box. Available fields. Provides options for displaying all elds or manually selecting which elds

to display.
E If the Custom option is selected, click Edit to open the Customize Available Fields dialog box,

which allows you to select which elds are available as segment attributes found by the mining task.
Figure 11-16 Customize Available Fields dialog box

352 Chapter 11

Organizing Data Selections


By organizing data selections (a mining dataset), you can specify which evaluation measures Decision List Viewer should analyze to nd new rules and select which data selections are used as the basis for measures.
To organize data selections:
E From the Tools menu, choose Organize Data Selections, or right-click a segment and select the

option. The Organize Data Selections dialog box opens.


Figure 11-17 Organize Data Selections dialog box

Note: The Organize Data Selections dialog box also allows you to edit or delete existing data selections.
E Click the Add new data selection button. A new data selection entry is added to the existing table. E Click Name and enter an appropriate selection name. E Click Partition and select an appropriate partition type. E Click Condition and select an appropriate condition option. When Specify is selected, the Specify

Selection Condition dialog box opens, providing options for dening specic eld conditions.

353 Decision List Figure 11-18 Specify Selection Condition dialog box

E Dene the appropriate condition and click OK.

The data selections are available from the Build Selection drop-down list in the Create/Edit Mining Task dialog box. The list allows you to select which evaluation measure is used for a particular mining task.

Segment Rules
You nd model segment rules by running a mining task based on a task template or by manually adding segment rules to a model using the Insert Segment or Edit Segment Rule functions. If you choose to mine for new segment rules, the results, if any, are displayed on the Session Results tab. You can quickly rene your model by replacing segment rules on the working model with one of the alternative results. In this way, you can experiment with differing results until you have built a model that accurately describes your optimum target group.

Inserting Segments
To add a segment rule condition to a model:
E From the Edit menu, choose Insert Segment or access this selection by right-clicking a segment. E Select a model location where you want to add a new segment. The new segment will be inserted

directly above the selected segment.


E From the Edit menu, choose Insert Segment, or access this selection by right-clicking a segment. E The Insert Segment dialog box opens, allowing you to insert new segment rule conditions. E Click Insert. The Insert Condition dialog box opens, allowing you to dene the attributes for the

new rule condition.

354 Chapter 11 E Select a eld and an operator from the drop-down lists.

Note: If you select the Not in operator, the selected condition will function as an exclusion condition and displays in red in the Insert Rule dialog box. For example, when the condition region = 'TOWN' displays in red, it means that TOWN is excluded from the result set.
E Enter one or more values or click the Insert Value icon to display the Insert Value dialog box. The

dialog box allows you to choose a value dened for the selected eld (for example, the eld married will provide the values yes and no).
E Click OK to return to the Insert Segment dialog box. Click OK a second time to add the created

segment to the model. The new segment will display in the specied model location.

Editing Segment Rules


The Edit Segment Rule functionality allows you to add, change, or delete segment rule conditions.
To change a segment rule condition:
E Select the model segment that you want to edit. E From the Edit menu, choose Edit Segment Rule, or right-click on the rule to access this selection.

The Edit Segment Rule dialog box opens.


E Select the appropriate condition and click Edit.

The Edit Condition dialog box opens, allowing you to dene the attributes for the selected rule condition.
E Select a eld and an operator from the drop-down lists.

Note: If you select the Not in operator, the selected condition will function as an exclusion condition and displays in red in the Edit Segment dialog box. For example, when the condition region = 'TOWN' displays in red, it means that TOWN is excluded from the result set.
E Enter one or more values or click the Insert Value button to display the Insert Value dialog box.

The dialog box allows you to choose a value dened for the selected eld (for example, the eld married will provide the values yes and no).
E Click OK to return to the Edit Segment dialog box. Click OK a second time to return to the

working model. The selected segment will display with the updated rule conditions.

Deleting Segment Rule Conditions


To delete a segment rule condition:
E Select the model segment containing the rule conditions that you want to delete.

355 Decision List E From the Edit menu, choose Edit Segment Rule, or right-click on the segment to access this

selection. The Edit Segment Rule dialog box opens, allowing you to delete one or more segment rule conditions.
E Select the appropriate rule condition and click Delete. E Click OK.

Deleting one or more segment rule conditions causes the working model pane to refresh its measure metrics.

Alternative Models
The Session Results tab displays the results of each mining task. Each result consists of the condition in the selected data that is most consistent with the target, as well as any good enough alternatives. The total number of alternatives depends on the search criteria effective during the analysis process.
To view alternative models:
E Click on an alternative model on the Session Results tab. The alternative model segments display

or replace the current model segments in the Preview pane.


E To work with an alternative model in the working model pane, click Promote to Working Model

in the Preview pane or right-click an alternative name on the Session Results tab and select Promote to Working Model. Note: Alternative models are not saved when you generate a new model.

Customizing a Model
Data are not static. Customers move, get married, and change jobs. Products lose market focus and become obsolete. Decision List Viewer offers business users the exibility to adapt models to new situations easily and quickly. You can change a model by editing, prioritizing, deleting, or inactivating specic model segments.

Prioritizing Segments
You can rank model rules in any order you choose. By default, model segments are displayed in order of priority, the rst segment having the highest priority. When you assign a different priority to one or more of the segments, the model is changed accordingly. You may alter the model as required by moving segments to a higher or lower priority position.
To prioritize model segments:
E Select the model segment to which you want to assign a different priority.

356 Chapter 11 E Click one of the two arrow buttons on the working model pane toolbar to move the selected

model segment up or down the list. After prioritization, all previous assessment results are recalculated and the new values are displayed.

Deleting Segments
To delete one or more segments:
E Select a model segment. E From the Edit menu, select Delete Segment, or click the delete button on the toolbar of the

working model pane. The measures are recalculated for the modied model, and the model is changed accordingly.

Excluding Segments
As you are searching for particular groups, you will probably base business actions on a selection of the model segments. When deploying a model, you may choose to exclude segments within a model. Excluded segments are scored as null values. Excluding a segment does not mean the segment is not used; it means that all records matching this rule are excluded from the mailing list. The rule is still applied, but differently.
To exclude specific model segments:
E Select a segment from the working model pane. E Click the Toggle Segment Exclusion button on the toolbar of the working model pane. Excluded is

now displayed in the selected Target column of the selected segment. Note: Unlike deleted segments, excluded segments remain available for reuse in the nal model. Excluded segments affect chart results.

Generate New Model


The Generate New Model dialog box provides options for naming the model and selecting where the new node is created.
Model name. Select Custom to adjust the auto-generated name or to create a unique name for

the node as displayed on the stream canvas.


Create node on. Selecting Canvas places the new model on the working canvas; selecting GM
Palette places the new model on the Models palette; selecting Both places the new model on both

the working canvas and the Models palette.


Include interactive session state. When enabled, the interactive session state is preserved in the

generated model. When you later generate a modeling node from the model, the state is carried over and used to initialize the interactive session. Regardless of whether the option is selected, the model itself scores new data identically. When the option is not selected, the model is still able to create a build node, but it will be a more generic build node that starts a new interactive

357 Decision List

session rather than pick up where the old session left off. If you change the node settings but execute with a saved state, the settings you have changed are ignored in favor of the settings from the saved state. Note: The standard metrics are the only metrics that remain with the model. Additional metrics are preserved with the interactive state. The generated model does not represent the saved interactive mining task state. Once you launch the Decision List Viewer, it displays the settings originally made through the Viewer. For more information, see Regenerating a Modeling Node in Chapter 6 on p. 242.

Model Assessment
Successful modeling requires the careful assessment of the model before implementation in the production environment takes place. Decision List Viewer provides a number of statistical and business measures that can be used to assess the impact of a model in the real world. These include gains charts and full interoperability with Excel, thus enabling cost/benet scenarios to be simulated for assessing the impact of deployment. You can assess your model in the following ways: Using the predened statistical and business measures available in Decision List Viewer (probability, frequency). Evaluating measures imported from Microsoft Excel. Visualizing the model using a gains chart.

Organizing Model Measures


Decision List Viewer provides options for dening the measures that are calculated and displayed as columns. Each segment can include the default cover, frequency, probability, and error measures represented as columns. You can also create new measures that will be displayed as columns.
Defining Model Measures To add a measure to your model or to define an existing measure:
E From the Tools menu, choose Organize Model Measures, or right-click on the model to make this

selection. The Organize Model Measures dialog box opens.

358 Chapter 11 Figure 11-19 Organize Model Measures dialog box

E Click the Add new model measure button. A new measure is displayed in the Metrics and

Selections table.
E Provide a measure name and select an appropriate type, display option, and selection. The Show

column indicates whether the measure will display for the working model. When dening an existing measure, select an appropriate metric and selection and indicate if the measure will display for the working model.
E Click OK to return to the Decision List Viewer workspace. If the Show column for the new

measure was checked, the new measure will display for the working model.
Custom Metrics in Excel

For more information, see Assessment in Excel on p. 359.

Refreshing Measures
In certain cases, it may be necessary to recalculate the model measures, such as when you apply an existing model to a new set of customers.
To recalculate (refresh) the model measures:

From the Edit menu, choose Refresh All Measures.

359 Decision List

or Press F5. All measures are recalculated, and the new values are shown for the working model.

Assessment in Excel
Decision List Viewer can be integrated with Microsoft Excel, allowing you to use your own value calculations and prot formulas directly within the model building process to simulate cost/benet scenarios. The link with Excel allows you to export data to Excel, where it can be used to create presentation charts, calculate custom measures, such as complex prot and ROI measures, and view them in Decision List Viewer while building the model. For more information, see Calculating Custom Measures Using Excel in Chapter 8 in Clementine 11.1 Applications Guide. Note: In order for you to work with an Excel spreadsheet, the analytical CRM expert has to dene conguration information for the synchronization of Decision List Viewer with Microsoft Excel. The conguration is contained in an Excel spreadsheet le and indicates which information is transferred from Decision List Viewer to Excel, and vice versa. The following steps are valid only when MS Excel is installed. If Excel is not installed, the options for synchronizing models with Excel are not displayed.
To synchronize models with MS Excel:
E Open the model and click the Organize Model Measure button on the toolbar of the working

model pane .
E Select Yes for the Calculate custom measures in Excel option. The Workbook eld activates,

allowing you to select a precongured Excel workbook template.


E Click the Connect to Excel button. The Open dialog box opens, allowing you to navigate to the

precongured template location on your local or network le system.


E Select the appropriate Excel template and click Open. The selected Excel template launches, and

the Choose Inputs for Custom Measures dialog box opens.


E Select the appropriate mappings between the metric names dened in the Excel template and the model metric names and click OK.

Once the link is established, Excel starts with the precongured Excel template that displays the model rules in the spreadsheet. The results calculated in Excel are displayed as new columns in Decision List Viewer. Excel metrics do not remain when the model is saved; the metrics are valid only during the active session. However, you can create snapshots that include Excel metrics. The Excel metrics saved in the snapshot views are valid only for historical comparison purposes and do not refresh when reopened. For more information, see Snapshots Tab on p. 347. The Excel metrics will not display in the snapshots until you reestablish a connection to the Excel template.

360 Chapter 11

MS Excel Integration Setup


The integration between Decision List Viewer and Microsoft Excel is accomplished through the use of a precongured Excel spreadsheet template. The template consists of three worksheets:
Model Builder. Displays the imported Decision List Viewer measures, the custom Excel measures, and the calculation totals (dened on the Settings worksheet). Settings. Provides the variables to generate calculations based on the imported Decision List Viewer measures and the custom Excel measures. Configuration. Provides options for specifying which measures are imported from Decision List

Viewer and for dening the custom Excel measures.


Metrics from Model. Indicates which Decision List Viewer metrics are used in the calculations. Metrics to Model. Indicates which Excel-generated metric(s) will be returned to Decision

List Viewer. The Excel-generated metrics display as new measure columns in Decision List Viewer. Note: Excel metrics do not remain with the model when you generate a new model; the metrics are valid only during the active session.

Visualizing Models
The best way to understand the impact of a model is to visualize it. Using a gains chart, you can obtain valuable day-to-day insight into the business and technical benet of your model by studying the effect of multiple alternatives in real time. The Gains Chart section shows the benet of a model over randomized decision-making and allows the direct comparison of multiple charts when there are alternative models.

Gains Chart
The gains chart plots the values in the Gains % column from the table. Gains are dened as the proportion of hits in each increment relative to the total number of hits in the tree, using the equation: (hits in increment / total number of hits) x 100% Gains charts effectively illustrate how widely you need to cast the net to capture a given percentage of all of the hits in the tree. The diagonal line plots the expected response for the entire sample if the model is not used. In this case, the response rate would be constant, since one person is just as likely to respond as another. To double your yield, you would need to ask twice as many people. The curved line indicates how much you can improve your response by including only those who rank in the higher percentiles based on gain. For example, including the top 50% might net you more than 70% of the positive responses. The steeper the curve, the higher the gain.

361 Decision List Figure 11-20 Gains tab

To view a gains chart:


E Open a Clementine stream that contains a Decision List node and launch an interactive session

from the node.


E Click the Gains tab. Depending on which partitions are specied, you may see one or two

charts (two charts would display, for example, when both the training and testing partitions are dened for the model measures). The charts display both the working model and preview model information (if a preview model is specied). By default, the charts display as segments. You can switch the charts to display as quantiles by selecting Quantiles and then selecting the appropriate quantile method from the drop-down menu. Note: See Editing Graphs for information on working with graphs.

Chart Options
The Chart Options feature provides options for selecting which models and snapshots are charted, which partitions are plotted, and whether or not segment labels display.

362 Chapter 11 Figure 11-21 Chart Options dialog box

Models to Plot Current Models. Allows you to select which models to chart. You can select the working model, preview model, or any created snapshot models. Partitions to Plot Partitions for left-hand chart. The drop-down list provides options for displaying all dened

partitions or all data.


Partitions for right-hand chart. The drop-down list provides options for displaying all dened

partitions, all data, or only the left-hand chart. When Graph only left is selected, only the left chart is displayed.
Display Segment Labels. When enabled, each segment label is displayed on the charts.

Statistical Models

12

Chapter

Statistical models use mathematical equations to encode information extracted from the data. Several statistical modeling nodes are available.
Figure 12-1 Simple linear regression equation

Linear regression is a common statistical technique for summarizing data and making predictions by tting a straight line or surface that minimizes the discrepancies between predicted and actual output values. For more information, see Linear Regression Node on p. 364.

Logistic regression is a statistical technique for classifying records based on values of input elds. It is analogous to linear regression but takes a categorical target eld instead of a numeric range. For more information, see Logistic Regression Node on p. 372.

The Factor/PCA node provides powerful data-reduction techniques to reduce the complexity of your data. Principal components analysis (PCA) nds linear combinations of the input elds that do the best job of capturing the variance in the entire set of elds, where the components are orthogonal (perpendicular) to each other. Factor analysis attempts to identify underlying factors that explain the pattern of correlations within a set of observed elds. For both approaches, the goal is to nd a small number of derived elds that effectively summarizes the information in the original set of elds. For more information, see Factor Analysis/PCA Node on p. 390.

363

364 Chapter 12

Statistical models have been around for some time and are relatively well understood mathematically. They represent basic models that assume fairly simple relationships in the data. In some cases, they can give you adequate models very quickly. Even for problems in which more exible machine-learning techniques (such as neural networks) can ultimately give better results, you can use statistical models as baseline predictive models to judge the performance of advanced techniques.

Linear Regression Node


This node is included with the Base module. The Linear Regression node generates a linear regression model. This model estimates the best-tting linear equation for predicting the output eld, based on the input elds. The regression equation represents a straight line or plane that minimizes the squared differences between predicted and actual output values. This is a very common statistical technique for summarizing data and making predictions.
Requirements. Only numeric elds can be used in a regression model. You must have exactly

one target (Out) eld and one or more predictors (In elds). Fields with direction Both or None are ignored, as are non-numeric elds. (If necessary, non-numeric elds can be recoded using a Derive node. For more information, see Recoding Values with the Derive Node in Chapter 4 on p. 97.)
Strengths. Regression models are relatively simple and give an easily interpreted mathematical

formula for generating predictions. Because regression modeling is a long-established statistical procedure, the properties of these models are well understood. Regression models are also typically very fast to train. The Linear Regression node provides methods for automatic eld selection in order to eliminate nonsignicant input elds from the equation. Note: In cases where the target eld is categorical rather than a continuous range, such as yes/no or churn/dont churn, logistic regression can be used as an alternative. Logistic regression also provides support for non-numeric inputs, removing the need to recode these elds. For more information, see Logistic Regression Node on p. 372.

Linear Regression Node Model Options


This node is included with the Base module.

365 Statistical Models Figure 12-2 Linear Regression node Model tab

Model name. You can generate the model name automatically based on the target or ID eld (or

model type in cases where no such eld is specied) or specify a custom name.
Use partitioned data. If a partition eld is dened, this option ensures that only data from the training partition is used to build the model.For more information, see Partition Node in Chapter 4 on p. 119. Method. Specify the method to be used in building the regression model. Enter. This is the default method, which enters all of the In elds into the equation directly.

No eld selection is performed in building the model.


Stepwise. The Stepwise method of eld selection builds the equation in steps, as the name

implies. The initial model is the simplest model possible, with no input elds in the equation. At each step, input elds that have not yet been added to the model are evaluated, and if the best of those input elds adds signicantly to the predictive power of the model, it is added. In addition, input elds that are currently in the model are reevaluated to determine if any of them can be removed without signicantly detracting from the model. If so, they are removed. Then the process repeats, and other elds are added and/or removed. When no more elds can be added to improve the model, and no more elds can be removed without detracting from the model, the nal model is generated.
Backwards. The Backwards method of eld selection is similar to the Stepwise method in

that the model is built in steps. However, with this method, the initial model contains all of the input elds as predictors, and elds can only be removed from the model. Input elds that contribute little to the model are removed one by one until no more elds can be removed without signicantly worsening the model, yielding the nal model.
Forwards. The Forwards method is essentially the opposite of the Backwards method. With

this method, the initial model is the simplest model with no input elds, and elds can only be added to the model. At each step, input elds not yet in the model are tested based on how much they would improve the model, and the best of those elds is added to the model. When

366 Chapter 12

no more elds can be added, or the best candidate eld does not produce a large-enough improvement in the model, the nal model is generated. Note: The automatic methods (including Stepwise, Forwards, and Backwards) are highly adaptable learning methods and have a strong tendency to overt the training data. When using these methods, it is especially important to verify the validity of the resulting model with a hold-out test sample or new data.
Include constant in equation. This option determines whether the resulting equation will include

a constant term. In most situations, you should leave this option selected. This option can be useful if you have prior knowledge that the output eld equals 0 whenever the predictor eld or elds equal 0.

Linear Regression Node Expert Options


This node is included with the Base module. If you have detailed knowledge of linear regression models, expert options allow you to ne-tune the model-building process. To access expert options, set Mode to Expert on the Expert tab.
Figure 12-3 Linear Regression Expert tab

Missing values. By default, the Linear Regression node will use only records that have valid

values for all elds used in the model. (This is sometimes called listwise deletion of missing values.) If you have a lot of missing data, you may nd this approach eliminates too many records, leaving you without enough data to generate a good model. In such cases, you can deselect the Only use complete records option. Clementine will then attempt to use as much information as possible to estimate the regression model, including records where some of the elds have missing values. (This is sometimes called pairwise deletion of missing values.) However, in some situations, using incomplete records in this manner can lead to computational problems in estimating the regression equation. For more information, see Overview of Missing Values in Chapter 6 in Clementine 11.1 Users Guide.
Singularity tolerance. This option allows you to specify the minimum proportion of variance in

a eld that must be independent of other elds in the model.

367 Statistical Models

Stepping. These options allow you to control the criteria for adding and removing elds with the Stepwise, Forwards, or Backwards estimation methods. (The button is disabled if the Enter method is selected.) For more information, see Linear Regression Node Stepping Options on p. 367. Output. These options allow you to request additional statistics that will appear in the advanced

output of the generated model built by the node. For more information, see Linear Regression Node Output Options on p. 367.

Linear Regression Node Stepping Options


This node is included with the Base module.
Figure 12-4 Linear Regression Stepping Criteria

Select one of the two criteria for stepping, and change the cutoff values as desired. Note: There is an inverse relationship between the two criteria. The more important a eld is for the model, the smaller the p value but the larger the F value.
Use probability of F. This option allows you to specify selection criteria based on the statistical

probability (the p value) associated with each eld. Fields will be added to the model only if the associated p value is smaller than the Entry value and will be removed only if the p value is larger than the Removal value. The Entry value must be less than the Removal value.
Use F value. This option allows you to specify selection criteria based on the F statistic

associated with each eld. The F statistic is a measure of how much each eld contributes to the model. Fields will be added to the model only if the associated F value is larger than the Entry value and will be removed only if the F value is smaller than the Removal value. The Entry value must be greater than the Removal value.

Linear Regression Node Output Options


This node is included with the Base module. Select the optional output you want to display in the advanced output of the generated linear regression model. To view the advanced output, browse the generated model and click the Advanced tab. For more information, see Linear Regression Model Advanced Output on p. 370.

368 Chapter 12 Figure 12-5 Linear Regression Advanced Output Options

Model fit. Summary of model t, including R-square. This represents the proportion of variance

in the output eld that can be explained by the input elds.


R squared change. The change in R-square at each step for Stepwise, Forwards, and Backwards

estimation methods.
Selection criteria. Statistics estimating the information content of the model for each step of the

model (to help evaluate model improvement). Statistics include the Akaike Information Criterion, Amemiyas Prediction Criterion, Mallows Prediction Criterion, and Schwarz Bayesian Criterion.
Descriptives. Basic descriptive statistics about the input and output elds. Part and partial correlations. Statistics that help to determine importance and unique contributions

of individual input elds to the model.


Collinearity Diagnostics. Statistics that help to identify problems with redundant input elds. Regression coefficients. Statistics for the regression coefcients. Confidence interval. The 95% condence interval for each coefcient in the equation. Covariance matrix. The covariance matrix of the input elds. Exclude fields. Statistics on elds that were considered for inclusion in the model but ultimately rejected based on the selection method used (Stepwise, Backwards, etc.). Residuals. Statistics for the residuals (or the differences between predicted values and actual

values).
Durbin-Watson. The Durbin-Watson test of autocorrelation. This test detects effects of record

order that can invalidate the regression model.

Generated Linear Regression Models


This node is included with the Base module. Generated linear regression models represent the equations estimated by Linear Regression nodes. They contain all of the information captured by the linear regression model, as well as information about the model structure and performance.

369 Statistical Models

When you execute a stream containing a Linear Regression Equation node, the node adds a new eld containing the models prediction for the output eld. The name of the new eld is derived from the name of the output eld being predicted, prexed with $E-. For example, for an output eld named prot, the new eld would be named $E-prot.
Generating a Filter node. The Generate menu allows you to create a new Filter node to pass input

elds based on the results of the model. This is most useful with models built using one of the eld selection methods. For more information, see Linear Regression Node Model Options on p. 364. For general information on using the model browser, see Browsing Generated Models on p. 239.
Evaluating the Model

As with other generated models, you can use an Analysis node to evaluate the model results. For more information, see Analysis Node in Chapter 17 on p. 537. You can also use a Plot node to display predicted values versus actual values, which can help you identify the records that are most difcult for the model to classify accurately and to identify systematic errors in the model. You can also assess the linear regression model by using the information available in the advanced output. To view the advanced output, click the Advanced tab of the generated model browser. The advanced output contains a lot of detailed information and is meant for users with extensive knowledge of linear regression. For more information, see Linear Regression Model Advanced Output on p. 370.

Linear Regression Model Summary


This node is included with the Base module. The Summary tab for a generated linear regression model displays each input eld with its coefcient in the regression equation. The complete regression equation is the sum of all entries. In addition, if you have executed an Analysis node attached to this modeling node, information from that analysis will also appear in this section. For more information, see Analysis Node in Chapter 17 on p. 537.

370 Chapter 12 Figure 12-6 Sample Linear Regression Equation node Summary tab

Linear Regression Model Advanced Output


This node is included with the Base module.

371 Statistical Models Figure 12-7 Sample Linear Regression Equation node Advanced tab

The advanced output for linear regression gives detailed information on the estimated model and its performance. Most of the information contained in the advanced output is quite technical, and extensive knowledge of linear regression analysis is required to properly interpret this output.
Warnings. Indicates any warnings or potential problems with the results. Descriptive statistics (optional). Shows the number of valid records (cases), the mean, and the

standard deviation for each eld in the analysis.


Correlations (optional). Shows the correlation matrix of input and output elds. One-tailed

signicance and the number of records (cases) for each correlation are also displayed.
Variables entered/removed. Shows elds added to or removed from the model at each step for Stepwise, Forwards, and Backwards regression methods. For the Enter method, only one row is shown entering all elds immediately. Model summary. Shows various summaries of model t. If the R squared change option is selected in the Linear Regression node, change in model t is reported at each step for Stepwise, Forwards, and Backwards methods. If the Selection criteria option is selected in the Linear Regression node, additional model t statistics are reported at each step, including Akaike Information Criterion, Amemiyas Prediction Criterion, Mallows Prediction Criterion, and Schwarz Bayesian Criterion. ANOVA. Shows the analysis of variance (ANOVA) table for the model. Coefficients. Shows the coefcients of the model and statistical tests of those coefcients. If the
Confidence interval option is selected in the Linear Regression node, 95% condence intervals are also reported in this table. If the Part and partial correlations option is selected, part and partial

372 Chapter 12

correlations are also reported in this table. Finally, if the Collinearity diagnostics option is selected, collinearity statistics for input elds are reported in this table.
Coefficient correlations (optional). Shows correlations among coefcient estimates. Collinearity diagnostics (optional). Shows collinearity diagnostics for identifying situations in

which the input elds form a linearly dependent set.


Casewise diagnostics (optional). Shows the records with the largest prediction errors. Residuals statistics (optional). Shows summary statistics describing the distribution of prediction

errors.

Logistic Regression Node


This node is available with the Classication module. Logistic regression, also known as nominal regression, is a statistical technique for classifying records based on values of input elds. It is analogous to linear regression but takes a categorical target eld instead of a numeric one. Both binomial models (for targets with two discrete categories) and multinomial models (for targets with more than two categories) are supported. Logistic regression works by building a set of equations that relate the input eld values to the probabilities associated with each of the output eld categories. Once the model is generated, it can be used to estimate probabilities for new data. For each record, a probability of membership is computed for each possible output category. The target category with the highest probability is assigned as the predicted output value for that record.
Binomial example. A telecommunications provider is concerned about the number of customers it is losing to competitors. Using service usage data, you can create a binomial model to predict which customers are liable to transfer to another provider and customize offers so as to retain as many customers as possible. A binomial model is used because the target has two distinct categories (likely to transfer or not). For more information, see Telecommunications Churn (Binomial Logistic Regression) in Chapter 11 in Clementine 11.1 Applications Guide. Multinomial example. A telecommunications provider has segmented its customer base by service

usage patterns, categorizing the customers into four groups. Using demographic data to predict group membership, you can create a multinomial model to classify prospective customers into groups and then customize offers for individual customers. For more information, see Classifying Telecommunications Customers (Multinomial Logistic Regression) in Chapter 10 in Clementine 11.1 Applications Guide.
Requirements. For a binomial model, you need one or more predictor (In) elds and exactly one

categorical target (Out) eld (usually a ag or, occasionally, a set) with two or more categories. For a multinomial model, the target must be a set eld with three or more categories. Fields set to Both or None are ignored. Fields used in the model must have their types fully instantiated. Note: For binomial models only, string elds must be limited to 8 characters. If necessary longer strings can be recoded using a Reclassify node. For more information, see Reclassify Node in Chapter 4 on p. 105.

373 Statistical Models

Strengths. Logistic regression models are often quite accurate. They can handle symbolic and

numeric input elds. They can give predicted probabilities for all target categories so that a second-best guess can easily be identied. Logistic models are most effective when group membership is a truly categorical eld; if group membership is based on values of a continuous range eld (for example, high IQ versus low IQ), you should consider using linear regression to take advantage of the richer information offered by the full range of values. Logistic models can also perform automatic eld selection, although other approaches such as tree models or Feature Selection may do this more quickly on large datasets. Finally, since logistic models are well understood by many analysts and data miners, they may be used by some as a baseline against which other modeling techniques can be compared. When processing large datasets, you can improve performance noticeably by disabling the likelihood-ratio test, an advanced output option. For more information, see Logistic Regression Node Output Options on p. 381.

Logistic Regression Node Model Options


This node is available with the Classication module.
Model name. You can generate the model name automatically based on the target or ID eld (or

model type in cases where no such eld is specied) or specify a custom name.
Use partitioned data. If a partition eld is dened, this option ensures that only data from the training partition is used to build the model.For more information, see Partition Node in Chapter 4 on p. 119. Procedure. Species whether a binomial or multinomial model is created. The options available

in the dialog box vary depending on which type of modeling procedure is selected.
Binomial. Used when the target eld is a ag or set with two discreet values (dichotomous),

such as yes/no, on/off, male/female.


Multinomial. Used when the target eld is a set eld with more than two values. You can

specify Main effects, Full factorial, or Custom.


Include constant in equation. This option determines whether the resulting equations will include a

constant term. In most situations, you should leave this option selected.

374 Chapter 12

Binomial Models
Figure 12-8 Logistic Regression node, binomial model options

For binomial models, the following methods and options are available:
Method. Specify the method to be used in building the logistic regression model. Enter. This is the default method, which enters all of the terms into the equation directly. No

eld selection is performed in building the model.


Forwards. The Forwards method of eld selection builds the model by moving forward step

by step. With this method, the initial model is the simplest model, and only the constant and terms can be added to the model. At each step, terms not yet in the model are tested based on how much they would improve the model, and the best of those terms is added to the model. When no more terms can be added, or the best candidate term does not produce a large-enough improvement in the model, the nal model is generated.
Backwards. The Backwards method is essentially the opposite of the Forwards method. With

this method, the initial model contains all of the terms as predictors, and terms can only be removed from the model. Model terms that contribute little to the model are removed one by one until no more terms can be removed without signicantly worsening the model, yielding the nal model.
Categorical inputs. Lists the elds that are identied as categorical, that is, a ag, set, or ordered

set. You can specify the contrast and base category for each categorical eld.

375 Statistical Models

Field Name. This column contains the eld names of the categorical inputs and is prepopulated

with all ag and set values in the data. To add continuous or numerical inputs into this column, click the Add Fields icon to the right of the list and select the required inputs.
Contrast. The interpretation of the regression coefcients for a categorical eld depends

on the contrasts that are used. The contrast determines how hypothesis tests are set up to compare the estimated means. For example, if you know that a categorical eld has implicit order, such as a pattern or grouping, you can use the contrast to model that order. The available contrasts are:
Indicator. Contrasts indicate the presence or absence of category membership. This is the

default method.
Simple. Each category of the predictor eld, except the reference category, is compared

to the reference category.


Difference. Each category of the predictor eld, except the rst category, is compared to the

average effect of previous categories. Also known as reverse Helmert contrasts.


Helmert. Each category of the predictor eld, except the last category, is compared to the

average effect of subsequent categories.


Repeated. Each category of the predictor eld, except the rst category, is compared to the

category that precedes it.


Polynomial. Orthogonal polynomial contrasts. Categories are assumed to be equally spaced.

Polynomial contrasts are available for numeric elds only.


Deviation. Each category of the predictor eld, except the reference category, is compared to the overall effect.

Base Category. Species how the reference category is determined for the selected contrast

type. Select First to use the rst category for the input eldsorted alphabeticallyor select Last to use the last category. The default value is First. Note: This eld is unavailable if the contrast setting is Difference, Helmert, Repeated, or Polynomial. The other categories are related to the base category in a relative fashion, to identify what makes them more likely to be in their own category. This can help you identify the elds and values that are more likely to give a specic response. The base category is shown in the output as 0.0. This is because comparing it to itself produces an empty result. All other categories are shown as equations relevant to the base category. For more information, see Logistic Regression Model Equations on p. 385.

376 Chapter 12

Multinomial Models
Figure 12-9 Logistic Regression node, multinomial model options

For multinomial models the following methods and options are available:
Method. Specify the method to be used in building the logistic regression model. Enter. This is the default method, which enters all of the terms into the equation directly. No

eld selection is performed in building the model.


Stepwise. The Stepwise method of eld selection builds the equation in steps, as the name

implies. The initial model is the simplest model possible, with no model terms (except the constant) in the equation. At each step, terms that have not yet been added to the model are evaluated, and if the best of those terms adds signicantly to the predictive power of the model, it is added. In addition, terms that are currently in the model are reevaluated to determine if any of them can be removed without signicantly detracting from the model. If so, they are removed. The process repeats, and other terms are added and/or removed. When no more terms can be added to improve the model, and no more terms can be removed without detracting from the model, the nal model is generated.
Forwards. The Forwards method of eld selection is similar to the Stepwise method in that the

model is built in steps. However, with this method, the initial model is the simplest model, and only the constant and terms can be added to the model. At each step, terms not yet in the model are tested based on how much they would improve the model, and the best of those terms is added to the model. When no more terms can be added, or the best candidate term does not produce a large-enough improvement in the model, the nal model is generated.

377 Statistical Models

Backwards. The Backwards method is essentially the opposite of the Forwards method. With

this method, the initial model contains all of the terms as predictors, and terms can only be removed from the model. Model terms that contribute little to the model are removed one by one until no more terms can be removed without signicantly worsening the model, yielding the nal model.
Backwards Stepwise. The Backwards Stepwise method is essentially the opposite of the

Stepwise method. With this method, the initial model contains all of the terms as predictors. At each step, terms in the model are evaluated, and any terms that can be removed without signicantly detracting from the model are removed. In addition, previously removed terms are reevaluated to determine if the best of those terms adds signicantly to the predictive power of the model. If so, it is added back into the model. When no more terms can be removed without signicantly detracting from the model, and no more terms can be added to improve the model, the nal model is generated. Note: The automatic methods, including Stepwise, Forwards, and Backwards, are highly adaptable learning methods and have a strong tendency to overt the training data. When using these methods, it is especially important to verify the validity of the resulting model either with new data or a hold-out test sample created using the Partition node. For more information, see Partition Node in Chapter 4 on p. 119.
Base category for target. Species how the reference category is determined. This is used as the baseline against which the regression equations for all other categories in the target are estimated. Select First to use the rst category for the current target eldsorted alphabeticallyor select Last to use the last category. Alternatively, you can select Specify to choose a specic category, and select the desired value from the list. Available values can be dened for each eld in a Type node. For more information, see Using the Values Dialog Box in Chapter 4 on p. 75.

Often you would specify the category in which you are least interested to be the base category, for example, a loss-leader product. The other categories are then related to this base category in a relative fashion to identify what makes them more likely to be in their own category. This can help you identify the elds and values that are more likely to give a specic response. The base category is shown in the output as 0.0. This is because comparing it to itself produces an empty result. All other categories are shown as equations relevant to the base category. For more information, see Logistic Regression Model Equations on p. 385.
Model type. There are three options for dening the terms in the model. Main Effects models

include only the input elds individually and do not test interactions (multiplicative effects) between input elds. Full Factorial models include all interactions as well as the input eld main effects. Full factorial models are better able to capture complex relationships but are also much more difcult to interpret and are more likely to suffer from overtting. Because of the potentially large number of possible combinations, automatic eld selection methods (methods other than Enter) are disabled for full factorial models. Custom models include only the terms (main effects and interactions) that you specify. When selecting this option, use the Model Terms list to add or remove terms in the model.
Model Terms. When building a Custom model, you will need to explicitly specify the terms in the

model. The list shows the current set of terms for the model. The buttons on the right side of the Model Terms list allow you to add and remove model terms.

378 Chapter 12 E To add terms to the model, click the Add new model terms button. E To delete terms, select the desired terms and click the Delete selected model terms button.

Adding Terms to a Logistic Regression Model


This node is available with the Classication module. When requesting a custom logistic regression model, you can add terms to the model by clicking the Add new model terms button on the Logistic Regression Model tab. A new dialog box opens in which you can specify terms.
Figure 12-10 Logistic Regression New Terms dialog box

Type of term to add. There are several ways to add terms to the model, based on the selection of input elds in the Available elds list. Single interaction. Inserts the term representing the interaction of all selected elds. Main effects. Inserts one main effect term (the eld itself) for each selected input eld. All 2-way interactions. Inserts a 2-way interaction term (the product of the input elds) for

each possible pair of selected input elds. For example, if you have selected input elds A, B, and C in the Available elds list, this method will insert the terms A * B, A * C, and B * C.

379 Statistical Models

All 3-way interactions. Inserts a 3-way interaction term (the product of the input elds) for

each possible combination of selected input elds, taken three at a time. For example, if you have selected input elds A, B, C, and D in the Available elds list, this method will insert the terms A * B * C, A * B * D, A * C * D, and B * C * D.
All 4-way interactions. Inserts a 4-way interaction term (the product of the input elds) for

each possible combination of selected input elds, taken four at a time. For example, if you have selected input elds A, B, C, D, and E in the Available elds list, this method will insert the terms A * B * C * D, A * B * C * E, A * B * D * E, A * C * D * E, and B * C * D * E.
Available fields. Lists the available input elds to be used in constructing model terms. Preview. Shows the terms that will be added to the model if you click Insert, based on the selected

elds and the term type selected above.


Insert. Inserts terms in the model (based on the current selection of elds and term type) and closes the dialog box.

Logistic Regression Node Expert Options


This node is available with the Classication module. If you have detailed knowledge of logistic regression, expert options allow you to ne-tune the training process. To access expert options, set Mode to Expert on the Expert tab.
Figure 12-11 Logistic Regression Expert tab

Scale. (Multinomial models only) You can specify a dispersion scaling value that will be used to

correct the estimate of the parameter covariance matrix. Pearson estimates the scaling value by using the Pearson chi-square statistic. Deviance estimates the scaling value by using the deviance

380 Chapter 12

function (likelihood-ratio chi-square) statistic. You can also specify your own user-dened scaling value. It must be a positive numeric value.
Append all probabilities. If this option is selected, probabilities for each category of the output eld will be added to each record processed by the node. If this option is not selected, the probability of only the predicted category is added.

For example, a table containing the results of a multinomial model with three categories will include ve new columns. One column will list the probability of the outcome being correctly predicted, the next column will show the probability that this prediction is a hit or miss, and a further three columns will show the probability that each categorys prediction is a miss or hit. For more information, see Generated Logistic Regression Models on p. 384. Note: This option is always selected for binomial models.
Singularity tolerance. Specify the tolerance used in checking for singularities. Convergence. These options allow you to control the parameters for model convergence. When

you execute the model, the convergence settings control how many times the different parameters are repeatedly run through to see how well they t. The more often the parameters are tried, the closer the results will be (that is, the results will converge). For more information, see Logistic Regression Node Convergence Options on p. 380.
Output. These options allow you to request additional statistics that will appear in the advanced output of the generated model built by the node. For more information, see Logistic Regression Node Output Options on p. 381. Stepping. These options allow you to control the criteria for adding and removing elds with the Stepwise, Forwards, Backwards, or Backwards Stepwise estimation methods. (The button is disabled if the Enter method is selected.) For more information, see Logistic Regression Node Stepping Options on p. 382.

Logistic Regression Node Convergence Options


This node is available with the Classication module. You can set the convergence parameters for logistic regression model estimation.
Figure 12-12 Logistic Regression Convergence options

Maximum iterations. Specify the maximum number of iterations for estimating the model.

381 Statistical Models

Maximum step-halving. Step-halving is a technique used by logistic regression to deal with

complexities in the estimation process. Under normal circumstances, you should use the default setting.
Log-likelihood convergence. Iterations stop if the relative change in the log-likelihood is less than

this value. The criterion is not used if the value is 0.


Parameter convergence. Iterations stop if the absolute change or relative change in the parameter estimates is less than this value. The criterion is not used if the value is 0. Delta. (Multinomial models only) You can specify a value between 0 and 1 to be added to each

empty cell (combination of input eld and output eld values). This can help the estimation algorithm deal with data where there are many possible combinations of eld values relative to the number of records in the data. The default is 0.

Logistic Regression Node Output Options


This node is available with the Classication module. Select the optional output you want to display in the advanced output of the generated logistic regression model. To view the advanced output, browse the generated model and click the Advanced tab. For more information, see Logistic Regression Model Advanced Output on p. 388.
Binomial Options
Figure 12-13 Logistic Regression, Binomial output options

Select the types of output to be generated for the model. For more information, see Logistic Regression Model Advanced Output on p. 388.
Display. Select whether to display the results at each step, or to wait until all steps have been

worked through.
CI for exp(B). Select the condence intervals for each coefcient (shown as Beta) in the expression.

Specify the level of the condence interval (the default is 95%).


Residual Diagnosis. Requests a Casewise Diagnostics table of residuals.

382 Chapter 12

Outliers outside (std. dev.). List only residual cases for which the absolute standardized value

of the listed variable is at least as large as the value you specify. The default value is 2.
All cases. Include all cases in the Casewise Diagnostic table of residuals. Note: Because

this option lists each of the input records, it may result in an exceptionally large table in the report, with one line for every record.
Classification cutoff. This allows you to determine the cutpoint for classifying cases. Cases

with predicted values that exceed the classication cutoff are classied as positive, while those with predicted values smaller than the cutoff are classied as negative. To change the default, enter a value between 0.01 and 0.99.

Multinomial Options
Figure 12-14 Logistic Regression, Multinomial output options

Select the types of output to be generated for the model. For more information, see Logistic Regression Model Advanced Output on p. 388. Note: Selecting the Likelihood ratio tests option greatly increases the processing time required to build a logistic regression model. If your model is taking too long to build, consider disabling this option or utilize the Wald and Score statistics instead. For more information, see Logistic Regression Node Stepping Options on p. 382.
Iteration history for every. Select the step interval for printing iteration status in the advanced

output.
Confidence Interval. The condence intervals for coefcients in the equations. Specify the level of

the condence interval (the default is 95%).

Logistic Regression Node Stepping Options


This node is available with the Classication module.

383 Statistical Models Figure 12-15 Logistic Regression Stepping Criteria

Number of terms in model. (Multinomial models only) You can specify the minimum number of

terms in the model for Backwards and Backwards Stepwise models and the maximum number of terms for Forwards and Stepwise models. If you specify a minimum value greater than 0, the model will include that many terms, even if some of the terms would have been removed based on statistical criteria. The minimum setting is ignored for Forwards, Stepwise, and Enter models. If you specify a maximum, some terms may be omitted from the model, even though they would have been selected based on statistical criteria. The Specify Maximum setting is ignored for Backwards, Backwards Stepwise, and Enter models.
Entry criterion. (Multinomial models only) Select Score to maximize speed of processing. The
Likelihood Ratio option may provide somewhat more robust estimates but take longer to compute. The default setting is to use the Score statistic.

Removal criterion. Select Likelihood Ratio for a more robust model. To shorten the time required to

build the model, you can try selecting Wald. However, if you have complete or quasi-complete separation in the data (which you can determine by using the Advanced tab on the generated model), the Wald statistic becomes particularly unreliable and should not be used. The default setting is to use the likelihood-ratio statistic. For binomial models, there is the additional option Conditional. This provides removal testing based on the probability of the likelihood-ratio statistic based on conditional parameter estimates.
Significance thresholds for criteria. This option allows you to specify selection criteria based on the statistical probability (the p value) associated with each eld. Fields will be added to the model only if the associated p value is smaller than the Entry value and will be removed only if the p value is larger than the Removal value. The Entry value must be smaller than the Removal value.

384 Chapter 12

Requirements for entry or removal. (Multinomial models only) For some applications, it doesnt make mathematical sense to add interaction terms to the model unless the model also contains the lower-order terms for the elds involved in the interaction term. For example, it may not make sense to include A * B in the model unless A and B also appear in the model. These options let you determine how such dependencies are handled during stepwise term selection. Hierarchy for discrete effects. Higher-order effects (interactions involving more elds) will

enter the model only if all lower-order effects (main effects or interactions involving fewer elds) for the relevant elds are already in the model, and lower-order effects will not be removed if higher-order effects involving the same elds are in the model. This option applies only to discrete elds. For more information, see Data Types in Chapter 4 on p. 71.
Hierarchy for all effects. This option works as described above, except it applies to all input

elds.
Containment for all effects. Effects can appear in the model only if all of the effects contained

in the effect also appear in the model. This option is similar to the Hierarchy for all effects option except that range elds are treated somewhat differently. For an effect to contain another effect, the contained (lower-order) effect must include all of the range elds involved in the containing (higher-order) effect, and the contained effects discrete elds must be a subset of those in the containing effect. For example, if A and B are discrete elds and X is a range eld, the term A * B * X contains the terms A * X and B * X.
None. No relationships are enforced; terms are added to and removed from the model

independently.

Generated Logistic Regression Models


This node is available with the Classication module. Logistic regression models represent the equations estimated by Logistic Regression nodes. They contain all of the information captured by the logistic regression model, as well as information about the model structure and performance. This type of equation may also be generated by other models such as Oracle SVM. When you execute a stream containing a logistic regression model, the node adds two new elds containing the models prediction and the associated probability. The names of the new elds are derived from the name of the output eld being predicted, prexed with $L- for the predicted category and $LP- for the associated probability. For example, for an output eld named colorpref, the new elds would be named $L-colorpref and $LP-colorpref. In addition, if you have selected the Append all probabilities option in the Logistic Regression node, an additional eld will be added for each category of the output eld, containing the probability belonging to the corresponding category for each record. These additional elds are named based on the values of the output eld, prexed by $LP-. For example, if the legal values of colorpref are Red, Green, and Blue, three new elds will be added: $LP-Red, $LP-Green, and $LP-Blue.
Generating a Filter node. The Generate menu allows you to create a new Filter node to pass input elds based on the results of the model. Fields that are dropped from the model due to multicollinearity will be ltered by the generated node, as well as elds not used in the model.

385 Statistical Models

Logistic Regression Model Equations


This node is available with the Classication module. Note: The Model tab is only available for Multinomial Logistic Regression models. The tab displays the actual equations estimated by a Logistic Regression node (one equation for each category in the target eld, except the baseline category). The equations are displayed in a tree format. This type of equation may also be generated by certain other models such as Oracle SVM. For more information, see Browsing Generated Models in Chapter 6 on p. 239.
Figure 12-16 Sample Logistic Regression Equation node Model tab

Equation For. Shows the regression equations used to derive the target category probabilities,

given a set of predictor values. The last category of the target eld is considered the baseline category; the equations shown give the log-odds for the other target categories relative to the baseline category for a particular set of predictor values. The predicted probability for each category of the given predictor pattern is derived from these log-odds values.

386 Chapter 12

How Are Probabilities Calculated?

Each equation calculates the log-odds for a particular target category, relative to the baseline category. The log-odds, also called the logit, is the ratio of the probability for the specied target category to that of the baseline category, with the natural logarithm function applied to the result. For the baseline category, the odds of the category relative to itself is 1.0, and thus the log-odds is 0. You can think of this as an implicit equation for the baseline category where all coefcients are 0. To derive the probability from the log-odds for a particular target category, you take the logit value calculated by the equation for that category and apply the following formula: P(groupi) = exp(gi) / k exp(gk) where g is the calculated log-odds, i is the category index, and k goes from 1 to the number of target categories.

Logistic Regression Model Summary


This node is available with the Classication module. The Summary tab for a generated logistic regression model displays the elds and settings used to generate the model. In addition, if you have executed an Analysis node attached to this modeling node, information from that analysis will also appear in this section. For more information, see Analysis Node in Chapter 17 on p. 537. For general information on using the model browser, see Browsing Generated Models on p. 239.

387 Statistical Models Figure 12-17 Sample Logistic Regression Equation node Summary tab

Logistic Regression Model Settings


This node is available with the Classication module. The Settings tab for a logistic regression model species options for condences and SQL generation during model scoring. This tab is only available after the generated model has been added to a stream.

388 Chapter 12 Figure 12-18 Settings tab for a Logistic Regression model

Scoring options. These options are unavailable if you choose to generate SQL for the model. Append all probabilities. Species whether probabilities for each category of the output eld

are added to each record processed by the node. If this option is not selected, the probability of only the predicted category is added. For example, a table containing the results of a binomial model will include one column listing the probability of the outcome being correctly predicted, another column showing the probability that this prediction is a hit, and a further column showing the probability that this prediction is a miss.
Calculate confidences. Species whether condences are calculated during scoring.

Note: The scoring options are always selected for binomial models.
Generate SQL for this model. There are two ways you can use SQL with Clementine:

Export the SQL as a text le for modication and use in another, unconnected, database. For more information, see Browsing Generated Models in Chapter 6 on p. 239. Enable SQL generation for the model in order to take advantage of database performance. This setting only applies when using data from a database. For more information, see SQL Optimization in Chapter 6 in Clementine 11.1 Server Administration and Performance Guide. Also note that actual performance results may vary, depending on the complexity of the model and the capabilities of each DBMS, particularly when large numbers of categorical predictors with many discrete values are used. In general, the more complex the model, the greater the likelihood that databases will struggle with the resulting generated SQL. Note: Generate SQL is unavailable when scoring options are selected.

Logistic Regression Model Advanced Output


This node is available with the Classication module.

389 Statistical Models Figure 12-19 Sample Logistic Regression Equation node Advanced tab

The advanced output for logistic regression (also known as nominal regression) gives detailed information about the estimated model and its performance. Most of the information contained in the advanced output is quite technical, and extensive knowledge of logistic regression analysis is required to properly interpret this output.
Warnings. Indicates any warnings or potential problems with the results. Case processing summary. Lists the number of records processed, broken down by each symbolic

eld in the model.


Step summary (optional). Lists the effects added or removed at each step of model creation, when using automatic eld selection. Note: Only shown for the Stepwise, Forwards, Backwards, or Backwards Stepwise methods. Iteration history (optional). Shows the iteration history of parameter estimates for every n iterations beginning with the initial estimates, where n is the value of the print interval. The default is to print every iteration (n=1). Model fitting information (Multinomial models). Shows the likelihood-ratio test of your model (Final) against one in which all of the parameter coefcients are 0 (Intercept Only).

390 Chapter 12

Classification (optional). Shows the matrix of predicted and actual output eld values with

percentages.
Goodness-of-fit chi-square statistics (optional). Shows Pearsons and likelihood-ratio chi-square statistics. These statistics test the overall t of the model to the training data. Hosmer and Lemeshow goodness-of-fit (optional). Shows the results of grouping cases into deciles of risk and comparing the observed probability with the expected probability within each decile. This goodness-of-t statistic is more robust than the traditional goodness-of-t statistic used in multinomial models, particularly for models with continuous covariates and studies with small sample sizes. Pseudo R-square (optional). Shows the Cox and Snell, Nagelkerke, and McFadden R-square measures of model t. These statistics are in some ways analogous to the R-square statistic in linear regression. Monotonicity measures (optional). Shows the number of concordant pairs, discordant pairs, and tied pairs in the data, as well as the percentage of the total number of pairs that each represents. The Somers D, Goodman and Kruskals Gamma, Kendalls tau-a, and Concordance Index C are also displayed in this table. Information criteria (optional). Shows Akaikes information criterion (AIC) and Schwarzs

Bayesian information criterion (BIC).


Likelihood ratio tests (optional). Shows statistics testing of whether the coefcients of the model

effects are statistically different from 0. Signicant input elds are those with very small signicance levels in the output (labeled Sig.).
Parameter estimates (optional). Shows estimates of the equation coefcients, tests of those coefcients, odds ratios derived from the coefcients labeled Exp(B), and condence intervals for the odds ratios. Asymptotic covariance/correlation matrix (optional). Shows the asymptotic covariances and/or

correlations of the coefcient estimates.


Observed and predicted frequencies (optional). For each covariate pattern, shows the observed

and predicted frequencies for each output eld value. This table can be quite large, especially for models with numeric input elds. If the resulting table would be too large to be practical, it is omitted, and a warning appears.

Factor Analysis/PCA Node


This node is included with the Base module. The Factor/PCA node provides powerful data-reduction techniques to reduce the complexity of your data. Two similar but distinct approaches are provided.

391 Statistical Models

Principal components analysis (PCA) nds linear combinations of the input elds that do the best job of capturing the variance in the entire set of elds, where the components are orthogonal (perpendicular) to each other. PCA focuses on all variance, including both shared and unique variance. Factor analysis attempts to identify underlying concepts, or factors, that explain the pattern of correlations within a set of observed elds. Factor analysis focuses on shared variance only. Variance that is unique to specic elds is not considered in estimating the model. Several methods of factor analysis are provided by the Factor/PCA node. For both approaches, the goal is to nd a small number of derived elds that effectively summarize the information in the original set of elds.
Requirements. Only numeric elds can be used in a factor/PCA model. To estimate a factor analysis or PCA, you need one or more In elds. Fields with direction Out, Both, or None are ignored, as are non-numeric elds. Strengths. Factor analysis and PCA can effectively reduce the complexity of your data without

sacricing much of the information content. These techniques can help you build more robust models that execute more quickly than would be possible with the raw input elds.

Factor Analysis/PCA Node Model Options


This node is included with the Base module.
Figure 12-20 Factor/PCA node Model tab

Model name. You can generate the model name automatically based on the target or ID eld (or

model type in cases where no such eld is specied) or specify a custom name.
Use partitioned data. If a partition eld is dened, this option ensures that only data from the training partition is used to build the model.For more information, see Partition Node in Chapter 4 on p. 119. Extraction Method. Specify the method to be used for data reduction.

392 Chapter 12

Principal Components. This is the default method, which uses PCA to nd components that

summarize the input elds.


Unweighted Least Squares. This factor analysis method works by nding the set of factors that

is best able to reproduce the pattern of relationships (correlations) among the input elds.
Generalized Least Squares. This factor analysis method is similar to unweighted least squares,

except that it uses weighting to de-emphasize elds with a lot of unique (unshared) variance.
Maximum Likelihood. This factor analysis method produces factor equations that are most

likely to have produced the observed pattern of relationships (correlations) in the input elds, based on assumptions about the form of those relationships. Specically, the method assumes that the training data follow a multivariate normal distribution.
Principal Axis Factoring. This factor analysis method is very similar to the principal

components method, except that it focuses on shared variance only.


Alpha Factoring. This factor analysis method considers the elds in the analysis to be a sample

from the universe of potential input elds. It maximizes the statistical reliability of the factors.
Image Factoring. This factor analysis method uses data estimation to isolate the common

variance and nd factors that describe it.

Factor Analysis/PCA Node Expert Options


This node is included with the Base module. If you have detailed knowledge of factor analysis and PCA, expert options allow you to ne-tune the training process. To access expert options, set Mode to Expert on the Expert tab.
Figure 12-21 Factor/PCA Expert tab

393 Statistical Models

Missing values. By default, Clementine will use only records that have valid values for all elds used in the model. (This is sometimes called listwise deletion of missing values.) If you have a lot of missing data, you may nd that this approach eliminates too many records, leaving you without enough data to generate a good model. In such cases, you can deselect the Only use complete records option. Clementine will then attempt use as much information as possible to estimate the model, including records where some of the elds have missing values. (This is sometimes called pairwise deletion of missing values.) However, in some situations, using incomplete records in this manner can lead to computational problems in estimating the model. Fields. Specify whether to use the correlation matrix (the default) or the covariance matrix of the

input elds in estimating the model.


Maximum iterations for convergence. Specify the maximum number of iterations for estimating

the model.
Extract factors. There are two ways to select the number of factors to extract from the input elds. Eigenvalues over. This option will retain all factors or components with eigenvalues larger

than the specied criterion. Eigenvalues measure the ability of each factor or component to summarize variance in the set of input elds. The model will retain all factors or components with eigenvalues greater than the specied value when using the correlation matrix. When using the covariance matrix, the criterion is the specied value times the mean eigenvalue. That scaling gives this option a similar meaning for both types of matrix.
Maximum number. This option will retain the specied number of factors or components in

descending order of eigenvalues. In other words, the factors or components corresponding to the n highest eigenvalues are retained, where n is the specied criterion. The default extraction criterion is ve factors/components.
Component/factor matrix format. These options control the format of the factor matrix (or

component matrix for PCA models).


Sort values. If this option is selected, factor loadings in the model output will be sorted

numerically.
Hide values below. If this option is selected, scores below the specied threshold will be

hidden in the matrix to make it easier to see the pattern in the matrix.
Rotation. These options allow you to control the rotation method for the model. For more information, see Factor/PCA Node Rotation Options on p. 393.

Factor/PCA Node Rotation Options


This node is included with the Base module.

394 Chapter 12 Figure 12-22 Factor/PCA Rotation options

In many cases, mathematically rotating the set of retained factors can increase their usefulness and especially their interpretability. Select a rotation method:
No rotation. The default option. No rotation is used. Varimax. An orthogonal rotation method that minimizes the number of elds with high

loadings on each factor. It simplies the interpretation of the factors.


Direct oblimin. A method for oblique (non-orthogonal) rotation. When Delta equals 0 (the

default), solutions are oblique. As delta becomes more negative, the factors become less oblique. To override the default delta of 0, enter a number less than or equal to 0.8.
Quartimax. An orthogonal method that minimizes the number of factors needed to explain

each eld. It simplies the interpretation of the observed elds.


Equamax. A rotation method that is a combination of the Varimax method, which simplies

the factors, and the Quartimax method, which simplies the elds. The number of elds that load highly on a factor and the number of factors needed to explain a eld are minimized.
Promax. An oblique rotation, which allows factors to be correlated. It can be calculated more

quickly than a direct oblimin rotation, so it can be useful for large datasets. Kappa controls the obliqueness of the solution (the extent to which factors can be correlated).

Generated Factor Models


This node is included with the Base module. Factor models represent the factor analysis and principal component analysis (PCA) models created by Factor/PCA nodes. They contain all of the information captured by the trained model, as well as information about the models performance and characteristics. When you execute a stream containing a factor equation model, the node adds a new eld for each factor or component in the model. The new eld names are derived from the model name, prexed by $F- and sufxed by -n, where n is the number of the factor or component. For example, if your model is named Factor and contains three factors, the new elds would be named $F-Factor-1, $F-Factor-2, and $F-Factor-3. To get a better sense of what the factor model has encoded, you can do some more downstream analysis. A useful way to view the result of the factor model is to view the correlations between factors and input elds using a Statistics node. This shows you which input elds load heavily on which factors and can help you discover if your factors have any underlying meaning or interpretation. For more information, see Statistics Node in Chapter 17 on p. 554.

395 Statistical Models

You can also assess the factor model by using the information available in the advanced output. To view the advanced output, click the Advanced tab of the generated model browser. The advanced output contains a lot of detailed information and is meant for users with extensive knowledge of factor analysis or PCA. For more information, see Factor Model Advanced Output on p. 396.

Factor Model Equations


This node is included with the Base module. The Model tab for a generated factor equation displays the factor score equation for each factor. Factor or component scores are calculated by multiplying each input eld value by its coefcient and summing the results.
Figure 12-23 Sample Factor Equation node Model tab

Factor Model Summary


This node is included with the Base module. The Summary tab for a factor model displays the number of factors retained in the factor/PCA model, along with additional information on the elds and settings used to generate the model. For more information, see Browsing Generated Models in Chapter 6 on p. 239.

396 Chapter 12 Figure 12-24 Sample Factor Equation node Summary tab

Factor Model Advanced Output


This node is included with the Base module.

397 Statistical Models Figure 12-25 Sample Factor Equation node Advanced tab

The advanced output for factor analysis gives detailed information on the estimated model and its performance. Most of the information contained in the advanced output is quite technical, and extensive knowledge of factor analysis is required to properly interpret this output.
Warnings. Indicates any warnings or potential problems with the results. Communalities. Shows the proportion of each elds variance that is accounted for by the factors

or components. Initial gives the initial communalities with the full set of factors (the model starts with as many factors as input elds), and Extraction gives the communalities based on the retained set of factors.
Total variance explained. Shows the total variance explained by the factors in the model. Initial Eigenvalues shows the variance explained by the full set of initial factors. Extraction Sums of Squared Loadings shows the variance explained by factors retained in the model. Rotation Sums of Squared Loadings shows the variance explained by the rotated factors. Note that for oblique rotations, Rotation Sums of Squared Loadings shows only the sums of squared loadings and does not show variance percentages. Factor (or component) matrix. Shows correlations between input elds and unrotated factors. Rotated factor (or component) matrix. Shows correlations between input elds and rotated factors

for orthogonal rotations.


Pattern matrix. Shows the partial correlations between input elds and rotated factors for oblique

rotations.

398 Chapter 12

Structure matrix. Shows the simple correlations between input elds and rotated factors for

oblique rotations.
Factor correlation matrix. Shows correlations among factors for oblique rotations.

Discriminant Node
This node is available with the Classication module. Discriminant analysis builds a predictive model for group membership. The model is composed of a discriminant function (or, for more than two groups, a set of discriminant functions) based on linear combinations of the predictor variables that provide the best discrimination between the groups. The functions are generated from a sample of cases for which group membership is known; the functions can then be applied to new cases that have measurements for the predictor variables but have unknown group membership.
Example. A telecommunications company can use discriminant analysis to classify customers

into groups based on usage data. This allows them to score potential customers and target those who are most likely to be in the most valuable groups. For more information, see Classifying Telecommunications Customers (Discriminant Analysis) in Chapter 18 in Clementine 11.1 Applications Guide.
Requirements. You need one or more predictor (In) elds and exactly one target (Out) eld.

The target must be a categorical eld (Flag or Set) with string or integer storage. (Storage can be converted using a Filler or Derive node if necessary. For more information, see Storage Conversion Using the Filler Node in Chapter 4 on p. 100.) Fields set to Both or None are ignored. Fields used in the model must have their types fully instantiated.
Strengths. Discriminant analysis and Logistic Regression are both suitable classication models.

However, Discriminant analysis makes more assumptions about the input elds, for example, that they are normally distributed and should be scale, and they give better results if those requirements are met, especially if the sample size is small.

Discriminant Node Model Options


This node is available with the Classication module.

399 Statistical Models Figure 12-26 Discriminant node Model tab

Model name. You can generate the model name automatically based on the target or ID eld (or

model type in cases where no such eld is specied) or specify a custom name.
Use partitioned data. If a partition eld is dened, this option ensures that only data from the training partition is used to build the model.For more information, see Partition Node in Chapter 4 on p. 119. Method. The following options are available for entering predictors into the model: Enter. This is the default method, which enters all of the terms into the equation directly.

Terms that do not add signicantly to the predictive power of the model are not added.
Stepwise. The initial model is the simplest model possible, with no model terms (except the

constant) in the equation. At each step, terms that have not yet been added to the model are evaluated, and if the best of those terms adds signicantly to the predictive power of the model, it is added. Note: The Stepwise method has a strong tendency to overt the training data. When using these methods, it is especially important to verify the validity of the resulting model with a hold-out test sample or new data.

Discriminant Node Expert Options


This node is available with the Classication module. If you have detailed knowledge of discriminant analysis, expert options allow you to ne-tune the training process. To access expert options, set Mode to Expert on the Expert tab.

400 Chapter 12 Figure 12-27 Discriminant node Expert tab

Prior Probabilities. This option determines whether the classication coefcients are adjusted

for a priori knowledge of group membership.


All groups equal. Equal prior probabilities are assumed for all groups; this has no effect

on the coefcients.
Compute from group sizes. The observed group sizes in your sample determine the prior

probabilities of group membership. For example, if 50% of the observations included in the analysis fall into the rst group, 25% in the second, and 25% in the third, the classication coefcients are adjusted to increase the likelihood of membership in the rst group relative to the other two.
Use Covariance Matrix. You can choose to classify cases using a within-groups covariance matrix or a separate-groups covariance matrix. Within-groups. The pooled within-groups covariance matrix is used to classify cases. Separate-groups. Separate-groups covariance matrices are used for classication. Because

classication is based on the discriminant functions (not based on the original variables), this option is not always equivalent to quadratic discrimination.
Output. These options allow you to request additional statistics that will appear in the advanced output of the generated model built by the node. For more information, see Discriminant Node Output Options on p. 400. Stepping. These options allow you to control the criteria for adding and removing elds with the Stepwise estimation method. (The button is disabled if the Enter method is selected.) For more information, see Discriminant Node Stepping Options on p. 402.

Discriminant Node Output Options


This node is available with the Classication module.

401 Statistical Models Figure 12-28 Discriminant node Advanced Output options

Select the optional output you want to display in the advanced output of the generated logistic regression model. To view the advanced output, browse the generated model and click the Advanced tab. For more information, see Discriminant Model Advanced Output on p. 404.
Descriptives. Available options are means (including standard deviations), univariate ANOVAs,

and Boxs M test.


Means. Displays total and group means, as well as standard deviations for the independent

variables.
Univariate ANOVAs. Performs a one-way analysis-of-variance test for equality of group means

for each independent variable.


Boxs M. A test for the equality of the group covariance matrices. For sufciently large

samples, a nonsignicant p value means there is insufcient evidence that the matrices differ. The test is sensitive to departures from multivariate normality.
Function Coefficients. Available options are Fishers classication coefcients and unstandardized

coefcients.
Fishers. Displays Fishers classication function coefcients that can be used directly for

classication. A set of coefcients is obtained for each group, and a case is assigned to the group for which it has the largest discriminant score.
Unstandardized. Displays the unstandardized discriminant function coefcients. Matrices. Available matrices of coefcients for independent variables are within-groups

correlation matrix, within-groups covariance matrix, separate-groups covariance matrix, and total covariance matrix.

402 Chapter 12

Within-groups correlation. Displays a pooled within-groups correlation matrix that is

obtained by averaging the separate covariance matrices for all groups before computing the correlations.
Within-groups covariance. Displays a pooled within-groups covariance matrix, which may

differ from the total covariance matrix. The matrix is obtained by averaging the separate covariance matrices for all groups.
Separate-groups covariance. Displays separate covariance matrices for each group. Total covariance. Displays a covariance matrix from all cases as if they were from a single

sample.
Classification. The following output pertains to the classication results. Casewise results. Codes for actual group, predicted group, posterior probabilities, and

discriminant scores are displayed for each case.


Summary table. The number of cases correctly and incorrectly assigned to each of the groups

based on the discriminant analysis. Sometimes called the "Confusion Matrix."


Leave-one-out classification. Each case in the analysis is classied by the functions derived

from all cases other than that case. It is also known as the "U-method."
Territorial map. A plot of the boundaries used to classify cases into groups based on function

values. The numbers correspond to groups into which cases are classied. The mean for each group is indicated by an asterisk within its boundaries. The map is not displayed if there is only one discriminant function.
Stepwise. Summary of steps displays statistics for all variables after each step; F for pairwise distances displays a matrix of pairwise F ratios for each pair of groups. The F ratios can be used

for signicance tests of the Mahalanobis distances between groups.

Discriminant Node Stepping Options


This node is available with the Classication module.
Figure 12-29 Discriminant node stepping options

403 Statistical Models

Method. Select the statistic to be used for entering or removing new variables. Available

alternatives are Wilks lambda, unexplained variance, Mahalanobis distance, smallest F ratio, and Raos V. With Raos V, you can specify the minimum increase in V for a variable to enter.
Wilks lambda. A variable selection method for stepwise discriminant analysis that chooses

variables for entry into the equation on the basis of how much they lower Wilks lambda. At each step, the variable that minimizes the overall Wilks lambda is entered.
Unexplained variance. At each step, the variable that minimizes the sum of the unexplained

variation between groups is entered.


Mahalanobis distance. A measure of how much a cases values on the independent variables

differ from the average of all cases. A large Mahalanobis distance identies a case as having extreme values on one or more of the independent variables.
Smallest F ratio. A method of variable selection in stepwise analysis based on maximizing an

F ratio computed from the Mahalanobis distance between groups.


Raos V. A measure of the differences between group means. Also called the Lawley-Hotelling

trace. At each step, the variable that maximizes the increase in Raos V is entered. After selecting this option, enter the minimum value a variable must have to enter the analysis.
Criteria. Available alternatives are Use F value and Use probability of F. Enter values for entering and removing variables. Use F value. A variable is entered into the model if its F value is greater than the Entry value

and is removed if the F value is less than the Removal value. Entry must be greater than Removal, and both values must be positive. To enter more variables into the model, lower the Entry value. To remove more variables from the model, increase the Removal value.
Use probability of F. A variable is entered into the model if the signicance level of its F

value is less than the Entry value and is removed if the signicance level is greater than the Removal value. Entry must be less than Removal, and both values must be positive. To enter more variables into the model, increase the Entry value. To remove more variables from the model, lower the Removal value.

Generated Discriminant Models


This node is available with the Classication module. Discriminant models represent the equations estimated by Discriminant nodes. They contain all of the information captured by the discriminant model, as well as information about the model structure and performance. When you execute a stream containing a logistic regression model, the node adds two new elds containing the models prediction and the associated probability. The names of the new elds are derived from the name of the output eld being predicted, prexed with $D- for the predicted category and $DP- for the associated probability. For example, for an output eld named colorpref, the new elds would be named $D-colorpref and $DP-colorpref.
Generating a Filter node. The Generate menu allows you to create a new Filter node to pass

input elds based on the results of the model.

404 Chapter 12

Discriminant Model Summary


This node is available with the Classication module. The Summary tab for a generated discriminant model displays the elds and settings used to generate the model. In addition, if you have executed an Analysis node attached to this modeling node, information from that analysis will also appear in this section. For more information, see Analysis Node in Chapter 17 on p. 537. For general information on using the model browser, see Browsing Generated Models on p. 239.
Figure 12-30 Sample Discriminant Equation node Summary tab

Discriminant Model Advanced Output


This node is available with the Classication module.

405 Statistical Models Figure 12-31 Sample Discriminant Equation node Advanced tab

The advanced output for discriminant analysis gives detailed information about the estimated model and its performance. Most of the information contained in the advanced output is quite technical, and extensive knowledge of discriminant analysis is required to properly interpret this output. For more information, see Discriminant Node Output Options on p. 400.

Generalized Linear Models Node


This node is available with the Classication module. The generalized linear model expands the general linear model so that the dependent variable is linearly related to the factors and covariates via a specied link function. Moreover, the model allows for the dependent variable to have a non-normal distribution. It covers widely used statistical models, such as linear regression for normally distributed responses, logistic models for binary data, loglinear models for count data, complementary log-log models for interval-censored survival data, plus many other statistical models through its very general model formulation.
Examples. A shipping company can use generalized linear models to t a Poisson regression to

damage counts for several types of ships constructed in different time periods, and the resulting model can help determine which ship types are most prone to damage. For more information, see Using Poisson Regression to Analyze Ship Damage Rates (Generalized Linear Models) in Chapter 20 in Clementine 11.1 Applications Guide.

406 Chapter 12

A car insurance company can use generalized linear models to t a gamma regression to damage claims for cars, and the resulting model can help determine the factors that contribute the most to claim size. For more information, see Fitting a Gamma Regression to Car Insurance Claims (Generalized Linear Models) in Chapter 21 in Clementine 11.1 Applications Guide. Medical researchers can use generalized linear models to t a complementary log-log regression to interval-censored survival data to predict the time to recurrence for a medical condition. For more information, see Analyzing Interval-Censored Survival Data (Generalized Linear Models) in Chapter 19 in Clementine 11.1 Applications Guide. Generalized Linear Models works by building an equation that relates the input eld values to the output eld values. Once the model is generated, it can be used to estimate values for new data. For each record, a probability of membership is computed for each possible output category. The target category with the highest probability is assigned as the predicted output value for that record.
Requirements. You need one or more predictor (In) elds and exactly one target (Out) eld

(which can be of any type) with two or more categories. Fields used in the model must have their types fully instantiated.
Strengths. The generalized linear model is extremely exible, but the process of choosing the

model structure is not automated and thus demands a level of familiarity with your data that is not required by black box algorithms.

Generalized Linear Models Node Field Options


This node is available with the Classication module.

407 Statistical Models Figure 12-32 Generalized Linear Models node Model tab

In addition to the target, input, and partition custom options typically offered on modeling node Fields tabs (see Modeling Node Fields Options on p. 235), the Generalized Linear Models node offers the following extra functionality.
Use weight field. The scale parameter is an estimated model parameter related to the variance of the response. The scale weights are known values that can vary from observation to observation. If the scale weight variable is specied, the scale parameter, which is related to the variance of the response, is divided by it for each observation. Records with scale weight values that are less than or equal to 0 or are missing are not used in the analysis. Target field represents number of events occurring in a set of trials. When the response is a number

of events occurring in a set of trials, the target eld contains the number of events and you can select an additional variable containing the number of trials. Alternatively, if the number of trials is the same across all subjects, then trials may be specied using a xed value. The number of trials should be greater than or equal to the number of events for each record. Events should be non-negative integers, and trials should be positive integers.

Generalized Linear Models Node Model Options


This node is available with the Classication module.

408 Chapter 12 Figure 12-33 Generalized Linear Models node Model tab

Model name. You can generate the model name automatically based on the target or ID eld (or

model type in cases where no such eld is specied) or specify a custom name.
Use partitioned data. If a partition eld is dened, this option ensures that only data from the training partition is used to build the model.For more information, see Partition Node in Chapter 4 on p. 119. Offset. The offset term is a structural predictor. Its coefcient is not estimated by the model but is assumed to have the value 1; thus, the values of the offset are simply added to the linear predictor of the dependent variable. This is especially useful in Poisson regression models, where each case may have different levels of exposure to the event of interest. For example, when modeling accident rates for individual drivers, there is an important difference between a driver who has been at fault in one accident in three years of experience and a driver who has been at fault in one accident in 25 years! The number of accidents can be modeled as a Poisson response if the experience of the driver is included as an offset term. Base category for target.

409 Statistical Models

For binary response, you can choose the reference category for the dependent variable. This can affect certain output, such as parameter estimates and saved values, but it should not change the model t. For example, if your binary response takes values 0 and 1: By default, the procedure makes the last (highest-valued) category, or 1, the reference category. In this situation, model-saved probabilities estimate the chance that a given case takes the value 0, and parameter estimates should be interpreted as relating to the likelihood of category 0. If you specify the rst (lowest-valued) category, or 0, as the reference category, then model-saved probabilities estimate the chance that a given case takes the value 1. If you specify the custom category and your variable has dened labels, you can set the reference category by choosing a value from the list. This can be convenient when, in the middle of specifying a model, you dont remember exactly how a particular variable was coded.
Include intercept in model. The intercept is usually included in the model. If you can assume the

data pass through the origin, you can exclude the intercept.

Generalized Linear Models Node Expert Options


This node is available with the Classication module. If you have detailed knowledge of Generalized Linear Models, expert options allow you to ne-tune the training process. To access expert options, set Mode to Expert on the Expert tab.

410 Chapter 12 Figure 12-34 Generalized Linear Models Expert tab

Target Field Distribution and Link Function Distribution.

This selection species the distribution of the dependent variable. The ability to specify a non-normal distribution and non-identity link function is the essential improvement of the generalized linear model over the general linear model. There are many possible distribution-link function combinations, and several may be appropriate for any given dataset, so your choice can be guided by a priori theoretical considerations or which combination seems to t best.
Binomial. This distribution is appropriate only for variables that represent a binary response

or number of events.
Gamma. This distribution is appropriate for variables with positive scale values that are

skewed toward larger positive values. If a data value is less than or equal to 0 or is missing, then the corresponding case is not used in the analysis.
Inverse Gaussian. This distribution is appropriate for variables with positive scale values that

are skewed toward larger positive values. If a data value is less than or equal to 0 or is missing, then the corresponding case is not used in the analysis.
Negative Binomial. This distribution can be thought of as the number of trials required to

observe k successes and is appropriate for variables with non-negative integer values. If a data value is non-integer, less than 0, or missing, then the corresponding case is not used in the analysis. The xed value of the negative binomial distributions ancillary parameter can

411 Statistical Models

be any number greater than or equal to 0. When the ancillary parameter is set to 0, using this distribution is equivalent to using the Poisson distribution.
Normal. This is appropriate for scale variables whose values take a symmetric, bell-shaped

distribution about a central (mean) value. The dependent variable must be numeric.
Poisson. This distribution can be thought of as the number of occurrences of an event of

interest in a xed period of time and is appropriate for variables with non-negative integer values. If a data value is non-integer, less than 0, or missing, then the corresponding case is not used in the analysis.
Link Functions.

The link function is a transformation of the dependent variable that allows estimation of the model. The following functions are available:
Identity. f(x)=x. The dependent variable is not transformed. This link can be used with any

distribution.
Complementary log-log. f(x)=log(log(1x)). This is appropriate only with the binomial

distribution.
Log. f(x)=log(x). This link can be used with any distribution. Log complement. f(x)=log(1x). This is appropriate only with the binomial distribution. Logit. f(x)=log(x / (1x)). This is appropriate only with the binomial distribution. Negative Binomial. f(x)=log(x / (x+k1)), where k is the ancillary parameter of the negative

binomial distribution. This is appropriate only with the negative binomial distribution.
Negative log-log. f(x)=log(log(x)). This is appropriate only with the binomial distribution. Odds power. f(x)=[(x/(1x))1]/, if 0. f(x)=log(x), if =0. is the required number

specication and must be a real number. This is appropriate only with the binomial distribution.
Probit. f(x)=1(x), where 1 is the inverse standard normal cumulative distribution

function. This is appropriate only with the binomial distribution.


Power. f(x)=x, if 0. f(x)=log(x), if =0. is the required number specication and must

be a real number. This link can be used with any distribution.


Parameter Estimation. The controls in this group allow you to specify estimation methods and to

provide initial values for the parameter estimates.


Method. You can select a parameter estimation method. Choose between Newton-Raphson,

Fisher scoring, or a hybrid method in which Fisher scoring iterations are performed before switching to the Newton-Raphson method. If convergence is achieved during the Fisher scoring phase of the hybrid method before the maximum number of Fisher iterations is reached, the algorithm continues with the Newton-Raphson method.
Scale parameter method. You can select the scale parameter estimation method.

Maximum-likelihood jointly estimates the scale parameter with the model effects; note that this option is not valid if the response has a negative binomial, poisson, or binomial distribution. The deviance and Pearson chi-square options estimate the scale parameter from the value of those statistics. Alternatively, you can specify a xed value for the scale parameter.

412 Chapter 12

Initial values. The procedure will automatically compute initial values for parameters. Covariance matrix. The model-based estimator is the negative of the generalized inverse of the

Hessian matrix. The robust (also called the Huber/White/sandwich) estimator is a corrected model-based estimator that provides a consistent estimate of the covariance, even when the specication of the variance and link functions is incorrect.
Iterations. These options allow you to control the parameters for model convergence. For more information, see Generalized Linear Models Node Iterations Options on p. 412. Output. These options allow you to request additional statistics that will appear in the advanced output of the generated model built by the node. For more information, see Generalized Linear Models Node Output Options on p. 413. Singularity tolerance. Singular (or non-invertible) matrices have linearly dependent columns,

which can cause serious problems for the estimation algorithm. Even near-singular matrices can lead to poor results, so the procedure will treat a matrix whose determinant is less than the tolerance as singular. Specify a positive value.

Generalized Linear Models Node Iterations Options


This node is available with the Classication module. You can set the convergence parameters for Generalized Linear Models model estimation.
Figure 12-35 Generalized Linear Modeling Iterations options

Iterations. Maximum iterations. The maximum number of iterations the algorithm will execute. Specify a

non-negative integer.

413 Statistical Models

Maximum iterations. At each iteration, the step size is reduced by a factor of 0.5 until the

log-likelihood increases or maximum step-halving is reached. Specify a positive integer.


Check for separation of data points. When selected, the algorithm performs tests to ensure that

the parameter estimates have unique values. Separation occurs when the procedure can produce a model that correctly classies every case. This option is available for binomial responses with binary format.
Convergence Criteria. Parameter convergence. When selected, the algorithm stops after an iteration in which the

absolute or relative change in the parameter estimates is less than the value specied, which must be positive.
Log-likelihood convergence. When selected, the algorithm stops after an iteration in which

the absolute or relative change in the log-likelihood function is less than the value specied, which must be positive.
Hessian convergence. For the Absolute specication, convergence is assumed if a statistic

based on the Hessian convergence is less than the positive value specied. For the Relative specication, convergence is assumed if the statistic is less than the product of the positive value specied and the absolute value of the log-likelihood.

Generalized Linear Models Node Output Options


This node is available with the Classication module.
Figure 12-36 Generalized Linear Models Advanced Output options

Select the optional output you want to display in the advanced output of the generated Generalized Linear Model. To view the advanced output, browse the generated model and click the Advanced tab. For more information, see Generalized Linear Models Model Advanced Output on p. 416.

414 Chapter 12

Print. The following output is available: Case processing summary. Displays the number and percentage of cases included and

excluded from the analysis and the Correlated Data Summary table.
Descriptive statistics. Displays descriptive statistics and summary information about the

dependent variable, covariates, and factors.


Model information. Displays the dataset name, dependent variable or events and trials

variables, offset variable, scale weight variable, probability distribution, and link function.
Goodness of fit statistics. Displays two extensions of Akaikes Information Criterion for model

selection: Quasi-likelihood under the independence model criterion (QIC) for choosing the best correlation structure and another QIC measure for choosing the best subset of predictors.
Model summary statistics. Displays model t tests, including likelihood-ratio statistics for the

model t omnibus test and statistics for the type I or III contrasts for each effect.
Parameter estimates. Displays parameter estimates and corresponding test statistics and

condence intervals. You can optionally display exponentiated parameter estimates in addition to the raw parameter estimates.
Covariance matrix for parameter estimates. Displays the estimated parameter covariance

matrix.
Correlation matrix for parameter estimates. Displays the estimated parameter correlation

matrix.
Contrast coefficient (L) matrices. Displays contrast coefcients for the default effects and for

the estimated marginal means, if requested on the EM Means tab.


General estimable functions. Displays the matrices for generating the contrast coefcient (L)

matrices.
Iteration history. Displays the iteration history for the parameter estimates and log-likelihood

and prints the last evaluation of the gradient vector and the Hessian matrix. The iteration history table displays parameter estimates for every nth iterations beginning with the 0th iteration (the initial estimates), where n is the value of the print interval. If the iteration history is requested, then the last iteration is always displayed regardless of n.
Lagrange multiplier test. Displays Lagrange multiplier test statistics for assessing the validity

of a scale parameter that is computed using the deviance or Pearson chi-square, or set at a xed number, for the normal, gamma, and inverse Gaussian distributions. For the negative binomial distribution, this tests the xed ancillary parameter.
Model Effects. Analysis type. Specify the type of analysis to produce. Type I analysis is generally appropriate

when you have a priori reasons for ordering predictors in the model, while Type III is more generally applicable. Wald statistics are produced in any case.
Confidence intervals. Specify a condence level greater than 50 and less than 100. Wald

intervals are based on the assumption that parameters have an asymptotic normal distribution.
Log-likelihood function. This controls the display format of the log-likelihood function. The

full function includes an additional term that is constant with respect to the parameter estimates; it has no effect on parameter estimation and is left out of the display in some software products.

415 Statistical Models

Generated Generalized Linear Models


This node is available with the Classication module. Generalized Linear Models represent the equations estimated by Generalized Linear Models nodes. They contain all of the information captured by the Generalized Linear Model, as well as information about the model structure and performance. When you execute a stream containing a Generalized Linear Model, the node adds new elds whose contents depend on the nature of the target eld:
Flag target. Adds elds containing the predicted category and associated probability, and the

probabilities for each category. The names of the rst two new elds are derived from the name of the output eld being predicted, prexed with $G- for the predicted category and $GP- for the associated probability. For example, for an output eld named default, the new elds would be named $G-default and $GP-default. The latter two additional elds are named based on the values of the output eld, prexed by $GP-. For example, if the legal values of default are Yes and No, three new elds will be added: $GP-Yes and $GP-No.
Range target. Adds elds containing the predicted mean and standard error. Range target, representing number of events in a series of trials. Adds elds containing the

predicted mean and standard error.


Generating a Filter node. The Generate menu allows you to create a new Filter node to pass input elds based on the results of the model.

Generalized Linear Models Model Summary


This node is available with the Classication module. The Summary tab for a generated Generalized Linear Model displays the elds and settings used to generate the model. In addition, if you have executed an Analysis node attached to this modeling node, information from that analysis will also appear in this section. For more information, see Analysis Node in Chapter 17 on p. 537. For general information on using the model browser, see Browsing Generated Models on p. 239.

416 Chapter 12 Figure 12-37 Sample Generalized Linear Models Equation node Summary tab

Generalized Linear Models Model Advanced Output


This node is available with the Classication module.

417 Statistical Models Figure 12-38 Sample Generalized Linear Models Equation node Advanced tab

The advanced output for Generalized Linear Models gives detailed information about the estimated model and its performance. Most of the information contained in the advanced output is quite technical, and extensive knowledge of Generalized Linear Models analysis is required to properly interpret this output. For more information, see Generalized Linear Models Node Output Options on p. 413.

Clustering Models

13

Chapter

Clustering models focus on identifying groups of similar records and labeling the records according to the group to which they belong. This is done without the benet of prior knowledge about the groups and their characteristics. In fact, you may not even know exactly how many groups to look for. This is what distinguishes clustering models from the other machine-learning techniques available in Clementinethere is no predened output or target eld for the model to predict. These models are often referred to as unsupervised learning models, since there is no external standard by which to judge the models classication performance. There are no right or wrong answers for these models. Their value is determined by their ability to capture interesting groupings in the data and provide useful descriptions of those groupings. Clustering methods are based on measuring distances between records and between clusters. Records are assigned to clusters in a way that tends to minimize the distance between records belonging to the same cluster.
Figure 13-1 Simple clustering model

Clementine provides three methods for clustering:


The K-Means node clusters the dataset into distinct groups (or clusters). The method denes a xed number of clusters, iteratively assigns records to clusters, and adjusts the cluster centers until further renement can no longer improve the model. Instead of trying to predict an outcome, k-means uses a process known as unsupervised learning to uncover patterns in the set of input elds. For more information, see K-Means Node on p. 426.

418

419 Clustering Models

The TwoStep node uses a two-step clustering method. The rst step makes a single pass through the data to compress the raw input data into a manageable set of subclusters. The second step uses a hierarchical clustering method to progressively merge the subclusters into larger and larger clusters. TwoStep has the advantage of automatically estimating the optimal number of clusters for the training data. It can handle mixed eld types and large datasets efciently. For more information, see TwoStep Cluster Node on p. 431. The Kohonen node generates a type of neural network that can be used to cluster the dataset into distinct groups. When the network is fully trained, records that are similar should appear close together on the output map, while records that are different will appear far apart. You can look at the number of observations captured by each unit in the generated model to identify the strong units. This may give you a sense of the appropriate number of clusters. For more information, see Kohonen Node on p. 419.

Clustering models are often used to create clusters or segments that are then used as inputs in subsequent analyses. A common example of this is the market segments used by marketers to partition their overall market into homogeneous subgroups. Each segment has special characteristics that affect the success of marketing efforts targeted toward it. If you are using data mining to optimize your marketing strategy, you can usually improve your model signicantly by identifying the appropriate segments and using that segment information in your predictive models.

Kohonen Node
This node is available with the Segmentation module. Kohonen networks are a type of neural network that perform clustering, also known as a knet or a self-organizing map. This type of network can be used to cluster the data set into distinct groups when you dont know what those groups are at the beginning. Records are grouped so that records within a group or cluster tend to be similar to each other, and records in different groups are dissimilar. The basic units are neurons, and they are organized into two layers: the input layer and the output layer (also called the output map). All of the input neurons are connected to all of the output neurons, and these connections have strengths, or weights, associated with them. During training, each unit competes with all of the others to win each record. The output map is a two-dimensional grid of neurons, with no connections between the units. A 3 4 map is shown below, although maps are typically larger than this.

420 Chapter 13 Figure 13-2 Structure of a Kohonen network

Input data is presented to the input layer, and the values are propagated to the output layer. The output neuron with the strongest response is said to be the winner and is the answer for that input. Initially, all weights are random. When a unit wins a record, its weights (along with those of other nearby units, collectively referred to as a neighborhood) are adjusted to better match the pattern of predictor values for that record. All of the input records are shown, and weights are updated accordingly. This process is repeated many times until the changes become very small. As training proceeds, the weights on the grid units are adjusted so that they form a two-dimensional map of the clusters (hence the term self-organizing map). When the network is fully trained, records that are similar should appear close together on the output map, whereas records that are vastly different will appear far apart. Unlike most learning methods in Clementine, Kohonen networks do not use a target eld. This type of learning, with no target eld, is called unsupervised learning. Instead of trying to predict an outcome, Kohonen nets try to uncover patterns in the set of input elds. Usually, a Kohonen net will end up with a few units that summarize many observations (strong units), and several units that dont really correspond to any of the observations (weak units). The strong units (and sometimes other units adjacent to them in the grid) represent probable cluster centers. Another use of Kohonen networks is in dimension reduction. The spatial characteristic of the two-dimensional grid provides a mapping from the k original predictors to two derived features that preserve the similarity relationships in the original predictors. In some cases, this can give you the same kind of benet as factor analysis or PCA. Note that the method for calculating default size of the output grid has changed from previous versions of Clementine. The new method will generally produce smaller output layers that are faster to train and generalize better. If you nd that you get poor results with the default size, try increasing the size of the output grid on the Expert tab. For more information, see Kohonen Node Expert Options on p. 423.
Requirements. To train a Kohonen net, you need one or more In elds. Fields set as Out, Both,

or None are ignored.

421 Clustering Models

Strengths. You do not need to have data on group membership to build a Kohonen network model. You dont even need to know the number of groups to look for. Kohonen networks start with a large number of units, and as training progresses, the units gravitate toward the natural clusters in the data. You can look at the number of observations captured by each unit in the generated model to identify the strong units, which can give you a sense of the appropriate number of clusters.

Kohonen Node Model Options


This node is available with the Segmentation module.
Figure 13-3 Kohonen node model options

Model name. You can generate the model name automatically based on the target or ID eld (or

model type in cases where no such eld is specied) or specify a custom name.
Use partitioned data. If a partition eld is dened, this option ensures that only data from the training partition is used to build the model.For more information, see Partition Node in Chapter 4 on p. 119. Continue training existing model. By default, each time you execute a Kohonen node, a completely

new network is created. If you select this option, training continues with the last net successfully produced by the node.
Show feedback graph. If this option is selected, a visual representation of the two-dimensional array is displayed during training. The strength of each node is represented by color. Red denotes a unit that is winning many records (a strong unit), and white denotes a unit that is winning few or no records (a weak unit). Note that this feature can slow training time. To speed up training time, deselect this option.

422 Chapter 13 Figure 13-4 Kohonen feedback graph

Stop on. The default stopping criterion stops training, based on internal parameters. You can also specify time as the stopping criterion. Enter the time (in minutes) for the network to train. Set random seed. If no random seed is set, the sequence of random values used to initialize the

network weights will be different every time the node is executed. This can cause the node to create different models on different runs, even if the node settings and data values are exactly the same. By selecting this option, you can set the random seed to a specic value so the resulting model is exactly reproducible. A specic random seed always generates the same sequence of random values, in which case executing the node always yields the same generated model. Note: When using the Set random seed option with records read from a database, a Sort node may be required prior to sampling in order to ensure the same result each time the node is executed. This is because the random seed depends on the order of records, which is not guaranteed to stay the same in a relational database. For more information, see Sort Node in Chapter 3 on p. 54. Note: Use binary set encoding, an option available in previous versions of Clementine, has been removed. In some situations, that option tended to distort distance information between records and thus was not suitable for use with Kohonen nets, which rely heavily on such distance information. If you want to include set elds in your model but are having memory problems in building the model, or the model is taking too long to build, consider recoding large set elds to reduce the number of values or using a different eld with fewer values as a proxy for the large set. For example, if you are having a problem with a product_id eld containing values for individual products, you might consider removing it from the model and adding a less detailed product_category eld instead.
Optimize. Select options designed to increase performance during model building based on your specic needs.

Select Speed to instruct the algorithm to never use disk spilling in order to improve performance. Select Memory to instruct the algorithm to use disk spilling when appropriate at some sacrice to speed. This option is selected by default. Note: When running in distributed mode, this setting can be overridden by administrator options specied in options.cfg. For

423 Clustering Models

more information, see Using the options.cfg File in Chapter 4 in Clementine 11.1 Server Administration and Performance Guide.

Kohonen Node Expert Options


This node is available with the Segmentation module. For those with detailed knowledge of Kohonen networks, expert options allow you to ne-tune the training process. To access expert options, set the Mode to Expert on the Expert tab.
Figure 13-5 Kohonen expert options

Width and Length. Specify the size (width and length) of the two-dimensional output map as

number of output units along each dimension.


Learning rate decay. Select either linear or exponential learning rate decay. The learning rate is a weighting factor that decreases over time, such that the network starts off encoding large-scale features of the data and gradually focuses on more ne-level detail. Phase 1 and Phase 2. Kohonen net training is split into two phases. Phase 1 is a rough estimation

phase, used to capture the gross patterns in the data. Phase 2 is a tuning phase, used to adjust the map to model the ner features of the data. For each phase, there are three parameters:
Neighborhood. Sets the starting size (radius) of the neighborhood. This determines the number

of nearby units that get updated along with the winning unit during training. During phase 1, the neighborhood size starts at Phase 1 Neighborhood and decreases to (Phase 2 Neighborhood + 1). During phase 2, neighborhood size starts at Phase 2 Neighborhood and decreases to 1.0. Phase 1 Neighborhood should be larger than Phase 2 Neighborhood.

424 Chapter 13

Initial Eta. Sets the starting value for learning rate eta. During phase 1, eta starts at Phase 1

Initial Eta and decreases to Phase 2 Initial Eta. During phase 2, eta starts at Phase 2 Initial Eta and decreases to 0. Phase 1 Initial Eta should be larger than Phase 2 Initial Eta.
Cycles. Sets the number of cycles for each phase of training. Each phase continues for the

specied number of passes through the data.

Generated Kohonen Models


Generated Kohonen models contain all of the information captured by the trained Kohenen network, as well as information about the networks architecture. When you execute a stream containing a generated Kohonen model, the node adds two new elds containing the X and Y coordinates of the unit in the Kohonen output grid that responded most strongly to that record. The new eld names are derived from the model name, prexed by $KX- and $KY-. For example, if your model is named Kohonen, the new elds would be named $KX-Kohonen and $KY-Kohonen. To get a better sense of what the Kohonen net has encoded, click the Viewer tab on the generated model browser. This displays the Cluster Viewer, providing a graphical representation of clusters, elds, and importance levels. For more information, see Cluster Viewer Tab on p. 436. If you prefer to visualize the clusters as a grid, you can view the result of the Kohonen net by plotting the $KX- and $KY- elds using a Plot node. (You should select X-Agitation and Y-Agitation in the Plot node to prevent each units records from all being plotted on top of each other.) In the plot, you can also overlay a symbolic eld to investigate how the Kohonen net has clustered the data. Another powerful technique for gaining insight into the Kohonen network is to use rule induction to discover the characteristics that distinguish the clusters found by the network. For more information, see C5.0 Node in Chapter 9 on p. 308.

Kohonen Model Cluster Details


The Model tab for a generated Kohonen model displays detailed information about the clusters dened by the model. Kohonen networks, also commonly called clusters, are labeled and the number of records assigned to each cluster is shown. Each cluster is described by its center, which can be thought of as the prototype for the cluster. For scale elds, the mean value and standard deviation for training records assigned to the cluster is given; for symbolic elds, the proportion for each distinct value is reported (except for values that do not occur for any records in the cluster, which are omitted).

425 Clustering Models Figure 13-6 Sample generated Kohonen node Model tab

For general information on using the model browser, see Browsing Generated Models

Kohonen Model Summary


The Summary tab for a generated Kohonen node displays information about the architecture or topology of the network. The length and width of the two-dimensional Kohonen feature map (the output layer) are shown as $KX-model_name and $KY-model_name. For the input and output layers, the number of units in that layer is listed.

426 Chapter 13 Figure 13-7 Sample generated Kohonen node Summary tab

K-Means Node
This node is included with the Base module. The K-Means node provides a method of cluster analysis. It can be used to cluster the data set into distinct groups when you dont know what those groups are at the beginning. Unlike most learning methods in Clementine, K-Means models do not use a target eld. This type of learning, with no target eld, is called unsupervised learning. Instead of trying to predict an outcome, K-Means tries to uncover patterns in the set of input elds. Records are grouped so that records within a group or cluster tend to be similar to each other, but records in different groups are dissimilar. K-Means works by dening a set of starting cluster centers derived from data. It then assigns each record to the cluster to which it is most similar, based on the records input eld values. After all cases have been assigned, the cluster centers are updated to reect the new set of records assigned to each cluster. The records are then checked again to see whether they should be reassigned to a different cluster, and the record assignment/cluster iteration process continues until either the maximum number of iterations is reached, or the change between one iteration and the next fails to exceed a specied threshold. Note: The resulting model depends to a certain extent on the order of the training data. Reordering the data and rebuilding the model may lead to a different nal cluster model.

427 Clustering Models

Requirements. To train a K-Means model, you need one or more In elds. Fields with direction Out, Both, or None are ignored. Strengths. You do not need to have data on group membership to build a K-Means model. The

K-Means model is often the fastest method of clustering for large data sets.

K-Means Node Model Options


This node is included with the Base module.
Figure 13-8 K-Means node model options

Model name. You can generate the model name automatically based on the target or ID eld (or

model type in cases where no such eld is specied) or specify a custom name.
Use partitioned data. If a partition eld is dened, this option ensures that only data from the training partition is used to build the model.For more information, see Partition Node in Chapter 4 on p. 119. Specified number of clusters. Specify the number of clusters to generate. The default is 5. Generate distance field. If this option is selected, the generated model will include a eld

containing the distance of each record from the center of its assigned cluster.
Show cluster proximity. Select this option to include information about distances between cluster centers in the generated model output. Cluster display. Specify the format for the generated cluster membership eld. Cluster

membership can be indicated as a String with the specied Label prefix (for example "Cluster 1, "Cluster 2", and so on), or as a Number. Note: Use binary set encoding, an option available in previous versions of Clementine, has been removed. In some situations, that option tended to distort distance information between records and was thus unsuitable for use with K-Means models, which rely heavily on such distance

428 Chapter 13

information. If you want to include set elds in your model but are having memory problems in building the model or the model is taking too long to build, consider recoding large set elds to reduce the number of values or using a different eld with fewer values as a proxy for the large set. For example, if you are having a problem with a product_id eld containing values for individual products, you might consider removing it from the model and adding a less detailed product_category eld instead.
Optimize. Select options designed to increase performance during model building based on your specic needs.

Select Speed to instruct the algorithm to never use disk spilling in order to improve performance. Select Memory to instruct the algorithm to use disk spilling when appropriate at some sacrice to speed. This option is selected by default. Note: When running in distributed mode, this setting can be overridden by administrator options specied in options.cfg. For more information, see Using the options.cfg File in Chapter 4 in Clementine 11.1 Server Administration and Performance Guide.

K-Means Node Expert Options


This node is included with the Base module. For those with detailed knowledge of k-means clustering, expert options allow you to ne-tune the training process. To access expert options, set the Mode to Expert on the Expert tab.
Figure 13-9 K-Means expert options

Stop on. Specify the stopping criterion to be used in training the model. The Default stopping

criterion is 20 iterations or change < 0.000001, whichever occurs rst. Select Custom to specify your own stopping criteria.

429 Clustering Models

Maximum Iterations. This option allows you to stop model training after the number of

iterations specied.
Change tolerance. This option allows you to stop model training when the largest change in

cluster centers for an iteration is less than the level specied.


Encoding value for sets. Specify a value between 0 and 1.0 to use for recoding set elds as groups of numeric elds. The default value is the square root of 0.5 (approximately 0.707107), to provide the proper weighting for recoded ag elds. Values closer to 1.0 will weight set elds more heavily than numeric elds.

Generated K-Means Models


Generated K-Means models contain all of the information captured by the clustering model, as well as information about the training data and the estimation process. When you execute a stream containing a generated K-Means model, the node adds two new elds containing the cluster membership and distance from the assigned cluster center for that record. The new eld names are derived from the model name, prexed by $KM- for the cluster membership and $KMD- for the distance from the cluster center. For example, if your model is named Kmeans, the new elds would be named $KM-Kmeans and $KMD-Kmeans. A powerful technique for gaining insight into the K-Means model is to use rule induction to discover the characteristics that distinguish the clusters found by the model. For more information, see C5.0 Node in Chapter 9 on p. 308. You can also click the Viewer tab on the generated model browser to display the Cluster Viewer, providing a graphical representation of clusters, elds, and importance levels. For more information, see Cluster Viewer Tab on p. 436.

K-Means Model Cluster Details


The Model tab for a generated K-Means model contains detailed information about the clusters dened by the model. Clusters are labeled and the number of records assigned to each cluster is shown. Each cluster is described by its center, which can be thought of as the prototype for the cluster. For scale elds, the mean value for training records assigned to the cluster is given; for symbolic elds, the proportion for each distinct value is reported (except for values that do not occur for any records in the cluster, which are omitted). If you requested Show cluster proximity in the K-Means node used to generate the model, each cluster description will also contain its proximities from every other cluster.

430 Chapter 13 Figure 13-10 Sample generated K-Means node Model tab

For more information, see Browsing Generated Models in Chapter 6 on p. 239.

K-Means Model Summary


The Summary tab for a generated K-Means model contains information about the training data, the estimation process, and the clusters dened by the model. The number of clusters is shown, as well as the iteration history. If you have executed an Analysis node attached to this modeling node, information from that analysis will also appear in this section. For more information, see Analysis Node in Chapter 17 on p. 537.

431 Clustering Models Figure 13-11 Sample generated K-Means node Summary tab

TwoStep Cluster Node


This node is available with the Segmentation module. The TwoStep Cluster node provides a form of cluster analysis. It can be used to cluster the data set into distinct groups when you dont know what those groups are at the beginning. As with Kohonen nodes and K-Means nodes, TwoStep Cluster models do not use a target eld. Instead of trying to predict an outcome, TwoStep Cluster tries to uncover patterns in the set of input elds. Records are grouped so that records within a group or cluster tend to be similar to each other, but records in different groups are dissimilar. TwoStep Cluster is a two-step clustering method. The rst step makes a single pass through the data, during which it compresses the raw input data into a manageable set of subclusters. The second step uses a hierarchical clustering method to progressively merge the subclusters into larger and larger clusters, without requiring another pass through the data. Hierarchical clustering has the advantage of not requiring the number of clusters to be selected ahead of time. Many hierarchical clustering methods start with individual records as starting clusters and merge them recursively to produce ever larger clusters. Though such approaches often break down with large amounts of data, TwoSteps initial preclustering makes hierarchical clustering fast even for large data sets.

432 Chapter 13

Note: The resulting model depends to a certain extent on the order of the training data. Reordering the data and rebuilding the model may lead to a different nal cluster model.
Requirements. To train a TwoStep Cluster model, you need one or more In elds. Fields with

direction Out, Both, or None are ignored. The TwoStep Cluster algorithm does not handle missing values. Records with blanks for any of the input elds will be ignored when building the model.
Strengths. TwoStep Cluster can handle mixed eld types and is able to handle large data sets

efciently. It also has the ability to test several cluster solutions and choose the best, so you dont need to know how many clusters to ask for at the outset. TwoStep Cluster can be set to automatically exclude outliers, or extremely unusual cases that can contaminate your results.

TwoStep Cluster Node Model Options


This node is available with the Segmentation module.
Figure 13-12 TwoStep Cluster node model options

Model name. You can generate the model name automatically based on the target or ID eld (or

model type in cases where no such eld is specied) or specify a custom name.
Use partitioned data. If a partition eld is dened, this option ensures that only data from the

training partition is used to build the model.For more information, see Partition Node in Chapter 4 on p. 119.
Standardize numeric fields. By default, TwoStep will standardize all numeric input elds to the

same scale, with a mean of 0 and a variance of 1. To retain the original scaling for numeric elds, deselect this option. Symbolic elds are not affected.
Exclude outliers. If you select this option, records that dont appear to t into a substantive cluster

will be automatically excluded from the analysis. This prevents such cases from distorting the results.

433 Clustering Models

Outlier detection occurs during the preclustering step. When this option is selected, subclusters with few records relative to other subclusters are considered potential outliers, and the tree of subclusters is rebuilt excluding those records. Some of those potential outlier records can be added to the rebuilt subclusters if they are similar enough to any of the new subcluster proles. The rest of the potential outliers that cannot be merged are considered outliers and are added to a noise cluster and excluded from the hierarchical clustering step. When scoring data with a TwoStep model that uses outlier handling, new cases that are more than a certain threshold distance (based on the log-likelihood) from the nearest substantive cluster are considered outliers and are assigned to the noise cluster.
Cluster label. Specify the format for the generated cluster membership eld. Cluster membership can be indicated as a String with the specied Label prefix (for example "Cluster 1", "Cluster 2", etc.), or as a Number. Automatically calculate number of clusters. TwoStep cluster can very rapidly analyze a large

number of cluster solutions to choose the optimal number of clusters for the training data. Specify a range of solutions to try by setting the Maximum and the Minimum number of clusters. TwoStep uses a two-stage process to determine the optimal number of clusters. In the rst stage, an upper bound on the number of clusters in the model is selected based on the change in the Bayes Information Criterion (BIC) as more clusters are added. In the second stage, the change in the minimum distance between clusters is found for all models with fewer clusters than the minimum-BIC solution. The largest change in distance is used to identify the nal cluster model.
Specify number of clusters. If you know how many clusters to include in your model, select this option and enter the number of clusters.

Generated TwoStep Cluster Models


Generated TwoStep cluster models contain all of the information captured by the clustering model, as well as information about the training data and the estimation process. When you execute a stream containing a generated TwoStep model, the node adds a new eld containing the cluster membership for that record. The new eld name is derived from the model name, prexed by $T-. For example, if your model is named TwoStep, the new eld would be named $T-TwoStep. A powerful technique for gaining insight into the TwoStep model is to use rule induction to discover the characteristics that distinguish the clusters found by the model. For more information, see C5.0 Node in Chapter 9 on p. 308. You can also click the Viewer tab on the generated model browser to display the Cluster Viewer, providing a graphical representation of clusters, elds, and importance levels. For more information, see Cluster Viewer Tab on p. 436.

TwoStep Model Cluster Details


The Model tab for a generated TwoStep model contains detailed information about the clusters dened by the model. Clusters are labeled, and the number of records assigned to each cluster is shown. Each cluster is described by its center, which can be thought of as the prototype for the cluster. For scale elds, the average value and standard deviation for training records assigned to

434 Chapter 13

the cluster are given; for symbolic elds, the proportion for each distinct value is reported (except for values that do not occur for any records in the cluster, which are omitted).
Figure 13-13 Sample generated TwoStep node Model tab

Note: When scoring records using a TwoStep model, the number of records assigned to a given segment may differ slightly from the number reported in the model browser. This is because the methods used in scoring differ slightly from those used to build the model and are done to optimize performance even for large data sets while making only two passes of the data.

TwoStep Model Summary


The Summary tab for a generated TwoStep model displays the number of clusters found, along with information about the training data, the estimation process, and build settings used.

435 Clustering Models Figure 13-14 Sample generated TwoStep node Summary tab

For more information, see Browsing Generated Models in Chapter 6 on p. 239.

The Cluster Viewer


Cluster models are typically used to nd groups (or clusters) of similar records based on the variables examined, where the similarity between members of the same group is high and the similarity between members of different groups is low. The results can be used to identify associations that would otherwise not be apparent. For example, through cluster analysis of customer preferences, income level, and buying habits, it may be possible to identify the types of customers who are more likely to respond to a particular marketing campaign. The following cluster models are generated in Clementine: Generated Kohonen net node Generated K-Means node Generated TwoStep cluster node To see information about the generated cluster models, right-click the model node and select Browse from the context menu (or Edit for nodes in a stream).

436 Chapter 13

Cluster Viewer Tab


The Viewer tab for cluster models shows a graphical display of summary statistics and distributions for elds between clusters.
Figure 13-15 Sample Viewer tab with cluster display

By default, the clusters are displayed on the x axis and the elds on the y axis. If the cluster matrix is large, it is automatically paginated for faster display on the screen. The expanded dialog box contains options for viewing all clusters and elds at once. The toolbar contains buttons used for navigating through paginated results. For more information, see Navigating the Cluster View on p. 441. The cluster axis lists each cluster in cluster number order and by default includes an Importance column. An Overall column can be added using options on the expanded dialog box.

437 Clustering Models

The Overall column displays the values (represented by bars) for all clusters in the data set and provides a useful comparison tool. Expand the dialog box using the yellow arrow button and select the Show Overall option. The Importance column displays the overall importance of the eld to the model. It is displayed as 1 minus the p value (probability value from the t test or chi-square test used to measure importance). The eld axis lists each eld (variable) used in the analysis and is sorted alphabetically. Both discrete elds and scale elds are displayed by default. The individual cells of the table shows summaries of a given elds values for the records in a given cluster. These values can be displayed as small charts or as scale values. Note: Some models created before Clementine 8.0 may not display full information on the Viewer tab: For pre-8.0 K-Means models, numeric elds always show importance as Unknown. Text view may not display any information for older models. For pre-8.0 Kohonen models, the Viewer tab is not available.

Understanding the Cluster View


There are two approaches to interpreting the results in a cluster display: Examine clusters to determine characteristics unique to that cluster. Does one cluster contain all the high-income borrowers? Does this cluster contain more records than the others? Examine elds across clusters to determine how values are distributed among clusters. Does ones level of education determine membership in a cluster? Does a high credit score distinguish between membership in one cluster or another? Using the main view and the various drill-down views in the Cluster display, you can gain insight to help you answer these questions.
Figure 13-16 Subsection of Top View display for clusters

As you read across the row for a eld, take note of how the category frequency (for discrete elds) and the mean-value distribution (for range elds) varies among clusters. For example, in the image above, notice that clusters 2 and 5 contain entirely different values for the BP (blood pressure) eld. This information, combined with the importance level indicated in the column on

438 Chapter 13

the right, tells you that blood pressure is an important determinant of membership in a cluster. These clusters and the BP eld are worth examining in greater detail. Using the display, you can double-click the eld for a more detailed view, displaying actual values and statistics. The following tips provide more information on interpreting the detailed view for elds and clusters.
What Is Importance?

For both range (numeric) and discrete elds, the higher the importance measure, the less likely the variation for a eld between clusters is due to chance and more likely due to some underlying difference. In other words, elds with a higher importance level are those to explore further. Importance is calculated as 1 minus the signicance value of a statistical test. For categorical variables, the test is a Chi-squares test. The null hypothesis is within-cluster distributions of category counts are the same across cluster. If this categorical variable is really inuential in determining cluster, the null hypothesis will be rejected and the signicance level will be close to zero. Hence the Importance index is close to one. For continuous variables, the test is a Students t test. The null hypothesis is within-cluster means are the same across cluster. If this continuous variable is really inuential in determining cluster, the null hypothesis will be rejected and the signicance level will be close to zero. Hence the Importance index is close to one.
Reading the Display for Discrete Fields

For discrete elds, or sets, the Top View (the default cluster comparison view) displays distribution charts indicating the category counts of the eld for each cluster. Drill-down (by double-clicking or using the expanded tab options) to view actual counts for each value within a cluster. These counts indicate the number of records with the given value that fall into a specic cluster.

439 Clustering Models Figure 13-17 Drill-down view for a discrete field

To view both counts and percentages, view the display as text. For more information, see Viewing Clusters As Text on p. 447. At any time, you can click the Top View button on the toolbar to return to the main Viewer display for all elds and clusters. Use the arrow buttons to ip through recent views.
Figure 13-18 Buttons used to return to Top View and flip through recent views

Reading the Display for Scale Fields

For scale elds, the Viewer displays bars representing the mean value of a eld for each cluster. The Overall column compares these mean values, but is not a histogram indicating frequency distribution. Drill-down (by double-clicking or using the expanded tab options) to view the actual mean value and standard deviation of the eld for each cluster.

440 Chapter 13 Figure 13-19 Drill-down view for a scale field

Reading Cluster Details

You can view detailed information about a single cluster by drilling-down into the display. This is an effective way to quickly examine a cluster of interest and determine which eld(s) might contribute to the clusters uniqueness. Compare the Cluster and Overall charts by eld and use the importance levels to determine elds that provide separation or commonality between clusters.

441 Clustering Models Figure 13-20 Drill-down view for a single cluster

Navigating the Cluster View


The Cluster Viewer is an interactive display. Using the mouse or the keyboard, you can: Drill-down to view more details for a eld or cluster. Move through paginated results. Compare clusters or elds by expanding the dialog box to select items of interest. Alter the display using toolbar buttons. Scroll through views. Transpose axes using toolbar buttons. Print, copy, and zoom. Generate Derive, Filter, and Select nodes using the Generate button.

442 Chapter 13

Using the Toolbar

You can control the display using the toolbar buttons. Move through paginated results for clusters and elds, or drill down to view a specic cluster or eld. You can also change the orientation of the display (top-down, left-to-right, or right-to-left) using the toolbar controls. You can also scroll through previous views, return to the top view, and open a dialog box to specify the colors and thresholds for displaying importance.
Figure 13-21 Toolbar for navigating and controlling the Cluster Viewer

Use your mouse on the Viewer tab to hover over a toolbar button and activate a ToolTip explaining its functionality.
Moving Columns

Columns can be moved to a new position in the table by selecting one or more column headers, holding down the left mouse button, and then dragging the columns to the desired position in the table. The same approach can be taken to move rows to a new position. Note that only adjacent columns or rows can be moved together.
Generating Nodes from Cluster Models

The Generate menu allows you to create new nodes based on the cluster model. This option is available from the Model and Viewer tabs of the generated model and allows you to generate nodes based on the current display or selection (that is, all visible clusters or all selected ones). For example, you can drill down to view details on a single eld and then generate a Filter node to discard all other (nonvisible) elds. The generated nodes are placed unconnected on the canvas. Connect and make any desired edits before execution.
Filter Node. Creates a new Filter node to lter elds that are not used by the cluster model,

and/or not visible in the current Viewer display. If there is a Type node upstream from this Cluster node, any elds with direction OUT are discarded by the generated Filter node.
Filter Node (from selection). Creates a new Filter node to lter elds based on selections in the

Viewer. Select multiple elds using the Ctrl-click method. Fields selected in the Viewer are discarded downstream, but you can change this behavior by editing the Filter node before execution.
Select Node. Creates a new Select node to select records based on their membership in any

of the clusters visible in the current Viewer display. A select condition is automatically generated.
Select Node (from selection). Creates a new Select node to select records based on membership

in clusters selected in the Viewer. Select multiple clusters using the Ctrl-click method.

443 Clustering Models

Derive Node. Creates a new Derive node, which derives a ag eld that assigns records a

value of True or False based on membership in all clusters visible in the Viewer. A derive condition is automatically generated.
Derive Node (from selection). Creates a new Derive node, which derives a ag eld based on

membership in clusters selected in the Viewer. Select multiple clusters using the Ctrl-click method.

Selecting Clusters for Display


You can specify clusters for display by selecting a cluster column in the viewer and double-clicking. Multiple adjacent cells, rows, or columns can be selected by holding down the Shift key on the keyboard while making a selection. Multiple nonadjacent cells, rows, or columns can be selected by holding down the Ctrl key while making a selection. Alternatively, you can select clusters for display using a dialog box available from the expanded Cluster Viewer. To open the dialog box:
E Click the yellow arrow at the top of the Viewer to expand for more options. Figure 13-22 Expanded Viewer tab with Show and Sort options

E From the Cluster drop-down list, select one of several options for display.

Select Display All to show all clusters in the matrix.

444 Chapter 13

Select a cluster number to display details for only that cluster. Select Clusters Larger than to set a threshold for display clusters. This enables the Records options, which allows you to specify the minimum numbers of records in a cluster for it to be displayed. Select Clusters Smaller than to set a threshold for displaying clusters. This enables the Records options, which allows you to specify the maximum numbers of records in a cluster for it to be displayed. Select Custom to hand-select clusters for display. To the right of the drop-down list, click the ellipsis (...) button to open a dialog box where you can select available clusters.
Custom Selection of Clusters

In the Show Selected Clusters dialog box, cluster names are listed in the column on the right. Individual clusters can be selected for display using the column on the left. Click Select All to select and view all clusters. Click Clear to deselect all clusters in the dialog box.

Selecting Fields for Display


You can specify elds for display by selecting a eld row in the viewer and double-clicking. Alternatively, you can select elds using a dialog box available from the expanded Cluster Viewer. To open the dialog box:
E Click the yellow arrow at the top of the Viewer to expand for more options. E From the Field drop-down list, select one of several options for display.

Select Display All to show all elds in the matrix. Select a eld name to display details for only that eld. Select All Ranges to display all range (numeric) elds. Select All Discrete to display all discrete (categorical) elds. Select Conditional to display elds that meet a certain level of importance. You can specify the importance condition using the Show drop-down list.

445 Clustering Models Figure 13-23 Displaying fields based on importance level

Select Custom to hand-select elds for display. To the right of the drop-down list, click the ellipsis (...) button to open a dialog box where you can select available elds.
Custom Selection of Fields

In the Show Selected Fields dialog box, eld names are listed in the column on the right. Individual elds be selected for display using the column on the left. Click Select All to display all elds. Click Clear to deselect all elds in the dialog box.

Sorting Display Items


When viewing cluster results as a whole or individual elds and clusters, it is often useful to sort the display table by areas of interest. Sorting options are available from the expanded Cluster Viewer. To sort clusters or elds:
E Click the yellow arrow at the top of the Viewer to expand for more options. E In the Sort Options control box, select a sorting method. Various options be disabled if you are

viewing individual elds or clusters.


Figure 13-24 Sort options on the expanded Viewer tab

446 Chapter 13

Available sort options include: For clusters, you can sort by size or name of the cluster. For elds, you can sort by eld name or importance level. Note: Fields are sorted by importance within eld type. For example, scale elds are sorted for importance rst, then discrete elds. Use the arrow buttons to specify sort direction.

Setting Importance Options


Using the importance dialog box, you can specify options to represent importance in the browser. Click the Importance options button on the toolbar to open the dialog box.
Figure 13-25 Color options toolbar button

Figure 13-26 Specifying format and display options for importance statistics

Labels. To show importance labels in the cluster display, select Show labels in the Importance Settings dialog box. This activates the label text elds where you can provide suitable labels. Thresholds. Use the arrow controls to specify the desired importance threshold associated with the

icon and label.


Colors. Select a color from the drop-down list to use for the importance icon. Icons. Select an icon from the drop-down list to use for the associated level of importance. What Is Importance?

For both range (numeric) and discrete elds, the higher the importance measure, the less likely the variation for a eld between clusters is due to chance and more likely due to some underlying difference. In other words, elds with a higher importance level are those to explore further. Importance is calculated as 1 minus the signicance value of a statistical test.

447 Clustering Models

For categorical variables, the test is a Chi-squares test. The null hypothesis is within-cluster distributions of category counts are the same across cluster. If this categorical variable is really inuential in determining cluster, the null hypothesis will be rejected and the signicance level will be close to zero. Hence the Importance index is close to one. For continuous variables, the test is a Students t test. The null hypothesis is within-cluster means are the same across cluster. If this continuous variable is really inuential in determining cluster, the null hypothesis will be rejected and the signicance level will be close to zero. Hence the Importance index is close to one.

Viewing Clusters As Text


Information in the Cluster Viewer can also be displayed as text, where all values are displayed as numerical values instead of as charts.
Figure 13-27 Selected clusters displayed as text

The text view, while different in appearance, operates in the same manner as the graphical view. To view as text:
E Click the yellow arrow at the top of the Viewer to expand for more options. E For both Display sizes and Display distributions, you can select to view results as text.

Association Rules

14

Chapter

Association rules associate a particular conclusion (the purchase of a particular product) with a set of conditions (the purchase of several other products). For example, the rule
beer <= cannedveg & frozenmeal (173, 17.0%, 0.84)

states that beer often occurs when cannedveg and frozenmeal occur together. The rule is 84% reliable and applies to 17% of the data, or 173 records. Association rule algorithms automatically nd the associations that you could nd manually using visualization techniques, such as the Web node.
Figure 14-1 Web node showing associations between market basket items

The advantage of association rule algorithms over the more standard decision tree algorithms (C5.0 and C&R Trees) is that associations can exist between any of the attributes. A decision tree algorithm will build rules with only a single conclusion, whereas association algorithms attempt to nd many rules, each of which may have a different conclusion.
448

449 Association Rules

The disadvantage of association algorithms is that they are trying to nd patterns within a potentially very large search space and, hence, can require much more time to run than a decision tree algorithm. The algorithms use a generate and test method for nding rulessimple rules are generated initially, and these are validated against the data set. The good rules are stored and all rules, subject to various constraints, are then specialized. Specialization is the process of adding conditions to a rule. These new rules are then validated against the data, and the process iteratively stores the best or most interesting rules found. The user usually supplies some limit to the possible number of antecedents to allow in a rule, and various techniques based on information theory or efcient indexing schemes are used to reduce the potentially large search space. At the end of the processing, a table of the best rules is presented. Unlike a decision tree, this set of association rules cannot be used directly to make predictions in the way that a standard model (such as a decision tree or a neural network) can. This is due to the many different possible conclusions for the rules. Another level of transformation is required to transform the association rules into a classication ruleset. Hence, the association rules produced by association algorithms are known as unrened models. Although the user can browse these unrened models, they cannot be used explicitly as classication models unless the user tells the system to generate a classication model from the unrened model. This is done from the browser through a Generate menu option. Clementine provides three association rule algorithms:
The Generalized Rule Induction (GRI) node discovers association rules in the data. For example, customers who purchase razors and aftershave lotion are also likely to purchase shaving cream. GRI extracts rules with the highest information content based on an index that takes both the generality (support) and accuracy (condence) of rules into account. GRI can handle numeric and categorical inputs, but the target must be categorical. For more information, see GRI Node on p. 450. The Apriori node extracts a set of rules from the data, pulling out the rules with the highest information content. Apriori offers ve different methods of selecting rules and uses a sophisticated indexing scheme to process large datasets efciently. For large problems, Apriori is generally faster to train than GRI; it has no arbitrary limit on the number of rules that can be retained, and it can handle rules with up to 32 preconditions. Apriori requires that input and output elds all be categorical but delivers better performance because it is optimized for this type of data. For more information, see Apriori Node on p. 452. The Sequence node discovers association rules in sequential or time-oriented data. A sequence is a list of item sets that tends to occur in a predictable order. For example, a customer who purchases a razor and aftershave lotion may purchase shaving cream the next time he shops. The Sequence node is based on the CARMA association rules algorithm, which uses an efcient two-pass method for nding sequences. For more information, see Sequence Node on p. 476.

Tabular versus Transactional Data


Data used by association rule models may be in transactional or tabular format, as described below. These are general descriptions; specic requirements may vary as discussed in the documentation for each model type. Note that when scoring models, the data to be scored must mirror the format of the data used to build the model. Models built using tabular data can be used to score only tabular data; models built using transactional data can score only transactional data.

450 Chapter 14

Transactional Format

Transactional data have a separate record for each transaction or item. If a customer makes multiple purchases, for example, each would be a separate record, with associated items linked by a customer ID. This is also sometimes known as till-roll format.
Customer 1 2 3 3 4 4 4 Purchase jam milk jam bread jam bread milk

The Apriori, CARMA, and Sequence nodes can all use transactional data.
Tabular Data

Tabular data (also known as basket or truth-table data) have items represented by separate ags, where each ag eld represents the presence or absence of a specic item. Each record represents a complete set of associated items. Flag elds can be categorical or numeric, although certain models may have more specic requirements.
Customer 1 2 3 4 Jam T F T T Bread F F T T Milk F T F T

The Apriori, CARMA, GRI, and Sequence nodes can all use tabular data.

GRI Node
This node is included with the Base module. The Generalized Rule Induction (GRI) node discovers association rules in the data. Association rules are statements in the form
if antecedent(s) then consequent(s)

For example, if a customer purchases a razor and aftershave lotion, then you can be 80% condent that the customer will also purchase shaving cream.
if razor and aftershave lotion then shaving cream

451 Association Rules

GRI extracts a set of rules from the data, pulling out the rules with the highest information content. Information content is measured using an index that takes both the generality (support) and accuracy (condence) of rules into account.
Requirements. To create GRI association rules, you need one or more In elds and one or more

Out elds. Output elds (those with direction Out or Both) must be symbolic. Fields with direction None are ignored. Fields types must be fully instantiated before executing the node. In contrast to Apriori and CARMA, which read both tabular and transactional data, GRI requires data be in tabular format. For more information, see Tabular versus Transactional Data on p. 449.
Strengths. Association rules are usually fairly easy to interpret, in contrast to other methods such

as neural networks. Rules in a set can overlap such that some records may trigger more than one rule. This allows the ruleset to make rules more general than is possible with a decision tree. The GRI node can also handle multiple output elds. In contrast to Apriori, GRI can handle numeric as well as symbolic input elds.

GRI Node Model Options


This node is included with the Base module.
Figure 14-2 GRI node model options

Model name. You can generate the model name automatically based on the target or ID eld (or

model type in cases where no such eld is specied) or specify a custom name.
Use partitioned data. If a partition eld is dened, this option ensures that only data from the training partition is used to build the model.For more information, see Partition Node in Chapter 4 on p. 119. Minimum antecedent support. You can also specify a support criterion (as a percentage). Support

refers to the percentage of records in the training data for which the antecedents (the if part of the rule) are true. (Note that this denition of support differs from that used in the CARMA and

452 Chapter 14

Sequence nodes. For more information, see Sequence Node Model Options on p. 478.) If you are getting rules that apply to very small subsets of the data, try increasing this setting. Note: The denition of support for Apriori and GRI is based on the number of records with the antecedents. This is in contrast to the CARMA and Sequence algorithms for which the denition of support is based on the number of records with all of the items in a rule (that is, both the antecedents and consequent). The results for association models show both the (antecedent) support and rule support measures.
Minimum rule confidence. You can specify an accuracy criterion (as a percentage) for keeping rules in the ruleset. Rules with lower condence than the specied criterion are discarded. If you are getting too many rules or uninteresting rules, try increasing this setting. If you are getting too few rules (or no rules at all), try decreasing this setting. Maximum number of antecedents. You can specify the maximum number of antecedents for any rule. This is a way to limit the complexity of the rules. If the rules are too complex or too specic, try decreasing this setting. This setting also has a large inuence on training time. If your ruleset is taking too long to train, try reducing this setting. Maximum number of rules. This option determines the number of rules retained in the ruleset. Rules are retained in descending order of interest (as calculated by the GRI algorithm). Note that the ruleset may contain fewer rules than the number specied, especially if you use a stringent condence or support criterion. Only true values for flags. If this option is selected, only true values will appear in the resulting

rules. This can help make rules easier to understand.

Apriori Node
This node is available with the Association module. The Apriori node also discovers association rules in the data. Apriori offers ve different methods of selecting rules and uses a sophisticated indexing scheme to efciently process large datasets.
Requirements. To create an Apriori ruleset, you need one or more In elds and one or more Out

elds. Input and output elds (those with direction In, Out, or Both) must be symbolic. Fields with direction None are ignored. Fields types must be fully instantiated before executing the node. Data can be in tabular or transactional format. For more information, see Tabular versus Transactional Data on p. 449.
Strengths. For large problems, Apriori is generally faster to train than GRI. It also has no arbitrary

limit on the number of rules that can be retained and can handle rules with up to 32 preconditions. Apriori offers ve different training methods, allowing more exibility in matching the data mining method to the problem at hand.

Apriori Node Model Options


This node is available with the Association module.

453 Association Rules Figure 14-3 Apriori node model options

Model name. You can generate the model name automatically based on the target or ID eld (or

model type in cases where no such eld is specied) or specify a custom name.
Use partitioned data. If a partition eld is dened, this option ensures that only data from the

training partition is used to build the model.For more information, see Partition Node in Chapter 4 on p. 119.
Minimum antecedent support. You can specify a support criterion for keeping rules in the ruleset. Support refers to the percentage of records in the training data for which the antecedents (the if part of the rule) are true. (Note that this denition of support differs from that used in the CARMA and Sequence nodes. For more information, see Sequence Node Model Options on p. 478.) If you are getting rules that apply to very small subsets of the data, try increasing this setting.

Note: The denition of support for Apriori and GRI is based on the number of records with the antecedents. This is in contrast to the CARMA and Sequence algorithms for which the denition of support is based on the number of records with all the items in a rule (that is, both the antecedents and consequent). The results for association models show both the (antecedent) support and rule support measures.
Minimum rule confidence. You can also specify a condence criterion. Condence is based on the records for which the rules antecedents are true and is the percentage of those records for which the consequent(s) are also true. In other words, its the percentage of predictions based on the rule that are correct. Rules with lower condence than the specied criterion are discarded. If you are getting too many rules, try increasing this setting. If you are getting too few rules (or no rules at all), try decreasing this setting. Maximum number of antecedents. You can specify the maximum number of preconditions for any rule. This is a way to limit the complexity of the rules. If the rules are too complex or too specic, try decreasing this setting. This setting also has a large inuence on training time. If your ruleset is taking too long to train, try reducing this setting.

454 Chapter 14

Only true values for flags. If this option is selected for data in tabular (truth table) format, then only true values will appear in the resulting rules. This can help make rules easier to understand. The option does not apply to data in transactional format. For more information, see Tabular versus Transactional Data on p. 449. Optimize. Select options designed to increase performance during model building based on your

specic needs. Select Speed to instruct the algorithm to never use disk spilling in order to improve performance. Select Memory to instruct the algorithm to use disk spilling when appropriate at some sacrice to speed. This option is selected by default. Note: When running in distributed mode, this setting can be overridden by administrator options specied in options.cfg. See the Clementine Server Administrators Guide for more information.

Apriori Node Expert Options


This node is available with the Association module. For those with detailed knowledge of Aprioris operation, the following expert options allow you to ne-tune the induction process. To access expert options, set the Mode to Expert on the Expert tab.
Figure 14-4 Apriori expert options

Evaluation measure. Apriori supports ve methods of evaluating potential rules. Rule Confidence. The default method uses the condence (or accuracy) of the rule to evaluate

rules. For this measure, the Evaluation measure lower bound is disabled, since it is redundant with the Minimum rule confidence option on the Model tab. For more information, see Apriori Node Model Options on p. 452.

455 Association Rules

Confidence Difference. (Also called absolute condence difference to prior .)This evaluation

measure is the absolute difference between the rules condence and its prior condence. This option prevents bias where the outcomes are not evenly distributed. This helps prevent obvious rules from being kept. For example, it may be the case that 80% of customers buy your most popular product. A rule that predicts buying that popular product with 85% accuracy doesnt add much to your knowledge, even though 85% accuracy may seem quite good on an absolute scale. Set the evaluation measure lower bound to the minimum difference in condence for which you want rules to be kept.
Confidence Ratio. (Also called difference of condence quotient to 1.) This evaluation

measure is the ratio of rule condence to prior condence (or, if the ratio is greater than one, its reciprocal) subtracted from 1. Like Condence Difference, this method takes uneven distributions into account. It is especially good at nding rules that predict rare events. For example, suppose that there is a rare medical condition that occurs in only 1% of patients. A rule that is able to predict this condition 10% of the time is a great improvement over random guessing, even though on an absolute scale, 10% accuracy might not seem very impressive. Set the evaluation measure lower bound to the difference for which you want rules to be kept.
Information Difference. (Also called information difference to prior.) This measure is based

on the information gain measure. If the probability of a particular consequent is considered as a logical value (a bit), then the information gain is the proportion of that bit that can be determined, based on the antecedents. The information difference is the difference between the information gain, given the antecedents, and the information gain, given only the prior condence of the consequent. An important feature of this method is that it takes support into account so that rules that cover more records are preferred for a given level of condence. Set the evaluation measure lower bound to the information difference for which you want rules to be kept. Note: Because the scale for this measure is somewhat less intuitive than the other scales, you may need to experiment with different lower bounds to get a satisfactory ruleset.
Normalized Chi-square. (Also called normalized chi-squared measure.) This measure is

a statistical index of association between antecedents and consequents. The measure is normalized to take values between 0 and 1. This measure is even more strongly dependent on support than the information difference measure. Set the evaluation measure lower bound to the information difference for which you want rules to be kept. Note: As with the information difference measure, the scale for this measure is somewhat less intuitive than the other scales, so you may need to experiment with different lower bounds to get a satisfactory ruleset.
Allow rules without antecedents. Select to allow rules that include only the consequent (item or

item set). This is useful when you are interested in determining common items or item sets. For example, cannedveg is a single-item rule without an antecedent that indicates purchasing cannedveg is a common occurrence in the data. In some cases, you may want to include such rules if you are simply interested in the most condent predictions. This option is off by default. By convention, antecedent support for rules without antecedents is expressed as 100%, and rule support will be the same as condence.

456 Chapter 14

CARMA Node
This node is available with the Association module. The CARMA node uses an association rules discovery algorithm to discover association rules in the data. Association rules are statements in the form
if antecedent(s) then consequent(s)

For example, if a Web customer purchases a wireless card and a high-end wireless router, the customer is also likely to purchase a wireless music server if offered. The CARMA model extracts a set of rules from the data without requiring you to specify In (predictor) or Out (target) elds. This means that the rules generated can be used for a wider variety of applications. For example, you can use rules generated by this node to nd a list of products or services (antecedents) whose consequent is the item that you want to promote this holiday season. Using Clementine, you can determine which clients have purchased the antecedent products and construct a marketing campaign designed to promote the consequent product.
Requirements. In contrast to GRI and Apriori, the CARMA node does not require In elds or Out

elds. This is integral to the way the algorithm works and is equivalent to building an Apriori model with all elds set to Both. You can constrain which items appear only as antecedents or consequents by ltering the model after it is built. For example, you can use the model browser to nd a list of products or services (antecedents) whose consequent is the item that you want to promote this holiday season. To create a CARMA ruleset, you need to specify an ID eld and one or more content elds. The ID eld can have any direction or type. Fields with direction None are ignored. Fields types must be fully instantiated before executing the node. Like Apriori, data may be in tabular or transactional format. For more information, see Tabular versus Transactional Data on p. 449.
Strengths. The CARMA node is based on the CARMA association rules algorithm. In contrast

to Apriori and GRI, the CARMA node offers build settings for rule support (support for both antecedent and consequent) rather than antecedent support. CARMA also allows rules with multiple consequents. Like Apriori, models generated by a CARMA node can be inserted into a data stream to create predictions. For more information, see Overview of Generated Models in Chapter 6 on p. 237.

CARMA Node Fields Options


This node is available with the Association module. Before executing a CARMA node, you must specify ID and content elds on the Fields tab of the CARMA node. While most modeling nodes share identical Fields tab options, the CARMA node contains several unique options. All options are discussed below.

457 Association Rules Figure 14-5 CARMA node fields options

Use Type node settings. This option tells the node to use eld information from an upstream type

node. This is the default.


Use custom settings. This option tells the node to use eld information specied here instead

of that given in any upstream Type node(s). After selecting this option, specify elds below according to whether you are reading data in transactional or tabular format. For more information, see Tabular versus Transactional Data on p. 449.
Use transactional format. This option changes the eld controls at the bottom of this dialog box

depending on whether your data are in transactional or tabular format. If you use multiple elds with transactional data, the items specied in these elds for a particular record are assumed to represent items found in a single transaction with a single timestamp. For more information, see Tabular versus Transactional Data on p. 449. CARMA can handle data in either format as discussed below.
ID field. For transactional data, select an ID eld from the list. Numeric or symbolic elds can be used as the ID eld. Each unique value of this eld should indicate a specic unit of analysis. For example, in a market basket application, each ID might represent a single customer. For a Web log analysis application, each ID might represent a computer (by IP address) or a user (by login data). IDs are contiguous. If your data are presorted so that all records with the same ID appear

together in the data stream, select this option to speed up processing. If your data are not presorted (or you are not sure), leave this option unselected and the CARMA node will sort the data automatically. Note: If your data are not sorted and you select this option, you may get invalid results in your model.

458 Chapter 14

Content field(s). Specify the content eld(s) for the model. These elds contain the items of interest in association modeling. You can specify multiple ag elds (if data are in tabular format) or a single set eld (if data are in transactional format).

CARMA Node Model Options


This node is available with the Association module.
Figure 14-6 CARMA node model options

Model name. You can generate the model name automatically based on the target or ID eld (or

model type in cases where no such eld is specied) or specify a custom name.
Use partitioned data. If a partition eld is dened, this option ensures that only data from the training partition is used to build the model.For more information, see Partition Node in Chapter 4 on p. 119. Minimum rule support (%). You can specify a support criterion. Rule support refers to the proportion of IDs in the training data that contain the entire rule. (Note that this denition of support differs from antecedent support used in the GRI and Apriori nodes.) If you want to focus on more common rules, increase this setting. Minimum rule confidence (%). You can specify a condence criterion for keeping rules in the

ruleset. Condence refers to the percentage of IDs where a correct prediction is made (out of all IDs for which the rule makes a prediction). It is calculated as the number of IDs for which the entire rule is found divided by the number of IDs for which the antecedents are found, based on the training data. Rules with lower condence than the specied criterion are discarded. If you are getting uninteresting or too many rules, try increasing this setting. If you are getting too few rules, try decreasing this setting.

459 Association Rules

Maximum rule size. You can set the maximum number of distinct item sets (as opposed to items) in a rule. If the rules of interest are relatively short, you can decrease this setting to speed up building the ruleset.

CARMA Node Expert Options


This node is available with the Association module. For those with detailed knowledge of the CARMA nodes operation, the following expert options allow you to ne-tune the model-building process. To access expert options, set the mode to Expert on the Expert tab.
Figure 14-7 CARMA node expert options

Exclude rules with multiple consequents. Select to exclude two-headed consequentsthat is, consequents that contain two items. For example, the rule bread & cheese & fish wine&fruit contains a two-headed consequent, wine&fruit. By default, such rules are included. Set pruning value. To conserve memory, the CARMA algorithm used periodically removes

(prunes) infrequent item sets from its list of potential item sets during processing. Select this option to adjust the frequency of pruning, and the number you specify determines the frequency of pruning. Enter a smaller value to decrease the memory requirements of the algorithm (but potentially increase the training time required), or enter a larger value to speed up training (but potentially increase memory requirements). The default value is 500.

460 Chapter 14

Vary support. Select to increase efciency by excluding infrequent item sets that appear to be

frequent when they appear unevenly. This is achieved by starting with a higher support level and tapering it down to the level specied on the Model tab. Enter a value for Estimated number of transactions to specify how quickly the support level should be tapered.
Allow rules without antecedents. Select to allow rules that include only the consequent (item or item set). This is useful when you are interested in determining common items or item sets. For example, cannedveg is a single-item rule without an antecedent that indicates purchasing cannedveg is a common occurrence in the data. In some cases, you may want to include such rules if you are simply interested in the most condent predictions. This option is unselected by default.

Generated Association Rule Models


Association rule models represent the rules discovered by one of the following association rule modeling nodes: Apriori GRI CARMA The generated models contain information about the rules extracted from the data during model building.
Viewing Results

You can browse the rules generated by association models (GRI, Apriori, and CARMA) and Sequence models using the Model tab on the dialog box. Browsing a model shows you the information about the rules and provides options for ltering and sorting results before generating new nodes or scoring the model. In this release, you also have the ability to view the results of association models using the Intelligent Miner Visualizer, a Java-based results browser from IBM. If you have previously purchased and installed Intelligent Miner Visualizer, you can launch it directly from Clementine to view generated models. For more information, see Visualizing Association Rule Models with IBM Tools on p. 465.
Scoring the Model

Rened models (Apriori, CARMA, and Sequence) may be added to a stream and used for scoring. For more information, see Using Generated Models in Streams in Chapter 6 on p. 241. Models used for scoring include an extra Settings tab on their respective dialog boxes. For more information, see Association Rule Model Settings on p. 470. The unrened GRI model cannot be used for scoring in its raw format. Instead, you can generate a ruleset and use the ruleset for scoring. For more information, see Generating a Ruleset from an Association Model on p. 468.

461 Association Rules

Association Rule Model Details


On the Model tab of a generated Association Rule model, you can see a table containing the rules extracted by the algorithm. Each row in the table represents a rule. The rst column represents the consequents (the then part of the rule), while the next column represents the antecedents (the if part of the rule). Subsequent columns contain rule information, such as condence, support, and lift.
Figure 14-8 Sample Association Rule node Model tab

Association rules are often shown in the following format:


Consequent Drug = drugY Antecedent Sex = F BP = HIGH

The example rule is interpreted as if Sex = F and BP = HIGH, then Drug is likely to be drugY; or to phrase it another way, for records where Sex = F and BP = HIGH, Drug is likely to be drugY. Using the dialog box toolbar, you can choose to display additional information, such as condence, support, and instances.
Show/Hide menu. The Show/Hide menu (percentage toolbar button) controls options for the

display of rules.
Figure 14-9 Show/Hide button

462 Chapter 14

The following display options are available: Rule ID displays the rule ID assigned during model building. A rule ID enables you to identify which rules are being applied for a given prediction. Rule IDs also allow you to merge additional rule information, such as deployability, product information, or antecedents, at a later time. Instances displays information about the number of unique IDs to which the rule appliesthat is, for which the antecedents are true. For example, given the rule bread cheese, the number of records in the training data that include the antecedent bread are referred to as instances. Support displays antecedent supportthat is, the proportion of IDs for which the antecedents are true, based on the training data. For example, if 50% of the training data includes the purchase of bread, then the rule bread cheese will have an antecedent support of 50%. Note: Support as dened here is the same as the instances but is represented as a percentage. Condence displays the ratio of rule support to antecedent support. This indicates the proportion of IDs with the specied antecedent(s) for which the consequent(s) is/are also true. For example, if 50% of the training data contains bread (indicating antecedent support) but only 20% contains both bread and cheese (indicating rule support), then condence for the rule bread cheese would be Rule Support / Antecedent Support or, in this case, 40%. Rule Support displays the proportion of IDs for which the entire rule, antecedents, and consequent(s), are true. For example, if 20% of the training data contains both bread and cheese, then rule support for the rule bread cheese is 20%. Lift displays the ratio of condence for the rule to the prior probability of having the consequent. For example, if 10% of the entire population buys bread, then a rule that predicts whether people will buy bread with 20% condence will have a lift of 20/10 = 2. If another rule tells you that people will buy bread with 11% condence, then the rule has a lift of close to 1, meaning that having the antecedent(s) does not make a lot of difference in the probability of having the consequent. In general, rules with lift different from 1 will be more interesting than rules with lift close to 1. Deployability is a measure of what percentage of the training data satises the conditions of the antecedent but does not satisfy the consequent. In product purchase terms, it basically means what percentage of the total customer base owns (or has purchased) the antecedent(s) but has not yet purchased the consequent. The deployability statistic is dened as (Antecedent Support in # of Records - Rule Support in # of Records) / Number of Records, where Antecedent Support means the number of records for which the antecedents are true and Rule Support means the number of records for which both antecedents and the consequent are true.
Sort menu. The Sort menu button on the toolbar controls the sorting of rules. Direction of sorting

(ascending or descending) can be changed using the sort direction button (up or down arrow).
Figure 14-10 Toolbar options for sorting

You can sort rules by: Support

463 Association Rules

Condence Rule Support Consequent Lift Deployability


Filter button. The Filter button (funnel icon) on the menu expands the bottom of the dialog box to

show a panel where active rule lters are displayed. Filters are used to narrow the number of rules displayed on the Models tab.
Figure 14-11 Filter button

To create a lter, click the Filter icon to the right of the expanded panel. This opens a separate dialog box in which you can specify constraints for displaying rules. Note that the Filter button is often used in conjunction with the Generate menu to rst lter rules and then generate a model containing that subset of rules. For more information, see Specifying Filters for Rules below.
Find Rule button. The Find Rule button (binoculars icon) enables you to search the rules shown for

a specied rule ID. The adjacent display box indicates the number of rules currently displayed out of the number available. Rule IDs are assigned by the model in the order of discovery at the time and are added to the data during scoring.
Figure 14-12 Find Rule button

To reorder rule IDs:


E You can rearrange rule IDs in Clementine by rst sorting the rule display table according to the

desired measurement, such as condence or lift.


E Then using options from the Generate menu, create a ltered model. E In the Filtered Model dialog box, select Renumber rules consecutively starting with, and specify

a start number. For more information, see Generating a Filtered Model on p. 469.

Specifying Filters for Rules


By default, rule algorithms, such as Apriori, CARMA, and Sequence, may generate a large and cumbersome number of rules. To enhance clarity when browsing or to streamline rule scoring, you should consider ltering rules so that consequents and antecedents of interest are more prominently displayed. Using the ltering options on the Model tab of a rule browser, you can open a dialog box for specifying lter qualications.

464 Chapter 14 Figure 14-13 Rules browser filter dialog box

Consequents. Select Enable Filter to activate options for ltering rules based on the inclusion or

exclusion of specied consequents. Select Includes any of to create a lter where rules contain at least one of the specied consequents. Alternatively, select Excludes to create a lter excluding specied consequents. You can select consequents using the picker icon to the right of the list box. This opens a dialog box listing all consequents present in the generated rules. Note: Consequents may contain more than one item. Filters will check only that a consequent contains one of the items specied.
Antecedents. Select Enable Filter to activate options for ltering rules based on the inclusion or exclusion of specied antecedents. You can select items using the picker icon to the right of the list box. This opens a dialog box listing all antecedents present in the generated rules.

Select Includes all of to set the lter as an inclusionary one where all antecedents specied must be included in a rule. Select Includes any of to create a lter where rules contain at least one of the specied antecedents. Select Excludes to create a lter excluding rules that contain a specied antecedent.
Confidence. Select Enable Filter to activate options for ltering rules based on the level of

condence for a rule. You can use the Min and Max controls to specify a condence range. When you are browsing generated models, condence is listed as a percentage. When you are scoring output, condence is expressed as a number between 0 and 1.

465 Association Rules

Antecedent Support. Select Enable Filter to activate options for ltering rules based on the level of

antecedent support for a rule. Antecedent support indicates the proportion of training data that contains the same antecedents as the current rule, making it analogous to a popularity index. You can use the Min and Max controls to specify a range used to lter rules based on support level.
Lift. Select Enable Filter to activate options for ltering rules based on the lift measurement for

a rule. Note: Lift ltering is available only for association models built after release 8.5 or for earlier models that contain a lift measurement. Sequence models do not contain this option. Click OK to apply all lters that have been enabled in this dialog box.

Visualizing Association Rule Models with IBM Tools


From Clementine, you now have the ability to access the DB2 Intelligent Miner Visualization tool for browsing association rule models. Using options in Clementine, you can directly launch DB2 Intelligent Miner Visualizationa Java-based results browserto explore models created in Clementine. To enable this functionality, you rst need to provide some specications in the Helper Applications dialog box.
E From the menus choose: Tools Helper Applications E Click the IBM tab. Figure 14-14 Options for enabling DB2 Intelligent Miner Visualization

E Select Enable launch of IBM DB2 Intelligent Miner Visualization for Association Models.

466 Chapter 14 E Specify the location of the executable le, typically called imvisualizerw.exe.

Launching DB2 Intelligent Miner Visualization


Once you have specied the location of DB2 Intelligent Miner Visualization and enabled this option in Clementine, you can access the visualization browser from any association model in Clementine. Note: This feature requires that IBM DB2 Intelligent Miner Visualization be installed on the Clementine client host. This functionality has been tested with version 8.1 of Intelligent Miner. To launch DB2 Intelligent Miner Visualization for a Clementine model:
E From the Models tab at the top of the Clementine window, right-click an association model,

such as Apriori, GRI, or CARMA.


E From the context menu, choose DB2 IM Visualization. Figure 14-15 Launching DB2 IM Visualization for a Clementine association model

This launches the executable le and displays the current model using the Association Visualizer from DB2 Intelligent Miner.

467 Association Rules Figure 14-16 Apriori model built in Clementine and displayed using Association Visualizer

For information on using DB2 Intelligent Mining Visualization, see the product documentation.

Association Rule Model Summary


The Summary tab of an association rule model displays the type of model, such as Apriori or CARMA, along with the number of rules discovered and the minimum and maximum for support, lift, and condence of rules in the ruleset.

468 Chapter 14 Figure 14-17 Sample Unrefined Rule node Summary tab

Generating a Ruleset from an Association Model


Figure 14-18 Generate Ruleset dialog box

Association models, such as Apriori and CARMA, can be used to score data directly, or you can rst generate a subset of rules, known as a ruleset. Rulesets are particularly useful when you are working with the GRI unrened model, which cannot be used directly for scoring. For more information, see Unrened Models in Chapter 6 on p. 246.

469 Association Rules

To generate a ruleset, choose Rule set from the Generate menu in the generated model browser. You can specify the following options for translating the rules into a ruleset:
Rule set name. Allows you to specify the name of the new generated Ruleset node. Create node on. Controls the location of the new generated Ruleset node. Select Canvas, GM
Palette, or Both.

Target field. Determines which output eld will be used for the generated Ruleset node. Select

a single output eld from the list.


Minimum support. Specify the minimum support for rules to be preserved in the generated ruleset. Rules with support less than the specied value will not appear in the new ruleset. Minimum confidence. Specify the minimum condence for rules to be preserved in the generated ruleset. Rules with condence less than the specied value will not appear in the new ruleset. Default value. Allows you to specify a default value for the target eld that is assigned to scored

records for which no rule res.

Generating a Filtered Model


Figure 14-19 Generate New Model dialog box

To generate a ltered model from an association model, such as an Apriori, CARMA, or Sequence Ruleset node, choose Filtered Model from the Generate menu in the generated model browser. This creates a subset model that includes only those rules currently displayed in the browser. Note: You cannot generate ltered models for GRI models (unrened models). You can specify the following options for ltering rules:
Name for New Model. Allows you to specify the name of the new Filtered Model node. Create node on. Controls the location of the new Filtered Model node. Select Canvas, GM
Palette, or Both.

Rule numbering. Specify how rule IDs will be numbered in the subset of rules included in the

ltered model.

470 Chapter 14

Retain original rule ID numbers. Select to maintain the original numbering of rules. By default,

rules are given an ID that corresponds with their order of discovery by the algorithm. That order may vary depending on the algorithm employed.
Renumber rules consecutively starting with. Select to assign new rule IDs for the ltered rules.

New IDs are assigned based on the sort order displayed in the rule browser table on the Model tab, beginning with the number you specify here. You can specify the start number for IDs using the arrows to the right.

Association Rule Model Settings


This Settings tab is used to specify scoring options for association models (Apriori and CARMA). This tab is available only after the generated model has been added to a stream for purposes of scoring. Note: The dialog box for browsing a GRI unrened model does not include the Settings tab, since it cannot be scored. To score the GRI unrened model, you must rst generate a ruleset. For more information, see Generating a Ruleset from an Association Model on p. 468.
Figure 14-20 Sample Association model Settings tab

Maximum number of predictions. Specify the maximum number of predictions included for each set of basket items. This option is used in conjunction with Rule Criterion below to produce the top predictions, where top indicates the highest level of condence, support, lift, etc., as specied below.

471 Association Rules

Rule Criterion. Select the measure used to determine the strength of rules. Rules are sorted by the

strength of criteria selected here in order to return the top predictions for an item set. Available criteria are: Condence Support Rule support (Support * Condence) Lift Deployability
Allow repeat predictions. Select to include multiple rules with the same consequent when scoring. For example, selecting this option allows the following rules to be scored:
bread & cheese wine cheese & fruit wine

Turn off this option to exclude repeat predictions when scoring. Note: Rules with multiple consequents (bread & cheese & fruit wine & pate) are considered repeat predictions only if all consequents (wine & pate) have been predicted before.
Ignore unmatched basket items. Select to ignore the presence of additional items in the item set. For example, when this option is selected for a basket that contains [tent & sleeping bag & kettle], the rule tent & sleeping bag gas_stove will apply despite the extra item (kettle) present in the basket.

There may be some circumstances where extra items should be excluded. For example, it is likely that someone who purchases a tent, sleeping bag, and kettle may already have a gas stove, indicated by the presence of the kettle. In other words, a gas stove may not be the best prediction. In such cases, you should deselect Ignore unmatched basket items to ensure that rule antecedents exactly match the contents of a basket. By default, unmatched items are ignored.
Check that predictions are not in basket. Select to ensure that consequents are not also present in the basket. For example, if the purpose of scoring is to make a home furniture product recommendation, then it is unlikely that a basket that already contains a dining room table will be likely to purchase another one. In such a case, you should select this option. On the other hand, if products are perishable or disposable (such as cheese, baby formula, or tissue), then rules where the consequent is already present in the basket may be of value. In the latter case, the most useful option might be Do not check basket for predictions below. Check that predictions are in basket. Select this option to ensure that consequents are also present in the basket. This approach is useful when you are attempting to gain insight into existing customers or transactions. For example, you may want to identify rules with the highest lift and then explore which customers t these rules. Do not check basket for predictions. Select to include all rules when scoring, regardless of the

presence or absence of consequents in the basket.

472 Chapter 14

Scoring Association Rules


Scores produced by running new data through an association rule model are returned in separate elds. Three new elds are added for each prediction, with P representing the prediction, C representing condence, and I representing the rule ID. The organization of these output elds depends on whether the input data are in transactional or tabular format. See Tabular versus Transactional Data on p. 449 for an overview of these formats. For example, suppose you are scoring basket data using a model that generates predictions based on the following three rules:
Rule_15 bread&wine meat (confidence 54%) Rule_22 cheese fruit (confidence 43%) Rule_5 bread&cheese frozveg (confidence 24%)

Tabular data. For tabular data, the three predictions (3 is the default) are returned in a single record.
Table 14-1 Scores in tabular format

ID

Bread

Wine 1

Cheese 1

P1 meat

C1 0.54

I1 15

P2 fruit

C2 0.43

I2 22

P3 frozveg

C3 .24

I3 5

Fred 1

Transactional data. For transactional data, a separate record is generated for each prediction.

Predictions are still added in separate columns, but scores are returned as they are calculated. This results in records with incomplete predictions, as shown in the sample output below. The second and third predictions (P2 and P3) are blank in the rst record, along with the associated condences and rule IDs. As scores are returned, however, the nal record contains all three predictions.
Table 14-2 Scores in transactional format

ID Fred Fred Fred

Item

P1

C1

I1

P2 fruit fruit

C2 0.43 0.43

I2 22 22

P3 $null$

C3

I3

bread cheese meat 0.54 14 wine meat 0.54 14

meat 0.54 14

$null$ $null$ $null$ $null$

$null$ $null$ $null$ $null$ 5

frozveg 0.24

To include only complete predictions for reporting or deployment purposes, use a Select node to select complete records. Note: The eld names used in these examples are abbreviated for clarity. During actual use, results elds for association models are named as follows:
New eld Prediction Condence (or other criterion) Rule ID Example eld name $A-TRANSACTION_NUMBER-1 $AC-TRANSACTION_NUMBER-1 $A-Rule_ID-1

473 Association Rules

Rules with Multiple Consequents

The CARMA algorithm allows rules with multiple consequentsfor example:


bread wine&cheese

When you are scoring such two-headed rules, predictions are returned in the format displayed in the following table:
Table 14-3 Scoring results including a prediction with multiple consequents

ID

Bread Wine Cheese P1 1 1 meat&veg

C1 0.54

I1 16

P2 fruit

C2

I2

P3

C3 I3

Fred 1

0.43 22

frozveg .24 5

In some cases, you may need to split such scores before deployment. To split a prediction with multiple consequents, you will need to parse the eld using the CLEM string functions. For more information, see String Functions in Chapter 8 in Clementine 11.1 Users Guide.

Deploying Association Models


When scoring association models, predictions and condences are output in separate columns (where P represents the prediction, C represents condence, and I represents the rule ID). This is the case whether the input data are tabular or transactional. For more information, see Scoring Association Rules on p. 472.
Figure 14-21 Tabular scores with predictions in columns

When preparing scores for deployment, you might nd that your application requires you to transpose your output data to a format with predictions in rows rather than columns (one prediction per row, sometimes known as till-roll format).

474 Chapter 14 Figure 14-22 Transposed scores with predictions in rows

Transposing Tabular Scores

You can transpose tabular scores from columns to rows using a combination of steps in Clementine, as described in the steps that follow.
Figure 14-23 Example stream used to transpose tabular data into till-roll format

E Use the @INDEX function in a Derive node to ascertain the current order of predictions and save

this indicator in a new eld, such as Original_order.


E Add a Type node to ensure that all elds are instantiated. E Use a Filter node to rename the default prediction, condence, and ID elds (P1, C1, I1) to

common elds, such as Pred, Crit, and Rule_ID, which will be used to append records later on. You will need one Filter node for each prediction generated.

475 Association Rules Figure 14-24 Filtering fields for predictions 1 and 3 while renaming fields for prediction 2.

E Use an Append node to append values for the shared Pred, Crit, and Rule_ID. E Attach a Sort node to sort records in ascending order for the eld Original_order and in

descending order for Crit, which is the eld used to sort predictions by criteria such as condence, lift, and support.
E Use another Filter node to lter the eld Original_order from the output.

At this point, the data are ready for deployment.


Transposing Transactional Scores

The process is similar for transposing transactional scores. For example, the stream shown below transposes scores to a format with a single prediction in each row as needed for deployment.
Figure 14-25 Example stream used to transpose transactional data into till-roll format

476 Chapter 14

With the addition of two Select nodes, the process is identical to that explained above for tabular data. The rst Select node is used to compare rule IDs across adjacent records and include only unique or undened records. This Select node uses the CLEM expression to select records: ID /= @OFFSET(ID,-1) or @OFFSET(ID,-1) = undef. The second Select node is used to discard extraneous rules, or rules where Rule_ID has a null value. This Select node uses the following CLEM expression to discard records: not(@NULL(Rule_ID)). For more information on transposing scores for deployment, contact SPSS Technical Support.

Sequence Node
This node is available with the Association module. The Sequence node discovers patterns in sequential or time-oriented data, in the format bread cheese. The elements of a sequence are item sets that constitute a single transaction. For example, if a person goes to the store and purchases bread and milk and then a few days later returns to the store and purchases some cheese, that persons buying activity can be represented as two item sets. The rst item set contains bread and milk, and the second one contains cheese. A sequence is a list of item sets that tend to occur in a predictable order. The sequence node detects frequent sequences and creates a generated model node that can be used to make predictions.
Requirements. To create a Sequence ruleset, you need to specify an ID eld, an optional time

eld, and one or more content elds. Note that these settings must be made on the Fields tab of the Modeling node; they cannot be read from an upstream Type node. The ID eld can have any direction or type. If you specify a time eld, it can have any direction but must be numeric, date, time, or timestamp. If you do not specify a time eld, the Sequence node will use an implied timestamp, in effect using row numbers as time values. Content elds can have any type and direction, but all content elds must be of the same type. If they are numeric, they must be integer ranges (not real ranges).
Strengths. The Sequence node is based on the CARMA association rules algorithm, which uses an efcient two-pass method for nding sequences. In addition, the generated model node created by a Sequence node can be inserted into a data stream to create predictions. The generated model node can also generate SuperNodes for detecting and counting specic sequences and for making predictions based on specic sequences.

Sequence Node Fields Options


This node is available with the Association module.

477 Association Rules Figure 14-26 Sequence node fields options

Before executing a Sequence node, you must specify ID and content elds on the Fields tab of the Sequence node. If you want to use a time eld, you also need to specify that here.
ID field. Select an ID eld from the list. Numeric or symbolic elds can be used as the ID eld.

Each unique value of this eld should indicate a specic unit of analysis. For example, in a market basket application, each ID might represent a single customer. For a Web log analysis application, each ID might represent a computer (by IP address) or a user (by login data).
IDs are contiguous. If your data are presorted so that all records with the same ID appear

together in the data stream, select this option to speed up processing. If your data are not presorted (or you are not sure), leave this option unselected, and the Sequence node will sort the data automatically. Note: If your data are not sorted and you select this option, you may get invalid results in your Sequence model.
Time field. If you want to use a eld in the data to indicate event times, select Use time field and

specify the eld to be used. The time eld must be numeric, date, time, or timestamp. If no time eld is specied, records are assumed to arrive from the data source in sequential order, and record numbers are used as time values (the rst record occurs at time "1"; the second, at time "2"; and so on).
Content fields. Specify the content eld(s) for the model. These elds contain the events of interest in sequence modeling.

478 Chapter 14

Tabular versus transactional data. The Sequence node can handle data in either tabular or transactional format. If you use multiple elds with transactional data, the items specied in these elds for a particular record are assumed to represent items found in a single transaction with a single timestamp. For more information, see Tabular versus Transactional Data on p. 449.

Sequence Node Model Options


This node is available with the Association module.
Figure 14-27 Sequence node model options

Model name. You can generate the model name automatically based on the target or ID eld (or

model type in cases where no such eld is specied) or specify a custom name.
Use partitioned data. If a partition eld is dened, this option ensures that only data from the training partition is used to build the model.For more information, see Partition Node in Chapter 4 on p. 119. Minimum rule support (%). You can specify a support criterion. Rule support refers to the

proportion of IDs in the training data that contain the entire sequence. If you want to focus on more common sequences, increase this setting.
Minimum rule confidence (%). You can specify a condence criterion for keeping sequences in the

sequence set. Condence refers to the percentage of the IDs where a correct prediction is made, out of all the IDs for which the rule makes a prediction. It is calculated as the number of IDs for which the entire sequence is found divided by the number of IDs for which the antecedents are found, based on the training data. Sequences with lower condence than the specied criterion are discarded. If you are getting too many sequences or uninteresting sequences, try increasing this setting. If you are getting too few sequences, try decreasing this setting.

479 Association Rules

Maximum sequence size. You can set the maximum number of distinct item sets (as opposed to items) in a sequence. If the sequences of interest are relatively short, you can decrease this setting to speed up building the sequence set. Predictions to add to stream. Specify the number of predictions to be added to the stream by the resulting generated Model node. For more information, see Generated Sequence Rule Models on p. 481.

Sequence Node Expert Options


This node is available with the Association module. For those with detailed knowledge of the Sequence nodes operation, the following expert options allow you to ne-tune the model-building process. To access expert options, set the Mode to Expert on the Expert tab.
Figure 14-28 Sequence node expert options

Set maximum duration. If this option is selected, sequences will be limited to those with a duration

(the time between the rst and last item set) less than or equal to the value specied. If you havent specied a time eld, the duration is expressed in terms of rows (records) in the raw data. If the time eld used is a time, date, or timestamp eld, the duration is expressed in seconds. For numeric elds, the duration is expressed in the same units as the eld itself.
Set pruning value. The CARMA algorithm used in the Sequence node periodically removes (prunes) infrequent item sets from its list of potential item sets during processing to conserve memory. Select this option to adjust the frequency of pruning. The number specied determines the frequency of pruning. Enter a smaller value to decrease the memory requirements of the algorithm (but potentially increase the training time required), or enter a larger value to speed up training (but potentially increase memory requirements).

480 Chapter 14

Set maximum sequences in memory. If this option is selected, the CARMA algorithm will limit its memory store of candidate sequences during model building to the number of sequences specied. Select this option if Clementine is using too much memory during the building of Sequence models. Note that the maximum sequences value you specify here is the number of candidate sequences tracked internally as the model is built. This number should be much larger than the number of sequences you expect in the nal model. Constrain gaps between item sets. This option allows you to specify constraints on the time gaps

that separate item sets. If selected, item sets with time gaps smaller than the Minimum gap or larger than the Maximum gap that you specify will not be considered to form part of a sequence. Use this option to avoid counting sequences that include long time intervals or those that take place in a very short time span. Note: If the time eld used is a time, date, or timestamp eld, the time gap is expressed in seconds. For numeric elds, the time gap is expressed in the same units as the time eld. For example, consider this list of transactions:
ID 1001 1001 1001 1001 Time 1 2 5 6 Content apples bread cheese dressing

If you build a model on these data with the minimum gap set to 2, you would get the following sequences:
apples cheese apples dressing bread cheese bread dressing

You would not see sequences such as apples bread because the gap between apples and bread is smaller than the minimum gap. Similarly, if the data were instead:
ID 1001 1001 1001 1001 Time 1 2 5 20 Content apples bread cheese dressing

and the maximum gap were set to 10, you would not see any sequences with dressing, because the gap between cheese and dressing is too large for them to be considered part of the same sequence.

481 Association Rules

Generated Sequence Rule Models


Generated Sequence Rule models represent the sequences found for a particular output eld discovered by the Sequence node and can be added to streams to generate predictions. When you execute a stream containing a Sequence Rules node, the Sequence Rules node adds a pair of elds containing predictions and associated condence values for each prediction from the sequence model to the data. By default, three pairs of elds containing the top three predictions (and their associated condence values) are added. You can change the number of predictions generated when you build the model by setting the Sequence node model options at build time, as well as on the Settings tab after adding the model to a stream. For more information, see Sequence Rule Model Settings on p. 485. The new eld names are derived from the model name. The eld names are $S-sequence-n for the prediction eld (where n indicates the nth prediction) and $SC-sequence-n for the condence eld. In a stream with multiple Sequence Rules nodes in a series, the new eld names will include numbers in the prex to distinguish them from each other. The rst Sequence Set node in the stream will use the usual names, the second node will use names starting with $S1- and $SC1-, the third node will use names starting with $S2- and $SC2-, and so on. Predictions appear in order by condence, so that $S-sequence-1 contains the prediction with the highest condence, $S-sequence-2 contains the prediction with the next highest condence, and so on. For records where the number of available predictions is smaller than the number of predictions requested, remaining predictions contain the value $null$. For example, if only two predictions can be made for a particular record, the values of $S-sequence-3 and $SC-sequence-3 will be $null$. For each record, the rules in the model are compared to the set of transactions processed for the current ID so far, including the current record and any previous records with the same ID and earlier timestamp. The k rules with the highest condence values that apply to this set of transactions are used to generate the k predictions for the record, where k is the number of predictions specied on the Settings tab after adding the model to the stream. (If multiple rules predict the same outcome for the transaction set, only the rule with the highest condence is used.) For more information, see Sequence Rule Model Settings on p. 485. As with other types of association rule models, the data format must match the format used in building the sequence model. For example, models built using tabular data can be used to score only tabular data. For more information, see Scoring Association Rules on p. 472. Note: When scoring data using a generated Sequence Set node in a stream, any tolerance or gap settings that you selected in building the model are ignored for scoring purposes.
Predictions from Sequence Rules

The node handles the records in a time-dependent manner (or order-dependent, if no timestamp eld was used to build the model). Records should be sorted by the ID eld and timestamp eld (if present). However, predictions are not tied to the timestamp of the record to which they are added. They simply refer to the most likely items to appear at some point in the future, given the history of transactions for the current ID up to the current record. Note that the predictions for each record do not necessarily depend on that records transactions. If the current records transactions do not trigger a specic rule, rules will be selected based on the previous transactions for the current ID. In other words, if the current record

482 Chapter 14

doesnt add any useful predictive information to the sequence, the prediction from the last useful transaction for this ID is carried forward to the current record. For example, suppose you have a Sequence Rule model with the single rule
Jam Bread (0.66)

and you pass it the following records:


ID 001 001 Purchase jam milk Prediction bread bread

Notice that the rst record generates a prediction of bread, as you would expect. The second record also contains a prediction of bread, because theres no rule for jam followed by milk; therefore, the milk transaction doesnt add any useful information, and the rule Jam Bread still applies.
Generating New Nodes

The Generate menu allows you to create new SuperNodes based on the sequence model.
Rule SuperNode. Generates a SuperNode that can detect and count occurrences of sequences

in scored data. This option is disabled if no rule is selected. For more information, see Generating a Rule SuperNode from a Sequence Rule Model on p. 486.
Model to Palette. Returns the model to the generated models palette. This is useful in situations

where a colleague may have sent you a stream containing the model and not the model itself.

Sequence Rule Model Details


The Model tab for a generated sequence rule model displays the rules extracted by the algorithm. Each row in the table represents a rule, with the antecedent (the if part of the rule) in the rst column followed by the consequent (the then part of the rule) in the second column.

483 Association Rules Figure 14-29 Sequence Rule browser Model tab

Each rule is shown in the following format:


Antecedent beer and cannedveg fish fish Consequent beer fish

The rst example rule is interpreted as for IDs that had beer and cannedveg in the same transaction, there is likely a subsequent occurrence of beer. The second example rule can be interpreted as for IDs that had sh in one transaction and then sh in another, there is a likely subsequent occurrence of sh. Note that in the rst rule, beer and cannedveg are purchased at the same time; in the second rule, sh is purchased in two separate transactions.
Show/Hide menu. The Show/Hide menu (percentage toolbar button) controls options for the

display of rules. The following display options are available: Instances displays information about the number of unique IDs for which the full sequenceboth antecedents and consequentappears. (Note this differs from Association models, for which the number of instances refers to the number of IDs for which only the antecedents apply.) For example, given the rule bread cheese, the number of IDs in the training data that include both bread and cheese are referred to as instances.

484 Chapter 14

Support displays the proportion of IDs in the training data for which the antecedents are true. For example, if 50% of the training data includes the antecedent bread then the support for the bread cheese rule would be 50%. (Unlike Association models, support is not based on the number of instances, as noted above.) Condence displays the percentage of the IDs where a correct prediction is made, out of all the IDs for which the rule makes a prediction. It is calculated as the number of IDs for which the entire sequence is found divided by the number of IDs for which the antecedents are found, based on the training data. For example, if 50% of the training data contains cannedveg (indicating antecedent support) but only 20% contains both cannedveg and frozenmeal, then condence for the rule cannedveg frozenmeal would be Rule Support / Antecedent Support or, in this case, 40%. Rule Support for Sequence models is based on instances and displays the proportion of training records for which the entire rule, antecedents, and consequent(s), are true. For example, if 20% of the training data contains both bread and cheese, then rule support for the rule bread cheese is 20%. Note that the proportions are based on valid transactions (transactions with at least one observed item or true value) rather than total transactions. Invalid transactionsthose with no items or true valuesare discarded for these calculations.
Sort menu. The Sort menu button on the toolbar controls the sorting of rules. Direction of sorting (ascending or descending) can be changed using the sort direction button (up or down arrow).
Figure 14-30 Toolbar options for sorting

You can sort rules by: Support Condence Rule Support Consequent First Antecedent Last Antecedent Number of Items (antecedents) For example, the following table is sorted in descending order by number of items. Rules with multiple items in the antecedent set precede those with fewer items.
Antecedent beer and cannedveg and frozenmeal beer and cannedveg fish fish softdrink Consequent frozenmeal beer fish softdrink

485 Association Rules

Filter button. The Filter button (funnel icon) on the menu expands the bottom of the dialog box to

show a panel where active rule lters are displayed. Filters are used to narrow the number of rules displayed on the Models tab.
Figure 14-31 Filter button

To create a lter, click the Filter icon to the right of the expanded panel. This opens a separate dialog box in which you can specify constraints for displaying rules. Note that the Filter button is often used in conjunction with the Generate menu to rst lter rules and then generate a model containing that subset of rules. For more information, see Specifying Filters for Rules below.

Sequence Rule Model Settings


The Settings tab for a generated sequence rule model displays scoring options for the model. This tab is available only after the model has been added to the stream canvas for scoring.
Figure 14-32 Sample Association model Settings tab

Maximum number of predictions. Specify the maximum number of predictions included for each set of basket items. The rules with the highest condence values that apply to this set of transactions are used to generate predictions for the record up to the specied limit.

Sequence Rule Model Summary


The Summary tab for a sequence rule model displays the number of rules discovered and the minimum and maximum for support and condence in the rules. If you have executed an Analysis node attached to this modeling node, information from that analysis will also appear in this section. For more information, see Analysis Node in Chapter 17 on p. 537.

486 Chapter 14 Figure 14-33 Sample Sequence Rules node Summary tab

For more information, see Browsing Generated Models in Chapter 6 on p. 239.

Generating a Rule SuperNode from a Sequence Rule Model


Figure 14-34 Generate Rule SuperNode dialog box

To generate a Rule SuperNode based on a sequence rule:


E On the Model tab for the generated sequence rule, click on a row in the table to select the desired

rule.
E From the rule browser menus choose: Generate Rule SuperNode

487 Association Rules

Important: To use the generated SuperNode, you must sort the data by ID eld (and Time eld, if any) before passing them into the SuperNode. The SuperNode will not detect sequences properly in unsorted data. You can specify the following options for generating a Rule SuperNode:
Detect. Species how matches are dened for data passed into the SuperNode. Antecedents only. The SuperNode will identify a match any time it nds the antecedents for

the selected rule in the correct order within a set of records having the same ID, regardless of whether the consequent is also found. Note that this does not take into account timestamp tolerance or item gap constraint settings from the original Sequence modeling node. When the last antecedent item set is detected in the stream (and all other antecedents have been found in the proper order), all subsequent records with the current ID will contain the summary selected below.
Entire sequence. The SuperNode will identify a match any time it nds the antecedents and

the consequent for the selected rule in the correct order within a set of records having the same ID. This does not take into account timestamp tolerance or item gap constraint settings from the original Sequence modeling node. When the consequent is detected in the stream (and all antecedents have also been found in the correct order), the current record and all subsequent records with the current ID will contain the summary selected below.
Display. Controls how match summaries are added to the data in the Rule SuperNode output. Consequent value for first occurrence. The value added to the data is the consequent value

predicted based on the rst occurrence of the match. Values are added as a new eld named rule_n_consequent, where n is the rule number (based on the order of creation of Rule SuperNodes in the stream).
True value for first occurrence. The value added to the data is true if there is at least one match

for the ID and false if there is no match. Values are added as a new eld named rule_n_ag.
Count of occurrences. The value added to the data is the number of matches for the ID. Values

are added as a new eld named rule_n_count.


Rule number. The value added is the rule number for the selected rule. Rule numbers are

assigned based on the order in which the SuperNode was added to the stream. For example, the rst Rule SuperNode is considered rule 1, the second Rule SuperNode is considered rule 2, etc. This option is most useful when you will be including multiple Rule SuperNodes in your stream. Values are added as a new eld named rule_n_number.
Include confidence figures. If selected, this option will add the rule condence to the data

stream as well as the other summary selected above. Values are added as a new eld named rule_n_condence.

Time Series Models


Why Forecast?

15

Chapter

To forecast means to predict the values of one or more series over time. For example, you may want to predict the expected demand for a line of products or services in order to allocate resources for manufacturing or distribution. Because planning decisions take time to implement, forecasts are an essential tool in many planning processes. Methods of modeling time series assume that history repeats itselfif not exactly, then closely enough that by studying the past, you can make better decisions in the future. To predict sales for next year, for example, you would probably start by looking at this years sales and work backward to gure out what trends or patterns, if any, have developed in recent years. But patterns can be difcult to gauge. If your sales increase several weeks in a row, for example, is this part of a seasonal cycle or the beginning of a long-term trend? Using statistical modeling techniques, you can analyze the patterns in your past data and project those patterns to determine a range within which future values of the series are likely to fall. The result is more accurate forecasts on which to base your decisions.

Time Series Data


A time series is an ordered collection of measurements taken at regular intervalsfor example, daily stock prices or weekly sales data. The measurements may be of anything that interests you, and each series can generally be classied as one of the following:
Dependent. A series that you want to forecast. Predictor. A series that may help to explain the targetfor example, using an advertising

budget to predict sales. Predictors can only be used with ARIMA models.
Event. A special predictor series used to account for predictable recurring incidentsfor

example, sales promotions.


Intervention. A special predictor series used to account for one-time past incidentsfor

example, a power outage or employee strike. The intervals can represent any unit of time, but the interval must be the same for all measurements. Moreover, any interval for which there is no measurement must be set to the missing value. Thus, the number of intervals with measurements (including those with missing values) denes the length of time of the historical span of the data.

488

489 Time Series Models

Characteristics of Time Series


Studying the past behavior of a series will help you identify patterns and make better forecasts. When plotted, many time series exhibit one or more of the following features: Trends Seasonal and nonseasonal cycles Pulses and steps Outliers Intermittent demand

Trends
A trend is a gradual upward or downward shift in the level of the series or the tendency of the series values to increase or decrease over time.
Figure 15-1 Trend

Trends are either local or global, but a single series can exhibit both types. Historically, series plots of the stock market index show an upward global trend. Local downward trends have appeared in times of recession, and local upward trends have appeared in times of prosperity. Trends can also be either linear or nonlinear. Linear trends are positive or negative additive increments to the level of the series, comparable to the effect of simple interest on principal. Nonlinear trends are often multiplicative, with increments that are proportional to the previous series value(s). Global linear trends are t and forecast well by both exponential smoothing and ARIMA models. In building ARIMA models, series showing trends are generally differenced to remove the effect of the trend.

Seasonal Cycles
A seasonal cycle is a repetitive, predictable pattern in the series values.

490 Chapter 15 Figure 15-2 Seasonal cycle

Seasonal cycles are tied to the interval of your series. For instance, monthly data typically cycles over quarters and years. A monthly series might show a signicant quarterly cycle with a low in the rst quarter or a yearly cycle with a peak every December. Series that show a seasonal cycle are said to exhibit seasonality. Seasonal patterns are useful in obtaining good ts and forecasts, and there are exponential smoothing and ARIMA models that capture seasonality.

Nonseasonal Cycles
A nonseasonal cycle is a repetitive, possibly unpredictable, pattern in the series values.
Figure 15-3 Nonseasonal cycle

Some series, such as unemployment rate, clearly display cyclical behavior; however, the periodicity of the cycle varies over time, making it difcult to predict when a high or low will occur. Other series may have predictable cycles but do not neatly t into the Gregorian calendar or have cycles longer than a year. For example, the tides follow the lunar calendar, international travel and trade related to the Olympics swell every four years, and there are many religious holidays whose Gregorian dates change from year to year. Nonseasonal cyclical patterns are difcult to model and generally increase uncertainty in forecasting. The stock market, for example, provides numerous instances of series that have deed the efforts of forecasters. All the same, nonseasonal patterns must be accounted for when

491 Time Series Models

they exist. In many cases, you can still identify a model that ts the historical data reasonably well, which gives you the best chance to minimize uncertainty in forecasting.

Pulses and Steps


Many series experience abrupt changes in level. They generally come in two types: A sudden, temporary shift, or pulse, in the series level A sudden, permanent shift, or step, in the series level
Figure 15-4 Series with a pulse

When steps or pulses are observed, it is important to nd a plausible explanation. Time series models are designed to account for gradual, not sudden, change. As a result, they tend to underestimate pulses and be ruined by steps, which lead to poor model ts and uncertain forecasts. (Some instances of seasonality may appear to exhibit sudden changes in level, but the level is constant from one seasonal period to the next.) If a disturbance can be explained, it can be modeled using an intervention or event. For example, during August 1973, an oil embargo imposed by the Organization of Petroleum Exporting Countries (OPEC) caused a drastic change in the ination rate, which then returned to normal levels in the ensuing months. By specifying a point intervention for the month of the embargo, you can improve the t of your model, thus indirectly improving your forecasts. For example, a retail store might nd that sales were much higher than usual on the day all items were marked 50% off. By specifying the 50%-off promotion as a recurring event, you can improve the t of your model and estimate the effect of repeating the promotion on future dates.

Outliers
Shifts in the level of a time series that cannot be explained are referred to as outliers. These observations are inconsistent with the remainder of the series and can dramatically inuence the analysis and, consequently, affect the forecasting ability of the time series model. The following gure displays several types of outliers commonly occurring in time series. The blue lines represent a series without outliers. The red lines suggest a pattern that might be present if the series contained outliers. These outliers are all classied as deterministic because they affect only the mean level of the series.

492 Chapter 15 Figure 15-5 Outlier types


Additive Outlier
10 9 8 7 6 5 4 3 2 1 0 0 10 9 8 7 6 5 4 3 2 1 0 0

Innovational Outlier

20

40

60 Time

80

100

120

20

40

60 Time

80

100

120

Level Shift Outlier


10 9 8 7 6 5 4 3 2 1 0 0 10 9 8 7 6 5 4 3 2 1 0 0

Transient Change Outlier

20

40

60 Time

80

100

120

20

40

60 Time

80

100

120

Seasonal Additive Outlier


10 9 8 7 6 5 4 3 2 1 0 0 10 9 8 7 6 5 4 3 2 1 0 0

Local Trend Outlier

20

40

60 Time

80

100

120 Outlier No Outlier

20

40

60 Time

80

100

120

Additive Outlier. An additive outlier appears as a surprisingly large or small value occurring

for a single observation. Subsequent observations are unaffected by an additive outlier. Consecutive additive outliers are typically referred to additive outlier patches.
Innovational Outlier. An innovational outlier is characterized by an initial impact with effects

lingering over subsequent observations. The inuence of the outliers may increase as time proceeds.
Level Shift Outlier. For a level shift, all observations appearing after the outlier move to a new

level. In contrast to additive outliers, a level shift outlier affects many observations and has a permanent effect.
Transient Change Outlier. Transient change outliers are similar to level shift outliers, but the

effect of the outlier diminishes exponentially over the subsequent observations. Eventually, the series returns to its normal level.
Seasonal Additive Outlier. A seasonal additive outlier appears as a surprisingly large or small

value occurring repeatedly at regular intervals.

493 Time Series Models

Local Trend Outlier. A local trend outlier yields a general drift in the series caused by a pattern

in the outliers after the onset of the initial outlier. Outlier detection in time series involves determining the location, type, and magnitude of any outliers present in a time series. Tsay (1988) proposed an iterative procedure for detecting mean level change to identify deterministic outliers. This process involves comparing a time series model that assumes no outliers are present to another model that incorporates outliers. Differences between the models yield estimates of the effect of treating any given point as an outlier.

Autocorrelation and Partial Autocorrelation Functions


Autocorrelation and partial autocorrelation are measures of association between current and past series values and indicate which past series values are most useful in predicting future values. With this knowledge, you can determine the order of processes in an ARIMA model. More specically,
Autocorrelation function (ACF). At lag k, this is the correlation between series values that

are k intervals apart.


Partial autocorrelation function (PACF). At lag k, this is the correlation between series values

that are k intervals apart, accounting for the values of the intervals between.
Figure 15-6 ACF plot for a series

The x axis of the ACF plot indicates the lag at which the autocorrelation is computed; the y axis indicates the value of the correlation (between -1 and 1). For example, a spike at lag 1 in an ACF plot indicates a strong correlation between each series value and the preceding value, a spike at lag 2 indicates a strong correlation between each value and the value occurring two points previously, and so on. A positive correlation indicates that large current values correspond with large values at the specied lag; a negative correlation indicates that large current values correspond with small values at the specied lag. The absolute value of a correlation is a measure of the strength of the association, with larger absolute values indicating stronger relationships.

494 Chapter 15

Series Transformations
Transformations are often useful for stabilizing a series before estimating models. This is particularly important for ARIMA models, which require series to be stationary before models are estimated. A series is stationary if the global level (mean) and average deviation from the level (variance) are constant throughout the series. While most interesting series are not stationary, ARIMA is effective as long as the series can be made stationary by applying transformations, such as the natural log, differencing, or seasonal differencing.
Variance stabilizing transformations. Series in which the variance changes over time can often be stabilized using a natural log or square root transformation. These are also called functional transformations. Natural log. The natural logarithm is applied to the series values. Square root. The square root function is applied to the series values.

Natural log and square root transformations cannot be used for series with negative values.
Level stabilizing transformations. A slow decline of the values in the ACF indicates that each series value is strongly correlated with the previous value. By analyzing the change in the series values, you obtain a stable level. Simple differencing. The differences between each value and the previous value in the series

are computed, excepting, of course, the oldest value in the series. This means that the differenced series will have one less value than the original series.
Seasonal differencing. Identical to simple differencing, except that the differences between

each value and the previous seasonal value are computed. When either simple or seasonal differencing is simultaneously in use with either the log or square root transformation, the variance stabilizing transformation is always applied rst. When simple and seasonal differencing are both in use, the resulting series values are the same whether simple differencing or seasonal differencing is applied rst.

Predictor Series
Predictor series include related data that may help explain the behavior of the series to be forecast. For example, a Web- or catalog-based retailer might forecast sales based on the number of catalogs mailed, the number of phone lines open, or the number of hits to the company Web page. Any series can be used as a predictor provided that the series extends as far into the future as you want to forecast and has complete data with no missing values. Use care when adding predictors to a model. Adding large numbers of predictors will increase the time required to estimate models. While adding predictors may improve a models ability to t the historical data, it doesnt necessarily mean that the model does a better job of forecasting, so the added complexity may not be worth the trouble. Ideally, the goal should be to identify the simplest model that does a good job of forecasting. As a general rule, it is recommended that the number of predictors should be less than the sample size divided by 15 (at most, one predictor per 15 cases).

495 Time Series Models

Predictors with missing data. Predictors with incomplete or missing data cannot be used in

forecasting. This applies to both historical data and future values. In some cases, you can avoid this limitation by setting the models estimation span to exclude the oldest data when estimating models.

Time Series Node


The Time Series node estimates exponential smoothing, univariate Autoregressive Integrated Moving Average (ARIMA), and multivariate ARIMA (or transfer function) models for time series and produces forecasts based on the time series data. Exponential smoothing is a method of forecasting that uses weighted values of previous series observations to predict future values. As such, exponential smoothing is not based on a theoretical understanding of the data. It forecasts one point at a time, adjusting its forecasts as new data come in. The technique is useful for forecasting series that exhibit trend, seasonality, or both. You can choose from a variety of exponential smoothing models that differ in their treatment of trend and seasonality. ARIMA models provide more sophisticated methods for modeling trend and seasonal components than do exponential smoothing models, and, in particular, they allow the added benet of including independent (predictor) variables in the model. This involves explicitly specifying autoregressive and moving average orders as well as the degree of differencing. You can include predictor variables and dene transfer functions for any or all of them. You can also specify automatic detection of outliers or an explicit set of outliers. Tip: In practical terms, ARIMA models are most useful if you want to include predictors that may help to explain the behavior of the series being forecast, such as the number of catalogs mailed or the number of hits to a company Web page. Exponential smoothing models describe the behavior of the time series without attempting to understand why it behaves as it does. For example, a series that historically has peaked every 12 months will probably continue to do so even if you dont know why. Also available is an Expert Modeler, which automatically identies and estimates the best-tting ARIMA or exponential smoothing model for one or more target variables, thus eliminating the need to identify an appropriate model through trial and error. In all cases, the Expert Modeler picks the best model for each of the target variables specied. If in doubt, use the Expert Modeler. If predictor variables are specied, the Expert Modeler selects for inclusion in ARIMA models those variables that have a statistically signicant relationship with the dependent series. Model variables are transformed where appropriate using differencing and/or a square root or natural log transformation. By default, the Expert Modeler considers all exponential smoothing models and all ARIMA models and picks the best model among them for each target eld. You can, however, limit the Expert Modeler only to pick the best of the exponential smoothing models or only to pick the best of the ARIMA models. You can also specify automatic detection of outliers.
Example. An analyst for a national broadband provider is required to produce forecasts of user subscriptions in order to predict utilization of bandwidth. Forecasts are needed for each of the local markets that make up the national subscriber base. You can use time series modeling to produce forecasts for the next three months for a number of local markets. For more information,

496 Chapter 15

see Forecasting Bandwidth Utilization (Time Series) in Chapter 12 in Clementine 11.1 Applications Guide.

Requirements
The Time Series node is different from other Clementine nodes in that you cannot simply insert it into a stream and execute the stream. The Time Series node must always be preceded by a Time Intervals node that species such information as the time interval to use (years, quarters, months etc.), the data to use for estimation, and how far into the future to extend a forecast, if used.
Figure 15-7 Always precede a Time Series node with a Time Intervals node

The time series data must be evenly spaced. Methods for modeling time series data require a uniform interval between each measurement, with any missing values indicated by empty rows. If your data do not already meet this requirement, the Time Intervals node can transform values as needed. For more information, see Time Intervals Node in Chapter 4 on p. 128. Other points to note in connection with Time Series nodes are: elds must be numeric date elds cannot be used as inputs partitions are ignored

497 Time Series Models

Field Options
Figure 15-8 Time Series node Fields tab

The Fields tab is where you specify the elds to be used in building the model. Before you can build a model, you need to specify which elds you want to use as targets and as inputs. Typically the Time Series node uses eld information from an upstream Type node. If you are using a Type node to select input and target elds, you dont need to change anything on this tab.
Use type node settings. This option tells the node to use eld information from an upstream Type

node. This is the default.


Use custom settings. This option tells the node to use eld information specied here instead of

that given in any upstream Type node(s). After selecting this option, specify the elds below. Note that elds stored as dates are not accepted as either target or input elds.
Target. Select one or more target elds. This is similar to setting a elds direction to Out in a

Type node. Target elds for a time series model must be of type Range. A separate model is created for each target eld. A target eld considers all specied Input elds except itself as possible inputs. Thus, the same eld can appear in both lists; such a eld will be used as a possible input to all models except the one where it is a target.
Inputs. Select the input eld(s). This is similar to setting a elds direction to In in a Type

node. Input elds for a time series model must be numeric.

498 Chapter 15

Time Series Model Options


Figure 15-9 Time Series node Model tab

Model name. Species the name assigned to the model that is generated when the node is executed. Auto. Generates the model name automatically based on the target or ID eld names or the

name of the model type in cases where no target is specied (such as clustering models).
Custom. Allows you to specify a custom name for the generated model. Method. You can choose Expert Modeler, Exponential Smoothing, or ARIMA. For more

information, see Time Series Node on p. 495. Select Criteria... to specify options for the selected method. Alternatively, if you have already generated a time series model, you can choose to reuse the criteria from that model. For more information, see Re-estimating and Forecasting on p. 508.
Expert Modeler. Choose this option to use the Expert Modeler, which automatically nds the

best-tting model for each dependent series.


Exponential Smoothing. Use this option to specify a custom exponential smoothing model. ARIMA. Use this option to specify a custom ARIMA model. Reuse Stored Settings. If you have already generated a time series model, select this option

to reuse the criteria settings specied for that model and generate a new model node in the Models palette, rather than building a new model from the beginning. In this way you can save time by re-estimating and producing a new forecast based on the same model settings as before but using more recent data. Thus, for example, if the original model for a particular time series was Holts linear trend, the same type of model is used for re-estimating and

499 Time Series Models

forecasting for that data; the system does not reattempt to nd the best model type for the new data.
Time Interval Information

This section of the dialog box contains information about specications for estimation and forecasting made on the Time Intervals node. Note that this section does not appear if you choose the Reuse Stored Settings option for the modeling method. The rst line of the information indicates whether any records are excluded from the model or used as holdouts. For more information, see Estimation Period in Chapter 4 on p. 133. The second line provides information about any forecast periods specied on the Time Intervals node. For more information, see Forecasts in Chapter 4 on p. 133. If the rst line reads No time interval defined, this indicates that no Time Intervals node is connected. This situation will cause an error on attempting to execute the stream; you must include a Time Intervals node upstream from the Time Series node.
Confidence limit width (%). Condence intervals are computed for the model predictions and

residual autocorrelations. You can specify any positive value less than 100. By default, a 95% condence interval is used.
Maximum number of lags in ACF and PACF output. You can set the maximum number of lags shown

in tables and plots of autocorrelations and partial autocorrelations.

Time Series Expert Modeler Criteria


Figure 15-10 Expert Modeler options

500 Chapter 15

Model Type. The following options are available: All models. The Expert Modeler considers both ARIMA and exponential smoothing models. Exponential smoothing models only. The Expert Modeler only considers exponential

smoothing models.
ARIMA models only. The Expert Modeler only considers ARIMA models. Expert Modeler considers seasonal models. This option is only enabled if a periodicity has been

dened for the active dataset. When this option is selected (checked), the Expert Modeler considers both seasonal and nonseasonal models. If this option is not selected, the Expert Modeler only considers nonseasonal models.
Events and Interventions. Enables you to designate certain predictor elds as event or intervention

elds. Doing so identies a eld as containing time series data affected by events (predictable recurring situations, e.g., sales promotions) or interventions (one-time incidents, e.g., power outage or employee strike). The Expert Modeler will consider only simple regression and not arbitrary transfer functions for predictors identied as event or intervention elds. Predictor elds must be of type Flag, Set or Ordered Set, and must be numeric (e.g. 1/0, not True/False, for a Flag eld), before they will appear in this list. For more information, see Pulses and Steps on p. 491.
Outliers
Figure 15-11 Specifying outlier detection with Expert Modeler

Detect outliers automatically. By default, automatic detection of outliers is not performed. Select

(check) this option to perform automatic detection of outliers, then select the desired outlier types. For more information, see Outliers on p. 491.

501 Time Series Models

Time Series Exponential Smoothing Criteria


Figure 15-12 Exponential smoothing criteria

Model Type. Exponential smoothing models (Gardner, 1985) are classied as either seasonal

or nonseasonal. Seasonal models are only available if the periodicity dened using the Time Intervals node is seasonal. The seasonal periodicities are: cyclic periods, years, quarters, months, days per week, hours per day, minutes per day, and seconds per day. For more information, see Time Intervals Node in Chapter 4 on p. 128.
Simple. This model is appropriate for series in which there is no trend or seasonality. Its only

relevant smoothing parameter is level. Simple exponential smoothing is most similar to an ARIMA with zero orders of autoregression, one order of differencing, one order of moving average, and no constant.
Holts linear trend. This model is appropriate for series in which there is a linear trend and no

seasonality. Its relevant smoothing parameters are level and trend, and, in this model, they are not constrained by each others values. Holts model is more general than Browns model but may take longer to compute estimates for large series. Holts exponential smoothing is most similar to an ARIMA with zero orders of autoregression, two orders of differencing, and two orders of moving average.
Browns linear trend. This model is appropriate for series in which there is a linear trend and

no seasonality. Its relevant smoothing parameters are level and trend, but, in this model, they are assumed to be equal. Browns model is therefore a special case of Holts model. Browns exponential smoothing is most similar to an ARIMA with zero orders of autoregression, two orders of differencing, and two orders of moving average, with the coefcient for the second order of moving average equal to one-half of the coefcient for the rst order squared.

502 Chapter 15

Damped trend. This model is appropriate for series with a linear trend that is dying out

and with no seasonality. Its relevant smoothing parameters are level, trend, and damping trend. Damped exponential smoothing is most similar to an ARIMA with one order of autoregression, one order of differencing, and two orders of moving average.
Simple seasonal. This model is appropriate for series with no trend and with a seasonal effect

that is constant over time. Its relevant smoothing parameters are level and season. Seasonal exponential smoothing is most similar to an ARIMA with zero orders of autoregression; one order of differencing; one order of seasonal differencing; and orders 1, p, and p+1 of moving average, where p is the number of periods in a seasonal interval. For monthly data, p = 12.
Winters additive. This model is appropriate for series with a linear trend and a seasonal

effect that is constant over time. Its relevant smoothing parameters are level, trend, and season. Winters additive exponential smoothing is most similar to an ARIMA with zero orders of autoregression; one order of differencing; one order of seasonal differencing; and p+1 orders of moving average, where p is the number of periods in a seasonal interval. For monthly data, p=12.
Winters multiplicative. This model is appropriate for series with a linear trend and a seasonal

effect that changes with the magnitude of the series. Its relevant smoothing parameters are level, trend, and season. Winters multiplicative exponential smoothing is not similar to any ARIMA model.
Target Transformation. You can specify a transformation to be performed on each dependent

variable before it is modeled. For more information, see Series Transformations on p. 494.
None. No transformation is performed. Square root. Square root transformation is performed. Natural log. Natural log transformation is performed.

Time Series ARIMA Criteria


The Time Series node allows you to build custom nonseasonal or seasonal ARIMA modelsalso known as Box-Jenkins (Box, Jenkins, and Reinsel, 1994) modelswith or without a xed set of input (predictor) variables. You can dene transfer functions for any or all of the input variables and specify automatic detection of outliers or an explicit set of outliers. All input variables specied are explicitly included in the model. This is in contrast to using the Expert Modeler, where input variables are included only if they have a statistically signicant relationship with the target variable.
Model

The Model tab allows you to specify the structure of a custom ARIMA model.

503 Time Series Models Figure 15-13 Specifying the structure of an ARIMA model

ARIMA Orders. Enter values for the various ARIMA components of your model into the

corresponding cells of the Structure grid. All values must be non-negative integers. For autoregressive and moving average components, the value represents the maximum order. All positive lower orders will be included in the model. For example, if you specify 2, the model includes orders 2 and 1. Cells in the Seasonal column are only enabled if a periodicity has been dened for the active dataset.
Autoregressive (p). The number of autoregressive orders in the model. Autoregressive orders

specify which previous values from the series are used to predict current values. For example, an autoregressive order of 2 species that the value of the series two time periods in the past be used to predict the current value.
Difference (d). Species the order of differencing applied to the series before estimating

models. Differencing is necessary when trends are present (series with trends are typically nonstationary and ARIMA modeling assumes stationarity) and is used to remove their effect. The order of differencing corresponds to the degree of series trendrst-order differencing accounts for linear trends, second-order differencing accounts for quadratic trends, and so on.
Moving Average (q). The number of moving average orders in the model. Moving average

orders specify how deviations from the series mean for previous values are used to predict current values. For example, moving-average orders of 1 and 2 specify that deviations from the mean value of the series from each of the last two time periods be considered when predicting current values of the series.
Seasonal Orders. Seasonal autoregressive, moving average, and differencing components play the same roles as their nonseasonal counterparts. For seasonal orders, however, current series values are affected by previous series values separated by one or more seasonal periods. For example, for monthly data (seasonal period of 12), a seasonal order of 1 means that the current series value

504 Chapter 15

is affected by the series value 12 periods prior to the current one. A seasonal order of 1, for monthly data, is then the same as specifying a nonseasonal order of 12.
Target Transformation. You can specify a transformation to be performed on each target variable before it is modeled. For more information, see Series Transformations on p. 494. None. No transformation is performed. Square root. Square root transformation is performed. Natural log. Natural log transformation is performed. Include constant in model. Inclusion of a constant is standard unless you are sure that the overall

mean series value is 0. Excluding the constant is recommended when differencing is applied.

Transfer Functions
Figure 15-14 Defining transfer functions

The Transfer Functions tab allows you to dene transfer functions for any or all of the input elds. Transfer functions allow you to specify the manner in which past values of these elds are used to forecast future values of the target series. The tab appears only if predictors (i.e., input elds) are specied, either on the Type node (Direction = In) or on the Fields tab of the Time Series node. (Use custom settingsInputs.). For more information, see Setting Field Direction in Chapter 4 on p. 80. The top list shows all predictor elds. The remaining information in this dialog box is specic to the selected predictor eld in the list.

505 Time Series Models

Transfer Function Orders. Enter values for the various components of the transfer function into the corresponding cells of the Structure grid. All values must be non-negative integers. For numerator and denominator components, the value represents the maximum order. All positive lower orders will be included in the model. In addition, order 0 is always included for numerator components. For example, if you specify 2 for numerator, the model includes orders 2, 1, and 0. If you specify 3 for denominator, the model includes orders 3, 2, and 1. Cells in the Seasonal column are only enabled if a periodicity has been dened for the active dataset. Numerator. The numerator order of the transfer function species which previous values from the selected independent (predictor) series are used to predict current values of the dependent series. For example, a numerator order of 1 species that the value of an independent series one time period in the pastas well as the current value of the independent seriesis used to predict the current value of each dependent series. Denominator. The denominator order of the transfer function species how deviations from the series mean, for previous values of the selected independent (predictor) series, are used to predict current values of the dependent series. For example, a denominator order of 1 species that deviations from the mean value of an independent series one time period in the past be considered when predicting the current value of each dependent series. Difference. Species the order of differencing applied to the selected independent (predictor)

series before estimating models. Differencing is necessary when trends are present and is used to remove their effect.
Seasonal Orders. Seasonal numerator, denominator, and differencing components play the same

roles as their nonseasonal counterparts. For seasonal orders, however, current series values are affected by previous series values separated by one or more seasonal periods. For example, for monthly data (seasonal period of 12), a seasonal order of 1 means that the current series value is affected by the series value 12 periods prior to the current one. A seasonal order of 1, for monthly data, is then the same as specifying a nonseasonal order of 12.
Delay. Setting a delay causes the input elds inuence to be delayed by the number of intervals specied. For example, if the delay is set to 5, the value of the input eld at time t doesnt affect forecasts until ve periods have elapsed (t + 5). Transformation. Specication of a transfer function for a set of independent variables also includes

an optional transformation to be performed on those variables.


None. No transformation is performed. Square root. Square root transformation is performed. Natural log. Natural log transformation is performed.

506 Chapter 15

Handling Outliers
Figure 15-15 Handling outliers in an ARIMA model

The Outliers tab provides a number of choices for the handling of outliers (Pena, Tiao, and Tsay, 2001) in the data.
Do not detect outliers or model them. By default, outliers are neither detected nor modeled. Select

this option to disable any detection or modeling of outliers.


Detect outliers automatically. Select this option to perform automatic detection of outliers, and

select one or more of the outlier types shown.


Type of Outliers to Detect. Select the outlier type(s) you want to detect. The supported types are:

Additive (default) Level shift (default) Innovational Transient Seasonal additive Local trend Additive patch For more information, see Outliers on p. 491.

507 Time Series Models

Generated Time Series Models


The time series modeling operation can create a number of new elds with the prex $TSas follows:
$TS-colname $TSLCI-colname $TSUCI-colname $TSNR-colname $TS-Total $TSLCI-Total $TSUCI-Total $TSNR-Total The generated model data for each column of the original data. The lower condence interval value for each column of the generated model data.* The upper condence interval value for each column of the generated model data.* The noise residual value for each column of the generated model data.* The total of the $TS-colname values for this row. The total of the $TSLCI-colname values for this row.* The total of the $TSUCI-colname values for this row.* The total of the $TSNR-colname values for this row.*

* Creation of these elds depends on options on the Settings tab of the Time Series model. For more information, see Time Series Model Settings on p. 514.

Generating Multiple Models


Time series modeling in Clementine generates a single model (either ARIMA or exponential smoothing) for each target eld. Thus, if you have multiple target elds, Clementine generates multiple models in a single operation, saving time and enabling you to compare the generated models. If you want to compare an ARIMA model and an exponential smoothing model for the same target eld, you can perform separate executions of the Time Series node, specifying a different model each time.

Using Time Series Models in Forecasting


A time series build operation uses a specic series of ordered cases, known as the estimation span, to build a model that can be used to forecast future values of the series. This model contains information about the time span used, including the interval. In order to forecast using this model, the same time span and interval information must be used with the same series for both the target variable and predictor variables. For example, suppose that at the beginning of January you want to forecast monthly sales of Product 1 for the rst three months of the year. You build a model using the actual monthly sales data for Product 1 from January through to December of the previous year (which well call Year 1), setting the Time Interval to Months. You can then use the model to forecast sales of Product 1 for the rst three months of Year 2. In fact you could forecast any number of months ahead, but of course, the further into the future you try to predict, the less effective the model will become. It would not, however, be possible to forecast the rst three weeks of Year 2, because the interval used to build the model was

508 Chapter 15

Months. It would also make no sense to use this model to predict the sales of Product 2a time series model is relevant only for the data that was used to dene it. For more information, see Forecasting Bandwidth Utilization (Time Series) in Chapter 12 in Clementine 11.1 Applications Guide.

Re-estimating and Forecasting


The estimation period is hard coded into the model that is generated. This means that any values outside the estimation period are ignored if you apply the current model to new data. Thus, a time series model must be re-estimated each time new data is available, in contrast to other Clementine models, which can be reapplied unchanged for the purposes of scoring. To continue the previous example, suppose that by the beginning of April in Year 2, you have the actual monthly sales data for January through March. However, if you re-apply the model you generated at the beginning of January, it will again forecast January through March and ignore the known sales data for that period. The solution is to generate a new model based on the updated actual data. Assuming that you do not change the forecasting parameters, the new model can be used to forecast the next three months, April through June. If you still have access to the stream that was used to generate the original model, you can simply replace the reference to the source le in that stream with a reference to the le containing the updated data and re-execute the stream to generate the new model. However, if all you have is the original model saved in a le, you can still use it to generate a Time Series node that you can then add to a new stream containing a reference to the updated source le. Provided this new stream precedes the Time Series node with a Time Intervals node where the interval is set to Months, executing this new stream will then generate the required new model. For more information, see Reapplying a Time Series Model in Chapter 12 in Clementine 11.1 Applications Guide.

509 Time Series Models

Time Series Model Node


Figure 15-16 Time Series model node

The Time Series model displays details of the various models selected for each of the series input into the Time Series build node. Multiple series (such as data relating to product lines, regions, or stores) can be input, and a separate model is generated for each target series. For example, if revenue in the eastern region is found to t an ARIMA model, but the western region ts only a simple moving average, each region is scored with the appropriate model. The default output shows, for each target eld, the model type, the number of predictors specied, and the goodness-of-t measure (stationary R-squared is the default). If you have specied outlier methods, there is a column showing the number of outliers detected. The default output also includes columns for Ljung-Box Q, degrees of freedom, and signicance values. You can also choose advanced output, which displays the following additional columns: R-squared RMSE (Root Mean Square Error) MAPE (Mean Absolute Percentage Error) MAE (Mean Absolute Error) MaxAPE (Maximum Absolute Percentage Error)

510 Chapter 15

MaxAE (Maximum Absolute Error) Norm. BIC (Normalized Bayesian Information Criterion)
Generate. Enables you to generate a Time Series node back to the stream or the palette. Generate Modeling Node. Places a Time Series node into a stream with the settings used to

create this set of models. Doing so would be useful, for example, if you have a stream in which you want to use these model settings but you no longer have the Time Series node used to generate them.
Model to Palette. Places a model containing all the targets in the Models manager. Model
Figure 15-17 Check All and Un-check All buttons

Check boxes. Choose which models you want to use in scoring. All the boxes are checked by

default. The Check all and Un-check all buttons act on all the boxes in a single operation.
Sort by. Enables you to sort the output rows in ascending or descending order of a specied

column of the display. The Selected option sorts the output based on one or more rows selected by check boxes. This would be useful, for example, to cause target elds named Market_1 to Market_9 to be displayed before Market_10, as the default sort order displays Market_10 immediately after Market_1.
View. The default view (Simple) displays the basic set of output columns. The Advanced option

displays additional columns for goodness-of-t measures.


Number of records used in estimation. The number of rows in the original source data le. Target. The eld or elds identied as the target elds (Direction = Out) in the Type node. Model. The type of model used for this target eld. Predictors. The number of predictors (Direction = In) used for this target eld. Outliers. This column is displayed only if you have requested (in the Expert Modeler or ARIMA criteria) the automatic detection of outliers. The value shown is the number of outliers detected. Stationary R-squared. A measure that compares the stationary part of the model to a simple mean model. This measure is preferable to ordinary R-squared when there is a trend or seasonal pattern. Stationary R-squared can be negative with a range of negative innity to 1. Negative values mean that the model under consideration is worse than the baseline model. Positive values mean that the model under consideration is better than the baseline model. R-Squared. Goodness-of-t measure of a linear model, sometimes called the coefcient of

determination. It is the proportion of variation in the dependent variable explained by the regression model. It ranges in value from 0 to 1. Small values indicate that the model does not t the data well.

511 Time Series Models

RMSE. Root Mean Square Error. The square root of mean square error. A measure of how

much a dependent series varies from its model-predicted level, expressed in the same units as the dependent series.
MAPE. Mean Absolute Percentage Error. A measure of how much a dependent series varies from its model-predicted level. It is independent of the units used and can therefore be used to compare series with different units. MAE. Mean absolute error. Measures how much the series varies from its model-predicted level.

MAE is reported in the original series units.


MaxAPE. Maximum Absolute Percentage Error. The largest forecasted error, expressed as a

percentage. This measure is useful for imagining a worst-case scenario for your forecasts.
MaxAE. Maximum Absolute Error. The largest forecasted error, expressed in the same units

as the dependent series. Like MaxAPE, it is useful for imagining the worst-case scenario for your forecasts. Maximum absolute error and maximum absolute percentage error may occur at different series pointsfor example, when the absolute error for a large series value is slightly larger than the absolute error for a small series value. In that case, the maximum absolute error will occur at the larger series value and the maximum absolute percentage error will occur at the smaller series value.
Normalized BIC. Normalized Bayesian Information Criterion. A general measure of the overall

t of a model that attempts to account for model complexity. It is a score based upon the mean square error and includes a penalty for the number of parameters in the model and the length of the series. The penalty removes the advantage of models with more parameters, making the statistic easy to compare across different models for the same series.
Q. The Ljung-Box Q statistic for this target eld. df. Degrees of freedom for this target eld. Sig. Signicance level for this target eld. Summary Statistics. This section contains various summary statistics for the different columns,

including mean, minimum, maximum, and percentile values.

512 Chapter 15

Time Series Model Residuals


Figure 15-18 Residuals ACF and PACF display

The Residuals tab shows the autocorrelation function (ACF) and partial autocorrelation function (PACF) of the residuals (the differences between expected and actual values) for each target eld. For more information, see Autocorrelation and Partial Autocorrelation Functions on p. 493.
Display plot for model. Select the target eld for which you want to display the residual ACF

and residual PACF.

513 Time Series Models

Time Series Model Summary


Figure 15-19 Summary details for Time Series model

The Summary tab of a generated model displays information about the model itself (Analysis), elds used in the model (Fields), settings used when building the model (Build Settings), and model training (Training Summary). When you rst browse the node, the Summary tab results are collapsed. To see the results of interest, use the expander control to the left of an item to unfold it or click the Expand All button to show all results. To hide the results when you have nished viewing them, use the expander control to collapse the specic results that you want to hide or click the Collapse All button to collapse all results.
Analysis. Displays information about the specic model. Fields. Lists the elds used as the target and the inputs in building the model. Build Settings. Contains information about the settings used in building the model. Training Summary. Shows the type of model, the stream used to create it, the user who created it,

when it was built, and the elapsed time for building the model.

514 Chapter 15

Time Series Model Settings


Figure 15-20 Model settings display

The Settings tab enables you to specify what extra elds are created by the modeling operation.
Create new fields for each model to be scored. Enables you to specify the new elds to create for

each model to be scored.


Calculate upper and lower confidence limits. If checked, creates new elds (with the default

prexes $TSLCI- and $TSUCI-) for the lower and upper condence intervals, respectively, for each target eld, together with totals of these values.
Calculate noise residuals. If checked, creates a new eld (with the default prex $TSNR-) for

the noise residuals for each target eld, together with a total of these values.

Self-Learning Response Node Models

16

Chapter

SLRM Node
This node is available with the Classication module. The Self-Learning Response Model (SLRM) node enables you to build a model that you can continually update, or re-estimate, as a dataset grows without having to rebuild the model every time using the complete dataset. For example, this is useful when you have several products and you want to identify which product a customer is most likely to buy if you offer it to them. This model allows you to predict which offers are most appropriate for customers and the probability of the offers being accepted. The model can initially be built using a small dataset with randomly made offers and the responses to those offers. As the dataset grows, the model can be updated and therefore becomes more able to predict the most suitable offers for customers and the probability of their acceptance based upon other input elds such as age, gender, job, income, and so on. The offers available can be changed by adding or removing them from within the node dialog box, instead of having to change the target eld of the dataset. When coupled with SPSS Predictive Enterprise Services, you can set up automatic regular updates to the model. This process, without the need for human oversight or action, provides a exible and low-cost solution for organizations and applications where custom intervention by a data miner is not possible or necessary. Example. A nancial institution wants to achieve more protable results by matching the offer that is most likely to be accepted to each customer. You can use a self-learning model to identify the characteristics of customers most likely to respond favorably based on previous promotions and to update the model in real time based on the latest customer responses. For more information, see Making Offers to Customers (Self-Learning) in Chapter 14 in Clementine 11.1 Applications Guide.

515

516 Chapter 16

SLRM Node Fields Options


Figure 16-1 Self-learning node: Model tab

Before executing a SLRM node, you must specify both the target and target response elds on the Fields tab of the node.
Target field. Select the target eld from the list. For example, a set containing the different products you want to offer to customers.

Note: The target eld cannot contain integer values.


Target response field. Select the target response eld from the list. For example, Accepted or

Rejected. Note: This eld must be a Flag. The true value of the ag indicates offer acceptance and the false value indicates offer refusal. The remaining elds on this dialog box are the standard ones used throughout Clementine. For more information, see Modeling Node Fields Options in Chapter 6 on p. 235. Note: If the source data includes ranges, that are to be used as continuous predictors, you must ensure that the metadata includes both the minimum and maximum details for each range.

517 Self-Learning Response Node Models

SLRM Node Model Options


Figure 16-2 Self-learning node: Model tab

Model name. You can generate the model name automatically based on the target or ID eld (or

model type in cases where no such eld is specied) or specify a custom name.
Use partitioned data. If a partition eld is dened, this option ensures that only data from the training partition is used to build the model.For more information, see Partition Node in Chapter 4 on p. 119. Continue training existing model. If you select this option, the results shown on the generated

Model tab are regenerated and updated each time the model is run. For example, you would do this when you have added a new or updated data source to an existing model.
Target field values By default this is set to Use all, which means that a model will be built that

contains every offer associated with the selected target eld value. If you want to generate a model that contains only some of the target elds offers, click Specify and use the Add, Edit, and Delete buttons to enter or amend the names of the offers for which you want to build a model. For example, if you chose a target that lists all of the products you supply, you can use this eld to limit the offered products to just a few that you enter here.
Model Assessment. The elds in this panel are independent from the model in that they dont

affect the scoring. Instead they enable you to create a visual representation of how well the model will predict results.

518 Chapter 16

Note: To display the model assessment results in the generated model you must also select the Display model evaluation box.
Include model assessment. Select this box to create graphs that show the models predicted

accuracy for each selected offer.


Set random seed. When estimating the accuracy of a model based on a random percentage,

this option allows you to duplicate the same results in another session. By specifying the starting value used by the random number generator, you can ensure the same records are assigned each time the node is executed. Enter the desired seed value. If this option is not selected, a different sample will be generated each time the node is executed.
Simulated sample size. Specify the number of records to be used in the sample when assessing

the model. The default is 100.


Number of iterations. This enables you to stop building the model assessment after the number

of iterations specied. Specify the maximum number of iterations; the default is 20. Note: Bear in mind that large sample sizes and high numbers of iterations will increase the amount of time it takes to build the model.
Display model evaluation. Select this option to display a graphical representation of the results in

the generated model.

SLRM Node Settings Options


Figure 16-3 Self-learning node: Settings tab

The node settings options allow you to ne-tune the model-building process.

519 Self-Learning Response Node Models

Maximum number of predictions per record. This option allows you to limit the number of predictions made for each record in the dataset. The default is 3. For example, you may have six offers (such as savings, mortgage, car loan, pension, credit card, and insurance) but you only want to know the best two to recommend; in this case you would set this eld to 2. When you build the model and attach it to a table, you would see two prediction columns (and the associated condence in the probability of the offer being accepted) per record. The predictions could be made up of any of the six possible offers. Level of randomization. To prevent any biasfor example, in a small or incomplete datasetand treat all potential offers equally, you can add a level of randomization to the selection of offers and the probability of their appearing as recommended offers. Randomization is expressed as a percentage, shown as decimal values between 0.0 (no randomization) and 1.0 (completely random). The default is 0.0. Set random seed. When adding a level of randomization to selection of an offer, this option allows

you to duplicate the same results in another session. By specifying the starting value used by the random number generator, you can ensure the same records are assigned each time the node is executed. Enter the desired seed value. If this option is not selected, a different sample will be generated each time the node is executed. Note: When using the Set random seed option with records read from a database, a Sort node may be required prior to sampling in order to ensure the same result each time the node is executed. This is because the random seed depends on the order of records, which is not guaranteed to stay the same in a relational database. For more information, see Sort Node in Chapter 3 on p. 54.
Sort order. Select the order in which offers are to be displayed in the built model: Descending. The model displays offers with the highest scores rst. These are the offers that

have the greatest probability of being accepted.


Ascending. The model displays offers with the lowest scores rst. These are the offers that

have the greatest probability of being rejected. For example, this may be useful when deciding which customers to remove from a marketing campaign for a specic offer.
Preferences for target fields. When building a model, there may be certain aspects of the data that you want to actively promote or remove. For example, if building a model that selects the best nancial offer to promote to a customer, you may want to ensure that one particular offer is always included regardless of how well it scores against each customer. To include an offer in this panel and edit its preferences, click Add, type the offers name (for example, Savings or Mortgage), and click OK. Value. This shows the name of the offer that you added. Preference. Specify the level of preference to be applied to the offer. Preference is expressed

as a percentage, shown as decimal values between 0.0 (not preferred) and 1.0 (most preferred). The default is 0.0.
Always include. To ensure that a specic offer is always included in the predictions, select

this box. Note: If the Preference is set to 0.0, the Always include setting is ignored.

520 Chapter 16

Take account of model reliability. A well-structured, data-rich model that has been ne-tuned through several regenerations should always produce more accurate results compared to a brand new model with little data. To take advantage of the more mature models increased reliability, select this box.

Generated SLRM Models


Note: Results are only shown on this tab if you select both Include model assessment and Display
model evaluation on the Model options tab. Figure 16-4 Self-learning model

When you execute a stream containing a SLRM model, the node estimates the accuracy of the predictions for each target eld value (offer) and the importance of each predictor used. Note: If you selected Continue training existing model on the Model tab, the information shown on this dialog box is updated each time you regenerate the model.
View. When you have more than one offer, select the one for which you want to display results.

521 Self-Learning Response Node Models

Model Performance. This shows the estimated model accuracy of each offer. The test set is generated through simulation. Variable Importance. This is also generated through simulation and shows the impact of each predictor. This is done by removing each predictor in turn from the model and seeing how this affects the models accuracy. Association with Response. The third graph displays the association (correlation) of each predictor

with the target variable.

SLRM Model Settings


The Settings tab for a SLRM model species options for modifying the built model. For example, you may use the SLRM node to build several different models using the same data and settings, then use this tab in each model to slightly modify the settings to see how that affects the results. Note: This tab is only available after the generated model has been added to a stream.
Figure 16-5 Self-learning model settings

Maximum number of predictions per record. This option allows you to limit the number of predictions made for each record in the dataset. The default is 3. For example, you may have six offers (such as savings, mortgage, car loan, pension, credit card, and insurance) but you only want to know the best two to recommend; in this case you would set this eld to 2. When you build the model and attach it to a table, you would see two

522 Chapter 16

prediction columns (and the associated condence in the probability of the offer being accepted) per record. The predictions could be made up of any of the six possible offers.
Level of randomization. To prevent any biasfor example, in a small or incomplete datasetand treat all potential offers equally, you can add a level of randomization to the selection of offers and the probability of their appearing as recommended offers. Randomization is expressed as a percentage, shown as decimal values between 0.0 (no randomization) and 1.0 (completely random). The default is 0.0. Set random seed. When adding a level of randomization to selection of an offer, this option allows

you to duplicate the same results in another session. By specifying the starting value used by the random number generator, you can ensure the same records are assigned each time the node is executed. Enter the desired seed value. If this option is not selected, a different sample will be generated each time the node is executed. Note: When using the Set random seed option with records read from a database, a Sort node may be required prior to sampling in order to ensure the same result each time the node is executed. This is because the random seed depends on the order of records, which is not guaranteed to stay the same in a relational database. For more information, see Sort Node in Chapter 3 on p. 54.
Sort order. Select the order in which offers are to be displayed in the built model: Descending. The model displays offers with the highest scores rst. These are the offers that

have the greatest probability of being accepted.


Ascending. The model displays offers with the lowest scores rst. These are the offers that

have the greatest probability of being rejected. For example, this may be useful when deciding which customers to remove from a marketing campaign for a specic offer.
Preferences for target fields. When building a model, there may be certain aspects of the data that

you want to actively promote or remove. For example, if building a model that selects the best nancial offer to promote to a customer, you may want to ensure that one particular offer is always included regardless of how well it scores against each customer. To include an offer in this panel and edit its preferences, click Add, type the offers name (for example, Savings or Mortgage), and click OK.
Value. This shows the name of the offer that you added. Preference. Specify the level of preference to be applied to the offer. Preference is expressed

as a percentage, shown as decimal values between 0.0 (not preferred) and 1.0 (most preferred). The default is 0.0.
Always include. To ensure that a specic offer is always included in the predictions, select

this box. Note: If the Preference is set to 0.0, the Always include setting is ignored.
Take account of model reliability. A well-structured, data-rich model that has been ne-tuned through several regenerations should always produce more accurate results compared to a brand new model with little data. To take advantage of the more mature models increased reliability, select this box.

Output Nodes

17

Chapter

Overview of Output Nodes


Output nodes provide the means to obtain information about your data and models. They also provide a mechanism for exporting data in various formats to interface with your other software tools. The following output nodes are available:
The Table node displays the data in table format, which can also be written to a le. This is useful anytime that you need to inspect your data values or export them in an easily readable form. For more information, see Table Node on p. 528. The Matrix node creates a table that shows relationships between elds. It is most commonly used to show the relationship between two symbolic elds, but it can also show relationships between ag elds or numeric elds. For more information, see Matrix Node on p. 532. The Analysis node evaluates predictive models ability to generate accurate predictions. Analysis nodes perform various comparisons between predicted values and actual values for one or more generated model nodes. They can also compare predictive models to each other. For more information, see Analysis Node on p. 537. The Data Audit node provides a comprehensive rst look at the data you bring into Clementine. Often used during data exploration, the data audit report shows summary statistics, as well as histograms and distribution graphs for each data eld. The results are displayed in an easy-to-read matrix that can be sorted and used to generate full-size graphs and data preparation nodes. For more information, see Data Audit Node on p. 541. The Transform node allows you to select and visually preview the results of transformations before applying them to selected elds. For more information, see Transform Node on p. 568. The Statistics node provides basic summary information about numeric elds. It calculates summary statistics for individual elds and correlations between elds. For more information, see Statistics Node on p. 554. The Means node compares the means between independent groups or between pairs of related elds to test whether a signicant difference exists. For example, you could compare mean revenues before and after running a promotion or compare revenues from customers who did not receive the promotion with those who did. For more information, see Means Node on p. 558.

523

524 Chapter 17

The Report node creates formatted reports containing xed text as well as data and other expressions derived from the data. You specify the format of the report using text templates to dene the xed text and data output constructions. You can provide custom text formatting by using HTML tags in the template and by setting options on the Output tab. You can include data values and other conditional output by using CLEM expressions in the template. For more information, see Report Node on p. 563. The Set Globals node scans the data and computes summary values that can be used in CLEM expressions. For example, you can use this node to compute statistics for a eld called age and then use the overall mean of age in CLEM expressions by inserting the function @GLOBAL_MEAN(age). For more information, see Set Globals Node on p. 566. If you have SPSS installed and licensed on your computer, the SPSS Output node allows you to call an SPSS procedure to analyze your Clementine data. You can view the results in a browser window or save results in the SPSS output le format. A wide variety of SPSS analytical procedures is accessible from Clementine. For more information, see SPSS Output Node on p. 573.

Managing Output
The Output manager shows the charts, graphs, and tables generated during a Clementine session. You can always reopen an output by double-clicking it in the manageryou do not have to rerun the corresponding stream or node.
To view the Output manager:
E Open the View menu and choose Managers. Click the Outputs tab. Figure 17-1 Output manager

From the Output manager, you can: Display existing output objects, such as histograms, evaluation charts, and tables. Rename output objects. Save output objects to disk or to the Predictive Enterprise Repository (if available). Add output les to the current project.

525 Output Nodes

Delete unsaved output objects from the current session. Open saved output objects or retrieve them from the Predictive Enterprise Repository (if available). To access these options, right-click anywhere on the Outputs tab.

Viewing Output
On-screen output is displayed in an output browser window. The output browser window has its own set of menus that allow you to print or save the output, or export it to another format. Note that specic options may vary depending on the type of output.
Printing, saving, and exporting data. More information is available as follows:

To print the output, use the Print menu option or button. Before you print, you can use Page Setup and Print Preview to set print options and preview the output. To save the output to a Clementine output le (.cou), choose Save or Save As from the File menu. To save the output in another format, such as text or HTML, choose Export from the File menu. For more information, see Exporting Output on p. 526. To save the output in the Predictive Enterprise Repository, choose Store Output from the File menu. Note that this option requires a separate license.
Selecting cells and columns. The Edit menu contains various options for selecting, deselecting,

and copying cells and columns, as appropriate for the current output type. For more information, see Selecting Cells and Columns on p. 527.
Generating new nodes. The Generate menu allows you to generate new nodes based on the contents of the output browser. The options vary depending on the type of output and the items in the output that are currently selected. For details about the node-generation options for a particular type of output, see the documentation for that output.

View output in an HTML browser


From the Advanced tab on the Linear, Logistic, and PCA/Factor model nodes, you can launch the displayed information in a separate browser, such as Internet Explorer. The information is output as HTML, enabling you to save it and reuse it elsewhere, such as on a corporate intranet, or Internet site.

526 Chapter 17 Figure 17-2 Sample Logistic Regression Equation node Advanced tab

To display the information in a browser, click the launch button, below the model icon in the top left of the Advanced tab dialog box.

Exporting Output
In the output browser window, you may choose to export the output to another format, such as text or HTML. The export formats vary depending on the type of output, but in general are similar to the le type options available if you select Save to file in the node used to generate the output.
To export output:
E In the output browser, open the File menu and choose Export. Then choose the le type that

you want to create:


Tab Delimited (*.tab). This option generates a formatted text le containing the data values.

This style is often useful for generating a plain-text representation of the information that can be imported into other applications. This option is available for the Table, Matrix, and Means nodes.

527 Output Nodes

Comma Delimited (*.dat). This option generates a comma-delimited text le containing the

data values. This style is often useful as a quick way to generate a data le that can be imported into spreadsheets or other data analysis applications. This option is available for the Table, Matrix, and Means nodes.
Transposed Tab Delimited (*.tab). This option is identical to the Tab Delimited option, but the

data is transposed so that rows represent elds and the columns represent records.
Transposed Comma Delimited (*.dat). This option is identical to the Comma Delimited option,

but the data is transposed so that rows represent elds and the columns represent records.
HTML (*.html). This option writes HTML-formatted output to a le or les.

Selecting Cells and Columns


Figure 17-3 Table browser window

A number of nodes, including the Table node, Matrix node, and Means node, generate tabular output. These output tables can be viewed and manipulated in similar ways, including selecting cells, copying all or part of the table to the Clipboard, generating new nodes based on the current selection, and saving and printing the table.
Selecting cells. To select a cell, click it. To select a rectangular range of cells, click one corner of the desired range, drag the mouse to the other corner of the range, and release the mouse button. To select an entire column, click the column heading. To select multiple columns, use Shift-click or Ctrl-click on column headings.

When you make a new selection, the old selection is cleared. By holding down the Ctrl key while selecting, you can add the new selection to any existing selection instead of clearing the old selection. You can use this method to select multiple, noncontiguous regions of the table. The Edit menu also contains the Select All and Clear Selection options.

528 Chapter 17

Reordering columns. The Table node and Means node output browsers allow you to move

columns in the table by clicking a column heading and dragging it to the desired location. You can move only one column at a time.

Table Node
The Table node allows you to create a table from your data, which can either be displayed on the screen or written to a le. This is useful anytime you need to inspect your data values or export them in an easily readable form.

Table Node Settings Tab


Figure 17-4 Table node: Settings tab

Highlight records where. You can highlight records in the table by entering a CLEM expression

that is true for the records to be highlighted. This option is enabled only when Output to screen is selected.

Table Node Format Tab


The Format tab contains options used to specify formatting on a per-eld basis. This tab is shared with the Type node. For more information, see Field Format Settings Tab on p. 82.

529 Output Nodes

Output Node Output Tab


Figure 17-5 Output node Output tab

For nodes that generate table-style output, the Output tab lets you specify the format and location of the results.
Output name. Species the name of the output produced when the node is executed. Auto chooses a name based on the node that generates the output. Optionally, you can select Custom to specify a different name. Output to screen (the default). Creates an output object to view online. The output object will

appear on the Outputs tab of the manager window when the output node is executed.
Output to file. Saves the output to a le when the node is executed. If you choose this option, enter

a lename (or navigate to a directory and specify a lename using the File Chooser button) and select a le type. Note that some le types may be unavailable for certain types of output.
Data (tab delimited) (*.tab). This option generates a formatted text le containing the data

values. This style is often useful for generating a plain-text representation of the information that can be imported into other applications. This option is available for the Table, Matrix, and Means nodes.
Data (comma delimited) (*.dat). This option generates a comma-delimited text le containing

the data values. This style is often useful as a quick way to generate a data le that can be imported into spreadsheets or other data analysis applications. This option is available for the Table, Matrix, and Means nodes.
HTML (*.html). This option writes HTML-formatted output to a le or les. For tabular output

(from the Table, Matrix, or Means nodes), a set of HTML les contains a contents panel listing eld names and the data in an HTML table. The table may be split over multiple HTML les if the number of rows in the table exceeds the Lines per page specication. In this case, the

530 Chapter 17

contents panel contains links to all table pages and provides a means of navigating the table. For non-tabular output, a single HTML le is created containing the results of the node. Note: If the HTML output contains only formatting for the rst page, select Paginate output and adjust the Lines per page specication to include all output on a single page. Or if the output template for nodes such as the Report node contains custom HTML tags, be sure you have specied Custom as the format type.
Text File (*.txt). This option generates a text le containing the output. This style is often

useful for generating output that can be imported into other applications, such as word processors or presentation software. This option is not available for some nodes.
Output object (*.cou). Output objects saved in this format can be opened and viewed in

Clementine, added to projects, and published and tracked using the SPSS Predictive Enterprise Repository.
Output view. For the Means node, you can specify whether simple or advanced output is displayed

by default. Note you can also toggle between these views when browsing the generated output. For more information, see Means Node Output Browser on p. 561.
Format. For the Report node, you can choose whether output is automatically formatted or

formatted using HTML included in the template. Select Custom to allow HTML formatting in the template.
Title. For the Report node, you can specify optional title text that will appear at the top of the report output. Highlight inserted text. For the Report node, select this option to highlight text generated by CLEM expressions in the Report template. For more information, see Report Node Template Tab on p. 564. This option is not recommended when using Custom formatting. Lines per page. For the Report node, specify a number of lines to include on each page during
Auto formatting of the output report.

Transpose data. This option transposes the data before export, so that rows represent elds and the

columns represent records. Note: For large tables, the above options can be somewhat inefcient, especially when working with a remote server. In such cases, using a File output node provides much better performance. For more information, see Flat File Node in Chapter 18 on p. 587.

531 Output Nodes

Table Browser
Figure 17-6 Table browser window

The table browser displays tabular data and allows you to perform standard operations including selecting and copying cells, reordering columns, and saving and printing the table. For more information, see Selecting Cells and Columns on p. 527.
Searching the table. The search button (with the binoculars icon) on the main toolbar activates

the search toolbar, allowing you to search the table for specic values. You can search forward or backward in the table, you can specify a case-sensitive search (the Aa button), and you can interrupt a search-in-progress with the interrupt search button.

532 Chapter 17 Figure 17-7 Table with search controls activated

Generating new nodes. The Generate menu contains node generation operations. Select Node (Records). Generates a Select node that selects the records for which any

cell in the table is selected.


Select (And). Generates a Select node that selects records containing all of the values

selected in the table.


Select (Or). Generates a Select node that selects records containing any of the values

selected in the table.


Derive (Records). Generates a Derive node to create a new ag eld. The ag eld contains

T for records for which any cell in the table is selected and F for the remaining records.
Derive (And). Generates a Derive node to create a new ag eld. The ag eld contains T

for records containing all of the values selected in the table and F for the remaining records.
Derive (Or). Generates a Derive node to create a new ag eld. The ag eld contains T for

records containing any of the values selected in the table and F for the remaining records.

Matrix Node
The Matrix node allows you to create a table that shows relationships between elds. It is most commonly used to show the relationship between two symbolic elds, but it can also be used to show relationships between ag elds or between numeric elds.

Matrix Node Settings Tab


The Settings tab lets you specify options for the structure of the matrix.

533 Output Nodes Figure 17-8 Matrix node: Settings tab

Fields. Select a eld selection type from the following options: Selected. This option allows you to select a symbolic eld for the rows and one for the

columns of the matrix. The rows and columns of the matrix are dened by the list of values for the selected symbolic eld. The cells of the matrix contain the summary statistics selected below.
All flags (true values). This option requests a matrix with one row and one column for each

ag eld in the data. The cells of the matrix contain the counts of double positives for each ag combination. In other words, for a row corresponding to bought bread and a column corresponding to bought cheese, the cell at the intersection of that row and column contains the number of records for which both bought bread and bought cheese are true.
All numerics. This option requests a matrix with one row and one column for each numeric

eld. The cells of the matrix represent the sum of the cross-products for the corresponding pair of elds. In other words, for each cell in the matrix, the values for the row eld and the column eld are multiplied for each record and then summed across records.
Include missing values. Includes user-missing (blank) and system missing ($null$) values in the

row and column output. For example, if the value N/A has been dened as user-missing for the selected column eld, a separate column labeled N/A will be included in the table (assuming this value actually occurs in the data) just like any other category. If this option is deselected, the N/A column is excluded regardless of how often it occurs. For more information, see Overview of Missing Values in Chapter 6 in Clementine 11.1 Users Guide. Note: The option to include missing values applies only when selected elds are cross-tabulated. Blank values are mapped to $null$ and are excluded from aggregation for the function eld when the mode is Selected and the content is set to Function and for all numeric elds when the mode is set to All Numerics.

534 Chapter 17

Cell contents. If you have chosen Selected elds above, you can specify the statistic to be used

in the cells of the matrix. Select a count-based statistic, or select an overlay eld to summarize values of a numeric eld based on the values of the row and column elds.
Cross-tabulations. Cell values are counts and/or percentages of how many records have the

corresponding combination of values. You can specify which cross-tabulation summaries you want using the options on the Appearance tab. The global chi-square value is also displayed along with the signicance. For more information, see Matrix Node Output Browser on p. 535.
Function. If you select a summary function, cell values are a function of the selected overlay

eld values for cases having the appropriate row and column values. For example, if the row eld is Region, the column eld is Product, and the overlay eld is Revenue, then the cell in the Northeast row and the Widget column will contain the sum (or average, minimum, or maximum) of revenue for widgets sold in the northeast region. The default summary function is Mean. You can select another function for summarizing the function eld. Options include Mean, Sum, SDev (standard deviation), Max (maximum), and Min (minimum).

Matrix Node Appearance Tab


The Appearance tab allows you to control sorting and highlighting options for the matrix, as well as statistics presented for cross-tabulation matrices.
Figure 17-9 Matrix node: Appearance tab

Rows and columns. Controls the sorting of row and column headings in the matrix. The default is Unsorted. Select Ascending or Descending to sort row and column headings in the specied direction. Overlay. Allows you to highlight extreme values in the matrix. Values are highlighted based on cell counts (for cross-tabulation matrices) or calculated values (for function matrices).

535 Output Nodes

Highlight top. You can request the highest values in the matrix to be highlighted (in red).

Specify the number of values to highlight.


Highlight bottom. You can also request the lowest values in the matrix to be highlighted (in

green). Specify the number of values to highlight. Note: For the two highlighting options, ties can cause more values than requested to be highlighted. For example, if you have a matrix with six zeros among the cells and you request Highlight bottom 5, all six zeros will be highlighted.
Cross-tabulation cell contents. For cross-tabulations, you can specify the summary statistics

contained in the matrix for cross-tabulation matrices. These options are not available when either the All Numerics or Function option is selected on the Settings tab.
Counts. Cells include the number of records with the row value that have the corresponding

column value. This is only default cell content.


Expected values. The expected value for number of records in the cell, assuming that there is

no relationship between the rows and columns. Expected values are based on the following formula:
p(row value) * p(column value) * total number of records

Residuals. The difference between observed and expected values. Percentage of row. The percentage of all records with the row value that have the

corresponding column value. Percentages sum to 100 within rows.


Percentage of column. The percentage of all records with the column value that have the

corresponding row value. Percentages sum to 100 within columns.


Percentage of total. The percentage of all records having the combination of column value and

row value. Percentages sum to 100 over the whole matrix.


Include row and column totals. Adds a row and a column to the matrix for column and row

totals.
Apply Settings. (Output Browser only) Enables you to make changes to the appearance of

the Matrix node output without having to close and reopen the Output Browser. Make the changes on this tab of the Output Browser, click this button and then select the Matrix tab to see the effect of the changes.

Matrix Node Output Browser


The matrix browser displays cross-tabulated data and allows you to perform operations on the matrix, including selecting cells, copying the matrix to the Clipboard in whole or in part, generating new nodes based on the matrix selection, and saving and printing the matrix. The matrix browser may also used to display output from certain models, such as Naive Bayes models from Oracle.

536 Chapter 17 Figure 17-10 Matrix browser

The File and Edit menus provide the usual options for printing, saving, and exporting output, and for selecting and copying data. For more information, see Viewing Output on p. 525.
Chi-square. For a cross-tabulation of two categorical elds, the global Pearson chi-square

is also displayed below the table. This test indicates the probability that the two elds are unrelated, based on the difference between observed counts and the counts you would expect if no relationship exists. For example, if there were no relationship between customer satisfaction and store location, you would expect similar satisfaction rates across all stores. But if customers at certain stores consistently report higher rates than others, you might suspect it wasnt a coincidence. The greater the difference, the smaller the probability that it was the result of chance sampling error alone. The chi-square test indicates the probability that the two elds are unrelated, in which case any differences between observed and expected frequencies are the result of chance alone. If this probability is very smalltypically less than 5%then the relationship between the two elds is said to be signicant. If there is only one column or one row (a one-way chi-square test), the degrees of freedom is the number of cells minus one. For a two-way chi-square, the degrees of freedom is the number or rows minus one times the number of columns minus one. Use caution when interpreting the chi-square statistic if any of the expected cell frequencies are less than ve. The chi-square test is available only for a cross-tabulation of two elds. (When All flags or All numerics is selected on the Settings tab, this test is not displayed.)
Generate menu. The Generate menu contains node generation operations. These operations are available only for cross-tabulated matrices, and you must have at least one cell selected in the matrix. Select Node. Generates a Select node that selects the records that match any selected cell

in the matrix.

537 Output Nodes

Derive Node (Flag). Generates a Derive node to create a new ag eld. The ag eld contains

T for records that match any selected cell in the matrix and F for the remaining records.
Derive Node (Set). Generates a Derive node to create a new set eld. The set eld contains one

category for each contiguous set of selected cells in the matrix.

Analysis Node
The Analysis node allows you to evaluate the ability of a model to generate accurate predictions. Analysis nodes perform various comparisons between predicted values and actual values (your target or Out eld) for one or more generated model nodes. Analysis nodes can also be used to compare predictive models to other predictive models. When you execute an Analysis node, a summary of the analysis results is automatically added to the Analysis section on the Summary tab for each generated model node in the executed stream. The detailed analysis results appear on the Outputs tab of the manager window or can be written directly to a le. Note: Because Analysis nodes compare predicted values to actual values, they are only useful with supervised models (those that require an Out eld). For unsupervised models such as clustering algorithms, there are no actual results available to use as a basis for comparison.

Analysis Node Analysis Tab


The Analysis tab allows you to specify the details of the analysis.
Figure 17-11 Analysis node: Analysis tab

538 Chapter 17

Coincidence matrices (for symbolic targets). Shows the pattern of matches between each generated (predicted) eld and its target eld for symbolic targets. A table is displayed with rows dened by actual values and columns dened by predicted values, with the number of records having that pattern in each cell. This is useful for identifying systematic errors in prediction. If there is more than one generated eld related to the same output eld but produced by different models, the cases where these elds agree and disagree are counted and the totals are displayed. For the cases where they agree, another set of correct/wrong statistics is displayed. Performance evaluation. Shows performance evaluation statistics for models with symbolic

outputs. This statistic, reported for each category of the output eld(s), is a measure of the average information content (in bits) of the model for predicting records belonging to that category. It takes the difculty of the classication problem into account, so accurate predictions for rare categories will earn a higher performance evaluation index than accurate predictions for common categories. If the model does no better than guessing for a category, the performance evaluation index for that category will be 0.
Confidence figures (if available). For models that generate a condence eld, this option reports

statistics on the condence values and their relationship to predictions. There are two settings for this option:
Threshold for. Reports the condence level above which the accuracy will be the specied

percentage.
Improve accuracy. Reports the condence level above which the accuracy is improved by the

specied factor. For example, if the overall accuracy is 90% and this option is set to 2.0, the reported value will be the condence required for 95% accuracy.
Split by partition. If a partition eld is used to split records into training, test, and validation samples, select this option to display results separately for each partition. For more information, see Partition Node in Chapter 4 on p. 119.

Note: When splitting by partition, records with null values in the partition eld are excluded from the analysis. This will never be an issue if a Partition node is used, since Partition nodes do not generate null values.
User defined analysis. You can specify your own analysis calculation to be used in evaluating

your model(s). Use CLEM expressions to specify what should be computed for each record and how to combine the record-level scores into an overall score. Use the functions @TARGET and @PREDICTED to refer to the target (actual output) value and the predicted value, respectively.
If. Specify a conditional expression if you need to use different calculations depending on

some condition.
Then. Specify the calculation if the If condition is true. Else. Specify the calculation if the If condition is false. Use. Select a statistic to compute an overall score from the individual scores. Break down analysis by fields. Shows the symbolic elds available for breaking down the analysis.

In addition to the overall analysis, a separate analysis will be reported for each category of each breakdown eld.

539 Output Nodes

Analysis Output Browser


The analysis output browser lets you see the results of executing the Analysis node. The usual saving, exporting, and printing options are available from the File menu. For more information, see Viewing Output on p. 525.
Figure 17-12 Analysis output browser

When you rst browse Analysis output, the results are expanded. To hide results after viewing them, use the expander control to the left of the item to collapse the specic results you want to hide or click the Collapse All button to collapse all results. To see results again after collapsing them, use the expander control to the left of the item to show the results or click the Expand All button to show all results.
Results for output field. The Analysis output contains a section for each output eld for which

there is a corresponding prediction eld created by a generated model.


Comparing. Within the output eld section is a subsection for each prediction eld associated with that output eld. For symbolic output elds, the top level of this section contains a table showing the number and percentage of correct and incorrect predictions and the total number of records in the stream. For numeric output elds, this section shows the following information: Minimum Error. Shows the minimum error (difference between observed and predicted values).

540 Chapter 17

Maximum Error. Shows the maximum error. Mean Error. Shows the average (mean) of errors across all records. This indicates whether

there is a systematic bias (a stronger tendency to overestimate than to underestimate, or vice versa) in the model.
Mean Absolute Error. Shows the average of the absolute values of the errors across all records.

Indicates the average magnitude of error, independent of the direction.


Standard Deviation. Shows the standard deviation of the errors. Linear Correlation. Shows the linear correlation between the predicted and actual values.

This statistic varies between 1.0 and 1.0. Values close to +1.0 indicate a strong positive association, so that high predicted values are associated with high actual values and low predicted values are associated with low actual values. Values close to 1.0 indicate a strong negative association, so that high predicted values are associated with low actual values, and vice versa. Values close to 0.0 indicate a weak association, so that predicted values are more or less independent of actual values.
Occurrences. Shows the number of records used in the analysis. Coincidence Matrix. For symbolic output elds, if you requested a coincidence matrix in the

analysis options, a subsection appears here containing the matrix. The rows represent actual observed values, and the columns represent predicted values. The cell in the table indicates the number of records for each combination of predicted and actual values.
Performance Evaluation. For symbolic output elds, if you requested performance evaluation statistics in the analysis options, the performance evaluation results appear here. Each output category is listed with its performance evaluation statistic. Confidence Values Report. For symbolic output elds, if you requested condence values in the analysis options, the values appear here. The following statistics are reported for model condence values: Range. Shows the range (smallest and largest values) of condence values for records

in the stream data.


Mean Correct. Shows the average condence for records that are classied correctly. Mean Incorrect. Shows the average condence for records that are classied incorrectly. Always Correct Above. Shows the condence threshold above which predictions are always

correct and shows the percentage of cases meeting this criterion.


Always Incorrect Below. Shows the condence threshold below which predictions are always

incorrect and shows the percentage of cases meeting this criterion.


X% Accuracy Above. Shows the condence level at which accuracy is X%. X is approximately

the value specied for Threshold for in the Analysis options. For some models and data sets, it is not possible to choose a condence value that gives the exact threshold specied in the options (usually due to clusters of similar cases with the same condence value near the threshold). The threshold reported is the closest value to the specied accuracy criterion that can be obtained with a single condence value threshold.
X Fold Correct Above. Shows the condence value at which accuracy is X times better than it is

for the overall data set. X is the value specied for Improve accuracy in the Analysis options.

541 Output Nodes

Agreement between. If two or more generated models that predict the same output eld are included in the stream, you will also see statistics on the agreement between predictions generated by the models. This includes the number and percentage of records for which the predictions agree (for symbolic output elds) or error summary statistics (for numeric output elds). For symbolic elds, it includes an analysis of predictions compared to actual values for the subset of records on which the models agree (generate the same predicted value).

Data Audit Node


The Data Audit node provides a comprehensive rst look at the data you bring into Clementine, presented in an easy-to-read matrix that can be sorted and used to generate full-size graphs and a variety of data preparation nodes.
Figure 17-13 Data Audit browser

The audit report displays summary statistics, histograms, and distribution graphs that may be useful in gaining a preliminary understanding of the data. The Quality tab in the audit report displays information about outliers, extremes, and missing values, and offers tools for handling these values.
Using the Data Audit Node

The Data Audit node can be attached directly to a source node or downstream from an instantiated Type node. You can also generate a number of data preparation nodes based on the results. For example, you can generate a Filter node that excludes elds with too many missing values to be useful in modeling, and generate a SuperNode that imputes missing values for any or all of the elds that remain. This is where the real power of the audit comes in, allowing you not only to assess the current state of your data, but to take action based on the assessment. For more information, see Preparing Data for Analysis (Data Audit) in Chapter 5 in Clementine 11.1 Applications Guide.

542 Chapter 17 Figure 17-14 Stream with Missing Values Supernode

Screening or sampling the data. Because an initial audit is particularly effective when dealing with

big data, a Sample node may be used to reduce processing time during the initial exploration by selecting only a subset of records. The Data Audit node can also be used in combination with nodes such as Feature Selection and Anomaly Detection in the exploratory stages of analysis.

Data Audit Node Settings Tab


The Settings tab allows you to specify basic parameters for the audit.
Figure 17-15 Data Audit node: Settings tab

543 Output Nodes

Default. You can simply attach the node to your stream and click Execute to generate an audit report for all elds based on default settings, as follows:

If there are no Type node settings, all elds are included in the report. If there are Type settings (regardless of whether or not they are instantiated), all In, Out, and Both elds are included in the display. If there is a single Out eld, use it as the Overlay eld. If there is more than one Out eld specied, no default overlay is specied.
Use custom fields. Select this option to manually select elds. Use the eld chooser button on the

right to select elds individually or by type.


Overlay field. The overlay eld is used in drawing the thumbnail graphs shown in the audit

report. In the case of a numeric range eld, bivariate statistics (covariance and correlation) are also calculated. If a single Out eld is present based on Type node settings, it is used as the default overlay eld as described above. Alternatively, you can select Use custom fields in order to specify an overlay.
Display. Allows you to specify whether graphs are available in the output, and to choose the

statistics displayed by default.


Graphs. Displays a graph for each selected eld; either a distribution (bar) graph, histogram,

or scatterplot as appropriate for the data. Graphs are displayed as thumbnails in the initial report, but full-sized graphs and graph nodes can also be generated. For more information, see Data Audit Output Browser on p. 545.
Basic/Advanced statistics. Species the level of statistics displayed in the output by default.

While this setting determines the initial display, all statistics are available in the output regardless of this setting. For more information, see Display Statistics on p. 546.
Median and mode. Calculates the median and mode for all elds in the report. Note that with large data sets, these statistics may increase processing time, since they take longer than others to compute. In the case of the median only, the reported value may be based on a sample of 2000 records (rather than the full data set) in some cases. This sampling is done on a per-eld basis in cases where memory limits would otherwise be exceeded. When sampling is in effect, the results will be labeled as such in the output (Sample Median rather than just Median). All statistics other than the median are always computed using the full data set. Empty or typeless fields. When used with instantiated data, typeless elds are not included in the

audit report. To include typeless elds (including empty elds), select Clear All Values in any upstream Type nodes. This ensures that data are not instantiated, causing all elds to be included in the report. For example, this may be useful if you want to obtain a complete list of all elds or generate a Filter node that will exclude those that are empty. For more information, see Filtering Fields with Missing Data on p. 552.

Data Audit Quality Tab


The Quality tab in the Data Audit node provides options for handling missing values, outliers, and extreme values.

544 Chapter 17 Figure 17-16 Data Audit node Quality tab

Missing Values Count of records with valid values. Select this option to show the number of records with valid

values for each evaluated eld. Note that null (undened) values, blank values, white spaces and empty strings are always treated as invalid values.
Breakdown counts of records with invalid values. Select this option to show the number of

records with each type of invalid value for each eld.


Outliers and Extreme Values

Detection method for outliers and extreme values. Two methods are supported.:
Standard deviation from the mean. Detects outliers and extremes based on the number of standard deviations from the mean. For example, if you have a eld with a mean of 100 and a standard deviation of 10, you could specify 3.0 to indicate that any value below 70 or above 130 should be treated as an outlier. Interquartile range. Detects outliers and extremes based on the interquartile range, which is the

range within which the two central quartiles fall (between the 25th and 75th percentiles). For example, based on the default setting of 1.5, the lower threshold for outliers would be Q1 - 1.5 * IQR and the upper threshold would be Q3 + 1.5*IQR. Note that using this option may slow performance on large data sets.

545 Output Nodes

Data Audit Output Browser


The Data Audit browser is a powerful tool for gaining overview of your data. The Audit tab displays thumbnail graphs and statistics for all elds, while the Quality tab displays information about outliers, extremes, and missing values. Based on the initial graphs and summary statistics, you might decide to recode a numeric eld, derive a new eld, or reclassify the values of a set eld. Or you may want to explore further using more sophisticated visualization. You can do this right from the audit report by using the Generate menu to create any number of nodes that can be used to transform or visualize the data.
Figure 17-17 Generating a Missing Values SuperNode

Sort columns by clicking on the column header, or reorder columns using drag and drop. Most standard output operations are also supported. For more information, see Viewing Output on p. 525. View values and ranges for elds by double-clicking a eld in the Type or Unique columns. Use the toolbar or Edit menu to show or hide value labels, or to choose the statistics you want to display. For more information, see Display Statistics on p. 546.

Viewing and Generating Graphs


If no overlay is selected, the Audit tab displays either bar charts (for set or ag elds) or histograms (range elds).
Figure 17-18 Excerpt of audit results without an overlay field

For a set or ag eld overlay, the graphs are colored by the values of the overlay.

546 Chapter 17 Figure 17-19 Excerpt of audit results with a set field overlay

For a scale eld overlay, two-dimensional scatterplots are generated rather than one-dimensional bars and histograms. In this case, the x axis maps to the overlay eld, enabling you to see the same scale on all x axes as you read down the table.
Figure 17-20 Excerpt of audit results with a scale field overlay

For Flag or Set elds, hold the mouse cursor over a bar to display the underlying value or label in a ToolTip. For Flag or Set elds, use the toolbar to toggle the orientation of thumbnail graphs from horizontal to vertical. To generate a full-sized graph from any thumbnail, double-click on the thumbnail, or select a thumbnail and choose Graph output from the Generate menu. Note: If a thumbnail graph was based on sampled data, the generated graph will contain all cases if the original data stream is still open. To generate a matching graph node, select one or more elds on the Audit tab, and select Graph node from the Generate menu. The resulting node is added to the stream canvas and can be used to re-create the graph each time the stream is executed. If an overlay set has more than 100 values, a warning is raised and the overlay is not included.

Display Statistics
The Display Statistics dialog box allows you to choose the statistics displayed on the Audit tab. The initial settings are specied in the Data Audit node. For more information, see Data Audit Node Settings Tab on p. 542.

547 Output Nodes Figure 17-21 Display Statistics

Minimum. The smallest value of a numeric variable. Maximum. The largest value of a numeric variable. Sum. The sum or total of the values, across all cases with nonmissing values. Range. The difference between the largest and smallest values of a numeric variable, the maximum minus the minimum.

cases.

Mean. A measure of central tendency. The arithmetic average, the sum divided by the number of Standard Error of Mean. A measure of how much the value of the mean may vary from sample to

sample taken from the same distribution. It can be used to roughly compare the observed mean to a hypothesized value (that is, you can conclude the two values are different if the ratio of the difference to the standard error is less than -2 or greater than +2).

standard deviation. A measure of dispersion around the mean, equal to the square root of the variance. The standard deviation is measured in the same units as the original variable. Variance. A measure of dispersion around the mean, equal to the sum of squared deviations from the mean divided by one less than the number of cases. The variance is measured in units that are the square of those of the variable itself. Skewness. A measure of the asymmetry of a distribution. The normal distribution is symmetric

and has a skewness value of 0. A distribution with a signicant positive skewness has a long right tail. A distribution with a signicant negative skewness has a long left tail. As a guideline, a skewness value more than twice its standard error is taken to indicate a departure from symmetry. normality (that is, you can reject normality if the ratio is less than -2 or greater than +2). A large positive value for skewness indicates a long right tail; an extreme negative value indicates a long left tail.

Standard Error of Skewness. The ratio of skewness to its standard error can be used as a test of

Kurtosis. A measure of the extent to which observations cluster around a central point. For a

normal distribution, the value of the kurtosis statistic is zero. Positive kurtosis indicates that the observations cluster more and have longer tails than those in the normal distribution, and negative kurtosis indicates that the observations cluster less and have shorter tails.

548 Chapter 17

Standard Error of Kurtosis. The ratio of kurtosis to its standard error can be used as a test of

normality (that is, you can reject normality if the ratio is less than -2 or greater than +2). A large positive value for kurtosis indicates that the tails of the distribution are longer than those of a normal distribution; a negative value for kurtosis indicates shorter tails (becoming like those of a box-shaped uniform distribution).

Unique. Evaluates all effects simultaneously, adjusting each effect for all other effects of any type. Valid. Valid cases having neither the system-missing value, nor a value dened as user-missing. Median. The value above and below which half of the cases fall, the 50th percentile. If there is an

even number of cases, the median is the average of the two middle cases when they are sorted in ascending or descending order. The median is a measure of central tendency not sensitive to outlying values (unlike the mean, which can be affected by a few extremely high or low values).

Mode. The most frequently occurring value. If several values share the greatest frequency of occurrence, each of them is a mode.

Note that median and mode are suppressed by default in order to improve performance but can be selected on the Settings tab in the Data Audit node. For more information, see Data Audit Node Settings Tab on p. 542.
Statistics for Overlays

If a numeric range overlay eld is in use, the following statistics are also available:
Correlation (Pearson). Measure of the strength of association between two variables. Two variables

are correlated if a change in the value of one signies a change in the other. Values close to 1 (or -1) indicate a very strong relationship; values close to 0 indicate a weak or no relationship. The sign of the coefcient indicates the direction of the relationship, where a positive correlation means that increases in one variable tend to accompany increases in the other variable.

Correlation T. The test statistic for the correlation coefcient, indicating whether the correlation is signicantly different from zero Correlation T df. Degrees of freedom for the test statistic. Correlation T significance. Signicance of the t statistic. Covariance. An unstandardized measure of association between two variables, equal to the cross-product deviation divided by N-1.

549 Output Nodes

Data Audit Browser Quality Tab


Figure 17-22 Quality report in the Data Audit browser

The Quality tab in the Data Audit browser displays the results of the data quality analysis, and allows you to specify treatments for outliers, extremes, and missing values.

Imputing Missing Values


The audit report lists the percentage of complete records for each eld, along with the number of valid, null, and blank values. You can choose to impute missing values for specic elds as appropriate, and then generate a SuperNode to apply these transformations.
E In the Impute Missing column, specify the type of values you want to impute, if any. You can

choose to impute blanks, nulls, both, or specify a custom condition or expression that selects the values to impute. There are several types of missing values recognized by Clementine:
Null or system-missing values. These are nonstring values that have been left blank in the

database or source le and have not been specically dened as missing in a source or Type node. System-missing values are displayed as $null$ in Clementine. Note that empty strings are not considered nulls in Clementine, although they may be treated as nulls by certain databases (see below).
Empty strings and white space. Clementine treats empty string values and white space

(strings with no visible characters) as distinct from null values. Empty strings are treated as equivalent to white space for most purposes. For example, if you select the option to treat white space as blanks in a source or Type node, this setting applies to empty strings as well.
Blank or user-defined missing values. These are values such as unknown, 99, or 1 that

are explicitly dened in a source node or Type node as missing. Optionally, you can also choose to treat nulls and white space as blanks, which allows them to be agged for special treatment and to be excluded from most calculations. For example, you can use the @BLANK

550 Chapter 17

function to treat these values, along with other types of missing values, as blanks. For more information, see Using the Values Dialog Box in Chapter 4 on p. 75.
E In the Method column, specify the method you want to use.

The following methods are available for imputing missing values:


Fixed. Substitutes a xed value (either the eld mean, midpoint of the range, or a constant that

you specify).
Random. Substitutes a random value based on a normal or uniform distribution. Expression. Allows you to specify a custom expression. For example, you could replace values

with a global variable created by the Set Globals node.


Algorithm. Substitutes a value predicted by a model based on the C&RT algorithm. For each eld imputed using this method, there will be a separate C&RT model, along with a Filler node that replaces blanks and nulls with the value predicted by the model. A Filter node is then used to remove the prediction elds generated by the model.
E To generate a Missing Values SuperNode, from the menus choose: Generate Missing Values SuperNode Figure 17-23 Missing Value SuperNode dialog box

E Select All fields or Selected fields only, and specify a sample size if desired. (The specied sample

is a percentage, by default 10% of all records are sampled.)


E Click OK to add the generated SuperNode to the stream canvas. E Attach the SuperNode to the stream to apply the transformations. Figure 17-24 Adding the SuperNode to the stream

551 Output Nodes

Within the SuperNode, a combination of generated model, Filler, and Filter nodes is used as appropriate. To understand how it works, you can edit the SuperNode and click Zoom In, and you can add, edit, or remove specic nodes within the SuperNode to ne-tune the behavior.
Figure 17-25 Zooming in on the SuperNode

Handling Outliers and Extreme Values


The audit report lists number of outliers and extremes is listed for each eld based on the detection options specied in the Data Audit node. For more information, see Data Audit Quality Tab on p. 543.You can choose to coerce, discard, or nullify these values for specic elds as appropriate, and then generate a SuperNode to apply the transformations.
E In the Action column, specify handling for outliers and extremes for specic elds as desired.

The following actions are available for handling outliers and extremes:
Coerce. Replaces outliers and extreme values with the nearest value that would not be

considered extreme. For example if an outlier is dened to be anything above or below 3 standard deviations, then all outliers would be replaced with the highest or lowest value within this range.
Discard. Discards records with outlying or extreme values for the specied eld. Nullify. Replaces outliers and extremes with the null or system-missing value. Coerce outliers / discard extremes. Discards extreme values only. Coerce outliers / nullify extremes. Nullies extreme values only.
E To generate the SuperNode, from the menus choose: Generate Outlier & Extreme SuperNode Figure 17-26 Outlier SuperNode dialog box

E Select All fields or Selected fields only, and then click OK to add the generated SuperNode to the

stream canvas.
E Attach the SuperNode to the stream to apply the transformations.

552 Chapter 17

Optionally, you can edit the SuperNode and zoom in to browse or make changes. Within the SuperNode, values are discarded, coerced, or nullied using a series of Select and/or Filler nodes as appropriate.

Filtering Fields with Missing Data


From the Data Audit browser, you can create a new Filter node based on the results of the Quality analysis.
Figure 17-27 Generate Filter from Quality dialog box

Mode. Select the desired operation for specied elds, either Include or Exclude. Selected fields. The Filter node will include/exclude the elds selected on the Quality tab.

For example you could sort the table on the % Complete column, use shift-click to select the least complete elds, and then generate a Filter node that excludes these elds.
Fields with quality percentage higher than. The Filter node will include/exclude elds where

the percentage of complete records is greater than the specied threshold. The default threshold is 50%.
Filtering Empty or Typeless Fields

Note that after data values have been instantiated, typeless or empty elds are excluded from the audit results, and from most other output in Clementine. These elds are ignored for purposes of modeling, but may bloat or clutter the data. If so, you can use the Data Audit browser to generate a Filter node from that removes these elds from the stream.
E To make sure that all elds are included in the audit, including empty or typeless elds, select

Clear All Values in the upstream source or Type node, or set Values to <pass> for all elds.

553 Output Nodes Figure 17-28 Passing uninstantiated values in the Type node

E In the Data Audit browser, sort on the % complete column, select the elds that have zero valid

values (or some other threshold) and use the Generate menu to produce a Filter node which can be added to the stream.

Selecting Records with Missing Data


From the Data Audit browser, you can create a new Select node based on the results of the quality analysis.
Figure 17-29 Generate Select node dialog box

Select when record is. Specify whether records should be kept when they are Valid or Invalid. Look for invalid values in. Specify where to check for invalid values. All fields. The Select node will check all elds for invalid values.

554 Chapter 17

Fields selected in table. The Select node will check only the elds currently selected in the

Quality output table.


Fields with quality percentage higher than. The Select node will check elds where the

percentage of complete records is greater than the specied threshold. The default threshold is 50%.
Consider a record invalid if an invalid value is found in. Specify the condition for identifying

a record as invalid.
Any of the above fields. The Select node will consider a record invalid if any of the elds

specied above contains an invalid value for that record.


All of the above fields. The Select node will consider a record invalid only if all of the elds

specied above contain invalid values for that record.

Generating Other Nodes for Data Preparation


A variety of nodes used in data preparation can be generated directly from the Data Audit browser, including Reclassify, Binning, and Derive nodes. For example: You can derive a new eld based on the values of claimvalue and farmincome by selecting both in the audit report and choosing Derive from the Generate menu. The new node is added to the stream canvas. Similarly, you may determine, based on audit results, that recoding farmincome into percentile-based bins provides a more focused analysis. To generate a Binning node, select the eld row in the display and choose Binning from the Generate menu. Once a node is generated and added to the stream canvas, you must attach it to the stream and open the node to specify options for the selected eld(s).

Statistics Node
The Statistics node gives you basic summary information about numeric elds. You can get summary statistics for individual elds and correlations between elds.

555 Output Nodes

Statistics Node Settings Tab


Figure 17-30 Statistics node: Settings tab

Examine. Select the eld or elds for which you want individual summary statistics. You can

select multiple elds.


Statistics. Select the statistics to report. Available options include Count, Mean, Sum, Min, Max,
Range, Variance, Std Dev, Std Error of Mean, Median, and Mode.

Correlate. Select the eld or elds that you want to correlate. You can select multiple elds. When

correlation elds are selected, the correlation between each Examine eld and the correlation eld(s) will be listed in the output.
Correlation Settings. You can specify options for displaying the strength of correlations in the

output.

Correlation Settings
Clementine can characterize correlations with descriptive labels to help highlight important relationships. The correlation measures the strength of relationship between two numeric range elds. It takes values between 1.0 and 1.0. Values close to +1.0 indicate a strong positive association so that high values on one eld are associated with high values on the other and low values are associated with low values. Values close to 1.0 indicate a strong negative association so that high values for one eld are associated with low values for the other, and vice versa. Values close to 0.0 indicate a weak association, so that values for the two elds are more or less independent. You can control display of correlation labels, change the thresholds that dene the categories, and change the labels used for each range. Because the way you characterize correlation values depends greatly on the problem domain, you may want to customize the ranges and labels to t your specic situation.

556 Chapter 17 Figure 17-31 Correlation Settings dialog box

Show correlation strength labels in output. This option is selected by default. Deselect this option

to omit the descriptive labels from the output.


Correlation Strength. There are two options for dening and labeling the strength of correlations: Define correlation strength by importance (1-p). Labels correlations based on importance,

dened as 1 minus the signicance, or 1 minus the probability that the difference in means could be explained by chance alone. The closer this value comes to 1, the greater the chance that the two elds are not independentin other words, that some relationship exists between them. Labeling correlations based on importance is generally recommended over absolute value because it accounts for variability in the datafor example, a coefcient of 0.6 may be highly signicant in one data set and not signicant at all in another. By default, importance values between 0.0 and 0.9 are labeled as Weak, those between 0.9 and 0.95 are labeled as Medium, and those between 0.95 and 1.0 are labeled as Strong.
Define correlation strength by absolute value. Labels correlations based on the absolute value

of the Pearsons correlation coefcient, which ranges between 1 and 1, as described above. The closer the absolute value of this measure comes to 1, the stronger the correlation. By default, correlations between 0.0 and 0.3333 (in absolute value) are labeled as Weak, those between 0.3333 and 0.6666 are labeled as Medium, and those between 0.6666 and 1.0 are labeled as Strong. Note, however, that the signicance of any given value is difcult to generalize from one data set to another; for this reason, dening correlations based on probability rather than absolute value is recommended in most cases.

Statistics Output Browser


The Statistics node output browser displays the results of the statistical analysis and allows you to perform operations, including selecting elds, generating new nodes based on the selection, and saving and printing the results. The usual saving, exporting, and printing options are available from the File menu, and the usual editing options are available from the Edit menu. For more information, see Viewing Output on p. 525. When you rst browse Statistics output, the results are expanded. To hide results after viewing them, use the expander control to the left of the item to collapse the specic results you want to hide or click the Collapse All button to collapse all results. To see results again after collapsing them, use the expander control to the left of the item to show the results or click the Expand All button to show all results.

557 Output Nodes Figure 17-32 Statistics output browser

The output contains a section for each Examine eld, containing a table of the requested statistics.
Count. The number of records with valid values for the eld. Mean. The average (mean) value for the eld across all records. Sum. The sum of values for the eld across all records. Min. The minimum value for the eld. Max. The maximum value for the eld. Range. The difference between the minimum and maximum values. Variance. A measure of the variability in the values of a eld. It is calculated by taking the

difference between each value and the overall mean, squaring it, summing across all of the values, and dividing by the number of records.
Standard Deviation. Another measure of variability in the values of a eld, calculated as the

square root of the variance.


Standard Error of Mean. A measure of the uncertainty in the estimate of the elds mean if

the mean is assumed to apply to new data.


Median. The middle value for the eld; that is, the value that divides the upper half of the

data from the lower half of the data (based on values of the eld).
Mode. The most common single value in the data. Correlations. If you specied correlate elds, the output also contains a section listing the Pearson

correlation between the Examine eld and each correlate eld, and optional descriptive labels for the correlation values. For more information, see Correlation Settings on p. 555.

558 Chapter 17

Generate menu. The Generate menu contains node generation operations. Filter. Generates a Filter node to lter out elds that are uncorrelated or weakly correlated

with other elds.

Generating a Filter Node from Statistics


Figure 17-33 Generate Filter from Statistics dialog box

The Filter node generated from a Statistics output browser will lter elds based on their correlations with other elds. It works by sorting the correlations in order of absolute value, taking the largest correlations (according to the criterion set in the dialog box), and creating a lter that passes all elds that appear in any of those large correlations.
Mode. Decide how to select correlations. Include causes elds appearing in the specied

correlations to be retained. Exclude causes the elds to be ltered.


Include/Exclude fields appearing in. Dene the criterion for selecting correlations. Top number of correlations. Selects the specied number of correlations and includes/excludes

elds that appear in any of those correlations.


Top percentage of correlations (%). Selects the specied percentage (n%) of correlations and

includes/excludes elds that appear in any of those correlations.


Correlations greater than. Selects correlations greater in absolute value than the specied

threshold.

Means Node
The Means node compares the means between independent groups or between pairs of related elds to test whether a signicant difference exists. For example, you can compare mean revenues before and after running a promotion or compare revenues from customers who didnt receive the promotion with those who did. You can compare means in two different ways, depending on your data:
Between groups within a field. To compare independent groups, select a test eld and a

grouping eld. For example, you could exclude a sample of holdout customers when sending a promotion and compare mean revenues for the holdout group with all of the others. In this case, you would specify a single test eld that indicates the revenue for each customer, with a ag or set eld that indicates whether they received the offer. The samples are independent in the sense that each record is assigned to one group or another, and there

559 Output Nodes

is no way to link a specic member of one group to a specic member of another. You can also specify a set eld with more than two values to compare the means for multiple groups. When executed, the node calculates a one-way ANOVA test on the selected elds. In cases where there are only two eld groups, the one-way ANOVA results are essentially the same as an independent-samples t test.For more information, see Comparing Means for Independent Groups on p. 559.
Between pairs of fields. When comparing means for two related elds, the groups must be

paired in some way for the results to be meaningful. For example, you could compare the mean revenues from the same group of customers before and after running a promotion or compare usage rates for a service between husband-wife pairs to see if they are different. Each record contains two separate but related measures that can be compared meaningfully. When executed, the node calculates a paired-samples t test on each eld pair selected. For more information, see Comparing Means Between Paired Fields on p. 560.

Comparing Means for Independent Groups


Select Between groups within a field in the Means node to compare the mean for two or more independent groups.
Figure 17-34 Comparing means between groups within one field

Grouping field. Select a numeric ag or set eld with two or more distinct values that divides

records into the groups you want to compare, such as those who received an offer versus those who did not. Regardless of the number of test elds, only one grouping eld can be selected.
Test fields. Select one or more numeric elds that contain the measures you want to test. A

separate test will be conducted for each eld you select. For example, you could test the impact of a given promotion on usage, revenue, and churn.

560 Chapter 17

Comparing Means Between Paired Fields


Select Between pairs of fields in the Means node to compare means between separate elds. The elds must be related in some way for the results to be meaningful, such as revenues before and after a promotion. Multiple eld pairs can also be selected.
Figure 17-35 Comparing means between paired fields

Field one. Select a numeric eld that contains the rst of the measures you want to compare. In a before-and-after study, this would be the before eld. Field two. Select the second eld you want to compare. Add. Adds the selected pair to the Test eld pair(s) list.

Repeat eld selections as needed to add multiple pairs to the list.


Correlation settings. Allows you to specify options for labeling the strength of correlations. For more information, see Correlation Settings on p. 555.

Means Node Options


The Options tab allows you to set the threshold p values used to label results as important, marginal, or unimportant. You can also edit the label for each ranking. Importance is measured on a percentage scale and can be broadly dened as 1 minus the probability of obtaining a result (such as the difference in means between two elds) as extreme as or more extreme than the observed result by chance alone. For example, a p value greater than 0.95 indicates less than a 5% chance that the result could be explained by chance alone.

561 Output Nodes Figure 17-36 Importance settings

Importance labels. You can edit the labels used to label each eld pair or group in the output. The default labels are important, marginal, and unimportant. Cutoff values. Species the threshold for each rank. Typically p values greater than 0.95 would

rank as important, while those lower than 0.9 would be unimportant, but these thresholds can be adjusted as needed. Note: Importance measures are available in a number of Clementine nodes. The specic computations depend on the node and on the type of target and predictors used, but the values can still be compared, since all are measured on a percentage scale.

Means Node Output Browser


The Means output browser displays cross-tabulated data and allows you to perform standard operations including selecting and copying the table one row at a time, sorting by any column, and saving and printing the table. For more information, see Viewing Output on p. 525. The specic information in the table depends on the type of comparison (groups within a eld or separate elds).
Sort by. Allows you to sort the output by a specic column. Click the up or down arrow to change

the direction of the sort. Alternatively, you can click on any column heading to sort by that column. (To change the direction of the sort within the column, click again.)
View. You can choose Simple or Advanced to control the level of detail in the display. The

advanced view includes all of the information from the simple view but with additional details provided.

562 Chapter 17

Means Output Comparing Groups within a Field


When comparing groups within a eld, the name of the grouping eld is displayed above the output table, and means and related statistics are reported separately for each group. The table includes a separate row for each test eld.
Figure 17-37 Comparing groups within a field

The following columns are displayed:


Field. Lists the names of the selected test elds. Means by group. Displays the mean for each category of the grouping eld. For example,

you might compare those who received an special offer (New Promotion) with those who didnt (Standard). In the advanced view, the standard deviation, standard error, and count are also displayed.
Importance. Displays the importance value and label. For more information, see Means

Node Options on p. 560.


Advanced Output

In the advanced view, the following additional columns are displayed.


F-Test. This test is based on the ratio of the variance between the groups and the variance

within each group. If the means are the same for all groups, you would expect the F ratio to be close to 1 since both are estimates of the same population variance. The larger this ratio, the greater the variation between groups and the greater than chance that a signicant difference exists.
df. Displays the degrees of freedom.

Means Output Comparing Pairs of Fields


When comparing separate elds, the output table includes a row for each selected eld pair.

563 Output Nodes Figure 17-38 Comparing pairs of fields

Field One/Two. Displays the name of the rst and second eld in each pair. In the advanced

view, the standard deviation, standard error, and count are also displayed.
Mean One/Two. Displays the mean for each eld, respectively. Correlation. Measures the strength of relationship between two numeric range elds. Values

close to +1.0 indicate a strong positive association, and values close to 1.0 indicate a strong negative association. For more information, see Correlation Settings on p. 555.
Mean Difference. Displays the difference between the two eld means. Importance. Displays the importance value and label. For more information, see Means

Node Options on p. 560.


Advanced Output

Advanced output adds the following columns:


95% Confidence Interval. Lower and upper boundaries of the range within which the true mean is

likely to fall in 95% of all possible samples of this size from this population.
T-Test. The t statistic is obtained by dividing the mean difference by its standard error. The greater

the absolute value of this statistic, the greater the probability that the means are not the same.
df. Displays the degrees of freedom for the statistic.

Report Node
The Report node allows you to create formatted reports containing xed text, as well as data and other expressions derived from the data. You specify the format of the report by using text templates to dene the xed text and the data output constructions. You can provide custom text formatting using HTML tags in the template and by setting options on the Output tab. Data values and other conditional output are included in the report using CLEM expressions in the template.
Alternatives to the Report Node

The Report node is most typically used to list records or cases output from a stream, such as all records meeting a certain condition. In this regard, it can be thought of as a less-structured alternative to the Table node.

564 Chapter 17

If you want a report that lists eld information or anything else that is dened in the Clementine stream rather than the data itself (such as eld denitions specied in a Type node), then a script can be used instead. For more information, see Type Node Report in Chapter 6 in Clementine 11.1 Scripting, Automation, and CEMI Reference. To generate a report that includes multiple output objects, such as a collection of models, tables, and graphs generated by one or more streams; and that can be output in multiple formats, including text, HTML, and Microsoft Word/Ofce, a Clementine project can be used. For more information, see Introduction to Projects in Chapter 11 in Clementine 11.1 Users Guide. To produce a list of eld names without using scripting, you can use a Table node preceded by a Sample node that discards all records. This produces a table with no rows, which can be transposed on export to produce a list of eld names in a single column. (Select Transpose data on the Output tab in the Table node to do this.)

Report Node Template Tab


Figure 17-39 Report node: Template tab

Creating a template. To dene the contents of the report, you create a template on the Report node

Template tab. The template consists of lines of text, each of which species something about the contents of the report, and some special tag lines used to indicate the scope of the content lines. Within each content line, CLEM expressions enclosed in square brackets ([]) are evaluated before the line is sent to the report. There are three possible scopes for a line in the template:
Fixed. Lines that are not marked otherwise are considered xed. Fixed lines are copied into the

report only once, after any expressions that they contain are evaluated. For example, the line

565 Output Nodes This is my report, printed on [@TODAY]

would copy a single line to the report, containing the text and the current date.
Global (iterate ALL). Lines contained between the special tags #ALL and # are copied to the report

once for each record of input data. CLEM expressions (enclosed in brackets) are evaluated based on the current record for each output line. For example, the lines
#ALL For record [@INDEX], the value of AGE is [AGE] #

would include one line for each record indicating the record number and age. To generate a list of all records:
#ALL [Age] [Sex] [Cholesterol] # [BP]

Conditional (iterate WHERE). Lines contained between the special tags #WHERE <condition> and #

are copied to the report once for each record where the specied condition is true. The condition is a CLEM expression. (In the WHERE condition, the brackets are optional.) For example, the lines
#WHERE [SEX = 'M'] Male at record no. [@INDEX] has age [AGE]. #

will write one line to the le for each record with a value of M for sex. The complete report will contain the xed, global, and conditional lines dened by applying the template to the input data. You can specify options for displaying or saving results using the Output tab, common to various types of output nodes. For more information, see Output Node Output Tab on p. 529.
Outputting Data in HTML or XML Format

You can include HTML or XML tags directly in the template in order to write reports in either of these formats. For example, the following template produces an HTML table.
This report is written in HTML. Only records where Age is above 60 are included. <HTML> <TABLE border="2"> <TR> <TD>Age</TD> <TD>BP</TD> <TD>Cholesterol</TD> <TD>Drug</TD> </TR> #WHERE Age > 60 <TR>

566 Chapter 17 <TD>[Age]</TD> <TD>[BP]</TD> <TD>[Cholesterol]</TD> <TD>[Drug]</TD> </TR> # </TABLE> </HTML>

Report Node Output Browser


The report browser shows you the contents of the generated report. The usual saving, exporting, and printing options are available from the File menu, and the usual editing options are available from the Edit menu. For more information, see Viewing Output on p. 525.
Figure 17-40 Report browser

Set Globals Node


The Set Globals node scans the data and computes summary values that can be used in CLEM expressions. For example, you can use a Set Globals node to compute statistics for a eld called age and then use the overall mean of age in CLEM expressions by inserting the function @GLOBAL_MEAN(age). For more information, see CLEM Reference Overview in Chapter 8 in Clementine 11.1 Users Guide.

567 Output Nodes

Set Globals Node Settings Tab


Figure 17-41 Set Globals node: Settings tab

Globals to be created. Select the eld or elds for which you want globals to be available. You

can select multiple elds. For each eld, specify the statistics to compute by making sure that the statistics you want are selected in the columns next to the eld name.
MEAN. The average (mean) value for the eld across all records. SUM. The sum of values for the eld across all records. MIN. The minimum value for the eld. MAX. The maximum value for the eld. SDEV. The standard deviation, which is a measure of variability in the values of a eld and is

calculated as the square root of the variance.


Default operation(s). The options selected here will be used when new elds are added to

the Globals list above. To change the default set of statistics, select or deselect statistics as appropriate. You can also use the Apply button to apply the default operations to all elds in the list.
Clear all globals before executing. Select this option to remove all global values before calculating new values. If this option is not selected, newly calculated values replace older values, but globals that are not recalculated remain available, as well. Display preview of globals created after execution. If you select this option, the Globals tab of the stream properties dialog box will appear after execution to display the calculated global values. For more information, see Viewing Global Values for Streams in Chapter 5 in Clementine 11.1 Users Guide.

568 Chapter 17

Transform Node
Normalizing input elds is an important step before using traditional scoring techniques, such as regression, logistic regression, and discriminant analysis. These techniques carry assumptions about normal distributions of data that may not be true for many raw data les. One approach to dealing with real-world data is to apply transformations that move a raw data element toward a more normal distribution. In addition, normalized elds can easily be compared with each otherfor example, income and age are on totally different scales in a raw data le but when normalized, the relative impact of each can be easily interpreted. The Transform Node provides an output viewer that enables you to perform a rapid visual assessment of the best transformation to use. You can see at a glance whether variables are normally distributed and, if necessary, choose the transformation you want and apply it. You can pick multiple elds and perform one transformation per eld. After selecting the preferred transformations for the elds, you can generate Derive or Filler nodes that perform the transformations and attach these nodes to the stream. The Derive node creates new elds, while the Filler node transforms the existing ones. For more information, see Generating Graphs on p. 572.
Transform Node Fields Tab

On the Fields tab, you can specify which elds of the data you want to use for viewing possible transformations and applying them. Only numeric elds can be transformed. Click the eld selector button and select one or more numeric elds from the list displayed.
Figure 17-42 Transform node: Fields tab

569 Output Nodes

Transform Node Options Tab


The Options tab allows you to specify the type of transformations you want to include. You can choose to include all available transformations, or select transformations individually. In the latter case, you can also enter a number to offset the data for the inverse and log transformations. Doing so is useful in situations where a large proportion of zeros in the data would bias the mean and standard deviation results. For example, assume that you have a eld named BALANCE that has some zero values in it, and you want to use the inverse transformation on it. To avoid undesired bias, you would select Inverse (1/x) and enter 1 in the Use a data offset eld. (Note that this offset is not related to that performed by the @OFFSET sequence function in Clementine.)
Figure 17-43 Transform node: Options tab

All formulas. Indicates that all available transformations should be calculated and shown in

the output.
Select formulas. Allows you to select the different transformations to be calculated and shown in

the output.
Inverse (1/x). Indicates that the inverse transformation should be displayed in the output. Log (log n). Indicates that the logn transformation should be displayed in the output. Log (log 10). Indicates that the log10 transformation should be displayed in the output. Exponential. Indicates that the exponential transformation (ex) should be displayed in the

output.
Square Root. Indicates that the square root transformation should be displayed in the output.

570 Chapter 17

Transform Node Output Tab


The Output tab lets you specify the format and location of the output. You can choose to display the results on the screen, or send them to one of the standard le types. For more information, see Output Node Output Tab on p. 529.

Transform Node Output Viewer


The output viewer enables you to see the results of executing the Transform Node. The viewer is a powerful tool that displays multiple transformations per eld in thumbnail views of the transformation, enabling you to compare elds quickly. You can use options on its File menu to save, export, or print the output. For more information, see Viewing Output on p. 525.
Figure 17-44 Viewing available transformations per field

For each transformation (other than Selected Transform), a legend is displayed underneath in the format: Mean (Standard deviation)

Generating Nodes for the Transformations


The output viewer provides a useful starting point for your data preparation. For example, you might want to normalize the eld AGE so that you can use a scoring technique (such as logistic regression or discriminant analysis) that assumes a normal distribution. Based upon the initial graphs and summary statistics, you might decide to transform the AGE eld according to a particular distribution (for example, log). After selecting the preferred distribution, you can then generate a derive node with a standardized transformation to use for scoring. You can generate the following eld operations nodes from the output viewer: Derive Filler A Derive node creates new elds with the desired transformations, while the Filler node transforms existing elds. The nodes are placed on the canvas in the form of a SuperNode.

571 Output Nodes

If you select the same transformation for different elds, a Derive or Filler node contains the formulas for that transformation type for all the elds to which that transformation applies. For example, assume that you have selected the following elds and transformations to generate a Derive node:
Field AGE INCOME OPEN_BAL BALANCE Transformation Current Distribution Log Inverse Inverse

The following nodes are contained in the SuperNode:


Figure 17-45 SuperNode on canvas

In this example, the Derive_Log node has the log formula for the INCOME eld, and the Derive_Inverse node has the inverse formulas for the OPEN_BAL and BALANCE elds.
To generate a node:
E For each eld in the output viewer, select the desired transformation. E From the Generate menu, choose Derive Node or Filler Node as desired.

Doing so displays the Generate Derive Node or Generate Filler Node dialog box, as appropriate.
Figure 17-46 Choosing standardized or non-standardized transformation

Choose Non-standardized transformation or Standardized transformation (z-score) as desired. The second option applies a z score to the transformation; z scores represent values as a function of distance from the mean of the variable in standard deviations. For example, if you apply the log transformation to the AGE eld and choose a standardized transformation, the nal equation for the generated node will be: (log(AGE)-Mean)/SD

572 Chapter 17

Once a node is generated and appears on the stream canvas:


E Attach it to the stream. E For a SuperNode, optionally double-click the node to view its contents. E Optionally double-click a Derive or Filler node to modify options for the selected eld(s).

Generating Graphs
You can generate full-size histogram output from a thumbnail histogram in the output viewer.
To generate a graph:
E Double-click a thumbnail graph in the output viewer.

or
E Select a thumbnail graph in the output viewer. E From the Generate menu, choose Graph output.

Doing so displays the histogram with a normal distribution curve overlaid. This enables you to compare how closely each available transformation matches a normal distribution.
Figure 17-47 Transformation histogram with normal distribution curve overlaid

573 Output Nodes

Other Operations
From the output viewer, you can also: Sort the output grid by the Field column. Export the output to an HTML le. For more information, see Exporting Output on p. 526.

SPSS Output Node


The SPSS Output node allows you to call an SPSS procedure to analyze your Clementine data. You can view the results in a browser window or save results in the SPSS output le format. A wide variety of SPSS analytical procedures is accessible from Clementine. Note: You must have SPSS installed and licensed on your computer to use this node. For more information, see SPSS Helper Applications on p. 575. For details on specic SPSS procedures, see the SPSS Command Syntax Reference, which is available under the \documentation folder on the product CD-ROM and also available from the Windows Start menu by choosing Start > [All] Programs > SPSS Clementine 11.1 > Documentation. Note that a newer version of this document may have been included with your copy of SPSS software. You can also click the SPSS Syntax Help button, available from the Syntax tab on the SPSS Output node dialog box in Clementine. This will provide syntax help for the command that you are currently typing. If necessary, you can use the Filter tab to lter or rename elds so they conform to SPSS naming standards. For more information, see Renaming or Filtering Fields for SPSS in Chapter 18 on p. 590.

SPSS Output Node Syntax Tab


Use this dialog box to create syntax for SPSS procedures. Syntax is composed of two parts: a statement and associated options. The statement species the analysis or operation to be performed and the elds to be used. The options specify everything else, including which statistics to display, derived elds to save, and so on. If you have previously created syntax les, you can use them here by choosing Open from the File menu. Selecting an .sps le will paste the contents into the Procedure node dialog box. To insert previously saved syntax without replacing the current contents, choose Insert from the File menu. This will paste the contents of an .sps le at the point specied by the cursor. If you are unfamiliar with SPSS syntax, the simplest way to create syntax in Clementine is to rst run the command in SPSS, copy the syntax into the SPSS Procedure node in Clementine, and execute the stream. Once you have created syntax for a frequently used procedure, you can save the syntax by choosing Save or Save As from the File menu.

574 Chapter 17 Figure 17-48 SPSS Output node dialog box

When you click Execute, the results are shown in the SPSS Output Browser.

SPSS Output Node Output Tab


The Output tab lets you specify the format and location of the output. You can choose to display the results on the screen, or send them to one of the standard le types. For more information, see Output Node Output Tab on p. 529.

575 Output Nodes

SPSS Output Browser


Figure 17-49 SPSS Output browser

The SPSS output browser shows you the results of the SPSS procedure that you executed in the SPSS Output node. The usual saving, exporting, and printing options are available from the File menu, and the usual editing options are available from the Edit menu. For more information, see Viewing Output on p. 525.

SPSS Helper Applications


If SPSS is installed and licensed on your computer, you can congure Clementine to process data with SPSS functionality using the SPSS Transform, SPSS Output, or SPSS Export nodes.
E To congure Clementine to work with SPSS and other applications, choose Helper Applications

from the Tools menu.

576 Chapter 17 Figure 17-50 Helper Applications dialog box

SPSS Interactive. Enter the name of the command to execute SPSS in interactive mode (usually, spsswin.exe in the SPSS program directory). This interactivity is used by the SPSS Export node. For more information, see SPSS Export Node in Chapter 18 on p. 588. Connection. If SPSS Server is located on the same server as Clementine Server, you can enable a

connection between the two applications, which increases efciency by leaving data on the server during analysis. Select Server to enable the Port option below. The default setting is Local.
Port. Specify the server port for SPSS Server. SPSS License Location Utility. To enable Clementine to use the SPSS Transform and SPSS

Output nodes, you must have a copy of SPSS installed and licensed on the computer where the stream is executed. If running Clementine Client in local (standalone) mode, the licensed copy of SPSS must be on the local computer. Click this button to specify the location of the local SPSS installation you want to use for licensing. If running in distributed mode against a remote Clementine Server, the licensed version of SPSS must be on the server computer, and the license conguration must be done on the server.
E Select Other to specify options for AnswerTree, Excel, and other applications.

577 Output Nodes

Comments

If you have trouble running SPSS procedure nodes, consider the following tips: If eld names used in Clementine are longer than eight characters (for versions prior to SPSS 12.0), longer than 64 characters (for SPSS 12.0 and subsequent versions), or contain invalid characters, it is necessary to rename or truncate them before reading them into SPSS. For more information, see Renaming or Filtering Fields for SPSS in Chapter 18 on p. 590. If SPSS was installed after Clementine, you may need to specify the SPSS license location, as explained above.

Other Helper Applications


On the Other tab of the Helper Applications dialog box, you can specify the location of applications, such as AnswerTree and Excel, to work interactively with data from Clementine.
Figure 17-51 Helper Applications dialog box:Other tab

AnswerTree command. Enter the name of the command to execute AnswerTree (normally

atree.exe in the AnswerTree program directory).


Excel(tm) command. Enter the name of the command to execute Excel (normally excel.exe in the

Excel program directory).


Publish to Web URL. Enter the URL for your SPSS Web Deployment Framework (SWDF) server for the Publish to Web option.

Export Nodes

18

Chapter

Overview of Export Nodes


Export nodes provide a mechanism for exporting data in various formats to interface with your other software tools. The following export nodes are available:
The Database Export node writes data to an ODBC-compliant relational data source. In order to write to an ODBC data source, the data source must exist and you must have write permission for it. For more information, see Database Output Node on p. 578. The Flat File node outputs data to a delimited text le. It is useful for exporting data that can be read by other analysis or spreadsheet software. For more information, see Flat File Node on p. 587. The SPSS Export node outputs data in SPSS .sav format. The .sav les can be read by SPSS Base and other SPSS products. This is also the format used for Clementine cache les. For more information, see SPSS Export Node on p. 588. The SAS Export node outputs data in SAS format, to be read into SAS or a SAS-compatible software package. Three SAS le formats are available: SAS for Windows/OS2, SAS for UNIX, or SAS Version 7/8. For more information, see SAS Export Node on p. 591. The Excel Export node outputs data in Microsoft Excel format (.xls). Optionally, you can choose to launch Excel automatically and open the exported le when the node is executed.For more information, see Excel Export Node on p. 592.

Database Output Node


You can use Database nodes to write data to ODBC-compliant relational data sources. To read or write to a database, you must have an ODBC data source installed and congured for the relevant database, with read or write permissions as needed. The SPSS Data Access Pack includes a set of ODBC drivers that can be used for this purpose, and these drivers are available from the SPSS Web site at http://www.spss.com/drivers/clientCLEM.htm. If you have questions about creating or setting permissions for ODBC data sources, contact your database administrator.
578

579 Export Nodes

Supported ODBC Drivers

For the latest information on which databases and ODBC drivers are supported and tested for use with Clementine 11.1, please review the product compatibility matrices on the SPSS Support site (http://support.spss.com).
Where to Install Drivers

Note that ODBC drivers must be installed and congured on each computer where processing may occur. If you are running Clementine Client in local (standalone) mode, the drivers must be installed on the local computer. If you are running Clementine Client or Clementine Batch in distributed mode against a remote Clementine Server, the ODBC drivers need to be installed on the computer where Clementine Server is installed. If you need to access the same data sources from both Clementine Client and Clementine Server, the ODBC drivers must be installed on both computers. If you are running Clementine Client over Terminal Services, the ODBC drivers need to be installed on the Terminal Services server on which you have Clementine Client installed. If you have purchased Clementine Solution Publisher and are using the Solution Publisher Runtime to execute published streams on a separate computer, you also need to install and congure ODBC drivers on that computer. Use the following general steps to write data to a database:
E Install an ODBC driver and congure a data source to the database you want to use. E On the Database node Export tab, specify the data source and table you want to write to. You can

create a new table or insert data into an existing one.


E Specify additional options as needed.

These steps are described in more detail in the next several topics.

580 Chapter 18

Database Node Export Tab


Figure 18-1 Database Output node: Export tab

Data source. Shows the selected data source. Enter the name or select it from the drop-down list.

If you dont see the desired database in the list, select Add new database connection and locate your database from the Database Connections dialog box. For more information, see Adding a Database Connection in Chapter 2 on p. 25.
Table name. Enter the name of the table to which you want to send the data. If you select the Insert into table option, you can select an existing table in the database by clicking the Select button. Create table. Select this option to create a new database table or to overwrite an existing database

table.
Insert into table. Select this option to insert the data into an existing database table. Drop existing table. Select this option to delete any existing table with the same name when

creating a new table.


Delete existing rows. Select this option to delete existing rows from the table before exporting

when inserting into a table. Note: If either of the two options above are selected, you will receive an Overwrite warning message when you execute the node. To suppress the warnings, deselect Warn when a node overwrites a database table on the Notications tab of the User Options dialog box. For more information, see Setting Notication Options in Chapter 3 in Clementine 11.1 Users Guide.
Default string size. Fields you have marked as typeless in an upstream Type node are written to the

database as string elds. Specify the size of strings to be used for typeless elds.
Quote table and column names. Select options used when sending a CREATE TABLE statement to

the database. Tables or columns with spaces or nonstandard characters must be quoted.
As needed. Select to allow Clementine to automatically determine when quoting is needed on

an individual basis.
Always. Select to always enclose table and column names in quotes. Never. Select to disable the use of quotes.

581 Export Nodes

Generate an import node for this data. Select to generate a Database source node for the data as exported to the specied data source and table. Upon execution, this node is added to the stream canvas.

Click Schema to open a dialog box where you can set SQL data types for your elds, and specify the primary key for purposes of database indexing. For more information, see Database Output Schema Options on p. 581. Click Indexes to specify options for indexing the exported table in order to improve database performance. For more information, see Database Output Schema Options on p. 581. Click Advanced to specify bulk loading and database commit options. For more information, see Database Output Advanced Options on p. 585.

Database Output Schema Options


The database output Schema dialog box allows you to set SQL data types for your elds, specify which elds are primary keys, and customize the CREATE TABLE statement generated upon export.
Figure 18-2 Database output Schema dialog box

The dialog box has two parts: The text eld at the top displays the template used to generate the CREATE TABLE command, which by default follows the format:
CREATE TABLE <table-name> <(table columns)>

The table in the lower portion allows you to specify the type for each eld, and to indicate which elds are primary keys as discussed below. The dialog box automatically generates the values of the <table-name> and <(table columns)> parameters based on the specications in the table.
Customizing CREATE TABLE Statements

Using the top portion of this dialog box, you can add extra database-specic options to the CREATE TABLE statement.
E Select the Customize CREATE TABLE command check box to activate the text window.

582 Chapter 18 E Add any database-specic options to the statement. Be sure to retain the text <table-name> and

(<table-columns>) parameters because these are substituted for the real table name and column denitions by Clementine.

Setting Data Types

By default, Clementine will allow the database server to assign data types automatically. To override the automatic type for a eld, nd the row corresponding to the eld and select the desired type from the drop-down list in the Type column of the schema table. For types that take a length, precision, or scale argument (BINARY, VARBINARY, CHAR, VARCHAR, NUMERIC, and NUMBER), you should specify a length rather than allow the database server to assign an automatic length. For example, specifying a sensible value, such as VARCHAR(25), for length ensures that the storage type in Clementine will be overwritten if that is your intention. To override the automatic assignment, select Specify from the Type drop-down list and replace the type denition with the desired SQL type denition statement.
Figure 18-3 Database output Specify Type dialog box

The easiest way to do this is to rst select the type that is closest to the desired type denition and then select Specify to edit that denition. For example, to set the SQL data type to VARCHAR(25), rst set the type to VARCHAR(length) from the Type drop-down list, and then select Specify and replace the text length with the value 25.

Primary Keys

If one or more columns in the exported table must have a unique value or combination of values for every row, you can indicate this by selecting the Primary Key check box for each eld that applies. Most databases will not allow the table to be modied in a manner that invalidates a primary key constraint and will automatically create an index over the primary key to help enforce this restriction. (Optionally, you can create indexes for other elds in the Indexes dialog box. For more information, see Database Output Index Options on p. 582.)

Database Output Index Options


The Indexes dialog box allows you to create indexes on database tables exported from Clementine. You can specify the eld sets you want to include and customize the CREATE INDEX command, as needed.

583 Export Nodes Figure 18-4 Database output Indexes dialog box

The dialog box has two parts: The text eld at the top displays a template that can be used to generate one or more CREATE INDEX commands, which by default follows the format:
CREATE INDEX <index-name> ON <table-name>

The table in the lower portion of the dialog box allows you to add specications for each index you want to create. For each index, you specify the index name and the elds or columns to include. The dialog box automatically generates the values of the <index-name> and <table-name> parameters accordingly. For example, the generated SQL for a single index on the elds empid and deptid might look like this:
CREATE INDEX MYTABLE_IDX1 ON MYTABLE(EMPID,DEPTID)

You can add multiple rows to create multiple indexes. A separate CREATE INDEX command is generated for each row.
Customizing the CREATE INDEX Command

Optionally, you can customize the CREATE INDEX command for all indexes or for a specify index only. This gives you the exibility to accommodate specic database requirements or options and to apply customizations to all indexes or only specic ones, as needed. Select Customize the CREATE INDEX command at the top of the dialog box to modify the template used for all indexes added subsequently. Note that changes will not automatically apply to indexes that have already been added to the table. Select one or more rows in the table and then click Update selected indexes at the top of the dialog box to apply the current customizations to all selected rows. Select the Customize check box in each row to modify the command template for that index only.

584 Chapter 18

Note that the values of the <index-name> and <table-name> parameters are generated automatically by the dialog box based on the table specications and cannot be edited directly.
BITMAP KEYWORD. If you are using an Oracle database, you can customize the template to create

a bitmap index rather than a standard index, as follows:


CREATE BITMAP INDEX <index-name> ON <table-name>

Bitmap indexes may be useful for indexing columns with a small number of distinct values. The resulting SQL might look this:
CREATE BITMAP INDEX MYTABLE_IDX1 ON MYTABLE(COLOR)

UNIQUE keyword. Most databases support the UNIQUE keyword in the CREATE INDEX command. This enforces a uniqueness constraint similar to a primary key constraint on the underlying table.
CREATE UNIQUE INDEX <index-name> ON <table-name>

Note that for elds actually designated as primary keys, this specication is not necessary. Most databases will automatically create an index for any elds specied as primary key elds within the CREATE TABLE command, so explicitly creating indexes on these elds is not necessary. For more information, see Database Output Schema Options on p. 581.
FILLFACTOR keyword. Some physical parameters for the index can be ne-tuned. For example,

SQL Server allows the user to trade off the index size (after initial creation) against the costs of maintenance as future changes are made to the table.
CREATE INDEX MYTABLE_IDX1 ON MYTABLE(EMPID,DEPTID) WITH FILLFACTOR=20

Other Comments

If an index already exists with the specied name, index creation will fail. Any failures will initially be treated as warnings, allowing subsequent indexes to be created and then re-reported as an error in the message log after all indexes have been attempted. For best performance, indexes should be created after data has been loaded into the table. Indexes must contain at least one column. Before executing the node, you can preview the generated SQL in the message log. For more information, see Previewing Generated SQL in Chapter 6 in Clementine 11.1 Server Administration and Performance Guide. For temporary tables written to the database (that is, when node caching is enabled) the options to specify primary keys and indexes are not available. However the system may create indexes on the temporary table as appropriate depending on how the data is used in downstream nodes. For example, if cached data is subsequently joined by a DEPT column, it would make sense to index the cached tabled on this column. For more information, see Caching Options for Nodes in Chapter 5 in Clementine 11.1 Users Guide.

585 Export Nodes

Indexes and Query Optimization

In some database management systems, once a database table has been created, loaded, and indexed, a further step is required before the optimizer is able to utilize the indexes to speed up query execution on the new table. For example, in Oracle, the cost-based query optimizer requires that a table be analyzed before its indexes can be used in query optimization. The le odbc-oracle-properties.cfg has been updated to make this happen, as follows:
# Defines SQL to be executed after a table and any associated indexes # have been created and populated table_analysis_sql, 'ANALYZE TABLE <table-name> COMPUTE STATISTICS

This step is executed whenever a table is created in Oracle (regardless of whether primary keys or indexes are dened). If necessary, the ODBC properties le for additional databases can be customized in a similar way. These les are installed in the /cong folder under your Clementine installationfor example, C:\Program Files\SPSS Clementine\10.0\cong\odbc-oracle-properties.cfg.

Database Output Advanced Options


When you click the Advanced button from the Database and Publisher node dialog boxes, a new dialog box opens to specify technical details for exporting results to a database.
Figure 18-5 Specifying advanced options for database export

Batch commit. Select to turn off row-by-row commits to the database. Batch size. Specify the number of records to send to the database before committing to memory.

Lowering this number provides greater data integrity at the cost of slower transfer speeds. You may want to ne-tune this number for optimal performance with your database.

586 Chapter 18

Use bulk loading. Select a method for bulk loading data to the database directly from Clementine.

Some experimentation may be required to select which bulk load options are appropriate for a particular scenario.
Via ODBC. Select to use the ODBC API to execute multiple-row inserts with greater efciency

than normal export to the database. Choose from row-wise or column-wise binding in the options below.
Via external loader. Select to use a custom bulk loader program specic to your database.

Selecting this option activates a variety of options below.


Advanced ODBC Options. These options are available only when Via ODBC is selected. Note that

this functionality may not be supported by all ODBC drivers.


Row-wise. Select row-wise binding to use the SQLBulkOperations call for loading data

into the database. Row-wise binding typically improves speed compared to the use of parameterized inserts that insert data on a record-by-record basis.
Column-wise. Select to use column-wise binding for loading data into the database.

Column-wise binding improves performance by binding each database column (in a parameterized INSERT statement) to an array of N values. Executing the INSERT statement once causes N rows to be inserted into the database. This method can dramatically increase performance.
External Loader Options. When Via external loader is specied, a variety of options are displayed

for exporting the data set to a le and specifying and executing a custom loader program to load the data from that le into the database. Clementine can interface with external loaders for many popular database systems. Several scripts have been included with the software and are available along with technical documentation under the /scripts subdirectory.
Use delimiter. Specify which delimiter character should be used in the exported le. Select
Tab to delimit with tab and Space to delimit with spaces. Select Other to specify another

character, such as a comma (,).


Specify data file. Select to enter the path to use for the data le written during bulk loading.

By default, a temporary le is created in the temp directory on the server.


Specify loader program. Select to specify a bulk loading program. By default, the software

searches the /scripts subdirectory of the Clementine installation for a python script to execute for a given database. Several scripts have been included with the software and are available along with technical documentation under the /scripts subdirectory.
Generate log. Select to generate a log le to the specied directory. The log le contains error

information and is useful if the bulk load operation fails.


Check table size. Select to perform table checking that ensures that the increase in table size

corresponds to the number of rows exported from Clementine.


Extra loader options. Specify additional arguments to the loader program. Use double-quotes

for arguments containing spaces. Double-quotes are included in optional arguments by escaping with a backslash. For example, the option specied as -comment This is a \comment\"" includes both the -comment ag and the comment itself rendered as This is a comment.

587 Export Nodes

A single backslash can be included by escaping with another backslash. For example, the option specied as -specialdir C:\\Test Scripts\\ includes the ag -specialdir and the directory rendered as C:\Test Scripts\.

Flat File Node


The File node allows you to write data to a delimited text le. This is useful for exporting data that can be read by other analysis or spreadsheet software. Note: You cannot write les in the old Clementine cache format, because Clementine no longer uses that format for cache les. Clementine cache les are now saved in SPSS .sav format, which you can write using an SPSS Export node. For more information, see SPSS Export Node on p. 588.

Flat File Export Tab


Figure 18-6 File node: Export tab

Export file. Specify the name of the le. Enter a lename or click the File Chooser button to

browse to the les location.


Write mode. If Overwrite is selected, any existing data in the specied le will be overwritten. If
Append is selected, output from this node will be added to the end of the existing le, preserving

any data it contains.


Include field names. If this option is selected, eld names will be written to the rst line of the

output le. This option is available only for the Overwrite write mode.

588 Chapter 18

New line after each record. If this option is selected, each record will be written on a new line in

the output le.


Field separator. Select the character to insert between eld values in the generated text le. Options are Comma, Tab, Space, and Other. If you select Other, enter the desired delimiter character(s) in the text box. Symbol quotes. Select the type of quoting to use for values of symbolic elds. Options are None

(values are not quoted), Single (), Double (), and Other. If you select Other, enter the desired quoting character(s) in the text box.
Encoding. Species the text-encoding method used. You can choose between the system default,

stream default, or UTF-8. The system default is specied in the Windows Control Panel or, if running in distributed mode, on the server computer. For more information, see Unicode Support in Clementine in Appendix B in Clementine 11.1 Users Guide. The stream default is specied in the Stream Properties dialog box. For more information, see Setting Options for Streams in Chapter 5 in Clementine 11.1 Users Guide.
Decimal symbol. Specify how decimals should be represented in the exported data. Stream default. The decimal separator dened by the current streams default setting will be

used. This will normally be the decimal separator dened by the computers locale settings.
Period (.). The period character will be used as the decimal separator. Comma (,). The comma character will be used as the decimal separator. Generate an import node for this data. Select this option to automatically generate a Variable File source node that will read the exported data le. For more information, see Variable File Node in Chapter 2 on p. 15.

SPSS Export Node


The SPSS Export node allows you to export data in SPSS .sav format. SPSS .sav les can be read by SPSS Base and other SPSS products. This is now also the format used for Clementine cache les. Mapping Clementine eld names to SPSS variable names can sometimes cause errors because SPSS variable names are limited to 64 characters and cannot include certain characters, such as spaces, $, , and so on. There are two ways to adjust for these restrictions: You can rename elds conforming to SPSS variable name requirements by clicking the Filter tab. For more information, see Renaming or Filtering Fields for SPSS on p. 590. Choose to export both eld names and labels from Clementine.

589 Export Nodes

SPSS Export Node Export Tab


Figure 18-7 SPSS Export node: Export tab

Export file. Specify the name of the le. Enter a lename or click the le chooser button to

browse to the les location.


Export field names. Select a method of handling variable names and labels upon export from

Clementine to an SPSS .sav le.


Names and variable labels. Select to export both Clementine eld names and eld labels.

Names are exported as SPSS variable names, while labels are exported as SPSS variable labels.
Names as variable labels. Select to use the Clementine eld names as variable labels in

SPSS. Clementine allows characters in eld names that are invalid in SPSS variable names. To prevent possibly creating invalid SPSS names, select Names as variable labels instead, or use the Filter tab to adjust eld names.
Launch Application. If SPSS or AnswerTree is installed on your computer, you can select this option to invoke either application directly on the saved data le. Options for launching each application must be specied in the Helper Applications dialog box. For more information, see SPSS Helper Applications in Chapter 17 on p. 575. To simply create an SPSS .sav le without opening an external program, deselect this option. Generate an import node for this data. Select this option to automatically generate an SPSS File

node that will read the exported data le. For more information, see SPSS Import Node in Chapter 2 on p. 27.

590 Chapter 18

Renaming or Filtering Fields for SPSS


Before exporting or deploying data from Clementine to external applications such as SPSS, it may be necessary to rename or adjust eld names. The SPSS Transform, SPSS Output, and SPSS Export dialog boxes contain a Filter tab to facilitate this process. A basic description of Filter tab functionality is discussed elsewhere. For more information, see Setting Filtering Options in Chapter 4 on p. 85. This topic provides tips for reading data into SPSS.
Figure 18-8 Renaming fields for SPSS on the Filter tab of the SPSS Transform node

Tips for SPSS

To adjust eld names to conform with SPSS, select Rename For SPSS from the Filter menu. This adjusts eld names in the Filter window according to the following restrictions for data in SPSS version 12.0 and higher.
Table 18-1 Field name restrictions and corrective action

SPSS restriction Field names must begin with a letter.

Corrective renaming The letter X is added to the beginning of the name.

The name cannot include blank spaces or any Invalid characters are replaced with a # symbol. special characters except a period (.) or the symbols @, #, _, or $. Field names cannot end in a period. Periods are replaced with a # symbol. Length of eld names cannot exceed 64 characters. Field names must be unique. Note: Names in SPSS are not case sensitive. Reserved keywords are ALL, NE, EQ, TO, LE, LT, BY, OR, GT, AND, NOT, GE, and WITH. Long names are truncated to 64 characters, according to standards for SPSS 12.0 and higher. Duplicate names are truncated to ve characters and then appended with an index ensuring uniqueness. Fields names matching a reserved word are appended with the # symbol. For example, WITH becomes WITH#.

591 Export Nodes

SAS Export Node


The SAS Export node allows you to write data in SAS format to be read into SAS or a SAS-compatible software package. You can export in three SAS le formats: SAS for Windows/OS2, SAS for UNIX, or SAS Version 7/8.

SAS Export Node Export Tab


Figure 18-9 SAS Export node: Export tab

Export file. Specify the name of the le. Enter a lename or click the File Chooser button to

browse to the les location.


Export. Specify the export le format. Options are SAS for Windows/OS2, SAS for UNIX, or SAS
Version 7/8.

Export field names. Select options for exporting eld names and labels from Clementine for

use with SAS.


Names and variable labels. Select to export both Clementine eld names and eld labels.

Names are exported as SAS variable names, while labels are exported as SAS variable labels.
Names as variable labels. Select to use the Clementine eld names as variable labels in SAS.

Clementine allows characters in eld names that are invalid in SAS variable names. To prevent possibly creating invalid SAS names, select Names and variable labels instead.
Generate an import node for this data. Select this option to automatically generate a SAS File node

that will read the exported data le. For more information, see SAS Import Node in Chapter 2 on p. 29.

592 Chapter 18

Excel Export Node


The Excel Export node outputs data in Microsoft Excel format (.xls). Optionally, you can choose to automatically launch Excel and open the exported le when the node is executed. Excel import is supported for Clementine Client and Server running on Windows platforms only and is not available on UNIX platforms. Note: Options for launching Excel are specied in the Helper Applications dialog box. For more information, see Other Helper Applications in Chapter 17 on p. 577.

Excel Node Export Tab


Figure 18-10 Excel node: Export tab

File name. Enter a lename or click the le chooser button to browse to the les location. The

default lename is excelexp.xls.


Include field names. Species whether eld names should be included in the rst row of the

worksheet.
Launch Excel. Species whether Excel is automatically launched on the exported le when the

node is executed. Note that when running in distributed mode against Clementine Server, the output is saved to the server le system, and Excel is launched on the Client with a copy of the exported le.
Generate an import node for this data. Select this option to automatically generate an Excel Import node that will read the exported data le. For more information, see Excel Import Node in Chapter 2 on p. 30.

SuperNodes

19

Chapter

Overview of SuperNodes
One of the reasons that Clementines visual programming interface is so easy to learn is that each node has a clearly dened function. However, for complex processing, a long sequence of nodes may be necessary. Eventually, this may clutter the stream canvas and make it difcult to follow stream diagrams. There are two ways to avoid the clutter of a long and complex stream: You can split a processing sequence into several streams that feed one into the other. The rst stream, for example, creates a data le that the second uses as input. The second creates a le that the third uses as input, and so on. You can manage these multiple streams by saving them in a project. A project provides organization for multiple streams and their output. However, a project le contains only a reference to the objects it contains, and you will still have multiple stream les to manage. You can create a SuperNode as a more streamlined alternative when working with complex stream processes. SuperNodes group multiple nodes into a single node by encapsulating sections of a data stream. This provides numerous benets to the data miner: Streams are neater and more manageable. Nodes can be combined into a business-specic SuperNode. SuperNodes can be exported to libraries for reuse in multiple data mining projects.

Types of SuperNodes
SuperNodes are represented in the data stream by a star icon. The icon is shaded to represent the type of SuperNode and the direction in which the stream must ow to or from it. There are three types of SuperNodes: Source SuperNodes Process SuperNodes Terminal SuperNodes

593

594 Chapter 19

Source SuperNodes
Source SuperNodes contain a data source just like a normal source node and can be used anywhere that a normal source node can be used. The left side of a source SuperNode is shaded to indicate that it is closed on the left and that data must ow downstream from a SuperNode.
Figure 19-1 Source SuperNode with zoomed-in version imposed over stream

Source SuperNodes have only one connection point on the right, showing that data leaves the SuperNode and ows to the stream.

Process SuperNodes
Process SuperNodes contain only process nodes and are unshaded to show that data can ow both in and out of this type of SuperNode.

595 SuperNodes Figure 19-2 Process SuperNode with zoomed-in version imposed over stream

Process SuperNodes have connection points on both the left and right, showing that data enters the SuperNode and leaves to ow back to the stream. Although SuperNodes can contain additional stream fragments and even extra streams, both connection points must ow through a single path connecting the From Stream and To Stream points. Note: Process SuperNodes are also sometimes referred to as Manipulation SuperNodes.

Terminal SuperNodes
Terminal SuperNodes contain one or more terminal nodes (plot, table, and so on) and can be used in the same manner as a terminal node. A terminal SuperNode is shaded on the right side to indicate that it is closed on the right and that data can ow only into a terminal SuperNode.

596 Chapter 19 Figure 19-3 Terminal SuperNode with zoomed-in version imposed over stream

Terminal SuperNodes have only one connection point on the left, showing that data enters the SuperNode from the stream and terminates inside the SuperNode. Terminal SuperNodes can also contain scripts that are used to specify the order of execution for all terminal nodes inside the SuperNode. For more information, see SuperNodes and Scripting on p. 607.

Creating SuperNodes
Creating a SuperNode shrinks the data stream by encapsulating several nodes into one node. Once you have created or loaded a stream on the canvas, there are several ways to create a SuperNode.
Multiple Selection

The simplest way to create a SuperNode is by selecting all of the nodes that you want to encapsulate:
E Use the mouse to select multiple nodes on the stream canvas. You can also use Shift-click to

select a stream or section of a stream. Note: Nodes that you select must be from a continuous or forked stream. You cannot select nodes that are not adjacent or connected in some way.
E Then, using one of the following three methods, encapsulate the selected nodes:

Click the SuperNode icon (shaped like a star) on the toolbar. Right-click on the SuperNode, and from the context menu choose:
Create SuperNode From Selection

From the SuperNode menu, choose:


Create SuperNode From Selection

597 SuperNodes Figure 19-4 Creating a SuperNode using multiple selection

All three of these options encapsulate the nodes into a SuperNode shaded to reect its typesource, process, or terminalbased on its contents.
Single Selection

You can also create a SuperNode by selecting a single node and using menu options to determine the start and end of the SuperNode or encapsulating everything downstream of the selected node.
E Click the node that determines the start of encapsulation. E From the SuperNode menu, choose: Create SuperNode From Here

598 Chapter 19 Figure 19-5 Creating a SuperNode using the context menu for single selection

SuperNodes can also be created more interactively by selecting the start and end of the stream section to encapsulate nodes:
E Click on the rst or last node that you want to include in the SuperNode. E From the SuperNode menu, choose: Create SuperNode Select... E Alternatively, you can use the context menu options by right-clicking on the desired node. E The cursor becomes a SuperNode icon, indicating that you must select another point in the stream.

Move either upstream or downstream to the other end of the SuperNode fragment and click on a node. This action will replace all nodes in between with the SuperNode star icon. Note: Nodes that you select must be from a continuous or forked stream. You cannot select nodes that are not adjacent or connected in some way.

Nesting SuperNodes
SuperNodes can be nested within other SuperNodes. The same rules for each type of SuperNode (source, process, and terminal) apply to nested SuperNodes. For example, a process SuperNode with nesting must have a continuous data ow through all nested SuperNodes in order for it to remain a process SuperNode. If one of the nested SuperNodes is terminal, then data would no longer ow through the hierarchy.

599 SuperNodes Figure 19-6 Process SuperNode nested within another process SuperNode

Terminal and source SuperNodes can contain other types of nested SuperNodes, but the same basic rules for creating SuperNodes apply.

Examples of Valid SuperNodes


Almost anything you create in Clementine can be encapsulated in a SuperNode. Following are examples of valid SuperNodes:
Figure 19-7 Valid process SuperNode with two connections in a valid stream flow

600 Chapter 19 Figure 19-8 Valid terminal SuperNode including separate stream used to test generated models

Figure 19-9 Valid process SuperNode containing a nested SuperNode

Examples of Invalid SuperNodes


The most important aspect of creating valid SuperNodes is to ensure that data ows linearly through the SuperNode connections. If there are two connections (a process SuperNode), then data must ow in a stream from the beginning connector to the ending connector. Similarly, a source SuperNode must allow data to ow from the source node to the single connector that brings data back to the zoomed-out stream.

601 SuperNodes Figure 19-10 Invalid source SuperNode: Source node not connected to the data flow path

Figure 19-11 Invalid terminal SuperNode: Nested SuperNode not connected to the data flow path

Editing SuperNodes
Once you have created a SuperNode, you can examine it more closely by zooming in to it. To view the contents of a SuperNode, you can use the zoom-in icon from the Clementine toolbar, or the following method:
E Right-click on a SuperNode. E From the context menu, choose Zoom In.

602 Chapter 19

The contents of the selected SuperNode will be displayed in a slightly different Clementine environment, with connectors showing the ow of data through the stream or stream fragment. At this level on the stream canvas, there are several tasks that you that can perform: Modify the SuperNode typesource, process, or terminal. Create parameters or edit the values of a parameter. Parameters are used in scripting and CLEM expressions. Specify caching options for the SuperNode and its subnodes. Create or modify a SuperNode script (terminal SuperNodes only).

Modifying SuperNode Types


In some circumstances, it is useful to alter the type of a SuperNode. This option is available only when you are zoomed in to a SuperNode, and it applies only to the SuperNode at that level. The three types of SuperNodes and their connectors are:
Source SuperNode Process SuperNode Terminal SuperNode One connection going out Two connections: one coming in and one going out One connection coming in

To change the type of a SuperNode:


E Be sure that you are zoomed in to the SuperNode. E Click the toolbar button for the type of SuperNode to which you want to convert. E Alternatively, you can use the SuperNode menu to choose a type. From the SuperNode menu, choose SuperNode Type, and then choose the type.

Annotating and Renaming SuperNodes


You can rename a SuperNode as it appears in the stream as well as write annotations used in a project or report. To access these properties:
E Right-click on a SuperNode (zoomed out) and choose Rename and Annotate. E Alternatively, from the SuperNode menu choose Rename and Annotate. This option is available

in both zoomed-in and zoomed-out modes. In both cases, a dialog box opens with the Annotations tab selected. Use the options here to customize the name displayed on the stream canvas and provide documentation regarding SuperNode operations.

603 SuperNodes Figure 19-12 Annotating a SuperNode

SuperNode Parameters
In Clementine, you have the ability to set user-dened variables, such as Minvalue, whose values can be specied when used in scripting or CLEM expressions. These variables are called parameters. You can set parameters for streams, sessions, and SuperNodes. Any parameters set for a SuperNode are available when building CLEM expressions in that SuperNode or any nested nodes. Parameters set for nested SuperNodes are not available to their parent SuperNode. There are two steps to creating and setting parameters for SuperNodes: Dene parameters for the SuperNode. Then, specify the value for each parameter of the SuperNode. These parameters can then be used in CLEM expressions for any encapsulated nodes.

Defining SuperNode Parameters


Parameters for a SuperNode can be dened in both zoomed-out and zoomed-in modes. The parameters dened apply to all encapsulated nodes. To dene the parameters of a SuperNode, you rst need to access the Parameters tab of the SuperNode dialog box. Use one of the following methods to open the dialog box: Double-click a SuperNode in the stream. From the SuperNode menu, choose Set Parameters. Alternatively, when zoomed in to a SuperNode, choose Set Parameters from the context menu. Once you have opened the dialog box, the Parameters tab is visible with any previously dened parameters.
To define a new parameter:
E Click the Define Parameters button to open the dialog box.

604 Chapter 19 Figure 19-13 Defining parameters for a SuperNode

Name. Parameter names are listed here. You can create a new parameter by entering a name in

this eld. For example, to create a parameter for the minimum temperature, you could type
minvalue. Do not include the $P- prex that denotes a parameter in CLEM expressions. This

name is also used for display in the Expression Builder.


Long name. Lists the descriptive name for each parameter created. Storage. Select a storage type from the drop-down list. Storage indicates how the data values

are stored in the parameter. For example, when working with values containing leading zeros that you want to preserve (such as 008), you should select String as the storage type. Otherwise, the zeros will be stripped from the value. Available storage types are string, integer, real, time, date, and timestamp. For date parameters, note that values must be specied using ISO standard notation as below.
Value. Lists the current value for each parameter. Adjust the parameter as desired. Note that for date parameters, values must be specied in ISO standard notation (that is, YYYY-MM-DD). Dates specied in other formats are not accepted. Type (optional). If you plan to deploy the stream to an external application, select a usage type from the drop-down list. Otherwise, it is advisable to leave the Type column as is.

Note that long name, storage, and type options can be set for parameters through the user interface only. These options cannot be set using scripts. Click the arrows at the right to move the selected parameter further up or down the list of available parameters. Use the delete button (marked with an X) to remove the selected parameter.

Setting Values for SuperNode Parameters


Once you have dened parameters for a SuperNode, you can specify values using the parameters in a CLEM expression or script.

605 SuperNodes

To specify the parameters of a SuperNode:


E Double-click on the SuperNode icon to open the SuperNode dialog box. E Alternatively, from the SuperNode menu choose Set Parameters. E Click the Parameters tab. Note: The elds in this dialog box are the elds dened by clicking the Define Parameters button on this tab. E Enter a value in the text box for each parameter that you have created. For example, you can set the

value minvalue to a particular threshold of interest. This parameter can then be used in numerous operations, such as selecting records above or below this threshold for further exploration.
Figure 19-14 Specifying parameters for a SuperNode

Using SuperNode Parameters to Access Node Properties


SuperNode parameters can also be used to dene node properties (also known as slot parameters) for encapsulated nodes. For example, suppose you want to specify that a SuperNode train an encapsulated Neural Net node for a certain length of time using a random sample of the data available. Using parameters, you can specify values for the length of time and percentage sample.
Figure 19-15 Stream fragment encapsulated in a SuperNode

The example SuperNode contains a Sample node called Sample and a Neural Net node called Train. You can use the node dialog boxes to specify the Sample nodes Sample setting as Random % and the Neural Net nodes Stop on setting to Time. Once these options are specied,

606 Chapter 19

you can access the node properties with parameters and specify specic values for the SuperNode. In the SuperNode dialog box, click Define Parameters and create the following parameters:
Figure 19-16 Defining parameters to access node properties

Note: The parameter names, such as Sample.rand_pct, use correct syntax for referring to node properties, where Sample represents the name of the node and rand_pct is a node property. For more information, see Properties Reference Overview in Chapter 10 in Clementine 11.1 Scripting, Automation, and CEMI Reference. Once you have dened these parameters, you can easily modify values for the two Sample and Neural Net node properties without reopening each dialog box. Instead, simply select Set Parameters from the SuperNode menu to access the Parameters tab of the SuperNode dialog box, where you can specify new values for Random % and Time. This is particularly useful when exploring the data during numerous iterations of model building.
Figure 19-17 Specifying values for node properties on the Parameters tab in the SuperNode dialog box

607 SuperNodes

SuperNodes and Caching


From within a SuperNode, all nodes except terminal nodes can be cached. Caching is controlled by right-clicking on a node and choosing one of several options from the Cache context menu. This menu option is available both from outside a SuperNode and for the nodes encapsulated within a SuperNode.
Figure 19-18 Selecting caching options for a SuperNode

There are several guidelines for SuperNode caches: If any of the nodes encapsulated in a SuperNode have caching enabled, the SuperNode will also. Disabling the cache on a SuperNode disables the cache for all encapsulated nodes. Enabling caching on a SuperNode actually enables the cache on the last cacheable subnode. In other words, if the last subnode is a Select node, the cache will be enabled for that Select node. If the last subnode is a terminal node (which does not allow caching), the next node upstream that supports caching will be enabled. Once you have set caches for the subnodes of a SuperNode, any activities upstream from the cached node, such as adding or editing nodes, will ush the caches.

SuperNodes and Scripting


You can use the Clementine scripting language to write simple programs that manipulate and execute the contents of a terminal SuperNode. For instance, you might want to specify the order of execution for a complex stream. As an example, if a SuperNode contains a Set Globals node that needs to be executed before a Plot node, you can create a script that executes the Set Globals

608 Chapter 19

node rst. Values calculated by this node, such as the average or standard deviation, can then be used when the Plot node is executed. The Script tab of the SuperNode dialog box is available only for terminal SuperNodes.
To open the scripting dialog box for a terminal SuperNode:
E Right-click on the SuperNode canvas and choose SuperNode Script. E Alternatively, in both zoomed-in and zoomed-out modes, you can choose SuperNode Script from

the SuperNode menu. Note: SuperNode scripts are executed only with the stream and SuperNode when you have selected Run this script in the dialog box.
Figure 19-19 Creating a script for a SuperNode

Specic options for scripting and its use within Clementine are discussed in the Scripting, Automation, and CEMI Reference, which can be accessed along with other documentation under (Windows Start menu, All Programs, Clementine 10.0, Documentation). For more information, see Scripting Overview in Chapter 2 in Clementine 11.1 Scripting, Automation, and CEMI Reference.

Saving and Loading SuperNodes


One of the advantages of SuperNodes is that they can be saved and reused in other streams. When saving and loading SuperNodes, note that they use an .slb extension.
To save a SuperNode:
E Zoom in on the SuperNode.

609 SuperNodes E From the SuperNode menu, choose Save SuperNode. E Specify a lename and directory in the dialog box. E Select whether to add the saved SuperNode to the current project. E Click Save.

To load a SuperNode:
E From the Insert menu in the Clementine window, choose SuperNode. E Select a SuperNode le (.slb) from the current directory or browse to a different one. E Click Load.

Note: Imported SuperNodes have the default values for all of their parameters. To change the parameters, double-click on a SuperNode on the stream canvas.

Glossary
Area under the ROC curve. The ROC curve provides an index for the performance of a model. The

further the curve lies above the reference line, the more accurate the test.

Boxs M test. A test for the equality of the group covariance matrices. For sufciently large samples, a nonsignicant p value means there is insufcient evidence that the matrices differ. The test is sensitive to departures from multivariate normality. Correlation (Pearson). Measure of the strength of association between two variables. Two variables

are correlated if a change in the value of one signies a change in the other. Values close to 1 (or -1) indicate a very strong relationship; values close to 0 indicate a weak or no relationship. The sign of the coefcient indicates the direction of the relationship, where a positive correlation means that increases in one variable tend to accompany increases in the other variable.

Correlation T. The test statistic for the correlation coefcient, indicating whether the correlation is signicantly different from zero Correlation T df. Degrees of freedom for the test statistic. Correlation T significance. Signicance of the t statistic. Covariance. An unstandardized measure of association between two variables, equal to the cross-product deviation divided by N-1. Fishers. Displays Fishers classication function coefcients that can be used directly for

classication. A set of coefcients is obtained for each group, and a case is assigned to the group for which it has the largest discriminant score. minus the p value, or the probability of obtaining a result as extreme or more extreme than the observed result by chance alone. The measure used to rank importance depends on whether the predictors and the target are all categorical, all numeric ranges, or a mix of range and categorical. Despite the differences in computation, the use of a standard percentage scale allows comparisons across different types of elds and results.

Importance. A measure used to rank elds or results on a percentage scale, dened broadly as 1

Kurtosis. A measure of the extent to which observations cluster around a central point. For a

normal distribution, the value of the kurtosis statistic is zero. Positive kurtosis indicates that the observations cluster more and have longer tails than those in the normal distribution, and negative kurtosis indicates that the observations cluster less and have shorter tails.
Leave-one-out Classification. Each case in the analysis is classied by the functions derived from all cases other than that case. It is also known as the "U-method." Lift (Cumulative). The ratio of hits in cumulative quantiles relative to the overall sample (where quantiles are sorted in terms of condence for the prediction). For example, a lift value of 3 for the top quantile indicates a hit rate three times as high as for the sample overall. For a good model, lift should start well above 1.0 for the top quantiles and then drop off sharply toward 1.0 for the lower quantiles. For a model that provides no information, the lift will hover around 1.0. MAE. Mean absolute error. Measures how much the series varies from its model-predicted level. MAE is reported in the original series units.

610

611 Glossary

Mahalanobis Distance. A measure of how much a cases values on the independent variables differ

from the average of all cases. A large Mahalanobis distance identies a case as having extreme values on one or more of the independent variables.

MAPE. Mean Absolute Percentage Error. A measure of how much a dependent series varies from its model-predicted level. It is independent of the units used and can therefore be used to compare series with different units. MaxAE. Maximum Absolute Error. The largest forecasted error, expressed in the same units

as the dependent series. Like MaxAPE, it is useful for imagining the worst-case scenario for your forecasts. Maximum absolute error and maximum absolute percentage error may occur at different series pointsfor example, when the absolute error for a large series value is slightly larger than the absolute error for a small series value. In that case, the maximum absolute error will occur at the larger series value and the maximum absolute percentage error will occur at the smaller series value.
MaxAPE. Maximum Absolute Percentage Error. The largest forecasted error, expressed as a

percentage. This measure is useful for imagining a worst-case scenario for your forecasts.

Maximizing the Smallest F Ratio Method of Entry. A method of variable selection in stepwise analysis based on maximizing an F ratio computed from the Mahalanobis distance between groups. Maximum. The largest value of a numeric variable. Mean. A measure of central tendency. The arithmetic average, the sum divided by the number of Median. The value above and below which half of the cases fall, the 50th percentile. If there is an

cases.

even number of cases, the median is the average of the two middle cases when they are sorted in ascending or descending order. The median is a measure of central tendency not sensitive to outlying values (unlike the mean, which can be affected by a few extremely high or low values).

Minimize Wilks Lambda. A variable selection method for stepwise discriminant analysis that chooses variables for entry into the equation on the basis of how much they lower Wilks lambda. At each step, the variable that minimizes the overall Wilks lambda is entered. Minimum. The smallest value of a numeric variable. Mode. The most frequently occurring value. If several values share the greatest frequency of occurrence, each of them is a mode. Normalized BIC. Normalized Bayesian Information Criterion. A general measure of the overall t of a model that attempts to account for model complexity. It is a score based upon the mean square error and includes a penalty for the number of parameters in the model and the length of the series. The penalty removes the advantage of models with more parameters, making the statistic easy to compare across different models for the same series. Number of variables. Ranks models based on the number of variables used. Overall accuracy. The percentage of records that is correctly predicted by the model relative to the total number of records. Profit (Cumulative). The sum of prots across cumulative percentiles (sorted in terms of condence for the prediction), as computed based on the specied cost, revenue, and weight criteria. Typically, the prot starts near 0 for the top percentile, increases steadily, and then decreases. For a good model, prots will show a well-dened peak, which is reported along with the percentile where it occurs. For a model that provides no information, the prot curve will be relatively straight and may be increasing, decreasing, or level, depending on the cost/revenue structure that applies.

612 Glossary

Range. The difference between the largest and smallest values of a numeric variable, the maximum minus the minimum. Raos V (Discriminant Analysis). A measure of the differences between group means. Also called the Lawley-Hotelling trace. At each step, the variable that maximizes the increase in Raos V is entered. After selecting this option, enter the minimum value a variable must have to enter the analysis. RMSE. Root Mean Square Error. The square root of mean square error. A measure of how

much a dependent series varies from its model-predicted level, expressed in the same units as the dependent series.

R-Squared. Goodness-of-t measure of a linear model, sometimes called the coefcient of

determination. It is the proportion of variation in the dependent variable explained by the regression model. It ranges in value from 0 to 1. Small values indicate that the model does not t the data well. and has a skewness value of 0. A distribution with a signicant positive skewness has a long right tail. A distribution with a signicant negative skewness has a long left tail. As a guideline, a skewness value more than twice its standard error is taken to indicate a departure from symmetry.

Skewness. A measure of the asymmetry of a distribution. The normal distribution is symmetric

standard deviation. A measure of dispersion around the mean, equal to the square root of the variance. The standard deviation is measured in the same units as the original variable. Standard Error of Kurtosis. The ratio of kurtosis to its standard error can be used as a test of

normality (that is, you can reject normality if the ratio is less than -2 or greater than +2). A large positive value for kurtosis indicates that the tails of the distribution are longer than those of a normal distribution; a negative value for kurtosis indicates shorter tails (becoming like those of a box-shaped uniform distribution). sample taken from the same distribution. It can be used to roughly compare the observed mean to a hypothesized value (that is, you can conclude the two values are different if the ratio of the difference to the standard error is less than -2 or greater than +2).

Standard Error of Mean. A measure of how much the value of the mean may vary from sample to

Standard Error of Skewness. The ratio of skewness to its standard error can be used as a test of

normality (that is, you can reject normality if the ratio is less than -2 or greater than +2). A large positive value for skewness indicates a long right tail; an extreme negative value indicates a long left tail.

Stationary R-squared. A measure that compares the stationary part of the model to a simple mean model. This measure is preferable to ordinary R-squared when there is a trend or seasonal pattern. Stationary R-squared can be negative with a range of negative innity to 1. Negative values mean that the model under consideration is worse than the baseline model. Positive values mean that the model under consideration is better than the baseline model. Sum. The sum or total of the values, across all cases with nonmissing values. Territorial Map. A plot of the boundaries used to classify cases into groups based on function values. The numbers correspond to groups into which cases are classied. The mean for each group is indicated by an asterisk within its boundaries. The map is not displayed if there is only one discriminant function. Unexplained Variance. At each step, the variable that minimizes the sum of the unexplained variation between groups is entered. Unique. Evaluates all effects simultaneously, adjusting each effect for all other effects of any type.

613 Glossary

Valid. Valid cases having neither the system-missing value, nor a value dened as user-missing. Variance. A measure of dispersion around the mean, equal to the sum of squared deviations from the mean divided by one less than the number of cases. The variance is measured in units that are the square of those of the variable itself.

Bibliography

Box, G. E. P., G. M. Jenkins, and G. C. Reinsel. 1994. Time series analysis: Forecasting and control, 3rd ed. Englewood Cliffs, N.J.: Prentice Hall. Gardner, E. S. 1985. Exponential smoothing: The state of the art. Journal of Forecasting, 4, 128. Pena, D., G. C. Tiao, and R. S. Tsay, eds. 2001. A course in time series analysis. New York: John Wiley and Sons.

614

Index
1-in-n sampling, 49 3-D graphs, 158 absolute condence difference to prior (apriori evaluation measure), 455 add model rules, 353 adding records, 52 additional information panel decision tree models, 315 additive outlier in Time Series Modeler, 506 additive outlier patches, 492 additive outliers, 492 additive patch outlier in Time Series Modeler, 506 ADO databases importing, 37 advanced output Discriminant Node, 400, 404 Factor/PCA node, 394, 396 Generalized Linear Models node, 413 Linear Regression node, 367, 370 Logistic Regression node, 381, 388 aggregate node overview, 52 parallel processing, 54 performance, 54 setting options, 53 aggregating records, 122 aggregating time series data, 131 Akaike Information Criterion linear regression, 367 algorithms, 238, 274 alpha neural net node, 328 alpha for merging CHAID node, 306 alpha for splitting CHAID node, 306 QUEST node, 308 alternative models, 355 Alternative Rules pane, 353 Amemiyas Prediction Criterion linear regression, 367 analysis browser interpreting, 539 Analysis node, 537 analysis tab, 537 output tab, 529 Analysis Services integrating with Clementine, 7 animation, 156, 159 Anomaly Detection models, 260 anomaly elds, 261 cutoff value, 261 peer groups, 261 scoring, 259, 261 Anomaly Detection node, 254 adjustment coefcient, 257 anomaly elds, 256 anomaly index, 256 cutoff value, 256 missing values, 257 noise level, 257 peer groups, 257 Anonymize node creating anonymized values, 104 overview, 101 setting options, 102 anonymizing eld names, 87 ANOVA Means node, 559 AnswerTree launching from Clementine, 577, 589 antecedent rules without, 459 anti-join, 57 Append node eld matching, 65 overview, 65 setting options, 65 tagging elds, 61 apriori node, 452, 454 evaluation measures, 454 expert options, 454 options, 452 tabular versus transactional data, 236 ARIMA models, 495 autoregressive orders, 503 constant, 503 criteria in Time Series Modeler, 502 differencing orders, 503 moving average orders, 503 outliers, 506 seasonal orders, 503 transfer functions, 504 ascending order, 54 assess a model, 357 assessment in Excel, 359 assigning data types, 44, 68, 70

615

616 Index

association models deploying, 473 transposing scores, 473 Association module, 2 association node DB2 Intelligent Miner, 236 association plots, 205, 207 association rule node, 469 association rules, 317, 320321, 448, 460461, 463, 467470, 481482, 485 apriori, 452 browsing with DB2 Intelligent Miner Visualization, 465466 CARMA, 456 for sequences, 476 GRI, 450 asymptotic correlations logistic regression, 381, 388 asymptotic covariance logistic regression, 381 audit Data Audit node, 541 initial data audit, 541 auto-typing, 72, 74 autocorrelation function series, 493 automatic recode, 105106 autoregression ARIMA models, 503 balance factors, 51 Balance node generating, 193, 198, 203 overview, 50 setting options, 51 banding continuous variables, 275 base category Logistic Regression node, 377 baseline evaluation chart options, 220 basket data, 449, 472473 best line evaluation chart options, 220 biased data, 50 Binary Classier node, 263, 265, 267268, 270 binary set encoding, 327 Binning node equal counts, 112 equal sums, 112 xed-width bins, 111 mean/standard deviation bins, 116 optimal, 117 overview, 109 previewing bins, 118 ranks, 115 setting options, 110

BITMAP indexes database tables, 584 Blank function padding time series, 132 blank handling, 44, 70, 75 Binning node, 110 lling values, 98 blank rows Excel les, 30 blank values in Matrix tables, 533 blanks, 549 in Matrix tables, 533 Bonferroni adjustment CHAID node, 307 boosting, 310, 319 Boxs M test Discriminant Node, 400 build rule node , 312 bulk loading, 585 business rule evaluation chart options, 221 C&R tree node, 275, 296, 312313, 316317 case weights, 236 frequency weights, 236 impurity measures, 301 misclassication costs, 304 ordered sets, 72 prior probabilities, 301, 303 pruning, 301 stopping criteria, 301 stopping options, 302 surrogates, 301 tree depth, 297 C&R Trees, 274 C5.0, 274, 312313, 316317, 319321 boosting, 319 parallel processing, 309, 311 performance, 309, 311 C5.0 node, 308, 310 boosting, 310 misclassication costs, 304, 310 options, 310 ordered sets, 72 pruning, 310 cache SuperNodes, 607 cache le node, 27 CARMA multiple consequents, 473 CARMA node, 456, 459 content eld(s), 456 data formats, 456 expert options, 459 eld options, 456 ID eld, 456

617 Index

options, 458 tabular versus transactional data, 459 time eld, 456 case data importing survey data, 3637 case processing summary in Generalized Linear Models, 414 category merging, 275 cell ranges Excel les, 30 CEMI models, 246 CHAID, 274 CHAID node, 275, 305, 312313, 316317 misclassication costs, 304 tree depth, 297 chart options, 361 charts saving output, 529 checking types, 79 chi-square Feature Selection, 251 Matrix node, 535 classication gains decision trees, 283, 287 Classication module, 2 classication table logistic regression, 381 classication trees, 296, 305, 307308 clear values, 44 CLEM expressions, 47 Clementine Batch, 1 Clementine Client, 1 Clementine Server, 1 Clementine Solution Publisher, 6 Cleo, 7 cluster analysis Anomaly Detection, 257 number of clusters, 432 cluster viewer display options, 443445 importance, 446 interpreting results, 437 overview, 436 text view, 447 using, 441 view all, 443444 clustering, 418419, 426, 429431, 433435 overall display, 436 viewing clusters, 436 coefcient of variance screening elds, 249 coercing values, 79 coincidence matrix Analysis node, 537 Collection node creating, 202 graph window, 203

overview, 201 collinearity diagnostics linear regression, 367, 370 column order table browser, 527, 531 column width for elds, 82 column-wise binding, 585 combining data, 65 from multiple les, 56 comma, 17, 82 comma-delimited les exporting, 526, 592 saving, 529 comment characters in variable les, 16 commit size, 585 compressed binary encoding neural nets, 327 concatenating records, 65 conditions specifying a series, 95 condence apriori node, 452 association rules, 462463, 484 CARMA node, 458 decision tree models, 315 for sequences, 482 GRI node, 451 neural networks, 329 sequence node, 478 condence difference (apriori evaluation measure), 455 condence intervals linear regression, 367, 370 logistic regression, 381 Means node, 562563 condence ratio (apriori evaluation measure), 455 condences decision tree models, 317 logistic regression, 387 rulesets, 317 connections database, 25 to Predictive Enterprise Repository, 10 consequent multiple consequents, 459 content eld(s) CARMA node, 456 sequence node, 476 contiguous data sampling, 49 contiguous keys, 53 continuous variables segmenting, 275 contrast coefcients matrix in Generalized Linear Models, 414 convergence options Generalized Linear Models node, 412

618 Index

Logistic Regression node, 380 converting sets to ags, 121, 123 copying type attributes, 81 correlation matrix in Generalized Linear Models, 414 correlations, 555 absolute value, 555 descriptive labels, 555 Means node, 563 probability, 555 signicance, 555 statistics output, 556 costs decision trees, 304 evaluation charts, 220 Count eld padding or aggregating time series, 132 Time Intervals node, 132 counts Binning node, 112 statistics output, 556 covariance matrix in Generalized Linear Models, 414 linear regression, 367, 370 Cramrs V Feature Selection, 251 create snapshot, 347 create a mining task, 350 CREATE INDEX command, 582 create/edit mining task, 351 creating new elds, 8788 synthetic data, 32 CRISP-DM data understanding, 9 CRISP-DM process model data preparation, 68 cross-tabulation Matrix node, 532, 534 currency display format, 83 custom splits decision trees, 277279 customize a model, 355 cut points Binning node, 109 cyclic periods Time Intervals node, 136 daily measurements Time Intervals node, 140141 DAT les exporting, 526, 592 saving, 529 data aggregating, 52 anonymizing, 101

audit, 541 exploring, 541 preparation, 47 reduction, 275, 390 storage, 20, 34, 98, 100 storage type, 75 understanding, 47 data audit browser Edit menu, 545 File menu, 545 generating graphs, 554 generating nodes, 554 Data Audit node, 541 output tab, 529 settings tab, 542 Data Provider Denition, 10 data quality Data Audit browser, 549 data types, 18, 44, 68, 70 instantiation, 73 database bulk loading, 585 connecting to data, 25 reading data from, 2223 selecting a table, 25 database connection settings Dimensions import node, 40 password, 41 user ID, 41 database modeling Analysis Services, 7 IBM Intelligent Miner, 7 Oracle Data Miner, 7 database node query editor, 27 Database Output node, 578 data source, 580 export tab, 580 indexing tables, 582 schema, 581 table name, 580 date storage format, 20, 34 date/time, 71 dates setting formats, 8283 DB2 Intelligent Miner association node transactional data, 236 DB2 Intelligent Miner Visualization, 465 decile bins, 112 decimal places display formats, 83 decimal symbol, 1617, 82 Flat File Output node, 587 number display formats, 84 Decision List models excluding segments, 334 mailing lists, 333

619 Index

PMML, 341 scoring, 334, 341 segments, 341 settings, 342 SQL generation, 342 Decision List node, 333 binning method, 340 expert options, 340 model options, 338 requirements, 338 search direction, 338 search width, 340 target value, 338 Decision List Viewer working with, 348 workspace, 342 decision tree models, 273, 312313, 316 additional information panel, 315 exporting results, 294 generating, 294 rule frequencies, 315 surrogates, 315 decision tree node, 295 decision trees, 275276, 281, 296, 305, 307308, 312313 custom splits, 277 exporting results, 294 gains, 282283, 287, 290, 292 gains charts, 287 generating models, 293294 misclassication costs, 304 predictors, 278 prots, 285 ROI, 285 stopping options, 302 surrogates, 279 decreasing data, 4849 dene a build selection, 350 dening model measures, 357 degrees of freedom Matrix node, 535 Means node, 562563 delete a segment rule condition, 354 deleting output objects, 524 deleting segments, 356 delimiters, 1617, 585 deployability measure, 462 Derive node conditional, 96 converting eld storage, 97 count, 95 ag, 92 formula, 91 generating, 183, 193, 198, 203, 211, 228 generating from a Binning node, 118 generating from bins, 109 multiple derive, 89

overview, 87 recoding values, 97 set, 93 setting options, 88 state, 94 descending order, 54 descriptive statistics in Generalized Linear Models, 414 descriptives linear regression, 367, 370 difference method neural net condences, 330 difference of condence quotient to 1 (apriori evaluation measure), 455 difference transformation ARIMA models, 503 differencing transformation, 494 dimension reduction, 419 Dimensions data importing, 3637, 41, 43 Dimensions Import node, 3637, 41, 43 Dimensions log les importing, 37 Dimensions Metadata Documents importing, 37 direct oblimin rotation Factor/PCA node, 393 Directed Web node overview, 205, 207 direction of elds, 44, 70, 80 directives, 51 decision trees, 294 discarding elds, 84 samples, 49 discriminant analysis, 398 Discriminant Equation node, 403404 Discriminant Node, 398399 advanced (expert) output, 400 expert options, 399 model form, 398 stepping criteria (eld selection), 402 Discriminant Node Discriminant Node expert options convergence criteria, 399 Discriminant Node, 399 expert output, 399 disguising data for use in a model, 101 display formats currency, 83 decimal places, 83 grouping symbol, 83 numbers, 83 scientic, 83 Distinct node overview, 66 distribution, 196

620 Index

Distribution node creating, 191192 overview, 190 using the graph, 193 using the table, 193 DPD, 10 DTD, 243 dummy coding, 121 duplicate elds, 56, 85 records, 66 Durbin-Watson test linear regression, 367, 370 edit segment rule, 354 editing graphs, 166 automatic settings, 167 axes, 171 colors and patterns, 168 dashing, 168 legend position, 173 margins, 170 padding, 170 panels, 172 point aspect ratio, 169 point rotation, 169 point shape, 169 rules, 167 scales, 171 selection, 167 size of graphic elements, 170 text, 167 transpose, 172 eigenvalues Factor/PCA node, 393 encapsulating nodes, 596 encoding, 17, 19, 588 Enterprise View node, 10 EOL characters, 16 epsilon for convergence CHAID node, 306 equal counts Binning node, 112 equamax rotation Factor/PCA node, 393 estimation period, 133 eta neural net node, 328 evaluating models, 537 Evaluation Chart node business rule, 221 creating, 220221 hit condition, 221 overview, 215 reading results, 222 score expression, 221 using the graph, 223

evaluation measures apriori node, 454 events identifying, 491 Excel launching from Clementine, 577, 592 Excel Export node, 592 Excel les exporting, 592 Excel Import node, 30 generating from output, 592 excluding segments, 356 execution specifying the order of, 607 Exhaustive CHAID, 274275, 297 exhaustive pruning neural net node, 324 expected values Matrix node, 534 Expert Modeler criteria in Time Series Modeler, 499 outliers, 500 expert options apriori node, 454 CARMA node, 459 Factor/PCA node, 392 Generalized Linear Models node, 409 K-Means node, 428 Kohonen node, 423 Logistic Regression node, 379 sequence node, 479 expert output Discriminant Node, 400 Factor/PCA node, 393 Generalized Linear Models node, 413 Linear Regression node, 367 Logistic Regression node, 381 exploring data Data Audit node, 541 exponential smoothing, 495 criteria in Time Series Modeler, 501 export decimal places, 83 export nodes, 578 exporting generated models, 238 output, 526 PMML, 242, 245 SQL, 240 SuperNodes, 608 exporting data DAT les, 592 at le format, 587 SAS format, 591 text, 592 to a database, 578 to AnswerTree, 588 to Excel, 592

621 Index

to SPSS, 588 Expression Builder, 47 extension derived eld, 89 F statistic Feature Selection, 251 Means node, 562 factor analysis, 390, 394396 Factor Equation node, 394396 Factor/PCA node, 390392 eigenvalues, 393 estimation methods, 391 expert options, 392 expert output, 393 factor scores, 392 iterations, 392 missing-value handling, 392 number of factors, 392 rotation options, 393 false values, 78 Feature Selection models, 251253 Feature Selection node eld importance, 248 generating Filter nodes, 253 importance, 247248, 251252 ranking predictors, 247248, 251252 screening predictors, 247248, 251252 feedback graph neural net node, 326 eld attributes, 81 eld derivation formula, 91 eld names, 86 anonymizing, 87 data export, 578, 587, 589, 591 eld operations nodes, 68 generating from a data audit, 554 eld options modeling nodes, 235 SLRM node, 516 Field Reorder node, 148 automatic sorting, 150 custom ordering, 148 setting options, 148 eld selection Linear Regression node, 364 eld storage converting, 97 eld types, 44, 70 elds anonymizing data, 101 delimiters, 17 deriving multiple elds, 89 eld and value labels, 44, 70, 76 ranking importance, 247, 250253 reordering, 148 screening, 247, 249, 251253

selecting for analysis, 247, 249253 selecting multiple, 91 transposing, 125126 Filler node overview, 98 FILLFACTOR keyword indexing database tables, 584 Filter node generating from decision trees, 295 generating from neural network model, 332 overview, 84 setting options, 85 ltering elds, 60, 84 for SPSS, 590 ltering rules, 463, 485 association rules, 463 First function time series aggregation, 132 rst hit ruleset, 320 scal year Time Intervals node, 138 xed le node overview, 18 setting options, 18 xed-eld text data, 18 ag type, 7172, 78 Flat File Output node, 587 export tab, 587 forecasting overview, 488 predictor series, 494 format les, 30 formats data, 20, 82 fractional ranks, 115 fraud detection Anomaly Detection, 254 free-eld text data, 15 frequencies Binning node, 112 decision tree models, 315 functional transformation, 494 gains decision trees, 282283, 287, 292 exporting, 294 gains chart, 360 gains charts, 215, 222 decision trees, 287 gains-based selection decision trees, 290 general estimable function in Generalized Linear Models, 414 Generalized Linear Models Equation node, 415416 Generalized Linear Models node, 405 advanced (expert) output, 413 convergence options, 412

622 Index

expert options, 409 elds, 406 model form, 407 generate new model, 356 generated K-Means node, 429430 generated Kohonen node, 424425 generated models, 237, 241, 246, 312313, 317, 319321, 329, 331, 368370, 384388, 394396, 403404, 415416, 424425, 429430, 433434, 460461, 467, 470, 482, 485 exporting, 238239 generating processing nodes from, 241 menus, 239 printing, 239 saving, 239 saving and loading, 238 scoring data with, 241 SLRM node, 520 Summary tab, 240 tabs, 239 using in streams, 241 generated models palette, 237 saving and loading, 238 generated net node, 329, 331 generated sequence rules node, 481482, 485 generated sequence ruleset, 468469 generated TwoStep cluster node, 433434 generating ags, 122, 124 getting started, 342 Gini impurity measure, 302 global values, 566 goodness of t in Generalized Linear Models, 414 goodness-of-t chi-square statistics logistic regression, 381, 388 graph nodes, 155, 162 graphs 3-D, 158 animation, 156 appearance options, 161 axes, 171 axis labels, 174 collections, 201 color overlay, 156 colors and patterns, 168 copying, 176 dashings, 168 default color scheme, 175 deleting regions, 188 distributions, 190 edit mode, 164 editing, 166 editing regions, 187 evaluation charts, 215 exporting, 176 footnote, 174 generating from a data audit, 554

histograms, 196 interaction mode, 162 legend position, 173 margins, 170 multiplot, 188 output, 160 padding, 170 panel overlay, 156 panels, 172 plots, 176 point aspect ratio, 169 point rotation, 169 point shape, 169 printing, 176 rotating a 3-D image, 158 saving, 176 saving edited layouts, 175 saving layout changes, 175 saving output, 529 scales, 171 selection mode, 162 shape overlay, 156 size of graphic elements, 170 size overlay, 156 stylesheet, 175 text, 167 time series, 225 title, 174 tooltips, 165 transparency, 156 transpose, 172 webs, 205, 207 with animation, 159 GRI node, 450451 options, 451 grouping symbol number display formats, 84 grouping values, 193 handling missing values, 68 hassubstring function, 41 helper applications, 575 DB2 Intelligent Miner Visualization, 465 Histogram node creating, 197 overview, 196 using the graph, 198 history decision tree models, 315 History node, 147 overview, 146 hits decision tree gains, 282 evaluation chart options, 221 holdouts time series modeling, 133

623 Index

Hosmer and Lemeshow goodness-of-t logistic regression, 388 hourly measurements Time Intervals node, 142 HTML saving output, 529 HTML output Report node, 565 view in browser, 525 IBM, 466 DB2 Intelligent Miner Visualization, 465 IBM Intelligent Miner integrating with Clementine, 7 PMML export, 245 ID eld CARMA node, 456 sequence node, 476 if-then-else statements, 96 imbalanced data, 50 importance comparing means, 560 in the cluster viewer, 446 Means node, 562563 ranking predictors, 247, 250253 importing PMML, 243, 245 PMML models, 238 SuperNodes, 608 impurity measures C&R tree node, 302 decision trees, 301 in-database modeling Analysis Services, 7 IBM Intelligent Miner, 7 Oracle Data Miner, 7 In2data databases importing, 37 incomplete records, 59 increasing performance, 49 index decision tree gains, 282 indexing database tables, 582 information difference (apriori evaluation measure), 455 inner join, 57 innovational outlier in Time Series Modeler, 506 innovational outliers, 492 insert model segment, 353 instances, 462, 483 decision tree models, 315 instantiation, 44, 7074 source node, 45 integer ranges, 77 integer storage format, 20, 34 integration ARIMA models, 503

interaction identication, 275 interactive trees, 273, 275276, 278, 281 custom splits, 277 exporting results, 294 gains, 282283, 287, 290, 292 gains charts, 287 generating models, 293294 prots, 285 ROI, 285 surrogates, 279 intervals time series data, 128 interventions identifying, 491 iteration history in Generalized Linear Models, 414 logistic regression, 381 jittering, 182 joining datasets, 65 joins, 5657, 59 partial outer, 60 justication for elds, 82 k-means clustering, 418, 426, 429430 k-means generated models, 429 K-Means node, 426428 distance eld, 427 encoding value for sets, 428 expert options, 428 stopping criteria, 428 key elds, 53, 122 key method, 56 key value for aggregation, 53 Kohonen generated models, 424 Kohonen networks, 418 Kohonen node, 419, 421, 423 binary set encoding option (removed), 421 expert options, 423 feedback graph, 421 learning rate, 423 neighborhood, 419, 423 stopping criteria, 421 L matrix in Generalized Linear Models, 414 label elds labeling records in output, 80 label types survey import, 39 labels, 78 exporting, 589, 591 importing, 28, 30 specifying, 44, 70, 7577, 79 value, 243

624 Index

variable, 243 lag ACF and PACF, 493 lagged data, 146 Lagrange multiplier test Generalized Linear Models, 414 lambda Feature Selection, 251 language survey import, 39 large databases, 47 performing a data audit, 541 Last function time series aggregation, 132 learning rate neural net node, 328 legends showing or hiding, 211 level shift outlier in Time Series Modeler, 506 level shift outliers, 492 level stabilizing transformation, 494 lift, 462 association rules, 463 decision tree gains, 282 lift charts, 215, 222 decision tree gains, 288 likelihood ratio test logistic regression, 381, 388 likelihood-ratio chi-square CHAID node, 306 Feature Selection, 251 line plots, 155, 176, 188 linear regression, 363364, 369370, 395 Linear Regression Equation node, 369370, 395 Linear Regression node, 364, 366 advanced (expert) output, 367 Backwards estimation method, 364 expert options, 366 expert output, 366 eld selection, 364 Forwards estimation method, 364 missing-value handling, 366 stepping criteria (eld selection), 366367 Stepwise estimation method, 364 weighted least squares, 236 linear trends identifying, 489 links Web node, 208 loading generated models, 238 local trend outlier in Time Series Modeler, 506 local trend outliers, 493 locally weighted least squares regression Plot node, 181, 227

LOESS smoother Plot node, 181, 227 log transformation, 494 in Time Series Modeler, 504 log-odds, 385 logistic regression, 363, 372, 385388 Logistic Regression Equation node, 384, 386388 equations, 385 Model tab, 385 Logistic Regression node, 372373, 379 advanced (expert) output, 381 convergence criteria, 379 convergence options, 380 expert options, 379 expert output, 379 model form, 373 stepping criteria (eld selection), 382 lowess smoother. See LOESS smoother Plot node, 181, 227 mailing lists Decision List models, 333 main dataset, 65 Mallows Prediction Criterion linear regression, 367 managers models tab, 238 outputs tab, 524 Managers pane, 346 market research data importing, 3637, 41, 43 matrix browser Generate menu, 535 Matrix node, 532 appearance tab, 534 column percentages, 534 cross-tabulation, 534 highlighting, 534 output browser, 535 output tab, 529 row percentages, 534 settings tab, 532 sorting rows and columns, 534 matrix output saving as text, 529 Max function time series aggregation, 132 maximum Set Globals node, 567 statistics output, 556 maximum value for aggregation, 53 MDD documents importing, 37 mean Binning node, 116 Set Globals node, 567 statistics output, 556

625 Index

Mean function time series aggregation, 132 Mean of most recent function padding time series, 132 mean value for aggregation, 53 mean value for records, 52 mean/standard deviation used to bin elds, 116 means comparing, 558561 Means node, 558 importance, 560 independent groups, 559 output browser, 561562 output tab, 529 paired elds, 560 median statistics output, 556 member (SAS import) setting, 30 Merge node ltering elds, 60 optimization settings, 63 overview, 56 setting options, 59 tagging elds, 61 metadata, 44, 70, 75 importing survey data, 3637 Microsoft Excel import node, 30 Min function time series aggregation, 132 minimum Set Globals node, 567 statistics output, 556 minimum value for aggregation, 53 mining tasks, 348 Decision List models, 333 minute increments Time Intervals node, 143144 misclassication costs C5.0 node, 310 decision trees, 304 missing data predictor series, 494 missing values, 68, 75 CHAID trees, 278 excluding from SQL, 318 lling, 549 handling, 549 in Aggregate nodes, 52 in Matrix tables, 533 screening elds, 249 mode statistics output, 556 Mode function time series aggregation, 132 model evaluation, 215

model t linear regression, 367, 370 logistic regression, 388 model information in Generalized Linear Models, 414 model measures dene, 357 refresh, 358 model options SLRM node, 517 modeling nodes, 231, 254, 308, 323, 364, 372, 390, 398, 405, 419, 426, 431, 450, 452, 476, 515 modeling roles specifying for elds, 44, 70, 80 models anonymizing data for, 101 ARIMA, 503 importing, 238 Summary tab, 240 models tab saving and loading, 238 modifying data values, 87 momentum neural net node, 328 monthly data Time Intervals node, 139 Most recent function padding time series, 132 moving average ARIMA models, 503 MS Excel setup integration format, 360 multilayer perceptrons, 323 multiple derive, 89 multiple elds selecting, 91 multiple inputs, 56 multiple regression, 364 multiple response data importing, 3637, 41, 43 Multiplot node creating, 188 overview, 188 using the graph, 190 natural log transformation, 494 in Time Series Modeler, 504 natural order altering, 148 NetGenesis Web analytics technology, 7 network web graph, 208 neural net node, 323324 alpha, 328 dynamic training method, 324 eta, 328 exhaustive prune training method, 324 feedback graph, 326

626 Index

eld options, 235 learning rate (eta), 328 momentum (alpha), 328 multiple training method, 324 prune training method, 324 quick training method, 324 radial basis function network (RBFN) training method, 324 sensitivity analysis, 326 stopping criteria, 324 training log, 326 neural network models generating lter nodes, 332 neural networks, 323, 329, 331, 419, 424425 condence method, 329330 node properties, 605 nominal regression, 372 nonlinear trends identifying, 489 nonseasonal cycles, 490 normalize values graph nodes, 189, 227 normalized chi-square (apriori evaluation measure), 455 null values in Matrix tables, 533 mixed data, 21, 35 nulls, 75, 549 in Matrix tables, 533 number display formats, 83 ODBC, 22 bulk loading via, 585 ODBC output node. See Database Output node, 578 odbc-oracle-properties.cfg le, 585 one-way ANOVA Means node, 559 opening output objects, 524 optimal binning, 117 optimization SQL pushback, 7 optimizing performance, 326, 422, 428, 454 options SPSS, 575 Oracle, 22 Oracle Data Miner integrating with Clementine, 7 order merging, 56 order of execution specifying, 607 order of input data, 61 ordered sets, 72 ordered twoing impurity measure, 302 ordering data, 54, 148 ordinal data, 72 organize data selections, 352 outer join, 57

outliers, 491 additive, 492 additive patches, 492 ARIMA models, 506 deterministic, 491 Expert Modeler, 500 identifying, 254 in series, 491 innovational, 492 level shift, 492 local trend, 493 seasonal additive, 492 transient change, 492 output exporting, 526 generating new nodes from, 525 HTML, 525 printing, 525 saving, 525 output les saving, 529 output formats, 529 output manager, 524 output nodes, 523, 528, 532, 537, 541, 554, 563, 566, 573, 578, 587588, 591592 output tab, 529 overlays, 179 for graphs, 156 overtraining neural net node, 325 overwriting database tables, 580 p value, 251 importance, 560 padding time series data, 131 panel, 156 parallel processing aggregate node, 54 C5.0, 309, 311 merging, 63 sorting, 56 parameter estimates in Generalized Linear Models, 414 logistic regression, 388 parameters node properties, 605 setting for SuperNodes, 603 SuperNodes, 603604 part and partial correlations linear regression, 367, 370 partial autocorrelation function series, 493 partial joins, 57, 60 partition elds, 44, 70, 80, 119120 selecting, 235 Partition node, 119120

627 Index

partitioning data, 119120, 235 Analysis node, 537 evaluation charts, 221 model building, 249, 256, 297, 325, 365, 373, 391, 399, 408, 421, 427, 432, 451, 453, 458, 478, 517 neural net node, 325 partitions, 292 passing samples, 49 password database connection settings, 25, 41 PCA, 390 Pearson chi-square CHAID node, 306 Feature Selection, 251 Matrix node, 535 Pearson correlations Means node, 563 statistics output, 556 peer groups Anomaly Detection, 257 percentages, 49 percentile bins, 112 performance aggregate node, 54 Binning nodes, 118 C5.0, 309, 311 Derive nodes, 118 merging, 63 sorting, 56 performance enhancements, 326, 383, 422, 428, 454 performance evaluation statistic, 537 period, 82 periodicity in Time Series Modeler, 504 time series data, 128 periods Time Intervals node, 136 Plot node, 155 creating, 176, 179 using the graph, 183 plotting associations, 205, 207 PMML exporting models, 242, 245 importing models, 243, 245 PMML models exporting, 238 importing, 238 point interventions identifying, 491 point plots, 155, 176, 188 prediction, 275 Predictive Applications Wizard, 7 Predictive Enterprise Repository, 7 connecting to, 10 Predictive Framework, 7 PredictiveMarketing, 7

predictor series, 494 missing data, 494 predictors decision trees, 278 ranking importance, 247, 250253 screening, 247, 249, 251253 selecting for analysis, 247, 249253 surrogates, 279 Preview pane, 344 primary key elds Database Output node, 582 principal components analysis (PCA), 390, 394396 printing output, 525 prior probabilities, 303 decision trees, 301, 303 prioritizing segments, 355 probabilities in logistic regression, 385 prot charts, 215, 222 prots decision tree gains, 285 Promax rotation Factor/PCA node, 393 properties for elds, 82 node, 605 pruning decision trees, 296, 301 pseudo R-square logistic regression, 388 publish to web URL setting, 577 pulses in series, 491 python bulk loading scripts, 585 quality browser generating lter nodes, 552 generating select nodes, 553 quality report Data Audit browser, 549 Quancept data importing, 37 Quantum data importing, 37 Quanvert databases importing, 37 quarterly data Time Intervals node, 138 quartile bins, 112 Quartimax rotation Factor/PCA node, 393 query, 2223 query editor, 27 QUEST, 274 QUEST node, 275, 307, 312313, 316317 misclassication costs, 304

628 Index

prior probabilities, 307 pruning, 307 stopping criteria, 307 surrogates, 307 tree depth, 297 quintile bins, 112 quotation marks importing text les, 17 quotes for database export, 580 R-squared change linear regression, 367, 370 radial basis function network (RBFN), 324 random samples, 49 random seed value sampling records, 50, 121 range statistics output, 556 range eld type, 77 ranges, 7172 missing values, 75 rank cases, 115 ranking predictors, 247, 250253 real ranges, 77 real storage format, 20, 34 Reclassify node, 106, 108 generating from a distribution, 193 overview, 105, 109 recode, 105106, 109 record counts, 53 labels, 80 length, 18 record operations nodes, 47 aggregate node, 52 Append node, 65 Balance node, 50 Distinct node, 66 merge node, 56 sample node, 49 select node, 48 sort node, 54 Time Intervals node, 128 records merging, 56 transposing, 125126 reference category Logistic Regression node, 377 refreshing measures, 358 regression, 364, 368 Regression Equation node, 368 regression gains decision trees, 287, 290, 292 regression trees, 296, 305, 307 relative importance of inputs. See sensitivity analysis, 331

renaming elds for export, 590 renaming output objects, 524 replacing eld values, 98 report browser, 566 Report node, 563 output tab, 529 template tab, 564 reports saving output, 529 residuals Matrix node, 534 response charts, 215, 222 decision tree gains, 282, 289 Restructure node, 123124 with Aggregate node, 124 restructuring data, 123 revenue evaluation charts, 220 risk estimate decision tree gains, 292 risks exporting, 294 ROI charts, 215, 222 decision tree gains, 285 roles specifying for elds, 44, 70, 80 rolling up time series data, 131 rotating 3-D graphs, 158 rotation of factors/components, 393 row-wise binding, 585 rule conditions Decision List models, 333 rule ID, 463 rule induction, 273, 296, 305, 307308, 450, 452 Rule SuperNode generating from sequence rules, 486 rules association rules, 450, 452, 456 rule support, 462, 484 ruleset node, 295, 317, 320321, 468470 rulesets generating from decision trees, 295 run a mining task, 348 sample node overview, 49 setting options, 49 sampling, 49 SAS data, 591 export node, 591 import node, 29 setting import options, 30 transport les, 29

629 Index

types of import les, 29 .sav les, 27 saving generated models, 238 output, 525 output objects, 524, 529 scale factors, 51 scatterplots, 155, 176, 188 scenario, 10 schema Database Output node, 581 Schwarz Bayesian Criterion linear regression, 367 scientic display format, 83 Score statistic, 381383 scoring evaluation chart options, 221 scoring data, 241 screening predictors, 247, 249, 251253 scripting SuperNodes, 607 .sd2 (SAS) les, 29 searching table browser, 531 seasonal additive outlier in Time Series Modeler, 506 seasonal additive outliers, 492 seasonal difference transformation ARIMA models, 503 seasonal differencing transformation, 494 seasonal orders ARIMA models, 503 seasonality, 490 identifying, 489 second increments Time Intervals node, 145 seed value sampling and records, 50, 121 segment rule edit, 354 segmentation, 275 Segmentation module, 2 segments Decision List models, 333 Select node generating, 183, 193, 198, 203, 211, 228 generating from decision trees, 295 overview, 48 selecting rows (cases), 48 selection criteria linear regression, 367, 370 Self-Learning Response Model node local trend, 515 self-organizing maps, 419 sensitivity analysis neural net node, 326 neural networks, 331

sequence browser, 485 sequence detection, 448, 476 sequence node, 476, 479 content eld(s), 476 data formats, 476 expert options, 479 eld options, 476 generated sequence rules, 481 ID eld, 476 options, 478 predictions, 481 tabular versus transactional data, 479 time eld, 476 sequence rules node generating a rule SuperNode, 486 sequences generated sequence rules, 482, 485 sequence browser, 485 sorting, 485 series transforming, 494 Session Results tab, 346 Set Globals node, 566 settings tab, 567 set random seed sampling records, 50, 121 Set to Flag node, 121122 set type, 7172, 78 sets converting to ags, 121, 123 transforming, 106, 108 settings options SLRM node, 518 signicance correlation strength, 555 simplemax method neural net condences, 330 simplemax scoring, 327 .slb les, 608 SLRM node, 515 eld options, 516 generated models, 520 model settings, 521 preferences for target elds, 519, 522 randomization of results, 519, 522 settings, 519, 521 smoother Plot node, 181, 227 Snapshots tab, 346347 softmax method neural net condences, 330 softmax scoring, 327 Solution Publisher, 6 sort node optimization settings, 55 overview, 54

630 Index

sorting elds, 148 presorted elds, 55 records, 54 source nodes database node, 22 Enterprise View node, 10 Excel Import node, 30 xed le node, 18 instantiating types, 45 overview, 9 SAS import node, 29 SPSS import node, 27 user input node, 3233 variable le node, 15 splits decision trees, 277279 SPSS launching from Clementine, 573, 575, 589 license location, 575 valid eld names, 590 SPSS data les importing survey data, 37 SPSS export node, 588 export tab, 589 ordinal data, 72 SPSS import node ordinal data, 72 overview, 27 SPSS MR importing survey data, 3637, 41, 43 SPSS output browser, 575 SPSS Output node, 573 Output tab, 574 Syntax tab, 573 SPSS Transform node, 151 allowable syntax, 152 setting options, 152 SQL, 2223, 27 export, 240 SQL generation logistic regression, 388 neural networks, 331 rulesets, 318 SQL optimization. See SQL generation, 7 square root transformation, 494 in Time Series Modeler, 504 .ssd (SAS) les, 29 standard deviation Binning node, 116 screening elds, 249 Set Globals node, 567 statistics output, 556 standard deviation for aggregation, 53 standard error of mean statistics output, 556

standard error rule QUEST node, 308 start a mining task, 350 statistical models, 363 statistics Data Audit node, 541 Matrix node, 532 statistics browser Generate menu, 556 generating lter nodes, 558 interpreting, 556 Statistics node, 554 correlation labels, 555 correlations, 555 output tab, 529 settings tab, 555 statistics, 555 step interventions identifying, 491 stepwise eld selection Discriminant Node, 402 stopping criteria decision trees, 301 stopping options decision trees, 302 storage, 75 converting, 9798, 100 storage formats, 20 stratication, 275 string storage format, 20, 34 sum Set Globals node, 567 statistics output, 556 Sum function time series aggregation, 132 summary data, 52 summary statistics Data Audit node, 541 logistic regression, 381 summed values, 53 SuperNode parameters, 603605 SuperNodes, 593 creating, 596 creating caches for, 607 editing, 601 loading, 608 nesting, 598 process SuperNodes, 594 saving, 608 scripting, 607 setting parameters, 603 source SuperNodes, 594 terminal SuperNodes, 595 types of, 593 zooming in, 601 supervised binning, 117

631 Index

support, 463 antecedent support, 462, 484 apriori node, 452 association rules, 463 CARMA node, 458, 460 for sequences, 482 GRI node, 451 rule support, 462, 484 sequence node, 478 surrogates C&R tree node, 301 decision tree models, 315 decision trees, 279, 301 QUEST node, 308 survey data importing, 3637, 41, 43 Surveycraft data importing, 37 syntax tab SPSS Output node, 573 SPSS Transform node, 152 synthetic data, 32 system missing values in Matrix tables, 533 system-missing values, 549 t statistic Feature Selection, 251 t test independent samples, 559 Means node, 559560, 563 paired samples, 560 table browser Generate menu, 531 reordering columns, 527, 531 searching, 531 selecting cells, 527, 531 Table node, 528 column justication, 82 column width, 82 format tab, 82 output settings, 528 output tab, 529 settings tab, 528 table owner, 25 tables joining, 57 reading from a database, 25 saving as text, 529 saving output, 529 tabular data, 449, 472 apriori node, 236 CARMA node, 456 sequence node, 476 transposing, 473 tabular output reordering columns, 527

selecting cells, 527 tags, 56, 61 templates Report node, 564 territorial map Discriminant Node, 400 test metric neural net node, 326 test samples partitioning data, 119120 text data, 15, 18 encoding, 17, 19, 588 text les, 15 exporting, 592 Text Mining for Clementine, 6 thresholds viewing bin thresholds, 118 ties Binning node, 112 tiles Binning node, 112 till-roll data, 449, 472473 time setting formats, 82 time eld CARMA node, 456 sequence node, 476 time formats, 83 Time Intervals node, 129, 131, 133 overview, 128 Time Plot node appearance options, 227 overview, 225 plot options, 226 using the graph, 228 time plots, 155 time series, 146 time series data aggregating, 128, 131 building from data, 131 dening, 128129, 131, 133 estimation period, 133 holdouts, 133 intervals, 129 labeling, 128129, 131, 133 padding, 128, 131 Time Series Model node, 509 Time Series Modeler outliers, 506 periodicity, 504 series transformation, 504 transfer functions, 504 Time Series node, 495 ARIMA criteria, 502 ARIMA models, 495 Expert Modeler criteria, 499

632 Index

exponential smoothing, 495 exponential smoothing criteria, 501 outliers, 500 requirements, 496 time storage format, 20, 34 TimeIndex eld Time Intervals node, 130 TimeLabel eld Time Intervals node, 130 timestamp, 71 timestamp storage format, 20, 34 tooltips in graphs, 165 .tpt (SAS) les, 29 train metric neural net node, 326 train net node. See neural net node, 323 training datasets, 49 training samples partitioning data, 119120 transactional data, 449, 472473 apriori node, 236 CARMA node, 456 DB2 Intelligent Miner association node, 236 sequence node, 476 transfer functions, 504 delay, 504 denominator orders, 504 difference orders, 504 numerator orders, 504 seasonal orders, 504 Transform node, 568 transformations reclassify, 105, 109 recode, 105, 109 transforming series, 494 transient change outliers, 492 transient outlier in Time Series Modeler, 506 transparency, 156 Transpose node, 125 eld names, 126 numeric elds, 126 string elds, 126 transposing data, 125126 transposing tabular output, 473 tree builder, 275276, 281 custom splits, 277 exporting results, 294 gains, 282283, 287, 290, 292 gains charts, 287 generating models, 293294 predictors, 278 prots, 285 ROI, 285 surrogates, 279 tree depth C&R tree node, 298

CHAID node, 298 QUEST node, 298 tree directives C&R tree node, 298 CHAID node, 298 decision trees, 294 QUEST node, 298 tree map decision tree models, 316 tree viewer, 316 tree-based analysis general uses, 275 trends identifying, 489 True if any true function time series aggregation, 132 true values, 78 truncating eld names, 8586 truth-table data, 449, 472473 two-headed rules, 459 twoing impurity measure, 302 TwoStep cluster node, 431433 number of clusters, 432 options, 432 ordered sets, 72 outlier handling, 432 standardization of elds, 432 TwoStep clustering, 418, 433434 type, 20 type attributes, 81 Type node blank handling, 75 clearing values, 44 column justication, 82 column width, 82 copying types, 81 ag eld type, 78 format tab, 82 overview, 70 range eld type, 77 set eld type, 78 setting modeling role, 80 setting options, 71 unbiased data, 50 undened values, 59 UNIQUE keyword indexing database tables, 584 unique records, 66 unrened models, 251253, 448 unrened rule node, 246, 460461, 467468 unsupervised learning, 418 unsupervised models, 419 usage type, 20 user ID database connection settings, 41

633 Index

user input node overview, 32 setting options, 33 user-missing values, 549 in Matrix tables, 533 UTF-8 encoding, 17, 19, 588 validation samples partitioning data, 119120 value labels, 27 values eld and value labels, 44, 70, 75 reading, 74 specifying, 75 variable le node, 15 setting options, 16 variable labels, 27 SPSS Export node, 588 variable names data export, 578, 587, 589, 591 variables screening, 275 variance statistics output, 556 variance stabilizing transformation, 494 Varimax rotation Factor/PCA node, 393 viewer tab decision tree models, 316 viewing HTML output in browser, 525 views, 25 vingtile bins, 112 visualization clustering models, 436 decision trees, 316 graphs and charts, 155 using IBM tools, 465466 visualize a model, 360 voting ruleset, 320 Wald statistic, 381383 Web Mining for Clementine, 7 Web node adjusting points, 212 adjusting thresholds, 214215 appearance options, 210 change layout, 213 creating, 208 links slider, 213 overview, 205, 207 slider, 213 using the graph, 211 weekly data Time Intervals node, 139 weight elds, 236

weighted least squares, 236 weights evaluation charts, 220 working model pane, 342 worksheets importing from Excel, 30 XLS les exporting, 592 XML output Report node, 565 yearly data Time Intervals node, 137 zooming, 601

You might also like