Clementine Users Guide
Clementine Users Guide
0 Users Guide
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
For more information about SPSS software products, please visit our Web site at http://www.spss.com or contact: SPSS Inc. 233 South Wacker Drive, 11th Floor Chicago, IL 60606-6412 Tel: (312) 651-3000 Fax: (312) 651-3668 SPSS is a registered trademark and the other product names are the trademarks of SPSS Inc. for its proprietary computer software. No material describing such software may be produced or distributed without the written permission of the owners of the trademark and license rights in the software and the copyrights in the published materials. The SOFTWARE and documentation are provided with RESTRICTED RIGHTS. Use, duplication, or disclosure by the Government is subject to restrictions as set forth in subdivision (c) (1) (ii) of The Rights in Technical Data and Computer Software clause at 52.227-7013. Contractor/manufacturer is SPSS Inc., 233 South Wacker Drive, 11th Floor, Chicago, IL 60606-6412. Graphs powered by SPSS Inc.s nViZn(TM) advanced visualization technology (http://www.spss.com/sm/nvizn) Patent No. 7,023,453 General notice: Other product names mentioned herein are used for identication purposes only and may be trademarks of their respective companies. Project phases are based on the CRISP-DM process model. Copyright 19972003 by CRISP-DM Consortium (http://www.crisp-dm.org). Some sample datasets are included from the UCI Knowledge Discovery in Databases Archive: Hettich, S. and Bay, S. D. 1999. The UCI KDD Archive (http://kdd.ics.uci.edu). Irvine, CA: University of California, Department of Information and Computer Science. Microsoft and Windows are registered trademarks of Microsoft Corporation. IBM, DB2, and Intelligent Miner are trademarks of IBM Corporation in the U.S.A. and/or other countries. Oracle is a registered trademark of Oracle Corporation and/or its afliates. UNIX is a registered trademark of The Open Group. Linux is a registered trademark of Linus Torvalds. Red Hat is a registered trademark of Red Hat Corporation. Solaris is a registered trademark of Sun Microsystems Corporation. DataDirect and SequeLink are registered trademarks of DataDirect Technologies. ICU 3.2.1, ICU 3.6, and ICU4J 2.8. Copyright (c) 1995-2003 International Business Machines Corporation and others. All rights reserved. Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation les (the Software), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, provided that the above copyright notice(s) and this permission notice appear in all copies of the Software and that both the above copyright notice(s) and this permission notice appear in supporting documentation. THE SOFTWARE IS PROVIDED AS IS, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT OF THIRD PARTY RIGHTS. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR HOLDERS INCLUDED IN THIS NOTICE BE LIABLE FOR ANY CLAIM, OR ANY SPECIAL INDIRECT OR CONSEQUENTIAL DAMAGES, OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE. Except as contained in this notice, the name of a copyright holder shall not be used in advertising or otherwise to promote the sale, use or other dealings in this Software without prior written authorization of the copyright holder. SiteMesh 2.0.1. This product includes software developed by the OpenSymphony Group (http://www.opensymphony.com). ZSI 2.0. This product includes ZSI 2.0 software licensed under the following terms. Copyright 2001, Zolera Systems, Inc. All Rights Reserved. Copyright 2002-2003, Rich Salz. All Rights Reserved. Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation les (the Software), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, provided that the above copyright notice(s) and this permission notice appear in all copies of the Software and that both the above copyright notice(s) and this permission notice appear in supporting documentation. THE SOFTWARE IS PROVIDED AS IS, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT OF THIRD PARTY RIGHTS. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR HOLDERS INCLUDED IN THIS NOTICE BE LIABLE FOR ANY CLAIM, OR ANY SPECIAL INDIRECT OR CONSEQUENTIAL DAMAGES, OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS,
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE. Clementine 12.0 Users Guide Copyright 2007 by Integral Solutions Limited. All rights reserved. Printed in the United States of America. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any meanselectronic, mechanical, photocopying, recording, or otherwisewithout the prior written permission of the publisher. 1234567890 10 09 08 07
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
Preface
Clementine is the SPSS enterprise-strength data mining workbench. Clementine helps organizations to improve customer and citizen relationships through an in-depth understanding of data. Organizations use the insight gained from Clementine to retain protable customers, identify cross-selling opportunities, attract new customers, detect fraud, reduce risk, and improve government service delivery. Clementines visual interface invites users to apply their specic business expertise, which leads to more powerful predictive models and shortens time-to-solution. Clementine offers many modeling techniques, such as prediction, classication, segmentation, and association detection algorithms. Once models are created, Clementine Solution Publisher enables their delivery enterprise-wide to decision makers or to a database.
Serial Numbers
Your serial number is your identication number with SPSS Inc. You will need this serial number when you contact SPSS Inc. for information regarding support, payment, or an upgraded system. The serial number was provided with your Clementine system.
Customer Service
If you have any questions concerning your shipment or account, contact your local ofce, listed on the SPSS Web site at http://www.spss.com/worldwide/. Please have your serial number ready for identication.
Training Seminars
SPSS Inc. provides both public and onsite training seminars. All seminars feature hands-on workshops. Seminars will be offered in major cities on a regular basis. For more information on these seminars, contact your local ofce, listed on the SPSS Web site at http://www.spss.com/worldwide/.
Technical Support
The services of SPSS Technical Support are available to registered customers. Student Version customers can obtain technical support only for installation and environmental issues. Customers may contact Technical Support for assistance in using Clementine products or for installation help for one of the supported hardware environments. To reach Technical Support, see the SPSS Web site at http://www.spss.com or contact your local ofce, listed on the SPSS Web site
iv
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
at http://www.spss.com/worldwide/. Be prepared to identify yourself, your organization, and the serial number of your system.
Contacting SPSS
If you would like to be on our mailing list, contact one of our ofces, listed on our Web site at http://www.spss.com/worldwide/.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
Contents
1 About Clementine 1
Clementine Server and Clementine Batch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Clementine Modules. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Clementine Options. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Text Mining for Clementine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Web Mining for Clementine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Clementine Documentation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Application Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Demos Folder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
New Features
Clementine Overview
14
Getting Started . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Starting Clementine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 Clementine Interface at a Glance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 Clementine Stream Canvas . . . . . . . . Nodes Palette . . . . . . . . . . . . . . . . . . Clementine Managers . . . . . . . . . . . . Clementine Projects. . . . . . . . . . . . . . Clementine Application Templates . . . Clementine Toolbars . . . . . . . . . . . . . Customizing theClementine Window . Using the Mouse in Clementine . . . . . Using Shortcut Keys . . . . . . . . . . . . . Printing. . . . . . . . . . . . . . . . . . . . . . . . . . . ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 24 25 26 27 28 28 29 30 30 31
Automating Clementine. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
vi
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
33
Data Mining Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Assessing the Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 A Strategy for Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 The CRISP-DM Process Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 Types of Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Data Mining Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Using Application Templates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Building Streams
46
Stream-Building Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 Building Data Streams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 Working with Nodes . . . . Working with Streams . . . Executing Streams. . . . . . Saving Data Streams . . . . Loading Files . . . . . . . . . . Mapping Data Streams . . Tips and Shortcuts . . . . . . . . . ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 47 56 66 67 70 71 76
78
Overview of Missing Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 Handling Missing Values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 Handling Records with Missing Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 Handling Fields with Missing Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 Imputing or Filling Missing Values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 CLEM Functions for Missing Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
84
vii
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
Expressions and Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 Stream, Session, and SuperNode Parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 Working with Strings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 Handling Blanks and Missing Values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 Working with Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 Working with Times and Dates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 Summarizing Multiple Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 Working with Multiple-Response Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 The Expression Builder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 Accessing the Expression Builder . . . . . . . . . . . . . . . Creating Expressions . . . . . . . . . . . . . . . . . . . . . . . . . Selecting Functions . . . . . . . . . . . . . . . . . . . . . . . . . . Selecting Fields, Parameters, and Global Variables . . Viewing or Selecting Values. . . . . . . . . . . . . . . . . . . . Checking CLEM Expressions . . . . . . . . . . . . . . . . . . . Find and Replace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... . . . 96 . . . 96 . . . 97 . . . 98 . . . 98 . . 100 . . 100
104
CLEM Reference Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 CLEM Datatypes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 Integers. . . . . Reals . . . . . . . Characters . . Strings. . . . . . Lists. . . . . . . . Fields. . . . . . . Dates. . . . . . . Time . . . . . . . CLEM Operators . . ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 105 105 105 106 106 106 106 107 108 110 111 112 112 114 115 116 117 117 118
Functions Reference. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 Conventions in Function Descriptions Information Functions . . . . . . . . . . . . Conversion Functions . . . . . . . . . . . . Comparison Functions . . . . . . . . . . . . Logical Functions. . . . . . . . . . . . . . . . Numeric Functions . . . . . . . . . . . . . . Trigonometric Functions . . . . . . . . . . Probability Functions . . . . . . . . . . . . . Bitwise Integer Operations . . . . . . . . Random Functions . . . . . . . . . . . . . . .
viii
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
String Functions. . . . . . . . . . . . . . . . . . . . . . . SoundEx Functions . . . . . . . . . . . . . . . . . . . . Date and Time Functions . . . . . . . . . . . . . . . . Sequence Functions . . . . . . . . . . . . . . . . . . . Global Functions . . . . . . . . . . . . . . . . . . . . . . Functions Handling Blanks and Null Values . . Special Fields . . . . . . . . . . . . . . . . . . . . . . . .
.. .. .. .. .. .. ..
136
Deploying Scenarios to SPSS Predictive Enterprise Services . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 SPSS Predictive Enterprise View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 Stream Deployment Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 Scoring and Modeling Parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 SPSS Predictive Enterprise Repository . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 Connecting to SPSS Predictive Enterprise Repository . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 Storing Objects. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 Retrieving Stored Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 Browsing Repository Content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 Object Properties. . . . . . . . . . . . . Deleting Repository Objects. . . . . Searching Repository Content . . . Adding and Removing Folders . . . ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... .. .. .. .. 148 149 150 153
155
.. .. .. .. .. .. .. .. .. 155 156 157 158 158 161 161 163 163
Cleo Wizard Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 Cleo Stream Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 Step 1: Cleo Wizard Overview Screen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
ix
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
Importing and Exporting Models as PMML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 Model Types Supporting PMML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
170
Introduction to Projects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 CRISP-DM View. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 Classes View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 Building a Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 Creating a New Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Adding to a Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Transferring Projects to SPSS Predictive Enterprise Repository . . Setting Project Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Annotating a Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Object Properties. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Closing a Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Generating a Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... .. .. .. .. .. .. .. .. 173 173 175 175 176 178 179 179
12 Customizing Clementine
185
Customizing Clementine Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 Setting Clementine Options. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 System Options . . . . . . . . . . . Setting Default Directories. . . Setting User Options . . . . . . . Setting User Information . . . . Customizing the Nodes Palette . . . ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... .. .. .. .. .. 185 186 187 194 194
Customizing the Palette Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194 Changing a Palette Tab View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
201
Order of Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201 Node Caches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202 Performance: Process Nodes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204 Performance: Modeling Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 Performance: CLEM Expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
Overview of Accessibility in Clementine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 Types of Accessibility Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 Accessibility for the Visually Impaired . . . Accessibility for Blind Users . . . . . . . . . . Keyboard Accessibility . . . . . . . . . . . . . . Using a Screen Reader . . . . . . . . . . . . . . Accessibility in the Tree Builder. . . . . . . . Tips for Use . . . . . . . . . . . . . . . . . . . . . . . . . . ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... .. .. .. .. .. .. 207 208 209 214 214 215
Interference with Other Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 JAWS and Java . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216 Using Graphs in Clementine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
B Unicode Support
217
Index
218
xi
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
Chapter
About Clementine
Clementine is a data mining workbench that enables you to quickly develop predictive models using business expertise and deploy them into business operations to improve decision making. Designed around the industry-standard CRISP-DM model, Clementine supports the entire data mining process, from data to better business results. Clementine Client can be purchased as a standalone product, or used in combination with Clementine Server. A number of additional modules and options are also available, as summarized in the following sections. For more information, see http://www.spss.com/clementine/.
installed and run on the users desktop computer. It can be run in local mode as a standalone product or in distributed mode along with Clementine Server for improved performance on large datasets.
Clementine Server. Clementine Server runs continually in distributed analysis mode together with
one or more Clementine Client installations, providing superior performance on large datasets because memory-intensive operations can be done on the server without downloading data to the client computer. Clementine Server also provides support for SQL optimization, batch-mode processing, and in-database modeling capabilities, delivering further benets in performance and automation. At least one Clementine Client or Clementine Batch installation must be present to run an analysis.
Clementine Batch. A special version of the client that runs in batch mode only, providing support
for the complete analytical capabilities of Clementine without access to the regular user interface. This allows long-running or repetitive tasks to be performed without user intervention and without the presence of the user interface on the screen. Unlike Clementine Client, which can be run as a standalone product, Clementine Batch must be licensed and used only in combination with Clementine Server.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
2 Chapter 1
Clementine Modules
Clementine offers a variety of modeling methods taken from machine learning, articial intelligence, and statistics. The methods available on the Modeling palette allow you to derive new information from your data and to develop predictive models. Each method has certain strengths and is best suited for particular types of problems. Modeling nodes are packaged into Base, Classication, Association, and Segmentation modules. For more information, see Overview of Modeling Nodes in Chapter 3 in Clementine 12.0 Modeling Nodes.
Clementine Options
In addition to the modules, the following components and features can be separately purchased and licensed for use with Clementine. Note that additional products or updates may also become available. For more information, see http://www.spss.com/clementine/. Clementine Server access, providing improved scalability and performance on large datasets, as well as support for SQL optimization, batch-mode automation, and in-database modeling capabilities. Clementine Solution Publisher, for real-time or automated scoring outside the Clementine environment. For more information, see Clementine Solution Publisher in Chapter 2 in Clementine 12.0 Solution Publisher. Deployment to SPSS Predictive Enterprise Services. For more information, see Deploying Scenarios to SPSS Predictive Enterprise Services in Chapter 9 on p. 136.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
3 About Clementine
Clementine Documentation
Complete documentation is available from the Clementine 12.0 program group under SPSS Inc on the Windows Start menu. This includes online Help and PDF documentation for Clementine Client, Clementine Server, and Clementine Solution Publisher, as well as the Applications Guide and other supporting materials. Complete documentation for each product is also available under the \documentation folder on each product CD.
Clementine Users Guide. General introduction to using Clementine, including how to build
data streams, handle missing values, build CLEM expressions, work with projects and reports, and package streams for deployment to SPSS Predictive Enterprise Services or Predictive Applications.
Clementine Source, Process, and Output Nodes. Descriptions of all the nodes used to read,
process, and output data in different formats. Effectively this means all nodes other than modeling nodes.
Clementine Modeling Nodes. Clementine offers a variety of modeling methods taken from
machine learning, articial intelligence, and statistics. For more information, see Overview of Modeling Nodes in Chapter 3 in Clementine 12.0 Modeling Nodes.
Clementine Applications Guide. The examples in this guide provide brief, targeted introductions
to specic modeling methods and techniques. An online version of this guide is also available from the Help menu. To access the sample streams and data les from any Clementine Client installation, choose Demos from the Clementine 12.0 program group under SPSS Inc on the Windows Start menu. For more information, see Application Examples on p. 4.
Clementine Scripting and Automation. Information on automating the system through scripting
and batch mode execution, including the properties that can be used to manipulate nodes and streams. The section on CEMI that was available in previous releases has been removed from the printed version, but is still included in the online Help. Note: CEMI has been deprecated in release 12.0 in favor of CLEF and will be no longer be supported in Clementine 13.0.
Clementine CLEF Developers Guide. CLEF provides the ability to integrate third-party programs
such as data processing routines or modeling algorithms as nodes in Clementine, and replaces the CEMI functionality available in previous releases.
Clementine In-Database Mining Guide. Information on how to leverage the power of your
database to improve performance and extend the range of analytical capabilities through third-party algorithms.
Clementine Server and Performance Guide. Information on how to congure and administer
Clementine Server.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
4 Chapter 1
that enables organizations to publish streams for use outside of the standard Clementine environment.
CRISP-DM 1.0 Guide. Step-by-step guide to data mining using the CRISP-DM methodology. SPSS Command Syntax Reference. Documentation of SPSS command syntax (available
Most Help features in this application use technology based on Microsoft Internet Explorer. Some versions of Internet Explorer (including the version provided with Microsoft Windows XP, Service Pack 2) will by default block what it considers to be active content in Internet Explorer windows on your local computer. This default setting may result in some blocked content in Help features. To see all Help content, you can change the default behavior of Internet Explorer.
E From the Internet Explorer menus choose: Tools Internet Options... E Click the Advanced tab. E Scroll down to the Security section. E Select (check) Allow active content to run in files on My Computer.
Application Examples
While the data mining tools in Clementine can help solve a wide variety of business and organizational problems, the application examples provide brief, targeted introductions to specic modeling methods and techniques. The datasets used here are much smaller than the enormous data stores managed by some data miners, but the concepts and methods involved should be scalable to real-world applications. You can access the examples by choosing Application Examples from the Help menu in Clementine Client. The data les and sample streams are installed in the Demos folder under the product installation directory. For more information, see Demos Folder on p. 5. A PDF version of the Applications Guide is also available. For more information, see Clementine Documentation on p. 3.
Database modeling examples. For more information, see Database Modeling Overview in Chapter
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
5 About Clementine
Demos Folder
The data les and sample streams used with the application examples are installed in the Demos folder under the product installation directory. This folder can also be accessed from the Clementine 12.0 program group under SPSS Inc on the Windows Start menu, or by choosing Demos from the list of recent directories in the File Open dialog box.
Figure 1-1 Choosing the Demos folder from the list of recently-used directories
Streams are organized into subfolders by module (Base, Classication, Segmentation, and so forth), along with additional subfolders for SPSS integration and database modeling.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
Chapter
New Features
New Features in Clementine 12.0
The Clementine 12.0 release adds a wide range of new features, including improved support for automated modeling, a wide range of new modeling and analytical techniques, powerful new tools for producing tables and charts, and improvements to performance, scalability, and extensibility.
Automated Modeling
Figure 2-1 Numeric Predictor results
Numeric Predictor node. The Numeric Predictor node allows you to create and compare models
for numeric range (continuous) outcomes using a number of different methods, making it easier to try out a variety of approaches and compare the results. It works in a similar manner to the Binary Classier node but for numeric range rather than binary targets. For more information, see Numeric Predictor Node in Chapter 5 in Clementine 12.0 Modeling Nodes.
6
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
7 New Features
Ensemble node. The Ensemble node combines two or more model nuggets to obtain more reliable
predictions than can be gained from the individual models. For more information, see Ensemble Node in Chapter 4 in Clementine 12.0 Source, Process, and Output Nodes.
New Algorithms
Figure 2-2 Bayesian network model
Bayesian Network node. The Bayesian Network node enables you to build a probability model by combining observed and recorded evidence with common-sense real-world knowledge to establish the likelihood of occurrences by using seemingly unlinked attributes. In the model viewer, the model is portrayed as a graphical network that consists of variables and directed links (also known as nodes and arcs); the parameters are displayed as conditional probability tables. In the current Clementine 12.0 release, the node focuses on Tree Augmented Nave Bayes (TAN) and Markov Blanket networks that are primarily used for classication. For more information, see Bayesian Network Node in Chapter 7 in Clementine 12.0 Modeling Nodes. Cox Regression node. Cox Regression is a form of survival analysis that models time-to-event
data; for example, the time until a customer churns or the time to failure of a mechanical part. The main advantage of Cox Regression is that it correctly handles censored cases; that is, records
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
8 Chapter 2
for which the terminal event has not occurred. This is important because when examining real data, there will always be customers who have not churned or parts that have not failed. Ignoring them will give you a very inaccurate idea of survival times. In addition, Cox Regression enables you to model the effect of external factors on survival time (for example, by building a model of length of employment based on educational level and job category). For more information, see Cox Node in Chapter 10 in Clementine 12.0 Modeling Nodes.
SVM (Support Vector Machines) node. SVM is a classication and regression technique that
maximizes the predictive accuracy of a model without overtting the data. SVM is very good at extracting information from wide datasets (for example, those with a very large number of predictor elds, such as 10,000 or more). It can be useful in many elds, including customer relationship management (CRM), image and speech recognition, and text mining concept extraction. For more information, see SVM Node in Chapter 15 in Clementine 12.0 Modeling Nodes.
Other Analytical Enhancements
Figure 2-3 RFM aggregate node
RFM aggregation. The RFM Aggregate node takes customers historical transactional data, strips
away any unused data, and aggregates transactions as input to an RFM Analysis node. For more information, see RFM Aggregate Node in Chapter 3 in Clementine 12.0 Source, Process, and Output Nodes.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
9 New Features
RFM analysis. Recency, Frequency, Monetary analysis is a marketing technique used to determine
quantitatively which customers are the best ones by examining how recently a customer has purchased (recency), how often they purchase (frequency), and how much the customer spends over all transactions (monetary). RFM analysis is based on the idea that past behavior is the best predictor of future action. For more information, see RFM Analysis Node in Chapter 4 in Clementine 12.0 Source, Process, and Output Nodes.
Propensity scores. For models that produce binary (yes or no) predictions, you can now request propensity scores in addition to the standard prediction and condence elds. Propensity indicates the probability of a given outcome; for example, the probability that a direct mail recipient will respond to a campaign, or an existing customer will churn. Propensity scores provide this information in a format that can be more easily compared across models. For more information, see Modeling Node Analyze Options in Chapter 3 in Clementine 12.0 Modeling Nodes. Variable importance chart. Typically, you will want to focus your modeling efforts on the variables that matter the most and consider dropping or ignoring those that matter the least. The variable importance chart helps to do this by indicating the relative importance of each variable in estimating the model. For more information, see Variable Importance in Chapter 3 in Clementine 12.0 Modeling Nodes.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
Dimensions support. The Dimensions source node has been enhanced to import multiple response
variables as multiple ag elds. A new Dimensions export node has also been added. For more information, see Dimensions Source Node in Chapter 2 in Clementine 12.0 Source, Process, and Output Nodes.
Multiple response variables. Multiple response variables can be dened for use with the new
Custom Tables node. Multiple response variables imported from Dimensions or SPSS are also preserved in Clementine. For more information, see Editing Multiple Response Sets in Chapter 4 in Clementine 12.0 Source, Process, and Output Nodes.
Clustered and stratified samples. Sampling is a critical step in many data mining problems. The
Sample node has been enhanced to support stratied sampling and clustered sampling and to provide greater control over random samples and other settings. For more information, see Sample Node in Chapter 3 in Clementine 12.0 Source, Process, and Output Nodes.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
11 New Features
Improved Visualization
Figure 2-5 Graphboard node: Basic tab
Graphboard node. The Graphboard node offers many different types of graphs in one single node.
You can choose the data elds you want to explore, and the Graphboard node will automatically lter out any graph types that will not work with the eld choices. You can then select the graph you want to produce. More advanced options, such as overlay aesthetics, panels, and summary statistics are available on the Detailed tab. For more information, see Graphboard Node in Chapter 5 in Clementine 12.0 Source, Process, and Output Nodes.
Graph editing. A number of enhancements have been added, including reordering of categories
on graph axes and improved ability to draw bands and regions as well as mark elements for the purpose of generating Select, Balance, and Filter nodes.
Custom Table node. This node supports a wide range of options, including the ability to nest, stack,
or layer variables in multiple dimensions, to display summaries for multiple statistics, and to display multiple response sets. For more information, see Custom Table Node in Chapter 6 in Clementine 12.0 Source, Process, and Output Nodes.
Performance, Scalability, and Extensibility SQL Optimization. The Distinct node and Sample node have been optimized for SQL Pushback.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
12 Chapter 2
Real-time scoring enhancements. Clementine Solution Publisher has been enhanced to improve its real-time scoring capability and to meet requirements for future releases of Predictive Applications. Server cluster load balancing. Load balancing allows you to optimize performance by distributing processing tasks across multiple servers. Clementine provides support for server clustering through SPSS Coordinator of Processes (COP) available in SPSS Predictive Enterprise Services. The COP provides server management capabilities designed to optimize client-server communication and processing. For more information, see Load Balancing with Server Clusters in Appendix A in Clementine 12.0 Server Administration and Performance Guide. CLEF. Enables you to create or integrate custom or third-party components into Clementine streams
and optionally deploy the resulting streams to Clementine Solution Publisher or Predictive Applications. For more information, see Introduction to CLEF in Chapter 1 in Clementine 12.0 CLEF Developers Guide.
Other Enhancements Customizable palettes. You can now customize your palettes and add sub-palettes for more
convenient access. For more information, see Customizing the Nodes Palette in Chapter 12 on p. 194.
C5.0 Enhancements. The C5.0 node now supports a weight eld. For more information, see C5.0
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
13 New Features
The Bayesian Network node enables you to build a probability model by combining observed and recorded evidence with real-world knowledge to establish the likelihood of occurrences. In the current Clementine 12.0 release, the node focuses on Tree Augmented Nave Bayes (TAN) and Markov Blanket networks that are primarily used for classication. For more information, see Bayesian Network Node in Chapter 7 in Clementine 12.0 Modeling Nodes. The Cox regression node enables you to build a survival model for time-to-event data in the presence of censored records. The model produces a survival function that predicts the probability that the event of interest has occurred at a given time (t) for given values of the predictor variables. For more information, see Cox Node in Chapter 10 in Clementine 12.0 Modeling Nodes. The Ensemble node combines two or more model nuggets to obtain more accurate predictions than can be gained from any one model. For more information, see Ensemble Node in Chapter 4 in Clementine 12.0 Source, Process, and Output Nodes.
The Recency, Frequency, Monetary (RFM) Aggregate node enables you to take customers historical transactional data, strip away any unused data, and combine all of their remaining transaction data into a single row that lists when they last dealt with you, how many transactions they have made, and the total monetary value of those transactions. For more information, see RFM Aggregate Node in Chapter 3 in Clementine 12.0 Source, Process, and Output Nodes. The Recency, Frequency, Monetary (RFM) Analysis node enables you to determine quantitatively which customers are likely to be the best ones by examining how recently they last purchased from you (recency), how often they purchased (frequency), and how much they spent over all transactions (monetary). For more information, see RFM Analysis Node in Chapter 4 in Clementine 12.0 Source, Process, and Output Nodes. The Custom Table node supports a wide range of options, including the ability to nest, stack, or layer variables in multiple dimensions, to display summaries for multiple statistics, and to display multiple response sets. For more information, see Merge Node in Chapter 3 in Clementine 12.0 Source, Process, and Output Nodes. The Dimensions export node outputs data in the format used by SPSS Dimensions market research software. The Dimensions Data Library must be installed to use this node. For more information, see Dimensions Export Node in Chapter 7 in Clementine 12.0 Source, Process, and Output Nodes.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
Chapter
Clementine Overview
Getting Started
As a data mining application, Clementine offers a strategic approach to nding useful relationships in large datasets. In contrast to more traditional statistical methods, you do not necessarily need to know what you are looking for when you start. You can explore your data, tting different models and investigating different relationships, until you nd useful information. Following is a sample of the types of problems that data mining can help to solve.
Public sector. Governments around the world use data mining to explore massive data stores,
improve citizen relationships, detect occurrences of fraud, such as money laundering and tax evasion, detect crime and terrorist patterns, and enhance the expanding realm of e-goverment.
Figure 3-1 Fraud detection in Clementine
14
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
15 Clementine Overview
CRM. Customer relationship management can be improved thanks to smart classication of customer types and accurate predictions of churn. Clementine has successfully helped businesses attract and retain the most valuable customers in a variety of industries.
Figure 3-2 Customer value pyramid.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
16 Chapter 3
Web mining. With powerful sequencing and prediction algorithms, Clementine contains the
necessary tools to discover exactly what guests do at a Web site and deliver exactly the products or information they desire. From data preparation to modeling, the entire data mining process can be managed inside of Clementine.
Figure 3-3 Exploring web-site behavior using Clementine graphs
Drug discovery and bioinformatics. Data mining aids both pharmaceutical and genomics research
by analyzing the vast data stores resulting from increased lab automation. Clustering and classication models help generate leads from compound libraries, while sequence detection aids the discovery of patterns.
Figure 3-4 Visualization in the Microarray CAT
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
17 Clementine Overview
Starting Clementine
To start the application, choose Clementine 12.0 from the SPSS Inc program group on the Windows Start menu. The main window will appear after a few seconds.
Figure 3-5 Clementine main application window
command-prompt, window.
E To launch the Clementine Client interface in interactive mode, type the clementine command
for example:
clemb -server -hostname myserver -port 80 -username dminer -password 1234 -stream report.str -execute
The available arguments (ags) allow you to connect to a server, load streams, execute scripts, or specify other parameters as needed.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
18 Chapter 3
from the table. Click Add or Edit to add or edit a connection. For more information, see Adding and Editing the Clementine Server Connection on p. 19. Click Search to access a server or server cluster in the SPSS COP. For more information, see Searching for Servers in SPSS Predictive Enterprise Services on p. 20.
Figure 3-6 Server Login dialog box
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
19 Clementine Overview
Server table. This table contains the set of dened server connections. The table displays the default connection, server name, description, and port number. You can manually add a new connection, as well as select or search for an existing connection. To set a particular server as the default connection, select the check box in the Default column in the table for the connection. User ID. Enter the user name with which to log on to the server. Password. Enter the password associated with the specied user name. Domain. Specify the domain used to log on to the server. A domain name is required only when
the server computer is in a different Windows domain than the client computer.
Default data path. Specify a path used for data on the server computer. Click the ellipsis button (...)
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
20 Chapter 3 E Enter the server connection details and click OK to save the connection and return to the Server
can be identied by an alphanumeric name (for example, myserver) or an IP address assigned to the server computer (for example, 202.123.456.78).
Port. Give the port number on which the server is listening. If the default does not work, ask
connection should be used. SSL is a commonly used protocol for securing data sent over a network. To use this feature, SSL must be enabled on the server hosting Clementine Server. If necessary, contact your local administrator for details. For more information, see Using SSL to Encrypt Data in Chapter 4 in Clementine 12.0 Server Administration and Performance Guide.
To Edit Server Connections
E From the Tools menu, choose Server Login. The Server Login dialog box opens. E In this dialog box, select the connection you want to edit and then click Edit. The Server Login
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
on to SPSS Predictive Enterprise Services when you attempt to browse the SPSS COP, you will be prompted to do so. For more information, see Connecting to SPSS Predictive Enterprise Repository in Chapter 9 on p. 141.
E Select the server or server cluster from the list. E Click OK to close the dialog box and add this connection to the table in the Server Login dialog box.
changes you made. Restarting the machine will also restart the service. All temp les will now be written to this new directory. Note: The most common error when you are attempting to do this is to use the wrong type of slashes. Because of Clementines UNIX history, forward slashes are used.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
22 Chapter 3
E In the Target text box, add -noshare to the end of the string. E In Windows Explorer, select: Tools Folder Options... E On the File Types tab, select the Clementine Stream option and click Advanced.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
E In the Edit File Type dialog box, select Open with Clementine and click Edit. Figure 3-11 Select the action
E In the Application used to perform action text box, add -noshare before the -stream argument.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
25 Clementine Overview
Streams are created by drawing diagrams of data operations relevant to your business on the main canvas in the interface. Each operation is represented by an icon or node, and the nodes are linked together in a stream representing the ow of data through each operation. You can work with multiple streams at one time in Clementine, either in the same stream canvas or by opening a new stream canvas. During a session, streams are stored in the Streams manager, at the upper right of the Clementine window.
Nodes Palette
Most of the data and modeling tools in Clementine reside in the Nodes Palette, across the bottom of the window below the stream canvas. For example, the Record Ops palette tab contains nodes that you can use to perform operations on the data records, such as selecting, merging, and appending. To add nodes to the canvas, double-click icons from the Nodes Palette or drag and drop them onto the canvas. You then connect them to create a stream, representing the ow of data.
Figure 3-14 Record Ops tab on the nodes palette
Each palette tab contains a collection of related nodes used for different phases of stream operations, such as:
Sources. Nodes bring data into Clementine. Record Ops. Nodes perform operations on data records, such as selecting, merging, and
appending.
Field Ops. Nodes perform operations on data elds, such as ltering, deriving new elds, and
viewed in Clementine or sent directly to another application, such as SPSS or Excel. As you become more familiar with Clementine, you can customize the palette contents for your own use. For more information, see Customizing the Nodes Palette in Chapter 12 on p. 194. Located below the Nodes Palette, a report window provides feedback on the progress of various operations, such as when data are being read into the data stream. Also located below the Nodes Palette, a status window provides information on what the application is currently doing, as well as indications of when user feedback is required.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
26 Chapter 3
Clementine Managers
You can use the Streams tab to open, rename, save, and delete the streams created in a session.
Figure 3-15 Streams tab
The Outputs tab contains a variety of les, such as graphs and tables, produced by stream operations in Clementine. You can display, save, rename, and close the tables, graphs, and reports listed on this tab.
Figure 3-16 Outputs tab
The Models tab is the most powerful of the manager tabs. This tab contains all model nuggets, which are models generated in Clementine, for the current session. These models can be browsed directly from the Models tab or added to the stream in the canvas.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
Clementine Projects
On the lower right side of the window is the projects tool, used to create and manage data mining projects. For more information, see Introduction to Projects in Chapter 11 on p. 170. There are two ways to view projects you create in Clementinein the Classes view and the CRISP-DM view. The CRISP-DM tab provides a way to organize projects according to the Cross-Industry Standard Process for Data Mining, an industry-proven, nonproprietary methodology. For both experienced and rst-time data miners, using the CRISP-DM tool will help you to better organize and communicate your efforts.
Figure 3-18 CRISP-DM view
The Classes tab provides a way to organize your work in Clementine categoricallyby the types of objects you create. This view is useful when taking inventory of data, streams, and models.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
Clementine Toolbars
At the top of the Clementine window, you will nd a toolbar of icons that provides a number of useful functions. Following are toolbar buttons and their functions:
Create new stream Save stream Open Clementine Application Templates (CATs) Copy to clipboard Open stream Print current stream Cut & move to clipboard Paste selection
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
29 Clementine Overview
Undo last action Edit stream properties Execute stream selection Add SuperNode Zoom out (SuperNodes only)
Redo Execute current stream Stop stream (Active only during stream execution) Zoom in (SuperNodes only)
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
30 Chapter 3
As an alternative to closing the nodes palette and manager and project windows, you can use the stream canvas as a scrollable page by moving vertically and horizontally with the scrollbars at the side and bottom of the Clementine window.
context-sensitive menus, and access various other standard controls and options. Click and hold the button to move and drag nodes.
Double-click. Double-click using the left mouse button to place nodes on the stream canvas
stream canvas. Double-click the middle mouse button to disconnect a node. If you do not have a three-button mouse, you can simulate this feature by pressing the Alt key while clicking and dragging the mouse.
Shortcut Key Ctrl-A Ctrl-X Ctrl-N Ctrl-O Ctrl-P Ctrl-C Ctrl-V Ctrl-Z Ctrl-Q
Function Select all Cut New stream Open stream Print Copy Paste Undo Select all nodes downstream of the selected node
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
31 Clementine Overview
Function Deselect all downstream nodes (toggles with Ctrl-Q) Execute from selected node Save current stream Move selected nodes on the stream canvas in the direction of the arrow used Open the context menu for the selected node
Shortcut Key Ctrl-Alt-D Ctrl-Alt-L Ctrl-Alt-R Ctrl-Alt-U Ctrl-Alt-C Ctrl-Alt-F Ctrl-Alt-X Ctrl-Alt-Z Delete Backspace
Function Duplicate node Load node Rename node Create User Input node Toggle cache on/off Flush cache Expand SuperNode Zoom in/zoom out Delete node or connection Delete node or connection
Printing
The following objects can be printed in Clementine: Stream diagrams Graphs Tables Reports (from the Report node and Project Reports) Scripts (from the stream properties, Standalone Script, or SuperNode script dialog boxes) Models (Model browsers, dialog box tabs with current focus, tree viewers) Annotations (using the Annotations tab for output)
To print an object:
To print without previewing, click the Print button on the toolbar. To set up the page before printing, select Page Setup from the File menu. To preview before printing, select Print Preview from the File menu. To view the standard print dialog box with options for selecting printers, and specifying appearance options, select Print from the File menu.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
32 Chapter 3
Automating Clementine
Since advanced data mining can be a complex and sometimes lengthy process, Clementine includes several types of coding and automation support. Clementine Language for Expression Manipulation (CLEM) is a language for analyzing and manipulating the data that ows along Clementine streams. Data miners use CLEM extensively in stream operations to perform tasks as simple as deriving prot from cost and revenue data or as complex as transforming Web-log data into a set of elds and records with usable information. For more information, see What Is CLEM? in Chapter 7 on p. 84. Scripting is a powerful tool for automating processes in the user interface and working with objects in batch mode. Scripts can perform the same kinds of actions that users perform with a mouse or a keyboard. You can set options for nodes and perform derivations using a subset of CLEM. You can also specify output and manipulate generated models. For more information, see Scripting Overview in Chapter 2 in Clementine 12.0 Scripting and Automation Guide. Batch mode enables you to use Clementine in a non-interactive manner by running Clementine with no visible user interface. Using scripts, you can specify stream and node operations as well as modeling parameters and deployment options. For more information, see Introduction to Batch Mode in Chapter 7 in Clementine 12.0 Scripting and Automation Guide.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
Chapter
Through a variety of techniques, data mining identies nuggets of information in bodies of data. Data mining extracts information in such a way that it can be used in areas such as decision support, prediction, forecasts, and estimation. Data is often voluminous but of low value and with little direct usefulness in its raw form. It is the hidden information in the data that has value. In data mining, success comes from combining your (or your experts) knowledge of the data with advanced, active analysis techniques in which the computer identies the underlying relationships and features in the data. The process of data mining generates models from historical data that are later used for predictions, pattern detection, and more. The technique for building these models is called machine learning or modeling.
Modeling Techniques
Clementine includes a number of machine-learning and modeling technologies, which can be roughly grouped according to the types of problems they are intended to solve. Predictive modeling methods include decision trees, neural networks, and statistical models. Clustering models focus on identifying groups of similar records and labeling the records according to the group to which they belong. Clustering methods include Kohonen, k-means, and TwoStep. Association rules associate a particular conclusion (such as the purchase of a particular product) with a set of conditions (the purchase of several other products). Screening models can be used to screen data to locate elds and records that are most likely to be of interest in modeling and identify outliers that may not t known patterns. Available methods include feature selection and anomaly detection.
Data Manipulation and Discovery
Clementine also includes many facilities that let you apply your expertise to the data:
Data manipulation. Constructs new data items derived from existing ones and breaks down the
data into meaningful subsets. Data from a variety of sources can be merged and ltered.
Browsing and visualization. Displays aspects of the data using the Data Audit node to perform
an initial audit including graphs and statistics. Advanced visualization includes interactive graphics, which can be exported for inclusion in project reports.
33
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
34 Chapter 4
Statistics. Conrms suspected relationships between variables in the data. Statistics from
Typically, you will use these facilities to identify a promising set of attributes in the data. These attributes can then be fed to the modeling techniques, which will attempt to identify underlying rules and relationships.
Typical Applications
hiring process.
Medical research. Create decision rules that suggest appropriate procedures based on medical
evidence.
Market analysis. Determine which variables, such as geography, price, and customer
product defects.
Policy studies. Use survey data to formulate policy by applying decision rules to select the most important variables. Health care. User surveys and clinical data can be combined to discover variables that contribute
to health.
Terminology
The terms attribute, eld, and variable refer to a single data item common to all cases under consideration. A collection of attribute values that refers to a specic case is called a record, an example, or a case.
This may seem like an obvious question, but be aware that although data might be available, it may not be in a form that can be used easily. can import data from databases (via ODBC) or from les. The data, however, might be held in some other form on a machine that cannot be directly
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
accessed. It will need to be downloaded or dumped in a suitable form before it can be used. It might be scattered among different databases and sources and need to be pulled together. It may not even be online. If it exists only on paper, data entry will be required before you can begin data mining.
Does the Data Cover the Relevant Attributes?
The object of data mining is to identify relevant attributes, so this may seem like an odd question. It is very useful, however, to look at what data is available and to try to identify the likely relevant factors that are not recorded. In trying to predict ice cream sales, for example, you may have a lot of information about retail outlets or sales history, but you may not have weather and temperature information, which is likely to play a signicant role. Missing attributes dont necessarily mean that data mining will not produce useful results, but they can limit the accuracy of resulting predictions. A quick way of assessing the situation is to perform a comprehensive audit of your data. Before moving on, consider attaching a Data Audit node to your data source and executing it to generate a full report. For more information, see Data Audit Node in Chapter 6 in Clementine 12.0 Source, Process, and Output Nodes.
Is the Data Noisy?
Data often contains errors or may contain subjective, and therefore variable, judgments. These phenomena are collectively referred to as noise. Sometimes noise in data is normal. There may well be underlying rules, but they may not hold for 100% of the cases. Typically, the more noise there is in data, the more difcult it is to get accurate results. However, Clementines machine-learning methods are able to handle noisy data and have been used successfully on datasets containing almost 50% noise.
Is There Enough Data?
In data mining, it is not necessarily the size of a dataset that is important. The representativeness of the dataset is far more signicant, together with its coverage of possible outcomes and combinations of variables. Typically, the more attributes that are considered, the more records that will be needed to give representative coverage. If the data is representative and there are general underlying rules, it may well be that a data sample of a few thousand (or even a few hundred) records will give equally good results as a millionand you will get the results more quickly.
Is Expertise on the Data Available?
In many cases, you will be working on your own data and will therefore be highly familiar with its content and meaning. However, if you are working on data for another department of your organization or for a client, it is highly desirable that you have access to experts who know the data. They can guide you in the identication of relevant attributes and can help to interpret the results of data mining, distinguishing the true nuggets of information from fools gold, or artifacts caused by anomalies in the datasets.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
36 Chapter 4
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
understanding includes determining business objectives, assessing the situation, determining data mining goals, and producing a project plan.
Data understanding. Data provides the raw materials of data mining. This phase addresses
the need to understand what your data resources are and the characteristics of those resources. It includes collecting initial data, describing data, exploring data, and verifying data quality. The Data Audit node available from the Output nodes palette is an indispensable tool for data understanding.
Data preparation. After cataloging your data resources, you will need to prepare your data for
mining. Preparations include selecting, cleaning, constructing, integrating, and formatting data.
Modeling. This is, of course, the ashy part of data mining, where sophisticated analysis
methods are used to extract information from the data. This phase involves selecting modeling techniques, generating test designs, and building and assessing models.
Evaluation. Once you have chosen your models, you are ready to evaluate how the data mining
results can help you to achieve your business objectives. Elements of this phase include evaluating results, reviewing the data mining process, and determining the next steps.
Deployment. Now that youve invested all of this effort, its time to reap the benets. This
phase focuses on integrating your new knowledge into your everyday business processes to solve your original business problem. This phase includes plan deployment, monitoring and maintenance, producing a nal report, and reviewing the project. There are some key points in this process model. First, while there is a general tendency for the process to ow through the steps in the order outlined above, there are also a number of places where the phases inuence each other in a nonlinear way. For example, data preparation usually precedes modeling. However, decisions made and information gathered during the modeling phase can often lead you to rethink parts of the data preparation phase, which can then present new modeling issues. The two phases feed back on each other until both phases have been resolved
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
38 Chapter 4
adequately. Similarly, the evaluation phase can lead you to reevaluate your original business understanding, and you may decide that youve been trying to answer the wrong question. At this point, you can revise your business understanding and proceed through the rest of the process again with a better target in mind. The second key point is the iterative nature of data mining. You will rarely, if ever, simply plan a data mining project, execute it, and then pack up your data and go home. Data mining to address your customers demands is an ongoing endeavor. The knowledge gained from one cycle of data mining will almost invariably lead to new questions, new issues, and new opportunities to identify and meet your customers needs. Those new questions, issues, and opportunities can usually be addressed by mining your data once again. This process of mining and identifying new opportunities should become part of the way you think about your business and a cornerstone of your overall business strategy. This introduction provides only a brief overview of the CRISP-DM process model. For complete details on the model, consult any of the following resources: The CRISP-DM Guide, which can be accessed along with other documentation from the program group on the Windows Start menu. The CRISP-DM Help system, available from the Start menu or by choosing Help on CRISP-DM from the Help menu in Clementine. Data Mining with Condence, published by SPSS Inc. This guide is available from the SPSS online bookstore.
Types of Models
Clementine offers a variety of modeling methods taken from machine learning, articial intelligence, and statistics. The methods available on the Modeling palette allow you to derive new information from your data and to develop predictive models. Each method has certain strengths and is best suited for particular types of problems. The Clementine Applications Guide provides examples for many of these methods, along with a general introduction to the modeling process. This guide is available as an online tutorial, and also in PDF format. For more information, see Application Examples in Chapter 1 on p. 4. Modeling nodes are packaged into Base, Classication, Association, and Segmentation modules as described below.
Base Module
The Clementine Base module includes a selection of the most commonly used analytical nodes to allow customers to get started with data mining. A broad range of modeling techniques are supported, including classication (decision trees), segmentation or clustering, association, and statistical methods. More specialized analytical modules are also available as add-ons to the Base module; these are the Classication, Segmentation, and Association modules, as summarized below. For more information, see http://www.spss.com/clementine/.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
The following nodes, sorted according to modeling type, are included in the Base module: Classication models
The Classication and Regression (C&R) Tree node generates a decision tree that allows you to predict or classify future observations. The method uses recursive partitioning to split the training records into segments by minimizing the impurity at each step, where a node is considered pure if 100% of cases in the node fall into a specic category of the target eld. Target and predictor elds can be range or categorical; all splits are binary (only two subgroups). For more information, see C&R Tree Node in Chapter 6 in Clementine 12.0 Modeling Nodes.
The QUEST node provides a binary classication method for building decision trees, designed to reduce the processing time required for large C&R Tree analyses while also reducing the tendency found in classication tree methods to favor predictors that allow more splits. Predictor elds can be numeric ranges, but the target eld must be categorical. All splits are binary. For more information, see QUEST Node in Chapter 6 in Clementine 12.0 Modeling Nodes.
The CHAID node generates decision trees using chi-square statistics to identify optimal splits. Unlike the C&R Tree and QUEST nodes, CHAID can generate nonbinary trees, meaning that some splits have more than two branches. Target and predictor elds can be range or categorical. Exhaustive CHAID is a modication of CHAID that does a more thorough job of examining all possible splits but takes longer to compute. For more information, see CHAID Node in Chapter 6 in Clementine 12.0 Modeling Nodes.
The Decision List node identies subgroups, or segments, that show a higher or lower likelihood of a given binary outcome relative to the overall population. For example, you might look for customers who are unlikely to churn or are most likely to respond favorably to a campaign. You can incorporate your business knowledge into the model by adding your own custom segments and previewing alternative models side by side in order to compare the results. Decision List models consist of a list of rules in which each rule has a condition and an outcome. Rules are applied in order, and the rst rule that matches determines the outcome. For more information, see Decision List in Chapter 9 in Clementine 12.0 Modeling Nodes.
Linear regression is a common statistical technique for summarizing data and making predictions by tting a straight line or surface that minimizes the discrepancies between predicted and actual output values. For more information, see Regression Node in Chapter 10 in Clementine 12.0 Modeling Nodes.
The Factor/PCA node provides powerful data-reduction techniques to reduce the complexity of your data. Principal components analysis (PCA) nds linear combinations of the input elds that do the best job of capturing the variance in the entire set of elds, where the components are orthogonal (perpendicular) to each other. Factor analysis attempts to identify underlying factors that explain the pattern of correlations within a set of observed elds. For both approaches, the goal is to nd a small number of derived elds that effectively summarizes the information in the original set of elds. For more information, see PCA/Factor Node in Chapter 10 in Clementine 12.0 Modeling Nodes.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
40 Chapter 4
Segmentation models
The K-Means node clusters the dataset into distinct groups (or clusters). The method denes a xed number of clusters, iteratively assigns records to clusters, and adjusts the cluster centers until further renement can no longer improve the model. Instead of trying to predict an outcome, k-means uses a process known as unsupervised learning to uncover patterns in the set of input elds. For more information, see K-Means Node in Chapter 11 in Clementine 12.0 Modeling Nodes.
Association models
The Generalized Rule Induction (GRI) node discovers association rules in the data. For example, customers who purchase razors and aftershave lotion are also likely to purchase shaving cream. GRI extracts rules with the highest information content based on an index that takes both the generality (support) and accuracy (condence) of rules into account. GRI can handle numeric and categorical inputs, but the target must be categorical. For more information, see GRI Node in Chapter 12 in Clementine 12.0 Modeling Nodes.
Classification Module
The Classication module helps organizations to predict a known result, such as whether a customer will buy or leave or whether a transaction ts a known pattern of fraud. Modeling techniques include machine learning (neural networks), decision trees (rule induction), subgroup identication, statistical methods, and multiple model generation. The following nodes are included:
The Binary Classier node creates and compares a number of different models for binary outcomes (yes or no, churn or dont, and so on), allowing you to choose the best approach for a given analysis. A number of modeling algorithms are supported, making it possible to select the methods you want to use, the specic options for each, and the criteria for comparing the results. The node generates a set of models based on the specied options and ranks the best candidates according to the criteria you specify. For more information, see Binary Classier Node in Chapter 5 in Clementine 12.0 Modeling Nodes. The Numeric Predictor node estimates and compares models for continuous numeric range outcomes using a number of different methods. The node works in the same manner as the Binary Classier node, allowing you to choose the algorithms to use and to experiment with multiple combinations of options in a single modeling pass. Supported algorithms include neural networks, C&R Tree, CHAID, linear regression, generalized linear regression, and support vector machines (SVM). Models can be compared based on correlation, relative error, or number of variables used. For more information, see Numeric Predictor Node in Chapter 5 in Clementine 12.0 Modeling Nodes. The Neural Net node uses a simplied model of the way the human brain processes information. It works by simulating a large number of interconnected simple processing units that resemble abstract versions of neurons. Neural networks are powerful general function estimators and require minimal statistical or mathematical knowledge to train or apply. For more information, see Neural Net Node in Chapter 8 in Clementine 12.0 Modeling Nodes.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
The C5.0 node builds either a decision tree or a rule set. The model works by splitting the sample based on the eld that provides the maximum information gain at each level. The target eld must be categorical. Multiple splits into more than two subgroups are allowed. For more information, see C5.0 Node in Chapter 6 in Clementine 12.0 Modeling Nodes. The Feature Selection node screens predictor elds for removal based on a set of criteria (such as the percentage of missing values); it then ranks the importance of remaining predictors relative to a specied target. For example, given a dataset with hundreds of potential predictors, which are most likely to be useful in modeling patient outcomes? For more information, see Feature Selection Node in Chapter 4 in Clementine 12.0 Modeling Nodes. Discriminant analysis makes more stringent assumptions than logistic regression but can be a valuable alternative or supplement to a logistic regression analysis when those assumptions are met. For more information, see Discriminant Node in Chapter 10 in Clementine 12.0 Modeling Nodes. Logistic regression is a statistical technique for classifying records based on values of input elds. It is analogous to linear regression but takes a categorical target eld instead of a numeric range. For more information, see Logistic Node in Chapter 10 in Clementine 12.0 Modeling Nodes. The generalized linear model expands the general linear model so that the dependent variable is linearly related to the factors and covariates via a specied link function. Moreover, the model allows for the dependent variable to have a non-normal distribution. It covers the functionality of a wide number of statistical models, including linear regression, logistic regression, loglinear models for count data, and interval-censored survival models. For more information, see GenLin Node in Chapter 10 in Clementine 12.0 Modeling Nodes. The Bayesian Network node enables you to build a probability model by combining observed and recorded evidence with real-world knowledge to establish the likelihood of occurrences. In the current Clementine 12.0 release, the node focuses on Tree Augmented Nave Bayes (TAN) and Markov Blanket networks that are primarily used for classication. For more information, see Bayesian Network Node in Chapter 7 in Clementine 12.0 Modeling Nodes. The Cox regression node enables you to build a survival model for time-to-event data in the presence of censored records. The model produces a survival function that predicts the probability that the event of interest has occurred at a given time (t) for given values of the predictor variables. For more information, see Cox Node in Chapter 10 in Clementine 12.0 Modeling Nodes. The Support Vector Machine (SVM) node enables you to classify data into one of two groups without overtting. SVM works well with wide datasets, such as those with a very large number of predictor elds. For more information, see SVM Node in Chapter 15 in Clementine 12.0 Modeling Nodes. The Self-Learning Response Model (SLRM) node enables you to build a model in which a single new case, or small number of new cases, can be used to reestimate the model without having to retrain the model using all data. For more information, see SLRM Node in Chapter 14 in Clementine 12.0 Modeling Nodes.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
42 Chapter 4
The Time Series node estimates exponential smoothing, univariate Autoregressive Integrated Moving Average (ARIMA), and multivariate ARIMA (or transfer function) models for time series data and produces forecasts of future performance. A Time Series node must always be preceded by a Time Intervals node. For more information, see Time Series Modeling Node in Chapter 13 in Clementine 12.0 Modeling Nodes.
Segmentation Module
The Segmentation module is recommended in cases where the specic result is unknown (for example, when identifying new patterns of fraud, or when identifying groups of interest in your customer base). Clustering models focus on identifying groups of similar records and labeling the records according to the group to which they belong. This is done without the benet of prior knowledge about the groups and their characteristics, and it distinguishes clustering models from the other modeling techniques in that there is no predened output or target eld for the model to predict. There are no right or wrong answers for these models. Their value is determined by their ability to capture interesting groupings in the data and provide useful descriptions of those groupings. Clustering models are often used to create clusters or segments that are then used as inputs in subsequent analyses (for example, by segmenting potential customers into homogeneous subgroups). This following nodes are included:
The Kohonen node generates a type of neural network that can be used to cluster the dataset into distinct groups. When the network is fully trained, records that are similar should appear close together on the output map, while records that are different will appear far apart. You can look at the number of observations captured by each unit in the model nugget to identify the strong units. This may give you a sense of the appropriate number of clusters. For more information, see Kohonen Node in Chapter 11 in Clementine 12.0 Modeling Nodes. The TwoStep node uses a two-step clustering method. The rst step makes a single pass through the data to compress the raw input data into a manageable set of subclusters. The second step uses a hierarchical clustering method to progressively merge the subclusters into larger and larger clusters. TwoStep has the advantage of automatically estimating the optimal number of clusters for the training data. It can handle mixed eld types and large datasets efciently. For more information, see TwoStep Cluster Node in Chapter 11 in Clementine 12.0 Modeling Nodes. The Anomaly Detection node identies unusual cases, or outliers, that do not conform to patterns of normal data. With this node, it is possible to identify outliers even if they do not t any previously known patterns and even if you are not exactly sure what you are looking for. For more information, see Anomaly Detection Node in Chapter 4 in Clementine 12.0 Modeling Nodes.
Association Module
The Association module is most useful when predicting multiple outcomesfor example, customers who bought product X also bought Y and Z. Association models associate a particular conclusion (such as the decision to buy something) with a set of conditions. Association rule algorithms automatically nd the associations that you could nd manually using visualization techniques, such as the Web node. The advantage of association rule algorithms over the more
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
standard decision tree algorithms (C5.0 and C&RT) is that associations can exist between any of the attributes. A decision tree algorithm will build rules with only a single conclusion, whereas association algorithms attempt to nd many rules, each of which may have a different conclusion. The following nodes are included:
The Apriori node extracts a set of rules from the data, pulling out the rules with the highest information content. Apriori offers ve different methods of selecting rules and uses a sophisticated indexing scheme to process large datasets efciently. For large problems, Apriori is generally faster to train than GRI; it has no arbitrary limit on the number of rules that can be retained, and it can handle rules with up to 32 preconditions. Apriori requires that input and output elds all be categorical but delivers better performance because it is optimized for this type of data. For more information, see Apriori Node in Chapter 12 in Clementine 12.0 Modeling Nodes. The CARMA model extracts a set of rules from the data without requiring you to specify In (predictor) or Out (target) elds. In contrast to Apriori and GRI, the CARMA node offers build settings for rule support (support for both antecedent and consequent) rather than just antecedent support. This means that the rules generated can be used for a wider variety of applicationsfor example, to nd a list of products or services (antecedents) whose consequent is the item that you want to promote this holiday season. For more information, see CARMA Node in Chapter 12 in Clementine 12.0 Modeling Nodes. The Sequence node discovers association rules in sequential or time-oriented data. A sequence is a list of item sets that tends to occur in a predictable order. For example, a customer who purchases a razor and aftershave lotion may purchase shaving cream the next time he shops. The Sequence node is based on the CARMA association rules algorithm, which uses an efcient two-pass method for nding sequences. For more information, see Sequence Node in Chapter 12 in Clementine 12.0 Modeling Nodes.
Detailed documentation on the modeling algorithms is also available. For more information, see the Clementine Algorithms Guide, available on the product CD. Note: After installing Clementine Client, you can access this document from the Clementine 12.0 program group (under the SPSS Inc program group) on the Windows Start menu.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
44 Chapter 4
Sample data that allows streams to be executed for illustrative purposes without modication. CAT data is supplied in at les to avoid dependence on a database system. In the case of CATs that use the SPSS Reference Data model, data is delivered in tables precongured for a SQL Server database. A users guide that explains the application, the approach and structure used in the stream library, the purpose and use of each stream, and how to apply the streams to new data.
Opening CAT Streams
Once you have purchased and installed a CAT, you can access the streams or project les easily from by using options from the File menu or a CAT toolbar button.
E On the toolbar at the top of the window, click the CAT button. Figure 4-2 Toolbar button used to open the template library
E Alternatively, from the File menu, choose: Template Library E Using the dialog box, which opens to the templates directory, select the CAT stream you want
to open. You can either choose to open it as a separate stream or insert the CAT stream into the currently open stream.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
If the CAT you purchased ships with a project (.cpj) le, you can use this project le to open and organize the many CAT streams. To open a CAT project:
E From the File menu, choose: Project Open Project E In the dialog box that appears, navigate to the application template directory using the diamond
drop-down list.
E Alternatively, you can navigate manually to the directory. By default, CATs are installed in an \STL
subfolder under your installation directory, for example C:\Program Files\Clementine\ 12.0\STL\. For more information on CATs, see the documentation available on the CAT CD.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
Chapter
Building Streams
Stream-Building Overview
Data mining using Clementine focuses on the process of running data through a series of nodes, referred to as a stream. This series of nodes represents operations to be performed on the data, while links between the nodes indicate the direction of data ow. Typically, you use a data stream to read data into Clementine, run it through a series of manipulations, and then send it to a destination, such as an SPSS le or the Clementine Solution Publisher. For example, suppose that you want to open a data source, add a new eld, select records based on values in the new eld, and then display the results in a table. In this case, your data stream would consist of four nodes:
A Variable File node, which you set up to read the data from the data source.
A Derive node, which you use to add the new, calculated eld to the dataset.
A Select node, which you use to set up selection criteria to exclude records from the data stream.
A Table node, which you use to display the results of your manipulations onscreen.
46
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
This section contains more detailed information on working with nodes to create more complex data streams. It also discusses options and settings for nodes and streams. For step-by-step examples of stream building using the data shipped with Clementine (in the Demos folder of your program installation), see Chapter 1.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
48 Chapter 5
To remove a node from the data stream, click it and press the Delete key, or right-click and choose Delete from the context menu.
The simplest way to form a stream is to double-click nodes on the palette. This method automatically connects the new node to the selected node on the stream canvas. For example, if the canvas contains a Database node, you can select this node and then double-click the next node from the palette, such as a Derive node. This action automatically connects the Derive node to the existing Database node. You can repeat this process until you have reached a terminal node, such as a Histogram or Publisher node, at which point any new nodes will be connected to the last non-terminal node upstream.
Figure 5-2 Stream created by double-clicking nodes from the palettes
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
49 Building Streams
On the stream canvas, you can click and drag from one node to another using the middle mouse button. (If your mouse does not have a middle button, you can simulate this by pressing the Alt key while dragging with the mouse from one node to another.)
Figure 5-3 Using the middle mouse button to connect nodes
If you do not have a middle mouse button and prefer to manually connect nodes, you can use the context menu for a node to connect it to another node already on the canvas.
E Select a node and right-click to open the context menu. E From the menu, choose Connect. E A connection icon will appear both on the start node and the cursor. Click on a second node on
When connecting nodes, there are several guidelines to follow. You will receive an error message if you attempt to make any of the following types of connections: A connection leading to a source node A connection leading from a terminal node A node having more than its maximum number of input connections Connecting two nodes that are already connected Circularity (data returns to a node from which it has already owed)
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
50 Chapter 5
To Bypass a Node
E On the stream canvas, use the middle mouse button to double-click the node that you want to
bypass. Alternatively, you can use Alt-double-click. Note: You can undo this action choosing Undo from the Edit menu or by pressing Ctrl-Z.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
51 Building Streams E With the middle mouse button, click and drag the connection arrow into which you want to insert
the node. Alternatively, you can hold down the Alt key while clicking and dragging to simulate a middle mouse button.
Figure 5-8 New stream
E Drag the connection to the node that you want to include and release the mouse button.
Note: You can remove new connections from the node and restore the original by bypassing the node.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
Choose Edit to open the dialog box for the selected node. Choose Connect to manually connect one node to another. Choose Disconnect to delete all links to and from the node. Choose Rename and Annotate to open the Annotations tab of the editing dialog box. Choose Cut or Delete to remove the selected node(s) from the stream canvas. Note: Choosing Cut allows you to paste nodes, while Delete does not. Choose Copy to make a copy of the node with no connections. This can be added to a new or existing stream. Choose Load Node to open a previously saved node and load its options into the currently selected node. Note: The nodes must be of identical type. Choose Save Node to save the nodes details in a le. You can load node details only into another node of the same type. Choose Cache to expand the menu, with options for caching the selected node. Choose Data Mapping to expand the menu, with options for mapping data to a new source or specifying mandatory elds. Choose Create SuperNode to expand the menu, with options for creating a SuperNode in the current stream. For more information, see Creating SuperNodes in Chapter 8 in Clementine 12.0 Source, Process, and Output Nodes. Choose Generate User Input Node to replace the selected node. Examples generated by this node will have the same elds as the current node. For more information, see User Input Node in Chapter 2 in Clementine 12.0 Source, Process, and Output Nodes. Choose Execute From Here to execute all terminal nodes downstream from the selected node.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
53 Building Streams
To Enable a Cache
E On the stream canvas, right-click the node and choose Cache from the context menu. E From the caching submenu, choose Enable. E You can turn the cache off by right-clicking the node and choosing Disable from the caching
submenu.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
54 Chapter 5
For streams executed in the database, data can be cached midstream to a temporary table in the database rather than the le system. When combined with SQL optimization, this may result in signicant gains in performance. For example, the output from a stream that merges multiple tables to create a data mining view may be cached and reused as needed. By automatically generating SQL for all downstream nodes, performance can be further improved. To take advantage of database caching, both SQL optimization and database caching must be enabled. Note that Server optimization settings override those on the Client. For more information, see Setting Optimization Options in Chapter 12 on p. 191. With database caching enabled, simply right-click on any nonterminal node to cache data at that point, and the cache will be created automatically directly in the database the next time the stream is executed. If database caching or SQL optimization is not enabled, the cache will be written to the le system instead. Note: The following databases support temporary tables for the purpose of caching: DB2, Netezza, Oracle, SQL Server, and Teradata. Other databases will use a normal table for database caching. The SQL code can be customized for specic databases by editing properties in the relevant conguration lefor example, C:\Program Files\SPSSInc\Clementine12.0\cong\odbc-teradata-properties.cfg. For more information, see the comments in the default conguration le, odbc-properties.cfg, installed in the same folder.
To Flush a Cache
A white document icon on a node indicates that its cache is empty. When the cache is full, the document icon becomes solid green. If you want to replace the contents of the cache, you must rst ush the cache and then reexecute the data stream to rell it.
E On the stream canvas, right-click the node and choose Cache from the context menu. E From the caching submenu, choose Flush.
To Save a Cache
You can save the contents of a cache as an SPSS data le (*.sav). You can then either reload the le as a cache, or you can set up a node that uses the cache le as its data source. You can also load a cache that you saved from another project.
E On the stream canvas, right-click the node and choose Cache from the context menu. E From the caching submenu, choose Save Cache. E In the Save Cache dialog box, browse to the location where you want to save the cache le. E Enter a name in the File Name text box. E Be sure that *.sav is selected in the Files of Type drop-down list, and click Save.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
55 Building Streams
To Load a Cache
If you have saved a cache le before removing it from the node, you can reload it.
E On the stream canvas, right-click the node and choose Cache from the context menu. E From the caching submenu, choose Load Cache. E In the Load Cache dialog box, browse to the location of the cache le, select it, and click Load.
Editing a node opens a tabbed dialog box containing an Annotations tab used to set a variety of annotation options. You can also open the Annotations tab directly.
E To annotate a node, right-click on the node on the stream canvas and choose Rename and Annotate.
The editing dialog box opens with the Annotations tab visible.
E To annotate a stream, choose Stream Properties from the Tools menu. (Alternatively, you can right-click a stream in the managers window and choose Stream Properties.) Figure 5-12 Annotations tab options
Name. Select Custom to adjust the autogenerated name or to create a unique name for the node as displayed on the stream canvas.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
56 Chapter 5
ToolTip text. Enter text used as a ToolTip on the stream canvas. This is particularly useful when working with a large number of similar nodes. Keywords. Specify keywords to be used in project reports and when searching or tracking objects stored in the Predictive Enterprise Repository. (For more information, see SPSS Predictive Enterprise Repository in Chapter 9 on p. 140.) Multiple keywords can be separated by semicolonsfor example, income; crop type; claim value. White spaces at the beginning and end of each keyword are trimmedfor example, income ; crop type will produce the same results as income;crop type. (White spaces within keywords are not trimmed, however. For example, crop type with one space and crop type with two spaces are not the same.) The main text window can be used to enter lengthy annotations regarding the operations of the node or decisions made in the node. For example, when you are sharing and reusing streams, it is helpful to take notes on decisions such as discarding a eld with numerous blanks using a Filter node. Annotating the node stores this information with the node. You can also choose to include these annotations in a project report created with the projects tool. For more information, see Introduction to Projects in Chapter 11 on p. 170. ID. Displays a unique ID that can be used to reference the node for the purpose of scripting or
automation. This value is automatically generated when the node is created and will not change. Also note that to avoid confusion with the letter O, zeros are not used in node IDs. Use the copy button at the right to copy and paste the ID into scripts or elsewhere as needed. For more information, see Referencing Nodes in Chapter 3 in Clementine 12.0 Scripting and Automation Guide. Note: When opening streams saved in releases of Clementine prior to 11.0, each node in the stream is assigned an ID as it loads. Until the stream is saved in Clementine 11.0 (or later), however, a different ID will be assigned each time the stream is reopened. To avoid this, save the stream in the current release before proceeding.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
57 Building Streams Figure 5-13 Streams tab in the managers tool with context menu options
From this tab, you can: Access streams. Save streams. Save streams to the current project. Close streams. Open new streams. Store and retrieve streams from the Predictive Enterprise Repository (if available at your site). For more information, see SPSS Predictive Enterprise Repository in Chapter 9 on p. 140. Right-click a stream on the Streams tab to access these options.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
Calculations in. Select Radians or Degrees as the unit of measurement to be used in trigonometric
CLEM expressions.
Import date/time as. Select whether to use date/time storage for date/time elds or whether to import them as string variables. Date format. Select a date format to be used for date storage elds or when strings are interpreted as dates by CLEM date functions. Time format. Select a time format to be used for time storage elds or when strings are interpreted as times by CLEM time functions. Rollover days/mins. For time formats, select whether negative time differences should be
of decimal places to be used when displaying or printing real numbers. This option is specied separately for each display format.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
59 Building Streams
Decimal symbol. Select either a comma (,) or a period (.) as a decimal separator. Grouping symbol. For number display formats, select the symbol used to group values (for example, the comma in 3,000.00). Options include none, period, comma, space, and locale-dened (in which case the default for the current locale is used). Date baseline (1st Jan). Select the baseline years (always January 1) to be used by CLEM date functions that work with a single date. 2-digit dates start from. Specify the cutoff year to add century digits for years denoted with only two digits. For example, specifying 1930 as the cutoff year will roll over 05/11/02 to the year 2002. The same setting will use the 19th century for dates after 30, such as 05/11/73. Encoding. Specify the encoding method used (Server default or UTF-8). Maximum set size. Select to specify a maximum number of members for set elds after which
the type of the eld becomes typeless. This option is disabled by default, but it is useful when working with large set elds. Note: The direction of elds set to typeless is automatically set to none. This means that the elds are not available for modeling.
Limit set size for Neural, Kohonen, and K-Means modeling. Select to specify a maximum number of
members for set elds used in neural nets, Kohonen nets, and K-Means modeling. The default set size is 20, after which the eld is ignored and a warning is raised, providing information on the eld in question.
Ruleset Evaluation. Determines how ruleset models are evaluated. By default, rulesets use Voting
to combine predictions from individual rules and determine the nal prediction. To ensure that rulesets use the rst hit rule by default, select First Hit. For more information, see Rule Set Nodes in Chapter 6 in Clementine 12.0 Modeling Nodes. Note that this option does not apply to Decision List models, which always use the rst hit as dened by the algorithm.
Refresh source nodes on execution. Select to automatically refresh all source nodes when executing the current stream. This action is analogous to clicking the Refresh button on a source node, except that this option automatically refreshes all source nodes (except User Input nodes) for the current stream. Note: Selecting this option ushes the caches of downstream nodes even if the data havent changed. Flushing occurs only once per execution, though, which means that you can still use downstream caches as temporary storage for a single execution. For example, say that youve set a cache midstream after a complex derive operation and that you have several graphs and reports attached downstream of this Derive node. When executing, the cache at the Derive node will be ushed and relled but only for the rst graph or report. Subsequent terminal nodes will read data from the Derive node cache. Display field and value labels in output. Displays eld and value labels in tables, charts, and other output. If labels dont exist, the eld names and data values will be displayed instead. Labels are turned off by default; however, you can toggle labels on an individual basis elsewhere in Clementine. You can also choose to display labels on the output window using a toggle button available on the toolbar.
Figure 5-15 Toolbar icon used to toggle field and value labels
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
60 Chapter 5
The options specied above apply only to the current stream. To set these options as the default for all streams, click Save As Default.
Stream canvas width. Specify the width of the stream canvas in pixels. Stream canvas height. Specify the height of the stream canvas in pixels. Stream scroll rate. Specify the scrolling rate for the stream canvas. Higher numbers specify a
nodes on the stream canvas using an invisible grid. The default grid cell size is 0.25.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
61 Building Streams
Snap to Grid. Select to align icons to an invisible grid pattern (selected by default).
The options specied above apply only to the current stream. To set these options as the default for all streams, click Save As Default.
In addition to messages regarding stream operations, error messages are reported here. When stream execution is terminated because of an error, this dialog box will open to the Messages tab with the error message visible. Additionally, the node with errors is highlighted in red on the stream canvas.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
If SQL optimization and logging options are enabled in the User Options dialog box, then information on generated SQL is also displayed. For more information, see Setting Optimization Options in Chapter 12 on p. 191. You can save messages reported here for a stream by choosing Save Messages from the save button drop-down list on the Messages tab. You can also clear all messages for a given stream by choosing Clear All Messages from the save button drop-down list.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
63 Building Streams
Parameters can also be set for SuperNodes, in which case they are visible only to nodes encapsulated within that SuperNode. For more information, see Dening SuperNode Parameters in Chapter 8 in Clementine 12.0 Source, Process, and Output Nodes. For information on setting and using parameters in scripts, see Stream, Session, and SuperNode Parameters in Chapter 3.
To Set Stream and Session Parameters through the User Interface
E To set stream parameters, from the menus choose: Tools Stream Properties Parameters E To set session parameters, choose Set Session Parameters from the Tools menu. Figure 5-19 Setting parameters for streams
Name. Parameter names are listed here. You can create a new parameter by entering a name in this eld. For example, to create a parameter for the minimum temperature, you could type minvalue. Do not include the $P- prex that denotes a parameter in CLEM expressions. This name is
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
64 Chapter 5
Storage. Select a storage type from the drop-down list. Storage indicates how the data values are
stored in the parameter. For example, when working with values containing leading zeros that you want to preserve (such as 008), you should select String as the storage type. Otherwise, the zeros will be stripped from the value. Available storage types are string, integer, real, time, date, and timestamp. For date parameters, note that values must be specied using ISO standard notation as below.
Value. Lists the current value for each parameter. Adjust the parameter as desired. Note that for
date parameters, values must be specied in ISO standard notation (that is, YYYY-MM-DD). Dates specied in other formats are not accepted.
Type (optional). If you plan to deploy the stream to an external application, select a usage type
from the drop-down list. Otherwise, it is advisable to leave the Type column as is. Note that long name, storage, and type options can be set for parameters through the user interface only. These options cannot be set using scripts. Click the arrows at the right to move the selected parameter further up or down the list of available parameters. Use the delete button (marked with an X) to remove the selected parameter.
Type. Displays the currently selected type. You can change the type to reect the way that you intend to use the parameter in Clementine. Storage. Displays the storage type if known. Storage types are unaffected by the usage type
(range, set, ag) that you choose for work in Clementine. You can alter the storage type on the main Parameters tab. The bottom half of the dialog box dynamically changes depending on the type selected above.
Range Types Lower. Specify a lower limit for the parameter values.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
65 Building Streams
Upper. Specify an upper limit for the parameter values. Set Types Values. This option allows you to specify values for a parameter that will be used as a set eld.
Values will not be coerced in the Clementine stream but will be used in a drop-down list for external deployment applications. Using the arrow and delete buttons, you can modify existing values as well as reorder or delete values.
Flag Types True. Specify a ag value for the parameter when the condition is met. False. Specify a ag value for the parameter when the condition is not met.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
66 Chapter 5 Figure 5-21 Viewing global values available for the stream
Globals available. Available globals are listed in this table. You cannot edit global values here,
but you can clear all global values for a stream using the Clear All Values button to the right of the table.
Executing Streams
Once you have specied the desired options for streams and connected the desired nodes, you can execute the stream by running the data through nodes in the stream. There are several ways to execute a stream within Clementine: You can choose Execute from the Tools menu. You can also execute your data streams by clicking one of the execute buttons on the toolbar. These buttons allow you to execute the entire stream or simply the selected terminal node. For more information, see Clementine Toolbars in Chapter 3 on p. 28. You can execute a single data stream by right-clicking a terminal node and choosing Execute from the context menu. You can execute part of a data stream by right-clicking any non-terminal node and choosing Execute From Here from the context menu, which executes all operations after the selected node. To halt the execution of a stream in progress, you can click the red stop button on the toolbar or choose Stop Execution from the Tools menu. If any stream takes longer than three seconds to execute, the Execution Feedback dialog box is displayed to indicate the progress.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
Some nodes have further displays giving additional information about stream execution. These are displayed by selecting the corresponding row in the dialog box. The rst row is selected automatically.
Clicking Save stores the stream with the extension *.str in the specied directory.
Automatic backup files. Each time a stream is saved, the previously saved version of the le is automatically preserved as a backup, with a hyphen appended to the lename (for example mystream.str-). To restore the backed-up version, simply delete the hyphen and reopen the le.
Saving States
In addition to streams, you can save states, which include the currently displayed stream diagram and any model nuggets that you have created (listed on the Models tab in the managers window).
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
68 Chapter 5
To Save a State
E From the File menu, choose: State Save State or Save State As E In the Save dialog box, browse to the folder in which you want to save the state le.
Clicking Save stores the state with the extension *.cst in the specied directory.
Saving Nodes
You can also save an individual node by right-clicking the node on the stream canvas and choosing Save Node from the context menu. Use the le extension *.nod.
E Simply select the check boxes for the objects that you want to save. E Click OK to save each object in the desired location.
You will then be prompted with a standard Save dialog box for each object. After you have nished saving, the application will close as originally instructed.
Saving Output
Tables, graphs, and reports generated from Clementine output nodes can be saved in output object (*.cou) format.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
69 Building Streams E When viewing the output you want to save, from the output window menus choose: File Save E Specify a name and location for the output le. E Optionally, select Add file to project in the Save dialog box to include the le in the current project.
For more information, see Introduction to Projects in Chapter 11 on p. 170. Alternatively, you can right-click on any output object listed in the managers window and select Save from the context menu.
E Select Encrypt this file. E Optionally, for further security, select Mask password. This displays anything you enter as a
series of dots.
E Enter the password. Warning: If you forget the password, the le or model cannot be opened. E If you selected Mask password, reenter the password to conrm that you entered it correctly. E Click OK to return to the Save dialog box.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
70 Chapter 5
Note: If you save a copy of any encryption-protected item, the new item is automatically saved in an encrypted format using the original password unless you change the settings in the Encryption Options dialog box.
Loading Files
You can reload a number of saved objects in Clementine: Streams (.str) States (.cst) Models (.gm) Models palette (.gen) Nodes (.nod) Output (.cou) Projects (.cpj)
Opening New Files
All other le types can be opened using the submenu items available on the File menu. For example, to load a model, from the File menu, choose:
Models Open Model or Load Models Palette
When loading streams created with earlier versions of Clementine, some nodes may be out of date. In some cases, the nodes will be automatically updated, and in others you will need to convert them using a utility. The Cache File node has been replaced by the SPSS Import node. Any streams that you load containing Cache File nodes will be replaced by SPSS Import nodes. The Build Rule node has been replaced by the C&R Tree node. Any streams that you load containing Build Rule nodes will be replaced by C&R Tree nodes.
Opening Recently Used Files
For quick loading of recently used les, you can use the options at the bottom of the File menu.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
71 Building Streams Figure 5-25 Opening recently used options from the File menu
Select Recent Streams, Recent Projects, or Recent States to expand a list of recently used les.
node to replace; then, using the Replacement option from the context menu, select the node with which to replace it. This method is particularly suitable for mapping data to a template.
Map to. This method starts with the node to be introduced to the stream. First, select the node to introduce; then, using the Map option from the context menu, select the node to which it should join. This method is particularly useful for mapping to a terminal node. Note: You cannot map to Merge or Append nodes. Instead, you should simply connect the stream to the Merge node in the normal manner.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
In contrast to earlier versions of Clementine, data mapping is now tightly integrated into stream building, and if you try to connect to a node that already has a connection, you will be offered the option of replacing the connection or mapping to that node.
for the template source node, choose Select Replacement Node. Then select the source node for the replacement data.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
Step 4: Check mapped fields. In the dialog box that opens, check that the software is mapping elds
properly from the replacement data source to the stream. Any unmapped essential elds are displayed in red. These elds are used in stream operations and must be replaced with a similar eld in the new data source in order for downstream operations to function properly. For more information, see Examining Mapped Fields on p. 75. After using the dialog box to ensure that all essential elds are properly mapped, the old data source is disconnected and the new data source is connected to the template stream using a Filter node called Map. This Filter node directs the actual mapping of elds in the stream. An Unmap Filter node is also included on the stream canvas. The Unmap Filter node can be used to reverse eld name mapping by adding it to the stream. It will undo the mapped elds, but note that you will have to edit any downstream terminal nodes to reselect the elds and overlays.
Figure 5-28 New data source successfully mapped to the template stream
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
74 Chapter 5
terminal nodes and copying and pasting between streams. Note: Using the Map to option, you cannot map to Merge, Append, and all types of source nodes.
Figure 5-29 Mapping a stream from its Sort node to the Type node of another stream
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
E Using the Field Chooser, you can add or remove elds from the list. To open the Field Chooser,
Original. Lists all elds in the template or existing streamall of the elds that are present further
downstream. Fields from the new data source will be mapped to these elds.
Mapped. Lists the elds selected for mapping to template elds. These are the elds whose names
may have to change to match the original elds used in stream operations. Click in the table cell for a eld to activate a drop-down list of available elds. If you are unsure of which elds to map, it may be useful to examine the source data closely before mapping. For example, you can use the Types tab in the source node to review a summary of the source data.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
76 Chapter 5
F2 to begin a connection, press Tab to move to the desired node, and press the spacebar to complete the connection. Press F3 to disconnect all inputs and outputs to the selected node.
Customize the Nodes Palette tab with your favorite nodes. From the Tools menu, choose Palettes
to open a dialog box for adding, removing, or moving the nodes shown on the Nodes Palette.
Figure 5-32 Palette Manager
Rename nodes and add ToolTips. Each node dialog box includes an Annotations tab on which
you can specify a custom name for nodes on the canvas as well as add ToolTips to help organize your stream. You can also include lengthy annotations to track progress, save process details, and denote any business decisions required or achieved.
Figure 5-33 ToolTip and custom node name
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
77 Building Streams
Insert values automatically into a CLEM expression. Using the Expression Builder, accessible
from a variety of dialog boxes (such as those for Derive and Filler nodes), you can automatically insert eld values into a CLEM expression. Click the values button on the Expression Builder to choose from existing eld values.
Figure 5-34 Values button
Browse for files quickly. When browsing for les, use the File drop-down list (yellow diamond
button) to access previously used directories as well as Clementine default directories. Use the forward and back buttons to scroll through accessed directories.
Figure 5-35 File-browsing options
Minimize output window clutter. You can close and delete output quickly using the red X button
at the top right corner of all output windows. This enables you to keep only promising or interesting results on the Outputs tab of the managers window. A full range of keyboard shortcuts is available for the software. For more information, see Keyboard Accessibility in Appendix A on p. 209.
Did you know that you can...
Drag and select a group of nodes on the stream canvas using your mouse. Copy and paste nodes from one stream to another. Access Help from every dialog box and output window. Get Help on CRISP-DM, the Cross-Industry Standard Process for Data Mining. (From the Help menu, choose CRISP-DM Help.)
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
Chapter
During the Data Preparation phase of data mining, you will often want to replace missing values in the data. Missing values are values in the dataset that are unknown, uncollected, or incorrectly entered. Usually, such values are invalid for their elds. For example, the eld Sex should contain the values M and F. If you discover the values Y or Z in the eld, you can safely assume that such values are invalid and should therefore be interpreted as blanks. Likewise, a negative value for the eld Age is meaningless and should also be interpreted as a blank. Frequently, such obviously wrong values are purposely entered, or elds left blank, during a questionnaire to indicate a nonresponse. At times, you may want to examine these blanks more closely to determine whether a nonresponse, such as the refusal to give ones age, is a factor in predicting a specic outcome. Some modeling techniques handle missing data better than others. For example, GRI, C5.0, and Apriori cope well with values that are explicitly declared as missing in a Type node. Other modeling techniques have trouble dealing with missing values and experience longer training times, resulting in less-accurate models. There are several types of missing values recognized by Clementine:
Null or system-missing values. These are nonstring values that have been left blank in the
database or source le and have not been specically dened as missing in a source or Type node. System-missing values are displayed as $null$. Note that empty strings are not considered nulls in Clementine, although they may be treated as nulls by certain databases.
Empty strings and white space. Empty string values and white space (strings with no visible
characters) are treated as distinct from null values. Empty strings are treated as equivalent to white space for most purposes. For example, if you select the option to treat white space as blanks in a source or Type node, this setting applies to empty strings as well.
Blank or user-defined missing values. These are values such as unknown, 99, or 1 that are
explicitly dened in a source node or Type node as missing. Optionally, you can also choose to treat nulls and white space as blanks, which allows them to be agged for special treatment and to be excluded from most calculations. For example, you can use the @BLANK function to treat these values, along with other types of missing values, as blanks. For more information, see Using the Values Dialog Box in Chapter 4 in Clementine 12.0 Source, Process, and Output Nodes.
78
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
79 Handling Missing Values Figure 6-1 Specifying missing values for a range variable
Reading in mixed data. Note that when you are reading in elds with numeric storage (either
integer, real, time, timestamp, or date), any non-numeric values are set to null or system missing. This is because, unlike some applications, does not allow mixed storage types within a eld. To avoid this, any elds with mixed data should be read in as strings by changing the storage type in the source node or external application as necessary. For more information, see Setting Field Storage and Formatting in Chapter 2 in Clementine 12.0 Source, Process, and Output Nodes.
Reading empty strings from Oracle. When reading from or writing to an Oracle database, be aware that, unlike Clementine and unlike most other databases, Oracle treats and stores empty string values as equivalent to null values. This means that the same data extracted from an Oracle database may behave differently than when extracted from a le or another database, and the data may return different results.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
80 Chapter 6
In general terms, there are two approaches you can follow: You can exclude elds or records with missing values You can impute, replace, or coerce missing values using a variety of methods Both of these approaches can be largely automated using the Data Audit node. For example, you can generate a Filter node that excludes elds with too many missing values to be useful in modeling, and generate a Supernode that imputes missing values for any or all of the elds that remain. This is where the real power of the audit comes in, allowing you not only to assess the current state of your data, but to take action based on the assessment. For more information, see Preparing Data for Analysis (Data Audit) in Chapter 7 in Clementine 12.0 Applications Guide.
In determining which method to use, you should also consider the type of eld with missing values.
Numeric range fields. For numeric eld types, such as range, you should always eliminate any
non-numeric values before building a model, because many models will not function if blanks are included in numeric elds.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
Categorical fields. For categorical elds, such as set and ag, altering missing values is not
necessary but will increase the accuracy of the model. For example, a model that uses the eld Sex will still function with meaningless values, such as Y and Z, but removing all values other than M and F will increase the accuracy of the model.
Screening or Removing Fields
To screen out elds with too many missing values, you have several options: You can use a Data Audit node to lter elds based on quality. For more information, see Filtering Fields with Missing Data in Chapter 6 in Clementine 12.0 Source, Process, and Output Nodes. You can use a Feature Selection node to screen out elds with more than a specied percentage of missing values and to rank elds based on importance relative to a specied target. For more information, see Feature Selection Node in Chapter 4 in Clementine 12.0 Modeling Nodes. Instead of removing the elds, you can use a Type node to set the elds direction to None. This will keep the elds in the dataset but exclude them from the modeling processes.
you specify).
Random. Substitutes a random value based on a normal or uniform distribution. Expression. Allows you to specify a custom expression. For example, you could replace values
imputed using this method, there will be a separate C&RT model, along with a Filler node that replaces blanks and nulls with the value predicted by the model. A Filter node is then used to remove the prediction elds generated by the model. For more information, see Imputing Missing Values in Chapter 6 in Clementine 12.0 Source, Process, and Output Nodes. Alternatively, to coerce values for specic elds, you can use a Type node to ensure that the eld types cover only legal values and then set the Check column to Coerce for the elds whose blank values need replacing. For more information, see Type Node in Chapter 4 in Clementine 12.0 Source, Process, and Output Nodes.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
82 Chapter 6
When using any of the functions that accept a list of elds as input, the special functions @FIELDS_BETWEEN and @FIELDS_MATCHING can be used, as shown in the following example:
count_nulls(@FIELDS_MATCHING('card*')) Figure 6-2 Using a Filler node to replace null values with 0 in the selected fields
You can use the undef function to ll elds with the system-missing value, displayed as $null$. For example, to replace any numeric value, you could use a conditional statement, such as:
if not(Age > 17) or not(Age < 66) then undef else Age endif
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
This replaces anything that is not in the range with a system-missing value, displayed as $null$. By using the not() function, you can catch all other numeric values, including any negatives. For more information, see Functions Handling Blanks and Null Values in Chapter 8 on p. 133.
Note on Discarding Records
When using a Select node to discard records, note that syntax uses three-valued logic and automatically includes null values in select statements. To exclude null values (system-missing) in a select expression, you must explicitly specify this by using and not in the expression. For example, to select and include all records where the type of prescription drug is Drug C, you would use the following select statement:
Drug = 'drugC' and not(@NULL(Drug))
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
Chapter
What Is CLEM?
The Clementine Language for Expression Manipulation (CLEM) is a powerful language for analyzing and manipulating the data that ows along Clementine streams. Data miners use CLEM extensively in stream operations to perform tasks as simple as deriving prot from cost and revenue data or as complex as transforming Web log data into a set of elds and records with usable information. CLEM is used within Clementine to: Compare and evaluate conditions on record elds. Derive values for new elds. Derive new values for existing elds. Reason about the sequence of records. Insert data from records into reports.
Scripting and batch mode. A subset of the CLEM language can also be used when scripting in
either the user interface or batch mode. This allows you to perform many of the same data manipulations in an automated fashion. For more information, see Scripting Overview in Chapter 2 in Clementine 12.0 Scripting and Automation Guide. CLEM expressions are indispensable for data preparation in Clementine and can be used in a wide range of nodesfrom record and eld operations (Select, Balance, Filler) to plots and output (Analysis, Report, Table). For example, you can use CLEM in a Derive node to create a new eld based on a formula such as ratio.
84
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
85 Building CLEM Expressions Figure 7-1 Derive node creating a new field based on a formula
CLEM expressions can also be used for global search and replace operations. For example, the expression @NULL(@FIELD) can be used in a Filler node to replace system-missing values with the integer value 0. (To replace user-missing values, also called blanks, use the @BLANK function.)
Figure 7-2 Filler node replacing system-missing values with 0
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
86 Chapter 7
More complex CLEM expressions can also be created. For example, you can derive new elds based on a conditional set of rules.
Figure 7-3 Conditional Derive comparing values of one field to those of the field before it
CLEM Examples
To illustrate correct syntax as well as the types of expressions possible with CLEM, example expressions follow.
Simple Expressions
Formulas can be as simple as this one, which derives a new eld based on the values of the elds After and Before:
(After - Before) / Before * 100.0
Notice that eld names are unquoted when referring to the values of the eld. Similarly, the following expression simply returns the log of each value for the eld salary.
log(salary)
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
Complex Expressions
Expressions can also be lengthy and more complex. The following expression returns true if the value of two elds ($KX-Kohonen and $KY-Kohonen) fall within the specied ranges. Notice that here the eld names are single-quoted because the eld names contain special characters.
('$KX-Kohonen' >= -0.2635771036148072 and '$KX-Kohonen' <= 0.3146203637123107 and '$KY-Kohonen' >= -0.18975617885589602 and '$KY-Kohonen' <= 0.17674794197082522) -> T
Several functions, such as string functions, require you to enter several parameters using correct syntax. For example, the function subscrs is used below to return the rst character of a produce_ID eld, indicating whether an item is organic, genetically modied, or conventional. The results of an expression are described by "-> Result".
subscrs(1,produce_ID) -> `c`
It is important to note that characters are always encapsulated within single backquotes.
Combining Functions in an Expression
Frequently, CLEM expressions consist of a combination of functions. The function below combines subscr and lowertoupper to return the rst character of produce_ID and convert it to upper case.
lowertoupper(subscr(1,produce_ID)) -> `C`
This expression locates the character `n` within the values of the eld web_page reading backward from the last character of the eld value. By including the length function as well, the expression dynamically calculates the length of the current value rather than using a static number, such as 7, which will be invalid for values with less than seven characters.
Special Functions
Numerous special functions (preceded with an @ symbol) are available. Commonly used functions include:
@BLANK('referrer ID') -> T
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
88 Chapter 7
Frequently, special functions are used in combination, which is a commonly used method of agging blanks in more than one eld at a time.
@BLANK(@FIELD)-> T
Additional examples are discussed throughout the CLEM documentation. For more information, see CLEM Reference Overview in Chapter 8 on p. 104.
where Product is the name of a eld from a market basket dataset, $P-NextField is the name of a parameter, and the value of the expression is the value of the named eld. Typically, eld names start with a letter and may also contain digits and underscores (_). You can use names that do not follow these rules if you place the name within quotation marks. CLEM values can be any of the following: Stringsfor example, "c1", "Type 2", "a piece of free text" Integersfor example, 12, 0, 189 Real numbersfor example, 12.34, 0.0, 0.0045 Date/time eldsfor example, 05/12/2002, 12/05/2002, 12/05/02 It is also possible to use the following elements: Character codesfor example, `a` or 3 Lists of itemsfor example, [1 2 3], ['Type 1' 'Type 2'] Character codes and lists do not usually occur as eld values. Typically, they are used as arguments of CLEM functions.
Quoting Rules
Although the software is exible when determining the elds, values, parameters, and strings used in a CLEM expression, the following general rules provide a list of best practices to use when creating expressions:
StringsAlways use double quotes when writing strings ("Type 2" or "value"). Single quotes
can be used instead but at the risk of confusion with quoted elds.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
CharactersAlways use single backquotes like this ` . For example, note the character d in
the function stripchar(`d`,"drugA"). The only exception to this is when you are using an integer to refer to a specic character in a string. For example, note the character 5 in the function lowertoupper("druga"(5)) > "A". Note: On a standard U.K. and U.S. keyboard, the key for the backquote character (grave accent, Unicode 0060) can be found just below the Esc key.
FieldsFields are typically unquoted when used in CLEM expressions (subscr(2,arrayID)) >
CHAR). You can use single quotes when necessary to enclose spaces or other special characters ('Order Number'). Fields that are quoted but undened in the dataset will be misread as strings.
ParametersAlways use single quotes ('$P-threshold').
Or, they can evaluate true or false (used when selecting on a condition)for example:
Drug = "drugA" Age < 16 not(PowerFlux) and Power > 2000
You can combine operators and functions arbitrarily in CLEM expressionsfor example:
sqrt(abs(Signal)) * max(T1, T2) + Baseline
Brackets and operator precedence determine the order in which the expression is evaluated. In this example, the order of evaluation is: abs(Signal) is evaluated, and sqrt is applied to its result. max(T1, T2) is evaluated. The two results are multiplied: x has higher precedence than +. Finally, Baseline is added to the result. The descending order of precedence (that is, operations that are executed rst to operations that are executed last) is as follows: Function arguments Function calls
xx x / mod div rem
+ > < >= <= /== == = /= If you want to override precedence, or if you are in any doubt of the order of evaluation, you can use parentheses to make it explicitfor example,
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
Parameters are represented in CLEM expressions by $P-pname, where pname is the name of the parameter. When used in CLEM expressions, parameters must be placed within single quotesfor example, '$P-scale'. Available parameters are easily viewed using the Expression Builder. To view current parameters:
E In any dialog box accepting CLEM expressions, click the Expression Builder button. E From the Fields drop-down list, select Parameters.
You can select parameters from the list for insertion into the CLEM expression. For more information, see Selecting Fields, Parameters, and Global Variables on p. 98.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
Removing leading or trailing whitespace from valuestrim(STRING), trim_start(STRING), or trimend(STRING). Extract the rst or last n characters from a stringstartstring(LENGTH, STRING) or endstring(LENGTH, STRING). For example, suppose you have a eld named item that combines a product name with a four-digit ID code (ACME CAMERA-D109). To create a new eld that contains only the four-digit code, specify the following formula in a Derive node:
endstring(4, item)
Matching a specic patternSTRING matches PATTERN. For example, to select persons with market anywhere in their job title, you could specify the following in a Select node:
job_title matches "*market*"
Replacing all instances of a substring within a stringreplace(SUBSTRING, NEWSUBSTRING, STRING). For example, to replace all instances of an unsupported character, such as a vertical pipe ( | ), with a semicolon prior to text mining, use the replace function in a Filler node. Under Fill in fields:, select all elds where the character may occur. For the Replace: condition, select Always, and specify the following condition under Replace with:
replace('|',';',@FIELD)
Deriving a ag eld based on the presence of a specic substring. For example, you could use a string function in a Derive node to generate a separate ag eld for each response with an expression such as:
hassubstring(museums,"museum_of_design")
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
For more information, see Functions Handling Blanks and Null Values in Chapter 8 on p. 133.
You can easily calculate the time passed from a baseline date using a family of functions similar to the one below. This function returns the time in months from the baseline date to the date represented by the date string DATE as a real number. This is an approximate gure, based on a month of 30.0 days.
date_in_months(Date)
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
Values of date/time elds can be compared across records using functions similar to the one below. This function returns a value of true if the date string DATE1 represents a date prior to that represented by the date string DATE2. Otherwise, this function returns a value of 0.
date_before(Date1, Date2)
Calculating Differences
You can also calculate the difference between two times and two dates using functions, such as:
date_weeks_difference(Date1, Date2)
This function returns the time in weeks from the date represented by the date string DATE1 to the date represented by the date string DATE2 as a real number. This is based on a week of 7.0 days. If DATE2 is prior to DATE1, this function returns a negative number.
Todays Date
The current date can be added to the dataset using the function @TODAY. Todays date is added as a string to the specied eld or new eld using the date format selected in the stream properties dialog box. For more information, see Date and Time Functions in Chapter 8 on p. 123.
You can compare values across multiple elds using the min_n and max_n functionsfor example:
max_n(['card1fee' 'card2fee''card3fee''card4fee'])
You can also use a number of counting functions to obtain counts of values that meet specic criterion, even when those values are stored in multiple elds. For example, to count the number of cards that have been held for more than ve years:
count_greater_than(5, ['cardtenure' 'card2tenure' 'card3tenure'])
Note that this example counts the number of cards being held, not the number of people holding them. For more information, see Comparison Functions in Chapter 8 on p. 112.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
94 Chapter 7
Numeric Functions
You can obtain statistics across multiple elds using the sum_n, mean_n, and sdev_n functionsfor example:
sum_n(['card1bal' 'card2bal''card3bal']) mean_n(['card1bal' 'card2bal''card3bal'])
When using any of the functions that accept a list of elds as input, the special functions @FIELDS_BETWEEN(start, end) and @FIELDS_MATCHING(pattern) can be used as input. For example, assuming the order of elds is as shown in the sum_n example above, the following would be equivalent:
sum_n(@FIELDS_BETWEEN(card1bal, card3bal))
Alternatively, to count the number of null values across all elds beginning with card:
count_nulls(@FIELDS_MATCHING('card*'))
Similarly, suppose you have asked customers to rank three cars in order of likelihood to purchase and coded the responses in three separate elds, as follows:
customer id 101 102 103 car1 1 3 2 car2 3 2 3 car3 2 1 1
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
In this case, you could determine the index of the eld for the car they like most (ranked #1, or the lowest rank) using the min_index function:
min_index(['car1' 'car2' 'car3'])
The special @MULTI_RESPONSE_SET function can be used to reference all of the elds in a multiple-response set. For example, if the three car elds in the above example are included in a multiple-response set named car_rankings, the following would return the same result:
max_index(@MULTI_RESPONSE_SET("car_rankings"))
For more information, see Editing Multiple Response Sets in Chapter 4 in Clementine 12.0 Source, Process, and Output Nodes.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
96 Chapter 7
Creating Expressions
The Expression Builder provides not only complete lists of elds, functions, and operators but also access to data values if your data is instantiated.
To Create an Expression Using the Expression Builder
E Type in the expression window, using the function and eld lists as references.
or
E Select the desired elds and functions from the scrolling lists. E Double-click or click the yellow arrow button to add the eld or function to the expression window. E Use the operand buttons in the center of the dialog box to insert the operations into the expression.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
Selecting Functions
The function list displays all available CLEM functions and operators. Scroll to select a function from the list, or, for easier searching, use the drop-down list to display a subset of functions or operators. Available functions are grouped into categories for easier searching.
Figure 7-7 Functions drop-down list
After you have selected a group of functions, double-click to insert the functions into the expression window at the point indicated by the position of the cursor.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
98 Chapter 7
For more information, see Stream, Session, and SuperNode Parameters on p. 90. In addition to elds, you can also choose from the following items:
Parameters. For more information, see Stream, Session, and SuperNode Parameters on p. 90. Global values. For more information, see Set Globals Node in Chapter 6 in Clementine 12.0
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
and values are known. For more information, see Using the Values Dialog Box in Chapter 4 in Clementine 12.0 Source, Process, and Output Nodes.
Figure 7-9 Fields list with values shown for Before
E To view values for a eld from the Expression Builder or a Time Intervals node, select the desired
eld and click the value picker button to open a dialog box listing values for the selected eld. You can then select a value and click Insert to paste the value into the current expression or list.
Figure 7-10 Value picker button
For ag and set elds, all dened values are listed. For numeric range elds, the minimum and maximum values are displayed.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
100 Chapter 7
The following items are checked: Correct quoting of values and eld names Correct usage of parameters and global variables Valid usage of operators Existence of referenced elds Existence and denition of referenced globals If you encounter errors in syntax, try creating the expression using the lists and operator buttons rather than typing the expression manually. This method automatically adds the proper quotes for elds and values.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
E With the cursor in a text area, press Ctrl-F to access the Find/Replace dialog box. E Enter the text you want to search for, or choose from the drop-down list of recently searched items. E Enter the replacement text, if any. E Choose Find Next to start the search. E Choose Replace to replace the current selection, or choose Replace All to update all or selected
instances.
E The dialog closes after each operation. Press F3 from any text area to repeat the last nd operation,
matches myVar. Replacement text is always inserted exactly as entered, regardless of this setting.
Whole words only. Species whether the nd operation matches text embedded within words. If
selected, for example, a search on spider will not match spiderman or spider-man.
Regular expressions. Species whether regular expression syntax is used (see below). When
selected, the Whole words only option is disabled and its value is ignored.
Selected text only. Controls the scope of the search when using the Replace All option. Regular Expression Syntax
Regular expressions allow you to search on special characters such as tabs or newline characters, classes or ranges of characters such as a through d, any digit or non-digit, and boundaries such as the beginning or end of a line. The following types of expressions are supported.
Character Matches
Characters x \\ Matches The character x The backslash character
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
102 Chapter 7
Matches The character with octal value 0n (0 <= n <= 7) The character with octal value 0nn (0 <= n <= 7) The character with octal value 0mnn (0 <= m <= 3, 0 <= n <= 7) The character with hexadecimal value 0xhh The character with hexadecimal value 0xhhhh The tab character (\u0009) The newline (line feed) character (\u000A) The carriage-return character (\u000D) The form-feed character (\u000C) The alert (bell) character (\u0007) The escape character (\u001B) The control character corresponding to x
Boundary Matches
Boundary matchers ^ $ \b \B Matches The beginning of a line The end of a line A word boundary A non-word boundary
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
Boundary matchers \A \Z \z
Matches The beginning of the input The end of the input but for the nal terminator, if any The end of the input
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
Chapter
This section describes the Clementine Language for Expression Manipulation (CLEM), which is a powerful tool used to analyze and manipulate the data used in Clementine streams. You can use CLEM within nodes to perform tasks ranging from evaluating conditions or deriving values to inserting data into reports. For more information, see What Is CLEM? in Chapter 7 on p. 84. A subset of the CLEM language can also be used when you are scripting in either the user interface or batch mode. This allows you to perform many of the same data manipulations in an automated fashion. For more information, see CLEM Expressions in Scripts in Chapter 3 in Clementine 12.0 Scripting and Automation Guide. CLEM expressions consist of values, eld names, operators, and functions. Using the correct syntax, you can create a wide variety of powerful data operations. For more information, see CLEM Examples in Chapter 7 on p. 86.
CLEM Datatypes
CLEM datatypes can be made up of any of the following: Integers Reals Characters Strings Lists Fields Date/Time
Rules for Quoting
Although Clementine is exible when you are determining the elds, values, parameters, and strings used in a CLEM expression, the following general rules provide a list of good practices to use in creating expressions: StringsAlways use double quotes when writing strings, such as "Type 2". Single quotes can be used instead but at the risk of confusion with quoted elds. FieldsUse single quotes only where necessary to enclose spaces or other special characters, such as 'Order Number'. Fields that are quoted but undened in the dataset will be misread as strings.
104
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
ParametersAlways use single quotes when using parameters, such as '$P-threshold'. CharactersAlways use single backquotes (`), such as stripchar(`d`, "drugA"). For more information, see Values and Data Types in Chapter 7 on p. 88. Additionally, these rules are covered in more detail in the following topics.
Integers
Integers are represented as a sequence of decimal digits. Optionally, you can place a minus sign () before the integer to denote a negative numberfor example, 1234, 999, 77. The CLEM language handles integers of arbitrary precision. The maximum integer size depends on your platform. If the values are too large to be displayed in an integer eld, changing the eld type to Real usually restores the value.
Reals
Real refers to a oating-point number. Reals are represented by one or more digits followed by a decimal point followed by one or more digits. CLEM reals are held in double precision. Optionally, you can place a minus sign () before the real to denote a negative numberfor example, 1.234, 0.999, 77.001. Use the form <number> e <exponent> to express a real number in exponential notationfor example, 1234.0e5, 1.7e2. When the Clementine application reads number strings from les and converts them automatically to numbers, numbers with no leading digit before the decimal point or with no digit after the point are acceptedfor example, 999. or .11. However, these forms are illegal in CLEM expressions. Note: When referencing real numbers in CLEM expressions, a period must be used as the decimal separator, regardless of any settings for the current stream or locale. For example, specify
Na > 0.6
rather than
Na > 0,6
This applies even if a comma is selected as the decimal symbol in the Stream Properties dialog box and is consistent with the general guideline that code syntax should be independent of any specic locale or convention.
Characters
Characters (usually shown as CHAR) are typically used within a CLEM expression to perform tests on strings. For example, you can use the function isuppercode to determine whether the rst character of a string is upper case. The following CLEM expression uses a character to indicate that the test should be performed on the rst character of the string:
isuppercode(subscrs(1, "MyString"))
To express the code (in contrast to the location) of a particular character in a CLEM expression, use single backquotes of the form `<character>`for example, `A`, `Z`.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
106 Chapter 8
Note: There is no CHAR storage type for a eld, so if a eld is derived or lled with an expression that results in a CHAR, then that result will be converted to a string.
Strings
Generally, you should enclose strings in double quotation marks. Examples of strings are "c35product2" and "referrerID". To indicate special characters in a string, use a backslashfor example, "\$65443". You can use single quotes around a string, but the result is indistinguishable from a quoted eld ('referrerID'). For more information, see String Functions on p. 119.
Lists
A list is an ordered sequence of elements, which may be of mixed type. Lists are enclosed in square brackets ([]). Examples of lists are [1 2 4 16] and ["abc" "def"]. Lists are not used as the value of Clementine elds. They are used to provide arguments to functions, such as member and oneof.
Fields
Names in CLEM expressions that are not names of functions are assumed to be eld names. You can write these simply as Power, val27, state_flag, and so on, but if the name begins with a digit or includes non-alphabetic characters, such as spaces (with the exception of the underscore), place the name within single quotation marksfor example, 'Power Increase', '2nd answer', '#101', '$P-NextField'. Note: Fields that are quoted but undened in the dataset will be misread as strings.
Dates
Date calculations are based on a baseline date, which is specied in the stream properties dialog box. The default baseline date is January 1, 1900. For more information, see Setting Options for Streams in Chapter 5 on p. 57. The CLEM language supports the following date formats:
Format DDMMYY MMDDYY YYMMDD YYYYMMDD YYYYDDD Examples 150163 011563 630115 19630115 Four-digit year followed by a three-digit number representing the day of the yearfor example, 2000032 represents the 32nd day of 2000 or February 1, 2000. Day of the week in the current localefor example, Monday, Tuesday, ..., in English. Month in the current localefor example, January, February, .
DAY
MONTH
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
Format DD/MM/YY DD/MM/YYYY MM/DD/YY MM/DD/YYYY DD-MM-YY DD-MM-YYYY MM-DD-YY MM-DD-YYYY DD.MM.YY DD.MM.YYYY MM.DD.YY MM.DD.YYYY DD-MON-YY DD/MON/YY DD.MON.YY DD-MON-YYYY DD/MON/YYYY DD.MON.YYYY MON YYYY q Q YYYY
ww WK YYYY
Examples 15/01/63 15/01/1963 01/15/63 01/15/1963 15-01-63 15-01-1963 01-15-63 01-15-1963 15.01.63 15.01.1963 01.15.63 01.15.1963 15-JAN-63, 15-jan-63, 15-Jan-63 15/JAN/63, 15/jan/63, 15/Jan/63 15.JAN.63, 15.jan.63, 15.Jan.63 15-JAN-1963, 15-jan-1963, 15-Jan-1963 15/JAN/1963, 15/jan/1963, 15/Jan/1963 15.JAN.1963, 15.jan.1963, 15.Jan.1963 Jan 2004 Date represented as a digit (14) representing the quarter followed by the letter Q and a four-digit yearfor example, 25th Dec 2004 would be represented as 4 Q 2004. Two-digit number representing the week of the year followed by the letters WK and then a four-digit year. The week of the year is calculated assuming that the rst day of the week is Monday and there is at least one day in the rst week.
Time
The CLEM language supports the following time formats:
Format HHMMSS HHMM MMSS HH:MM:SS HH:MM MM:SS (H)H:(M)M:(S)S (H)H:(M)M (M)M:(S)S HH.MM.SS Examples 120112, 010101, 221212 1223, 0745, 2207 5558, 0100 12:01:12, 01:01:01, 22:12:12 12:23, 07:45, 22:07 55:58, 01:00 12:1:12, 1:1:1, 22:12:12 12:23, 7:45, 22:7 55:58, 1:0 12.01.12, 01.01.01, 22.12.12
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
108 Chapter 8
Examples 12.23, 07.45, 22.07 55.58, 01.00 12.1.12, 1.1.1, 22.12.12 12.23, 7.45, 22.7 55.58, 1.0
CLEM Operators
The following operators are available:
Operation or Comments Used between two CLEM expressions. Returns a value of true if either is true or if both are true. Used between two CLEM expressions. Returns a value of true if both are true. Used between any two comparable items. Returns true if ITEM1 is equal to ITEM2. Identical to the above. Used between any two comparable items. Returns true if ITEM1 is not equal to ITEM2. Identical to the above. Used between any two comparable items. Returns true if ITEM1 is strictly greater than ITEM2. Used between any two comparable items. Returns true if ITEM1 is greater than or equal to ITEM2. Used between any two comparable items. Returns true if ITEM1 is strictly less than ITEM2 Used between any two comparable items. Returns true if ITEM1 is less than or equal to ITEM2. Used between two integers. Equivalent to the Boolean expression INT1 && INT2 = 0. Used between two integers. Equivalent to the Boolean expression INT1 && INT2 /= 0. Adds two numbers: NUM1 + NUM2. Concatenates two strings; for example, STRING1 >< STRING2. Subtracts one number from another: NUM1 - NUM2. Can also be used in front of a number: - NUM. Used to multiply two numbers: NUM1 * NUM2. Precedence (see below) 10 9 7 7 7 7 6 6 6 6 6 6 5 5 5 4
and =
== /=
/== >
>=
<
<=
&&=_0
&&/=_0 + >< -
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
Operation &&
&&~~
|| ~~ ||/&
INT1 << N
INT1 >> N
/ ** rem
div
Comments Used between two integers. The result is the bitwise and of the integers INT1 and INT2. Used between two integers. The result is the bitwise and of INT1 and the bitwise complement of INT2. Used between two integers. The result is the bitwise inclusive or of INT1 and INT2. Used in front of an integer. Produces the bitwise complement of INT. Used between two integers. The result is the bitwise exclusive or of INT1 and INT2. Used between two integers. Produces the bit pattern of INT shifted left by N positions. Used between two integers. Produces the bit pattern of INT shifted right by N positions. Used to divide one number by another: NUM1 / NUM2. Used between two numbers: BASE ** POWER. Returns BASE raised to the power POWER. Used between two integers: INT1 rem INT2. Returns the remainder, INT1 (INT1 div INT2) * INT2. Used between two integers: INT1 div INT2. Performs integer division.
Operator Precedence
Precedences determine the parsing of complex expressions, especially unbracketed expressions with more than one inx operator. For example,
3+4*5
parses as 3 + (4 * 5) rather than (3 + 4) * 5 because the relative precedences dictate that * is to be parsed before +. Every operator in the CLEM language has a precedence value associated with it; the lower this value, the more important it is on the parsing list, meaning that it will be processed sooner than other operators with higher precedence values.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
110 Chapter 8
Functions Reference
The following CLEM functions are available for working with data in Clementine. You can enter these functions as code in a variety of dialog boxes, such as Derive and Set To Flag nodes, or you can use the Expression Builder to create valid CLEM expressions without memorizing function lists or eld names.
Function Type Information Conversion Comparison Logical Numeric Trigonometric Probability Bitwise Random String SoundEx Date and time Sequence Global Blanks and null Special elds Description Used to gain insight into eld values. For example, the function is_string returns true for all records whose type is a string. Used to construct new elds or convert storage type. For example, the function to_timestamp converts the selected eld to a timestamp. Used to compare eld values to each other or to a specied string. For example, <= is used to compare whether the values of two elds are lesser or equal. Used to perform logical operations, such as if, then, else operations. Used to perform numeric calculations, such as the natural log of eld values. Used to perform trigonometric calculations, such as the arccosine of a specied angle. Return probabilities based on various distributions, such as probability that a value from Students t distribution will be less than a specic value. Used to manipulate integers as bit patterns. Used to randomly select items or generate numbers. Used to perform a wide variety of operations on strings, such as stripchar, which allows you to remove a specied character. Used to nd strings when the precise spelling is not known; based on phonetic assumptions about how certain letters are pronounced. Used to perform a variety of operations on date, time, and timestamp elds. Used to gain insight into the record sequence of a dataset or perform operations based on that sequence. Used to access global values created by a Set Globals node. For example, @MEAN is used to refer to the mean average of all values for a eld across the entire dataset. Used to access, ag, and frequently ll user-specied blanks or system-missing values. For example, @BLANK(FIELD) is used to raise a true ag for records where blanks are present. Used to denote the specic elds under examination. For example, @FIELD is used when deriving multiple elds.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
Convention INT, INT1, INT2 CHAR STRING LIST ITEM DATE TIME
Description Any integer, such as 1 or 77. A character code, such as `A`. A string, such as "referrerID". A list of items, such as ["abc" "def"]. A eld, such as Customer or extract_concept. A date eld, such as start_date, where values are in a format such as DD-MON-YYYY. A time eld, such as power_flux, where values are in a format such as HHMMSS.
Functions in this guide are listed with the function in one column, the result type (integer, string, and so on) in another, and a description (where available) in a third column. For example, the rem function description is shown below:
Function INT1 rem INT2 Result Number Description Returns the remainder of INT1 divided by INT2. For example, INT1 (INT1 div INT2) * INT2.
Details on usage conventions, such as how to list items or specify characters in a function, are described elsewhere. For more information, see CLEM Datatypes on p. 104.
Information Functions
Information functions are used to gain insight into the values of a particular eld. They are typically used to derive ag elds. For example, you can use the @BLANK function to create a ag eld indicating records whose values are blank for the selected eld. Similarly, you can check the storage type for a eld using any of the storage type functions, such as is_string.
Function Result Description Returns true for all records whose values are blank according to the blank-handling rules set in an upstream Type node or source node (Types tab). Note that this function cannot be called from a script. For more information, see CLEM Expressions in Scripts in Chapter 3 in Clementine 12.0 Scripting and Automation Guide. Returns true for all records whose values are undened. Undened values are system null values, displayed in Clementine as $null$. Note that this function cannot be called from a script. For more information, see CLEM Expressions in Scripts in Chapter 3 in Clementine 12.0 Scripting and Automation Guide. Returns true for all records whose type is a date. Returns true for all records whose type is a date, time, or timestamp. Returns true for all records whose type is an integer. Returns true for all records whose type is a number. Returns true for all records whose type is a real.
@BLANK(FIELD)
Boolean
@NULL(ITEM)
Boolean
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
112 Chapter 8
Description Returns true for all records whose type is a string. Returns true for all records whose type is a time. Returns true for all records whose type is a timestamp.
Conversion Functions
Conversion functions allow you to construct new elds and convert the storage type of existing les. For example, you can form new strings by joining strings together or by taking strings apart. To join two strings, use the operator ><. For example, if the eld Site has the value "BRAMLEY", then "xx" >< Site returns "xxBRAMLEY". The result of >< is always a string, even if the arguments are not strings. Thus, if eld V1 is 3 and eld V2 is 5, then V1 >< V2 returns "35" (a string, not a number). Conversion functions (and any other functions that require a specic type of input, such as a date or time value) depend on the current formats specied in the Stream Options dialog box. For example, if you want to convert a string eld with values Jan 2003, Feb 2003, and so on, select the matching date format MON YYYY as the default date format for the stream. For more information, see Setting Options for Streams in Chapter 5 on p. 57.
Function ITEM1 >< ITEM2 to_integer(ITEM) to_real(ITEM) to_number(ITEM) to_string(ITEM) to_time(ITEM) to_date(ITEM) to_timestamp(ITEM) to_datetime(ITEM) Result String Integer Real Number String Time Date Timestamp Datetime Description Concatenates values for two elds and returns the resulting string as ITEM1ITEM2. Converts the storage of the specied eld to an integer. Converts the storage of the specied eld to a real. Converts the storage of the specied eld to a number. Converts the storage of the specied eld to a string. Converts the storage of the specied eld to a time. Converts the storage of the specied eld to a date. Converts the storage of the specied eld to a timestamp. Converts the storage of the specied eld to a date, time, or timestamp value. Returns the date value for a number, string, or timestamp. Note this is the only function that allows you to convert a number (in seconds) back to a date. If ITEM is a string, creates a date by parsing a string in the current date format. The date format specied in the stream properties dialog box must be correct for this function to be successful. If ITEM is a number, it is interpreted as a number of seconds since the base date (or epoch). Fractions of a day are truncated. If ITEM is a timestamp, the date part of the timestamp is returned. If ITEM is a date, it is returned unchanged.
datetime_date(ITEM)
Date
Comparison Functions
Comparison functions are used to compare eld values to each other or to a specied string. For example, you can check strings for equality using =. An example of string equality verication is: Class = "class 1".
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
For purposes of numeric comparison, greater means closer to positive innity, and lesser means closer to negative innity. That is, all negative numbers are less than any positive number.
Function count_equal(ITEM1, LIST) count_greater_than(ITEM1, LIST) count_less_than(ITEM1, LIST) count_not_equal(ITEM1, LIST) count_nulls(LIST) date_before(DATE1, DATE2) Result Integer Integer Integer Integer Integer Boolean Description Returns the number of values from a list of elds that are equal to ITEM1 or null if ITEM1 is null. For more information, see Summarizing Multiple Fields in Chapter 7 on p. 93. Returns the number of values from a list of elds that are greater than ITEM1 or null if ITEM1 is null. Returns the number of values from a list of elds that are less than ITEM1 or null if ITEM1 is null. Returns the number of values from a list of elds that are not equal to ITEM1 or null if ITEM1 is null. Returns the number of null values from a list of elds. Used to check the ordering of date values. Returns a true value if DATE1 is before DATE2. Returns the index of the rst eld containing ITEM from a LIST of elds or 0 if the value is not found. Supported for string, integer, and real types only. For more information, see Working with Multiple-Response Data in Chapter 7 on p. 94. Returns the rst non-null value in the supplied list of elds. All storage types supported. Returns the index of the rst eld in the specied LIST containing a non-null value or 0 if all values are null. All storage types are supported. Returns true for records where ITEM1 is equal to ITEM2. Returns true if the two strings are not identical or 0 if they are identical. Returns true for records where ITEM1 is less than ITEM2. Returns true for records where ITEM1 is less than or equal to ITEM2. Returns true for records where ITEM1 is greater than ITEM2. Returns true for records where ITEM1 is greater than or equal to ITEM2. Returns the index of the last eld containing ITEM from a LIST of elds or 0 if the value is not found. Supported for string, integer, and real types only. For more information, see Working with Multiple-Response Data in Chapter 7 on p. 94. Returns the last non-null value in the supplied list of elds. All storage types supported. Returns the index of the last eld in the specied LIST containing a non-null value or 0 if all values are null. All storage types are supported. Returns the greater of the two itemsITEM1 or ITEM2.
first_index(ITEM, LIST)
Integer
first_non_null(LIST)
Any
first_non_null_index(LIST) Integer ITEM1 = ITEM2 ITEM1 /= ITEM2 ITEM1 < ITEM2 ITEM1 <= ITEM2 ITEM1 > ITEM2 ITEM1 >= ITEM2 Boolean Boolean Boolean Boolean Boolean Boolean
last_index(ITEM, LIST)
Integer
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
114 Chapter 8
Function
Result
max_index(LIST)
Integer
max_n(LIST)
Number
Boolean Any
min_index(LIST)
Integer
Number Boolean
Description Returns the index of the eld containing the maximum value from a list of numeric elds or 0 if all values are null. For example, if the third eld listed contains the maximum, the index value 3 is returned. If multiple elds contain the maximum value, the one listed rst (leftmost) is returned. For more information, see Working with Multiple-Response Data in Chapter 7 on p. 94. Returns the maximum value from a list of numeric elds or null if all of the eld values are null. For more information, see Summarizing Multiple Fields in Chapter 7 on p. 93. Returns true if ITEM is a member of the specied LIST. Otherwise, a false value is returned. A list of eld names can also be specied. For more information, see Summarizing Multiple Fields in Chapter 7 on p. 93. Returns the lesser of the two itemsITEM1 or ITEM2. Returns the index of the eld containing the minimum value from a list of numeric elds or 0 if all values are null. For example, if the third eld listed contains the minimum, the index value 3 is returned. If multiple elds contain the minimum value, the one listed rst (leftmost) is returned. For more information, see Working with Multiple-Response Data in Chapter 7 on p. 94. Returns the minimum value from a list of numeric elds or null if all of the eld values are null. Used to check the ordering of time values. Returns a true value if TIME1 is before TIME2. Returns the value of each listed eld at offset INT or NULL if the offset is outside the range of valid values (that is, less than 1 or greater than the number of listed elds). All storage types supported.
Logical Functions
CLEM expressions can be used to perform logical operations.
Function Result Description This operation is a logical conjunction and returns a true value if both COND1 and COND2 are true. If COND1 is false, then COND2 is not evaluated; this makes it possible to have conjunctions where COND1 rst tests that an operation in COND2 is legal. For example, length(Label) >=6 and Label(6) = 'x'. This operation is a logical (inclusive) disjunction and returns a true value if either COND1 or COND2 is true or if both are true. If COND1 is true, COND2 is not evaluated. This operation is a logical negation and returns a true value if COND is false. Otherwise, this operation returns a value of 0.
Boolean
COND1 or COND2
Boolean
not(COND)
Boolean
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
Function if COND then EXPR1 else EXPR2 endif if COND1 then EXPR1 elseif COND2 then EXPR2 else EXPR_N endif
Result Any
Any
Description This operation is a conditional evaluation. If COND is true, this operation returns the result of EXPR1. Otherwise, the result of evaluating EXPR2 is returned. This operation is a multibranch conditional evaluation. If COND1 is true, this operation returns the result of EXPR1. Otherwise, if COND2 is true, this operation returns the result of evaluating EXPR2. Otherwise, the result of evaluating EXPR_N is returned.
Numeric Functions
CLEM contains a number of commonly used numeric functions.
Function NUM NUM1 + NUM2 code NUM2 NUM1 * NUM2 NUM1 / NUM2 INT1 div INT2 INT1 rem INT2 INT1 mod INT2 Result Number Number Number Number Number Number Number Number Description Used to negate NUM. Returns the corresponding number with the opposite sign. Returns the sum of NUM1 and NUM2. Returns the value of NUM2 subtracted from NUM1. Returns the value of NUM1 multiplied by NUM2. Returns the value of NUM1 divided by NUM2. Used to perform integer division. Returns the value of INT1 divided by INT2. Returns the remainder of INT1 divided by INT2. For example, INT1 (INT1 div INT2) * INT2. This function has been deprecated. It is recommended that the rem function be used instead. Returns BASE raised to the power POWER, where either may be any number (except that BASE must not be zero if POWER is zero of any type other than integer 0). If POWER is an integer, the computation is performed by successively multiplying powers of BASE. Thus, if BASE is an integer, the result will be an integer. If POWER is integer 0, the result is always a 1 of the same type as BASE. Otherwise, if POWER is not an integer, the result is computed as exp(POWER * log(BASE)). Returns the absolute value of NUM, which is always a number of the same type. Returns e raised to the power NUM, where e is the base of natural logarithms. Returns the fractional part of NUM, dened as NUMintof(NUM). Truncates its argument to an integer. It returns the integer of the same sign as NUM and with the largest magnitude such that abs(INT) <= abs(NUM). Returns the natural (base e) logarithm of NUM, which must not be a zero of any kind. Returns the base 10 logarithm of NUM, which must not be a zero of any kind. This function is dened as log(NUM) / log(10). Used to negate NUM. Returns the corresponding number with the opposite sign.
BASE ** POWER
Number
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
116 Chapter 8
Function round(NUM)
Description Used to round NUM to an integer by taking intof(NUM+0.5) if NUM is positive or intof(NUM0.5) if NUM is negative. Used to determine the sign of NUM. This operation returns 1, 0, or 1 if NUM is an integer. If NUM is a real, it returns 1.0, 0.0, or 1.0, depending on whether NUM is negative, zero, or positive. Returns the square root of NUM. NUM must be positive. Returns the sum of values from a list of numeric elds or null if all of the eld values are null. For more information, see Summarizing Multiple Fields in Chapter 7 on p. 93. Returns the mean value from a list of numeric elds or null if all of the eld values are null. Returns the standard deviation from a list of numeric elds or null if all of the eld values are null.
Trigonometric Functions
All of the functions in this section either take an angle as an argument or return one as a result. In both cases, the units of the angle (radians or degrees) are controlled by the setting of the relevant stream option.
Function arccos(NUM) arccosh(NUM) arcsin(NUM) arcsinh(NUM) arctan(NUM) arctan2(NUM_X, NUM_Y) arctanh(NUM) cos(NUM) cosh(NUM) pi sin(NUM) sinh(NUM) tan(NUM) tanh(NUM) Result Real Real Real Real Real Real Real Real Real Real Real Real Real Real Description Computes the arccosine of the specied angle. Computes the hyperbolic arccosine of the specied angle. Computes the arcsine of the specied angle. Computes the hyperbolic arcsine of the specied angle. Computes the arctangent of the specied angle. Computes the arctangent of NUM_Y / NUM_X and uses the signs of the two numbers to derive quadrant information. The result is a real in the range - pi < ANGLE <= pi (radians) 180 < ANGLE <= 180 (degrees) Computes the hyperbolic arctangent of the specied angle. Computes the cosine of the specied angle. Computes the hyperbolic cosine of the specied angle. This constant is the best real approximation to pi. Computes the sine of the specied angle. Computes the hyperbolic sine of the specied angle. Computes the tangent of the specied angle. Computes the hyperbolic tangent of the specied angle.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
Probability Functions
Probability functions return probabilities based on various distributions, such as the probability that a value from Students t distribution will be less than a specic value.
Function cdf_chisq(NUM, DF) Result Real Real Real Real Description Returns the probability that a value from the chi-square distribution with the specied degrees of freedom will be less than the specied number. Returns the probability that a value from the F distribution, with degrees of freedom DF1 and DF2, will be less than the specied number. Returns the probability that a value from the normal distribution with the specied mean and standard deviation will be less than the specied number. Returns the probability that a value from Students t distribution with the specied degrees of freedom will be less than the specied number.
~~ INT1
Integer
INT1 || INT2
Integer
Integer
Integer
Integer
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
118 Chapter 8
Function INT << N INT >> N INT1 &&=_0 INT2 INT1 &&/=_0 INT2
integer_bitcount(INT)
Integer
integer_leastbit(INT)
Integer
integer_length(INT)
Integer
testbit(INT, N)
Boolean
Description Produces the bit pattern of INT1 shifted left by N positions. A negative value for N produces a right shift. Produces the bit pattern of INT1 shifted right by N positions. A negative value for N produces a left shift. Equivalent to the Boolean expression INT1 && INT2 /== 0 but is more efcient. Equivalent to the Boolean expression INT1 && INT2 == 0 but is more efcient. Counts the number of 1 or 0 bits in the twos-complement representation of INT. If INT is non-negative, N is the number of 1 bits. If INT is negative, it is the number of 0 bits. Owing to the sign extension, there are an innite number of 0 bits in a non-negative integer or 1 bits in a negative integer. It is always the case that integer_bitcount(INT) = integer_bitcount(-(INT+1)). Returns the bit position N of the least-signicant bit set in the integer INT. N is the highest power of 2 by which INT divides exactly. Returns the length in bits of INT as a twos-complement integer. That is, N is the smallest integer such that INT < (1 << N) if INT >= 0 INT >= (1 << N) if INT < 0. If INT is non-negative, then the representation of INT as an unsigned integer requires a eld of at least N bits. Alternatively, a minimum of N+1 bits is required to represent INT as a signed integer, regardless of its sign. Tests the bit at position N in the integer INT and returns the state of bit N as a Boolean value, which is true for 1 and false for 0.
Random Functions
The following functions are used to randomly select items or randomly generate numbers.
Function oneof(LIST) Result Any Description Returns a randomly chosen element of LIST. List items should be entered as [ITEM1,ITEM2,...,ITEM_N]. Note that a list of eld names can also be specied. For more information, see Summarizing Multiple Fields in Chapter 7 on p. 93. Returns a uniformly distributed random number of the same type (INT or REAL), starting from 1 to NUM. If you use an integer, then only integers are returned. If you use a real (decimal) number, then real numbers are returned (decimal precision determined by the stream options). The largest random number returned by the function could equal NUM. This has the same properties as random(NUM), but starting from 0. The largest random number returned by the function will never equal X.
random(NUM)
Number
random0(NUM)
Number
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
String Functions
In CLEM, you can perform the following operations with strings: Compare strings Create strings Access characters In CLEM, a string is any sequence of characters between matching double quotation marks ("string quotes"). Characters (CHAR) can be any single alphanumeric character. They are declared in CLEM expressions using single backquotes in the form of `<character>`, such as `z`, `A`, or `2`. Characters that are out-of-bounds or negative indices to a string will result in a null value.
Function allbutfirst(N, STRING) allbutlast(N, STRING) alphabefore(STRING1, STRING2) Result String String Boolean Description Returns a string, which is STRING with the rst N characters removed. Returns a string, which is STRING with the last characters removed. Used to check the alphabetical ordering of strings. Returns true if STRING1 precedes STRING2. Extracts the last N characters from the specied string. If the string length is less than or equal to the specied length, then it is unchanged. This function is the same as isendstring(SUB_STRING, STRING). This function is the same as ismidstring(SUB_STRING, STRING) (embedded substring). This function is the same as isstartstring(SUB_STRING, STRING). This function is the same as issubstring(SUB_STRING, N, STRING), where N defaults to 1. Returns the number of times the specied substring occurs within the string. For example, count_substring("foooo.txt", "oo") returns 3. This function is the same as issubstring(SUB_STRING, 1, STRING), where N defaults to 1. Returns a value of true if CHAR is a character in the specied string (often a eld name) whose character code is a letter. Otherwise, this function returns a value of 0. For example, isalphacode(produce_num(1)). If the string STRING ends with the substring SUB_STRING, then this function returns the integer subscript of SUB_STRING in STRING. Otherwise, this function returns a value of 0.
endstring(LENGTH, STRING)
count_substring(STRING, SUBSTRING)
Integer
hassubstring(STRING, SUBSTRING)
Integer
isalphacode(CHAR)
Boolean
isendstring(SUBSTRING, STRING)
Integer
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
120 Chapter 8
Function
Result
islowercode(CHAR)
Boolean
ismidstring(SUBSTRING, STRING)
Integer
isnumbercode(CHAR)
Boolean
isstartstring(SUBSTRING, STRING)
Integer
issubstring(SUBSTRING, N, STRING)
Integer
issubstring(SUBSTRING, STRING)
Integer
issubstring_count(SUBSTRING, N, STRING):
Integer
Integer
isuppercode(CHAR)
Boolean
Description Returns a value of true if CHAR is a lowercase letter character for the specied string (often a eld name). Otherwise, this function returns a value of 0. For example, both islowercode(``) and islowercode(country_name(2)) are valid expressions. If SUB_STRING is a substring of STRING but does not start on the rst character of STRING or end on the last, then this function returns the subscript at which the substring starts. Otherwise, this function returns a value of 0. Returns a value of true if CHAR for the specied string (often a eld name) is a character whose character code is a digit. Otherwise, this function returns a value of 0. For example, isnumbercode(product_id(2)). If the string STRING starts with the substring SUB_STRING, then this function returns the subscript 1. Otherwise, this function returns a value of 0. Searches the string STRING, starting from its Nth character, for a substring equal to the string SUB_STRING. If found, this function returns the integer subscript at which the matching substring begins. Otherwise, this function returns a value of 0. If N is not given, this function defaults to 1. Searches the string STRING, starting from its Nth character, for a substring equal to the string SUB_STRING. If found, this function returns the integer subscript at which the matching substring begins. Otherwise, this function returns a value of 0. If N is not given, this function defaults to 1. Returns the index of the Nth occurrence of SUBSTRING within the specied STRING. If there are fewer than N occurrences of SUBSTRING, 0 is returned. This function is the same as issubstring, but the match is constrained to start on or before the subscript STARTLIM and to end on or before the subscript ENDLIM. The STARTLIM or ENDLIM constraints may be disabled by supplying a value of false for either argumentfor example, issubstring_lim(SUB_STRING, N, false, false, STRING) is the same as issubstring. Returns a value of true if CHAR is an uppercase letter character. Otherwise, this function returns a value of 0. For example, both isuppercode(``) and isuppercode(country_name(2)) are valid expressions.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
locchar(CHAR, N, STRING)
Integer
locchar_back(CHAR, N, STRING)
Integer
CHAR or String
matches
Boolean
Description Returns the last character CHAR of STRING (which must be at least one character long). Returns the length of the string STRINGthat is, the number of characters in it. Used to identify the location of characters in symbolic elds. The function searches the string STRING for the character CHAR, starting the search at the Nth character of STRING. This function returns a value indicating the location (starting at N) where the character is found. If the character is not found, this function returns a value of 0. If the function has an invalid offset (N) (for example, an offset that is beyond the length of the string), this function returns $null$. For example, locchar(`n`, 2, web_page) searches the eld called web_page for the `n` character beginning at the second character in the eld value. Note: Be sure to use single backquotes to encapsulate the specied character. Similar to locchar, except that the search is performed backward starting from the Nth character. For example, locchar_back(`n`, 9, web_page) searches the eld web_page starting from the ninth character and moving backward toward the start of the string. If the function has an invalid offset (for example, an offset that is beyond the length of the string), this function returns $null$. Ideally, you should use locchar_back in conjunction with the function length(<field>) to dynamically use the length of the current value of the eld. For example, locchar_back(`n`, (length(web_page)), web_page). Input can be either a string or character, which is used in this function to return a new item of the same type, with any lowercase characters converted to their uppercase equivalents. For example, lowertoupper(`a`), lowertoupper(My string), and lowertoupper(field_name(2)) are all valid expressions. Returns true if a string matches a specied patternfor example, company_name matches "SPSS". The pattern must be a string literal; it cannot be a eld name containing a pattern. A question mark (?) can be included in the pattern to match exactly one character; an asterisk (*) matches zero or more characters. To match a literal question mark or asterisk (rather than using these as wildcards), a backslash (\) can be used as an escape character.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
122 Chapter 8
stripchar(CHAR,STRING)
String
skipchar(CHAR, N, STRING)
Integer
skipchar_back(CHAR, N, STRING)
Integer
startstring(LENGTH, STRING)
String
strmember(CHAR, STRING)
Integer
subscrs(N, STRING)
CHAR
String
String
Description Within the specied STRING, replace all instances of SUBSTRING with NEWSUBSTRING. Returns a string that consists of the original string copied the specied number of times. Enables you to remove specied characters from a string or eld. You can use this function, for example, to remove extra symbols, such as currency notations, from data to achieve a simple number or name. For example, using the syntax stripchar(`$`, 'Cost') returns a new eld with the dollar sign removed from all values. Note: Be sure to use single backquotes to encapsulate the specied character. Searches the string STRING for any character other than CHAR, starting at the Nth character. This function returns an integer substring indicating the point at which one is found or 0 if every character from the Nth onward is a CHAR. If the function has an invalid offset (for example, an offset that is beyond the length of the string), this function returns $null$. locchar is often used in conjunction with the skipchar functions to determine the value of N (the point at which to start searching the string). For example, skipchar(`s`, (locchar(`s`, 1, "MyString")), "MyString"). Similar to skipchar, except that the search is performed backward, starting from the Nth character. Extracts the rst N characters from the specied string. If the string length is less than or equal to the specied length, then it is unchanged. Equivalent to locchar(CHAR, 1, STRING). It returns an integer substring indicating the point at which CHAR rst occurs or 0. If the function has an invalid offset (for example, an offset that is beyond the length of the string), this function returns $null$. Returns the Nth character CHAR of the input string STRING. This function can also be written in a shorthand form as STRING(N). For example, lowertoupper(name(1)) is a valid expression. Returns a string SUB_STRING, which consists of the LEN characters of the string STRING, starting from the character at subscript N. Returns the substring of STRING, which begins at subscript N1 and ends at subscript N2.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
CHAR or String
Description Removes leading and trailing whitespace characters from the specied string. Removes leading whitespace characters from the specied string. Removes trailing whitespace characters from the specied string. Returns the character with Unicode value NUM. Returns the Unicode value of CHAR Input can be either a string or character and is used in this function to return a new item of the same type with any uppercase characters converted to their lowercase equivalents. Note: Remember to specify strings with double quotes and characters with single backquotes. Simple eld names should appear without quotes.
SoundEx Functions
SoundEx is a method used to nd strings when the sound is known but the precise spelling is not. Developed in 1918, it searches out words with similar sounds based on phonetic assumptions about how certain letters are pronounced. It can be used to search names in a database, for example, where spellings and pronunciations for similar names may vary. The basic SoundEx algorithm is documented in a number of sources and, despite known limitations (for example, leading letter combinations such as ph and f will not match even though they sound the same), is supported in some form by most databases.
Function soundex(STRING) Result Integer Description Returns the four-character SoundEx code for the specied STRING. Returns an integer between 0 and 4 that indicates the number of characters that are the same in the SoundEx encoding for the two strings, where 0 indicates no similarity and 4 indicates strong similarity or identical strings.
soundex_difference(STRING1, STRING2)
Integer
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
124 Chapter 8
Note: Date and time functions cannot be called from scripts. For more information, see CLEM Expressions in Scripts in Chapter 3 in Clementine 12.0 Scripting and Automation Guide.
Function Result Description If you select Rollover days/mins in the stream properties dialog box, this function returns the current date as a string in the current date format. If you use a two-digit date format and do not select Rollover days/mins, this function returns $null$ on the current server. Note that this function cannot be called from a script. For more information, see CLEM Expressions in Scripts in Chapter 3 in Clementine 12.0 Scripting and Automation Guide. Converts the storage of the specied eld to a time. Converts the storage of the specied eld to a date. Converts the storage of the specied eld to a timestamp. Converts the storage of the specied eld to a date, time, or timestamp value. Returns the date value for a number, string, or timestamp. Note this is the only function that allows you to convert a number (in seconds) back to a date. If ITEM is a string, creates a date by parsing a string in the current date format. The date format specied in the stream properties dialog box must be correct for this function to be successful. If ITEM is a number, it is interpreted as a number of seconds since the base date (or epoch). Fractions of a day are truncated. If ITEM is timestamp, the date part of the timestamp is returned. If ITEM is a date, it is returned unchanged. Returns a value of true if DATE1 represents a date before that represented by DATE2. Otherwise, this function returns a value of 0. Returns the time in days from the date represented by DATE1 to the date represented by DATE2, as an integer. If DATE2 is before DATE1, this function returns a negative number. Returns the time in days from the baseline date to the date represented by DATE, as an integer. If DATE is before the baseline date, this function returns a negative number. You must include a valid date for the calculation to work appropriately. For example, you should not specify February 29, 2001, as the date. Because 2001 is a not a leap year, this date does not exist. Returns the time in months from the baseline date to the date represented by DATE, as a real number. This is an approximate gure based on a month of 30.0 days. If DATE is before the baseline date, this function returns a negative number. You must include a valid date for the calculation to work appropriately. For example, you should not specify February 29, 2001, as the date. Because 2001 is a not a leap year, this date does not exist.
@TODAY
String
datetime_date(ITEM)
Date
date_before(DATE1, DATE2)
Boolean
date_days_difference(DATE1, DATE2)
Integer
date_in_days(DATE)
Integer
date_in_months(DATE)
Real
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
Function
Result
date_in_weeks(DATE)
Real
date_in_years(DATE)
Real
date_months_difference (DATE1, DATE2) datetime_date(YEAR, MONTH, DAY) datetime_day(DATE) datetime_day_name(DAY) datetime_hour(TIME) datetime_in_seconds (DATETIME) datetime_minute(TIME) datetime_month(DATE) datetime_month_name (MONTH) datetime_now datetime_second(TIME) datetime_day_short_ name(DAY) datetime_month_short_ name(MONTH) datetime_time(HOUR, MINUTE, SECOND) datetime_time(ITEM) datetime_timestamp(YEAR, MONTH, DAY, HOUR, MINUTE, SECOND)
Real Date Integer String Integer Real Integer Integer String Timestamp Integer String String Time Time Timestamp
Description Returns the time in weeks from the baseline date to the date represented by DATE, as a real number. This is based on a week of 7.0 days. If DATE is before the baseline date, this function returns a negative number. You must include a valid date for the calculation to work appropriately. For example, you should not specify February 29, 2001, as the date. Because 2001 is a not a leap year, this date does not exist. Returns the time in years from the baseline date to the date represented by DATE, as a real number. This is an approximate gure based on a year of 365.0 days. If DATE is before the baseline date, this function returns a negative number. You must include a valid date for the calculation to work appropriately. For example, you should not specify February 29, 2001, as the date. Because 2001 is a not a leap year, this date does not exist. Returns the time in months from DATE1 to DATE2, as a real number. This is an approximate gure based on a month of 30.0 days. If DATE2 is before DATE1, this function returns a negative number. Creates a date value for the given YEAR, MONTH, and DAY. The arguments must be integers. Returns the day of the month from a given DATE or timestamp. The result is an integer in the range 1 to 31. Returns the full name of the given DAY. The argument must be an integer in the range 1 (Sunday) to 7 (Saturday). Returns the hour from a TIME or timestamp. The result is an integer in the range 0 to 23. Returns the number of seconds in a DATETIME. Returns the minute from a TIME or timestamp. The result is an integer in the range 0 to 59. Returns the month from a DATE or timestamp. The result is an integer in the range 1 to 12. Returns the full name of the given MONTH. The argument must be an integer in the range 1 to 12. Returns the current time as a timestamp. Returns the second from a TIME or timestamp. The result is an integer in the range 0 to 59. Returns the abbreviated name of the given DAY. The argument must be an integer in the range 1 (Sunday) to 7 (Saturday). Returns the abbreviated name of the given MONTH. The argument must be an integer in the range 1 to 12. Returns the time value for the specied HOUR, MINUTE, and SECOND. The arguments must be integers. Returns the time value of the given ITEM. Returns the timestamp value for the given YEAR, MONTH, DAY, HOUR, MINUTE, and SECOND.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
126 Chapter 8
Real
Real
time_before(TIME1, TIME2)
Boolean
Real
time_mins_difference(TIME1, TIME2)
Real
time_secs_difference(TIME1, TIME2)
Integer
Description Returns the timestamp value for the given DATE and TIME. Returns the timestamp value of the given number of seconds. Returns the day of the week from the given DATE or timestamp. Returns the year from a DATE or timestamp. The result is an integer such as 2002. Returns the time in weeks from the date represented by DATE1 to the date represented by DATE2, as a real number. This is based on a week of 7.0 days. If DATE2 is before DATE1, this function returns a negative number. Returns the time in years from the date represented by DATE1 to the date represented by DATE2, as a real number. This is an approximate gure based on a year of 365.0 days. If DATE2 is before DATE1, this function returns a negative number. Returns a value of true if TIME1 represents a time before that represented by TIME2. Otherwise, this function returns a value of 0. Returns the time difference in hours between the times represented by TIME1and TIME2, as a real number. If you select Rollover days/mins in the stream properties dialog box, a higher value of TIME1 is taken to refer to the previous day. If you do not select the rollover option, a higher value of TIME1 causes the returned value to be negative. Returns the time in hours represented by TIME, as a real number. For example, under time format HHMM, the expression time_in_hours('0130') evaluates to 1.5. Returns the time in minutes represented by TIME, as a real number. Returns the time in seconds represented by TIME, as an integer. Returns the time difference in minutes between the times represented by TIME1 and TIME2, as a real number. If you select Rollover days/mins in the stream properties dialog box, a higher value of TIME1 is taken to refer to the previous day (or the previous hour, if only minutes and seconds are specied in the current format). If you do not select the rollover option, a higher value of TIME1 will cause the returned value to be negative. Returns the time difference in seconds between the times represented by TIME1 and TIME2, as an integer. If you select Rollover days/mins in the stream properties dialog box, a higher value of TIME1 is taken to refer to the previous day (or the previous hour, if only minutes and seconds are specied in the current format). If you do not select the rollover option, a higher value of TIME1 causes the returned value to be negative.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
For this conversion to work, select the matching date format MON YYYY as the default date format for the stream. For more information, see Setting Options for Streams in Chapter 5 on p. 57. For an example that converts string values to dates using a Filler node, see the stream broadband_create_models.str, installed in the \Demo folder under the Classication_Module folder. For more information, see Forecasting with the Time Series Node in Chapter 16 in Clementine 12.0 Applications Guide.
Dates stored as numbers. Note that DATE in the above example is the name of a eld, while to_date is a CLEM function. If you have dates stored as numbers, you can convert them using the datetime_date function, where the number is interpreted as a number of seconds since the base date (or epoch).
datetime_date(DATE)
By converting a date to a number of seconds (and back), you can perform calculations such as computing the current date plus or minus a xed number of days, for example:
datetime_date((date_in_days(DATE)-7)*60*60*24)
Sequence Functions
For some operations, the sequence of events is important. The application allows you to work with the following record sequences: Sequences and time series Sequence functions Record indexing Averaging, summing, and comparing values Monitoring changedifferentiation @SINCE Offset values Additional sequence facilities For many applications, each record passing through a stream can be considered as an individual case, independent of all others. In such situations, the order of records is usually unimportant. For some classes of problems, however, the record sequence is very important. These are typically time series situations, in which the sequence of records represents an ordered sequence of events or occurrences. Each record represents a snapshot at a particular instant in time; much
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
128 Chapter 8
of the richest information, however, might be contained not in instantaneous values but in the way in which such values are changing and behaving over time. Of course, the relevant parameter may be something other than time. For example, the records could represent analyses performed at distances along a line, but the same principles would apply. Sequence and special functions are immediately recognizable by the following characteristics: They are all prexed by @. Their names are given in upper case. Sequence functions can refer to the record currently being processed by a node, the records that have already passed through a node, and even, in one case, records that have yet to pass through a node. Sequence functions can be mixed freely with other components of CLEM expressions, although some have restrictions on what can be used as their arguments.
Examples
You may nd it useful to know how long it has been since a certain event occurred or a condition was true. Use the function @SINCE to do thisfor example:
@SINCE(Income > Outgoings)
This function returns the offset of the last record where this condition was truethat is, the number of records before this one in which the condition was true. If the condition has never been true, @SINCE returns @INDEX + 1. Sometimes you may want to refer to a value of the current record in the expression used by @SINCE. You can do this using the function @THIS, which species that a eld name always applies to the current record. To nd the offset of the last record that had a Concentration eld value more than twice that of the current record, you could use:
@SINCE(Concentration > 2 * @THIS(Concentration))
In some cases the condition given to @SINCE is true of the current record by denitionfor example:
@SINCE(ID == @THIS(ID))
For this reason, @SINCE does not evaluate its condition for the current record. Use a similar function, @SINCE0, if you want to evaluate the condition for the current record as well as previous ones; if the condition is true in the current record, @SINCE0 returns 0. Note: @ functions cannot be called from scripts. For more information, see CLEM Expressions in Scripts in Chapter 3 in Clementine 12.0 Scripting and Automation Guide.
Function MEAN(FIELD) Result Real Description Returns the mean average of values for the specied FIELD or FIELDS.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
Function
Result
@MEAN(FIELD, EXPR)
Real
Real
@DIFF1(FIELD)
Real
@DIFF1(FIELD1, FIELD2)
Real
@DIFF2(FIELD)
Real
@DIFF2(FIELD1, FIELD2)
Real
@INDEX
Integer
@LAST_NON_BLANK(FIELD)
Any
@MAX(FIELD)
Number Number
@MAX(FIELD, EXPR)
Description Returns the mean average of values for FIELD over the last EXPR records received by the current node, including the current record. FIELD must be the name of a numeric eld. EXPR may be any expression evaluating to an integer greater than 0. If EXPR is omitted or if it exceeds the number of records received so far, the average over all of the records received so far is returned. Note that this function cannot be called from a script. For more information, see CLEM Expressions in Scripts in Chapter 3 in Clementine 12.0 Scripting and Automation Guide. Returns the mean average of values for FIELD over the last EXPR records received by the current node, including the current record. FIELD must be the name of a numeric eld. EXPR may be any expression evaluating to an integer greater than 0. If EXPR is omitted or if it exceeds the number of records received so far, the average over all of the records received so far is returned. INT species the maximum number of values to look back. This is far more efcient than using just two arguments. Returns the rst differential of FIELD1. The single-argument form thus simply returns the difference between the current value and the previous value of the eld. Returns 0 if the relevant previous records do not exist. The two-argument form gives the rst differential of FIELD1 with respect to FIELD2. Returns 0 if the relevant previous records do not exist. Returns the second differential of FIELD1. The single-argument form thus simply returns the difference between the current value and the previous value of the eld. Returns 0 if the relevant previous records do not exist The two-argument form gives the rst differential of FIELD1 with respect to FIELD2. Returns 0 if the relevant previous records do not exist. Returns the index of the current record. Indices are allocated to records as they arrive at the current node. The rst record is given index 1, and the index is incremented by 1 for each subsequent record. Returns the last value for FIELD that was not blank, as dened in an upstream source or Type node. If there are no nonblank values for FIELD in the records read so far, $null$ is returned. Note that blank values, also called user-missing values, can be dened separately for each eld. Returns the maximum value for the specied FIELD. Returns the maximum value for FIELD over the last EXPR records received so far, including the current record. FIELD must be the name of a numeric eld. EXPR may be any expression evaluating to an integer greater than 0.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
130 Chapter 8
Function
Result
Number
@MIN(FIELD)
Number Number
@MIN(FIELD, EXPR)
Number
@OFFSET(FIELD, EXPR)
Any
Description Returns the maximum value for FIELD over the last EXPR records received so far, including the current record. FIELD must be the name of a numeric eld. EXPR may be any expression evaluating to an integer greater than 0. If EXPR is omitted, or if it exceeds the number of records received so far, the maximum value over all of the records received so far is returned. INT species the maximum number of values to look back. This is far more efcient than using just two arguments. Returns the minimum value for the specied FIELD. Returns the minimum value for FIELD over the last EXPR records received so far, including the current record. FIELD must be the name of a numeric eld. EXPR may be any expression evaluating to an integer greater than 0. Returns the minimum value for FIELD over the last EXPR records received so far, including the current record. FIELD must be the name of a numeric eld. EXPR may be any expression evaluating to an integer greater than 0. If EXPR is omitted, or if it exceeds the number of records received so far, the minimum value over all of the records received so far is returned. INT species the maximum number of values to look back. This is far more efcient than using just two arguments. Returns the value of FIELD in the record offset from the current record by the value of EXPR. A positive offset refers to a record that has already passed, while a negative one species a lookahead to a record that has yet to arrive. For example, @OFFSET(Status, 1) returns the value of the Status eld in the previous record, while @OFFSET(Status, 4) looks ahead four records in the sequence (that is, to records that have not yet passed through this node) to obtain the value. Note that a negative (look ahead) offset must be specied as a constant. For positive offsets only, EXPR may also be an arbitrary CLEM expression, which is evaluated for the current record to give the offset. In this case, the three-argument version of this function is recommended to improve performance (see below). If the expression returns anything other than a non-negative integer, this causes an errorthat is, it is not legal to have calculated lookahead offsets.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
Function
Result
Any
@SDEV(FIELD)
Real
@SDEV(FIELD, EXPR)
Real
Real
@SUM(FIELD, EXPR)
Number
Description Performs the same operation as the @OFFSET function with the addition of a third argument, INT, which species the maximum number of values to look back. In cases where the offset is computed from an expression, this third argument is recommended to improve performance. For example, in an expression such as@OFFSET(Foo, Month, 12), the system knows to keep only the last twelve values of Foo; otherwise, it has to store every value just in case. In cases where the offset value is a constantincluding negative lookahead offsets, which must be constantthe third argument is pointless and the two-argument version of this function is recommended. Returns the standard deviation of values for the specied FIELD or FIELDS. Returns the standard deviation of values for FIELD over the last EXPR records received by the current node, including the current record. FIELD must be the name of a numeric eld. EXPR may be any expression evaluating to an integer greater than 0. If EXPR is omitted, or if it exceeds the number of records received so far, the standard deviation over all of the records received so far is returned. Returns the standard deviation of values for FIELD over the last EXPR records received by the current node, including the current record. FIELD must be the name of a numeric eld. EXPR may be any expression evaluating to an integer greater than 0. If EXPR is omitted, or if it exceeds the number of records received so far, the standard deviation over all of the records received so far is returned. INT species the maximum number of values to look back. This is far more efcient than using just two arguments. Returns the number of records that have passed since EXPR, an arbitrary CLEM expression, was true. Adding the second argument, INT, species the maximum number of records to look back. If EXPR has never been true, INT is @INDEX+1. Considers the current record, while @SINCE does not; @SINCE0 returns 0 if EXPR is true for the current record. Adding the second argument, INT species the maximum number of records to look back. Returns the sum of values for the specied FIELD or FIELDS. Returns the sum of values for FIELD over the last EXPR records received by the current node, including the current record. FIELD must be the name of a numeric eld. EXPR may be any expression evaluating to an integer greater than 0. If EXPR is omitted, or if it exceeds the number of records received so far, the sum over all of the records received so far is returned.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
132 Chapter 8
Function
Result
Number
@THIS(FIELD)
Any
Description Returns the sum of values for FIELD over the last EXPR records received by the current node, including the current record. FIELD must be the name of a numeric eld. EXPR may be any expression evaluating to an integer greater than 0. If EXPR is omitted, or if it exceeds the number of records received so far, the sum over all of the records received so far is returned. INT species the maximum number of values to look back. This is far more efcient than using just two arguments. Returns the value of the eld named FIELD in the current record. Used only in @SINCE expressions.
Global Functions
The functions @MEAN,@SUM, @MIN, @MAX, and @SDEV work on, at most, all of the records read up to and including the current one. In some cases, however, it is useful to be able to work out how values in the current record compare with values seen in the entire dataset. Using a Set Globals node to generate values across the entire dataset, you can access these values in a CLEM expression using the global functions. For example,
@GLOBAL_MAX(Age)
returns the highest value of Age in the dataset, while the expression
(Value - @GLOBAL_MEAN(Value)) / @GLOBAL_SDEV(Value)
expresses the difference between this records Value and the global mean as a number of standard deviations. You can use global values only after they have been calculated by a Set Globals node. All current global values can be canceled by clicking the Clear Global Values button on the Globals tab in the stream properties dialog box. Note: @ functions cannot be called from scripts. For more information, see CLEM Expressions in Scripts in Chapter 3 in Clementine 12.0 Scripting and Automation Guide.
Function Result Description Returns the maximum value for FIELD over the whole dataset, as previously generated by a Set Globals node. FIELD must be the name of a numeric eld. If the corresponding global value has not been set, an error occurs. Note that this function cannot be called from a script. For more information, see CLEM Expressions in Scripts in Chapter 3 in Clementine 12.0 Scripting and Automation Guide. Returns the minimum value for FIELD over the whole dataset, as previously generated by a Set Globals node. FIELD must be the name of a numeric eld. If the corresponding global value has not been set, an error occurs.
@GLOBAL_MAX(FIELD)
Number
@GLOBAL_MIN(FIELD)
Number
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
Function
Result Number
@GLOBAL_SDEV(FIELD)
@GLOBAL_MEAN(FIELD)
Number
@GLOBAL_SUM(FIELD)
Number
Description Returns the standard deviation of values for FIELD over the whole dataset, as previously generated by a Set Globals node. FIELD must be the name of a numeric eld. If the corresponding global value has not been set, an error occurs. Returns the mean average of values for FIELD over the whole dataset, as previously generated by a Set Globals node. FIELD must be the name of a numeric eld. If the corresponding global value has not been set, an error occurs. Returns the sum of values for FIELD over the whole dataset, as previously generated by a Set Globals node. FIELD must be the name of a numeric eld. If the corresponding global value has not been set, an error occurs.
@BLANK(FIELD)
Boolean
@LAST_NON_BLANK(FIELD)
Any
@NULL(FIELD)
Boolean
undef
Any
Blank elds may be lled in with the Filler node. In both Filler and Derive nodes (multiple mode only), the special CLEM function @FIELD refers to the current eld(s) being examined.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
134 Chapter 8
Special Fields
Special functions are used to denote the specic elds under examination, or to generate a list of elds as input. For example, when deriving multiple elds at once, you should use @FIELD to denote perform this derive action on the selected elds. Using the expression log(@FIELD) derives a new log eld for each selected eld. Note: @ functions cannot be called from scripts. For more information, see CLEM Expressions in Scripts in Chapter 3 in Clementine 12.0 Scripting and Automation Guide.
Function Result Description Performs an action on all elds specied in the expression context. Note that this function cannot be called from a script. For more information, see CLEM Expressions in Scripts in Chapter 3 in Clementine 12.0 Scripting and Automation Guide. When a CLEM expression is used in a user-dened analysis function, @TARGET represents the target eld or correct value for the target/predicted pair being analyzed. This function is commonly used in an Analysis node. When a CLEM expression is used in a user-dened analysis function,@PREDICTED represents the predicted value for the target/predicted pair being analyzed. This function is commonly used in an Analysis node. Substitutes the name of the current partition eld. Returns the value of the current training partition. For example, to select training records using a Select node, use the CLEM expression: @PARTITION_FIELD = @TRAINING_PARTITION This ensures that the Select node will always work regardless of which values are used to represent each partition in the data. Returns the value of the current testing partition. Returns the value of the current validation partition. Returns the list of eld names between the specied start and end elds (inclusive) based on the natural (that is, insert) order of the elds in the data. For more information, see Summarizing Multiple Fields in Chapter 7 on p. 93.
@FIELD
Any
@TARGET
Any
@PREDICTED
Any
@PARTITION_FIELD
Any
@TRAINING_PARTITION
Any
@TESTING_PARTITION @VALIDATION_PARTITION
Any Any
@FIELDS_BETWEEN(start, end)
Any
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
Function
Result
@FIELDS_MATCHING(pattern)
Any
@MULTI_RESPONSE_SET
Any
Description Returns a list a eld names matching a specied pattern. A question mark (?) can be included in the pattern to match exactly one character; an asterisk (*) matches zero or more characters. To match a literal question mark or asterisk (rather than using these as wildcards), a backslash (\) can be used as an escape character. For more information, see Summarizing Multiple Fields in Chapter 7 on p. 93. Returns the list of elds in the named multiple response set. For more information, see Working with Multiple-Response Data in Chapter 7 on p. 94.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
Chapter
To ensure consistent access to enterprise data, deployed streams must be accessed through the SPSS Predictive Enterprise View. This means that there must be at least one Enterprise View source node within each designated scoring or modeling branch in the stream. For more information, see Enterprise View Node in Chapter 2 in Clementine 12.0 Source, Process, and Output Nodes. To use the Enterprise View node, SPSS Predictive Enterprise Services must be installed and congured at your site, with an Enterprise View, Application Views, and DPDs already dened. For more information, contact your local administrator, or see the SPSS Web site at http://www.spss.com/predictive_enterprise_services/. In addition, the SPSS Predictive Enterprise View driver must be installed on each computer used to modify or execute the stream. For Windows, simply install the driver on the computer where Clementine Client or Clementine Server is installed, and no further conguration of the driver is needed. On UNIX, a reference to the pev.sh script must be added to the startup script. Contact your local administrator for details on installing the PEV driver. A DPD is dened against a particular ODBC data source. To use a DPD from Clementine, you must have an ODBC data source dened on the Clementine server host that has the same name and that connects to the same data store as the one referenced in the DPD.
About SPSS Predictive Enterprise Services
SPSS Predictive Enterprise Services is an enterprise-level application that enables widespread use and deployment of predictive analytics. SPSS Predictive Enterprise Services provides centralized, secure, and auditable storage of analytical assets, and advanced capabilities for management and control of predictive analytic processes, as well as sophisticated mechanisms for delivering the results of analytical processing to end users. The benets of SPSS Predictive Enterprise Services include safeguarding the value of analytical assets, ensuring compliance with regulatory
136
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
requirements, improving the productivity of analysts, and minimizing the IT costs of managing analytics.
Other Deployment Methods
While SPSS Predictive Enterprise Services offers the most extensive features for managing enterprise content, a number of other mechanisms for deploying or exporting streams are also available, including: Use the Predictive Applications 4.x Wizard to export streams for deployment to that version of Predictive Applications. For more information, see Predictive Applications 4.x Wizard in Chapter 10 on p. 155. Use the Cleo Wizard to prepare a stream for deployment as a Cleo scenario for real-time scoring over the Web. For more information, see Exporting to Cleo in Chapter 10 on p. 163. Export the stream and model for later use with Clementine Solution Publisher Runtime. For more information, see Clementine Solution Publisher in Chapter 2 in Clementine 12.0 Solution Publisher. Export one or more models in PMML, an XML-based format for encoding model information. For more information, see Importing and Exporting Models as PMML in Chapter 10 on p. 165.
In Clementine, you can access enterprise data through the Enterprise View node, which allows you to dene and manage the settings for each connection. For example, if you want to use Champion/Challenger analysis to compare the performance of a number of different models, using the same data connection will ensure the same data is available to all models. For more information, see Enterprise View Node in Chapter 2 in Clementine 12.0 Source, Process, and Output Nodes.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
138 Chapter 9
SPSS Predictive Enterprise View consists of three parts, which are typically dened by a data specialist or system administrator:
Enterprise View. Lists the complete set of tables and columns available to the enterprise,
regardless of where they actually reside. For example, some columns may be drawn from an operational database, others from a transactional database, but all are listed in the Enterprise View. There can only be one Enterprise View within each repository.
Application View. A subset of the Enterprise View tailored to the needs of a specic application
or analysis. Each repository supports a single Enterprise View, from which multiple Application Views can be derived, each a subset of the EV intended for a different purpose. The table and column denitions in the Enterprise and Application Views do not contain actual data, but rather dene the specic eld names and types that will be available as inputs to the modeling process. For example, if you want to predict response rates based on predictors such as age, income, and debt_equity_ratio, then those columns must be dened in your view.
Data Provider Definition (DPD). Maps the virtual table and column denitions from the
Application View to the physical tables where data resides, whether in a data warehouse, an operational data store, or an online transactional database. Multiple DPDs may be used with the same Application View, in order to support the different stages of a project. For example, the historic data used to build the model may come from one database, while operational data comes from another. The denitions of the Enterprise View, Application Views, and DPDs are stored in the SPSS Predictive Enterprise Services. The actual data, as noted, can reside anywhere. Once dened, these views allow consistent, managed access to users across different departments, allowing for a collaborative modeling and deployment process across the enterprise.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
139 Deploying to the SPSS Predictive Enterprise Repository Figure 9-1 Stream Deployment options
Scoring branch. A branch that contains at least one valid scoring model and ends with a terminal
output, export, or plot node. While the stream can actually contain any number of valid branches, models, and terminal nodes, one and only one scoring branch must be designated for purposes of deployment. This is the most basic requirement to deploy any stream. The specied branch must contain at least one Enterprise View node used to read in data for scoring.
Scoring Parameters. Allows you to specify parameters that can be modied when the scoring
branch is executed. For more information, see Scoring and Modeling Parameters on p. 140.
Scoring model. For model refresh, species the model that will be updated or regenerated each time
the Scenario is updated (typically as part of a scheduled job). While multiple models may exist on the scoring branch, only one can be designated. Note that when the Scenario is initially created this may effectively be a placeholder model that is updated or regenerated as new data is available.
Model builder. Species the model building node used to generate or update the scoring model.
Must be a modeling node of the same type as the specied scoring model.
Model Build Parameters. Allows you to specify parameters that can be modied when the model building node is executed. For more information, see Scoring and Modeling Parameters on p. 140.
Note: Multiple Enterprise View nodes can also be used within a given branch. If so, using a single data connection for all Enterprise nodes within a given branch is recommended in most cases, and is required for Champion Challenger support.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
140 Chapter 9
E To make a parameter visible so it can be viewed or edited after the Scenario is deployed, select it
from the list in the dialog box. The list of available parameters is dened on the Parameters tab in the Stream Properties dialog box. For more information, see Setting Stream and Session Parameters in Chapter 5 on p. 62.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
141 Deploying to the SPSS Predictive Enterprise Repository Figure 9-3 SPSS Predictive Enterprise Repository
For example, suppose you create a stream and store it in the repository where it can be shared with researchers from other divisions. If you later update the stream, you can add it to the repository without overwriting the previous version. All versions remain accessible and can be searched by name, label, elds used, or other attributes. For example, you could search for all model versions that use net revenue as a predictor or all models created by a particular author. (To do this with a traditional le system, you would have to save each version under a different lename, and the relationships between versions would be unknown to the software.)
Settings are specic to each site or installation. For specic port, password, and domain information, contact your local system administrator. Note: A separate license is required to access this component. For more information, see http://www.spss.com/clementine/.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
Repository The SPSS Predictive Enterprise Repository installation you want to access. Generally,
this matches the name of the host server where the repository is installed. You can connect to only one repository at a time.
Port. The port used to host the connection, typically 8080 by default. Ensure secure connection (use SSL). Species whether an SSL (Secure Sockets Layer) connection
should be used. SSL is a commonly used protocol for securing data sent over a network. To use this feature, SSL must be enabled on the server hosting SPSS Predictive Enterprise Repository. If necessary, contact your local administrator for details. For more information, see Using SSL to Encrypt Data in Chapter 4 in Clementine 12.0 Server Administration and Performance Guide.
User ID and password. Specify a valid user name and password for logging on. In many cases,
this may be the same password you use to log on to the local network. If necessary, contact your local administrator for more information.
Domain. The network domain where the user is denedfor example, AD/SPSS or LDAP/SPSS,
where the specied prex matches the ID congured for the provider. For domains congured to use Active Directory with local override, the format ADL/SPSS is set by default, but again this may be customized for each provider. Unless you are using a Windows Active Directory or LDAP domain, this eld can typically be left blank. Contact your local administrator for specic login information if necessary.
Set as default repository. Saves the current settings as the default so that you do not have to reenter
them each time you want to connect. Note: You must reenter the password whenever you open a new connection.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
Storing Objects
Figure 9-5 Storing a model
You can store streams, nodes, models, projects, and output objects.
E To store the current stream, from the menus choose: File Store Stream... E To store a model, project, or output object, select it on the manager palette in Clementine, and
or
File Projects Store Project...
or
File Outputs Store Output... E Alternatively, right-click on an object in the manager palette and choose Store.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
144 Chapter 9 E To store a node, right-click the node in the stream canvas and choose Store Node. Figure 9-6 Storing a model
E Select the folder where you want to store the object. (To create a new folder, click the icon in the
Labels can be moved from one version of an object to another. For example, you could label the current production version, and then reuse that label with a newer version when it becomes available. Note that only one version may carry a particular label at once.
Storing Projects
Because a project le is a container for other Clementine objects, you need to tell Clementine where to store the projects objectsin the local le system or in SPSS Predictive Enterprise Repository. You do this using a setting in the Project Properties dialog box. Once you congure a project to store objects in the repository, whenever you add a new object to the project, Clementine automatically prompts you to store the object. When you have nished your Clementine session, you must store a new version of the project le so that it remembers your additions. The project le automatically contains (and retrieves) the latest versions of its objects. If you did not add any objects to a project during a Clementine session, then you do not have to restore the project le. You must, however, store new versions for the project objects (streams, output, and so forth) that you changed.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
or
File Projects Retrieve Project...
or
File Outputs Retrieve Output... E Alternatively, right-click in the manager or project palette and choose Retrieve from the context
menu.
E To retrieve a node, from the Clementine menus choose: Insert Node (or SuperNode) from Repository... Figure 9-7 Retrieving an object from SPSS Predictive Enterprise Repository
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
146 Chapter 9 E Browse to select the object you want to retrieve. E Select the desired object and version.
Selecting a Version
To retrieve a version other than the latest, click the version browse button (...). Detailed information for all versions is displayed, allowing you to choose the one you want.
Figure 9-8 Retrieving a version of an object
E To sort the list by version, label, size, or date, click on the header of the appropriate column.
You can access retrieved objects from the Stream, Model, or Output managers (as appropriate for the object type) in the Clementine window.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
147 Deploying to the SPSS Predictive Enterprise Repository Figure 9-9 Exploring SPSS Predictive Enterprise Repository folders
E To display a tree view of the folder hierarchy, click the Folders tab in the upper left pane. E To locate stored objects by type, label, date, or other criteria, click the Search tab.
Objects that match the current selection or search criterion are listed in the right pane, with detailed information on the selected version displayed in the lower right pane. The attributes displayed apply to the most recent version.
E To browse or retrieve other versions of an object, click the object, and from the menus choose: Edit Retrieve Version... Figure 9-10 Retrieving a version of an object
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
148 Chapter 9
Object Properties
The Object Properties dialog box in SPSS Predictive Enterprise Repository allows you to view and edit properties. Although some properties cannot be changed, you can always update an object by adding a new version.
To View Object Properties
E In the SPSS Predictive Enterprise Repository window, right-click the desired object. E Choose Object Properties. Figure 9-11 Object properties
General Tab Name. The name of the object as viewed in SPSS Predictive Enterprise Repository. Created on. Date the object (not the version) was created. Last modified. Date the most recent version was modied. Author. The users login name. Description. By default, this contains the description specied on the objects Annotation tab
in Clementine.
Linked topics.SPSS Predictive Enterprise Repository allows models and related objects to be
organized by topics if desired. The list of available topics is provided by the local administrator.
Keywords. You specify keywords on the Annotation tab for a stream, model, or output object.
Multiple keywords should be separated by spaces, up to a maximum of 255 characters. (If keywords contain spaces, use quotation marks to separate them.)
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
Versions Tab
Objects stored in SPSS Predictive Enterprise Repository may have multiple versions. The Versions tab displays information about each version.
Figure 9-12 Version properties
The following properties can be specied or modied for specic versions of a stored object:
Version. Unique identier for the version generated based on the time when the version was stored. Label. Current label for the version, if any. Unlike the version identier, labels can be moved from one version of an object to another.
The le size, creation date, and author are also displayed for each version.
Permissions Tab
The Permissions tab lets you set read and write permissions for the object. All users and groups with access to the current object are listed. Permissions follow a hierarchy. For example, if you do not have read permission, you cannot have write permission. If you do not have write permission, you cannot have delete permission.
Add group. Click the Add Group icon on the right side of the Permissions tab to assign access to additional users and groups. The list of available users and groups is controlled by the administrator.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
151 Deploying to the SPSS Predictive Enterprise Repository Figure 9-14 Searching objects by name
Searching objects by name. When searching on objects by name, an asterisk (*) can be used as
a wildcard character to match any string of characters, and a question mark matches any single character. For example, *cluster* matches all objects that include the string cluster anywhere in the name. The search string m0?_* matches M01_cluster.str and M02_cluster.str but not M01a_cluster.str. Searches are not case sensitive (cluster matches Cluster matches CLUSTER). Note: If the number of objects is large, searches may take a few moments.
Refining the Search
You can rene the search based on object type, label, date, or keyword. Only objects that match all specied search criteria will be found. For example, you could locate all streams containing one or more clustering models that also have a specic label applied, and/or were modied after a specic date.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
152 Chapter 9 Figure 9-15 Searching for streams containing a specific type of model
Object Type. You can restrict the search to models, streams, output, or other types of objects
Field Name. You can search by elds usedfor example, all models that use a eld named
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
Keywords. Search on specic keywords. In Clementine, keywords are specied on the Annotation
specied on the Annotation tab for a stream, model, or output object. Multiple search phrases can be separated by semicolonsfor example, income; crop type; claim value. (Note that within a search phrase, spaces matter. For example, crop type with one space and crop type with two spaces are not the same.)
choose:
Tools Predictive Enterprise Repository Explore... E Click the Folders tab. E To add a new folder, right-click the parent folder and choose New Folder. E To delete a folder, right-click it and choose Delete Folder. E To rename a folder, right-click it and choose Rename Folder.
Folder Properties
To view properties for any folder in the SPSS Predictive Enterprise Repository window, right-click the desired folder. Choose Folder Properties.
Figure 9-16 Folder properties
General tab. Displays the folder name, creation, and modication dates. Permissions tab. Species read and write permissions for the folder. All users and groups with
access to the parent folder are listed. Permissions follow a hierarchy. For example, if you do not have read permission, you cannot have write permission. If you do not have write permission, you cannot have delete permission.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
154 Chapter 9
Add group. Click the Add Group icon on the right side of the Permissions tab to assign access
to additional users and groups. The list of available users and groups is controlled by the administrator.
Cascade all permissions. Cascades permissions settings from the current folder to all child and
descendant folders. This is a quick way to set permissions for several folders at once. Set permissions as desired for the parent folder, and then cascade as desired.
Cascade changes only. Cascades only changes made since the last time changes were applied.
For example, if a new group has been added and you want to give it access to all folders under the Sales branch, you can give the group access to the root Sales folder and cascade the change to all subfolders. All other permissions to existing subfolders remain as before.
Do not cascade. Any changes made apply to the current folder only and do not cascade
to child folders.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
10
Chapter
Under normal circumstances, heres how you can use Clementine to expand the data mining and deployment capabilities available with predictive applications.
155
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
156 Chapter 10 E Begin in your predictive application. Using options in the Customer View Builder, export the
Unied Customer View (UCV) data model as an XML le. Note the location of the XML le because you will need it to guide your work in Clementine .
E Next, set up source nodes in Clementine to access all data sources (databases, at les, etc.)
that contain the elds for the UCV (these are listed in the XML le exported earlier). You may choose to include all elds referenced by the UCV, or you may use only the portion of the UCV that you will need for the models you are building. Typically, you will use several source nodes in Clementine to access the modeling data.
E Use Clementine to perform any data merging, transformations, or derivations that are necessary
node is used not only for directionality when modeling but also to ensure that your data matches the eld information dened in the XML le. It is a good idea to compare settings in the Type node to attribute specications in the XML le generated earlier. For more information, see Step 3: Selecting a UCV Node on p. 158.
E Next, consider the type of model you are creating in Clementine. If you are deploying a value
model, such as a neural network, you may want to export binary predictions (for example, Churn True/False) as a propensity, which will make the prediction comparable with predictions from models generated by the application. For more information, see Exporting Binary Predictions as Propensity Scores on p. 156.
E Once you are satised with the model and have converted condences to propensities, add a
Terminal node to the deployment branch of the stream. Many people use a Table node, but any Terminal node will sufce. Ensure that only elds you want to be visible in the outside application are visible at the Terminal node. In other words, prior to this Terminal node, you may need to lter out elds that you do not want to deploy.
E In addition, make sure that any prediction elds generated by the model are instantiated before
being exported. If necessary, this can be done by adding a Type node between the generated model and the terminal export node.
E As a nal step before using the wizard, ensure that your stream is prepared for deployment by
performing a test execution. The stream is now ready for deployment. You can access the Predictive Applications Wizard from the Tools menu in Clementine. Follow the wizard steps described in this documentation to produce a Clementine Deployment Package (.cdp) containing stream information and metadata required for publishing in the Real Time Environment.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
Before exporting with the Predictive Applications Wizard, consider whether the scoring output of your Clementine model is consistent with predictions generated by the predictive application. In many cases, you may want to export binary predictions as propensity scores that allow you to compare the strength of predictions accurately across several models. A value model creates a single propensity score for each record that ranks the likelihood of a specic yes or no outcome on a scale from 0.0 to 1.0. For example, a churn model might produce a score ranging from 1.0 (likely to churn) to 0.0 (not likely to churn). Since propensity scores are not probabilities, a score of 0.5 does not necessarily mean 50% likely to churn or even twice as likely to churn as someone with a 0.25 score, but it does mean more likely to churn than someone with a 0.4 score. Propensity scores can be used for ranking and can be used, for example, to nd the 10% of customers most likely to churn. An offer model creates prediction and condence values that, when submitted to the application, will be used as virtual attributes in the UCV. These may be numeric or continuous range values. A number of Clementine models can produce binary predictions, including neural networks, decision trees, and logistic regression models. If you are deploying a value model, you may need to export propensity scores prior to exporting. Propensity scores can be enabled on the Analyze tab in the modeling node or on the Settings tab in the generated model nugget. For more information, see Modeling Node Analyze Options in Chapter 3 in Clementine 12.0 Modeling Nodes.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
158 Chapter 10
The rest of the wizard takes you through the process of generating a package for deployment into the Real Time Environment. Before proceeding, use the prerequisite checklist to ensure that the stream is prepared for deployment. For more information, see Before Using the Predictive Applications Wizard on p. 155.
It is important to distinguish between Terminal nodes of the same name, since the drop-down list in the wizard provides only the name and node type for each Terminal node in the stream. To avoid confusion, give Terminal nodes unique names in the stream. Also, ensure that only elds you want to be visible in the application environment are visible at the Terminal node. In other words, prior to this Terminal node, you may need to lter out elds that you do not want to deploy.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
159 Exporting to Predictive Applications Figure 10-3 Selecting a Type node used as a UCV node and a UCV metadata file (XML file)
UCV node. A UCV node is a Type node from your stream that is used to ensure that all data
matches the denitions in the Unied Customer View (UCV). These specications are stored in an XML le (exported previously from the Customer View Builder). When you click Next, the wizard automatically validates settings in the Type node against specications in the XML le that you specify here using the UCV Metadata File option below.
UCV metadata file. A UCV metadata le is the XML le that you generated previously from the
Customer View Builder. The XML le you choose here contains the data attributes required for deployment to the application.
Data Mismatch Errors
If the wizard has generated data mismatch errors, go back to the Clementine stream and examine your Type node specications. Compare eld information in the Type node to that in the XML le generated by the Customer View Builder. (You can open the XML le in a text browser, such as Notepad). Do your settings in the Type node match the UCV attributes? For example, the XML le may state that a eld named Cholesterol is required by the UCV and that it contains string values:
<UcvAttribute Name="Cholesterol" Domain="String"/>
In the Clementine stream, check the Type node settings to ensure that only elds useful to the Real Time Environment are exported from Clementine.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
160 Chapter 10 Figure 10-4 Checking Type node settings and storage in the Values subdialog box
In some cases, the mapping between Type node settings and UCV specications is not obvious. In this integration, domain as dened in the UCV XML le is equivalent to storage in Clementine. Field settings for each of these will be matched during export.
Table 10-1 Storage and domain mapping
Application Domains String Character Bit Float Double Decimal Long Integer Date Timestamp No match
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
You can alter storage type using conversion functions, such as to_integer, in a Derive or Filler node. For more information, see Storage Conversion Using the Filler Node in Chapter 4 in Clementine 12.0 Source, Process, and Output Nodes.
Click Next to automatically check the stream metadata and your specications. If all are specied properly, the .cdp le will be generated. Note the target location because you will need to access this le later when publishing in the application.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
Field names and types from the Clementine stream are veried against the XML le generated from the UCV. If eld information does not match, the wizard may exit automatically and display relevant error messages.
Figure 10-7 Error messages resulting from metadata mismatch
In case of an error, return to the Clementine stream and check the following: Check that all eld names in the UCV node (a Type node in the stream) are present in the XML le dening the UCV. Also check that their types (domains) are compatible. Note the case of eld names, since both Clementine and other applications may be case sensitive. Field order, however, is not important.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
Note: The size of the generated .cdp package needs to be under 5 KB. In particular, if the list of generated elds is long, some of them might need to be removed from the description in order to stay within this limit.
Step 6: Summary
When your Clementine Deployment Package is successfully generated, you have nished your work in Clementine.
Taking the Next Step
Next, you can import the model (a .cdp package le) into Interaction Builder.
Exporting to Cleo
Using options within Clementine, you can easily publish streams for use with Cleo, a customizable solution that allows you to extend the power of predictive analytics to a Web-based audience. The Cleo interface can be completely customized for your target application and operates seamlessly within the SPSS Web Deployment Framework (SWDF). To help you package streams, the Cleo Wizard has been added to Clementine. To access the wizard, from the menus choose:
Tools Cleo Wizard
This opens the Cleo Wizard, which takes you through the steps of specifying elds and other import information needed by Cleo. Note: A separate license is required to access this component. For more information, see http://www.spss.com/clementine/.
A Cleo scenario is a Clementine data mining solution published to SWDF for use with Cleo. Selecting Publish now in step 11 of the Cleo Wizard deploys the current stream and various specications as a Cleo scenario ready for immediate use. Cleo end users can access the Cleo scenario to analyze data from a database or make predictions based on a single input record.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
164 Chapter 10
The HTML interface for a Cleo scenario is entirely customizable, depending on the settings that you specify in the Cleo Wizard and the options described in the Cleo Implementation Guide, available with the Cleo product.
What Is a Cleo Bundle?
A Cleo bundle contains all of the components of a Cleo scenario without actually publishing to the SWDF. Selecting Save scenario in step 11 of the Cleo Wizard creates a .jar le containing the necessary ingredients for a Cleo scenario, which you can publish or alter at a later date. For example, the Cleo bundle includes stream and data specications as well as a blueprint for the look of Cleo Web pages. To modify a Cleo scenario, you can open the Cleo bundle using the Cleo Wizard and make any changes on the Wizard pages. To apply your new specications, republish or save the bundle.
Terminate the stream with a Publisher node. Make sure that all settings are complete in the Publisher node dialog box and deselect the Quote strings option. Perform a test execute to ensure that the stream is fully functional. Check that any ODBC connections used and mapped drives are available on the server machine. Always save a scenarioas well as publish itto ensure that you can restore settings in the event of an error. Note: A separate license is required to access this component. For more information, see http://www.spss.com/clementine/.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
The Cleo Wizard walks you through the process of generating a bundle for deployment into the SPSS Web Deployment Framework. For additional hints on each screen, click the Help button to open the relevant topic in the online Help. Before proceeding, you may want to use the prerequisite checklist to ensure that the stream is prepared for deployment. For more information, see Cleo Stream Prerequisites on p. 164.
PMML export is supported for most of the model types generated in Clementine. For more information, see Model Types Supporting PMML on p. 167.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
166 Chapter 10 E Right-click a model nugget on the Models tab in the managers window. E From the context menu, choose Export PMML. Figure 10-10 Exporting a model in PMML format
E In the Export dialog box, specify a target directory and a unique name for the model.
Note: You can change options for PMML export in the User Options dialog box. For more information, see Setting PMML Export Options in Chapter 12 on p. 193.
To Import a Model Saved as PMML
Models exported as PMML from Clementine or another application can be imported into the model nuggets palette. For more information, see Model Types Supporting PMML on p. 167.
E In the model nuggets palette, right-click on the palette and select Import PMML from the context
menu.
Figure 10-11 Importing a model in PMML format
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
167 Exporting to Predictive Applications E Select the le to import and specify options for variable and value labels as desired. Figure 10-12 Selecting the XML file for a model saved using PMML
Use variable labels. The PMML may specify both variable names and variable labels (such as
Referrer ID for RefID) for variables in the data dictionary. Select this option to use variable labels if they are present in the originally exported PMML.
Use value labels. The PMML may specify both values and value labels (such as Male for M
or Female for F) for a variable. Select this option to use the value labels if they are present in the PMML. If you have selected the above label options but there are no variable or value labels in the PMML, the variable names and literal values are used as normal. By default, both options are selected.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
168 Chapter 10
Model Type Text Extraction Feature Selection Anomaly Detection Time Series Unrened (GRI, CEMI) Discriminant SLRM Cox Regression
PMML Export (Version 3.1) not available not available not available not available not available not available not available not available
Database native models. For models generated using database-native algorithms, PMML export is available for IBM Intelligent Miner models only. Models created using Analysis Services from Microsoft or Oracle Data Miner cannot be exported. Also note that IBM models exported as PMML cannot be imported back into Clementine. For more information, see Database Modeling Overview in Chapter 2 in Clementine 12.0 In-Database Mining Guide. PMML 3.1 Import
Clementine can import and score PMML 3.1 models generated by current versions of all SPSS products, including models exported from Clementine as well as model or transformation PMML generated by SPSS 15.0 or later. Essentially, this means any PMML that the SPSS SmartScore component can score, with the following exceptions: Apriori, CARMA, and Anomaly Detection models cannot be imported. PMML models may not be browsed after importing into Clementine even though they can be used in scoring. (Note that this includes models that were exported from Clementine to begin with. To avoid this limitation, export the model as a generated model le [*.gm] rather than PMML.) Models that cannot be scored will not be imported. IBM Intelligent Miner models exported as PMML cannot be imported.
Importing Earlier Versions of PMML (2.1 or 3.0)
PMML import for legacy models exported from earlier releases of Clementine (prior to 11.0) is supported for some, but not all, model types, as indicated below:
Model Type Neural Network C&R Tree CHAID Tree QUEST Tree C5.0 Tree Ruleset Kohonen Net K-Means TwoStep PMML Import (2.1 or 3.0) not available yes yes yes not not not not available available available available yes
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
Model Type Linear Regression Logistic Regression Factor/PCA Sequence CARMA Apriori Text Extraction Feature Selection Anomaly Detection Unrened (GRI, CEMI)
PMML Import (2.1 or 3.0) yes yes not not not not not not not not available available available available available available available available
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
11
Chapter
Introduction to Projects
A project is a group of les related to a data mining task. Projects include data streams, graphs, generated models, reports, and anything else that you have created in Clementine. At rst glance, it may seem that Clementine projects are simply a way to organize output, but they are actually capable of much more. Using projects, you can: Annotate each object in the project le. Use the CRISP-DM methodology to guide your data mining efforts. Projects also contain a CRISP-DM Help system that provides details and real-world examples on data mining with CRISP-DM. Add non-Clementine objects to the project, such as a PowerPoint slide show used to present your data mining goals or white papers on the algorithms that you plan to use. Produce both comprehensive and simple update reports based on your annotations. These reports can be generated in HTML for easy publishing on your organizations intranet. Note: If the Projects tool is not visible in the Clementine window, choose Project from the View menu. Objects that you add to a project can be viewed in two ways: Classes view and CRISP-DM view. Anything that you add to a project is added to both views, and you can toggle between views to create the organization that works best.
Figure 11-1 CRISP-DM view and Classes view of a project file
170
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
CRISP-DM View
By supporting the Cross-Industry Standard Process for Data Mining (CRISP-DM), Clementine projects provide an industry-proven and non-proprietary way of organizing the pieces of your data mining efforts. CRISP-DM uses six phases to describe the process from start (gathering business requirements) to nish (deploying your results). Even though some phases do not typically involve work in Clementine, the projects tool includes all six phases so that you have a central location for storing and tracking all materials associated with the project. For example, the Business Understanding phase typically involves gathering requirements and meeting with colleagues to determine goals rather than working with data in Clementine. The projects tool allows you to store your notes from such meetings in the Business Understanding folder for future reference and inclusion in reports.
Figure 11-2 CRISP-DM view of the projects tool
The CRISP-DM projects tool is also equipped with its own Help system to guide you through the data mining life cycle. From Clementine, this help can be accessed by choosing CRISP-DM Help from the Help menu. Note: If the Projects tool is not visible in the window, choose Project from the View menu.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
172 Chapter 11
Classes View
The Classes view in the projects tool organizes your work in Clementine categorically by the types of objects created. Saved objects can be added to any of the following categories: Streams Nodes Models Tables, graphs, reports Other (non-Clementine les, such as slide shows or white papers relevant to your data mining work)
Figure 11-3 Classes view in the projects tool
Adding objects to the Classes view also adds them to the default phase folder in the CRISP-DM view. Note: If the Projects tool is not visible in the window, choose Project from the View menu.
Building a Project
A project is essentially a le containing references to all of the les that you associate with the project. This means that project items are saved both individually and as a reference in the project le (.cpj). Because of this referential structure, note the following: Project items must rst be saved individually before being added to a project. If an item is unsaved, you will be prompted to save it before adding it to the current project. Objects that are updated individually, such as streams, are also updated in the project le. Manually moving or deleting objects (such as streams, nodes, and output objects) from the le system will render links in the project le invalid.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
Adding to a Project
Once you have created or opened a project, you can add objects, such as data streams, nodes, and reports, using several methods.
Adding Objects from the Managers
Using the managers in the upper right corner of the Clementine window, you can add streams or output.
E Select an object, such as a table or a stream, from one of the managers tabs. E Right-click and choose Add to Project.
If the object has been previously saved, it will automatically be added to the appropriate objects folder (in Classes view) or to the default phase folder (in CRISP-DM view).
E Alternatively, you can drag and drop objects from the managers to the project workspace.
Note: You may be asked to save the object rst. When saving, be sure that Add file to project is selected in the Save dialog box. This will automatically add the object to the project after you save it.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
You can add individual nodes from the stream canvas by using the Save dialog box.
E Select a node on the canvas. E Right-click and choose Save Node. Alternatively, from the menus choose: Edit Node Save Node... E In the Save dialog box, select Add file to project. E Create a name for the node and click Save.
This saves the le and adds it to the project. Nodes are added to the Nodes folder in Classes view and to the default phase folder in CRISP-DM view.
Adding External Files
You can add a wide variety of non-Clementine objects to a project. This is useful when you are managing the entire data mining process within Clementine. For example, you can store links to data, notes, presentations, and graphics in a project. In CRISP-DM view, external les can be added to the folder of your choice. In Classes view, external les can be saved only to the Other folder.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
or
E Right-click the target folder in CRISP-DM or Classes view. E From the menu, choose Add to Folder. E Select a le in the dialog box and click Open.
This will add a reference to the selected object inside Clementine projects.
Make sure that the project you want to transfer is open in the Projects tool.
To transfer a project:
E Right-click the root project folder and choose Transfer Project. E If prompted, log in to SPSS Predictive Enterprise Repository. E Specify the new location for the project and click OK.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
Created. Shows the projects creation date (not editable). Summary. You can enter a summary for your data mining project that will be displayed in the
project report.
Contents. Lists the type and number of components referenced by the project le (not editable). Save unsaved object as. Species whether unsaved objects should be saved to the local le system,
or stored in the Predictive Enterprise Repository. For more information, see SPSS Predictive Enterprise Repository in Chapter 9 on p. 140.
Update object references when loading project. Select this option to update the projects references to its components. Note: The les added to a project are not saved in the project le itself. Rather, a reference to the les is stored in the project. This means that moving or deleting a le will remove that object from the project.
Annotating a Project
The projects tool provides a number of ways to annotate your data mining efforts. Project-level annotations are often used to track big-picture goals and decisions, while folder or node annotations provide additional detail. The Annotations tab provides enough space for you to document project-level details, such as the exclusion of data with irretrievable missing data or promising hypotheses formed during data exploration.
To annotate a project:
E Select the project folder in either CRISP-DM or Classes view.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
177 Projects and Reports E Right-click the folder and choose Project Properties. E Click the Annotations tab. Figure 11-6 Annotations tab in the project properties dialog box
In CRISP-DM view, folders are annotated with a summary of the purpose of each phase as well as guidance on completing the relevant data mining tasks. You can remove or edit any of these annotations.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
Name. This area displays the name of the selected eld. Tooltip text. Create custom ToolTips that will be displayed when you hover the mouse pointer over a project folder. This is useful in CRISP-DM view, for example, to provide a quick overview of each phases goals or to mark the status of a phase, such as In progress or Complete. Annotation field. Use this eld for more lengthy annotations that can be collated in the project
report. The CRISP-DM view includes a description of each data mining phase in the annotation, but you should feel free to customize this for your own project.
Include in report. To include the annotation in reports, select Include in report.
Object Properties
You can view object properties and choose whether to include individual objects in the project report. To access object properties:
E Right-click an object in the project window. E From the menu, choose Object Properties. Figure 11-8 Object properties dialog box
Name. This area lists the name of the saved object. Path. This area lists the location of the saved object. Include in report. Select this option to include the object details in a generated report.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
Closing a Project
When you exit Clementine or open a new project, the existing project is closed, including all associated les. Alternatively, you can choose to close the project le itself and leave all associated les open.
To close a project file:
E From the File menu, choose Close Project. E If you are prompted to close or leave open all les associated with the project, click Leave Open
to close the project le (.cpj) itself but to leave open all associated les, such as streams, nodes, or graphs. If you modify and save any associated les after the close of a project, these updated versions will be included in the project the next time you open it. To prevent this behavior, remove the le from the project or save it under a different lename.
Generating a Report
One of the most useful features of projects is the ability to generate reports based on the project items and annotations. You can generate a report directly into one of several le types or to an output window on the screen for immediate viewing. From there, you can print, save, or view the report in a Web browser. You can distribute saved reports to others in your organization. Reports are often generated from project les several times during the data mining process for distribution to those involved in the project. The report culls information about the objects referenced from the project le as well as any annotations created. You can create reports based on either the Classes view or CRISP-DM view.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
To generate a report:
E Select the project folder in either CRISP-DM or Classes view. E Right-click the folder and choose Project Properties. E In the project properties dialog box, click the Report tab. E Specify the report options and click Generate Report.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
181 Projects and Reports Figure 11-10 Selecting options for a report
The options in the report dialog box provide several ways to generate the type of report you need:
Output name. Specify the name of the output window if you choose to send the output of the
report to the screen. You can specify a custom name or let Clementine automatically name the window for you.
Output to screen. Select this option to generate and display the report in an output window. Note
that you have the option to export the report to various le types from the output window.
Output to file. Select this option to generate and save the report as a le of the type specied
\bin directory. Use the ellipsis button (...) to specify a different location.
File type. Available le types are: HTML document. The report is saved as a single HTML le. If your report contains graphs,
they are saved as PNG les and are referenced by the HTML le. When publishing your report on the Internet, make sure to upload both the HTML le and any images it references.
Text document. The report is saved as a single text le. If your report contains graphs, only
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
182 Chapter 11
Microsoft Word document. The report is saved as a single document, with any graphs
Note: In order to export to a Microsoft Ofce le, you must have the corresponding application installed.
Title. Specify a title for the report. Report structure. Select either CRISP-DM or Classes. CRISP-DM view provides a status report
with big-picture synopses as well as details about each phase of data mining. Classes view is an object-based view that is more appropriate for internal tracking of data and streams.
Author. The default user name is displayed, but you can change it. Report includes. Select a method for including objects in the report. Select all folders and objects to include all items added to the project le. You can also include items based on whether Include in Report is selected in the object properties. Alternatively, to check on unreported items, you can choose to include only items marked for exclusion (where Include in Report is not selected). Select. This option allows you to provide project updates by selecting only recent items in the report. Alternatively, you can track older and perhaps unresolved issues by setting parameters for old items. Select all items to dismiss time as a parameter for the report. Order by. You can select a combination of the following object characteristics to order them
within a folder:
Type. Group objects by type. Name. Organize objects alphabetically. Added date. Sort objects using the date they were added to the project.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
Model appliers. Generated models, also known as nuggets. For more information, see
Features in Chapter 5 in Clementine 12.0 Source, Process, and Output Nodes. For more information, see Overview of Output Nodes in Chapter 6 in Clementine 12.0 Source, Process, and Output Nodes.
Other. Any other nodes related to the project. For example, those available on the Field Ops
The report is saved in the format you chose. You can export to the following le types: HTML Text Microsoft Word Microsoft Excel Microsoft PowerPoint Note: In order to export to a Microsoft Ofce le, you must have the corresponding application installed. Use the buttons at the top of the window to: Print the report. View the report as HTML in an external Web browser.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
Customizing Clementine
Customizing Clementine Options
12
Chapter
There are a number of operations you can perform to customize Clementine to your needs. Primarily, this customization consists of setting specic user options such as memory allocation, default directories, and use of sound and color. You can also customize the Nodes palette located at the bottom of the Clementine window.
System Options
You can specify the preferred language or locale for Clementine by choosing System Options from the Tools > Options menu. Here you can also set the maximum memory usage for Clementine. Note that changes made in this dialog box will not take effect until you restart Clementine.
185
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
some platforms, Clementine limits its process size to reduce the toll on computers with limited resources or heavy loads. If you are dealing with large amounts of data, this may cause an out of memory error. You can ease memory load by specifying a new threshold.
Use system locale. This option is selected by default and set to English (United States). Deselect
to specify another language from the drop-down list of available languages and locales.
Managing Memory
In addition to the Maximum memory setting specied in the System Options dialog box, there are several ways you can optimize memory usage: Set up a cache on any nonterminal node so that the data are read from the cache rather than retrieved from the data source when you execute the data stream. This will help decrease the memory load for large datasets. For more information, see Caching Options for Nodes in Chapter 5 on p. 53. Adjust the Maximum set size option in the stream properties dialog box. This option species a maximum number of members for set elds after which the type of the eld becomes typeless. For more information, see Setting Options for Streams in Chapter 5 on p. 57. Force Clementine to free up memory by clicking in the lower right corner of the window where the memory that Clementine is using and the amount allocated are displayed (xxMB / xxMB). Clicking this region turns it a darker shade, after which memory allocation gures will drop. Once the region returns to its regular color, Clementine has freed up all the memory possible.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
Set Directory. You can use this option to set the working directory. The default working
directory is based on the installation path of your version of Clementine or from the command line path used to launch Clementine. In local mode, the working directory is the path used for all client-side operations and output les (if they are referenced with relative paths).
Set Server Directory. The Set Server Directory option on the File menu is enabled whenever
there is a remote server connection. Use this option to specify the default directory for all server les and data les specied for input or output. The default server directory is $CLEO/data, where $CLEO is the directory in which the Server version of Clementine is installed. Using the command line, you can also override this default by using the -server_directory ag with the clementine command line argument.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
188 Chapter 12 Figure 12-2 User Options dialog box, Notifications tab
Show stream execution feedback dialog. Select to display a progress indicator when stream execution has been in progress for three seconds. Warn when a node overwrites a file. Select to warn with an error message when node operations
Note: The .wav les used to create sounds in Clementine are stored in the /media/sounds directory of your installation.
Mute all sounds. Select to turn off sound notication for all events. New Output / New Model. The options on the right side of this dialog box are used to specify the behavior of the Outputs and Models managers tabs when new items are generated. Select New Output or New Model from the drop-down list to specify the behavior of the corresponding tab. The following options are available: Select tab. Choose whether to switch to the Outputs or Models tab when the corresponding
object is generated during stream execution. Select Always to switch to the corresponding tab in the managers window.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
Select If generated by current stream to switch to the corresponding tab only for objects generated by the stream currently visible in the canvas. Select Never to restrict the software from switching to the corresponding tab to notify you of generated outputs or models.
Flash tab. Select whether to ash the Outputs or Models tab in the managers window when new
outputs or models have been generated. Select If not selected to ash the corresponding tab (if not already selected) whenever new objects are generated in the managers window. Select Never to restrict the software from ashing the corresponding tab to notify you of generated objects.
Open window (New Output only). Select whether to automatically open an output window upon
generation. Select Always to always open a new output window. Select If generated by current stream to open a new window for output generated by the stream currently visible in the canvas. Select Never to restrict the software from automatically opening new windows for generated output.
Warn when outputs exceed [n] (New Output only). Select whether to display a warning when the
number of items on the Outputs tab exceeds a prespecied quantity. The default quantity is 20; however, you can change this if needed.
Scroll palette to make visible (New Model only). Select whether to automatically scroll the Models
tab in the managers window to make the most recent model visible. Select Always to enable scrolling. Select If generated by current stream to scroll only for objects generated by the stream currently visible in the canvas. Select Never to restrict the software from automatically scrolling the Models tab.
Replace previous model (New Model only). Select to overwrite previous iterations of the same
model. Click Default Values to revert to the system default settings for this tab.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
190 Chapter 12 Figure 12-3 User Options dialog box, Display tab
Show welcome dialog on startup. Select to cause the welcome dialog to be displayed on startup. The welcome dialog has options to launch the application examples tutorial, open a demo stream or an existing stream or project, or to create a new stream. Standard Fonts & Colors. Options in this control box are used to specify the color scheme of
Clementine and the size of the fonts displayed. Options selected here are not applied until you close and restart the software.
Use Clementine settings. Select to use the default blue-themed Clementine interface. Use Windows settings. Select to use the Windows display settings on your computer. This may
in the stream canvas. Note: Node size for a stream can be specied on the Layout tab of the stream properties dialog box.
Custom Colors. For each of the items listed in the table, select a color from the drop-down list. To specify a custom color, scroll to the bottom of the color drop-down list and select Color. Chart Category Color Order. This table lists the currently selected colors used for display in newly created graphs. The order of the colors reects the order in which they will be used in the chart. For example, if a set eld used as a color overlay contains four unique values, then only the rst four colors listed here will be used. You can specify different colors using the drop-down list for
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
each color number. To specify a custom color, scroll to the bottom of the drop-down list and select Color. Changes made here do not affect previously created graphs. Click Default Values to revert to the system default settings for this tab.
Enable stream rewriting. Select this option to enable stream rewriting in Clementine. Two types of rewriting are available, and you can select one or both. Stream rewriting reorders the nodes in a stream behind the scenes for more efcient execution, without altering stream semantics. Optimize SQL generation. This option allows nodes to be reordered within the stream so that
more operations can be pushed back using SQL generation for execution in the database. When it nds a node that cannot be rendered into SQL, the optimizer will look ahead to see if there are any downstream nodes that can be rendered into SQL and safely moved in front of
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
192 Chapter 12
the problem node without affecting the stream semantics. Not only can the database perform operations more efciently than Clementine, but such pushbacks act to reduce the size of the dataset that is returned to Clementine for processing. This, in turn, can reduce network trafc and speed stream operations. Note that Generate SQL must be selected (see below) for this option to have any effect. For more information, see SQL Optimization in Chapter 6 in Clementine 12.0 Server Administration and Performance Guide.
Optimize other execution. This method of stream rewriting increases the efciency of
operations that cannot be delegated to the database. Optimization is achieved by reducing the amount of data in the stream as early as possible. While maintaining data integrity, the stream is rewritten to push operations closer to the data source, thus reducing data downstream for costly operations, such as joins.
Enable parallel processing. When running on a computer with multiple processors, this option
allows the system to balance the load across those processors, which may result in faster performance. Use of multiple nodes or use of the following individual nodes may benet from parallel processing: C5.0, Merge (by key), Sort, Bin (rank and tile methods), and Aggregate (using one or more key elds).
Generate SQL. Select this option to enable SQL optimization, allowing stream operations to be
pushed back to the database by using SQL code to generate execution processes, which may improve performance. To further improve performance, Optimize SQL generation can also be selected in order to maximize the number of operations pushed back to the database. When operations for a node have been pushed back to the database, the node will be highlighted in purple during execution.
Database caching. For streams executed in the database, data can be cached midstream to a
temporary table in the database rather than to the le system. When combined with SQL optimization, this may result in signicant gains in performance. For example, the output from a stream that merges multiple tables to create a data mining view may be cached and reused as needed. With database caching enabled, simply right-click any nonterminal node to cache data at that point, and the cache is automatically created directly in the database the next time the stream is executed. This allows SQL to be generated for downstream nodes, further improving performance. Alternatively, this option can be disabled if needed, such as when policies or permissions preclude data being written to the database. If database caching or SQL optimization is not enabled, the cache will be written to the le system instead. For more information, see Caching Options for Nodes in Chapter 5 on p. 53. Note: Due to minor differences in SQL implementation, streams executed in a database may return slightly different results than when executed in Clementine. For similar reasons, these differences may also vary depending on the database vendor.
Display SQL in the messages log during stream execution. Species whether SQL generated while
preview, species whether a preview of SQL that would be executed is passed to the messages log.
Display SQL. Species whether any SQL that is displayed in the log should contain native SQL
functions or standard ODBC functions of the form {fn FUNC()}, as generated by Clementine. The former relies on ODBC driver functionality that may not be implemented. For example, this control would have no effect for SQL Server.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
Reformat SQL for improved readability. Species whether SQL displayed in the log should be
nodes. Specify a number that is used for updating the status every N records. Click Default Values to revert to the system default settings for this tab.
Export PMML. Here you can congure variations of PMML that work best with your target
application. Select with extensions for SPSS SmartScore to allow PMML extensions for special cases where there is no standard PMML equivalent. Note that in most cases this will produce the same result as standard PMML. Select as standard PMML V3.1 to export PMML that adheres as closely as possible to the PMML standard.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
194 Chapter 12
Standard PMML Options. When standard PMML is selected above, you can choose one of two valid ways to export linear and logistic regression models:
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
Palette Name. Each available palette tab, whether shown on the Nodes Palette or not, is listed.
This includes any palette tabs that you have created. For more information, see Creating a Palette Tab on p. 196.
No. of nodes. The number of nodes displayed on each palette tab. A high number here means you
may nd it more convenient to create sub palettes to divide up the nodes on the tab. For more information, see Creating a Sub Palette on p. 197.
Shown?. Select this eld to display the palette tab on the Nodes Palette. For more information, see Displaying Palette Tabs on the Nodes Palette on p. 196. Sub Palettes. To select sub palettes for display on a palette tab, highlight the required Palette
Name and click this button to display the Sub Palettes dialog box. For more information, see
and sub palettes and return to the default palette settings, click this button. Note: The Extensions tab on the Palette Manager contains options for displaying user-provided extensions, such as a data-processing routine or a modeling algorithm, created using CLEF nodes in Clementine. For more information, see Introduction to CLEF in Chapter 1 in Clementine 12.0 CLEF Developers Guide. The CEMI tab on the Palette Manager contains options for displaying nodes created using the Clementine External Module Interface (CEMI). Note that the CEMI functionality is deprecated in Clementine 12.0 and will no longer be supported for Clementine 13.0. The CEMI functionality has been replaced with the Clementine Extension Framework (CLEF).
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
196 Chapter 12
box is displayed.
E Type in a unique Palette name. E In the Nodes available area, select the node to be added to the palette tab. E Click the Add Node right-arrow button to move the highlighted node to the Selected nodes area.
Repeat until you have added all the nodes you want. After you have added all of the required nodes, you can change the order in which they will appear on the palette tab:
E Use the simple arrow buttons to move a node up or down one row. E Use the line-arrow buttons to move a node to the bottom or top of the list. E To remove a node from a palette, highlight the node and click the Delete button to the right of the Selected nodes area.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
197 Customizing Clementine Figure 12-9 Palette Manager showing the tabs displayed on the Nodes Palette
To permanently remove a palette tab from the Nodes Palette, highlight the node and click the Delete button to the right of the Shown? column. Once deleted, a palette tab cannot be recovered. Note: You cannot delete the default palette tabs supplied with Clementine, except for the favorites tab.
Changing the display order on the Nodes Palette
After you have selected which palette tabs you want to display, you can change the order in which they will appear on the Nodes Palette:
E Use the simple arrow buttons to move a palette tab up or down one row. Moving them up moves
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
198 Chapter 12
frequently for creating your streams, you could create four sub palettes that break the selections down by source node, eld operations, modeling, and output. Note: You can only select sub palette nodes from those added to the parent palette tab.
Figure 12-10 Sub palette creation on the Create/Edit Sub Palette dialog box
The sub palettes you create are displayed on the Nodes Palette when you select their parent palette tab. For more information, see Changing a Palette Tab View on p. 199.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
199 Customizing Clementine Figure 12-11 Sub palettes available for the Modeling Palette tab
Delete button to the right of the Shown? column. Note: You cannot delete the default sub palettes supplied with the Modeling palette tab.
Changing the display order on the Palette Tab
After you have selected which sub palettes you want to display, you can change the order in which they will appear on the parent palette tab:
E Use the simple arrow buttons to move a sub palette up or down one row. E Use the line-arrow buttons to move a sub palette to the bottom or top of the list.
The sub palettes you create are displayed on the Nodes Palette when you select their parent palette tab. For more information, see Changing a Palette Tab View on p. 199.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
200 Chapter 12
For example, for the Modeling palette tab you can choose to display the modeling nodes for one of the following module types: Classication, Association, Segmentation, or Automated (those that enable you to create multiple models). For more information, see Clementine Modules in Chapter 1 on p. 2. To change the nodes shown on a palette tab, select the palette tab and then, from the menu on the left, select to display either all nodes, or just those in a specic sub palette.
Figure 12-12 Modeling palette tab showing the Classification sub palette
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
13
Chapter
You can design your streams to maximize performance by arranging the nodes in the most efcient conguration, by enabling node caches when appropriate, and by paying attention to other considerations as detailed in this section. Aside from the considerations discussed here, additional and more substantial performance improvements can typically be gained by making effective use of your database, particularly through SQL optimization. For more information, see Performance Overview in Chapter 5 in Clementine 12.0 Server Administration and Performance Guide.
Order of Nodes
Even when you are not using SQL optimization, the order of nodes in a stream can affect performance. The general goal is to minimize downstream processing; therefore, when you have nodes that reduce the amount of data, place them near the beginning of the stream. Clementine Server can apply some reordering rules automatically during compilation to bring forward certain nodes when it can be proven safe to do so. (This feature is enabled by default. Check with your system administrator to make sure it is enabled in your installation.) When using SQL optimization, you want to maximize its availability and efciency. Since optimization halts when the stream contains an operation that cannot be performed in the database, it is best to group SQL-optimized operations together at the beginning of the stream. This strategy keeps more of the processing in the database, so less data are carried into Clementine. The following operations can be done in most databases. Try to group them at the beginning of the stream: Merge by key (join) Select Aggregate Sort Sample Append Distinct operations in include mode, in which all elds are selected Filler operations
201
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
202 Chapter 13
Basic derive operations using standard arithmetic or string manipulation (depending on which operations are supported by the database) Set-to-ag The following operations cannot be performed in most databases. They should be placed in the stream after the operations in the above list: Operations on any nondatabase data, such as at les Merge by order Balance Distinct operations in discard mode or where only a subset of elds are selected as distinct Any operation that requires accessing data from records other than the one being processed State and count eld derivations History node operations Operations involving @ (time-series) functions Type-checking modes Warn and Abort Model construction, application, and analysis Note: Decision trees, rulesets, linear regression, and factor-generated models can generate SQL and can therefore be pushed back to the database. Data output to anywhere other than the same database that is processing the data
Node Caches
To optimize stream execution, you can set up a cache on any nonterminal node. When you set up a cache on a node, the cache is lled with the data that pass through the node the next time you execute the data stream. From then on, the data are read from the cache (which is stored on disk in a temporary directory) rather than from the data source. Caching is most useful following a time-consuming operation such as a sort, merge, or aggregation. For example, suppose that you have a source node set to read sales data from a database and an Aggregate node that summarizes sales by location. You can set up a cache on the Aggregate node rather than on the source node because you want the cache to store the aggregated data rather than the entire dataset. Note: Caching at source nodes, which simply stores a copy of the original data as they are read into Clementine, will not improve performance in most circumstances and is not typically recommended. Nodes with caching enabled are displayed with a small document icon at the top right corner. When the data are cached at the node, the document icon is green.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
203 Performance Considerations for Streams and Nodes Figure 13-1 Caching at the Type node to store newly derived fields
To Enable a Cache
E On the stream canvas, right-click the node and choose Cache from the context menu. E From the caching submenu, choose Enable. E You can turn the cache off by right-clicking the node and choosing Disable from the caching
submenu.
Caching Nodes in the Database
For streams executed in the database, data can be cached midstream to a temporary table in the database rather than the le system. When combined with SQL optimization, this may result in signicant gains in performance. For example, the output from a stream that merges multiple tables to create a data mining view may be cached and reused as needed. By automatically generating SQL for all downstream nodes, performance can be further improved. To take advantage of database caching, both SQL optimization and database caching must be enabled. Note that Server optimization settings override those on the Client. For more information, see Setting Optimization Options in Chapter 12 on p. 191. With database caching enabled, simply right-click on any nonterminal node to cache data at that point, and the cache will be created automatically directly in the database the next time the stream is executed. If database caching or SQL optimization is not enabled, the cache will be written to the le system instead. Note: The following databases support temporary tables for the purpose of caching: DB2, Netezza, Oracle, SQL Server, and Teradata. Other databases will use a normal table for database caching. The SQL code can be customized for specic databases by editing properties in the relevant conguration lefor example, C:\Program Files\SPSSInc\Clementine12.0\cong\odbc-teradata-properties.cfg. For more information, see the comments in the default conguration le, odbc-properties.cfg, installed in the same folder.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
204 Chapter 13
before it allocates records to bins. The dataset is cached while the boundaries are computed; then it is rescanned for allocation. When the binning method is xed-width or mean+standard deviation, the dataset is cached directly to disk. These methods have a linear running time and require enough disk space to store the entire dataset. When the binning method is ranks or tiles, the dataset is sorted using the sort algorithm described above, and the sorted dataset is used as the cache. Sorting gives these methods a running time of M*N*log(N), where M is the number of binned elds and N is the number of records; it requires disk space equal to twice the dataset size. Generating a Derive node based on generated bins will improve performance in subsequent passes. Derive operations are much faster than binning. For more information, see Previewing the Generated Bins in Chapter 4 in Clementine 12.0 Source, Process, and Output Nodes.
Merge by Key (Join). The Merge node, when the merge method is keys (equivalent to a database
join), sorts each of its input datasets by the key elds. This part of the procedure has a running time of M*N*log(N), where M is the number of inputs and N is the number of records in the largest input; it requires sufcient disk space to store all of its input datasets plus a second copy of the largest dataset. The running time of the merge itself is proportional to the size of the output dataset, which depends on the frequency of matching keys. In the worst case, where the output is the Cartesian product of the inputs, the running time may approach NM. This is raremost joins have many fewer matching keys. If one dataset is relatively larger than the other(s), or if the incoming data are already sorted by a key eld, then you can improve the performance of this node using the Optimization tab. For more information, see Merge Optimization Settings in Chapter 3 in Clementine 12.0 Source, Process, and Output Nodes.
Aggregate. When the Keys are contiguous option is not set, this node reads (but does not
store) its entire input dataset before it produces any aggregated output. In the more extreme situations, where the size of the aggregated data reach a limit (determined by the Clementine
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
Server conguration option Memory usage multiplier), the remainder of the dataset is sorted and processed as if the Keys are contiguous option were set. When this option is set, no data are stored because the aggregated output records are produced as the input data are read.
Distinct. The Distinct node must store all of the unique key elds in the input dataset. In the worst case, where all elds are key elds and all records are unique, it stores the entire dataset. The Distinct node does not perform well for large datasets. If you have a large dataset and the order of the output dataset is not important, you can try an alternativesort the data on the key elds and then use the CLEM expression @OFFSET with a Select node to select (or discard) the rst distinct record from each group. Type. In some instances, the Type node caches the input data when reading values; the cache is used for downstream processing. The cache requires sufcient disk space to store the entire dataset but speeds up processing. Evaluation. The Evaluation node must sort the input data to compute tiles. The sort is repeated for
each model evaluated because the scores and consequent record order are different in each case. The running time is M*N*log(N), where M is the number of models and N is the number of records.
CEMI. CEMI nodes are run as external processes, where any data passed to or from the process
must be stored on disk. A CEMI process node may require two copies of the data on disk: one as input and one as output.
make many passes over the training data. The data are stored in memory up to a limit, and the excess is spilled to disk. Accessing the training data from disk is expensive because the access method is random, which can lead to excessive disk activity. You can disable the use of disk storage for these algorithms, forcing all data to be stored in memory, by selecting the Optimize for speed option on the Model tab of the nodes dialog box. Note that if the amount of memory required to store the data is greater than the working set of the server process, part of it will be paged to disk and performance will suffer accordingly. When Optimize for memory is enabled, a percentage of physical RAM is allocated to the algorithm according to the value of the Clementine Server conguration option Modeling memory limit percentage. To use more memory for training neural networks, either provide more RAM or increase the value of this option, but note that setting the value too high will cause paging. The running time of the neural network algorithms depends on the desired level of accuracy. You can control the running time by setting a stopping condition in the nodes dialog box.
K-Means. The K-Means clustering algorithm has the same options for controlling memory usage
as the neural network algorithms. Performance on data stored on disk is better, however, because access to the data is sequential.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
206 Chapter 13
offset value is not a literal integer; for example, @OFFSET(Sales, Month). The offset value is the eld name Month, whose value is unknown until executed. The server must save all values of the Sales eld in order to ensure accurate results. Where an upper bound is known, you should provide it as an additional argument; for example, @OFFSET(Sales, Month, 12). This operation instructs the server to store no more than the 12 most recent values of Sales. Sequence functions, bounded or otherwise, almost always inhibit SQL generation.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
Appendix
Accessibility in Clementine
Overview of Accessibility in Clementine
This release offers greatly enhanced accessibility for all users, as well as specic support for users with visual and other functional impairments. This section describes the features and methods of working using accessibility enhancements, such as screen readers and keyboard shortcuts.
You can select colors for the display of graphs. You can also choose to use your specic Windows settings for the software itself. This may help to increase visual contrast.
E To set display options, from the Tools menu, choose User Options. E Then click the Display tab. The options on this tab include the software color scheme, chart
By turning on or off sounds, you can control the way you are alerted to particular operations in the software. For example, you can activate sounds for events such as node creation and deletion or the generation of new output or models.
E To set notication options, from the Tools menu, choose User Options. E Then click the Notifications tab. 207
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
208 Appendix A
The Notications tab on the User Options dialog box is also used to control whether newly generated output, such as tables and charts, are launched in a separate window. It may be easier for you to disable this option and open an output window only when desired.
E To set these options, from the Tools menu, choose User Options. E Then click the Notifications tab. E In the dialog box, select New Output from the drop-down list on the right. E Then, in the Open Window group, select Never.
Node Size
Nodes can be displayed using either a standard or small size. You may want to adjust these sizes to t your needs.
E To set node size options, from the File menu, choose Stream Properties. E Then click the Layout tab. E From the Icon Size drop-down list, select Standard.
Screen readers tend to perform better when the visual contrast is greater on the screen. If you already have a high-contrast Windows setting, you can choose to use these Windows settings for the software itself.
E To set display options, from the Tools menu, choose User Options. E Then click the Display tab.
By turning on or off sounds, you can control the way you are alerted to particular operations in the software. For example, you can activate sounds for events such as node creation and deletion or the generation of new output or models.
E To set notication options, from the Tools menu, choose User Options. E Then click the Notifications tab.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
The Notications tab on the User Options dialog box is also used to control whether newly generated output is launched in a separate window. It may be easier for you to disable this option and open an output window as needed.
E To set these options, from the Tools menu, choose User Options. E Then click the Notifications tab. E In the dialog box, select New Output from the drop-down list on the right. E Then, in the Open Window group, select Never.
Keyboard Accessibility
Major new features have been added to make the products functionality accessible from the keyboard. At the most basic level, you can press Alt plus the appropriate key to activate window menus (such as Alt-F to access the File menu) or press the Tab key to scroll through dialog box controls. However, there are special issues related to each of the products main windows and helpful hints for navigating dialog boxes. This section will cover the highlights of keyboard accessibility, from opening a stream to using node dialog boxes to working with output. Additionally, lists of keyboard shortcuts are provided for even more efcient navigation.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
210 Appendix A
Function Moves focus to the node palettes. Moves focus to the stream canvas. Moves focus to the managers window. Moves focus to the projects window.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
Shortcut Key Ctrl-Alt-R Ctrl-Alt-U Ctrl-Alt-C Ctrl-Alt-F Tab Any key F1 F2 F3 Delete Backspace Shift-F10 Esc Ctrl-Alt-X Ctrl-Alt-Z Ctrl-E
Function Displays the Annotations tab for a selected node, enabling you to rename the node. Creates a User Input source node. Toggles the cache for a node on or off. Flushes the cache for a node. On the stream canvas, cycles through all the source nodes in the current stream. On a selected sub-palette, toggles focus between In the current stream, gives focus and cycles to the next node whose name starts with the key pressed. Opens the Help system at a topic relevant to the focus. Starts the connection process for a node selected in the canvas. Use the Tab key to move to the desired node on the canvas, and press the spacebar to nish the connection. Deletes all connections for the selected node on the canvas. Deletes a selected node from the canvas. Opens the context menu. Closes a context menu or dialog window. Expands a SuperNode. Zooms in or out of a SuperNode. With focus in the stream canvas, this executes the current stream.
A number of standard shortcut keys are also used in Clementine, such as Ctrl-C to copy. For more information, see Using Shortcut Keys in Chapter 3 on p. 30.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
212 Appendix A
Table Shortcuts
Table shortcuts are used for output tables as well as table controls in dialog boxes for nodes such as Type, Filter, and Merge. Typically, you will use the Tab key to move between table cells and Ctrl-Tab to leave the table control. Note: Occasionally, a screen reader may not immediately begin reading the contents of a cell. Pressing the arrows keys once or twice will reset the software and start the speech.
Shortcut Key Ctrl-W Ctrl-Alt-W Ctrl-D Ctrl-Alt-D Ctrl-T Ctrl-Alt-T Ctrl-R Ctrl-Alt-R Ctrl-I Ctrl-Alt-I Ctrl-G Ctrl-Alt-G Ctrl-Q Function For tables, reads the short description of the selected roW. For example, Selected row 2 values are sex, ag, m/f, etc. For tables, reads the long description of the selected roW. For example, Selected row 2 values are eld = sex, type = ag, sex = m/f, etc. For tables, reads the short Description of the selected area. For example, Selection is one row by six columns. For tables, provides the long Description of the selected area. For example, Selection is one row by six columns. Selected columns are Field, Type, Missing. Selected row is 1. For tables, provides a short description of the selected columns. For example, Fields, Type, Missing. For tables, provides a long description of the selected columns. For example, Selected columns are Fields, Type, Missing. For tables, provides the number of Records in the table. For tables, provides the number of Records in the table as well as column names. For tables, reads the cell Information, or contents, for the cell that has focus. For tables, reads the long description of cell Information (column name and contents of the cell) for the cell that has focus. For tables, provides short General selection information. For tables, provides long General selection information. For tables, provides a Quick toggle of the table cells. Ctrl-Q reads long descriptions, such as Sex=Female, as you move through the table using the arrow keys. Selecting Ctrl-Q again will toggle to short descriptions (cell contents).
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
213 Accessibility in Clementine E Right arrow. Moves focus to the Variable File node. E Spacebar. Selects the Variable File node. E Crtl-Enter. Adds the Variable File node to the stream canvas but keeps focus on the node palette.
This way, you are ready to move to the next node and add it without moving back and forth between the canvas and the palette. This key combination also keeps selection on the Variable File node so that the next node added will be connected to it.
E Right arrow 4 times. Moves to the Derive node. E Spacebar. Selects the Derive node. E Alt-Enter. Adds the Derive node to the canvas and moves selection to the Derive node. This node
options. Then close the dialog box. At this point, you can add additional nodes or execute the current stream. Keep in mind the following tips when you are building streams: When manually connecting nodes, use F2 to create the start and end points of a connection, and use the spacebar to nalize the connection. Use F3 to destroy all connections for a selected node in the canvas. Once you have created a stream, use Ctrl-E to execute the current stream. A complete list of shortcut keys is available. For more information, see Shortcuts for Navigating the Main Window on p. 209.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
214 Appendix A
A Clementine dictionary le (Awt.JDF) is available for inclusion with JAWS. To use this le:
E Navigate to the /accessibility subdirectory of your Clementine installation and copy the dictionary
le (Awt.JDF).
E Copy it to the directory with your JAWS scripts.
You may already have a le named Awt.JDF on your machine if you have other JAVA applications running. In this case, you may not be able to use this dictionary le without manually editing the dictionary le.
This displays a view similar to the standard tree map, but one which Jaws can read correctly. You can move up, down, right, or left using the standard arrow keys. As you navigate the accessible window, the focus in the Tree Builder window moves accordingly. Use the space bar to change the selection, or use Ctrl-Space to extend the current selection.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
key to move between option buttons. The arrow keys will not work in this context.
Drop-down lists. In a drop-down list for dialog boxes, you can use either the Escape button or
spacebar to select an item and then close the drop-down list. You can also use the Escape key to close drop-down lists that do not close when you have tabbed to another control.
Execution status. When you are executing a stream on a large database, JAWS can lag behind
in reading the execution status to you. Press the Ctrl key periodically to update the status reporting.
Using the node palettes. When you rst enter a tab of the node palettes, JAWS will sometimes
read groupbox instead of the name of the node. In this case, you can use Ctrl-right arrow and then Ctrl-left arrow to reset the screen reader and hear the node name.
Reading menus. Occasionally, when you are rst opening a menu, JAWS may not read the
rst menu item. If you suspect that this may have happened, use the down arrow and then the up arrow to hear the rst item in the menu.
Cascaded menus. JAWS does not read the rst level of a cascaded menu. If you hear a break in
speaking while moving through a menu, press the right arrow key to hear the child menu items. Additionally, if you have Text Mining for Clementine installed, the following tips can make the interactive workbench interface more accessible to you.
Entering dialog boxes. You may need to press the Tab key to put the focus on the rst control
pane, extracted results pane, or library tree, you can type the rst letter of the element when the pane has the focus. This will select the next occurrence of an element beginning with the letter you entered.
Drop-down lists. In a drop-down list for dialog boxes, you can use the space bar to select
an item and then close the drop-down list. Additional tips for use are discussed at length in the following topics.
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
216 Appendix A
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
Appendix
Unicode Support
The goal of the Unicode standard is to provide a consistent way to encode multilingual text so that it can be easily shared across borders, locales, and applications. The Unicode Standard, now at version 4.0.1, denes a character set that is a superset of all of the character sets in common use in the world today and assigns to each character a unique name and code point. The characters and their code points are identical to those of the Universal Character Set (UCS) dened by ISO-10646. For more information, see the Unicode Home Page (http://www.unicode.org).
217
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
Index
508 compliance, 207 abs function, 115 accessibility, 207, 216 example, 212213 features in Clementine, 207 tips in Clementine, 215 adding to a project, 172 adding Clementine Server connections, 1920 Aggregate node performance, 204 allbutrst function, 119 allbutlast function, 119 alphabefore function, 119 and operator, 114 annotating nodes, 55 streams, 55, 65 annotations folder, 177 project, 176 application examples, 3 Application View, 137 applications, 34 applications of data mining, 34 arccos function, 116 arccosh function, 116 arcsin function, 116 arcsinh function, 116 arctan function, 116 arctan2 function, 116 arctanh function, 116 Association module, 38 attribute, 33 automation, 84 backup stream les restoring, 67 Base module, 38 batch mode, 32, 84 invoking software, 17 Binning node performance, 204 bioinformatics, 14, 28 bitwise functions, 117 @BLANK function, 82, 111, 133 blank handling CLEM functions, 133 blanks, 7879, 91 build rule node loading, 70 cache enabling, 53, 186, 202 ushing, 54, 57 options for nodes, 53, 202 saving, 54 setting up a cache, 51 cache le node loading, 70 canvas, 24 case, 33 CAT, 28 cdf_chisq function, 117 cdf_f function, 117 cdf_normal function, 117 cdf_t function, 117 CEMI node performance, 205 Champion/Challenger analysis, 138 characters, 104105 charts saving output, 68 checking CLEM expressions, 100 chi-square distribution probability functions, 117 classes , 27, 170, 172 Classication module, 38 CLEM, 95 building expressions, 96 checking expressions, 100 datatypes, 105106 examples, 86 expressions, 89, 104 functions, 97 introduction, 32, 84 language, 104 CLEM expressions nding and replacing text, 100 parameters, 62, 90 performance, 205 CLEM functions bitwise, 117 blanks and nulls, 133 comparison, 112 conversion, 112 datetime, 123 global, 132 information, 111 list of available, 110
218
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
219 Index
logical, 114 missing values, 82 numeric, 115 probability, 117 random, 118 sequence, 127128 special functions, 134 string, 119 trigonometric, 116 Clementine, 24 accessibility features, 207 documentation, 3 getting started, 17 options, 185 overview, 6, 12, 14, 185 running from command line, 17 tips and shortcuts, 76 Clementine Application Templates (CATs), 43 data mapping tool, 71 opening, 43 Clementine Batch, 1 Clementine Client, 1 Clementine Server, 1 domain name (Windows), 18 host name, 1819 password, 18 port number, 1819 user ID, 18 Cleo, 136 stream prerequisites, 164 wizard, 163, 165 client default directory, 186 colors setting, 189 comma, 57 command line starting Clementine, 17 comparison functions, 112 concatenating strings, 112 conditions, 89 connections server cluster, 20 to Clementine Server, 1820 conventions, 110 conversion functions, 112 Coordinator of Processes, 20 COP, 20 copy, 28 cos function, 116 cosh function, 116 count_equal function, 93, 112 count_greater_than function, 93, 112 count_less_than function, 93, 112 count_not_equal function, 93, 112 count_nulls function, 82, 93, 112 count_substring function, 119
CRISP-DM, 27, 170 projects view, 171 CRISP-DM process model, 36 CRM, 14, 28 currency display format, 58 custom palette creation, 196 sub palette creation, 197 cut, 28 data audit node use in exploration, 33 Data Audit node use in data mining, 35 data mapping tool, 7172 data mining, 33 application examples, 43 strategy, 36 Data Provider Denition, 137 data streams building, 46 data types, 88 in parameters, 64 date formats, 58, 106107 date functions, 106107 date_before, 112, 123 date_days_difference, 123 date_in_days, 123 date_in_months, 123 date_in_weeks, 123 date_in_years, 123 date_months_difference, 123 date_weeks_difference, 123 date_years_difference, 123 @TODAY function, 123 date_before function, 112 date/time values, 92 dates converting, 127 manipulating, 127 datetime functions datetime_date, 123 datetime_day, 123 datetime_day_name, 123 datetime_day_short_name, 123 datetime_hour, 123 datetime_in_seconds, 123 datetime_minute, 123 datetime_month, 123 datetime_month_name, 123 datetime_month_short_name, 123 datetime_now datetime_second, 123 datetime_time, 123 datetime_timestamp, 123 datetime_weekday , 123 datetime_year, 123 datetime_date function, 112
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
220 Index
decimal places display formats, 58 decimal symbol, 57 number display formats, 59 decision tree models propensity scores, 157 decision trees accessibility, 214 default project phase, 171 degrees measurements units, 58 demo streams Applications Guide, 3 deploying Scenarios, 138 deploying streams, 140 deployment options Scenarios, 138 dictionary le, 214 DIFF function, 128 @DIFF function, 127128 difference method converting condences to propensity scores, 157 neural net models, 157 directory default, 186 display formats currency, 58 decimal places, 58 grouping symbol, 58 numbers, 58 scientic, 58 Distinct node performance, 205 distribution functions, 117 div function, 115 documentation, 3 domain name (Windows) Clementine Server, 18 DPD, 137 DTD, 166 encoding, 59, 217 endstring function, 119 Enterprise View node, 137138 equals operator, 112 error messages, 61 essential elds, 71, 74 Evaluation node performance, 205 example streams Applications Guide, 3 examples Clementine Application Templates (CATs), 43 overview, 4 execution for streams, 66
exponential function, 115 exporting PMML, 165, 167 to Cleo, 163 Expression Builder, 211 accessing, 96 nding and replacing text, 100 overview, 95 using, 96 expressions, 104 f distribution probability functions, 117 factor, 214 favorite nodes, 194 Feature Selection node missing values, 80 @FIELD function, 82, 134 elds, 33, 104, 106 in CLEM expressions, 98 viewing values, 98 @FIELDS_BETWEEN function, 82, 93, 134 @FIELDS_MATCHING function, 82, 93, 134 ller node missing values, 81 nding text, 100 rst_index function, 94, 112 rst_non_null function, 94, 112 rst_non_null_index function, 94, 112 fonts, 189 fracof function, 115 fraud, 14, 28 functions, 106107, 110111, 127 @BLANK, 81 examples, 86 @FIELD, 95, 134 @GLOBAL_MAX, 132 @GLOBAL_MEAN, 132 @GLOBAL_MIN, 132 @GLOBAL_SDEV, 132 @GLOBAL_SUM, 132 handling missing values, 82 in CLEM expressions, 97 @PARTITION, 134 @PREDICTED, 95, 134 @TARGET, 95, 134 generated models palette, 26 global functions, 132 global values in CLEM expressions, 98 graphs adding to projects, 172 saving output, 68 greater than operator, 112 grouping symbol number display formats, 59
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
221 Index
hasendstring function, 119 hasmidstring function, 119 hasstartstring function, 119 hassubstring function, 119 hints general usage, 76 host name Clementine Server, 1819 hot keys, 30 HTML output screen reader, 214 IBM Intelligent Miner PMML export, 167 icons setting options, 60 if, then, else functions, 114 importing PMML, 166167 INDEX function, 128 @INDEX function, 127128 information functions, 111 integer_bitcount function, 117 integer_leastbit function, 117 integer_length function, 117 integers, 104105 interactive tree builder accessibility, 214 intof function, 115 introduction, 104 Clementine, 14, 185 is_date function, 111 is_datetime function, 111 is_integer function, 111 is_number function, 111 is_real function, 111 is_string function, 111 is_time function, 111 is_timestamp function, 111 isalphacode function, 119 isendstring function, 119 islowercode function, 119 ismidstring function, 119 isnumbercode function, 119 isstartstring function, 119 issubstring function, 119 issubstring_count function, 119 issubstring_lim function, 119 isuppercode function, 119 Java, 216 JAWS, 207, 214216 K-Means node large sets, 57 performance, 205
keyboard shortcuts, 209, 211 keywords annotating nodes, 55 knowledge discovery, 33 Kohonen node large sets, 57 performance, 205 labels displaying, 59 value, 166 variable, 166 language options, 185 last_index function, 94, 112 LAST_NON_BLANK function, 128 @LAST_NON_BLANK function, 127128, 133 last_non_null function, 94, 112 last_non_null_index function, 94, 112 length function, 119 less than operator, 112 linear regression export as PMML, 193 lists, 104, 106 loading nodes, 70 states, 70 locale options, 185 locchar function, 119 locchar_back function, 119 log les displaying generated SQL, 192 log function, 115 log10 function, 115 logging in to Clementine Server, 18 logical functions, 114 logistic regression, 214 export as PMML, 193 logistic regression models propensity scores, 157 lowertoupper function, 119 machine learning, 33 main window, 24 managers, 26 mandatory elds, 75 mapping data, 74 mapping elds, 71 matches function, 119 max function, 112 MAX function, 128 @MAX function, 127128 max_index function, 94, 112 max_n function, 93, 112 MEAN function, 127128 @MEAN function, 127128
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
222 Index
mean_n function, 93, 115 member function, 112 memory managing, 185186 Merge node performance, 204 messages displaying generated SQL, 192 middle mouse button simulating, 30, 48 min function, 112 MIN function, 128 @MIN function, 127128 min_index function, 94, 112 min_n function, 93, 112 minimizing, 29 missing values, 7980, 91 CLEM expressions, 82 lling, 78 handling, 78 in records, 80 mod function, 115 model refresh, 138 modeling nodes, 38 modeling palette tab customization, 199 performance, 205 models adding to projects, 172 exporting, 193 mouse using in Clementine, 30, 48 @MULTI_RESPONSE_SET function, 94, 134 multiple Clementine Client sessions, 22 multiple-category sets in CLEM expressions, 94 multiple-dichotomy sets in CLEM expressions, 94 multiple-response sets in CLEM expressions, 94, 98 naming nodes and streams, 55 navigating keyboard shortcuts, 209 negate function, 115 NetGenesis Web analytics technology, 3 neural net node large sets, 57 Neural Net node performance, 205 neural network models propensity scores, 157 new features, 6, 12 node caching enabling, 53, 202 node names, 55 node palette selection, 196
nodes, 17 adding, 48, 50 adding to projects, 172173 annotating, 55 bypassing in a stream, 50 connecting in a stream, 48 custom palette creation, 196 custom sub palette creation, 197 deleting, 48 deleting connections, 51 displaying on palette, 196 duplicating, 51 editing, 51 favorites, 194 introduction, 47 loading, 70 order of, 201 palette tab customization, 199 performance, 204205 removing from palette, 196 saving, 67 setting options, 51 noisy data, 35 normal distribution probability functions, 117 not equal operator, 112 not operator, 114 notications setting options, 187 @NULL function, 82, 111, 133 nulls, 78, 91 number display formats, 58 numbers, 92, 105 numeric functions, 115 objects properties, 178 offer models, 157 OFFSET function, 128 @OFFSET function, 127128 performance considerations, 205 oneof function, 118 opening models, 70 nodes, 70 output, 70 projects, 172 states, 70 streams, 70 operator precedence, 108 operators joining strings, 112 optimization, 191 options, 185 display, 189 for Clementine, 185 PMML, 193
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
223 Index
stream properties, 57, 6061 user, 187 or operator, 114 output, 26 output les saving, 68 palette tab customization, 199 palettes, 24 parallel processing enabling, 191 parameters in CLEM expressions, 98 model building, 140 scoring, 140 session, 62, 64, 90 stream, 62, 64, 90 type, 64 using in scenarios, 140 @PARTITION_FIELD function, 134 password Clementine Server, 18 paste, 28 performance CLEM expressions, 205 node caching, 53, 202 of modeling nodes, 205 of process nodes, 204 period, 57 pi function, 116 PMML export options, 193 exporting models, 165, 167 importing models, 166167 PMML models linear regression, 193 logistic regression, 193 port number Clementine Server, 1819 power (exponential) function, 115 PowerPoint les, 172 precedence, 108 @PREDICTED function, 134 Predictive Applications, 138 Predictive Applications Wizard, 136, 155, 157 stream prerequisites, 155 PredictiveMarketing, 155 printing, 31 streams, 51 probabilities predictive applications, 157 probability functions, 117 process nodes performance, 204 projects, 27, 170 adding objects, 173 annotating, 176
building, 172 Classes view, 172 closing, 179 creating new, 173 CRISP-DM view, 171 folder properties, 177 generating reports, 179 in SPSS Predictive Enterprise Repository, 175 object properties, 178 setting a default folder, 171 setting properties, 175 propensity scores predictive applications, 157 properties for data streams, 57 project folder, 177 report phases, 179 purple nodes, 191 pushbacks, 191 Quality node missing values, 80 radians measurements units, 58 random function, 118 random0 function, 118 reals, 104105 records, 33 missing values, 80 refresh source nodes, 57 regression, 214 rem function, 115 renaming nodes, 55 streams, 65 replace function, 119 replacing text, 100 replicate function, 119 reports adding to projects, 172 generating, 179 saving output, 68 setting properties, 179 resizing, 29 retrieving objects from SPSS Predictive Enterprise Repository, 145 rollover days, 58 round function, 115 ruleset models propensity scores, 157 rulesets evaluating, 57
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
224 Index
SAS les encoding, 217 saving multiple objects, 68 nodes, 67 output objects, 68 states, 6768 streams, 67 scenario, 163 Scenarios deployment options, 138 scientic notation display format, 58 screen readers, 209, 211, 214 example, 212213 scripting, 32, 84 nding and replacing text, 100 scrolling setting options, 60 SDEV function, 128 @SDEV function, 127128 sdev_n function, 93, 115 searching COP for connections, 20 searching for objects in SPSS Predictive Enterprise Repository, 150 Segmentation module, 38 sequence functions, 127128 server adding connections, 19 default directory, 186 logging in, 18 searching COP for servers, 20 session parameters, 62, 64, 90 set command, 62, 90 sets, 57 shortcuts general usage, 76 keyboard, 30, 209, 211 sign function, 115 sin function, 116 SINCE function, 128 @SINCE function, 127128 sinh function, 116 skipchar function, 119 skipchar_back function, 119 softmax method converting condences to propensity scores, 157 neural net models, 157 solutions template library, 71 Sort node performance, 204 soundex function, 123 soundex_difference function, 123 source nodes data mapping, 72 refreshing, 57
spaces removing from strings, 90, 119 special characters removing from strings, 90 special functions, 134 SPSS les encoding, 217 SPSS Predictive Enterprise Repository, 136, 140, 146 adding and removing folders, 153 connecting to, 141 deleting objects, 149 deleting versions, 149 object properties, 148 permissions for folders, 153 permissions for objects, 148 properties of folders, 153 retrieving objects, 145 searching in, 150 storing objects, 143 transferring projects to, 175 SPSS Predictive Enterprise Services, 136 SPSS Predictive Enterprise View, 136138 SQL generation, 191 logging, 192 previewing, 192 sqrt function, 115 startstring function, 119 startup dialog, 190 states loading, 70 saving, 6768 stop execution, 28 storing objects in SPSS Predictive Enterprise Repository, 143 stream, 24 stream canvas settings, 60 stream names, 55 stream parameters, 62, 64, 90 stream rewriting enabling, 191 streams, 17 adding nodes, 48, 50 adding to projects, 172173 annotating, 55, 65 backup les, 67 building, 46 bypassing nodes, 50 connecting nodes, 48 deployment options, 138 execution, 66 loading, 70 options, 57 renaming, 55 saving, 67 string functions, 119
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
225 Index
strings, 104, 106 manipulating in CLEM expressions, 90 matching, 90 replacing, 90 stripchar function, 119 strmember function, 119 sub palette creation, 197 displaying on palette tab, 198 removing from palette tab, 198 subscrs function, 119 substring function, 119 substring_between function, 119 SUM function, 128 @SUM function, 127128 sum_n function, 93, 115 SuperNode parameters, 62, 90 system options, 185 system-missing values, 78 t distribution probability functions, 117 tables, 211 adding to projects, 172 saving output, 68 tan function, 116 tanh function, 116 @TARGET function, 134 temp directory, 21 template elds, 75 templates, 7172, 163 testbit function, 117 @TESTING_PARTITION function, 134 text data les encoding, 217 text encoding, 59 Text Mining for Clementine, 2 THIS function, 128 @THIS function, 127128 time and date functions, 106107 time elds converting , 127 time formats, 58, 106107 time functions, 106107 time_before, 112, 123 time_hours_difference, 123 time_in_hours, 123 time_in_mins, 123 time_in_secs, 123 time_mins_difference, 123 time_secs_difference, 123 time_before function, 112 tips for accessibility, 215 general usage, 76
to_date function, 112, 123 to_dateline function, 123 to_datetime function, 112 to_integer function, 112 to_number function, 112 to_real function, 112 to_string function, 112 to_time function, 112, 123 to_timestamp function, 112, 123 @TODAY function, 123 toolbar, 28 ToolTips annotating nodes, 55 @TRAINING_PARTITION function, 134 tree builder accessibility, 214 tree-based analysis typical applications, 34 trigonometric functions, 116 trim function, 119 trim_start function, 119 trimend function, 119 Type node missing values, 81 performance, 205 typical applications, 34 undef function, 133 undo, 28 Unicode support, 217 unicode_char function, 119 unicode_value function, 119 unmapping elds, 71 uppertolower function, 119 user ID Clementine Server, 18 user options, 187 user-missing values, 78 UTF-8 encoding, 59, 217 @VALIDATION_PARTITION function, 134 value models, 157 value_at function, 94, 112 values, 88 adding to CLEM expressions, 98 viewing from a data audit, 98 variables, 33 visual programming, 24 warnings, 61 setting options, 187 Web Mining for Clementine, 3 web-mining, 14, 28 welcome dialog, 190 whats new, 6, 12
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete
226 Index
whitespace removing from strings, 90, 119 wizard accessing, 163 overview, 163 zooming, 28
www.cadfamily.com EMail:[email protected] The document is for study only,if tort to your rights,please inform us,we will delete