多因素分析模型解决方法_选择模型方法的8个决定因素-CSDN博客

本文探讨了如何在数据科学中选择合适的多因素分析模型，列举了8个决定模型选择的重要因素，包括问题性质、数据类型、预测能力、解释性等，为机器学习和人工智能项目的模型构建提供了指导。

多因素分析模型解决方法

Finally, all data were cleansed and ready to analyze. Andy started overenthusiastically to visualize the data to get a first impression of the data. He had many dimensions and variables such that he spent several days visually analyzing them and determining the best methods to apply. At the end of that week, the team manager told him that he would need a draft presentation about the outcomes next Tuesday because the team manager had to present it in one week to a steering committee.

最后，所有数据都已清洗并准备分析。 Andy热情洋溢地开始可视化数据以获得对数据的第一印象。他具有许多维度和变量，因此他花了几天的时间对它们进行可视化分析并确定最佳的应用方法。在该周结束时，团队经理告诉他，下周二他将需要一份关于结果的演示文稿草稿，因为团队经理必须在一周内将其提交给指导委员会。

Andy told him that he has no results yet. But there was no space for negotiations. On Tuesday, conclusions had to be delivered and integrated into a PowerPoint presentation.

安迪告诉他，他还没有结果。但是没有谈判的空间。在星期二，必须提交结论并将其集成到PowerPoint演示文稿中。

Hastily, Andy produced some regression analyses and integrated them into the presentation.

仓促地，安迪进行了一些回归分析并将其整合到演示中。

After the steering committee meeting, the team manager told him that the project would not be carried on.

在指导委员会会议之后，团队经理告诉他该项目将不会继续进行。

Andy was very frustrated. That was his second project, and the second time it ended with the same decision. He has chosen this position because of the potential for doing great data science work on a large amount of data available.

安迪非常沮丧。那是他的第二个项目，第二次以相同的决定结束。他之所以选择此职位，是因为他有潜力对大量可用数据进行出色的数据科学工作。

This story is a real case, and it is not an atypical situation in corporations. I assume that some of you have already experienced a similar situation, too.

这个故事是真实的案例，在公司中不是典型情况。我想你们当中有些人也已经经历过类似的情况。

The reason that this happens is not your skills.

发生这种情况的原因不是您的技能。

When thrown into a data science project in a corporate environment, the situation is different from the previous learning context.

在公司环境中投入数据科学项目时，情况与以前的学习环境不同。

My experience is that most data scientists struggle to manage the project, given the many corporate constraints and expectations.

我的经验是，鉴于许多公司的限制和期望，大多数数据科学家都在努力管理项目。

More than a few data scientists are disappointed and frustrated after the first projects and looking for another position.

在进行第一个项目并寻找另一个职位后，许多数据科学家感到失望和沮丧。

Why?

为什么？

They are trained in handling data, technical methods, and programming. Nobody ever taught them in project, stakeholder, or corporate data management or educated them about corporate business KPIs.

他们接受过处理数据，技术方法和编程方面的培训。没有人曾在项目，利益相关者或公司数据管理方面教过他们，也没有教过他们有关公司业务KPI的知识。

It is the lack of experience with unspoken corporate practices.

这是缺乏对潜行企业实践的经验。

Unfortunately, there are more potential pitfalls in that area than with all your technical skills.

不幸的是，与您所有的技术技能相比，该领域存在更多的潜在陷阱。

If you know the determining factors, you can plan your data science tasks accordingly, pursue satisfying projects, and steer your work.

如果您知道决定因素，则可以相应地计划数据科学任务，追求令人满意的项目并指导工作。

In the following, I give you the eight most important drivers for the model approach selection in the corporate environment and how to mitigate them.

在下文中，我为您提供了在企业环境中选择模型方法以及如何减轻它们的八个最重要的驱动因素。

1.时间，时间表和截止日期 (1. Time, timelines, and deadlines)

What you need to know

你需要知道的

Corporations have defined project processes. Stage-gate or steering committee meetings are part of that where outcomes must be presented. Presentations have to be submitted a few days in advance and must contain certain expected information. Also, corporates are always under pressure to deliver financial results. That leads to consistently tight deadlines. These processes are part of the corporate culture, unspoken, and supposed that the employee knows them.

公司已经定义了项目流程。阶段性会议或指导委员会会议是必须提出成果的会议的一部分。演示文稿必须提前几天提交，并且必须包含某些预期的信息。而且，企业总是承受着交付财务成果的压力。这导致持续的时间紧迫。这些流程是企业文化的一部分，是不言而喻的，并且假定员工知道它们。

How to address it?

如何解决？

Ask, ask, ask. Ask about the milestones, e.g., the meeting dates where project decisions will be made.

问，问，问。询问里程碑，例如制定项目决策的会议日期。

Set up a time budget. Start at the milestone’s date and calculate backward a project schedule.

设置时间预算。从里程碑的日期开始，然后向后计算项目进度表。

Include not only your tasks but also the surrounding actions, like coordination meetings, presentations, and deadlines for submitting the presentations. Do not forget that there is a review round for each presentation, and you have to consider adding a few days in advance of submission. Include time margins for unexpected tasks and troubleshooting.

不仅包括您的任务，还包括周围的动作，例如协调会议，演示文稿以及提交演示文稿的截止日期。别忘了每个演示文稿都有一个审核回合，您必须考虑在提交前几天添加。包括用于意外任务和故障排除的时间余量。

Only then, choose the approaches for the ability to perform it within the determined schedule. Choose methods that can be run quickly and where you are familiar. After having a few successful results, and hopefully, still time, start experimenting with more complex and new methods.

只有这样，才能在确定的时间表内选择执行该功能的方法。选择可以在您熟悉的地方快速运行的方法。在取得了一些成功的结果之后，希望还有时间，可以开始尝试使用更复杂和新的方法。

Example

例

Human Resources (HR) urgently needed the patterns of HR management’s key success factors towards the business departments and people. Setting up the schedule based on the deadline, we decided only to perform simple linear regression without considering any interdependencies of such key success factors, e.g., the level of education and the attended training pieces. We focused on fitting accurately simpler models and having single contribution factors with high reliability identified.

人力资源部(HR)迫切需要人力资源管理模式对业务部门和人员的关键成功因素。根据截止日期制定时间表，我们决定只进行简单的线性回归，而没有考虑这些关键成功因素之间的相互依赖性，例如教育水平和参加的培训项目。我们专注于精确拟合更简单的模型，并确定具有高可靠性的单一贡献因子。

2.模型和结果所需的准确性 (2. Accuracy needed of the models and the results)

What you need to know

你需要知道的

The available and ready to use data determine the accuracy of a model. So, the level of detail of a model and the granularity of the data must match. The same is true for the expectations of the granularity of the outcome. The method must match expectations. Any mismatch will give unreliable results.

可用和准备使用的数据确定模型的准确性。因此，模型的详细程度和数据的粒度必须匹配。对于结果粒度的期望也是如此。该方法必须符合期望。任何不匹配都会导致不可靠的结果。

How to address it?

如何解决？

Select the model according to the granularity of the available data. Do not waste your time to fit a very detailed and accurate model when there is no proper data. Aggregating data and using a less granular model gives more reliable results when not having good quality data.

根据可用数据的粒度选择模型。如果没有适当的数据，请不要浪费时间来拟合非常详细和准确的模型。当没有高质量的数据时，聚合数据并使用粒度较小的模型可以提供更可靠的结果。

When the level of accuracy needed for decision making does not match the level that can be achieved by the data, you have to escalate it as early as possible. Do not try to make something up. Only transparent communication helps, prevent surprises, and manages expectations. Otherwise, you will be blamed.

当决策所需的准确度与数据所能达到的准确度不匹配时，您必须尽早升级。不要试图弥补。只有透明的沟通才能帮助，防止意外并管理期望。否则，您将受到责备。

Example

例

When we analyzed the influencing patterns for nursing homes’ profitability, the granular data had been too inhomogeneous, and the results made no economic sense. So, we aggregated the data and applied simpler models. Based on the results, the authority could already make essential decisions and put guidelines in place for future data management and collection.

当我们分析养老院盈利能力的影响模式时，粒度数据太不均匀，结果没有经济意义。因此，我们汇总了数据并应用了更简单的模型。根据结果，主管部门可能已经做出了重要决定，并为将来的数据管理和收集制定了指导方针。

3.方法的相关性 (3. The relevance of the methods)

What you need to know

你需要知道的

The right problem must be solved with a suitable method. The question to be answered must be clear. It should not permit any ambiguity. Also, the form of the outcomes must be comparable with other internal and external analyses. Both point the direction of the relevant methodology that should be used.

正确的问题必须用适当的方法解决。必须回答的问题必须清楚。它不应该有任何歧义。而且，结果的形式必须与其他内部和外部分析具有可比性。两者都指出了应使用的相关方法的方向。

How to address it?

如何解决？

Make sure that you understand the question that has to be answered. Please do not assume it! Ask! It does not help when you have a solution with the most accurate method but to the wrong question.

确保您了解必须回答的问题。请不要假设！问！如果您有使用最准确方法的解决方案，但是对于错误的问题，则无济于事。

Based on that, you can determine if it falls into the descriptive, predictive, or prescriptive field. If the most influential factors are looked for, choose descriptive methods. When the question is to forecast, choose a predictive approach, and only when optimized decision-making under the various effects is the aim, choose prescriptive models. Do not try to be creative. My experience is that it goes in most cases wrong.

基于此，您可以确定它是否属于描述性，预测性或规范性字段。如果寻找最有影响力的因素，请选择描述性方法。当要预测问题时，请选择一种预测方法，只有当在各种影响下优化决策为目标时，才选择规定性模型。不要尝试发挥创造力。我的经验是，在大多数情况下，这是错误的。

Example

例

Three years ago, my former team opposed heavily against me and had pushed to implement a new trendy time series method for asset return forecasts. Finally, they just executed it — oh yeah, I was angry, but we could not move back because of the deadline. For three years, they struggled to get adequate results without making a lot of adjustment efforts. Recently, one of my former team members told me that they finally moved back to the old model because the new model had included several features not relevant for the outcome but added to much noise.

三年前，我的前团队强烈反对我，并推动实施一种新的趋势时间序列方法来进行资产收益预测。最后，他们只是执行了它-哦，是的，我很生气，但是由于截止日期，我们不能退缩。三年来，他们一直在不进行大量调整的情况下努力获得足够的结果。最近，我的一位前团队成员告诉我，他们终于回到了旧模型，因为新模型具有与结果无关的几个功能，但增加了很多噪音。

4.数据准确性 (4. Accuracy of data)

What you need to know

你需要知道的

The accuracy of the data restricts the pool of possible methods. Very accurate methods do not bring any value when used with less accurate data. The error term will be high. Again, the accuracy of the data and the accuracy of methods must match. Bad quality affects the results — garbage in, garbage out.

数据的准确性限制了可能方法的集合。当使用不太准确的数据时，非常准确的方法不会带来任何价值。错误项将很高。同样，数据的准确性和方法的准确性必须匹配。不良的质量会影响结果-垃圾进场，垃圾出场。

How to address it?

如何解决？

Understand the data as well as the requirements of the models. Do not just apply methods for try and error reasons. Do not just replicate methods because it has given excellent results in other, similar cases. You need to tailor them to the requirements of the data accuracy.

了解数据以及模型的要求。不要仅出于尝试和错误原因而应用方法。不要仅仅复制方法，因为它在其他类似情况下也能提供出色的结果。您需要根据数据准确性的要求定制它们。

Example

例

In optimizing the operating room capacities of two hospitals, we had to apply two different approaches. In one hospital, granular data for every time point of action, e.g., beginning of anesthesia, entering the operating room, beginning of the surgery, and so on, were available. The data was of good quality because of real-time electronic recording.

为了优化两家医院的手术室容量，我们不得不采用两种不同的方法。在一家医院中，可以获得每个动作时间点的详细数据，例如麻醉开始，进入手术室，手术开始等。由于实时电子记录，因此数据质量很高。

In the other hospital, the data was recorded manually and sometimes with hours of delays, and thus, the data was very imprecise. E.g., the data has shown eight surgeries in six operating rooms in parallel.

在另一家医院中，数据是手动记录的，有时会有数小时的延迟，因此，数据非常不准确。例如，数据显示在六个手术室中并行进行了八次手术。

In the first case, we could fit the granular time series and agent-based models and consider the data’s seasonality. In contrast, in the second case, we had to rebuild the models and work with regression analysis and smoothing out inconsistencies before using them as an input for a less granular agent-based model.

在第一种情况下，我们可以拟合粒度时间序列和基于代理的模型，并考虑数据的季节性。相反，在第二种情况下，我们不得不重建模型并进行回归分析并消除不一致性，然后才将它们用作基于粒度较小的基于代理的模型的输入。

5.数据可用性和使数据可立即使用的成本 (5. Data availability and cost to make data ready to use)

What you need to know

你需要知道的

How often I have heard ‘we would have the perfect model when we could have this and this data, but unfortunately, we cannot access them in due time.’ A fact is that today, corporates are only able to use between 12% and about 30% of their data. In the discussions I have, companies state mostly, that they are using around 20% of their data. The cost to access them is, in most cases, too high, and no equivalent business case is available. If no business case covers the cost of making the data available, you will not get the data in due time.

我经常听到“我们拥有完善的模型，而我们可以拥有这些数据，但是不幸的是，我们无法在适当的时候访问它们”。一个事实是，如今，企业只能使用其12％到30％的数据。在我进行的讨论中，公司大多声明他们正在使用大约20％的数据。在大多数情况下，访问它们的成本太高，并且没有等效的业务案例可用。如果没有任何商业案例可以负担使数据可用的成本，则您将无法在适当的时候获得数据。

How to address it?

如何解决？

Before having all your thoughts around the fancy models, you could apply, clarify, what data is available in due time, and the cost of getting them. Just because ‘the data is available’ in a company, it does not mean that it is available in a reasonable time frame and at a reasonable cost.

在对奇特的模型有所有想法之前，您可以应用，澄清，在适当的时候可用的数据以及获取它们的成本。仅仅因为“数据可以在公司中使用”，并不意味着可以在合理的时间范围内以合理的成本获得数据。

Prioritize the data based on the other seven drivers given in this article, and make in each case a cost-benefit analysis: what is the additional benefit from the business perspective when having the data compared to what is the cost of getting them. Never ask, ‘can you give me all data?’. It shows that you have no understanding of the corporate’s business processes, and you will be de-prioritized when you need support, e.g., from IT.

根据本文中给出的其他七个驱动因素对数据进行优先级排序，并分别进行成本效益分析：从业务角度来看，获取数据的额外好处是什么？与获取数据的成本相比，这是什么？永远不要问，“您能给我所有数据吗？”。它表明您不了解公司的业务流程，并且在需要支持时(例如，从IT部门获得支持)，您将失去优先权。

Example

例

We had been unexpectedly faced with storage format issues in the pattern recognition work on a global bank’s intra-day liquidity data. The data of one of the required data sets of transactions from the prior year were archived on magnetic tapes. Thus, it would have taken several months until the data had been available due to release cycles and transformation into accessible formats. We had to assess alternative data and adjust the models.

在一家全球银行的日内流动性数据的模式识别工作中，我们曾出乎意料地面临存储格式问题。上一年所需的交易数据集之一的数据已存储在磁带上。因此，由于发布周期和转换为可访问的格式，可能要花几个月的时间才能获得数据。我们必须评估替代数据并调整模型。

6.数据隐私和机密性 (6. Data privacy and confidentiality)

What you need to know

你需要知道的

Customer data are often confidential. Data privacy is regulated by laws, e.g., the GDPR in the EU or the CCPA in the State of California. Financial institutions have their own regulations to protect so-called CID data — client identifying data. Access to such data have only authorized people, and data scientists are rarely amongst them. The data can only be used in anonymized, encrypted, or aggregated forms and after approval from the data owners, security officer, and legal counsel.

客户数据通常是机密的。数据隐私受法律规范，例如欧盟的GDPR或加利福尼亚州的CCPA。金融机构有自己的法规来保护所谓的CID数据-客户识别数据。只有经过授权的人员才能访问此类数据，而数据科学家很少在其中。数据只能以匿名，加密或聚合的形式使用，并且必须经过数据所有者，安全员和法律顾问的批准。

How to address it?

如何解决？

Before you start with the project, clarify if any personal data that fall under these restrictions are involved in your data science project. If yes, address it as early as possible, on one side with the IT, because they have eventually already encryption tools to deal with that, on the other side with the legal counsel. Only after having all approvals, and appropriate encryption, work with the data. I have seen many projects that could not be performed not because of the data privacy acts but because it was addressed to late and there was not enough time to get the approvals and encrypt the data in due time.

在开始该项目之前，请弄清楚数据科学项目中是否涉及任何受这些限制的个人数据。如果是，请尽早在IT部门解决此问题，因为他们最终已经拥有加密工具来处理该问题，而在另一方面与法律顾问联系。仅在获得所有批准和适当的加密之后，才能使用数据。我已经看到许多无法执行的项目不是因为数据隐私行为，而是因为它已经解决了，而且没有足够的时间来获得批准并在适当的时候对数据进行加密。

Example

例

In a project where credit card transaction data had to be used for third party service analytics, the lawyers needed seven months to clarify and approve the data use. The clarification contained not only the legal aspects but also the way of encryption, the aggregation level that should be used, and technical requirements like access rights and containerization of software.

在一个必须将信用卡交易数据用于第三方服务分析的项目中，律师需要七个月的时间来澄清和批准数据使用。澄清不仅包含法律方面，还包含加密方式，应使用的聚合级别以及诸如访问权限和软件容器化之类的技术要求。

7.资源，基础架构和工具可用性 (7. Resources, infrastructure, and tools availability)

What you need to know

你需要知道的

Projects in a corporate environment have many different departments involved: IT, the business, an innovation team, or an internal consulting group. All are involved in several projects in parallel, and their time is limited.

公司环境中的项目涉及许多不同部门：IT，业务，创新团队或内部咨询小组。所有这些都同时参与多个项目，并且时间有限。

You need storage and computational power. Corporate rules about software installation are in place, and corresponding approvals are required. If a tool costs and needs a license, a corporate approval process exists. As a data scientist, you do not only need Python and Jupyter Notebook but most probably other tools like Tableau or Alteryx. Some companies require containers like Docker. And some tools are not permitted per corporate policy.

您需要存储和计算能力。有关软件安装的公司规则已到位，并且需要相应的批准。如果工具成本高昂且需要许可证，则存在公司批准流程。作为数据科学家，您不仅需要Python和Jupyter Notebook，而且还可能需要其他工具，例如Tableau或Alteryx。一些公司需要像Docker这样的容器。并且某些公司政策不允许使用某些工具。

How to address it?

如何解决？

Clarify the tools and infrastructure before you start with the actual project. Estimate the storage and computational power needed, and ensure that it will be available. Clarify the corporate’s policy about data science software, and what tools are available. Inform the people from the other departments early about the upcoming support needed to plan some dedicated time. When working in an already existing data science team, you can clarify this first with your line manager. But even in an established data science team, do not assume that everything you will need for a project is in place.

在开始实际项目之前，请先弄清工具和基础结构。估计所需的存储和计算能力，并确保将可用。阐明公司有关数据科学软件的政策以及可用的工具。尽早通知其他部门的人们有关计划一些专用时间所需的即将到来的支持。在已经存在的数据科学团队中工作时，您可以先与您的直属经理进行澄清。但是，即使在已建立的数据科学团队中，也不要假设项目所需的一切都已经就绪。

Example

例

While working on a large amount of transactional data in a bank, we needed more computational and storage power. We worked in a private cloud environment, and typically, it takes only a few minutes to a few hours until the capacity is added. However, because we worked with client identifying data, in a so-called red zone environment, a virtual zone with very restrictive security, the infrastructure needs to be ‘red zone’ certified by the security officer. And this has then taken two weeks.

在银行中处理大量交易数据时，我们需要更多的计算和存储能力。我们在私有云环境中工作，通常只有几分钟到几小时才能添加容量。但是，由于我们与客户识别数据一起使用，因此在所谓的红色区域环境中，即具有非常严格的安全性的虚拟区域，因此基础架构需要经过安全人员的“红色区域”认证。然后这花了两个星期。

8.公司的产品和项目管理关键绩效指标 (8. Product and project management KPIs of the company)

What you need to know

你需要知道的

Corporates measure the product and project management with KPIs. There are quantitative measures like a net present value for short-term projects or a break-even point for products. And there are qualitative benefits like a shortened time to market, the learning of a project that can be leveraged to other projects, etc. Decisions and approvals of projects are based on such metrics.

企业使用KPI衡量产品和项目管理。有一些量化指标，例如短期项目的净现值或产品的收支平衡点。并且具有质量上的好处，例如缩短上市时间，学习可以被其他项目利用的项目等。项目的决策和批准均基于此类指标。

How to address it?

如何解决？

It does no matter how great the results of your data science work are; it should always be translated into the company’s KPIs. So, clarify with your line manager what are the steering measures of the company. Translate your outcomes into these metrics and communicate what the benefits for the company are. My experience is that the decision-makers stop fewer projects, more are implemented into the company’s processes, and finally, it builds a lot of trust in the data science team’s work.

无论您的数据科学工作成果多么出色，它都没有关系。应始终将其转换为公司的KPI。因此，请与您的直线经理一起说明公司的指导措施是什么。将您的结果转化为这些指标，并传达给公司带来什么好处。我的经验是，决策者停止了较少的项目，在公司的流程中实施了更多的项目，最后，它对数据科学团队的工作赢得了很大的信任。

Example

例

One department of a life sciences company tried for months to get internal funding for their intended data science projects, even thought, data, and data science are pillars in the company’s strategy. They finally ask me to support them. We found out that the finance department has investment templates for projects, including the company’s metrics. So, we asked them for that template and assembled all the data science blueprints into such temples. After the next presentation round, they got 60% of all their projects approved. The trigger was that the executive committee could now compare it with the company’s KPIs and other projects’ performance.

一家生命科学公司的一个部门几个月来一直在努力为其预期的数据科学项目获得内部资金，甚至思想，数据和数据科学也是公司战略的Struts。他们终于要我支持他们。我们发现财务部门具有用于项目的投资模板，包括公司的指标。因此，我们要求他们提供该模板，并将所有数据科学蓝图组装到这样的模板中。在下一轮演示之后，他们获得了所有项目的60％的批准。触发因素是执行委员会现在可以将其与公司的KPI和其他项目的绩效进行比较。

连接点 (Connecting the Dots)

Many data scientists are not aware that working in a corporate environment involves up to 80% of other tasks than setting up models and analyze data. And you are eventually, a bit frustrated when you read all my comments.

许多数据科学家并不了解在企业环境中进行工作除了建立模型和分析数据外还涉及多达80％的其他任务。当您阅读我的所有评论时，最终您会感到沮丧。

But knowing the above factors and addressing them early enough, and pro-actively puts you back into the driver seat and avoids bad surprises. The goal is to gain as much freedom as possible for our tasks. It increases project success, and you can keep free time for doing experiments with more complex and new approaches.

但是了解上述因素并及早解决它们，并主动将您带回驾驶员座位，并避免出现意外情况。目标是为我们的任务获得尽可能多的自由。它可以提高项目的成功率，并且您可以保留空闲时间来使用更复杂和新的方法进行实验。

Data scientists are not trained in managing such factors and often do not expecting them. Managing them properly is more important than all your detailed technical knowledge.

数据科学家没有接受过管理此类因素的培训，并且往往不期望它们。正确管理它们比您所有详细的技术知识更重要。

All my tips and tricks to address these determining factors are neither rocket science nor a secret. But it is vital to raise your awareness of them. I hopefully can enable you to have more control and more fun with your projects.

我针对这些决定性因素的所有技巧都不是火箭科学也不是秘密。但是，提高对它们的认识至关重要。我希望可以使您对项目有更多的控制权和更多的乐趣。