博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
反本能pdf_如何提高数据本能
阅读量:2523 次
发布时间:2019-05-11

本文共 16060 字,大约阅读时间需要 53 分钟。

反本能pdf

With recent advances in machine learning and AI research making headlines on a regular basis these days, it’s little surprise that data science has become an area of real mainstream interest.

近年来,随着机器学习和AI研究的不断发展成为头条新闻,数据科学已成为真正的主流兴趣领域也就不足为奇了。

It certainly makes a great career choice for the analytically minded, requiring a blend of solid programming skills and in-depth technical knowledge.

对于分析能力强的人来说,这无疑是一个不错的职业选择,需要扎实的编程技能和深入的技术知识相结合。

However, behind the show-stealing acts of dueling neural networks and distributed computing are some fundamental statistical practices that every aspiring data scientist should be deeply familiar with.

但是,在决斗的神经网络和分布式计算的抢劫行为背后,是一些有抱负的数据科学家都应该熟悉的一些基本统计方法。

You can read up on the latest programming frameworks or advances in the scientific literature as required for a specific project. But there are no shortcuts towards acquiring the underlying statistical know-how that makes for an effective data scientist.

您可以阅读特定项目所需的最新编程框架或科学文献的进展。 但是,获取构成有效数据科学家的基础统计知识的捷径不二。

Only practice, patience, and maybe just a little learning-the-hard-way, will truly sharpen your “data instincts”.

只有实践,耐心,也许只是一点点的学习,才能真正增强您的“数据本能”。

简约原则 (The principle of parsimony)

It’s repeated to the point of cliché in introductory stats courses, but the words of British statistician George Box are perhaps more relevant today than ever before:

在入门统计课程中,这一点已经重复到陈词滥调了,但是今天英国统计学家乔治·博克斯(George Box)的用语可能比以往任何时候都更有意义:

“All models are wrong, but some are useful”
“所有模型都是错误的,但有些是有用的”

What does this statement actually mean?

这句话实际上是什么意思?

It means that when seeking to model a real-world system, you necessarily have to simplify and generalise at the expense of explanatory power.

这意味着在寻求对真实系统进行建模时,您必须以解释力为代价来简化和概括。

The real world is messy and noisy and difficult to understand to the finest detail. Statistical modelling therefore strives not to achieve perfect predictive power but, rather, maximal predictive power with the minimal necessary model.

现实世界是凌乱和嘈杂的,很难理解到最好的细节。 因此,统计建模的目的不是要获得完美的预测能力,而是要以最小的必要模型来实现最大的预测能力。

This concept can appear counter-intuitive to those new to the world of data. Why not include as many terms in a model as possible? Surely extra terms can only add further explanatory power to the model?

这个概念可能与那些刚接触数据世界的人不合常理。 为什么不在模型中包含尽可能多的术语? 当然,额外的条款只能为模型增加更多的解释能力吗?

Well, yes… and no. You need only care about terms which bring with them a statistically significant increase in explanatory power.

好吧,是的……不是。 您只需要考虑在解释力上具有统计意义的显着增加的术语。

Consider the different types of model that can be fitted to a given data set.

考虑可以适合给定数据集的不同类型的模型。

The most basic is the null model, which has only one parameter — the overall mean of the response variable (plus some r).

最基本的是空模型,它只有一个参数-响应变量的整体平均值(加上一些 )。

This model posits that the response variable doesn’t depend on any of the explanatory variables. Instead, its values are entirely explained by random fluctuation about the overall mean. This obviously limits the model’s explanatory power somewhat.

该模型假定响应变量不依赖于任何解释变量。 取而代之的是,其值完全由关于总体均值的随机波动来解释。 这显然限制了模型的解释能力。

The polar opposite concept is the saturated model, which has one parameter for every single data point. Here, you have a perfectly fitted model, but one which has no explanatory power should you throw any new data at it.

相反的概念是饱和模型,它对每个数据点都有一个参数。 在这里,您有一个非常合适的模型,但是如果您添加任何新数据,则它没有解释力。

Including one term per data point also neglects to simplify in any meaningful way. Again — not exactly useful.

每个数据点包含一个术语也忽略了以任何有意义的方式进行简化。 再次强调-并非完全有用。

Clearly, those are extreme cases. You should seek a model somewhere in between — one which fits the data well and has good explanatory power. You could try fitting the maximal model. This model includes terms for all factors and interaction terms under consideration.

显然,这些都是极端情况。 您应该在两者之间寻找一个模型,该模型非常适合数据并且具有良好的解释能力。 您可以尝试拟合最大模型。 该模型包括所有因素的术语以及正在考虑的交互作用术语。

For example, say you have a response variable y which you want to model as a function of explanatory variables x and x₂, multiplied by coefficients β. The maximal model would look like this:

例如,假设您有一个要根据解释变量x model建模的响应变量y x 2,乘以系数β 。 最大模型如下所示:

y = intercept + β₁x₁ + β₂x₂ + β₃(x₁x₂) + error

y =截距+β₁x₁+β2 x 2 +β₃ ( x₁x2 ) +误差

This maximal model will hopefully fit the data pretty well, and also provide good explanatory power. It includes one term for each explanatory variable, and an interaction term, x₁x₂.

该最大模型将有望很好地拟合数据,并提供良好的解释能力。 它包括每个解释变量的一项和一个交互项x₁x2。

Removing terms from the model will increase the overall residual deviance, or the proportion of observed variation the model’s predictions fail to account for.

从模型中删除项会增加整体残余偏差或模型预测未考虑的观察到的变化比例。

However, not all terms are equal. You may be able to remove one (or more) terms, without seeing a statistically significant increase in deviance.

但是,并非所有术语都相等。 您可能可以删除一个(或多个)术语,而不会发现统计上的差异明显增加。

Such terms can be considered insignificant, and removed from the model. You can remove insignificant terms one-by-one (remembering to recalculate the residual deviance at each step). Repeat this until all terms remaining carry statistical significance.

这样的术语可以被认为是无关紧要的,并且可以从模型中删除。 您可以一对一删除无关紧要的术语(记住要重新计算每一步的残余偏差)。 重复此过程,直到剩下的所有术语都具有统计意义。

Now you have arrived at the minimal adequate model. The estimates for each term’s coefficient β are significantly different to zero. The step-by-step eliminative approach used to arrive here is known as “stepwise” regression.

现在您已经找到了最小适当模型。 每个项的系数β的估计值都明显不同于零。 到达此处使用的逐步消除方法称为“逐步”回归。

The philosophical principle underpinning this drive towards model simplicity is known as the principle of parsimony.

推动模型简化的哲学原理被称为简约原则

It bears some resemblance to the medieval philosopher William of Ockham’s famous heuristic, . This goes along the lines of: “given two or more equally acceptable explanations for a phenomenon, work with the one which introduces the fewest assumptions.”

它与奥卡姆(Ockham)著名的试探法( 的中世纪哲学家威廉(William)类似。 这符合以下原则:“针对现象给出两种或两种以上同样可接受的解释,并与引入最少假设的解释一起使用。”

Or, in other words: can you usefully explain something complex in the simplest way possible? Arguably, this is the defining pursuit of data science — efficiently translating complexity into insight.

或者换句话说:您能以最简单的方式有用地解释复杂的事物吗? 可以说,这是数据科学的基本追求-有效地将复杂性转化为洞察力。

永远持怀疑态度 (Always be sceptical)

(such as ) is an important data science concept.

(例如 )是重要的数据科学概念。

Simply put, hypothesis testing works by reducing a problem to two mutually exclusive hypotheses, and asking under which hypothesis the observed value of a given test statistic is most probable. The test statistic is, of course, calculated from some appropriate set of experimental or observational data.

简而言之,假设检验的工作原理是将一个问题简化为两个互斥的假设,并询问在哪种假设下最有可能观察到给定检验统计量的值。 当然,测试统计数据是从一组适当的实验或观察数据中得出的。

When it comes to hypothesis testing, you are usually asking whether you accept or reject the .

在进行假设检验时,您通常会问您是接受还是拒绝原 。

Often, you hear people describe the null hypothesis as something of a let-down, or even evidence of experimental failure.

通常,您会听到人们将零假设描述为令人失望的东西,甚至是实验失败的证据。

Maybe it stems from how hypothesis testing is taught to beginners, but it seems many researchers and data scientists have a subconscious bias against the null hypothesis. They seek to reject it in favour of the supposedly more exciting, more interesting, .

也许源于向初学者教授假设检验的方法,但似乎许多研究人员和数据科学家对原假设存在潜意识的偏见。 他们试图拒绝它,以支持所谓的更令人兴奋,更有趣的 。

This isn’t just an anecdotal problem. have been written on the issue of within the scientific literature. One can only wonder how this tendency manifests itself within a commercial context.

这不仅仅是一个轶事。 都写在问题上 在科学文献中。 人们只能怀疑这种趋势在商业环境中如何表现出来。

Yet the fact of the matter is this: for any properly designed experiment or complete-enough data set, accepting the null hypothesis should be just as interesting as accepting the alternative.

然而,事实是这样的:对于任何经过适当设计的实验或足够完整的数据集, 接受零假设应该与接受替代假设一样有趣。

Indeed, the null hypothesis is a cornerstone of inferential statistics. It defines what we do as data scientists, which is to turn data into insights. Insights are worth nothing if we’re not hyper-selective about what findings pass muster, and it is for this reason it pays to be ultra-sceptical at all times.

的确,零假设是推论统计的基础。 它定义了我们作为数据科学家所做的工作,即将数据转化为见解。 如果我们对选择的发现不那么挑剔,那么洞察力就一文不值,这就是因此始终保持超怀疑态度的原因。

This is especially so, given how easy it is to “accidentally” reject the null hypothesis (at least when applying a frequentist approach naïvely).

考虑到“偶然”拒绝零假设的难易程度(至少在天真地采用频率论方法时),尤其如此。

(or, ‘p-hacking’) can throw up all manner of meaningless results, which nevertheless appear statistically significant. Where multiple comparisons are unavoidable, there are no excuses for not taking steps to minimise (false positives, or “seeing effects which are not really there”).

(或“ p-hacking”)可能会抛出各种毫无意义的结果,尽管如此,它们在统计上还是很有意义的。 在无法避免多次比较的情况下,没有任何借口不采取措施以最大程度地减少 (误报,或“看到实际上并不存在的影响”)。

  • For a start, when it comes to statistical tests, pick one which is inherently cautious. Check that the test’s assumptions about your data are properly met.

    首先,在进行统计测试时,请选择本来要谨慎的一种。 检查是否正确满足了有关数据的测试假设。
  • It is also important to look into , e.g., . However, these methods are sometimes criticised for being overly cautious. They can reduce by producing too many type II errors (false negatives, or “ignoring effects which do actually exist”).

    研究 (例如 )也很重要。 但是,有时会批评这些方法过于谨慎。 它们可以降低 通过产生过多的II型错误(假阴性或“忽略实际上存在的影响”)。

  • Look for “null” explanations for your results. How suitable were your sampling/data collection procedures? Can you rule out any systematic errors? Could there be any effects of , , or ?

    寻找结果为“空”的解释。 您的采样/数据收集程序是否合适? 您可以排除任何系统性错误吗? , , 还是 ?

  • And finally, how plausible are any potential relationships you’ve found? Never take anything at face value, no matter how low the p-value may be!

    最后,您发现任何潜在的关系有多合理? 无论p值有多低,都不要取任何面值!

Scepticism is healthy, and in general it is good practice to always be mindful of null explanations for your data.

怀疑主义是健康的,总的来说,最好始终谨记对数据的无效解释。

But avoid paranoia! If you’ve designed your experiment well, and analysed your data cautiously, then go ahead and take your findings as real!

但是要避免妄想症! 如果您精心设计了实验,并谨慎地分析了数据,那么继续进行,将您的发现视为真实!

了解你的方法 (Know your methods)

Recent technological and theoretical advances have provided data scientists with a range of powerful new tools for solving complex problems that would not have been feasible to tackle even a decade or two ago.

最近的技术和理论发展为数据科学家提供了一系列强大的新工具,用于解决甚至在十年或两年前都无法解决的复杂问题。

There is a great deal of excitement surrounding these advances in machine learning, and for good reason. However, it is all-too-easy to overlook any limitations there might be in applying them to a given problem.

机器学习的这些进步令人兴奋,这是有充分理由的。 但是,忽略将它们应用于给定问题的任何限制都非常容易。

As an example, might be brilliant at classifying images and recognising handwriting, but they’re by no means a perfect solution for all problems. For a start, they are very prone to overfitting — that is, getting too familiar with the training data and being unable to generalise to new cases.

例如, 在图像分类和手写识别方面可能很出色,但绝不是解决所有问题的完美解决方案。 首先,他们非常容易过拟合-也就是说,过于熟悉培训数据并且无法推广到新的案例。

Take their opacity as well. The predictive power of neural networks often comes at the cost of transparency. Thanks to the internalisation of feature selection, even if a network makes an accurate prediction, you don’t necessarily understand how it arrived at its answer.

也要考虑他们的不透明度。 神经网络的预测能力通常以透明性为代价。 由于功能选择的内部化,即使网络做出了准确的预测,您也不必了解它是如何得出答案的。

In many business and commercial applications, understanding “how-and-why” is often the most important outcome of an analytical project. Ceding this understanding for the sake of predictive accuracy may or may not be a trade-off worth making.

在许多商业和商业应用中,了解“如何做”通常是分析项目的最重要结果。 出于预测准确性的考虑而放弃这种理解可能是或不值得权衡的。

Likewise, it’s tempting to rely on the accuracy of a sophisticated machine learning algorithm, but they’re absolutely not infallible.

同样,依靠复杂的机器学习算法的准确性也是很诱人的,但是它们绝对不是绝对可靠的。

For example, Google’s — which is generally very impressive — can by even a small amount of noise in an image. Conversely, another fascinating research paper has shown how Deep Neural Networks .

例如,Google的 (通常给人留下深刻的印象) 图像中的少量噪点 。 相反,另一篇引人入胜的研究论文展示了深度神经网络 。

It’s not just cutting-edge Machine Learning methods which need to be used with wariness.

不仅需要谨慎使用尖端的机器学习方法。

Even with more traditional modelling approaches, care needs to be taken that key assumptions are met. Always eye extrapolation beyond the scope of the training data, if not with suspicion, then at least with caution. With every conclusion you draw, always ask if your methods justify doing so.

即使采用更传统的建模方法,也需要注意满足关键假设。 如果没有怀疑,总是要在训练数据范围之外进行眼外推,至少要谨慎。 对于得出的每个结论,请始终询问您的方法是否合理。

This isn’t to say don’t trust any method at all — just to be aware at all times why you’re using one method over another, and what the relative pros/cons might be.

这并不是说在所有的不信任任何方法-只是要知道在任何时候你为什么使用一种方法而上,相对优点/缺点可能是什么

As a general rule, if you can’t come up with at least one drawback of a method you’re considering, research it further before proceeding. Always use the simplest tool that will do the job.

通常,如果您无法避免所考虑方法的至少一个缺点,请在继续之前进行进一步研究。 始终使用将完成此工作的最简单的工具。

Knowing when is and isn’t appropriate to use a given approach is a key skill in data science. It is a skill which improves with experience and genuine understanding of the methods.

知道何时以及何时不适合使用给定的方法是数据科学中的一项关键技能。 这是一项随着经验和对方法的真正理解而提高的技能。

通讯 (Communication)

Communication is the essence of data science. Unlike in academic disciplines, where your target audience will be highly-trained experts in your exact field of study, the audience of a commercial Data Scientist will likely be experts in a wide range of other areas.

交流是数据科学的本质。 与学术学科不同,在学科学科中,目标受众将是您确切的研究领域中训练有素的专家,而商业数据科学家的受众可能会是其他许多领域的专家。

Even the best insights in the world are worth nothing if communicated poorly. Many aspiring data scientists come from academic/research backgrounds, and will be used to communicating with technically-specialised audiences.

如果沟通不畅,即使是世界上最好的见解也一文不值。 许多有抱负的数据科学家都来自学术/研究背景,并将被用来与技术专业的受众进行交流。

In a commercial environment, however, it cannot be stressed enough how important it is to explain your findings in a way that a general audience can understand and work with.

但是,在商业环境中,要以足够的理解力和对普通读者的理解来解释您的发现具有多么重要的意义,这一点已经足够强调了。

For example, your results may be relevant to a range of different departments within an organisation — from marketing, to operations, to product development. Members of each will be experts in their respective fields of work, and will benefit from clear, concise, and relevant summaries of your findings.

例如,您的结果可能与组织中的多个不同部门相关-从营销,运营到产品开发。 每个成员将是各自工作领域的专家,并且将从您的发现的清晰,简洁和相关的摘要中受益。

As important as the actual results are the known limitations of your findings. Make sure your audience is aware of any key assumptions, missing data, or degrees of uncertainty in your workflow.

研究结果的已知局限性与实际结果同样重要。 确保您的听众了解工作流程中的任何关键假设,数据丢失或不确定程度。

The cliche “a picture is worth a thousand words” is especially true in data science. To this end, data visualisation tools are invaluable.

俗话说“一幅图片值得一千个单词”在数据科学中尤其如此。 为此,数据可视化工具是无价的。

Software such as Tableau, or libraries such as and , are great ways of communicating complex data very effectively. They are worth mastering as much as any technical concept.

诸如Tableau之类的软件或诸如和 库是非常有效地传达复杂数据的好方法。 他们值得掌握任何技术概念。

Some awareness of will go a long way in making your diagrams look smart and professional.

对一定了解将使您的图看起来更聪明,更专业。

Be sure to write clearly, too. Evolution has shaped us humans into impressionable creatures full of subconscious biases, and we’re inherently more inclined to trust better presented, well-written information.

一定也要写清楚。 进化已将人类塑造成充满潜意识偏差的令人印象深刻的生物,而我们天生更倾向于信任更好呈现的信息。

Sometimes, the best way to understand a concept is to interact with it yourself — so it may be worth learning a few front-end web skills to produce that your audience can play around with. There’s no need to reinvent the wheel. Libraries and tools such as D3.js and R’s Shiny make your task much easier.

有时,理解概念的最佳方法是自己与之交互-因此值得学习一些前端Web技能以产生观众可以玩耍的效果。 无需重新发明轮子。 库和工具(例如D3.js和R's Shiny)使您的工作更加轻松。

Thanks for reading! If you have any feedback or comments, please leave a response below — I look forward to reading them!

谢谢阅读! 如果您有任何反馈或意见,请在下面留下您的答复-我期待着阅读它们!

翻译自:

反本能pdf

转载地址:http://fcwzd.baihongyu.com/

你可能感兴趣的文章
(转)arguments.callee移除AS3匿名函数的侦听
查看>>
onNewIntent调用时机
查看>>
MYSQL GTID使用运维介绍(转)
查看>>
04代理,迭代器
查看>>
解决Nginx+PHP-FPM出现502(Bad Gateway)错误问题
查看>>
Java 虚拟机:互斥同步、锁优化及synchronized和volatile
查看>>
2.python的基本数据类型
查看>>
python学习笔记-day10-01-【 类的扩展: 重写父类,新式类与经典的区别】
查看>>
查看端口被占用情况
查看>>
浅谈css(块级元素、行级元素、盒子模型)
查看>>
Ubuntu菜鸟入门(五)—— 一些编程相关工具
查看>>
PHP开源搜索引擎
查看>>
12-FileZilla-响应:550 Permission denied
查看>>
ASP.NET MVC 3 扩展生成 HTML 的 Input 元素
查看>>
LeetCode 234. Palindrome Linked List
查看>>
编译HBase1.0.0-cdh5.4.2版本
查看>>
结构体指针
查看>>
迭代器
查看>>
Food HDU - 4292 (结点容量 拆点) Dinic
查看>>
Ubuntu安装Sun JDK及如何设置默认java JDK
查看>>