版权相关

翻译人：StoneDemo，该成员来自云+社区翻译社
原文链接：Data Security for Data Scientists
原文作者：Andrew Therriault

Data Security for Data Scientists

Ten practical tips for protecting your data

题目：（给数据科学家的十条数据安全建议）
这里写图片描述

Another day, another breach. The Equifax credit data breach is just the latest in a series of stories about major organizations’ data being exposed. It happened to Target’s customer credit card database, it happened to Anthem’s health insurance records, and it even happened to the federal Office of Personnel Management’s background check forms. Even worse, these are just a handful of the biggest examples — critical servers and databases are compromised every single day. This problem is becoming even more frequent over time, and it’s safe to say it’ll get a lot worse before it gets better.

又一天，又一次泄露。这次的 Equifax 信用数据外泄，仅仅是一系列被曝光的严重组织数据泄露事件中，最新的一件。它发生在 Target 的客户信用卡数据库中，发生在 Anthem 的健康保险记录中，甚至发生在联邦人事管理办公室的背景检查表中。更糟糕的是，这些只是极少数最严重的例子 —— 关键服务器和数据库每天都会受到侵害。随着时间的推移，这些问题会出现得越来越频繁，并且可以肯定的是，在它变好之前事态会更加糟糕。

Anxious yet? Good. You should be. Security is no longer just a niche specialty of database admins and network engineers. Everyone who creates, manages, analyzes, or even just has access to data is a potential point of failure in an organization’s security plan. So if you use data which is at all sensitive — that is, any data you wouldn’t freely give out to any random stranger on the internet — then it’s your responsibility to make sure that data is protected appropriately.

感到焦虑了吗？很好。这就是你应该感受到的。安全性（Security）不再仅仅是数据库管理员和网络工程师的专属领域。在组织安全策略中，每一个涉及到创建、管理、分析，甚至只是访问数据的个人，都是一个潜在的故障点。因此，如果您使用的是完全敏感的数据（即那些您不会自由地向互联网上任意陌生人提供的数据），那么您就有责任确保数据得到适当的保护。

I had the importance of data security hammered home for me in 2016. While the DNC hack targeted the organization’s email servers (which my team didn’t interact with, except as normal email users), it’s not hard to imagine that someone who could get into those systems could also have found their way into our databases of voters and campaigns. Whether that actually happened or not, I have no idea (I had already left that job by the time the extent of the hack became known), but the incident underlined how important it is for data scientists to be invested in the security of their own data. Simply put, we cannot naively assume that someone else will take care of security for us.

2016年，我明白了数据安全对我的重要性。虽然对 DNC 入侵的黑客只攻击了该组织的电子邮件服务器（除了作为普通的电子邮件用户，我的团队没有与之交互），但不难想象，能够进入这些系统的人同样也可以进入我们的选民与竞选数据库。这种情况无论是否真的发生了，我都不知道（当我能掌握入侵的程度时，我已经离开了那份工作），但这一事件凸显出数据科学家对自己的数据安全进行投资的重要性。简而言之，我们不能天真地认为别人会为我们保护安全。

For most data scientists, this topic is probably an unfamiliar one, as your typical grad school or bootcamp training programs cover security little if at all. (They certainly should, but I assume most readers are already past the point where that would help them.) That’s no excuse for neglecting data security, though, especially when one small misstep in that area can potentially overshadow everything else you’re trying to do. So if you want to start taking better care of your data now, where do you begin?

对于大多数数据科学家来说，安全性可能是一个不熟悉的领域，因为典型的研究生或训练营的培训计划几乎都不涉及安全性。（他们当然应该这样做，但我认为大多数读者已经超过了这一点。）但是，这并不是忽视数据安全的借口，特别是当该领域的一个小错误可能会掩盖住你想要做的其他一切时。那么，如果您从现在开始想要更好地保护您的数据，您应从何处开始呢？

Before starting to worry about things that are specific to data science, I’d recommend going through a basic security checkup, starting with some general best practices for keeping control of your accounts and assets online. Some specific recommendations:

Use (but don’t reuse!) complex passwords, and keep track of them in a password manager like LastPass or Dashlane.

Activate two-factor authentication wherever you can, preferably with a physical component such as an authenticator app or physical token, which are harder to compromise than text messages.

Don’t mess around with sketchy wifi / computers / flash drives / software, and always encrypt any portable device (flash drive, portable hard drive, phone, tablet, laptop, etc.) which you could potentially lose.

Always connect through a VPN (a paid one, not just a free one, and be sure to do your research) when on public wifi networks.

Update your computer and phone regularly and use anti-virus / anti-malware / firewall software (though note that the value of these tools is increasingly being questioned).

Don’t store your account credentials in text files or embed them in scripts.

And make sure that you maintain control over the physical security of your computers, tablets, and phones as well, or all of these steps could be wasted in the time it takes for your cell phone to be accidentally left on a table.

在开始担心特定于数据科学的安全问题之前，我建议您先进行基本的安全审查，从一些通用的，用于在线控制您的账户与资产的最佳实践开始。以下是一些具体的建议：

使用（但不要重复使用！）复杂密码，并在一个密码管理器（如 LastPass 或 Dashlane）中记录它们。
尽可能激活双因子身份验证（Two-factor authentication），最好使用身份验证器应用（Authenticator app）或物理令牌（Physical token）等物理组件，这些组件比文本消息更难以受到危害。
不要乱用粗略的 wifi / 计算机 / 闪存驱动器 / 软件，并始终加密任何可能丢失的便携式设备（闪存驱动器，移动硬盘，手机，平板电脑，笔记本电脑等）。
在公共 wifi 网络上，始终通过 VPN（不是免费，而是付费的，并确保对做好调查）进行连接。
定期更新您的计算机和手机，并使用防病毒 / 反恶意软件 / 防火墙软件（但请注意，这些工具的价值越来越受到质疑）。
请勿将您的帐户凭据存储在文本文件中，或将其嵌入到脚本中。
并确保您可以保持对计算机，平板电脑和手机的物理安全性的控制，若您遗失了您的手机，上面这些所有步骤都会被浪费。

These practices are good advice for everyone who has an email address, but if you work with data for a living, they’re just the beginning. So to help you take the next step toward becoming a security-literate data scientist, here are ten things I’ve learned throughout my career that I’d recommend for every data scientist:

Take only what you need. The old spy’s mantra — sharing only on a “need to know basis” — is also the first rule of data security. You can’t lose data that you don’t have in the first place, so don’t collect sensitive data unless you have a clear need that justifies the risk. And even then, only get the absolute minimum you need to accomplish a task. As tempting as it is for data scientists to collect more data just in case they need it later, this kind of stockpiling can mean the difference between a minor cybersecurity incident and a major disaster, so don’t do it.

Understand the data you have, and don’t keep data you don’t need anymore. Presumably you already have some data, so you should also apply the same principles from #1 above to your existing data. Keep a regular inventory of the data you have on hand, analyze the sensitivity of each dataset, get rid of data you don’t need, and consider taking steps to mitigate the risks inherent in data you do keep — for example, by removing or redacting unstructured text fields, which can hide potentially sensitive data like names and phone numbers. And when you think about data sensitivity, don’t just think about your own interests: if you have data about other people, be sure to put yourself in their shoes.

Encrypt data when and wherever possible. Encrypting your data (both when “in motion” and “at rest”) is not the magic bullet we’d like it to be, but it’s typically a low-cost way to add an extra layer of protection in case a hard drive or network connection gets compromised. Unless you’re working on applications which demand extremely high performance, the impact of encryption on performance is really not a big deal anymore, so if you have sensitive data it should be encrypted by default. And performance is not the deal-breaker for encryption it used to be — there are plenty of high-performance applications and services with built-in encryption (this is a standard feature in Microsoft’s Azure SQL Database, for example), so this excuse is becoming less and less valid over time. (Also, note that several early readers suggested I was too forgiving here — in their view, there is no excuse at this point for not encrypting sensitive data at every step in the process.)

Use secure sharing services rather than email, web servers, or basic FTP servers. The simple and quick methods of sharing files are fine for turning in class papers or sending cute dog pictures, but they’re risky ways to share files with sensitive data. Instead, use a service which is specifically designed for sharing files securely. For some, this might mean an access-controlled S3 bucket on AWS (where you can manage sharing of encrypted files with other AWS users) or an SFTP server (which implements secure file transfers over an encrypted connection). But even just moving to a service like Dropbox or Google Drive is an improvement. Though they’re not meant to be as security-focused as some other tools, they still provide better fundamental security (both Dropbox and Google encrypt files at rest, for example) and allow for more fine-grained access control than sending files via email or dumping them on a minimally-secured sever. For those looking for an upgrade from Dropbox or Google, a service like SpiderOak One can provide end-to-end encryption for file storage and sharing while maintaining an easy-to-use interface, and at a price-point that’s accessible for almost anyone ( $5/mo for 100GB,$ 12/mo for 1TB).

If you use cloud services like AWS or Azure, be sure to lock them down. Don’t make the mistake of assuming that because someone else runs the servers, you don’t have to worry about security. Quite the opposite, actually — there are a whole host of best practices for securing these systems that you need to be aware of. (I’d also suggest reading some of those services’ users’ own recommendations such as this one.) These include things like making sure you turn on authentication for S3 buckets and other file stores, securing ports on servers so only the ones you need are accessible, and limiting access to your services to only approved IP addresses or through a VPN tunnel.

Share conscientiously. For sensitive data, grant access to individual users (both internal and external) and datasets rather than granting access in bulk, and only give access when it’s actually needed (think #1 above, but for other people). Likewise, only give access for specific use cases and timeframes (think #2 above). Make your collaborators sign on to nondisclosure and data usage agreements — even if they’re not punitively enforced, they lay out expectations for how others will handle data you’re giving access to — and regularly check logs to ensure they’re complying with the intended usage.

Secure not only data stores but also applications, backup copies, analytic servers, and so forth. Basically, anything that touches your data should be secured. Otherwise, you might create the Fort Knox of databases, but all that work is useless if your dashboard server caches all that data to disk and isn’t protected. And likewise, remember that your system backups will often make copies of your data files as well, and these will likely endure even after you delete the files themselves (which is, after all, the point of a backup copy). So these backups should not only be protected themselves, they should also be purged when no longer needed. Otherwise, they could become a buried treasure for a hacker — why bother with your carefully-pruned operational database when everything you’ve ever had is still on the backup drive?

Make sure raw data isn’t hidden in outputs that you might share. Some machine learning models package up data (such as words and phrases from original documents) as part of a trained model object, so sharing a model result could potentially reveal training data accidentally. Along the same lines, dashboards, graphs, or maps might have the raw data embedded in the final output, while all you see on the surface is the aggregate result. And even if you’re just sharing a static image of a chart, there are tools out there to reconstruct the original datasets, so don’t assume that you’re not revealing raw data just because you’re not sharing tables. Know what it is you’re sharing, and think through what someone with bad intentions could do with it.

Understand the privacy implications of “de-identified” or “anonymized” data (and make sure you’ve done it correctly). If you don’t need to keep personally-identifiable information (PII) in a dataset, removing those fields is an obvious way to reduce the potential impact of a breach, and it’s a mandatory step to take before you share data publicly. But even when you’ve removed PII from a dataset, that doesn’t guarantee that someone else couldn’t figure out who’s who. Could the data be re-identified if combined with some other data? Are the non-PII characteristics unique enough to only apply to specific people? Did you believe somebody who foolishly told you that hashing was a good idea? I once received an “anonymized” consumer data file in which it took me less than 2 minutes to find my own record. (I was the only person with my unique combination of age, race, gender, and length of residence in my Census block.) Without much effort I could have found the records for many others as well, and with the help of a voter registration file (which is public record in most places) it would’ve been feasible to match most of those records to individuals’ names, addresses, and dates of birth. There’s no perfect standard for de-identification, but if you plan to rely on it to protect privacy, I’d highly recommend following the Department of Health and Human Services’ standards for de-identification of protected health information. It’s not an absolute guarantee of privacy protection, but it’s the closest thing you’re going to find which still allows your data to be useful.

Know your worst case scenarios. Even after all of these precautions, you can’t eliminate risk entirely, so think through what the worst potential outcomes would be if your data got out. After you’ve done that, go back to #1 and #2. No matter how hard you can try to stop breaches, no solution is foolproof, so if you can’t tolerate the potential risks you shouldn’t keep sensitive data around in the first place.

对于每一个拥有电子邮件的人来说，这些都是很好的建议，但如果您以数据为生，那么这只是一个开始。因此，为了帮助您迈向下一步，成为一位具有安全知识的数据科学家，下面我会给数据科学家们提出一些建议，这是我在整个职业生涯中所学到的十件事：

只拿取你需要的东西。老间谍的口头禅 —— 仅仅在 “需要知道（need to know basis）” 的基础上进行共享 —— 这也是数据安全的第一原则。您不会失去那些一开始就没有的数据，因此，除非您有明确的需求证明风险合理，否则就不要收集敏感数据。即使如此，也仅需获取完成任务所需的绝对最小值。尽管对数据科学家来说，收集更多数据以为未来的需求而准备，这是挺诱人的，但这种囤积可能意味着轻微的网络安全事件和重大灾难之间的区别，所以请不要这样做。
了解您所拥有的数据，丢弃您不再需要的数据。想必您已经拥有了一些数据，因此您也应该将上述＃1中的相同原则应用于现有数据。定期清点您手头上的数据，分析每个数据集的敏感度，清除不需要的数据，并考虑采取一定措施降低您存储的数据中固有的风险 —— 例如，通过删除或编辑非结构化文本字段，可以隐藏潜在的敏感数据，比如姓名和电话号码。当你考虑数据敏感性时，不要只考虑到自己：如果你拥有其他人的数据，一定也要设身处地地为他人着想。
无论何时何地，尽你所能加密数据。对数据（不管是 “动态的” 还是 “静态的”）加密，这不是我们想要的灵丹妙药，但它通常是增加额外保护层的一种低成本方法，以防硬盘驱动器或网络连接受到损害。除非您要开发性能需求极高的应用程序，否则实际上加密对性能的影响并不是一个大问题。因此如果您有敏感数据，则应默认进行加密。而且性能并不是加密的关键因素 —— 有大量高性能应用程序和服务带有内置的加密功能（举个例子，这是 Microsoft Azure SQL 数据库中的标准功能），所以随着时间的推移，这个借口越来越没有说服力。（另请注意，一些早期的读者认为我在此处过于宽容 —— 在他们看来，此过程中没有任何借口可以在流程的任一步不对敏感数据加密。）
使用安全共享服务而非电子邮件、Web 服务器，或基本的 FTP 服务器。简单快捷的文件共享方法适合上交作业，或发送可爱的狗狗图片，但它们是与共享敏感数据文件的危险方法。与此相反，请使用专门为安全共享文件而设计的服务。对于某些人来说，这可能是 AWS 上的访问控制 S3 存储桶（您可以在其中管理与其他 AWS 用户共享的加密文件）或 SFTP 服务器（通过加密连接实现安全文件传输）。但即使只是转向使用 Dropbox 或 Google Drive 等服务也是一种改进。虽然它们并不像其他工具那样专注于安全性，但它们仍然提供了更好的基本安全性（例如 Dropbox 和 Google 加密静态文件），并且允许比通过电子邮件发送文件更精细的访问控制，或者将它们转储（Dumping）到最低限度安全（Minimally-secured）的服务器上。对于那些希望从 Dropbox 或 Google 升级的人来说，诸如 SpiderOak One 这样的服务可以为文件存储和共享提供端到端加密，同时能保持易于使用的界面，并且价格亲民（100GB 为 5 美元/小时，1TB 为 12 美元/月）。
如果您使用 AWS 或 Azure 等云服务，请务必将其锁定（Lock them down）。不要错误地以为，因为是其他人运行着服务器，您就不必担心安全性。恰恰相反，实际上 —— 您需要了解一系列保护这些系统的最佳实践。（我还建议阅读其中一些服务的用户给出的建议，例如这个。）这些功能包括，确保为 S3 存储同和其他文件存储打开身份验证，保护服务器上的端口，以便只访问您需要的端口，以及将对服务的访问限制为仅允许的 IP 地址（或通过 VPN 隧道）。
认真地共享。对于敏感数据，单独授予对用户（内部和外部）和数据集的访问权限，而不是批量授予访问权限，并且仅仅在有实际需要时才提供访问权限（请考虑上面的＃1，不过是针对别人来说）。同样，只为特定用例和时间范围提供访问权限（请考虑上面的＃2）。让您的协作者签署保密协议和数据使用协议（即使协议中没有强制执行的惩罚，但对于 “其他人要如何处理这些您许可访问的数据” 这个问题，它们会作出预期），并定期检查日志以确保他们没有偏离预期的用途。
不仅要保护数据存储，还要保护应用程序，备份副本，分析服务器等。从根本上来说，您应该保护任何能够触及您的数据的事物。举个反例，您创建了 Fort Knox 数据库，但如果仪表板服务器（Dashboard Server）将所有数据缓存到不受保护的磁盘，如此一来所有工作都是无用的。同样，请记住，您的系统备份通常也会复制您的数据文件，所以即使您自己删除了文件（毕竟这是备份副本的要点之一），这些数据也可能会继续存在。因此，这些备份不仅应该自我保护，还应该在不再需要时进行清除。否则，它们可能会成为黑客的宝藏 —— 当你所拥有的一切仍然在备份驱动器上时，为什么还要费心仔细呵护你的数据库呢？
确保原始数据不会隐藏在您可能会共享的输出中。一些机器学习模型将数据（例如来自原始文档的单词和短语）打包为训练模型对象的一部分，因此对模型进行共享的结果，可能会是意外地泄露训练数据。同样地，仪表板、图形，或者地图都可能将原始数据嵌入到最终输出中，而您在表面上看到的只是聚合后的结果。即使您只是共享一张图表的静态图像，也有一些工具可以重建原始数据集，因此不要仅仅因为您没有共享表格而以为您没有泄露原始数据。要了解你所共享的事物，并思考一个有不良意图的人可以用它来做什么。
了解 “去识别化（De-identified）” 或 “匿名（Anonymized）” 的数据的隐私含义（并确保您已正确地完成）。如果您不需要在数据集中保留个人身份信息（PII，Personally-identifiable information），则删除这些字段是减少违规带来的潜在影响的显而易见的方法，并且这是在您公开共享数据之前必须采取的步骤。但即使您从数据集中删除了 PII，也无法保证其他人无法确定谁是谁。如果与一些其他数据相结合，是否就可以重新识别数据？非 PII 特征是否足够独特，仅适用于特定人群？你相信那些愚蠢地告诉你哈希是一个好主意的人吗？我曾经收到一个 “匿名” 的消费者数据文件，我花了不到 2 分钟找到了自己的记录。（在我的人口普查区，我是唯一一个拥有年龄、种族、性别，以及居住时间的独特组合的人。）我同样也可以轻而易举地找到许多其他人的记录，并在选民登记文件（在大多数地方是公共记录）的帮助下将大多数记录与个人的姓名、地址，以及出生日期相匹配，这是切实可行的。去识别化没有完美的标准，但如果您打算依靠它来保护隐私，我强烈建议您遵循美国卫生及公共服务部的标准，以对受保护健康信息进行去识别化。它不是隐私保护的绝对保证，但这是你所能找到的，最接近于让你的数据有用的东西。
明白最糟糕的情况。即便采取了所有这些预防措施，您也无法完全消除风险，因此请考虑一下当您的数据出现最糟糕的潜在结果的情况。考虑好后，回头看看＃1和＃2。无论你如何努力阻止违规，没有任何解决方案是万无一失的，所以如果你不能容忍潜在的风险，你就不应该把敏感数据放在首位。

To be clear, these steps won’t protect you from every danger out there. If a state-sponsored hacking group is trying to find a way in, you’re probably over-matched. (That’s why I put recommendation #1 where I did — not having anything worth hacking is the only guaranteed defense!) But for the 99.9% of data scientists whose data is mainly of interest to a less-elite class of hackers, these tips should cover most of the topics you’ll need to know. That doesn’t mean you’re done — figuring out how to do all these things well is a much longer read — but you’re at least on the path to being a responsible guardian of your (and our) data.

Hopefully this helps make you a bit better at securing data, and therefore a better data scientist. If I missed anything here, or if you have other suggestions to add for other readers, please leave a response below.

需要明确的是，这些步骤并不能保护您免受所有危险。如果有国家做后台的黑客组织试图找到一种方法，那么你可能就遇上劲敌了。（这就是为什么我把建议＃1放在最前头 —— 没有任何值得黑客攻击的东西是唯一有保障的防御！）但对于 99.9％的数据科学家而言，他们的数据主要是对于那些实力不是很强的黑客具有吸引力，这些建议应该涵盖了您需要了解的大部分主题。但这并不意味着完事了 —— 要弄清楚如何做好所有这些事情，需要更长的时间 —— 但至少你正在成为你（和我们）的数据的一个负责任的守护者。

希望这篇文章有助于您更好地保护数据，并成为更好的数据科学家。

[网络安全技术文章之其一] 给数据科学家的十条数据安全建议

相关文章

版权相关

Data Security for Data Scientists

猜你喜欢