雅典娜雅典娜

news/2024/10/18 9:22:28/

If you’re using AWS for data transformation, you’re going to run into Athena sooner or later. Athena allows you to query data across multiple data stores, with a well-known SQL syntax (Presto 6.15).

如果您使用AWS进行数据转换,则迟早会遇到Athena。 Athena允许您使用众所周知SQL语法(Presto 6.15)在多个数据存储中查询数据。

With S3 as a storage solution, Athena promises to handle the complexity of a huge database for you. Serverless compute and storage means an entirely serverless database experience. All you need to do is know where all of the red flags are. In this article, I’ve listed some of the situations I’ve found myself in over the past few months.

使用S3作为存储解决方案,Athena承诺为您处理庞大数据库的复杂性。 无服务器计算和存储意味着完全无服务器的数据库体验。 您需要做的就是知道所有红色标记在哪里。 在本文中,我列出了过去几个月来遇到的一些情况。

分区与雅典娜不可商议 (Partitioning Is Non-Negotiable With Athena)

When you’re writing out your data into AWS Glue tables, there should be one word at the forefront of your conversation: partitioning. Partitioning instructs AWS Glue on how to group your files together in S3 so that your queries can run over the smallest possible set of data.

在将数据写到AWS Glue表中时,在对话的最前面应该有一个词:partitioning 分区指导AWS Glue如何在S3中将文件分组在一起,以便您的查询可以在最小的数据集上运行。

Poor partitioning strategies have been the bane of databases for decades. It is very difficult to get this right since an optimisation inevitably means becoming worse at something, as you specialise in something else. Here are the questions to ask yourself when you’re designing your partition:

几十年来,不良的分区策略一直是数据库的祸根。 做到这一点非常困难,因为优化不可避免地意味着在某些事情上会变得更糟,因为您专注于其他事情。 这是设计分区时要问自己的问题:

  • How is this data going to be queried?

    如何查询这些数据?
  • How much data per partition does that mean?

    这意味着每个分区有多少数据?
  • How often are we going to be querying this data?

    我们将多久查询一次此数据?
  • Is this datastore going to morph into something completely different? How would we handle that?

    这个资料储存库会变成完全不同的东西吗? 我们将如何处理?

A very common partitioning strategy is to partition on a date key. The AWS Glue libraries come fitted with a mechanism for specifying your partition columns out of the box. Using these libraries, your code may look something like this:

一种非常常见的分区策略是对日期键进行分区。 AWS Glue库配备了一种机制,用于开箱即用地指定分区列。 使用这些库,您的代码可能看起来像这样:

glue_context.write_dynamic_frame.from_options(
frame = projectedEvents,
connection_options = {
"path": "$outpath",
"partitionKeys": ["date"]
},
format = "parquet"
)

In this example, we’re telling Glue to write the output in a parquet format and to partition on the date field. If we were to open up S3, we would see hive-style partitions of the form:

在此示例中,我们告诉Glue以拼花形式写入输出并在date字段上进行分区。 如果要打开S3,我们将看到以下形式的蜂巢式分区 :

/date=2020-05-01/…/type=2020-05-02/…/type=2020-05-03/…

The statement we’ve made is this: “We want to optimise on queries within a day.” If we were planning on running lots of queries that spanned over many days, this partitioning strategy would not help us to optimise our costs. Remember, Athena charges by the amount of data scanned — nothing else.

我们发表的声明是:“我们希望在一天内优化查询。” 如果我们计划运行许多跨越许多天的查询,那么这种分区策略将无助于我们优化成本。 请记住,雅典娜按扫描的数据量收费—别无其他。

不要害怕在数据上存储多个视图 (Don’t be afraid to store multiple views on the data)

Your AWS storage costs are nothing compared to the read/write costs. If you can greatly optimise your S3 I/O by storing a duplicated set of data with different partitions, it’ll usually work out as savings.

与读/写成本相比,您的AWS存储成本不算什么。 如果可以通过使用不同分区存储重复的数据集来极大地优化S3 I / O,则通常可以节省下来。

实木复合地板可以为您节省很多钱 (Parquet can save you a lot of money)

The output format you choose to write in can seem like personal preference to the uninitiated (read: me a few weeks ago). However, this choice can profoundly impact the operational cost of your system. Parquet is a columnar storage format, meaning it doesn’t group whole rows together. One file may contain a subset of the columns for a given row.

您选择写入的输出格式可能看起来像是对未开始使用的人的个人喜好(几周前阅读:我)。 但是,这种选择会严重影响系统的运营成本。 Parquet是一种列式存储格式,这意味着它不会将整个行分组在一起。 一个文件可能包含给定行的列的子集。

If you run a query like this against a stack of JSON files, what do you think Athena will have to do?

如果您对一堆JSON文件运行这样的查询,您认为Athena必须做什么?

SELECT name, age, dob from my_huge_json_table where dob = '2020-05-01';

It will be forced to pull the whole JSON document for everything that matches that dob and scan through it. There could be 100 different columns in your JSON file, but you’re only interested in three of them.

它将强制为匹配dob所有内容提取整个JSON文档并对其进行扫描。 JSON文件中可能有100个不同的列,但是您只对其中三个感兴趣。

The same query run against parquet is far easier to optimise. Instead of pulling the whole file, Athena can sniff out the exact files it needs. It won’t be perfect. Depending on the size of your files, Athena may be forced to sift through some extra data, but this additional dimension means that specific queries can operate over specific datasets. This results in potentially significant cost savings.

针对镶木地板运行的同一查询要容易得多。 雅典娜可以提取出所需的确切文件,而不必提取整个文件。 这不是完美的。 根据文件的大小,可能会迫使Athena筛选一些额外的数据,但是这个额外的维度意味着特定的查询可以对特定的数据集进行操作。 这样可以节省大量成本。

雅典娜不仅可以提取数据,还可以带来更多好处 (Athena Is Good for More Than Just Extracting Data)

Athena is often discussed in the documentation as a way of extracting the data from your tables once you’re happy with it. It’s almost a presentational layer that APIs can hook into. This is correct but limited.

文档中经常讨论Athena,将其作为对表满意后从表中提取数据的一种方式。 这几乎是API可以挂接的表示层。 这是正确的,但有局限性。

SQL is a powerful data transformation language that, when used properly, can result in very fast-running jobs. Some operations, such as window functions and aggregate functions, work nicely in a SQL syntax and result in much more straightforward, elegant code. With the introduction of CTAS, you can write metadata directly to the Glue datastore without the need for a crawler.

SQL是一种强大的数据转换语言,如果使用得当,它可以导致运行非常快的作业。 某些操作(例如,窗口函数和聚合函数)可以很好地以SQL语法运行,并产生更直接,更简洁的代码。 随着CTAS的引入,您可以将元数据直接写入Glue数据存储区,而无需搜寻器。

One of the lessons we learned was that Athena can be used to clean the data itself. Athena makes use of Presto 6.15 — have a read of the documentation. The Presto DBMS has a plethora of great functions to tap into.

我们吸取的教训之一是,雅典娜可用于清理数据本身。 雅典娜使用Presto 6.15-阅读文档 。 Presto DBMS具有许多强大的功能可供利用。

Athena is powerful, but it has some quirks that took us a while to work out.

雅典娜功能强大,但有些怪癖使我们花了一段时间才能解决。

雅典娜不喜欢连字符 (Athena Doesn’t Like Hyphens)

This is a small one, but it can result in some bizarre behaviour. Don’t put hyphens in your table names. They can break your queries. And not in the “Oh, everything is suddenly very broken” kind of way. In the “Oh, this query is doing something completely random now” kind of way.

这虽然很小,但是可能会导致一些奇怪的行为。 不要在表名中加上连字符。 他们可能破坏您的查询。 而不是以“哦,一切突然变得很崩溃”的方式。 在“哦,此查询现在正在做的事情完全是随机的”这种方式。

Image for post
SEO Mechanic. SEO Mechanic 。

Avoid the dumpster fire and go for underscores. It doesn’t change readability too much and is one less thing to worry about. If you’re deadset on using hyphens, you can wrap your column names in " . This has fixed the issues when I have seen it crop up, but I don’t know if it’s a genuine fix or if it has quirks.

避免垃圾箱起火并强调下划线。 它不会太大地改变可读性,并且不用担心。 如果您对使用连字符不满意,可以将列名包装在" 。这已经解决了我看到的连字符时出现的问题,但是我不知道它是否是真正的修复程序或是否有古怪之处。

雅典娜真的不喜欢Global ORDER BY (Athena Really Doesn’t Like Global ORDER BY)

This crops up in other DBMS systems too, but it was the cause of some of our Athena queries hitting the “resource exhausted” error:

这也出现在其他DBMS系统中,但这是导致我们的某些Athena查询遇到“资源耗尽”错误的原因:

AWS Athena Error: Query exhausted resources at this scale factor

In every case where this has popped up, we’ve found that the best way to optimise our queries is to limit the number of ORDER BY statements we have in our query to the bare minimum. This is because an ORDER BY over your whole dataset means moving your data onto a single node so that it can be sorted.

在每种情况下,我们都发现优化查询的最佳方法是将查询中拥有的ORDER BY语句的数量限制到最小。 这是因为对整个数据集进行ORDER BY意味着将数据移动到单个节点上,以便可以对其进行排序。

The ORDER BY statement is just one of the culprits for greedy Athena queries. Fortunately, AWS has put together a great list of options for you to make the most out of Athena without setting fire to a server somewhere in Dublin.

ORDER BY语句只是贪婪的Athena查询的元凶之一。 幸运的是,AWS汇集了很多选项 ,可让您充分利用Athena,而不会在都柏林的某处服务器上惹火。

翻译自: https://medium.com/better-programming/aws-athena-at-scale-a90c58c3b110


http://www.ppmy.cn/news/572493.html

相关文章

六、Docker容器数据卷

学习参考:尚硅谷Docker实战教程、Docker官网、其他优秀博客(参考过的在文章最后列出) 目录 前言一、容器数据卷1.1 容器数据卷1.2 怎么用1.3 能干什么 二、实践案例2.1 启动ubuntu容器实例,挂载目录:2.2 在ubuntu容器实例的/tmp/docker_data中…

内网隧道代理技术(七)之Bash反弹shell

Bash反弹shell Bash介绍 Shell也称为终端或壳,是人与内核之间的翻译官,而Bash则是Linux中默认使用的Shell Bash 反弹Shell的命令如下: bash -i >&/dev/tcp/攻击机_IP/攻击机端口 0>&1 bash -i >&/dev/tcp/攻击机_IP…

html闪屏代码,JS闪屏代码,闪瞎你的眼睛

JS判断手机访问函数 函数is_mobile()手机端返回true,PC端返回false,根据返回值来判断时候是手机端访问,然后做相应的操作。 复制代码 代码如下: script function is_mobile() { var regex_match /(nokia|iphone|android|motorola|^mot-|soft…

Android闪屏页适配

Android闪屏页适配 首先,为了避免第一次打开出现白屏,所以应该自定义style,添加anroid:windowBackground属性。此属性下放了一张1080*1920的闪屏图片,会出现在不同尺寸的屏幕上闪屏图片被压缩的情况。 于是,我的解决方…

Egret部分低版本安卓手机闪屏

棋牌游戏,用的是renderMode:"canvas"模式.测试发现,在安卓5.0.1版本上会白色闪屏,就是打出一张牌或有图片变动,就会闪一下. 暂时解决办法是 首先index.html中的data-show-paint-rect设置为true, 修改引擎代码egret.js,注释掉下面的一行. for (var i 0; i < leng…

小功能⭐️Unity导出到手机后闪屏

文章目录 Q&#xff1a;Unity导出到手机后闪屏 A&#xff1a;原因之一&#xff1a;本次导出采用了Vuforia SDK&#xff0c;Vuforia 没有调ARCamera的初始位置&#xff0c;导致Camera部分在模型内部。因此&#xff0c;需要调整ARCamera的位置&#xff0c;使其不在物体内即可

android 闪屏引导,启动体验设计-闪屏,启动页,引导页

冷启动(cold start) 冷启动是指应用尚未运行&#xff0c;系统创建进程并初始化应用。 热启动是指应用已经运行但是在后台被挂起&#xff0c;从后台恢复应用。 冷启动时间稍微会长一点&#xff0c;在此期间可能会看到白屏或黑屏或感受到明显的启动延迟(背景设为透明) 解决方法是…

ios闪屏问题 vue_解决vue更新默认值时出现的闪屏问题

在Vue项目中&#xff0c;对于一个展示用户个人信息的页面。有以下需求&#xff0c;需要判断用户个人信息是否填充过&#xff0c;如果填充过&#xff0c;需要在页面中展示已填充项(未填充项不展示)&#xff1b;如果未填充过&#xff0c;需要在页面中显示另外一种元素(提示用“去…