Building Cloud Native Analytical Pipelines on AWS

Building Cloud-native Analytical
Pipelines on AWS
Irene Cai
Software Engineer
2019.08.20

Outline
- Background
- Challenges
- Performance
- Spiky Load Handling
- Expensive rename operation on S3
- Solution with Alluxio
- Ideas for Critical Future Improvement in Alluxio

Background
As cloud offerings become matured and cost efficient, we decided to move our
data pipelines that processes hundreds of TBs of data daily from team managed
Hadoop cluster to AWS with S3 as storage and EMR for compute.
This allows us to leverage the elastic compute EMR has to offer and offload the
cluster operational work to AWS. Also, with S3 as the storage, this allows data to
be easily shareable across teams and form one data lake.

Challenge
- I/O Performance
○ The reading and writing performance to S3
- Spiky Load Handling
○ When read and write load gets spiky (i.e. when jobs are in final stage of outputting results),
different storage system handles spiky load differently.
- Expensive rename operation on S3

I/O Performance - S3 Bottleneck
The reading and writing performance to S3 is one of the bottlenecks we
encounter. Remote reads and writes are very expensive and needs to be
performed multiple times at each pipeline job.

Alluxio as cache buffer
- Reading
- Alluxio acts as a shared reading buffer and allows us to reduce number of reads needed at
each pipeline job
- Writing
- Transient Output
Suitable for output that can be deleted after use and is cheap to recompute if lost. This is great
performance boost as these output can be written to Alluxio only and never output to S3. From
our benchmark test, we can write ~100G to Alluxio in 1 minute.
- Persistent Output
Output to be consumed for future use or expansive to recompute. For this type of output, need
to persist to persistent storage such as HDFS/S3. Alluxio helps to accelerate and simplify
writing to persistent store at the application layer.

Alluxio Performance
Alluxio as in-memory file system has excellent read and write performance.
In our experiments, we are able to write ~100G data to Alluxio in 1 minute and
persist to S3 from Alluxio in 7 minutes versus writing directly to S3 often gets
throttled or takes much longer.

Storage System Response to Spiky Load
- S3 Throttling
S3 doesn’t have a hard limit on request rate. However, it throttles requests
when request rate dramatically increases.
- HDFS namenode slowness
HDFS handles the requests sequentially but responses get increasingly
slower when namenode is under stress.

Handling Spiky Load
- Complicated for applications.
- Data engines such as Hadoop handle throttling poorly
- Data applications do not have mechanisms to tune output pace
- In order to address such behavior, pipelines have to retry and this is very
expensive because of the repetitive compute costs
- Alluxio as a simplifying solution
Alluxio handles this problem and serve as a buffer to smooth out the IO
stream and avoid throttling or slowdowns.
Pipeline applications can simply write to Alluxio without explicitly handling
spiky loads in the application logic.

How Alluxio helps with Spiky Load
- Offer user mechanism to control output pace to S3 and avoid throttling
- We used hadoop distcp with Alluxio 1.8 and tuned the pace to copy to S3 with it.
- Persist to underlying file system can be asynchrones so from application’s
view the files are available for use once it’s written to Alluxio.
- Alluxio is memory based and avoid blocking replication which provides much
better performance than EMRFS

Rename Operations on S3
Move operating is very expansive on S3 as it deletes the old object and creates a
new one.
However, Spark/Hive typically write into a temp directory and move result to final
destination when computation finishes. This creates unnecessary stress on S3.
With Alluxio as middle layer, we persist only final results to S3.
Alternatively, user can use EMRFS as the middle layer. However, performance is
less optimal than Alluxio.

Ideas for Critical Future Improvement in Alluxio
- Data completeness at node failure
- This is currently mitigated by data replication.
- In-memory replication is costly
- Persistent storage replication is slow and could still result in corrupt data if data has not been
replicated to disk prior to node failure.
- When data is corrupted, pipelines need to rerun the job to regenerate the complete output. It
would be great if Alluxio can manage the data loss automatically by launching only the tasks
required to recompute the missing blocks.

- Stronger guarantee on writing success to persistent storage
- Alluxio currently gives user a way to tune the output to reduce throttling. It would be great if
Alluxio can have built-in mechanism to provide better guarantees on successfully persist data
to underlying storage without user intervention.

- Deeper data engine integration
- Accessing Alluxio from various compute engines is very easy. It would be great to see such
integrations get deeper to provide a stronger guarantee when compute engines write to
Alluxio. For example, Alluxio can help avoid recompute of previously successful tasks with
same input to reduce cost of job failures.

Takeaway
Introducing Alluxio into the pipeline gives significant performance advantages and
helps to address multiple challenges from underlying file system behavior.
Alluxio provides great performance advantage as a memory based shared cache.
It also provides good abstraction so applications don’t need to handle underlying
storage system when working with them.

Building Cloud Native Analytical Pipelines on AWS

More Related Content

What's hot (20)

Similar to Building Cloud Native Analytical Pipelines on AWS (20)

More from Alluxio, Inc. (20)

Recently uploaded (20)

Building Cloud Native Analytical Pipelines on AWS