The document discusses a case study of using Apache Spark to improve data processing speed. An organization was processing pharmaceutical data batches containing up to 1 billion records, which previously took 2.2 hours using a 5 node Vertica cluster. By migrating to a 3 node Apache Spark cluster on AWS, processing time was reduced by 62%, taking only 1 hour to process 1.2 billion records. Key steps taken included ingesting data into DataFrames, replacing procedures with UDFs, using Spark SQL and partitioning the DataFrame to perform parallel processing across nodes.
Related topics: