The document discusses data science workflows on Hadoop. It describes data science as involving three phases - data plumbing to ingest and transform data, exploratory analytics to investigate and analyze data, and operational analytics to build and deploy models. It provides examples of tools used for each phase including Spark, Hadoop streaming, SAS, and Python for exploratory analytics, and MLlib and Spark for operational analytics. The document also discusses lambda architectures for handling both batch and real-time analytics.
Related topics: