How to pass the Google Cloud Professional Data Engineer certification – as a sales guy
Due to the positive feedback regarding the Cloud Architect stufy guide, the following article centers on my subjective experience regarding Google Cloud’s Professional Data Engineer certification: if it is the right certification for you, how to prepare, and what to watch out for in the exam. Nothing of what follows should be considered as official Google advice in any way.
Data Engineer or Machine Learning Engineer — what is the right fit for me?
Even though the Cloud Architect certification provides a decent and horizontal overview of the Google Cloud infrastructure, it remained kind of a blind spot to me what working with data actually looks like. In other words, everything that falls under the umbrella of “data science”: how to mysteriously use data once it is in BigQuery or Bigtable to create mind-blowing use cases & business value with BI or ML.
I was uncertain if the Professional Data Engineer or Professional Machine Learning study paths would be the logical next step to build deeper/vertical knowledge in the data science field, and opted for the latter.
When I began to study the Machine Learning content on Coursera in September 2020, the Professional Machine Learning certification was still in beta phase. This changed with the official release in October, at which time I had already studied the first 6 of overall 11 recommended courses:
- Google Cloud Platform Big Data and Machine Learning Fundamentals
- Machine Learning with TensorFlow on Google Cloud Platform Specialization: (1) How Google does Machine Learning; (2) Launching into Machine Learning; (3) Introduction to TensorFlow; (4) Feature Engineering; (5) Art and Science of Machine Learning
- Advanced Machine Learning with TensorFlow on Google Cloud Platform Specialization: (1) End-to-End Machine Learning with TensorFlow on GCP; (2) Production Machine Learning Systems; (3) Image Understanding with TensorFlow on GCP; (4) Sequence Models for Time Series and Natural Language Processing; (5) Recommendation Systems with TensorFlow on GCP
I soon realized that the courses did not give me the kind of energy that I had experienced during the Cloud Architect track. One reason was the curriculum, which is at least two years old (Diane Greene was still the Google Cloud CEO). The content often felt redundant or oscillated in between easy and very challenging in minutes, making it difficult to fully dedicate time to a long-term learning experience. Another reason was the course focus, which is (rightly!) tailored to an expert audience. For me personally, it was just too technical to proceed, and as a result, I decided to stop this adventure after having finished the Machine Learning with TensorFlow on Google Cloud Platform Specialization.
Having said that, there are a few modules that were highly exciting and worthwhile to study, such as Launching into ML (2.2) or Introduction to TensorFlow (2.3). And with the recently-introduced Professional Machine Learning certification, the Coursera courses will probably undergo a content refresh. I might revisit the curriculum at some point in the future to give it another try.
At this stage, however, the better fit for me was the Professional Data Engineer certification, since it spans the bridge between the holistic/horizontal Architect certification and the very technical/vertical Machine Learning study tracks.
Going for the Data Engineer: What is covered
Although the PDE certification scratches a bit on the Machine Learning surface — maybe 10% of the exam questions —, the lion’s share focuses on how Google Cloud does data ingestion (Pub/Sub), data preparation and processing (Dataprep, Dataflow, Data Fusion and Dataproc), what role the underlying databases play (BigQuery, Bigtable, Cloud SQL, Spanner and Datastore), and how to activate and monitor the data effectively (Datalab, Data Studio, Composer, Stackdriver). An understanding of how open-source alternatives such as Apache’s Hadoop, Spark, Pig, Hive, Kafka, Airflow, as well as MongoDB, MariaDB, Cassandra can be used with Google Cloud products is essential. Looker is not part of the curriculum yet. This is more or less how the solutions can be mapped:
- Pub/Sub => Apache Kafka
- Dataflow => Apache Beam
- Dataproc => Apache Hadoop
- Composer => Apache Airflow
- BigQuery => Apache Hive
- ML Engine / AI Platform => Apache SparkML
- Bigtable => Apache HBase
- noSQL databases such as Bigtable or Datastore => MongoDB, Cassandra
- SQL databases such as Cloud SQL or Spanner => MariaDB
- Dataprep => Trifacta
Going for the Data Engineer: How to prepare
Similar to the Architect preparation, I used a mix of Coursera, Cloud Guru (former Linux Academy), Google Cloud’s documentation and Whizlabs. With Coursera, I studied the following four courses:
- Google Cloud Platform Big Data and Machine Learning Fundamentals
- Modernizing Data Lakes and Data Warehouses with GCP
- Building Batch Data Pipelines on GCP
- Building Resilient Streaming Analytics Systems on GCP
Google Cloud Platform Big Data and Machine Learning Fundamentals (15h) & Modernizing Data Lakes and Data Warehouses with GCP (10h) both cover the Google Cloud products from a holistic perspective. If you want to choose only one of the two, I’d recommend you the latter, given that it includes some great Qwiklabs exercises, and the Gojek customer example make this course a worthwhile learning investment. It also helps to explain the differences between OLTP and OLAP, and how database design is a tradeoff between performance and consistency. This might be straightforward for many of you, but for me it was partially new and helped to understand the benefits and shortfalls of GCP solutions which are inherently derived by their design and purpose. Here is my short summary:
- Cloud SQL is an example of an OLTP database. It is write-focused and record-based, covers transactional data, and its purpose is to control and run business tasks. The key words are consistency and referential integrity. It can only give you a snapshot of the business, and is defined by short and fast updates. Data size is small by design, given that the entire DB must be processed every time. Tables are handled via normalization to avoid self-joins.
- BigQuery is an example of an OLAP database. It is read-focused and column-based, covers analytical workloads, and its purpose is to plan & support in solving business problems with high performance. Eventual consistency is sufficient. It can give you a multi-view of the business based on big data, because indexes and filters can be leveraged for the analysis. Tables are denormalized and nested with Structs and Arrays. If you wanted to query normalized data, you would need a lot of costly joins. This is why — for analytical purposes such as BigQuery — you first normalize the data and then denormalize it. The advantage of denormalizing the data is that you can use wide Structs (type=“RECORD”) and deep Arrays (mode=“REPEATED”). This nested structure allows for higher query performance, given that 1) there is less data in columns and 2) no joins between tables are necessary. Parallel processing is possible. If arrays are to be queried, one needs to UNNEST() them first. Unnesting the arrays brings them back into rows. Arrays are either all strings or all numbers.
Building Batch Data Pipelines on GCP (15h) dives deep into Dataproc and the Hadoop ecosystem. This is a crucial topic for the exam, and also for the general understanding of when to go for a more DevOps-focused approach compared with serverless solutions such as Pub/Sub and Dataflow. A topic that was covered in a few questions in the exam was the role of Cloud Storage in all of these scenarios, e.g. as source or sink, or compared with HDFS.
Building Resilient Streaming Analytics Systems on GCP (10h) centers on Pub/Sub, Dataflow windows, Bigtable and how to design good queries in BigQuery. I found the Qwiklabs bike sharing exercises excellent to get a good understanding of the DOs and DON’Ts of querying.
Next, I went for the 181 practice exam questions on Whizlabs, out of which I scored 121 — to be more precise: 34/50, 32/50, 34/50 and 21/31 in the respective tests. This was a better result than in my Cloud Architect preparations with Whizlabs, so I felt confident that I was on the right track.
I then signed up for Cloud Guru (Linux Academy), given that I found its content and practice exam questions very relevant to the Cloud Architect exam. Matthew Ulasien did not disappoint this time either. The practice exam questions in the Data Engineer track are spot on — I scored 60% —, and the course content is not only a good refresher of what was covered in Coursera but highlights the core knowledge necessary for the exam. Another 15h well spent.
Together with the official Google Cloud practice exam — I scored 55% —, I now understood where I needed to focus. I then doubled down on these topics in the official Google Cloud documentation.
The Data Engineer exam
The proctored online service from Kryterion, similar to my Cloud Architect experience, did not go smoothly. The exam launch button did not show up on time, and I had to wait for 20 minutes to show my ID and get the green light to get started. Speaking to Kryterion’s support teams on an additional device, in my case my cell phone, unfortunately seems to be necessary. So make sure to have a second device available for troubleshooting. I hope that Kryterion steps up its game soon, since it does not do justice to the overall quality of the certification.
From a content perspective, I was surprised that my main study areas saw low coverage in the exam. I had put a lot of emphasis on DOs and DON’Ts in BigQuery and Bigtable, and how to ensure validity, accuracy, completeness, consistency and uniformity. I also do not recall Dataflow window questions at all, a topic that was all over the place in the practice exams.
Plenty of questions centered on how to best choose between Google Cloud and its open-source alternatives. It also felt like access control (e.g. IAM) and security and monitoring measures (e.g. authorized views, Stackdriver and CMEK/CSEK) had a significant share in the overall picture. AutoML versus Vision API showed up at least twice, so this is something that one should be able to easily distinguish.
In general, I found the questions to be harder than in the Cloud Architect track. While I would label 40%+ of the Architect questions to be easily answerable if one studied the content well — maybe a result of the Mountkirk, TerramEarth and Dress4Win case studies —, I found only 20-30% of the Data Engineer questions to be a cakewalk. The rest required careful reading and thinking, especially since the majority of the questions are quite long and provide a lot of details. I flagged at least 60% of the questions for a second run, and finished only 10 minutes ahead of time. With the Architect exam, I had roughly half an hour left.
In case of questions, please feel free to reach out. More than happy to help. Happy new year and good luck!
Driving the industry and our products to improve development of Cloud AI
4yDr. Andreas Ribbrock Christine Schulze