Latest Databricks Databricks-Certified-Data-Engineer-Professional Actual Free Exam Questions & Community Discussion, Page 4

Question #43

A data engineer is tasked with building a nightly batch ETL pipeline that processes very large volumes of raw JSON logs from a data lake into Delta tables for reporting. The data arrives in bulk once per day, and the pipeline takes several hours to complete. Cost efficiency is important, but performance and reliability of completing the pipeline are the highest priorities. Which type of Databricks cluster should the data engineer configure?

A. An all-purpose cluster always kept running to ensure low-latency job startup times.

B. A job cluster configured to autoscale across multiple workers during the pipeline run.

C. A high-concurrency cluster designed for interactive SQL workloads.

D. A lightweight single-node cluster with low worker node count to reduce costs.

Discussion 0

Correct Answer: B Vote an answer

Explanation: Only visible for EduDump members. You can sign-up / login (it's free).

Question #44

A data engineer is building a customer data pipeline in Lakeflow Spark Declarative Pipelines. The source is a cloud-based event stream with limited retention containing inserts, updates, and deletes for customer records. These changes are being applied using the AUTO CDC INTO syntax to maintain an SCD Type 1 table as the target table, customer_dim. How should the data engineer build a downstream job that streams from the customer_dim table to only act on updates and delete events, processing data incrementally?

A. Streaming from customer_dim table would only be possible in the case of SCD 2 retention.

B. Use ignoreChanges flag while streaming from customer_dim to avoid breaking the pipeline during updates and deletes.

C. When stored as SCD 1, the target of AUTO CDC INTO includes updates and deletes. Streaming from customer_dim can fail due to these operations. Instead, build another stream from the original source.

D. Read change data feed from customer_dim table and apply filters to incrementally act on the change events.

Discussion 0

Correct Answer: D Vote an answer

Explanation: Only visible for EduDump members. You can sign-up / login (it's free).

Question #45

A data engineer, User A, has promoted a new pipeline to production by using the REST API to programmatically create several jobs. A DevOps engineer, User B, has configured an external orchestration tool to trigger job runs through the REST API. Both users authorized the REST API calls using their personal access tokens.
Which statement describes the contents of the workspace audit logs concerning these events?

A. Because User A created the jobs, their identity will be associated with both the job creation events and the job run events.

B. Because these events are managed separately, User A will have their identity associated with the job creation events and User B will have their identity associated with the job run events.

C. Because the REST API was used for job creation and triggering runs, user identity will not be captured in the audit logs.

D. Because User B last configured the jobs, their identity will be associated with both the job creation events and the job run events.

E. Because the REST API was used for job creation and triggering runs, a Service Principal will be automatically used to identity these events.

Discussion 0

Correct Answer: B Vote an answer

Explanation: Only visible for EduDump members. You can sign-up / login (it's free).

Question #46

A large company seeks to implement a near real-time solution involving hundreds of pipelines with parallel updates of many tables with extremely high volume and high velocity data.
Which of the following solutions would you implement to achieve this requirement?

A. Use Databricks High Concurrency clusters, which leverage optimized cloud storage connections to maximize data throughput.

B. Store all tables in a single database to ensure that the Databricks Catalyst Metastore can load balance overall throughput.

C. Partition ingestion tables by a small time duration to allow for many data files to be written in parallel.

D. Isolate Delta Lake tables in their own storage containers to avoid API limits imposed by cloud vendors.

E. Configure Databricks to save all data to attached SSD volumes instead of object storage, increasing file I/O significantly.

Discussion 0

Correct Answer: A Vote an answer

Explanation: Only visible for EduDump members. You can sign-up / login (it's free).

Question #47

A data engineer and a platform engineer are working together to automate their system tasks. A script needs to be executed outside of Databricks only if a particular daily Databricks job finishes successfully for the day. Databricks CLI command was used to check the last execution of the job. What are the required command options for that task?

A. databricks jobs list-runs --job-id JOB_ID --start-time-from TODAY_MIDNIGHT_EPOCH_MS -- completed-only

B. databricks jobs list-runs --job-id JOB_ID --start-time-from TODAY_MIDNIGHT_EPOCH_MS -- active-only

C. databricks jobs list-runs --job-id JOB_ID --start-time-to TODAY_MIDNIGHT_EPOCH_MS -- completed-only

D. databricks jobs list-runs --job-id JOB_ID --start-time-to TODAY_MIDNIGHT_EPOCH_MS --active- only

Discussion 0

Correct Answer: A Vote an answer

Explanation: Only visible for EduDump members. You can sign-up / login (it's free).

Question #48

The data science team has created and logged a production model using MLflow. The following code correctly imports and applies the production model to output the predictions as a new DataFrame named preds with the schema "customer_id LONG, predictions DOUBLE, date DATE".

The data science team would like predictions saved to a Delta Lake table with the ability to compare all predictions across time. Churn predictions will be made at most once per day.
Which code block accomplishes this task while minimizing potential compute costs?

A. preds.write.format("delta").save("/preds/churn_preds")

B.

C.

D. preds.write.mode("append").saveAsTable("churn_preds")

E.

Discussion 0

Correct Answer: D Vote an answer

Question #49

An analytics team wants to run a short-term experiment in Databricks SQL on the customer transactions Delta table (about 20 billion records) created by the data engineering team. Which strategy should the data engineering team use to ensure minimal downtime and no impact on the ongoing ETL processes?

A. Deep clone the table for the analytics team.

B. Create a new table for the analytics team using a CTAS statement.

C. Shallow clone the table for the analytics team.

D. Give the analytics team direct access to the production table.

Discussion 0

Correct Answer: C Vote an answer

Explanation: Only visible for EduDump members. You can sign-up / login (it's free).

Question #50

Predictive Optimization is an automated Databricks service enabled by default for Unity Catalog Managed tables. It helps maintain Delta tables by continuously optimizing them to ensure optimal performance and costs. Which two operations does Predictive Optimization run to maintain the Delta tables? (Choose two.)

A. OPTIMIZE

B. ANALYZE

C. PARTITION BY

D. COMPACT

E. BUCKETING

Discussion 0

Correct Answer: A,B Vote an answer

Explanation: Only visible for EduDump members. You can sign-up / login (it's free).

Question #51

A Structured Streaming job deployed to production has been experiencing delays during peak hours of the day. At present, during normal execution, each microbatch of data is processed in less than 3 seconds. During peak hours of the day, execution time for each microbatch becomes very inconsistent, sometimes exceeding 30 seconds. The streaming write is currently configured with a trigger interval of 10 seconds.
Holding all other variables constant and assuming records need to be processed in less than 10 seconds, which adjustment will meet the requirement?

A. The trigger interval cannot be modified without modifying the checkpoint directory; to maintain the current stream state, increase the number of shuffle partitions to maximize parallelism.

B. Decrease the trigger interval to 5 seconds; triggering batches more frequently may prevent records from backing up and large batches from causing spill.

C. Use the trigger once option and configure a Databricks job to execute the query every 10 seconds; this ensures all backlogged records are processed with each batch.

D. Decrease the trigger interval to 5 seconds; triggering batches more frequently allows idle executors to begin processing the next batch while longer running tasks from previous batches finish.

E. Increase the trigger interval to 30 seconds; setting the trigger interval near the maximum execution time observed for each batch is always best practice to ensure no records are dropped.

Discussion 0

Correct Answer: B Vote an answer

Explanation: Only visible for EduDump members. You can sign-up / login (it's free).

Question #52

A member of the data engineering team has submitted a short notebook that they wish to schedule as part of a larger data pipeline. Assume that the commands provided below produce the logically correct results when run as presented.

Which command should be removed from the notebook before scheduling it as a job?

A. Cmd 4

B. Cmd 3

C. Cmd 2

D. Cmd 5

E. Cmd 6

Discussion 0

Correct Answer: E Vote an answer

Explanation: Only visible for EduDump members. You can sign-up / login (it's free).

Question #53

What is a method of installing a Python package scoped at the notebook level to all nodes in the currently active cluster?

A. Run source env/bin/activate in a notebook setup script

B. Use &sh install in a notebook cell

C. Use &Pip install in a notebook cell

D. Install libraries from PyPi using the cluster UI

Discussion 0

Correct Answer: D Vote an answer

Explanation: Only visible for EduDump members. You can sign-up / login (it's free).

Question #54

The data governance team is reviewing user for deleting records for compliance with GDPR. The following logic has been implemented to propagate deleted requests from the user_lookup table to the user aggregate table.

Assuming that user_id is a unique identifying key and that all users have requested deletion have been removed from the user_lookup table, which statement describes whether successfully executing the above logic guarantees that the records to be deleted from the user_aggregates table are no longer accessible and why?

A. Yes; Delta Lake ACID guarantees provide assurance that the DELETE command successed fully and permanently purged these records.

B. No; the Delta Lake DELETE command only provides ACID guarantees when combined with the MERGE INTO command

C. No; the change data feed only tracks inserts and updates not deleted records.

D. No; files containing deleted records may still be accessible with time travel until a BACUM command is used to remove invalidated data files.

E. Yes; the change data feed uses foreign keys to ensure delete consistency throughout the Lakehouse.

Discussion 0

Correct Answer: D Vote an answer

Explanation: Only visible for EduDump members. You can sign-up / login (it's free).

Question #55

A data organization has adopted Delta Sharing to securely distribute curated datasets from a Unity Catalog-enabled workspace. The data engineering team shares large Delta tables internally via Databricks-to-Databricks and externally via Open Sharing for aggregated reports. While testing, they encounter challenges related to access control, data update visibility, and shareable object types. What is a limitation of the Delta Sharing protocol or implementation when used with Databricks-to-Databricks or Open Sharing?

A. With Open Sharing, recipients cannot access Volumes, Models, or notebooks -- only static Delta tables are supported.

B. With Databricks-to-Databricks sharing, Unity Catalog recipients must re-ingest data manually using COPY INTO or REST APIs.

C. Delta Sharing (both Databricks-to-Databricks and Open Sharing) allows recipients to modify the source data if they have select privileges.

D. Delta Sharing does not support Unity Catalog-enabled tables; only legacy Hive Metastore tables are shareable.

Discussion 0

Correct Answer: A Vote an answer

Explanation: Only visible for EduDump members. You can sign-up / login (it's free).

Question #56

Which approach demonstrates a modular and testable way to use DataFrame transform for ETL code in PySpark?

A.

B.

C.

D.

Discussion 0

Correct Answer: C Vote an answer

Explanation: Only visible for EduDump members. You can sign-up / login (it's free).

Databricks Databricks-Certified-Data-Engineer-Professional Actual Free Exam Questions & Community Discussion

Download Free Databricks Databricks-Certified-Data-Engineer-Professional Demo