Thursday, July 4, 2024

Educating ChatGPT on Knowledge Lakehouse

As the usage of ChatGPT turns into extra prevalent, I ceaselessly encounter prospects and information customers citing ChatGPT’s responses of their discussions. I really like the passion surrounding ChatGPT and the eagerness to study fashionable information architectures reminiscent of information lakehouses, information meshes, and information materials. ChatGPT is a wonderful useful resource for gaining high-level insights and constructing consciousness of any expertise. Nevertheless, warning is important when delving deeper into a specific expertise. ChatGPT is educated on historic information and relying on how one phrases their query, it might provide inaccurate or deceptive info. 

I took the free model of ChatGPT on a take a look at drive (in March 2023) and requested some easy questions on information lakehouse and its elements. Listed here are some responses that weren’t precisely proper, and our rationalization on the place and why it went unsuitable. Hopefully this weblog will give ChatGPT a chance to study and proper itself whereas counting in the direction of my 2023 contribution to social good. 

I believed this was a reasonably complete listing. The one key element that’s lacking is a typical, shared desk format, that can be utilized by all analytic providers accessing the lakehouse information. When implementing an information lakehouse, the desk format is a vital piece as a result of it acts as an abstraction layer, making it simple to entry all of the structured, unstructured information within the lakehouse by any engine or software, concurrently. The desk format supplies the mandatory construction for the unstructured information that’s lacking in an information lake, utilizing a schema or metadata definition, to carry it nearer to a knowledge warehouse. A few of the common desk codecs are Apache Iceberg, Delta Lake, Hudi, and Hive ACID.

Additionally, the info lake layer just isn’t restricted to cloud object shops.  Many corporations nonetheless have large quantities of information on premises and information lakehouses will not be restricted to public clouds. They are often constructed on premises or as hybrid deployments leveraging personal clouds, HDFS shops, or Apache Ozone. 

At Cloudera, we additionally present machine studying as a part of our lakehouse, so information scientists get easy accessibility to dependable information within the information lakehouse to rapidly launch new machine studying initiatives and construct and deploy new fashions for superior analytics. 

I like how ChatGPT began this reply, nevertheless it rapidly jumps into options and even offers an incorrect response on the function comparability. Options will not be the one method of deciding which is a greater desk format. It relies on compatibility, openness, versatility, and different elements that may assure broader utilization for diverse information customers, assure safety and governance, and future-proof your structure. 

Here’s a high-level function comparability chart if you wish to go into the small print of what’s obtainable on Delta Lake versus Apache Iceberg.

 

This response is slightly harmful due to its incorrectness and demonstrates why I really feel these instruments will not be prepared for deeper evaluation. At first look it might appear to be an inexpensive response, however its premise is unsuitable, which makes you doubt the whole response and different responses as properly. Saying “Delta Lake is constructed on prime of Apache Iceberg” is inaccurate as the 2 are utterly completely different, unrelated desk codecs and one has nothing to do with the conception of the opposite. They have been created by completely different organizations to resolve frequent information issues. 

 

I’m impressed that ChatGPT acquired this one proper, though it made just a few errors with our product names, and missed just a few which can be vital for a lakehouse implementation.

CDP’s elements that assist an information lakehouse structure embody:

  1. Apache Iceberg desk format that’s built-in into CDP to offer construction to the huge quantities of structured, unstructured information in your information lake.
  2. Knowledge providers, together with cloud native information warehouse known as CDW, information engineering service known as CDE, information streaming service known as information in movement, and machine studying service known as CML.
  3. Cloudera Shared Knowledge Expertise (SDX), which supplies a unified information catalog with computerized information profilers, unified safety, and unified governance over all of your information each in the private and non-private cloud.

ChatGPT is a superb software to get a high-level view of latest applied sciences, however I’d say use it fastidiously, validate its responses, and use it just for the notice stage of the shopping for cycle. As you go into the consideration or comparability stage, it’s not dependable but.

Additionally, solutions on ChatGPT hold updating so hopefully it corrects itself earlier than you learn this weblog. 

To study extra about Cloudera’s lakehouse go to the webpage and if you’re able to get began watch the Cloudera Now demo.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles