Home Artificial Intelligence Your Data’s (Finally) In The Cloud. Now, Stop Acting So On-Prem 3 Ways Data Teams Modified with the Cloud 5 Ways Data Teams Still Act Like They Are On-Prem Be a learn-it-all

Your Data’s (Finally) In The Cloud. Now, Stop Acting So On-Prem 3 Ways Data Teams Modified with the Cloud 5 Ways Data Teams Still Act Like They Are On-Prem Be a learn-it-all

0
Your Data’s (Finally) In The Cloud. Now, Stop Acting So On-Prem
3 Ways Data Teams Modified with the Cloud
5 Ways Data Teams Still Act Like They Are On-Prem
Be a learn-it-all

The trendy data stacks mean you can do things in a different way, not only at a bigger scale. Benefit from it.

Towards Data Science
Photo by Massimo Botturi on Unsplash

Imagine you’ve been constructing houses with a hammer and nails for many of your profession, and I gave you a nail gun. But as an alternative of pressing it to the wood and pulling the trigger, you switch it sideways and hit the nail identical to you’ll as if it were a hammer.

You’d probably think it’s expensive and never overly effective, while the positioning’s inspector goes to rightly view it as a security hazard.

Well, that’s since you’re using modern tooling, but with legacy pondering and processes. And while this analogy isn’t an ideal encapsulation of how some data teams operate after moving from on-premises to a contemporary data stack, it’s close.

Teams quickly understand how hyper elastic compute and storage services can enable them to handle more diverse data types at a previously unheard of volume and velocity, but they don’t at all times understand the impact of the cloud to their workflows.

So perhaps a greater analogy for these recently migrated data teams can be if I gave you 1,000 nail guns…after which watched you switch all of them sideways to hit 1,000 nails at the identical time.

Regardless, the necessary thing to know is that the trendy data stack doesn’t just mean you can store and process data larger and faster, it lets you handle data fundamentally in a different way to perform recent goals and extract various kinds of value.

That is partly attributable to the rise in scale and speed, but in addition consequently of richer metadata and more seamless integrations across the ecosystem.

Image courtesy of Shane Murray and the writer.

On this post, I highlight three of the more common ways I see data teams change their behavior within the cloud, and five ways they don’t (but should). Let’s dive in.

There are reasons data teams move to a contemporary data stack (beyond the CFO finally freeing up budget). These use cases are typically the primary and easiest behavior shift for data teams once they enter the cloud. They’re:

Moving from ETL to ELT to speed up time-to-insight

You’ll be able to’t just load anything into your on-premise database– especially not if you happen to want a question to return before you hit the weekend. In consequence, these data teams have to rigorously consider what data to tug and how you can transform it into its final state often via a pipeline hardcoded in Python.

That’s like making specific meals to order for each data consumer moderately than putting out a buffet, and as anyone who has been on a cruise ship knows, when you have to feed an insatiable demand for data across the organization, a buffet is the strategy to go.

This was the case for AutoTrader UK technical lead Edward Kent who spoke with my team last 12 months about data trust and the demand for self-service analytics.

“We would like to empower AutoTrader and its customers to make data-informed decisions and democratize access to data through a self-serve platform….As we’re migrating trusted on-premises systems to the cloud, the users of those older systems have to have trust that the brand new cloud-based technologies are as reliable because the older systems they’ve used prior to now,” he said.

When data teams migrate to the trendy data stack, they gleefully adopt automated ingestion tools like Fivetran or transformation tools like dbt and Spark to go together with more sophisticated data curation strategies. Analytical self-service opens up an entire recent can of worms, and it’s not at all times clear who should own data modeling, but on the entire it’s a rather more efficient way of addressing analytical (and other!) use cases.

Real-time data for operational decision making

In the trendy data stack, data can move fast enough that it now not must be reserved for those day by day metric pulse checks. Data teams can reap the benefits of Delta live tables, Snowpark, Kafka, Kinesis, micro-batching and more.

Not every team has a real-time data use case, but those who do are typically well aware. These are often firms with significant logistics in need of operational support or technology firms with strong reporting integrated into their products (although portion of the latter were born within the cloud).

Challenges still exist, after all. These can sometimes involve running parallel architectures (analytical batches and real-time streams) and trying to succeed in a level of quality control that just isn’t possible to the degree most would love. But most data leaders quickly understand the worth unlock that comes from with the ability to more directly support real-time operational decision making.

Generative AI and machine learning

Data teams are conscious about the GenAI wave, and plenty of industry watchers suspect that this emerging technology is driving an enormous wave of infrastructure modernization and utilization.

But before ChatGPT generated its first essay, machine learning applications had slowly moved from cutting-edge to straightforward best practice for a lot of data intensive industries including media, e-commerce, and promoting.

Today, many data teams immediately start examining these use cases the minute they’ve scalable storage and compute (although some would profit from constructing a greater foundation).

In case you recently moved to the cloud and haven’t asked the business how these use cases could higher support the business, put it on the calendar. For this week. Or today. You’ll thank me later.

Now, let’s take a take a look at a number of the unrealized opportunities formerly on-premises data teams could be slower to take advantage of.

Side note: I need to be clear that while my earlier analogy was a bit humorous, I’m not making fun of the teams that also operate on-premises or are operating within the cloud using the processes below. Change is tough. It’s even tougher to do when you’re facing a continuing backlog and ever increasing demand.

Data testing

Data teams which can be on-premises don’t have the size or wealthy metadata from central query logs or modern table formats to simply run machine learning driven anomaly detection (in other words data observability).

As an alternative, they work with domain teams to know data quality requirements and translate those into SQL rules, or data tests. For instance, customer_id should never be NULL or currency_conversion should never have a negative value. There are on-premise based tools designed to assist speed up and manage this process.

When these data teams get to the cloud, their first thought isn’t to approach data quality in a different way, it’s to execute data tests at cloud scale. It’s what they know.

I’ve seen case studies that read like horror stories (and no I won’t name names) where an information engineering team is running hundreds of thousands of tasks across hundreds of DAGs to observe data quality across a whole lot of pipelines. Yikes!

What happens if you run a half million data tests? I’ll inform you. Even when the overwhelming majority pass, there are still tens of hundreds that may fail. And they’re going to fail again tomorrow, because there is no such thing as a context to expedite root cause evaluation and even begin to triage and determine where to start out.

You’ve someway alert fatigued your team AND still not reached the extent of coverage you wish. Not to say wide-scale data testing is each time and price intensive.

Image courtesy of the writer. Source.

As an alternative, data teams should leverage technologies that may detect, triage, and help RCA potential issues while reserving data tests (or custom monitors) to probably the most clear thresholds on an important values inside probably the most used tables.

Data modeling for data lineage

There are numerous legitimate reasons to support a central data model, and also you’ve probably read all of them in an awesome Chad Sanderson post.

But, every now and again I run into data teams on the cloud which can be investing considerable time and resources into maintaining data models for the only real reason of maintaining and understanding data lineage. If you end up on-premises, that is basically your best bet unless you ought to read through long blocks of SQL code and create a corkboard so filled with flashcards and yarn that your better half starts asking if you happen to are OK.

Photo by Jason Goodman on Unsplash

(“No Lior! I’m not OK, I’m trying to know how this WHERE clause changes which columns are on this JOIN!”)

Multiple tools inside the trendy data stack–including data catalogs, data observability platforms, and data repositories–can leverage metadata to create automated data lineage. It’s only a matter of picking a flavor.

Customer segmentation

Within the old world, the view of the shopper is flat whereas we realize it really needs to be a 360 global view.

This limited customer view is the results of pre-modeled data (ETL), experimentation constraints, and the length of time required for on-premises databases to calculate more sophisticated queries (unique counts, distinct values) on larger data sets.

Unfortunately, data teams don’t at all times remove the blinders from their customer lens once those constraints have been removed within the cloud. There are sometimes several reasons for this, but the biggest culprits by far are good quaint data silos.

The shopper data platform that the marketing team operates continues to be alive and kicking. That team may gain advantage from enriching their view of prospects and customers from other domain’s data that’s stored within the warehouse/lakehouse, however the habits and sense of ownership built from years of campaign management is tough to interrupt.

So as an alternative of targeting prospects based on the best estimated lifetime value, it’s going to be cost per lead or cost per click. This can be a missed opportunity for data teams to contribute value in a directly and highly visible strategy to the organization.

Export external data sharing

Copying and exporting data is the worst. It takes time, adds costs, creates versioning issues, and makes access control virtually inconceivable.

As an alternative of making the most of your modern data stack to create a pipeline to export data to your typical partners at blazing fast speeds, more data teams on the cloud should leverage zero copy data sharing. Identical to managing the permissions of a cloud file has largely replaced the e-mail attachment, zero copy data sharing allows access to data without having to maneuver it from the host environment.

Each Snowflake and Databricks have announced and heavily featured their data sharing technologies at their annual summits the last two years, and more data teams need to start out taking advantage.

Optimizing cost and performance

Inside many on-premises systems, it falls to the database administrator to oversee all of the variables that would impact overall performance and regulate as crucial.

Inside the trendy data stack, alternatively, you frequently see one in every of two extremes.

In a number of cases, the role of DBA stays or it’s farmed out to a central data platform team, which may create bottlenecks if not managed properly. More common nevertheless, is that cost or performance optimization becomes the wild west until a very eye-watering bill hits the CFO’s desk.

This often occurs when data teams don’t have the proper cost monitors in place, and there may be a very aggressive outlier event (perhaps bad code or exploding JOINs).

Moreover, some data teams fail to take full advantage of the “pay for what you employ” model and as an alternative go for committing to a predetermined amount of credits (typically at a reduction)…after which exceed it. While there may be nothing inherently incorrect in credit commit contracts, having that runway can create some bad habits that may construct up over time if you happen to aren’t careful.

The cloud enables and encourages a more continuous, collaborative and integrated approach for DevOps/DataOps, and the identical is true in the case of FinOps. The teams I see which can be probably the most successful with cost optimization inside the trendy data stack are those who make it a part of their day by day workflows and incentivize those closest to the associated fee.

“The rise of consumption based pricing makes this much more critical as the discharge of a brand new feature could potentially cause costs to rise exponentially,” said Tom Milner at Tenable. “Because the manager of my team, I check our Snowflake costs day by day and can make any spike a priority in our backlog.”

This creates feedback loops, shared learnings, and hundreds of small, quick fixes that drive big results.

“We’ve got alerts arrange when someone queries anything that will cost us greater than $1. This is sort of a low threshold, but we’ve found that it doesn’t have to cost greater than that. We found this to be feedback loop. [When this alert occurs] it’s often someone forgetting a filter on a partitioned or clustered column and so they can learn quickly,” said Stijn Zanders at Aiven.

Finally, deploying charge-back models across teams, previously unfathomable within the pre-cloud days, is an advanced, but ultimately worthwhile endeavor I’d wish to see more data teams evaluate.

Microsoft CEO Satya Nadella has spoken about how he deliberately shifted the corporate’s organizational culture from “know-it-alls” to “learn-it-alls.” This could be my best advice for data leaders, whether you will have just migrated or have been on the forefront of knowledge modernization for years.

I understand just how overwhelming it could actually be. Latest technologies are coming fast and furious, as are calls from the vendors hawking them. Ultimately, it’s not going to be about having the “most modernist” data stack in your industry, but moderately creating alignment between modern tooling, top talent, and best practices.

To try this, at all times be able to find out how your peers are tackling lots of the challenges you might be facing. Engage on social media, read Medium, follow analysts, and attend conferences. I’ll see you there!

What other on-prem data engineering activities now not make sense within the cloud? Reach out to Barr on LinkedIn with any comments or questions.

LEAVE A REPLY

Please enter your comment!
Please enter your name here