Navigating the Unknown in Data Projects: An Iterative Approach for Data Teams

Written By

Palak Dhawan

Ankon Chakraborty

In data engineering, we often focus on ETL pipelines, scalable data platforms, and real-time ingestion. But behind the technical talk lies a familiar challenge: being expected to start delivering before there's full clarity on the data. Sound familiar?

Whether it’s due to legacy system limitations, missing documentation, unclear business rules, the absence of the right stakeholders, or data still being sourced—many data teams find themselves building solutions amidst ambiguity.

So the real question is: how do we deliver meaningful data products in such uncertain conditions?

Let's break it down.

1. How do you understand the problem statement and overall scope?

Before any delivery begins, one of the most critical steps is to gain a solid grasp of the problem and define the scope clearly.

‍

A structured discovery phase with all key stakeholders is essential. This process lays the groundwork by mapping out current (as-is) business processes, surfacing core pain points, and identifying operational constraints. It helps the team frame the problem statement accurately and build consensus around a shared understanding of what needs to be solved.

‍

Insights gathered during this phase should guide the development of a future-state roadmap, define the Minimum Viable Product (MVP), sketch out desired (to-be) user journeys, and inform the high-level solution architecture.

‍

Equally important is a deep dive into the legacy system’s codebase. Often, there are gaps between what stakeholders believe the system does and what the logic actually reflects. Analysing the code early helps surface these discrepancies before they become blockers down the line.

‍

💡 Tip: Document the discovered business logic clearly, validate it with stakeholders, and integrate it into the scope. This ensures alignment across the board—and helps avoid surprises later in the project.

‍

2. Starting Delivery Without Fully Understanding the Data

You might be handed raw data or legacy code and asked to “just migrate it.” But without understanding how that data is generated—or the business logic behind it—you’re likely to make wrong assumptions.

Often, there’s no schema documentation. Just code. In such cases, you’ll need to reverse-engineer the system: map inputs to outputs, trace joins, and infer meaning from variable names and function flows.

‍

💡 Tip: Pay close attention to transformation logic, filters, and conditionals in the legacy code—they often reveal embedded business rules.

‍

3. How Do You Access the Data?

Before any transformation begins, you need access to the raw source—whether it’s a database, API, file dump, or third-party service.

Here’s a quick checklist to guide you:

Is access controlled via IAM roles, API keys, or VPN?
What tools are available to explore the data? (e.g., dbt, Jupyter, DBeaver)
Is sample data accessible for early-stage exploration?

‍

4. How to Interpret the Data?

When a schema lacks context, it’s on you to decode it. A column like status_code—is it an HTTP response? An order state? Something else entirely?

Approach:

Reverse-engineer meaning from pipelines or legacy code
Profile the data: look for patterns, nulls, and duplicates
If real data isn’t available, mock input variations to test system behavior

You don’t always need full datasets. Often, simulating edge cases or input variations is enough to uncover validation rules and logic relationships—without needing production access.

‍

5. Aligning Data with Business Requirements

When context is missing, it's your job to connect the dots between raw data and business logic. Stay closely aligned with stakeholders to understand how data drives outcomes.

Key Questions to Ask:

What does this metric actually represent?
How is it used in decision-making?
What's the expected output format—dashboard, CSV, API, or something else?

Clear answers here ensure the data serves real business needs, not just technical completeness.

‍

6. Defining User Stories in Data Projects – When Nothing Is Documented

Data often takes a backseat in user stories, leading to unclear requirements, misaligned expectations, and testing delays—especially when documentation is sparse or nonexistent.

To avoid this, take a structured approach:

Run workshops/interviews to define clear As-Is and To-Be journey maps
Keep refining process flows as scope and understanding evolve
Center data in your requirement gathering—not just functionality
Include relevant datasets and definitions directly in user stories
Break stories into small, testable units to enable fast feedback and continuous validation

This helps ensure developers are aligned and solutions stay grounded in real business needs.

‍

7. Testing Without the Full Dataset

Lack of access to full production data can complicate testing—but it doesn’t have to stall progress.

Workarounds:

Use synthetic or scaled-down datasets that preserve variation and structure
Focus on edge cases and boundary conditions using mock inputs
Validate transformations by comparing results to expected output formats

You don’t need real data to test logic—just realistic, representative scenarios.

‍

8. Delays in Getting Feedback from End Users

Late feedback in data projects can lead to expensive rework across ingestion, transformation, and output layers. The earlier you involve end users, the smoother the delivery.

How to Avoid This:

Set up a strong feedback loop from day one
Engage end users throughout—not just at the finish line
Involve stakeholders during requirements, grooming, reviews, and demos
Make UAT a central part of your testing strategy

‍

End users know the data best—their input is key to building the right solution the first time.

Best Practices: Delivering Data Projects on Time Amid Multiple Moving Parts

Data projects involve complex interdependencies—source systems, transformation logic, business rules, and downstream consumers. To stay on track and avoid delivery slippage, apply these principles:

Invest in Discovery & Data Profiling: Understand data quality, source complexity, and transformation challenges upfront.
Avoid Premature Scope Commitment: Don’t lock in timelines or deliverables before clarity is achieved.
Define Short, Actionable Milestones: Break the project into small, achievable chunks to keep progress visible and risks manageable.
Continuously Refine the Backlog: Conduct mini-discoveries to validate assumptions and adjust scope as needed.
Estimate Collaboratively: Involve the entire team. Data tasks often hide complexity behind seemingly simple labels.
Build Incrementally, Test Early: Validate as you go to catch issues early and reduce rework.
Maintain Strong Delivery Hygiene: Clear user stories and disciplined sprint practices reduce spillovers.
Proactively Manage Dependencies: Track and mitigate risks, assumptions, and inter-team dependencies early.
Communicate Timeline Shifts Early: Keep stakeholders informed to align priorities and manage expectations.
Don’t Skip Retrospectives: Use retros to improve collaboration and continuously refine your delivery process.

‍

Conclusion

Data projects often start in ambiguity—limited documentation, unclear logic, evolving requirements. That’s expected.What truly drives success is curiosity, clear communication, and thoughtful iteration. Ask the right questions. Test early. Adapt often. Because in the end, it’s not just about moving data—it's about shaping it into something people can trust and act on.

Author

Lead Consultant