• Daily Success Snacks
  • Posts
  • 5 Brutal Truths: Your Data Access Bottlenecks Are Killing Model Iteration (Not Your Models)

5 Brutal Truths: Your Data Access Bottlenecks Are Killing Model Iteration (Not Your Models)

If experimentation takes days, your problem isn’t ML—it’s data friction.

Read time: 2.5 minutes

The uncomfortable truth is that you are limited by data access speed, not your modeling abilities.

As a data scientist, you start with a solid idea and a good plan. Then reality hits: you have to request access to the data, wait, follow up, and wait some more. Finally, you receive partial access to the data, have to redo the joins, and clean it up... again.

By the time you run your first experiment, several days have passed.

Meanwhile, another team with fast access to data has run multiple iterations and shipped.

They have the same data science skill set, but different results, simply because one team had fast data access and the other did not.

In machine learning, iteration is key to success... if you cannot iterate quickly enough, your model will not work. It is not your model's fault, it is everything that happened before the model.

Methods to eliminate data congestion and accelerate iteration.

1. Have No Pause Between Data and Workflow
• Allow for direct query of data (eg., Databricks, Snowflake or Google BigQuery).
• Keep data science schematics persistent.
• Create caches for reusable tables of features.

2. Stop Reproducing the Same Data
• Use shared feature databases (Feast or Tecton).
• Version datasets (for example, Delta Lake).
• Transformations should always reside in one area.

3. Align Production Vs. Training Data
• Both iterations must be done using the same pipeline.
• Pull data from production-ready tables.
• Validate schema and distributions before deploying to production.

4. Decrease The Length Of The Experimental Cycle
• Speed of iteration determines rate of progress.
• Use self-service data and compute.
• Pre-load 'test' samples and full datasets.
• Track experiments (MLflow or Weights & Biases).

5. Fix Access, not just Models
• Debugging data access can be very resource-consuming.
• Standardize schema and data libraries.
• Use a third-party catalogue of data and data lineage software.
• Automate access to data using role-based access control and Identity access management.

💡Key Takeaway: 

It is not your model that is inefficient... it is the way you are accessing it, which is ultimately costing you results.

👉 LIKE this if you've ever experienced an overall slowdown due to slow data access.

👉 SUBSCRIBE now to get tips and tricks on how to create efficient, intelligent systems for accessing and working with data.

👉 Follow Glenda Carnate to stay informed about the latest trends in Data Science and the ability to use data effectively.

👉 COMMENT on what the largest area of delay in your data workflow is currently.

👉 SHARE this with someone you know is still waiting for data before starting a build.

Reply

or to participate.