Price Prediction

5 Major Business Mistakes When Working with Big Data: Lessons from a Company Managing 16 TB of Data

Over a quarter of data and analytics professionals worldwide estimate that poor-quality data costs companies over $5 million annually, with 7% putting the figure at $25 million or more. Using low-quality data is just one of many common mistakes that, over time, can lead to serious financial losses.

I’ve been working with the production analytics system at Waites for over 10 years. During this time, we’ve collected more than 16 terabytes of data and receive over 10 billion new data points daily from industrial equipment. Based on my experience working with Big Data, I’ve identified five major mistakes businesses often make when they first begin working with data.

Undefined data retention period

The length of time data is stored before being deleted is determined by the Time to Live (TTL) parameter. Typically, TTL for active data ranges from 1 to 2 years, while archived data is retained for around 5 years.

Each company can set its own TTL based on business needs, but not setting it at all can lead to issues. If the TTL for active data is much longer than actually needed, the system starts to slow down. Even when querying only the last six months, the database may have to scan excessive volumes of information — especially if the data hasn’t been archived. In addition, storing outdated data for too long leads to unnecessary infrastructure costs.

Archival formats like Apache Avro or Protobuf allow data to be stored more compactly. Combined with compression algorithms, they help reduce storage costs. However, reading such archives requires more computing resources and processing time. That’s why it’s important to strike a balance between access speed, infrastructure costs, and storage volume.

Poor data quality and lack of standardization

In 2022, Unity Software — the company behind a widely-used video game engine — launched its own advertising system. This move came after Apple’s IDFA policy changes rendered traditional ad models less effective. Unity’s system relied on its own user interaction data rather than Apple’s, and while the idea seemed promising, it was built on poor-quality data. As a result, targeting was inaccurate and ad performance dropped. Clients began switching to competitors, leading to a $5 billion loss in market capitalization.

Often, businesses only become aware of data quality issues once they’ve already caused significant financial damage.

Issues with data types, timestamps, duplicates, or lack of standardization can distort analytics and lead AI models to make incorrect conclusions. To prevent this, it’s essential to establish a unified standard for storing and processing data during the design phase, clearly document it, and ensure alignment across all teams. Regular data quality checks are also crucial — this includes removing duplicates, standardizing formats, and monitoring data completeness.

The autoscaling trap during peak loads

Every company experiences periods of increased demand — holidays, seasonal promotions, or new product launches. We work with manufacturing companies, and during such peak periods, we observe a 5–10x increase in data volume and the number of requests. The infrastructure must be prepared to handle this level of load.

Using cloud autoscaling may seem like an obvious solution, but if you don’t set upper limits, the system can automatically add servers far beyond what’s actually needed. I witnessed a case where the number of servers jumped from the required 8 to 80 overnight. The company ended up with a bill for several hundred thousand dollars. While clients were happy with the performance, the business wasn’t — because the same results could have been achieved without the excessive cost.

To avoid this, companies need to not only configure scaling limits but also closely monitor what’s happening in real time. Tools like Datadog or AWS CloudWatch can help by alerting teams when thresholds are exceeded.

Sometimes seasonal spikes can be managed not by scaling infrastructure, but through optimization — such as changing data formats, restructuring databases, or distributing the load across multiple servers. While this may require extra effort from architects and developers, in the long run, it proves to be a much more cost-effective solution.

Excessive costs caused by surplus data

Often, companies collect not only the metrics needed for decision-making but also technical user information that goes unanalyzed. According to research by Snowflake, only 6% of companies have achieved high data efficiency and are gaining real business benefits from it. And only 38% of companies use data as a basis for decision-making.

Smaller companies storing up to a million records may not notice issues from surplus data for a while. But when dealing with billions or trillions of records, the costs start to add up — spending on infrastructure increases, more developers are needed, system integration becomes more complex, and overall performance slows down.

For example, in industrial analytics, IIoT sensors transmit not only regularly updated equipment status parameters — like vibration levels or temperature — but also technical details that rarely change, such as the sensor’s firmware version or report type. Storing these technical details in every record can increase data volume by 10 to 20 times. That’s why we only save technical information when it changes. This approach saves storage space, reduces costs, and speeds up processing.

When working with data, it’s important from the very start to clearly define which information is essential for analytics and decision-making, and what can be either omitted or stored temporarily.

Choosing the wrong database

One of the most underestimated mistakes is using a single database as a one-size-fits-all solution. What works well for one case may be inefficient in another.

To explain in the context of production analytics: NoSQL databases like MongoDB work well when you need to display all data related to a specific piece of equipment on a single page — parameters, logs, notes, ML reports, and so on. But using a NoSQL database to build lists requiring brief information across many different objects can cause issues. NoSQL doesn’t support efficient field-specific queries or relationships between data the way relational databases do. As a result, the system may have to load entire documents where only parts are needed, lowering performance and slowing query processing.

Before designing your architecture, make sure the chosen database fits your data type and business needs. InfluxDB, TimescaleDB, or OpenTSDB are suitable for time series data; MongoDB, Cassandra, or DynamoDB work for flexible or semi-structured data; and MySQL, PostgreSQL, Oracle, or MS SQL are best for structured data. Avoid relying on just one database.

Effective data management doesn’t start with data collection but with asking yourself: what do you need to record to make decisions, which data is necessary, and how will you store and process it?

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button