Getting the most out of your data lake
As engineers, a big part of our job is to ensure that our users are getting real value out of the products and systems that we’re building. At MarketFinance, we’re working hard to ensure that our business is data-driven; that’s to say people know what data can be leveraged to solve problems and, as per our company value, #MakeItHappen.
The “build it and they will come” approach may work when it comes getting animals onto an ark, but the same can’t be said for data platforms or data lakes! It’s simply not enough to build them well; we must also give our users the knowledge and tools to be able to use them effectively. As Jeff Atwood (co-founder of Stack Overflow) once said:
“Writing code? That’s the easy part. Getting your application in the hands of users, and creating applications that people actually want to use – now that’s the hard stuff.”
Over the last 12 months, our data engineering team has been focussed on making our data lake more robust and reliable. We’ve also learnt a few things along the way which have been instrumental in ensuring the business is getting real value from it.
A “data lake”, in its simplest form, is a repository of data stored in its natural/raw format. This architecture has in recent years become increasingly popular due to its flexibility and scalability in contrast to traditional relational databases where data is transformed or structured prior to ingestion. The “data platform” refers to the general infrastructure which enables us to build systems.
Click here for a more detailed explanation of our approach from Jitse-Jan van Waterschoot, Head of Data Engineering at MarketFinance.
For most companies, data is the life-blood; we all need it to know what’s going on and to help us do our jobs. How much data we need, how often and how we access it, however, will change from person to person or team to team. That’s why we provide multiple ways for our users to access it. Namely:
- Looker for operational reporting and dashboarding of company KPIs
- Internal APIs (using GraphQL) for integration with internal and third-party systems
- AWS Athena for general SQL queries
- Jupyter Notebooks for data science model development
One of our key principles in building out our data lake is consistency. Simple questions like “what do I name this database”, “what compression should I use” or “how should I format my code” may seem trivial but your users and colleagues will thank you in the long run if you define a standard, document it and stick to it. Using a data indexer and data catalogue tool such as AWS Glue is a big help to automate this and means that data can be queried and joined in the same way regardless of its format (i.e. CSV, JSON, PARQUET). We use the Python module ‘Black’ to automate all code formatting and linting.
Logging is another simple thing that, when done well, becomes a game-changer. We all write code with bugs (yes, even you) but identifying and fixing them is much easier when you and your team are logging exceptions in a similar way. When we run any code on the data platform, we log it on an ElasticSearch cluster for easy searchability. This has also enabled us to ensure that any big problem triggers a PagerDuty exception which will be assigned to someone in our team to fix immediately.
As Peter Parker’s Uncle Ben once said:
“With great amounts of
power data comes great responsibilities painfully slow queries!”
Okay, that’s not the real quote but you get the point. More data means more waiting around while your jobs or queries scan all that data. Let’s say for example you’re storing 10 years of click-stream data, but only the most recent month’s worth is actually being used for analysis. You could hugely improve your query performance if you:
- Partition your data by the time at which it’s generated. This allows a query to determine which partitions contain relevant data, therefore stopping the need to do unnecessary full-table scans.
- Create a “time to live” policy which archives all data older than a few months. If you’re on AWS this can easily be achieved with S3 Lifecycle Policies. This would also have the added bonus of saving you money as you’re not paying for the additional cloud storage. There’s no “one size fits all” solution when it comes to query optimization, so it’s imperative that you understand what your users are trying to do!
As engineers or developers, it’s very easy to silo yourself away and come up with a solution to a problem that makes sense to you. In most cases, though, this is not what your users want or need. That’s why at MarketFinance our teams are structured into Agile squads and tribes. This way, we work closely with our users and stakeholders to help them define a problem statement, list clear requirements, and then work through a solution together.
Finally and perhaps most importantly – share what you’ve shipped! Update your documentation, post on Slack, hold a meeting and show your stakeholders what you’ve been doing. However you do it, make sure that your users know what you’ve been working on. After all, what’s the point of throwing a great party if you don’t invite anyone along?