Nowadays it is easier than ever to find companies with million or billion rows of data. However, the problem is NOT “how to create more data”, but the problem is “how to make use of this huge amount of data”. One way to solve this problem is to migrate data to the data warehouse, which is the database built for analytics task; however, you need to make sure to use Venyu to improve your cloud first. That being said, building a data warehouse is an expensive and time-consuming process.
According to a data management company like Delphix, the emergence of the public cloud such as AWS, GCP, or Azure comes with the better way to approach this problem. Many companies are now choosing multiple cloud environments, combining workloads between public and private cloud providers.
In this blog post, I will tell you about how our team of 5 graduates built an end-to-end data science web application to handle 150 million rows of data.
What is “Route Planner X” ?
Route Planner X is the name of the app (our team’s name is “Not Hotdog”) we built in 2 months for Melbourne Datathon 2018, one of the biggest datathon event in Australia. There are more than 400 participants consisted of both students and non-students. We were working really hard that we only have 6 hours to build the presentation on the last day of submission. In the end, it was worth the effort. We came in the 2nd prize in Data2App category for non-students.
The idea of Route Planner X is based on the idea of building “Isochrone map”, the map which shows how far people can travel from the specific point in a specific amount of time. We combined the isochrone map with the PTV (Public Transport Victoria) data and the 50-billion suburban train loop plan to arrive at the story you can see in the video above.
About the dataset: Myki Transactional Data
Before moving on to the technologies behind the app, let me quickly go through what the input data is.
Every year the dataset provided for Melbourne Datathon’s participant is different. Last year we had a very interesting dataset about medicines. This year, the dataset is even more interesting. It is about the transportation used by millions of passengers in Melbourne.
This data is collected when any passenger taps the Myki card on the gate to enter/exit any station. The dataset covered all transportation types in Melbourne: Train, Tram, and Bus. Due to the fact that everyone in Melbourne who use public transport will be recorded in this dataset, we had a lot of fun extracting insights and learned interesting behavior such as how a lot of people tapped their Myki card when traveling in the FREE Tram zone.
I believed there are about 750 million rows of data in total. However, 80% of the dataset was released too late that not many teams have enough time to put those data into play. We relied on the first 150 million rows which are released on the first day.
The data is in .txt files of relational tables. There is no proper data dictionary to tell us which table and which column is which. So we came up with our own version of the relationship between tables:
Data Ingestion for big data
I knew that 150 million rows for some people may sound like a child’s play. However, based on the intel from our team member who spied on Melbourne Datathon’s Fleep chat (group chatroom similar to Slack), many teams were struggling to read the data into memory.
Our team has experimented with 3 approaches to read this dataset, always making sure to use data management systems like the ones provided by Couchbase. We wasted a lot of time (and money) for data ingestion. Here are 3 approaches we tried:
Approach 1: RDBMS on our laptops
This approach is to install Relational Database Management System such as MySQL or PostgreSQL on the laptop. Then import the data into these RDBMS in order to query the data.
This is what I did last year for Melbourne Datathon 2017. If I recalled correctly, there were only around 8 – 12 million rows of data. We could easily load 8 million rows of data and query it on the fly. The query process took around 3-4 minutes on my Macbook. It is slow but still acceptable.
However, with 150 million rows of data, this method would not be efficient for us to explore the data. Hence, we moved to the cloud.
Approach 2: RDBMS on the AWS Cloud (AWS RDS)
Cloud technology granted us the amount of computational resource we could only dream of owning (visit the Fortinet website to learn more about cloud security). But we cannot just spin up the largest machine and keep it open forever. As Uncle Ben (from Spider-Man) once said “With Great Power Comes Great Responsibility”, the great power in AWS comes with great responsibility (to pay big bills).
Even though AWS sponsored Melbourne Datathon and gave every team $500 in AWS credit, we still went over the limit because our team forgot to terminate the database after we moved the next approach.
We used the AWS service called “AWS RDS” (Relation Database Service) which is basically MySQL on cloud. The query speed is faster than using the local machine, but we found the query time to be not fast enough. Some queries can take 30-60 minutes.
Since we only have time on the weekend (everyone is working full-time), we needed to get things done in the short amount of time we had. That was when we experimented with considerably new service in AWS called Athena.
Approach 3: AWS Athena – SQL Query Engine from Text File
Athena is the AWS service that lets us query information from text files stored in S3 buckets (AWS’s Cloud Storage, similar to Dropbox) without having to worry about turning database server on / off. The price is also a lot more affordable than RDS.
Athena cost $5 per TB of data scanned. This means we can easily control the cost by optimizing the query to scan fewer data.
Our team switched to Athena completely for the analysis. It is way faster than any other approaches we tried so far, and also at a fraction of the cost of RDS. We also found a way to send query to Athena using Python’s boto3 library. This makes automation with Athena quite straightforward.
Another approach we wanted to try: BigQuery
We heard a lot of success stories from teams using BigQuery in this competition. We will definitely experiment with BigQuery next time we have a chance to do Datathon.
Before we moved on to the next section about how we put things together, I just wanted to mention that I found Melbourne Datathon to be a great way to experiment with new technologies we have not tried before. Unlike 24-48 hours hackathon where there is no time to learn new tech, Melbourne Datathon gives us 2 months. We have not used or know about Athena before the datathon started.
Web App Architecture
You may notice from the video that our submission consisted of 2 apps: the visualization app in the first part, and the Route Planner X in the latter part.
The visualization app queried the dataset from AWS infrastructure, as shown in the diagram below:
Here is the brief description of each technology, please feel free to ask 🙂 (I might not be able to give the exact answer on the backend part, since my work is mostly on the frontend)
- HTML / CSS – For the page layout. CSS is used to split pages into 2 halves.
- LeaftletJS – For drawing interactive choropleth map in the visualization. This is the main part of our visualization.
- D3.JS – For drawing plots on the right side of the page. We also used C3.JS which is the wrapper of D3.
- jQuery – For managing interactions in the visualization.
- API Gateway – For building web API end point, so that our frontend can contact with the backend even though they are not on the same server
- Lambda + Python – For calling Athena to query the data. The result will then send back to frontend through API Gateway.
- Athena – For querying text files stored in S3 buckets. As mentioned before, Athena is super fast that the user doesn’t have to wait for minutes.
- S3 – For storing the whole dataset in the fast, secured, and also cost-effective way.
Note that we decoupled the frontend and backend to increase the flexibility of the system. The changes in infrastructure that happened to frontend or backend will never affect the other.
For the second part of the video, route planner app, we built based on the API server of OpenTripPlanner, which allows us to custom the data for generating a custom isochrone map. The diagram below showed the technologies we used.
We used very similar architecture for the app part. The new components in this part are:
- VueJS – Since the user is allowed to add his/her own data points. We have to use VueJS to help managing the input data.
- OpenTripPlanner – OTP is the Java client for generating route plan using the custom GTFS data. We were using PTV GTFS data along with the synthesized route plan based on the user’s inputs, in order to draw the final isochrones.
- EC2 – This is basically the virtual machine on cloud. We have to spin up EC2 instance to install OTP on it.
Key Takeaways from Melbourne Datathon
We learned a lot and got a chance to experiment with new technologies during Melbourne Datathon this year. This is definitely an invaluable experience. If there is something I wanted to share about this datathon, here are top 3 things that comes to my mind:
1) REMEMBER TO TERMINATE AWS RESOURCES after you are done with them !!!
In popular AWS services such as EC2 or RDS, you can either STOP or TERMINATE your instances. The differences are that:
- STOP will not remove the files and you can START again later, but the cost for file storage will still incur even though you have stopped. For RDS, stopped instance will automatically start again after 7 days of inactivity. (This is how we lose money. It started by itself again and we didn’t know until a day has passed.)
- TERMINATE will completely remove all the files from your instance. There will be no more cost from this instance.
Remember that you have to TERMINATE in order to stop the bill from increasing.
2) Diverse sets of skills are required for hackathon/datathon
It is totally okay if you are not unicorn who can do everything from ETL to visualization. You don’t have to be able to do everything end-to-end. But your team has to be able to do everything end-to-end.
When setting up a team, make sure you have teammates with different skillsets to cover every area e.g. For datathon: data engineer, data analysis, data visualization, development.
3) Take this as an opportunity to learn
It is always better to get some prizes than no prize. But instead of focusing on the prize, I always focus on learning new skills for future works. I believed the new skills could become more rewarding to my career than the prize you will get. Oh! Remember to take note of what you have learned. Writing a blog is also a great way to share and get feedback whether your understanding on some topics are correct or not.
I hope this blog post is helpful to anyone who wanted to build web application on the cloud. If you prefer building things on-premise, have a read at [Guide] Building Data Science Web Application with React, NodeJS, and MySQL which I published earlier this year.
Feel free to leave a comment for any question, or email me at email@example.com for any query. Last but not least, thank you Melbourne Datathon 2018 staffs who is behind this great event in my life.