As learning by doing is at the heart of BeCode’s approach, we are constantly on the lookout for real-life projects for our juniors to practice on and learn from. To put the knowledge of our Brussels AI juniors to the test, we teamed up with KPMG. Their challenge? Automating documents using NLP techniques.
Curious about what the project entailed and how our juniors have experienced it? Ankita will tell you all about it in the interview below.
Can you tell me a bit more about yourself?
“I’m from India and moved to Belgium four years ago. Before coming to Belgium, I worked in a bank for over eight years. Once I arrived here, I wanted to launch myself in a new career. BeCode seemed like the perfect opportunity to dig deeper into my interest in IT as I have a bachelor’s degree in computer science. More specifically, this seemed like a unique occasion to broaden my knowledge of the data science field, which is a very hot field right now. I decided to give it a try and the journey has been great so far.”
What convinced you to start a training at BeCode and not at a similar training center?
“Most of the other trainings that I stumbled upon were given in French or Dutch, but this training is given in English. Also, for the juniors, this course is free. Both factors made me want to apply and I got selected.”
How far along are you in the training right now?
“The end of the training is approaching. We only have ten days left.”
What have you learned so far?
“A lot. This journey has been tremendous. Before starting the training, I had some experience with Python and other tools that are frequently used in the field, but I used to take my time to learn new things as I didn’t feel rushed. But during the training, we’ve studied so many topics in such a short span of time. In these seven months I’ve learned so much more than I would’ve been able to learn by myself.”
“We have learned how to code in Python, but have also focussed on data visualization tools such as PowerBI, machine learning, deep learning, computer vision and NLP.”
As part of the AI-bootcamp, you’ve been working on a use case, provided by KPMG. Can you tell me a bit more about that use case?
“One of KPMG’s clients approached them as they could use some help in dealing with their data, more specifically CLA’s (Central Government Licenses). Previously, they manually had to look for new CLA’s on the government’s website, which is very time consuming. The goal of this use case was to automate this process. Therefore they firstly wanted us to create a system that would notify them if a new CLA had been uploaded, which would in turn allow them to update their information much faster and more easily. In a second phase, we had to work on a system that stores all of these CLA’s and where you could easily search for CLA’s regarding a different topic.”
What was the timeline of the project?
“It was the first time that we worked on a project for three weeks. We’ve had one project that lasted two weeks, but all other projects were just a week.”
What were the different phases of the project?
“We started brainstorming about how we could tackle this challenge and what techniques we would need to use to do so. Before diving into the project, KPMG warned us that we would need to use OCR (Optical Character Recognition) techniques for the model to be able to read and understand the PDF’s. In our search for such tools, we stumbled upon new tools for text extraction, namely Tesseract and Pytesseract.”
“Some PDF’s were written in Dutch, some in French, others contained both languages, which made it difficult to develop a database. We therefore decided to only use the French texts.”
“The PDF’s were also structured differently. Sometimes there was only one column, in other documents there were two. Sometimes the document consisted of one page, sometimes of several pages. We therefore had to put in place a mechanism that could detect how the document was structured.”
“As you can see, we faced several challenges during the text extraction phase, but we tackled them one by one. In the end, we developed a good system that could extract the text very accurately.”
What happened after the text extraction?
“We provided KPMG with an end to end solution. Firstly, we created a database using SQL libraries such as SQLight. In this database, the different documents, together with their link to the government’s website, were stored. We also build a user-interface that would allow the user to go look for specific information. Lastly we deployed our solution using Heroku and Streamlit.”
You had to work in groups on this use case. How did the collaboration go?
“We were a team of five and we used to touch base on a regular basis. We had a morning call and an evening call on a daily basis, but when we faced problems during the day, we didn’t hesitate to reach out to each other.”
“We distributed the tasks. Everyone knew exactly what to do. Orhan built a very accurate database. Opaps, our project manager, was brilliant. He made sure that everybody delivered their part on time. Together with Adam, I worked on text extraction and text structuring. Later on, I also deployed the model. At last there was
Dilara. She was focusing on creating the user interface and preparing our presentation.”
Soft skills, such as working in a group, are vital in such projects. What other soft skills, that you’ve learned during your BeCode training, have helped you during this project?
“Being able to compromise. Sometimes we were disagreeing, but at some point you need to be able to put aside these differences. What matters is that you deliver a good project.”
How did you deal with these disagreements?
“We consulted each team member, but the final decision was made by the project manager as you cannot be stuck on one point of discussion, you need to move forward.”
You’ve used tools such as Heroku to deploy the model. What other tools or technologies did you use during this project?
“For text extraction, we used Tesseract and Pytesseract. There were few other libraries that we had never encountered, such as SQLight, that we used to build the database.”
“When looking in the database for documents regarding a specific topic, you might not always get all of the results as documents can contain spelling mistakes. Orhan found out about a Google API that could correct those spelling mistakes, which made our system much more efficient.”
Next to discovering some new tools, did you acquire any new skills due to this use case?
“Some NLP techniques, with which we hoped to put to use to build a question answering system, but we ended up with creating a search engine. I’m sure that if we would have had more time, we would’ve been able to deliver a better end result.”
You have faced challenges during the text extraction process and have experienced some disagreements in the group. Did you face any other challenges?
“Time management. As we were also looking for an internship at that time, some group members were not always available to work on their part of the project. Besides when we had the intention of meeting for a couple of minutes, the meeting sometimes went on for several hours. We’ve lost a lot of time this way, but in the end it turned out fine.”
You had to present the final result to KPMG. How did the presentations go?
“They were happy with the final result, but I do have the feeling that they expected more than the search engine that we built. The question-answering system that we first wanted to create, would probably have been more helpful to them, but we weren’t able to finish it.”
What feedback did you receive?
“They had expectations and we somewhat fulfilled them. They would’ve liked a solution that focused more on NLP, which I can understand. But overall, they were very happy with the final result. They were really impressed with the Google API that we used to correct spelling mistakes and really loved our user interface.”
How did you experience the project in general?
“It was a great learning experience. Even if we worked overtime, nobody used to feel like they were doing something extra.”
What is according to you the added value of integrating use cases in the AI-bootcamp?
“When you read the theory, you may think that you actually understand the concepts. But it’s when working on such projects, that you actually learn how to put your knowledge into practice. You learn much more than just theoretical knowledge. If for example, KPMG wouldn’t have provided us with this use case, we wouldn’t have had concrete experience with OCR techniques and building a database from scratch. To me, it’s therefore extremely important to integrate such projects into the training.”
What would your advice be for someone that is about to start with a similar project?
“Make sure to know what the exact requirements of the client are before starting the project. In our case there was some mismatching whereby we only found out in the second week what they truly expected from us. This way we lost a lot of time on things that were not needed or meaningful to them.”
The training is coming to an end. What are your plans for after the training?
“Finding an internship place. I’m having interviews everyday, but I haven’t found an internship place yet.”
What kind of internship are you looking for?
“A position as a data analyst or a data scientist. If I cannot find an internship, a job would do too.”
Do you feel ready for your new career?
“I feel confident enough to pursue a career in the field. There are gaps, I agree, but if I focus more on closing them, I’ll be able to move forward.”
Are you interested in following the AI Bootcamp?
Amazing! – We have new classes starting in Brussels, Antwerp, Ghent and Liège.