Ari Yacobi, Chief Data Scientist at Knowledgent, was recently interviewed on the topic of our Intelligent Trial Planning platform built on AWS services. He discusses the data implications, data science models, and AWS components of the platform, among other topics.
Read the full interview below:
Q: Can you tell us what Knowledgent does in respect to AWS?
Knowledgent is a data intelligence consulting firm. Our primary clients are healthcare and life sciences companies. From an engineering standpoint, we move them into AWS environments, and go all the way through building machine learning and AI-based applications for them.
Q: What is the application you’re going to discuss with us?
The application we’re going to discuss today is our Intelligent Trial Planning application, better known as ITP. It uses machine learning and AI to predict what the feasibility of a clinical trial will be, and what the recruitment timeline would look like.
Q: ITP must be really important for that industry because clinical trials can take a long time, or be very expensive to run, and getting some visibility into that would be valuable, is that correct?
Absolutely, it’s not just important but the process of getting there is also very time consuming. They usually have a team of statisticians and researchers work each protocol for the clinical trial to determine what the impact would be and how they would recruit in different countries based on that. It’s a very lengthy process, even just arriving to what those timelines would look like. We have re-imagined it using all the technology that’s available to us with machine learning.
Q: Where do we start?
Let’s start with the data. We are looking at internal and external data sources for a pharmaceutical company. Internal data sources include historical trial data from their past trials and their clinical trial management system data. Externally we’re looking at clinicaltrials.gov, which is a public data source, and Thomson Reuters cancer health, which is an incidence and prevalence data source, and other commercial data sources that are out there. We bring the internal and external data sources into an S3, where the raw data sits.
This application is relatively new, and we’re still on version 1.0 in production. As we built it out, we wanted to leverage the latest and greatest tools and technologies available through Amazon, so we have used Glue for data cleansing, data aggregation and integration, as we brought hundreds of features and different variables in. Glue was used for ETL. Once the data was cleansed and ready, it went into S3, and that’s where the formatted data set was ready for the data scientists to go at it.
Q: How do the data scientists get access to that data, now that it’s been formatted?
Once the data is there, we give them the ability to query it using Athena. It’s serverless and a nice easy way for them to quickly get a sense of what the makeup of the data is, and do the queries on it, and visualize it as they drive what the features are. Once they get a sense of what the data is, we use Glue again for feature extraction.
Q: What’s the difference between the earlier use of Glue and this use of Glue?
The earlier use of Glue is where the data cleansing is done, and this Glue is where we’re doing feature extractions from the formatted data set.
Q: What is feature extraction?
I’ll give you an example. One of the things we look at in clinical trials that is important is if a patient is biologic-naïve or not, and that translates into Feature 01. Biologic-naïve is a patient that has never been exposed to a treatment for that indication that they’re suffering from. So next, the features go into a feature repo, and the feature repo is in S3. Essentially, we make those features available for a distributed team of data scientists to go at and try different machine learning models on it.
Q: What do you use EMR for?
EMR is where the magic happens. This is where the machine learning is done. We use a Jupiter interface with Python for machine learning, and EMR is where the machine learning jobs are run. You can refer to the S3 as Production Model Artifacts. The features and production version feed into the EMR for Machine learning. These predictions are made available both to researchers or anyone using the application through a web-based interface. Through the web-based interface, they can run their protocol definition, specify what the protocol is, and see the output. We’re storing all the data in Dynamo DB, so we’re using Dynamo as our data store. With that said, it’s always evolving. We’re looking at things to implement to make it even better.
Q: What would some of those ideas be? What would you like to do next?
One thing we’re experimenting with right now is bringing in SageMaker and leveraging it for machine learning, so that we can be complete end-to-end AWS.