Top 20 Latest Research Problems in Big Data and Data Science

Even though Big data is in the mainstream of operations as of 2020, there are still potential issues or challenges the researchers can address. Some of these issues overlap with the data science field. In this article, the top 20 interesting latest research problems in the combination of big data and data science are covered based on my personal experience (with due respect to the Intellectual Property of my organizations) and the latest trends in these domains [1,2]. These problems are covered under 5 different categories, namely

Core Big data area to handle the scale

Handling Noise and Uncertainty in the data

Security and Privacy aspects

Data Engineering

Intersection of Big data and Data science

The article also covers a research methodology to solve specified problems and top research labs to follow which are working in these areas.

I encourage researchers to solve applied research problems which will have more impact on society at large. The reason to stress this point is that we are hardly analyzing 1% of the available data. On the other hand, we are generating terabytes of data every day. These problems are not very specific to a domain and can be applied across the domains.

Let me first introduce 8 V’s of Big data (based on an interesting article from Elena), namely Volume, Value, Veracity, Visualization, Variety, Velocity, Viscosity, and Virality. If we closely look at the questions on individual V’s in Fig 1, they trigger interesting points for the researchers. Even though they are business questions, there are underlying research problems. For instance, 02-Value: “Can you find it when you most need it?” qualifies for analyzing the available data and giving context-sensitive answers when needed.

Having understood the 8V’s of big data, let us look into details of research problems to be addressed. General big data research topics [3]are in the lines of:

  • Scalability — Scalable Architectures for parallel data processing
  • Real-time big data analytics — Stream data processing of text, image, and video
  • Cloud Computing Platforms for Big Data Adoption and Analytics — Reducing the cost of complex analytics in the cloud
  • Security and Privacy issues
  • Efficient storage and transfer
  • How to efficiently model uncertainty
  • Graph databases
  • Quantum computing for Big Data Analytics

Next, let me cover some of the specific research problems across the five listed categories mentioned above.

The problems related to core big data area of handling the scale:-

  1. Scalable architectures for parallel data processing:

Hadoop or Spark kind of environment is used for offline or online processing of data. The industry is looking for scalable architectures to carry out parallel data processing of big data. There is a lot of progress in recent years, however, there is a huge potential to improve performance.

2. Handling real-time video analytics in a distributed cloud:

With the increased accessibility to the internet even in developing countries, videos became a common medium of data exchange. There is a role of telecom infrastructure, operators, deployment of the Internet of Things (IoT), and CCTVs in this regard. Can the existing systems be enhanced with low latency and more accuracy? Once the real-time video data is available, the question is how the data can be transferred to the cloud, how it can be processed efficiently both at the edge and in a distributed cloud?

3. Efficient graph processing at scale:

Social media analytics is one such area that demands efficient graph processing. The role of graph databases in big data analytics is covered extensively in the reference article [4]. Handling efficient graph processing at a large scale is still a fascinating problem to work on.

The research problems to handle noise and uncertainty in the data:-

4. Identify fake news in near real-time:

This is a very pressing issue to handle the fake news in real-time and at scale as the fake news spread like a virus in a bursty way. The data may come from Twitter or fake URLs or WhatsApp. Sometimes it may look like an authenticated source but still may be fake which makes the problem more interesting to solve.

5. Dimensional Reduction approaches for large scale data:

One can extend the existing approaches of dimensionality reduction to handle large scale data or propose new approaches. This also includes visualization aspects. One can use existing open-source contributions to start with and contribute back to the open-source.

6. Training / Inference in noisy environments and incomplete data:

Sometimes, one may not get a complete distribution of the input data or data may be lost due to a noisy environment. Can the data be augmented in a meaningful way by oversampling, Synthetic Minority Oversampling Technique (SMOTE), or using Generative Adversarial Networks (GANs)? Can the augmentation help in improving the performance? How one can train and infer is the challenge to be addressed.

7. Handling uncertainty in big data processing:

There are multiple ways to handle the uncertainty in big data processing[4]. This includes sub-topics such as how to learn from low veracity, incomplete/imprecise training data. How to handle uncertainty with unlabeled data when the volume is high? We can try to use active learning, distributed learning, deep learning, and fuzzy logic theory to solve these sets of problems.

The research problems in the security and privacy [5] area:-

8. Anomaly Detection in Very Large Scale Systems:

The anomaly detection is a very standard problem but it is not a trivial problem at a large scale in real-time. The range of application domains includes health care, telecom, and financial domains.

9. Effective anonymization of sensitive fields in the large scale systems:

Let me take an example from Healthcare systems. If we have a chest X-ray image, it may contain PHR (Personal Health Record). How one can anonymize the sensitive fields to preserve the privacy in a large scale system in near real-time? This can be applied to other fields as well primarily to preserve privacy.

10. Secure federated learning with real-world applications:

Federated learning enables model training on decentralized data. It can be adopted where the data cannot be shared due to regulatory / privacy issues but still may need to build the models locally and then share the models across the boundaries. Can we still make the federated learning work at scale and make it secure with standard software/hardware-level security is the next challenge to be addressed. Interested researchers can explore further information from RISELab of UCB in this regard.

11. Scalable privacy preservation on big data:

Privacy preservation for large scale data is a challenging research problem to work on as the range of applications varies from the text, image to videos. The difference in country/region level privacy regulations will make the problem more challenging to handle.

The research problems related to data engineering aspects:-

12. Lightweight Big Data analytics as a Service:

Everything offering as a service is a new trend in the industry such as Software as a Service (SaaS). Can we work towards providing lightweight big data analytics as a service?

13. Auto conversion of algorithms to MapReduce problems:

MapReduce is a well-known programming model in Big data. It is not just a map and reduce functions but provide scalability and fault-tolerance to the applications. However, there are not many algorithms that support map-reduce directly. Can we build a library to do an auto conversion of standard algorithms to support MapReduce?

14. Automated Deployment of Spark Clusters:

A lot of progress is witnessed in the usage of spark clusters in recent times but they are not completely ready for automated deployment. This is yet another challenging problem to explore further.

The research problems in intersection of big data with data science:-

15. Approaches to make the models learn with less number of data samples:

In the last 10 years, the complexity of deep learning models increased with the availability of more data and compute power. Some researchers proudly claim that they solved a complex problem with hundreds of layers in deep learning. For instance, image segmentation may need a 100 layer network to solve the segmentation problem. However, the recent trend is that can anyone solve the same problem with less relevant data and with less complexity? The reason behind this thinking is to run the models at the edge devices, not just only at the cloud environment using GPUs/TPUs. For instance, the deep learning models trained on big data might need deployment in CCTV / Drones for real-time usage. This is fundamentally changing the approach of solving complex problems. You may work on challenging problems in this sub-topic.

16. Neural Machine Translation to Local languages:

One can use Google translation for neural machine translation (NMT) activities. However, there is a lot of research in local universities to do neural machine translation in local languages with support from the Governments. The latest advances in Bidirectional Encoder Representations from Transformers (BERT) are changing the way of solving these problems. One can collaborate with those efforts to solve real-world problems.

17. Handling Data and Model drift for real-world applications:

Do we need to run the model on inference data if one knows that the data pattern is changing and the performance of the model will drop? Can we identify the drift in the data distribution even before passing the data to the model? If one can identify the drift, why should one pass the data for inference of models and waste the compute power. This is a compelling research problem to solve at scale in the real world. Active learning and online learning are some of the approaches to solve the model drift problem.

18. Handling interpretability of deep learning models in real-time applications:

Explainable AI is the recent buzz word. Interpretability is a subset of explainability. Machine / Deep learning models are no more black-box models. Few models such as Decision Trees are interpretable. However, if the complexity increases, the base model itself may not be useful to interpret the results. We may need to depend on surrogate models such as Local interpretable model-agnostic explanations (LIME) / SHapley Additive exPlanations (SHAP) to interpret. This can help the decision-makers with the justification of the results produced. For instance, rejection of a loan application or classifying the chest x-ray as COVID-19 positive. Can the interpretable models handle large scale real-time applications?

19. Building context-sensitive large scale systems:

Building a large scale context-sensitive system is the latest trend. There are some open-source efforts to kick start. However, it requires a lot of effort in collecting the right set of data and building context-sensitive systems to improve search capability. One can choose a research problem in this topic if you have a background on search, knowledge graphs, and Natural Language Processing (NLP). This is applicable across the domains.

20. Building large scale generative based conversational systems (Chatbot frameworks):

One specific area gaining momentum is building conversational systems such as Q&A and Chatbot generative systems. A lot of chatbot frameworks are available. Making them generative and preparing summary in real-time conversations are still challenging problems. The complexity of the problem increases as the scale increases. A lot of research is going on in this area. This requires a good understanding of Natural Language Processing and the latest advances such as Bidirectional Encoder Representations from Transformers (BERT) to expand the scope of what conversational systems can solve at scale.

Research Methodology:

Hope you can frame specific problems with your domain and technical expertise from the topics highlighted above. Let me recommend a methodology to solve any of these problems. Some points may look obvious for the researchers, however, let me cover the points in the interest of a larger audience:

Identify your core strengths whether it is in theory, implementation, tools, security, or in a specific domain. Other new skills you can acquire while doing the research. Identifying the right research problem with suitable data is kind of reaching 50% of the milestone. This may overlap with other technology areas such as the Internet of Things (IoT), Artificial Intelligence (AI), and Cloud. Your passion for research will determine how long you can go in solving that problem. The trend is interdisciplinary research problems across the departments. So, one may choose a specific domain to apply the skills of big data and data science.

Literature survey: I strongly recommend to follow only the authenticated publications such as IEEE, ACM, Springer, Elsevier, Science direct, etc… Do not get into the trap of “International journal …” which publish without peer reviews. Please do not limit the literature survey to only IEEE/ACM papers only. A lot of interesting papers are available in and paperswithcode. One needs to check/follow the top research labs in industry and academia as per the shortlisted topic. That gives the latest research updates and helps to identify the gaps to fill in.

Lab ecosystem: Create a good lab environment to carry out strong research. This can be in your research lab with professors, post-docs, Ph.D. scholars, masters, and bachelor students in academia setup or with senior, junior researchers in industry setup. Having the right partnership is the key to collaboration and you may try the virtual groups as well. Having that good ecosystem boosts up the results as one can challenge the others on their approach to improve the results further.

Publish at right avenues: As mentioned in the literature survey, publish the research papers in the right forum where you will receive peer reviews from the experts around the world. We may get obstacles in this process in the way of rejections. However, as long as you receive constructive feedback, one should be thankful to the anonymous reviewers. You may see the potential opportunity to patent the ideas if the approach is novel, non-obvious, and inventive. The recent trend is to open source the code while publishing the paper. If your institution permits it to open source, you may do so by uploading the relevant code in Github with appropriate licensing terms and conditions.

Top Research labs to follow:

Some of these research areas are active in the top research centers around the world. I request you to follow them and identify further gaps to continue the work. Here are some of the top research centers around the world to follow in big data + data science area:

RISE Lab at the University of Berkeley, USA

Doctoral Research Centre in Data Science, The University of Edinburgh, United Kingdom

Data Science Institute, Columbia University, USA

The Institute of Data-Intensive Engineering and Science, John Hopkins University, USA

Facebook Data Science research

Big Data Institute, University of Oxford, United Kingdom

Center for Big Data Analytics, The University of Texas at Austin, USA

Center for data science and big data analytics, Oakland University, USA

Institute for Machine Learning, ETH Zurich, Switzerland

The Alan Turing Institute, United Kingdom

IISc Computational and Data Sciences Research

Data Lab, Carnegie Mellon University, USA

If you wish to continue your learning in big data, here are my recommendations:

Coursera Big Data Specialization

Big data course from the University of California San Diego

Top 10 books based on your need can be picked up from the summary article in Analytics India Magazine.

Data Challenges:

In the process of solving the real-world problems, one may come across these challenges related to data:

  • What is the relevant data in the available data?
  • The Lack of International Standards for Data Privacy Regulations
  • The General Data Protection Regulation (GDPR) kind of rules across the countries
  • Federated learning concepts to adhere to the rules — one can build the model and share, still, data belongs to the country/organization.


In this article, I briefly introduced the big data research issues in general and listed Top 20 latest research problems in big data and data science in 2020. These problems are further divided and presented in 5 categories so that the researchers can pick up the problem based on their interests and skill set. This list is no means exhaustive. However, I hope these inputs can excite some of you to solve the real problems in big data and data science. I covered these points along with some background on big data in a webinar for your reference [7]. You may refer to my other article which lists the problems to solve with data science amid Covid-19[8]. Let us come together to build a better world with technology.











Leave a Reply

Your email address will not be published. Required fields are marked *