Humangenomedata | 健康医療インテリジェンス分野

Development of Human Genome Data Analysis Technology

Cancer Genome: Cancer is a disease caused by the combination of multiple mutations accumulated in the genome that cause cells to lose control. The most advanced sequencing technologies have made it possible to obtain multi-omics data such as DNA sequences, RNA expression levels and epigenomes of individual cancer genomes. We are developing data analysis techniques to identify various genomic aberrations in cancer using these omics data.

Immunogenomics: For example, a reason we don't necessarily develop a virus when we are infected with a virus is because our bodies have an immune system, a mechanism that eliminates viruses, which are "non-self" entities that have invaded our bodies. Cancer is also a target of the immune system's attack because it is a "non-self" with a genome that is different from the original due to genomic mutations. However, cancer cells escape from the attack of immune system in a variety of ways. A well-known cancer drug called Nivolumab allow the immune system to attack cancer cells by removing the brakes of the immune system. There are many other mechanisms that allow cancer cells to escape the immune system's attack. In this research, we are developing computational techniques to analyze cancer cells and the immune system as a system.

Long read sequencing data analysis: Sequencers currently used for genome sequencing around the world fragment DNA into several hundred bases and read about 100 to 200 bases at each end. This data is called short reads. It is not hard to imagine how much easier and more accurate genome reconstruction would be if the genome sequence could be read for a longer length rather than from short reads. Using long-read sequencers like Oxford Nanopore, it is possible to read tens of thousands of bases or more in a single sequence. However, compared to short-read sequences, long-read sequences have more errors. In this research, we are developing a neural network using deep learning to more accurately determine genome sequences from long-read sequencing data and to detect genomic mutations such as structural mutations that are difficult to identify with short-read sequences.

Single cell sequencing data analysis: In the past, DNA and RNA sequencing was used to obtain data by sequencing a group of cancer cells taken from a cancer patient, for example. On the other hand, single-cell sequencing provides information on the DNA sequence and RNA expression of each individual cell. Using this information, it is possible to separate immune cells such as T cells and Nk cells based on the RNA expression of marker genes, or to analyze the diversity of cancer cells in single cell resolution. However, when the data are analyzed, the cells are destroyed to extract DNA or RNA, so the temporal and spatial information is lost. In this research, we are developing a data analysis technology that enables spatio-temporal analysis by probabilistically representing single cell data based on information such as cell cycle.

Clinical sequence enhanced with Artificial Intelligence (AI): By making full use of the data analysis techniques described above, we can comprehensively detect genomic mutations in cancer cells. We are conducting research to apply this technology to cancer treatment. This research is being conducted in collaboration with the Research Hospital in our institute, the Advanced Clinical Research Center, and affiliated hospitals. The whole genome sequencing of cancer patients' genomes has revealed thousands to hundreds of thousands of genomic mutations. The goal is to identify cancer-causing mutations (driver mutations) and use this information to find effective anti-cancer drugs for each patient. To do this, it is necessary to read and interpret the relevant literature for each mutation, as well as genetic pathway information and drug patent information. This process is called clinical translation of the genome information. Performing this task for each of the thousands of mutations is a daunting task. In addition, PubMed, a literature database, currently contains more than 30 million papers in the life sciences. It is not something that a single researcher can cover. We are studying the use of artificial intelligence for the clinical translation of this genomic information.

Ｄivision of Health Medical Intelligence,

Laboratory of Sequence Analysis,

Human Genome Center

Institute of Medical Science,

University of Tokyo