Neisseria & Phylogenetic tree (step by step guide)

Brunomalta
6 min readMar 31, 2024

Neisseria? Phylogenetic tree?

Neisseria, is known to be a large genus grouping of bacteria, that encapsulates both non-pathogenic and pathogenic species( N.mucosa, N.gonorrhoeae…). Neisseria are Gram-negative bacteria, with aproximately 11 species from this genus that colonize humans ( of these 11, “only” 2 are known to be pathogenic and harmful, N.gonorrhoeae and N. meningitidis.

The study of evolutionary history of the species and their relationship, can be done through the creation of phylogenetic trees. “Dendrogram” or phylogenetic tree, are basically a two dimensional graph where the evolutionary relationship between organisms can be analyzed with a simple “tree” structure. The clusters formed within these trees, encapsulates common species, with common ancestors, having more genetic proximity and the length of the branch corresponds to the time taken for evolution.

https://www.yourgenome.org/theme/what-is-phylogenetics/

We can use three different ways for doing the construction of the Phylogenetic Trees : Parsimony / Distance Matrix Based / Maximum Likelihood . In this project, I am going to delve into the Distance Matrix Based tree using BioPython (this method is based on the amount of the distance or the dissimilarity between two or more aligned sequences).

Pipeline

This project will be divided in 3 major parts:

  1. Web Scraping → In this chapter, I am going to create an algorithm for automatically retrieving a bunch of Neisseria .fasta format files from LPSN (with the dna code).
  2. Neisseria Dataframe → In this chapter, I am going to create a algorithm for using the data scraped from LPSN, to create a dataframe with some biological information ( GC content, DNA size…).
  3. Phylogenetic Tree → In this chapter, I am going to use the data scraped and create a phylogenetic tree, using distance based matrix.

Web Scraping

LPSN, is the most well-known website for prokaryotic information, for that reason, I have chosen it as the main source of data for creating the phylogenetic tree. Navigation on the LPSN website is very straightforward. We input the name of the species or the genus that we are looking for, and the information in easily accessible.

https://lpsn.dsmz.de/genus/neisseria

One of the main things that a “web scraper” wants from a website is the organization and stability of the HTML code on the website. This means that, regarding the information that we try to scrape from the website, the hierarchy of the website is the same or with very few changes. In the case of the LPSN, there are some tricks that need to be made to work around if we want all the species from the Neisseria.

Code

First, we will need some libraries. We will import BeutifulSoup for the scraping part ( with some additional like request, re, os….). And then the Selenium for creating a bot for downloading .fasta files from LSPN website.

(Selenium instalation is not so straightforward as just one “pip install”, as you need to install a webdriver for the browser that you are going to use ( Chrome, Firefox…) in my case I used Chrome. Another thing important to mention, is that the webdriver suffers a lot of updates, blocking your code if you don’t update the webdrivers.)

Within the web scraper code, I created two important functions, the getting_species_list and download_species.

The getting_species_list function, takes the specific genus as input and it parses the HTML content of the page, with the specific genus that we are inputting in the function . Then finds the element <a> with the class “color-species”, that represents the links to individual species within the genus that we are giving as input. The rest of the code, was basically a trial and error to try to find the best approach for getting the name of the species.

The download_species functions, receives as parameters, the species list, obtained by getting_species_list, directory (to store the data) and main_domain ( the LPSN website). In this code, I first set the chrome options and create the webdriver. Then, using the names from the species list, selenium will open the webpage for the specific species and download it. I am using chrome options, to save the files in a specific directory and changing the name of the file, with the name of the respective species. There are some expections like in the case that the download button is not present, it will not be possible to do the download ( using dont_exists flag). This code is done with multiple trial and error, which is normal, since the HTML behavior in some websites is very divergent.

To wrap everything up :

We call first the getting_species_list function for the Neisseria genus, and then using the list variable as input for the download_species.

Neisseria Dataframe

We have our files, so we need to have some Bio metadata. For that reason I create a algortihm for reading the files and process the same files for giving us more granularity. There are two main classes in this code:

  1. Bacterium class: Design to represent a species and its DNA characteristics ( DNA size, GC content, DNA frequency..). There are a bunch of important methods, like read_file, species_finder, long_dna, frequency, gc_content…., used for reading and creating the Bio metadata. The update_dataframe method, add the dna data to a class-level DataFrame.
  2. FolderScrap class: Iterates over the files in the folder, creates Bacterium instances for each file and aggregates the DNA data into a single DataFrame.

(There is way more to talk about regarding the methods used for the Bio metadata but my main goal is not to talk about that in this article).

In the end I got a dataframe with the specific species and the specific Bio metadata.

Phylogenetic Tree

Like I said in the beginning of this article, I used the Distance Matrix Based. Before doing the creation of the Dendrogram, is important to do a Multiple Sequence Alignment. The main goal of MSA is to identify regions of similarity, that may indicate functional, structural and of course evolutionary relationship between sequences. MSA works by aligning the multiple sequences, and give a score of how well the sequences match at a certain position. Matches with a high score, usually have identical characters.

First, we need to ensure that all the DNA is store in just one file.

After this, is important to read newly generated fasta files and aling those files using MultiSeqAlignment() from BioPython. After that, the script initializes a DistanceCalculator object with the method identity, which calculates the percentage of identity between sequences. Then, it uses the get_distance method for getting the distance matrix.

Finally, it initializes a DistanceTreeConstructor object, using the upgma() ( Unweighted Pair Group) method, the DistanceTreeConstructor object will create phylogenetic tree using the distance matrix. There is a cons, regarding the use of upgma(), since this one assumes a constant rate of evolution over time, assuming that sequence evolves at a constant rate, which is not true, if we look at the rate of mutations - they are not constant.

In the end, we will get a tree like this:

Hope I have been helpful in some way :)

--

--

Brunomalta

Data-scientist, AI-Engineering, Clinical Physiology, Neuroscience/Phylosophy entusiast