Apprendre Scatter Plots for Sequence Properties | Biological Data Visualization

Glissez pour afficher le menu

Scatter plots are powerful tools for visualizing the relationship between two numerical variables. In biology, scatter plots are often used to compare properties of biological sequences, such as the length of DNA sequences and their GC content. By plotting these properties against each other, you can quickly see if there is any correlation or trend, such as whether longer sequences tend to have higher or lower GC content. This kind of visualization helps you identify patterns, outliers, or clusters that may be biologically meaningful, such as differences between genes, chromosomes, or species.


              123456789101112131415161718192021222324252627282930
            
import matplotlib.pyplot as plt

# Example DNA sequences from different species
sequences = [
    {"id": "seq1", "species": "Human", "sequence": "ATGCGCGTACGTAGCTAGCGT"},
    {"id": "seq2", "species": "Mouse", "sequence": "ATGCGTACGTAGCTAGC"},
    {"id": "seq3", "species": "Yeast", "sequence": "ATGCGCGCGCGT"},
    {"id": "seq4", "species": "Human", "sequence": "ATGCGTAGCTAGCTAGCGCGT"},
    {"id": "seq5", "species": "Mouse", "sequence": "ATGCTAGCTAG"},
    {"id": "seq6", "species": "Yeast", "sequence": "ATGCGCGCGCGCGCGT"},
]

# Calculate sequence length and GC content
def gc_content(seq):
    g = seq.count("G")
    c = seq.count("C")
    return 100 * (g + c) / len(seq)

lengths = []
gc_contents = []
for entry in sequences:
    seq = entry["sequence"]
    lengths.append(len(seq))
    gc_contents.append(gc_content(seq))

plt.scatter(lengths, gc_contents)
plt.xlabel("Sequence Length")
plt.ylabel("GC Content (%)")
plt.title("Scatter Plot of Sequence Length vs. GC Content")
plt.show()

To understand the scatter plot code, start by preparing a list of DNA sequences, each with a species label and a sequence string. You calculate the length of each sequence and its GC content using a simple function that counts the number of G and C nucleotides and divides by the total length. These values are stored in two lists: one for sequence lengths and one for GC content percentages. The plt.scatter() function from matplotlib creates a scatter plot, with sequence length on the x-axis and GC content on the y-axis. You label the axes with plt.xlabel() and plt.ylabel(), and add a title with plt.title(). Interpreting the plot, each point represents a sequence; clusters or trends can reveal biological insights, such as whether certain species have sequences with higher GC content or if longer sequences tend to have specific GC content ranges.


              1234567891011121314151617181920
            
import matplotlib.pyplot as plt

# Assign a color to each species
species_colors = {"Human": "blue", "Mouse": "green", "Yeast": "red"}

lengths = []
gc_contents = []
colors = []

for entry in sequences:
    seq = entry["sequence"]
    lengths.append(len(seq))
    gc_contents.append(gc_content(seq))
    colors.append(species_colors[entry["species"]])

plt.scatter(lengths, gc_contents, c=colors)
plt.xlabel("Sequence Length")
plt.ylabel("GC Content (%)")
plt.title("Sequence Length vs. GC Content (Colored by Species)")
plt.show()

1. What can a scatter plot of sequence length vs. GC content reveal?

2. Fill in the blank: In a scatter plot, each point represents a _____ with two properties.

3. How can color coding enhance the interpretability of scatter plots in biology?

Tout était clair ?

Merci pour vos commentaires !

Section 3. Chapitre 6

Demandez à l'IA

Posez n'importe quelle question ou essayez l'une des questions suggérées pour commencer notre discussion

Section 3. Chapitre 6