As I frequently use HMMER in my currently PhD project, while Pfam is a very popular database in which HMMER is also intensively applied. It’s also quite interesting to know some key contributors/authors of HMMER package are also play essential roles in Pfam. So recently I read several paper of Pfam to get a feeling of whats the “right way” to use HMMER :) Before further jumping to my understanding of Ffam. I want also to share some personally feeling after reading Pfam papers. I knew already Pfam is first created longtime ago. While after actually going back to first Pfam paper published in 1997. I did realise development of Pfam is a continuously updating process to fit to new data and knowledge of bioinformatics like any other great database. And many great sciencetis and software enginners do spend many of there processional life time in maintaining and improving Pfam database. They are amazing !
First Pfam paper at 1997
Pfam database was first developed to create a comprehensive protein family database with both good quality of multiple sequence alignment and also semi-automatic updating to deal with the reality that size of protein sequence database grows exponentially.
Ignore how to choose member of each protein family for a moment. Presumably I will update this later :) Good quality of construction of good alignment of each protein family usually need manually curation. While considering MSA of these protein family will also need to be continuously updated as new proteins sequences are generated everyday. If we do full MSA every time manually, the computation time and labor cost would be insanely high. So Pfam propose to build protein family profile (profile HMM ) on “seed alignment” which consists of only representative protein sequence of each protein family. Then these profile HMM will be used to search in any whole protein sequence database to obtain all proteins in every protein family. Seed alignments are intentionally kept stable and will only be changed if new protein sequence could not be mapped to any of existing profile HMM. So most of new sequenced protein will only be added to full alignment of protein family and will not affect profile HMM based on seed alignment Seed alignments are collected either from other database or build from scratch (mostly by ClustalW)