The structural biology programme is based upon a well-defined and mature pipeline which ranges from bioinformatics for target identification through molecular biology, protein production and biochemistry, biophysics and chemical biology through to crystallisation, crystallography and structure determination and deposition.
We apply this pipeline to groups of related proteins which exhibit similar biology properties, allowing us to dissect and better understand the subtle functional differences between such proteins in terms of their structural biology.
The SGC currently focuses upon the following areas:
The quality criteria defined here are the minimal acceptance criteria for SGC structures; most structures are of considerably higher quality (Acta Crystallogr D Biol Crystallogr. 2007 63:941-50).
X-ray Crystallographic Structures
- 2.8 Å or better for soluble domain
- 4.0 Å or better for integral membrane protein
II- Data Quality
- Resolution limit is the last shell where:
- completeness ~100% (or before it drops off, for pathological non-reproducible datasets with overall completeness significantly < 100%).
- I/sigI > 2
- Redundancy > 3 (anomalous merged)
- No sigma cutoff
- 1 < χ2 < 2 ("normal probability plot" in SCALA)
- (SCALA, use "Mean(I)/sd"; SCALEPACK, calculate I/error in last table)
- Usable data may extend beyond the 100% completeness cutoff; this data should still be used in refinement.
- < 10% in lowest resolution shell
- < 80% in highest resolution shell (mainly sanity check)
III- Model quality
- Rfree < 30% (< 35% for 4 Å structures), calculated with all data (NO sigma cutoff! NO low resolution cutoff)
- NO Ramachandran outliers (according to Richardson/Richardson contours-see MolProbity server) unless specific and rational explanation exists (e.g. ligand induced)
- rmsdbond ~ 0.018 Å for high resolution structures (better than ~2 Å). Tighter geometry restraints may be required at lower resolutions, until procheck reports no protein aromatic planarity outliers (planarity weights can NOT be individually adjusted)
- Messy active sites are considered complete if all expected sidechains and obvious ligands have been accounted for. Clear alternate conformations should be modeled, but residual density does not need to be completely accounted for, particularly if it is not obviously biochemically meaningful.
Unlike X-ray crystal structures there is no generally accepted R-factor or direct measure of resolution that can provide quantitative estimates of the quality or accuracy of NMR structures. Instead, NMR solution structures are reported as an ensemble of structures that satisfy the experimental data. The pairwise rmsd (in Å) between equivalent backbone atoms within the ensemble is often used as a rough measure of resolution or precision, however, this value is not equivalent to resolution as in an x-ray structure nor is it necessarily a measure of accuracy (given the dynamic nature of proteins in solution, and the "sparse" nature of NMR restraints). Given these limitations there are some generally agreed upon measures of structure quality that can be used to guide an NMR structure refinement. Below are a set of minimum quality requirements for SGC NMR structures.
II- Extent of Resonance assignment
A sufficient number of multidimensional, multinuclear NMR experiments should be collected (with sufficient resolution and S/N) to allow assignment of:
- 98% of backbone 15N, 1H and 13Cα resonances in ordered regions of the molecule
- 95% of side chain resonances in ordered regions of the molecule
- 90% of the observed NOEs for both 15N and 13C bound protons, including ambiguous NOEs
III- Assessment of Data Quality
- Assessment of the extent and quality of data should be made on a per residue basis (see for example, Nabuurs et al J.Biomol NMR (2005) 3, 123) in order to identify the globular, well defined regions of the protein and to identify potential problem areas of the protein during refinement.
- Attempts should be made to identify a medium for orienting the sample for measurements of residual dipolar couplings, and data collected for 1DNH and 1DCH, if possible. Refinement against (N,H) residual dipolar couplings alone only slightly improves geometric measures of structure quality, but should improve accuracy. Use of a sufficient number of both N,H and Cα,H can greatly improve the structure quality.
- Experimental data on the oligomeric state of the protein should be obtained using gel filtration and/or AUC preferably combined with NMR-derived 15N relaxation data and heteronuclear 15N-1H NOE data. If the protein is oligomeric, isotope filter/edited NOESY spectra should be acquired in order to identify unambiguous, experimentally derived inter-subunit NOEs.
IV- Structure quality
- Geometric measures of structure quality, such as deviations of bond lengths and steric clashes in the interior of the protein, are usually reported in terms of Z-scores, on the basis of the Gaussian distribution of the parameter in a reference set of high-resolution X-ray structures. The usual NMR structure determination protocol consists of automated or semi-automated NOESY assignment and simulated annealing of a structure using programs such as CYANA, ARIA, AUTOSTRUCTURE. However the geometric Z scores for structures "straight" out of these programs are often of poor quality. Refinement of structures in explicit water (see, for example, Linge et al (2003) Proteins, 50:496-506) has been shown to improve Z-scores, side-chain packing and the appearance of the Ramachandran plot. Two software packages which calculate Z-scores from PDB structures submitted via the web are WHATIF and Protein Structure Validation Suite (PSVS).
- All Z score values must be greater than -5, and preferably > -2
- At least 85% of residues should be in the most favored region of the Ramachandran plot.
- Agreement with experimental restraints.The SGC encourages the use of one or more recently developed validation strategies for NMR-derived protein structures that report on the agreement between the NMR restraints and the structure. Scores from such analyses can be used during the structure refinement process, to identify and fix incorrect assignments, NOESY peaks, and restraints. An example is the information provided by the QUEEN software package [Quantitative Evaluation of Experimental NMR Restraints; S. B. Nabuurs et al, JACS (2003) 125, 12026-12034]. This program can rank distance restraints according to their information content, and can identify restraints which are either redundant or unique (unsupported by other restraints). The PSVS package also reports RPF scores, which assess the agreement of the structure with NOE peak lists. The recall score measures the fraction of NOE cross peaks that are predicted by the structure. The precision is the fraction of nearby proton pairs which correspond to observed NOEs. The F-measure characterizes the combined recall and precision, and the DP (discriminating power) score measures how the structure differs from a freely rotating chain model. Problematic regions of the structure can be indicated by a graphical tool which displays the distribution of false positives (short distances without corresponding NOESY peaks) on the molecular structure.
- F-measure and Recall scores (from PSVS) should both be >0.8 (the latter means that fewer than 20% of NOESY peaks cannot be assigned on the basis of the structure)
- The PSVS DP score should be > 0.7
- There should be no consistently violated NOE restraints by > 0.5Å or violated dihedral angle restraints by > 5 degrees, in > 50% of structures.
- If RDCs are used, the Q factor (after refinement) should be < 0.20 (See Bax and Grishaev (2005) Curr Opin Struct Biol. 25 563-570, and references therein)