Python Scripts for Bioinformatics, 2017

Identification_Information:

Citation:

Citation_Information:

Originator: Menning, Damian M (ORCID: 0000-0003-3547-3062)
Originator: Talbot, Sandra L (ORCID: 0000-0002-3312-7214)
Publication_Date: 20170425
Title: Python Scripts for Bioinformatics, 2017
Geospatial_Data_Presentation_Form: software
Publication_Information:

Publication_Place: Anchorage, AK
Publisher: U.S. Geological Survey, Alaska Science Center

Other_Citation_Details:

Suggested Citation: Menning, D. M., Talbot, S. L., Python Scripts for Bioinformatics, 2017: U.S. Geological Survey software release, https://doi.org/10.5066/F74F1NZ4

Online_Linkage: https://doi.org/10.5066/F74F1NZ4
Larger_Work_Citation:

Citation_Information:

Originator: Talbot, S.L.
Publication_Date: 2000
Title:

Developing and Applying Molecular Tools to Natural Resource Problems in Alaska

Geospatial_Data_Presentation_Form: web site
Publication_Information:

Publication_Place: Anchorage, AK
Publisher: U.S. Geological Survey, Alaska Science Center

Online_Linkage: http://alaska.usgs.gov/portal/project.php?project_id=378

Description:

Abstract:

Bioinformatics software repository containing python scripts intended for search and download of genetic information obtained from GenBank NCBI genetics data resources in support of developing PCR primers, targeted genetic databases, genetic analyses, and data interpretation. Includes multiple functions to streamline this process.

This repository represents an effort to strengthen the software community within the U.S. Geological Survey and follow the open source efforts promoted by the executive level of the federal government. The intent is this become a community built project. All users are encouraged to contribute and comment on issues and pull requests.

Purpose:

To provide free easy to use Python/Biopython scripts for the development, processing, and interpretation of biological sequence information.

Supplemental_Information:

These scripts were written as part of the USGS ASC Python scripts for Bioinformatics code library using Python 2.7 and requiring Biopython 1.68.

Time_Period_of_Content:

Time_Period_Information:

Single_Date/Time:

Calendar_Date: 2017

Currentness_Reference: publication date

Status:

Progress: Complete
Maintenance_and_Update_Frequency: As needed

Spatial_Domain:

Bounding_Coordinates:

West_Bounding_Coordinate: -180.00000
East_Bounding_Coordinate: 180.00000
North_Bounding_Coordinate: 90.00000
South_Bounding_Coordinate: -90.00000

Keywords:

Theme:

Theme_Keyword_Thesaurus: USGS Metadata Identifier
Theme_Keyword: USGS:ASC189

Theme:

Theme_Keyword_Thesaurus: None
Theme_Keyword: Python
Theme_Keyword: Biopython
Theme_Keyword: GenBank
Theme_Keyword: NCBI
Theme_Keyword: Web Portal Services
Theme_Keyword: Catalog Service for the Web
Theme_Keyword: Web Coverage Service
Theme_Keyword: Web Feature Service

Theme:

Theme_Keyword_Thesaurus: ISO 19115 Topic Category
Theme_Keyword: Biota
Theme_Keyword: Environment

Theme:

Theme_Keyword_Thesaurus: USGS Science
Theme_Keyword: data communications
Theme_Keyword: data packaging
Theme_Keyword: markup and query language development
Theme_Keyword: scientific software
Theme_Keyword: genetics

Theme:

Theme_Keyword_Thesaurus: USGS Biocomplexity Thesaurus
Theme_Keyword: Genetics
Theme_Keyword: Genetic Resources
Theme_Keyword: Technology
Theme_Keyword: Technology Transfer
Theme_Keyword: Data Processing
Theme_Keyword: Automation

Theme:

Theme_Keyword_Thesaurus: MeSH-National Library of Medicine
Theme_Keyword: Work Simplification
Theme_Keyword: Efficiency
Theme_Keyword: Task Performance and Analysis
Theme_Keyword: Data Display
Theme_Keyword: Computing Methodologies
Theme_Keyword: Automatic Data Processing
Theme_Keyword: Information Storage and Retrieval
Theme_Keyword: Pattern Recognition, Automated

Theme:

Theme_Keyword_Thesaurus: NASA GCMD Earth Science Keywords Thesaurus
Theme_Keyword: Earth Science Services
Theme_Keyword: Data Management/Data Handling
Theme_Keyword: Data Mining
Theme_Keyword: Data Search and Retrieval
Theme_Keyword: Transformation/Conversion
Theme_Keyword: Subsetting/Supersetting
Theme_Keyword: Web Services
Theme_Keyword: Data Processing Services
Theme_Keyword: Data Alignment Services
Theme_Keyword: Format Conversion Services

Access_Constraints: None
Use_Constraints:

For purposes of publication or dissemination, it is requested that citations or credit be given to the author and the U.S. Geological Survey Alaska Science Center. It is strongly recommended that careful attention be paid to the contents of the metadata file associated with these computer programs to evaluate limitations, restrictions or intended use.

Point_of_Contact:

Contact_Information:

Contact_Organization_Primary:

Contact_Organization: U.S. Geological Survey, Alaska Science Center

Contact_Address:

Address_Type: mailing and physical
Address: 4210 University Drive
City: Anchorage
State_or_Province: Alaska
Postal_Code: 99508
Country: USA

Contact_Voice_Telephone: 907-786-7000
Contact_Electronic_Mail_Address: ascweb@usgs.gov

Data_Set_Credit: None
Native_Data_Set_Environment:

These scripts were written using a Windows 7 Enterprise 64-bit Operating System using Python 2.7.10 and Biopython 1.68. They should run on Unix based systems but has not been tested on any other platforms besides the Windows environment.

Data_Quality_Information:

Attribute_Accuracy:

Attribute_Accuracy_Report:

Data obtained from the use of these scripts is only as reliable as the user entered information and the accuracy of the NCBI GenBank Nucleotide database.

Data downloaded from NCBI GenBank using these scripts was validated when the scripts were completed by performing the identical search manually and comparing the results manually.

Logical_Consistency_Report:

Consistency of values in data extracted is unknown and is dependent on chosen fields and value parameters in NCBI and what constraints the authors of that data applied.

Completeness_Report:

Completeness of data extracted is unknown and is dependent on chosen fields and value parameters in NCBI and what constraints the authors of that data applied. A manual check of the extracted data is highly recommend to ensure completeness.

Lineage:

Source_Information:

Source_Citation:

Citation_Information:

Originator: National Center for Biotechnology Information
Publication_Date: 1982
Title: GenBank
Geospatial_Data_Presentation_Form: website
Publication_Information:

Publication_Place: Bethesda, MD
Publisher:

National Center for Biotechnology Information, U.S. National Library of Medecine

Online_Linkage: http://www.ncbi.nlm.nih.gov/nuccore/

Type_of_Source_Media: database
Source_Time_Period_of_Content:

Time_Period_Information:

Single_Date/Time:

Calendar_Date: 201703

Source_Currentness_Reference: publication date

Source_Citation_Abbreviation: NCBI
Source_Contribution:

The National Center for Biotechnology Information Nucleotide database is a repository for genetic sequences and associated information.

Data obtained primarily consists of sequence identifiers, usually Genus species name, and sequence data, nucleotides.

Source_Information:

Source_Citation:

Citation_Information:

Originator: Python Community
Publication_Date: 199505
Title: Python
Edition: 2.7
Geospatial_Data_Presentation_Form: software library
Online_Linkage: https://www.python.org/

Type_of_Source_Media: computer program
Source_Time_Period_of_Content:

Time_Period_Information:

Single_Date/Time:

Calendar_Date: 20160217

Source_Currentness_Reference: publication date

Source_Citation_Abbreviation: Python
Source_Contribution:

Code libraries in which the bioinformatics scripts are built upon and dependent for code execution.

Source_Information:

Source_Citation:

Citation_Information:

Originator: Biopython
Publication_Date: 2000
Title: Biopython
Edition: 1.68
Geospatial_Data_Presentation_Form: software library
Online_Linkage: http://biopython.org/wiki/Biopython

Type_of_Source_Media: computer program
Source_Time_Period_of_Content:

Time_Period_Information:

Single_Date/Time:

Calendar_Date: 20160825

Source_Currentness_Reference: publication date

Source_Citation_Abbreviation: Biopython
Source_Contribution:

Biopython provides additional code libraries to supplement the Python code libraries in order to increase the functionality to the Python scripts.

Process_Step:

Process_Description:

checkGenBank.py

Description: Searches the National Center for Biotechnology Information GenBank database using two different user developed text (.txt) files. One text file contains a list of Genus species names and a different text file (.txt) contains the search terms (usually genes). Outputs a .csv file containing the Genus species name, search term, and number of hits in separate columns for every possible combination. Also outputs two .txt files, one of all successful searches and one of all failed searches. If the Genus species file contains both Genus species name and common name, they must be separated by a comma.

Process_Date: 201703

Process_Step:

Process_Description:

convertFASTQtoFASTA.py

Description: Converts a .fastq file to a .fasta file.

Process_Date: 201703

Process_Step:

Process_Description:

convertFASTQtoFASTAfolder.py

Description: Converts all .fastq files in this folder to .fasta files.

Process_Date: 201703

Process_Step:

Process_Description:

cropFASTA.py

Description: Opens a fasta file, requests primers of interest, searches .fasta file for primers, outputs one .fasta file with all sequences that contain those primers.

Process_Date: 201703

Process_Step:

Process_Description:

filterFASTA.py Description: Checks a .fasta file for identical sequence and taxonomic information. If identical sequences are found that do not have identical taxonomic information that data is written to a text file ending with _rnw.txt (repeats not written). If identical sequences are found that do have identical taxonomic information, one sequence with associated taxonomic information is written to a .fasta file ending with _rr.fasta (repeats written).

Process_Date: 201703

Process_Step:

Process_Description:

filterFASTAfolder.py Description: Checks all .fasta files in current folder, removes all identical sequences that contain more than one unique Genus species identifier for each .fasta file individually. Outputs one .fasta file containing only unique sequences with unique Genus species identifiers for each .fasta file. Also outputs a .txt file containing all removed sequences for each .fasta file.

Process_Date: 201703

Process_Step:

Process_Description:

findPrimers.py Description: Using an aligned .fasta file, searches for consensus sequences (segments) of a predetermined length (can be user modified, default is 18). If these segments are greater than or equal to the predetermined length they are considered potential primers sites and written to two .txt files. One .txt file contains just the segments, the other contains the segments and their position number in the sequence. Note: this will not find degenerate nucleotides! A manual search of the aligned .fasta should be conducted after running this script in order to find potential segments containing degenerate nucleotides.

Process_Date: 201703

Process_Step:

Process_Description:

getFASTA.py Description: Searches the National Center for Biotechnology Information GenBank database using one text file (.txt) containing a list of Genus species names. User defined inputs include desired output file format, .gb or .fasta, number of output files (all data in one file or separate files for each Genus species name, and search term(s) (usually genes) that utilize Boolean operators. Internally stores only unique Accession numbers and outputs all data in either GenBank (.gb) or FASTA (.fasta) format. If the Genus species file contains both Genus species name and common name, they must be separated by a comma.

Process_Date: 201703

Process_Step:

Process_Description:

getFragments.py Description: Cuts an aligned .fasta file into fragments using a list of potential primers sites and returns all potential fragment combinations in .fasta format. This script assumes the required length of potential fragments is no shorter than 41 bps and no longer than 350 bps. These values can be changed by the user. The potential primer sites input file must be in .txt format.

Process_Date: 201703

Process_Step:

Process_Description:

Process_Date: 201703

Process_Step:

Process_Description:

matchRefDB.py Description: Opens a .fasta file to be used as the reference database, searches each .fasta file in the folder individually for sequence matches to the reference database. Outputs a .csv file with the number of matches to reference sequences and one .fasta file with all sequences that did not match any sequence within the reference database.

Process_Date: 201703

Process_Step:

Process_Description:

mergePairedEndFASTAs.py Description: Searches all files in current folder, merges forward (R1) and reverse (R2) Illumina MiSeq reads using a user-adjustable global alignment.

Process_Date: 201703

Process_Step:

Process_Description:

verifyFASTA.py Description: Checks a .fasta file, counts the number of times a sequence is repeated, writes the sequence, the number of times that sequences occurs, and the identifiers associated with the sequence. Only writes information for sequences that have two (2) or more identifiers. Best used after filterFASTA.py to verify all remaining sequences only have one (1) unique identifier.

Process_Date: 201703

Process_Step:

Process_Description:

verifyFASTAfolder.py Description: Checks all .fasta files in current folder individually, counts the number of times a sequence is repeated, writes the sequence, the number of times that sequences occurs, and the identifiers associated with the sequence. Only writes information for sequences that have two (2) or more identifiers. Best used after filterFASTA.py to verify all remaining sequences only have one (1) unique identifier.

Process_Date: 201703

Distribution_Information:

Distributor:

Contact_Information:

Contact_Organization_Primary:

Contact_Organization: U.S. Geological Survey, Alaska Science Center

Contact_Address:

Address_Type: mailing and physical
Address: 4210 University Drive
City: Anchorge
State_or_Province: AK
Postal_Code: 99508
Country: USA

Contact_Voice_Telephone: 907-786-7000
Contact_Electronic_Mail_Address: ascweb@usgs.gov

Resource_Description:

Code repository on a USGS managed Bitbucket installation. This repository represents the authoritative source for the bioinformatics scripts.

Distribution_Liability:

This software has been approved for release by the U.S. Geological Survey (USGS). Although the software has been subjected to rigorous review, the USGS reserves the right to update the software as needed pursuant to further analysis and review. No warranty, expressed or implied, is made by the USGS or the U.S. Government as to the functionality of the software and related material nor shall the fact of release constitute any such warranty. Furthermore, the software is released on condition that neither the USGS nor the U.S. Government shall be held liable for any damages resulting from its authorized or unauthorized use. Any use of trade, firm, or product names is for descriptive purposes only and does not imply endorsement by the U.S. Government.

Standard_Order_Process:

Digital_Form:

Digital_Transfer_Information:

Format_Name: Online_Git_Repository
Format_Information_Content:

These Python/Biopython scripts are hosted at the USGS Bitbucket website. Bitbucket is a web-based hosting service that uses Git version control system. Scripts can be downloaded from the USGS Bitbucket site as a ZIP file or cloned to a local folder. Readme docs available on repo providing further code documentation. Online help regarding Bitbucket available from the site.

Digital_Transfer_Option:

Online_Option:

Computer_Contact_Information:

Network_Address:

Network_Resource_Name: https://doi.org/10.5066/F74F1NZ4

Fees: None

Technical_Prerequisites:

Windows 7.0 Python 2.7 Biopython 1.68

These scripts should work on most platforms but have only been tested on a Windows 7 64-bit operating system.

Python Scripts for Bioinformatics, 2017

Metadata: